Testing Is a Poor Substitute for Reasoning

Introduction

"Program testing can be used to show the presence of bugs, but never to show their absence!"– Edsger W. Dijkstra

"I'm not sure what fuzzing is, other than an admission that you don't know how to reason about your code."– John Lakos

This is the first in what I hope will be a series of posts on how & why we should be reasoning about code as opposed to merely testing it. It's something about which I've been thinking for a long time but am only now starting to write down. This is all very nascent, and I'm going to start with some very simple ideas. Bear with me.

Motivation

I'd like to motivate this with a real-world example. Some time ago, while on the Security Infrastructure team at LinkedIn, I wrote a program called datavault-tool. It was a Python program, and a modest one at that. At that time, it comprised eleven files & 1,979 lines of code. Here's the end of a build:

...
:datavault-tools:pytest
===================== test session starts ======================
platform linux2 -- Python 2.7.11, pytest-3.0.3, py-1.4.31, pl...
rootdir: /home/michael/code/mp/datavault-tools_tru...
plugins: xdist-1.15.0, lipy-config-base-3.2.76, cov-2.4.0
collected 34 items

test/test_audit.py .
test/test_auditd2.py ..
test/test_comms.py ....
test/test_d2.py .....
test/test_healthcheck.py .....
test/test_request.py .....
test/test_tls.py .......
test/test_verify.py .....

================== 34 passed in 11.97 seconds ==================
:datavault-tools:validateIncompatibilityRules
:datavault-tools:check
:datavault-tools:build

BUILD SUCCESSFUL

Total time: 9 mins 6.958 secs

As you can see, I have thirty-four unit tests, and they all passed.

Does that mean my code has no bugs?

It's impossible to be certain, of course, but I suspect the reader would say no (as would I). Suppose I went off & wrote sixty-six more unit tests, so I had an even one hundred: could we then conclude that my code has no bugs?

Again, of course, no. Having one hundred (or a thousand) unit tests tells me nothing about whether my code has been exercised against all possible input, in the presence of every possible operating condition it might face.

Code coverage metrics seem to be popular; suppose I took a different approach & wrote \(n\) unit tests such that every line of the program was exercised during testing? Can I now conclude that my code has no bugs?

Again, of course, no: one hundred percent code coverage means… well, that you've executed every line of code over the course of your test suite. It still tells us nothing about whether or not the code has been tested against every possible input, or whether it has been tested under every possible operating condition it may face.

I'm not really sure what we could conclude from one hundred percent code coverage. Code coverage is, I suggest, something we measure because we can measure it, not because it's particularly dispositive.

The problem is that, in the words of Dijkstra, testing can only expose the presence of bugs, not prove their absence.

Intuition

The idea of exercising our code against all possible inputs is appealing, and even feasible for simple cases.

/* Increment & return the input value, *unless* it is 255, in
   which case it shall be returned unaltered */
unsigned char capped_inc(unsigned char x) {
  if (255 != x) {
    ++x;
  }
  return x;
}

We can prove that this function works correctly through testing by simply exercising it against every possible input:

#include "capped_inc.h"
void main() {
    assert(  1 == capped_inc(  0));
    assert(  2 == capped_inc(  1));
    assert(  3 == capped_inc(  2));
    ...
    assert(255 == capped_inc(254));
    assert(255 == capped_inc(255));
}

This is, however, the exception that proves the rule: let's change capped_inc to work on unsigned long long's & target a platform on which that type is 64 bits wide:

/* Increment & return the input value, *unless* it is 2^64-1 :=
   ULONGLONG_MAX, in which case it shall be returned unaltered */
unsigned long long capped_inc_2(unsigned long long x) {
  if (ULONGLONG_MAX != x) {  // line 1
    ++x;                     //      2
  }                          //      3
  return x;                  //      4
}

Our approach to proving that there are no bugs in this code breaks down: it now requires 18,446,744,073,709,551,616 assertions (that's eighteen quintillion) and the test driver containing them would balloon to many petabytes in size.

And yet… the structure of the function didn't really change from one we know to be correct. All that changed was the domain of input over which the function is defined, and that didn't change in a way that's salient to us (we're still operating on unsigned integers, with addition defined modulo \(\mathcal{M}\) for some \(\mathcal{M}\)). We can read the implementation and "just see" that it is correct; we don't really need to mechanically check every possible input.

Let's make that intuition more concrete:

Claim: capped_inc_2 works correctly for all values of unsigned long long on platforms where unsigned long long is 64 bits in size and integer arithmetic is modulo \(2^{64}\).

Proof: \(x\) is a 64 bit unsigned long long, therefore \(0 \le x \le \mathcal{M}\), where \(\mathcal{M} := 2^{64}-1\).

Cases:

\(0 \le x < \mathcal{M}\): the test on line 1 evaluates to true, so line 2 is executed. \(x\) is less than the maximum representable value of unsigned long long, so the addition will not overflow. Therefore, on line 4, the value of \(x\) will be one greater than what it was on entry.
\(x = \mathcal{M}\): the test on line 1 evaluates to false, so \(x\) is returned unchanged, in accordance with the definition of capped_inc_2.

\(\blacksquare\)

Our correctness proof has collapsed from eighteen quintillion lines to seven (or thereabouts). Instead of mechanically exercising the code against every possible input, we were able to reason about different subsets of the universe of all possible inputs and for each subset prove that the code did what we wanted it to do. That's very different from failing to expose any bugs and hoping we didn't miss any.

Next Steps

This was an example contrived to prove a point: that reasoning about code can yield results that testing cannot. Doing this with real-world, production code is far more complicated (and fascinating).

I see two broad approaches to reasoning about code. I'll call them the formal approach & the informal approach. In the former, we use the techniques of formal reasoning to get a computer program to prove propositional statements about our code. In the latter, we design our code in such a way as to enlist the compiler in checking properties of our code.

Either way, I'd like to explore the idea of what I call "design for reasonability"; analogous to "design for testability", this is the idea that we should write code about which it is easy to reason.

In my next post, I'll show a "real-world" win for the formal approach. In later posts, I hope to say more about both.

08/11/21 12:07

Unwound Stack

Testing Is a Poor Substitute for Reasoning

Introduction

Motivation

Intuition

Next Steps