madhadron

Easier, faster software testing

Status: Finished
Confidence: Very likely

We all know that we should test our code, but the slog of figuring out what tests to write, generating all the test data, and figuring out what to assert for each case can end up taking longer than writing the code.

And most of the advice on doing so isn’t that helpful. Most of what you find is information about continuous integration, or arguments about unit tests versus integration tests, or whether test driven development is the One True Way™. And all you want is to know is how to not have to spend two hours writing boilerplate to test ten lines of code.

Picking test cases can take only a few minutes if you have a few heuristics, and there are techniques that make generating test data vastly simpler. And a few coding guidelines shrink the number of tests you have to write dramatically. And testing concurrent systems is never easy, but it can be relatively straightforward.

There are also ways to make your code easier to test, which I’ve detailed elsewhere. The code in these lessons is all on GitHub.

Lessons:

Lesson 1. Getting started

All our examples are going to be in Python 3 and use only its standard library and the third party tool mypy. Python is the closest to pseudocode or a lingua franca that we have today.

So let’s start at the beginning. We have a function under test. Let’s choose something simple, like Boolean and.

def _and(x, y):
    return x and y

By convention in Python we put tests in a file with the same name but with test_ prefix to the filename. So in test_addfunc.py, we write

import unittest
from andfunc import _and

class TestAnd(unittest.TestCase):
    def test_and(self):
        cases = [
            ((True, True), True),
            ((True, False), False),
            ((False, True), False),
            ((False, False), False),
        ]
        for (input, expected) in cases:
            with self.subTest(input=input):
                found = _and(input[0], input[1])
                self.assertEqual(expected, found)

The unittest library provides a base class called TestSuite. Each test you write is a method of your subclass with a name starting with test_.

Tip: You will often have lots of test cases that differ only by their input data and the expected output. Rather than write lots of boilerplate to do the same test for each input, set up a mapping as we have done here and iterate over it.

Note that we have used with self.subTest(...). Each call to this context manager will make the asserts that happen in its body appear as separate tests when you run the test suite. This is important because when a test fails, you want to be able to understand what failed as quickly as possible.

Much of the time people lose in testing is understanding what is leading to a failing test. Think about how to organize your test suite so that a single failure leads you straight to what is wrong.

On Unix systems you can run this test suite by running

> python3 -m unittest test_andfunc.py
test_and (test_andfunc.TestAnd) ... ok
--------------------------------------
Ran 1 test in 0.000s

OK

The next thing I recommend is to install and learn to use PyCharm, a Python IDE from JetBrains. The community edition is free, and it is worth learning the environment just for the time you will save by being able to single step through failing tests in the debugger in a nicely integrated environment.

Exercises

  1. Skim through the documentation for the unittest module.
  2. Install PyCharm Community Edition and learn how to set a breakpoint and run unit tests in the debugger.
  3. Try rewriting test_addfunc.py so each of the conditions is its own test method instead of using subTest. How much more awkward is this?

Lesson 2. Formal methods are your friend

This lesson begins with a lot of exposition, but I promise we’ll get to the meat soon.

There is a world view in parts of the programming community that the only true way to show program correctness is by formal proof. This reflects the fact that the major intellectual heritage of academic computer science is from mathematics and theoretical physics. This viewpoint will hold you back.

Our world is built on things known primarily from experiment. When an engineer designs a steam turbine, they aren’t deriving the properties of water from first principles, or indeed from any mathematical model. They have big tables of the measured properties of steam that they reference. Toxicologists don’t prove a safe dose of some toxin by formal proof. They do so by carefully selected experiments. Antibiotics were not a result of a proof from first principles, nor were SSRI antidepressants. In fact, we don’t really know how either work. Their entire basis as useful artifacts is demonstrated by experiments.

So even if we don’t have a formal proof of our program’s correctness, we can have high confidence in its correctness by testing correctly chosen cases.

Now, when I say “formal methods,” many people immediately think of automatic verification of the properties of the entire system, such as the seL4 project that provides and end-to-end verified microkernel. seL4 is amazing, but it took years of work by specialists. End to end proof isn’t the only use for formal methods. We all use formal methods every day to get us part of the way.

Consider structured programming. When it was created in the 1960’s, it was considered too abstract for real world use, too much formal mathematics. Today we are so used to working with these formal methods that the spaghetti of goto statements that it forbid is something young programmers today never imagine.

Another example of a formal method we use when we write our code. There is a classic source of off by one errors when we write classic C loops to iterate over arrays or other data structures:

int[] xs;
int len;
...
for (int i = 0; i < len; i++) {
    printf("%d\n", xs[i]);
}

Did we get the length right? Was it supposed to be <= or <? With experience programmers restrict themselves only to specific patterns and don’t use the broader possibilities of the loop, or have switched to constructs that, like structured programming, mathematically guarantee that we can’t make this class of error, such as in C++,

for (auto it = xs.begin(); it < xs.end(); ++it) {
    printf("%d", *it);
}

or in Python, where this was made the default,

for x in xs:
    print(x)

So this leads us to our first principle:

Principle 1. Wring everything you easily can from formal methods before you start testing.

The quickest formal method, and one you can get enormous mileage from if you master it, is type checking. Let’s look at our tests from lesson 1 again. We can call the function

def _and(x, y):
    return x and y

with arguments of any type, whether booleans, strings, numbers, or more complicated data types. And in Python, this has well defined semantics. But say that we don’t want that. We want to operate on booleans. We could change the function to throw an exception,

def _and(x, y):
    if not isinstance(x, bool) or not isinstance(y, bool):
        raise ValueError("Arguments must be boolean")
    return x and y

but now we have to write more tests to check this behavior. During the slog of writing such tests, being able to say “Look, x is an int” in C or Java starts looking pretty nice.

Fortunately, Python 3 has type annotation. We can write

def _and(x: bool, y: bool) -> bool:
    return x and y

By itself, this doesn’t do anything, but we can run the tool mypy to type check our code. Adding those “: bool” annotations is much faster to write, to read, and to maintain than writing tests for the same behavior.

When we are testing code, we are doing so already assuming properties given to us by formal methods, and everything we can claw out of those formal methods will save us effort testing.

Exercises

  1. Write a few calls to _and in andfunc.py that are of the wrong types. Install mypy and run it on andfunc.py so see what it produces.
  2. Look up INTERCAL’s COME FROM statement. What would this do to being able to reason about your program’s behavior?

Lesson 3. Choosing test conditions

So, we’ve got our basic test framework set up. We know that we should get everything out of formal methods that we easily can to make our testing easier and faster. How do we actually choose the conditions that we test?

For _and, it’s easy. With the type annotations, there are only four possible inputs. We’ll test them all. But what about a function head that returns the first element of a list? We can’t test every possible list, and, intuitively, it would be incredibly wasteful even to try. So how do we choose which lists are important?

First, we use formal methods again. We annotate the types by creating a generic type variable with TypeVar.

from typing import List, TypeVar
T = TypeVar('T')

def head(xs: List[T]) -> T:
    if xs == []:
        raise ValueError('Cannot take head of empty list.')
    else:
        return xs[0]

The type of elements in the list is generic, so it doesn’t matter what type we use. We could even let T be NoneType, which has only one value (None), and use lists [], [None], [None, None], … .

That turns out not to be what we want because we would not be able to distinguish if we are accidentally taking the second or last element of the list instead of the first. For other functions, like finding the length of a collection, testing with NoneType is very useful.

For this case, though, we want a bigger type that we can distinguish values of, like int. But which values of List<int> are important?

This function only touches the first element of a list (if it exists). Passing a list of fifty one elements to this function is going to be basically the same as passing a list of fifty two elements or a hundred and two elements. On the other hand, the empty list, list with one element, and list with two elements are probably all things we want to test.

This is a pattern we will see again and again. If we consider the space of possible inputs to a function, most functions have small boundaries in that space where their behavior changes dramatically, connected by broad stretches where things are pretty much the same. This gives us an heuristic:

Principle 2: Carefully test the boundary. Sample the bulk.

With that in mind, let’s write a test plan for head.

import unittest
from head import head

class TestHead(unittest.TestCase):
    # Boundary:
    #   [] -> raises ValueError
    #   [522] -> 522
    #   [5, 2] -> 5
    # Bulk:
    #   [128, 53, 921, 1022, 41] -> 128
    #   [1, 5, 3, 12, 8, 1, 3, 2, 1, 5, 3, 19] -> 1
    def test_empty_list(self):
        with self.assertRaises(ValueError):
            # head behaves quite differently on [],
            # so we can't lump it in with the other
            # cases without being ridiculously
            # complicated.
            head([])

    def test_head(self):
        cases = [
            ([522], 522),
            ([5, 2], 5),
            ([128, 53, 921, 1022, 41], 128),
            ([1, 5, 3, 12, 8, 1, 3, 2, 1, 5, 3, 19], 1),
        ]
        for input, expected in cases:
            with self.subTest(input=input):
                found = head(input)
                self.assertEqual(expected, found)

The same approach works in general for other functions on containers. Consider the function

from typing import Dict, Hashable
from math import inf

def maxval(m: Dict[Hashable, int]) -> int:
    if m == {}:
        return -inf
    else:
        return max(m.values())

that returns the highest value in the provided dict. What should its test plan be?

(Note: The Hashable for keys is a detail of Python, which requires that dictionary keys be hashable so they can be put into a hashmap. Hashable is the most general type definition that can be a key.)

The boundary obviously includes the empty dict {}, and dicts with one and two elements. We want to handle the two element cases when they have equal values and when they have different values. Then we take some large dicts out in the bulk. So we have something like:

import unittest
from maxval import maxval
from math import inf

class TestMaxval(unittest.TestCase):
    # Boundary:
    #   {} → -inf
    #   {'a': 1} → 1
    #   {'a': 1, 'b': 1} → 1
    #   {'a': 1, 'b': 2} → 2
    # Bulk:
    #   {'a': 5, 'b': 12, 'c': 42, 'd': 961, 'e': 128} → 961
    #   {'boris': 400, 'blah': 999, 'quirk': 421} → 999
    def test_maxval(self):
        cases = [
            ({}, -inf),
            ({'a': 1}, 1),
            ({'a': 1, 'b': 1}, 1),
            ({'a': 1, 'b': 2}, 2),
            ({'a': 5, 'b': 12, 'c': 42, 'd': 961, 'e': 128}, 961),
            ({'boris': 400, 'blah': 999, 'quirk': 421}, 999),
        ]
        for input, expected in cases:
            with self.subTest(input=input):
                found = maxval(input)
                self.assertEqual(expected, found)

Boundary and bulk will carry us a long way, through parse trees and recursive structures and even to describing the behavior of external services. But before we do that, we need to talk about functions with more than one parameter.

Exercises

  1. Write a boundary and bulk test plan for square(x: float) -> float that squares its input. Don’t forget to text overflow!
  2. Write a boundary and bulk test plan for length(xs: List[T]) -> int that returns the length of the list.

Lesson 4. Multiparameter functions

All the functions we have been testing so far have had one parameter. That obviously won’t do. How do we handle functions with two or more parameters? The field of statistical design of experiments provides us with the general answer, called a “factorial design.”

Factorial designs first appeared in agricultural experiments. Say we have three kinds of fertilizer, two watering schedules, and two kinds of seeds. How do we find out what effect each of them has? Until the early 20th century, scientists would have said to fix two of them and vary only one factor at a time.

It turns out that this isn’t the best way to do it. If the fertilizer you happen to choose when you hold it fixed kills the plants, then you learn nothing about watering schedule. Maybe one seed variety likes one watering schedule better and the other seed variety prefers the other schedule. For almost any sensible situation you might encounter, the best way is to do all combinations rather than vary only one factor at a time.

When we’re testing software, we do the same thing. We write down boundary and bulk test plans for each parameter, and then test all combinations of test cases for the parameters.

For example, consider the function

from typing import List, Tuple, TypeVar
T = TypeVar('T')
U = TypeVar('U')

def zip(xs: List[T], ys: List[U]) -> List[Tuple[T,U]]:
    n = min(len(xs), len(ys))
    result = []
    i = 0
    while i < n:
        result.append((xs[i], ys[i]))
        i += 1
    return result

The first element of its output is a tuple of the first elements of its two input lists, the second element is a tuple of the second elements of its inputs, etc. The output list has the length of the shortest input list.

We write down a boundary and bulk test plan for each input, such as xs equals [], [5], [3, 8], [1, 12, 5, 32, 9, 1, 3, 17], and ys equals [], [''], ['a', 'b'], ['vrrp', 'meep', 'baboon', 'trilobyte']. Our test for zip is then

import unittest
from zip import zip

class TestZip(unittest.TestCase):
    def test_head(self):
        xs_cases = [
            [], [5], [3, 8], 
            [1, 12, 5, 32, 9, 1, 3, 17],
        ]
        ys_cases = [
            [], [''], ['a', 'b'], 
            ['vrrp', 'meep', 'baboon', 'trilobyte'],
        ]
        cases = [(xs, ys) for xs in xs_cases for ys in ys_cases]
        for xs, ys in cases:
            with self.subTest(xs=xs, ys=ys):
                found = zip(xs, ys)
                n = min(len(xs), len(ys))
                self.assertEqual([v[0] for v in found], xs[:n])
                self.assertEqual([v[1] for v in found], ys[:n])

So, this is wonderful, right? But there’s a problem. This is sixteen cases. If we have three parameters each with this many test cases, we have 64 cases. Four parameters gives us 256 cases. If we have more complicated boundaries and more cases for each parameter, it grows even faster. This is the essential problem with factorial designs: the number of test cases grows very fast!

In the field of statistics there are some clever things like fractional factorial designs to slow this growth, but the most effective ways to manage this growth when testing software are:

  1. Prove that some parameters don’t interact, so we don’t need to test all combinations of them.
  2. Alter the code to shrink the input space and the number of required test cases.

We’ll address (1) now, and (2) throughout the rest of this course.

Proving that some parameters don’t interact is like knowing that none of the fertilizers kill the plants in our introduction of factorial designs, or that the different varieties respond to watering schedules in the same way. For a scientist studying nature, there is no way to make that assertion before doing the experiment. But remember our first principle: wring everything you easily can from formal methods before you start testing.

Consider a function to print out the little lists of pages with forward and back arrows that search engines show.

def format_pager(
    current_item: int, total_items: int, items_per_page: int,
    left_arrow: str, right_arrow: str
) -> str:
    current_page = math.floor(current_item / items_per_page) + 1
    last_page = math.ceil(total_items / items_per_page)

    numbers = ' '.join(
        f'*{i}*' if i == current_page else f'{i}'
        for i in range(1, last_page+1)
    )

    result = ""
    if current_page != 1:
        result += left_arrow + ' '
    result += numbers
    if current_page != last_page:
        result += ' ' + right_arrow

    return result

For example,

>>> format_pager(5, 25, 4, "<", ">")
"< 1 *2* 3 4 5 6 >"
>> format_pager(1, 5, 4, "@<<", ">>")
"*1* 2 >>"

We can study the code for format_pager and satisfy ourselves that the parameters left_arrow and right_arrow don’t interact with the other three parameters. The other three affect only whether the left or right arrows appear, not what they are. What they are is unrelated to if they appear.

With this assertion, we replace our five parameter factorial design with two smaller factorial designs. We

  1. Fix the values of left_arrow and right_arrow and test current_item, total_items, and items_per_page.
  2. Fix current_item, total_items, and items_per_page, and test left_arrow and right_arrow.

We can make a very rough calculation of how many test cases this saves us by assuming that we have N cases for each parameter. Obviously some parameters will have more and some fewer, but, like when doing complexity analysis of algorithms or data structures with big-O notation, we only care about the exponent, not the exact value out front.

If we do a factorial design in 5 parameters, we have N5N^5 test cases. If we do the two factorial designs with our assertion of independence, we have N3+N2N^3 + N^2. For a reasonable value of NN like 5, 55=3,1255^5 = 3,125 while 53+52=1505^3 + 5^2 = 150. One small proof reduced our number of test cases by a factor of 20.

Exercises

  1. Write a function zip3 that zips three lists instead of two, and tests for it. How big is the factorial design?
  2. Write a function zipn that takes an arbitrary number of lists and zips them into tuples. Write a test for it. Hint: do bounday and bulk on the number of lists and then on the contents of each list.

Lesson 5. Testing recursive problems

Choosing boundary and bulk is fairly obvious for primitive types like integers and floats and for containers like lists, dicts, and sets. In this lesson we’ll extend it to functions on recursive types like parse trees.

The code for this becomes unweildy (plus the implementation needs a tokenizer, so we have other code for that), so rather than include it here, we will go through the process of selecting test cases, and I will refer you to parse.py and test_parse.py to study the implementation.

Let’s define a little Lisp-style arithmetic language to use for our examples. Its EBNF grammar is given by

Number := [0-9]+ ;
Operator := '+' | '-' | '*' | '/' ;
OperatorExpression := '(' Operator Expression Expression ')' ;
Expression := OperatorExpression | Number ;

So the following are all valid expressions in this language:

5
(+ 5 12)
(- (* 2 3) (/ 10 2))
(* 1 (+ 1 (+ 1 (+ 1 1))))

We want to test a parse function, and we’ll denote their parsed values by something like

5 
→ Number(5)

(+ 5 12)
→ OperatorExpression(Operator(+), Number(5), Number(12))

(- (* 2 3) (/ 10 2))
→ OperatorExpression(
    Operator(-),
    OperatorExpression(Operator(*), Number(2), Number(3)),
    OperatorExpression(Operator(/), Number(10), Number(2))
  )

(* 1 (+ 1 (+ 1 (+ 1 1))))
→ OperatorExpression(
    Operator(*),
    Number(1),
    OperatorExpression(
      Operator(+),
      Number(1),
      OperatorExpression(
        Operator(+),
        Number(1),
        OperatorExpression(Operator(+), Number(1), Number(1)
      )
    )
  )

The size of the results in the examples above may seem alarming if you’re not used to parse trees. They sprawl ominously. Don’t be afraid of them. They’re just awkward and large.

What is the boundary and what is the bulk? In recursive expressions, we start from our top level expression and exercise each clause in it. The bulk is largely about making sure that recursion stays uniform, and we assume that it constitutes a bulk because we will write our recursive algorithms as recursive code.

Expression was defined with two clauses,

Expression := OperatorExpression | Number

We’ll take Number first since it’s quickly dealt with, with something like 5 and 52366.

Then for OperatorExpression, defined as '(' Operator Expression Expression ')', we need to exercise both clauses of Expression for both times it appears, or

OperatorExpression(Operator, Number, Number)
OperatorExpression(Operator, OperatorExpression, Number)
OperatorExpression(Operator, Number, OperatorExpression)
OperatorExpression(
    Operator, OperatorExpression, OperatorExpression)

We stop there, since whether the sub-OperatorExpressions contain numbers or further opexprs doesn’t change our pattern matching at this level. We test this boundary with these five cases and with all four operators. We shrink this a bit by not doing all possible combinations of operators, and get the test plan:

5
52366
(+ 5 2) 
(- 1 3)
(* 4 1)
(/ 8 4)
(+ (- 3 2) (* 5 1))
(* (/ 12 4) 52)
(- 96 (+ 1 2))

Then we choose a couple of sprawling expressions in the bulk:

(* (/ (- 63 8) (+ (* 2 3) 8)) (+ (* 4 3) (* 2 12)))
(/ (+ 3 (* 2 4) (+ 1 (+ 1 (+ 1 1)))))

We can’t use type annotations or any other formal method to verify that our parser’s inputs are valid. The only way to check that is to parse and find out. So we also need to include test cases that will fail. At a minimum this will be an invalid string like 'a!5kja', but if you start defining the error messages that should result from failed parsing (as most production quality parsers need), you will end up with a much more comprehensive set.

We can write down boundary and bulk for parse errors as well. Again we start with our top level Expression. We choose examples that will not match either clause of Expression, such as 'abc'. Then we create an example that will start to match the first clause, OperatorExpression, and fail after initial match, such as (1. We do the same for Number, e.g., 2b.

As a technical detail of parsing, we need to distinguish between an unexpected character (one that does not belong in any valid input) and an unexpected token (characters that could be valid, but not where they are found). (1 has an unexpected token. 2b has an unexpected character.

Anyway, this first level we probe with the test cases:

abc
(1
2b

Next we descend. Number isn’t recursive, so it is fully covered by what we have already done. For OperatorExpression, we take each of its subexpressions in turn and generate examples where parsing fails, either with an unexpected character or an unexpected token, or, in the case of the final parenthesis, with a missing token.

Working through them left to right, invalid character then invalid token for each, we get the list cases:

(! 3 5) # (first subexpression, invalid character)
(25 3 5) # (first subexpression, invalid token)
(+ abc 52)
(+ * 52)
(+ (* a 12) 2)
(+ (* + 12) 2)
(+ 5 abc)
(+ 5 *)
(+ 5 (* ! 5))
(+ 52 (9 1))
(+ 52 12

Again, we don’t recurse further because we have covered the structural cases for this first level. Then we generate some bulk entries where the syntax error is buried deep.

(+ (+ 1 (+ abc)) 12)
(+ (+ 1 (+ abc) 12)

The final list of tests is fairly lengthy, but we have constructed them completely systematically. There was no staring into space and hoping to dream up a counterexample. Indeed, it’s so systematic that you might think, “Why don’t we have a computer do this?” And for parsers we often do…or just have the computer generate the parser itself and not bother with this, as ANTLR does. The general approach to testing recursive problems generalizes, though, so it was worth working your way through.

Lesson 6. Structs and records

Say we have the namedtuple (Python’s version of structs)

from typing import NamedTuple, List, Tuple

class ColoredLineSegments(NamedTuple):
    color: int # 0 to 15
    vertices: List[Tuple[float, float]]

And we have a function renderSvg that takes the shape and returns an SVG string of it,

<<cls>>

def renderSvg(shape: ColoredLineSegments) -> str:
    CGA_COLORS = [
        'black', 'blue', 'green', 'cyan', 'red',
        'magenta', 'brown', 'lightgray', 'darkgray',
        'lightblue', 'lightgreen', 'lightcyan',
        'lightred', 'lightmagenta', 'yellow', 'white',
    ]
    if shape.color < 0 or shape.color > 15:
        raise ValueError(f'Color must be in [0, 15], found {shape.color}')

    vertices_str = ' '.join(
        f'{x},{y}' for (x, y) in shape.vertices
    )
    polyline = f'<polyline points="{vertices_str}" stroke="currentColor" />'
    return f'<g color="{CGA_COLORS[shape.color]}">{polyline}</g>'

This kind of record type is ubiquitous, and admits just as systematic an approach as the recursive types we saw in the last lesson.

Simply, we could rewrite renderSvg as

def renderSvg(color: int, vertices: List[Tuple[float, float]]) -> str:
    ...

and reduce this to a multiparameter function, with our usual path through factorial designs. Let’s work through the example.

For color, our boundaries are the edges of the valid color range and one color in the middle (-1, 0, 15, 16).

For vertices, we obviously want the empty list, and a single point, but we go out to three points so we can actually see two segments being rendered: [], [(3, 5)], [(1, 0), (0, 1)], [(5, 12), (3, 16), (8, 2)].

Looking at what renderSvg does, we can write our code so we can formally assert that color and vertices are independent, so we don’t need to do a factorial design in both. We can hold one constant and work through the test cases for the other. In practice we’ll change the one held constant a bit just to make our test coverage a little more robust, but it’s not necessary.

So, our boundary is:

ColoredLineSegments(-1, [(1,0), (0,1)])
→ error
ColoredLineSegments(16, [(1,0), (0,1)])
→ error
ColoredLineSegments(0, [])
'''
  <g color="black">
    <polyline points="" stroke="currentColor" />
  </g>
  '''
ColoredLineSegments(15, [(3, 5)])
'''
  <g color="white">
    <polyline points="3,5" stroke="currentColor" />
  </g>
  '''
ColoredLineSegments(0, [(1, 0), (0, 1)])
'''
  <g color="black">
    <polyline points="1,0 0,1" stroke="currentColor" />
  </g>
  '''
ColoredLineSegments(15, [(5, 12), (3, 16), (8, 2)])
'''
  <g color="white">
    <polyline points="5,12 3,16 8,2" stroke="currentColor" />
  </g>
  '''

and then some bulk with other colors and a lot of points

ColoredLineSegments(4, [(1,0), (1,1), (2, 3), (512, 368), (8, 62)])
-> '''
   <g color="red">
     <polyline points="1,0 1,1 2,3 512,368 8,62" stroke="currentColor" />
   </g>
   '''

So our full test code is,

import unittest
from rendersvg import renderSvg, ColoredLineSegments

CASES = [
    (
        ColoredLineSegments(0, []),
        (
            '<g color="black">'
            '<polyline points="" stroke="currentColor" />'
            '</g>'
        )
    ),
    (
        ColoredLineSegments(15, [(3, 5)]),
        (
            '<g color="white">'
            '<polyline points="3,5" stroke="currentColor" />'
            '</g>'
        )
    ),
    (
        ColoredLineSegments(0, [(1, 0), (0, 1)]),
        (
            '<g color="black">'
            '<polyline points="1,0 0,1" stroke="currentColor" />'
            '</g>'
        )
    ),
    (
        ColoredLineSegments(15, [(5, 12), (3, 16), (8, 2)]),
        (
            '<g color="white">'
            '<polyline points="5,12 3,16 8,2" stroke="currentColor" />'
            '</g>'
        )
    ),
    (
        ColoredLineSegments(4, [(1,0), (1,1), (2, 3), 
                                (512, 368), (8, 62)]),
        (
            '<g color="red">'
            '<polyline points="1,0 1,1 2,3 512,368 8,62" '
            'stroke="currentColor" />'
            '</g>'
        )
    )
]

class TestRenderSvg(unittest.TestCase):
    def test_errors(self):
        cases = [
            ColoredLineSegments(-1, [(1,0), (0,1)]),
            ColoredLineSegments(16, [(1,0), (0,1)]),
        ]
        for input in cases:
            with self.subTest(input=input):
                with self.assertRaises(ValueError):
                    renderSvg(input)

    def test_render(self):
        for input, expected in CASES:
            with self.subTest(input=input, expected=expected):
                self.assertEqual(expected, renderSvg(input))

Exercises

  1. Define a namedtuple Person with two fields name and age. Write a function that takes a List[Person] and returns the youngest person in the list. In case of ties it should return the one whose name is earlier in the alphabet. Write and implement a test plan for the function.
  2. Given the namedtuple GreetingSpec,
class GreetingSpec(typing.NamedTuple):
    salutation: string
    recipient: Person # This is the namedtuple from exercise 1
    punctuation: string

write a function that turns it into a string, e.g., GreetingSpec("Hello", Person("Boris", 42), "!") should return "Hello, Boris!". Write and implement a test plan for the function.

Lesson 7. More complicated boundaries

This lesson isn’t going to introduce anything new, just explore a couple more complicated examples of what we have been doing. All the boundaries that we have been looking at have been fairly simple. The examples we chose were meant to make it seem that way to make the ideas easy to grasp. Let us look at a couple of examples that are more complicated.

Consider a function that takes a list of names and formats it into a usable string for a limited amount of space.

from typing import List

def format_names(names: List[str]) -> str:
    if names == []:
        return "No one"
    elif len(names) == 1:
        return names[0]
    elif len(names) == 2:
        return f'{names[0]} and {names[1]}'
    elif len(names) == 3:
        return f'{names[0]}, {names[1]}, and {names[2]}'
    elif len(names) < 26:
        return f'{names[0]}, {names[1]}, and {len(names)-2} others'
    else:
        return f'{len(names)} people'

Some example output from this function:

Let’s say it switches to the last form at 26 names. We have two boundaries. The first is the empty list up to four names. The second is at twenty five and twenty six names. With those boundaries, we have two bulks to sample. With that we have tests

from format_names import format_names
import unittest

class TestFormatNames(unittest.TestCase):
    def test_format_names(self):
        cases = [
            # First boundary, around the empty list
            ([], 'No one'),
            (['Jeff Y'], 'Jeff Y'),
            (['Jeff Y', 'Jane P'], 'Jeff Y and Jane P'),
            (
                ['Jeff Y', 'Jane P', 'Bill J'], 
                'Jeff Y, Jane P, and Bill J',
            ),
            (
                ['Jeff Y', 'Jane P', 'Bill J', 'Eddy Q'], 
                'Jeff Y, Jane P, and 2 others',
            ),
            # First bulk
            (
                ['Jeff Y', 'Jane P', 'Bill J', 'Eddy Q',
                 'Morris V', 'Edna M']*2, 
                'Jeff Y, Jane P, and 10 others',
            ),
            (
                ['Jeff Y', 'Jane P', 'Bill J', 'Eddy Q',
                 'Morris V', 'Edna M']*3, 
                'Jeff Y, Jane P, and 16 others',
            ),
            # Second boundary
            (
                ['Jeff Y', 'Jane P', 'Bill J', 
                 'Eddy Q', 'Morris V']*5,
                 'Jeff Y, Jane P, and 23 others'),
            (
                ['Jeff Y', 'Jane P', 'Bill J', 
                 'Eddy Q', 'Morris V']*5 + ['Adam S'],
                '26 people',
            ),
            # Second bulk
            (['Jeff Y', 'Jane P']*50, '100 people'),
            (['Jeff Y', 'Jane P']*500, '1000 people'),
        ]
        for (input, expected) in cases:
            with self.subTest(input=input):
                found = format_names(input)
                self.assertEqual(expected, found)

You will often want to have upper boundaries as well. For example, consider a function that prints a list of numbers 1 to n, no more than 20 characters long, e.g.,

>>> f(4)
"1 2 3 4"
>>> f(9)
"1 2 3 4 5 6 7 8 9"
>>> f(100)
"1 2 3 4 5 6 ... 100"

It’s implementation is

def abbrev_range(n: int) -> str:
    if n <= 10:
        return ' '.join(str(x) for x in range(1, n+1))
    elif len(str(n)) < 14:
        suffix = f'... {n}'
        prefix_len = 20 - len(suffix)
        return ''.join(
            f'{x} ' for x in range(1, prefix_len//2 + 1)
        ) + suffix
    else:
        # Our final number is too long to be shown as
        # 1 ... n in 20 characters.
        raise ValueError('Overflow')

There is a lower boundary for f(0), f(1), and f(2). There’s a boundary when we reach 20 characters and shift to having ... in the string. And there’s a boundary when our final number is over 20 characters long. And there are two bulks inbetween these three boundaries. Our tests are

from abbrev_range import abbrev_range
import unittest

class TestAbbrevRange(unittest.TestCase):
    def test_abbrev_range(self):
        cases = [
            # First boundary
            (0, ''),
            (1, '1'),
            (2, '1 2'),
            # First bulk
            (8, '1 2 3 4 5 6 7 8'),
            # Second boundary
            (10, '1 2 3 4 5 6 7 8 9 10'),
            (11, '1 2 3 4 5 6 7 ... 11'),
            (12, '1 2 3 4 5 6 7 ... 12'),
            # Second bulk
            (35, '1 2 3 4 5 6 7 ... 35'),
            (1024, '1 2 3 4 5 6 ... 1024'),
            # Third boundary
            (10**12, '1 ... 1000000000000'),
            # Then overflows under test_overflow
	    ]
        for (input, expected) in cases:
            with self.subTest(input=input):
                found = abbrev_range(input)
                self.assertEqual(expected, found)

    def test_overflow(self):
        inputs = [10**13, 10**14, 10**17, 10**28]
        for input in inputs:
            with self.subTest(input=input):
                with self.assertRaises(ValueError):
                    abbrev_range(input)
            

For those with a mathematical bent, naive numerical algorithms often have surprising boundaries or limits to their convergence. For example, people doing simulations in statistical physics regularly do a particular calculation for the probability of a state. It takes the energy level of the possible states, the system’s temperature (in Kelvin), and returns the probabilities of each state.

import numpy as np

def gibbs(energies: np.ndarray, temperature: float) -> np.ndarray:
    KB = 1.38064852e-23 # Boltzmann's constant in Joules/Kelvin
    partition = np.exp(-1 * energies / (KB*temperature))
    return partition / sum(partition)

For energies on the order of 10-19 Joules, this works fine. 273 is the freezing point of water in Kelvin, so a reasonable argument for temperature for testing.

>>> gibbs(np.array([1e-19, 1.001e-19, 1.002e-19]), 273)
array([0.34221507, 0.33325514, 0.32452979])

But if we talk about energies closer to room temperature,

>>> gibbs(np.ndarray([1, 1.017, 1.002]), 273)
RuntimeWarning: invalid value encountered in true_divide
array([nan, nan, nan])

If you dig into this, the exp(-1 * E / (KB*temperature)) == 0.0 for E == 1 and temperature == 273 in Python. For E == 1, the argument of exp is around -1025. The exp function loses precision for such small arguments and gives us 0. Mathematically, the expression is perfectly well defined, as we can see by adjusting it. If we multiple top and bottom by exp(-min(energies) / (KB*temperature)), we get the code

<<gibbs>>

def better_gibbs(
      energies: np.ndarray, temperature: float
    ) -> np.ndarray:
    KB = 1.38064852e-23 # Boltzmann's constant in Joules/Kelvin
    min_energy = energies.min()
    partition = np.exp(-1*(energies-min_energy)/(KB*temperature))
    return partition / sum(partition)

Now we have

>>> better_gibbs(np.array([1e-19, 1.001e-19, 1.002e-19]), 273)
array([0.34221507, 0.33325514, 0.32452979])
>>> better_gibbs(np.ndarray([1, 1.017, 1.002]), 273)
array([1., 0., 0.])

Finding the boundaries for numerical functions is one of the basic ends of the field called numerical analysis, and well beyond the scope of this course. If you’re interested, I highly recommend Acton’s Real Computing Made Real as a delightful starting point.

Taking this further, the methods described here fail miserably when trying to build quality checks for machine learning systems. Unlike what we think of as well designed programs, neural nets and other such systems have lots of boundaries tracing through their input spaces in ways that are not obvious from examination or possibly not even reproducible between different networks trained on the same training data. Testing for machine learning requires a whole different set of tools.

Exercises

  1. In many languages, the conventional greeting depends on the time of day. For example, in Italian you would say “buon giorno” until after lunch, then “buona sera” for the rest of the day. Write a function that takes a time and a greeting schedule for a language and returns the proper greeting, then write and implement a test plan for it.
  2. Optional: For extra fun, add formality to the mix. In Italian, “ciao” is the informal greeting no matter the time, and “salve” is the super-formal greeting no matter the time.
  3. Optional: Write and test a function that finds the fastest route between two points on a network of roads. Then, to make the boundaries fuzzier, give each road segment a range of time it takes to traverse.

Lesson 8. “Hidden” parameters

We have been looking only at the explicit parameters of functions in our test plans, but many functions have other, implicit parameters. None of us would be surprised to dig into a project and find a function like

def query(id: int):
    DB_HOST = os.getenv('DB_HOST')
    conn = db.connect(DB_HOST)
    ... # Query for the data we want

Naively, this function would have a single parameter in its test suite, but no one would think that was satisfactory. At the very least we have to include the DB_HOST environment variable, the database connection, and the contents of the database.

If you stop to think about it, there are a remarkable number of hidden parameters that can show up. For example,

It’s often worth adjusting your code to make a hidden parameter explicit. For example, instead of having the query function above connect to the database, pass the database connection as a parameter.

def query(conn: db.Connection, id: int):
    ...

This reduces the hidden parameters from the environment variable, the database connection, and the database contents previously to just the database contents. Something other code must deal with the environment variable and the database connection, but, as we have seen time and again, N2+N2N^2 + N^2 is generally much less than N4N^4, so a pair of independent, two parameter functions is easier to test than one four parameter function.

Now, the database contents are still implicit, and will remain implicit. You can’t make everything explicit. And sometimes it becomes impractical, even for things you could make explicit. Are you going to pass a time function, a set of environment variables, references to stdout, stderr, and stdin, and errno as parameters to every function? That way lies madness.

We will return to the question of how to set up external state like what is in the database in the next lesson, but for now we need to be able to override the behavior of one of these hidden parameters. In fairly dynamic languages like Python we can patch the references to these. For example, if we have a function that sleeps 5 seconds before retrying,

import urllib.request
import time

def fetch(url: str) -> str:
    retries = 0
    while retries < 3:
        try:
            with urllib.request.urlopen(url) as r:
                return r.read()
        except:
            retries += 1
            time.sleep(5)    
    raise ValueError(f'Could not fetch from URL {url}.')

We can set up a local HTTP server in our test and have it return two errors and then a successful request to test retries, but do we really want such a basic test to spend ten seconds sleeping?

Instead we can use patch to override time.sleep and check for the number of times it was called instead. The full test code that sets up an HTTP server, queries it, and checks for two calls to sleep is:

from unittest.mock import patch, call
import unittest
import http.server
from threading import Thread
import fetch

class TestHandler(http.server.BaseHTTPRequestHandler):
    calls = 0
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def do_GET(self):
        if TestHandler.calls < 2:
            self.send_response(500, 'ERROR')
            self.end_headers()
        else:
            self.send_response(200, 'OK')
            self.send_header('Content-type', 'text/html')
            self.end_headers()
            self.wfile.write(b'Test')
        TestHandler.calls += 1

class TestFetch(unittest.TestCase):
    @patch('time.sleep')
    def test_fetch(self, mock_sleep):
        # Start an HTTP server to query
        PORT = 9095
        httpd = http.server.HTTPServer(("127.0.0.1", PORT), TestHandler)
        def serve():
            httpd.serve_forever(0.1)
        t = Thread(target=serve)
        t.start()

        # Try fetching
        try:
            url = f'http://{httpd.server_address[0]}:{httpd.server_address[1]}'
            found = fetch.fetch(url)
        finally:
            httpd.shutdown()
            t.join()

        self.assertEqual(b'Test', found)
        mock_sleep.assert_has_calls([call(5), call(5)])

Of course, the behavior of the HTTP server in this example is another implicit parameter, and we might want to patch the call to urlopen in fetch as well rather than create an HTTP server.

In less dynamic languages, such as C++, this isn’t as simple. You can’t patch a function. The best solution I’m aware of is to create a wrapper that you can control with the preprocessor: set the __TEST__ variable in your compilation flags and it makes your wrapper function a mock. Don’t set it, and it points to the usual version. You still have to switch all callsites to use the wrapper, but that should hopefully be a pretty mechanical refactor.

Since we have discussed sleep, we must spend a moment on what to do when some external service you are testing requires you to wait for it. If you ever find yourself writing an SDK, you are likely to have this situation. All too often you will see tests such as

def test_thing(self):
    c = ServiceClient()
    c.do_something()
    # Eventually c.get_value() will reflect the change
    time.sleep(10)
    found = c.get_value()
    expected = ...
    self.assertEqual(expected, found)

Even a few tests like this can make a test suite unmanageable to run. The single largest speedup I ever got from fixing this kind of problem took the time to run a test suite from over twenty minutes to under two. The primary tool is this function, which you will probably end up copying and pasting many times throughout your career:

async def assertEventuallyTrue(
    self, predicate, message, timeout=3, sleep_time=0.1):
    t0 = time.time()
    while not predicate():
        if time.time() >= t0 + timeout:
            self.fail(message)
        await asyncio.sleep(sleep_time)

You needn’t have an async version. Just take the async off the beginning and replace await asyncio.sleep with time.sleep in the last line. This function will sleep for small bits of time and retry until it either times out or its predicate is true.

Exercises

  1. Write a function that sleeps for a random number of milliseconds between 0 and 5000. Write a test suite of ten trivial test cases that test things like 2+2=4 and then call that function. Write the suite first where it waits for 5500 milliseconds in each test. Measure how long it takes to run. Then write another version that uses the assertEventuallyTrue function above. Measure how long this one takes to run.
  2. Instead of actually waiting, write another version of that sleep function that instead uses a threading.Condition value to let you control when it is done. Rewrite your tests so that your test cases can call notify on the Condition to tell it when to proceed instead of waiting. How much faster does this test suite run?
  3. (Optional) Add operating system and Python version to your test plan. Choose your supported operating systems and Python versions, and use Docker, virtual machines, or some other means to run your test plan with your selected variations.

Lesson 9. Setting up implicit state

We looked at mocking simple hidden parameters in the last lesson, and put off how to produce implicit state for things like the contents of a database. Now it’s time to address that.

First, you must be able to reasonably interact with the service or something that looks very like it as part of your tests. If your database is SQLite, then spinning up an in-memory SQLite database as part of your tests is trivial and easy. If your database is Amazon’s DynamoDB, creating a new instance for your unit test isn’t practical, but someone has written a lookalike implementation of it that you would never want to run in production…but is perfect for running tests against.

Sometimes a service is easy enough to write a toy implementation of for testing that you might as well do it. If your organization uses a protocol like Thrift or Protobuf, you can probably write a toy implementation of the methods a service provides pretty straightforwardly. For example, say you are using memcached and the pymemcache client. The client library’s API is fairly straightforward to implement around a simple dict, and you likely only need a subset to start with. If you are only using get and set, then we have a mock memcached client,

import queue
import time

class MockMemcachedClient:
    def __init__(self):
        self.data = {}
        self.ttls = queue.PriorityQueue()
    
    def vacuum(self):
        # Any calls to the mock client will call vacuum
        # to clean up any items that have exceeded their
        # time to live.
        now = time.time()
        try:
            while True:
                t, key = self.ttls.get(False)
                if t <= now:
                    del self.data[key]
                else:
                    self.ttls.put((t, key))
                    return
        except queue.Empty:
            return

    def set(self, key, value, expire=0, noreply=None, flags=None):
        self.data[key] = value
        if expire != 0:
            self.ttls.put((
                time.time()+expire, 
                key,
            ))
        self.vacuum()
        return True

    def get(self, key, default=None):
        self.vacuum()
        return self.data.get(key, default)

Filling in additional methods as needed is light work, removes the need to create memcached instances on your dev box, and lets you insert whatever timing control you might need to test edge cases.

On the other hand, sometimes the service is sprawling and complicated. Writing a mock of PostgreSQL if you really use its features, is probably not worth the effort. In this case the easiest path is to require whoever is running the tests to have an accessible PostgreSQL instance running on their machine. Set environment variables or put a config file in your home directory with the necessary information for tests to connect to that instance, and have them create a uniquely named, temporary database, set up the required schema, insert their data, and clean it up. As an example, here is a class that connects to a local PostgreSQL instance:

import os
import psycopg2
import psycopg2.sql as sql
import math
import time
from typing import Dict
import unittest

class TestCaseWithDatabase(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        dbname = f'testdb{math.floor(time.time())}'
        with psycopg2.connect(dbname='postgres') as conn:
            conn.autocommit = True
            st = sql.SQL(
                'CREATE DATABASE {dbname}'
            ).format(dbname=sql.Identifier(dbname))
            with conn.cursor() as cur:
                cur.execute(st)

        conn = psycopg2.connect(dbname=dbname)
        with conn.cursor() as cur:
            cls.table_names = []
            for name, statement in cls.get_schema().items():
                cls.table_names.append(name)
                cur.execute(statement)
        cls.conn = conn
        cls.dbname = dbname

    @classmethod
    def get_schema(cls) -> Dict[str, str]:
        """Get the CREATE TABLE statements for tables."""
        raise NotImplemented(
            'Override get_schema in your subclass.'
        )

    def clear_database(self):
        # Truncate all tables in the schema. We make this
        # a separate method, as we will want to run it from
        # subTests.
        with self.conn.cursor() as cur:
            for table in self.table_names:
                st = sql.SQL(
                    'TRUNCATE TABLE {table} CASCADE'
                ).format(table=sql.Identifier(table))
                cur.execute(st)

    def teardown(self):
        self.clear_database()

    @classmethod
    def teardownClass(cls):
        cls.conn.close()
        with psycopg2.connect(dbname='postgres') as conn:
            conn.autocommit = True
            sql = sql.SQL(
                'DROP DATABASE {dbname}'
            ).format(dbname=sql.Identifier(dbname))
            with conn.cursor() as cur:
                cur.execute(sql)

which you use by subclassing and implementing get_schema,

<<dbtc>>

class TestCaseWithFriendsDatabase(TestCaseWithDatabase):
    @classmethod
    def get_schema(cls) -> Dict[str, str]:
        return {
            'people': '''CREATE TABLE people (
                id INTEGER PRIMARY KEY,
                name TEXT NOT NULL
            )''',
            'friends': '''CREATE TABLE friends (
                id1 INTEGER REFERENCES people(id),
                id2 INTEGER REFERENCES people(id),
                PRIMARY KEY (id1, id2)
            )''',
        }

At this point you may be thinking that this is a lot of work to do for all of your dependencies. This the cost of having dependencies.

Having set up a test environment, we need to talk about what goes in it. There are two somewhat distinct viewpoints that are useful here. The first treats the contents of the database that we are querying as basically static, but the queries we are making as sufficiently complicated that we need a real SQL engine to handle them. The second treats the state as the result of a series of calls made to the service, but assumes that the range of queries is somewhat limited. In this case we will dredge up a method from the academic literature called the Trace Assertion Method to get some mileage.

Static contents of a database that we test on is going to look like boundary and bulk again. We begin with some state, do something to it, and assert that it is now what we expect. For example, say we have the schema above, and a function that returns the ids of mutual friends two people have.

from typing import List

def mutual_friends(
        conn,
        id1: int, id2: int
    ) -> List[int]:
    with conn.cursor() as cur:
        cur.execute('''
            SELECT DISTINCT lf.id2 
            FROM friends AS lf 
            INNER JOIN friends AS rf 
            ON lf.id2 = rf.id1
            WHERE lf.id1 = %s and rf.id2 = %s
        ''', (id1, id2))
        return [x[0] for x in cur.fetchall()]

We create a boundary of when id1 and id2 have no connections, no shared connections but some unshared connections, one shared connection but no unshared connections, one shared connection and some unshared connections, and two shared connections with unshared connections. Then we add a bulk of many shared connections.

<<dbtc>>

from mutual_friends import mutual_friends

class TestCaseWithFriendsDatabase(TestCaseWithDatabase):
    @classmethod
    def get_schema(cls) -> Dict[str, str]:
        return {
            'people': '''CREATE TABLE people (
                id INTEGER PRIMARY KEY,
                name TEXT NOT NULL
            )''',
            'friends': '''CREATE TABLE friends (
                id1 INTEGER REFERENCES people(id),
                id2 INTEGER REFERENCES people(id),
                PRIMARY KEY (id1, id2)
            )''',
        }

    def load(self, names, friendships):
        with self.conn.cursor() as cur:
            for i, name in enumerate(names):
                cur.execute(
                  'INSERT INTO people(id,name) VALUES (%s,%s)',
                  (i, name),
                )
            for (id1, id2) in friendships:
                st = 'INSERT INTO friends(id1,id2) VALUES (%s,%s)'
                cur.execute(st, (id1, id2))
                cur.execute(st, (id2, id1))
    
    def test_mutual_friends(self):
        # We don't care about names here, just need 
        # them to make the database happy.
        names = ['']*20
        cases = [ # We will always use 0 and 1 as our ids.
            ([], []),
            ([(0, 2), (1, 3)], []),
            ([(0, 2), (1, 2)], [2]),
            ([(0, 2), (0, 3), (1, 2), (1, 4)], [2]),
            ([
                (0, 2), (0, 3), (1, 2), (1, 3), 
                (0, 4), (1, 5)
             ], [2, 3]),
            ([
                (0, 2), (0, 3), (0, 4), (0, 5), 
                (0, 6), (0, 7), (0, 8), (0, 9),
                (1, 4), (1, 5), (1, 6), (1, 7), 
                (1, 8), (1, 9), (1, 10), (1, 11),
             ], [4, 5, 6, 7, 8, 9]),
        ]
        for (friendships, expected) in cases:
            with self.subTest(friendships=friendships):
                self.clear_database()
                self.load(names, friendships)
                found = mutual_friends(self.conn, 0, 1)
                self.assertEqual(expected, found)

So much for static data. What if we have an external service that we issue a series of commands to put into a particular state for our testing? There is a technique called the Trace Assertion Method due to Parnas (who is also the one who originally published the idea of decomposing programs into modules). It’s worth reading the papers on it (Bartussek and Parnas. “Using assertions about traces to write abstract specifications for software modules”, 1978), but we can summarize what we need of it fairly quickly.

Say that we have a stack. It has two methods that change its state, push and pop, and a method head that returns the value on the top of the stack if the stack is not empty. If we apply head to an empty stack, it is an error.

Say that we start with an empty stack and consider all traces of calling push and pop. Some traces area:

[]
[push(5)]
[push(5), pop()]
[push(5), push(3), pop()]

If, for any trace, we specify the value of calling head at the end, we have a full specification of the stack. At this point, this should look like another obvious candidate for boundary and bulk. For a toy implementation of such a stack (pretend that it’s a serious, remote service with all the trimmings),

class Stack():
    def __init__(self):
        self._stack = []

    def push(self, v):
        self._stack.append(v)

    def pop(self):
        if len(self._stack) == 0:
            raise IndexError('Empty stack')
        else:
            self._stack.pop()

    def head(self):
        return self._stack[-1]

we can write tests for head in the usual boundary and bulk manner,

from stack import Stack
import unittest

def push(v):
    return ('push', [v])

def pop():
    return ('pop', [])

class TestStack(unittest.TestCase):
    def test_head(self):
        cases = [
            ([push(5)], 5),
            ([push(5), pop(), push(3)], 3),
            ([push(5), push(3), pop(), push(12), 
              push(8), pop(), pop(), push(9)], 9),
        ]
        for (input, expected) in cases:
            with self.subTest(input=input):
                st = Stack()
                for (f, args) in input:
                    getattr(st, f)(*args)
                found = st.head()
                self.assertEqual(expected, found)

    def test_head_empty(self):
        cases = [
            [],
            [push(5), pop(), pop()],
            [pop()],
        ]
        for input in cases:
            with self.subTest(input=input):
                st = Stack()
                with self.assertRaises(IndexError):
                    for (f, args) in input:
                        getattr(st, f)(*args)
                    st.head()

Exercises

  1. Write a Trace Assertion Method specification for a very, very simple chat server. It will have a single channel that users can send messages to. All users receive all messages from the channel.
  2. Write a simple chat bot for the chat server you described in the previous exercise. It should receive messages and send a response. You might have it respond to !date with the current date and !echo with whatever text follows the command. Write a test plan for the bot. Be sure to include what happens when it fails to send a message to the chat server.
  3. Write a toy implementation of your chat server spec to connect the bot to and use it to implement the test plan from exercise 2.
  4. Extend your chat bot to log requests made to it, including when the request was received, when it sent its response, the user it recognized the request from, the body of the request, its response, whether its send to the chat server succeeded, and, if it failed, what the error was. Write a test for its logging where the test opens a SQLite database (use :memory: as its path to create an in-memory database), creates a table for the log entries in it, triggers the chat server, and checks that the data shows up correctly in the database.

Lesson 10. Randomized data

In all the tests we have written so far, we have hardcoded data. The names of people in our database example were fixed (and short and all lowercase ASCII). Particular numbers and strings have been the same in every test run. This is unfortunate for two reasons:

  1. We’re not taking advantage of successive runs to probe our parameter space a little more.
  2. Programmers tend to choose data that is on the “happy path” of the program as opposed to what will break it.

Boundary and bulk is an heuristic for systematically deriving data that avoids the happy path, but when you get into the details of what goes in the fields of complex data structures it becomes unmanageable to try to use it for every detail.

So in this lesson we’re going to talk about how to use random data to fill in your test cases. There is a further step (“property based testing”) where you use random generation to create your test cases, but it requires a significantly different approach to how you assert what a correct result is. Rather than go down that (fascinating and powerful) road right now, we’ll talk about how to get mileage immediately out of randomization.

For example, say we want to check that adding an integer and its negative cancels to zero. We could try to choose values for this…or we could write something like:

def test_inverses_cancel():
    for nmax in range(10000):
        n = random.randint(1, nmax)
        with self.subTest(n=n):
            self.assertEquals(0, n+(-n))

Or, at the beginning of this course, we wrote tests for

from typing import List, TypeVar
T = TypeVar('T')

def head(xs: List[T]) -> T:
    if xs == []:
        raise ValueError('Cannot take head of empty list.')
    else:
        return xs[0]

Our test plan for this was the empty list, a list of one element, a list of two elements, and a couple lists out in the bulk. If we were to randomize all the data in this, we would do something like:

import unittest
import random
import sys
from head import head
from typing import List

def randlist(n: int) -> List[int]:
    return [random.randint(-sys.maxsize, sys.maxsize)
            for _ in range(n)]

class TestHead(unittest.TestCase):
    def test_empty_list(self):
        with self.assertRaises(ValueError):
            head([])

    def test_head(self):
        cases = [
            randlist(1),
            randlist(2),
            randlist(6),
            randlist(35),
        ]
        for input in cases:
            with self.subTest(input=input):
                found = head(input)
                expected = input[0]
                self.assertEqual(expected, found)

This doesn’t seem to buy us much, since we know that head shouldn’t care what type is used. But where this kind of generation really shines is when you have to put text somewhere. Text is hard to generate well. Did you include Chinese? Arabic? Some accented Latin letters? Greek? Emoji? Are you covering switches from right to left and left to right? If you’re doing any case handling, is it being done correctly? Really good Unicode testing requires very deep knowledge of the world’s languages, but we can root out assumptions that the world is ASCII much more easily, and that is already a big win considering how many websites wouldn’t accept my mailing address when I had such a mild character as ‘é’ in it. Consider generating any string you need randomly:

import unicodedata
import random

# Unicode codepoints are divided into categories. We
# generate codepoints in the first 16 bit range, but
# skip those in unassigned, internal, and control ranges.
def randchar(category_blacklist=['Cs', 'Co', 'Cn']):
    while True:
        c = chr(random.randint(1, 2**16))
        if unicodedata.category(c) not in category_blacklist:
            return c

def randstr(length):
    return ''.join(randchar() for _ in range(length))

This produces text the likes of which never was written by mankind. For example,

>>> print(randstr(256))

There’s Chinese, Burmese, some arrows, some Japanese, Korean, some Arabic, a storm cloud emoji…all kinds of stuff in there. This kind of data quickly roots out assumptions about ASCII in your code. Indeed, I included it in this as an image because it quickly rooted out assumptions in many of the programs I tried to use to produce this file.

But say we are using this kind of randomized data. How do we choose our expected values? There are three tactics to know:

1. Start on the other side of trapdoors

Some functions are easy to do backwards but hard to do forwards. In cryptography these are called trapdoor functions. Factoring an integer into its prime factors is a great example. It’s trivial to multiply the primes together but hard to factor them. So if you have a list of primes, and a function factor that returns a List[int] of the primes,

def factor(n):
    factors = []
    k = 2
    while k <= n:
        if n % k == 0:
            factors.append(k)
            n /= k
        else:
            k += 1
    return factors

then you can write the tests by

import unittest
import random

from factor import factor
from typing import List

PRIMES: List[int] = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41] 
# ...obviously there are more primes, but we have to stop somewhere

def product(xs):
    p = 1
    for x in xs:
        p *= x
    return p

class TestFactor(unittest.TestCase):
    def test_factor(self):
        for k in range(5):
            for _ in range(500):
                expected = sorted(random.choices(PRIMES, k=k))
        		N = product(expected)
                with self.subTest(N=N, expected=expected):
                    found = sorted(factor(N))
                    self.assertEqual(expected, found)

More trivially, if we have something that is supposed to parse a line of CSV, we can start by generating fields and then concatenate them into a line of CSV. We generate fields, escape them if they contain commas or whitespace, and then concatenate them together. Then run your split function and see if you recover your original fields.

Now, escaping and concatenating seems like a nontrivial piece of functionality as well. Maybe it should go in your library instead of your tests. Which leads us to the second tactic:

2. Test invariants

product and factor are inverses. If we generate a list of primes ps we can check that factor(product(ps)) == ps. Or we can generate a number n and check that product(factor(n)) == n. Instead of looking at the exact values produced by each function, we ask about properties of combining the functions that are invariant under whatever values we put in.

The same tests as before, written this way, become

import unittest
import random

from factor import factor
from typing import List

PRIMES: List[int] = [2, 3, 5, 7, 11, 13, 17, 19, 23, ...]

def product(xs):
    p = 1
    for x in xs:
        p *= x
    return p

class TestFactor(unittest.TestCase):
    def test_product_factor_is_identity(self):
        for k in range(5):
            for _ in range(500):
                N = random.randint(1, 10**5)
                with self.subTest(N=N):
                    self.assertEqual(N, product(factor(N)))

    def factor_product_is_identity(self):
        for k in range(5):
	        for _ in range(500):
    	        primes = sorted(random.choices(PRIMES, k=N))
                with self.subTest(primes=primes):
                    self.assertEqual(primes, factor(product(primes)))

This is a very different way of thinking, and it can take some time to get used to. If you took a physics class at some point in the past, it might be worth considering it this way: you can solve Newton’s equations of motion to find the path of an object over time, or you an use conservation of energy and momentum to find properties of that path without talking about the exact path. Similarly, we are switching from tracking the exact data that goes into and out of functions to talking about relationships among the functions globally.

The biggest shift this requires is instead of testing one function at a time, you have to start testing sets of them. The only invariants you can find on one function are pretty boring. They are:

With two functions, f and g, we can get a few more, such as

Note that last one is different from constant for one function. g might return different values, all of which are mapped to a constant by f. For example, count_lowercase(to_upper(s)) on an ASCII string should return 0 always. But there are only so many ways to arrange the symbols f, g, identity, and constant.

And you really need this. A single property, such as involution, doesn’t specify the function uniquely. Both swapping the first two elements of a list and reversing the list are involutions, but you need additional properties to distinguish them.

And sometimes properties only hold over subsets of inputs, which brings us to our last tactic,

3. Reduce the scope you generate in

Let’s look at Unicode again. If we are writing a toupper function to uppercase text, we need to consider what that means in different scripts. In English, French, Italian, or Spanish, it’s straightforward, as is Russian. English has a unique mapping from lowercase to uppercase, even if you include archaic accents like the old fashioned umlaut on ‘coöperation.’ German throws in a wrench because the uppercase of ß is SS, though. Salish has a sound represented as t̕ᶿ ('\u0074\u0315\u1dbf' in Python), which uppercases to T̕ᶿ. These languages are all called bicameral scripts (because they have two cases). Uppercasing Chinese or Arabic is also easy, because it doesn’t change (they are unicameral scripts). But if we generate Unicode code points at random, most of them are going to be in non-Latin alphabets, and we are very unlikely to get ß.

So we want to decompose arbitrary Unicode into subdomains to test. We want to specifically test ß and some of the corner cases built into Native American languages. We want to generate random Latin characters, including those with accents like umlauts. We want to generate Cyrillic. And we want to check that Hangul and Chinese characters don’t change at all.

After all this, it may seem like randomization makes testing more complicated. It does and yoit doesn’t. It does in that it will often force edge cases to your attention that you overlooked. It doesn’t in that you have to use the same domain knowledge to find deterministic test cases that you have to express in your program for randomized test cases. Sometimes that domain knowledge is sufficiently convoluted that expressing a set of deterministic tests is easier on the programmer and on those who read the code later than trying to get all the details in place to generate random tests. Often it is not, or simple changes to the system’s API make it easier. And often you can find a balance, when certain aspects you generate deterministically then fill in fields and other areas with random data. Finding that balance is a matter of experimenting until you build some experience.

Exercises

  1. Test if floating point addition is associative, that is, (a + b) + c ==a + (b + c)`. Use random generation of floating point values.
  2. (Rejection sampling.) Write a function that generates pairs of floating point numbers a and b such that a*a + b*b <= 1. Do it exactly first, where you generate a between -1 and 1, and then generate b with bounds set by the a you generated. Then do it by rejection where you generate pairs of floating point numbers both between -1 and 1, and throw away pairs until you get a pair that satisfies the constraint.
  3. The function reverse that reverses a list and swap that swaps the first two elements of a list are both involutions. Write down an invariants for how reverse and swap commute with array slicing, that is reverse(xs[n:m]) == reverse(xs)[?:?] and swap(xs[n:m]) = swap(xs)[?:?]. The latter will not be a simple, single formula.

Where next…

Readers with some background in software quality may notice that I haven’t used classifications of tests into unit tests, integration tests, acceptance tests, or a variety of other buckets. That is intentional. These classifications all describe aspects of how a particular organization handles its work. Unit testing and integration testing are distinguished when teams are assigned to build parts of a system that will be integrated. Each team produces unit tests on their part, and a separate team build integration tests on the whole system. Acceptance tests are part of procurement. You contract another firm to build what you need, and set acceptance tests for whether you will accept the result and pay them. These are all orthogonal to the material here, which is about how to write the tests.

There are three tools I recommend exploring from here to see where you can push these ideas further:

  1. Jester: How do you know if you actually have adequate testing? One common measure is what percentage of the lines or branches your test suite exercises, though this has some obvious weaknesses. Jester measures it by mutating your source code and seeing if your tests fail.

  2. Hypothesis: The full development of the idea of randomly generating your test data is property based testing, which generates increasingly large datasets and tests invariants. If you can write your code in a way amenable to property based testing, it’s something of a superpower.

  3. American Fuzzy Lop: AFL is a fuzz tester, that is, it generates random data to try to crash the program it’s testing, and mutates its input trying to force more and more of the instructions in the program to be run.

Finally, tests that aren’t run do you no good. Continuous integration is the name today for setting up a system to run your tests on every change to your program. There are many, many such systems available today and no reason not to use one.