Tactics for testable code

Status: Finished
Confidence: Very likely

There are certain tactics or gambits that I have found over the years that make code more testable. When I say more testable, I mean that it requires fewer test cases to demonstrate correctness and that those test cases should be easier to construct. In each of the tactics that follow we’ll make those two criteria—fewer and easier to construct—more precise.

Tactic 1. Adapt to canonical forms

Consider a function

def events_since(t) -> List[Event]:
    """Return all events that happened since time t."""

What are somethings people might want to pass in for t? Perhaps:

and heaven knows what else. 3PM on the third Thursday of last month, perhaps? Imagine if events_since knew how to interpret all of these. The test plan is going to be terrifying.

We can make a qualitative calculation of how big such a plan will be. Say we accept those four forms for dates, and for each form we need to pass in N cases. It may be fewer or more by a bit for each case, but it shouldn’t be ten times fewer or ten times more, so we can at least see how it will scale, rather the same way that when we analyze the complexity of an algorithm with big-O notation we neglect lots of constants and aim for a rough scaling law in terms of the size N of the inputs.

For four types like this, we have 4N4N cases. But what if we have a function like

def events_between(t0, tf) -> List[Event]:
    """Return events occuring after t0 and before tf."""

We combine the test plans for each parameter by taking all combinations. Expressing it in code, if we have test cases t0_cases and tf_cases, our test cases for both together are

cases = [(t0_case, tf_case)
         for t0_case in t0_cases
         for tf_case in tf_cases]

For three or more parameters we take all combinations in the same way. This is called a factorial design and is one of the foundational achievements of statistics.

For events_between, if we accept all four kinds as above, we now have 16N216N^2 cases. NN is typically on the other of 5 to 10, so we’re looking at 400 to 1600 test cases.

Instead, let’s choose a canonical form for time. In Python that would probably be the datetime type. In a packet capture system I worked on it was a 64 bit, unsigned integer representing microseconds since the epoch. Then we write adapter functions from our various input kinds to that canonical form. For example,

def ago(s: str, relative_to=None) -> datetime:
def from_unixtime(ts: int) -> datetime:
def from_iso(s: str) -> datetime:

and our function has a more restricted signature:

def events_between(t0: datetime, tf: datetime) -> List[Event]:

Now we have N cases for ago, N cases for from_unixtime, N cases for from_iso, and N2N^2 cases for events_between. In all, that is N2+3NN^2 + 3N cases, as opposed to 16N216N^2 for accepting everything. For a plausible value like N=10N = 10, that’s still only 130 cases as opposed to the 1600 we said before.

This actually underestimates the improvement, though. If we look at our API as a whole, we get similar savings for events_since as well as events_between. The more functions we have that use the canonical type instead of handling all cases, the more test cases the canonical form and adapters save us.

Now, the canonical form is not always a concrete type. It may be an opaque interface. For example, imagine that we have a system that needs to be able to notify individuals, other systems, pub/sub queues, or oncalls of an event. If we write a function

def alert(recipients, msg):

What goes it its test plan? It’s going to be huge, and it will require us to be able to set up a test environment for each of the possible kinds of recipient to check that the alert was dispatched properly. Not only is the test plan large, but the individual tests will require complex environments.

Instead, we can create an interface. There’s no “interface” type in Python, but in pseudocode it would look like

from typing import NamedTuple, Callable

class Messageable(NamedTuple):
    send: Callable[[str], None]
    name: Callable[[], str]

In C++ it would be an abstract base class. In Java or Go it would be an interface, since those languages support the idea directly. In C it would be a struct with the fields send and name that are function pointers. The alert function now has the type

def alert(recipients: Iterable[Messageable], msg: str):

And you can again write adapter functions, such as

def to_user(id: int) -> Messageable:
def to_oncall(name: str) -> Messageable:
def to_queue(name: str) -> Messageable:

and whatever else needs to receive alerts. The test plan for alert now becomes straightforward. We test entirely with a simple implementation of Messageable, such as

def make_messageable(name: str):
    msgs = []
    def _send(msg):
    def _name():
        return name
    return Messageable(send=_send, name=_name)

The individual adapters such as to_user and to_oncall need to have more complex test environments, but they stay separate, and we only need those tests for the adapters, not for any other functions that use the Messageable interface.

As is often the case, what makes the code more testable also buys us some nice maintainability. If you need to add a new kind of recipient that can receive alerts, you don’t have to touch the alert function at all, or even have access to its code. You just need to write a new adapter function that produces a Messageable.

Tactic 2. Extract and dispose

While spelunking through a piece of code I ran across a piece of spaghetti that resembled the following call tree:

b = B.__init__(line_length, locale)
a = A.__init__(b, name, owner)

Working with code that actually does this kind of ping-pong is a nightmare. What any chunk of code is supposed to do is spread across many functions and it requires a frustrating amount of brainpower just to figure out what’s going on.

Anyway, we said before that when we have multiple parameters, we have to take all combinations of their test cases to get our full test plan. With this kind of spaghetti, the parameters to both A and B have to be taken together. If they could have been tested separately, it would have been N3+N2N^3 + N^2 test cases (1100 cases for N=10N=10). Together, they need N5N^5 test cases (100,000 with N=10N=10). 1100 to 100,000 is a big difference.

If we look at what’s going on here, B.send is being called to do something and is referring to A to get information it needs for that. We can decouple them and reduce our test cases by making A give you everything you need and then disposing of it.

a = A.__init__(name, owner)
info = a.generate()

# info == {
#   'name': ...,
#   'description': ...,
#   'owner': ...,
# }

b = B.__init__(line_length, locale)

We could probably omit the classes entirely and just write

info = generate(name, owner)
send(info, line_length, locale)

Now we are back to N2+N3N^2 + N^3 test cases, and the code will be much more readable.

Similarly, if you have a function that does some computation and has side effects, you can simplify your life by splitting it into a pure function that calculates what the effects should be and a simple function that runs side effects.

So if we have a function

def update(input):
    ... # do things to global state based on input

we can split it into

def delta(input) -> List[Change]:
    ... # calculate the list of changes

def apply(ch: List[Change]):
    ... # apply changes

Testing delta requires no complex setup. It’s just inputs and outputs. Testing apply requires more setup, but its test cases are much simpler.

If you apply the extract and dispose idea recursively, you end up with pipelines, such as

x = f(input)
y = g(x)
z = h(y)
m = p(z)
n = q(m)
return n

At each stage you finish entirely with the previous work and forget everything except what you need for what follows. This is extremely easy to test, but it is also extremely easy to end up thrashing from one step to the next, which brings us to our next tactic:

Tactic 3. Match impedances and chain endomorphisms

Here’s some functions that query information about employees:

def sex_of(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
def is_bald(vs: List[Name]) -> Dict[Name, bool]:
def knows_haskell(vs: List[Name]) -> List[bool]:

Your task is to write a function that, given some names, returns their sex, if they have hair, and if they know Haskell. Looking at the types of these functions, this is going to be a bit awkward.

def query(names: List[Name]) -> Dict[str, Any]:
    people = {name: {} for name in names}
    people = sex_of(people)
    for name, ib in is_bald(names).items():
        people[name]['is_bald'] = ib
    for name, kh in zip(names, knows_haskell(names)):
        people[name]['knows_haskell'] = kh
    return people

Those three functions could all have the same type. If we choose to make them all the same type, such as

def sex_of(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
def is_bald(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
def knows_haskell(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:

then our code becomes more straightforward:

def query(names: List[Name]) -> Dict[str, Any]:
    people = {name: {} for name in names}
    people = sex_of(vs)
    people = is_bald(vs)
    people = knows_haskell(vs)
    return people

Now, this is all well and good, but how is it going to improve our testing? The number of test cases for query isn’t going to change because we reorganized this, even if the code is cleaner, and sex_of, is_bald, and knows_haskell are probably querying a database of some sort, so testing query requires setting up a test environment with that database.

Matching impedances this way buys us power, though, because we can introduce a function that can chains these transforms. Transforms from a type to itself are also called endomorphisms, so you can sound very sophisticated with the battlecry “Chain your endomorphisms!” Anyway, a chaining function:

def chain(*fs):
    def newf(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
        for f in fs:
            vs = f(vs)
        return vs
    return newf

If we test chain, then our various functions like query stop being a thing we write a separate function with testing. We just write

query = chain(sex_of, is_bald, knows_haskell)

which is something we would probably just write inline wherever we are using it and barely needs testing at all. We can test chain with whatever functions of the right signature that we want, so we can choose ones that, unlike sex_of or is_bald, need no external data source.

For a single function like query we haven’t bought much, but if we have more than just query dealing with such functions, then will reduce a lot of test cases by only testing the endomorphisms and chain and having to do very little beyond that.

Now, we could have chosen any of the three function types to standardize on. The one we chose is probably the least flexible because the functions themselves hardcode the names of the keys they write. That can be good or bad depending on your environment. You could just as easily write a rename endomorphism. For your amusement you might want to try writing a chain function for the other options.

An extreme and well known case of the kind of thrashing we’re avoiding with this tactic is object relational mismatch, where we go from the world of a relational database where we store aspects of identities to a world of objects where we store entities with all their aspects. In the relational world aspects may be added or ignored at will. In the object world an entity has an expected set of attributes. Objects contain other objects. In relational terms, you join the aspects you need and nothing contains anything else.

Tactic 4. Opacify passthroughs

The classic example of opaque types is generics, such as

def head(xs: List[T]) -> T:
    """Return the first element of xs."""

What properties of T do we need to consider when we test head? None! We can pick any type that is convenient to test it. We could use None, but we will want to be able to distinguish the various elements of the lists we pass in, so we’ll want a somewhat bigger type. int is usually a pretty solid choice in such cases, but we can choose whatever is easy to test, and then in use we can pass whatever complicated type we may need.

But now suppose have a function that queries a service for posts made bya given user between two points in time:

def posts_by_user(
    user_id: int, 
    t0: datetime, tf: datetime, 
    format_spec: FormatSpec,

What is that FormatSpec? Inevitably in an API you need to specify what parts of the data you need to return, and that is what the format spec specifies. Our function probably does nothing with it, just passes it on to whatever service it queries. If we make the client explicit and parameterize it over the type of the FormatSpec,

def posts_by_user(
    client: Client[FS],
    user_id: int,
    t0: datetime, tf: datetime,
    format_spec: FS,

We can create an interface for Client and subclass it for a fake client for testing and a real client for production, such as

from typing import Generic

class Client(Generic[FS]): ...

class MockClient(Client[None]): ...
class RealClient(Client[FormatSpec]): ...

then we can test posts_by_user with MockClient, where the only possible value of format_spec is None. That effectively removes a parameter from our test plan. Going from N4N^4 to N3N^3 (with N=10N=10, 10,000 to 1,000 test cases) is a significant win.

Especially if you have to pass a parameter through multiple layers of code, opacifying passthroughs like this can save a lot of tests.


You can structure your code to dramatically reduce the number of test cases it needs.

  1. Converting various forms to a single, canonical representation.
  2. Extracting what you need from a piece of code, then disposing of it so later code can’t spaghetti into it.
  3. “Chain your endomorphisms!”—make your transforms have the same type and write a function to chain them to avoid thrashing representations and overly complicated test situations.
  4. Test paramters that are just passed on to other layers of code by putting a generic type in for the parameter and testing with None or similarly boring types in that slot.

Handily, all of these tend to give you code that is easier to read and easier to change over time as well.

If this was helpful to you, look at my post on easier, faster testing.