Tactics for testable code
Status: Finished
Confidence: Very likely
There are certain tactics or gambits that I have found over the years that make code more testable. When I say more testable, I mean that it requires fewer test cases to demonstrate correctness and that those test cases should be easier to construct. In each of the tactics that follow we’ll make those two criteria—fewer and easier to construct—more precise.
Tactic 1. Adapt to canonical forms
Consider a function
def events_since(t) -> List[Event]:
"""Return all events that happened since time t."""
...
What are somethings people might want to pass in for t
?
Perhaps:
- Seconds since the epoch (or microseconds, or nanoseconds)
- Python’s
datetime
objects - Strings like
"2 hours ago"
or"3 weeks ago"
- Strings like
"2019-12-5 15:33:02"
and heaven knows what else. 3PM on the third Thursday of last month,
perhaps? Imagine if events_since
knew how to interpret all
of these. The test plan is going to be terrifying.
We can make a qualitative calculation of how big such a plan will be. Say we accept those four forms for dates, and for each form we need to pass in N cases. It may be fewer or more by a bit for each case, but it shouldn’t be ten times fewer or ten times more, so we can at least see how it will scale, rather the same way that when we analyze the complexity of an algorithm with big-O notation we neglect lots of constants and aim for a rough scaling law in terms of the size N of the inputs.
For four types like this, we have cases. But what if we have a function like
def events_between(t0, tf) -> List[Event]:
"""Return events occuring after t0 and before tf."""
...
We combine the test plans for each parameter by taking all
combinations. Expressing it in code, if we have test cases
t0_cases
and tf_cases
, our test cases for both
together are
= [(t0_case, tf_case)
cases for t0_case in t0_cases
for tf_case in tf_cases]
For three or more parameters we take all combinations in the same way. This is called a factorial design and is one of the foundational achievements of statistics.
For events_between
, if we accept all four kinds as
above, we now have
cases.
is typically on the other of 5 to 10, so we’re looking at 400 to 1600
test cases.
Instead, let’s choose a canonical form for time. In Python that would
probably be the datetime
type. In a packet capture system I
worked on it was a 64 bit, unsigned integer representing microseconds
since the epoch. Then we write adapter functions from our various input
kinds to that canonical form. For example,
def ago(s: str, relative_to=None) -> datetime:
...def from_unixtime(ts: int) -> datetime:
...def from_iso(s: str) -> datetime:
...
and our function has a more restricted signature:
def events_between(t0: datetime, tf: datetime) -> List[Event]:
...
Now we have N cases for ago
, N cases
for from_unixtime
, N cases for
from_iso
, and
cases for events_between
. In all, that is
cases, as opposed to
for accepting everything. For a plausible value like
,
that’s still only 130 cases as opposed to the 1600 we said before.
This actually underestimates the improvement, though. If we look at
our API as a whole, we get similar savings for events_since
as well as events_between
. The more functions we have that
use the canonical type instead of handling all cases, the more test
cases the canonical form and adapters save us.
Now, the canonical form is not always a concrete type. It may be an opaque interface. For example, imagine that we have a system that needs to be able to notify individuals, other systems, pub/sub queues, or oncalls of an event. If we write a function
def alert(recipients, msg):
...
What goes it its test plan? It’s going to be huge, and it will require us to be able to set up a test environment for each of the possible kinds of recipient to check that the alert was dispatched properly. Not only is the test plan large, but the individual tests will require complex environments.
Instead, we can create an interface. There’s no “interface” type in Python, but in pseudocode it would look like
from typing import NamedTuple, Callable
class Messageable(NamedTuple):
send: Callable[[str], None]
name: Callable[[], str]
In C++ it would be an abstract base class. In Java or Go it would be
an interface, since those languages support the idea directly. In C it
would be a struct with the fields send
and
name
that are function pointers. The alert
function now has the type
def alert(recipients: Iterable[Messageable], msg: str):
...
And you can again write adapter functions, such as
def to_user(id: int) -> Messageable:
...def to_oncall(name: str) -> Messageable:
...def to_queue(name: str) -> Messageable:
...
and whatever else needs to receive alerts. The test plan for
alert
now becomes straightforward. We test entirely with a
simple implementation of Messageable
, such as
def make_messageable(name: str):
= []
msgs def _send(msg):
msgs.append(msg)def _name():
return name
return Messageable(send=_send, name=_name)
The individual adapters such as to_user
and
to_oncall
need to have more complex test environments, but
they stay separate, and we only need those tests for the adapters, not
for any other functions that use the Messageable
interface.
As is often the case, what makes the code more testable also buys us
some nice maintainability. If you need to add a new kind of recipient
that can receive alerts, you don’t have to touch the alert
function at all, or even have access to its code. You just need to write
a new adapter function that produces a Messageable
.
Tactic 2. Extract and dispose
While spelunking through a piece of code I ran across a piece of spaghetti that resembled the following call tree:
= B.__init__(line_length, locale)
b = A.__init__(b, name, owner)
a
a.generate()
b.send(a)
a.name()
b.line_length()
a.description()
b.line_length()
b.locale()
b.get_destination()
a.name() a.owner()
Working with code that actually does this kind of ping-pong is a nightmare. What any chunk of code is supposed to do is spread across many functions and it requires a frustrating amount of brainpower just to figure out what’s going on.
Anyway, we said before that when we have multiple parameters, we have
to take all combinations of their test cases to get our full test plan.
With this kind of spaghetti, the parameters to both A
and
B
have to be taken together. If they could have been tested
separately, it would have been
test cases (1100 cases for
).
Together, they need
test cases (100,000 with
).
1100 to 100,000 is a big difference.
If we look at what’s going on here, B.send
is being
called to do something and is referring to A
to get
information it needs for that. We can decouple them and reduce our test
cases by making A
give you everything you need and then
disposing of it.
= A.__init__(name, owner)
a = a.generate()
info
# info == {
# 'name': ...,
# 'description': ...,
# 'owner': ...,
# }
= B.__init__(line_length, locale)
b b.send(info)
We could probably omit the classes entirely and just write
= generate(name, owner)
info send(info, line_length, locale)
Now we are back to test cases, and the code will be much more readable.
Similarly, if you have a function that does some computation and has side effects, you can simplify your life by splitting it into a pure function that calculates what the effects should be and a simple function that runs side effects.
So if we have a function
def update(input):
# do things to global state based on input ...
we can split it into
def delta(input) -> List[Change]:
# calculate the list of changes
...
def apply(ch: List[Change]):
# apply changes ...
Testing delta
requires no complex setup. It’s just
inputs and outputs. Testing apply
requires more setup, but
its test cases are much simpler.
If you apply the extract and dispose idea recursively, you end up with pipelines, such as
x = f(input)
y = g(x)
z = h(y)
m = p(z)
n = q(m)
return n
At each stage you finish entirely with the previous work and forget everything except what you need for what follows. This is extremely easy to test, but it is also extremely easy to end up thrashing from one step to the next, which brings us to our next tactic:
Tactic 3. Match impedances and chain endomorphisms
Here’s some functions that query information about employees:
def sex_of(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
...def is_bald(vs: List[Name]) -> Dict[Name, bool]:
...def knows_haskell(vs: List[Name]) -> List[bool]:
...
Your task is to write a function that, given some names, returns their sex, if they have hair, and if they know Haskell. Looking at the types of these functions, this is going to be a bit awkward.
def query(names: List[Name]) -> Dict[str, Any]:
= {name: {} for name in names}
people = sex_of(people)
people for name, ib in is_bald(names).items():
'is_bald'] = ib
people[name][for name, kh in zip(names, knows_haskell(names)):
'knows_haskell'] = kh
people[name][return people
Those three functions could all have the same type. If we choose to make them all the same type, such as
def sex_of(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
...def is_bald(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
...def knows_haskell(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
...
then our code becomes more straightforward:
def query(names: List[Name]) -> Dict[str, Any]:
= {name: {} for name in names}
people = sex_of(vs)
people = is_bald(vs)
people = knows_haskell(vs)
people return people
Now, this is all well and good, but how is it going to improve our
testing? The number of test cases for query
isn’t going to
change because we reorganized this, even if the code is cleaner, and
sex_of
, is_bald
, and
knows_haskell
are probably querying a database of some
sort, so testing query
requires setting up a test
environment with that database.
Matching impedances this way buys us power, though, because we can introduce a function that can chains these transforms. Transforms from a type to itself are also called endomorphisms, so you can sound very sophisticated with the battlecry “Chain your endomorphisms!” Anyway, a chaining function:
def chain(*fs):
def newf(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
for f in fs:
= f(vs)
vs return vs
return newf
If we test chain
, then our various functions like
query
stop being a thing we write a separate function with
testing. We just write
= chain(sex_of, is_bald, knows_haskell) query
which is something we would probably just write inline wherever we
are using it and barely needs testing at all. We can test
chain
with whatever functions of the right signature that
we want, so we can choose ones that, unlike sex_of
or
is_bald
, need no external data source.
For a single function like query
we haven’t bought much,
but if we have more than just query
dealing with such
functions, then will reduce a lot of test cases by only testing the
endomorphisms and chain and having to do very little beyond that.
Now, we could have chosen any of the three function types to
standardize on. The one we chose is probably the least flexible because
the functions themselves hardcode the names of the keys they write. That
can be good or bad depending on your environment. You could just as
easily write a rename
endomorphism. For your amusement you
might want to try writing a chain
function for the other
options.
An extreme and well known case of the kind of thrashing we’re avoiding with this tactic is object relational mismatch, where we go from the world of a relational database where we store aspects of identities to a world of objects where we store entities with all their aspects. In the relational world aspects may be added or ignored at will. In the object world an entity has an expected set of attributes. Objects contain other objects. In relational terms, you join the aspects you need and nothing contains anything else.
Tactic 4. Opacify passthroughs
The classic example of opaque types is generics, such as
def head(xs: List[T]) -> T:
"""Return the first element of xs."""
...
What properties of T
do we need to consider when we test
head
? None! We can pick any type that is convenient to test
it. We could use None
, but we will want to be able to
distinguish the various elements of the lists we pass in, so we’ll want
a somewhat bigger type. int
is usually a pretty solid
choice in such cases, but we can choose whatever is easy to test, and
then in use we can pass whatever complicated type we may need.
But now suppose have a function that queries a service for posts made bya given user between two points in time:
def posts_by_user(
int,
user_id:
t0: datetime, tf: datetime,
format_spec: FormatSpec, )
What is that FormatSpec
? Inevitably in an API you need
to specify what parts of the data you need to return, and that is what
the format spec specifies. Our function probably does nothing with it,
just passes it on to whatever service it queries. If we make the client
explicit and parameterize it over the type of the
FormatSpec
,
def posts_by_user(
client: Client[FS],int,
user_id:
t0: datetime, tf: datetime,
format_spec: FS, )
We can create an interface for Client
and subclass it
for a fake client for testing and a real client for production, such
as
from typing import Generic
class Client(Generic[FS]): ...
class MockClient(Client[None]): ...
class RealClient(Client[FormatSpec]): ...
then we can test posts_by_user
with
MockClient
, where the only possible value of
format_spec
is None
. That effectively removes
a parameter from our test plan. Going from
to
(with
,
10,000 to 1,000 test cases) is a significant win.
Especially if you have to pass a parameter through multiple layers of code, opacifying passthroughs like this can save a lot of tests.
tl;dr
You can structure your code to dramatically reduce the number of test cases it needs.
- Converting various forms to a single, canonical representation.
- Extracting what you need from a piece of code, then disposing of it so later code can’t spaghetti into it.
- “Chain your endomorphisms!”—make your transforms have the same type and write a function to chain them to avoid thrashing representations and overly complicated test situations.
- Test paramters that are just passed on to other layers of code by
putting a generic type in for the parameter and testing with
None
or similarly boring types in that slot.
Handily, all of these tend to give you code that is easier to read and easier to change over time as well.
If this was helpful to you, look at my post on easier, faster testing.