by Fred Ross Last updated: March 28, 2020
There are certain tactics or gambits that I have found over the years that make code more testable. When I say more testable, I mean that it requires fewer test cases to demonstrate correctness and that those test cases should be easier to construct. In each of the tactics that follow we’ll make those two criteria—fewer and easier to construct—more precise.
Consider a function
What are somethings people might want to pass in for t
? Perhaps:
datetime
objects"2 hours ago"
or "3 weeks ago"
"2019-12-5 15:33:02"
and heaven knows what else. 3PM on the third Thursday of last month, perhaps? Imagine if events_since
knew how to interpret all of these. The test plan is going to be terrifying.
We can make a qualitative calculation of how big such a plan will be. Say we accept those four forms for dates, and for each form we need to pass in N cases. It may be fewer or more by a bit for each case, but it shouldn’t be ten times fewer or ten times more, so we can at least see how it will scale, rather the same way that when we analyze the complexity of an algorithm with big-O notation we neglect lots of constants and aim for a rough scaling law in terms of the size N of the inputs.
For four types like this, we have cases. But what if we have a function like
We combine the test plans for each parameter by taking all combinations. Expressing it in code, if we have test cases t0_cases
and tf_cases
, our test cases for both together are
For three or more parameters we take all combinations in the same way. This is called a factorial design and is one of the foundational achievements of statistics.
For events_between
, if we accept all four kinds as above, we now have cases. is typically on the other of 5 to 10, so we’re looking at 400 to 1600 test cases.
Instead, let’s choose a canonical form for time. In Python that would probably be the datetime
type. In a packet capture system I worked on it was a 64 bit, unsigned integer representing microseconds since the epoch. Then we write adapter functions from our various input kinds to that canonical form. For example,
def ago(s: str, relative_to=None) -> datetime:
...
def from_unixtime(ts: int) -> datetime:
...
def from_iso(s: str) -> datetime:
...
and our function has a more restricted signature:
Now we have N cases for ago
, N cases for from_unixtime
, N cases for from_iso
, and cases for events_between
. In all, that is cases, as opposed to for accepting everything. For a plausible value like , that’s still only 130 cases as opposed to the 1600 we said before.
This actually underestimates the improvement, though. If we look at our API as a whole, we get similar savings for events_since
as well as events_between
. The more functions we have that use the canonical type instead of handling all cases, the more test cases the canonical form and adapters save us.
Now, the canonical form is not always a concrete type. It may be an opaque interface. For example, imagine that we have a system that needs to be able to notify individuals, other systems, pub/sub queues, or oncalls of an event. If we write a function
What goes it its test plan? It’s going to be huge, and it will require us to be able to set up a test environment for each of the possible kinds of recipient to check that the alert was dispatched properly. Not only is the test plan large, but the individual tests will require complex environments.
Instead, we can create an interface. There’s no “interface” type in Python, but in pseudocode it would look like
from typing import NamedTuple, Callable
class Messageable(NamedTuple):
send: Callable[[str], None]
name: Callable[[], str]
In C++ it would be an abstract base class. In Java or Go it would be an interface, since those languages support the idea directly. In C it would be a struct with the fields send
and name
that are function pointers. The alert
function now has the type
And you can again write adapter functions, such as
def to_user(id: int) -> Messageable:
...
def to_oncall(name: str) -> Messageable:
...
def to_queue(name: str) -> Messageable:
...
and whatever else needs to receive alerts. The test plan for alert
now becomes straightforward. We test entirely with a simple implementation of Messageable
, such as
def make_messageable(name: str):
msgs = []
def _send(msg):
msgs.append(msg)
def _name():
return name
return Messageable(send=_send, name=_name)
The individual adapters such as to_user
and to_oncall
need to have more complex test environments, but they stay separate, and we only need those tests for the adapters, not for any other functions that use the Messageable
interface.
As is often the case, what makes the code more testable also buys us some nice maintainability. If you need to add a new kind of recipient that can receive alerts, you don’t have to touch the alert
function at all, or even have access to its code. You just need to write a new adapter function that produces a Messageable
.
While spelunking through a piece of code I ran across a piece of spaghetti that resembled the following call tree:
b = B.__init__(line_length, locale)
a = A.__init__(b, name, owner)
a.generate()
b.send(a)
a.name()
b.line_length()
a.description()
b.line_length()
b.locale()
b.get_destination()
a.name()
a.owner()
Working with code that actually does this kind of ping-pong is a nightmare. What any chunk of code is supposed to do is spread across many functions and it requires a frustrating amount of brainpower just to figure out what’s going on.
Anyway, we said before that when we have multiple parameters, we have to take all combinations of their test cases to get our full test plan. With this kind of spaghetti, the parameters to both A
and B
have to be taken together. If they could have been tested separately, it would have been test cases (1100 cases for ). Together, they need test cases (100,000 with ). 1100 to 100,000 is a big difference.
If we look at what’s going on here, B.send
is being called to do something and is referring to A
to get information it needs for that. We can decouple them and reduce our test cases by making A
give you everything you need and then disposing of it.
a = A.__init__(name, owner)
info = a.generate()
# info == {
# 'name': ...,
# 'description': ...,
# 'owner': ...,
# }
b = B.__init__(line_length, locale)
b.send(info)
We could probably omit the classes entirely and just write
Now we are back to test cases, and the code will be much more readable.
Similarly, if you have a function that does some computation and has side effects, you can simplify your life by splitting it into a pure function that calculates what the effects should be and a simple function that runs side effects.
So if we have a function
we can split it into
def delta(input) -> List[Change]:
... # calculate the list of changes
def apply(ch: List[Change]):
... # apply changes
Testing delta
requires no complex setup. It’s just inputs and outputs. Testing apply
requires more setup, but its test cases are much simpler.
If you apply the extract and dispose idea recursively, you end up with pipelines, such as
x = f(input)
y = g(x)
z = h(y)
m = p(z)
n = q(m)
return n
At each stage you finish entirely with the previous work and forget everything except what you need for what follows. This is extremely easy to test, but it is also extremely easy to end up thrashing from one step to the next, which brings us to our next tactic:
Here’s some functions that query information about employees:
def sex_of(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
...
def is_bald(vs: List[Name]) -> Dict[Name, bool]:
...
def knows_haskell(vs: List[Name]) -> List[bool]:
...
Your task is to write a function that, given some names, returns their sex, if they have hair, and if they know Haskell. Looking at the types of these functions, this is going to be a bit awkward.
def query(names: List[Name]) -> Dict[str, Any]:
people = {name: {} for name in names}
people = sex_of(people)
for name, ib in is_bald(names).items():
people[name]['is_bald'] = ib
for name, kh in zip(names, knows_haskell(names)):
people[name]['knows_haskell'] = kh
return people
Those three functions could all have the same type. If we choose to make them all the same type, such as
def sex_of(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
...
def is_bald(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
...
def knows_haskell(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
...
then our code becomes more straightforward:
def query(names: List[Name]) -> Dict[str, Any]:
people = {name: {} for name in names}
people = sex_of(vs)
people = is_bald(vs)
people = knows_haskell(vs)
return people
Now, this is all well and good, but how is it going to improve our testing? The number of test cases for query
isn’t going to change because we reorganized this, even if the code is cleaner, and sex_of
, is_bald
, and knows_haskell
are probably querying a database of some sort, so testing query
requires setting up a test environment with that database.
Matching impedances this way buys us power, though, because we can introduce a function that can chains these transforms. Transforms from a type to itself are also called endomorphisms, so you can sound very sophisticated with the battlecry “Chain your endomorphisms!” Anyway, a chaining function:
def chain(*fs):
def newf(vs: Dict[Name, Dict[str, Any]]) -> Dict[Name, Dict[str, Any]]:
for f in fs:
vs = f(vs)
return vs
return newf
If we test chain
, then our various functions like query
stop being a thing we write a separate function with testing. We just write
which is something we would probably just write inline wherever we are using it and barely needs testing at all. We can test chain
with whatever functions of the right signature that we want, so we can choose ones that, unlike sex_of
or is_bald
, need no external data source.
For a single function like query
we haven’t bought much, but if we have more than just query
dealing with such functions, then will reduce a lot of test cases by only testing the endomorphisms and chain and having to do very little beyond that.
Now, we could have chosen any of the three function types to standardize on. The one we chose is probably the least flexible because the functions themselves hardcode the names of the keys they write. That can be good or bad depending on your environment. You could just as easily write a rename
endomorphism. For your amusement you might want to try writing a chain
function for the other options.
An extreme and well known case of the kind of thrashing we’re avoiding with this tactic is object relational mismatch, where we go from the world of a relational database where we store aspects of identities to a world of objects where we store entities with all their aspects. In the relational world aspects may be added or ignored at will. In the object world an entity has an expected set of attributes. Objects contain other objects. In relational terms, you join the aspects you need and nothing contains anything else.
The classic example of opaque types is generics, such as
What properties of T
do we need to consider when we test head
? None! We can pick any type that is convenient to test it. We could use None
, but we will want to be able to distinguish the various elements of the lists we pass in, so we’ll want a somewhat bigger type. int
is usually a pretty solid choice in such cases, but we can choose whatever is easy to test, and then in use we can pass whatever complicated type we may need.
But now suppose have a function that queries a service for posts made bya given user between two points in time:
What is that FormatSpec
? Inevitably in an API you need to specify what parts of the data you need to return, and that is what the format spec specifies. Our function probably does nothing with it, just passes it on to whatever service it queries. If we make the client explicit and parameterize it over the type of the FormatSpec
,
We can create an interface for Client
and subclass it for a fake client for testing and a real client for production, such as
from typing import Generic
class Client(Generic[FS]): ...
class MockClient(Client[None]): ...
class RealClient(Client[FormatSpec]): ...
then we can test posts_by_user
with MockClient
, where the only possible value of format_spec
is None
. That effectively removes a parameter from our test plan. Going from to (with , 10,000 to 1,000 test cases) is a significant win.
Especially if you have to pass a parameter through multiple layers of code, opacifying passthroughs like this can save a lot of tests.
You can structure your code to dramatically reduce the number of test cases it needs.
None
or similarly boring types in that slot.Handily, all of these tend to give you code that is easier to read and easier to change over time as well.
If this was helpful to you, look at my post on easier, faster testing.