Quality assurance & operations

This is the part of the series on getting better as a programmer. The articles are:

In writing programs we talked about programs as texts with meanings. Now we come to how we deal with our programs when they turn into actual processes happening in the real world.

“Beware of bugs in the above code; I have only proved it correct, not tried it.” -Don Knuth

The programs most of us care about are those that specify the actions of real machines. There is a set of machines out there that we call “production.” It might be web servers. It might be people’s phones running an app you wrote, or a desktop computers running a word processor, or trains running control software. It might even be your own machine and no one else’s, producing some answer you want.

There are all kinds of forces on those machines. A website can be swamped by users. A hurricane can knock a data center offline. A desktop computer can crash with unsaved data in memory. An attacker tries to break in. The users themselves do something unexpected. And, of course, programmers change the program.

Programmers changing things is the single greatest risk to a running program, so we give the problem of minimizing that risk its own name: quality assurance. Handling all the other forces on the program goes under the name operations.

Quality assurance

Let’s start with how we prevent ourselves from breaking production. There’s a perfect method, and then there’s methods that are possible in reality.

The perfect method is to test every possible route through the program. Go through all possible inputs, or sequence of input over time for interactive programs, and check all aspects of the output for each input. If we have a program that takes two boolean values and computes their AND, we have four possible inputs (true/true, true/false, false/true, false/false). It’s hard to imagine a more comprehensive test for this program than putting in all for inputs and checking the outputs.

But consider something like calculating the square root of a 64 bit integer. There are $2^{64}$, or roughly $18 \times 10^{18}$, different inputs to this function. My current laptop runs at 2.6GHz and has 6 cores. If we could test one instruction per cycle per core, it would take over 1300 years to test all of the possible inputs. With two arguments, such as addition of 64 bit integers, it’s $2^{64} \times 2^{64}$ possible inputs. For something like a web server? it’s just not possible.

And we don’t always know what the right answer is. If I’m writing a word processor, what is the right place to break lines in a paragraph for every possible paragraph? To typeset his magnum opus The Art of Computer Programming, Donald Knuth wrote his own typesetting system called TeX. As part of that he had to invent new linebreaking algorithms. Or in the music typesetting program LilyPond, there’s a huge amount of effort to figure out rules that work well across different pieces of music, but the programmers cannot examine the results of all possible music. The PostgreSQL team’s query planner (the program that takes a SQL query and figure out which exact loads from disk and index lookups the database should do) is an amazing piece of software, but they publicly say that if you find a query it doesn’t make good decisions on to submit it as a bug so they can try to fix it. In the absence of such feedback, they don’t know what the perfect answer is. Or if you’re writing the program to calculate answers you can’t otherwise get, such as optimal routing for transit or simulations of large scale motion of galaxies, you cannot know the right answer in advance be cause it’s what you’re trying to find.

As an aside, I want to point you to an old gem of a paper from 1975: “Toward a theory of test data selection” (PDF) by JB Goodenough and SL Gerhart. Aside from the utter glory of a seminal paper in the field of software testing being written by someone with the surname Goodenough, it articulates what it means to test software adequately. It’s worth pulling out some nomenclature from it:

We want ideal tests, defined as tests that pass only when the program is correct. The most obvious ideal test is the exhaustive one that tests all inputs (but we already said that’s generally impossible in practice). To find an ideal test that isn’t exhaustive, we need some criterion for selecting the inputs to test. We call a criterion that produces ideal tests complete.

Completeness breaks into

• reliability: the criterion, when applied to generate a test to run, should consistently produce tests that always pass or always fail when applied to a given program,
• validity: the criterion produces tests whose results are meaningful, that is, for each error in the program there is a subset of inputs in the test which reveal it.

Reliability may seem a strange criterion. You pick some inputs to test and they’re valid or not, right? No! A criterion can include algorithms for generating which inputs to use based on random numbers, on the text of the program, and on the results of previous tests. For example, property based testing works by generating increasingly large random inputs and asserting whether properties hold on them. When it finds a counterexample where a property doesn’t hold, it automatically tries to shrink it to a minimal counterexample for you, and some libraries implementing it will store generated counterexamples between test runs so they can be retried. Another example is the American Fuzzy Lop fuzz tester, which uses the actual structure of the program to generate a set of input cases that exercise all paths through it.

You can also use your knowledge of the program to guide what conditions you pick to test. The xmonad window manager parameterizes the data structure it uses to track windows and which window is focused and tests it with integers slotted in as the parameter, even though the real program has a structure representing a window on the screen in that slot. They leverage the formal properties they can impose in their language’s type system to limit the test data they need to consider: if their type works with integers, it will work the same way with windows.

A lot of knowledge about programs is somewhat generic and can be turned into cross cutting tools. A program written in Rust without invoking its escape hatch into unsafe code won’t have programs with memory allocation and freeing. There is a whole world of verification tools like Frama-C or SPARK that automatically prove classes of properties about programs that then don’t need to be tested.

Other knowledge is very specific to the program. If you’re writing a distributd data store and you need to merge changes from two different nodes, the merge operation must be the join of a semilattice. If you can prove that your merge operation obeys the axioms defining a join, lots of other tests around eventual consistency become unnecessary.

I’ve written down most of what I know about picking test data elsewhere.

There is another consideration that I haven’t mentioned yet: how fast do these tests run? If your test suite takes a day to run, you’re going to spend most of your time when programming waiting for tests to see if your change was correct. Michael Feathers, in his wonderful book Working Effectively With Legacy Code, recommends that you have at least a subset of tests that cover the part of the program you’re working on enough to give you confidence that runs in a few seconds. If you’ve never worked in an environment like that, let me tell you: it’s incredibly liberating. You reflexively hit run on your tests when you pause to think the way people used to reflexively save their documents when word processors crashed all the time.

But are those tests enough? Sometimes yes. Sometimes no. So maybe you have larger tests you run nightly. Maybe you have a whole, deployed environment that is very similar to production where you mirror live traffic or generate your own traffic. This is the second part of quality assurance: designing the flow from our editors to production.

This design takes the form of nested feedback loops, differentiated mostly on how long the loop takes to run and what resources it needs to run. The innermost, fastest loop is on your machine as you write code. How fast do you find out that you misspelled a function name or forgot a semicolon? In a good IDE today, it should be almost immediate as you type. If you have really small, fast subsets of your test suite that are relevant to the changes you’re making you may end up using the “save and run tests” button as the trigger for your feedback loop. The timescale of this feedback loop needs to be under a second.

At some point you get some edits to the point where you do a slightly longer feedback loop, one where you explicitly switch your mental focus to examining the change you’re making as a whole. You run tests that are a bit slower, you review the changes you’ve made in your version control client, you spend a few tens of seconds on the process.

Then you get to feedback loops that take long enough where you go get a cup of coffee or tea while you wait, a few minutes. Finally you get feedback loops that take hours or days, such as code reviews, deployments to staging environments that try to imitate production, or sending out releases to beta testers. These are often nested in turn. You have to iterate through the code review feedback loop until you pass before you get to a staging environment. If it fails there, you start back through all the feedback loops again to reach it again. Once you iterate to success in the staging environment, you go to production, and you feedback loop becomes customer complaints that you have to fix.

At the same time, how much effort is it worth to avoid an error reaching production? NASA’s most intense quality assurance processes are incredibly effective at producing error free software, but they involve multiple dedicated testers per programmer and produce code at a rate that would make producing most of the world’s software too expensive to actually happen. On the other hand, if you’re throwing together a shell script for yourself and it screws up on line 22 of its input file, you just edit it by hand and get on with life.

Getting better at quality assurance is largely being able to allocate the finite time in various feedback loops most valuably, which usually means maintaining the level of errors reaching production that is acceptable for your system while minimizing the time for a programmer’s change to reach production. Issues that happen regularly (missing semicolons, messing up the inputs to a function) go in the innermost loop provided you have time, and less common issues are shunted into slower loops. It also means being able to design the system so that much of the testing of some piece can be done in inner loops and doesn’t have to wait for outer loops.

Operations

Now we move on to all the other things that go wrong with programs. This is, paradoxically, less of an issue in general than quality assurance. Computers are devices that mechanically follow instructions over and over again. With ECC RAM they’re even pretty resistent to things like cosmic rays. So long as you provide a computer with electricity, its parts haven’t failed, the connections it has to the outside world, whether a network link or a robotic arm, are still reachable, and the conditions now are about the same as they were ten minutes ago, they largely keep working. Operations, then, is about

• mitigating the impacts of change much as you can in advance
• being able to detect when something does change
• being able to do something about it

Mitigation is all the stuff that systems administrators do that everyone griped about. Why do I have to deal with the system being slower while backups are taken? Why can’t I use my name as my password? Why can’t we install the latest version of this software right now? The salesman told me had this feature that I want. Why can’t I log into the production database directly from my laptop when I’m at the airport? Why can’t I run the credit card processing from my server under my desk? Why are we paying for a second server in that other data center?

You don’t start with mitigation. You start with risk models. There are piles of books and seminars and acronyms on this topic, but it boils down to pretending you are a demon out to destroy your system and start dreaming up ways of doing it.

This is actually really fun.

Your server is running on a machine in a rack in a data center that’s a two hour drive away. It has sensitive information on it and needs to remain available.

Start with the big things: power and network. What happens if the data center is hit by a hurricane and its whole region loses power? What happens if a tree falls on the network cable for the data center? (These are real examples from my personal experience.) What if a breaker in the data center blows and power to your server’s rack goes away until a technician can go reset it? What happens if a network administrator at one of the backbone providers screws up a routing table and cuts off one part of the Internet from another?

The computer itself is also vulnerable. Our data is on a disk. If it’s a spinning disk, it will eventually have a mechanical failure. If it’s on an SSD, we will eventually exceed the number of writes it can do. The power supply or fans can fail.

Then there’s that sensitive information. How could you steal it? You could go in over the network. It’s a server after all. Can you get a legitimate user’s password and access it? Does that user have access to everything or do you need multiple users to get it all? Could you get into the data center and steal the physical machine, put the hard disk in another computer, and read it that way?

Oh, and it’s supposed to remain available. Can you knock it offline? Can you overwhelm it so it can’t serve legitimate traffic? Can you knock its data center offline? What about cutting the fiber optic connections to the city it’s in? A friend of mine once pointed out that with a backhoe on a trailer and a couple of hours, she could take Phoenix, AZ offline for weeks.

Part of why experienced security and operations personnel are paid more is that their experience lets them generate incredible lists like this, often with glee. Another part of why is that they can then triage these lists for what is most important to address and what they’re not going to try to mitigate. If the server is only handling traffic for a business in a small region, there’s no reason to try to keep it available if the whole region is hit by a disaster, but you had better have backups to bring it back online afterwards.

Mitigation also affects how you write software. At the crudest level, imagine your program crashes every so often and you have to manually restart it. Eliminating the crashes is mitigation. Writing a program so it starts shedding load when it is overwhelmed rather than grinding to a halt is mitigation. Setting up how you persist data so you can take backups without stopping your program is mitigation. Deny by default is one of the best mitigation strategies ever devised (see this explanation).

There’s a trap that fast growing companies often fall into. They reward engineers for features shipped, and there are always more engineers coming in so even if the feature was written in a way that adds a lot of future operational burden, there will be more people to handle that burden. As soon as the company stops growing fast or, even worse, has layoffs, the teams immediately find themselves drowning in responding to issues that could have been mitigated when the program was written. At that point you have three choices:

1. stop doing any new feature development and fight the fires that keep starting
2. let all but the worst fires burn and start working on reducing the burden
3. let all but the worst fires burn and keep writing new features

Organizations that choose options 3 generally do so by default, so the same incentives are still in place. Those new features will still be written in a way to add more burden because that’s what’s rewarded, and eventually there are enough sufficiently bad fires that the team ends up in option 1 anyway.

After mitigation we come to actual incidents when things go wrong. The first problem is always knowing when something goes wrong. How do you get information from production to a system that monitors it, and how do you detect problems in that information that a human should be notified of?

If you’re running a web service, getting the information is fairly straightforward, though it may be messy. If production is a desktop application or a fleet of bulldozers, how do you find out about problems? Microsoft captures telemetry from its desktop applications and sends it to their servers, as do many other organizations. Other organizations forbid the use of software on their systems that send telemetry. Lawyers argue over contracts, engineers put in special flags to turn off telemetry for certain license keys, and you find out about incidents because humans call tech support.

The most extreme case I have dealt with is a computer lab I put together for a school on an outlying island in Polynesia. The best case scenario, where money is no object and the weather cooperates, of getting spare parts to this school is at least a week and involves multiple flights followed by chartering a boat and sailing for three days. Generally a boat goes only every couple months and only part of the year. It can take months before I hear if there is a problem. They’ll hopefully have a satellite connection at some point which will speed this up drastically.

So, getting the information comes down to having computers send information about what they’re doing to other computers or having humans contact other humans and capturing what is communicated. The former began as printing text to a file on disk, and today has turned into three distinct kind of data. Metrics are sequences of numbers over time, such as CPU load or available disk space or number of HTTP requests. Events are structured data describing a particular thing that happened, such as an HTTP request arriving and its status code and headers. Traces are a tree of events where each child happened as part of its parent happening, such as database queries and checking sessions as children of an HTTP request being handled. We generally use metrics to decide that something is wrong and traces to figure out what it is. Honeycomb.io is a good representation of the state of the art today, and their articles and blog are a good resource.

Capturing data communicated by humans is a question of organization. Is there a central place that support personnel can raise issues? Do they have a way of escalating to engineering? Do they know enough to be able to do so intelligently? Are there people going through the issues that customers raise to look for problems? Doing this well is a lot of work, and companies that are accustomed to computer sending information to computers tend to avoid it like the plague. Google’s support (or their effective lack of it) is wonderful example of this, as is Facebook’s. Even companies that have decent support personnel may not have the organizational discipline to handle what they hear well. It’s all too easy for support to end up cut off from engineering and with no one dealing with the issues coming in.

Once you have metrics about your system, how do you decide if there’s a problem? Most places do this somewhat by intuition, but there is a rigorous basis of it based on systems modeling and time series statistics.

The usual place people learn systems this is Donella Meadows’s book Thinking In Systems. The rough structure goes like this: you abstractly describe your system as stocks and flows. A web server has a stock of currently running HTTP requests it is handling. There is a flow in of new requests and a flow out of requests as they are completed. All the subsidiary systems involved in handling the HTTP requests have their own stocks of utilized resources and flows in based on how much load they have and flow out based on completion. If there is a queue in the system, its depth is a stock and its flows in and out are the obvious pushing and popping of elements. These techniques were developed to model complex systems like cities or the whole planet (the Club of Rome models that led to the book Limits to Growth), and they are the most generic tools available to turn interlinked parts into a dynamical system with properties we can measure over time.

Once you have a model of your system and the stocks you care about you sample the level of those stocks from your system, such as capturing a metric of percent CPU utilization or IO operations per second. Once you have that, you need a way of deciding if a given level is a problem or not. The most useful literature on this goes under the label of “statistical process control” or “statistical quality control,” but when you are trying to generate a particular measurement it boils down to one of three things. You compare the current value or recent set of values to one or more of

• the value at a previous point in time, perhaps directly before the current value or at a time in the past that we expect it to be similar such as the same hour the previous day or the same day of the previous week.
• the values of other members of a cohort, such as other servers with the same workload or the duration of other similar requests.
• the values predicted by some theoretical model, perhaps predicted by your systems model or just a fixed point such as wanting your CPU utilization to stay below 80% and planning to add capacity when it goes over.

You use these to construct an expression involving your metrics and figure out a threshold to put on them. Set your threshold too low and you’ll give lots of false alarms. Set it too high and you’ll miss real problems. I’ve written elsewhere (part 1, part 2, correction) about this problem.

When your metric goes over that threshold and you’re woken up at 2AM, what do you do? Or if a customer has notified you of a critical bug, what do you do? Always, in order:

1. Understand the problem.
2. Mitigate the problem.
3. Fix the problem.

Always understand first. If there’s a crisis, it’s tempting to dive in and start doing something. You may have people breathing down your neck expecting action. But action based on faulty understanding is often worse that doing nothing at all. You risk introducing a new problem in something unrelated if your understanding is wrong, or even making the current problem worse. There’s a famous quotation from Jay Forrester who created system modeling:

“People know intuitively where leverage points are,” he says. “Time after time I’ve done an analysis of a company, and I’ve figured out a leverage point — in inventory policy, maybe, or in the relationship between sales force and productive force, or in personnel policy. Then I’ve gone to the company and discovered that there’s already a lot of attention to that point. Everyone is trying very hard to push it IN THE WRONG DIRECTION!”

If you’re working on a system where you’re getting data sent by the computer, then you go back to your events and traces. The same system model you constructed for understanding your metrics applies here, but events represent individual increments of a flow increasing or decreasing some stock, and you need to track down why those are happening. If you’re dealing with human reports, you will want to get someone with the problem on the phone or in person. Then figure out a minimal, isolated way of reproducing the problem.

Once you have that, you need to mitigate the problem. That does not mean fixing the problem. Fixing the problem may require quite deep changes and bears all the risk of any significant change to a program. For now you need to apply first aid. Is this one feature causing a crash? Disable it. Is a workload larger than some size causing the machine to corrupt data? Stick in a check that rejects workloads larger than that size. Like applying a tourniquet stops bleeding but can lead to the loss of a limb if left on too long, mitigations buy you time to do a real fix.

Once the problem is mitigated, then you plan the fix. Often you don’t dive in immediately. If it is three in the morning and everyone involved has been up since 6AM the previous day, everyone should go to sleep and resume when they are rested. Then begin the process of fixing with focus but without rush.

Finally, it’s important close this, the largest quality assurance feedback loop. What went wrong? What happened? There is a lot of practice in this around running blameless post mortems of incidents.

How to ascend

The first step is to run something in production. Run a website for a group you’re part of. Run a Minecraft server for your friends. Do desktop support for your local library. There is no substitute for the experience of running real systems that people care about. Learn how to take backups. Make a free Honeycomb account and start sending some metrics and traces, even if you’re not sure if they’re exactly the right ones.

Meanwhile, write some tests. It doesn’t have to be good tests at first, but get the habit. I think everyone should try test driven development (that is, write a test that fails, alter the code to make it pass, repeat) for at least a little while for the discipline it builds. Read my course on choosing test data and Michael Flower’s Working Effectively With Legacy Code.

This happens first. You have to learn the tools and the feeling of this before higher strategy and theory is helpful. Once you have a little experience, you can start thinking about the higher level structure. Read at the very least Donella Meadows’s essay on leverage points and a little bit about Boyd’s work on OODA loops, and start examining the quality and assurance and operations of the systems you’re working on. If you have time, read at least part of Donella Meadows’s Thinking In Systems.

Spend some time with some security people thinking up risk models. This can be as simple as taking people from the security team at your organization out for a beer a few times and having them play “what if” with you and tell you about stuff they’ve seen. Do the same with the operations team. Read Site Reliability Engineering, but don’t take it as gospel. There’s a lot of good ideas there, and a lot of Google’s marketing for their hiring brand. Someone will tell you to read The Phoenix Project. You can skip it. Its author is a snake oil salesman who repackaged an older book called The Goal by Goldratt. Goldratt is worth reading. Go to incident review at your company and listen and learn.

Meanwhile study the quality assurance and operations at your organization. What is it? You may be surprised at how few people actually know how it works. When you have it figured out, start looking for places where:

• You could mitigate something in the program that you’re monitoring and responding to in production.
• You’re testing things that could be caught faster with static analysis.
• You’re waiting on slow, involved tests because parts of the system are written in a way that doesn’t allow focused, fast tests.
• You’re maintaining large collections of test data, possibly obsolete, perhaps copied from production and anonymized. Can you replace this with techniques like property based testing, fuzzing, or using SMT solvers to generate traffic that exercises your possible cases?
• You have a feedback loop that is overloaded to the point where it is turning into the next feedback loop out, leaving a wasted inner feedback loop. Can you restore value to that inner loop?

The transcendent level is looking at a system and seeing what pieces of it are part of the solution, and what are part of the problem. It’s being able to look at a system and know the likely quality and operational issues, and then envision changes that will improve them. This verges into software architecture, but far too much writing on architecture is divorced from the operation of the running system.

This series is still being written. Subscribe to get emailed when each is section is posted: