Incentives and reliability
(Based on a discussion I had about how to approach being a reliability group inside a larger software team. The discussion began with asking about hiring SREs based on my time as a production engineer at Facebook, but I quickly steered it in a different direction.)
There’s an old truism from manufacturing that you can’t inspect your way to quality. That is, it doesn’t matter how many inspectors you put in downstream to check what has been made. They can’t affect what has already happened. I am always cautious about borrowing from manufacturing, but this happens in software engineering organizations very often. Management is concerned about reliability and institutes a gate that must be passed to ship software to check for reliability, usually consisting of some kind of audit or checklist.
The gate doesn’t change what the engineers themselves do, though. Marketing is still setting release dates. The gate is now just one more thing that must be passed to hit that release date, and an engineer is assigned to trample it as much as possible. The team does some parts, they promise that they’ll do the rest later, and everyone involved knows that someone is getting on a stage to make an announcement in a week, and the gatekeeper strictly enforcing it is going to do more damage than good in the short term, not to mention piss off any executives that have a stake in the launch. The promises are accepted, the gate is passed, the announcement is made. Some of the promises get done. Some get displaced from the backlog by other items. So we arrive at:
A reliability organization that works by imposing gates is nearly ineffective.
The next evolution of the group is to try to do the work themselves. The engineers working on a feature didn’t write the necessary docs for customer service or runbooks for the oncall? Okay, we’ll write them.
Now for every engineer working on a feature, you have a corresponding amount of effort from the reliability team. If there are only a few items to be dealt with, it may be one reliability team member per a couple of software engineers, but as the software matures and becomes more complicated, that number is going to grow. At best, the reliability team would have to grow linearly with the size of the rest of the engineering organization. So we arrive at:
A reliability organization will always fall behind on doing reliability work themselves.
So if reliability work is to reliably and completely get done, it has to be done by the engineers working on features. The only way a reliability organization can get traction is by modifying the behavior of those engineers.
Fortunately, we know a lot about modifying behavior. A behavior consists of
- an antecedant which is the conditions which trigger the behavior to happen,
- the behavior itself, the action taken by the person, and
- a consequence, which is the response from the environment.
Consequences are either reinforcing (the person likes what happened) which makes the behavior more likely to be repeated when an antecedant that triggers it happens in future, or punishing (the person did not like what happened), which makes the behavior less likely. Don’t get hung up on “punishment.” It’s a term of art in applied behavior analysis and means only a consequence that makes the behavior less likely to happen again. For example, putting your hand on a hot stove has the consequence of your hand getting burned, which is punishing.
Punishment modifies behavior, but rarely in the way you want. The behavior was happening because it was reinforcing in some way. Punishment will make the person look for a way to get the reinforcement while avoiding the punishment. If a child get punished for taking candy from the cabinet, the modification to their behavior isn’t going to be that they stop taking candy. They’re going to take candy in a way that they can hide from you.
Looking at reliability gates through this lens, the gate is punishment. The engineer is going to still try to get the reward for shipping their feature, and they’re going to modify their behavior to do just enough to make the gate go away.
This means that the only practical tool that a reliability team has in the long term is setting up antecedants and rewarding desired behavior. This is an art more than a science, and the very best practitioners—people like Barbara Heidenrich, Susan Friedman, or Ken Ramirez—are incredibly creative, but in the context of reliability work here are a couple of ideas:
Automate or systematize the jagged edges or hard parts. In Facebook or Google, everyone uses the common means of communicating between services (gRPC at Google, ServiceRouter at Facebook) because it’s easy. Authentication, routing, serialization, load balancing, failover—it’s all done. Many web services rely on the relational database’s locking to handle all concurrency issues between separate requests. Occasionally deadlock or lock timeout issues show up, but the system defaults to safe and you have to unsnarl your locks.
Use the shared systems to impose leverage and control points. If everyone is using the same inter-service communication system, you can impose controls there and they apply everywhere. Improving telemetry of requests between services is easy. Add your telemetry to the inter-service communication system and don’t bother asking anyone that uses it. Or, more actively, you can impose gates such as not allowing specific routings, or not letting binaries that are more than thirty days old make connections at all.
Donella Meadows produced a really useful taxonomy of twelve places to intervene in systems. Standards—the gates above—are the least effective point to intervene in her list. Creating common systems like the interservice communication systems above is low on the list…but necessary to change rules, like incentives, which are much higher on the list: making it easy to take part in interservice communication.
But if you look at what gets in reliability gates, a lot of it isn’t the software itself. It’s documentation, for the public or for customer support. It’s tools and instructions to deal with outages. It’s tests that give compelling evidence of correctness but are orthogonal enough that they don’t make it onerous to change the code and fast enough to be run on every check-in.
When most software engineers use the word ‘feature,’ their mental conception is of the code necessary. There may be a vague notion of tests and deployment. Perhaps there is a dim hint of documentation over the horizon. Coming back to antecedant-behavior-consequence, this conception of feature makes it unlikely that antecedants for writing customer service documentation will ever occur, whether or not a behavior would happen in response to it.
If we look at Donella Meadows’s list again, number two is mindset or paradigm. At this point I’ll make another left turn and refer to David Marquardt: leadership is language (or more darkly, Orwell’s “Politics and the English Language”). Language shapes what is easy for us to think, and human reward systems are heavily weighted towards mental ease.
So can a reliability organization affect mindset via language? ‘Feature’ is well worn and its meaning is clear. Imagine instead a separate term, implying a broader scope. This is a spot where I’m tempted to go read about Disney’s Imagineering, since they’re about the most detail oriented creative design organization on the planet. They probably have verbiage and process that we can coopt at least in part for this purpose. The goal is to make the language that everyone uses subsume ‘feature’—the feature is a piece of the easily accessible mental model the engineer defaults to—and make antecedants to create all the other pieces of the reliability work common. So maybe the final dictum is:
Reliability organizations should build the ‘Disney rides’ that shape software engineers’ behavior.