madhadron

Zero downtime deployments

Status: Finished
Confidence: Certain

You have a webservice. How do you push a new version of its software without downtime? This sometimes goes under the name of “blue/green deployment” or “rolling push” or a variety of other names for special cases.

What we are replacing

Before we dive into zero downtime deployments, let’s talk about deployments that have downtime.

Here’s the scenario: You have a service listening for requests. Various users are issuing requests to the service.

Service
[Not supported by viewer]
Users
[Not supported by viewer]

At some point you have a new version of the service to deploy. Obviously, because you are a conscientious individual, you have ensured that it is well tested and you are confident this is going to go well. You turn off the service, install the new version, and turn on the service. Ideally you give some warning before you do this by scheduling a “maintenance window,” but either way no one can reach the service during this operation.

Service taken down
<font style=“font-size: 24px” face=“Garamond”>Service taken down<br></font>
New version started
<font style=“font-size: 24px” face=“Garamond”>New version started<br></font>
Maintenance
window
[Not supported by viewer]

Before you go rushing out to set up a zero downtime deployment process, reflect whether you actually need it.

Imagine you have a simple Django or Ruby on Rails web service. You’re running one copy on a single machine. It serves a few hundred concurrent sessions. The downtime for deploying a new version of the code is measured in seconds: copy the new code to the server, stop the service, and start it running the new code.

Further, many services are only used during business hours in a small range of time zones.

If your service is idle ten hours a day, it doesn’t matter if users can’t reach it for thirty minutes in those ten hours. If this is the case, you will be better served spending your time on other aspects of the system. Are you sure your backups are working?

Zero downtime deployment: the basic procedure

To do a zero downtime deployment, you need to have at least two instances of your program accepting traffic and a load balancer directing connections to them.

To do a zero downtime deployment, you:

  1. Copy the new code to one of the instances.
  2. Tell the load balancer to stop routing requests to that instance, and wait for all current requests to the instance to complete.
  3. Stop the service on the instance.
  4. Start the service on the instance with the new code.
  5. Run a sanity check that the instance came up properly.
  6. Tell the load balancer to start routing requests to the instance.
  7. Repeat for the other instance.

It should be obvious, but it must be said: if the sanity check in step 5 fails, you stop and figure out what went wrong. Once you have done so, you repeat steps 1, 3, 4, and 5 on the same instance until you have successfully brought up the new version. Leave the other instance that is still working alone. Someday you will be stressed in the middle of a deployment, and on that day I hope you remember this warning.

Each of the instances must be able to handle the total traffic of your service. You pay for zero downtime deployments by running more hardware than the service’s usual load requires. On the other hand, you will need the same hardware if you need to be able to continue to provide service even when you lose parts of your system in a disaster.

But sessions…

The first wrinkle in this scheme comes from services that don’t interact with a user in a single request and response. Imagine you are deploying an upgrade to an online widgets catalog. In the new version you have replaced a misspelled HTTP request path /parts/wigdets/ with the correct /parts/widgets/. A user has loaded a page in your site with a link that calls /parts/wigdets/. Between the time that they load the page and they click the link, you deploy your new version to one of your two instances. When the user clicks the link, what happens?

If the load balancer is sending requests randomly to each instance, there’s a 50% chance that the user gets what they expect, and a 50% chance that they get a 404 Page Not Found error.

In a more extreme case, consider video chat. When a server receives a datagram of video data to forward to another user, it’s not practical for it to write the datagram to shared storage and for another machine to be alerted, read it, and send it out. At best, the latency it would cause would emulate the experience of trying to video chat with someone on the moon. Video chat sessions are handled in memory by one machine, and the packets for all parties in a particular call have to go to that one machine.

What both these examples mean is that we need a way to tell the load balancer that a given user session must always go to the same instance of the program. This goes under the name of “session pinning,” “persistent sessions,” or “sticky sessions,” and any load balancer you would consider using in production should have it.

The naive way of implementing sticky sessions for N servers is to number the servers from 0 to N-1, use the client’s IP address to calculate hash(IP address) mod N, and use that as the index into the array of servers. If that server is down, look it up some other way, such as hashing the hash to get a new, deterministic, random choice of server.

This doesn’t work in general for several reasons:

The usual solution is to teach the load balancer about your application protocol. Amazon’s Elastic Load Balancer, for example, uses a cookie to track sessions.

This may not be a problem

Most services aren’t video chat systems. All that complexity we were talking about above? You may be able to skip it if you:

Store all session information in shared storage such as the database.
If your Django app stores all session information in its MySQL database, then it doesn’t matter which instance an HTTP request is routed to by the load balancer.
Check for backwards compatibility in your release process.
The canonical way to do this is to have lots of tests that you run, but replaying real traffic to your test environment and making sure that it matches what production does may work as well if you have enough traffic covering all your app’s functionality.
Deploy large user interface changes almost as new systems.
Think about how Google rolls out major changes to GMail’s user interface. You opt in with a button to get the new one. Any changes that don’t require some action on the user’s part are tiny adjustments that users won’t notice as they work with the app.

For a lot of situations this isn’t very onerous. It’s worth designing for if you can.

How long are sessions?

For those who do need sticky sessions, there are more difficulties. Look back at the basic procedure for zero downtime deployments. In step 2, we wait for all sessions to an instance to complete. How long is a video call? A minute? An hour? All day as a family opens a video chat to a distant relative on a holiday so they can be virtually together? How long can you wait to drain this instance? Or when someone is talking to their friend as they arrive at work in an underground bunker with only secured external communications, how long do you wait from the last time you hear from them before you declare the session over?

When sessions are open ended, you need to set a time when no activity makes them expire, and you need a way to migrate them. This tends to be domain specific. For example, for a video chat system you might try something like this:

  1. The instance currently managing the session generates a nonce (a single use, random number) and sends it to all clients of the session.
  2. The clients open a new connection to the load balancer with the nonce as their session affinity key, so they all go to the same instance.
  3. The clients issue a special “resume” request to the new instance, including the nonce. The instance creates a session and connects all clients with that nonce to it.
  4. The clients send an acknowledgement on the old connection to the old instance.
  5. Once all the clients have acknowledged or the session expiration time has been reached, the old instance closes the session.

Even if this whole process takes a two hundred milliseconds, users will experience it as only a minor blip.

For a slightly more complicated example, consider a multiplayer real time game. Today they mostly follow the lead of Quake 3. Client and server send UDP datagrams, which may or may not arrive. Each datagram says, “The last time I know that you know about is a. I represent time b. The differences from a to b are x, y, and z.” A client sends such a datagram to the server with the player’s activity. The server sends such a datagram to the client to update the world the player interacts with. If you drop a datagram, it’s no big deal. The server maintains a list of the last times clients have sent descriptions for, and always sends updates based on that time.

For example, here is a server with two clients. At each step, the server sends the changes relative to the last state the client has acknowledged.

Server
[Not supported by viewer]
Client 1
[Not supported by viewer]
Client 2
[Not supported by viewer]
Sends change from
time 0 to time 1
[Not supported by viewer]
Sends change from
time 0 to time 1
[Not supported by viewer]
Sends change from
time 0 to time 1
[Not supported by viewer]
Sends change from
time 0 to time 1
[Not supported by viewer]
Sends change from
time 1 to time 2
[Not supported by viewer]
Sends change from
time 2 to time 3
[Not supported by viewer]
Sends change from
time 3 to time 4
[Not supported by viewer]
Sends change from
time 1 to time 2
[Not supported by viewer]
Sends change from
time 2 to time 3
[Not supported by viewer]
Sends change from
time 3 to time 4
[Not supported by viewer]
Sends change from
time 1 to time 2
[Not supported by viewer]
Datagrams lost
[Not supported by viewer]
Sends change from
time 1 to time 3
[Not supported by viewer]
Sends change from
time 1 to time 4
[Not supported by viewer]
Sends change from
time 3 to time 4
[Not supported by viewer]
Sends change from
time 4 to time 5
[Not supported by viewer]
Sends change from
time 4 to time 5
[Not supported by viewer]
Sends change from
time 4 to time 5
[Not supported by viewer]
Sends change from
time 4 to time 5
[Not supported by viewer]

If you miss some datagrams, who cares? As long as enough get through to keep the states enough in sync to play the game, it will work.

How do we migrate a session here? Something very similar to the video chat method would work. Say we are going to migrate the session at time 7, but the last state that all client have acknowledged is 5. Then,

  1. The instance currently managing the session generates a nonce (a single use, random number) and sends it to all clients of the session. It saves a snapshot of the game state at times 5 to 7 and the last known times for each client to shared storage, using that nonce as its key. The server ignores all updates from clients and no longer sends game state updates.
  2. The clients open a new connection to the load balancer with the nonce as their session affinity key, so they all go to the same instance.
  3. The clients issue a special “resume” request to the new instance, including the nonce. The instance creates a session and connects all clients with that nonce to it. It loads the snapshot using the nonce, and starts handling datagrams from the clients and sending out updates.
  4. The clients send an acknowledgement on the old connection to the old instance.
  5. Once all the clients have acknowledged or the session expiration time has been reached, the old instance closes the session.

So much for sessions. Let’s talk about zero downtime deployments of bigger, more complicated systems.

More is basically the same

There is a famous adage in condensed matter physics that “more is different.” In this case, though, more is basically the same. If we have five instances behind the load balancer instead of two, or fifty, or five thousand, the basic procedure stays the same. The only thing that changes is how many and which machines we deregister and upgrade at a time. Upgrading a batch at a time like this is called a “rolling push.”

Recall that with two instances, each instance must be able to carry the entire traffic. The same principle applies here. If you have 50 instances and upgrade them 10 at a time, each of them must be able to handle 20% more traffic. If you upgrade them 5 at a time, they only need to be able to handle 10% more traffic. On the other hand, you have to make sure each batch is healthy after you upgrade them. If you let the new version “bake” for five minutes before starting the next batch, then upgrading 5 at a time will take nearly an hour, while upgrading 10 at a time will take half that.

Take a few minutes to play around with the relationship among the batch size and bake time and get a feel for it:

minutes

Between push, run your servers at % utilization.

Pushes will take minutes

Once you reach a modest scale, you probably have machines in different geographic locations. This is primarily for three reasons:

  1. If a hurricane hits the data center in New York where half your servers are located, it’s very unlikely that another natural disaster will take out the data center in Los Angeles where your other servers are running at the same time.
  2. Users get lower latency using the service if there is a server nearer to them.
  3. For large companies, they may not be able to lease enough space and power in one data center for all their machines.

Imagine that the hurricane hits New York while you are deploying an upgrade. All the traffic that was going to New York is now going to Los Angeles, and you have to have enough machines available to handle it. So we have to handle upgrades in each region separately. You’re also going to need enough extra machines in each data center to handle the traffic. If half your traffic goes to each data center, you should be running your hardware at around 50% utilization in each data center under normal conditions.

When you have multiple locations, you do a separate rolling push in each of them at the same time. Imagine if you pushed one data center and then the other. You essentially create a disaster scenario at every push. Any other coupling among pushes in different locations will still lead to uneven load. And since running a push in each data center without orchestration is simpler, there’s no reason to do otherwise.

A final variation

Say we have more than just some web servers and a database. What if we have those, plus a queueing system that feeds work to a bunch of worker nodes, a collection of caching servers, and who knows what else. How do we do a zero downtime upgrade of that?

At some level of size and complexity the answer becomes very simple: each of those things gets deployed separately, and you apply the basic procedure to each of them. Time to upgrade the workers? Take them down in batches and upgrade them. Every piece of the system has to function in an environment that is running multiple versions of the software at the same time. This situation goes under the name of “microservices.”

For modestly sized systems with lots of complex state that might be difficult to make robust to having different versions in play, there is an alternative strategy called blue/green deployments. It goes back to the simple idea of two instances behind a load balancer, but this time you deploy two whole copies of the entire system in parallel. One copy is called blue, one copy is called green.

Blue/green deployments do not scale up well for two reasons:

  1. At some point the time to push the entire system becomes so long that it’s not practical.
  2. At some point you can’t afford the hardware to run two entire copies of the system.

But for systems of five to ten distinct subsystems and a few tens of machines, it’s a useful trick to have in your back pocket.

What you should remember…

Don’t assume you need a zero downtime deployment. It’s a lot of engineering that you might be able to expend more profitably elsewhere.

If you do need a zero downtime deployment, try to design your system so you don’t have to pin sessions to individual instances behind the load balancer.

And remember the basic procedure:

  1. Copy the new code to one of the instances.
  2. Tell the load balancer to stop routing requests to that instance, and wait for all current requests to the instance to complete.
  3. Stop the service on the instance.
  4. Start the service on the instance with the new code.
  5. Run a sanity check that the instance came up properly.
  6. Tell the load balancer to start routing requests to the instance.
  7. Repeat for the other instance.

Almost everything else comes from thinking through the basic procedure in your particular circumstances.