You have a webservice. How do you push a new version of its software without downtime? This sometimes goes under the name of “blue/green deployment” or “rolling push” or a variety of other names for special cases.
Before we dive into zero downtime deployments, let’s talk about deployments that have downtime.
Here’s the scenario: You have a service listening for requests. Various users are issuing requests to the service.
At some point you have a new version of the service to deploy. Obviously, because you are a conscientious individual, you have ensured that it is well tested and you are confident this is going to go well. You turn off the service, install the new version, and turn on the service. Ideally you give some warning before you do this by scheduling a “maintenance window,” but either way no one can reach the service during this operation.
Before you go rushing out to set up a zero downtime deployment process, reflect whether you actually need it.
Imagine you have a simple Django or Ruby on Rails web service. You’re running one copy on a single machine. It serves a few hundred concurrent sessions. The downtime for deploying a new version of the code is measured in seconds: copy the new code to the server, stop the service, and start it running the new code.
Further, many services are only used during business hours in a small range of time zones.
If your service is idle ten hours a day, it doesn’t matter if users can’t reach it for thirty minutes in those ten hours. If this is the case, you will be better served spending your time on other aspects of the system. Are you sure your backups are working?
To do a zero downtime deployment, you need to have at least two instances of your program accepting traffic and a load balancer directing connections to them.
To do a zero downtime deployment, you:
It should be obvious, but it must be said: if the sanity check in step 5 fails, you stop and figure out what went wrong. Once you have done so, you repeat steps 1, 3, 4, and 5 on the same instance until you have successfully brought up the new version. Leave the other instance that is still working alone. Someday you will be stressed in the middle of a deployment, and on that day I hope you remember this warning.
Each of the instances must be able to handle the total traffic of your service. You pay for zero downtime deployments by running more hardware than the service’s usual load requires. On the other hand, you will need the same hardware if you need to be able to continue to provide service even when you lose parts of your system in a disaster.
The first wrinkle in this scheme comes from services that don’t interact with a user in a single request and response. Imagine you are deploying an upgrade to an online widgets catalog. In the new version you have replaced a misspelled HTTP request path
/parts/wigdets/ with the correct
/parts/widgets/. A user has loaded a page in your site with a link that calls
/parts/wigdets/. Between the time that they load the page and they click the link, you deploy your new version to one of your two instances. When the user clicks the link, what happens?
If the load balancer is sending requests randomly to each instance, there’s a 50% chance that the user gets what they expect, and a 50% chance that they get a 404 Page Not Found error.
In a more extreme case, consider video chat. When a server receives a datagram of video data to forward to another user, it’s not practical for it to write the datagram to shared storage and for another machine to be alerted, read it, and send it out. At best, the latency it would cause would emulate the experience of trying to video chat with someone on the moon. Video chat sessions are handled in memory by one machine, and the packets for all parties in a particular call have to go to that one machine.
What both these examples mean is that we need a way to tell the load balancer that a given user session must always go to the same instance of the program. This goes under the name of “session pinning,” “persistent sessions,” or “sticky sessions,” and any load balancer you would consider using in production should have it.
The naive way of implementing sticky sessions for N servers is to number the servers from 0 to N-1, use the client’s IP address to calculate hash(IP address) mod N, and use that as the index into the array of servers. If that server is down, look it up some other way, such as hashing the hash to get a new, deterministic, random choice of server.
This doesn’t work in general for several reasons:
The usual solution is to teach the load balancer about your application protocol. Amazon’s Elastic Load Balancer, for example, uses a cookie to track sessions.
Most services aren’t video chat systems. All that complexity we were talking about above? You may be able to skip it if you:
For a lot of situations this isn’t very onerous. It’s worth designing for if you can.
For those who do need sticky sessions, there are more difficulties. Look back at the basic procedure for zero downtime deployments. In step 2, we wait for all sessions to an instance to complete. How long is a video call? A minute? An hour? All day as a family opens a video chat to a distant relative on a holiday so they can be virtually together? How long can you wait to drain this instance? Or when someone is talking to their friend as they arrive at work in an underground bunker with only secured external communications, how long do you wait from the last time you hear from them before you declare the session over?
When sessions are open ended, you need to set a time when no activity makes them expire, and you need a way to migrate them. This tends to be domain specific. For example, for a video chat system you might try something like this:
Even if this whole process takes a two hundred milliseconds, users will experience it as only a minor blip.
For a slightly more complicated example, consider a multiplayer real time game. Today they mostly follow the lead of Quake 3. Client and server send UDP datagrams, which may or may not arrive. Each datagram says, “The last time I know that you know about is a. I represent time b. The differences from a to b are x, y, and z.” A client sends such a datagram to the server with the player’s activity. The server sends such a datagram to the client to update the world the player interacts with. If you drop a datagram, it’s no big deal. The server maintains a list of the last times clients have sent descriptions for, and always sends updates based on that time.
For example, here is a server with two clients. At each step, the server sends the changes relative to the last state the client has acknowledged.
If you miss some datagrams, who cares? As long as enough get through to keep the states enough in sync to play the game, it will work.
How do we migrate a session here? Something very similar to the video chat method would work. Say we are going to migrate the session at time 7, but the last state that all client have acknowledged is 5. Then,
So much for sessions. Let’s talk about zero downtime deployments of bigger, more complicated systems.
There is a famous adage in condensed matter physics that “more is different.” In this case, though, more is basically the same. If we have five instances behind the load balancer instead of two, or fifty, or five thousand, the basic procedure stays the same. The only thing that changes is how many and which machines we deregister and upgrade at a time. Upgrading a batch at a time like this is called a “rolling push.”
Recall that with two instances, each instance must be able to carry the entire traffic. The same principle applies here. If you have 50 instances and upgrade them 10 at a time, each of them must be able to handle 20% more traffic. If you upgrade them 5 at a time, they only need to be able to handle 10% more traffic. On the other hand, you have to make sure each batch is healthy after you upgrade them. If you let the new version “bake” for five minutes before starting the next batch, then upgrading 5 at a time will take nearly an hour, while upgrading 10 at a time will take half that.
Take a few minutes to play around with the relationship among the batch size and bake time and get a feel for it:
Between push, run your servers at % utilization.Pushes will take minutes
Once you reach a modest scale, you probably have machines in different geographic locations. This is primarily for three reasons:
Imagine that the hurricane hits New York while you are deploying an upgrade. All the traffic that was going to New York is now going to Los Angeles, and you have to have enough machines available to handle it. So we have to handle upgrades in each region separately. You’re also going to need enough extra machines in each data center to handle the traffic. If half your traffic goes to each data center, you should be running your hardware at around 50% utilization in each data center under normal conditions.
When you have multiple locations, you do a separate rolling push in each of them at the same time. Imagine if you pushed one data center and then the other. You essentially create a disaster scenario at every push. Any other coupling among pushes in different locations will still lead to uneven load. And since running a push in each data center without orchestration is simpler, there’s no reason to do otherwise.
Say we have more than just some web servers and a database. What if we have those, plus a queueing system that feeds work to a bunch of worker nodes, a collection of caching servers, and who knows what else. How do we do a zero downtime upgrade of that?
At some level of size and complexity the answer becomes very simple: each of those things gets deployed separately, and you apply the basic procedure to each of them. Time to upgrade the workers? Take them down in batches and upgrade them. Every piece of the system has to function in an environment that is running multiple versions of the software at the same time. This situation goes under the name of “microservices.”
For modestly sized systems with lots of complex state that might be difficult to make robust to having different versions in play, there is an alternative strategy called blue/green deployments. It goes back to the simple idea of two instances behind a load balancer, but this time you deploy two whole copies of the entire system in parallel. One copy is called blue, one copy is called green.
Blue/green deployments do not scale up well for two reasons:
But for systems of five to ten distinct subsystems and a few tens of machines, it’s a useful trick to have in your back pocket.
Don’t assume you need a zero downtime deployment. It’s a lot of engineering that you might be able to expend more profitably elsewhere.
If you do need a zero downtime deployment, try to design your system so you don’t have to pin sessions to individual instances behind the load balancer.
And remember the basic procedure:
Almost everything else comes from thinking through the basic procedure in your particular circumstances.