madhadron

Serializing data, wrong and right

Status: Finished
Confidence: Likely

tl;dr: Use an established serialization library/protocol like Thrift, Protobuf, Capn’Proto, or CBOR. See the checklist at the end.

Say you’re preparing a hunk of data to be written to a file or an external key-value store, or be sent across a network socket. How do you go from live data structures in memory to a sequence of bytes, what is called serializing the data?

Let’s consider how not to do it first.

I ran across a piece of code close to the following in the depths of a system that I was working on that stopped me cold. I felt like I was being gaslit by my codebase. It went something like this:

template<typename T>
std::string serialize(const T& v) {
  char buffer[sizeof(T)];
  memcpy(buffer, &v, sizeof(T));
  return std::string(buffer);
}

If you haven’t dealt with the problems of serializing data before you may wonder what’s wrong with that. But the question is really, what isn’t wrong with that?

Let’s start with a simple case. We’ll serialize a 32 bit unsigned integer. We’ll add a main function to our program that prints out the hexadecimal representation of our serialization.

Why hexadecimal? It’s the most convenient notation for working with raw bytes. Consider following sequence of bytes in various bases:

Try to answer:

  1. Where are the bytes with all bits set?
  2. Where are the bytes with only the lower four bits set?
  3. What bytes have only the second bit set?

Once you’ve tried that you will probably note a few things:

  1. Recognizing all 1’s in binary requires focusing in in the contents of the string. Recognizing all 1’s in octal or decimal means memorizing the special strings 377 and 255, which aren’t that easy to distinguish from 376 or 254. In hexadecimal, it’s ff, which is the largest two digit hexadecimal number and quick to recognize.
  2. In hexadecimal, this is the same as the first digit of the pair being 0. In the other bases, who knows?
  3. In binary you can examine each. In octal and decimal you need to know what numbers have that bit set all the way up to 255. In hexadecimal, since each digit represents 4 bits, you only need to memorize which have that bit set up to 16. People who spend a lot of time messing with memory can recognize the bits set in a 4 bit hex digit reflexively.

In practice you will often use a hex editor, which shows the contents of your data as hexadecimal numbers in one pane, and as text in the other. For example, try typing some text in one side of this or hex in the other:

Anyway, our program:

#include <iomanip>
#include <string>
#include <iostream>
#include <cstring>
#include <vector>
#include <iterator>

template<typename T>
std::string serialize(const T& v) {
  char buffer[sizeof(T)];
  memcpy(buffer, &v, sizeof(T));
  // Note: you must use this exact constructor,
  // or the string will only load as far as the
  // first 0 byte.
  return std::string(buffer, sizeof(T));
}

template<typename T>
void hexdump(const T& value, std::string serialized) {
  std::cout << "Value:      " << value << std::endl;
  std::cout << "Serialized: " ;
  for (const auto c : serialized) {
    std::cout << std::hex << std::setfill('0') << std::setw(2) << (int)c << " ";
  }
  std::cout << std::endl;
}

int main() {
  uint32_t value = 42;
  auto serialized = serialize2(htonl(s));
  hexdump(value, serialized);
}

I saved that in serialize.cpp and at my shell ran

$ c++ -std=c++17 serialize.cpp  && ./a.out
Value:      42
Serialized: 2a 00 00 00

2a is indeed the hex representation of 42. Now, if you ran it on a POWER or recent ARM processor you would get a different result:

$ c++ -std=c++17 serialize.cpp && ./a.out
Value:      42
Serialized: 00 00 00 2a

Intel processors, such as the one in my laptop, are “little endian.” POWER and recent ARM are “big endian.” Big endian more or less won, since when you are transferring over the network it is assumed that you will send big endian, and because of that Intel chips have quite a bit of infrastructure for handling big endian numbers.

Now what happens if we put in a struct? Let’s serialize

struct MyStruct {
    int64_t a;
    char b;
    int32_t a;
};

struct MyStruct value;
value.a = 15; // 15 is easy to find in hex dumps: it's 0f.
value.b = 15;
value.c = 15;

We remove the printing of the value, since we haven’t taught C++ how to print this struct, and get

$ c++ -std=c++17 serialize.cpp  && ./a.out
Serialized: 0f 00 00 00 00 00 00 00 0f ffffffc0 ffffffef 0a 0f 00 00 00

There’s the first 8 bytes starting with a 0f for the int64_t. There’s a single byte of 0f for the char. Then there’s some kind of noise before the last 0f 00 00 00 for our int32_t. This is padding the compiler introduced. Processors don’t treat all bytes in memory the same. If you ask for a 32 bit read from memory on a 32 bit processor, you generally need do so an address that is a multiple of four. So you can read from address 0, address 4, address 8, address 12, etc. You can’t read a 32 bit value from address 1 or address 13. On some processors like those from Intel “unaligned” reads are slower. On others, like SPARC, unaligned reads are impossible. Similarly, 64 bit reads happen on multiples of 8. Try changing the last field of the struct to int64_t and see what happens.

But here’s the problem. You can tell your compiler to not insert that padding. And if you try to load the data on a different architecture or in a program compiled by a different compiler or with different compiler options you may find yourself with nonsense.

I promise we’ll talk about how to handle this right, but let’s finish going through the traps that await the unwary.

The last problem with this is that it assumes that the value has a fixed size in memory and that all the data that needs to be serialized is in that same hunk of memory. What happens if we try to serialize a std::vector<int*> this way?

char a = 15;
char b = 14;
char c = 13;
std::vector<char*> value{&a, &b, &c};

Running the program now produces

$ c++ -std=c++17 serialize.cpp  && ./a.out
Serialized: 40 06 40 ffffff9c ffffffc0 7f 00 00 58 06 40 ffffff9c ffffffc0 7f 00 00 58 06 40 ffffff9c ffffffc0 7f 00 00

That is the memory footprint of the vector itself. Our data doesn’t appear anywhere in it (it should leap out as the sequence 0f 0e 0d). Somewhere in there are the memory addresses of our data, but that isn’t going to be of much help.

What you should do

After all that trouble, we can make a list of the qualities we want out of a serialization protocol:

  1. The output should be independent of the compiler, architecture, or other details of the system where it is serialized.
  2. It should handle variable length data like vectors.
  3. It should handle indirection like pointers and references.

It may not be obvious now, but there’s another criterion you will really want:

  1. When you add fields to structs and otherwise extend the schema of the data that you’re serializing, you should still be able to load old versions of the data.

If this seems like a lot of work, that’s because it is. And you should use a pre-existing system that someone else has built. Let’s look at some of the options. There are three major lineages.

The ancient lineage

The oldest serialization protocol that’s still something you might consider is ASN.1. It’s been around since the 1980’s and a lot of big systems use it. It has always been tugged by many users and many systems to be flexible, so it’s conceptually more sprawling than the other lineages that we will discuss, and it has accumulated all kinds of extensions over the years.

If you work in an organization that already uses ASN.1 heavily, keep using it.

The Google lineage

This lineage begins with Google’s internal serialization system called Protocol Buffers. If it were only internal to Google that wouldn’t be terribly interesting for us, but there is a pattern that happened repeatedly over the years:

  1. Facebook reaches a scale that Google reached some years earlier.
  2. Engineers are always changing jobs, so some from Google join Facebook.
  3. Those engineers clone a piece of Google infrastructure to deal with this scaling issue.
  4. Facebook open sources that piece of infrastructure.
  5. Google, in response, open sources theirs.

In this case, Facebook wrote Thrift, and later passed it on to the Apache Project. Twitter also uses it internally, as well as many other players. Thrift also defines a remote procedure call system (RPC), used to make requests among network services. Protocol Buffers actually also has a remote procedure call system, but it was harder to extract sensibly from Google’s environment, so it got released later as gRPC.

The next step in this evolution is a system called Capn’Proto, written by one of the architects of Protocol Buffers at Google. Capn’Proto says, instead of writing to a separate format for serialization, let’s be very clever and make our actual types in the programming language serializable in the naive memcpy way this whole article started with. That saves all the computation of generating that serialization. It also includes an RPC system.

Unlike ASN.1, which was designed for a world where it would have adapt at each step to integrate widely different systems, the Google lineage assumes that the whole world they operate in uses only their serialization and RPC system. All of Facebook is Thrift. All of Twitter is Thrift. All of Google is Protocol Buffers. When you can assume that, you don’t want ASN.1’s flexibility. You want the strictest system you can get.

At the opposite end of the spectrum from that we have our third lineage.

The JSON lineage

JSON is a subset of JavaScript that can be used to define data with no code attached. JavaScript and the web made JSON a de facto standard. There are even real standards for it now. If you are shipping data to a browser these days, you are probably doing it in JSON unless you inherited an XML based system from the 1990’s or early 2000’s.

If you are using a dynamically typed language like Python or JavaScript, then JSON provides a very low friction serialization system. You serialize whatever you have, read it in on the far end, and enforce by some other method that the data you received matches what you expected (though there are now schema systems for JSON as well).

JSON is text, though. A 64 bit integer, that fits in 8 bytes on the computer or in most serialization protocols, is represented in decimal notation as characters and may take 19 bytes. It is wasteful of network bandwidth, wasteful of memory, and computationally expensive to parse compared to the other formats we have discussed. So people have tried to fix that while still retaining the flexibility and simplicity of JSON.

The first big contendor for this was BSON, used as the native format for MongoDB. BSON is a naive translation of JSON into a binary protocol. It had two major major shortcomings:

  1. When you write a map or an array, you must specify its length at the beginning. This makes it impossible to incrementally build a BSON document. You have to completely calculate what is going in it before you can construct it.
  2. Its types aren’t extensible, it didn’t add to JSON’s very limited type palette, and its definition is controlled by MongoDB.

MessagePack has a richer type palette than BSON, and added the notion of extension types to fix problem 2, but still has problem 1. All its arrays and maps are length prefixed. This issue was fixed in CBOR, which appears to be the mature evolution of this lineage.

What about XML?

There is kind of a fourth lineage, descending through SGML to XML. Both systems were originally understood to be a way of building representations of documents, often extremely large ones, rather than serialization protocols. That didn’t stop XML becoming the preferred serialization protocol in the 1990’s and early 2000’s.

There is still a place for XML. If you are dealing with large, structured data that you expect both computers and humans to interact with, XML is probably your best bet. The tooling is mature, anyone with a text editor can work with it, it’s easier for humans to read and write than JSON, and it’s been through a full hype cycle so there’s plenty of experience available around how to use it well, including the vast range of cases of when not to use it.

What should you use?

The important thing is to pick one. Here’s a list of questions that should help guide you.

  1. Are you communicating with a web browser? You’re using JSON for that. End of story.
  2. Does your organization or project already use one for everything, such as Thrift for Facebook, Protocol Buffers at Google, or ASN.1 at NCBI? If so, keep using it.
  3. Is it plausible that everything in your organization will use the same library and speak the same protocol, such as when you’re starting a new tech company? Pick one of the Google lineage.
  4. Do you need to do lots of documenting of interfaces and carefully plumbing data to legacy systems? Try ASN.1.
  5. Are you working mostly in dynamically types languages? Consider CBOR.
  6. Are you working with mostly statically typed languages? Consider one of the Google lineage.

The Google lineage has three good options. If you or a coworker or friend have expertise with one of them already, use that. Otherwise eliminate any of the three that don’t support the languages you’re working in, and choose randomly among the remaining options.