madhadron

Data persistence

This is the part of the series on getting better as a programmer. The articles are:

Maintaining and updating state over time so as to avoid corruption and data loss, be able to access it efficeintly, and complying with controls and compliance requirements around handling it such as GDPR or HIPAA.

And so…

The vast majority of programming involves maintaining some kind of state over time. Word processors must save documents for later use. Scientific simulations must save their results for analysis. Web services typically have a database of some kind. Even embedded systems at least have a few counters tracking how many times a cycle has run or the like.

Once we are persisting data through time, we have four primary concerns:

Availability
Will the state still be reachable later? Will it still be reachable if your hard drive melts down, a hurricane rolls in, malicious actors try to delete it, and the power spikes? If someone updates it, when does the update become available to you?
Corruption
Is the state that you access correct? Are those numbers still the ones that were put in? Did half of an update get applied to them and the other half was lost?
Authorization
Is the state only accessible to those who should have access to it? How do you define who should have access and how do you enforce it?
Compliance
A lot of persisted state has legal requirements on how and where it is stores, what the authorization policy should be, and even requirements for when and how you parts of it must be deleted.

We also need to talk about how we choose what format we persist data in, and what problems that can cause or avert. We’ll start with this and return to the issues above.

Serialization formats

At any given point in time and programmer subculture there are usually a few default formats. Among web developers in 2023, it’s JSON, unless you’re inside Google where it’s Protobuf or Facebook where it’s Thrift. Among web developers in the late 1990’s it was XML.

Having a default format is better than the alternative. Bioinformatics around 2008 is an instructive case study. The field was characterized by each algorithm being provided as a separate Unix command line tool, each with its own file formats for input and output. In that setting, the majority of programmer effort was spent converting among file formats.

A few file formats emerged for parts of the problem, such as SAM and then BAM (a binary version of SAM) when it turned out that storing large quantities of DNA sequence data in ASCII text was inefficient, but they only represented the exact data that their defining tools, samtools, worked on. Different information couldn’t be shoehorned in. The format didn’t allow for flexible enough schemas (the definition of the structure of the data), nor did any other format in common use in that community at that time, so everyone invented their own.

The pairing of SAM and BAM shows a common pattern in serialization formats. Both dealt with data that had the same structure, but represented differently in memory. A plain text format is easy to look at with a text editor, which remains the most ubiquitous tool for programming. A binary format is typically more compact and often much faster to read and write. Having multiple formats with matching semantics like this is common. Protobuf has both a binary and a JSON representation, as does Thrift. JSON did not, which has led to several binary partners emerging, notably BSON, and then CBOR to fix BSON’s limitations.

Coming back to schemas, a format can persist the data in a way where you must know the structure in order to read it (static schema), or it can write the structure along with the data so a program can read the schema and data together at runtime (dynamic schema). JSON is an example of a dynamic schema. As you read a JSON document, the characters like [, {, and " characters you encounter tell you what the structure is that you are decoding. A sequence of four integers like [1, 2, 3, 4] carries all the information about its structure with it. I could also define a static schema where I write each number as a four byte, big Endian integer, one after another in memory. When I read those sixteen bytes, I need to know in advance that it’s four 32 bit integers as opposed to two 64 bit integers or a sixteen character ASCII string.

Static and dynamic schemas are orthogonal from being a text format or a binary format. I could encode those four numbers as a ten characters (the minimum necessary to represent an unsigned, 32 bit integer) giving their decimal representation, one after another 1 2 3 4. It is still a static schema.

In practice there is usually a mix. A program will expect certain fields in a JSON document, or read the contents of a string in a static schema to figure out what to do. And, particularly for dynamic formats, you will often have to check if data you read is valid for your purpose (which goes under the name of “schema validation”). Schema validation ranges from hand writing imperative code to check the data to XML schemas to systems like CUE that provide a logic programming language for validation.

What types of structure should a format be able to represent? There’s some consensus on what we actually need today. JSON represents the bare minimum: records or dictionaries that associate keys with values, sequences or arrays of values, and a variety of atomic types like numbers and strings.

For both records and sequences, when speed becomes important, you need two variants, one where you put the length at the beginning, and one where you read until you get a termination character. The former are necessary to be able to preallocate memory when reading or writing. The latter is necessary when you are writing a stream that you don’t know the length of in advance.

For a static schema, or when we write schema validation, we need a bit more than just records, sequences, and atomic types. We need some way of implementing tagged unions or algebraic data types, such as a way of saying that the color we are giving is in RGB or HSV. In JSON we might write {"type": "RGB", "red": 0.3, "green": 0.5, "blue": 0} and {"type": "HSV", "hue": 0.4, "saturation": 0.1, "value": 0.8}. In a dynamic schema we can pick out the type field and react accordingly, but in a static schema we need some way of declaring the alternatives.

Finally, there are some issues around formats that are meant for humans to write or read or which act as templates. The problems with YAML beautifully illustrate both.

YAML has many ways of representing a string. They each do different things with newlines, white space, and special characters. They were all added to make some case more convenient for someone to write, but they add up to a format that is both difficult to write a complete parser for and difficult to write by hand without accidentally triggering some special case. Writing a multiline string containing special characters in JSON is less convenient, but very easy to reason about.

YAML is also often used as a template that gets modified by injecting some data into it. It has no built-in support for templating, so any system that uses it this way introduces a combination of new directives that are not valid YAML and special names for the YAML. Over time each such system adds more and more complexity to try to make certain things possible. This isn’t particular to YAML. The same path has happened to many other formats used as templating languages. The only winning move is not to play: start with a full programming language.

To summarize,

Availability

Now we return to the issues of availability, corruption, authorization, and compliance, starting with availability. We are necessarily using availability in a broader sense than it is used among database specialists, but the relevant English words have all largely been given technical meanings, so I will claim a mathematician’s privilege to abuse the language and mean multiple things by the same word.1

To set the stage, consider our data going through a sequence of states over time as people or machines update it. At some point in time, can I get the latest state?

What could prevent me from getting it? There are four possibilities:

  1. The physical device that we’re storing the state on could be destroyed.
  2. I may not be able to reach the device to read the state.
  3. If there are multiple copies of the data that are synchronized in some way, I the my copy may not have been updated to the latest state.
  4. Someone, whether inadvertantly or malicious, has deleted the state.

We protect against (1) and (2) very simply: we have multiple copies. If one copy is destroyed, we use a different copy. Did you spill tea on your laptop and fry it? Reach for your backup disk and hope you backed up recently. Worried about an earthquake destroying the building where your server rack lives? Put additional servers somewhere else, far away. Want to have a file shared with people but still work with it when you’re offline? Use something like OneDrive that copies files locally and synchronizes them back to a central server.

The “far away” is key. I had a very interesting week at Facebook one time because a hurricane was predicted to make landfall over the data center where the primary instances of Facebook’s core ads database ran. Normally you would just promote instances in another data center to be primary, but due to ancient problems in how the ads system was written, which had been mitigated for the rest of Facebook but the ads code would have had to be largely rewritten to use that, you couldn’t promote another instance to primary if it was geographically too far away. In fact, the only other data center known to work as primary was also in the predicted path of the hurricane.

Not having the latest update due to unsynchronized copies only comes into play when you have multiple copies, but, as we said, multiple copies are how you protect against destruction of or being unable to reach the device the state is on. This kind of contradiction is ubiquitous in questions of data persistence. Wisdom in this area is largely around being able to make tradeoffs that are acceptable for a particular case.

On the other hand, we protect against inadvertant or malicious deletion by keeping the previous states so we can return to them. These are necessarily copies that are not synchronized. Think of undo/redo in a word processor or the previous versions of a file tracked by OneDrive or Dropbox. For malicious deletion we also want to worry about someone specifically trying to delete the previous states, so this turns into an authorization question, too.

So, we want multiple copies of our data, geographically far apart. We have to balance keeping them synchronized with keeping them available when we don’t have a connection between copies, whether because some network link is down or because our backup disk is at home and we’re not. And we want previous states of our data in a durable way that a malicious actor can’t easily destroy. And we want to do it in a way where storage doesn’t cost too much and we don’t have to do a lot of toil keeping copies in sync.

Corruption

Consider our sequence of states again. We may get an unintended new state that is wrong in some way. Perhaps our word processor corrupted our file when writing it to disk. Perhaps the disk itself is failing and bits on it are flipping randomly. Perhaps I updated a copy of the data, someone else updated another copy, and syncing them has combined those changes in a bad way. All such unintended state changes we call corruption.

There are really only two strategies for handling corruption.

First, if we have multiple copies of the data and a way to detect if a given copy is corrupted, we replace the corrupt copy with a good copy. Detecting if a copy is corrupted is generally done via

These mechanisms are often built into various layers of our system. RAID clusters often keep a parity bit, as do some kinds of RAM and filesystems. Other filesystems, like ZFS, keep a checksum of each block of data. ZFS also handles multiple copies of data across multiple disks so it can implement both the correction and the resolution.

The other approach is to encode a byte of data with extra bits in such a way where if one or two bits change, we can still calculate what the original byte was. The first of these was Hamming’s (7,4) code which uses 7 bits on disk for each 4 bits to be written. The 7 bits have 128 possible states. The 4 bits have 16. The trick is to map each of the 7 bit states to one of the 4 bit states in such a way that changing one bit in the input still maps to the same 4 bit state. These encodings are called error correcting codes.

The most common use of error correcting codes is in ECC RAM. It is more expensive because you need more physical bits to get the same amount of bytes that appear to the user and slightly slower, but it also doesn’t randomly change bits once in a while.

Storing data on CDs and DVDs also uses error correcting codes. Audio CDs were designed so that individual bits could be lost, which made them much cheaper to manufacture. For audio, the occasional missing bit could be interpolated from what came around it. To represent data where we can’t interpolate it that way, the data is encoded in an error correcting code before writing it to the disk.

Large, archival storage systems also use error correcting codes. Rather than store a complete copy of the data everywhere, different parts of the encoded form are stored in various places. If one is lost, the data can still be calculated from the others, and the pieces are smaller than the full data.

In real systems, we address corruption using a notion from information security called “defense in depth.” We don’t rely on any single thing to stop it. We layer the protections. We use ECC RAM and a file system that keeps checksums of data and a write data validation checks in our software.

Authorization

As soon as we start dealing with data, we have to deal with secrets. Some of these are secrets for reasons internal to our system like passwords and encryption keys. Others are secrets because of how our system is used, such as personally identifying information on our users, medical records, trade serets, or, for governments, classified data.

When someone tries to read data we hold, we have to first decide who we think they are, and second decide if that person is supposed to have access to the data requested. These are called, respectively, authentication and authorization. This pair of names is really unfortunate because they both get shortened to “auth.” If you’re reading the security literature you need to keep them straight, but in colloquial use I recommend “login” and “permissions” instead.

Sometimes authorization is outside our software. If data is stored only on a disk that is kept in a vault, then being able to get into the vault is an authorization scheme in its own right. Authoriation via physically securing and isolating things is an enormous area in its own right, leading to things like TEMPEST protection and secure compartmented information facilities.

Within our software, we handle login by checking for at least one of something that you know, something that you have, and something that you are. For example, something that you know could be a password, something that you have could be your phone to receive a text message with a code, and something that you are could be an iris or face scan. Something that you are tends to only work for local login. If you are transmitting login information to another computer, there is no reason to think you can trust that what you received actually came from an untempered biometric scanner. So your local computer may log you in by seeing your face (though you need special cameras and software so that holding a picture of your face up doesn’t work just as well), but to log into a website or other remote system we can only rely on things you know and things you have.

It’s best to think of login as a flow of steps. That flow may be a single step like providing a username and password to a server, which decides yes or no. It can also be quite long, such as the flow for OAuth with two factor authentication. Nor is it defined solely by the server. If I keep my password in a password manager, then my login flows all begin with having my encrypted password file and knowing the password for it. Those two things then provide the username and password for a website.

Once we have logged someone in, we have to decide if they’re supposed to have access, and what kind of access. This has four aspects.

First, what can they do on what bit of state? Only administrators should have access to password records, users shouldn’t be able to read the email of other users (at least by default), and on classified systems someone with a secret clearance shouldn’t be able to read top secret data. And if they do have access, can they only read the data, or can they change it, too? This comes down to a list of all the various bits of state and a list of who can read and write each bit.

Making that list correctly and maintaining it quickly become unmanageable, so how do we maintain permissions in bulk? The answer is always, on the people side, some variation on defining roles that we grant permission to on data and marking people as having one or more roles, and, on the data side, having groups of data such as directories or tags that can all have their permissions set together.

The permissions we can set may have constraints on them. For example, in an Orange Book system, users marked with only a classified clearance cannot be authorized to access data marked top secret. How do we maintain contraints on authorization?

Finally, given the permissions system in place, how do we prevent leaks? At a basic level this means properly checking permissions before providing requested data. This can get surprisingly hard and subtle (see SPECTRE and Heartbleed). On the management side, it also means keeping permissions sufficiently fine grained that you don’t inadvertantly grant improper access. The classic example of this was programs on older Unix systems that were specially flagged to run as root so they could do a particular thing, but could then be diverted to use those permissions for another purpose entirely.

Compliance

Most data of interest has some kinds of rules around it. Some of the rules are laws. For example, healthcare data in the US is subject to HIPAA, and basically any information about people in Europe is subject to GDPR. Other rules are policies of your organization, some of which are required by contracts signed with other organizations. These all impose constraints on how you handle data.

For example, GDPR requires you to delete data about an individual on their request. That means all data, including in your backups or logs. Conversely, court cases can require organizations not to delete specific data.

If you operate in multiple jurisdictions, the rules may be different locally. Washington state in the USA has laws about disclosing financing for election related advertisements that other states don’t have. The rule may even be contradictory. What happens if a court in the US demands that data be retained for a case while the subject of the data, in Europe, demands its deletion under the GDPR?

Some data is tied up with international politics. What do you label the island of Taiwan on a map? Depending where in east Asia you are, people and governments will have strong opinions on the answer. What happens if someone posts Nazi propaganda on your site near the time of an election? In the US, very little. In Germany you may face massive fines.

Some data is illegal to possess. Child pornography is the internationally accepted example. If someone uploads it to your servers, what do you do? There’s a whole slew of law and international agreements that outline, though perhaps not fully define, your liability and what you are supposed to or allowed to do in such cases (though for the case of child pornography, locking things down and calling the cops is generally a good first answer).

Even beyond rules, what data do you accept and how can you use it without compromising your goals? If you provide online games that children play, you must not provide unrestricted chat. If you provide an online spreadsheet, if you don’t maintain users’ data private from each other you will quickly go out of business.

This all falls under the heading of compliance. What data are you allowed to keep, what do you have to do to be allowed to keep it, and what do you do when conditions change? This is not a technical question. It’s a legal question, and a question of often conflicting details.

Compliance is thankless. Any policy you arrive it will displease someone and quite likely break someone’s rules. Many of the people with opinions on how you should handle data will be hopelessly naive, such as free speech absolutists in the USA, and the more naive they are the louder and more officious they are likely to be.

How to ascend

Data persistence is huge, and the first area we’ve looked at that quickly involves issues outside the computer the people directly using it.

When you start, make a first pass through all of this. Learn just enough to serialize data in a useful way even if it’s just writing JSON to files. Learn the basics of relational databases and how to make and maintain backups. Pick your tools like disks and filesystems as much as possible so they will take care of corruption problems for you. Learn the basic login and permissions mechanisms your operating system provides. And learn the rough sketches of any laws governing your domain where you leave. Stay away from distributed databases at this point. And, most importantly, learn who to ask when you need more.

After that, everyone will need parts of all of the areas, but how deep they go in each depend on their needs and their path. What does ascending look like in each area?

This series is still being written. Subscribe to get emailed when each is section is posted:


  1. Humpty Dumpty, 1872↩︎