Mechanical sympathy

This is the part of the series on getting better as a programmer. The articles are:

Getting better at programming

Writing programs

Quality assurance & operations

Debugging & exploration

Mechanical sympathy

Data persistence

Isolation, coordination, & communication

Human communication & behavior

Development process & architecture

Self management

Context

Domain specific knowledge

Mechanical sympathy: you don’t have to be a hardware engineer, but your best programming will happen if you know enough about the system to work in harmony with it. The concept comes from race car driver Jackie Stewart, imported into software by Martin Thompson.

Most students are taught a mental model of a computer that was more or less accurate in the 1970’s: a CPU with some registers, RAM that can be read into registers or written to from registers, and disks that we read and write bytes from. At each clock cycle, the system runs one instruction that does something with the contents of registers by gating a particular circuit. With that model we try to get beginners to not write $O(n^2)$ algorithms when $O(n)$ is straightforward. But a typical server or personal computer or smartphone today doesn’t work like that.

A short, very incomplete list of the differences:

Multicore: CPUs are divided into parallel cores, and each core has its own registers and circuitry to work independently from the rest. One clock cycle represents an instruction run on each core.
Pipelining: The CPU cores don’t run one instruction at a time. Each instruction’s work is divided into phases, and they are run in a pipeline, so during a particular cycle one instruction completes phase one, while the one ahead of it in the pipeline completes phase two, etc.
Memory hierarchy: Registers don’t read from and write to RAM. RAM is so much slower than CPU cycles that if you read a value from RAM and waited for it, you would waste thousands of CPU cycles. CPUs have small caches that are fast enough to keep up with the CPU and load hunks of memory into those as needed. Those caches have a sequence of successively larger, slower caches between them and RAM.
Branch prediction: If you run an if statement, what is the next instruction that goes in the CPU’s pipeline? Rather than wait for that, the CPU keeps a cache of which way that if statement went in the past and guesses. If it guesses right, the pipeline stays full. If it guesses wrong, it undoes its speculative work and runs the right branch instead.
Non-uniform memory access (NUMA): Reading from different locations in RAM may take different amounts of time from different CPU cores. The memory isn’t all in one place.
Microcoding: The raw instructions sent to your CPU may not correspond to a particular circuit on the chip. The CPU translates some instructions into programs that it has stored internally and runs that program instead.
Hyperthreading: If your CPU core is waiting for data to make its way down from RAM or from some other slow, outside source, it could be sitting idle, so some CPUs halfway combine two cores so they share pipelines. That way if one sort-of-core stalls, the other one can keep the pipelines full.
Reordering and barriers: CPU cores will try to reorder instructions to keep their pipelines full, so the order you send instructions may not be the order they run. This can cause problems, so there’s a whole system of barriers around certain instructions and memory accesses that can prevent reordering.
Virtual memory mapping: Fetching from a location in RAM doesn’t go to that location in RAM. It goes through CPU circuitry that keeps a map from the addresses instructions use to actual locations in RAM. This map is controlled by the operating system.
Block IO: We don’t read and write individual bytes or values to disks. We write blocks of data, usually at least 4kb. Depending on the addresses of a collection of blocks and the order we write them and the disks themselves we may get faster or slower writes.
Disks have onboard cache: When we write to disk, it may or may not actually get written to disk. Disks may also have their own caches and rules about when those caches actually get flushed.
Vector processing (SIMD): A register may not hold one value, it may hold hundreds of numbers, and one instruction may do the same operation on all of them.

These are not exotic. Your laptop, desktop, and smartphone have all of these and have for decades. Mechanical sympathy can mean hundreds or thousands of times difference in performance from fairly small differences in how a programmer lays out their program.

In this vein, we don’t want to stop with hardware. The software and abstractions beneath us matter, too. If you’re writing JavaScript to run on V8 or SpiderMonkey, there’s a lot to know about how to set up your program their just-in-time compilers are most likely to give you high performance. If you’re writing Java, it used to be that appending a bunch of strings needed to be done with a StringBuffer, not by appending the strings themselves, because Java strings were immutable and each append allocated a new object and copied all the data involved. The compiler generally optimizes this under the hood now, but I still see young programmers carefully setting up their StringBuffers in interviews without knowing why they’re doing so. The author of CapnProto realized that he could find a way of laying out data in memory that all major processors today could understand and then use that as a file format so there was no parsing or serializing, just loading the data into memory and using it as it. Your filesystem, when you write to disk, may write back to the same place, or may write a new copy with your changes made, and knowing which and how that affects your disk, can dramatically change how fast you can write and how long your disks last.

Or at a more basic level, if you’re going to do a bunch of numeric computations in Python, knowing that Python usually stores its numbers as full objects with all the overhead that entails and its loops are slow because it doesn’t know what the next iteration of the loop will be into an object tells it, lets you go look for something like numpy which takes your lists of numbers and escapes from Python entirely to do calculations for you.

Or even more basic, Python, JavaScript, and Ruby all can only use one CPU core. If you want to use more, you have to run parallel copies of their runtime, each of which use memory for copies of the same code. As the number of cores climbs to dozens, this becomes a burden for memory intensive applications. In a system without that limit, like the Java virtual machine, you only need one copy of the code.

So, mechanical sympathy for us means knowing roughly how the machines work, how to figure out what they’re likely doing in a given case, and how to take advantage of that.

How they work

This is a progression. You don’t start out naively thinking about your for loops, go into a library for five years, and emerge designing microprocessors. Most programmers I know, me included, have no idea how to actually go about designing a modern microprocessor, even if we do know some electronics. You will also learn specifics of particular machines and software stacks that are relevant to you, and remain ignorant of others. That’s normal and fine.

Before diving into the complicated stuff, you need to understand the idealized model of a 1970’s computer I mentioned above and how programs fit into that. It’s the simplest cognitive model we have that provides any mechanical sympathy, and most programmers first map their ideas onto it, then expand parts of it into the modern details as needed. This is one of the sets of linked models I described in debugging & exploration.

That basic model is necessary so you have a substrate to learn the basics of how programs are represented in the machine. At the beginning this just means some basic knowledge of how data in a program becomes bits in memory, and issues like big endian versus little endian, how strings are encoded (yes, you need to learn how Unicode works), and different ways of representing numbers. Later you’ll come back to this and learn how the language you’re using represents structure or objects in memory and how programs saved on disk get mapped into memory to be run, but you that can be put off.

Once you have this basic understanding, then you start digging into that list of topics like pipelining, virtual memory, and memory hierarchy. Depending on the field you’re working in you may also learn about other hardware such as network cards or GPUs, which have their own special issues. Interleaved with learning how various aspects of the hardware work you will start learning about the software you depend on.

For all of this, try to find out why something ended up the way it did. It makes it easier to remember, often gives you a set of ready made examples to think about how to use what you’re learning, and usually leads you into some other piece of mechanical sympathy. For example, pipelining in processors came from the observation that while running an instruction, most of the circuitry is only used for a small fraction of the time in different phases of the instruction. The effort to reorganize the instruction sets implemented by a processor to be able to pipeline them lead to Reduced Instruction Set Computers (RISC), and leads you naturally into the idea of assembly language and what instructions are implemented by your processor.

Creating new processes is slow on Windows. Older Unix software, if it wanted to do multiple tasks, would usually create a new process for each task because threads were not reliably available acros different Unix-like systems until the late 1990’s. Windows began with robust threading and the assumption that programs would use threads, and so implemented a heavier process model. Why? Windows comes from a different lineage than Unix, via Digital Equipment Corporation’s VMS and Mica. Windows didn’t change from the Unix conventions anymore than Mandarin diverged from English.

How to figure out what they’re doing

A lot of what’s going on is not visible to us. It’s ephemeral signals in circuits. There’s no instrumentation to capture it, and we wouldn’t accept the performance hit of adding that instrumentation. So, like debugging, we’re back to running experiments, except this time we don’t get the source code. We need a way of generating hypotheses, then running experiments to check them in such a way that we get information that leads to more hypotheses.

Like scientists studying things too small to see, we need tools to make the tiny things that are not visible to us trigger changes that are visible. Think about how we measure a single carbon atom. We can either combine it with many other carbon atoms until we get diamond or graphite that we can see, or we can couple it to some apparatus like an atomic force microscope that amplifies what the carbon does, though it disturbs it in the process.

Similarly, we take a piece of our program and run it bigger (over more data, repeated more times, in more parallel threads, all the dimensions that make sense for it), or we wait for something to happen and cause a large event, such as stopping it when it makes a particular system call.

A classic example for dealing with memory hierarchy is iterating over arrays versus linked lists. If you allocate an array of five integers and a linked list of five integers, and iterate over them both, you probably won’t see much difference. Instead, we want data too big for the L1 cache (which is 32kb on Intel processors and generally comparable sizes due to the tradeoffs of modern processor design), so let’s make the array and list a million integers long. And if we allocate the linked list in one go, all its elements will still probably be lined up in memory like the array, so let’s go in and shuffle the linked list so its pointers are pointing randomly to other elements in memory and not to the next element.

Reading from L1 cache is one CPU cycle. If we’re summing the numbers in our list or array, the addition operation is another cycle plus a few cycles for updating the loop indexes. Five cycles to fetch a number from L1, add it to our sum, and update the loop indexes is reasonable. For the array, we will fetch a 64 byte cache line from L2 cache to L1 cache every time we exhaust the previous one. That fetch takes a bit more than about ten cycles. For the shuffled linked list, we will fetch from L2 almost every iteration. The loop will run roughly ten times as long with a linked list as an array. Probably. Except for all the other things that are going on like compiler optimization, other processes getting time on the same core…measurements are hard and doing them well is time consuming.

Your other major tool is profilers. Profilers capture information about performance from a running program. Like the atomic force microscope, they disturb the system they are measuring, and you need to be aware of that. For some things this doesn’t matter. For others, it matters a lot. Inserting profiling too intrusively can make the small scale behavior of L1 cache and processor pipelines change dramatically. You always need to have an experiment that you can run with and without the profiler to demonstrate that the effect you care about is preserved.

The three major kinds of profilers that you will deal with measure:

Time spent in parts of the program by every so often pausing the program and seeing what it’s currently doing. Many such samples give you a statistical picture of where the program is spending its time.
Time spent in parts of the program by measuring every step of the way, such as recording the time for each function call.
Memory allocation, freeing, and how much memory is allocated in what types and data structures, usually taken by pausing the program, analyzing all the memory the program is using, writing out that information, then resuming the program.

It can be helpful to think of profilers versus isolated experiments as the difference between a biologist going out into the field and observing what is there and a biologist taking an organism into the laboratoy to do experiments on it. Experiments done in the lab can easily be irrelevant to what happens in nature, so you need to observe in the field to discover what is important. Observations in the field are at the mercy of whatever happens to be going on, so confirming hypotheses about pieces of it generally needs the isolation of a laboratory.

How to take advantage of it

At the most basic level, you take advantage of this by choosing your language and tools. If you’re going to be writing loops that iterate more than a trivial number of times, use a language that produces optimized, compiled code, whether from an ahead of time compiler like C++ or Rust or a just-in-time compiler like Java or JavaScript. If you’re going to be doing lots of linear algebra, use optimized libraries for the purpose, whether from FORTRAN or Python (via numpy). If you’re going to use more than one CPU core, use a language that can take advantage of them well (Java, Rust, Erlang) and not one where you have to tie yourself into knots to use multiple cores (Python, JavaScript, Ruby). If you’re serializing and deserializing lots of data with the same structure, choose a format to serialize to that takes advantage of that fixed structure (Protobuf, CapnPro) instead of one that serializes or deserializes any structure anywhere (JSON, CBOR). These choices make an enormous difference.

Then you study what people have done. In computer game design, an approach to laying out data in memory evolved known as data oriented design, which arranges everything to have mechanical sympathy with the memory hierarchy. Incidentally, that layout is exactly that of a relational database using a columnar store, and the book I linked to is accidentally one of the best books on designing schemas for relational databases that I know of. Read about the LMAX architecture (and the rest of Martin Thompson’s writing. Read what Poul-Henning Kamp has written about how we designed Varnish.

All this will probably make you throw away a lot of how you learned to structure programs, especially if you learned object oriented programming in Java.

How to ascend

The answer here is: a bit at a time. Learn enough basics on your computer architecture and how data is represented in memory to be able to do simple measurements, even ones as trivial as comparing the size in memory of various ways of representing a bunch of numbers. Start weeding out or replacing parts of your toolkit that are mechanically unsympathetic for no good reason. Learn how to use all three kinds of profilers for your environment and get used to using them and reading their results.

Then keep cycling through unpacking some black box, figuring out how to measure its effects, finding software that handles it sympathetically, and learning how to do that yourself. Which black boxes you pry open and in what order depends on your career. You may have need of the internal details of MIPS instructions and never need to care about how the V8 engine represents numbers or why you responding to network congestion drives you to always use UDP for things like logging or synchronizing state in multiplayer video games. You may spend your entire career using Linux in production and never have reason to read Windows Internals (it’s a good book, though) or Don Box’s books on COM and .NET, or you may only work on Windows and never think about how the number of threads Solaris handles gracefully is much larger than Linux.

« Home