The iniquities of the Unix shell

by Fred Ross Last updated: July 29, 2011

Most bioinformaticists that I know regard the Unix shell as their interface to their computer. There are other things floting around out there in the depths of whatever Unix-like operating system they run, but in their minds everything passes through the shell as the milieu in which computing takes place.

This is not only wrong, it is dangerously wrong. I'll explain why it's wrong, then come back to the dangerous part afterwards.

Everyone has learned some variation of the idealized von Neumann computing machine: a processor that executes one instruction at a time and some contiguous bank of memory which it is connected to. Instructions in the processor can take a memory address in a register and read the data found at that address in memory into another register, or write the contents of that other register to that address in memory. Perhaps the registers are replaced by a stack. The memory might be layered, with some parts being faster to access and smaller and size, and others increasingly large and slow.

A program specifies the behavior of this machine. Typically they start with data at one place in memory and produce other data in another place in memory. This should all sound very familiar. A way of specifying programs is an interface to the computer.

So does a shell script constitute a program for this machine?

Many of my readers doubtless said yes. After all, it begins with data at one address (which happens to be a file in the filesystem, but that's a memory address) and generally leaves data at another address (again in the file system, or perhaps stdout). It specifies the behavior of the machine. How could it not be a program for this machine?

It specifies the behavior of a machine, but not of this machine. When you run ls | tail -n 7, you do not have a coherent piece of memory. The operating system establishes the appearance of a contiguous chunk of memory for ls, and it establishes a second, completely disjoint chunk of memory for tail. In effect, there are two machines. The operating system goes through contortions to hand off a starting pattern of data to tail and to do something with the final pattern of data from ls. The shell script describes a machine more like a prototyping board in electrical engineering. The Unix shell is a more convenient way of starting programs than keying them in by hand with switches on the front of the machine.

Why is this error dangerous?

If you regard the Unix shell as the interface to the machine, then writing a program consists of linking up pieces in the shell. The programmer may have to venture into some other world in order to create pieces, but those pieces are thought of as the atomic units. Among such programmers, all projects will end up producing shell commands. Each command will have its own file format, or ape the format of another command. There will be no publication of algorithms, only publication of shell commands. After a certain point the programmers realize that they are running out of descriptive names that are short enough to type comfortably. They begin creating namespaces by having a single command accept subcommands: foobar create, foobar delete, foobar check, etc.

Among such programmers much of the "work" of the field will consist of dealing with file formats, resolving naming conflicts, and trying to find bigger, better ways to connect pieces, never realizing that all of these are irrelevant.

A program consists of algorithms that run on data structures. Anything that interferes with running the algorithms on the data structures is part of the problem set, not the solution set. File formats to get data from one program to another are a problem. Leave the result of an algorithm in memory for the next algorithm to operate on. Naming every possible operation and adding subcommands to avoid conflicts and reuse names is a problem. Use a language with namespace support, like anything designed since the 1970s. Bigger and better ways to connect shell commands --- these go under the moniker of "workflow managers" --- are a problem. Calling conventions ceased to be a topic of controversey in programming languages in the 1970s, and instrumenting programs to monitor and log their execution has spawned that exotic tool, the "debugger".

The shell mindset results in two dangers. First, as the cruft wedged between algorithms increases, more and more of the field spends more and more of its time fighting through with it instead of real problems. Estimating from my own experience and observation of my colleagues, most bioinformaticists today spend between 90% and 100% of their time stuck in cruft. Methods are chosen because the file formats are compatible, not because of any underlying suitability. Second, algorithms vanish from the field. They are not published. They are not analyzed. They are not seen. I am not worried about a lack of understanding of the heuristics for solving NP-complete problems like sequence matching, though I would love to be. I'm worried about the number of bioinformaticists who don't understand the difference between an O(n) and an O(n2) algorithm, and don't realize that it matters.

I don't know what to halt this process. It may be too late. Biologists are pouring into the field and being taught that the irrelevancies and trivialities of file formats and workflow managers are legitimate work. They go back and tell their colleagues that it's all very complicated and difficult without ever seeing the really difficult parts of programming. We may be doomed.