(Talk at SyBit Retreat, February 2011)
Since I am soon leaving for a new job, I thought I would impart a little wisdom to you all, since I am no longer constrained to be polite.
First, I want to talk about where most people lose time in programming. There are two stages to making a change to a codebase. First you must figure out what change to make. Where is the code to alter? How do you change it so as not to break anything else? How do you check whether your change was correct? How do you minimize the disruption of the code's structure?
There isn't much I can say about this stage. Someone skilled at it can wade through an unfamiliar code base an order of magnitude faster than a novice, but I don't know how to teach it, other than to modify unfamiliar code bases again and again.
The second stage -- actually making the planned change in the code -- is much more deterministic. All programmers go through the same process:
The first stage is fairly fixed. I find that the longest I'm comfortable coding without going through the rest of the process is about ten minutes. Starting from that, how much time does the rest of the process take? The two extremes are working completely manually, and working with a maximum of automation. The difference in time depends mostly on the tools, and not on the skill of the programmer.
Working manually, the programmer types each command to run the compiler on each source file; exercises the program by hand; documents the change in a text file separate from the code; diffs his modified source tree against the original, and emails the resulting patch to his colleagues (and we must count the time for his colleagues to merge that patch into their own copies); and copies his code to each host which he supports, manually building it and making sure it runs.
Working automatically, the programmer issues a single command to an incremental build system; a single command to run a test suite; writes the documentation alongside the code he modified; commits the change to a version control system; and runs a single command to push the new version to all the hosts he supports.
Here are some approximate numbers for the time spent in each stage:
Automation saves the programmer over half his time. These numbers are approximate, but the difference is too large to be sensitive to details.
What tools make up the automation?
Build: Find a build system which is incremental; that is, it recompiles only those files which have changed. Your entire build should require one command. It is an error to require two. Most languages have their accustomed system:
make for C and C++;
distutils for Python;
cabal for Haskell; etc.
Test: Everyone is familiar with unit testing. Also consider QuickCheck systems, which generate random test data and often find subtle errors missed by normal unit tests, and acceptance tests, which exercise the resulting program itself, such as with
expect for command line programs or Selenium for web pages. Either way, you should be able to run all your tests with one command.
Document: The time to write documentation is fairly fixed. You lose time in searching through text files for where to write it. Most languages today provide a way to write documentation directly into the code. Python and Common Lisp build it into the language as docstrings. Others embed structured text in comments, such as Doxygen for C++ or Haddock for Haskell. Find the tools for your language and use them.
Share: Distributed version control systems are the state of the art, and of them git and darcs are the preferable. They provide the tools you need locally, so you have the same environment whether you are in the office or on an airplane over the Pacific Ocean. Equally important, you do not have to trust the server where you share code. Your local repository is authoritative, and if your server is taken down, just push a copy of your repository to another.
Deploy: This can be as simple as a shell script that copies the code to your lab server, then logs into the server and runs the build system and tests. For more complicated software, it may mean staging servers, build reports from multiple platforms, and sophisticated infrastructure. In either case, it should be one command to push your software everywhere it needs to go and collect any errors from all the locations.
Setting up the infrastructure for a small project takes a couple of hours, but will pay off in less than a week.
With that out of the way, I want to set forth a few principles that I think would fix many of the problems in the bioinformatics community today.
1. Demand elegance.
Elegant solutions have a coherent structure that absorbs corner cases and outliers without resorting to lists of special cases. Each component of the solution corresponds to some piece of reality, so modifying the solution is straightforwards, and the production code implementing an elegant solution is itself elegant.
Most bioinformaticists lose elegance long before reaching code. The most egregious error is failing to estimate a physical quantity. You should be estimating a length, in meters, or a probability of a particular base being bound by a certain protein. If you find yourself calculating with "arbitrary units", you have failed.
"Arbitrary units" lead to such absurdities as normalization procedures to compare two ChipSeq experiments, or questions of how to compare the results of a microarray experiment to the results of an RNASeq experiment. These are not legitimate problems. How well a microarray experiment estimates a particular physical quantity is a legitimate question.
The next common source of problems is the statistical model. Many people practice "statistics by prayer": "I saw this in a paper once. Please, god, let it work in my project." The vast majority of statistical models are simple combinations of <30 families of probability distributions and <10 of stochastic processes. There is no reason not to build a model for your particular situation.
2. Know open source sociology.
At the recent SIB days, I was in a workshop on "how much programming do bioinformaticists need to know?" During the workshop we discussed ownership and sharing of code. Many of the ideas I heard were levelled at the open source community in the 1990s, ranging from "There is no benefit to me to sharing my code" to "If I make my code public, it will be easier for someone to break into my systems." These arguments have been proved empirically wrong over the past 20 years.
The open source community has developed a mature and sophisticated sociology around sharing code, ranging from small projects with one or two developers to Debian, which is a constitutional democracy of thousands of people.
Further, it is a community that has had to explain itself to the world outside, and to teach newcomers how to take part. There are documents you can hand to a newcomer which teach him the why and how of sharing source code.
For the why, the best introduction remains Eric Raymond's essay "The Cathedral and the Bazaar". For the how, there is a book titled Producing Open Source Software available for free on the web. It goes over the details of how to organize a project, what infrastructure you need, how to resolve disputes, and what to expect as a participant in the open source community.
Everyone new to scientific programming should read these.
3. Do not create file formats.
If you are creating a new file format, you are making a mistake.
If there is an acceptable existing format, you should use it. How do you decide if a format is acceptable? There are three criteria:
Often you will find what you need. For storing sequences, FASTA is wonderful. FASTQ is ideal for storing short reads with quality information. SAM and BAM are completely adequate for alignments. Sometimes the formats you find aren't acceptable, as in the case of BED and WIG.
Image formats in microscopy are a nice example. Microscopy requires three things from an image format: 16-bit depth greyscale images, multiple images per file, and metadeta stored in the file. Everyone uses TIFF, which only supports metadata as a nonstandard extension. The format you should use is PGM, which, as well as supporting metadata by default, takes only an afternoon to write a parser for. Writing a parser for TIFF takes years.
What happens if you cannot find an acceptable file format? You still should not write your own.
Use SQLite. The only thing that it won't handle better than a custom file format is huge, numerical arrays. For that purpose, use HDF5. With these two, you should never have a reason to invent a file format.
4. Build on existing projects.
When you need a piece of software for something, you gather up all the options and try them. If you don't find that fits your needs exactly, your response should not be, "I need to start a new project." It should be, "Which of these projects should I modify?" If the code base was written by someone who was not clinically retarded, it will save you time to build on someone else's work. You won't believe it's saving time, but it is.
When is it legitimate to start a new project? First, if you did not find any options at all, then you have to start a new project.
There are also occasional cases where you do need to start a new project. If you have a large body of code in a language and need a library to parse a data format to use with that code, then, though there may be libraries in other languages, you will have to write a new one.
Aside from purely technical reasons, there may be sociopolitical reasons to start a new project. For example, Linux has two major desktop environments, KDE and GNOME. KDE was stable before GNOME began. So why start GNOME? KDE was built on a non-free widget set, and many Linux distributions' bylaws forbade them from including non-free software. Thus a large fraction of the community could not use KDE, and a new project was legitimately required.
5. Keep it simple. Iterate fast.
Remember that what you think you need and what you actually need are rarely the same thing, and if you know what you actually need, you are probably starting a new project when you shouldn't.
So how do you find out what you actually need? You do it experimentally. You build the simplest thing you could imagine doing something useful, and you put it out in the real world. You give it to a target user, then keep your hands to yourself, and your mouth shut, and watch. You will learn an incredible amount of what you actually need. Analyze that and repeat.
After iterating through this cycle again and again, you will figure out what you actually need. It seems that the number of cycles required is almost independent of the length of the cycles, so the faster you can iterate, the less of your life you will waste. Imagine iterating every three days, one day to code, one day to observe, one day to analyze.
What can you write in one day? It will be embarrassingly bad, right? But that's okay, because at some point you will say, "This is the right thing, but this piece of the interface is so bad that it's unusable." Part of the feature set has just naturally frozen. Now you can focus on improving it.
6. Don't damage new programmers.
Those who are new to programming obey the Sapir-Whorf hypothesis. How they think is formed by the language they work in. Experienced programmers don't realize this. They use layered, abstract notations in their heads and don't see the incidental complexity which the new programmer is drowning in.
New programmers faced with too much complexity resort to magical thinking. They memorize incantations which they don't understand, and fail to develop critical thought about their programs.
How do you prevent this damage? There are two stages which will inoculate the new programmer against this damage. First, have them work in Scheme for a while. This will imbue them with the idea of an abstract syntax tree. Thereafter they can take any complicated syntax and translate it back to Scheme.
Next, following Wirth's dictum that "data structures + algorithms = programming", the new programmer needs a language in which to think about data structures and algorithms on them, but which is independent of the details of the underlying machine and doesn't encumber the new programmer with peculiar design decisions. The best choice for this is Standard ML.
Someone who intends to do more than run existing tools in the years to come should be given this exposure first. It needn't be a lot, merely a couple of weeks of each language. After that, the new programmer can be tossed into whatever language you like without mental damage.
23 February 2011