A rant about "what programming language should I learn?"

November 4, 2012

Someone asked on a forum what language they should learn for bioinformatics. I admit it, I went on a rant. Here it is, for your enjoyment:

You're asking the wrong question. This is understandable, since the skill level of the bioinformatics community is so low that most of them ask the same wrong question. I'll even answer it: PLT Racket is the best source to learn to program today. But it's still the wrong question.

Here's the right question: "What do I need to learn to be able to effectively use the computer as a tool to do biology?"

Part of the answer will depend on what you're kind of science you're trying to do, but some topics will be absolutely universal.

You need to learn a general purpose programming language. Here, I'll teach you Scheme: (function argument argument argument ...), and that form can go anywhere in each of those slots. For example, (+ 2 2), (+ (* 3 3) 1), ((if (> 2 3) + -) 1 1), (define (square x) (* x x)), (square 4). Congratulations. You can learn other languages when you need them. Languages come and languages go (well, except Common Lisp and FORTRAN), and you use what you want.

You need a basic knowledge of data structures and algorithms: big-O notation, singly and doubly linked lists, arrays, binary and n-ary trees, and hash tables. You need to know what a hash function is and why they work. You need to know the general operations for manipulating these data structures, and what they're called in your language. You need to know how sorting works (though you needn't implement it yourself) and searching on the various data structures. You need to know about the vagaries of floating point, and how to do basic root finding and minimization (Acton's 'Real Computing Made Real' is the best source I know of for this), and how to design and write these algorithms by hand. You must know how pseudorandom number generation works, and have a good generator on hand. The Mersenne Twister is the day-to-day state of the art at this point. You need to know how Monte Carlo methods work, and how to generate random data (a.k.a., simulation).

You need to know how data is represented in the computer. What are bytes and words? How are characters represented? What are the different kinds of integer representations and floating point representations? How are enumerations and symbols represented? How are more complicated data structures like structs laid out in memory? How are the representations laid out in binary file formats? (Hint: binary files are not black magic, they're just more data as represented in memory). You need to know the difference between machine code and byte code, compilers and interpreters, and what the relative benefits of each are (note that compilers can be interactive and interpreters batch only -- ignore any assertions to the contrary).

You need to understand recursion and the design of loops via preconditions, postconditions, and loop invariants.

You need to understand relational algebra and be able to manipulate relational databases (SQLite is a good place to start). You need to know what memoization is, and how to implement various forms of it. You need to know how to produce 2D graphics in a clean, composable way, such as recognizing that the data area of a chart represents a new set of coordinates that you're transforming to. You need to be able to send and receive HTTP requests, that is, opening a port and sending and receiving messages according to a fixed protocol. You need to be able to write a parser for a file format that isn't a bunch of hacked-together regular expressions (go look at Haskell's Parsec -- write one for your language). You should understand what Prolog is, how to write in it, and how to implement a simple one yourself.

You need to be able to produce correct programs. This means knowing what each part of your program is supposed to produce for some cases, being able to easily check that easily (best is stating invariants that another program checks by generating increasingly huge random cases -- see QuickCheck), and being able to reason your way to where the error is in your program rather than trying things at random.

Oh, and learn a modern version control system: git or mercurial. If someone around you already uses one of those two, use what they're using. Otherwise, flip a coin. [Edit: Since I wrote this, git has won. Use it.]

Those are the universals that will make the computer into a tool for you. Seem like a daunting list? It's actually not nearly as bad as it looks, trust me. But what about your science? That's the goal, remember: use the computer as a tool to do science. Not as a tool to move data from one file format to another (after you learn about representing data in the machine, you'll understand that all file formats are arbitrary). Not as a tool for connecting to NCBI or EMBL or anywhere else. A tool to do science. Don't lose sight of that fact. Most bioinformaticists spend between 90% and 100% of their time just messing with file formats. It's not science.

Now, to recommend where you go next, you'll need to talk about what kind of science you want to do.

Did you enjoy that? Try one of my books:
Nonfiction Fiction
Into the Sciences Monologue: A Comedy of Telepathy