madhadron

A farewell to bioinformatics

Around the time I left academia, I wrote a rant saying what I thought of bioinformatics. I sent it around to my old national consortium in Switzerland, which was used to receiving my rants. I posted them on my website becuse they liked sharing them. This rant, like my others, was well received by that group.

Then, one Friday morning, one of my pals from Switzerland messaged me, saying that someone had posted it to reddit’s r/bioinformatics, asking what people think about it. It was the top item for a couple days.

From there it went to Hacker News and sat at the top of the front page for hours. Other forums picked it up. I got emails from their moderators asking me to come take part in the discussion and lots of email from people who just wanted to talk to me directly.

I still get a slow trickle of emails of people who want to talk to me in response to it, and it is (or was) assigned reading in bionformatics courses at Johns Hopkins and Beijing. People in bioinformatics quote bits of it at each other, so I leave it here.

The Rant

I’m leaving bioinformatics to go work at a software company with more technically ept people and for a lot more money. This seems like an opportune time to set forth my accumulated wisdom and thoughts on bioinformatics.

My attitude towards the subject after all my work in it can probably be best summarized thus: “Fuck you, bioinformatics. Eat shit and die.”

Bioinformatics is an attempt to make molecular biology relevant to reality. All the molecular biologists, devoid of skills beyond those of a laboratory technician, cried out for the mathematicians and programmers to magically extract science from their mountain of shitty results.

And so the programmers descended and built giant databases where huge numbers of shitty results could be searched quickly. They wrote algorithms to organize shitty results into trees and make pretty graphs of them, and the molecular biologists carefully avoided telling the programmers the actual quality of the results. When it became obvious to everyone involved that a class of results was worthless, such as microarray data, there was a rush of handwaving about “not really quantitative, but we can draw qualitative conclusions” followed by a hasty switch to a new technique that had not yet been proved worthless.

And the databases grew, and everyone annotated their data by searching the databases, then submitted in turn. No one seems to have pointed out that this makes your database a reflection of your database, not a reflection of reality. Pull out an annotation in GenBank today and it’s not very long odds that it’s completely wrong.

Compare this with the most important result obtained by sequencing to date: Woese et al’s discovery of the archaea. (Did you think I was going to say the human genome? Fuck off. That was a monument to the vanity of that god-bobbering asshole Francis Collins, not a science project.) They didn’t sequence whole genomes, or even whole genes. They sequenced a small region of the 16S rRNA, and it was chosen after pilot experiments and careful thought. The conclusions didn’t require giant computers, and they didn’t require precise counting of the number of templates. They knew the limitations of their tools.

Then came clinical identification, done in combination with other assays, where a judicious bit of sequencing could resolve many ambiguities. Similarly, small scale sequencing has been an incredible boon to epidemiology. Indeed, its primary scientific use is in ecology. But how many molecular biologists do you know who know anything about ecology? I can count the ones I know on one hand.

And sequencing outside of ecology? Irene Pepperberg’s work with Alex the parrot dwarfs the scientific contributions of all other sequencing to date put together.

This all seems an inauspicious beginning for a field. Anything so worthless should quickly shrivel up and die, right? Well, intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%. Thus the thread of failures can be stretched out from years to decades, hidden by the cloak of incompetence.

And the rhetoric! The call for computational capacity, most of which is wasted! There are only two computationally difficult problems in bioinformatics, sequence alignment and phylogenetic tree construction. Most people would spend a few minutes thinking about what was really important before feeding data to an NP complete algorithm. I ran a full set of alignments last night using the exact algorithms, not heuristic approximations, in a virtual machine on my underpowered laptop yesterday afternoon, so we’re not talking about truly hard problems. But no, the software is written to be inefficient, to use memory poorly, and the cry goes up for bigger, faster machines! When the machines are procured, even larger hunks of data are indiscriminately shoved through black box implementations of algorithms in hopes that meaning will emerge on the far side. It never does, but maybe with a bigger machine…

Fortunately for you, no one takes me seriously. The funding of molecular biology and bioinformatics is safe, protected by a wall of inbreeding, pointless jargon, and lies. So you all can rot in your computational shit heap. I’m gone.

Followup

If you look at the forums, you will see lots of negative comments. I didn’t get a single negative email. This struck me as strange, so I went back and counted up all the positive, negative, and neutral comments and emails.

Sentiment Email reddit Hacker News Biostars Hubski
Positive 20 1 21 0 2
Neutral 3 16 119 13 14
Negative 0 24 21 5 2

The three neutral email were two people inviting me to take part in discussions (which is how I found the discussions on Biostars and Hubski), and one which contained only the line “sent from my iPad.” The neutral comments on forums were largely side discussion. Interestingly, the mixed or primarily programming fora (Hacker News and Hubski) had about equal numbers of positive and negative comments. The bioinformatics specific fora (BioStars and the reddit subgroup) were very negative.

The negatives all followed certain themes. Many were ad hominems: “From looking into this guy a bit (who I’ve never heard of before today in my 10+ years in the field)…it does not appear that he completed his PhD after several years of work” and “Sounds like a fed up academic with a stick up his backside.” There were lots of implications that I didn’t know biology, didn’t know the state of programming in the non-academic world, didn’t know bioinformatics, etc.

A number were strawmen. One did try to claim I was wrong based on my own words (“There are only two computationally difficult problems in bioinformatics, sequence alignment and phylogenetic tree construction.”), but when asked for another, all he could offer was genome assembly, which is a special case of sequence alignment. One amusing strawman took umbrage with my use of the word “ept,” claiming it didn’t exist. Someone did eventually post the reference to the OED entry.

Others took issue with my tone, saying that it was unacceptable to address people this way, but there was no substantive criticism in public, and no criticism at all in private. That means these people weren’t concerned that I was wrong, or at least had no stomach to send me a criticism without some kind of public setting where they would be part of a group. Indeed, the original poster on reddit, who was worried as he was about to start a PhD in bioinformatics, didn’t receive an actual answer (though I privately sent him one).

There’s a name for this in circles that study human behavior: group monkey dance. You should follow that link and read Rory’s article on it, and probably Rory’s books, too, but here’s a quick summary: human violence follows patterns. Most fist fights occur in the same way. Married couples will have the same arguments year after year. And social groups will turn on an outsider or perceived betrayer with a brutality that most of the group members would never display individually.

In this setting, a group monkey dance would have emotional outbursts against the transgressor (me), with repeated themes and short on rational argument, which is exactly what we find. Most of these people are folks I could sit down with an have a sensible conversation about bioinformatics. However, I attacked the group which they have made a part of their identity and triggered a group monkey dance.

So what has this whole debacle taught me is that public comment on forums encourage group monkey dances, and thus reduce the quality of the discourse on the Internet. Based on this, I dropped off all public forums for several years afterwards, and since then have only rejoined a small number of heavily moderated ones.

What I sent to the original r/bioinformatics poster

I tried to send the original person who posted the ranted to r/bioinformatics something useful by private message, which I’ll reproduce here for anyone else who may be similarly disturbed:

Hi, I’m the author of the piece. A colleague of mine still in the field pointed out that someone had posted it to reddit. I have no intention of engaging with the comment thread, but I thought I’d drop you a private message.

If you notice, no one provided any substantive criticism of what I said, no refutation of my points. There were a few strawmen, a few ad hominems, but no one addressed my actual words. If they’re words that would make you want to not do a PhD, then you need to address that. Figure out what the parts are that unsettle you (aside from the tone, which was intentionally strident), and go independently find an answer for yourself. The exercise will, at the very least, give you a useful overview of some of biology. (As a similar exercise, try writing a history of the future of the field over the next 50 years.)

Whether you decide to do your PhD or not, this is useful. If it leads you to do something else–and you should plan on what you’re going to do when you leave academia, since the data says you will, like almost everyone else–fine. If it leads you to do your PhD, you’ll have a perspective that you can use to choose what you’ll specialize in, as opposed to randomly fall into it.

If you do the PhD, though, I warn you: use the perspective to cut areas out that don’t interest you, but choose based on the professor. Your advisor, whether you trust his scientific taste, his personality, and his skill as a mentor, should be almost the only criterion in your selection of your research in a PhD. Look at his students. Are they happy, healthy, making progress? Do they respect him? What about his former students?

Good luck to you either way.