Language Wars

"What language would you recommend to introduce programming to an audience of life science students at a bachelor level?"

(Originally published on BiocodersHub)

Following several lengthy and passionate discussions in different venues on what language to use for teaching bioinformatics, I've started cutting and pasting my reply. And here it is.

You'll get a lot of different opinions on this because:

  • It's a religious issue. That is, it comes down a lot to subjective judgements and personal experience.
  • There's a lot of possible considerations for language choice in bioinformatics courses: teachable to people who aren't just going to be programmers and may not have programmed before, has a lot of useful libraries, has a community behind it, good for quick and dirty / one off scripting solutions, useful for web development, fast enough, etc.
  • What "bioinformatics" means to one person and another can be quite different. I'm a bioinformaticist, you're a computational biologist, you're a genomicist and they just do a few stats ...

So a few thoughts about different languages:

Old school compiled languages, e.g. C/C++: No. Learning curve too high, no good for quick-and-dirty problems, weak in web development. Relatively little bioinformatic work happening here. Not a good place to start. Sure, it's fast, but Java would be a better place to look.

Java: Lots of libraries and BioJava is pretty damn good. However it's not a great first language, and always feels a bit "heavy" when I'm trying to do solve a small problem. Still, I expect to see a lot of development in this area with the JVM enabled languages like Jython, JRuby, Groovy, where you can script and still use the Java libraries. Not for novices. Scala may be an interesting new entrance in the high performance stakes, although I've yet to find anyone using it for bioinformatics.

Perl: was the undisputed choice for bioinformatics 10 years ago but that lead has evaporated. Quirky, opaque and write once. The whole Perl 6 morass doesn't help. I think you can do better. Still, there's a lot of code here and a lot of the older significant tools are written in this (e.g. GBrowse etc.)

Ruby: I've got a love-hate relationship with Ruby. There's a lot of Good Stuff there, and the web development is excellent. People seem to like learning Ruby too. But there are a few quirks in the language and BioRuby is still a work in progress. Still, a lot of enthusiasm here.

Python: this is where the weight of attention is. BioPython has really come along in the last few years and many of the newer, excellent tools (e.g. Galaxy) are written in it. Easy to learn, kind to beginners, big community, good scientific computing support (IPython, NumPy, etc.), decent web programming tools, lots of resources for learning the language. There's an odd aspect or two I wish was developed more (I'd really like anonymous closures and better functional programming) but you couldn't go wrong here.

(Declaration of interest: This is my choice. I've taught classes using Python. I've used Python in my own work for a decade. I think Python is the best general choice. But I think some other choices are defensible - or at least not ridiculous - especially in particular contexts.)

Javascript: many people rave about what a great language JS is, and there are occasional feints at doing bioinformatics in it. True, there are useful things that you could do in a browser with, perhaps involving microformatted sequence data. But while you can do work in it, should you? Mostly, it seems like a case of Atwood's law: "Any application that can be written in JavaScript, will eventually be written in JavaScript." No Bio library (or really any standard libraries at all), non-existent bio community. You could do web-development entirely in JS, but it's still a fringe activity. Nope.

R: A lot of ecologists & mathematical biologists use R, a lot of expression data is analysed using R, and it's got graphics & visualization to die for. The IDE is great for beginners as well, allowing packages to easily be installed locally. There's a big commercial effort behind getting serious IDE and computation tools for R (see Revolution Analytics). I confess to a bit of a blindspot with R (some of the syntax is a bit weird), but this could be the right choice for the right group of students.

Matlab, Mathematica, Octave, etc.: There's a few people who do their work in one of the specialised analytical or mathematical languages. My experience here is admitttedly limited, and some of the visualizations and models produced are nice, but the community is tiny, library choice and coverage is limited, web development tools are not common, and using proprietory tools may not suit. (That Mathematica yearly subscription still smarts.) For the right person and project, these might work but I'm not persuaded any of these are a good general answer.

Clarification: Of course, these languages are not related or very similar, but I've grouped them because they occupy similar niches. Bioinformatic library choice is low. There's tonnes of libraries in general, but as far as a Bio community goes for these languages, it seems to be a set of islands: this research group does structural bioinformatics, this investigator does ecological simulations, this project does some phylogenetics ... unless you're attached to a group working with one of these tools, you're pretty much on your own.

Lisp, Scheme, etc.: What? You're kidding me.

PHP: Despite the routine denigration PHP gets, it does drive a huge number of websites (including Facebook and the NHS) and is easy to pick up and there's a cornucopia of resources for (non-bioinformatic) programming. There's also a BioPHP. But there's not a lot of activity here, and outside of web development it would be different to find any positive reason why you should opt for this choice.

(Visual/Real/whatever) Basic: I know that Basic has been used to do bioinformatics, because I've done it. It seemed like a good idea at the time to quickly throw together a user interface in an afternoon, the program looked good. Fast-forward a few years and some feature creep, and I'm trying to write simulated annealing and tree-walking code in Basic. Admittedly, it may be simple for people to learn and get pretty apps up on the screen fast, but scientific computation are not a natural fit. The developer community is large but the bio community is near non-existent. And you're dealing with proprietary tools again, complicated by non-standardized dialects across competing tools. Maybe suitable for a simple one-off GUI with a very restricted scope.

C# / .Net / etc.: There's a small number of C-Sharp bioinformatic apps about. Apart from the obvious community-size problems, I'm not sure if it would be suited for one-off scripts, and there doesn't seem to much activity in the way of bioinformatic libraries (for exceptions see here and here). The status of the non-proprietary implementation (Mono) is a little worrying. Still, this may be useful for doing a desktop GUI app.

Shell languages / Awk / etc.: Yes, you could do bio-analysis with shell scripts. But why would you? Again, see Atwood's law. Stop punching yourself in the face and use a proper language.