Posts categorized under: science

Pointers for analytics of healthcare data

A start

In recent time I've been asked for pointers on analysing complex healthcare data. This is a difficult issue. Healthcare analytics / health informatics / medical informatics / etc. range over a wide area, driven a wide variety of interests and outcomes, overlapping hugely in some areas and not at all in others. The …

The rules of analysis

Opinions, I got 'em'

While mentoring some juniors, I started to think about the rules of thumb for analysing data that I've built up over the years. While I'm certainly not the world's greatest data scientist (or it's greatest bioinformatician, statistician, biomedical scientist, etc.), it seems worthwhile trying to capture them here. these are …

Have you looked on Evoldir?

Reflections on the academic job search

For those who don’t know, EvolDir is a worldwide mailing list on evolutionary biology that has been running since approximately forever. Everyone who works even vaguely in the area of evolution subscribes to it. Every day It typically carries several posts on conferences, book announcements, funding opportunities and job …

Tentacles vs arms

Solving (partly) an old conundrum

You might have come across confusing statements like "octopuses have arms, squids have tentacles" and wondered what's the difference. I did and here's the (unsatisfactory) answer.

Cephalopods (squid, octopus, nautilus and a number of other aquatic creatures, which are a class of mollusc) have a number of muscular "limbs". Traditionally …

Why bioinformaticians don't get no respect

It's a common cry amongst working bioinformaticians that they're unappreciated, undervalued and generally "get no respect". While people love to complain, from discussions with peers and colleagues, the same stories come up again and again:

  • Not being consulted when projects are planned
  • Technical advice and results not taken seriously
  • Being …

Using AWS for research computing

Your own private computing cluster?

This is based upon my reply to a question on reddit concerning experiences with using Amazon Web Services (including ELastic Computing, Glacier, etc.) for research.

I was part of a pan-European research consortium that used it for our shared computing infrastructure: databases, a few web apps and web sites, mailers …

Academic job ad red flags

Words you don't want to read in a job description
"competitive salary":
Like several other terms on this list, this phrase is can be used in a completely sincere manner: We pay decently. If we offer you the job, then we'll negotiate. Unfortunately, it is just as frequently used as a way of avoiding the subject of remuneration, in the …

Random observations on the academic-scientific job search

Written without any bitterness at all
  • Maintain employable, valuable skills. Never evince any technical skills. Never casually offer to help a colleague with computer problems, etc. First you'll end up being known for that. Second, people will keep asking you to encrypt their hard-drive, build a database, etc. Third, technical matters are low status "working class …

Bayesian stats in very plain language

My pass at explaining the often misunderstood.


Some years ago, I got into an argument with someone abut the relative merits of Bayesian versus Maximum Likelihood in phylogenetics. They asserted the two were basically the same or would come to the same answers. I countered that while they would often agree, they were measuring different things …

What I done learned about REDCap

A few surprises

For those not in the know, REDCap is a platform for creating and editing databases through the web. And by and large, it works fine. It saves a lot of development effort. It provides good reporting tools for users. It's secure and robust. But there are some things to be …

(Re-)building databases with csvsql

Tips and traps when going from a database dump to a database

The scenario

You have a bunch of related CSV files.

Maybe they're the result of a raw database dump. Maybe they've been generated in some other way: experimental results, various public data sets, whatever. But the important thing is that you need to make a database from them. Perhaps because …

Words a bioinformatician never wants to hear

Based on hard-won experience.

(This first appeared on courtesy of Rad, and has since popped up on It enjoyed some moments of viral popularity, with many aggrieved practitioners chipping in on the comments of the article. Following the resurrection of my website, it's a good opportunity to bring this piece …

Tools for data

How to store data, what to use

Prompted by a recent tweet asking what people used for storing and managing their data, I wrote down my own hard-won lessons on the topic. In rough order of preference and data complexity:

A hierarchical strategy

Use restructured text for documentation

Or markdown / asciidoc. The advantages of this being:

  • It's …

Philosophical considerations in manuscript preparation

Thought experiments and musings.

Xeno's paradox of manuscript completeness

No matter how many drafts you go through, the number of helpful suggestions make by your co-authors will approach but never quite reach zero.

Plato's allegory of collaborators and the cave wall

Distinguished or influential co-authors have a tendency to invite, introduce or insist upon …

Writing knitr in restructured text

Swapping out Markdown for a different markup.

knitr is a useful R package/tool for documenting analysis. Basically, it allows the embedding of R code "chunks" within a simple text document. This document can then be "knitted", which means that the R code is interpreted and reinserted in the document along with the results of that code …

Common tasks in Galaxy

It's all there in the documentation, but sometimes it's hard to find. This document gives you another place to look.

So how do I ...

... create admin users?

Curiously, the identity of admin users is hardcoded into the Galaxy configuration file. (Which makes it secure, I guess, but separate from the …

Compiling Quickjoin and file formats

Problems with building qjoin and getting it to read stockholm files.

Quickjoin / qjoin is an excellent commandline program for rapid construction of neighbour-joining trees. However, while using it recently, I had a few problems getting it to read Stockholm files, the most accessible of the formats it can use.

The …

Galaxy toolsheds

Galaxy toolsheds

Relatively painless tool-sharing

This is a more recent innovation in Galaxy, which can make it a somewhat confused one: the concept of the toolshed has changed over its lifetime, the documentation is incomplete, and there's a slightly strange emphasis in the documentation that exists. So …

Mile-high description

Toolsheds …

Language Wars

"What language would you recommend to introduce programming to an audience of life science students at a bachelor level?"

(Originally published on BiocodersHub)

Following several lengthy and passionate discussions in different venues on what language to use for teaching bioinformatics, I've started cutting and pasting my reply. And here it is.

You'll get a lot of different opinions on this because:

  • It's a religious issue. That is, it comes …

Hitchhikers guide to BioPython: SeqRecords

For the novice, more-than-raw sequences.

(Previously published on BiocodersHub.)

Previously I'd spoken about how Biopython represents sequence data with the Seq class. But there is also the SeqRecord class:

  • A Seq is just raw sequence data and information about what type of sequence it is.
  • A SeqRecord is a Seq and all the other information …

Cleaning biosequences

A simple script to check and purge sequence files of possible problems.

Some times you need sequences that are unambiguous (i.e. only 'ACGT', lacking gaps) whether it's because of the limitations or assumptions of tools (like omegaMap) or just because you want to know where SNPs or sequencing ambiguities …

Coloring dendroscope files

How to programatically label phylogenies.

The need had arisen for the tips of a large phylogeny to be labelled in a systematic way. Rather than "point and click" within Dendroscope, this script takes a .den/dendro file and colors the tips according to a "color description" file. This is a simple csv file with taxa …

Consensus in BioRuby

Explaining the ill-explained ways to obtain a consensus sequence in BioRuby.

In BioRuby, alignments are equipped with several methods for obtaining consensus sequences. Unfortunately, these have terse descriptions which point you at the BioPerl documentation, with the added bonus of not quite working like the BioPerl equivalents.

First, let's create a very simple alignment, where everything agrees except the last sequence …

Galaxy miscellanea

Odds and ends and the surprising.


If you are serving the installation with a proxy redirect (e.g. the galaxy server is running on port 7070 but is being redirect by Apache to appear at port 80 on /galaxy), while you can access Galaxy at both addresses, login will …

More about MrBayes

Some (more) notes about the venerable Bayesian reconstruction program.

Error when setting parameter "Gap" (2)

When attempting to execute a Nexus file, MrBayes kept spitting back this cryptic error upon loading:

Executing file "c_vp1_nuc_seqs.nxs" [...]
Reading data block Allocated matrix [...]
Data is Dna Gap character matches matching or missing characters …

Ross Crozier 1943-2009

The sudden death of Ross Crozier on the 12th of November was heralded largely by a slow ripple of email, phone calls and Facebook messages across the globe. I found out from an email that started with a short but singularly complete sentence:

Terrible news.

It is sobering to think …

What works - NGS assemblers

A quick paper review on picking the best assembler

You could spend all day just keeping up with developments in next-generation sequencing. Companies announce new and revolutionary technologies seemingly every month, promising to do more, better and for less. Yet at the same time, it’s difficult to hack your way through the marketing tallk and get hard figures …

Drawing sequence logos

A very simple script to do a simple but tedious task.

Sequence logos are a common way of representing SNPs and diversity in groups of sequences. This script automates the task. It's a bit rough around the edges and serves mainly as a base for further hacking.

Usage is:

drawlogo.rb [options] FILE1 [FILE2 ...]

where options are:

-h, --help Display this …

Reducing a sequence to SNPs

A script for a simple task.

A largely self-explanatory script. This will "shrink" an alignment, deleting all sites that don't contain a polymorphism in some member sequence. A little bit of script candy as well, this takes any number of files and saves the results in a new file named according to a definable schema:

#!/usr …

Ordnance Survey locations

Converting between OS grid references and longitudes-latitudes.

The Ordinance Survey is a UK-peculiar geospatial format, ubiquitous via street atlases, hiking charts and (yes) farming and epidemiological maps. It is explained in great detail is several places, but here's a quick overview:

The OS grid is a set of 25 squares, 500 kilometers a side, arranged 5-by-5 that …

Error 1 for Mrbayes

What happens when a make fails.

If this happens to you when trying to compile MrBayes:

% make
gcc -DUNIX_VERSION -DUSE_READLINE -O3 -Wall    -c -o mb.o mb.c
gcc -DUNIX_VERSION -DUSE_READLINE -O3 -Wall    -c -o mcmc.o mcmc.c
gcc -DUNIX_VERSION -DUSE_READLINE -O3 -Wall    -c -o bayes.o bayes.c
bayes.c:45:31: readline/readline …

Fetch sequences from db

A simple script to grab bioseqs by accession.

This just wraps the BioRuby fetch functionality in a friendly commandline interface. In brief, it can accept accession ids on the commandline or from a piped file (one accession per line) and save the corresponding sequences from the db. Sequences may be downloaded via the bioruby or EBI servers. The …

Installing Galaxy

Setting up a production version of GMOD Galaxy for general use.

This presents one way to create an optimized production Galaxy instance. Variations are certainly possible and some of the choices presented are/were dictated by local culture. Certain settings may be more suitable for production or development environments. Nonetheless, this presents a start-to-stop process for installation and setup.

Note: this …

Parsing Dendroscope nodes

For when you have to do lots to a big tree.

Previously, I showed how Dendroscope files can be easily manipulated with brute-force regex, so you can right scripts to color a mass of nodes, rather than having to format them one-by-one in the GUI. However, more complex manipulations require …