What works - NGS assemblers

A quick paper review on picking the best assembler

You could spend all day just keeping up with developments in next-generation sequencing. Companies announce new and revolutionary technologies seemingly every month, promising to do more, better and for less. Yet at the same time, it’s difficult to hack your way through the marketing tallk and get hard figures. And the road of NGS technologies is strewn with once promising technologies …

Fortunate then that a few recent papers have looked at NGS tech and done some hard comparative studies. For example:

Finotello et al. (2012) Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data. Brief Bioinform 13(3):269-280. doi: 10.1093/bib/bbr063

Here the vital step we’re talking about is assemblers: how accurate is the assembly, where and how many gaps result, how good are they at solving complex genetic regions.

Of course, the test data set is perhaps critical context: what works well for one dataset may be inappropriate to others. Here it’s 454 reads of three different bacterial genomes for which high quality reference genomes were available:  Zymomonas mobilis, Helicobacter pylori and E. coli. Furthermore, the effects of different degrees of coverage were estimated by sampling the original data to different degrees. So the study is essentially doing de novo assembly with different software and then assessing the results by comparison against a reference genome.

The assemblers in question are:

  • Newbler (version 2.3 and 2.5)
  • A pre-release version of PCAP
  • MIRA
  • CLC Assembly Cell

(Note: this seems like a slightly odd selection of assemblers, but it may be just my personal experience speaking). A number of popular assemblers were excluded:

  • Long read assemblers like PHRAP and PCAP were not used due to previous work showing them were suboptimal
  • Short read assemblers like Velvet and ABySS were not used due to previous work showing they produce a lot of small contigs and poorer reconstruction

So what are the results? There’s a mass of figures and graphs in there that are difficult to digest but to brutally sumarize them:

  • CABOG gives the largest and most complete contiguous assemblies. Emphasis is on the word “contiguous” – CABOG is very good at putting things together, although it sometimes does this incorrectly
  • Newbler creates very accurate contigs, although more “disconnected”. There’s a curious story here where the more recent version of the software (v2.5) is better at linking the contigs at the expense of more errors
  • CLC copes better with lower coverage than other assemblers
  • MIRA has a significant error rate is assembling contigs
  • PCAP shows a somewhat erratic performance (perhaps because of being pre-release)

As an extra, the authors also ask what the “sweet spot” for coverage is – how much is “enough”, how much effort should you invest in increasing coverage? All assemblers show a rapid improvement up to about 15-20x coverage and then almost none as far as 72x coverage. (Note this is much lower than previously suggested figures.) The authors take-home is that more than 20-30x coverage is unlikely to bring much benefit. In fact they suggest that exceeding this will not only be expensive but may compromise assembly quality. (The reasoning for this last point is unclear to me – perhaps the additional of more but uninformative data will just create  larger job for the assembler.)

The results are a little surprising: no one in my circles uses CABOG or it’s predecessors, with Mira has a reasonable following. And some of the differences are quite substantial, on the level of several percent of the genome being “missed”. Which makes me wonder how much NGS data we’re going to have to throw out in the coming years when we find how bad it is …