Hitchhikers guide to Biopython: Sequences & alphabets

2013-07-11

tagged BioPython, biosequences, computational biology, Python

(Originally published on BiocodersHub.)

If you’re doing bioinformatics in Python, you’re probably using Biopython. Actually, Biopython is a good reason for using Python. But it can be formidable to newcomers: there’s a lot there and there’s not a huge amount of learning material. This then is the first of a series of rapid introductions to Biopython.

What Is It For?

What sort of features does it provide? Seemingly everything:

Reading and writing common sequence (and other) file formats
Classes for the usual datatypes (sequences, alignments etc.)
Functions for the usual things you’d do with those datatypes (alignments, Blasting)
Facilities for data base queries
A very large etc., etc. There’s a lot in there.

The downsides are:

Some of it is frankly under-documented
Having been written by many different authors, there’s a certain inconsistency across the library
The library is still evolving and some of it is a little bleeding edge

However these points are improving steadily. There’s been a marked improvement over the last few years to where BioPython could arguably be regarded as the best of the BioFoo libraries.

Installation

Installing Biopython is mercifully easy:

% easy_install biopython

if you have setuptools or something similar installed. If not, there are tarballs and binary packages available. If you’re on a Linux system, there should be a prebuilt package for most package managers. If you don’t have root access, Biopython works well with things like python-brew. There are a few suggested dependencies (NumPy, ReportLab) but you can install these later.

So how do you call the Biopython library? Like this:

>>> import Bio

Sequences

The most ubiquitous entity in BioPython would be Seq, a class for simple biosequences:

>>> from Bio import Seq
>>> mySeq = Seq.Seq('ACGTTTGCGC')

This represents a biosequence in the simplest possible way, with it acting like a string, allowing indexing, using len() to get length, slicing, etc.:

>>> import Bio
>>> from Bio import Seq
>>> mySeq = Seq.Seq ('acggtcggtggggccc')
>>> mySeq
Seq ('acggtcggtggggccc', Alphabet())
>>> mySeq[0]
Seq ('a', Alphabet())
>>> mySeq[:5]
Seq ('acggt', Alphabet())
>>> mySeq[:5] + mySeq[-3:]
Seq ('acggtccc', Alphabet())

Note that the Seq class is found inside the Seq module, which invariably confuses novices.

Another point for confusion is that there is another class SeqRecord. What’s the different? A SeqRecord wraps a Seq and all the ancillary information about a sequence, like annotations and so on. A Seq is just the sequence, a SeqRecord is the sequence and all associated data. Be warned: some Biopython functions take Seqs, others take SeqRecords.

Just for the moment, I’m going to ignore SeqRecords because we need to talk about …

Alphabets

You would have noticed Alpbabet in the code up above. Obviously the legal “letters” in a (say) mRNA sequence differ from those in a protein transcript. A Biopython alphabet dictates the allowable symbols in a sequence and how they are used. An Alphabet governs the conversion & compatibility of sequences. You can’t combine two Seqs with incompatible alphabets. Common alphabets are defined in Bio.Alphabet.IUPAC, with useful functions being held in Bio.Alphabet.

(It may seem like I’m spending a lot of words on Alphabets: this is due to my observation that they’re one of the more neglected and misunderstood areas in Biopython. But once you know what they do, you largely can ignore them.)

There’s a whole hots of standard alphabets defined by Biopython. But what if you need to make your own?

>>> import Bio
>>> from Bio import Alphabet
>>> from Alphabet import IUPAC
>>> GAPPEDAMBIG_DNA_ALPHABET = Alphabet.Gapped (
IUPAC.ambiguous_dna, '-')
>>> LegalDnaLetters = IUPAC.ambiguous_dna.letters
>>> LegalGappedDnaLetters = GAPPEDAMBIG_DNA_ALPHABET.letters

Simple.

As said, you can only combine sequences with compatible alphabets types. This means, DNA with DNA, protein with protein etc. If you combined two sequences with compatible but different alphabets, the result ‘promotes’ to more general type. By and large, this works as you would expect:

>>> seq_ud = Seq ('acgt', unambiguous_dna)
>>> seq_ad = Seq ('acgt', ambiguous_dna)
>>> seq_ur = Seq ('acgt', unambiguous_rna)
>>> seq_p = Seq ('acgt', protein)

What happens?:

>>> seq_ud + seq_ad
Seq('acgtacgt', IUPACAmbiguousDNA())
>>> seq_ud + seq_ur
TypeError Traceback (most recent call last)
...
TypeError: Incompatible alphabets IUPACUnambiguousDNA() and IUPACUnambiguousRNA()
>>> seq_ud + seq_p
TypeError Traceback (most recent call last)
...
TypeError: Incompatible alphabets IUPACUnambiguousDNA() and IUPACProtein()

Note that alphabets are no guarantee. It is still entirely possible to assign rubbish letters into a sequence. The method Seq.verify_alphabet() can be used to manually check if a seq conforms to an alphabet.

Back to sequences

Having considered alphabets, we can now look at some of the Seq methods that use alphabets. For example, getting the complement of a sequence will obviously depend on the alphabet:

>>> mySeq = Seq ('acggtcggtggggccc')
>>> mySeq.complement()
Seq('tgccagccaccccggg', Alphabet())
>>> mySeq.reverse_complement()
Seq('gggccccaccgaccgt', Alphabet())

Transcription can be handled by standalone functions in Seq (the module) not Seq (the class):

transcribe(seq_or_str): DNA to RNA
back_transcribe(seq_or_str): RNA to DNA

DNA to protein Translation can be most simply handled by the method translate:

>>> mySeq.translate()
Seq('TVGGA', ExtendedIUPACProtein())

Reading and writing sequences

A warning: there are actually many ways of reading and writing sequence data in Biopython. Most of them are old, clunky and deprecated. The new hotness is SeqIO and you should use only that. Behold:

>>> from Bio import SeqIO
>>> in_hndl = open ('myseqs.fasta', 'rU')
>>> for s in SeqIO.parse (in_hndl, 'fasta'):
# do something with the sequence you just read
print s
>>> in_hndl.close()

In plain English: SeqIO.parse takes an open file (or file-like object) and the name of the format (in lower case) and iterates over the SeqRecords within. (Note:: SeqRecords not Seqs). Format SeqIo can understand include clustal, genbank, qual, fastq, nexus …

What if you just want to capture of the the sequences in a file, say in a list. Do something like this:

>>> in_hndl = open ("opuntia.aln", "rU")
>>> seq_list = [s for s in SeqIO.parse (in_hndl, "clustal")]
>>> in_hndl.close()

or even:

>>> records = list (SeqIO.parse (open ("myaln.fasta", 'rU'), "fasta")

Note: SeqIO will not guess at the format of a file from the name. Conversely, the file name extension doesn’t have to match the actual format.

Writing out sequences is just as deliriously easy. Feed a list of SeqRecords, a file handle and a format to Bio.SeqIO.write():

# my_seqs is a list or iterator of seqrecords
>>> out_hndl = open ("dump.fasta", "w")
>>> SeqIO.write (my_seqs, out_hndl, "fasta")
>>> out_hndl.close()