"Checking gene names & identities in Biomart"

I need to check some gene names and get the sequences for them. Ordinarily, I would prefer to use Python for this sort of thing. Unfortunately, the Python interfaces to Biomart appear to have been written by insane people, so R it is.

What is Biomart? What is a biomart? A biomart is a way of providing remote access to large genomic datasets and associated information, and Biomart is one of the popular instances of it:

  • Biomart: http://central.biomart.org
  • Cosmic http://www.sanger.ac.uk/genetics/CGP/cosmic/biomart/martview
  • Ensembl http://www.ensembl.org/biomart/martview
  • Intogen http://biomart.intogen.org/

Under the hood, individual biomart servers may connect to a whole bunch of other servers, but this is invisible to users. Different servers may also specialise in different datasets, so you will need to discover and select the service you are using. You can use this data to answer queries like:

  • "Give me all the genes for this species"
  • "What's the name for this gene in this other nomenclature"
  • "Show me all the genes associated with this region"

There is a web interface to every biomart ("MartView"), but the winning feature of marts is programmatic access over the web, which is what I'll be demonstrating here.

First load the `biomaRt library and see what marts are available at the default address:

library ('biomaRt')
## Error in library("biomaRt"): there is no package called 'biomaRt'
head (listMarts())
## Error in listMarts(): could not find function "listMarts"

A little bit of nomenclature:

  • A mart is a standalone service, so a biomart server may have several marts.
  • Each mart may contain a number of datasets that can be queried
  • Each datasets consists of records with attributes (columns)
  • A record will have a value for each attribute
  • A query can use a set of filters to control and narrow its search

So, lets select Ensembl and see what datasets it has:

ensembl <- useMart('ensembl')
## Error in useMart("ensembl"): could not find function "useMart"
head (listDatasets (ensembl))
## Error in listDatasets(ensembl): could not find function "listDatasets"

Select the human dataset. Alternatively, you can select the mart and dataset in one go:

mart <- useDataset("hsapiens_gene_ensembl", mart=ensembl)
## Error in useDataset("hsapiens_gene_ensembl", mart = ensembl): could not find function "useDataset"
mart <- useMart (biomart="ensembl", dataset="hsapiens_gene_ensembl")
## Error in useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl"): could not find function "useMart"

You can see what sort of fields are available to filter upon:

filters = listFilters (mart)
## Error in listFilters(mart): could not find function "listFilters"
head (filters)
## Error in head(filters): object 'filters' not found
attribs = listAttributes (mart)
## Error in listAttributes(mart): could not find function "listAttributes"
head (attribs)
## Error in head(attribs): object 'attribs' not found

Queries are built with the getBM call. For example, to retreive genes by their gene symbol:

results <- getBM (attributes=c("ensembl_gene_id", "hgnc_symbol"), filters="hgnc_symbol", values=c("CPVL", "PYGL"), mart=mart)
## Error in getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", : could not find function "getBM"
results
## Error in eval(expr, envir, enclos): object 'results' not found

To put this in plain English:

  • using the mart we selected earlier
  • return the ensembl_gene_id and hgnc_symbol columns for every record
  • where the hgnc_symbol column must be CPVL or PYGL.

So attributes are what we want to know and filters are how we restrict our search to a set of values.

Note that if the gene symbol does not exist in the set, the results will not include a result for it:

results <- getBM (attributes=c("ensembl_gene_id", "hgnc_symbol"), filters="hgnc_symbol", values=c("CPVL", "BOGUS"), mart=mart)
## Error in getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", : could not find function "getBM"
results
## Error in eval(expr, envir, enclos): object 'results' not found

A query that has no results will return an empty frame:

results <- getBM (attributes=c("ensembl_gene_id", "hgnc_symbol"), filters="hgnc_symbol", values=c("BOGUS"), mart=mart)
## Error in getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", : could not find function "getBM"
results
## Error in eval(expr, envir, enclos): object 'results' not found

Wonderful. Now let's do something real. Given a list of obselete or synonym gene names, extract the current and official ones. First, I'll connect to a different biomart in a slightly different way:

library ('biomaRt')
## Error in library("biomaRt"): there is no package called 'biomaRt'
listMarts (host='www.genenames.org')
## Error in listMarts(host = "www.genenames.org"): could not find function "listMarts"
mart <- useMart (host='www.genenames.org', biomart='g4public')
## Error in useMart(host = "www.genenames.org", biomart = "g4public"): could not find function "useMart"
head (listDatasets (mart))
## Error in listDatasets(mart): could not find function "listDatasets"
mart <- useMart (host='www.genenames.org', biomart='g4public', dataset='hgnc')
## Error in useMart(host = "www.genenames.org", biomart = "g4public", dataset = "hgnc"): could not find function "useMart"

I chose this 'mart because I wanted to query names. Also the default central mart was flaky the day I did this, which seems to be a common-enough occurence.

Next find the allowed attributes and filters:

head (listFilters (mart))
## Error in listFilters(mart): could not find function "listFilters"
head (listAttributes (mart))
## Error in listAttributes(mart): could not find function "listAttributes"

Load or details the names to query on:

# the first two are synonyms, the third and the last a primary and current identifier
query_symbols <- c ('BHLHB2', 'BRP44L', 'BOGUS', 'CPVL')

Then build and submit the query. Note that symbols that are correct (i.e. current and the primary symbol) and those that are unrecognised do not return any results:

attributes <- c(
  "gd_app_sym", "gd_app_name", 
  "gd_prev_sym",
  'md_ensembl_id',
  'f_gd_prev_sym'
)
filters <- 'f_gd_prev_sym' # or 'f_gd_aliases'
values <- query_symbols

results <- getBM (attributes=attributes, filters=filters, values=values, mart=mart)
## Error in getBM(attributes = attributes, filters = filters, values = values, : could not find function "getBM"
head (results)
## Error in head(results): object 'results' not found

A more complicated example: get upstream, downstream and coding regions for a set of genes:

library(tools)
library(biomaRt)
## Error in library(biomaRt): there is no package called 'biomaRt'
# initialize mart:
ensembl <- useMart('ensembl') 
## Error in useMart("ensembl"): could not find function "useMart"
mart <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
## Error in useDataset("hsapiens_gene_ensembl", mart = ensembl): could not find function "useDataset"
# get the central sequence
# note that seqinr masks the biomart function, lovely
pep_seqs <- biomaRt::getSequence (id=as.vector (query_symbols),
 type="hgnc_symbol", seqType="peptide", mart = mart)
## Error in loadNamespace(name): there is no package called 'biomaRt'
pep_seqs <- pep_seqs[!pep_seqs$peptide=='Sequence unavailable',]
## Error in eval(expr, envir, enclos): object 'pep_seqs' not found
pep_seqs <- pep_seqs[!duplicated(pep_seqs$hgnc_symbol),]
## Error in eval(expr, envir, enclos): object 'pep_seqs' not found
# get the upstream sequence
upstr_seqs <- biomaRt::getSequence (id=as.vector (query_symbols),
 type="hgnc_symbol", seqType="coding_gene_flank", upstream=5000, mart = mart)
## Error in loadNamespace(name): there is no package called 'biomaRt'
upstr_seqs <- upstr_seqs[!upstr_seqs$coding_gene_flank=='Sequence unavailable',]
## Error in eval(expr, envir, enclos): object 'upstr_seqs' not found
upstr_seqs <- upstr_seqs[!duplicated(upstr_seqs$hgnc_symbol),]
## Error in eval(expr, envir, enclos): object 'upstr_seqs' not found
# get the downstream sequence
downstr_seqs <- biomaRt::getSequence (id=as.vector (query_symbols),
 type="hgnc_symbol", seqType="coding_gene_flank", downstream=5000, mart=mart)
## Error in loadNamespace(name): there is no package called 'biomaRt'
downstr_seqs <- downstr_seqs[!downstr_seqs$coding_gene_flank=='Sequence unavailable',]
## Error in eval(expr, envir, enclos): object 'downstr_seqs' not found
downstr_seqs <- downstr_seqs[!duplicated(downstr_seqs$hgnc_symbol),]
## Error in eval(expr, envir, enclos): object 'downstr_seqs' not found

Some notes:

  • Biomart will usually return multiple sequences for each gene - because there are multiple sequences in the database. Some are partial, some are from abberant chromosomes, etc. That's why I deduplicate above and why you will have to filter stuff out according to your own preferences.
  • Remember, if there's no hit, nothing will be returned.
  • For reasons unclear to me, some were returned with 'Sequence unavailable'.

And here's an absolute hack. As said, sometimes Biomart can be flaky and I had a huge problem trying to download a set of several hundred sequences when the service would barf once every 100 or so. So I wrote a fallback function to retry any failed queries:

# "get_seq_fxn" does the biomart fetching and
# returns the results it gets

repeat {
   message ("trying ", ids[1])
   results <- try (get_seq_fxn())
   if (class (results) == "data.frame") return (results)
   print ("errored out")
   Sys.sleep (0.1)
}

References