Cleaning biosequences

A simple script to check and purge sequence files of possible problems.

Some times you need sequences that are unambiguous (i.e. only 'ACGT', lacking gaps) whether it's because of the limitations or assumptions of tools (like omegaMap) or just because you want to know where SNPs or sequencing ambiguities are. This script reads in a sequence file of any format, reports the location and nature of ambiguous characters, and optionally corrects these from a consensus sequence, saves the result to a FASTA file.

Usage is:

checkseqs [options] INFILE [INFILE, INFILE ...]

where options are:

  • --repair-with-conc: patch ambiguous characters with the consensus sequence and save
  • --overwrite: newly created files can write over pre-existing ones

Consensus is calculated on a 50% threshold.

The usual caveats apply: this is a quick hack with little error-checking. Over enthusiastic application may mask real sequence problems.

The file: checkseqs.rb