Biopython

Authors:	Tom Dunham
Date:	2009-03-31

BioPython

Types to reprisent sequences (Seq and MutableSeq)
Parsers for several file formats (including GenBank, EMBL, Fasta)
Sequence alignment
Population genetics
Clustering
BioSQL

Sequences

>>> from Bio.Seq import Seq
>>> seq = Seq("AGTACACTGGT")
>>> seq
Seq('AGTACACTGGT', Alphabet())
>>> print seq
AGTACACTGGT

Sequences

Sequences behave a lot like strings

>>> seq[:2]
Seq('AG', Alphabet())
>>> print seq[:2]
AG
>>> seq.count("A")
3
>>> Seq("ATCCGG") + Seq("TTGGAATC")
Seq('ATCCGGTTGGAATC', Alphabet())

Sequences != Strings

You can specify a specific alphabet:

>>> from Bio.Alphabet import IUPAC
>>> seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())

Sequence Methods

>>> seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> seq.complement()
Seq('TCATGTGACCA', IUPACUnambiguousDNA())
>>> seq.reverse_complement()
Seq('ACCAGTGTACT', IUPACUnambiguousDNA())

Notice that the alphabet does not change

Transcription

>>> seq = Seq('ATGAAGACAG', IUPAC.unambiguous_dna)
>>> seq
Seq('ATGAAGACAG', IUPACUnambiguousDNA())
>>> seq.transcribe()
Seq('AUGAAGACAG', IUPACUnambiguousRNA())

Notice the alphabet changes

Mutable Sequences

>>> seq
Seq('ATGAAGACAG', IUPACUnambiguousDNA())
>>> mseq = seq.tomutable()
>>> mseq
MutableSeq('ATGAAGACAG', IUPACUnambiguousDNA())

>>> mseq.reverse_complement()
>>> mseq
MutableSeq('CTGTCTTCAT', IUPACUnambiguousDNA())

Notice that this behaviour means you cannot substitute Seq and MutableSeq objects.

If you need to write functions that can manipulate Seq and MutableSeq objects without regard to their type, you can use the reverse_complement, transcribe, back_transcribe, translate functions from Bio.Seq.

Parsers

Parsers yield SeqRecord objects:

>>> from Bio import SeqIO
>>> p = SeqIO.parse(open(r"wgs_aacr_pro.dat", "rb"), "embl")
>>> p.next()
SeqRecord(seq=Seq('TG...GCA', IUPACAmbiguousDNA()),
          id='AACR01000001.1',
          name='AACR01000001',
          description='Pasteuria nishizawae str. North American PnPst10, whole genome shotgun sequence.',
          dbxrefs=[])

Documentation for SeqRecord http://biopython.org/wiki/SeqRecord

SeqRecord

>>> rec = p.next()

>>> print rec
ID: AACR01000002.1
Name: AACR01000002
Description: Pasteuria nishizawae str. North American PnEco4A, whole genome shotgun sequence.
Number of features: 1
/taxonomy=['Bacteria', 'Firmicutes', 'Bacillales', 'Pasteuriaceae', 'Pasteuria']
/references=[<Bio.SeqFeature.Reference instance at 0x0155FAA8>]
/accessions=['AACR01000002', 'AACR01000000', 'AACR00000000']
/data_file_division=PRO
/organism=Pasteuria nishizawae str. North American
/sequence_version=1
Seq('AATTCGTGGCGGGGCAAACGGGTTCATTATAAAAATTGTCGGAACGAGTTCAAC...ATT', IUPACAmbiguousDNA())

SeqRecord fields

>>> rec.seq
Seq('AATTCGTGGCGGGGCAAA...ATT', IUPACAmbiguousDNA())
>>> len(rec)
1124
>>> len(rec.seq)
1124

SeqRecord fields

>>> rec.name
'AACR01000002'
>>> len(rec.features)
1
>>> rec.features[0].type
'source'
>>> rec.features[0].strand
1

Annotations

>>> pprint(rec.annotations)
{'accessions': ['AACR01000002', 'AACR01000000', 'AACR00000000'],
 'data_file_division': 'PRO',
 'organism': 'Pasteuria nishizawae str. North American',
 'references': [<Bio.SeqFeature.Reference instance at 0x0155FAA8>],
 'sequence_version': 1,
 'taxonomy': ['Bacteria',
              'Firmicutes',
              'Bacillales',
              'Pasteuriaceae',
              'Pasteuria']}

Writing SeqRecords

from Bio import SeqIO
in_file = SeqIO.parse(open("infile"), "embl")
out_file = open(sys.argv[2], "w")
SeqIO.write(in_file, out_file, "fasta")

write will accept lists or iterators.

Excersizes

See handout

1. Write a program that takes the name of a EMBL file and calculates some statstics on the lengths of the sequences within it - start with min and max.

2. Write a program that takes the name of a EMBL file and prints it in fasta format.