| Authors: | Tom Dunham |
|---|---|
| Date: | 2009-03-31 |
>>> from Bio.Seq import Seq
>>> seq = Seq("AGTACACTGGT")
>>> seq
Seq('AGTACACTGGT', Alphabet())
>>> print seq
AGTACACTGGT
Sequences behave a lot like strings
>>> seq[:2]
Seq('AG', Alphabet())
>>> print seq[:2]
AG
>>> seq.count("A")
3
>>> Seq("ATCCGG") + Seq("TTGGAATC")
Seq('ATCCGGTTGGAATC', Alphabet())
You can specify a specific alphabet:
>>> from Bio.Alphabet import IUPAC
>>> seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> seq.complement()
Seq('TCATGTGACCA', IUPACUnambiguousDNA())
>>> seq.reverse_complement()
Seq('ACCAGTGTACT', IUPACUnambiguousDNA())
Notice that the alphabet does not change
>>> seq = Seq('ATGAAGACAG', IUPAC.unambiguous_dna)
>>> seq
Seq('ATGAAGACAG', IUPACUnambiguousDNA())
>>> seq.transcribe()
Seq('AUGAAGACAG', IUPACUnambiguousRNA())
Notice the alphabet changes
>>> seq
Seq('ATGAAGACAG', IUPACUnambiguousDNA())
>>> mseq = seq.tomutable()
>>> mseq
MutableSeq('ATGAAGACAG', IUPACUnambiguousDNA())
>>> mseq.reverse_complement()
>>> mseq
MutableSeq('CTGTCTTCAT', IUPACUnambiguousDNA())
Notice that this behaviour means you cannot substitute Seq and MutableSeq objects.
Parsers yield SeqRecord objects:
>>> from Bio import SeqIO
>>> p = SeqIO.parse(open(r"wgs_aacr_pro.dat", "rb"), "embl")
>>> p.next()
SeqRecord(seq=Seq('TG...GCA', IUPACAmbiguousDNA()),
id='AACR01000001.1',
name='AACR01000001',
description='Pasteuria nishizawae str. North American PnPst10, whole genome shotgun sequence.',
dbxrefs=[])
>>> rec = p.next()
>>> print rec
ID: AACR01000002.1
Name: AACR01000002
Description: Pasteuria nishizawae str. North American PnEco4A, whole genome shotgun sequence.
Number of features: 1
/taxonomy=['Bacteria', 'Firmicutes', 'Bacillales', 'Pasteuriaceae', 'Pasteuria']
/references=[<Bio.SeqFeature.Reference instance at 0x0155FAA8>]
/accessions=['AACR01000002', 'AACR01000000', 'AACR00000000']
/data_file_division=PRO
/organism=Pasteuria nishizawae str. North American
/sequence_version=1
Seq('AATTCGTGGCGGGGCAAACGGGTTCATTATAAAAATTGTCGGAACGAGTTCAAC...ATT', IUPACAmbiguousDNA())
>>> rec.seq
Seq('AATTCGTGGCGGGGCAAA...ATT', IUPACAmbiguousDNA())
>>> len(rec)
1124
>>> len(rec.seq)
1124
>>> rec.name 'AACR01000002' >>> len(rec.features) 1 >>> rec.features[0].type 'source' >>> rec.features[0].strand 1
>>> pprint(rec.annotations)
{'accessions': ['AACR01000002', 'AACR01000000', 'AACR00000000'],
'data_file_division': 'PRO',
'organism': 'Pasteuria nishizawae str. North American',
'references': [<Bio.SeqFeature.Reference instance at 0x0155FAA8>],
'sequence_version': 1,
'taxonomy': ['Bacteria',
'Firmicutes',
'Bacillales',
'Pasteuriaceae',
'Pasteuria']}
from Bio import SeqIO
in_file = SeqIO.parse(open("infile"), "embl")
out_file = open(sys.argv[2], "w")
SeqIO.write(in_file, out_file, "fasta")
write will accept lists or iterators.
See handout
1. Write a program that takes the name of a EMBL file and calculates some statstics on the lengths of the sequences within it - start with min and max.
2. Write a program that takes the name of a EMBL file and prints it in fasta format.