An Introduction to Biopython
Biopython is a library containing freely available Python tools for computational Biology. It makes it easy to write python programs for bioinformatics use.
The basic functionalities provided by Biopython include :
- Parsers for various Bioinformatics file formats (BLAST output, FASTA, Genbank etc)
- Access to online services (NCBI, Expasy etc)
- A standard sequence class that deals with sequences, ids on sequences, and sequence features.
- Interface to some common Bioinformatics programs like BLAST from NCBI (tool for sequence alignment) and EMBOSS tools (tools for sequence analysis)
- Tools for performing common operations on sequences, such as translation, transcription and weight calculations
- Integration with BioSQL, a sequence database schema
Before installing BioPython, you need to install the prerequisites, i.e python and NumPy.
NumPy (Numerical Python )
To install NumPy , you can use pip :
pip install numpy
After installing numpy, you can install biopython using pip:
pip install biopython
To check if biopython is install properly, use this:
If this gives an error, Biopython is not installed.
Let’s look at some of the functionalities that make Biopython awesome !
- Parsing various Biological file formats
Most of the biological data is stored as special file formats such as FASTA formatted text files, GENBANK formatted text files etc. Parsing these file formats into a format that can be manipulated using a programming language is a challenging task that can be simplified using the parsers provided in biopython.
Suppose you have a file named “ABC” in FASTA format. You can parse the file format and obtain a list of sequnces stored in the file as follows:
from Bio import SeqIO records = list(SeqIO.parse("ABC.fasta", "fasta")) #A file in GENBANK format can be parsed similarly from Bio import SeqIO records = list(SeqIO.parse("ABC.gbk", "genbank"))
Accessing Online Databases
Biopython can be used to access and download biological data from several databases such as NCBI, Expasy etc. The Bio.Entrez module can be used for this purpose.
from Bio import Entrez from Bio import SeqIO #provide your email-id Entrez.email = "email@example.com" #IDs to be searched records = ["P68871","Q96I25"] #search the database and obtain data in FASTA format for rec_id in records: handle = Entrez.efetch(db="protein", id=rec_id, rettype="fasta") seqRec = SeqIO.read(handle,"fasta") print(seqRec) handle.close() ID: P68871.2 Name: P68871.2 Description: P68871.2 RecName: Full=Hemoglobin subunit beta; AltName: Full=Beta-globin; AltName: Full=Hemoglobin beta chain; Contains: RecName: Full=LVV-hemorphin-7; Contains: RecName: Full=Spinorphin Number of features: 0 Seq('MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDA...KYH', SingleLetterAlphabet()) ID: Q96I25.1 Name: Q96I25.1 Description: Q96I25.1 RecName: Full=Splicing factor 45; AltName: Full=45 kDa-splicing factor; AltName: Full=RNA-binding motif protein 17 Number of features: 0 Seq('MSLYDDLGVETSDSKTEGWSKNFKLLQSQLQVKKAALTQAKSQRTKQSTVLAPV...EQV', SingleLetterAlphabet()) * Sequence Objects and common operations Biopython has sequence objects that are basically strings of letters. We can perform operations such as indexing, calculating string length, iterating through the characters , slicing the string etc, just like we do with python strings. from Bio.Seq import Seq from Bio.Alphabet import IUPAC my_seq = Seq("ATGCGTACGATACATACAGCGT" , IUPAC.unambiguous_dna) len(my_seq) #length of sequence 22 my_seq.count("A") #count occurrences of a character 7 my_seq[3:7] #slicing the sequence Seq('CGTA', IUPACUnambiguousDNA()) for letter in my_seq[:5]: print(letter) #iterating through characters A T G C G
objects in Biopython and standard Python strings have some similarities, there are two major differences.
- object could denote a protein sequence, or a DNA sequence.
Some of the biologically relevant operations that can be performed using Biopython methods are demonstrated below.
"ATGCGTACGATACATACAGCGT" , IUPAC.unambiguous_dna) #calculating GC content of the DNA sequence > GC(my_seq) 45.45454545454545 #complement of DNA sequence > my_seq.complement() Seq('TACGCATGCTATGTATGTCGCA', IUPACUnambiguousDNA()) #reverse complement of DNA > my_seq.reverse_complement() Seq('ACGCTGTATGTATCGTACGCAT', IUPACUnambiguousDNA()) #simulating biological DNA strands > coding_dna = my_seq > template_dna = coding_dna.reverse_complement() > template_dna Seq('ACGCTGTATGTATCGTACGCAT', IUPACUnambiguousDNA()) #transcription process (DNA -> mRNA) > messenger_rna = template_dna.reverse_complement().transcribe() > messenger_rna Seq('AUGCGUACGAUACAUACAGCGU', IUPACUnambiguousRNA()) #translation process (mRNA -> Protein) > protein = messenger_rna.translate() > protein Seq('MRTIHTA', IUPACProtein())> from Bio.Seq import Seq > from Bio.Alphabet import IUPAC > from Bio.SeqUtils import GC > my_seq = Seq(
Apart from these, Biopython offers lots of other features. So, if you are interested in bioinformatics, and love to program in python, then Biopython is the perfect choice for you !