An Introduction to Biopython

By:Akshatha Nayak

WHAT’S BIOPYTHON

Biopython is a library containing freely available Python tools for computational Biology. It makes it easy to write python programs for bioinformatics use.

The basic functionalities provided by Biopython include :

  • Parsers for various Bioinformatics file formats (BLAST output, FASTA, Genbank etc)
  • Access to online services (NCBI, Expasy etc)
  • A standard sequence class that deals with sequences, ids on sequences, and sequence features.
  • Interface to some common Bioinformatics programs like BLAST from NCBI (tool for sequence alignment) and EMBOSS tools (tools for sequence analysis)
  • Tools for performing common operations on sequences, such as translation, transcription and weight calculations
  • Integration with BioSQL, a sequence database schema

INSTALLATION

Before installing BioPython, you need to install the prerequisites, i.e python and NumPy.

  • Python

  • NumPy (Numerical Python )

To install NumPy , you can use pip :


pip install numpy
  • Biopython

After installing numpy, you can install biopython using pip:


pip install biopython

To check if biopython is install properly, use this:


import Bio

If this gives an error, Biopython is not installed.

GETTING STARTED

Let’s look at some of the functionalities that make Biopython awesome !

  • Parsing various Biological file formats

Most of the biological data is stored as special file formats such as FASTA formatted text files, GENBANK formatted text files etc. Parsing these file formats into a format that can be manipulated using a programming language is a challenging task that can be simplified using the parsers provided in biopython.

Suppose you have a file named “ABC” in FASTA format. You can parse the file format and obtain a list of sequnces stored in the file as follows:

from Bio import SeqIO

records = list(SeqIO.parse("ABC.fasta", "fasta"))


#A file in GENBANK format can be parsed similarly



from Bio import SeqIO

records = list(SeqIO.parse("ABC.gbk", "genbank"))

Accessing Online Databases

Biopython can be used to access and download biological data from several databases such as NCBI, Expasy etc. The Bio.Entrez module can be used for this purpose.


>>> from Bio import Entrez

>>> from Bio import SeqIO



#provide your email-id

>>> Entrez.email = "me@email.com"



#IDs to be searched

>>> records = ["P68871","Q96I25"]



#search the database and obtain data in FASTA format

>>> for rec_id in records:

...     handle = Entrez.efetch(db="protein", id=rec_id, rettype="fasta")

...     seqRec = SeqIO.read(handle,"fasta")

...     print(seqRec)

...     handle.close()

...



ID: P68871.2

Name: P68871.2

Description: P68871.2 RecName: Full=Hemoglobin subunit beta; AltName: Full=Beta-globin; AltName: Full=Hemoglobin beta chain; Contains: RecName: Full=LVV-hemorphin-7; Contains: RecName: Full=Spinorphin

Number of features: 0

Seq('MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDA...KYH', SingleLetterAlphabet())



ID: Q96I25.1

Name: Q96I25.1

Description: Q96I25.1 RecName: Full=Splicing factor 45; AltName: Full=45 kDa-splicing factor; AltName: Full=RNA-binding motif protein 17

Number of features: 0

Seq('MSLYDDLGVETSDSKTEGWSKNFKLLQSQLQVKKAALTQAKSQRTKQSTVLAPV...EQV', SingleLetterAlphabet())



*   Sequence Objects and common operations

Biopython has sequence objects that are basically strings of letters. We can perform operations such as indexing, calculating string length, iterating through the characters , slicing the string etc, just like we do with python strings.



>>> from Bio.Seq import Seq

>>> from Bio.Alphabet import IUPAC



>>> my_seq = Seq("ATGCGTACGATACATACAGCGT" , IUPAC.unambiguous_dna)



>>> len(my_seq)        #length of sequence

22



>>> my_seq.count("A")    #count occurrences of a character

7



>>> my_seq[3:7]        #slicing the sequence

Seq('CGTA', IUPACUnambiguousDNA())



>>> for letter in my_seq[:5]:

...     print(letter)    #iterating through characters

...

A

T

G

C

G

objects in Biopython and standard Python strings have some similarities, there are two major differences.

  • etc.
  • object could denote a protein sequence, or a DNA sequence.

Some of the biologically relevant operations that can be performed using Biopython methods are demonstrated below.


>>> from Bio.Seq import Seq

>>> from Bio.Alphabet import IUPAC

>>> from Bio.SeqUtils import GC



>>> my_seq = Seq("ATGCGTACGATACATACAGCGT" , IUPAC.unambiguous_dna)



#calculating GC content of the DNA sequence

>>> GC(my_seq)

45.45454545454545



#complement of DNA sequence

>>> my_seq.complement()

Seq('TACGCATGCTATGTATGTCGCA', IUPACUnambiguousDNA())



#reverse complement of DNA

>>> my_seq.reverse_complement()

Seq('ACGCTGTATGTATCGTACGCAT', IUPACUnambiguousDNA())



#simulating biological DNA strands

>>> coding_dna = my_seq

>>> template_dna = coding_dna.reverse_complement()

>>> template_dna

Seq('ACGCTGTATGTATCGTACGCAT', IUPACUnambiguousDNA())



#transcription process (DNA -> mRNA)

>>> messenger_rna = template_dna.reverse_complement().transcribe()

>>> messenger_rna

Seq('AUGCGUACGAUACAUACAGCGU', IUPACUnambiguousRNA())



#translation process (mRNA -> Protein)

>>> protein = messenger_rna.translate()

>>> protein

Seq('MRTIHTA', IUPACProtein())

Apart from these, Biopython offers lots of other features. So, if you are interested in bioinformatics, and love to program in python, then Biopython is the perfect choice for you !

No Comments Yet