Data Analysis of Brain Cancer With Biopython
Data Analysis of Brain Cancer With Biopython
ISSN No:-2456-2165
Abstract:- Biopython, an open source tools of Python for Standalone Blast from NCBI, and command line tools from
biological computation, was first published in 2000 by EMBOSS), a standard sequence class (dealing with
Brad Chapman and Jeff Chang. Biopython features sequences, sequence ids and sequence features), tools for
consist of parsers for Bioinformatics file formats, access performing common procedures on sequences (translation,
to online Bioinformatics databases, interfaces to common transcription, and weight calculations), Bio.Motif module
programs, a standard sequence class, etc. Glioblastoma provide analysis of sequence motif (searching, comparing,
is one of the most aggressive (grade IV) type of brain and de novo learning),[2, 6] Bio.Phylo module used for the
cancer, accounts for 15% of all brain tumors. Genetic visualization of phylogenetic trees.[7]
alteration in Glioblastoma include EGFR and PDGFR
amplification, TERT promoter mutation, alteration in An abnormal cell growth that have formed in the brain
TP53, NF1, PTEN and RB, loss of chromosome arm 10q is a Brain tumor. Tumors can form in the brain or other parts
and aberrations in RTK/Ras/PI3K signaling pathways. of the central nervous system (CNS) (spine or cranial
Pathway maps were used to understand a molecular nerves). Brain controls most of the bodily functions which
interaction and reaction network in Glioma. Multiple include awareness, movements, sensations, thoughts,
sequence alignment tools helps us to analyze the area of speech, and memory. Tumors can affect these function and
similarity and evolutionary relationships between the alters brain’s ability to operate properly. [8, 9] There are more
sequences. Using Biopython tools we perform the than 120 different types of brain tumor, based on the tissue
analysis of the nucleotide sequences. This study they arise from. Brain tumors can be cancerous and non-
introduces the application of a brain tumor detection cancerous or benign, but even non-cancerous tumors can be
algorithm using machine learning techniques. harmful due to its size and location.[10] Tumors that arises in
the brain are called primary brain tumors and cancer that
Keywords:- Brain Tumor, Computer Vision, Structure metastasize from other parts of the body to the brain are
Analysis, Sequence Alignment, Deep Learning, Biopython. secondary brain tumors. Brain tumors can also classified as
histological grading (I-IV) and molecular marker. Tumor
I. INTRODUCTION diagnosis should be “layered” as histological classification,
WHO grade, and molecular information and reported as
In 1980, Guido van Rossum started working on Python “integrated diagnosis.”[11]
and first published it in 1991 as Python 0.9.0. [1]
Development of Biopython initiated in 1999 and it was first Gliomas are one of the type of brain tumor that look
published in 2000 by Brad Chapman and Jeff Chang. [2] like glial cells.[12] The most common type of malignant
Python is a high-level programming language extensively gliomas are Glioblastoma (grade IV), accounts for 15% of
used in commercial and academics, accessible to all the all brain tumors.[13] Glioblastoma is one of the most
major operating system. It promotes basic syntax, object- aggressive types of brain cancer because it arises from
oriented programming and a wide array of libraries. [3] astrocytes cells that supports nerve cells and regulate the
Biopython is a member project of the Open Bioinformatics blood amount that reaches them, so having access to the
Foundation (OBF), which organises Biopython web site, large number of blood vessels helps cancer cells to grow and
source code repository, bug tracking database and email spread rapidly.[14] Another reason behind the aggressiveness
mailing lists. It also supports the related projects such as: of glioblastoma is their high recurrence rate. This is because
BioPerl[4], BioJava[5], BioRuby and BioSQL. tumor contains glioma stem cells (GSC), a type of self-
regenerating cancer stem cell that controls the growth of
Biopython is an open source compilation of Python tumors. In previous study, Subhas Mukherjee and his
tools for biological computation, created by an international colleagues found high level of cyclin-dependent kinase 5
team of developers.[2, 6] The main reason for development of (CDK5) enzymes in GSC, the study shows that blocking this
Biopython is to make it easier for Python programming enzyme inhibits GSCs ability to self-regenerate.[15] The
language user by creating high-quality, reusable modules cause of some glioblastoma cases are unknown. Some
and classes for the complex bioinformatics problems. uncommon risk factors include genetic disorders, previous
Biopython consist of various features which include the radiation therapy[16, 17] and its association with viruses
ability to parse various Bioinformatics file formats (BLAST, (SV40, [18] HHV-6,[19, 20] and cytomegalovirus[21]). Common
Clustalw, FASTA, Genbank, PubMed, ExPASy, SCOP, genetic alteration in Glioblastomas include amplification of
KEGG, UniGene, and SwissProt), access to online epidermal growth factor receptor (EGFR) and platelet-
Bioinformatics databases (NCBI and ExPASy), interfaces to derived growth factor receptor (PDGFR); Telomerase
common programs (Clustalw alignment program, reverse transcriptase (TERT) promoter mutation; alteration
Clustal Omega is a multiple sequence alignment tool which generate alignment between three or more sequence using
seeded guide trees and HMM profile-profile techniques. In Figure 2, we observe a lot of variations in nucleotide sequences. The
gap here represents the deletion in sequences and asterisk shows fully conserved alignment.
In Figure 3, the "length" of the branches represented by the values shown in the tree, indicating evolutionary distance
between the sequences, i.e., the larger number represent the larger amount of genetic changes.
Open Reading Frame (ORF) identifies all the possible protein coding region in the sequence. There would be 3 possible
reading frames in each direction of the DNA sequence, i.e. there are total 6 possible reading frame (6 horizontal bars) in every
DNA sequence. The 6 possible reading frames are +1, +2, +3 in the forward strand and -1, -2 and -3 in the reverse strand. Asterisk
(*) represent Stop Codon whereas M codes for Start Codon. In figure 4 and 5, the result displays all the possible six reading frame
present in the entered sequence query. The ORF is listed according to their size and the graphical representation of the sequence.
The selected ORF is the ORF1, +1 reading frame in the forward strand. Nucleotide length of ORF1 is 96 and 31 is the protein
length. For ORF1 start codon is placed at 169 while stop codon is placed at 264. The longest ORF among all is ORF6, -3 reading
frame in the reverse strand. Nucleotide length and protein length of ORF6 is 360 and 119 respectively. For ORF6 start codon is
placed at 360 while stop codon is placed at >1.
The result of CDD provides various display options and information for the conserved domains that align to the query
sequence. In figure 6, the graphical summary displays the standard result which shows best scoring domain model from each
source database. Small triangle in figure indicates specific amino acids involved in conserved features like binding and catalytic
sites. Specific hits determine by the e-value of RPS-BLAST hits to be equal or lower than domain-specific threshold e-value. It
describes the high confidence association between query sequence and conserved domain i.e. the query sequence is related to the
same protein family. Superfamily is a set of conserved domain models of different families which have the same protein
sequences and provide structural, functional and evolutionary information for proteins.