Unit 7 (Application of Bioinformatics in Agriculture)
Unit 7 (Application of Bioinformatics in Agriculture)
105 FoCARS
Foundation Course For Agricultural Research Service
Digital Repository of
Course Materials
Support Team
P. Krishnan and P. Namdev
APPLICATION OF BIOINFORMATICS
IN AGRICULTURE
M.Balakrishnan1
Introduction
Bioinformatics has evolved into a full-fledged multidisciplinary subject
that integrates developments in information and computer technology as
applied to Biotechnology and Biological Sciences. Bioinformatics uses
computer software tools for database creation, data management, data
warehousing, data mining and global communication networking.
Bioinformatics is the recording, annotation, storage, analysis, and
searching/retrieval of nucleic acid sequence (genes and RNAs), protein
sequence and structural information. This includes databases of the
sequences and structural information as well methods to access, search,
visualize and retrieve the information. Bioinformatics concern the creation
and maintenance of databases of biological information whereby
researchers can both access existing information and submit new entries.
Function genomics, bimolecular structure, proteome analysis, cell
metabolism, biodiversity, downstream processing in chemical engineering,
drug and vaccine design are some of the areas in which Bioinformatics is
an integral component.
Bioinformatics which is coming with HGP brings together the fields of life
science, computer science and statistics and strives to understand medical
and biological systems by the creative application of statistics and
computer analysis. Bioinformatics is the use of computer technology to
help scientists keep track of the genetic information they find. Using
computers, researchers can gather, store, analyse and compare biological
data with great speed and accuracy. Imagine studying gene structures
without the help of a computer. It would take many years to compare the
15,000 genes of Arabidopsis to the genes of a similar plant. And keeping
track of the 100,000 genes of a human being would be inconceivable. With
1
Principal Scientist, ICM Division, NAARM
1
105th FOCARS
Importance of Bioinformatics
In order to study how normal cellular activities are altered in different
disease states, the biological data must be combined to form a
comprehensive picture of these activities. Therefore, the field of
bioinformatics has evolved such that the most pressing task now involves
the analysis and interpretation of various types of data. This includes
nucleotide and amino acid sequences, protein domains, and protein
structures. The actual process of analyzing and interpreting data is referred
to as computational biology. Important sub-disciplines within
bioinformatics and computational biology include:
The development and implementation of tools that enable efficient
access to, use and management of, various types of information.
The development of new algorithms (mathematical formulas) and
statistics with which to assess relationships among members of
large data sets. For example, methods to locate a gene within a
sequence, predict protein structure and/or function, and cluster
protein sequences into families of related sequences.
2
National Academy of Agricultural Research Management
formal and practical problems arising from the management and analysis
of biological data.
Over the past few decades rapid developments in genomic and other
molecular research technologies and developments in information
technologies have combined to produce a tremendous amount of
information related to molecular biology. Bioinformatics is the name given
to these mathematical and computing approaches used to glean
understanding of biological processes
Approaches
Common activities in bioinformatics include mapping and analyzing DNA
and protein sequences, aligning different DNA and protein sequences to
compare them, and creating and viewing 3-D models of protein structures.
There are two fundamental ways of modeling a Biological system (e.g.,
living cell) both coming under Bioinformatics approaches.
Static
Sequences – Proteins, Nucleic acids and Peptides
Interaction data among the above entities including
microarray data and Networks of proteins, metabolites
Dynamic
Structures – Proteins, Nucleic acids, Ligands (including
metabolites and drugs) and Peptides (structures studied
with bioinformatics tools are not considered static anymore
and their dynamics is often the core of the structural
studies)
Systems Biology comes under this category including
reaction fluxes and variable concentrations of metabolites
Multi-Agent Based modeling approaches capturing cellular
events such as signaling, transcription and reaction
dynamics
A broad sub-category under bioinformatics is structural bioinformatics.
Roles of Bioinformatics
Bioinformatics today has entered every major discipline in biology. In
genomics, Bioinformatics has aided in genome sequencing, and has shown
its success in locating the genes, in phylogenetic comparison and in the
detection of transcription factor binding sites of the genes (Liu et al.,1995;
Thijs G et al.,2002), just to name a few. Microarray technology has opened
3
105th FOCARS
4
National Academy of Agricultural Research Management
Genome Annotation
In the context of genomics, annotation is the process of marking the genes
and other biological features in a DNA sequence. The first genome
annotation software system was designed in 1995 by Dr. Owen White,
who was part of the team at The Institute for Genomic Research that
sequenced and analyzed the first genome of a free-living organism to be
decoded, the bacterium Haemophilus influenzae. Dr. White built a
software system to find the genes (fragments of genomic sequence that
encode proteins), the transfer RNAs, and to make initial assignments of
function to those genes. Most current genome annotation systems work
similarly, but the programs available for analysis of genomic DNA, such
as the GeneMark program trained and used to find protein-coding genes in
Haemophilus influenzae, are constantly changing and improving.
5
105th FOCARS
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science that uses genetic algorithms
is sometimes confused with computational evolutionary biology, but the
two areas are not necessarily related.
Literature Analysis
The growth in the number of published literature makes it virtually
impossible to read every paper, resulting in disjointed sub-fields of
research. Literature analysis aims to employ computational and statistical
linguistics to mine this growing library of text resources. For example:
abbreviation recognition - identify the long-form and abbreviation
of biological terms,
named entity recognition - recognizing biological terms such as
gene names
protein-protein interaction - identify which proteins interact with
which proteins from text
The area of research draws from statistics and computational linguistics.
6
National Academy of Agricultural Research Management
Analysis of Regulation
Regulation is the complex orchestration of events starting with an
extracellular signal such as a hormone and leading to an increase or
decrease in the activity of one or more proteins. Bioinformatics techniques
have been applied to explore various steps in this process. For example,
promoter analysis involves the identification and study of sequence motifs
in the DNA surrounding the coding region of a gene. These motifs
influence the extent to which that region is transcribed into mRNA.
Expression data can be used to infer gene regulation: one might compare
microarray data from a wide variety of states of an organism to form
hypotheses about the genes involved in each state. In a single-cell
organism, one might compare stages of the cell cycle, along with various
stress conditions (heat shock, starvation, etc.). One can then apply
clustering algorithms to that expression data to determine which genes are
co-expressed. For example, the upstream regions (promoters) of co-
expressed genes can be searched for over-represented regulatory elements.
Examples of clustering algorithms applied in gene clustering are k-means
clustering, self-organizing maps (SOMs), hierarchical clustering, and
consensus clustering methods such as the Bi-CoPaM. The later, namely
Bi-CoPaM, has been actually proposed to address various issues specific
to gene discovery problems such as consistent co-expression of genes over
multiple microarray datasets.
7
105th FOCARS
Comparative Genomics
The core of comparative genome analysis is the establishment of the
correspondence between genes (orthology analysis) or other genomic
features in different organisms. It is these intergenomic maps that make it
possible to trace the evolutionary processes responsible for the divergence
of two genomes. A multitude of evolutionary events acting at various
organizational levels shape genome evolution. At the lowest level, point
mutations affect individual nucleotides. At a higher level, large
chromosomal segments undergo duplication, lateral transfer, inversion,
transposition, deletion and insertion. Ultimately, whole genomes are
involved in processes of hybridization, polyploidization and
endosymbiosis, often leading to rapid speciation. The complexity of
genome evolution poses many exciting challenges to developers of
mathematical models and algorithms, who have recourse to a spectra of
algorithmic, statistical and mathematical techniques, ranging from exact,
heuristics, fixed parameter and approximation algorithms for problems
based on parsimony models to Markov Chain Monte Carlo algorithms for
Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the homology detection and protein
family‘s computation
8
National Academy of Agricultural Research Management
Bioinformatics in Agriculture
Plant life plays important and diverse roles in our society, our economy,
and our global environment. Especially crop is the most important plants
to us. Feeding the increasing world population is a challenge for modern
plant biotechnology. Crop yields have increased during the last century
and will continue to improve as agronomy re-assorting the enhanced
breeding and develop new biotechnological-engineered strategies. The
onset of genomics is providing massive information to improve crop
phenotypes. The accumulation of sequence data allows detailed genome
analysis by using friendly database access and information retrieval.
Genetic and molecular genome co linearity allows efficient transfer of data
revealing extensive conservation of genome organization between species.
The goals of genome research are the identification of the sequenced genes
and the deduction of their functions by metabolic analysis and reverses
genetic screens of gene knockouts. Over 20% of the predicted genes occur
9
105th FOCARS
10
National Academy of Agricultural Research Management
Evolutionary Studies
To determine the tree of life and the last universal common ancestor, the
sequencing of genomes from all three domains of life, eukaryota, bacteria
and archaea means that evolutionary studies can be performed in a
mission.
Crop Improvement
Comparative genetics of the plant genomes has shown that the
organization of their genes is more conserved over evolutionary time than
was previously believed. These results suggest that information obtained
from the model crop systems can be used to suggest improvements to
other food crops. At hand the complete genomes of Arabidopsis thaliana
(water cress) and Oryza sativa (rice) are available.
Insect Resistance
Bacillus thuringiensis genes control a number of serious pests that have
been successfully transferred to cotton, maize and potatoes. This new
aptitude of the plants to resist insect attack means that the amount of
insecticides being used can be reduced and hence the nutritional quality of
the crops is increased.
11
105th FOCARS
areas, thus adding more land to the global production base. The
development work is in progress on the production of crop varieties
capable of tolerating reduced water conditions.
Bioinformatics Tools
There are both standard and customized products to meet the requirements
of particular projects. There are data-mining software that retrieves data
from genomic sequence databases and also visualization tools to analyse
and retrieve information from proteomic databases. These can be classified
as homology and similarity tools, protein functional analysis tools,
sequence analysis tools and miscellaneous tools.
Here is a brief description of a few of these, everyday bioinformatics is
done with sequence search programs like BLAST, sequence analysis
programs, like the EMBOSS and Staden packages, structure prediction
programs like THREADER or PHD or molecular imaging/modelling
programs like RasMol and WHATIF.
Structural Analysis:
This set of tools allows you to compare structures with the known
structure databases. The function of a protein is more directly a
consequence of its structure rather than its sequence with structural
homologs tending to share functions. The determination of a protein's
2D/3D structure is crucial in the study of its function.
12
National Academy of Agricultural Research Management
Sequence Analysis:
This set of tools allows you to carry out further, more detailed analysis on
your query sequence including evolutionary analysis, identification of
mutations, hydropathy regions, CpG islands and compositional biases. The
identification of these and other biological properties are all clues that aid
the search to elucidate the specific function of your sequence.
Some examples of Bioinformatics Tools:
Blast:
BLAST (Basic Local Alignment Search Tool) comes under the category of
homology and similarity tools. It is a set of search programs designed for
the Windows platform and is used to perform fast similarity searches
regardless of whether the query is for protein or DNA. Comparison of
nucleotide sequences in a database can be performed. Also a protein
database can be searched to find a match against the queried protein
sequence. NCBI has also introduced the new queuing system to BLAST
(Q BLAST) that allows users to retrieve results at their convenience and
format their results multiple times with different formatting options.
Depending on the type of sequences to compare, there are different
programs:
blastp compares an amino acid query sequence against a protein
sequence database
blastn compares a nucleotide query sequence against a nucleotide
sequence database
blastx compares a nucleotide query sequence translated in all
reading frames against a protein sequence database
tblastn compares a protein query sequence against a nucleotide
sequence database dynamically translated in all reading frames
tblastx compares the six-frame translations of a nucleotide query
sequence against the six-frame translations of a nucleotide
sequence database.
Fasta
FAST homology search All sequences .An alignment program for protein
sequences created by Pearsin and Lipman in 1988. The program is one of
the many heuristic algorithms proposed to speed up sequence comparison.
The basic idea is to add a fast pre-screen step to locate the highly matching
segments between two sequences, and then extend these matching
13
105th FOCARS
Emboss
EMBOSS (European Molecular Biology Open Software Suite) is a
software-analysis package. It can work with data in a range of formats and
also retrieve sequence data transparently from the Web. Extensive libraries
are also provided with this package, allowing other scientists to release
their software as open source. It provides a set of sequence-analysis
programs, and also supports all UNIX platforms.
Clustalw
It is a fully automated sequence alignment tool for DNA and protein
sequences. It returns the best match over a total length of input sequences,
be it a protein or a nucleic acid.
RasMol:
It is a powerful research tool to display the structure of DNA, proteins, and
smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to
use program.
Prospect
PROSPECT (PROtein Structure Prediction and Evaluation Computer
ToolKit) is a protein-structure prediction system that employs a
computational technique called protein threading to construct a protein's 3-
D model.
Pattern Hunter
Pattern Hunter, based on Java, can identify all approximate repeats in a
complete genome in a short time using little memory on a desktop
computer. Its features are its advanced patented algorithm and data
structures, and the java language used to create it. The Java language
version of PatternHunter is just 40 KB, only 1% the size of Blast, while
offering a large portion of its functionality.
Copia:
COPIA (COnsensus Pattern Identification and Analysis) is a protein
structure analysis tool for discovering motifs (conserved regions) in a
family of protein sequences. Such motifs can be then used to determine
14
National Academy of Agricultural Research Management
15
105th FOCARS
16
National Academy of Agricultural Research Management
Perl in Bioinformatics
String manipulation, regular expression matching, file parsing, data format
interconversion etc are the common text-processing tasks performed in
bioinformatics. Perl excels in such tasks and is being used by many
developers. Yet, there are no standard modules designed in Perl
specifically for the field of bioinformatics. However, developers have
designed several of their own individual modules for the purpose, which
have become quite popular and are coordinated by the BioPerl project.
Bioinformatics Projects
BioJava The BioJava Project is dedicated to providing Java tools for
processing biological data which includes objects for manipulating
17
105th FOCARS
BioPerl
The BioPerl project is an international association of developers of Perl
tools for bioinformatics and provides an online resource for modules,
scripts and web links for developers of Perl-based software.
BioXML
A part of the BioPerl project, this is a resource to gather XML
documentation, DTDs and XML aware tools for biology in one location.
Biocorba
Interface objects have facilitated interoperability between bioperl and
other perl packages such as Ensembl and the Annotation Workbench.
However, interoperability between bioperl and packages written in other
languages requires additional support software. CORBA is one such
framework for interlanguage support, and the biocorba project is currently
implementing a CORBA interface for bioperl. With biocorba, objects
written within bioperl will be able to communicate with objects written in
biopython and biojava (see the next subsection). For more information, see
the biocorba project website at http://biocorba.org/ . The Bioperl
BioCORBA server and client bindings are available in the bioperl-corba-
server and bioperl-corba-client bioperl CVS repositories respecitively. (see
http://cvs.bioperl.org/ for more information).
Ensembl
Ensembl is an ambitious automated-genome-annotation project at EBI.
Much of Ensembl\'s code is based on bioperl, and Ensembl developers, in
turn, have contributed significant pieces of code to bioperl. In particular,
the bioperl code for automated sequence annotation has been largely
contributed by Ensembl developers. Describing Ensembl and its
capabilities is far beyond the scope of this tutorial The interested reader is
referred to the Ensembl website at http://www.ensembl.org/.
Bioperl-Db
Bioperl-db is a relatively new project intended to transfer some of
Ensembl's capability of integrating bioperl syntax with a standalone Mysql
database (http://www.mysql.com ) to the bioperl code-base. More details
18
National Academy of Agricultural Research Management
19
105th FOCARS
Rosalind
Rosalind is an educational resource and web project for learning
bioinformatics through problem solving and programming. Rosalind users
learn bioinformatics concepts through a problem tree that builds up
biological, algorithmic, and programming knowledge concurrently. Each
problem is checked automatically, allowing for the project to also be used
for automated homework testing in existing classes.
Rosalind is a joint project between the University of California at San
Diego and Saint Petersburg Academic University along with the Russian
Academy of Sciences. The project's name commemorates Rosalind
Franklin, whose X-ray crystallography with Raymond Gosling facilitated
the discovery of the DNA double helix by James D. Watson and Francis
Crick. It was recognized by Homologous as the Best Educational Resource
of 2012 in their review of the Top Bioinformatics Contributions of 2012.
As of March 2013, it hosts over 5,000 active users.
20
National Academy of Agricultural Research Management
References
Altschul, S.F., Gish, W., Miller, W., Meyers, E.W., Lipman, D.J. 1990.
Basic local alignment search tool, Mol. Biol., 215: 403.
Durbin, B.P., Hardin, J.S., Hawkins, D.M., Rocke, D.M. 2002. A variance-
stabilizing transformation for gene-expression microarray data,
Bioinformatics, 18(Suppl. 1): s105.
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. 1998. Cluster
analysis and display of genome-wide expression paterns, Proc Natl
Acad Sci (USA), 95: 14 863.
Liu, J.S., Neuwald, A.F., Lawrence, C.E. 1995. Bayesian models for
multiple local sequence alignment and Gibbs sampling strategies, J
Amer Stat, 90: 1156.
Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D.,
Friedman, N. 2003. Module networks: identifying regulatory
modules and their condition-specific regulators from gene
expression data. Nat Gen, 34(2):166.
Sheng, Q., Moreau, Y., De Moor.B. 2003. Biclustering microarray data by
Gibbs sampling, Bioinformatics, 19(Suppl. 2): ii 196.
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen,
M.B., Brown, P.O., Botstein, D., Futcher, B. 1998. Comprehesive
identification of cell cycle-regulated genes of the yeast
saccaromyces cerevisiae by microarray hybridization, Molecular
Biology of the Cell, 9: 3 273.
Thijs, G., Marchal, K., Lescot, M., Rombauts, S., De Moor.B., Rouze, P.,
Moreau, Y. 2002. A Gibbs Sampling method to detect over-
represented motifs in upstream regions of expressed genes, Journal
of Computational Biology, 9(2): 447.
Jian Xue , Shoujing Zhao,* , Yanlong Liang , Chunxi Hou , Jianhua
Wang Bioinformatics And Its Applications In Agriculture, College
of Biological and Agricultural Engineering, Jiling University, 5988
Renmin Street, Changchun, Jilin, 130022, P.R.China.
21
¦ÉÉEÞò+xÉÖ{É - ®úɹ]ÅõÒªÉ EÞòÊ¹É +xÉÖºÉÆvÉÉxÉ |ɤɯvÉ +EòÉnù¨ÉÒ
®úÉVÉäxpùxÉMÉ®ú, ½èþnù®úɤÉÉnù-500030, iÉä±ÉÆMÉÉhÉÉ, ¦ÉÉ®úiÉ
ICAR-National Academy of Agricultural Research Management
(ISO 9001:2008 Certified)
Rajendranagar, Hyderabad-500030, Telangana, India
https://www.naarm.org.in