0% found this document useful (0 votes)
4 views

Unit 7 (Application of Bioinformatics in Agriculture)

The document outlines the significance of bioinformatics in agriculture, detailing its applications in data management, genomic analysis, and computational biology. It emphasizes the role of bioinformatics in improving efficiency in research through tools for sequence analysis, genome annotation, and protein structure prediction. The text also discusses various approaches and methodologies used in bioinformatics to enhance understanding of biological processes and systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 7 (Application of Bioinformatics in Agriculture)

The document outlines the significance of bioinformatics in agriculture, detailing its applications in data management, genomic analysis, and computational biology. It emphasizes the role of bioinformatics in improving efficiency in research through tools for sequence analysis, genome annotation, and protein structure prediction. The text also discusses various approaches and methodologies used in bioinformatics to enhance understanding of biological processes and systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

th

105 FoCARS
Foundation Course For Agricultural Research Service

Digital Repository of
Course Materials

• International and National Agricultural Research System in India


• Challenges and Management of Agricultural Extension in the New Millennium
• Production Systems Approach
• Economic Policies and Agricultural Development
• WTO and Agriculture Research and Development
• Intellectual Property Rights in Indian Agriculture
• Copyrights
• Designs as Ips
• Geographical Indicators
• Patents
• Trade Secrets
• Trademarks
• Application of Bioinformatics in Agriculture
Course Coordinators
K. Kareemulla and S. Ravichandran

Support Team
P. Krishnan and P. Namdev
APPLICATION OF BIOINFORMATICS
IN AGRICULTURE

M.Balakrishnan1

Introduction
Bioinformatics has evolved into a full-fledged multidisciplinary subject
that integrates developments in information and computer technology as
applied to Biotechnology and Biological Sciences. Bioinformatics uses
computer software tools for database creation, data management, data
warehousing, data mining and global communication networking.
Bioinformatics is the recording, annotation, storage, analysis, and
searching/retrieval of nucleic acid sequence (genes and RNAs), protein
sequence and structural information. This includes databases of the
sequences and structural information as well methods to access, search,
visualize and retrieve the information. Bioinformatics concern the creation
and maintenance of databases of biological information whereby
researchers can both access existing information and submit new entries.
Function genomics, bimolecular structure, proteome analysis, cell
metabolism, biodiversity, downstream processing in chemical engineering,
drug and vaccine design are some of the areas in which Bioinformatics is
an integral component.
Bioinformatics which is coming with HGP brings together the fields of life
science, computer science and statistics and strives to understand medical
and biological systems by the creative application of statistics and
computer analysis. Bioinformatics is the use of computer technology to
help scientists keep track of the genetic information they find. Using
computers, researchers can gather, store, analyse and compare biological
data with great speed and accuracy. Imagine studying gene structures
without the help of a computer. It would take many years to compare the
15,000 genes of Arabidopsis to the genes of a similar plant. And keeping
track of the 100,000 genes of a human being would be inconceivable. With

1
Principal Scientist, ICM Division, NAARM

1
105th FOCARS

computers, the process of comparison is automated. By storing


information as it is discovered, computers ease the immense job of
genome mapping. But computers can analyse as well as store information.
They can be used to construct models that reduce the need for
experimentation. In this way, biotechnology has become more efficient.
Scientists are able to use fairly reliable computer-assisted predictions of
test results on genetic modifications. This complements the time-
consuming process involved in growing out every modified plant in the
laboratory or greenhouse to test for the desired modification.

Importance of Bioinformatics
In order to study how normal cellular activities are altered in different
disease states, the biological data must be combined to form a
comprehensive picture of these activities. Therefore, the field of
bioinformatics has evolved such that the most pressing task now involves
the analysis and interpretation of various types of data. This includes
nucleotide and amino acid sequences, protein domains, and protein
structures. The actual process of analyzing and interpreting data is referred
to as computational biology. Important sub-disciplines within
bioinformatics and computational biology include:
 The development and implementation of tools that enable efficient
access to, use and management of, various types of information.
 The development of new algorithms (mathematical formulas) and
statistics with which to assess relationships among members of
large data sets. For example, methods to locate a gene within a
sequence, predict protein structure and/or function, and cluster
protein sequences into families of related sequences.

The primary goal of bioinformatics is to increase the understanding of


biological processes. What sets it apart from other approaches, however, is
its focus on developing and applying computationally intensive techniques
to achieve this goal. Examples include: pattern recognition, data mining,
machine learning algorithms, and visualization. Major research efforts in
the field include sequence alignment, gene finding, genome assembly,
drug design, drug discovery, protein structure alignment, protein structure
prediction, prediction of gene expression and protein–protein interactions,
genome-wide association studies, and the modeling of evolution.
Bioinformatics now entails the creation and advancement of databases,
algorithms, computational and statistical techniques, and theory to solve

2
National Academy of Agricultural Research Management

formal and practical problems arising from the management and analysis
of biological data.
Over the past few decades rapid developments in genomic and other
molecular research technologies and developments in information
technologies have combined to produce a tremendous amount of
information related to molecular biology. Bioinformatics is the name given
to these mathematical and computing approaches used to glean
understanding of biological processes

Approaches
Common activities in bioinformatics include mapping and analyzing DNA
and protein sequences, aligning different DNA and protein sequences to
compare them, and creating and viewing 3-D models of protein structures.
There are two fundamental ways of modeling a Biological system (e.g.,
living cell) both coming under Bioinformatics approaches.

 Static
 Sequences – Proteins, Nucleic acids and Peptides
 Interaction data among the above entities including
microarray data and Networks of proteins, metabolites
 Dynamic
 Structures – Proteins, Nucleic acids, Ligands (including
metabolites and drugs) and Peptides (structures studied
with bioinformatics tools are not considered static anymore
and their dynamics is often the core of the structural
studies)
 Systems Biology comes under this category including
reaction fluxes and variable concentrations of metabolites
 Multi-Agent Based modeling approaches capturing cellular
events such as signaling, transcription and reaction
dynamics
A broad sub-category under bioinformatics is structural bioinformatics.

Roles of Bioinformatics
Bioinformatics today has entered every major discipline in biology. In
genomics, Bioinformatics has aided in genome sequencing, and has shown
its success in locating the genes, in phylogenetic comparison and in the
detection of transcription factor binding sites of the genes (Liu et al.,1995;
Thijs G et al.,2002), just to name a few. Microarray technology has opened

3
105th FOCARS

the world of transcript me in front ‗biologists (Spellman et al., 1998; Eisen


et al., 1998). Bioinformatics provides analytical tools for microarray data.
These tools range from image the processing techniques that read out the
data, to the visualization tools that provide a first-sight hint to the
biologists; from pre-processing techniques (Durbin et al., 2002) that
remove the systematic noise in the data to the clustering methods (Eisen et
al., 1998; Sheng et al., 2003) that reveal genes that behave similarly under
different Experimental conditions. In proteomics, bioinformatics helps in
the study of protein structures and the discovery of sequence sites where
protein-protein interactions take place. To help understanding biology at
the system level, bioinformatics begins to show promise in unravelling
genetic networks (Segal et al., 2003). Finally, in the study of metabolome,
bioinformatics is used to study the dynamics in a cell, and thus to simulate
the cellular interactions. Bioinformatics and Its Applications in Agriculture

Major Research Area


Sequence analysis
Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of
thousands of organisms have been decoded and stored in databases. This
sequence information is analyzed to determine genes that encode
polypeptides (proteins), RNA genes, regulatory sequences, structural
motifs, and repetitive sequences. A comparison of genes within a species
or between different species can show similarities between protein
functions, or relations between species (the use of molecular systematics
to construct phylogenetic trees). With the growing amount of data, it long
ago became impractical to analyze DNA sequences manually. Today,
computer programs such as BLAST are used daily to search sequences
from more than 260 000 organisms, containing over 190 billion
nucleotides. These programs can compensate for mutations (exchanged,
deleted or inserted bases) in the DNA sequence, to identify sequences that
are related, but not identical. A variant of this sequence alignment is used
in the sequencing process itself. The so-called shotgun sequencing
technique (which was used, for example, by The Institute for Genomic
Research to sequence the first bacterial genome, Haemophilus influenzae)
does not produce entire chromosomes. Instead it generates the sequences
of many thousands of small DNA fragments (ranging from 35 to 900
nucleotides long, depending on the sequencing technology). The ends of
these fragments overlap and, when aligned properly by a genome
assembly program, can be used to reconstruct the complete genome.
Shotgun sequencing yields sequence data quickly, but the task of

4
National Academy of Agricultural Research Management

assembling the fragments can be quite complicated for larger genomes.


For a genome as large as the human genome, it may take many days of
CPU time on large-memory, multiprocessor computers to assemble the
fragments, and the resulting assembly will usually contain numerous gaps
that have to be filled in later. Shotgun sequencing is the method of choice
for virtually all genomes sequenced today, and genome assembly
algorithms are a critical area of bioinformatics research.
Another aspect of bioinformatics in sequence analysis is annotation. This
involves computational gene finding to search for protein-coding genes,
RNA genes, and other functional sequences within a genome. Not all of
the nucleotides within a genome are part of genes. Within the genomes of
higher organisms, large parts of the DNA do not serve any obvious
purpose. This so-called junk DNA may, however, contain unrecognized
functional elements. Bioinformatics helps to bridge the gap between
genome and prorteome projects — for example, in the use of DNA
sequences for protein identification.

Genome Annotation
In the context of genomics, annotation is the process of marking the genes
and other biological features in a DNA sequence. The first genome
annotation software system was designed in 1995 by Dr. Owen White,
who was part of the team at The Institute for Genomic Research that
sequenced and analyzed the first genome of a free-living organism to be
decoded, the bacterium Haemophilus influenzae. Dr. White built a
software system to find the genes (fragments of genomic sequence that
encode proteins), the transfer RNAs, and to make initial assignments of
function to those genes. Most current genome annotation systems work
similarly, but the programs available for analysis of genomic DNA, such
as the GeneMark program trained and used to find protein-coding genes in
Haemophilus influenzae, are constantly changing and improving.

Computational Evolutionary Biology


Evolutionary biology is the study of the origin and descent of species, as
well as their change over time. Informatics has assisted evolutionary
biologists in several key ways; it has enabled researchers to:
 trace the evolution of a large number of organisms by measuring
changes in their DNA, rather than through physical taxonomy or
physiological observations alone,

5
105th FOCARS

 more recently, compare entire genomes, which permits the study of


more complex evolutionary events, such as gene duplication,
horizontal gene transfer, and the prediction of factors important in
bacterial speciation,
 build complex computational models of populations to predict the
outcome of the system over time
 track and share information on an increasingly large number of
species and organisms

Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science that uses genetic algorithms
is sometimes confused with computational evolutionary biology, but the
two areas are not necessarily related.

Literature Analysis
The growth in the number of published literature makes it virtually
impossible to read every paper, resulting in disjointed sub-fields of
research. Literature analysis aims to employ computational and statistical
linguistics to mine this growing library of text resources. For example:
 abbreviation recognition - identify the long-form and abbreviation
of biological terms,
 named entity recognition - recognizing biological terms such as
gene names
 protein-protein interaction - identify which proteins interact with
which proteins from text
The area of research draws from statistics and computational linguistics.

Analysis of Gene Expression


The expression of many genes can be determined by measuring mRNA
levels with multiple techniques including microarrays, expressed cDNA
sequence tag (EST) sequencing, serial analysis of gene expression (SAGE)
tag sequencing, massively parallel signature sequencing (MPSS), RNA-
Seq, also known as "Whole Transcriptome Shotgun Sequencing" (WTSS),
or various applications of multiplexed in-situ hybridization. All of these
techniques are extremely noise-prone and/or subject to bias in the
biological measurement, and a major research area in computational
biology involves developing statistical tools to separate signal from noise
in high-throughput gene expression studies. Such studies are often used to

6
National Academy of Agricultural Research Management

determine the genes implicated in a disorder: one might compare


microarray data from cancerous epithelial cells to data from non-cancerous
cells to determine the transcripts that are up-regulated and down-regulated
in a particular population of cancer cells.

Analysis of Regulation
Regulation is the complex orchestration of events starting with an
extracellular signal such as a hormone and leading to an increase or
decrease in the activity of one or more proteins. Bioinformatics techniques
have been applied to explore various steps in this process. For example,
promoter analysis involves the identification and study of sequence motifs
in the DNA surrounding the coding region of a gene. These motifs
influence the extent to which that region is transcribed into mRNA.
Expression data can be used to infer gene regulation: one might compare
microarray data from a wide variety of states of an organism to form
hypotheses about the genes involved in each state. In a single-cell
organism, one might compare stages of the cell cycle, along with various
stress conditions (heat shock, starvation, etc.). One can then apply
clustering algorithms to that expression data to determine which genes are
co-expressed. For example, the upstream regions (promoters) of co-
expressed genes can be searched for over-represented regulatory elements.
Examples of clustering algorithms applied in gene clustering are k-means
clustering, self-organizing maps (SOMs), hierarchical clustering, and
consensus clustering methods such as the Bi-CoPaM. The later, namely
Bi-CoPaM, has been actually proposed to address various issues specific
to gene discovery problems such as consistent co-expression of genes over
multiple microarray datasets.

Analysis of Protein Expression


Protein microarrays and high throughput (HT) mass spectrometry (MS)
can provide a snapshot of the proteins present in a biological sample.
Bioinformatics is very much involved in making sense of protein
microarray and HT MS data; the former approach faces similar problems
as with microarrays targeted at mRNA, the latter involves the problem of
matching large amounts of mass data against predicted masses from
protein sequence databases, and the complicated statistical analysis of
samples where multiple, but incomplete peptides from each protein are
detected.

7
105th FOCARS

Comparative Genomics
The core of comparative genome analysis is the establishment of the
correspondence between genes (orthology analysis) or other genomic
features in different organisms. It is these intergenomic maps that make it
possible to trace the evolutionary processes responsible for the divergence
of two genomes. A multitude of evolutionary events acting at various
organizational levels shape genome evolution. At the lowest level, point
mutations affect individual nucleotides. At a higher level, large
chromosomal segments undergo duplication, lateral transfer, inversion,
transposition, deletion and insertion. Ultimately, whole genomes are
involved in processes of hybridization, polyploidization and
endosymbiosis, often leading to rapid speciation. The complexity of
genome evolution poses many exciting challenges to developers of
mathematical models and algorithms, who have recourse to a spectra of
algorithmic, statistical and mathematical techniques, ranging from exact,
heuristics, fixed parameter and approximation algorithms for problems
based on parsimony models to Markov Chain Monte Carlo algorithms for
Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the homology detection and protein
family‘s computation

Network and Systems Biology


Network analysis seeks to understand the relationships within biological
networks such as metabolic or protein-protein interaction networks.
Although biological networks can be constructed from a single type of
molecule or entity (such as genes), network biology often attempts to
integrate many different data types, such as proteins, small molecules,
gene expression data, and others, which are all connected physically
and/or functionally.
Systems biology involves the use of computer simulations of cellular
subsystems (such as the networks of metabolites and enzymes which
comprise metabolism, signal transduction pathways and gene regulatory
networks) to both analyze and visualize the complex connections of these
cellular processes. Artificial life or virtual evolution attempts to
understand evolutionary processes via the computer simulation of simple
(artificial) life forms.

8
National Academy of Agricultural Research Management

High-Throughput Image Analysis


Computational technologies are used to accelerate or fully automate the
processing, quantification and analysis of large amounts of high-
information-content biomedical imagery. Modern image analysis systems
augment an observer's ability to make measurements from a large or
complex set of images, by improving accuracy, objectivity, or speed. A
fully developed analysis system may completely replace the observer.
Although these systems are not unique to biomedical imagery, biomedical
imaging is becoming more important for both diagnostics and research.
Some examples are:
 high-throughput and high-fidelity quantification and sub-cellular
localization (high-content screening, cytohistopathology, Bioimage
informatics)
 morphometrics
 clinical image analysis and visualization
 determining the real-time air-flow patterns in breathing lungs of
living animals
 quantifying occlusion size in real-time imagery from the
development of and recovery during arterial injury
 making behavioral observations from extended video recordings of
laboratory animals
 infrared measurements for metabolic activity determination
Inferring clone overlaps in DNA mapping, e.g. the Sulston score

Bioinformatics in Agriculture
Plant life plays important and diverse roles in our society, our economy,
and our global environment. Especially crop is the most important plants
to us. Feeding the increasing world population is a challenge for modern
plant biotechnology. Crop yields have increased during the last century
and will continue to improve as agronomy re-assorting the enhanced
breeding and develop new biotechnological-engineered strategies. The
onset of genomics is providing massive information to improve crop
phenotypes. The accumulation of sequence data allows detailed genome
analysis by using friendly database access and information retrieval.
Genetic and molecular genome co linearity allows efficient transfer of data
revealing extensive conservation of genome organization between species.
The goals of genome research are the identification of the sequenced genes
and the deduction of their functions by metabolic analysis and reverses
genetic screens of gene knockouts. Over 20% of the predicted genes occur

9
105th FOCARS

as cluster of related genes generating a considerable proportion of gene


families. Multiple alignments provide a method to estimate the number of
genes in gene families allowing the identification of previously
undescribed genes. This information enables new strategies to study gene
expression patterns in plants. Available information from news
technologies, as the database stored DNA microarray expression data, will
help plant biology functional genomics. Expressed sequence tags (ESTs)
also give the opportunity to perform ―digital northern‖ comparison of gene
expression levels providing initial clues toward unknown regulatory
phenomena. Crop plant networks collections of databases and
bioinformatics resources for crop plants genomics have been built to
harness the extensive work in genome mapping. This resource facilitates
the identification of ergonomically important genes, by comparative
analysis between crop plants and model species, allowing the genetic
engineering of crop plants selected by the quality of the resulting products.
Bioinformatics resources have evolved beyond expectation, developing
new nutritional genomics biotechnology tools to genetically modify and
improve food supply, for an ever-increasing world population. So
bioinformatics can now be leveraged to accelerate the translation of basic
discovery to agriculture. The predictive manipulation of plant growth will
affect agriculture at a time when food security, diminution of lands
available for agricultural use, stewardship of the environment, and climate
change are all issues of growing public concern.
Agri Informatics is based on a GIS (Geographical Information System)
analysis of the complex interaction between the environment and crops.
Analysis of the micro climate can be draw by the use of simple maps of
the farm. According to a variety of critical variables, the ideal block
borders are drawn. In the same way, a soil map is also drawn up for
describing soil qualities such as the acidity level, clay content, water
retention ability and depth. The result of the map shows, the ideal block
design, row alignment, root stock, cultivator, plant spacing, and trellis
system.
The farm meso-climate is compared to other wine region, locally and
internationally. These analyses form the basis of precision viticulture.
Aerial photography or satellite images can also be included in the system.
Satellite based remote sensing technique combined with limited field
survey provide valuable information in rapid time. The information
provided by the data which has been generated on the status of crops under
observation at any particular date can be processed quickly. This
advantage provides a synoptic view, making it possible to analyse the

10
National Academy of Agricultural Research Management

status and trend of transplantation, crop growth and harvesting throughout


the study area. Application of remote sensing and GIS will be used for
creating a systematic and sharable database on crop related subject and
answer the unanswered questions of different stakeholders.

Evolutionary Studies
To determine the tree of life and the last universal common ancestor, the
sequencing of genomes from all three domains of life, eukaryota, bacteria
and archaea means that evolutionary studies can be performed in a
mission.

Crop Improvement
Comparative genetics of the plant genomes has shown that the
organization of their genes is more conserved over evolutionary time than
was previously believed. These results suggest that information obtained
from the model crop systems can be used to suggest improvements to
other food crops. At hand the complete genomes of Arabidopsis thaliana
(water cress) and Oryza sativa (rice) are available.

Insect Resistance
Bacillus thuringiensis genes control a number of serious pests that have
been successfully transferred to cotton, maize and potatoes. This new
aptitude of the plants to resist insect attack means that the amount of
insecticides being used can be reduced and hence the nutritional quality of
the crops is increased.

Improve Nutritional Quality


The recent development in agriculture is transferring genes into rice to
increase levels of Vitamin A, iron and other micronutrients. This work
might have a deep impact in reducing occurrences of blindness and
anaemia which is caused by deficiencies in Vitamin A and iron
respectively. Another one is, to insert a yeast gene into the tomato, and the
result is a plant whose fruit stays longer on the vine and has an extended
shelf life.

Development of Drought Resistance Varieties


Development has been made in developing cereal varieties that have a
greater tolerance for soil alkalinity, free aluminium and iron toxicities.
These types of varieties will allow agriculture to succeed in poorer soil

11
105th FOCARS

areas, thus adding more land to the global production base. The
development work is in progress on the production of crop varieties
capable of tolerating reduced water conditions.

Bioinformatics Tools
There are both standard and customized products to meet the requirements
of particular projects. There are data-mining software that retrieves data
from genomic sequence databases and also visualization tools to analyse
and retrieve information from proteomic databases. These can be classified
as homology and similarity tools, protein functional analysis tools,
sequence analysis tools and miscellaneous tools.
Here is a brief description of a few of these, everyday bioinformatics is
done with sequence search programs like BLAST, sequence analysis
programs, like the EMBOSS and Staden packages, structure prediction
programs like THREADER or PHD or molecular imaging/modelling
programs like RasMol and WHATIF.

Homology and Similarity Tools:


Homologous sequences are sequences that are related by divergence from
a common ancestor. Thus the degree of similarity between two sequences
can be measured while their homology is a case of being either true of
false. This set of tools can be used to identify similarities between novel
query sequences of unknown structure and function and database
sequences whose structure and function have been elucidated.
Protein Function Analysis:
This group of programs allow you to compare your protein sequence to the
secondary (or derived) protein databases that contain information on
motifs, signatures and protein domains. Highly significant hits against
these different pattern databases allow you to approximate the biochemical
function of your query protein.

Structural Analysis:
This set of tools allows you to compare structures with the known
structure databases. The function of a protein is more directly a
consequence of its structure rather than its sequence with structural
homologs tending to share functions. The determination of a protein's
2D/3D structure is crucial in the study of its function.

12
National Academy of Agricultural Research Management

Sequence Analysis:
This set of tools allows you to carry out further, more detailed analysis on
your query sequence including evolutionary analysis, identification of
mutations, hydropathy regions, CpG islands and compositional biases. The
identification of these and other biological properties are all clues that aid
the search to elucidate the specific function of your sequence.
Some examples of Bioinformatics Tools:

Blast:
BLAST (Basic Local Alignment Search Tool) comes under the category of
homology and similarity tools. It is a set of search programs designed for
the Windows platform and is used to perform fast similarity searches
regardless of whether the query is for protein or DNA. Comparison of
nucleotide sequences in a database can be performed. Also a protein
database can be searched to find a match against the queried protein
sequence. NCBI has also introduced the new queuing system to BLAST
(Q BLAST) that allows users to retrieve results at their convenience and
format their results multiple times with different formatting options.
Depending on the type of sequences to compare, there are different
programs:
 blastp compares an amino acid query sequence against a protein
sequence database
 blastn compares a nucleotide query sequence against a nucleotide
sequence database
 blastx compares a nucleotide query sequence translated in all
reading frames against a protein sequence database
 tblastn compares a protein query sequence against a nucleotide
sequence database dynamically translated in all reading frames
 tblastx compares the six-frame translations of a nucleotide query
sequence against the six-frame translations of a nucleotide
sequence database.
Fasta
FAST homology search All sequences .An alignment program for protein
sequences created by Pearsin and Lipman in 1988. The program is one of
the many heuristic algorithms proposed to speed up sequence comparison.
The basic idea is to add a fast pre-screen step to locate the highly matching
segments between two sequences, and then extend these matching

13
105th FOCARS

segments to local alignments using more rigorous algorithms such as


Smith-Waterman.

Emboss
EMBOSS (European Molecular Biology Open Software Suite) is a
software-analysis package. It can work with data in a range of formats and
also retrieve sequence data transparently from the Web. Extensive libraries
are also provided with this package, allowing other scientists to release
their software as open source. It provides a set of sequence-analysis
programs, and also supports all UNIX platforms.

Clustalw
It is a fully automated sequence alignment tool for DNA and protein
sequences. It returns the best match over a total length of input sequences,
be it a protein or a nucleic acid.

RasMol:
It is a powerful research tool to display the structure of DNA, proteins, and
smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to
use program.

Prospect
PROSPECT (PROtein Structure Prediction and Evaluation Computer
ToolKit) is a protein-structure prediction system that employs a
computational technique called protein threading to construct a protein's 3-
D model.

Pattern Hunter
Pattern Hunter, based on Java, can identify all approximate repeats in a
complete genome in a short time using little memory on a desktop
computer. Its features are its advanced patented algorithm and data
structures, and the java language used to create it. The Java language
version of PatternHunter is just 40 KB, only 1% the size of Blast, while
offering a large portion of its functionality.

Copia:
COPIA (COnsensus Pattern Identification and Analysis) is a protein
structure analysis tool for discovering motifs (conserved regions) in a
family of protein sequences. Such motifs can be then used to determine

14
National Academy of Agricultural Research Management

membership to the family for new protein sequences, predict secondary


and tertiary structure and function of proteins and study evolution history
of the sequences.

Web Tools and Resources of Bioinformatics


The World Wide Web provides a mechanism for unprecedented
information sharing among researchers. Today, scientists can easily post
their research findings on the Web or compare their discoveries with
previous results, often spurring innovation and further discovery. The
value of accessing data from other institutions and the relative ease of
disseminating this data has increased the opportunity for multi-institution
collaborations, which produce dramatically larger data sets than were
previously available and require advanced data management techniques
for full utilization. As a side effect of these types of collaborations, some
tools become defacto standards in the communities as they are shared
among a large number of institutions. For instance, consider the BLAST
(Altschul et al., 1990) family of applications, which allow biologists to
find homologs of an input sequence in DNA and protein sequence
libraries. BLAST is an example application that has been enhanced as a
Web source, which provides dynamic access to large data sets. Many
genomics laboratories provide a Web-based BLAST interface
(http://blast.wustl.edu/) to their sequence databases that allow scientists to
easily identify homologs of an input sequence of interest. This capability
enhances the genomics research environment by allowing scientists to
compare new sequences with every known sequence and to have their
work validated by other members of the community. The addition of new
sequences at an increasingly frequent rate (NIAS DNA Bank,
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htm) further increases
the value of this capability. There are a number of common bioinformatics
analyses one can perform at other sites, such as European Bioinformatics
Institute (EBI), Bio Web Pasteur and Canadian Bioinformatics Resource,
including BLAST and sequence analyses, primer tools and phylogenetic
tree construction. EMBOSS sequence analysis package and SRS bio-
database access are among the widely useful web tools available at these
and other resource sites. There are numerous web lists of bioinformatics
resources, with many aimed at the biologist looking for software. Some of
these, such as Bioinformatics.net, include discussion forums on the use of
biology software. These are useful for biologists, as well as bioinformatics
engineers looking for tools related to their work, or to be used at service
centres. Many of these share a similar organization by functional
categories, with many of the same links. It is useful to compare these for

15
105th FOCARS

their different editorial perspectives, e.g. genomics/molecular biology or


proteomics/biochemistry, as well as effort to update and remove obsolete
links. General resources such as Google, Amazon‘s Alexa and Open
Directory Project at Mozilla.org include biology and bioinformatics
categories in their directories. These directories are populated by robots or
from submissions; they tend to lack the comprehensiveness of biologist-
maintained lists. Bioinformatics.ca provides a curated list of links that are
well organized in categories, with main sections that include human
genome and model organisms, sequences, gene expression, education and
computer-related resources. Most or all of these include useful editorial
comments on the content and value of the linked resources, making this
list especially useful in learning about resources. The Genome Web at
MRC, UK, offers a similar very useful catalogue of links with editorial
abstracts. An interesting function at Bioinformatics.ca is provided by an
XML standard for web news called RSS, for sharing bioinformatics links.
This allows customers and other web sites to have computable access to
this catalogue. For instance, you can use an RSS program to notify you of
additions and changes to this catalogue. The Bionetwork project at Pasteur
Institute provides an example of resource lists that are searchable by
several bioinformatics criteria: Biological Domain, e.g. sequence analysis
or structural biology, Resource type, e.g. database or online analysis tools,
and Organism. This biology focused search engine proves especially
useful in finding that tool or resource most relevant to one‘s research. This
project also has implemented link maintenance by using semi-automatic
scanning of internet news and resources (robot-like) to update the
catalogue. A similar project is BioHunt, which uses internet robot
technology to search and update molecular biology resources. BioHunt
maintains current entries (it shows update times of this review month for
several searches), making it especially useful to find new or updated tools
that one has heard of, but lacks crated cataloguing of these to make it easy
to find by subject matter.
Bioinformatics.net is a catalogue of online biology resources, specializing
in bioinformatics tools. Its focus is towards he needs of molecular
biologists and life science professionals, more than for bioinformaticians,
and includes discussion and help forums on the use of software and
bioscience topics. Jonathan Rees, who developed this resource, also
curates biology lists in the Open Directory project. This service is
supported in part y advertising, as are others reviewed here, one of the
limited options available o maintain such services. Bioinformatik.de offers
a similar directory style collection of crated bioinformatics and biology
resource links. The CMS molecular biology resource is an extensive

16
National Academy of Agricultural Research Management

catalogue of biology resources, including software Bioinformatics and Its


Applications in Agriculture tools. The Southwest Biotechnology Center
also maintains a useful catalogue covering a broad range of biology
resources. Bioinformatics.org and SourceForge.net are resources that
support Software developers and bioinformatics engineers, but are also
useful to biologists looking for tools. Open-source software development
in bioinformatics and other fields is being invigorated through agencies
such as these. The number of active, widely used and valuable
bioinformatics projects at these services is growing, including Generic
Model Organism Database, Gene Ontology, GeneX Gene Expression
Database and Staden Package for sequence analysis. These agencies allow
for software archiving, but the primary attractions to software developers
are infrastructure and tools that enable collaborative software
development. A historical archive or catalogue service of bioinformatics
software is limited, and maintenance of software releases is left to
developers using this service.

Application of Programmes in Bioinformatics


JAVA in Bioinformatics
Since research centers are scattered all around the globe ranging from
private to academic settings, and a range of hardware and OSs are being
used, Java is emerging as a key player in bioinformatics. Physiome
Sciences' computer-based biological simulation technologies and
Bioinformatics Solutions' Pattern Hunter are two examples of the growing
adoption of Java in bioinformatics.

Perl in Bioinformatics
String manipulation, regular expression matching, file parsing, data format
interconversion etc are the common text-processing tasks performed in
bioinformatics. Perl excels in such tasks and is being used by many
developers. Yet, there are no standard modules designed in Perl
specifically for the field of bioinformatics. However, developers have
designed several of their own individual modules for the purpose, which
have become quite popular and are coordinated by the BioPerl project.

Bioinformatics Projects
BioJava The BioJava Project is dedicated to providing Java tools for
processing biological data which includes objects for manipulating

17
105th FOCARS

sequences, dynamic programming, file parsers, simple statistical routines,


etc.

BioPerl
The BioPerl project is an international association of developers of Perl
tools for bioinformatics and provides an online resource for modules,
scripts and web links for developers of Perl-based software.

BioXML
A part of the BioPerl project, this is a resource to gather XML
documentation, DTDs and XML aware tools for biology in one location.

Biocorba
Interface objects have facilitated interoperability between bioperl and
other perl packages such as Ensembl and the Annotation Workbench.
However, interoperability between bioperl and packages written in other
languages requires additional support software. CORBA is one such
framework for interlanguage support, and the biocorba project is currently
implementing a CORBA interface for bioperl. With biocorba, objects
written within bioperl will be able to communicate with objects written in
biopython and biojava (see the next subsection). For more information, see
the biocorba project website at http://biocorba.org/ . The Bioperl
BioCORBA server and client bindings are available in the bioperl-corba-
server and bioperl-corba-client bioperl CVS repositories respecitively. (see
http://cvs.bioperl.org/ for more information).

Ensembl
Ensembl is an ambitious automated-genome-annotation project at EBI.
Much of Ensembl\'s code is based on bioperl, and Ensembl developers, in
turn, have contributed significant pieces of code to bioperl. In particular,
the bioperl code for automated sequence annotation has been largely
contributed by Ensembl developers. Describing Ensembl and its
capabilities is far beyond the scope of this tutorial The interested reader is
referred to the Ensembl website at http://www.ensembl.org/.

Bioperl-Db
Bioperl-db is a relatively new project intended to transfer some of
Ensembl's capability of integrating bioperl syntax with a standalone Mysql
database (http://www.mysql.com ) to the bioperl code-base. More details

18
National Academy of Agricultural Research Management

on bioperl-db can be found in the bioperl-db CVS directory at


http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-
db/?cvsroot=bioperl. It is worth mentioning that most of the bioperl
objects mentioned above map directly to tables in the bioperl-db schema.
Therefore object data such as sequences, their features, and annotations
can be easily loaded into the databases, as in $loader->store
($newid,$seqobj) Similarly one can query the database in a variety of
ways and retrieve arrays of Seq objects. See biodatabases.pod,
Bio::DB::SQL::SeqAdaptor, Bio::DB::SQL::QueryConstraint, and
Bio::DB::SQL::BioQuery for examples.

Biopython and Biojava


Biopython and biojava are open source projects with very similar goals to
bioperl. However their code is implemented in python and java,
respectively. With the development of interface objects and biocorba, it is
possible to write java or python objects which can be accessed by a bioperl
script, or to call bioperl objects from java or python code. Since biopython
and biojava are more recent projects than bioperl, most effort to date has
been to port bioperl functionality to biopython and biojava rather than the
other way around. However, in the future, some bioinformatics tasks may
prove to be more effectively implemented in java or python in which case
being able to call them from within bioperl will become more important.
For more information, go to the biojava http://biojava.org/ and biopython
http://biopython.org/ websites.

Web services in bioinformatics


SOAP and REST-based interfaces have been developed for a wide variety
of bioinformatics applications allowing an application running on one
computer in one part of the world to use algorithms, data and computing
resources on servers in other parts of the world. The main advantages
derive from the fact that end users do not have to deal with software and
database maintenance overheads.
Basic bioinformatics services are classified by the EBI into three
categories: SSS (Sequence Search Services), MSA (Multiple Sequence
Alignment), and BSA (Biological Sequence Analysis). The availability of
these service-oriented bioinformatics resources demonstrate the
applicability of web based bioinformatics solutions, and range from a
collection of standalone tools with a common data format under a single,
standalone or web-based interface, to integrative, distributed and
extensible bioinformatics workflow management systems.

19
105th FOCARS

Bioinformatics workflow management systems


A Bioinformatics workflow management system is a specialized form of a
workflow management system designed specifically to compose and
execute a series of computational or data manipulation steps, or a
workflow, in a Bioinformatics application. Such systems are designed to
• provide an easy-to-use environment for individual application
scientists themselves to create their own workflows
• provide interactive tools for the scientists enabling them to execute
their workflows and view their results in real-time
• Simplify the process of sharing and reusing workflows between the
scientists.
• Enable scientists to track the provenance of the workflow execution
results and the workflow creation steps.
Currently, there are at least three platforms giving this service: Galaxy,
Taverna and Anduril

Rosalind
Rosalind is an educational resource and web project for learning
bioinformatics through problem solving and programming. Rosalind users
learn bioinformatics concepts through a problem tree that builds up
biological, algorithmic, and programming knowledge concurrently. Each
problem is checked automatically, allowing for the project to also be used
for automated homework testing in existing classes.
Rosalind is a joint project between the University of California at San
Diego and Saint Petersburg Academic University along with the Russian
Academy of Sciences. The project's name commemorates Rosalind
Franklin, whose X-ray crystallography with Raymond Gosling facilitated
the discovery of the DNA double helix by James D. Watson and Francis
Crick. It was recognized by Homologous as the Best Educational Resource
of 2012 in their review of the Top Bioinformatics Contributions of 2012.
As of March 2013, it hosts over 5,000 active users.

20
National Academy of Agricultural Research Management

References
Altschul, S.F., Gish, W., Miller, W., Meyers, E.W., Lipman, D.J. 1990.
Basic local alignment search tool, Mol. Biol., 215: 403.
Durbin, B.P., Hardin, J.S., Hawkins, D.M., Rocke, D.M. 2002. A variance-
stabilizing transformation for gene-expression microarray data,
Bioinformatics, 18(Suppl. 1): s105.
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. 1998. Cluster
analysis and display of genome-wide expression paterns, Proc Natl
Acad Sci (USA), 95: 14 863.
Liu, J.S., Neuwald, A.F., Lawrence, C.E. 1995. Bayesian models for
multiple local sequence alignment and Gibbs sampling strategies, J
Amer Stat, 90: 1156.
Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D.,
Friedman, N. 2003. Module networks: identifying regulatory
modules and their condition-specific regulators from gene
expression data. Nat Gen, 34(2):166.
Sheng, Q., Moreau, Y., De Moor.B. 2003. Biclustering microarray data by
Gibbs sampling, Bioinformatics, 19(Suppl. 2): ii 196.
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen,
M.B., Brown, P.O., Botstein, D., Futcher, B. 1998. Comprehesive
identification of cell cycle-regulated genes of the yeast
saccaromyces cerevisiae by microarray hybridization, Molecular
Biology of the Cell, 9: 3 273.
Thijs, G., Marchal, K., Lescot, M., Rombauts, S., De Moor.B., Rouze, P.,
Moreau, Y. 2002. A Gibbs Sampling method to detect over-
represented motifs in upstream regions of expressed genes, Journal
of Computational Biology, 9(2): 447.
Jian Xue , Shoujing Zhao,* , Yanlong Liang , Chunxi Hou , Jianhua
Wang Bioinformatics And Its Applications In Agriculture, College
of Biological and Agricultural Engineering, Jiling University, 5988
Renmin Street, Changchun, Jilin, 130022, P.R.China.

21
¦ÉÉEÞò+xÉÖ{É - ®úɹ]ÅõÒªÉ EÞòÊ¹É +xÉÖºÉÆvÉÉxÉ |ɤɯvÉ +EòÉnù¨ÉÒ
®úÉVÉäxpùxÉMÉ®ú, ½èþnù®úɤÉÉnù-500030, iÉä±ÉÆMÉÉhÉÉ, ¦ÉÉ®úiÉ
ICAR-National Academy of Agricultural Research Management
(ISO 9001:2008 Certified)
Rajendranagar, Hyderabad-500030, Telangana, India
https://www.naarm.org.in

You might also like