Bioinformatics - Trends in Gene Expression Analysis
Bioinformatics - Trends in Gene Expression Analysis
Abstract— Bioinformatics is an interesting combination of we require complex repeated vast computations. To deal the
biology and computational sciences, which help scientists and above discussion computers are best.
researchers to do more biological experiments to improve the The aim of Bioinformatics is to guide on biological
life of living being. Gene expression is fundamental biological experiments so that we can model biological macromolecule
basics of cell biology. It is also responsible for genetic as well and complexes, identify problems like various diseases and
as physical or biochemical characteristics of an organism. The their causes, design new drugs, peptides, proteins to improve
study of gene expression analysis helps to predict resultant human life. The current research topics in Bioinformatics are
protein product, in identifying abnormal functioning of cells sequence alignment, gene finding, protein structure
which may responsible for various diseases, and in designing
alignment, protein structure prediction, gene expression
new drugs. For analysis purpose, DNA Microarray is an
analysis, genome analysis.
important tool as number of genes can simultaneously be
observed. The output of DNA Microarray is vast databases In the followed section we discuss about components of
which need to be processed by computation tools to take out Bioinformatics, various research topics in Bioinformatics,
biological significance. Computation tools include various and lastly microarray gene expression analysis.
algorithms of data mining, pattern recognition, support vector
machines etc. Vast literature is available which demonstrate
II. COMPONENTS OF BIOINFORMATICS
different algorithms and their result for analysis purpose. To
find unique algorithms which satisfies all the requisite A. Databases
constraints is still the research topic. In this paper, we try to In Bioinformatics, biological databases plays important
discuss all the major computational tools, various major role to understand biological phenomenon. These databases
methods from last many years which are used as a trends in are very huge and complex, and hence to store, access and
gene expression analysis. We also try to discusses major manipulate these data efficiently is itself a research topic.
difficulties while applying these methods to databases for Biological databases can be categorized as sequence
analysis purpose. databases, microarray databases, genome databases, protein
Keywords-component :Bioinformatics, Gene Expression structure databases and many more. The database contents[1]
Analysis, DNA Microarray, Data mining, Support vector are semi-structured data and can be represented in the form
machines) of table. A single table is a group of records which may
include data attributes, text descriptors, citations and
I. INTRODUCTION ontology classifications. Sequence databases represent
sequence information of all the organisms. GeneBank,
Bioinformatics is an interdisciplinary field which EMBL, and DDBJ are the largest databanks of this category.
includes molecular biology, computer science, artificial Microarray databases contain microarray gene expression
intelligence, statistics and mathematics. It is basically a study under various biological conditions. Example databases for
to model, to organize, to understand and to discover this category are ArrayExpress, Gene Expression Omnibus,
interesting information associated with the large scale GPX. Genome databases collect organism genome(genes
molecular biological databases. The term Bio(Molecular and non-coding) sequences and produce analysis for same.
Biology)informatics(Information Technology) encompasses Xenbase, Corn, RGD, SEED are some of the example
computational tools and methods used to manage, analyze databases of this category. Theses databases may contain
and manipulate large set of biological data. Bioinformatics many species genomes , or a single organism genome.
has three major components- first is Databases, the large Protein structure databases include complete domain of
databases allow storage and management of biological data, protein structure based on similarities of their amino acid
second is Algorithms and Statistics, to determine sequences and three-dimensional structure. Some of the
relationships among these large database sets, and last is databases in this category includes PDB, SCOP and CATH.
Computational tools, which are required to analyze and Vast biological data is available in text format with many
interpret biological data. Bioinformatics is largely, although databases like PubMed and OMIN.
not exclusively, a computer oriented discipline for the
following reasons: growth of large molecular databases, B. Computational Tools
complexities of biological data, advancement of internet One of the aim of Bioinformatics is to increase the
technologies. To analyze and manipulate this complex data understanding of biological process, and for this, only huge
978-1-4244-6775-4/10/$26.00 2010
c IEEE 97
databases can not be sufficient, hence we require used Multiple sequence alignments with reduced parameters
computational tools to process this information and convert it like minimum space, minimum time[5] .
into knowledge. Pattern recognition, Data mining, Machine
learning are some of the computational tools required for B. Gene finding/ predictions
purpose. In Pattern recognition systems, we try to find some As the sequencing of genes is over, Gene
pattern which must be drawn from the available raw data by finding/prediction[12] is one of the first and most important
applying some computational action. To understand the step in understanding the genome of a species. Once
patterns, it includes two important methods, Supervised and sequence has been obtained, the most likely protein-
Unsupervised learning. Data mining is an advance process encoding regions are identified, then sequence is annotated
of database management. It extracts the important pattern through databases to find a gene. The SNAP gene finder is
from gathered data. For this, it try to apply different data like Genscan and attempts to be find a gene using HMM . A
mining analysis like cluster analysis, frequent pattern few recent approaches like CONTRAST[14], and
analysis and many more to find hidden pattern in data mGene[13] use support vector machines for gene prediction.
samples. While machine learning is an artificial intelligence
C. Protein structure alignment and prediction
part where focus is to first find complex pattern using
statistics, probability theory, or artificial intelligence. The It is very important and one of the big challenge in
decisions are made on this identified data. Bioinformatics, to understand relationship between
nucleotide sequence and three-dimensional structure in
C. Algorithms and Softwares proteins. If there is any direct relationship then structure of
There are various programs and tools available which protein can be directly predicted by analyzing the nucleotide
can applied on above discussed databases for analysis and sequence. The polypeptide chain is first formed on the
modeling purpose. The BLAST[6] and Genscan are used for ribosome which contains codon sequence. The secondary
sequence comparisons. They finds regions of local similarity structure is form through hydrogen bonds between amino
between sequences. It compares the sequences and find out acids in the chain. Because of the further interactions among
statistical information. FASTA/ SSEARCH[7] provides amino acid these secondary structure fold into three-
sequence similarity searching against protein databases. dimensional structure. How the interactions are formed,
Smith-Waterman algorithm[9], searching protein databases among amino acids is still a research topic, prediction of
to find those with the best alignment. The GCG[10] software protein structure from secondary structure is achievable
is a collection of molecular biological analysis programs. using different methods. The accuracy of these methods is
ENTREZ[11], simultaneously searches biological approximately 70-75% only. There is vast scope to design
information in different databases . Other softwares and such methods so that matching can be improved. Recent
algorithms may include CHROMAS[8], Gene Explorer 1.4, developments in structural proteomics[15] for protein
MAGE etc. structure determination includes instrumental methods such
as X-ray crystallography and NMR spectroscopy, and
III. REASERCH TOPICS IN BIOINFORMATICS computational methods such as comparative and de novo
structure prediction and molecular dynamics simulations.
A. Sequence alignment The relationship of protein sequence-structure and its
Sequences are nucleotide sequences present in DNA, function causes many difficulties for prediction methods.
RNA, or proteins. Sequence alignment[2] is useful for The highly complex nature of these relationship is a
discovering functional, structural, and evolutionary consequence of the interplay between physics and evolution
information in biological sequences, for this, comparison of that has been studied using a wide array of experimental and
sequences are required. Sequences which are found to be theoretical techniques. A recent research[16] is to review
similar probably can have same functions in regulatory role and design methods to conserve the sequence, structure and
in DNA molecule or similar three-dimensional structure in function.
case of proteins. Also in addition if sequences are similar D. Gene expression analysis
then they may have similar ancestral sequences and can be
configured as homogenous. Alignment indicates the changes In genetics, gene expression is the most fundamental
that could have occurred between the two homologous level at which genetic constitution of a cell of an organism
sequences and a common ancestor sequence during rise to the biochemical physiological characteristics. The
evolution. Different alignment methods are discussed by genetic code is interpreted by gene expression, and the
Serafim[4] which gives large scale genomic comparisons and properties of the expression products give rise to the
suggesting possible directions that will be explored in the organisms phenotype. Major research is concentrated for
near future. Recently, multiple sequence alignments are very analysis of this interpreted data. Number of tools and
widely used in all areas of DNA and protein sequence techniques are available for analysis purpose like DNA
analysis[3]. This brings practical concern advances of how to Microarray, SAGE, Tiling array etc. We will discuss these
combine three-dimensional structure information with topics in detail in later part of this paper.
primary sequences to give more accurate alignments, when E. Genome analysis
structures are available. Large research scope is available to It is a complete study of full genomes of all organism. It
includes identification of genes, prediction of its functions.