0% found this document useful (0 votes)
15 views4 pages

Bioinformatics - Trends in Gene Expression Analysis

The document discusses the field of bioinformatics, focusing on gene expression analysis and its significance in understanding biological processes and disease mechanisms. It highlights the use of computational tools, databases, and algorithms in analyzing large biological datasets, particularly through methods like DNA microarray and gene prediction. The paper also outlines various research topics and challenges in bioinformatics, emphasizing the importance of computational analysis in gene expression data interpretation.

Uploaded by

mirelefernandes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

Bioinformatics - Trends in Gene Expression Analysis

The document discusses the field of bioinformatics, focusing on gene expression analysis and its significance in understanding biological processes and disease mechanisms. It highlights the use of computational tools, databases, and algorithms in analyzing large biological datasets, particularly through methods like DNA microarray and gene prediction. The paper also outlines various research topics and challenges in bioinformatics, emphasizing the importance of computational analysis in gene expression data interpretation.

Uploaded by

mirelefernandes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Bioinformatics: Trends in Gene Expression Analysis

Shital A. Raut, S.R.Sathe Adarsh Raut


Deptt. Of Electronics & Computer Science Deptt. Of Software
Visvesvaraya National Institute of Technology, ADCC, IT Park
Nagpur, India Nagpur, India
[email protected] [email protected]

Abstract— Bioinformatics is an interesting combination of we require complex repeated vast computations. To deal the
biology and computational sciences, which help scientists and above discussion computers are best.
researchers to do more biological experiments to improve the The aim of Bioinformatics is to guide on biological
life of living being. Gene expression is fundamental biological experiments so that we can model biological macromolecule
basics of cell biology. It is also responsible for genetic as well and complexes, identify problems like various diseases and
as physical or biochemical characteristics of an organism. The their causes, design new drugs, peptides, proteins to improve
study of gene expression analysis helps to predict resultant human life. The current research topics in Bioinformatics are
protein product, in identifying abnormal functioning of cells sequence alignment, gene finding, protein structure
which may responsible for various diseases, and in designing
alignment, protein structure prediction, gene expression
new drugs. For analysis purpose, DNA Microarray is an
analysis, genome analysis.
important tool as number of genes can simultaneously be
observed. The output of DNA Microarray is vast databases In the followed section we discuss about components of
which need to be processed by computation tools to take out Bioinformatics, various research topics in Bioinformatics,
biological significance. Computation tools include various and lastly microarray gene expression analysis.
algorithms of data mining, pattern recognition, support vector
machines etc. Vast literature is available which demonstrate
II. COMPONENTS OF BIOINFORMATICS
different algorithms and their result for analysis purpose. To
find unique algorithms which satisfies all the requisite A. Databases
constraints is still the research topic. In this paper, we try to In Bioinformatics, biological databases plays important
discuss all the major computational tools, various major role to understand biological phenomenon. These databases
methods from last many years which are used as a trends in are very huge and complex, and hence to store, access and
gene expression analysis. We also try to discusses major manipulate these data efficiently is itself a research topic.
difficulties while applying these methods to databases for Biological databases can be categorized as sequence
analysis purpose. databases, microarray databases, genome databases, protein
Keywords-component :Bioinformatics, Gene Expression structure databases and many more. The database contents[1]
Analysis, DNA Microarray, Data mining, Support vector are semi-structured data and can be represented in the form
machines) of table. A single table is a group of records which may
include data attributes, text descriptors, citations and
I. INTRODUCTION ontology classifications. Sequence databases represent
sequence information of all the organisms. GeneBank,
Bioinformatics is an interdisciplinary field which EMBL, and DDBJ are the largest databanks of this category.
includes molecular biology, computer science, artificial Microarray databases contain microarray gene expression
intelligence, statistics and mathematics. It is basically a study under various biological conditions. Example databases for
to model, to organize, to understand and to discover this category are ArrayExpress, Gene Expression Omnibus,
interesting information associated with the large scale GPX. Genome databases collect organism genome(genes
molecular biological databases. The term Bio(Molecular and non-coding) sequences and produce analysis for same.
Biology)informatics(Information Technology) encompasses Xenbase, Corn, RGD, SEED are some of the example
computational tools and methods used to manage, analyze databases of this category. Theses databases may contain
and manipulate large set of biological data. Bioinformatics many species genomes , or a single organism genome.
has three major components- first is Databases, the large Protein structure databases include complete domain of
databases allow storage and management of biological data, protein structure based on similarities of their amino acid
second is Algorithms and Statistics, to determine sequences and three-dimensional structure. Some of the
relationships among these large database sets, and last is databases in this category includes PDB, SCOP and CATH.
Computational tools, which are required to analyze and Vast biological data is available in text format with many
interpret biological data. Bioinformatics is largely, although databases like PubMed and OMIN.
not exclusively, a computer oriented discipline for the
following reasons: growth of large molecular databases, B. Computational Tools
complexities of biological data, advancement of internet One of the aim of Bioinformatics is to increase the
technologies. To analyze and manipulate this complex data understanding of biological process, and for this, only huge

978-1-4244-6775-4/10/$26.00 2010
c IEEE 97
databases can not be sufficient, hence we require used Multiple sequence alignments with reduced parameters
computational tools to process this information and convert it like minimum space, minimum time[5] .
into knowledge. Pattern recognition, Data mining, Machine
learning are some of the computational tools required for B. Gene finding/ predictions
purpose. In Pattern recognition systems, we try to find some As the sequencing of genes is over, Gene
pattern which must be drawn from the available raw data by finding/prediction[12] is one of the first and most important
applying some computational action. To understand the step in understanding the genome of a species. Once
patterns, it includes two important methods, Supervised and sequence has been obtained, the most likely protein-
Unsupervised learning. Data mining is an advance process encoding regions are identified, then sequence is annotated
of database management. It extracts the important pattern through databases to find a gene. The SNAP gene finder is
from gathered data. For this, it try to apply different data like Genscan and attempts to be find a gene using HMM . A
mining analysis like cluster analysis, frequent pattern few recent approaches like CONTRAST[14], and
analysis and many more to find hidden pattern in data mGene[13] use support vector machines for gene prediction.
samples. While machine learning is an artificial intelligence
C. Protein structure alignment and prediction
part where focus is to first find complex pattern using
statistics, probability theory, or artificial intelligence. The It is very important and one of the big challenge in
decisions are made on this identified data. Bioinformatics, to understand relationship between
nucleotide sequence and three-dimensional structure in
C. Algorithms and Softwares proteins. If there is any direct relationship then structure of
There are various programs and tools available which protein can be directly predicted by analyzing the nucleotide
can applied on above discussed databases for analysis and sequence. The polypeptide chain is first formed on the
modeling purpose. The BLAST[6] and Genscan are used for ribosome which contains codon sequence. The secondary
sequence comparisons. They finds regions of local similarity structure is form through hydrogen bonds between amino
between sequences. It compares the sequences and find out acids in the chain. Because of the further interactions among
statistical information. FASTA/ SSEARCH[7] provides amino acid these secondary structure fold into three-
sequence similarity searching against protein databases. dimensional structure. How the interactions are formed,
Smith-Waterman algorithm[9], searching protein databases among amino acids is still a research topic, prediction of
to find those with the best alignment. The GCG[10] software protein structure from secondary structure is achievable
is a collection of molecular biological analysis programs. using different methods. The accuracy of these methods is
ENTREZ[11], simultaneously searches biological approximately 70-75% only. There is vast scope to design
information in different databases . Other softwares and such methods so that matching can be improved. Recent
algorithms may include CHROMAS[8], Gene Explorer 1.4, developments in structural proteomics[15] for protein
MAGE etc. structure determination includes instrumental methods such
as X-ray crystallography and NMR spectroscopy, and
III. REASERCH TOPICS IN BIOINFORMATICS computational methods such as comparative and de novo
structure prediction and molecular dynamics simulations.
A. Sequence alignment The relationship of protein sequence-structure and its
Sequences are nucleotide sequences present in DNA, function causes many difficulties for prediction methods.
RNA, or proteins. Sequence alignment[2] is useful for The highly complex nature of these relationship is a
discovering functional, structural, and evolutionary consequence of the interplay between physics and evolution
information in biological sequences, for this, comparison of that has been studied using a wide array of experimental and
sequences are required. Sequences which are found to be theoretical techniques. A recent research[16] is to review
similar probably can have same functions in regulatory role and design methods to conserve the sequence, structure and
in DNA molecule or similar three-dimensional structure in function.
case of proteins. Also in addition if sequences are similar D. Gene expression analysis
then they may have similar ancestral sequences and can be
configured as homogenous. Alignment indicates the changes In genetics, gene expression is the most fundamental
that could have occurred between the two homologous level at which genetic constitution of a cell of an organism
sequences and a common ancestor sequence during rise to the biochemical physiological characteristics. The
evolution. Different alignment methods are discussed by genetic code is interpreted by gene expression, and the
Serafim[4] which gives large scale genomic comparisons and properties of the expression products give rise to the
suggesting possible directions that will be explored in the organisms phenotype. Major research is concentrated for
near future. Recently, multiple sequence alignments are very analysis of this interpreted data. Number of tools and
widely used in all areas of DNA and protein sequence techniques are available for analysis purpose like DNA
analysis[3]. This brings practical concern advances of how to Microarray, SAGE, Tiling array etc. We will discuss these
combine three-dimensional structure information with topics in detail in later part of this paper.
primary sequences to give more accurate alignments, when E. Genome analysis
structures are available. Large research scope is available to It is a complete study of full genomes of all organism. It
includes identification of genes, prediction of its functions.

98 2010 International Conference on Bioinformatics and Biomedical Technology


The complete analysis includes overall study of all available C. Microarray Gene Expression Analysis
genomes. A newly identified gene in another organism can Various tools and techniques are available for gene
be compared with the existing database of information to expression analysis like SAGE, DNA Microarray, Tiling
find whether it has a similar functions or not. This array etc., but most popular technique is DNA Microarray. A
comparison is very helpful to find organisms biochemical, microarray generally consists of thousands of DNA
cellular or developmental model. Identified Analysis proved molecules taken from cells with specific conditions. The
to be useful to detect various diseases and drug designing. molecules are distributed as spots on the surface of a glass
slide as array . Solutions of fluorescently labeled DNA or
IV. MICROARRAY GEN EXPRESSION ANALYSIS RNA are poured over the array and then each molecule in the
Gene is a basic unit of all organisms. Genes hold the solution searches for matching partner on the surface as per
information to build and maintain organism’s cells. In cells, the matching rule of nucleotide base pairs. The genes that
genes are co-exists with DNA which contains coding finds pair with complementary molecule are said to be turned
sequences and non-coding sequences. Both the sequences are on, and then their level of expression can be marked with
having their unique functionality. Coding sequences fluorescent marker on each spot. These turned on genes
determine what the gene does, and non-coding sequences plays important role in cellular or tissue function under
determine when the gene should be active. specific condition. The advantage of DNA Microarray is
When the gene is active, it express gene code, and this that, we can able to measure the levels of gene expression in
expression is called as Gene Expression. When gene is tens of thousands of genes. Once the data from such tools
active, coding and non-coding sequences are copied in a are accumulated, it is essential to extract the significance of
process called transcription, producing RNA. This RNA later data using computational analysis.
on direct the process of proteins using genetic code. Here D. Computational Analysis of Gene Expression Data
we are going to discuss about background of topic like
central dogma of molecular biology, what is gene expression With the rapid advance of microarray technology, gene
analysis?, how microarray will be useful for that? and finally expression data are being generated in large throughput.
computational methods for analysis of gene expression. Hence to learn functionally significant classification of
genes computational analysis is required. Currently, for
A. Central Dogma of Molecular Biology classification of genes, computational analysis breaks up into
When genes are active they copy coding and non-coding two categories, supervised and unsupervised methods.
sequences which produces RNA, this process is called Supervised methods assume that each gene expression
‘transcription’. These RNA, after long processing produces profile is associated with the specific known profile and
proteins which is called as ‘transformation’. The supervised methods make use of these methods for learning
transformation of DNA to RNA and RNA to protein is called process. On other hand we have unsupervised methods
the central dogma of molecular biology. The first step of where these methods have no assumption about each gene
‘transcription’ take place in nucleus and second step of expression profile. Unsupervised methods cannot always
‘transformation’ take place in cell cytoplasm. Transcription correctly separate classes as they use unlabeled data to
and translation are highly regulated processes, with identify classes through clusters of gene expression data.
constantly changing environments. Regulation of a single Supervised methods claim that they can overcome this
gene can give rise to thousands of messenger RNAs, problem, by exploiting biological and medical knowledge on
encoding functionally distinct proteins. Around 30 000 the problem domain, using labeled data to directly identify
genes of the human genome can express hundreds of and separate classes. Support Vector Machines (SVM) and
thousands of proteins, each with a specific role to play. other supervised machine learning methods have been
Currently we are updated with this information only, which applied to the analysis of DNA microarray gene expression
is considered to be a small fraction of information embodied data in order to classify functional groups of genes and
in the molecule. Many researchers are developing powerful multiple tumor types [16,17]. Valentini[18] discussed
tools which will enable to do large scale analysis. method and results of Supervised gene expression data
analysis using Support Vector Machines and Multi-Layer
B. Gene Expression Analysis Perceptrons.
Gene expression is a process in which information in The important aspect of gene expression analysis is to
gene is used for synthesis of proteins, which forms building find the gene, analyze it and extract the relevant information
block of cells. Transformation of genes to proteins is for certain phenotype. Supervised methods, selects genes,
considered to be one of the most complicated biological based on known phenotype. However certain set of genes
processes. Scientists are busy to do research on functions of may corresponds to new phenotype for which there is no
genes, its unique expression as per environmental condition, predefined phenotype, hence there is a need to develop such
protein products etc. This research will help to understand unsupervised methods which try to discover such phenotype
the complex processes, under which each gene in the DNA information. Ding[19] discussed such a method in his paper,
sequence is “expressed”, i.e. when, where, and to what extent unsupervised feature selection via two-way ordering in gene
the gene is stimulated to produce the protein(encoding). The expression analysis.
information drawn from this will give clues for detection of A vast literature is available for Specifically, gene
diseases, cells functional regularity and drug design. expression analysis using data mining methods. Supervised

2010 International Conference on Bioinformatics and Biomedical Technology 99


data mining methods include class association rule and [11] www.ncbi.nlm.nih.gov/Entre
classification, while unsupervised data mining methods [12] Korf I. (2004-05-14). "Gene finding in novel genomes". BMC
mainly refer to the various clustering methods. Considerable Bioinformatics 5: 59–67. doi:10.1186/1471-2105-5-59. PMID
15144565
amount of research has demonstrated that accurate and
[13] Schweikert et al. (2009-05-19). "mGene.web: A Web Service for
inexpensive diagnosis can be achieved with class association Accurate Computational Gene Finding". Nucleic Acids Research
rules, supervised data mining method, [20][21] because of
[14] Gross et al. (2007-12-20). "CONTRAST: A Discriminative,
their informativeness and succinctness. Unsupervised data Phylogeny-free Approach to Multiple Informant De Novo Gene
mining methods mainly refer to the clustering method. The Prediction". Genome Biology 8 (12): R269. doi:10.1186/gb-2007-8-
clustering subroutine typically groups the correlated genes or 12-r269. PMID 18096039
samples together to find co-regulated and functionally [15] Proteomics. 2005 May ;5:2056-68 15846841, Recent developments in
similar genes or similarly expressed samples structural proteomics for protein structure determination., [My paper]
The most popular clustering algorithms adopted for gene Hsuan-Liang Liu, Jyh-Ping Hsu
expression data include the hierarchical clustering, iteratively [16] M. Brown et al. Knowledge-base analysis of microarray gene
expression data by using Support Vector Machines. PNAS, 97(1),
joining the two closest clusters beginning with singleton 2000.
clusters , K-mean, typically using the Euclidean distances to [17] T.S. Furey et al. Support Vector Machine classification and validation
partition the space into K parts [22], SOM,a neural network of cancer tissue samples using microarray expression data.
algorithm [23] and graph theoretic approaches such as HCS Bioinformatics, 16(10), 2000.
[24]. [18] Giorgio Valentini ,Supervised gene expression data analysis using
The extremely high dimensionality and the complex Support Vector Machines and Multi-Layer Perceptrons.
correlations among the genes pose great challenge for http://homes.dsi.unimi.it/~valenti/papers/KES2002.
successful application of existing supervised and [19] CHQ Ding,
unsupervised algorithms for gene expression analysis. Hence bioinformatics.oxfordjournals.org/cgi/content/abstract/19/10/1259,
DOI:10.1093/bioinformatics/btg149
to find the intermediate solution Gong and Chen[25],
[20] Jinyan Li, Huiqing Liu, James R. Downing, Allen Eng-Juh Yeoh, and
proposed use of semi-supervised learning algorithms. The LimsoonWong. Simple rules underlying gene expression profiles of
suggested method uses both labeled and unlabeled data to more than six subtypes of acute lymphoblastic leukemia (all) patients.
do classification for microarray data. It claims higher Bioinformatics,19:71–78, 2003.
classification accuracy than the supervised methods and is [21] Jinyan Li and Limsoon Wong. Using rules to analyse bio-medical
more stable when the labeled examples are very few. data: A comparison between c4.5 and pcl. Proc. of 4th Int. Conf. on
Web-Age Information Management, 2003.
ACKNOWLEDGMENT [22] Chinatsu Arima, Taizo Hanai, and Masahiro Okamoto. Gene
expression analysis using fuzzy k-means clustering. Genome
We sincerely gone through available literature and try to Informatics, 14, 2003.
draw out what are the different trends for gene expression [23] T. Kohonen. Self-organizing maps. Springer, 1997.
analysis. Huge research papers are available for above said [24] Erez Hartuv and Ron Shamir. A clustering algorithm based on graph
discussion but we try to focus on major algorithms and major connectivity.Information Processing Letters, 76(200):175–181, 2000.
trends in computation tools [25] Y.C. Gong and C. L. Chen, Semi-supervised Method for Gene
Expression Data Classification with Gaussian Fields and Harmonic
REFERENCES Functions,figment.csee.usf.edu/~sfefilat/data/papers/WeAT8.28.pdf

[1] Bourne P (August 2005). "Will a biological database be different


from a biological journal?". PLoS Comput. Biol. 1 (3): 179–81.
doi:10.1371/journal.pcbi.0010034. PMID 16158097.
[2] Mount, David W. Bioinformatics: Sequence and Genome Analysis.
Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press,
2001. ebrary Reader e-book.David mount, refences.
[3] Multiple sequence alignments, Iain M Wallace, Gordon Blackshields,
Desmond G Higgins , Curr Opin Struct Biol. 2005 Jun ;15 (3):261-6
15963889
[4] Brief Bioinform. 2005 Mar ;6 (1):6-22 15826353 , The many faces of
sequence alignment., Serafim Batzoglou
[5] BMC Bioinformatics. 2004 Aug 19;5 :113 15318951 MUSCLE: a
multiple sequence alignment method with reduced time and space
complexity. [My paper] Robert C Edgar
[6] http://blast.ncbi.nlm.nih.gov/Blast.cgi
[7] http://www.ebi.ac.uk/Tools/fasta33/index.html
[8] http://www.flu.org.cn
[9] http://www.itl.nist.gov/div897/sqg/dads/HTML/smithWaterman.html
[10] http://www.pbrc.hawaii.edu/gcg/

100 2010 International Conference on Bioinformatics and Biomedical Technology

You might also like