0% found this document useful (0 votes)
7 views

Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics

Uploaded by

Motalab Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics

Uploaded by

Motalab Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Hemby & Bahn (Eds.

)
Progress in Brain Research, Vol. 158
ISSN 0079-6123
Copyright r 2006 Elsevier B.V. All rights reserved

CHAPTER 4

Functional genomics and proteomics in the clinical


neurosciences: data mining and bioinformatics

John H. Phan, Chang-Feng Quo and May D. Wang

The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University,
Atlanta, GA 30322, USA

Abstract: The goal of this chapter is to introduce some of the available computational methods for
expression analysis. Genomic and proteomic experimental techniques are briefly discussed to help the
reader understand these methods and results better in context with the biological significance. Furthermore,
a case study is presented that will illustrate the use of these analytical methods to extract significant
biomarkers from high-throughput microarray data.
Genomic and proteomic data analysis is essential for understanding the underlying factors that are
involved in human disease. Currently, such experimental data are generally obtained by high-throughput
microarray or mass spectrometry technologies among others. The sheer amount of raw data obtained using
these methods warrants specialized computational methods for data analysis.
Biomarker discovery for neurological diagnosis and prognosis is one such example. By extracting
significant genomic and proteomic biomarkers in controlled experiments, we come closer to understanding
how biological mechanisms contribute to neural degenerative diseases such as Alzheimers’ and how drug
treatments interact with the nervous system.
In the biomarker discovery process, there are several computational methods that must be carefully
considered to accurately analyze genomic or proteomic data. These methods include quality control,
clustering, classification, feature ranking, and validation.
Data quality control and normalization methods reduce technical variability and ensure that discovered
biomarkers are statistically significant. Preprocessing steps must be carefully selected since they may ad-
versely affect the results of the following expression analysis steps, which generally fall into two categories:
unsupervised and supervised.
Unsupervised or clustering methods can be used to group similar genomic or proteomic profiles and
therefore can elucidate relationships within sample groups. These methods can also assign biomarkers to
sub-groups based on their expression profiles across patient samples. Although clustering is useful for
exploratory analysis, it is limited due to its inability to incorporate expert knowledge.
On the other hand, classification and feature ranking are supervised, knowledge-based machine learning
methods that estimate the distribution of biological expression data and, in doing so, can extract important
information about these experiments. Classification is closely coupled with feature ranking, which is
essentially a data reduction method that uses classification error estimation or other statistical tests to score
features. Biomarkers can subsequently be extracted by eliminating insignificantly ranked features.
These analytical methods may be equally applied to genetic and proteomic data. However, because of
both biological differences between the data sources and technical differences between the experimental

Corresponding author. E-mail: [email protected]

DOI: 10.1016/S0079-6123(06)58004-5 83
84

methods used to obtain these data, it is important to have a firm understanding of the data sources and
experimental methods.
At the same time, regardless of the data quality, it is inevitable that some discovered biomarkers are false
positives. Thus, it is important to validate discovered biomarkers. The validation process may be slow; yet,
the overall biomarker discovery process is significantly accelerated due to initial feature ranking and data
reduction steps. Information obtained from the validation process may also be used to refine data analysis
procedures for future iteration. Biomarker validation may be performed in a number of ways — bench-side
in traditional labs, web-based electronic resources such as gene ontology and literature databases, and
clinical trials.

Introduction the functional complexity of the human body, it is


somewhat surprising that only a relatively small
Genomics, the study of gene expression, is funda- number of genes are required to encode this in-
mental for two reasons. First, genes encode infor- formation. Expectedly, it turns out that among
mation that is the basic blueprint of life. Second, other classes of compounds, proteins play a lead-
genes are the vehicles responsible for the trans- ing role in expressing the wide array of functions
mission of hereditary material from one generation within the human body. Consequently, the study
to the next. Gene sequences, or DNA, are of protein expression, or proteomics, refers to any
relatively simple to analyze in the sense that gene procedure that characterizes large sets of proteins;
function is neatly contained and determined by the this includes composition, modification, quantifi-
primary structure, which is the order of base pairs. cation, localization, and functional interaction
In addition, the flavor of base pairs is constrained (Pandey and Mann, 2000; Fields, 2001; Aebersol
to four basic nucleotides, with the exception of and Mann, 2003; Glish and Vachet, 2003; Steen
some special cases. Interaction between base pairs and Mann, 2004).
is readily described by a simple complementary Experimentally, because of the ease at which
pairing rule. genes may be sequenced as well as the relative
On the other hand, proteins are the active agents simplicity of experiments that assay genes com-
in the cell that ultimately determine the cellular pared to proteins, genomics has been more widely
characteristics. Briefly, proteins carry out intra- exploited as a high-throughput means for screen-
and intercellular processes and facilitate commu- ing. Furthermore, genomics is essential for detect-
nication within cells as well as between cells and ing and possibly treating disease conditions in
larger systems. In comparison with genes, protein their primary manifestations in DNA resulting
function may be determined by up to four levels — from inheritance and mutation. At the same time,
primary, secondary, tertiary, and quaternary while we may obtain a vague idea of the quality
structures. The primary structure of a protein is and quantity of expressed gene products by stud-
the amino acid sequence directly encoded from ying gene expression profiles, the level of assess-
DNA base pairs, while secondary, tertiary, and ment is insufficient to determine the dynamic state
quaternary structures are derived from various of cellular processes. In many instances such as
degrees of protein folding and aggregation. There post-translational modifications, gene expression
is also greater variety in the flavor of amino acids, is not easily correlated to protein expression. Thus,
as evident by the 20 natural amino acids apart genomics and proteomics are complementary
from other rare species. Interaction between amino efforts that must be coupled to reveal the secrets
acids arises as a function of multiple chemical of complex biological systems, in particular, the
factors. human body.
The completion of the Human Genome Project Microarrays and mass spectrometry (MS) are
created new inroads into the study of complex current common genomic and proteomic technol-
biological systems, such as the human body. Given ogies that play a vital role in clinical neurosciences,
85

providing the tools to describe, quantify and ulti- independent sources such as literature annotation
mately predict the behavior of neurological sys- or clinical experiments. The results of validation
tems. We may be interested in the signaling also provide performance measures and feedback
pathways involved in disease pathologies or the for further iteration of data-mining methods.
response of the neurological system to therapeutic Finally, validated findings from bioinformatics
agents. To meet the demands of current clinical then lead to clinical tools and applications. These
interests however, microarrays and MS are used to steps are combined to provide accurate analyses of
assay genomic and proteomic expression levels on genomic and proteomic data with each step in-
a large scale, enabling a birds’ eye view of neuro- forming other steps in the process.
logical systems that is, at the same time, volumi- In the data-mining process, each step — quality
nous. Consequently, to utilize experimental data control, normalization, biomarker discovery, in-
fruitfully, we have a need for data mining and in a terpretation, and validation — informs other
broader context, bioinformatics. steps. These challenges will be addressed in detail
The objective of bioinformatics is to discover in later sections so that we may achieve accurate
significant biomarkers. Here, we are presented analyses of genomic and proteomic data for po-
with several challenges in bioinformatics. As with tential clinical applications.
all biological experiments, we need to deal with
technical and biological variability, or noise, in the
data. Because multiple platforms exist for Experimental methods
microarrays, these experiments are highly variable
across platforms. In addition, the complexity of Genomic technology
the multi-step process contributes to technical var-
iability. This problem is also inherent in MS. Thus, The production of mRNA in cells is the first step
it is important to understand the construct of the in the expression of genes to functional proteins.
experiments, including concrete details, so that we By quantifying mRNA expression, microarray
are aware of the limits of the data obtained as well technology is able to roughly measure genetic
as kinks in the process where unwanted variability processes. The concept that underscores genomic
is introduced. Furthermore, within a single pa- microarray technology is the specific and comple-
tient, or clinical source, pathologies are non- mentary hybridization of nucleotide sequences.
uniform leading to biological variability. Quality Generally, mRNA can be isolated from cells and
control, both experimentally and analytically, cou- exposed to an array of complementary sequences
pled with normalization methods are steps to to which the mRNA sequences of interest have
reducing technical and biological noise. high affinity. Hybridization can then be measured
At the same time, the relatively high cost of ob- by fluorescence. The two primary microarray tech-
taining patient or clinical samples is a prohibitive nologies based on this premise that are widely used
factor that contributes to the common phenome- today are cDNA (complementary DNA) and
non of ‘ill-posed’ problems, especially in bioin- oligonucleotide microarrays (Liao, 2005).
formatics. While microarrays and MS may assay
samples in multiple dimensions (to the order of 104
in genomic studies), the small number of patient cDNA microarrays
samples leads to statistical problems. cDNA microarrays, developed by Schena et al.
Subsequently, while robust algorithms may pro- (1995), are based on long sequences (0.6–2.4 kb) of
duce mathematically valid results, these results cDNA fixed, or printed, onto a substrate (usually a
must be also biologically relevant to be useful for glass slide) in a spotted matrix such that each spot
clinical applications. The problem of false discov- on the array corresponds to a specific gene or
ery is non-trivial given that there is no unique so- transcript. cDNA sequences are selected from
lution to ill-posed problems. Various tools exist for libraries of gene sequences and amplified using
the purpose of validating analytical results with polymerase chain reaction (PCR).
86

RNA is extracted and isolated from separate In addition to perfect match sequences, mis-
control and test cells, reverse transcribed, then match sequences are included on the chip. Mis-
amplified with PCR. During the process of PCR, match sequences are the same as perfect match
special fluorescent base pairs, Cy5 and Cy3 are sequences with the exception of a single nucleotide.
incorporated into the cDNA for tagging purposes. These sequences are used to quantify the amount
Tags are incorporated into the control and test of specificity. Sequences that normally would bind
products in order to distinguish different cases; for to perfect match sequences have a small chance of
example, the Cy3 in control and Cy5 in test cases binding to mismatch sequences, and the amount of
represent green and red fluorescence, respectively. non-specific binding is subtracted from the
cDNA from both control and test cases are mixed amount of specific, perfect binding. Statistically,
and allowed to hybridize to the glass slide. After this technique corrects any bias caused by non-
the microarray is washed, each cDNA spot on the specific binding. As with cDNA arrays, mRNA
array, consisting of many similar sequences, will from test cells are extracted and hybridized to
have hybridized with sequences from both the oligonucleotide arrays, except that only a single
control and test cases. fluorescence channel is necessary.
Theoretically, the amount of hybridization from
each case, control or test, is proportional to the
Quality control of microarrays
amount of cDNA, which is proportional to the
amount of original mRNA expression. Each spot
Variance and bias in microarray experiments can
on the microarray can then be quantified by anal-
be introduced through a number of steps, includ-
yzing the ratio of fluorescence of each of the two
ing the extraction and amplification of mRNA,
colors, red or green. Typically, the log ratio of the
design of chips to maximize hybridization specifi-
two fluorescence colors is used for analysis such
city, dye intensity imbalances (depending on chip
that, for example, a positive number would indi-
type), and quantification of signals with image
cate over expression of the test case relative to the
processing. Variability can be reduced by increas-
control case and a negative number would indicate
ing the accuracy and precision of hardware (Zien
under expression.
et al., 2001), by removing outlier samples, or by
increasing the number of replicates. This section
Oligonucleotide microarrays
will focus on the removal of outliers and various
Oligonucleotide microarrays consist of short nuc-
normalization methods to improve overall data
leotide sequences (on the order of 20–60 base pairs)
quality.
rather than the long sequences in cDNA microar-
rays. A photo-lithographic technique developed by
Affymetrix (Lipshutz et al., 1999) enables produc- Outlier removal
tion of high-density microarrays by building nuc- Model et al. (2002) describe methods of handling
leotide sequences directly onto a substrate. variations between single microarray slides and be-
Unfortunately, this method limits the length of se- tween batches of slides. They propose the use of
quences to 25 nucleotides, reducing the sensitivity multivariate statistical process control to detect de-
and specificity of hybridization. Careful design of viations from normal working conditions. Once
the microarray and selection of sequences, however, these deviations are detected, samples are either re-
can overcome this problem. Because of the length moved or replicates produced so that confident av-
limitation, sequences must be selected that are erage values can be obtained. The simplest method
unique to the transcript or target gene, called of detecting outliers would be able to measure the
perfect match sequences. For real biological exper- deviation of a gene on an array from the mean
iments, different target genes may have many re- expression of that gene over all arrays. This is also
gions with similar sequences, therefore several known as the sample variance. The threshold of
different perfect match sequences may be required tolerance for variance can be defined depending on
for high specificity to a single target. the desired t-distribution significance level,
87

assuming that the data are normally distributed. If the smoothing residuals to obtain normalized data
the number of outlier genes on an array reaches a (Cleveland and Devlin, 1988; Yang et al., 2002).
threshold, the entire array is deemed as an outlier.
It is usually the case, however, that genes are highly
correlated and single-dimensional tests will not ac- Normalization
count for this. In such cases, the use of Hotelling’s Normalization has a significant effect on the de-
T2 statistic in combination with robust PCA is bet- tection of differentially expressed genes (Hoff-
ter suited for multi-dimensional tests (Model et al., mann et al., 2002). The combination of
2002). However, removing samples in an already normalization and gene selection algorithms
small pool of samples may be problematic. An al- should be selected carefully so that the number
ternative would be to increase the sample size so of relevant genes is maximized while reducing the
that the effect of outliers is reduced. Unfortunately, number of false positives. This may be a daunting
increasing sample size may also be difficult due to task, since the process of interpreting and validat-
the cost of microarray experiments and, depending ing results can be very tedious. Nevertheless, many
on the study, a lack of test subjects. Many data normalization algorithms have been widely used
normalization techniques have been explored to without fully understanding their effects. Some of
clean data without removing or adding samples. these methods include background correction,
dye-, global-, and quantile normalization.
Fluorescent signals representing hybridization
Multi-channel microarrays
on microarrays may include some background
Multi-channel microarray chips, such as cDNA
signal that is present regardless of hybridization.
chips, use different colored dyes and enable the
These noisy signals, which may be caused by non-
measuring of relative expression levels, indicated
specific hybridization of dye or tagged transcripts
by the amount of fluorescence of each dye.
to the array, tend to reduce the signal-to-noise ra-
Although these dyes have very similar properties,
tio but may also provide some information
slight differences exist that affect hybridization or
regarding dye intensities (for two channel arrays).
amount of fluorescence and, ultimately, the ob-
Assuming that the background signal is constant
served gene-expression level. The most common
over the entire array and is additive, it may be
dyes for two channel cDNA microarrays are Cy3
subtracted from all spot intensities to normalize
(green) and Cy5 (red). A method often recom-
the array (Wolkenhauer et al., 2002). For instances
mended for correcting dye intensity imbalances is
in which the background signal may be variable
to fit the data on a transformed MA scatter plot, in
over the array, signals can be normalized by sub-
which the axes are
tracting a multiple of the standard deviations of
mi ¼ logðRi Þ  logðG i Þ (1) the background or a fraction of the background to
avoid negative signals. Background correction is
1 often applied during the image acquisition step
ai ¼ ðlogðRi Þ þ logðGi ÞÞ (2)
2 when fluorescence of each microarray spot is
In the ideal case, values along the M axis, which quantified, but may also be applied at a later step.
is the log ratio of red over green fluorescence, is Global normalization methods adjust overall
approximately constant as values along the A axis, intensity of each microarray by assuming that the
average intensity, increases. For some dyes, how- total amount of mRNA is consistent across most
ever, the intensity of one dye increases more cells. Gene expression signals can be divided by the
quickly than the intensity of the other dye, result- sum of gene expression over the entire chip,
ing in a slight positive slope in the graph. The re- resulting in normalized fractions of total mRNA
lationship may also be non-linear. The data can be expression (Zien et al., 2001). This method may
smoothed after transformation to the MA axis also be used with housekeeping genes (genes that
using either locally weighted linear regression are consistently expressed) by dividing or sub-
(LOWESS) or smoothing splines and adjusted by tracting other gene expression signals by the
88

expression values of housekeeping genes. How- to comparative genomic microarray studies. Third,
ever, the assumption of total gene expression and the study of protein aggregates and protein–pro-
constant expression of housekeeping genes may tein interactions focuses on cellular reaction mech-
not be accurate (Suzuki et al., 2000). An alterna- anisms, also leading to further functional studies.
tive is to use invariant or control genes, which are Technologies that have been used for protein
not necessarily housekeeping genes, as a basis for analysis include two-dimensional (2D) polyacryl-
normalization. Control genes may be spiked into amide gel electrophoresis (2D-PAGE), MS, and
experiments and in cases where controls are not protein microarrays. 2D-PAGE has been the com-
available, invariant genes must be estimated from mon method for protein analysis; however, for the
the given data. These estimations can be errone- analysis of a large number of proteins, it is difficult
ous, however, and the use of a handful of genes for several reasons. Although it can accurately
estimated to be invariant for normalization of an identify a large amount of proteins, it is very labor-
entire chip may bias the results (Reilly et al., 2003). intensive and requires large quantities of protein
Other more general normalization techniques (Li et al., 2002). MS and protein microarrays are
focus on ensuring similarity of distributions across high-throughput methods that have been able to
microarrays. For example, the most basic normal- overcome the limitations of 2D-PAGE. Many of
ization of several microarrays is mean or median the algorithms used for genomic microarray anal-
centering to ensure that all chips have a similar ysis can be applied to MS and protein microarrays.
baseline of expression. Similarly, each sample can Like genomic microarrays, experiments from such
be scaled so that all spots are expressed within the methods often produce data with small sample
same range. Both of these methods are relatively sizes and large dimensions.
simple, but incorrectly assume that inconsistencies
in microarray experiments are linear. A more so-
phisticated method of distribution adjustment is 2Dimensional gel electrophoresis
quantile normalization. Quantile normalization 2D gel electrophoresis (GE) is a highly popular
forces all samples into identical distributions and technique among proteomics researchers. Besides
can even be applied across conditions if invariant other reasons, GE is not equipment intensive and
genes are taken into account (Bolstad et al., 2003). relies simply on fundamental molecular properties
It is important that the effects of a normalization of the sample such as the isoelectric point (pI),
technique on a dataset are well known before molecular weight, and charge. These properties
drawing conclusions from the results. The best can be measured without disrupting the sample
method may be to iteratively apply these methods significantly. The resulting 2D gel maps can be
to a dataset and interpret or validate results from used for ‘differential display’ of protein levels be-
differential expression analysis before selecting the tween control and treated samples; specific pro-
best method. The following sections discuss several teins may also be extracted for further analysis. It
differential expression analysis methods. is common for 2D GE to be coupled with MS for
this latter part of experimental analysis.
The first dimension in 2D GE is isoelectric focus-
Proteomic technology ing. The nature of the amino acid backbone inherent
in all proteins allows for the formation of zwitterions
Proteomics can be crudely classified into three ar- — ions that possess both positive and negative
eas (Pandey and Mann, 2000). First, peptide char- charges at different sites. The amino group readily
acterization for large-scale protein identification accepts a proton to become positively charged, while
and post-translational modifications focuses on the carboxyl group loses a proton to become neg-
extracting sequence and structure information atively charged. Thus, given an applied electric field
leading to further functional studies. Second, along a simple pH gradient mounted on a gel strip,
‘differential display’ for the comparison of protein these zwitterions will migrate, or be ‘focused’, to
levels with potential clinical applications is similar their respective isoelectric points known as the pI.
89

The second dimension is separation by size and intensities (peaks). Depending on the sample com-
molecular weight through a poly-acrylamide gel position, the spectrum generated contains large
mesh. The mesh is prepared using sodium dodecyl volumes of data previously embedded in the sam-
sulfate (SDS) that confers a uniform negative ple. Consequently, information extraction depends
charge density, based on molecular size, to the on high-precision spectrum analysis.
proteins present. Furthermore, mercaptans are The mass spectrometer is comprised of three es-
used to disrupt the secondary disulfide bonds be- sential components: an ion source, analyzer, and
tween amino acids so that the proteins acquire a detector. The ion source is the key to ionizing the
large rod-like conformation. Because of the mate- sample — it should ideally only provide the sample
rials used, this procedure is also termed as SDS- particles with charges but not interfere with the
PAGE. The size of the pores in this gel mesh is motion of these particles. Achieving stable and
dependent on the concentration of polyacrylamide abundant ions from the sample of interest is es-
used. An electric field is applied across the gel with sential in MS. Subsequently, the analyzer separates
the positive electrode as the end point to attract the charged sample particles based on the mass-
and motivate the migration of the proteins. Given to-charge (m/z) ratio — analyzers can be classified
the uniform charge density, proteins move through into beam and trapping analyzers. Finally, the de-
the gel mesh based only on size and molecular tector reports the intensities of sample particles
weight. A variety of staining methods may be ap- based on the m/z ratio. Understanding the con-
plied to visualize the location of the proteins as struct of the equipment will make it easier to refine
spots. The resulting 2D map gives us information or troubleshoot equipment specifications to
about the isoelectric point and molecular weight of achieve better spectra.
the sample proteins. Importantly, while evaluating the appeal of MS,
In general, a typical gel map may contain up to we consider the following factors: sample prepara-
5000 spots. Detection sensitivity is heavily condi- tion, sample consumption, accuracy, and precision.
tioned on the staining method applied. Even then, The sample state is critical in determining if MS is
the number of spots visualized is practically lim- viable. Sample preparation is required for obtaining
ited and is potentially less than the number of a sufficiently pure sample — a clean sample results
possible gene products. This is a consequence of in a spectrum with the appearance of junk peaks
constraints from the separation steps such as a significantly reduced. Furthermore, in many clinical
limited spectrum of isoelectric points and limited applications today, sample volume is generally low.
dimensions of the polyacrylamide gel, resulting in Thus, it is important that sample consumption is
a limited molecular weights range to less than optimized to extract the most information. The ac-
250 kDa in general. Furthermore, there may be curacy of the spectrum peaks in locating and re-
problems with loading the sample in that the load- porting the true m/z ratios in the relative intensities
ing volume is restricted and hydrophobic proteins give us both qualitative and quantitative informa-
are problematic. In addition, issues of quantifica- tion about the sample composition. A high-preci-
tion and reproducibility do arise as this is a labor- sion mass spectrometer is desired to allow the
intensive procedure. resolution of closely located peaks.
MS can be coupled with other techniques such
as liquid chromatography and 2D-GE to achieve
Mass spectrometry more information about the sample. Furthermore,
MS is a popular high-throughput experimental the process of MS can be repeated sequentially —
technique that has quickly become an integral this procedure is known as tandem MS.
component of proteomics. The experimental pro-
cedure is conceptually simple where sample parti-
cles are ionized; the motion of these charged Tandem mass spectrometry
particles in applied electric and magnetic fields are Tandem mass spectrometry (MS/MS) is a natural
then captured and represented in a spectrum of extension of MS to further derive more
90

information about isolated spectrum components lysate to produce several microarrays (Espina
from a previous MS stage. Theoretically, multiple et al., 2003).
stages of MS can be involved, leading to an MSn
experiment where n is the number of stages. Here
we discuss MS/MS with two stages. The extension Data analysis
to higher orders is similar.
The parent ion undergoes the first stage of MS; Microarray analysis
selected components from the spectra are then iso-
lated and may be subject to a variety of induced Microarray technology is most commonly used for
dissociation methods and one more stage of MS to either identifying patterns in gene expression by
produce daughter ions and neutral masses. This clustering or detecting differentially expressed
process is represented by a simple concept equation: genes between two or more classes of samples.
Gene clustering and differential expression analy-
mþ þ
parent ! mdaughter þ mneutral (3) sis fall into the realm of unsupervised and super-
vised algorithms that have been widely explored in
The signal-to-noise ratio is greatly improved in
microarray applications. Unsupervised methods
the second stage of MS because the cloud of ions,
comprise algorithms that do not require prior
released from the ion source, that contribute to
knowledge of the microarray samples. On the
‘chemical noise’ is filtered by the first stage of MS.
other hand, supervised methods require that the
Furthermore, technological advances have made it
treatment conditions of samples are known.
possible to select to scan either the first or second
Gene clustering can be used to identify patterns
stage of MS — this becomes the independent stage
in microarray data based on similarity of expres-
— and investigate the spectra from the other stage
sion (Eisen et al., 1998). Samples can be clustered
— the dependent stage. This is the primary
to identify groups of samples that have similar ex-
advantage of MS/MS.
pression profiles or that have been exposed to
similar treatments. In addition, genes can be clus-
tered to identify groups of genes that may be
Protein microarrays biologically correlated.
Protein microarrays are a relatively new technol- Differential gene expression experiments are
ogy similar to genomic microarrays, and exist in typically carried out with one set of microarray
two types: forward and reverse phase. Forward samples comprising mRNA extracted from test
phase protein microarrays, like genomic microar- subjects or tissue under certain treatment condi-
rays, immobilize bait molecules on the surface of a tions and a control set of samples comprising
chip and detect proteins from a sample by washing mRNA from independent, normal subjects or tis-
the chip with a solution of several analytes ex- sue. Ideally, all microarrays in the experiment are
tracted from that sample. The result is that each identical with regard to the types and number of
chip can assay a large number of proteins for a spotted gene or DNA transcripts. The fluorescent
single sample. Reverse phase arrays immobilize signal from each transcript in the set of microar-
several analytes per spot with each spot represent- rays of the same condition is representative of a
ing an entire sample. The entire chip is then single population. Since the intensities of the flu-
probed with a single molecule of interest, such as a orescent signals are correlated with mRNA tran-
labeled antibody, so that several samples can be scription, statistical testing between multiple
analyzed at once under the same conditions. A populations can be performed to derive conclu-
unique advantage of using protein microarrays is sions about gene expression. Gene transcripts that
that they can detect whether proteins are in a have significant differential expression between the
phosphorylated state, a switching mechanism of- two conditions may be used to infer the biological
ten found in protein networks. Protein microar- mechanisms altered by treatment of the test sub-
rays are also efficient, requiring very little cell jects or tissue.
91

Although these analytical techniques have been Tags, relies on the observation that most spectra
used for microarray analysis, they are not unique usually contain a small series of easily interpret-
to genomics and can easily be applied to pro- able sequence by mass. It is assumed that the low-
teomics. The section on ‘Statistical analysis and est and highest masses in the series contain
pattern classification’ will describe some super- information about the distance to either ends of
vised and unsupervised statistical and pattern clas- the peptide chain in mass units. The easily inter-
sification techniques in detail. pretable sequence forms the link between these two
masses — these three pieces combine to form a
peptide sequence tag that is matched against se-
Mass spectrometry analysis quences in the genomic database. This method is
most notably implemented over the Internet by the
MS is primarily used for identification of proteins EMBL Bioanalytical Research Group in the Pep-
based on cleavage of the amino acid backbone. tideSearch program (2005).
Analytical methods can be broadly classified as The second method is a correlation method that
database search and de novo analysis algorithms. compares theoretically derived spectra, based on
Nonetheless, all identification methods depend information returned from the genomic database
heavily on the specificity of the enzyme proteases search, to empirical data. This method is derived
used for cleavage during sample preparation. For- from signal-processing techniques common in
tunately, with the exception of some proteases, the communications and works better for low-resolu-
majority of experimentally viable proteases are tion data. The computational cost for processing
sufficiently specific and consistent. high-resolution data this way is too high for prac-
Mass spectra obtained can be searched against tical considerations although this may change with
theoretical peptides derived from genomic dat- technological advances. Another advantage of this
abases to establish an identity. Clearly, such method is that it is also robust for low signal-
approaches fail in two cases: first, when there is to-noise spectra.
insufficient genomic information, which is the pre- The third approach utilizes the intensity infor-
vailing scenario, taking into account possible se- mation by matching the masses of the theoretical
quencing errors and second, in general when fragments with the experimental peaks beginning
extensive post-translational modifications occur. with the most intense peak. The probability of
In addition, because there is only limited mass in- random fragment matches, i.e. the probability that
formation available, the problem of false-positive the fragments match by chance, is also determined
matches returned from such database searches is as the basis for comparison. This is known as
non-trivial. By requiring more stringent match cri- probability-based matching.
teria, the number of false positives will decrease,
however at the same time, possible candidates may
be eliminated. It is important to note that all these De novo analysis
methods return a score, or probability, that the
match is true. Matches are then ranked in order of Apart from database search methods, another
the assigned scores. The best matches are not nec- different approach toward spectrum analysis is de
essarily unique; furthermore, by varying the com- novo analysis, literally ‘a new analysis’. This ap-
bination of match criteria, different ‘best’ matches proach derives theoretical peptide sequences solely
may be obtained. Consequently, the rational ap- based on the distribution of intensities in spectra.
proach will be to seek peptide matches that con- The true identity of the sample peptide is then de-
sistently achieve high scores and to validate termined by verification from other independent
analysis results using other sources such as pub- sources. Because of the limitations of database
lished literature or ontological approaches. search-based methods and the potential of ma-
There are three common database methods for chine-learning algorithms, de novo analysis must
protein identification. The first, Peptide Sequence not be lightly dismissed as an academic approach.
92

The first step in de novo analysis is to assess the The parent-ion and neutral-loss scans can be use-
quality of the spectra (Bern et al., 2004). There is ful for drug discovery. Homologous compounds,
little purpose in investing significant effort in anal- such as variations of a likely therapeutic compound
yzing spectra of inferior quality. At the same time, or metabolites of a target compound, will exhibit
the concept of spectra quality is not a well-defined similar m/z ratios in the second stage of MS; these
one. Good quality spectra may be defined in terms homologues may also be lost as neutral masses. The
of resolution or signal-to-noise ratio among the product-ion scan is useful for further revealing the
other criteria. Specific criteria are determined with structure and sequence information of complex pro-
regard to the experimental focus. tein mixtures passed into the first stage of MS.
Spectra peaks are the result of the cleavage of
the amino acid backbone at specific sites based on
the specificity of the proteases. In general, peptide 2Dimensional gel electrophoresis
fragments attached with the N-terminus are
known as b-ions, while peptide fragments attached Because the primary mode of comparing individ-
with the C-terminus is known as y-ions. There ual gel maps is visual, there is a variety of software
have been multiple attempts to reconstruct the that is integrated with laboratory equipment to
sample peptide using graph methods as well as process the gel images. The huge number of spots
probabilistic network approaches (Bruni et al., on a single gel has also necessitated the automa-
2005; Yan et al., 2005). Typically, peaks are clas- tion of this process for high-throughput screening.
sified as b- and y-ions and represented as nodes in Gel maps can first be compared visually for
a graph; these nodes are linked by weighted edges. ‘differential display’, i.e. a comparison of protein
Subsequently, the best possible peptide sequence is expression levels between control and treated sam-
obtained from the path through the graph that has ples. ‘Interesting’ proteins may be identified by
an optimized score along the edges. location on the map or by quantifying the expres-
Besides proteases, the tertiary structure of the sion levels in identical spots. Furthermore, the
sample peptide influences the tendency for frag- spots can be excised and prepared for MS for fur-
mentation at specific sites, affecting directly the ther analysis.
locations and intensities of peaks within the spec-
tra. Hence, there are efforts to correlate peak in- Statistical analysis and pattern classification
tensities with peptide properties such as
hydrophobicity and helicity (Gay et al., 2002; El- Once the quality of the genomic or proteomic ex-
ias et al., 2004). However, such intensity-based periment has been verified, the next step is to em-
approaches are still in its infancy because of un- ploy any of the number of mathematical
resolved issues with the reproducibility of absolute techniques have been applied to or developed for
intensity information across replicate MS runs. genomic or proteomic data. As mentioned previ-
ously, these techniques can be categorized into ei-
ther unsupervised or supervised methods. The
Tandem mass spectrometry analysis purpose of these algorithms is to identify under-
lying relationships between gene or protein
In a parent-ion scan, the product (daughter) ion is expression or experimental samples and to iden-
fixed as the independent variable. The first spectrum is tify significant genes or proteins in a controlled
scanned for parent ions that give rise to specified experiment.
spectrum component in the second stage. In other
words, we are searching for a class of parent ions that
dissociate to give target product ions. The product-ion Unsupervised methods
scan is analogous to the parent-ion scan. In addition,
the neutral-loss scan refers to scanning both stages of Unsupervised methods can provide insights to the
MS for a specified mass loss between the two stages. natural organization of data. Typically for
93

clustering algorithms, samples, genes, or protein hierarchical clustering, self-organizing maps, and
expression levels can be arranged based on a metric principal component analysis.
of similarity. There are several methods for deter-
mining feature similarity: these include Euclidian
distance, correlation, and dot product. Euclidian Hierarchical clustering
distance is the simplest metric. For example, given n Once the distance metric has been defined, hierar-
samples, the distance between gene expression val- chical clustering can be performed using one of
ues X 2 Rn and Y 2 Rn can be calculated as several algorithms. Hierarchical clustering is an ag-
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi glomerative technique, meaning that initially, there
X n
DðX ; Y Þ ¼ ðX i  Y i Þ2 (4) are several single member clusters which are grad-
i¼1 ually combined based on similarity until a single
cluster remains (Quackenbush, 2001). When clus-
where X and Y represent two different genes ters are formed, the distance between two clusters
across all n samples. For the purpose of computa- may be calculated in different ways. Single-linkage
tional efficiency the square root may be removed clustering uses the minimum distance between a
with no adverse effect on clustering. member of one cluster and a member of the other
Pearson’s correlation coefficient can also be cluster, for all members. Complete-linkage cluster-
used as a distance metric and can be computed as ing uses the maximum distance, while average link-
n   
1X X i  X offset Y i  Y offset age uses the average distance. Average distance can
DðX ; Y Þ ¼ be computed as the average of the distance of each
n i¼1 FX FY
point in one cluster to all other points in the other
(5) cluster. Averages may also be weighted to account
in which F is the standard deviation of gene or for unbalanced cluster sizes using a method called
protein expression over n samples. It has been weighted pair-group average.
suggested that the standard deviation for this met- Hierarchical clustering is very useful for micro-
ric be modified to account for a reference state that array analysis because it can be easily visualized on
is not necessarily based on sample mean expression all dimensions. Gene expression can be repre-
(Eisen et al., 1998). The modified standard devi- sented with a heat map and ordered with hierar-
ation is chical clustering so that patterns in expression
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi become visually accessible. Figure 1 is an example
X n
ðG i  G offset Þ2 of hierarchical clustering using two distance met-
FG ¼ (6) rics. Both distance metrics result in the same two
i¼1
n
groups consisting of genes 1, 2, 4, 6, 9 and genes 3,
where G offset is a reference state (perhaps median 5, 7, 8, 10, however, the ordering of genes within
or another reference state not apparent in the these groups differs slightly between Euclidian and
data). It may also be necessary to divide by n–1 correlation distance metrics.
instead of n to obtain the sample standard devi- There are drawbacks in the use of hierarchical
ation if the sample size is small. clustering for gene/protein expression analysis. The
The dot product of two expression vectors may hierarchical structure imposes a strict ordering of
also be used as a measure of similarity. Vectors groups of genes that may not be representative of
that are similar will have a large inner product, all possible relationships. For multi-dimensional
whereas orthogonal vectors will have a zero inner clustering, most metrics can result in non-unique
product. A larger result implies that genes are distances, therefore failing to capture important
more closely related. differences between groups (Tamayo et al., 1999).
  More sophisticated clustering techniques such as
DðX ; Y Þ ¼ X  Y (7)
self-organizing maps have been shown to outper-
The unsupervised clustering methods described form hierarchical clustering when applied to noisy
here are based on similarity measures and include biological data (Mangiameli et al., 1996).
94

Fig. 1. Hierarchical clustering of 10 genes and four samples using correlations (top) and Euclidian distance (bottom). Clustering
groups change slightly depending on the distance metric.

Self-organizing maps a synaptic weight vector wi 2 Rd where i ¼ 1,


The self-organizing map (SOM) is an unsupervised 2, y, m and m is the number of nodes. For each
clustering neural network in which a single neuron, input data point X i 2 Rd where d is the number of
or node, is activated for each input vector, or features or genes, the closest match to a node is
sample. A SOM is trained by first selecting the computed as
number of nodes (equal to the number of samples)
qðX Þ ¼ min kX  wi k2 (8)
and arranging them in a 1- or 2dimensional grid.
Each node is then mapped to a d-dimensional where q(X) indicates the winning node of the
space in which d is the number of features (genes associated weight vector. The synaptic weight vec-
or proteins). Nodes are randomly compared to tor of the winning node is adjusted by
samples and iteratively moved, in d-dimensional
wi ðk þ 1Þ ¼ wi ðkÞ þ mðkÞ½X ðkÞ  wi ðkÞ (9)
space, toward the sample to which it is closest. The
nodes in the grid surrounding the node that is where m(k) is the learning rate parameter that is
closest to a sample are also moved slightly in the decreased at each iteration. Nodes in the vicinity
direction of the sample. After a number of itera- of the winning node are also adjusted, but with a
tions, the nodes in the grid become organized in a different, smaller m(k). The number of neighboring
manner that represents the topological structure of nodes affected by the winning node may also be
the input samples. The nodes will form clusters reduced over time. The number of nodes, repre-
that represent clusters of the underlying data. senting clusters, can be defined so that the input
Mathematically, this is applied as follows (Ham data can be optimally clustered. SOMs have been
and Kostanic, 2001). Each node is associated with successfully applied to microarray gene expression
95

data (Tamayo et al., 1999). A fundamental differ- Each principal eigenvector in turn has a number of
ence between SOMs and hierarchical clustering is components that correspond to the original data
the ability to select the number of nodes, or clus- dimensions. These components can be sorted by
ters, in SOMs. This is equivalent to the clustering magnitude to rank the original dimensions that
of data by the most significant features, the mostly contribute to the reduced dimension.
number of which is equal to the number of nodes. A drawback of PCA is that ranking or cluster-
ing by principal components is not intuitive; each
value does not directly represent gene expression,
Principal component analysis
instead it is a weighted linear combination of the
Although both hierarchical clustering and SOMs
original dimensions. Classical PCA has also been
have been applied to large dimensional datasets, it
shown to be highly sensitive to outliers in the data,
may be desirable to reduce dimensionality without
consequently, Hubert and Engelen (2004) have
discarding any important features. Principal com-
developed a robust PCA method to avoid this
ponent analysis (PCA) has been used as a statis-
problem.
tical tool for dimensional reduction. Given a set of
features, or genes in a microarray, the principal
components retain most of the information in the
Supervised methods
original features (Wang and Gehan, 2005). PCA
compresses the input sample x with an optimal
In many genomic and proteomic experiments, the
transformation matrix W according to
underlying grouping or clustering of samples is al-
y ¼ Wx (10) ready known, in which case it would be more in-
where y 2 R ; x 2 R ; and m oo n (n is the
m n formative to identify the genes or proteins most
original number of features). y retains most of the affected by a controlled treatment of the samples,
information in the input samples x by combining also known as the differentiating genes or proteins.
dimensions that have high covariance or correla- These genes or proteins are called significant bio-
tion. W is a matrix of eigenvectors markers and are actively sought after because of
their potential in helping to understand and control
W ¼ ½w1 ; w2 ; :::; wm T (11) biological mechanisms. They also help to reduce the
in which wi is an n dimensional eigenvector. complexity and increase the accuracy of classifica-
PCA can be solved by computing the eigenvalue tion rules that can be used for disease diagnosis and
decomposition of the covariance matrix of the in- prognosis. Further applications of supervised meth-
put data: ods are discussed by Olshen and Jain (2002).
Supervised algorithms continually being devel-
C x ¼ W T LW (12) oped and refined to search for these biomarkers
in which Cx is the covariance matrix, defined as are not trivial for several reasons. First, the algo-
E½xxT ; and L the diagonal matrix of eigenvalues rithm must be able to distinguish truly significant
(Ham and Kostanic, 2001). biomarkers from non-significant biomarkers in
Transformation with the W matrix can serve noisy data. Because of inherent variations in bio-
two purposes. First, it reduces the dimensionality logical data, algorithms are seldom perfect. Sec-
of each sample so that the complexity of other ond, algorithm parameters depend on properties
clustering methods is reduced. Second, PCA can of the data such that a set of parameters that
be a preliminary step to feature reduction in that works well for one dataset may not be correct for
the components of each eigenvector can be exam- another dataset. Finally, identified biomarkers
ined to extract the original dimensions that are must be validated, which can be a tedious proc-
most significant. For example, there are typically ess given the large number of gene and protein
only a handful of eigenvalues that are much larger features assayed in a single experiment. Significant
than the rest, corresponding to a small number of biomarker identification is an iterative process that
eigenvectors and components in the reduced space. must be mathematically and biologically sound.
96

Statistical testing and pattern classification have the number of features, e.g. genes for microarrays
been used for identifying significant biomarkers and or peptides for MS, and i the sample number. The
building predictive models for diagnosis and prog- number of features for both experiments can be
nosis. Statistical testing can be coupled with pattern very large, on the order of tens of thousands and it
classification as a method of feature reduction by is possible that only a handful of these features
ranking and filtering insignificant features before significantly differentiate samples into defined
building a predictive classifier. Determination of the classes. Feature space reduction can minimize the
best statistical test to use on a particular dataset computational complexity of the classifier and
depends on the properties of the data distribution, provide useful information about underlying prop-
such as mean, variance, and normality. erties of the data. For example, a small set of fea-
Pattern classifiers can be used to for both feature tures that can be used to build an accurate
ranking and model building. The problem of pat- classifier may include features that are very im-
tern recognition is one of predicting future be- portant in the biological mechanisms under exam-
havior or output of a system based on past input ination.
for which the output is already known. In general, The problem of producing an accurate classifier,
this can be thought of as interpolating or extrap- either by selecting the appropriate algorithm or
olating the output of a system. Output behavior of adjusting parameters and feature selection are of-
past input or training data is known, thus pattern ten closely related. In order to select features, a
recognition is a supervised learning method. metric must be defined which can evaluate the
Given a set of data points X in the space Rd with overall contribution of a feature to classification.
known class labels C A{1, 2, y, l}, a pattern rec- The simplest metric that has often been used in
ognition algorithm, or classifier, defines a function microarray analysis is average fold change
that maps the data points to their correct labels in between classes:
C. This classifier can then attempt to map an ad-
G1
ditional set of test data points, which were not used F ðX Þ ¼ (13)
in the training phase, to their correct labels. Correct G2
mapping of test data points is not guaranteed and in which G1 and G2 are the average expression of
the primary problem in designing the classifier is to a single feature in conditions 1 and 2, respectively.
minimize classification error. Determining true clas- With a small number of samples, however, the
sification error is difficult, especially with small difference in mean expression may not be signifi-
sample size problems, and is discussed in the error cant. Even with an adequate number of samples, it
estimation and cross-validation section. An accu- may be useful to quantify the significance of mean
rate error estimation technique can also be used as a difference using a statistical hypothesis test.
feature ranking method. Depending on properties of the data, a z- or t-
Several pattern classifiers have been applied to test can be used to test the confidence of difference
both genomic and proteomic experiments for both between two population means. The two popula-
feature ranking and predictive model building. tion means, in this case, are the mean gene ex-
Some commonly used supervised methods — pression for each condition.
k-nearest neighbors (knn), linear discriminant The z-test can be used for data in which both
analysis and support vector machines (SVM) — populations are known to be normally distributed
are discussed in the following sections. Statistical and population variances are also known. The null
hypothesis testing is also discussed as a method of hypothesis for this two-tailed test is that mean
feature ranking. feature expression is zero.
H0 : m1  m2 ¼ 0 (14)
Feature ranking with hypothesis testing
Ha : m1  m2 a0 (15)
Microarray or MS data can be represented math-
ematically as a set of samples X i 2 Rd where d is in which m1 and m2 are the population means of the
97

feature in conditions 1 and 2, respectively. The where ri is the rank of sample i in all samples,
z-statistic is computed as with 1 being the lowest ranked and M+N being the
highest ranked. For gene expression, we are inter-
G1  G2  0
z¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi (16) ested in a two-tailed test, therefore the rejection
a21 a22 region of the test is w4c or w o ¼ M(M+N+1)-c
MþN
where c can be obtained from a table of Wilcoxon
in which M and N are sample sizes in conditions 1 critical values (Devore, 2004).
and 2, respectively and s2 is the known population Using the methods described above, features
variance of each condition. The null hypothesis is can be ranked by ordering p-values of the hypoth-
rejected if z  za=2 or z  za=2 ; where a is selected esis tests or ordering fold changes.
based on desired confidence. Even when the pop-
ulations cannot be assumed to be normal and the
population variances are not known, the z-statistic k-Nearest neighbors
can still be used if samples sizes are sufficiently k-nearest neighbors (kNN) is a simple supervised
large. The z-statistic in this case is slightly altered classification method that is used to predict the
class of an unknown sample based on surrounding
G1  G2  0 samples with known labels (Cover and Hart,
z¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi (17)
S 21 S 22 1967). Given an unknown sample, X 2 Rd ; the
M þ N nearest k known samples are found by computing
where S is the sample standard deviation. and sorting all distances to adjacent samples. Each
For the typical microarray case in which the of the k samples is associated with a label y and the
sample size is small and population variances are predicted class of the unknown sample is deter-
not known, the t-test can be applied if both pop- mined by a majority vote. The distance can be
ulations are assumed to be normal. The t-statistic computed using any number of metrics, including
is very similar to the z-statistic: the simple Euclidian distance. Variations on kNN
have been proposed that include studies of dis-
G1  G2  0 tance weighting (Dudani, 1976) and selection of k
T¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi (18)
S 21 S 22 (Kulkarni et al., 1998).
Mþ N
kNN has the advantage of being very simple to
T is approximately a t-distribution with degrees implement yet competitive with other pattern clas-
of freedom, v, estimated as sifiers. On the other hand, it can become compu-
 2 2 tationally intensive as the number of samples
S1 S 22
Mþ N increases. The distance metric must be computed
v¼ 2 2 2 (19) for a large number of sample combinations, which
ðS1 =M Þ ðS22 =N Þ
M1 þ N1 would be compounded when using different cross
The assumption that populations are normal validation or feature selection methods. The kNN
may be problematic and can be handled by meth- algorithm has been applied to many gene selection
ods such as significance analysis of microarrays problems, ranging from bone disease (Theilhaber
(SAM) or the Wilcoxon rank-sum test (Tusher et al., 2002) to breast cancer (Modlich and Bojar,
et al., 2001). 2005).
The Wilcoxon rank-sum test is effective on data
that is neither normally distributed nor has a large Linear discriminant analysis
sample size. This test is more robust on data with Linear discriminant analysis (LDA) is similar to
outliers. If the distributions of the two conditions PCA except that it considers class labels when
are similar, the test statistic can be computed as reducing dimensionality. The LDA solution is the
X
m projection of data points onto an optimized fea-
w¼ ri (20) ture space which maximizes the ratio of between-
i¼1 class to within-class variance. In other words, the
98

data points are mapped to a space in which classes 1 X X


LP ¼ kwk2 þ C xi  ai fyi ðX i  w þ bÞ
are maximally separated. Given microarray sam- 2 i i
ples {X1, X2, y, XN} belonging to classes {C1, X
 1 þ xi g  mi xi ð27Þ
C2, y, CC} the between-class scatter matrix can be
i
computed as
where ai and mi are Lagrange multipliers. Equation
1X C
(26) can be converted into the Lagrange dual, max-
Sb ¼ N i ðmi  mÞðmi  mÞT (21)
N i¼1 imizing
P
where mi ¼ 1=N X 2C i X
P is the mean of samples in X 1X
class Ci and m ¼ 1=N N i¼1 X i is the mean of all
LD ¼ ai 
2 i;j
ai aj yi yj KðX i ; X j Þ (28)
i
samples. The within-class scatter matrix is com-
puted as subject to:
1X C X
0  ai  C (29)
Sw ¼ ðX k  mi ÞðX k  mi ÞT (22)
N i¼1 X 2C
k i
X
The solution is obtained by solving the optimi- ai yi ¼ 0 (30)
zation problem i
 T 
W S b W  then the solution can be obtained by
arg maxW  T  ¼ ½w1 ; w2 ; :::; wm  (23) X
W S W W 
w¼ ai y i X i (31)
where Wi are eigenvectors of Sb and Sw that cor- i
respond to the largest eigenvalues, li : The problem
The function K(Xi, Xj) in Eq. (27) is known as the
is essentially an eigenvalue decomposition
kernel function and enables the SVM to handle
Sb wi ¼ li S w wi (24) non-linear separations by mapping each data point
Once W has been solved, a data point, or sample Xi to an alternate space FðX i Þ: The function FðX i Þ
can be mapped to a predicted class label by deter- does not have to be explicit, keeping computational
mining the samples position relative to the hyper- complexity of the SVM at a minimum even though
plane defined by w (Nhat and Lee, 2005). the alternate space, usually high dimension, can be
very complex. For the linear case, the kernel func-
tion is a dot product
Support vector machines  
SVM, developed by Vapnik (1995), are a class of KðX i ; X j Þ ¼ X i  X j (32)
optimization problems that partitions a set of data
points according to class label with a maximal mar- Mapping can be very complex, to an infinite dimen-
gin hyperplane. Given a set of data points X 2 Rd sional space, using the Gaussian kernel function
and class labels y 2 fþ1; 1g; a linear classifier de- 2
=2s2
fines the hyperplane KðX i ; X j Þ ¼ ekX i X j k (33)

wX þb¼0 (25) Once the SVM has been trained, it can be eval-
uated with the function
where w is also a vector in Rd. The optimization
X
problem searches for w and b that satisfy the con- f ðX Þ ¼ ai yi KðX i ; X Þ þ b (34)
ditions i¼1
yi ðX i  w þ bÞ  1  xi (26) in which the classification of sample X is determined
in which xi is a slack variable that allows for small by the sign of f(X). Note that when evaluating the
errors in classification. The primal Lagrangian of classifier, the mapping function FðX Þ does not need
this problem is to be known (Cristianini and Shawe-Taylor, 2000).
99

Performance evaluation Correct prediction rate or error of the classifier can


be computed as the average over all k-tests. k-fold
Cross validation and error-estimation methods cross validation reduces the variance in error esti-
A pattern classifier is typically able to correctly mation by increasing the number of tests.
classify all the samples on which it was trained. Of Error estimation using cross validation for small
course, this depends on the type of classifier used. sample microarray studies may still be biased and
For example, a linear discriminant classifier will highly variable (Fu et al., 2005). Therefore boot-
have 100% accuracy on a set of data points only if strapping has been proposed to reduce biases in
those data points can be linearly separated. A non- error estimation for microarray analysis (Braga-
linear problem is more difficult for a linear clas- Neto and Dougherty, 2004b). Bootstrapping is the
sifier but can be easily handled by the SVM using a process of randomly selecting a number of samples
non-linear kernel. However, testing a classifier on for training and testing on the remaining samples.
the same set of points on which the classifier was By randomly selecting samples, training sets may
trained is only useful for determining the classi- not be unique, but by repeating this process a
fier’s ability to map input samples to output class sufficient number of times, the variance and bias of
labels, also known as the resubstitution or training the error can be improved. Generally, bootstrap
error. If, for instance, the classifier was then used error estimation iteratively selects B random sam-
to test an independent set of samples, the accuracy ples of n points each for training a classifier. The n
may be significantly lower. This phenomenon of points of sample are selected with replacement,
high accuracy on training data is called overfitting thus the total sample size in a set B is usually less
and is a difficult problem that arises in ill-posed than n. For each B sample, the classifier is trained
problems such as the classification of microarray then tested on the remaining points, resulting in an
and MS in which the number of features is much error value. The B error values are then averaged
larger than the number of samples. The accuracy to produce the error estimation.
of a classifier when tested with an independent set The 0.632 bootstrap method is similar to boot-
of samples corresponds to the classifier’s ability to strap cross validation except that the cross valida-
generalize. The combination of accuracy and gen- tion error rate is weighted by 0.632, the average
eralization ability of a classifier can be used as a fraction of samples that are in a training set if
measure of overall performance. Methods used to samples are selected randomly with replacement.
deal with small sample problems are cross valida- For n total samples, the training set consists of n
tion, bootstrap, and bolstering. draws with replacement, therefore some samples
Typically, cross validation consists of training a are duplicated, while others are left out (Efron,
classifier on a subset of samples, then testing the 1983). Therefore, the total 0.632 bootstrap error is
resulting classifier on an independent set of test computed as
samples. This method in its simplest form is also
called holdout cross validation. This method is ad- E 0:632 ¼ 0:632E 0 þ ð1  0:632ÞE resub (35)
equate for datasets that have a large number of
samples. However, for small sample sizes, as is of- where E0 is the bootstrap zero estimator (Efron,
ten the case with microarray experiments, alterna- 1983; Braga-Neto and Dougherty, 2004b) and
tive methods must be used (Goutte, 1997). k-fold Eresub the resubstitution error.
cross validation is the process of dividing a dataset The 0.632 bootstrap method has been improved
into k subsets and performing the holdout cross to alleviate the case in which resubstitution error,
validation k times, each time with a different train- Eresub, is 0 (sometimes this means that overfitting has
ing and test set. The training set in this case would occurred). With this method, a larger weight
be k–1 subsets, while the test set would be the re- (40.632) is placed on the bootstrap zero estimator
maining subset. The extreme case of k-fold cross when overfitting occurs, detected when E0 is much
validation is complete leave-one-out cross valida- greater than Eresub. This improved bootstrap method
tion, in which k ¼ n, the total number of samples. is known as the 0.632+estimator (Efron, 1997).
100

Braga-Neto and Dougherty, 2004b found that An alternate method, recursive feature elimina-
the 0.632 bootstrap error estimator shows the least tion (RFE), is a process of iteratively reducing the
bias and variability compared to resubstitution, feature space of a dataset by ranking (Guyon et
leave-one-out, and 5- and 10-fold cross validation. al., 2002). At each iteration, the classifier is trained
Fu et al., however, compared bootstrap, cross val- with the remaining features (which, for the first
idation, leave-one-out bootstrap, and 0.632 boot- iteration, will be all features), and the genes are
strap methods in a comprehensive simulated study ranked based on properties of the trained classi-
and showed that the simpler bootstrap performs fier. For a SVM, the ranking may be based on the
better than leave-one-out bootstrap, 0.632 boot- absolute value or square of the weights, w, with a
strap, and the improved 0.632+bootstrap. small weight indicating the lowest ranked feature.
Another method for error estimation in which the The lowest ranked feature is then removed and the
original data distribution is ‘bolstered’ using a ker- process is repeated. For computational efficiency,
nel function performs, in most cases, as well as 0.632 more than one feature can be removed at each
bootstrap error estimation but can be computed iteration, however, removing only one feature re-
faster. This method is essentially a density estima- sults in a complete ranking. Ranking produced by
tion for a set of samples using a mixture of Gauss- RFE may differ from individual gene ranking
ians. The area around a data point, defined by the since the performance of a feature set is the result
kernel function, is part of a Gaussian distribution of a combination of several genes.
and is used to compute a smooth error. If the area Although single feature ranking is the simplest
around any point is misclassified, the total perform- method that has been proven in several applica-
ance of the classifier is deducted by an amount equal tions, it assumes that features are expressed inde-
to the fraction of misclassified area. The error can pendently. Yet the ultimate mechanisms that
be quickly estimated using Monte-Carlo simulation govern a biological process most certainly depend
(Braga-Neto and Dougherty, 2004a). on combinations of features. Because of the com-
A set of features used for training a classifier may binatorial complexity of ranking groups of features,
be an optimal set of features depending on the per- single-feature ranking remains an indispensable
formance of the classifier using various cross-valida- tool that may be used to reduce the search space
tion techniques for error estimation. If the estimated for multiple-feature ranking. Methods for feature
error is accurate and small, then these features may combination selection are introduced in the next
be significant contributors to the natural partitioning section.
of the conditions of interest. This is the basis of fea-
ture ranking using error estimation and classifiers.
ROC curves
The receiver operating characteristic (ROC) curve
Feature ranking with classifiers is often used as a tool for measuring the perform-
The most intuitive method of ranking features ance of a machine-learning algorithm (Bradley,
with a classifier is to perform one of the above 1997). When designing a classifier with optimal
mentioned error estimation techniques on individ- generalization properties, there is always some in-
ual features and rank these features by increasing herent error involved. For the simplest two class
error or decreasing prediction rate. Once again, the problems, these errors can be categorized as either
total number of features governs the number of false positives or false negatives. The overall per-
identified significant features, which depends on formance of a classifier can be summarized with a
the desired threshold of significance. This thresh- confusion matrix (Table 1).
old of significance, as with the p-values in hypoth- The confusion matrix represents all information
esis testing, should be selected carefully and the about a classifier at a specific operating point. The
process may have to undergo validation through only information needed to produce an ROC
an iterative process in order to select the optimal curve is the sensitivity (Eq. (36)) and specificity
number of significant features. (Eq. (37)).
101

Table 1. Confusion matrix Szabo et al. discuss multivariate statistical meth-


True class Predicted class True total
ods for comparing gene combinations. The appli-
cation of these statistical methods to microarray
– + data is not easy for two reasons. First, some of
– Tn Fp Cn
these methods rely on estimating covariance, such
+ Fn Tp Cp as the Mahalanobis distance (Mahalanobis, 1936),
Predicted total Rn Rp which is a generalized statistical distance. Covar-
iance estimation is usually unreliable because of
the small number of samples often associated with
microarray experiments. Second, the number of
Tp
Sensitivity ¼ PðT p Þ ¼ (36) dimensions in the dataset is usually very large,
CP rendering a full search of all gene combinations
computationally impossible (Chilingaryan et al.,
Tn
Specificity ¼ PðT n Þ ¼ (37) 2002).
Cn
Algorithms such as the Monte-Carlo and ge-
As the decision threshold of the classifier is var- netic algorithm (GA) have been applied to over-
ied, P(Tp) and P(Tn) also vary. Several ROC come the problem of combinatorial complexity in
points can be plotted in this manner, while varying feature selection searches. The Monte-Carlo, a
the decision threshold to produce the ROC curve. random search algorithm, by no means searches
ROC points are plotted as false positives versus the entire space but has been shown to be a good
true positives, or 1–P(Tn) versus P(Tp). For a estimator for optimization problems. Chilingaryan
good classifier, a slight increase in false positives et al. (2002) proposed a multi-start random search
would result in a faster increase in true positives, algorithm which can be implemented in parallel.
whereas for random data, these values would in- This algorithm searches local maximum regions
crease at approximately the same rate. Therefore, and avoids finding global maximums by stopping
to measure the performance of the classifier, all the algorithm at an iteration limit. The reasoning
points on the curve should be considered. The area being that a global optimum may be overfitting,
under the ROC curve (AUC) is a good measure of especially in discontinuous or discrete data.
performance since, for an optimal classifier, this GAs are similar to random searches except the
area would be approximately 1, whereas for ran- solution is directed, or evolved, toward the optimal
dom data, this area would be 0.5. solution by producing new combinations from
previously good combinations, as evaluated by an
objective function. In other words, it uses survival
Feature combinations and global search methods of the fittest to retain variations only if they are
beneficial. Random permutations are periodically
Feature selection and ranking methods that anal- introduced into the population in the form of mu-
yze microarray genes or MS proteomic expression tations that prevent the algorithm from stalling in
a single dimension at a time ignore the fact that a local optimum. The GA has been applied
many of these features are highly correlated. Cor- to many biological problems (Kim et al., 2004;
relation information in microarray data can pro- Ni and Liu, 2004; Paul and Iba, 2004; Liu et al.,
vide insights to gene regulatory networks and 2005).
improve the effectiveness of pattern classifiers. Se- In addition to using multivariate statistical tests
lection of significant feature combinations may such as the Mahalanobis distance, cross-validation
produce classification rules with higher predictive methods can be used as the objective function for
ability than any single biomarker. Furthermore, Monte-Carlo and GA searches. Pattern classifiers
these combinations may include genes or proteins such as kNN, linear discriminant, and SVM can
that may not be detectable by conventional un- easily handle multi-dimensional data. Further-
ivariate methods (Szabo et al., 2002). more, the use of an effective cross validation or
102

error estimation method as the objective function minimize resubstitution, cross validation, and
may alleviate the problem of overfitting with glo- bootstrap error. In all cases, the error tended to
bal search algorithms. decrease with increasing cost, however, computa-
While the search for single biomarkers can pro- tional time also increased with increasing cost.
vide some information about underlying biological Cost must be selected based on problem size and
mechanisms, the search for multiple feature com- computing power, but does not significantly im-
binations is essential for discovering complex prove error after a problem-specific threshold
interactions that govern gene and protein net- (Fig. 3). The optimal cost was selected to be 1000.
works. The inference of genetic and proteomic In addition to the SVM cost parameter, the ra-
networks, however, is an extension of the methods dial basis kernel expects an additional parameter,
described here and will not be covered. gamma (inversely proportional to sigma in the ra-
dial basis kernel equation), which affects the size
of resulting classification regions. Both the cost
Microarray case study and gamma must be selected in conjunction since
they can have a direct effect on each other. Values
As a case study, a set of microarray data is analy- for each parameter are selected by evaluating the
zed according to some of the methods outlined performance of each feature across a 2D grid of
above. CodeLink bioarrays were used to measure parameter values (Fig. 4). For the cocaine over-
gene expression for two sample classes, cocaine dose dataset, cost was varied from 1 to 1000 and
overdose (OD) and control (C) (Hemby, S.). Seven gamma was varied from 100 to 10,0000 on the log
cocaine OD and seven C samples for a total of 14 scale. Each gene is evaluated independently; how-
samples were analyzed after four samples (two ever, it is not feasible to select parameters for each
from each class) were removed based on quality. individual gene. Therefore, the performance of all
CodeLink bioarrays have a very large dynamic genes for a single parameter grid point is averaged,
range, on the order of thousands, which is good creating a smooth surface representing overall
for accurately detecting small changes in expres- performance. The performance measures used
sion. However, for the numerical stability of some were resubstitution, leave-one-out cross validation
computational algorithms, this dynamic range can and 0.632 bootstrap. Resubstitution results (Fig. 4,
be reduced to a more manageable range. The da- top left) show that as gamma and cost are in-
taset was normalized with a log10 transformation, creased, the accuracy of classification on training
which reduces the data’s range as well as variances data increases. Cross-validation results (Fig. 4, top
across samples (Fig. 2). right) show that the generalization ability of the
Dataset features were ranked individually using classifier decreases as gamma increases. It is desir-
fold change, SAM, linear SVM, radial basis SVM, able to have both training accuracy and good gen-
and the polynomial SVM with degrees 2 and 3. For eralization ability; therefore, a combination of
each SVM, a full parameter analysis was performed resubstitution and cross validation should be used
to determine the optimal SVM cost or gamma (in to determine the optimal parameters. The boot-
the case of the radial basis and polynomial SVMs) strap error-estimation method achieves this, se-
that reduced the average error over all features. The lecting an intermediate gamma (Fig. 4, bottom).
SVM (implemented using LIBSVM (Cheng and The optimal gamma and cost for the radial basis
Lin, 2001)) was used to estimate resubstitution, SVM are 800 and 1000, respectively.
leave-one-out cross validation and bootstrap error The genes of the dataset were ranked using the
estimation for each feature. Bootstrap error esti- optimal selected parameters for linear and radial
mation with the 0.632 method (Braga-Neto and basis SVM as well as with fold change and stu-
Dougherty, 2004b) was performed on each feature dent’s t-test. Fold change, t-test, and linear SVM
with 100 sampling iterations. rankings were somewhat correlated, while radial
For the linear SVM, the cost parameter was basis SVM rankings were significantly different
ranged from 0.01 to 10,000 on the log scale to (Fig. 5). The gene ranking method must therefore
103

x 104 Original Data Correlation Log 10 Data Correlation


3.5 5

3 4.5

4
Array 2 Expression

Array 2 Expression
2.5
3.5
2
3
1.5
2.5
1
2
0.5 1.5

0 1
0 1 2 3 4 5 6 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Array 1 Expression x 104 Array 1 Expression

Distribution of Sample Standard Distribution of Sample Standard


Deviations, Original Data Deviations, Log 10 Data
120 140

100 120

100
80
80
60
60
40
40
20 20

0 0
0 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1
Sample Standard Deviation Sample Standard Deviation

Fig. 2. The original cocaine overdose microarray data has a large dynamic range (top left) with large variance (bottom left). After
normalization with a log10 transformation, the data’s range (top right) and variance (bottom right) have been significantly reduced.

be selected based on validation and interpretation employed in the first place. Full literature surveys
results. on the top genes are possible but can be very time-
Once genes have been ranked, the threshold for consuming. In addition, different ranking and
selection of significant genes should be determined normalization methods can drastically change the
to maximize the number of truly significant genes selected top genes, therefore increasing the number
and minimize the false-discovery rate. To deter- of possible significant genes.
mine if a gene is truly significant, that gene must be The simplest method of selecting top genes from
understood in context of the disease in question. In the ranking results is to set the threshold at a sta-
many cases, this information is incomplete, which tistically significant level. For our test case, the
is why gene ranking and selection methods are threshold for selecting genes from the radial basis
104

0.566
0.632 Bootstrap Error Estimation, Linear SVM and linear SVM was set to three standard devia-
tions under the mean error. For normally distrib-
0.564
uted data, this means that there is about a 0.15%
0.562 chance that any samples would fall below that
0.56 threshold. However, true errors are only approx-
Average Error

0.558
imately normal, resulting in slightly different per-
centages. For the fold change data, genes were
0.556
selected if their fold change fell outside of three
0.554 standard deviations from the mean, both greater
0.552 than and less than, for an approximately 0.3%
0.55 probability.
Correlation of significant markers is not signifi-
0.548
-2 -1 0 1 2 3 4 cant among the three ranking methods (Table 2).
Log 10 Cost There were 14 common genes that were significant
Fig. 3. Parameter-selection curve for the linear SVM. The av- in both linear SVM ranking and absolute fold
erage error over all features decreases as the SVM cost is in- change ranking. Only two were in common
creased. between radial basis and linear SVM and none in
common between radial basis and absolute fold

Parameter Selection Contour Map, Resubstitution Parameter Selection Contour Map, Cross Validation
3 3

2.5 2.5

2 2
Log 10 Cost
Log 10 Cost

1.5 1.5

1 1

0.5 0.5

0 0
2 2.5 3 3.5 4 4.5 5 2 2.5 3 3.5 4 4.5 5
Log 10 Gamma Log 10 Gamma

Parameter Selection Contour Map, Bootstrap


3

2.5

2
Log 10 Cost

1.5

0.5

0
2 2.5 3 3.5 4 4.5 5
Log 10 Gamma

Fig. 4. Parameter selection contour maps for several error-estimation methods: resubstitution (top left), cross validation (top right),
bootstrap (bottom).
105

P value vs Linear SVM Bootstrap Error P value vs Absolute Log Fold Change
Linear SVM Bootstrap Error 0.8 0.8

0.7 0.7

Absolute Log Fold Change


0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 0.1
0.1 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
P value P value
P value vs Radial Basis SVM Bootstrap Error
0.7
Radial Basis SVM Bootstrap Error

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
P value

Fig. 5. Comparison of gene-ranking methods. Bootstrap error estimation for the linear SVM tends to increase as the t-test p-value
increases (top left), fold change between class means decreases as t-test p-value increases (top right) and bootstrap error estimation for
the radial basis SVM has no correlation with p-value.

ranking methods using top-, mid-, and low-ranked


Table 2. Correlation of gene ranking methods after establishing genes. ROC curves for the absolute fold change
a statistically significant threshold method were generated using a linear SVM. Using
RBF SVM Linear SVM Abs. FC optimal parameters, the radial basis SVM per-
forms well with most genes, showing only a slight
RBF SVM 31 2 0 decrease in AUC as lower-ranked genes are used.
Linear SVM 2 185 14
Abs. FC 0 14 73
AUC of the linear SVM, however, decreases sig-
nificantly as the classifier is trained with lower-
Note: The radial basis function SVM (RBF SVM) and linear SVM ranked genes (Fig. 6).
thresholds were set to three standard deviations below the mean error.
The absolute fold change (Abs. FC) threshold was set to three standard
deviations from the mean. Interpretation and validation

change. The threshold for t-test ranking is more Following the biomarker discovery process, efforts
difficult to select since the distribution of p-values must be undertaken to interpret and validate the
is not normal. However, p-values were shown to be results in a clinically meaningful way. Solving the
correlated with linear SVM and absolute fold problems posed by biomarker discovery, as we
change rankings (Fig. 5). have discussed, may be an academic challenge and
ROC curves were generated for the radial basis will remain at that until we verify our findings with
SVM, linear SVM, and absolute fold change independent sources.
106

ROC Curves, Linear SVM because the field of pathway annotation and vis-
1 ualization is still in infancy, the number of curated
0.9 pathways remains at a minimum and may not
0.8 necessarily be consistently annotated. For a more
0.7 comprehensive list of bioinformatics tools, refer to
the Bioinformatics Links Directory: Protein Inter-
True Positives

0.6
action et al., 2005, (http://bioinformatics.ubc.ca/
0.5 resources/links directory/) provided by the Uni-
0.4 versity of British Columbia Bioinformatics Centre.
0.3 Similarly, as is the case with ontologies, discovered
0.2 biomarkers may be verified by their location
Top Ranked Gene
Mid Ranked Gene within the visualized pathways.
0.1
Low Ranked Gene Furthermore, comprehensive tools have been de-
0
0 0.2 0.4 0.6 0.8 1 veloped to provide simultaneous verification of dis-
False Positives covered biomarkers across multiple platforms and
sources. For instance, GoMiner (Zeeberg et al.,
Fig. 6. Linear ROC curves generated using a top-, mid-, and
low-ranked gene. AUC of the ROC curves decrease as gene
2003) provides a variety of representations of bio-
rank is decreased. marker information such as trees, statistical tables,
molecular structures, and pathway visualizations
There are a variety of approaches to interpreting derived from multiple databases. The strength of
and validating selected biomarkers. Gene onto- such cross-platform validation lies in the fact that
logies are shared vocabularies that classify gene the sources are independent compartmentalized
products by three major classes: molecular func- views of cellular processes and that agreement
tion, biological process, and cellular component. throughout these sources is a robust verification of
These vocabularies are by no means exhaustive; the validity of the discovered biomarkers.
however, they represent significant progress toward Although microarray technology is fairly well un-
a unified standard for biological annotation. The derstood, it may be desirable to validate the mRNA
Gene Ontology Project is a collaborative effort to- expression in a more focused experiment. For this
ward this means — to achieve a consistent anno- reason, real-time RT-PCR (qPCR) has been used to
tation for gene products across various databases corroborate the expression of selected microarray
(Ashburner et al., 2000). By locating the discovered transcripts. By eliminating the many variables in-
biomarkers in the biological context of an ontolog- volved in a microarray experiment, a qPCR exper-
ical tree, we gather more evidence supporting or iment can serve to increase the confidence of
refuting the significance of these biomarkers. expression level before drawing conclusions.
Besides ontologies, considerable research has Undoubtedly, the golden standard for interpre-
been done on cellular pathways, in particular, tation and validation of biomarkers is empirical
pathways that are potentially involved in patho- and clinical studies. However, besides cost, prac-
logical expression. Coupled with high-throughput tical, and ethical considerations dictate that clin-
screening methods, such as microarrays and MS, ical studies must be fully justified before trials can
extensive networks of cellular pathways may be be conducted. Consequently, the importance of
sampled and then visualized simultaneously to interpretation and validation tools, as we have just
achieve a systems-based perspective. For instance, discussed, cannot be slighted.
the Pathway Explorer developed by the Institute of
Genomics and Bioinformatics, Graz University of Acknowledgments
Technology, Austria (Mlecnik et al., 2005) is a
popular web-based tool for visualizing biological The authors want to thank Georgia Cancer Coa-
pathways derived from databases such as KEGG, lition, Georgia Research Alliance, and NIH for
BioCarta, and GenMapp. On the other hand, Research support for this work.
107

References Espina, V., Mehta, A.I., Winters, M.E., Calvert, V., Wulfkuhle,
J., Petricoin, E.F. and Liotta, L.A. (2003) Protein microar-
Aebersol, R. and Mann, M. (2003) Mass spectrometry-based rays: molecular profiling technologies for clinical specimens.
proteomics. Nature, 422: 198–207. Proteomics, 3: 2091–2100.
Ashburner, M., et al. (2000) Gene Ontology: tool for the uni- Fields, S. (2001) Proteomics in genomeland. Science, 291:
fication of biology. The Gene Ontology Consortium. Nat. 1221–1223.
Genet., 25: 25–29. Fu, W.J., Carroll, R.J. and Wang, S. (2005) Estimating mis-
Bern, M., Goldberg, D., McDonald, W.H. and Yates, J.R. classification error with small samples via bootstrap cross-
(2004) Automatic quality assessment of peptide tandem mass validation. Bioinformatics, 21(9): 1979–1986.
spectra. Bioinformatics, 20: i49–i54. Gay, S., Binz, P.-A., Hochstrasser, D. and Appel, R. (2002)
Bioinformatics Links Directory: Protein Interaction, Pathways, Peptide mass fingerprinting peak intensity prediction: ex-
Enzymes. (2005). tracting knowledge from spectra. Proteomics, 2: 1374–1391.
Bolstad, B.M., Irizarry, R.A., Astrand, M. and Speed, T.P. Glish, G. and Vachet, R. (2003) The basics of mass spectrome-
(2003) A comparison of normalization methods for high try in the twenty-first century. Nat. Rev. Drug Discov., 2:
density oligonucleotide array data based on variance and 140–150.
bias. Bioinformatics, 19(2): 185–193. Goutte, C. (1997) Note on free lunches and cross-validation.
Bradley, A.P. (1997) The use of area under the ROC curve in Neural Comput., 9: 1245–1249.
the evaluation of machine learning algorithms. Pattern Re- Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. (2002) Gene
cog., 30(7): 1145–1159. selection for cancer classification using support vector ma-
Braga-Neto, U.M. and Dougherty, E.R. (2004a) Bolstered er- chines. Mach. Learning, 46: 389–422.
ror estimation. Pattern Recog., 37: 1267–1281. Ham, F.M. and Kostanic, I. (2001) Principles of Neurocom-
Braga-Neto, U.M. and Dougherty, E.R. (2004b) Is cross-val- puting for Science and Engineering. McGraw-Hill, New
idation valid for small-sample microarray classification? Bio- York.
informatics, 20(3): 374–380. Hoffmann, R., Seidl, T. and Dugas, M. (2002) Profound effect
Bruni, R., Gianfranceschi, G. and Koch, G. (2005) On peptide of normalization on detection of differentially expressed
de novo sequencing: a new approach. J. Pept. Sci., 11: genes in oligonucleotide microarray data analysis. Bioin-
225–234. formatics, 21(8): 1509–1515.
Cheng, C.-C., Lin, C.-J. (2001). LIBSVM: a library for support Hubert, M. and Engelen, S. (2004) Robust PCA and classifi-
vector machines. cation in biosciences. Bioinformatics, 20: 1728–1736.
Chilingaryan, A., Gevorgyan, N., Vardanyan, A., Jones, Kim, Y.H., Lee, S.Y. and Moon, B.R. (2004) A genetic ap-
D. and Szabo, A. (2002) Multivariate approach for selecting proach for gene selection on microarray expression data.
sets of differentially expressed genes. Math. Biosci., 176: Lecture Notes Comput. Sci., 3102: 346–355.
59–69. Kulkarni, S.R., Lugosi, G. and Venkatesh, S.S. (1998) Learing
Cleveland, W.S. and Devlin, S.J. (1988) Locally weighted re- pattern classification — a survey. IEEE Trans. Inform. The-
gression: an approach to regression analysis by local fitting. J. ory, 44(6): 2178–2206.
Am. Stat. Assoc., 83(403): 596–610. Li, J., Zhang, Z., Rosenzweig, J., Wang, Y.Y. and Chan, D.W.
Cover, T.M. and Hart, P.E. (1967) Nearest neighbor pattern (2002) Proteomics and bioinformatics approaches for iden-
classification. IEEE Trans. Inform. Theory, IT-13(1): 21–27. tification of serum biomarkers to detect breast cancer. Clin.
Cristianini, N. and Shawe-Taylor, J. (2000) An Introduction to Chem., 48(8): 1296–1304.
Support Vector Machines and Other Kernel-based Learning Liao, M. (2005) Baysian Models and Machine Learning with
Methods. Cambridge University Press, Cambridge. Gene Expression Analysis Applications. Institute of Statistics
Devore, J.L. (2004) Probability and Statistics for Engineering and Decision Sciences, Duke University.
and the Sciences. Thomson Brooks/Cole, Toronto. Lipshutz, R., Fodor, S., Gingeras, T. and Lockhart, D. (1999)
Dudani, S. (1976) The distance weighted K-nearest-neighbor High density synthetic oligonucleotide arrays. Nat. Genet.
rule. IEEE Trans. Systems, Man, Cybernet., 6: 325–327. Suppl., 21: 20–24.
Efron, B. (1983) Estimating the error rate of a prediction rule: Liu, J.J., Cutler, G., Li, W., Pan, Z., Peng, S., Hoey, T., Chen,
some improvements on cross-validation. J. Am. Stat. Assoc., L. and Ling, X.B. (2005) Multiclass cancer classification and
78: 316–331. biomarker discovery using GA-based algorithms. Bioin-
Efron, B. (1997) Improvements on cross-validation: the .632+ formatics, 21(11): 2691–2697.
bootstrap method. J. Am. Stat. Assoc., 92(438): 548–560. Mahalanobis, P.C. (1936) On the generalized distance in sta-
Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D. tistics. Proc. Natl. Inst. India, 12: 49.
(1998) Cluster analysis and display of genome-wide expres- Mangiameli, P., Chen, S. and West, D.A. (1996) A comparison
sion patterns. Proc. Natl. Acad. Sci., 95: 14863–14868. of SOM neural network and hierarchical clustering methods.
Elias, J., Gibbons, F., King, O., Roth, F. and Gygi, S. (2004) Eur. J. Oper. Res., 93: 402–417.
Intensity-based protein identification by machine learning Model, F., Konig, T., Piepenbrock, C. and Adorjan, P. (2002)
from a library of tandem mass spectra. Nat. Biotechnol., Statistical process control for large scale microarray exper-
22(2): 214–219. iments. Bioinformatics, 18: S155–S163.
108

Modlich, O., Prisack, H.B., Munnes, M., Audretsch, W., and selection and pattern recognition with gene expression data
Bojar, H. (2005) Predictors of primary breast cancers re- generated by the microarray technology. Math. Biosci., 176:
sponsiveness to preoperative epirubicin/cyclophosphamide- 71–98.
based chemotherapy: translation of microarray data into Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S.,
clincally useful predictive signatures. J. Transl. Med., 32. Dmitrovsky, E., Lander, E.S. and Golub, T.R. (1999) Inter-
Nhat, V.D.M. and Lee, S. (2005) Block LDA for face recog- preting patterns of gene expression with self-organizing
nition. Lecture Notes Comput. Sci., 3512: 899. maps: methods and application to hematopoietic differenti-
Ni, B. and Liu, J. (2004) A novel method of searching ation. Proc. Natl. Acad. Sci., 96: 2907–2912.
the microarray data for the best gene subsets by using a ge- Theilhaber, J., Connolly, T., Roman-Roman, S., Bushnell, S.,
netic algorithm. Lecture Notes in Comput. Sci., 3242: Jackson, A., Call, K., Garcia, T. and Baron, R. (2002) Find-
1153–1162. ing genes in the C2C12 osteogenic pathway by k-nearest-
Olshen, A. and Jain, A. (2002) Deriving quantitative conclu- neighbor classification of expression data. Genome Res., 12:
sions from microarray expression data. Bioinformatics, 18: 165–176.
961–970. Tusher, V.G., Tibshirani, R. and Chu, G. (2001) Significance
Pandey, A. and Mann, M. (2000) Proteomics to study genes analysis of microarrays applied to the ionizing radiation re-
and genomes. Nature, 405: 837–846. sponse. Proc. Natl. Acad. Sci., 98(9): 5116–5121.
Mlecnik, B., Scheideler, M., Hack1, H., Hartler, J., Sanchez- Vapnik, V. (1995) The Nature of Statistical Learning Theory.
Cabo, F. and Trajanoski, Z. (2005) Pathway Explorer: web Springer, New York.
service for visualizing high-throughput expression data on Wang, A. and Gehan, E. (2005) Gene selection for microarray
biological pathways. Nucleic Acids Res, 33: W633–W637. data analysis using principal component analysis. Stat. Med.,
Paul, T.K. and Iba, H. (2004) Identification of informative 24: 2069–2087.
genes for molecular classification using probabilistic model Wolkenhauer, O., Moller-Levet, C. and Sanchez-Cabo, F.
building genetic algorithm. Lecture Notes Comput. Sci., (2002) The curse of normalization. Comp. Funct. Genom., 3:
3102: 414–425. 375–379.
PeptideSearch: FingerPrint, Bioanalytical Research Group. Yan, B., Pan, C., Olman, V., Hettich, R. and Xu, Y. (2005) A
(2005). graph-theoretic approach for the separation of b and y ions
Quackenbush, J. (2001) Computational analysis of microarray in tandem mass spectra. Bioinformatics, 21(5): 563–574.
data. Nat. Rev.: Genet., 2: 418–427. Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J.
Reilly, C., Wang, C. and Rutherford, M. (2003) A method for and Speed, T.P. (2002) Normalization of cDNA microarray
normalizing microarrays using genes that are not differen- data: a robust composite method addressing single and mul-
tially expressed. J. Am. Stat. Assoc., 98(464): 868–878. tiple slide systematic variation. Nucleic Acids Res., 30(4):
Schena, M., Shalon, D., Davis, R. and Brown, P. (1995) Quan- e15.
titative monitoring of gene expression patterns with a com- Zeeberg, B., Feng, W., Wang, G., Wang, M., Fojo, A., Sun-
plementary DNA microarray. Science, 270: 467–470. shine, M., Narasimhan, S., Kane, D., Reinhold, W., Laba-
Steen, H. and Mann, M. (2004) The abc’s (and xyz’s) of peptide bidi, S., Bussey, K., Riss, J., Barrett, J. and Weinstein, J.
sequencing. Nat. Rev. Mol. Cell Biol., 5: 699–711. (2003) oMiner: a resource for biological interpretation of
Suzuki, T., Higgins, P.J. and Crawford, D.R. (2000) Control genomic and proteomic data. Genome Biol., 4: R28.
selection for RNA quantitation. Biotechniques, 29(2): 332. Zien, A., Aigner, T., Zimmer, R. and Lengauer, T. (2001) Cen-
Szabo, A., Boucher, K., Carroll, W.L., Klebanov, L.B., tralization: a new method for the normalization of gene ex-
Tsodikov, A.D. and Yakovlev, A.Y. (2002) Variable pression data. Bioinformatics, 17: S323–S331.

You might also like