Thesis On Gene Expression Analysis
Thesis On Gene Expression Analysis
XU MIN
A THESIS SUBMITTED
2003
I
Acknowledgements
I am very grateful to my supervisor Dr. Rudy Setiono for his insightful suggestions in both
the content and presentation of this thesis. It was his encouragement, support and patience
that saw me through and I am ever grateful to him. I am full of gratitude to my boss, Dr. Peng
Jinrong, for his understanding and support on allowing me to take the part-time Master of
I would like to thank Ms. Jane Lo, Mr. Guan Bin, and other members of Lab of Functional
Genomics of Institute of Molecular and Cell Biology, Singapore, for their numerous help
For the study of the hybrid of Likelihood method and Recursive Feature Elimination method,
I would like to thank Dr. Isabelle Guyon for providing the supplementary data. I also thank
Dr. Cui Lirong, Mr. Wang Yang, Dr. Oilian Kon, Dr. Wolfgang Hartmann, and Mr. Li
I would also like to thank my parents for their love, encouragement, guidance and patience
throughout my studies.
II
Table of Contents
Acknowledgements..................................................................................................................... I
Table of Contents....................................................................................................................... II
List of Figures............................................................................................................................ V
List of Tables ............................................................................................................................VI
Summary ................................................................................................................................ VII
1 Introduction ..................................................................................................................... 1-1
1.1 Background ................................................................................................... 1-1
3.3.4.1 Linearly separable learning model and its training ......................... 3-45
4.1.3 Neural networks with features selected using information gain.............. 4-65
4.1.7 The combination of Likelihood method and Fisher’s method ................ 4-98
Reference................................................................................................................................109
V
List of Figures
List of Tables
Table 4.1: List of combined univariate and multivariate feature selection methods tested.
........................................................................................................................... 4-60
Table 4.2: Performance of neural network using principle components as input......... 4-61
Table 4.3: Decision trees constructed by iterative mode from all 72 samples. ............ 4-62
Table 4.4: Decision trees constructed by leave-one-out mode from all 72 samples..... 4-63
Table 4.5: Decision trees constructed by leave-one-out mode from 38 training samples
with prediction accuracy on 34 test samples........................................................ 4-63
Table 4.6: Leukemia features and their information gain of all samples, training samples
and test samples, sorted by gain of all samples.................................................... 4-64
Table 4.7: Prediction performance of neural network using features selected by
information gain. ................................................................................................ 4-66
Table 4.8: Test of neural network using features selected by information gain to identify
incorrectly predicted test samples. ...................................................................... 4-67
Table 4.9: AdaBoost test results. ............................................................................... 4-69
Table 4.10: Experiment result of neural network feature selector with summed square
error function...................................................................................................... 4-71
Table 4.11: Experiment result of neural network feature selector with cross entropy error
function. ............................................................................................................. 4-71
Table 4.12: The smallest gene set found that achieves prefect classification performance.
........................................................................................................................... 4-78
Table 4.13: The genes selected by the hybrid LIK+RFE method. The genes that have
LIK scores of at least 1500 were selected initially. RFE was then applied to select
these four genes that achieved perfect performance............................................. 4-83
Table 4.14: Experimental results for the SRBCT dataset using the hybrid LIK+RFE. 4-88
Table 4.15: The genes selected by the hybrid LIK+RFE for the four binary classification
problems............................................................................................................. 4-90
Table 4.16: The performance of SVM and Naï
ve Bayesian classifiers built using the top
genes selected according to their LIK scores....................................................... 4-91
Table 4.17: Test result of LIK, RFE and LIK+RFE on artificial datasets ................... 4-97
Table 4.18: Test of SVM with RBF kernel using different parameters. .................... 4-102
VII
Summary
in parallel. It has become one of the main tools for global gene expression analysis in
molecular biology research in recent years. The large amount of expression data generated by
this technology makes the study of certain complex biological problems possible and
machine learning methods are playing a crucial role in the analysis process. At present, many
machine learning methods have been or have the potential to be applied to major areas of
gene expression analysis. These areas include clustering, classification, dynamic modeling
In this thesis, we focus our work on using machine learning methods to solve the
classification problems arising from microarray data. We first identify the major types of the
classification problems; then apply several machine learning methods to solve the problems
and perform systematic tests on real and artificial datasets. We propose improvement to
method to obtain high classification performance for high dimension classification problems.
Using the hybrid feature selection method, we are able to identify small sets of features that
give predictive accuracy that is as good as that from other methods which require many more
features.
1-1
1 Introduction
1.1 Background
With the completion of Human Genome Project, biology research is entering the post genome
era. Although biologists have collected a vast amount of DNA sequence data, the details of
how these sequences function still remain largely unknown. Genomes of even the simplest
organisms are very complex. Nowadays, biologists are still trying to find answers to the
• What are the functional roles of different genes and in what cellular process do they
participate?
• How are the genes regulated? How do the genes and gene products interact? What are
• How does the gene expression level differ in various cell types and states? How is the
Biology used to be data-poor science. With more advanced techniques developed in recent
years, biologists are now able to transform vast amount of biological information into useful
data. This makes it possible to study gene function globally, and a new field, functional
function by making use of the information and reagents provided by structural genomics. It is
statistical and computational analysis of the results (Hieter and Boguski, 1997).
1-2
Several methods have been developed to understand the behavior of genes. Microarray
technology is an important one among them. It is used to monitor large amount of genes’
expression level in parallel. Here gene expression refers to the process to transcribe a gene's
DNA sequence into the RNA that serves as a template for protein production, and gene
expression level indicates how active a gene is in certain tissue, at certain time, or under
certain experimental condition. The monitored gene expression level provides an overall
picture of the genes being studied. It also reflects the activities of the corresponding protein
Several steps are involved in this technology. First, complementary DNA (cDNA) molecules
or oligos are printed onto slides as spots. Then, two kinds of dye labeled samples, i.e. sample
and control , are hybridized. Finally, the hybridization is scanned and stored as images (see
example in Figure 1.1, a sample from Zebra fish). Using a suitable image-processing
algorithm, these images are quantified into a set of expression values representing the
intensity of spots. Usually, the dye intensity may be biased by factors like its physical
property, experimental variability in probe coupling and processing procedures, and scanner
settings. To minimize the undesirable effects caused by this biased dye intensity,
normalization is done to balance dye intensities and make expression value comparable
across experiments (Yang et al., 2001). Here the term comparable means that the difference
of any measured expression value of a gene between two experiments should reflect the
Molecular biology also used to be a data poor field, and most of gene expression analysis
work was done manually with very limited information derived from experiment. The focus
of a molecular biologist was on a few genes or proteins. With the application of large-scale
biological information quantification methods like microarray and DNA sequencing, the
behavior of genes can be studied globally. Currently, there is an increasing demand for
automatic analysis of the overall relationship hidden behind large amount of genes from their
expression.
1-4
Machine learning is the study of algorithms that could learn from experience and then
predict. The theoretical aspects of machine learning are rooted in statistics and informatics,
but computational considerations are also indispensable. Due to the complex nature of
biological information, machine learning could play an important role in the analysis process.
Microarray technology based gene expression profiling is one of hottest research topics in
biology at present. The experimental part of this technology is already mature. Compared
with this, the exploration of automatic analysis methods is still at its early stage. In this thesis,
we study several machine learning approaches to solving several typical gene expression
analysis problems.
• To identify typical gene expression analysis problems from machine learning point of
view.
• To apply suitable machine learning methods to the problems from public datasets, and
This thesis is organized as follows. Chapter 2 provides a brief review of the current methods
that can be applied to microarray data analysis. Chapter 3 gives detailed illustrations of
1-5
several important machine learning methods that can be applied to classification using gene
developed multivariate likelihood feature selection method, and propose a hybrid framework
of univariate and multivariate feature selection method. Chapter 4 describes the experimental
results of these methods on two different kinds of gene expression analysis problems, and
discusses the experiment results. Specifically, we perform systematic tests of the hybrid of
Likehood method and Recursive Feature Elimination method because we have obtained very
good feature selection performance on several microarray datasets; we also apply Support
Vector Machine on a recently obtained Zebra fish dataset to perform gene function
prediction. Finally, Chapter 5 concludes the thesis and future works are illustrated.
2-6
2 Literature review
Various automatic methods have been applied or developed for the gene expression analysis.
They are basically from fields such as machine learning, statistics, signal processing, and
informatics. The following are the relevant works categorized according to the analysis tasks.
Some methods could be used to find useful information or pattern from biological data,
which indicates relationship among the genes. These methods are unsupervised (Haykin,
1999), i.e., the learning models are optimized using pre-specified task-independent measures,
which reflect the difference or similarity of the training samples. Once the model has become
tuned to the statistical regularities of the input gene expression data, it develops the ability to
form internal representations for encoding features of the input and thereby to create new
simplifying complex data sets (Raychaudhuri et al., 2000). Given an expression matrix with a
number of features, a set of new features is generated by PCA. These new features account
for most of the information in the original features, but the number of dimensions is smaller
than that of the original data. There are several neural network algorithms that support PCA,
these algorithms are mainly Hebbian-based algorithms (Haykin, 1999) which are self-
organizing and adaptive. Singular Value Decomposition (SVD) can also be used to perform
PCA. SVD is a linear transformation that decomposes the gene expression matrix into a
product of three matrices that represent the underlying characteristics of the original matrix.
Alter et al. (2000) applied SVD for gene expression analysis. They first obtained the principal
components from the decomposed matrices by applying SVD to expression data of yeast
2-7
genes; then rejected the genes that contribute little information to the principal components.
In their work, the information contribution was measured by Shannon entropy of the
expression values of the genes, where Shannon entropy characterizes the complexity of the
expression values. Finally, the remaining genes were sorted, and the results reflect the strong
relationship between the groups of these genes and their functional categories.
Clustering is a typical way to group genes together according to their features. Certain
distance measure, which reflects the similarity of genes’expression, is needed for clustering
process. Most clustering methods that have been studied in the gene expression analysis
literature use either Euclidean distance or Pearson correlation between expression profiles as
a distance measure (D’haeseleer, 2000). Other measures include Euclidean distance between
expression profiles and slopes (Wen et al., 1998), squared Pearson correlation (D’haeseleer et
al., 1997), Euclidean distance between pairwise correlations to all other genes (Ewing et al.,
1999), Spearman rank correlation (D’haeseleer et al., 1997), and mutual information
represented by pairwise entropy (D’haeseleer et al., 1997; Michaels et al., 1998; and Butte
There are two kinds of clustering methods: hierarchical and non-hierarchical ones. A
hierarchical clustering method starts from individual genes, merging them into bigger clusters
until there is only one cluster left, in an agglomerative way. The method can also divisively
start from all genes, splitting them until no two of them are together. The output of the
method is a hierarchy of clusters, where the higher-level clusters are the sum of the lower-
level ones. On the other hand, a non-hierarchical clustering method first divides genes into a
certain number of clusters, and then iteratively refines them until certain optimization
criterion is met.
2-8
Clustering methods that have been applied for gene expression analysis were reviewed in
(D’haeseleer, 2000) and (Tibshirani et al., 1999). Because different algorithms may be
applicable to different datasets, in (Yeung et al., 2001) a data-driven method to evaluate these
Some methods can be used to find the relationship between gene expression and other
information, i.e. properties of genes and samples. These properties can be type of
hybridization sample, experimental condition, or the biological process that they are involved
in. These are basically supervised methods for classification and regression. The methods try
to construct learning models that could represent the relationship when given gene expression
Machine learning classification methods have been applied to gene expression analysis in
recent years. These methods usually employ class labels to represent different groups of
learning model to predict whether a tissue is cancerous or to predict the type of cancer using
gene expression. Cancer tissue classification is crucial for the diagnosis of patients. It used to
be based on morphological appearances, which is often hard to measure and differentiate, and
the classification result is very subjective. With the emergence of microarray technology, the
classification is improved greatly by going to the molecular level. The various machine
learning methods applied for classification are briefly described in the following paragraphs.
2-9
Neural networks are learning models that are based on the structure and behavior of neurons
in the human brain and can be trained to recognize and categorize complex patterns (Bishop,
1995). Khan et al. (2001) used neural networks to classify cancer tissues using gene
expression data as input. In their experiments, PCA was used to select a set of candidate
genes. A number of neural networks were then trained on the training dataset. The prediction
on test samples was achieved by summarizing all the outputs of the trained neural networks.
Support Vector Machine (SVM), which is rooted in statistical learning theory, is another
method that can also be used to perform classification. It can achieve good generalization
performance by minimizing both the training error and a generalization criterion that depends
on Vapnik-Chervonenkis (VC) dimension (Vapnik, 1998). In (Brown et al., 2000), SVM was
applied to classify yeast genes according to the biological process they involve in as
represented by their expression data. In (Furey et al., 2000), it was also used to classify
cancer tissues.
Decision tree generates a tree structure consisting of leaves and decision nodes. Each leaf
indicates a class, and each decision node specifies some test to be carried out on a single
attribute value, with one branch and subtree for each possible outcome of the test (Quinlan,
1993). C4.5 is a well-known decision tree induction algorithm, which uses information gain
measure. Cai et al. (2000) used it to classify cancer tissue samples. A comparison was done
Naï
ve Bayes Classification is a statistical discrimination method based on Bayes rule. In
(Keller et al., 2000), this method was used to classify cancer tissues. The algorithm is simple
2-10
and can be easily extended from two-class to multi-class classification. Gaussian distribution
probabilistic graphical model that represents the unique joint probability distribution of
random variables efficiently. Nodes of a Bayesian network could correspond to genes and
class labels, and represent the probability of the class label given some gene expression
levels. Hwang et al. (2001) used Bayesian Networks to classify acute leukemia samples. A
simple Bayesian Network with four gene nodes and one class label node was constructed
from gene expression data. The high prediction performance indicated that the constructed
network model can correctly represent the causal relationships of certain genes that are
Radial Basis Function (RBF) networks is a type of neural networks whose hidden neurons
contain RBFs, a statistical transformation based on a Gaussian distribution, and whose output
neuron computes a linear combination of its inputs. Hwang et al. (2001) also used an RBF
network to classify acute leukemia samples. The network was larger than the constructed
Bayesian network, but test results showed the prediction accuracy of RBF networks was
higher.
Besides classification, feature selection i.e. the process of selecting genes that are most
relevant to the class labels is also an important task for gene expression analysis. In (Slonim
et al., 2000), a statistical method involving mean and variance was used to reflect the
relevance between individual genes and class labels. In their work, the acute leukemia
samples were divided into two groups according to their class labels. Those genes whose
2-11
expression values had small variance in both groups and big mean difference between the two
groups were selected. In (Keller et al., 2000), a likelihood gene selection method was
proposed based on likelihood. It outperformed the Baseline method in (Slonim et al., 2000)
on the same cancer dataset by choosing less number of genes while achieving similar
classification performance. In (Li, 2002), the linear relationship between the logarithm of
ability of genes was found to obey Zipf’s law (Zipf, 1965). Plots of this relationship provided
a useful tool in estimating the number of genes that is necessary for classification. Gouyon et
al. (2002) proposed a Recursive Feature Elimination method based on Support Vector
Machine. It made use of the magnitude of weights of trained SVMs as indicators of the
discrimination ability of the genes. The algorithm keeps eliminating the genes that have
relatively small contribution to the classification. In the test of the method on a leukemia
Neural trees represent multilayer feed-forward neural networks as tree structures. They have
heterogeneous neuron types in a single network, and their connectivity is irregular and sparse
(Zhang et al., 1997). Compared with the conventional neural networks, neural trees are more
flexible. They can represent more complex relationship and permit structural learning and
feature selection. Evolutionary algorithms can be used to construct neural trees. Hwang et al.
(2001) constructed neural trees using gene expression, and selected relevant genes according
to the connections in the trees. Neural trees were found to have better classification
performance than the other two methods in the paper, i.e. Redial Basis networks and
Bayesian networks. Genes with significant contribution to the classification could also be
It is also important to the study significant patterns and infer the dynamic model of gene
expression from hybridization samples collected at different experimental time points. The
dynamics can provide clues to the role of genes in the biological processes. Some of these
Filkov et al. (2001) proposed a set of analysis methods that are suitable for short-term
discrete time series data. These methods include: period detection, phase detection,
correlation significance of short sequences of different length, and edge detection of group of
regularity genes. The prediction analysis on yeast microarray data (Spellman et al., 1998)
showed that the amount of data is not sufficient for large regularity pathway inference.
These characteristic models can also be used to construct dynamic models of gene expression
by deducing time translational matrices (Alter et al., 2000; Dewey and Bhan 2001; Holter et
al., 2001). In (Dewey and Bhan, 2001), the change of expression level in a gene was modeled
as first order Markov process. The time translational coefficient matrix was computed using
least squares method based on a combination of SVD and linear response theory. The
network model inferred from the matrix provided a way to cluster genes using their function.
The clusters derived by applying the method to yeast time series expression data were in
The dynamics of gene expression could also be modeled as differential equations. In (Chen et
al., 1999) a linear transcription model is proposed, and two methods, Minimum Weight
2-13
Solutions for Linear Equations and Fourier Transform for Stable Systems, are proposed for
pattern over time, or across anatomical regions, and therefore reveals the amount of
information carried by the gene during a disease process or during normal phenotypic change.
Shannon entropy is used by Fuhrman et al. (2000) to identify the most likely drug target
Gene network inference attempts to construct and study coarse-scale network models of
regulatory interactions between genes. It shows relationship between individual genes, and
then provides a richer structure than clustering, which only reveals relationship between
groups of genes. Gene network inference requires inference of the causal relationships among
genes, i.e. reverse engineering the network architecture from its activity profiles. Reverse
following issues: choosing hybridization sample or expression data, choosing network model,
and choosing method to construct the model, studying the structure and dynamics of the
model. The study of network dynamics often involves time series analysis techniques
The simplest gene network model is Boolean network proposed by Kauffman (1969). In a
Boolean network model, each node is in one of two possible states: express or not-express.
The actual state depends on the states of other nodes that are linked to it. A variety of
2-14
Boolean network construction algorithms have been developed. Somogyi et al. (1996)
employed a phylogenetic tree construction algorithm (Fitch and Margoliash, 1967) to create
and visualize the network. In (Liang et al., 1998), a more systematic and general algorithm
was developed using mutual information to identify a minimal set of inputs that uniquely
defines the output for each gene at next time step. Akutsu et al. (1999) improved Liang’s
algorithm to accept noisy expression data. Ideker et al. (2000) developed an alternative
interactively refine the sensitivity and specificity of the constructed networks. At each of the
iterations, a set of networks was inferred according to the expression data from different
experimental perturbations. They were then discriminated using entropy based approach. The
Serov (1999) proposed an interactive Java applet tools for visualization and analysis of the
Boolean network constructed. Maki et al. (2001) proposed a system that uses a top-down
approach for the inference of Boolean network. The inferred networks on the simulated
expression data matched the original ones well even when one of the genes was disrupted.
The advantage of Boolean network is its low construction cost. But it has the disadvantage
being too coarse to represent the true regulation relationship between genes. Linear modeling
tries to overcome this disadvantage by using weighted sum to represent the influence of other
genes on one particular gene. In the model, the overall relationship is then represented as a
matrix. Someren et al. (2000) used linear algebra methods to construct the model. Partial
Least Squares method is a statistical method that is particularly useful for modeling large
number of variables each with few observations (Stone and Brooks, 1990). Datta (2001)
applied it to Sacccharomyces cerevisiae yeast microarray data to get linear regression model
2-15
and then predicted expression level of a gene according to that of other genes. The result
The dependency of the expression of one gene on the expression of other genes can also
modeled using nonlinear functions. The nonlinear approach provides the model more ability
to reveal the biological reality. However, it often introduces more difficulty in solving the
model at the same time. Maki et al. (2001) also modeled the gene interaction as an S-system
(Savageau, 1976). S-system is one of the best formalisms to estimate the complex gene
parameters to be estimated is vary large compared with that of Boolean network. To analysis
large scale network, S-system approach was combined with their Boolean approach.
Bayesian networks can also be used for gene network inference. The inference process
estimates the statistical confidence of dependencies between the activities of genes. Friedman
et al. (2000) used it to analyze Sacccharomyces cerevisiae yeast microarray data from
(Spellman et al., 1998). Their order relation and Markov relation analysis showed that the
constructed Bayesian network had strong link to cell cycle regulated genes. Pe’er et al.
(2001) extended this framework by the following steps: adding new kinds of factors such as
enabling handling of mutation; and employing better discretization on the data for
preprocessing. Their experiment on yeast microarray data showed that the constructed
Several works have been conducted on the study of the dynamics of constructed network
model. Huang (1999) used Boolean network to interpret gene activity profiles as entities
2-16
related to the dynamics of both the regularity network and functional cellular states. In this
approach, the dynamics were mapped into state space and the system property of the network
This dynamics could be modeled more precisely as a set of differential equations. Neural
network is one of the methods that could effectively solve these equations. Both Vohradský
(2001) and D’haeseleer (2000) modeled gene network as recurrent neural networks.
Vohradský (2001) used recurrent back-propagation (Pineda 1987), and simulated annealing
to construct the network. However, D’haeseleer (2000) tried back-propagation through time
(Werbos, 1990) in the training process, with techniques such as weight decay and weight
elimination (Weigend et al., 1991) applied to simplify the model. Compared with simulated
annealing, back-propagation is a more effective training method, but its scalability is worse
because it attempts to unfold the temporal operation of the network into a layered feed-
forward network. When doing their experiments, both Vohradský and D’haeseleer did not
have microarray dataset that was large enough for the network construction. Instead, they
used artificial data for experiment. The trained networks appeared to match original ones.
Szallasi, (1999) illustrated some basic natures of gene networks that could affect modeling.
They include the stochastic nature, effective size, compartmentalization, and information
content of expression matrix. In (Wessels et al., 2001; Someren et al., 2001), different
network models were categorized and compared under criteria like inferential power,
Clustering based on gene expression reflects the correlation of genes. Classification links the
expression of genes to functions. To study the change of expression of genes through time,
dynamic modeling and time series analysis method have been used. In order to obtain the
causal relationship or regulation of genes globally, the gene networks are needed to be
inferred from the expression data. The inference work is a reverse engineering process
(D’haeseleer, 2000). Reverse engineering is one of the major focuses of systems biology at
present. The field of Systems biology studies biology at system level by examining the
structure and dynamics of cellular and organismal function (Kitano, 2002). When the field of
systems biology advances to the stage of trying to unify the biological knowledge across
different levels of living organisms, we expect the understanding of the inherent complexity
Our work focuses on classification and feature selection methods for global gene expression
analysis. In Section 3.1, the classification problems are described. Section 3.2 is about feature
integration and univariate feature selection methods. Section 3.3 describes multivariate
In this section, we present the gene expression data and class information in a mathematical
form, and illustrate two types of classification problems that are commonly encountered in
hybridizations are available. Suppose there are m genes and n hybridizations. The
where aij represents the expression value of the i th gene in the j th hybridization.
Certain property of gene or hybridization sample needs to be defined, i.e. labeling is required,
in order to find the relationship between the genes and their expression matrix. The gene’s
x1
x
x= 2, Eq. 3.1.1.2
M
xm
where each element represents one possible value of this property. For example, xi = +1
means the i th gene belongs to some biological process, while xi = −1 means the i th gene
does not belong to this process. Similarly, the property of hybridization sample is defined as
an n × 1 vector
y1
y
y = 2 , Eq. 3.1.1.3
M
yn
where each element represents one possible value of this property. For example, yi = +1
sample is non-cancerous.
From a micoarray experiment, we can only obtain a very limited number of hybridizations
which involve a large number of genes. That is to say, n is usually no more than a hundred,
but m can be a few thousands. So there are basically two types of classification problems:
• First type: large number of samples with low dimension. When the relationship
features, each of them corresponds to one hybridization, and m samples, each of them
• Second type: small number of samples with large number of features and high
b1
b
classification problem can be expressed as B = A = 2 , which is transpose of A ,
T
M
b n
as input of learning model and y as output. There are m features, each corresponds
In this thesis, our work is focused mainly on solving the classification problems of the second
type, because this type of problems has distinct nature from the ordinary classification
problems, and many cancer tissue classification problems based on gene expression is of this
kind. We also have obtained a newly released Zebra fish developmental microarray dataset,
which can be used to form classification problems of the first type. Because we are able to
validate our prediction results using more precise biological experiments with the help of
researchers who generated this dataset, we have applied Support Vector Machine
Preprocessing is needed for the second type of classification problem (large number of
features). The main goal of preprocessing is to reduce the number of inputs for a learning
model without much loss, or even with some improvement of classification accuracy. Two
kinds of methods can be used for the reduction: feature selection and feature integration (Liu
and Motoda, 1998). Feature selection selects a subset of features as classifier input, while
3-21
feature integration generates a new feature set from original features as input. The feature
integration method used in this thesis is Principle Component Analysis (PCA) (Muirhead,
1982); the two univariate feature selection methods used in our research are Information Gain
(Quinlan, 1993) and Likelihood method (Keller et al., 2000). Due to space limitation, only
Information gain and Likelihood method will be described in this section. Here univariate
means the selection method only takes the contribution of individual features to the
classification into consideration. The multivariate feature selection method will be described
in Section 3.3, because most of them are based on classification methods. The term
multivariate means the selection method accounts for the combinatorial effect of the features
on the classification.
Information gain method can be used to rank the individual discrimination ability of the
features. It comes from information theory. In this method, information amount is measured
by entropy. Let S denote the set of all samples, | S |= n . Let k denote the number of classes,
k
and let Ci , i = 1,K, k denote the set of samples that belong to a class, UC i =S ,
i =1
| Ci |
of class Ci . This message has probability of being correct. So the information it
|S|
|C ∩ S |
conveys is − log 2 i . The expected information of such a message is
|S |
k
| Ci ∩ S | |C ∩ S |
info( S ) = −∑ × log 2 i . Eq. 3.2.1.1
i =1 |S | |S |
3-22
Similarly, the expected information amount of any subset of S can also be determined. If S
l
is partitioned into l subsets Ti , i = 1,K, l , UT i = S , ∀1 ≤ i < j ≤ l : Ti ∩ T j = Φ . Then the
i =1
l
| Ti |
info X (S ) = ∑ × info(Ti ) . Eq. 3.2.1.2
i =1 | S |
The difference
Information gain tends to be greater when l becomes larger, which may not truly reflect the
l
| Ti | |T |
split info = −∑ × log 2 i . Eq. 3.2.1.4
i =1 | S | |S |
gain
gain ratio = . Eq. 3.2.1.5
split info
Given sample expression values and their class labels, each feature’s information gain ratio
could be calculated this way: sort samples according to expression values of this feature,
partition them and calculate gain ratio of every possible split, and choose the maximum one
as this feature’s information gain ratio. This ratio provides a measure to evaluate the
classification ability of one feature. Feature selection is done by choosing features that have
Keller et al. (2000) proposed the Maximum Likelihood gene selection (LIK) method. Denote
the event that a sample belongs to class a or class b by M a and M b , respectively. The
difference in the log likelihood is used to rank the usefulness of gene g for distinguishing the
samples of one class from the other. The LIK score is computed as follows:
LIK ag→b = log( P( M a | x ag,1 ,K, x ag,na )) − log( P( M b | xag,1 , K, x ag,na )) Eq. 3.2.2.1
and
LIK bg→ a = log( P( M b | xbg,1 ,K , xbg,nb )) − log( P( M a | xbg,1 ,K , xbg,nb )) Eq. 3.2.2.2
where P(M i | x gj,1 , K , x gj ,n j ) is a posteriori probability that M i is true given the expression
values of the g th gene of all the training samples that belong to class j , where n j is the
is used, with three assumptions required by the method. First is the assumption of equal prior
and second is the assumption that the conditional probability of X falling within a small non-
− ( x− µ ) 2
1
P( x | M ) = e 2δ 2 Eq. 3.2.2.5
δ 2π
where µ and δ are the mean and standard deviation of X respectively. The values µ and
δ can be estimated from the training data. With the third assumption that the distributions of
the expression values of the genes are independent, we obtain the LIK ranking of class a
na na
= log(∏ P( x ag,i | M a )) − log(∏ P( x ag,i | M b ))
i =1 i =1
Eq. 3.2.2.6
na
(x g − µ g )2 (x g − µ g )2
= ∑ − log(δ ag ) − a ,i g a2 + log(δ bg ) + a ,i g b2 ,
i =1 2(δ a ) 2(δ b )
and similarly, the LIK ranking of class b over class a for this gene is
Eq. 3.2.2.7
nb ( xbg,i − µ bg ) 2 ( xbg,i − µ ag ) 2
= ∑ − log(δ b ) −
g
+ log(δ a ) +
g .
i =1 2 ( δ g 2
) 2 ( δ g 2
)
b a
Genes that have higher likelihood scores are expected to have better ability to distinguish one
Classification is also called pattern recognition. It is a process to assign one of the prescribed
number of classes (categories) given an input pattern (Haykin, 1999). Training of a classifier
is usually needed to establish the learning model that could reflect the relationship between
input patterns and class labels. This thesis employs four classification methods. They are
decision tree (Quinlan, 1993), neural networks (Haykin, 1999), support vector machines
(Vapnik, 1998) and Bayesian classification (Keller et al., 2000). Due to the space limitation,
only Neural Network, Support Vector Machines and Bayesian classification will be
described. Boosting technique, which combines output of multiple classifiers to form more
accurate hypothesis, will also be illustrated in this section. Three multivariate feature
selection methods: neural network feature selector, recursive feature elimination method and
units, which has a natural property for storing experimental knowledge and making it
available for future use. Similar to human brain, it acquires knowledge from environment
through a learning process, and its interneuron connection strengths store knowledge
(Haykin, 1999; and Aleksander and Morton 1990). Neural networks as an analysis method
simple hierarchical structure with high synaptic connections. A three-layer neural network
model consists of an input layer, a hidden layer and an output layer. Input neurons in the
input layer receive input signals, and send them to neurons in the hidden layer. Hidden
neurons are computational units. They combine their inputs as their local fields, and send
their output to neurons at the output layer. Output neurons then perform a similar
combination and their output is the output of the whole model. Each of the synaptic links in
the model has a weight associated with it, which is used to amplify/reduce the signal when
Computationally, this model can also be described as follows: Suppose there are k 0 input
neurons, k1 hidden neurons and k 2 output neurons. The synaptic links between the input and
hidden layer can be represented as a k1 × (k 0 + 1) matrix w (1) , including biases, and the links
between hidden layer and output layer can similarly be represented as a k 2 × (k1 + 1) matrix
3-26
w ( 2 ) , also including biases. Let a vector x = [+1, x1 L xk0 ]T be the input of the network. The
neurons in the hidden layer first sum up their input together with the bias associated with the
first element of x
and perform a certain transformation using activation ϕ (1) function and get their output
+1
y (1) = (1) (1) = [1, y1(1) L y k(11 ) ]T . Eq. 3.3.1.2
ϕ ( v )
Here we also add an additional element + 1 as the first element of y in order to handle bias
of the output neurons. Similarly, the neurons in the output layer perform the summation
y ( 2) = ϕ ( 2) ( v ( 2) ) . Eq. 3.3.1.4
• Threshold function
1 if v ≥ 0
y = ϕ (v ) = Eq. 3.3.1.5
0 if v < 0
• Picewise-Linear function
1 if v ≥ a
y = ϕ (v) = v if a ≥ v > −a , with a > 0 Eq. 3.3.1.6
0 if v < −a
e bv − e − bv
y = ϕ (v) = a tanh(bv) = a with a > 0, b > 0 Eq. 3.3.1.8
e bv + e −bv
A neural network provides a mapping from its input to its output, and this mapping is
determined by the network’s structure and synaptic weights. With a proper mapping
established, given certain inputs, the network can produce the desired outputs. This
mechanism could be used for classification. Training is needed to obtain a neural network
that can give correct predictions. The training is done using an algorithm which usually
consists of two steps: generate initial weights, and then iteratively refine these weights until
certain stopping criterion is met. In essence, these algorithms are optimization algorithms,
which attempt to meet certain criteria when refining the weights. These criteria are measured
in terms of error functions because they reflect the difference between neural network outputs
Among many training algorithms, back-propagation is most popular (Hertz et al., 1991).
There are two types of training in back-propagation: sequential mode and batch mode. In
sequential mode, the algorithm calculates training error and updates weights each time it
receives a training sample. On the other hand, in batch mode, the algorithm calculates overall
error of all the training samples, and then updates the network’s weights. The sequential
mode is computationally slower than the batch model. But the order of training samples that
are presented to the training algorithm can be randomly assigned, and the stochastic nature of
3-28
sequential mode. Its batch mode version could be easily adapted from the sequential mode.
Back-propagation uses
E = e Te Eq. 3.3.1.11
as the instantaneous error function of the network. The objective of back-propagation training
is to minimize the average instantaneous error function to a certain extent, given all training
samples.
Given a sample vector x with its class labels d , E can be computed using Eq. 3.3.1.1 to Eq.
3.3.1.4 and Eq. 3.3.1.10 to Eq. 3.3.1.11. To meet the objective, the correction of weights
∂E ∂E
∆w (1) and ∆w ( 2) should be proportional to the partial derivatives and
∂w (1)
∂w ( 2 )
∂E
∆w ( 2 ) = −η
∂w ( 2 )
∂E ∂e ∂y ( 2) ∂v ( 2 )
= −η
∂e ∂y ( 2) ∂v ( 2 ) ∂w ( 2)
′ T
Eq. 3.3.1.12
= −η × e × (−1) ⊗ ϕ ( 2) ( v ( 2) ) × y (1)
′ T
= η × (e ⊗ ϕ ( 2 ) ( v ( 2 ) )) × y (1)
T
= η × δ ( 2) × y (1) ,
′
here we let δ ( 2 ) = e ⊗ ϕ ( 2 ) ( v ( 2 ) ) ; and similarly,
3-29
∂E
∆w (1) = −η
∂w (1)
∂E ∂y (1)
= −η
∂y (1) ∂w (1)
Eq. 3.3.1.13
(2)T ′ ′
= η × ((w × (e ⊗ ϕ ( 2)
( v ))) ⊗ ϕ
( 2) (1)
( v )) × x
(1 ) T
T ′
= η × ((w ( 2 ) × δ ( 2 ) ) ⊗ ϕ (1) ( v (1) )) × x T
= η × δ (1) × x T ,
T ′
here we let δ (1) = (w ( 2 ) × δ ( 2 ) ) ⊗ ϕ (1) ( v (1 ) ) .
In Eq. 3.3.1.12 and Eq. 3.3.1.13, η is a positive learning rate parameter which controls the
Activation function ϕ (1) and ϕ ( 2) must be differentiable for back-propagation training. The
neural networks. Adjustment of weights involves only the neuron signals of the successive
layers they connected, so the algorithm is a local method. This also makes it computationally
efficient.
Feed-forward neural networks can be used for classification. For two class problems, a neural
network with a single output neuron is needed. Samples are labeled + 1 or − 1 . The label
value will be the desired values at the output side of the network when training the network.
For multi-class problems, a neural network that has the number of output neurons identical to
3-30
the number of classes is usually required. The desired output value in this case is a vector.
The elements of the vector are − 1 , except the one that corresponds to the class of the sample,
which is + 1 . If the training samples are biased, for example, the number of positive samples
is much less than that of the negative ones, label values other than ± 1 can be used to adjust
the feed-back signal to obtain better performance. Many gene expression classification
One of the main disadvantages of neural network for classification is that the training result
also depends on initial weights, which are generated randomly. Boosting can be used to
enhance the robustness of the neural network. The term Boosting refers to a machine learning
framework that combines a set of simple decision rules, which is generated by a set of
learners with different learning abilities, into a complex one that has higher accuracy and
lower variance. It is especially useful in handling real world problems that have the following
properties (Freund and Schapire, 1996): the samples have various degrees of hardness to
learn and the learner is sensitive to the change of training samples. The complexity of
learning hardness often occurs when applying machine learning method to tackle biological
• Boosting by filtering: If the number of training samples is large, the samples are either
• Boosting by subsampling: With a fixed training sample set size, the probability to
include samples into training sample set for learning algorithms is adjusted.
3-31
• Boosting by reweighting: This approach assumes that the training samples could be
weighted by the learning algorithms. The training errors are calculated by making use
of these weights.
AdaBoost (Freund and Schapire, 1996) is a simple and effective boosting algorithm through
subsampling. During learning process, the algorithm tries to make the learners focus on
different portions of the training samples by refining the sampling distribution. Figure 3.1
The input of the algorithm are n training samples, {(b1 , y1 ),K, (b n , y n )} , where
labels, which are also the desired outputs of the learning models. Y is the set of class labels.
The algorithm starts by setting the initial sampling distribution d 1,1Kn to a uniform one. It then
enters an iterative process. At the t th iteration, the algorithm calls the function
train_learner() with training samples b1..n and sampling distribution d t ,1Kn , to get the trained
learning model M t . By calling function get_hypothesis() with M t and b1..n , the algorithm
generates the hypothesized classes of training samples. The algorithm then calculates total
error ε t by adding up all the wrongly predicted samples’distribution. If the error is bigger
1
than , the algorithm stops. Otherwise, it proceeds to calculate a factor β t . This factor is
2
used to reduce the portion of the correctly predicted training samples, in the sampling
distribution d t +1,1Kn of the next iteration. When the algorithm terminates, T learning models
M 1KT are trained with factors β1KT indicating the contribution of the learning models to the
combined hypothesis. The combined hypothesis for a test sample b can then be calculated as
3-32
1
h(b) = arg max ∑ log .
Eq. 3.3.2.1
t :get_hypothesis( M t ,b )= y β t
y∈Y
When number of training samples is small, β t could be zero. We set a lower threshold to β t
1
d1,i = (i = 1K n) // Initial sampling distribution
n
for t = 1 to T do
M t = train_learner(b1..n , d t ,1Kn )
ht ,1..n = get_hypothesis( M t , b1Kn )
εt = ∑d
i :ht ,i ≠ y i
t ,i
1
if ε t > then
2
T = t −1
terminate loop // Terminate the algorithm
end
Theoretical study shows that if the hypothesis obtained by individual classifiers constantly
has error that is slightly better than random guess, the number of prediction errors of the final
hypothesis h drops to zero exponentially fast when T increases (Freund and Schapire,
1996).
Three layer neural networks can be the learning model integrated with AdaBoost. In this case,
train_learner() consists of initializing, training and optimizing neural network weights, and
algorithm will construct T neural networks in total for making combined decision.
This section first analyses the limitation of applying information gain measure for ranking of
features having continuous value, then describes the neural network feature selector method.
Information gain can be applied in measuring both discrete and continuous feature values.
But in continuous case, it only takes the order of the values into consideration, which may not
6 1 7
7 2 7.5
8 3 8
9 4 8 .5
10 5 9
B=
11 16 12
12 17 12.5
13 18 13
14 19 13.5
15 20 14
see that the discrimination abilities of the feature at line y = 2 and y = 3 are better than the
one at line y = 1 . Compared with feature values at line y = 1 , the distance between feature
values of different classes at line y = 2 are longer, and the density of feature values within
different classes at line y = 3 are higher. But the information gain ratios of these three
features are the same because it only takes the order of the feature values into account, hence
achieving the same maximum gain ratio for all three features when split in the middle.
3-35
Figure 3.2: Plot of feature values with class labels, ‘o’for ‘-1’and ‘+’for ‘+1’
can also be a selection criterion. The advantage of integrating a classifier into the feature
selection process is that the feature set is optimized by the classification accuracy. Moreover,
the training of the classifier and the selection of the features use the same bias. The
consistency improves the classification performance. However, the computational cost of the
integration may be high. In (Liu and Motoda, 1998) this approach is called wrapper model.
Procedure wrapper accepts full feature set (F), training samples (D) and testing samples (T)
as input. It first generates a subset of features (S), and then performs cross validation using S
and D to get classification accuracy (A). These two steps are repeated until A is sufficiently
3-36
high under certain criterion. A classification model (M) is then obtained from D with selected
S. Finally M and S are used to perform the test to measure the performance. In the algorithm,
cross validation can be done by dividing D into a training set and a validating set; or by the
Setiono and Liu, (1997) proposed a neural network feature selection method based on the
wrapper approach. Here in this thesis, we applied it to gene expression analysis, with some
The algorithm starts with all features, and removes features that have minor contribution to
classification one by one. The algorithm first trains a neural network with a given feature set,
then disables each feature and estimates the classification performance of this neural network
with the remaining features. If the decrease of the estimated performance is within an
acceptable level, the algorithm constructs a neural network with the remaining features, and
calculates the actual classification performance. If the actual performance is also acceptable,
the feature is removed, and the algorithm continues searching for more features to be
removed. Otherwise, it keeps this feature and continues to test other features according to
3-37
solutions.
There are two ways to train and validate the neural network. Suppose there are n samples.
The first one is to separate these samples into two sets: training set and validating set. If n is
too small to produce sufficiently large training and validating sets, the leave-one-out
technique can be used to perform training and validation n times, and obtain the average
Neural network feature selector method requires the neural network training to force the
weights associated with an irrelevant input neuron to have small magnitude, in order to
reduce the effect of the corresponding feature’s removal on the classification performance.
This is implemented using weight decay on the weights between the input and the hidden
layer when applying the back-propagation training. After every element of the weights is
′
wi(,1j) = wi(,1j) + ∆wi(,1j) , Eq. 3.3.3.1
(1) ″ ε jη (1)′
wi , j = 1 − 2 i, j
w , Eq. 3.3.3.2
1 + ( w (1) )
∑i i, j
2
where η is the learning rate parameter, and ε j is a penalty term associated with the j th
input neuron. Eq. 3.3.3.2 is similar to the weight decay method in (Hertz et al., 1991), but the
In order to improve the convergence, we also tried the cross entropy error function instead of
Eq. 3.3.1.11,
n k2
[
F(w (1) , w ( 2 ) ) = − ∑∑ d kp log( y k( 2 ), p ) + (1 − d kp ) log(1 − y k( 2 ), p ) , ] Eq. 3.3.3.3
p =1 k
from the one used in Eq. 3.3.1.11. The derivation of the back propagation with error function
∂ F n k2 ∂ F ∂y k( 2 ), p
= ∑∑ , Eq. 3.3.3.4
∂w p =1 k ∂yk( 2 ), p ∂w
where
∂F d kp 1 − d kp
= −
y ( 2 ), p 1 − y ( 2 ), p .
− Eq. 3.3.3.5
∂yk( 2 ), p k k
k1 k0
yk( 2 ), p = ϕ ( 2 ) ∑ϕ (1) ∑ xip w (ji1) wkj( 2 ) , Eq. 3.3.3.6
j =1 i =1
we have
∂yk( 2 ), p ′ ′
∂wkj (2)
( ) ( ) ( )
= ϕ ( 2 ) vk( 2 ), p ⋅ ϕ (1) v (j1), p = ϕ ( 2 ) vk( 2 ), p ⋅ y (j1), p , Eq. 3.3.3.7
and
∂yk( 2 ), p ′ ′
= ϕ ( 2 ) (vk( 2 ), p ) ⋅ ϕ (1) (v (j1), p ) ⋅ xip . Eq. 3.3.3.8
∂w ji (1)
Again according to the delta rule (Haykin, 1999) and the gradient descent rule (Hertz et al.,
1991), we obtain
n d kp ( 2 ) ′ ( 2), p (1), p
∆wkj( 2 ) = −η
∂F
∂wkj( 2)
= η ∑ ( 2 ), p
−
1 − d kp
1 − yk( 2 ), p
ϕ
vk( )⋅ yj Eq. 3.3.3.9
p =1 yk
and
3-39
n k d kp ( 2 ) ′ ( 2 ), p
∂F
∑∑ ( 2 ), p
1 − d kp
ϕ ( ) ′
( )
2
d ij 1 − d ij
of weight into matrix form: Let F′ be a n × k2 matrix, the elements f ij = ( 2 ),i −
y 1 − y (j 2 ),i
j
∂F
which has the value opposite to that of . Let X be n × k0 matrix; V (1) and Y (1) be
∂yk( 2), p
′
∆W (2) = η (F ′ ⊗ ϕ ( 2) (V ( 2) )) T Y (1) Eq. 3.3.3.11
and
T
′T ′
∆W (1)
= η (F ′ ⋅ ϕ ( 2 ) (V ( 2) )) ⊗ I ⋅ ϕ (1) (V (1) ) X Eq. 3.3.3.12
accepts six parameters: a set of m feature vectors {f1 ,K, f m } , a set of penalty parameters
(ε min , ε max ) , a class label vector y , an allowable maximum decrease in validation accuracy
At lines 2 to 5, the function copies feature number, feature set, penalty parameter set into
′′ to a
three variables m′ , F and E , and initializes the maximum validating accuracy rmax
3-40
small positive value ε . It then enters an iterative process, at lines 6 to 40, to remove features
one by one until no more feature in F can be removed with sufficiently high performance. In
function initialize () . Then at line 8, N is trained and validated with features in F and
penalty parameters in E by calling train_vali date() , which returns the trained network N ,
training accuracy r ′ and validating accuracy r ′′ . The algorithm then updates rmax
′′ . At line 10,
At lines 11 to 15, m′ feature sets F1,K,m′ and penalty parameter sets E1,K, m′ are constructed,
each of which has one feature omitted; and they are tested and validated with corresponding
neural network by calling simulate_validate() at line 14; the corresponding estimated training
accuracy r1′,K,m′ and validating accuracy r1′,′K,m′ are returned. At line 16, r1′,K,m′ and r1′,′K,m′ are
sorted according to their linear combination in descending order. The higher the linear
combination rfactor ri′ + ri′′ is, the more likely the corresponding i th feature to be eliminated.
The factor rfactor is usually set to be bigger than one. When the difference among r1′,K,m′ are
high, r1′,K,m′ will be the main contributor to the ranking. However, when the difference among
r1′,K,m′ are not so high, which often occurs for small of training set, ranking is mainly affected
by r1′,′K,m′ .
whose estimated training and validation accuracy rates, rs′i and rs′i′ are bigger than the
′′ ) are retrained to get actual training accuracy rs′i and validating accuracy
′ , rmin
threshold (rmin
rs′i′ . If the validating accuracy is sufficiently high compared with maximum validating
3-41
accuracy, as indicated in the conditions at lines 23 and 26, then the penalty parameters is
updated at lines 27 to 33. At lines 35 to 38, the selected feature f si and penalty parameter ε si
′′ is updated. At lines
are removed from F and E ; feature number m′ is decreased; and rmax
29 to 33, the way to update penalty parameter is as follows: for any r′j , if it is bigger than
average value r ′ , which means that the corresponding feature is likely to be removed, the
parameter’s lower and upper thresholds [ε min , ε max ] are set to prevent it to be too large or too
small. After the feature removal process at from lines 6 to 40, the selected feature set in F is
As mentioned earlier, neural network feature selector method is based on the wrapper model.
But its feature generation and cross validation part are not so distinct. In general, lines 7 to 16
and lines 26 to 39 relate to feature generation; lines 17 to 25 validate the feature set; and line
21 N si = initialize (m′ − 1)
22 ( N si , rs′i , rs′i′ ) = train_validate( N si , Fsi , E si , y )
′′ − rs′i′
rmax
23 δ =
′′
rmax
24 end
25 untili ≥ m′ or δ ≤ ∆r ′′
26 if δ ≤ ∆r ′′ then
1 m′
27 r′ = ∑ ri′
m′ i =1
28 for j = 1 to m′
29 if r j′ ≥ r ′ and ε j ∈ [ε min , ε max ] then
30 ε j = fε j
31 else
εj
32 εj =
f
33 end
34 end
35 F = F − { f si }
36 E = E − {ε si }
37 m′ = m′ − 1
38 ′′ = max(rmax
rmax ′′ , rs′′i )
39 end
40 until i ≥ m ′
41 return F
42 end
The main modifications of the neural network feature selector in this thesis compared with
the one proposed by Setiono and Liu (1997) are summarized below.
1) Use of back propagation training method: Setiono and Liu employed BFGS (Broyden-
quasi-Newton method that has been shown to be very effective. However, in the training
process, BFGS computes a Hessian matrix with dimension equal to the square of the number
of features. The size of the matrix is huge when the algorithm is applied to microarray dataset
consisting of a large number of features. So we replace the BFGS algorithm with back
algorithm proposed by Setiono and Liu uses a fixed partition of training and cross-validation
sample set as initial input. In a typical microarray dataset, when there are only a very limited
number of samples, a fixed training and cross-validation sample set may introduce large bias
in estimating the generalization performance of the trained neural networks. Instead of simply
splitting the sample set into two partitions, the neural network feature selector proposed in
this thesis employs leave-one-out method to reduce the bias in estimating the generalization
performance.
3) Use of different of penalty function: Penalty functions are employed in both versions of
neural network feature selectors to force small weights between the input layer and the
hidden layer to zero, in order to reduce the effect of irrelevant features to the classification.
P( w) = ε (∑∑
( ) ) + ε ( (w ) ) ,
β wi(,1j)
2
∑ (1) 2
1 + β (w )
1 2 i, j Eq. 3.3.3.13
(1) 2
i j i, j i
where ε1 , ε 2 and β are parameters for deciding the detailed penalty effect. From Eq.
3.3.1.13 it can be easily seen that the derivative of the penalty function against a weight
2 βwi(,1j)
∂ P( w) + 2ε w (1) .
= ε1 Eq. 3.3.3.14
2
( )
2 i, j
∂wi(,1j) 2
1 + β wi , j
(1)
As a result, the update of the weights tries to force the individual small weights to zero. By
contrast, the penalty function used in this thesis forces all weights associated with an input
neuron to zero, if the summed square of these weights are small. The new penalty function is
implicitly implemented in Eq. 3.3.3.2. It helps reducing the effect of an input neuron on all
Support vector machine is a linear learning model that can also perform classification. It was
invented by Boser et al. (1992). The theoretical aspect of Support Vector Machines is based
on Statistical Learning Theory (Vapnik, 1998). Section 3.3.4.1 describes the basic SVM
learning model and its training; Section 3.3.4.2 illustrates the linearly non-separable case;
Section 3.3.4.3 describes SVM with nonlinear mapping; Section 3.3.4.4 introduces a
From decision-making point of view, a linear classifier tries to obtain decision surfaces that
can discriminate samples of different classes. These decision surfaces are hyperplanes.
Suppose there are n training samples that belong to two classes, {(x1 , y1 ), K , (x n , yn )} ,
w T x i + b ≥ 0 yi = + 1
T Eq. 3.3.4.1
w x i + b < 0 yi = − 1
hold, then these samples are linearly separable. The distance between the separating
hyperplane and its closest sample vector is called margin of separation, denoted as ρ . A
T
Support Vector Machine’s task is to find an optimal separating hyperplane w ∗ x + b ∗ = 0
that has the biggest margin of separation among all separating hyperplanes. Obviously, the
distances between this hyperplane and its nearest sample vectors on both sides are equal.
w ∗ T x i + b ∗ ≥ +1 yi = +1
∗T Eq. 3.3.4.2
w x i + b ∗ ≤ −1 yi = −1
yi ( w ∗ x i + b ∗ ) = 1
T
Eq. 3.3.4.3
are called support vectors. Let us denote a pair of support vectors on both sides of the
w ∗ T x + + b ∗ = +1
∗T − . Eq. 3.3.4.4
w x + b ∗ = −1
We then have
3-46
1 w ∗ x + w ∗ x −
T T
ρ= −
2 w∗ w ∗
=
1
2w ∗
(T T
w ∗ x+ − w∗ x− ) Eq. 3.3.4.5
1
=
w∗
T
where w ∗ = w ∗ w ∗ .
optimization problem
minimize w
. Eq. 3.3.4.6
subject to yi (w x i + b) ≥ 1 i = 1,K, n
T
This constrained optimization problem can be solved using the method of Lagrange
constructed
n
1
J(w , b, a ) = w − ∑α i [ yi (w T x i + b ) − 1] Eq. 3.3.4.7
2 i =1
where the nonnegative variables a = [α1 ,K,α n ]T are called Lagrange multipliers. The
∂ J( w, b, a )
=0
∂w
∂ J( w, b, a ) . Eq. 3.3.4.8
=0
∂b
n
w = ∑
i =1
α i yi x i
n , Eq. 3.3.4.9
∑α i yi = 0
i =1
n n n
1
J( w , b , a ) = ∑ αi −
i =1 2
∑∑
i =1 j =1
y i y jα iα j x iT x j . Eq. 3.3.4.10
The problem in Eq. 3.3.4.6 can then be transformed into the quadratic optimization problem
n
1 n n
maximize W(a ) = ∑α i − ∑∑ yi y jα iα j x iT x j
i =1 2 i =1 j =1
n
. Eq. 3.3.4.11
subject to ∑ yiα i = 0, α i ≥ 0, i = 1,K, n
i =1
This quadratic optimization problem has a unique solution that can be expressed as a
Suppose the solution of the problem is a ∗ = [α1∗ ,K,α n∗ ]T , according to Eq. 3.3.4.9, we have
n
w ∗ = ∑α i y i x i . Eq. 3.3.4.12
i =1
T T
b ∗ = 1 − w ∗ x + = −1 − w ∗ x − , Eq. 3.3.4.13
So the sample vectors with positive Lagrange multipliers are support vectors.
3-48
In the linearly non-separable case, there is no hyperplane that can separate all samples. In this
hyperplane is redefined as
n
minimize w T w + C∑ξi
i =1
. Eq. 3.3.4.16
subject to y i (w x i + b) ≥ 1 i = 1, K, n
T
where the parameter C balances the generalization ability represented in the first term, and
the separation ability indicated in the second term. Problem in Eq. 3.3.4.16 can be converted
to its dual problem similar to that in the separable case, in which the slack variables are
omitted
n
1 n n
maximize W(a ) = ∑α i − ∑∑ yi y jα iα j x Ti x j
i =1 2 i =1 j =1
n
, Eq. 3.3.4.17
subject to ∑yα
i =1
i i = 0, 0 ≤ α i ≤ C , i = 1,K, n
ns
w ∗ = ∑ α si y si x si , Eq. 3.3.4.18
i =1
defined as
α i [ y i (w T x i + b) − 1 + ξ i ] = 0
i = 1,K , n . Eq. 3.3.4.19
ξ i (α i − C ) = 0
3-49
According to this condition, all sample vectors with positive Lagrange multipliers are support
vectors; and the slack variable is non-zero only when its corresponding Lagrange multiplier
equals to C . The value of b ∗ can be determined by choosing any support vector x i with
T
b∗ = 1 − w ∗ xi if yi = +1 Eq. 3.3.4.20
or
T
b ∗ = w ∗ x i − 1 if yi = −1 . Eq. 3.3.4.21
When the classification task has more than two class labels, there are two ways to transform
it to a binary classification problem. One way is to encode class labels using a binary
representation. Suppose there are l class labels in the task. log 2 (l ) support vector machines
are needed to perform classification task together. If a sample has the k th class label, a
Another way is the one against others method. In this method, l support vector machines are
+ 1 if i = k
is y = [ y1 ,K , yl ]T , where yi = i = 1,K, l .
− 1 otherwise
Support vector machine is basically a linear model. It can be extended to handle non-linear
mapping functions for transformation of input vectors can also be employed. This approach
makes the learning more flexible. Let t (x) = [t 0 (x), t 1 (x ),K , t l (x)]T be a vector of nonlinear
transform functions, where t 0 (x) = 1 . The optimal hyperplane is then defined as follows
w T t ( x) = 0 , Eq. 3.3.4.22
where the bias term is included implicitly in w . By adapting Eq. 3.3.4.12 we get
l
w = ∑ α i yi t (x i ) . Eq. 3.3.4.23
i =1
∑α
i =1
i y i t T ( x i )t ( x ) = 0 . Eq. 3.3.4.24
Let the inner product kernel K(x i , x ) = t T (x i )t (x) be a symmetric function, Eq. 3.3.4.24
becomes
∑α
i =1
i yi K(x i , x ) = 0 . Eq. 3.3.4.25
n
1 n n
maximize W(a ) = ∑ α i − ∑∑ y i y jα iα j K(x i , x j )
i =1 2 i =1 j =1
n
. Eq. 3.3.4.26
subject to ∑yα
i =1
i i = 0, 0 ≤ α i ≤ C , i = 1,K, n
The optimal decision hyperplane can be found by solving problem in Eq. 3.3.4.26 and
The complexity of the target function to be learned depends on the way it is represented. The
kernel approach provides a means to implicitly map input vectors into a feature space, i.e. the
kernel can be used without knowing its corresponding transforming function. The
3-51
introduction of kernel simplifies the design of a learner, and may improve generalization
ability. This approach can be used not only in support vector machine but also in other
learning models. Some of the examples of learning models that consist of kernel are listed
1
− x − xi
K(x, x i ) = e 2σ 2
. Eq. 3.3.4.28
The weights of a trained SVM can indicate the importance of the corresponding features to
the classification. Based on this idea, Guyon et al. (2002) proposed a recursive feature
elimination method. After training a linear kernel SVM, its weight can be obtained by Eq.
3.3.4.18. The algorithm iteratively trains SVM and eliminates the feature(s) with small
weights, until the feature set become empty. Figure 3.5 shows the algorithm:
3-52
While S > 1
w = svm_training( S , y )
S =S− ff { }
end
Keller et al. (2000) used a simple classification method based on naïve Bayes rule. Given an
follows
where P(M i | x) is a posteriori probability that M i is true given x . Applying the Bayes
m
= arg max(∑ log P( x g | M i ))
i g =1
Eq. 3.3.5.2
m
( x g − µ ig ) 2
= arg max(∑ − log δ ig − ), i = a, b ,
i g =1 2(δ ig ) 2
where µig and δ ig are the mean and standard deviation of the feature values of the training
classification when the difference between log P(x | M a ) and log P(x | M b ) is bigger.
However, in order to obtain more information regarding the confidence of the classification,
Eq. 3.3.5.3
m
( x g − µ ag ) 2 m ( x g − µ bg ) 2
= ∑ − log δ ag − − ∑ − log δ g
− .
2(δ ag ) 2 g =1 2(δ bg ) 2
b
g =1
A positive difference means that the sample is predicted to be class a , and a negative
difference means that the sample is predicted to be class b . The larger the difference, the
more confident we are about the classification. We also make use of this difference when
computing another measure of accuracy, i.e. acceptance rate, which will be discussed in
Section 4.1.6.1.
We found that a number of multivariate feature ranking methods can be placed in a unified
framework. The methods attempt either to find a vector projection of the samples onto which
maximizes or to minimize certain objective function f(w ) . The function f(w ) is originally
used on individual features to measure the discrimination ability or diversity of the feature.
The magnitude of the elements in the vector then indicates the relative importance of the
features. When f(w ) is extremal margin, the method is equivalent to RFE. When f(w ) is set
to Fisher’s criterion (see Section 3.3.6.1), the method becomes Fisher’s Linear
Discrimination method. When the function f(w ) is substituted by Eq. 3.3.3.3, the method
resembles neural network feature selector in the sense that the optimized weights between
input and hidden layer can indicate the relative importance of the input neuron. When f(w ) is
3-54
the standard deviation of the projection of all samples, the method becomes PCA. Section
contribution of features.
Let G = {g1 ,K, g m } be a set of features. By performing linear transform ya ,i = ∑ w g xag,i and
g∈G
yb ,i = ∑ w g xbg,i , that is projecting all samples from m dimension space to a unit vector w ,
g∈G
(µ a′ − µ b′ )2
F(w ) =
δ a′ + δ b′
2 2
, Eq. 3.3.6.1
= T
(w
u a − w Tub )
T 2
w ( n a Σ a + nb Σ b ) w
where µ ′ and δ ′ are the mean and standard deviation of two classes of projections
respectively, u a and u b are the mean of the original sample vectors of the two classes
respectively, and Σ a and Σ b are the covariance matrix of the samples from the two classes
respectively. Fisher’s linear discriminant tries to find the weight w that maximizes F(w ) ,
that is,
maximize F(w )
. Eq. 3.3.6.2
s.t. wTw = 1
w * = (n a Σ a + nb Σ b ) −1 (u a − u b ) . Eq. 3.3.6.3
We propose a multivariate likelihood feature selection method that is based on a similar idea
features. Recall Keller’s Likelihood method for ranking for individual gene g , which are
expressed in Eq. 3.2.2.6 and Eq. 3.2.2.7 in Section 3.2.2. Let G = {g1 ,K, g m } be a set of
projecting all samples from m dimension space to a unit vector w , we can obtain the
na ( y a ,i − µ a ) 2 ( y a ,i − µ b ) 2
LIK a→b = ∑ − log(δ a ) −
+ log(δ b ) + Eq. 3.3.6.4
i =1 2(δ a ) 2 2(δ b ) 2
and
nb
( y − µb )2 ( y b ,i − µ a ) 2
LIK b→a = ∑ − log(δ b ) − b ,i + log(δ ) + . Eq. 3.3.6.5
2(δ b ) 2 2(δ a ) 2
a
i =1
Where µ and δ are the mean and standard deviation of two classes of projections,
respectively.
maximize f(w )
Eq. 3.3.6.6
s.t. wTw = 1
where f(w ) = LIK a→b , f(w ) = LIK b→a or f(w ) = LIK a→b + LIK b→a . When the maximization
suitable w .
Information gain and likelihood method are univariate feature ranking methods in the sense
that they assume genes contribute to classification independently, and rank the genes
But in real world, genes are often working together for a certain function, and the
combinatorial effect of these genes is not considered by univariate selection methods. On the
other hand, neural network feature selector, recursive feature elimination and multivariate
likelihood method consider the whole contribution of subset of features to the classification.
These three approaches have the potential to select smaller subsets of features with higher
classification performance. However, the selection process may be obscured when applied to
microarray datasets with high dimensionality and in the presence of large number of
irrelevant features. Take RFE as an example, the presence of large number of irrelevant
features hides the discriminative information from relevant features. This can be seen in the
formulation of the SVM dual problem where the coefficients of the quadratic terms in the
m
x i T x j = ∑ xik x jk Eq. 3.3.7.1
k =1 .
The gene elimination process is very sensitive to change in the feature set. SVM also has the
data, the outliers may be introduced by: 1) noise in the expression data, or 2) incorrectly
3-57
identified or labeled samples in the training dataset. It is therefore more beneficial to apply
RFE on a dataset with a reduced number of features. A univariate feature selection algorithm
can be used to first efficiently reduce the large number of features originally present in the
dataset and a multivariate feature selection method such as RFE can then be applied to
remove more features. To summarize, we first identify and remove genes that are expected to
have low discrimination ability as indicated by LIK scores. Then, we apply RFE to remove
the size of the feature set further. With this integrated approach to feature selection, we are
able to achieve good classification performance with fewer genes than those reported by
thousands of genes. For RFE, in order to eliminate one or more genes, a new SVM has to be
trained, and the overall computational cost is Ω(m 2 n 2 ) . On the other hand, LIK ranks genes
independently, which makes the computational complexity of LIK, Ο(mn) . Using LIK first
to reject a large number of genes, and then using RFE to perform further selection will save
significant running time compared to just using RFE alone. This is especially important when
the improvement in microarray technology makes it possible to obtain gene expression values
The experimental results from and discussions on applying machine learning methods to
global gene expression analysis in our research will be described in this chapter. We start
with the experiments on the datasets of the second type classification problem, which consist
of large number of features and small number of samples. The high dimension nature of this
kind of problem makes it distinct from common classification problems. We then describe the
test result on a newly released dataset, which is of the first type of classification problem with
The main work in this thesis focuses on the second type of classification problems, which
involves a small number of samples with a large number of features, and feature selection
problems associated with this type of problems. Our strategy is to first try various analysis
methods, most of which were described in Chapter 3, on a well known benchmark dataset,
human acute leukemia microarray dataset (Golub et al., 1999), select one that has the best
performance, then try that method on other datasets of same type, including small, round blue
cell tumors (SRBCTs) (Khan et al., 2001) dataset and artificial datasets.
The human acute leukemia microarray dataset consists of 72 microarray experiments with
expression values of 7129 clones from 6817 human genes. Here the term clone refers to the
fragment of a gene. Each of the genes has a short description; and each clone is represented
by an accession number. Each microarray is assigned with a class label, either Acute Myeloid
Leukemia (AML) or Acute Lymphoblastic Leukemia (ALL), according to the organism used
for the hybridization. A second type of classification problem arises from this dataset. We
4-59
used the clone id as feature id. Each of the hybridizations corresponds to a classification
sample. These samples were divided into two sets by Golub et al.: The first sample set, which
consists of 27 ALL samples and 11 AML samples, is for training the classifier. The second
sample set, consisting of 20 ALL samples and 14 AML samples, is for testing the classifier.
Figure 4.1: The combination of feature selection / integration methods and classification
methods.
Due to the time constraint, we only tested a subset of all possible combinations of these
Likelihood and Recursive Feature Selection method achieved the best feature selection
4-60
performance. With the optimal feature sets selected using the above combination, Bayesian
classification method achieved the best classification performance. The combinations that
used are summarized in Table 4.1. Characteristic of some of the combinations are described.
The term homogenous in the table refer to the methods that use same kind of criterion in
single dimension and multi-dimension. Those combinations that are not tested correspond to
found
Extermal Margin
Baseline Tested
Table 4.1: List of combined univariate and multivariate feature selection methods tested.
Our first effort was to study the relevance of features to the class labels. The study involves
both feature integration and feature selection. For feature integration, we chose principle
component analysis to see how much the most informative component extracted from the
features can contribute to the classification. Experiment was done using statistics toolbox and
neural network toolbox in MATLAB v6.1. Computation of the principle components form
large number of features consumes large amount of computer memory. Due to the limitation
of our computer system, the program was unable to process all 7129 features. We generated
72 components from randomly chosen 4500 features, which is the maximum number of
4-61
features that can be processed by the program, and then used all samples with these
neural network with different number of hidden units on all samples. In the training process,
batch mode was used. In each leave-one-out iteration, a test of the classification performance
on the 71 samples was done after every 10 epochs. If the accuracy on training data is 100%,
the training is stopped, and the remaining one sample is tested. Table 4.2 shows the
performance. The result shows that although principle component extracted most informative
information in terms of standard variance from the features, this information is hardly
We applied C4.5 algorithm on all 72 samples of the leukemia dataset. The constructed
decision tree was surprisingly simple, which only involves two genes. It could correctly
| X54489_rna1_at <= 91 : 1
| X54489_rna1_at > 91 : -1
4-62
Using only 38 training samples, a even simpler tree involving only one feature is generated,
this tree correctly classified all 38 training samples and 31 out of 34 testing samples
(accuracy: 91.2%).
The decision tree construction algorithm C4.5 also supports constructing trees in iterative
mode. In this mode, the algorithm randomly selects an initial sample subset from training set
to construct a decision tree, and then iteratively add the samples that are misclassified by the
tree into the subset and reconstruct the tree until there is no misclassification among all
training samples. We tested C4.5 with iterative mode for 20 trials using all 72 samples. All
trials generated a two-layer tree with three nodes. These trees are listed in Table 4.3. For
simplification, we just list the gene accession numbers of the first layer and the second layer
Table 4.3: Decision trees constructed by iterative mode from all 72 samples.
4-63
We then tried leave-one-out test to construct decision trees from all samples. A total of 72
trees were constructed, most of which have two layers, but some had only one layer. Table
4.4 summarize these trees. When applying leave-one-out to 38 training samples, all the trees
Table 4.4: Decision trees constructed by leave-one-out mode from all 72 samples
M27891_at 1 0.941
M31166_at 1 0.706
M55150_at 1 0.794
Table 4.5: Decision trees constructed by leave-one-out mode from 38 training samples with
Table 4.3, Table 4.4 and Table 4.5 show that certain features occur very frequently in these
three experiments. It appears that the selectivity of C4.5 algorithm is high for the dataset. In
Table 4.3 and Table 4.4, feature M84526_at has the highest occurrence frequency, which
implies that this feature is important in deciding the classes when all 72 samples are taken
into account. But when only taking the 38 training samples into account, the algorithm
selected a very different set of features, which is shown in Table 4.5. Only feature X95735_at
appeared two times in leave-one-out mode for all 72 samples (see Table 4.4). The fact that
the trees generated on all training and test samples are very simple and the trees generated on
the training samples are even much simpler make us expect that some feature selection
method should that can generate a very small feature set when presented with a very limited
number of training samples. It may also be possible to obtain high classification accuracy on
test samples by certain classifiers constructed using this small feature set.
Training Test
Rank Feature All samples samples samples
1 M84526_at 0.652 0.408 0.689
2 M27891_at 0.652 0.685 0.689
3 D88422_at 0.651 0.578 0.584
4 M23197_at 0.648 0.581 0.684
5 X95735_at 0.647 0.844 0.522
6 U46499_at 0.634 0.565 0.692
7 M31523_at 0.590 0.511 0.851
8 L09209_s_at 0.589 0.562 0.577
9 M83652_s_at 0.550 0.578 0.420
10 M11722_at 0.542 0.332 0.683
22 M31166_at 0.405 0.689 0.182
26 M55150_at 0.398 0.671 0.198
Table 4.6: Leukemia features and their information gain of all samples, training samples and
We continued to investigate the information gain computed for the features of first level. In
Table 4.6, top ten ranked features are listed according to the information gain of all samples.
The features that are listed in Table 4.5 whose rank are higher than ten are also listed in Table
4.6.
features with high gain in training samples are likely to have high gain in all samples and test
samples. As mentioned before, C4.5 can construct a decision tree that consists of only one
feature from training samples using information gain. The tree is very simple but it can only
constructed using only information gain measure for continuous features is not high, as was
method, and test neural networks based on the selected features to see whether there is any
improvement in performance.
We tested the neural network with different configurations and different numbers of features,
as is listed in the following tables, with the highest information gain obtained from training
samples. The test results are listed in Table 4.7. Two kinds of training methods were used:
trainoss (One Step Secant Algorithm) and traingd (Gradient descent backpropagation). The
training was done in batch mode. In the training process, a test of the classification error of
training samples was performed repeatedly after a certain number of epochs indicated in test
interval column in the table until the error converge to no greater than training error
tolerance. The reason why we choose different tolerance is to check how well the trained
neural network could generalize, under a given over-fitting limit. If the number of tests
4-66
exceeded the number indicated in number of unsuccessful trials column, then the training was
considered to be unsuccessful and the training result was rejected. Once the training was
successfully terminated, we tested the classification accuracy of the trained network on test
samples. For every configuration corresponding to each row in the table, we collected 100
successful trainings and then calculates the mean and standard deviation of the classification
Feature Number of Training Test interval Number of Training error Test accuracy
number hidden units method (number of epochs) unsuccessful tolerance (number of
trials samples)
5 20 trainoss 20 20 0 0.880±0.075
50 50 trainoss 20 20 0 0.848±0.090
Table 4.7: Prediction performance of neural network using features selected by information
gain.
Note that the experiment in the third row of Table 4.7 used top ten features from information
gain of all features. The aim of this test is to see how high the accuracy could be when
information from test samples is used. We found that when the training error tolerance of the
training sample was as large as 4, with top 20 features, the test accuracy could approximate
4-67
the test in the third row. If the training error tolerance is smaller than 4, the test performance
The test in Table 4.7 gave us some indications on how to optimize the parameters to obtain
better generalization ability. We continued to test whether the test errors were located on a
few test samples or evenly distributed with fine tuned parameters. The method was the same
in the experiments in Table 4.7. We collected 100 successful trainings and counted the
number of times the test samples that were wrongly predicted as well. In Table 4.8 the
configuration and prediction accuracy are listed, and the wrong prediction frequency is
As can be seen from Figure 4.2, it seemed that for individual experiments of certain
configurations, the wrong predication frequency was very high in certain test samples. In
addition, most of trained neural networks were likely to make wrong prediction on the class
Experiment ID Number of Training Test interval Number of Feature Training error Mean precision Standard
hidden method (number of unsuccessful number tolerance (number deviation
units epochs) trials of samples)
Table 4.8: Test of neural network using features selected by information gain to identify
100
Number of times of wrong prediction
90
80
70 Exp 14
Exp 15
60
Exp 16
50 Exp 17
40 Exp 18
Exp 19
30
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
Test sample ID
4.1.4 AdaBoost
AdaBoost has the mechanism to change the sampling distribution to focus on the training
samples that have high validation errors, and then AdaBoost may fit the training samples well
and keep good generalization ability as well. Hence we expect that AdaBoost framework
might further improve the overall classification performance. In each AdaBoost experiment, a
number of neural networks of the identical size were consecutively trained for 50 epochs, and
those neural networks with training error no higher than error tolerance were employed for
refining sampling distribution. The training process continued until 50 such neural networks
4-69
were obtained. The final hypothesis of the test samples could then be obtained by combining
hypothesis of individual neural networks and factors β1KT , according to Eq. 3.3.2.1. One of
disadvantages of AdaBoost is that the training process is slow, so we only conducted the test
no more than twice for different configurations. The test results are listed in Table 4.9. The
configurations in tests 1 and 4 were tested once for their training time is extremely long, but
As can be seen from Table 4.9, the test performance of tests using the top 20 to 50 features is
generally better than the test results using the same number of features in Table 4.7 even
when the error tolerance was as low as 2 validation errors. In AdaBoost tests, there are
generally one or two errors (97.1% or 94.1% accuracy) in the test samples. The trained
networks are expected to fit the training samples well, and the combined hypothesis can also
By seeing the test result in the previous section, we continued to investigate whether it is
possible to further reduce the number of relevant genes without losing too much prediction
accuracy. Neural network feature selector, being a wrapper feature selection approach, could
exploit the relevant information between features and classes in the trained neural network
models. Another advantage of the neural network feature selector is that it can decide the
optimum number of selected features. Experiments were carried out using neural network
feature selector. The neural network feature selector algorithm was implemented using
MATLAB. In the experiments, a set of features, which had highest information gain from the
training samples, was used as initial feature set whose size was to be reduced. Because the
number of training samples is very small, we used leave-one-out approach to get average
training accuracy and validation accuracy when calling function train_validate() and
simulate_validate() . Each neural network was trained for maximum 200 epochs in the
function train_validate() . The rfactor was set to 10. After the selection process, with the
selected features, 100 repetitions of training and testing were done, and the mean and
standard deviation of training and testing accuracy ranks were calculated. The results are
summarized in Table 4.10 and Table 4.11. The performance of both summed square and
cross entropy error functions are listed in the two tables respectively. In these two tables, the
column (ε min , ε max ) contains the thresholds of penalty parameter ε . Because we increased or
decreased the penalty parameter by a factor of 1.1 , the values reflect the minimum and
maximum number of times that the penalty parameter may increase or decrease
accumulatively.
4-71
Experiment Feature
set size
′′
r′min rmin ∆r ′′ (ε min , ε max ) Number of Accuracy
features training samples
on Accuracy on test
samples
selected
1
200 0.9 0.9 0.05 1.1±30 6 1.00±0.00 0.67±0.01
2
50 0.9 0.9 0.01 1.1±20 3 1.00±0.00 0.79±0.00
3
50 0.95 0.95 0.01 1.1±20 4 1.00±0.00 0.88±0.03
4
50 0.97 0.97 0.01 1.1±20 4 1.00±0.00 0.88±0.03
5
200 0.95 0.9 0.03 1.1±20 6 0.98±0.01 0.71±0.00
Table 4.10: Experiment result of neural network feature selector with summed square error
function.
Experiment Feature
set size
′
rmin ′′
rmin ∆r ′′ (ε min , ε max ) Number of Accuracy
features
on Accuracy on
training samples test samples
selected out
1
200 0.9 0.9 0.05 1.1±30 4 1.00±0.00 0.71±0.02
2
50 0.9 0.9 0.01 1.1±20 4 1.00±0.00 0.72±0.03
3
50 0.95 0.95 0.01 1.1±20 6 1.00±0.00 0.68±0.04
4
50 0.97 0.97 0.01 1.1±20 36 1.00±0.00 0.80±0.08
5
200 0.95 0.9 0.03 1.1±20 191 1.00±0.01 0.79±0.09
Table 4.11: Experiment result of neural network feature selector with cross entropy error
function.
From Table 4.10, we can see that under certain settings the neural network feature selector
can select a small set of four out of 50 genes and the prediction accuracy on the test samples
could be as high as 88%. In comparison, the experiments in Table 4.11 show that the
selection performances are generally worse in terms of the number of the features selected
and the prediction accuracy. We noticed that back-propagation training of neural networks
was much faster when using cross entropy error function than when using summed error
function.
4-72
We tried the combination of likelihood and recursive feature elimination method. Very good
feature selection performance was found using this hybrid method on Leukemia dataset,
which encouraged us to continue to test the method systematically, comparing it with feature
selection methods using LIK and RFE alone. We computed acceptance rate, which is a
Suppose there are n samples with predicted values of o1 , K , o n , and their corresponding
class labels are y1 , K , y n . Each of the class labels takes value of either + 1 or − 1 , and the
values of outputs are real numbers. If the prediction output of a classifier for a sample has the
same sign as that of its true class, we consider this sample to be correctly classified. The
performance measure of accuracy, i.e. the number of correctly classified samples over the
{i o i y i > 0, i = 1, K , n }
accuracy = , Eq. 4.1.6.1
n
where S denotes the cardinality of the set S . In contrast, the acceptance rate is computed as
follows
i oi yi > − min( o j y j ), i = 1,K, n
j =1,K, n Eq. 4.1.6.2
acceptance rate = .
n
The strength of correct prediction of a sample can be obtained by multiplying the output and
class label of a sample together oi yi , the larger the value of the product, the better the
prediction made by the classifier. When the value of the product is negative, the classifier
makes a wrong prediction of the sample. To calculate the acceptance rate, we first select the
4-73
worst prediction out of all test samples. The worst prediction corresponds to the sample
threshold. All the predictions that have the o j y j value bigger than this threshold are
considered as being accepted. The acceptance rate is 1 when all the test samples are correctly
predicted. Otherwise, it will not be greater than the accuracy, because the prediction of the
classifier on some test samples may have small confidence, as indicated by the output-class
label products that are lower than the threshold. These test samples are correctly predicted
and are counted in the computation of accuracy, but they will not be counted in the
In the tables and figures in this section, we will denote accuracy and acceptance rate by acu
and acp, respectively. Obviously, the acceptance rate cannot be higher than accuracy.
Because the number of samples was small, we used the leave-one-out method for validating
the classifier on training samples as well as on all samples. When there are n samples, leave-
one-out is a technique to iteratively choose each sample for testing, and the remaining
samples for training. A total of n classifiers were trained, and n predictions were made. The
accuracy and acceptance rates were computed from the predictions and labels of the
wrote and ran our program using MATLAB 6.1. The support vector machine was constructed
the SVM, we set C = 100.0 and used the linear kernel, the same as those used by Guyon et al.
(2002).
4-74
Figure 4.3 shows the sorted LIK scores. We chose equal numbers of genes with the highest
LIK ALL → AML and LIK AML→ ALL score as the initial gene sets for RFE. We plotted in this figure the
scores of the top (2 x 80) genes. The top i th gene according to LIK ALL → AML always has a
higher LIK ALL → AML value than the corresponding top i th gene’s LIK AML→ ALL score. Genes with
low LIK scores are not expected to be good discriminators. We decided to pick the top genes
to check their discriminating ability. In particular, we ran experiments using the top (2 x 10),
(2 x 20), (2 x 30) genes. We found the best performance was obtained when 2 x 20 top
ranking genes were selected. The performance was measured by computing the prediction
accuracy and acceptance rate of SVM and Bayesian classifiers built using the selected genes
Figure 4.4 shows the accuracy and acceptance rate using two different experimental settings:
leave-one-out and train-test split. For the leave-one-out (LOO) setting, we computed the
performance measures using only the 38 training samples as well as on the entire dataset
consisting of 72 samples. For the train-test split, the measures shown were computed on the
34 test samples, while the measures on the 38 training samples are not reported in the table. A
series of experiments were conducted to find the smallest number of genes that would give
good performance measure. The experiments started with all 40 (= 2 x 20) genes selected by
the LIK feature selection. One gene at a time was eliminated using RFE. RFE feature
selection was conducted until there was only one gene left. For a selected subset of genes, the
performance measures were computed under all experimental settings and using both SVM
1800
1600
1400
1200
LIK score
1000
800
600
400
200
0 20 40 60 80 100 120
Gene rank
Figure 4.3: Sorted LIK score of a subset of genes in the leukemia dataset. Dots indicate
LIK ALL→ AML scores and circles indicate LIK AML→ ALL . The top 28 genes according to their
LIK ALL→ AML values have scores between 92014 and 1978.3; and the top 15 genes according to
their LIK ALL → AML values have scores between 22852 and 2148.7; they are not shown in this
figure.
As can be seen from the figure, the SVM classifier achieved almost perfect accuracy and
acceptance rate when there were three to 14 genes used to find the separating hyperplane. On
the other hand, when the Naïve Bayesian method was used for classification, almost perfect
performance was achieved with as many as 40 genes in the model. Elimination of the genes
by RFE one by one showed that the results could be maintained as long as there are at least
4-76
three genes in the model. This stability in performance indicates the robustness of the RFE
feature selection method when given a pre-selected small subset of relevant genes, as
identified by the LIK method. It is worth noting that the acceptance rate on the test samples
was almost constant with at least three genes, both when the SVM classifier and the Naïve
Bayesian classifier were used for prediction. We emphasize here that the hybrid LIK+RFE
feature selection was run using the 38 training samples; the classifiers were also built using
the same set of training samples without the use of any information from the data in the test
set.
A set of three genes was discovered to give perfect accuracy and acceptance rate regardless of
the experimental settings and the classifiers used. These genes are listed in Table 4.12. They
have also been identified as relevant genes in this dataset by several researchers. Golub et al.
(1999) identified U05259_rna1_at and M27891_at as relevant, while Keller et al. (2000)
identified the gene X03934_at as relevant. On the other hand, Guyon et al. (2002) identified a
completely different set consisting of four genes. Among these four genes, only M27891_at
1.00
0.80
Training sample LOO acu
Performance
Training sample LOO acp
0.60
Test sample acu
Test sample acp
0.40
All sample LOO acu
All sample LOO acp
0.20
0.00
40 37 34 31 28 25 22 19 16 13 10 7 4 1
Number of genes
1.00
0.80
Training sample LOO acu
Performance
Training sample LOO acp
0.60
Test sample acu
Test sample acp
0.40
All sample LOO acu
All sample LOO acp
0.20
0.00
40 37 34 31 28 25 22 19 16 13 10 7 4 1
Number of genes
Figure 4.4: Classification performance of genes selected using the hybrid LIK+RFE.
4-78
Table 4.12: The smallest gene set found that achieves prefect classification performance.
2
U05259__rna1__at
-1
-2
-2 -1
0 0
2 1
4 2
6 3
X03934__at
M27891__at
Figure 4.5: Plot of the leukemia data samples according to the expression values of the three
Since there were only three genes selected by the hybrid LIK+RFE method, we are able to
visualize the distribution of both the training and test samples in a three-dimensional space.
Figure 4.5 shows the plot of the samples. In this figure, we differentiate between acute
4-79
myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) samples. There were
actually two different types of ALL samples. These were B-cells or T-cells as determined by
whether they arose from a B or a T cell lineage (Keller et al., 2000). From the figure, we can
see that all except one B-cell sample had almost constant expression values for two genes,
namely M27891_at and X02934_at. Training sample number 17 was the one ALL B-cell
that was an outlier. On the other hand, all T-cell samples had almost constant expression
values for genes U05259_mal_at and M27891_at, while all AML samples had similar
expression values for U05259_mal_at and X03934_at. The plot shows that the three selected
genes were also useful in differentiating ALL B-cell and T-cell samples.
classifiers built using genes selected according to their LIK scores only is shown in Figure
4.6. For the results shown in this figure, we started with the same set of 40 genes and
removed one gene at a time according to their LIK scores. As can be seen from the figure, the
results were not as good as those shown in Figure 4.4. In particular, using SVM classifiers,
the accuracy and the acceptance rate were more than 80 percent when there were still more
than 20 genes in the model. The acceptance rate drops drastically when there are fewer genes.
Naï
ve Bayesian classifiers performed well when there were more than 21 genes. Further
removal of more genes according to their LIK scores caused the acceptance rate to drop
considerably. When there were fewer than five genes, the accuracy and the acceptance rate of
1.00
0.80
Training sample LOO acu
Performance
Training sample LOO acp
0.60
Test sample acu
Test sample acp
0.40
All sample LOO acu
All sample LOO acp
0.20
0.00
40 37 34 31 28 25 22 19 16 13 10 7 4 1
Number of genes
1.00
0.80
Training sample LOO acu
Performance
Training sample LOO acp
0.60
Test sample acu
Test sample acp
0.40
All sample LOO acu
All sample LOO acp
0.20
0.00
40 37 34 31 28 25 22 19 16 13 10 7 4 1
Number of genes
RFE is depicted in Figure 4.7. We started with all 7129 genes in the feature set. We built an
4-81
SVM using the training samples with expression values of all the genes and measured its
performance as well. The gene that had the smallest absolute weight in the SVM-constructed
hyperplane was removed, and the process of training and testing was repeated with one fewer
gene. This process was continued until there were no more genes to be removed.
1.00
0.80
Performance
0.20
0.00
865
517
169
2953
2605
2257
1909
1561
1213
7129
6781
6433
6085
5737
5389
5041
4693
4345
3997
3649
3301
Number of genes
down to only one gene. The experimental setting was training test split and the performance
An interesting point to note from the results depicted in Figure 4.7 is the sharp improvement
in the acceptance rate of the Bayesian classifiers when the number of genes was reduced from
2773 to 2772. The gene that was eliminated at this stage was M26602_at. The acceptance
rates stayed at 100 percent when there were 2772 to 1437 genes. Further removal of genes
caused the rate to deteriorate gradually. On the other hand, the performance of SVM was
more stable. With more than 519 genes, both the accuracy and the acceptance rate were at
least 90 percent.
We also experimented with choosing the top genes according to their LIK scores. We
selected genes with LIK scores that were higher than a certain threshold. The threshold
values tested were 1500, 2000, and 2500. Note that there were always more genes selected
because of their high LIK ALL→ AML than genes selected because of their LIK AML→ ALL values. The
best performance was obtained when the threshold was set to 1500. A total of 62 genes met
this threshold value and were used to form the initial gene set for RFE. After applying RFE,
we obtained a set of four genes that achieves perfect accuracy and acceptance rate on the
training and test samples under all three experimental settings. The set of four selected genes
is shown in Table 4.13. Two out of the four genes were the same ones as those selected using
the (2 x 20) top initial genes listed in Table 4.12. These genes were U0529_rnal_at and
M27891_at. The genes M16336_s_at was also found by Keller at al. (2000) to be an
The performance of the SVM classifiers with genes selected using just the RFE approach was
slightly different from that reported by Guyon et al. (2002). The reason for this could be the
variation in the implementation of the quadratic programming solvers. The Matlab toolbox
uses Sequential Minimal Optimisation algorithm (Platt, 1999), while Guyon et al. used a
4-83
variant of the soft-margin algorithm for SVM training (Cortes, 1995). Our hybrid LIK+RFE
method achieved better performance than other methods reported in the literature. To achieve
perfect performance, the RFE implementation of Guyon et al. needed eight genes. When the
number of genes was reduced to four, the leave-one-out results on the training samples using
SVM achieved only 97 percent accuracy and 97 percent acceptance rate. SVMs trained on 38
training samples with the four selected genes achieved only 91 percent accuracy and 82
Table 4.13: The genes selected by the hybrid LIK+RFE method. The genes that have LIK
scores of at least 1500 were selected initially. RFE was then applied to select these four genes
Using the genes selected according to their LIK scores and applying the Bayesian method,
Keller et al. (2000) achieved 100 percent prediction with more than 150 genes. Hellem and
contribution of genes to the classification. The classification of the samples was obtained by
applying k-nearest neighbours, diagonal linear discriminant and Fisher’s linear discriminant
methods. Guyon et al. also mentioned the performance of other works on this dataset
(Mukherjee et al., 2000; Chapelle et al., 2000; Weston et al., 2001). None of these works
Besides LIK, we also tried to combine other univariate feature ranking methods with RFE.
They are baseline criterion proposed by Golub et al. (1999), Fisher’s criterion, and Extremal
margin (Guyon et al., 2002). In the way that is similar to that in LIK+RFE, we choose the top
40 genes ranked by these three methods, then let RFE do further feature set reduction. Figure
4.8 shows the accuracy on test samples when running RFE starting from the initial feature set
It can be seen from the Figure 4.8, the prediction accuracy of baseline and Fisher’s criteria on
test samples are similar. In the elimination process, the accuracy were kept above 80% for
SVM prediction and 90% for Bayesian prediction, as long as there were more than 5 genes
remaining in the set. The prediction accuracy dropped drastically when there were less than 5
genes remaining in the set. The RFE starting from genes selected by Extremal margin ranking
performed better than the other two ranking methods especially in Bayesian prediction
accuracy, which remained no less than 97% until there was one gene left in the gene set.
Particularly, the perfect prediction was achieved when there are 5 genes left in the dataset.
4-85
Bayesian prediction
1
0.9
0.8
Accuracy 0.7
0.6 Baseline
0.5 Fisher
0.4 Extremal margin
0.3
0.2
0.1
0
40
35
30
25
20
15
10
5
Number of genes
SVM prediction
0.8
Accuracy
0.6 Baseline
Fisher
0.4 Extremal margin
0.2
0
5
40
35
30
25
20
15
10
Number of genes
Figure 4.8: SVM and Bayesian prediction accuracy on test samples using features selected by
RFE combined with baseline criterion, Fisher’s criterion and Extremal margin method.
4-86
The second dataset to test LIK+RFE is from small, round blue cell tumors (SRBCTs) (Khan
et al., 2001). There are 88 samples altogether with 2308 genes, divided into 4 classes:
neuroblastoma (NB), rhabdomyosarcoma (RMS), Burkitt lymphomas (BL) and the Ewing
family of tumors (EWS). Because our focus is on binary classification, we decomposed the
problem into 4 one-against-rest binary second type classification problems. In (Khan et al.,
2001) there are 63 training samples, which consist of 23 EWS, 8 BL, 12 NB, and 20 RMS
samples. There are 25 test samples, which consist of 6 EWS, 3 BL, 6 NB, 5 RMS samples,
problems, we then performed test on the prediction combining all the classifiers together to
We obtained the expression ratio data of Khan et al. (2001). Before we conducted our
experiments, the expression values were transformed by computing their logarithmic values.
Base 2 log transformation was used, as this is the usual practice employed by researchers
analyzing micraoarray data. In Figure 4.9, the plot of the gene ranking according to their LIK
scores is shown. The LIK scores were computed for differentiating EWS samples from non-
EWS samples. The set of top 20 genes according to their LIK EWS → Non− EWS ranking contained
eight genes that were also in the set of top 20 genes according to their LIK Non− EWS → EWS ranking.
Hence, when RFE was applied to further eliminate genes from the feature set, it started with
32 unique genes.
4-87
90
80
70
60
50
LIK score
40
30
20
10
-10
Figure 4.9: Sorted LIK scores of genes in the SRBCT dataset. Dots indicate LIK EWS → Non− EWS
For the other three classification problems, the plots would look very similar to Figure 4.9
and are not shown in this paper. For each of the four problems, LIK selected the top (2 x 20)
genes. The number of unique genes selected by LIK and the results of the experiments from
solving four binary classification problems are summarized in Table 4.14. The numbers of
unique genes selected by LIK and the smallest numbers of genes required to achieve near
perfect performance during the gene elimination process by RFE are shown in the second
column of the table. For the three classification problems to identify EWS, BL and NB, the
accuracy and the acceptance rates were at least 98 percent for all experimental settings. Those
perfect performance results are highlighted in the table. For the fourth classification problem
4-88
to differentiate between RMS and non-RMS samples, the accuracy rates were at least 92
percent. However, the acceptance rate on the test samples dropped to eight percent for SVM
SVM Bayesian
Leave-one-
Leave-one-out Prediction Leave-one- out on Prediction
on training on test out on all training on test Leave-one-out
samples samples samples samples samples on all samples
Classification
problem Initial/final
number of
genes Acu Acp Acu Acp Acu Acp Acu Acp Acu Acp Acu Acp
EWS vs non- 32/5 1.00 1.00 1.00 1.00 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
EWS
BL vs non-BL 37/3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
NB vs non-NB 34/3 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.98 1.00 1.00 1.00 1.00
RMS vs non- 34/4 1.00 1.00 0.92 0.08 0.99 0.88 1.00 1.00 0.92 0.16 0.97 0.35
RMS
Table 4.14: Experimental results for the SRBCT dataset using the hybrid LIK+RFE.
The poor acceptance rate obtained when predicting RMS test samples suggests that the
differences in the output of the classifiers and the actual target values were high for the
incorrectly predicted samples. In order to verify the predictions, we plotted the distribution of
the samples according to the expression values of three genes, ImageID784224 (fibroblast
(troponin T1). The three were selected because their corresponding SVM weights were the
largest. The plot is shown in Figure 4.10. We can clearly see that the two incorrectly
classified non-RMS samples were outliers with large values for ImageID1409509 (troponin
T1). These two outliers were Sk. Muscle samples TEST-9 and TEST-13, which were
misclassified as RMS samples. It should be noted that there were no Sk. Muscle samples in
RMS Train
Non-RMS Train
RMS Test
Non-RMS Test
3
2.5
1.5
1
troponin T1
0.5
-0.5
-1
-1.5
-2
-3 3
-2 2
-1 1
0 0
1 -1
2 -2
3 -3
sarcoglycan, alpha
fibroblast growth factor receptor 4
Figure 4.10: Plot of RMS and non-RMS samples. Plot of all 88 RMS and non-RMS samples
The genes selected by the hybrid LIK+RFE for each of the four classification problems are
listed in Table 4.15. For the problem of differentiating EWS from non-EWS samples, our
method selected five genes, all of which were also selected by Khan et al. (2001). On the
other hand, to differentiate between NB and non-NB samples, only three genes were needed
and none was selected by Khan et al. All together, the hybrid LIK+RFE identified 15
important genes. This number compares favorably with the total of 96 genes selected by the
Reported
Classification
by Khan et Image ID Description
problem
al. (2001)
Table 4.15: The genes selected by the hybrid LIK+RFE for the four binary classification
problems.
genes selected based purely on their LIK scores. For comparison purpose, for each of the four
problems, the number of genes was set to be the same as the corresponding final number
selected by the hybrid LIK+RFE shown in Table 4.14. Table 4.16 summarizes the results. For
three of the four classification problems, the performance of the classifiers was not as good as
the results reported in Table 4.14. The accuracy and acceptance rates dropped to as low as 52
percent. The most unexpected results came from the fourth problem to differentiate between
RMS and non-RMS samples. The SVM classifier achieved perfect accuracy and acceptance
percent accuracy and acceptance rate. The four genes were ImageID461425 (MLY4),
factor 2), and ImageID207274 (Human DNA for insulin-like growth factor II). All these
genes were among the 96 genes identified by Khan et al. (2001). Of these four, only one was
SVM Bayesian
Leave-one-out Leave-one- Leave-one-out
on training Prediction on out on all on training Prediction on Leave-one-out
samples test samples samples samples test samples on all samples
Classification Number
problem of
genes Acu Acp Acu Acp Acu Acp Acu Acp Acu Acp Acu Acp
1.00 1.00 0.92 0.88 0.95 0.88 0.98 0.97 0.84 0.84 0.95 0.86
EWS vs non-
EWS 5
BL vs non- 0.95 0.92 0.88 0.88 0.97 0.83 0.98 0.98 0.88 0.76 0.93 0.88
BL 3
NB vs non- 0.95 0.92 0.84 0.76 0.97 0.86 0.97 0.97 0.80 0.52 0.95 0.92
NB 3
4 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.95 0.92 0.92 0.97 0.95
RMS vs non-
RMS
In comparison, Khan et al. (2001) used neural networks for multiple classifications to achieve
93 percent EWS, 96 percent RMS, 100 percent BL and 100 percent NB diagnostic
classification performance on the 88 training and test samples. Since there were four classes
of training data samples, each neural network had four output units. The target outputs were
binary encoded, for example, for an EWS sample the target was (EWS=1, RMS=NB=BL=0).
A total of 3750 neural networks calibrated with 96 genes were required. The highest average
output value from all neural networks determined the predicted class of a new sample. The
Euclidean distance between the average values and the target values was computed for all
samples in order to derive the probability distribution of the distances. A test sample would
be diagnosed as a member of one of the four classes based on the highest average value given
by the neural networks. This was provided that the distance value falls within the 95th
4-92
percentile of the corresponding distance probability of the predicted class. Otherwise, the
diagnosis would be rejected and the sample would be classified as a non-SRBCT sample. Of
the 88 samples in the training and test datasets, eight were rejected. Five of these were non-
SRBCT samples in the test set, while the other three actually belonged to the correct class but
In order to visualize the distribution of the samples based on the expression values of the
selected genes, we performed clustering of the genes using the EPCLUST program
average linkage clustering and uncentered correlation distance measure were used. Figure
4.11 shows the clusters. It can be seen clearly from this figure that there existed four distinct
clusters corresponding to the four classes in the data. Most of the samples of a class fell into
their own corresponding clusters. The five non-SRBCT samples lay between clusters. We
conjecture that samples between clusters might not belong to any classes found in the training
dataset. Two between-cluster samples, RMS-T7 and TEST-20 were exceptions. RMS-T7,
which was nearer to the two Sk. Muscle samples TEST-9 and TEST-13 was actually an RMS
sample. TEST-20, which was nearer to Prostate sample TEST-11 than to EWS cluster was
actually an EWS sample. These exceptions were consistent with the neural network
prediction results of Khan et al. (2001) as the neural networks predicted TEST-9 and TEST-
13 to be RMS class, and they predicted TEST-20 and TEST-11 to be EWS class. Both
predictions, however, did not meet the 95th percentile distance criterion and were therefore
rejected. This indicated that these samples were also difficult to differentiate by the neural
networks. Different results from our clustering and the neural network classification can be
seen for test sample TEST-3, a non-SRBCT sample. The clustering placed TEST-3 between
4-93
BL and NB clusters. But the neural networks predicted this sample as an RMS sample
Most of the genes selected from Leukemia and SRBCT datasets by our hybrid LIK + RFE
method have some relevance to cancer according to literature search in PubMed, a document
retrieval service of the National Library of Medicine of United States. However, biological
experiments need to be done for further validation of the role of these genes. The
difference in the acceptance rate of the classifiers for the first three binary classification
problems and the fourth problem in the SRBCT dataset. Overall, we observe that the
classification performance on the test set generally does not change much with the
consecutive elimination of a few genes. The removal of one gene would not normally cause a
drastic change in the performance of the classifier. Significant drops in accuracy and/or the
acceptance rate are observed most frequently when a gene is removed from the optimal set.
4-94
In order to study how expression values of irrelevant genes affect the selection performance
on RFE, we generated three types of artificial datasets. Each of the datasets consists of 40
training samples (20 positive class + 20 negative class) and 40 test samples (also 20 positive
class + 20 negative class). All datasets consist of features that are relevant and irrelevant to
the classification. The values of the irrelevant features are sampled from a normal distribution
with standard deviation 1 and mean 0 regardless of the class labels of the samples. The three
types of datasets are different in the way the relevant features are constructed. For the first
type, the relevant features contribute independently to classification. Their values are
normally distributed with standard deviation 1, and mean x or − x according to the class
labels. For different relevant features, x is randomly chosen from a uniform distribution on
the interval (0,1) . If x is big enough, univariate feature selection method is likely to be able
to select the relevant features. The relevant features of the second type dataset is constructed
by considering the joint effect of the features: suppose there are k relevant features
X 1 ,K , X k , where X 1 , K X k −1 are sampled from normal distribution in the same way as that
k −1
from irrelevant features. We set the remaining relevant feature X k = ∑ X i ± α , where α is
i =1
randomly sampled from a normal distribution of standard deviation 0 and mean 1, and α is
added or subtracted from the sum depending on the class labels of samples. It is expected to
be harder for a univariate method to select the first k − 1 relevant features. The third type of
datasets is constructed by setting the first k − 1 relevant features the same way as that in first
type datasets. But the k th feature is set in the same way as that in the second type datasets.
This type of datasets models the expression of the genes that related to cancer better, where
the genes have certain degrees of individual contributions to cancer from observation, but
We tried LIK, RFE and LIK+RFE on the datasets of all the three types above, each contained
a total of 100 features, 4 of which are relevant features. In each experiment, we let the
selection method select the features based on the training samples, then test the classification
performance on the testing samples. For testing RFE, one feature is eliminated each time. For
testing LIK, we include the same number of top ranking features of both LIK ag→b and
LIKbg→a ranks. For testing the combination of LIK+RFE, we choose the top 5+5 and the top
10+10 LIK ranked features to input into RFE. We use SVM to test the classification
performance. The setting of SVM for RFE and classification is same as those in Section
4.1.6.1. Before applying the methods, the datasets were normalized in the same way the
Leukemia dataset. For each setting, 100 experiments were conducted, each on a different
dataset generated, and the mean and standard deviation of the test result are shown in Table
4.17. The table shows the type of the datasets; the number of LIK ranked features to be used
as input for RFE (lik_num), within which the number of distinct ones (num_lik_rfe); the
accuracy of SVM prediction on datasets using all and relevant features (acc_all, acc_rev
respectively), where acc_rev, is used to find out what is the ideal prediction performance; the
respectively); for running LIK, RFE, and LIK+RFE, the highest accuracy obtained
lik_num 10 20
dataset_type 1 2 3 1 2 3
num_lik_rfe 8.57 ± 0.97 9.01 ± 1.00 8.40 ± 0.83 14.66 ± 1.34 14.97 ± 1.47 14.86 ± 1.20
num_lik_sel 2.71 ± 0.87 1.05 ± 0.59 2.91 ± 0.71 2.79 ± 0.95 1.24 ± 0.62 3.22 ± 0.66
acc_all 0.66 ± 0.09 0.55 ± 0.08 0.71 ± 0.09 0.65 ± 0.11 0.55 ± 0.08 0.71 ± 0.08
num_sv_all 35.25 ± 1.56 36.07 ± 1.61 34.39 ± 1.61 35.14 ± 1.92 36.16 ± 1.64 34.13 ± 1.94
acc_rev 0.83 ± 0.08 0.81 ± 0.07 0.88 ± 0.06 0.82 ± 0.09 0.81 ± 0.06 0.89 ± 0.06
num_sv_rev 13.15 ± 6.45 14.87 ± 4.47 8.76 ± 4.21 12.79 ± 6.66 15.08 ± 4.68 8.29 ± 3.75
max_rfe_acc 0.82 ± 0.08 0.69 ± 0.07 0.89 ± 0.07 0.82 ± 0.09 0.69 ± 0.07 0.89 ± 0.07
num_rfe_acc 9.79 ± 17.15 13.38 ± 17.62 3.51 ± 7.06 7.20 ± 11.85 11.75 ± 17.55 3.98 ± 8.70
num_rfe_acc_rev 2.19 ± 1.01 1.39 ± 0.75 1.29 ± 0.62 1.90 ± 0.89 1.39 ± 0.82 1.29 ± 0.61
contain_rfe_acc_last_rev 0.53 ± 0.50 0.90 ± 0.30 0.97 ± 0.17 0.50 ± 0.50 0.90 ± 0.30 0.94 ± 0.24
max_lik_acc 0.84 ± 0.08 0.70 ± 0.07 0.91 ± 0.06 0.84 ± 0.08 0.70 ± 0.07 0.90 ± 0.06
num_lik_acc 10.20 ± 17.07 24.26 ± 27.35 5.20 ± 11.70 6.97 ± 11.14 18.02 ± 20.83 3.76 ± 5.94
num_lik_acc_rev 2.45 ± 1.00 1.66 ± 1.06 1.81 ± 0.95 2.08 ± 0.97 1.40 ± 0.90 1.81 ± 0.97
contain_lik_acc_last_rev 0.63 ± 0.49 0.94 ± 0.24 1.00 ± 0.00 0.52 ± 0.50 0.92 ± 0.27 1.00 ± 0.00
max_lik_rfe_acc 0.81 ± 0.09 0.66 ± 0.09 0.89 ± 0.07 0.82 ± 0.09 0.66 ± 0.09 0.88 ± 0.07
num_lik_rfe_acc 3.17 ± 2.06 3.14 ± 2.56 1.83 ± 1.48 3.92 ± 3.20 4.24 ± 3.94 2.41 ± 3.15
num_lik_rfe_acc_rev 1.96 ± 0.97 0.84 ± 0.55 1.25 ± 0.58 1.91 ± 0.91 0.85 ± 0.61 1.38 ± 0.76
contain_lik_rfe_acc_last_rev 0.51 ± 0.50 0.72 ± 0.45 0.96 ± 0.20 0.41 ± 0.49 0.70 ± 0.46 0.92 ± 0.27
Table 4.17: Test result of LIK, RFE and LIK+RFE on artificial datasets
It can be seen from Table 4.17, as expected, in terms of the number of feature selected and
the accuracy (rows max_*_acc and num_*_acc), for datasets of type 1 and 3, RFE is similar
to LIK; for datasets of type 2, RFE is significantly better than LIK. But the testing results
from all three types of datasets indicate that LIK+RFE selected significantly more accurate
feature sets, in terms of the number of relevant features over the number of selected features,
num_*_acc_rev).
It can also be seen from the table that when the number of irrelevant features is large
compared with the number of relevant features, the classification of SVM becomes bad;
4-98
almost all training samples became support vectors (row num_sv_all compared with row
num_sv_rev). This phenomenon also occurs when SVM is applied on the second type
datasets like Leukemia and SRBCT dataset. From our experience, the number of support
vectors reduces significantly only when the number of features for training is near or less
than the number of training samples. We suspect the phenomenon is related to the learning
capacity of SVM but we have not found theoretical basis for this.
The assumption of single or double normal distribution on irrelevant and relevant features are
simple and may not accurately reflect the real situation in microarray datasets, especially, the
combinatorial effect of features are not modeled. However, as can be seen from the test result
from these simple datasets, it is quite likely that the weights of trained SVM on a mixture of
many irrelevant features and few relevant features are unable to truly measure the
contribution of the features to the classification. Some relevant features are incorrectly
eliminated by RFE starting from all features. By comparison, LIK ranking is able to keep
most of the relevant features for type 1 and 3 datasets, which enables RFE to perform further
We also compared LIK+RFE with the combination of univariate and multivariate versions of
Likelihood method and Fisher’s method. In our experiment, a set of features was first
selected by a univariate method, in the same fashion as in the experiment for LIK+RFE. A
multivariate method was then used to eliminate the features recursively from that feature set.
We tested the method on the Leukemia dataset and set the size of the initial set to 30 as
Fisher’s linear discriminant encounters matrix inversion problems if the initial gene set size is
4-99
bigger than the number of training samples. The algorithm was implemented and run on
Matlab 6.1.
Figure 4.12 shows that when using the combination of F_F, L_L and F_L, the accuracy of
Bayesian classification on test samples was generally better than that of SVM prediction,
based on same set of genes. But the SVM prediction of L_F is better that that of Bayesian
method. In both SVM and Bayesian test results, L_L outperformed the other three
combinations when there are more than 15 genes remaining in the gene set. However, its
performance dropped drastically when there are less than four genes remaining in the gene
set. Under this situation, the combination of F_F is the best. However, none of these
combinations of these four methods outperformed the combination of LIK+RFE on our test
on Leukemia dataset in terms of classification accuracy based on the same number of genes.
4-100
SVM prediction
1
0.9
0.8
0.7
L_F
Accuracy
0.6
F_F
0.5
L_L
0.4
F_L
0.3
0.2
0.1
0
30
27
24
21
18
15
12
3
Number of genes
Bayesian prediction
1
0.9
0.8
0.7 L_F
Accuracy
0.6
F_F
0.5
L_L
0.4
0.3 F_L
0.2
0.1
0
30
27
24
21
18
15
12
Number of genes
Figure 4.12: SVM and Bayesian prediction accuracy when running combination of univariate
and multivariate feature selection methods. L_F: Likelihood + Fisher’s linear discriminator;
The dataset we used that consists of the low dimension analysis problem is a Zebra fish
Molecular and Cell Biology, Singapore (Lo et al., 2003). In recent years, Zebra fish has been
adopted as a model system for the studies of vertebrate development owing to some of its
unique characteristics favorable for genetic studies compared to other vertebrate systems.
These characteristics include reasonably short lifetime, large number of progenies, external
fertilization and embryonic development, and translucent embryos (Talbot and Hopkins,
2000). In the microarray experiment, there were altogether 11,552 Expression Sequence Tag
(EST) clones representing 3100 genes printed onto the microarray glass slides. According to
BLAST (Basic Local Alignment Search Tool) search, 4519 of the 11,552 clones have
matches to 728 distinct publicly deposited protein sequences. That is, the functions of these
4519 clones are known, and the functions of the remaining clones are unknown. The relative
expression of the 11,552 clones in Zebra fishes’six developmental stages, including cleavage
(E2), gastrula (E3), blastula (E4), segmentation (E5), pharyngula (E6) and hatching (E7),
comparison to the expression of these clones in stage unfertilized eggs (E0). A first type
classification problem was constructed, which included 11,449 samples from the 11,552
clones with 6 features. A total 3887 of the 11,449 samples corresponding to the known clones
were labeled according to whether they are muscle genes or not. Within these 3887 clones,
248 were clones from 17 muscle genes. We did the classification by employing SVM. The
labeled clones were randomly split into two sets, 2500 for training and remaining 1387 for
testing. There were 157 and 91 positive samples in the training and testing sets respectively.
The remaining 7562 unlabelled samples were then used to perform prediction.
4-102
Table 4.18: Test of SVM with RBF kernel using different parameters.
Table 4.18 shows the test result of the SVM using different parameters with radial basis
function kernel. It can be seen that best test performance was obtained when setting C = 10
and σ = 0.01 . Using this parameter, the number of correctly predicted positive test samples
(true positive) reached 44, which was the highest among all configurations we tried; the
number of incorrectly predicted negative (false negative) samples was as low as 10,
compared favorably with the lowest false negative we obtained, which is 9. The trained
model under this setting is also not complex, which can be seen from the number of support
vectors was as low as 271. We provided a list of the 110 positively predicted unknown clones
to the biological researchers in the Lab of Functional Genomics of Institute of Molecular and
Cell Biology. Ten clones were selected from these 110 clones to do further biological
4-103
validation (re-sequencing of these clones) was done. Eight of these ten clones are proven to
be from muscle genes. The remaining two were found to be known genes after repeated
sequencing. Although these two are not muscle genes, they are functionally related to muscle
genes, therefore showed similar expression patterns to that of the other eight clones. Besides
these ten clones, another positively predicted clone was re-sequenced, which is a putative
novel gene. In situ hybridization on this clone showed that the corresponding gene truly had
Biological research can be treated as a knowledge discovery process, which has been greatly
studies, stored in DNA sequence trace files, scanned microarray images, descriptions of
samples and descriptions of experimental conditions. The information can then be quantified
or symbolized for biological data with certain structure. The biological data include
and statistical methods are then applied to extract biological knowledge from the data. The
knowledge is in essence relationships, which could be relationship between genes from their
sequential and expression similarity; relationship between gene expression and sample
knowledge that can be discovered depends on two factors: the amount and quality of the
biological data available, and the suitability of the machine learning methods for various
biological problems.
There are more and more researchers who work to accelerate the discovery process. There are
currently two main journals and several main conferences in the bioinformatics area: Journal
(PSB), International Conference on Intelligent Systems for Molecular Biology (ISMB), and
(RECOMB). There is also a conference specifically focus on microarray data analysis, which
Although much effort have been made for multidisciplinary collaboration in the discovery
process, researchers from non-biology discipline still need to gain more insight of the nature
of the biological problems, together with biologists. The real good design of machine
learning methods lie in full incorporation of biological knowledge rather than simply abstract
the biological problem to fit to well developed models. This criterion dictates the future
This thesis focuses on the classification and feature selection problems for gene expression
analysis. In the research work, we have reviewed current work in the literature and identified
the classification problems. We then applied seven feature selection methods, feature
extraction methods and some of their combinations for gene selection, and employed five
classification methods for prediction of cancer tissue type and gene function. We improved
neural network feature selector to make it more suitable to the gene selection problem for the
datasets having high dimension but few samples. We also developed a multivariate version of
likelihood feature selection method. We found the hybridization of Likelihood and Recursive
benchmark Leukemia dataset than other methods. The hybridization of LIK+RFE on SRBCT
researchers.
5-106
The thesis shows a process of understanding of the nature of the problem and choosing
suitable methods. We first tested whether the most informative components may contribute to
the classification by applying principle component analysis and neural networks on Leukemia
dataset. Result showed that those components had little discrimination ability. We then
applied decision tree to find sets of rules that were able to classify all the samples. The
simplicity of the rules implied the possibility to find small set of genes that have high
classification ability. Due to the discrete nature of C4.5 algorithm, the small decision tree
generated from training samples, with continuous expression values, did not have good
generalization ability; subsequently the prediction accuracy of the tree on test samples was
low.
The possibly simple underlying classification model and the deficiency of decision tree
method inspired us to use information gain as a gene ranking method, but use neural network
as classifier. The classification was improved. After studying of the distribution of neural
then moved on into looking for methods that could reduce the number of genes used for
parameters, the combination of information gain and neural network feature selection
We then tested other combinations of univariate selection method and multivariate selection
method, including the methods that are based on extremal margin, likelihood, and Fisher’s
criterion. Within these combinations the hybrid of Likelihood method and Recursive Feature
Elimination method (LIK+RFE) selected the most compact gene set that had prefect
5-107
prediction performance. We did systematic test on this hybrid method using other datasets of
the high dimension classification problem. Test results were very promising. Applying the
classification methods on the Zebra fish dataset with large number of samples was
genes, the real prediction of the function of some unknown genes were confirmed by
biological experiments.
Our experiments of the hybrid of LIK+RFE on SRBCT dataset showed that the feature
selection and classification methods for gene expression analysis are data dependent. Our
experiments also showed that, for microarray datasets of high dimension classification
problem, the choice of feature selection methods are more important than the choice of
of selection methods for Leukemia dataset; and although some method used in this thesis did
not achieve high selection performance on Leukemia dataset, they may do well on other
datasets.
The study on linear separability (Cover, 1965) suggests that when the number of samples is
small compared with the number of features, it is possible to find a number of subsets of
features that can perfectly distinguish all samples. Our experiments on the leukemia dataset
also support this hypothesis: we found two different gene sets consisting of just three or four
genes, which can achieve perfect classification performance. Biological study shows that
although many genes do not have direct relevance to the cancer under study, their expression
may have subtle and systematic difference in different classes of tissues (Alon et al., 1999).
Hence, a new challenge for cancer classification arises: to find as many as possible small
subsets of genes that can achieve high classification performance. Using only microarray data
5-108
with these subsets of genes, we can build different classifiers and look for those that have
desirable properties such as extremal margin, i.e. wide difference between the smallest output
of the positive class samples and the largest output of the negative class samples. Another
property could be median margin, which is the difference between the median output of the
positive class samples and the median output of the negative class samples. Exhaustively
enumerating and evaluating all the gene combinations is computationally NP-hard (non-
deterministic Polynomial-time hard) and is feasible only when the number of relevant genes
Due to its cost, microarray experiments conducted for identifying the genes that are crucial
for cancer diagnosis are still scarce. The measurements obtained from the experiments are
noisy. These facts make the selection of different sets of relevant genes vital. Moreover,
cancer is a complex disease. It is not caused by only a few genes, but also by many other
factors (Kiberstis and Roberts, 2002). So even the best selected subsets may not actually be
the most crucial ones to the cancer under study. They can, however, be important candidates
for a further focused study on the gene interactions within individual subsets, and the
relationship between these interactions and the disease. There has been work done on the
second order selection. For example, Goyun et al. (2002) found a gene pair that could have
zero leave-one-out error on the training samples, but achieved poor performance on test
samples. Hellem and Jonassen (2002) also evaluated the contribution of pairs of genes to the
classification for the ranking of genes, but they still have to combine multiple pairs of genes
to perform classification. We plan to work on finding better ways to develop methods for
high order feature selection that would allow the classifiers to achieve high performance with
Reference
• Akutsu, T., Miyano, S., and Kuhara, S., (1999). Identification of genetic networks
from a small number of gene expression patterns under the Boolean network model.
• Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A.
• Alter, O., Brown, P. O., and Botstein, D., (2000). Singular value decomposition for
• Bishop, C.M., (1995). Neural Networks for Pattern Recognition. Clarendon Press,
Oxford.
• Boser, B., Guyon, I., and Vapnik, V. N., 1992. A training algorithm for optimal
152.
• Brazma, A., and Vilo, J., (2000). Gene expression data analysis. FEBS Letters,
480,17-24.
110
• Brown, M. P. S., Grundy, W. N., Lin, D., Sugnet, C., Ares, M., and Haussler, D.,
support vector machines. Proceedings of the National Academy of Sciences, 97, 262-
267.
• Butte, A. J., and Kohane, I. S., (2000). Mutual information relevance networks:
• Cai, J., Dayanik, A., Yu, H., Hasan, N., Terauchi, T., and Grundy, W. N., (2000).
Biology.
• Chapelle, O., Vapnik, V., Bousquet, O. and Mukherjee, S., (2000). Choosing kernel
• Chen, T., Filkov, V., and Skiena, S. S., (1999). Identifying gene regulatory networks
• Cortes, C., and Vapnik, V., (1995). Support vector networks. Machine Learning, 20,
273-297.
• Datta, S., (2001). Exploring relationships in gene expressions: A partial least squares
• Dewey, T. G., and Bhan, A., (2001). A linear systems analysis of expression time
• D'haeseleer, P., (2000). Reconstructing Gene Networks from Large Scale Gene
• D'haeseleer, P., Wen, X., Fuhrman, S., and Somogyi, R., (1997). Mining the gene
expression matrix: inferring gene relationships from large scale gene expression data.
Information processing in cells and tissues, Paton, R. C., and Holcombe, M., Eds.,
• Ewing, R. M., Kahla, A. B., Poirot, O., Lopez, F., Audic, S., and Claverie, J. M.,
(1999). Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene
• Filkov, V, Skiena, S, and Zhi, J., (2001). Analysis techniques for microarray time-
• Fitch, W. M., and Margoliash, E., (1967). Construction of phylogenetic trees. Science,
155, 279-284.
• Fletcher, R., (1987). Practical Methods of Optimization, 2nd edition, Wiley, New
York.
• Freund, Y., and Schapire, R. E., (1996). Experiments with a new boosting algorithm.
• Friedman, N., Linial, M., Nachman, I., and Pe'er, D., (2000). Using Bayesian
• Fuhrman, S., Cunningham, M. J., Wen, X., Zweiger, G., Seihamer, J. J., and
• Furey, T. S., Cristiannini, N., Duffy, N., Bednarski, D. W., Schummer, M., and
Haussler, D., (2000). Support vector machine classification and validation of cancer
• Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gassenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and
Lander, E. S., (1999). Molecular Classification of Cancer: Class Discovery and Class
• Guyon, I., Weston, J., Barnhill, S., and Vapnik, V., (2002). Gene selection for cancer
• Hellem, T. and Jonassen, I., (2002). New feature subset selection procedures for
• Hertz, J., Krogh, A., and Pla,er, R. G., (1991). Introduction to the Theory of Neural
Computation. Addison-Wesley.
• Hieter, P., and Boguski, M., (1997). Functional genomics: it's all how you read it.
• Holter, N. S., Maritan, A., Cieplak, M., Fedoroff, N. V., and Banavar, J. R., (2001).
• Huang, S., (1999). Gene expression profiling, genetic networks, and cellular states: an
• Hwang, K. B., Cho, D. Y., Park, S. W., Kim, S. D., and Zhang, B. T., (2001).
• Keller, A. D., Schummer, M., Ruzzo, W. L., and Hood, L., (2000). Bayesian
• Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M, Westermann, F, Berthold,
F., Schwab, M., Antonescu, C. R., Peterson, C., and Meltzer, P. S., (2001).
• Kiberstis, P., and Roberts, L., (2002). It's Not Just the Genes. Science, 296, 685.
• Kitano, H., (2002). Systems biology: a brief overview. Science, 295, 1662-1664.
• Klevecz, R. R., (2000). Dynamic architecture of the yeast cell cycle uncovered by
Genomics, 1, 186-192.
• Li, W., (2002). Zipf's law in importance of genes for cancer classification using
• Liang, S., Fuhrman, S., and Somogyi, R., (1998). REVEAL, A general reverse
Symposium on Biocomputing.
• Liu, H., and Motoda, H., (1998). Feature selection for knowledge discovery and data
• Lo, J., Lee, S., Xu, M., Liu, F., Ruan, H., Eun, A., He, Y., Ma, W., Wang, W., Wen,
Z., and Peng, J., (2003). 15,000 Unique Zebrafish EST Clusters and Their Use in
• Maki, Y., Tominaga, D., Okamoto, M., Watanabe, S., and Eguchi, Y., (2001).
Development of a System for the Inference of Large Scale Genetic Networks. Pacific
• Michaels, G. S., Carr, D. B., Askenazi, M., Fuhrman, S., Wen, X., and Somogyi, R.,
(1998). Cluster analysis and data visualization of large-scale gene expression data.
• Mukherjee, S., Tamayo, P., Slonim, D., Verri, A., Golub, T., Messirov, J. P., and
• Pe'er, D., Regev, A., Elidan, G., and Friedman, N., (2001). Inferring subnetworks
• Platt, J., (1999). Fast training of SVMs using sequential minimal optimisation.
Advances in Kernel Methods: Support Vector Learning. MIT press, Cambridge, MA,
185-208.
• Quinlan, J. R., (1993). C4.5: Programming for Machine Learning. Morgan Kaufmann
Publishers.
115
• Raychaudhuri, S., Stuart, J. M., and Altman, R. B., (2000). Principal components
• Samsonova, M. G., and Serov, V. N., (1999). NetWork: An interactive interface to the
tools for analysis of genetic network structure and dynamics. Pacific Symposium on
Biocomputing.
• Setiono, R., and Liu, H., (1997). Neural-network feature selector. IEEE Transactions
• Slonim, D., Tamayo, P., Mesirov, J., Golub, T. R., and Lander, E., (2000). Class
• Someren, E. V., Wessels, L. F. A., and Reinders, M. J. T., (2000). Linear modeling of
• Somogyi, R., Fuhrman, S., Askenazi, M. and Wuensche, A., (1996). The gene
• Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B.,
Brown, P. O., Botstein, D., and Futcher, B., (1998). Comprehensive identification of
116
squares and principal component regression. Journal of the Royal Statistical Society
• Szallasi, Z., (1999). Genetic network analysis in light of massively parallel biological
• Talbot, W.S., and Hopkins, N., (2000). Zebra fish mutations and functional analysis
• Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P., (1999).
Clustering methods for the analysis of DNA microarray data. Technical Report,
Stanford University.
• Vohradsky, J., (2001). Neural network model of gene expression. FASEB Journal, 15,
846-854.
Processing Systems, Lippmann, R. P., Moody, J., and Touretzky, D. S., Eds., Morgan
Kaufmann, 3, 875-882.
• Wen, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker, J. L., and
• Werbos, P. J., (1990). Backpropagation through time: what it does and how to do it.
• Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T. and Vapnik, V.,
• Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P., (2001). Normalization for cDNA
SPIE.
• Yeung, K. Y., Haynor, D. R., Ruzzo, W. L., (2001). Validating clustering for gene
• Zhang, B. T., Ohm, P., and Muhlenbein, H., (1997). Evolutionary Induction of Sparse