0% found this document useful (0 votes)
76 views

Thesis On Gene Expression Analysis

This thesis examines the use of machine learning methods for global gene expression analysis using microarray data. It explores feature selection and integration techniques like information gain, likelihood methods, and neural network feature selection, as well as classification algorithms such as neural networks, boosting, support vector machines, Bayesian classifiers, and linear discriminant methods. Experimental results on cancer gene expression datasets demonstrate that combinations of univariate feature ranking and multivariate selection methods, such as the hybrid likelihood and recursive feature elimination approach, are effective for analyzing high-dimensional microarray classification problems.

Uploaded by

Tajbia Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Thesis On Gene Expression Analysis

This thesis examines the use of machine learning methods for global gene expression analysis using microarray data. It explores feature selection and integration techniques like information gain, likelihood methods, and neural network feature selection, as well as classification algorithms such as neural networks, boosting, support vector machines, Bayesian classifiers, and linear discriminant methods. Experimental results on cancer gene expression datasets demonstrate that combinations of univariate feature ranking and multivariate selection methods, such as the hybrid likelihood and recursive feature elimination approach, are effective for analyzing high-dimensional microarray classification problems.

Uploaded by

Tajbia Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

GLOBAL GENE EXPRESSION ANALYSIS

USING MACHINE LEARNING METHODS

XU MIN

(B.Eng. Beihang Univ., China)

A THESIS SUBMITTED

FOR THE DEGREE OF MASTER OF SCIENCE

DEPARTMENT OF INFORMATION SYSTEMS

NATIONAL UNIVERSITY OF SINGAPORE

2003
I

Acknowledgements

I am very grateful to my supervisor Dr. Rudy Setiono for his insightful suggestions in both

the content and presentation of this thesis. It was his encouragement, support and patience

that saw me through and I am ever grateful to him. I am full of gratitude to my boss, Dr. Peng

Jinrong, for his understanding and support on allowing me to take the part-time Master of

Science research work.

I would like to thank Ms. Jane Lo, Mr. Guan Bin, and other members of Lab of Functional

Genomics of Institute of Molecular and Cell Biology, Singapore, for their numerous help

throughout my research work.

For the study of the hybrid of Likelihood method and Recursive Feature Elimination method,

I would like to thank Dr. Isabelle Guyon for providing the supplementary data. I also thank

Dr. Cui Lirong, Mr. Wang Yang, Dr. Oilian Kon, Dr. Wolfgang Hartmann, and Mr. Li

Guozheng for their numerous helpful consultations.

I would also like to thank my parents for their love, encouragement, guidance and patience

throughout my studies.
II

Table of Contents

Acknowledgements..................................................................................................................... I
Table of Contents....................................................................................................................... II
List of Figures............................................................................................................................ V
List of Tables ............................................................................................................................VI
Summary ................................................................................................................................ VII
1 Introduction ..................................................................................................................... 1-1
1.1 Background ................................................................................................... 1-1

1.1.1 Functional genomics ............................................................................... 1-1

1.1.2 Microarray Technology........................................................................... 1-2

1.1.3 Machine learning methods for global analysis ......................................... 1-3

1.2 Research Objectives....................................................................................... 1-4

1.3 Organization of chapters ................................................................................ 1-4

2 Literature review.............................................................................................................. 2-6


2.1 Finding gene groups ...................................................................................... 2-6

2.2 Finding relationships between genes and function.......................................... 2-8

2.3 Dynamic modeling and time series analysis ................................................. 2-12

2.4 Reverse engineering and gene network inference ......................................... 2-13

2.5 Overview of the field ................................................................................... 2-17

3 Machine learning methods used ......................................................................................3-18


3.1 Description of problem ................................................................................ 3-18

3.1.1 Structure of microarray data.................................................................. 3-18

3.1.2 Two types of classification problems .................................................... 3-19

3.2 Feature integration and univariate feature selection methods........................ 3-20

3.2.1 Information gain ................................................................................... 3-21

3.2.2 Likelihood method (LIK) ...................................................................... 3-23


III

3.3 Classification and multivariate feature selection methods............................. 3-24

3.3.1 Neural Networks ................................................................................... 3-25

3.3.2 Ensemble of neural networks (boosting)................................................ 3-30

3.3.3 Neural network feature selector............................................................. 3-33

3.3.3.1 Disadvantage of information gain measure .................................... 3-33

3.3.3.2 Neural network feature selector..................................................... 3-35

3.3.4 Support vector machines ....................................................................... 3-44

3.3.4.1 Linearly separable learning model and its training ......................... 3-45

3.3.4.2 Linearly non-separable learning model and its training.................. 3-48

3.3.4.3 Nonlinear Support Vector Machines.............................................. 3-49

3.3.4.4 SVM recursive feature elimination (RFE)...................................... 3-51

3.3.5 Bayesian classifier ................................................................................ 3-52

3.3.6 Two discriminant methods for multivariate feature selection................. 3-53

3.3.6.1 Fisher’s linear discriminant ........................................................... 3-54

3.3.6.2 Multivariate Likelihood feature ranking ........................................ 3-55

3.3.7 Combining univariate feature ranking method and multivariate feature

selection method................................................................................................. 3-56

4 Experimental results and discussion ................................................................................4-58


4.1 Second type - high dimension problems....................................................... 4-58

4.1.1 Principle Component Analysis for feature integration ........................... 4-60

4.1.2 C4.5 for feature selection ...................................................................... 4-61

4.1.3 Neural networks with features selected using information gain.............. 4-65

4.1.4 AdaBoost.............................................................................................. 4-68

4.1.5 Neural network feature selector............................................................. 4-70

4.1.6 Hybrid Likelihood and Recursive Feature Elimination method.............. 4-72

4.1.6.1 Leukemia dataset........................................................................... 4-74


IV

4.1.6.2 Small, round blue cell tumors dataset ............................................ 4-86

4.1.6.3 Artificial datasets .......................................................................... 4-95

4.1.7 The combination of Likelihood method and Fisher’s method ................ 4-98

4.2 First type - low dimension problems .......................................................... 4-101

5 Conclusion and future work ..........................................................................................5-104


5.1 Biological knowledge discovery process.................................................... 5-104

5.2 Contribution, limitation and future work.................................................... 5-105

Reference................................................................................................................................109
V

List of Figures

Figure 1.1: A scanned microarray image ..................................................................... 1-3


Figure 3.1: AdaBoost algorithm ................................................................................ 3-32
Figure 3.2: Plot of feature values with class labels, ‘o’for ‘-1’and ‘+’for ‘+1’......... 3-35
Figure 3.3: Wrapper model framework...................................................................... 3-36
Figure 3.4: Neural network feature selector .............................................................. 3-42
Figure 3.5: Recursive feature elimination algorithm .................................................. 3-52
Figure 4.1: The combination of feature selection / integration methods and classification
methods.............................................................................................................. 4-59
Figure 4.2: Wrong prediction times. .......................................................................... 4-68
Figure 4.3: Sorted LIK score of a subset of genes in the leukemia dataset................. 4-75
Figure 4.4: Classification performance of genes selected using the hybrid LIK+RFE. .. 4-
77
Figure 4.5: Plot of the leukemia data samples according to the expression values of the
three genes selected by the hybrid LIK+RFE. ..................................................... 4-78
Figure 4.6: Performance of SVM and Naïve Bayesian classifiers built using genes
selected according to LIK scores. ....................................................................... 4-80
Figure 4.7: Classification performance of purely using RFE. ..................................... 4-81
Figure 4.8: SVM and Bayesian prediction accuracy on test samples using features
selected by RFE combined with baseline criterion, Fisher’s criterion and Extremal
margin method. .................................................................................................. 4-85
Figure 4.9: Sorted LIK scores of genes in the SRBCT dataset. Dots indicate
LIK EWS → Non− EWS scores and circles indicate LIK Non− EWS → EWS scores. ......................... 4-87
Figure 4.10: Plot of RMS and non-RMS samples. Plot of all 88 RMS and non-RMS
samples according to the expression values of three of the four selected genes.... 4-89
Figure 4.11: Hierarchical clustering of SRBCT samples with selected 15 genes. ....... 4-94
Figure 4.12: SVM and Bayesian prediction accuracy when running combination of
univariate and multivariate feature selection methods. ...................................... 4-100
VI

List of Tables

Table 4.1: List of combined univariate and multivariate feature selection methods tested.
........................................................................................................................... 4-60
Table 4.2: Performance of neural network using principle components as input......... 4-61
Table 4.3: Decision trees constructed by iterative mode from all 72 samples. ............ 4-62
Table 4.4: Decision trees constructed by leave-one-out mode from all 72 samples..... 4-63
Table 4.5: Decision trees constructed by leave-one-out mode from 38 training samples
with prediction accuracy on 34 test samples........................................................ 4-63
Table 4.6: Leukemia features and their information gain of all samples, training samples
and test samples, sorted by gain of all samples.................................................... 4-64
Table 4.7: Prediction performance of neural network using features selected by
information gain. ................................................................................................ 4-66
Table 4.8: Test of neural network using features selected by information gain to identify
incorrectly predicted test samples. ...................................................................... 4-67
Table 4.9: AdaBoost test results. ............................................................................... 4-69
Table 4.10: Experiment result of neural network feature selector with summed square
error function...................................................................................................... 4-71
Table 4.11: Experiment result of neural network feature selector with cross entropy error
function. ............................................................................................................. 4-71
Table 4.12: The smallest gene set found that achieves prefect classification performance.
........................................................................................................................... 4-78
Table 4.13: The genes selected by the hybrid LIK+RFE method. The genes that have
LIK scores of at least 1500 were selected initially. RFE was then applied to select
these four genes that achieved perfect performance............................................. 4-83
Table 4.14: Experimental results for the SRBCT dataset using the hybrid LIK+RFE. 4-88
Table 4.15: The genes selected by the hybrid LIK+RFE for the four binary classification
problems............................................................................................................. 4-90
Table 4.16: The performance of SVM and Naï
ve Bayesian classifiers built using the top
genes selected according to their LIK scores....................................................... 4-91
Table 4.17: Test result of LIK, RFE and LIK+RFE on artificial datasets ................... 4-97
Table 4.18: Test of SVM with RBF kernel using different parameters. .................... 4-102
VII

Summary

Microarray is a technology to quantitatively monitor the expression of large number of genes

in parallel. It has become one of the main tools for global gene expression analysis in

molecular biology research in recent years. The large amount of expression data generated by

this technology makes the study of certain complex biological problems possible and

machine learning methods are playing a crucial role in the analysis process. At present, many

machine learning methods have been or have the potential to be applied to major areas of

gene expression analysis. These areas include clustering, classification, dynamic modeling

and reverse engineering.

In this thesis, we focus our work on using machine learning methods to solve the

classification problems arising from microarray data. We first identify the major types of the

classification problems; then apply several machine learning methods to solve the problems

and perform systematic tests on real and artificial datasets. We propose improvement to

existing methods. Specifically, we develop a multivariate and a hybrid feature selection

method to obtain high classification performance for high dimension classification problems.

Using the hybrid feature selection method, we are able to identify small sets of features that

give predictive accuracy that is as good as that from other methods which require many more

features.
1-1

1 Introduction

1.1 Background

1.1.1 Functional genomics

With the completion of Human Genome Project, biology research is entering the post genome

era. Although biologists have collected a vast amount of DNA sequence data, the details of

how these sequences function still remain largely unknown. Genomes of even the simplest

organisms are very complex. Nowadays, biologists are still trying to find answers to the

following questions (Brazma and Vilo, 2000):

• What are the functional roles of different genes and in what cellular process do they

participate?

• How are the genes regulated? How do the genes and gene products interact? What are

the interaction networks?

• How does the gene expression level differ in various cell types and states? How is the

gene expression changed by the various diseases of compound treatments?

Biology used to be data-poor science. With more advanced techniques developed in recent

years, biologists are now able to transform vast amount of biological information into useful

data. This makes it possible to study gene function globally, and a new field, functional

genomics emerges. Specifically, functional genomics refers to the development and

application of global (genome-wide or system-wide) experimental approaches to assess gene

function by making use of the information and reagents provided by structural genomics. It is

characterized by high throughput or large scale experimental methodologies combined with

statistical and computational analysis of the results (Hieter and Boguski, 1997).
1-2

1.1.2 Microarray Technology

Several methods have been developed to understand the behavior of genes. Microarray

technology is an important one among them. It is used to monitor large amount of genes’

expression level in parallel. Here gene expression refers to the process to transcribe a gene's

DNA sequence into the RNA that serves as a template for protein production, and gene

expression level indicates how active a gene is in certain tissue, at certain time, or under

certain experimental condition. The monitored gene expression level provides an overall

picture of the genes being studied. It also reflects the activities of the corresponding protein

under certain conditions.

Several steps are involved in this technology. First, complementary DNA (cDNA) molecules

or oligos are printed onto slides as spots. Then, two kinds of dye labeled samples, i.e. sample

and control , are hybridized. Finally, the hybridization is scanned and stored as images (see

example in Figure 1.1, a sample from Zebra fish). Using a suitable image-processing

algorithm, these images are quantified into a set of expression values representing the

intensity of spots. Usually, the dye intensity may be biased by factors like its physical

property, experimental variability in probe coupling and processing procedures, and scanner

settings. To minimize the undesirable effects caused by this biased dye intensity,

normalization is done to balance dye intensities and make expression value comparable

across experiments (Yang et al., 2001). Here the term comparable means that the difference

of any measured expression value of a gene between two experiments should reflect the

difference of its true expression levels.


1-3

Figure 1.1: A scanned microarray image

1.1.3 Machine learning methods for global analysis

Molecular biology also used to be a data poor field, and most of gene expression analysis

work was done manually with very limited information derived from experiment. The focus

of a molecular biologist was on a few genes or proteins. With the application of large-scale

biological information quantification methods like microarray and DNA sequencing, the

behavior of genes can be studied globally. Currently, there is an increasing demand for

automatic analysis of the overall relationship hidden behind large amount of genes from their

expression.
1-4

Machine learning is the study of algorithms that could learn from experience and then

predict. The theoretical aspects of machine learning are rooted in statistics and informatics,

but computational considerations are also indispensable. Due to the complex nature of

biological information, machine learning could play an important role in the analysis process.

1.2 Research Objectives

Microarray technology based gene expression profiling is one of hottest research topics in

biology at present. The experimental part of this technology is already mature. Compared

with this, the exploration of automatic analysis methods is still at its early stage. In this thesis,

we study several machine learning approaches to solving several typical gene expression

analysis problems.

The main objectives of this research are:

• To identify typical gene expression analysis problems from machine learning point of

view.

• To apply suitable machine learning methods to the problems from public datasets, and

to improve these methods when necessary.

• To find new approaches to the problems.

• To study the experimental results.

• To apply the methods to new datasets and validate the result.

1.3 Organization of chapters

This thesis is organized as follows. Chapter 2 provides a brief review of the current methods

that can be applied to microarray data analysis. Chapter 3 gives detailed illustrations of
1-5

several important machine learning methods that can be applied to classification using gene

expression data. In particular, we improve a neural network feature selector method,

developed multivariate likelihood feature selection method, and propose a hybrid framework

of univariate and multivariate feature selection method. Chapter 4 describes the experimental

results of these methods on two different kinds of gene expression analysis problems, and

discusses the experiment results. Specifically, we perform systematic tests of the hybrid of

Likehood method and Recursive Feature Elimination method because we have obtained very

good feature selection performance on several microarray datasets; we also apply Support

Vector Machine on a recently obtained Zebra fish dataset to perform gene function

prediction. Finally, Chapter 5 concludes the thesis and future works are illustrated.
2-6

2 Literature review

Various automatic methods have been applied or developed for the gene expression analysis.

They are basically from fields such as machine learning, statistics, signal processing, and

informatics. The following are the relevant works categorized according to the analysis tasks.

2.1 Finding gene groups

Some methods could be used to find useful information or pattern from biological data,

which indicates relationship among the genes. These methods are unsupervised (Haykin,

1999), i.e., the learning models are optimized using pre-specified task-independent measures,

which reflect the difference or similarity of the training samples. Once the model has become

tuned to the statistical regularities of the input gene expression data, it develops the ability to

form internal representations for encoding features of the input and thereby to create new

classes automatically (Becker, 1991).

Principle Component Analysis (PCA) is an exploratory multivariate statistical technique for

simplifying complex data sets (Raychaudhuri et al., 2000). Given an expression matrix with a

number of features, a set of new features is generated by PCA. These new features account

for most of the information in the original features, but the number of dimensions is smaller

than that of the original data. There are several neural network algorithms that support PCA,

these algorithms are mainly Hebbian-based algorithms (Haykin, 1999) which are self-

organizing and adaptive. Singular Value Decomposition (SVD) can also be used to perform

PCA. SVD is a linear transformation that decomposes the gene expression matrix into a

product of three matrices that represent the underlying characteristics of the original matrix.

Alter et al. (2000) applied SVD for gene expression analysis. They first obtained the principal

components from the decomposed matrices by applying SVD to expression data of yeast
2-7

genes; then rejected the genes that contribute little information to the principal components.

In their work, the information contribution was measured by Shannon entropy of the

expression values of the genes, where Shannon entropy characterizes the complexity of the

expression values. Finally, the remaining genes were sorted, and the results reflect the strong

relationship between the groups of these genes and their functional categories.

Clustering is a typical way to group genes together according to their features. Certain

distance measure, which reflects the similarity of genes’expression, is needed for clustering

process. Most clustering methods that have been studied in the gene expression analysis

literature use either Euclidean distance or Pearson correlation between expression profiles as

a distance measure (D’haeseleer, 2000). Other measures include Euclidean distance between

expression profiles and slopes (Wen et al., 1998), squared Pearson correlation (D’haeseleer et

al., 1997), Euclidean distance between pairwise correlations to all other genes (Ewing et al.,

1999), Spearman rank correlation (D’haeseleer et al., 1997), and mutual information

represented by pairwise entropy (D’haeseleer et al., 1997; Michaels et al., 1998; and Butte

and Kohane, 2000).

There are two kinds of clustering methods: hierarchical and non-hierarchical ones. A

hierarchical clustering method starts from individual genes, merging them into bigger clusters

until there is only one cluster left, in an agglomerative way. The method can also divisively

start from all genes, splitting them until no two of them are together. The output of the

method is a hierarchy of clusters, where the higher-level clusters are the sum of the lower-

level ones. On the other hand, a non-hierarchical clustering method first divides genes into a

certain number of clusters, and then iteratively refines them until certain optimization

criterion is met.
2-8

Clustering methods that have been applied for gene expression analysis were reviewed in

(D’haeseleer, 2000) and (Tibshirani et al., 1999). Because different algorithms may be

applicable to different datasets, in (Yeung et al., 2001) a data-driven method to evaluate these

algorithms was proposed.

2.2 Finding relationships between genes and function

Some methods can be used to find the relationship between gene expression and other

information, i.e. properties of genes and samples. These properties can be type of

hybridization sample, experimental condition, or the biological process that they are involved

in. These are basically supervised methods for classification and regression. The methods try

to construct learning models that could represent the relationship when given gene expression

data as input and other information as output.

Machine learning classification methods have been applied to gene expression analysis in

recent years. These methods usually employ class labels to represent different groups of

expression data. An important application is cancer tissue classification, i.e. to construct

learning model to predict whether a tissue is cancerous or to predict the type of cancer using

gene expression. Cancer tissue classification is crucial for the diagnosis of patients. It used to

be based on morphological appearances, which is often hard to measure and differentiate, and

the classification result is very subjective. With the emergence of microarray technology, the

classification is improved greatly by going to the molecular level. The various machine

learning methods applied for classification are briefly described in the following paragraphs.
2-9

Neural networks are learning models that are based on the structure and behavior of neurons

in the human brain and can be trained to recognize and categorize complex patterns (Bishop,

1995). Khan et al. (2001) used neural networks to classify cancer tissues using gene

expression data as input. In their experiments, PCA was used to select a set of candidate

genes. A number of neural networks were then trained on the training dataset. The prediction

on test samples was achieved by summarizing all the outputs of the trained neural networks.

Support Vector Machine (SVM), which is rooted in statistical learning theory, is another

method that can also be used to perform classification. It can achieve good generalization

performance by minimizing both the training error and a generalization criterion that depends

on Vapnik-Chervonenkis (VC) dimension (Vapnik, 1998). In (Brown et al., 2000), SVM was

applied to classify yeast genes according to the biological process they involve in as

represented by their expression data. In (Furey et al., 2000), it was also used to classify

cancer tissues.

Decision tree generates a tree structure consisting of leaves and decision nodes. Each leaf

indicates a class, and each decision node specifies some test to be carried out on a single

attribute value, with one branch and subtree for each possible outcome of the test (Quinlan,

1993). C4.5 is a well-known decision tree induction algorithm, which uses information gain

measure. Cai et al. (2000) used it to classify cancer tissue samples. A comparison was done

with SVM, which was found to outperform C4.5.

Naï
ve Bayes Classification is a statistical discrimination method based on Bayes rule. In

(Keller et al., 2000), this method was used to classify cancer tissues. The algorithm is simple
2-10

and can be easily extended from two-class to multi-class classification. Gaussian distribution

of the data and class independence are assumed by the method.

A related technique is Bayesian Networks, which is also based on Bayesian rule. It is a

probabilistic graphical model that represents the unique joint probability distribution of

random variables efficiently. Nodes of a Bayesian network could correspond to genes and

class labels, and represent the probability of the class label given some gene expression

levels. Hwang et al. (2001) used Bayesian Networks to classify acute leukemia samples. A

simple Bayesian Network with four gene nodes and one class label node was constructed

from gene expression data. The high prediction performance indicated that the constructed

network model can correctly represent the causal relationships of certain genes that are

relevant to the classification.

Radial Basis Function (RBF) networks is a type of neural networks whose hidden neurons

contain RBFs, a statistical transformation based on a Gaussian distribution, and whose output

neuron computes a linear combination of its inputs. Hwang et al. (2001) also used an RBF

network to classify acute leukemia samples. The network was larger than the constructed

Bayesian network, but test results showed the prediction accuracy of RBF networks was

higher.

Besides classification, feature selection i.e. the process of selecting genes that are most

relevant to the class labels is also an important task for gene expression analysis. In (Slonim

et al., 2000), a statistical method involving mean and variance was used to reflect the

relevance between individual genes and class labels. In their work, the acute leukemia

samples were divided into two groups according to their class labels. Those genes whose
2-11

expression values had small variance in both groups and big mean difference between the two

groups were selected. In (Keller et al., 2000), a likelihood gene selection method was

proposed based on likelihood. It outperformed the Baseline method in (Slonim et al., 2000)

on the same cancer dataset by choosing less number of genes while achieving similar

classification performance. In (Li, 2002), the linear relationship between the logarithm of

measurement of classification ability of genes and the logarithm of rank of classification

ability of genes was found to obey Zipf’s law (Zipf, 1965). Plots of this relationship provided

a useful tool in estimating the number of genes that is necessary for classification. Gouyon et

al. (2002) proposed a Recursive Feature Elimination method based on Support Vector

Machine. It made use of the magnitude of weights of trained SVMs as indicators of the

discrimination ability of the genes. The algorithm keeps eliminating the genes that have

relatively small contribution to the classification. In the test of the method on a leukemia

dataset, small sets of genes with high discrimination contribution is obtained.

Neural trees represent multilayer feed-forward neural networks as tree structures. They have

heterogeneous neuron types in a single network, and their connectivity is irregular and sparse

(Zhang et al., 1997). Compared with the conventional neural networks, neural trees are more

flexible. They can represent more complex relationship and permit structural learning and

feature selection. Evolutionary algorithms can be used to construct neural trees. Hwang et al.

(2001) constructed neural trees using gene expression, and selected relevant genes according

to the connections in the trees. Neural trees were found to have better classification

performance than the other two methods in the paper, i.e. Redial Basis networks and

Bayesian networks. Genes with significant contribution to the classification could also be

found from the constructed neural trees.


2-12

2.3 Dynamic modeling and time series analysis

It is also important to the study significant patterns and infer the dynamic model of gene

expression from hybridization samples collected at different experimental time points. The

dynamics can provide clues to the role of genes in the biological processes. Some of these

methods have been successfully applied to discrete signal analysis.

Filkov et al. (2001) proposed a set of analysis methods that are suitable for short-term

discrete time series data. These methods include: period detection, phase detection,

correlation significance of short sequences of different length, and edge detection of group of

regularity genes. The prediction analysis on yeast microarray data (Spellman et al., 1998)

showed that the amount of data is not sufficient for large regularity pathway inference.

Singular Value Decomposition (SVD) constructs characteristic models of gene expression.

These characteristic models can also be used to construct dynamic models of gene expression

by deducing time translational matrices (Alter et al., 2000; Dewey and Bhan 2001; Holter et

al., 2001). In (Dewey and Bhan, 2001), the change of expression level in a gene was modeled

as first order Markov process. The time translational coefficient matrix was computed using

least squares method based on a combination of SVD and linear response theory. The

network model inferred from the matrix provided a way to cluster genes using their function.

The clusters derived by applying the method to yeast time series expression data were in

agreement with previously reported experimental work.

The dynamics of gene expression could also be modeled as differential equations. In (Chen et

al., 1999) a linear transcription model is proposed, and two methods, Minimum Weight
2-13

Solutions for Linear Equations and Fourier Transform for Stable Systems, are proposed for

constructing the model.

Shannon entropy can be used as a measure of the information content or complexity of a

measurement series. It indicates the amount of information contained in expression of a gene

pattern over time, or across anatomical regions, and therefore reveals the amount of

information carried by the gene during a disease process or during normal phenotypic change.

Shannon entropy is used by Fuhrman et al. (2000) to identify the most likely drug target

candidate genes from temporal gene expression patterns.

2.4 Reverse engineering and gene network inference

Gene network inference attempts to construct and study coarse-scale network models of

regulatory interactions between genes. It shows relationship between individual genes, and

then provides a richer structure than clustering, which only reveals relationship between

groups of genes. Gene network inference requires inference of the causal relationships among

genes, i.e. reverse engineering the network architecture from its activity profiles. Reverse

engineering is generally an unsupervised system identification process, which involves the

following issues: choosing hybridization sample or expression data, choosing network model,

and choosing method to construct the model, studying the structure and dynamics of the

model. The study of network dynamics often involves time series analysis techniques

mentioned in Section 2.3.

The simplest gene network model is Boolean network proposed by Kauffman (1969). In a

Boolean network model, each node is in one of two possible states: express or not-express.

The actual state depends on the states of other nodes that are linked to it. A variety of
2-14

Boolean network construction algorithms have been developed. Somogyi et al. (1996)

employed a phylogenetic tree construction algorithm (Fitch and Margoliash, 1967) to create

and visualize the network. In (Liang et al., 1998), a more systematic and general algorithm

was developed using mutual information to identify a minimal set of inputs that uniquely

defines the output for each gene at next time step. Akutsu et al. (1999) improved Liang’s

algorithm to accept noisy expression data. Ideker et al. (2000) developed an alternative

algorithm, which introduced perturbations in the expression data to iteratively and

interactively refine the sensitivity and specificity of the constructed networks. At each of the

iterations, a set of networks was inferred according to the expression data from different

experimental perturbations. They were then discriminated using entropy based approach. The

discriminations provide guide in further experimental perturbation design. Samsonova and

Serov (1999) proposed an interactive Java applet tools for visualization and analysis of the

Boolean network constructed. Maki et al. (2001) proposed a system that uses a top-down

approach for the inference of Boolean network. The inferred networks on the simulated

expression data matched the original ones well even when one of the genes was disrupted.

The advantage of Boolean network is its low construction cost. But it has the disadvantage

being too coarse to represent the true regulation relationship between genes. Linear modeling

tries to overcome this disadvantage by using weighted sum to represent the influence of other

genes on one particular gene. In the model, the overall relationship is then represented as a

matrix. Someren et al. (2000) used linear algebra methods to construct the model. Partial

Least Squares method is a statistical method that is particularly useful for modeling large

number of variables each with few observations (Stone and Brooks, 1990). Datta (2001)

applied it to Sacccharomyces cerevisiae yeast microarray data to get linear regression model
2-15

and then predicted expression level of a gene according to that of other genes. The result

appeared to be consistent with the known biological knowledge.

The dependency of the expression of one gene on the expression of other genes can also

modeled using nonlinear functions. The nonlinear approach provides the model more ability

to reveal the biological reality. However, it often introduces more difficulty in solving the

model at the same time. Maki et al. (2001) also modeled the gene interaction as an S-system

(Savageau, 1976). S-system is one of the best formalisms to estimate the complex gene

interaction mechanisms. The disadvantage of the S-system network is the number of

parameters to be estimated is vary large compared with that of Boolean network. To analysis

large scale network, S-system approach was combined with their Boolean approach.

Bayesian networks can also be used for gene network inference. The inference process

estimates the statistical confidence of dependencies between the activities of genes. Friedman

et al. (2000) used it to analyze Sacccharomyces cerevisiae yeast microarray data from

(Spellman et al., 1998). Their order relation and Markov relation analysis showed that the

constructed Bayesian network had strong link to cell cycle regulated genes. Pe’er et al.

(2001) extended this framework by the following steps: adding new kinds of factors such as

mediator, activator, and inhibitor; enabling construction of subnets of strong confidence;

enabling handling of mutation; and employing better discretization on the data for

preprocessing. Their experiment on yeast microarray data showed that the constructed

significant subnets could reveal biological pathways.

Several works have been conducted on the study of the dynamics of constructed network

model. Huang (1999) used Boolean network to interpret gene activity profiles as entities
2-16

related to the dynamics of both the regularity network and functional cellular states. In this

approach, the dynamics were mapped into state space and the system property of the network

like stability, trajectories and attractors were studied.

This dynamics could be modeled more precisely as a set of differential equations. Neural

network is one of the methods that could effectively solve these equations. Both Vohradský

(2001) and D’haeseleer (2000) modeled gene network as recurrent neural networks.

Vohradský (2001) used recurrent back-propagation (Pineda 1987), and simulated annealing

to construct the network. However, D’haeseleer (2000) tried back-propagation through time

(Werbos, 1990) in the training process, with techniques such as weight decay and weight

elimination (Weigend et al., 1991) applied to simplify the model. Compared with simulated

annealing, back-propagation is a more effective training method, but its scalability is worse

because it attempts to unfold the temporal operation of the network into a layered feed-

forward network. When doing their experiments, both Vohradský and D’haeseleer did not

have microarray dataset that was large enough for the network construction. Instead, they

used artificial data for experiment. The trained networks appeared to match original ones.

Szallasi, (1999) illustrated some basic natures of gene networks that could affect modeling.

They include the stochastic nature, effective size, compartmentalization, and information

content of expression matrix. In (Wessels et al., 2001; Someren et al., 2001), different

network models were categorized and compared under criteria like inferential power,

predictive power, robustness, consistency, stability, and computational cost.


2-17

2.5 Overview of the field

Clustering based on gene expression reflects the correlation of genes. Classification links the

expression of genes to functions. To study the change of expression of genes through time,

dynamic modeling and time series analysis method have been used. In order to obtain the

causal relationship or regulation of genes globally, the gene networks are needed to be

inferred from the expression data. The inference work is a reverse engineering process

(D’haeseleer, 2000). Reverse engineering is one of the major focuses of systems biology at

present. The field of Systems biology studies biology at system level by examining the

structure and dynamics of cellular and organismal function (Kitano, 2002). When the field of

systems biology advances to the stage of trying to unify the biological knowledge across

different levels of living organisms, we expect the understanding of the inherent complexity

of living organisms will become a central issue (Michigan, 1999).


3-18

3 Machine learning methods used

Our work focuses on classification and feature selection methods for global gene expression

analysis. In Section 3.1, the classification problems are described. Section 3.2 is about feature

integration and univariate feature selection methods. Section 3.3 describes multivariate

feature selection methods and classification methods.

3.1 Description of problem

In this section, we present the gene expression data and class information in a mathematical

form, and illustrate two types of classification problems that are commonly encountered in

microarray data analysis.

3.1.1 Structure of microarray data

An expression matrix can be generated when quantified expression values of different

hybridizations are available. Suppose there are m genes and n hybridizations. The

expression matrix A is an m × n matrix

 a11 a12 L a1n   a1 


a a 22 L a 2 n   a 2 
A =  21 = , Eq. 3.1.1.1
 M M O M  M 
   
a m1 am 2 L a mn  a m 

where aij represents the expression value of the i th gene in the j th hybridization.

Certain property of gene or hybridization sample needs to be defined, i.e. labeling is required,

in order to find the relationship between the genes and their expression matrix. The gene’s

property is represented by an m × 1 vector


3-19

 x1 
x 
x= 2, Eq. 3.1.1.2
M
 
 xm 

where each element represents one possible value of this property. For example, xi = +1

means the i th gene belongs to some biological process, while xi = −1 means the i th gene

does not belong to this process. Similarly, the property of hybridization sample is defined as

an n × 1 vector

 y1 
y 
y =  2 , Eq. 3.1.1.3
M
 
 yn 

where each element represents one possible value of this property. For example, yi = +1

means the i th hybridization sample is cancerous, while yi = −1 means the i th hybridization

sample is non-cancerous.

3.1.2 Two types of classification problems

From a micoarray experiment, we can only obtain a very limited number of hybridizations

which involve a large number of genes. That is to say, n is usually no more than a hundred,

but m can be a few thousands. So there are basically two types of classification problems:

• First type: large number of samples with low dimension. When the relationship

between genes’expression and their property (function) is studied, the classification

problem consists of A as input of learning model and x as output. There are n

features, each of them corresponds to one hybridization, and m samples, each of them

corresponds to one gene.


3-20

• Second type: small number of samples with large number of features and high

dimension. When the relationship between expression of all genes under

consideration and a certain property of hybridization sample is studied, the

 b1 
b 
classification problem can be expressed as B = A =  2  , which is transpose of A ,
T
M
 
b n 

as input of learning model and y as output. There are m features, each corresponds

to one gene, and n samples, each corresponds to one hybridization.

In this thesis, our work is focused mainly on solving the classification problems of the second

type, because this type of problems has distinct nature from the ordinary classification

problems, and many cancer tissue classification problems based on gene expression is of this

kind. We also have obtained a newly released Zebra fish developmental microarray dataset,

which can be used to form classification problems of the first type. Because we are able to

validate our prediction results using more precise biological experiments with the help of

researchers who generated this dataset, we have applied Support Vector Machine

classification method to this dataset as well.

3.2 Feature integration and univariate feature selection methods

Preprocessing is needed for the second type of classification problem (large number of

features). The main goal of preprocessing is to reduce the number of inputs for a learning

model without much loss, or even with some improvement of classification accuracy. Two

kinds of methods can be used for the reduction: feature selection and feature integration (Liu

and Motoda, 1998). Feature selection selects a subset of features as classifier input, while
3-21

feature integration generates a new feature set from original features as input. The feature

integration method used in this thesis is Principle Component Analysis (PCA) (Muirhead,

1982); the two univariate feature selection methods used in our research are Information Gain

(Quinlan, 1993) and Likelihood method (Keller et al., 2000). Due to space limitation, only

Information gain and Likelihood method will be described in this section. Here univariate

means the selection method only takes the contribution of individual features to the

classification into consideration. The multivariate feature selection method will be described

in Section 3.3, because most of them are based on classification methods. The term

multivariate means the selection method accounts for the combinatorial effect of the features

on the classification.

3.2.1 Information gain

Information gain method can be used to rank the individual discrimination ability of the

features. It comes from information theory. In this method, information amount is measured

by entropy. Let S denote the set of all samples, | S |= n . Let k denote the number of classes,

k
and let Ci , i = 1,K, k denote the set of samples that belong to a class, UC i =S ,
i =1

∀1 ≤ i < j ≤ k : C i ∩ C j = Φ . Suppose we select one sample from S and label it as a member

| Ci |
of class Ci . This message has probability of being correct. So the information it
|S|

|C ∩ S |
conveys is − log 2  i  . The expected information of such a message is
 |S | 

k
| Ci ∩ S | |C ∩ S |
info( S ) = −∑ × log 2  i  . Eq. 3.2.1.1
i =1 |S |  |S | 
3-22

Similarly, the expected information amount of any subset of S can also be determined. If S

l
is partitioned into l subsets Ti , i = 1,K, l , UT i = S , ∀1 ≤ i < j ≤ l : Ti ∩ T j = Φ . Then the
i =1

expected information requirement is

l
| Ti |
info X (S ) = ∑ × info(Ti ) . Eq. 3.2.1.2
i =1 | S |

The difference

gain = info( S ) − info X (S ) Eq. 3.2.1.3

represents the amount of information gained from this partitioning.

Information gain tends to be greater when l becomes larger, which may not truly reflect the

quality of the partition. So it needs to be normalized by taking split information into

consideration. Split information is defined as

l
| Ti | |T |
split info = −∑ × log 2  i  . Eq. 3.2.1.4
i =1 | S | |S |

The normalized gain is thereby defined as

gain
gain ratio = . Eq. 3.2.1.5
split info

Given sample expression values and their class labels, each feature’s information gain ratio

could be calculated this way: sort samples according to expression values of this feature,

partition them and calculate gain ratio of every possible split, and choose the maximum one

as this feature’s information gain ratio. This ratio provides a measure to evaluate the

classification ability of one feature. Feature selection is done by choosing features that have

the highest gain ratios.


3-23

3.2.2 Likelihood method (LIK)

Keller et al. (2000) proposed the Maximum Likelihood gene selection (LIK) method. Denote

the event that a sample belongs to class a or class b by M a and M b , respectively. The

difference in the log likelihood is used to rank the usefulness of gene g for distinguishing the

samples of one class from the other. The LIK score is computed as follows:

LIK ag→b = log( P( M a | x ag,1 ,K, x ag,na )) − log( P( M b | xag,1 , K, x ag,na )) Eq. 3.2.2.1

and

LIK bg→ a = log( P( M b | xbg,1 ,K , xbg,nb )) − log( P( M a | xbg,1 ,K , xbg,nb )) Eq. 3.2.2.2

where P(M i | x gj,1 , K , x gj ,n j ) is a posteriori probability that M i is true given the expression

values of the g th gene of all the training samples that belong to class j , where n j is the

number of training samples that belong to class j . Bayes rule

P(M | X ) P( X ) = P( X | M ) P( M ) Eq. 3.2.2.3

is used, with three assumptions required by the method. First is the assumption of equal prior

probabilities of the classes

P(M a ) = P(M b ) , Eq. 3.2.2.4

and second is the assumption that the conditional probability of X falling within a small non-

zero interval centered at x given M can be modeled by a normal distribution

− ( x− µ ) 2
1
P( x | M ) = e 2δ 2 Eq. 3.2.2.5
δ 2π

where µ and δ are the mean and standard deviation of X respectively. The values µ and

δ can be estimated from the training data. With the third assumption that the distributions of

the expression values of the genes are independent, we obtain the LIK ranking of class a

over class b for the g th gene as follows:


3-24

LIKag→b = log( P( xag,1,K, xag, n | M a )) − log( P( xag,1,K, xag,n | M b ))


a a

na na
= log(∏ P( x ag,i | M a )) − log(∏ P( x ag,i | M b ))
i =1 i =1
Eq. 3.2.2.6

na 
(x g − µ g )2 (x g − µ g )2 
= ∑  − log(δ ag ) − a ,i g a2 + log(δ bg ) + a ,i g b2  ,

i =1  2(δ a ) 2(δ b ) 

and similarly, the LIK ranking of class b over class a for this gene is

LIK bg→a = log( P( xbg,1 ,K , xbg,n | M b )) − log( P( xbg,1 ,K , xbg,n | M a ))


b b

 Eq. 3.2.2.7
nb ( xbg,i − µ bg ) 2 ( xbg,i − µ ag ) 2 

= ∑ − log(δ b ) −
g
+ log(δ a ) +
g .

i =1  2 ( δ g 2
) 2 ( δ g 2
) 
b a 

Genes that have higher likelihood scores are expected to have better ability to distinguish one

class from the other.

3.3 Classification and multivariate feature selection methods

Classification is also called pattern recognition. It is a process to assign one of the prescribed

number of classes (categories) given an input pattern (Haykin, 1999). Training of a classifier

is usually needed to establish the learning model that could reflect the relationship between

input patterns and class labels. This thesis employs four classification methods. They are

decision tree (Quinlan, 1993), neural networks (Haykin, 1999), support vector machines

(Vapnik, 1998) and Bayesian classification (Keller et al., 2000). Due to the space limitation,

only Neural Network, Support Vector Machines and Bayesian classification will be

described. Boosting technique, which combines output of multiple classifiers to form more

accurate hypothesis, will also be illustrated in this section. Three multivariate feature

selection methods: neural network feature selector, recursive feature elimination method and

multivariate likelihood method are also described in this section.


3-25

3.3.1 Neural Networks

A neural network is a massively parallel distributed processor made up of simple processing

units, which has a natural property for storing experimental knowledge and making it

available for future use. Similar to human brain, it acquires knowledge from environment

through a learning process, and its interneuron connection strengths store knowledge

(Haykin, 1999; and Aleksander and Morton 1990). Neural networks as an analysis method

has advantages such as nonlinearity, input-output mapping, adaptivity, evindential response,

contextual information, and fault tolerance.

Three-layer feed-forward neural network is a typical neural network architecture. It has a

simple hierarchical structure with high synaptic connections. A three-layer neural network

model consists of an input layer, a hidden layer and an output layer. Input neurons in the

input layer receive input signals, and send them to neurons in the hidden layer. Hidden

neurons are computational units. They combine their inputs as their local fields, and send

their output to neurons at the output layer. Output neurons then perform a similar

combination and their output is the output of the whole model. Each of the synaptic links in

the model has a weight associated with it, which is used to amplify/reduce the signal when

the signal is passing by.

Computationally, this model can also be described as follows: Suppose there are k 0 input

neurons, k1 hidden neurons and k 2 output neurons. The synaptic links between the input and

hidden layer can be represented as a k1 × (k 0 + 1) matrix w (1) , including biases, and the links

between hidden layer and output layer can similarly be represented as a k 2 × (k1 + 1) matrix
3-26

w ( 2 ) , also including biases. Let a vector x = [+1, x1 L xk0 ]T be the input of the network. The

neurons in the hidden layer first sum up their input together with the bias associated with the

first element of x

v (1) = w (1) x , Eq. 3.3.1.1

and perform a certain transformation using activation ϕ (1) function and get their output

 +1 
y (1) =  (1) (1)  = [1, y1(1) L y k(11 ) ]T . Eq. 3.3.1.2
ϕ ( v )

Here we also add an additional element + 1 as the first element of y in order to handle bias

of the output neurons. Similarly, the neurons in the output layer perform the summation

v ( 2) = w ( 2) y (1) Eq. 3.3.1.3

and transformation using activation function ϕ


( 2)
to get the output

y ( 2) = ϕ ( 2) ( v ( 2) ) . Eq. 3.3.1.4

Here the activation functions ϕ usually take following forms:

• Threshold function
1 if v ≥ 0
y = ϕ (v ) =  Eq. 3.3.1.5
0 if v < 0

• Picewise-Linear function
1 if v ≥ a

y = ϕ (v) = v if a ≥ v > −a , with a > 0 Eq. 3.3.1.6
0 if v < −a

• Sigmoid function (logistic function)


1
y = ϕ (v ) = with a > 0 Eq. 3.3.1.7
1 + e − av

• Hyperbolic tangent function


3-27

e bv − e − bv
y = ϕ (v) = a tanh(bv) = a with a > 0, b > 0 Eq. 3.3.1.8
e bv + e −bv

• Hyperbolic tangent sigmoid function


2
y = ϕ (v ) = −1 Eq. 3.3.1.9
1 + e − 2v

Here v and y are scalars corresponding to the elements of the vectors.

A neural network provides a mapping from its input to its output, and this mapping is

determined by the network’s structure and synaptic weights. With a proper mapping

established, given certain inputs, the network can produce the desired outputs. This

mechanism could be used for classification. Training is needed to obtain a neural network

that can give correct predictions. The training is done using an algorithm which usually

consists of two steps: generate initial weights, and then iteratively refine these weights until

certain stopping criterion is met. In essence, these algorithms are optimization algorithms,

which attempt to meet certain criteria when refining the weights. These criteria are measured

in terms of error functions because they reflect the difference between neural network outputs

and the desired outputs.

Among many training algorithms, back-propagation is most popular (Hertz et al., 1991).

There are two types of training in back-propagation: sequential mode and batch mode. In

sequential mode, the algorithm calculates training error and updates weights each time it

receives a training sample. On the other hand, in batch mode, the algorithm calculates overall

error of all the training samples, and then updates the network’s weights. The sequential

mode is computationally slower than the batch model. But the order of training samples that

are presented to the training algorithm can be randomly assigned, and the stochastic nature of
3-28

samples is able to be modeled. Below is the description of back-propagation algorithm in

sequential mode. Its batch mode version could be easily adapted from the sequential mode.

Suppose the errors at output neurons are defined as

e = d − y ( 2 ) = [e1 Lek2 ]T . Eq. 3.3.1.10

Back-propagation uses

E = e Te Eq. 3.3.1.11

as the instantaneous error function of the network. The objective of back-propagation training

is to minimize the average instantaneous error function to a certain extent, given all training

samples.

Given a sample vector x with its class labels d , E can be computed using Eq. 3.3.1.1 to Eq.

3.3.1.4 and Eq. 3.3.1.10 to Eq. 3.3.1.11. To meet the objective, the correction of weights

∂E ∂E
∆w (1) and ∆w ( 2) should be proportional to the partial derivatives and
∂w (1)
∂w ( 2 )

respectively. According to the chain rule, we have:

∂E
∆w ( 2 ) = −η
∂w ( 2 )

∂E ∂e ∂y ( 2) ∂v ( 2 )
= −η
∂e ∂y ( 2) ∂v ( 2 ) ∂w ( 2)

′ T
Eq. 3.3.1.12
= −η × e × (−1) ⊗ ϕ ( 2) ( v ( 2) ) × y (1)

′ T
= η × (e ⊗ ϕ ( 2 ) ( v ( 2 ) )) × y (1)

T
= η × δ ( 2) × y (1) ,


here we let δ ( 2 ) = e ⊗ ϕ ( 2 ) ( v ( 2 ) ) ; and similarly,
3-29

∂E
∆w (1) = −η
∂w (1)

∂E ∂y (1)
= −η
∂y (1) ∂w (1)
Eq. 3.3.1.13
(2)T ′ ′
= η × ((w × (e ⊗ ϕ ( 2)
( v ))) ⊗ ϕ
( 2) (1)
( v )) × x
(1 ) T

T ′
= η × ((w ( 2 ) × δ ( 2 ) ) ⊗ ϕ (1) ( v (1) )) × x T

= η × δ (1) × x T ,

T ′
here we let δ (1) = (w ( 2 ) × δ ( 2 ) ) ⊗ ϕ (1) ( v (1 ) ) .

In Eq. 3.3.1.12 and Eq. 3.3.1.13, η is a positive learning rate parameter which controls the

amount of weight adjustment. The operator ⊗ is an element-by-element multiplication.

Activation function ϕ (1) and ϕ ( 2) must be differentiable for back-propagation training. The

bias elements need to be included or excluded in certain steps of matrix multiplication in

order to maintain consistency.

The back-propagation algorithm could be easily extended to training multi-layer feed-forward

neural networks. Adjustment of weights involves only the neuron signals of the successive

layers they connected, so the algorithm is a local method. This also makes it computationally

efficient.

Feed-forward neural networks can be used for classification. For two class problems, a neural

network with a single output neuron is needed. Samples are labeled + 1 or − 1 . The label

value will be the desired values at the output side of the network when training the network.

For multi-class problems, a neural network that has the number of output neurons identical to
3-30

the number of classes is usually required. The desired output value in this case is a vector.

The elements of the vector are − 1 , except the one that corresponds to the class of the sample,

which is + 1 . If the training samples are biased, for example, the number of positive samples

is much less than that of the negative ones, label values other than ± 1 can be used to adjust

the feed-back signal to obtain better performance. Many gene expression classification

problems are two class problems.

3.3.2 Ensemble of neural networks (boosting)

One of the main disadvantages of neural network for classification is that the training result

also depends on initial weights, which are generated randomly. Boosting can be used to

enhance the robustness of the neural network. The term Boosting refers to a machine learning

framework that combines a set of simple decision rules, which is generated by a set of

learners with different learning abilities, into a complex one that has higher accuracy and

lower variance. It is especially useful in handling real world problems that have the following

properties (Freund and Schapire, 1996): the samples have various degrees of hardness to

learn and the learner is sensitive to the change of training samples. The complexity of

learning hardness often occurs when applying machine learning method to tackle biological

problems. There are three boosting approaches (Haykin, 1999):

• Boosting by filtering: If the number of training samples is large, the samples are either

discarded or kept during training.

• Boosting by subsampling: With a fixed training sample set size, the probability to

include samples into training sample set for learning algorithms is adjusted.
3-31

• Boosting by reweighting: This approach assumes that the training samples could be

weighted by the learning algorithms. The training errors are calculated by making use

of these weights.

AdaBoost (Freund and Schapire, 1996) is a simple and effective boosting algorithm through

subsampling. During learning process, the algorithm tries to make the learners focus on

different portions of the training samples by refining the sampling distribution. Figure 3.1

shows one of two versions of AdaBoost training algorithm.

The input of the algorithm are n training samples, {(b1 , y1 ),K, (b n , y n )} , where

bi (i = 1K n) are the sample vectors, and yi ∈ Y (i = 1K n) are their associated class

labels, which are also the desired outputs of the learning models. Y is the set of class labels.

The algorithm starts by setting the initial sampling distribution d 1,1Kn to a uniform one. It then

enters an iterative process. At the t th iteration, the algorithm calls the function

train_learner() with training samples b1..n and sampling distribution d t ,1Kn , to get the trained

learning model M t . By calling function get_hypothesis() with M t and b1..n , the algorithm

generates the hypothesized classes of training samples. The algorithm then calculates total

error ε t by adding up all the wrongly predicted samples’distribution. If the error is bigger

1
than , the algorithm stops. Otherwise, it proceeds to calculate a factor β t . This factor is
2

used to reduce the portion of the correctly predicted training samples, in the sampling

distribution d t +1,1Kn of the next iteration. When the algorithm terminates, T learning models

M 1KT are trained with factors β1KT indicating the contribution of the learning models to the

combined hypothesis. The combined hypothesis for a test sample b can then be calculated as
3-32

  1 
h(b) = arg max ∑ log   .
 Eq. 3.3.2.1
 t :get_hypothesis( M t ,b )= y  β t 
y∈Y

When number of training samples is small, β t could be zero. We set a lower threshold to β t

when implementing the algorithm so that h(b) can be calculated.

function AdaBoost({(b 1 , y1 ),K, (b n , y n )}) // Input: training samples

1
d1,i = (i = 1K n) // Initial sampling distribution
n

for t = 1 to T do
M t = train_learner(b1..n , d t ,1Kn )
ht ,1..n = get_hypothesis( M t , b1Kn )
εt = ∑d
i :ht ,i ≠ y i
t ,i

1
if ε t > then
2
T = t −1
terminate loop // Terminate the algorithm
end

// Update sampling distribution to focus on the wrongly classified samples


ε
βt = t
1 − εt
 β t d t ,i if ht ,i = yi
d t′,i =  (i = 1K n)
 d t ,i if ht ,i ≠ y i
d′
d t +1,i = n t ,i (i = 1K n)
∑ d t′,i
i =1
end

return ( M 1KT , β 1KT )


end

Figure 3.1: AdaBoost algorithm


3-33

Theoretical study shows that if the hypothesis obtained by individual classifiers constantly

has error that is slightly better than random guess, the number of prediction errors of the final

hypothesis h drops to zero exponentially fast when T increases (Freund and Schapire,

1996).

Three layer neural networks can be the learning model integrated with AdaBoost. In this case,

train_learner() consists of initializing, training and optimizing neural network weights, and

get_hypothesis() corresponds to neural network decision making. The running of the

algorithm will construct T neural networks in total for making combined decision.

3.3.3 Neural network feature selector

This section first analyses the limitation of applying information gain measure for ranking of

features having continuous value, then describes the neural network feature selector method.

3.3.3.1 Disadvantage of information gain measure

Information gain can be applied in measuring both discrete and continuous feature values.

But in continuous case, it only takes the order of the values into consideration, which may not

be sufficient. For example, suppose the expression matrix is


3-34

6 1 7 
7 2 7.5 

8 3 8 
 
9 4 8 .5 
10 5 9 
B= 
11 16 12 
12 17 12.5
 
13 18 13 
14 19 13.5
 
15 20 14 

and the corresponding class labels are y = [− 1,−1,−1,−1,−1,+1,+1,+1,+1,+1] . The three


T

features’values are plotted in Figure 3.2 at line y = 1 , y = 2 and y = 3 . Intuitively we can

see that the discrimination abilities of the feature at line y = 2 and y = 3 are better than the

one at line y = 1 . Compared with feature values at line y = 1 , the distance between feature

values of different classes at line y = 2 are longer, and the density of feature values within

different classes at line y = 3 are higher. But the information gain ratios of these three

features are the same because it only takes the order of the feature values into account, hence

achieving the same maximum gain ratio for all three features when split in the middle.
3-35

Figure 3.2: Plot of feature values with class labels, ‘o’for ‘-1’and ‘+’for ‘+1’

3.3.3.2 Neural network feature selector

Since feature selection is a preprocessing process for classification, classification accuracy

can also be a selection criterion. The advantage of integrating a classifier into the feature

selection process is that the feature set is optimized by the classification accuracy. Moreover,

the training of the classifier and the selection of the features use the same bias. The

consistency improves the classification performance. However, the computational cost of the

integration may be high. In (Liu and Motoda, 1998) this approach is called wrapper model.

This framework is shown in Figure 3.3.

Procedure wrapper accepts full feature set (F), training samples (D) and testing samples (T)

as input. It first generates a subset of features (S), and then performs cross validation using S

and D to get classification accuracy (A). These two steps are repeated until A is sufficiently
3-36

high under certain criterion. A classification model (M) is then obtained from D with selected

S. Finally M and S are used to perform the test to measure the performance. In the algorithm,

cross validation can be done by dividing D into a training set and a validating set; or by the

leave-one-out method (see more details below).

procedure wrapper (F, D, T)


do
S=feature_set_gen(F)
A=cross_validation(S,D)
until sufficient(A)
M=train(S,D)
test(M,S,T)
end

Figure 3.3: Wrapper model framework

Setiono and Liu, (1997) proposed a neural network feature selection method based on the

wrapper approach. Here in this thesis, we applied it to gene expression analysis, with some

modifications. The following is a detailed description of this method.

The algorithm starts with all features, and removes features that have minor contribution to

classification one by one. The algorithm first trains a neural network with a given feature set,

then disables each feature and estimates the classification performance of this neural network

with the remaining features. If the decrease of the estimated performance is within an

acceptable level, the algorithm constructs a neural network with the remaining features, and

calculates the actual classification performance. If the actual performance is also acceptable,

the feature is removed, and the algorithm continues searching for more features to be

removed. Otherwise, it keeps this feature and continues to test other features according to
3-37

their estimated performances. It is a greedy method, and usually achieves sub-optimal

solutions.

There are two ways to train and validate the neural network. Suppose there are n samples.

The first one is to separate these samples into two sets: training set and validating set. If n is

too small to produce sufficiently large training and validating sets, the leave-one-out

technique can be used to perform training and validation n times, and obtain the average

training and validation accuracy.

Neural network feature selector method requires the neural network training to force the

weights associated with an irrelevant input neuron to have small magnitude, in order to

reduce the effect of the corresponding feature’s removal on the classification performance.

This is implemented using weight decay on the weights between the input and the hidden

layer when applying the back-propagation training. After every element of the weights is

updated with ∆w (1) , according to Eq. 3.3.1.13,


wi(,1j) = wi(,1j) + ∆wi(,1j) , Eq. 3.3.3.1

the elements of the new weights are computed as follows

 
 
(1) ″  ε jη  (1)′
wi , j = 1 − 2  i, j
w , Eq. 3.3.3.2
 
 1 + ( w (1) )  
∑i i, j  
2

  

where η is the learning rate parameter, and ε j is a penalty term associated with the j th

input neuron. Eq. 3.3.3.2 is similar to the weight decay method in (Hertz et al., 1991), but the

focus is different: the later can be used to eliminate hidden neurons.


3-38

In order to improve the convergence, we also tried the cross entropy error function instead of

Eq. 3.3.1.11,

 n k2 
[
F(w (1) , w ( 2 ) ) = − ∑∑ d kp log( y k( 2 ), p ) + (1 − d kp ) log(1 − y k( 2 ), p )  , ] Eq. 3.3.3.3
 p =1 k 

where the desired value d kp of p th sample of k th class is either 1 or 0 , which is different

from the one used in Eq. 3.3.1.11. The derivation of the back propagation with error function

in Eq. 3.3.3.3 is as follows:

∂ F  n k2 ∂ F ∂y k( 2 ), p 
=  ∑∑ , Eq. 3.3.3.4
∂w  p =1 k ∂yk( 2 ), p ∂w 

where

∂F  d kp 1 − d kp 
= −
 y ( 2 ), p 1 − y ( 2 ), p  .
− Eq. 3.3.3.5
∂yk( 2 ), p  k k 

Because the three layer neural network model is

 k1  k0  
yk( 2 ), p = ϕ ( 2 ) ∑ϕ (1)  ∑ xip w (ji1)  wkj( 2 )  , Eq. 3.3.3.6
 j =1  i =1  

we have

∂yk( 2 ), p ′ ′
∂wkj (2)
( ) ( ) ( )
= ϕ ( 2 ) vk( 2 ), p ⋅ ϕ (1) v (j1), p = ϕ ( 2 ) vk( 2 ), p ⋅ y (j1), p , Eq. 3.3.3.7

and

∂yk( 2 ), p ′ ′
= ϕ ( 2 ) (vk( 2 ), p ) ⋅ ϕ (1) (v (j1), p ) ⋅ xip . Eq. 3.3.3.8
∂w ji (1)

Again according to the delta rule (Haykin, 1999) and the gradient descent rule (Hertz et al.,

1991), we obtain

 n  d kp  ( 2 ) ′ ( 2), p (1), p 
∆wkj( 2 ) = −η
∂F
∂wkj( 2)
= η ∑  ( 2 ), p
 −
1 − d kp
1 − yk( 2 ), p
ϕ

vk( )⋅ yj  Eq. 3.3.3.9
 p =1  yk  

and
3-39

 n k  d kp  ( 2 ) ′ ( 2 ), p 
∂F
∑∑  ( 2 ), p

1 − d kp
ϕ ( ) ′
( )
2

∆w (ji1) = −η = η − vk ⋅ ϕ (1) v (j1), p ⋅ xip  . Eq. 3.3.3.10


∂w (ji1)  p =1 k  y k 1 − y k( 2 ), p  

Because the implementation of the algorithm is on MATLAB, we transform the computation

 d ij 1 − d ij 
of weight into matrix form: Let F′ be a n × k2 matrix, the elements f ij =  ( 2 ),i − 
y 1 − y (j 2 ),i 
 j 

∂F
which has the value opposite to that of . Let X be n × k0 matrix; V (1) and Y (1) be
∂yk( 2), p

n × k1 matrix; V ( 2) and Y( 2 ) be n × k2 matrix. Then we have


∆W (2) = η (F ′ ⊗ ϕ ( 2) (V ( 2) )) T Y (1) Eq. 3.3.3.11

and

T
 ′T  ′ 
∆W (1)
= η   (F ′ ⋅ ϕ ( 2 ) (V ( 2) )) ⊗ I  ⋅ ϕ (1) (V (1) )  X Eq. 3.3.3.12
 
  

where I is a n × n identity matrix used to extract the diagonal elements out.

Figure 3.4 shows the algorithm: At line 1, the function neural_network_feature_selector()

accepts six parameters: a set of m feature vectors {f1 ,K, f m } , a set of penalty parameters

{ε 1 ,K, ε m } , a penalty parameter amplification factor f , penalty parameter thresholds

(ε min , ε max ) , a class label vector y , an allowable maximum decrease in validation accuracy

∆r ′′ , training and validation thresholds (rmin


′ , rmin
′′ ) , and a linear combination factor rfactor for

testing and training performances.

At lines 2 to 5, the function copies feature number, feature set, penalty parameter set into

′′ to a
three variables m′ , F and E , and initializes the maximum validating accuracy rmax
3-40

small positive value ε . It then enters an iterative process, at lines 6 to 40, to remove features

one by one until no more feature in F can be removed with sufficiently high performance. In

the removal process, at line 7, a neural network N is initialized according to m′ , by calling

function initialize () . Then at line 8, N is trained and validated with features in F and

penalty parameters in E by calling train_vali date() , which returns the trained network N ,

training accuracy r ′ and validating accuracy r ′′ . The algorithm then updates rmax
′′ . At line 10,

m′ neural networks N 1,K,m′ are initialized with same weights as N .

At lines 11 to 15, m′ feature sets F1,K,m′ and penalty parameter sets E1,K, m′ are constructed,

each of which has one feature omitted; and they are tested and validated with corresponding

neural network by calling simulate_validate() at line 14; the corresponding estimated training

accuracy r1′,K,m′ and validating accuracy r1′,′K,m′ are returned. At line 16, r1′,K,m′ and r1′,′K,m′ are

sorted according to their linear combination in descending order. The higher the linear

combination rfactor ri′ + ri′′ is, the more likely the corresponding i th feature to be eliminated.

The factor rfactor is usually set to be bigger than one. When the difference among r1′,K,m′ are

high, r1′,K,m′ will be the main contributor to the ranking. However, when the difference among

r1′,K,m′ are not so high, which often occurs for small of training set, ranking is mainly affected

by r1′,′K,m′ .

At lines 17 to 25, according to the sorted index s1K


, ,m′ , the corresponding neural networks

whose estimated training and validation accuracy rates, rs′i and rs′i′ are bigger than the

′′ ) are retrained to get actual training accuracy rs′i and validating accuracy
′ , rmin
threshold (rmin

rs′i′ . If the validating accuracy is sufficiently high compared with maximum validating
3-41

accuracy, as indicated in the conditions at lines 23 and 26, then the penalty parameters is

updated at lines 27 to 33. At lines 35 to 38, the selected feature f si and penalty parameter ε si

′′ is updated. At lines
are removed from F and E ; feature number m′ is decreased; and rmax

29 to 33, the way to update penalty parameter is as follows: for any r′j , if it is bigger than

average value r ′ , which means that the corresponding feature is likely to be removed, the

corresponding ε j is enlarged by the factor f ; otherwise it is reduced by f . A penalty

parameter’s lower and upper thresholds [ε min , ε max ] are set to prevent it to be too large or too

small. After the feature removal process at from lines 6 to 40, the selected feature set in F is

returned at line 41.

As mentioned earlier, neural network feature selector method is based on the wrapper model.

But its feature generation and cross validation part are not so distinct. In general, lines 7 to 16

and lines 26 to 39 relate to feature generation; lines 17 to 25 validate the feature set; and line

40 contains the stopping criterion.


1 function
3-42
neural_network_feature_selector({f1 ,K , f m },{ε1 ,K, ε m}, f , (ε min , ε max ), y , ∆r ′′, (rmin
′ , rmin
′′ ), rfactor )
2 m′ = m
3 F = {f1 , , f m }
4 E = {ε 1 ,K , ε m }
5 ′′ = ε
rmax
6 do
7 N = initialize (m ′)
8 ( N , r ′, r ′′) = train_vali date( N , F , E , y )
9 ′′ = max(rmax
rmax ′′ , r ′′)
10 ( N1 ,K , N m′ ) = ( N ,K, N )
11 for i = 1 to m′
12 Fi = F − {N i }
13 Ei = E − {ε i }
14 (ri′, ri′′) = simulate_validate( N i , Fi , Ei , y )
15 end
16 ((rs′1 ,K , rs′m′ ), (rs′1′ ,K, rs′′m′ )) = sort_descending( rfactor × (r1′,K , rm′ ′ ) + (r1′′, K, rm′′′ ))
17 i=0
18 do
19 i = i +1
20 if rs′ > rmin
′ and rs′′ ′′
> rmin
i i

21 N si = initialize (m′ − 1)
22 ( N si , rs′i , rs′i′ ) = train_validate( N si , Fsi , E si , y )
′′ − rs′i′
rmax
23 δ =
′′
rmax
24 end
25 untili ≥ m′ or δ ≤ ∆r ′′
26 if δ ≤ ∆r ′′ then
1 m′
27 r′ = ∑ ri′
m′ i =1
28 for j = 1 to m′
29 if r j′ ≥ r ′ and ε j ∈ [ε min , ε max ] then

30 ε j = fε j
31 else
εj
32 εj =
f
33 end
34 end
35 F = F − { f si }
36 E = E − {ε si }
37 m′ = m′ − 1
38 ′′ = max(rmax
rmax ′′ , rs′′i )
39 end
40 until i ≥ m ′
41 return F
42 end

Figure 3.4: Neural network feature selector


3-43

The main modifications of the neural network feature selector in this thesis compared with

the one proposed by Setiono and Liu (1997) are summarized below.

1) Use of back propagation training method: Setiono and Liu employed BFGS (Broyden-

Fletcher-Shanno-Goldfarb) method for training of neural networks, which is a variant of

quasi-Newton method that has been shown to be very effective. However, in the training

process, BFGS computes a Hessian matrix with dimension equal to the square of the number

of features. The size of the matrix is huge when the algorithm is applied to microarray dataset

consisting of a large number of features. So we replace the BFGS algorithm with back

propagation algorithm in our implementation.

2) Use of leave-one-out validation as an estimator: The neural network feature selector

algorithm proposed by Setiono and Liu uses a fixed partition of training and cross-validation

sample set as initial input. In a typical microarray dataset, when there are only a very limited

number of samples, a fixed training and cross-validation sample set may introduce large bias

in estimating the generalization performance of the trained neural networks. Instead of simply

splitting the sample set into two partitions, the neural network feature selector proposed in

this thesis employs leave-one-out method to reduce the bias in estimating the generalization

performance.

3) Use of different of penalty function: Penalty functions are employed in both versions of

neural network feature selectors to force small weights between the input layer and the

hidden layer to zero, in order to reduce the effect of irrelevant features to the classification.

The penalty function used in (Setiono and Liu, 1997) is


3-44

P( w) = ε (∑∑
( ) ) + ε ( (w ) ) ,
β wi(,1j)
2

∑ (1) 2

1 + β (w )
1 2 i, j Eq. 3.3.3.13
(1) 2
i j i, j i

where ε1 , ε 2 and β are parameters for deciding the detailed penalty effect. From Eq.

3.3.1.13 it can be easily seen that the derivative of the penalty function against a weight

involves only that weight itself:

 
 2 βwi(,1j) 
∂ P( w)   + 2ε w (1) .
= ε1 Eq. 3.3.3.14
 2 
( )
2 i, j
∂wi(,1j) 2 
 1 + β wi , j  
(1)
  

As a result, the update of the weights tries to force the individual small weights to zero. By

contrast, the penalty function used in this thesis forces all weights associated with an input

neuron to zero, if the summed square of these weights are small. The new penalty function is

implicitly implemented in Eq. 3.3.3.2. It helps reducing the effect of an input neuron on all

hidden neurons when this input neuron is removed.

3.3.4 Support vector machines

Support vector machine is a linear learning model that can also perform classification. It was

invented by Boser et al. (1992). The theoretical aspect of Support Vector Machines is based

on Statistical Learning Theory (Vapnik, 1998). Section 3.3.4.1 describes the basic SVM

learning model and its training; Section 3.3.4.2 illustrates the linearly non-separable case;

Section 3.3.4.3 describes SVM with nonlinear mapping; Section 3.3.4.4 introduces a

multivariate feature selection method based on SVM.


3-45

3.3.4.1 Linearly separable learning model and its training

From decision-making point of view, a linear classifier tries to obtain decision surfaces that

can discriminate samples of different classes. These decision surfaces are hyperplanes.

Suppose there are n training samples that belong to two classes, {(x1 , y1 ), K , (x n , yn )} ,

where x i (i = 1K n) are sample vectors, and yi = ±1 (i = 1K n) are associated class labels.

If there exists a hyperplane w T x + b = 0 so that for all n samples x i ,

w T x i + b ≥ 0 yi = + 1
 T Eq. 3.3.4.1
w x i + b < 0 yi = − 1

hold, then these samples are linearly separable. The distance between the separating

hyperplane and its closest sample vector is called margin of separation, denoted as ρ . A

T
Support Vector Machine’s task is to find an optimal separating hyperplane w ∗ x + b ∗ = 0

that has the biggest margin of separation among all separating hyperplanes. Obviously, the

distances between this hyperplane and its nearest sample vectors on both sides are equal.

With w ∗ and b∗ properly scaled, we have

w ∗ T x i + b ∗ ≥ +1 yi = +1
 ∗T Eq. 3.3.4.2
w x i + b ∗ ≤ −1 yi = −1

for all samples. The sample vectors that satisfy

yi ( w ∗ x i + b ∗ ) = 1
T
Eq. 3.3.4.3

are called support vectors. Let us denote a pair of support vectors on both sides of the

separating hyperplane as x + and x − ,

w ∗ T x + + b ∗ = +1
 ∗T − . Eq. 3.3.4.4
w x + b ∗ = −1

We then have
3-46

1  w ∗ x + w ∗ x − 
T T

ρ=  −
2  w∗ w ∗ 

=
1
2w ∗
(T T
w ∗ x+ − w∗ x− ) Eq. 3.3.4.5

1
=
w∗

T
where w ∗ = w ∗ w ∗ .

So ρ is maximized when w ∗ is minimized. The training task can then be converted to an

optimization problem

minimize w
. Eq. 3.3.4.6
subject to yi (w x i + b) ≥ 1 i = 1,K, n
T

This constrained optimization problem can be solved using the method of Lagrange

multipliers. First, the Lagrange function corresponding to problem in Eq. 3.3.4.6 is

constructed

n
1
J(w , b, a ) = w − ∑α i [ yi (w T x i + b ) − 1] Eq. 3.3.4.7
2 i =1

where the nonnegative variables a = [α1 ,K,α n ]T are called Lagrange multipliers. The

problem in Eq. 3.3.4.6 is equivalent to minimizing J(w, b, a ) with respect to w and b , or

maximizing J( w, b, a ) respect to a . The solution lies in the saddle point of J(w , b, a )

 ∂ J( w, b, a )
 =0
∂w
 ∂ J( w, b, a ) . Eq. 3.3.4.8
 =0
 ∂b

By solving Eq. 3.3.4.8, we have


3-47

 n

 w = ∑
i =1
α i yi x i
 n , Eq. 3.3.4.9
 ∑α i yi = 0
 i =1

and by substituting Eq. 3.3.4.9 into Eq. 3.3.4.7 we get

n n n
1
J( w , b , a ) = ∑ αi −
i =1 2
∑∑
i =1 j =1
y i y jα iα j x iT x j . Eq. 3.3.4.10

The problem in Eq. 3.3.4.6 can then be transformed into the quadratic optimization problem

n
1 n n
maximize W(a ) = ∑α i − ∑∑ yi y jα iα j x iT x j
i =1 2 i =1 j =1
n
. Eq. 3.3.4.11
subject to ∑ yiα i = 0, α i ≥ 0, i = 1,K, n
i =1

This quadratic optimization problem has a unique solution that can be expressed as a

weighted combination of the training samples.

Suppose the solution of the problem is a ∗ = [α1∗ ,K,α n∗ ]T , according to Eq. 3.3.4.9, we have

n
w ∗ = ∑α i y i x i . Eq. 3.3.4.12
i =1

and using Eq. 3.3.4.4, we get

T T
b ∗ = 1 − w ∗ x + = −1 − w ∗ x − , Eq. 3.3.4.13

where support vectors x + and x − could be determined using the Karush-Kuhn-Tucker

conditions (Fletcher, 1987; and Bertsekas, 1995). According to the Karush-Kuhn-Tucker

condition, following equation holds,

α i [ yi (w T x i + b) − 1] = 0 i = 1,K, n . Eq. 3.3.4.14

So the sample vectors with positive Lagrange multipliers are support vectors.
3-48

3.3.4.2 Linearly non-separable learning model and its training

In the linearly non-separable case, there is no hyperplane that can separate all samples. In this

circumstance, a set of slack variables {ξ i } ξ i ≥ 0, i = 1,K, n is introduced, and the separation

hyperplane is redefined as

yi (w T x i + b ) ≥ 1 − ξ i i = 1,K, n , Eq. 3.3.4.15

and the optimization problem becomes

n
minimize w T w + C∑ξi
i =1
. Eq. 3.3.4.16
subject to y i (w x i + b) ≥ 1 i = 1, K, n
T

where the parameter C balances the generalization ability represented in the first term, and

the separation ability indicated in the second term. Problem in Eq. 3.3.4.16 can be converted

to its dual problem similar to that in the separable case, in which the slack variables are

omitted

n
1 n n
maximize W(a ) = ∑α i − ∑∑ yi y jα iα j x Ti x j
i =1 2 i =1 j =1
n
, Eq. 3.3.4.17
subject to ∑yα
i =1
i i = 0, 0 ≤ α i ≤ C , i = 1,K, n

and the optimum solution becomes

ns
w ∗ = ∑ α si y si x si , Eq. 3.3.4.18
i =1

where n s is the number of support vectors, and si , i = 1K n s are indices corresponding to

those support vectors. To identify support vectors, the Karush-Kuhn-Tucker condition is

defined as

α i [ y i (w T x i + b) − 1 + ξ i ] = 0
 i = 1,K , n . Eq. 3.3.4.19
 ξ i (α i − C ) = 0
3-49

According to this condition, all sample vectors with positive Lagrange multipliers are support

vectors; and the slack variable is non-zero only when its corresponding Lagrange multiplier

equals to C . The value of b ∗ can be determined by choosing any support vector x i with

Lagrange multiplier 0 < α i < C :

T
b∗ = 1 − w ∗ xi if yi = +1 Eq. 3.3.4.20

or

T
b ∗ = w ∗ x i − 1 if yi = −1 . Eq. 3.3.4.21

When the classification task has more than two class labels, there are two ways to transform

it to a binary classification problem. One way is to encode class labels using a binary

representation. Suppose there are l class labels in the task. log 2 (l )  support vector machines

are needed to perform classification task together. If a sample has the k th class label, a

vector y = [ y1 , K , y log 2 (l ) ]T , y1 , K , y log 2 (l ) ∈ {+1,−1} ,

+ 1 if ith bit of binary form of k is 1


where yi =  i = 1, K, log 2 (l ) ,
− 1 if ith bit of binary form of k is 0

is sufficient to represent corresponding desired outputs of the support vector machines.

Another way is the one against others method. In this method, l support vector machines are

needed. The corresponding desired output of a sample with k th class label

+ 1 if i = k
is y = [ y1 ,K , yl ]T , where yi =  i = 1,K, l .
− 1 otherwise

3.3.4.3 Nonlinear Support Vector Machines

Support vector machine is basically a linear model. It can be extended to handle non-linear

cases by introducing slack variables, as is shown in Section 3.3.4.2. In addition, nonlinear


3-50

mapping functions for transformation of input vectors can also be employed. This approach

makes the learning more flexible. Let t (x) = [t 0 (x), t 1 (x ),K , t l (x)]T be a vector of nonlinear

transform functions, where t 0 (x) = 1 . The optimal hyperplane is then defined as follows

w T t ( x) = 0 , Eq. 3.3.4.22

where the bias term is included implicitly in w . By adapting Eq. 3.3.4.12 we get

l
w = ∑ α i yi t (x i ) . Eq. 3.3.4.23
i =1

Substituting Eq. 3.3.4.23 into Eq. 3.3.4.22, we obtain

∑α
i =1
i y i t T ( x i )t ( x ) = 0 . Eq. 3.3.4.24

Let the inner product kernel K(x i , x ) = t T (x i )t (x) be a symmetric function, Eq. 3.3.4.24

becomes

∑α
i =1
i yi K(x i , x ) = 0 . Eq. 3.3.4.25

Problem Eq. 3.3.4.17 is then reformulated as

n
1 n n
maximize W(a ) = ∑ α i − ∑∑ y i y jα iα j K(x i , x j )
i =1 2 i =1 j =1
n
. Eq. 3.3.4.26
subject to ∑yα
i =1
i i = 0, 0 ≤ α i ≤ C , i = 1,K, n

The optimal decision hyperplane can be found by solving problem in Eq. 3.3.4.26 and

substituting the Lagrange multipliers into Eq. 3.3.4.25.

The complexity of the target function to be learned depends on the way it is represented. The

kernel approach provides a means to implicitly map input vectors into a feature space, i.e. the

kernel can be used without knowing its corresponding transforming function. The
3-51

introduction of kernel simplifies the design of a learner, and may improve generalization

ability. This approach can be used not only in support vector machine but also in other

learning models. Some of the examples of learning models that consist of kernel are listed

below (Haykin, 1999):

• Polinomial learning machine

K(x, x i ) = (x T x i + 1) p . Eq. 3.3.4.27

• Radial-basis function network

1
− x − xi
K(x, x i ) = e 2σ 2
. Eq. 3.3.4.28

• Three layer neural network

K(x, x i ) = tanh( β 0 x T x i + β 1 ) Eq. 3.3.4.29

where p , σ , β 0 and β1 are pre-specified parameters.

3.3.4.4 SVM recursive feature elimination (RFE)

The weights of a trained SVM can indicate the importance of the corresponding features to

the classification. Based on this idea, Guyon et al. (2002) proposed a recursive feature

elimination method. After training a linear kernel SVM, its weight can be obtained by Eq.

3.3.4.18. The algorithm iteratively trains SVM and eliminates the feature(s) with small

weights, until the feature set become empty. Figure 3.5 shows the algorithm:
3-52

RFE( S = {f1 ,K, f m } , y )

While S > 1

w = svm_training( S , y )

f = arg min( wi 2 ), i = 1,K, m

S =S− ff { }
end

Figure 3.5: Recursive feature elimination algorithm

3.3.5 Bayesian classifier

Keller et al. (2000) used a simple classification method based on naïve Bayes rule. Given an

expression vector x of m selected features, the classification of a sample is computed as

follows

class( x) = arg max(log P( M i | x)), i = a, b , Eq. 3.3.5.1


i

where P(M i | x) is a posteriori probability that M i is true given x . Applying the Bayes

rule once again, the class for vector x can be predicted as

class( x) = arg max(log P(x | M i ))


i

m
= arg max(∑ log P( x g | M i ))
i g =1
Eq. 3.3.5.2

m
( x g − µ ig ) 2
= arg max(∑ − log δ ig − ), i = a, b ,
i g =1 2(δ ig ) 2

where µig and δ ig are the mean and standard deviation of the feature values of the training

samples of class i . In a binary classification, we can be more confident about the


3-53

classification when the difference between log P(x | M a ) and log P(x | M b ) is bigger.

However, in order to obtain more information regarding the confidence of the classification,

we need to compute the following

class( x ) = log P(x | M a ) − log P(x | M b )

Eq. 3.3.5.3
m
 ( x g − µ ag ) 2  m  ( x g − µ bg ) 2 
= ∑ − log δ ag −  − ∑  − log δ g
− .
2(δ ag ) 2  g =1  2(δ bg ) 2 
b
g =1 

A positive difference means that the sample is predicted to be class a , and a negative

difference means that the sample is predicted to be class b . The larger the difference, the

more confident we are about the classification. We also make use of this difference when

computing another measure of accuracy, i.e. acceptance rate, which will be discussed in

Section 4.1.6.1.

3.3.6 Two discriminant methods for multivariate feature selection

We found that a number of multivariate feature ranking methods can be placed in a unified

framework. The methods attempt either to find a vector projection of the samples onto which

maximizes or to minimize certain objective function f(w ) . The function f(w ) is originally

used on individual features to measure the discrimination ability or diversity of the feature.

The magnitude of the elements in the vector then indicates the relative importance of the

features. When f(w ) is extremal margin, the method is equivalent to RFE. When f(w ) is set

to Fisher’s criterion (see Section 3.3.6.1), the method becomes Fisher’s Linear

Discrimination method. When the function f(w ) is substituted by Eq. 3.3.3.3, the method

resembles neural network feature selector in the sense that the optimized weights between

input and hidden layer can indicate the relative importance of the input neuron. When f(w ) is
3-54

the standard deviation of the projection of all samples, the method becomes PCA. Section

3.3.6.1 describes Fisher’s linear discriminator. In Section 3.3.6.2, we attempt to use

Likelihood ranking method as the objective function to rank relative discrimination

contribution of features.

3.3.6.1 Fisher’s linear discriminant

Let G = {g1 ,K, g m } be a set of features. By performing linear transform ya ,i = ∑ w g xag,i and
g∈G

yb ,i = ∑ w g xbg,i , that is projecting all samples from m dimension space to a unit vector w ,
g∈G

we can obtain the Fisher’s criterion of the projections alone w ,

(µ a′ − µ b′ )2
F(w ) =
δ a′ + δ b′
2 2

, Eq. 3.3.6.1
= T
(w
u a − w Tub )
T 2

w ( n a Σ a + nb Σ b ) w

where µ ′ and δ ′ are the mean and standard deviation of two classes of projections

respectively, u a and u b are the mean of the original sample vectors of the two classes

respectively, and Σ a and Σ b are the covariance matrix of the samples from the two classes

respectively. Fisher’s linear discriminant tries to find the weight w that maximizes F(w ) ,

that is,

maximize F(w )
. Eq. 3.3.6.2
s.t. wTw = 1

The solution of the maximization problem is

w * = (n a Σ a + nb Σ b ) −1 (u a − u b ) . Eq. 3.3.6.3

The value of w * can be an indicator of the contribution of features to discrimination.


3-55

3.3.6.2 Multivariate Likelihood feature ranking

We propose a multivariate likelihood feature selection method that is based on a similar idea

as that of Fisher’s linear discriminant. Suppose there are n = na + nb samples with m

features. Recall Keller’s Likelihood method for ranking for individual gene g , which are

expressed in Eq. 3.2.2.6 and Eq. 3.2.2.7 in Section 3.2.2. Let G = {g1 ,K, g m } be a set of

features. By performing linear transform ya , i = ∑ w g xag,i and yb, i = ∑ w g xbg, i , that is


g ∈G g ∈G

projecting all samples from m dimension space to a unit vector w , we can obtain the

likelihood of the projections of the samples of two classes on vector w ,

na  ( y a ,i − µ a ) 2 ( y a ,i − µ b ) 2 
LIK a→b = ∑  − log(δ a ) −
 + log(δ b ) +  Eq. 3.3.6.4
i =1  2(δ a ) 2 2(δ b ) 2 

and

nb 
( y − µb )2 ( y b ,i − µ a ) 2 
LIK b→a = ∑  − log(δ b ) − b ,i + log(δ ) + . Eq. 3.3.6.5
2(δ b ) 2 2(δ a ) 2 
a
i =1 

Where µ and δ are the mean and standard deviation of two classes of projections,

respectively.

The multivariate feature selection process becomes:

maximize f(w )
Eq. 3.3.6.6
s.t. wTw = 1

where f(w ) = LIK a→b , f(w ) = LIK b→a or f(w ) = LIK a→b + LIK b→a . When the maximization

problem in Eq. 3.3.6.6 is solved, it becomes an indicator of the contribution of features to


3-56

discrimination. We use Sequential Quadratic Programming method (Fletcher, 1987) to find a

suitable w .

3.3.7 Combining univariate feature ranking method and multivariate

feature selection method

Information gain and likelihood method are univariate feature ranking methods in the sense

that they assume genes contribute to classification independently, and rank the genes

according to their individual contribution. This assumption has computational advantages.

But in real world, genes are often working together for a certain function, and the

combinatorial effect of these genes is not considered by univariate selection methods. On the

other hand, neural network feature selector, recursive feature elimination and multivariate

likelihood method consider the whole contribution of subset of features to the classification.

These three approaches have the potential to select smaller subsets of features with higher

classification performance. However, the selection process may be obscured when applied to

microarray datasets with high dimensionality and in the presence of large number of

irrelevant features. Take RFE as an example, the presence of large number of irrelevant

features hides the discriminative information from relevant features. This can be seen in the

formulation of the SVM dual problem where the coefficients of the quadratic terms in the

dual problem are computed as the scalar products of two inputs

m
x i T x j = ∑ xik x jk Eq. 3.3.7.1
k =1 .

The gene elimination process is very sensitive to change in the feature set. SVM also has the

disadvantage that it is sensitive to outliers as discussed in Guyon et al. (2002). In microarray

data, the outliers may be introduced by: 1) noise in the expression data, or 2) incorrectly
3-57

identified or labeled samples in the training dataset. It is therefore more beneficial to apply

RFE on a dataset with a reduced number of features. A univariate feature selection algorithm

can be used to first efficiently reduce the large number of features originally present in the

dataset and a multivariate feature selection method such as RFE can then be applied to

remove more features. To summarize, we first identify and remove genes that are expected to

have low discrimination ability as indicated by LIK scores. Then, we apply RFE to remove

the size of the feature set further. With this integrated approach to feature selection, we are

able to achieve good classification performance with fewer genes than those reported by

Guyon et al. (2002) and Keller et al. (2000).

A multivariate method is usually very time-consuming when applied to a dataset with

thousands of genes. For RFE, in order to eliminate one or more genes, a new SVM has to be

trained, and the overall computational cost is Ω(m 2 n 2 ) . On the other hand, LIK ranks genes

independently, which makes the computational complexity of LIK, Ο(mn) . Using LIK first

to reject a large number of genes, and then using RFE to perform further selection will save

significant running time compared to just using RFE alone. This is especially important when

the improvement in microarray technology makes it possible to obtain gene expression values

from tens or even hundreds of thousands of genes.


4-58

4 Experimental results and discussion

The experimental results from and discussions on applying machine learning methods to

global gene expression analysis in our research will be described in this chapter. We start

with the experiments on the datasets of the second type classification problem, which consist

of large number of features and small number of samples. The high dimension nature of this

kind of problem makes it distinct from common classification problems. We then describe the

test result on a newly released dataset, which is of the first type of classification problem with

large number of samples and small number of features.

4.1 Second type - high dimension problems

The main work in this thesis focuses on the second type of classification problems, which

involves a small number of samples with a large number of features, and feature selection

problems associated with this type of problems. Our strategy is to first try various analysis

methods, most of which were described in Chapter 3, on a well known benchmark dataset,

human acute leukemia microarray dataset (Golub et al., 1999), select one that has the best

performance, then try that method on other datasets of same type, including small, round blue

cell tumors (SRBCTs) (Khan et al., 2001) dataset and artificial datasets.

The human acute leukemia microarray dataset consists of 72 microarray experiments with

expression values of 7129 clones from 6817 human genes. Here the term clone refers to the

fragment of a gene. Each of the genes has a short description; and each clone is represented

by an accession number. Each microarray is assigned with a class label, either Acute Myeloid

Leukemia (AML) or Acute Lymphoblastic Leukemia (ALL), according to the organism used

for the hybridization. A second type of classification problem arises from this dataset. We
4-59

used the clone id as feature id. Each of the hybridizations corresponds to a classification

sample. These samples were divided into two sets by Golub et al.: The first sample set, which

consists of 27 ALL samples and 11 AML samples, is for training the classifier. The second

sample set, consisting of 20 ALL samples and 14 AML samples, is for testing the classifier.

Figure 4.1: The combination of feature selection / integration methods and classification

methods.

Due to the time constraint, we only tested a subset of all possible combinations of these

methods. In our experiment described in the following sections, the combination of

Likelihood and Recursive Feature Selection method achieved the best feature selection
4-60

performance. With the optimal feature sets selected using the above combination, Bayesian

classification method achieved the best classification performance. The combinations that

used are summarized in Table 4.1. Characteristic of some of the combinations are described.

The term homogenous in the table refer to the methods that use same kind of criterion in

single dimension and multi-dimension. Those combinations that are not tested correspond to

blank cells in the table.

Neural network feature Fisher Linear Multivariate Likelihood


selector Discriminator Selection Method Recursive Feature Elimination
Homogeneous , stopping
criterion, heavy
Information gain computation
Best selection performance /
Likelihood Tested Homogeneous Extensively studied
Two gene sets can be
obtained, each distinguish
Fisher Criterion Homogeneous one class from the other Tested
Second best combination we

found
Extermal Margin
Baseline Tested

Table 4.1: List of combined univariate and multivariate feature selection methods tested.

4.1.1 Principle Component Analysis for feature integration

Our first effort was to study the relevance of features to the class labels. The study involves

both feature integration and feature selection. For feature integration, we chose principle

component analysis to see how much the most informative component extracted from the

features can contribute to the classification. Experiment was done using statistics toolbox and

neural network toolbox in MATLAB v6.1. Computation of the principle components form

large number of features consumes large amount of computer memory. Due to the limitation

of our computer system, the program was unable to process all 7129 features. We generated

72 components from randomly chosen 4500 features, which is the maximum number of
4-61

features that can be processed by the program, and then used all samples with these

components to perform leave-one-out training and validation using a three-layer feed-forward

neural network with different number of hidden units on all samples. In the training process,

batch mode was used. In each leave-one-out iteration, a test of the classification performance

on the 71 samples was done after every 10 epochs. If the accuracy on training data is 100%,

the training is stopped, and the remaining one sample is tested. Table 4.2 shows the

performance. The result shows that although principle component extracted most informative

information in terms of standard variance from the features, this information is hardly

relevant to the classification.

Hidden neuron number Training method Performance


600 RPROP backpropagation 51.39%
600 One Step Secant Algorithm 51.39%
50 One Step Secant Algorithm 43.06%
50 One Step Secant Algorithm 48.61%

Table 4.2: Performance of neural network using principle components as input.

4.1.2 C4.5 for feature selection

We applied C4.5 algorithm on all 72 samples of the leukemia dataset. The constructed

decision tree was surprisingly simple, which only involves two genes. It could correctly

classify 71 samples. The tree is as follows:

M84526_at > 290 : -1

M84526_at <= 290 :

| X54489_rna1_at <= 91 : 1

| X54489_rna1_at > 91 : -1
4-62

Using only 38 training samples, a even simpler tree involving only one feature is generated,

this tree correctly classified all 38 training samples and 31 out of 34 testing samples

(accuracy: 91.2%).

X95735_at <= 938 : 1

X95735_at > 938 : -1

The decision tree construction algorithm C4.5 also supports constructing trees in iterative

mode. In this mode, the algorithm randomly selects an initial sample subset from training set

to construct a decision tree, and then iteratively add the samples that are misclassified by the

tree into the subset and reconstruct the tree until there is no misclassification among all

training samples. We tested C4.5 with iterative mode for 20 trials using all 72 samples. All

trials generated a two-layer tree with three nodes. These trees are listed in Table 4.3. For

simplification, we just list the gene accession numbers of the first layer and the second layer

to represent the tree.

Feature of first layer Feature of second layer Occurrence


M84526_at M83652_s_at 4
M84526_at D86967_at 2
M84526_at X54489_rna1_at 2
U46499_at M98833_at 1
U46499_at X86401_s_at 1
U46499_at D89289_at 1
L09209_s_at D80003_at 1
D88422_at M31166_at 5
D88422_at U23070_at 2
D88422_at M83652_s_at 1

Table 4.3: Decision trees constructed by iterative mode from all 72 samples.
4-63

We then tried leave-one-out test to construct decision trees from all samples. A total of 72

trees were constructed, most of which have two layers, but some had only one layer. Table

4.4 summarize these trees. When applying leave-one-out to 38 training samples, all the trees

constructed had one layer. They are summarized in Table 4.5.

Feature of first layer Feature of second layer Occurance


M23197_at M20902_at 1
M23197_at D80003_at 1
M23197_at 2
M27891_at M31166_at 2
M27891_at M55418_at 1
M27891_at 2
M84526_at X06948_at 1
M84526_at Y07604_at 1
M84526_at M81883_at 1
M84526_at D86967_at 1
M84526_at X54489_rna1_at 52
U46499_at M60527_at 1
U46499_at M98833_at 1
U46499_at U36922_at 1
U46499_at X86401_s_at 1
X95735_at HG2160-HT2230_at 1
X95735_at 1
M83652_s_at M31211_s_at 1

Table 4.4: Decision trees constructed by leave-one-out mode from all 72 samples

Feature of first layer Occurance Classification accuracy


on test samples
X95735_at 35 0.912

M27891_at 1 0.941

M31166_at 1 0.706

M55150_at 1 0.794

Table 4.5: Decision trees constructed by leave-one-out mode from 38 training samples with

prediction accuracy on 34 test samples.


4-64

Table 4.3, Table 4.4 and Table 4.5 show that certain features occur very frequently in these

three experiments. It appears that the selectivity of C4.5 algorithm is high for the dataset. In

Table 4.3 and Table 4.4, feature M84526_at has the highest occurrence frequency, which

implies that this feature is important in deciding the classes when all 72 samples are taken

into account. But when only taking the 38 training samples into account, the algorithm

selected a very different set of features, which is shown in Table 4.5. Only feature X95735_at

appeared two times in leave-one-out mode for all 72 samples (see Table 4.4). The fact that

the trees generated on all training and test samples are very simple and the trees generated on

the training samples are even much simpler make us expect that some feature selection

method should that can generate a very small feature set when presented with a very limited

number of training samples. It may also be possible to obtain high classification accuracy on

test samples by certain classifiers constructed using this small feature set.

Training Test
Rank Feature All samples samples samples
1 M84526_at 0.652 0.408 0.689
2 M27891_at 0.652 0.685 0.689
3 D88422_at 0.651 0.578 0.584
4 M23197_at 0.648 0.581 0.684
5 X95735_at 0.647 0.844 0.522
6 U46499_at 0.634 0.565 0.692
7 M31523_at 0.590 0.511 0.851
8 L09209_s_at 0.589 0.562 0.577
9 M83652_s_at 0.550 0.578 0.420
10 M11722_at 0.542 0.332 0.683
22 M31166_at 0.405 0.689 0.182
26 M55150_at 0.398 0.671 0.198

Table 4.6: Leukemia features and their information gain of all samples, training samples and

test samples, sorted by gain of all samples.


4-65

We continued to investigate the information gain computed for the features of first level. In

Table 4.6, top ten ranked features are listed according to the information gain of all samples.

The features that are listed in Table 4.5 whose rank are higher than ten are also listed in Table

4.6.

Information gain is a measure of relevance of a feature to classification. It appeared that the

features with high gain in training samples are likely to have high gain in all samples and test

samples. As mentioned before, C4.5 can construct a decision tree that consists of only one

feature from training samples using information gain. The tree is very simple but it can only

correctly classify 31 of 34 test samples. The generalization ability of the classifiers

constructed using only information gain measure for continuous features is not high, as was

discussed in Chapter 3, so we tried to use information gain measure as a feature selection

method, and test neural networks based on the selected features to see whether there is any

improvement in performance.

4.1.3 Neural networks with features selected using information gain

We tested the neural network with different configurations and different numbers of features,

as is listed in the following tables, with the highest information gain obtained from training

samples. The test results are listed in Table 4.7. Two kinds of training methods were used:

trainoss (One Step Secant Algorithm) and traingd (Gradient descent backpropagation). The

training was done in batch mode. In the training process, a test of the classification error of

training samples was performed repeatedly after a certain number of epochs indicated in test

interval column in the table until the error converge to no greater than training error

tolerance. The reason why we choose different tolerance is to check how well the trained

neural network could generalize, under a given over-fitting limit. If the number of tests
4-66

exceeded the number indicated in number of unsuccessful trials column, then the training was

considered to be unsuccessful and the training result was rejected. Once the training was

successfully terminated, we tested the classification accuracy of the trained network on test

samples. For every configuration corresponding to each row in the table, we collected 100

successful trainings and then calculates the mean and standard deviation of the classification

accuracy on the test samples.

Feature Number of Training Test interval Number of Training error Test accuracy
number hidden units method (number of epochs) unsuccessful tolerance (number of
trials samples)

50 60 trainoss 20 N/A 0 0.810±0.090

20 60 trainoss 20 N/A 0 0.855±0.080


10 (from all
samples) 20 trainoss 20 20 0 0.952±0.024

5 20 trainoss 20 20 0 0.880±0.075

100 40 trainoss 20 20 0 0.824±0.101

50 50 trainoss 200 20 0 0.843±0.089

50 50 trainoss 20 20 0 0.848±0.090

50 10 traingd 100 5 2 0.936±0.036

50 10 traingd 100 5 2 0.931±0.041

20 10 traingd 100 5 4 0.942±0.025


(Repeats can
10 10 traingd 100 3 4 hardly converge)

20 5 traingd 100 2 4 0.947±0.021

Table 4.7: Prediction performance of neural network using features selected by information

gain.

Note that the experiment in the third row of Table 4.7 used top ten features from information

gain of all features. The aim of this test is to see how high the accuracy could be when

information from test samples is used. We found that when the training error tolerance of the

training sample was as large as 4, with top 20 features, the test accuracy could approximate
4-67

the test in the third row. If the training error tolerance is smaller than 4, the test performance

may be affected by over-fitting.

The test in Table 4.7 gave us some indications on how to optimize the parameters to obtain

better generalization ability. We continued to test whether the test errors were located on a

few test samples or evenly distributed with fine tuned parameters. The method was the same

in the experiments in Table 4.7. We collected 100 successful trainings and counted the

number of times the test samples that were wrongly predicted as well. In Table 4.8 the

configuration and prediction accuracy are listed, and the wrong prediction frequency is

shown in Figure 4.2.

As can be seen from Figure 4.2, it seemed that for individual experiments of certain

configurations, the wrong predication frequency was very high in certain test samples. In

addition, most of trained neural networks were likely to make wrong prediction on the class

of test samples 28 and 29.

Experiment ID Number of Training Test interval Number of Feature Training error Mean precision Standard
hidden method (number of unsuccessful number tolerance (number deviation
units epochs) trials of samples)

14 5 traingd 100 1 15 5 0.893 0.044

15 3 traingd 50 1 50 2 0.932 0.036

16 3 traingd 50 1 50 4 0.914 0.050

17 3 traingd 50 1 20 4 0.948 0.019

18 3 traingd 50 1 100 4 0.924 0.030

19 3 traingd 50 1 100 2 0.930 0.021

Table 4.8: Test of neural network using features selected by information gain to identify

incorrectly predicted test samples.


4-68

Distribution of wrong predictions

100
Number of times of wrong prediction

90
80
70 Exp 14
Exp 15
60
Exp 16
50 Exp 17
40 Exp 18
Exp 19
30
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
Test sample ID

Figure 4.2: Wrong prediction times.

4.1.4 AdaBoost

AdaBoost has the mechanism to change the sampling distribution to focus on the training

samples that have high validation errors, and then AdaBoost may fit the training samples well

and keep good generalization ability as well. Hence we expect that AdaBoost framework

might further improve the overall classification performance. In each AdaBoost experiment, a

number of neural networks of the identical size were consecutively trained for 50 epochs, and

those neural networks with training error no higher than error tolerance were employed for

refining sampling distribution. The training process continued until 50 such neural networks
4-69

were obtained. The final hypothesis of the test samples could then be obtained by combining

hypothesis of individual neural networks and factors β1KT , according to Eq. 3.3.2.1. One of

disadvantages of AdaBoost is that the training process is slow, so we only conducted the test

no more than twice for different configurations. The test results are listed in Table 4.9. The

configurations in tests 1 and 4 were tested once for their training time is extremely long, but

the remaining configurations were tested twice.

As can be seen from Table 4.9, the test performance of tests using the top 20 to 50 features is

generally better than the test results using the same number of features in Table 4.7 even

when the error tolerance was as low as 2 validation errors. In AdaBoost tests, there are

generally one or two errors (97.1% or 94.1% accuracy) in the test samples. The trained

networks are expected to fit the training samples well, and the combined hypothesis can also

achieve high prediction performance on the test samples.

Test ID Hidden unit Test


number Feature number Training error tolerance accuracy
1
5 20 3 0.971
2
5 10 3 0.853
3
5 10 3 0.794
4
5 15 4 0.912
5
5 50 4 0.941
6
5 50 4 0.971
7
5 50 2 0.971
8
5 50 2 0.941
9
3 50 2 0.971
10
3 50 2 0.941

Table 4.9: AdaBoost test results.


4-70

4.1.5 Neural network feature selector

By seeing the test result in the previous section, we continued to investigate whether it is

possible to further reduce the number of relevant genes without losing too much prediction

accuracy. Neural network feature selector, being a wrapper feature selection approach, could

exploit the relevant information between features and classes in the trained neural network

models. Another advantage of the neural network feature selector is that it can decide the

optimum number of selected features. Experiments were carried out using neural network

feature selector. The neural network feature selector algorithm was implemented using

MATLAB. In the experiments, a set of features, which had highest information gain from the

training samples, was used as initial feature set whose size was to be reduced. Because the

number of training samples is very small, we used leave-one-out approach to get average

training accuracy and validation accuracy when calling function train_validate() and

simulate_validate() . Each neural network was trained for maximum 200 epochs in the

function train_validate() . The rfactor was set to 10. After the selection process, with the

selected features, 100 repetitions of training and testing were done, and the mean and

standard deviation of training and testing accuracy ranks were calculated. The results are

summarized in Table 4.10 and Table 4.11. The performance of both summed square and

cross entropy error functions are listed in the two tables respectively. In these two tables, the

column (ε min , ε max ) contains the thresholds of penalty parameter ε . Because we increased or

decreased the penalty parameter by a factor of 1.1 , the values reflect the minimum and

maximum number of times that the penalty parameter may increase or decrease

accumulatively.
4-71

Experiment Feature
set size
′′
r′min rmin ∆r ′′ (ε min , ε max ) Number of Accuracy
features training samples
on Accuracy on test
samples
selected
1
200 0.9 0.9 0.05 1.1±30 6 1.00±0.00 0.67±0.01
2
50 0.9 0.9 0.01 1.1±20 3 1.00±0.00 0.79±0.00
3
50 0.95 0.95 0.01 1.1±20 4 1.00±0.00 0.88±0.03
4
50 0.97 0.97 0.01 1.1±20 4 1.00±0.00 0.88±0.03
5
200 0.95 0.9 0.03 1.1±20 6 0.98±0.01 0.71±0.00

Table 4.10: Experiment result of neural network feature selector with summed square error

function.

Experiment Feature
set size

rmin ′′
rmin ∆r ′′ (ε min , ε max ) Number of Accuracy
features
on Accuracy on
training samples test samples
selected out
1
200 0.9 0.9 0.05 1.1±30 4 1.00±0.00 0.71±0.02
2
50 0.9 0.9 0.01 1.1±20 4 1.00±0.00 0.72±0.03
3
50 0.95 0.95 0.01 1.1±20 6 1.00±0.00 0.68±0.04
4
50 0.97 0.97 0.01 1.1±20 36 1.00±0.00 0.80±0.08
5
200 0.95 0.9 0.03 1.1±20 191 1.00±0.01 0.79±0.09

Table 4.11: Experiment result of neural network feature selector with cross entropy error

function.

From Table 4.10, we can see that under certain settings the neural network feature selector

can select a small set of four out of 50 genes and the prediction accuracy on the test samples

could be as high as 88%. In comparison, the experiments in Table 4.11 show that the

selection performances are generally worse in terms of the number of the features selected

and the prediction accuracy. We noticed that back-propagation training of neural networks

was much faster when using cross entropy error function than when using summed error

function.
4-72

4.1.6 Hybrid Likelihood and Recursive Feature Elimination method

We tried the combination of likelihood and recursive feature elimination method. Very good

feature selection performance was found using this hybrid method on Leukemia dataset,

which encouraged us to continue to test the method systematically, comparing it with feature

selection methods using LIK and RFE alone. We computed acceptance rate, which is a

performance measure that is stricter than accuracy.

Suppose there are n samples with predicted values of o1 , K , o n , and their corresponding

class labels are y1 , K , y n . Each of the class labels takes value of either + 1 or − 1 , and the

values of outputs are real numbers. If the prediction output of a classifier for a sample has the

same sign as that of its true class, we consider this sample to be correctly classified. The

performance measure of accuracy, i.e. the number of correctly classified samples over the

total number of test samples, is defined as

{i o i y i > 0, i = 1, K , n }
accuracy = , Eq. 4.1.6.1
n

where S denotes the cardinality of the set S . In contrast, the acceptance rate is computed as

follows

 
 i oi yi > − min( o j y j ), i = 1,K, n 
 j =1,K, n  Eq. 4.1.6.2
acceptance rate = .
n

The strength of correct prediction of a sample can be obtained by multiplying the output and

class label of a sample together oi yi , the larger the value of the product, the better the

prediction made by the classifier. When the value of the product is negative, the classifier

makes a wrong prediction of the sample. To calculate the acceptance rate, we first select the
4-73

worst prediction out of all test samples. The worst prediction corresponds to the sample

whose o j y j is minimum. This minimum values is multiplied by − 1 and is used as a

threshold. All the predictions that have the o j y j value bigger than this threshold are

considered as being accepted. The acceptance rate is 1 when all the test samples are correctly

predicted. Otherwise, it will not be greater than the accuracy, because the prediction of the

classifier on some test samples may have small confidence, as indicated by the output-class

label products that are lower than the threshold. These test samples are correctly predicted

and are counted in the computation of accuracy, but they will not be counted in the

computation of acceptance rate.

In the tables and figures in this section, we will denote accuracy and acceptance rate by acu

and acp, respectively. Obviously, the acceptance rate cannot be higher than accuracy.

Because the number of samples was small, we used the leave-one-out method for validating

the classifier on training samples as well as on all samples. When there are n samples, leave-

one-out is a technique to iteratively choose each sample for testing, and the remaining

samples for training. A total of n classifiers were trained, and n predictions were made. The

accuracy and acceptance rates were computed from the predictions and labels of the

corresponding test samples.

We ran our experiments on a Pentium 4 1.4GHz computer with 512-megabyte memory. We

wrote and ran our program using MATLAB 6.1. The support vector machine was constructed

with the Support Vector Machine Toolbox from

http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox which was developed by Gavin Cawley. For

the SVM, we set C = 100.0 and used the linear kernel, the same as those used by Guyon et al.

(2002).
4-74

4.1.6.1 Leukemia dataset

Figure 4.3 shows the sorted LIK scores. We chose equal numbers of genes with the highest

LIK ALL → AML and LIK AML→ ALL score as the initial gene sets for RFE. We plotted in this figure the

scores of the top (2 x 80) genes. The top i th gene according to LIK ALL → AML always has a

higher LIK ALL → AML value than the corresponding top i th gene’s LIK AML→ ALL score. Genes with

low LIK scores are not expected to be good discriminators. We decided to pick the top genes

to check their discriminating ability. In particular, we ran experiments using the top (2 x 10),

(2 x 20), (2 x 30) genes. We found the best performance was obtained when 2 x 20 top

ranking genes were selected. The performance was measured by computing the prediction

accuracy and acceptance rate of SVM and Bayesian classifiers built using the selected genes

on the test samples.

Figure 4.4 shows the accuracy and acceptance rate using two different experimental settings:

leave-one-out and train-test split. For the leave-one-out (LOO) setting, we computed the

performance measures using only the 38 training samples as well as on the entire dataset

consisting of 72 samples. For the train-test split, the measures shown were computed on the

34 test samples, while the measures on the 38 training samples are not reported in the table. A

series of experiments were conducted to find the smallest number of genes that would give

good performance measure. The experiments started with all 40 (= 2 x 20) genes selected by

the LIK feature selection. One gene at a time was eliminated using RFE. RFE feature

selection was conducted until there was only one gene left. For a selected subset of genes, the

performance measures were computed under all experimental settings and using both SVM

and Bayesian classifiers.


4-75

1800

1600

1400

1200
LIK score

1000

800

600

400

200

0 20 40 60 80 100 120
Gene rank

Figure 4.3: Sorted LIK score of a subset of genes in the leukemia dataset. Dots indicate

LIK ALL→ AML scores and circles indicate LIK AML→ ALL . The top 28 genes according to their

LIK ALL→ AML values have scores between 92014 and 1978.3; and the top 15 genes according to

their LIK ALL → AML values have scores between 22852 and 2148.7; they are not shown in this

figure.

As can be seen from the figure, the SVM classifier achieved almost perfect accuracy and

acceptance rate when there were three to 14 genes used to find the separating hyperplane. On

the other hand, when the Naïve Bayesian method was used for classification, almost perfect

performance was achieved with as many as 40 genes in the model. Elimination of the genes

by RFE one by one showed that the results could be maintained as long as there are at least
4-76

three genes in the model. This stability in performance indicates the robustness of the RFE

feature selection method when given a pre-selected small subset of relevant genes, as

identified by the LIK method. It is worth noting that the acceptance rate on the test samples

was almost constant with at least three genes, both when the SVM classifier and the Naïve

Bayesian classifier were used for prediction. We emphasize here that the hybrid LIK+RFE

feature selection was run using the 38 training samples; the classifiers were also built using

the same set of training samples without the use of any information from the data in the test

set.

A set of three genes was discovered to give perfect accuracy and acceptance rate regardless of

the experimental settings and the classifiers used. These genes are listed in Table 4.12. They

have also been identified as relevant genes in this dataset by several researchers. Golub et al.

(1999) identified U05259_rna1_at and M27891_at as relevant, while Keller et al. (2000)

identified the gene X03934_at as relevant. On the other hand, Guyon et al. (2002) identified a

completely different set consisting of four genes. Among these four genes, only M27891_at

occurs in the previous C4.5 result.


4-77

LIK+RFE classification performance using SVM

1.00

0.80
Training sample LOO acu
Performance
Training sample LOO acp
0.60
Test sample acu
Test sample acp
0.40
All sample LOO acu
All sample LOO acp
0.20

0.00
40 37 34 31 28 25 22 19 16 13 10 7 4 1
Number of genes

LIK+RFE classification performance using Bayesian method

1.00

0.80
Training sample LOO acu
Performance
Training sample LOO acp
0.60
Test sample acu
Test sample acp
0.40
All sample LOO acu
All sample LOO acp
0.20

0.00
40 37 34 31 28 25 22 19 16 13 10 7 4 1
Number of genes

Figure 4.4: Classification performance of genes selected using the hybrid LIK+RFE.
4-78

Gene accession number Description

U05259_rna1_at MB-1 gene


M27891_at CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage)
X03934_at GB DEF = T-cell antigen receptor gene T3-delta

Table 4.12: The smallest gene set found that achieves prefect classification performance.

ALL T-cell Train


4 ALL B-cell Train
AML Train
ALL T-cell Test
ALL B-cell Test
3 AML Test

2
U05259__rna1__at

-1

-2
-2 -1
0 0
2 1
4 2
6 3
X03934__at
M27891__at

Figure 4.5: Plot of the leukemia data samples according to the expression values of the three

genes selected by the hybrid LIK+RFE.

Since there were only three genes selected by the hybrid LIK+RFE method, we are able to

visualize the distribution of both the training and test samples in a three-dimensional space.

Figure 4.5 shows the plot of the samples. In this figure, we differentiate between acute
4-79

myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) samples. There were

actually two different types of ALL samples. These were B-cells or T-cells as determined by

whether they arose from a B or a T cell lineage (Keller et al., 2000). From the figure, we can

see that all except one B-cell sample had almost constant expression values for two genes,

namely M27891_at and X02934_at. Training sample number 17 was the one ALL B-cell

that was an outlier. On the other hand, all T-cell samples had almost constant expression

values for genes U05259_mal_at and M27891_at, while all AML samples had similar

expression values for U05259_mal_at and X03934_at. The plot shows that the three selected

genes were also useful in differentiating ALL B-cell and T-cell samples.

For comparison purposes, the classification performance of SVM and Naï


ve Bayesian

classifiers built using genes selected according to their LIK scores only is shown in Figure

4.6. For the results shown in this figure, we started with the same set of 40 genes and

removed one gene at a time according to their LIK scores. As can be seen from the figure, the

results were not as good as those shown in Figure 4.4. In particular, using SVM classifiers,

the accuracy and the acceptance rate were more than 80 percent when there were still more

than 20 genes in the model. The acceptance rate drops drastically when there are fewer genes.

Naï
ve Bayesian classifiers performed well when there were more than 21 genes. Further

removal of more genes according to their LIK scores caused the acceptance rate to drop

considerably. When there were fewer than five genes, the accuracy and the acceptance rate of

the classifiers were low.


4-80

LIK classification performance using SVM

1.00

0.80
Training sample LOO acu
Performance
Training sample LOO acp
0.60
Test sample acu
Test sample acp
0.40
All sample LOO acu
All sample LOO acp
0.20

0.00
40 37 34 31 28 25 22 19 16 13 10 7 4 1
Number of genes

LIK classification performance using Bayesian method

1.00

0.80
Training sample LOO acu
Performance
Training sample LOO acp
0.60
Test sample acu
Test sample acp
0.40
All sample LOO acu
All sample LOO acp
0.20

0.00
40 37 34 31 28 25 22 19 16 13 10 7 4 1
Number of genes

Figure 4.6: Performance of SVM and Naï


ve Bayesian classifiers built using genes selected

according to LIK scores.

The performance of SVM and Naï


ve Bayesian classifiers built using the genes selected by

RFE is depicted in Figure 4.7. We started with all 7129 genes in the feature set. We built an
4-81

SVM using the training samples with expression values of all the genes and measured its

performance on the test samples. We also built a Naï


ve Bayesian classifier and measured its

performance as well. The gene that had the smallest absolute weight in the SVM-constructed

hyperplane was removed, and the process of training and testing was repeated with one fewer

gene. This process was continued until there were no more genes to be removed.

RFE classification performance

1.00

0.80
Performance

0.60 SVM acu


SVM acp
Bayesian acu
0.40 Bayesian acp

0.20

0.00
865
517
169
2953
2605
2257
1909
1561
1213
7129
6781
6433
6085
5737
5389
5041
4693

4345
3997
3649
3301

Number of genes

Figure 4.7: Classification performance of purely using RFE. Classification performance of

SVM and Naï


ve Bayesian classifiers using genes selected by RFE starting from 7129 genes

down to only one gene. The experimental setting was training test split and the performance

measures were shown on the 34 test samples.


4-82

An interesting point to note from the results depicted in Figure 4.7 is the sharp improvement

in the acceptance rate of the Bayesian classifiers when the number of genes was reduced from

2773 to 2772. The gene that was eliminated at this stage was M26602_at. The acceptance

rates stayed at 100 percent when there were 2772 to 1437 genes. Further removal of genes

caused the rate to deteriorate gradually. On the other hand, the performance of SVM was

more stable. With more than 519 genes, both the accuracy and the acceptance rate were at

least 90 percent.

We also experimented with choosing the top genes according to their LIK scores. We

selected genes with LIK scores that were higher than a certain threshold. The threshold

values tested were 1500, 2000, and 2500. Note that there were always more genes selected

because of their high LIK ALL→ AML than genes selected because of their LIK AML→ ALL values. The

best performance was obtained when the threshold was set to 1500. A total of 62 genes met

this threshold value and were used to form the initial gene set for RFE. After applying RFE,

we obtained a set of four genes that achieves perfect accuracy and acceptance rate on the

training and test samples under all three experimental settings. The set of four selected genes

is shown in Table 4.13. Two out of the four genes were the same ones as those selected using

the (2 x 20) top initial genes listed in Table 4.12. These genes were U0529_rnal_at and

M27891_at. The genes M16336_s_at was also found by Keller at al. (2000) to be an

important gene for classification.

The performance of the SVM classifiers with genes selected using just the RFE approach was

slightly different from that reported by Guyon et al. (2002). The reason for this could be the

variation in the implementation of the quadratic programming solvers. The Matlab toolbox

uses Sequential Minimal Optimisation algorithm (Platt, 1999), while Guyon et al. used a
4-83

variant of the soft-margin algorithm for SVM training (Cortes, 1995). Our hybrid LIK+RFE

method achieved better performance than other methods reported in the literature. To achieve

perfect performance, the RFE implementation of Guyon et al. needed eight genes. When the

number of genes was reduced to four, the leave-one-out results on the training samples using

SVM achieved only 97 percent accuracy and 97 percent acceptance rate. SVMs trained on 38

training samples with the four selected genes achieved only 91 percent accuracy and 82

percent acceptance rate on the test samples.

Gene assection number Description

U05259_rna1_at MB-1 gene


M16336_s_at CD2 CD2 antigen (p50), sheep red blood cell receptor
M27891_at CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage)
X58072_at GATA3 GATA-binding protein 3

Table 4.13: The genes selected by the hybrid LIK+RFE method. The genes that have LIK

scores of at least 1500 were selected initially. RFE was then applied to select these four genes

that achieved perfect performance.

Using the genes selected according to their LIK scores and applying the Bayesian method,

Keller et al. (2000) achieved 100 percent prediction with more than 150 genes. Hellem and

Jonassen (2002) required 20 to 30 genes to obtain accurate prediction by ranking pair-wise

contribution of genes to the classification. The classification of the samples was obtained by

applying k-nearest neighbours, diagonal linear discriminant and Fisher’s linear discriminant

methods. Guyon et al. also mentioned the performance of other works on this dataset

(Mukherjee et al., 2000; Chapelle et al., 2000; Weston et al., 2001). None of these works

reported performance results that are as good as ours.


4-84

Besides LIK, we also tried to combine other univariate feature ranking methods with RFE.

They are baseline criterion proposed by Golub et al. (1999), Fisher’s criterion, and Extremal

margin (Guyon et al., 2002). In the way that is similar to that in LIK+RFE, we choose the top

40 genes ranked by these three methods, then let RFE do further feature set reduction. Figure

4.8 shows the accuracy on test samples when running RFE starting from the initial feature set

selected by the three univariate methods.

It can be seen from the Figure 4.8, the prediction accuracy of baseline and Fisher’s criteria on

test samples are similar. In the elimination process, the accuracy were kept above 80% for

SVM prediction and 90% for Bayesian prediction, as long as there were more than 5 genes

remaining in the set. The prediction accuracy dropped drastically when there were less than 5

genes remaining in the set. The RFE starting from genes selected by Extremal margin ranking

performed better than the other two ranking methods especially in Bayesian prediction

accuracy, which remained no less than 97% until there was one gene left in the gene set.

Particularly, the perfect prediction was achieved when there are 5 genes left in the dataset.
4-85

Bayesian prediction

1
0.9
0.8
Accuracy 0.7
0.6 Baseline
0.5 Fisher
0.4 Extremal margin
0.3
0.2
0.1
0
40

35

30

25

20

15

10

5
Number of genes

SVM prediction

0.8
Accuracy

0.6 Baseline
Fisher
0.4 Extremal margin
0.2

0
5
40

35

30

25

20

15

10

Number of genes

Figure 4.8: SVM and Bayesian prediction accuracy on test samples using features selected by

RFE combined with baseline criterion, Fisher’s criterion and Extremal margin method.
4-86

4.1.6.2 Small, round blue cell tumors dataset

The second dataset to test LIK+RFE is from small, round blue cell tumors (SRBCTs) (Khan

et al., 2001). There are 88 samples altogether with 2308 genes, divided into 4 classes:

neuroblastoma (NB), rhabdomyosarcoma (RMS), Burkitt lymphomas (BL) and the Ewing

family of tumors (EWS). Because our focus is on binary classification, we decomposed the

problem into 4 one-against-rest binary second type classification problems. In (Khan et al.,

2001) there are 63 training samples, which consist of 23 EWS, 8 BL, 12 NB, and 20 RMS

samples. There are 25 test samples, which consist of 6 EWS, 3 BL, 6 NB, 5 RMS samples,

and non-SRBCTs samples. After testing the classification performance of individual

problems, we then performed test on the prediction combining all the classifiers together to

predict the samples that is not belong to these four classes.

We obtained the expression ratio data of Khan et al. (2001). Before we conducted our

experiments, the expression values were transformed by computing their logarithmic values.

Base 2 log transformation was used, as this is the usual practice employed by researchers

analyzing micraoarray data. In Figure 4.9, the plot of the gene ranking according to their LIK

scores is shown. The LIK scores were computed for differentiating EWS samples from non-

EWS samples. The set of top 20 genes according to their LIK EWS → Non− EWS ranking contained

eight genes that were also in the set of top 20 genes according to their LIK Non− EWS → EWS ranking.

Hence, when RFE was applied to further eliminate genes from the feature set, it started with

32 unique genes.
4-87

90

80

70

60

50
LIK score

40

30

20

10

-10

0 100 200 300 400 500 600


Gene rank

Figure 4.9: Sorted LIK scores of genes in the SRBCT dataset. Dots indicate LIK EWS → Non− EWS

scores and circles indicate LIK Non− EWS → EWS scores.

For the other three classification problems, the plots would look very similar to Figure 4.9

and are not shown in this paper. For each of the four problems, LIK selected the top (2 x 20)

genes. The number of unique genes selected by LIK and the results of the experiments from

solving four binary classification problems are summarized in Table 4.14. The numbers of

unique genes selected by LIK and the smallest numbers of genes required to achieve near

perfect performance during the gene elimination process by RFE are shown in the second

column of the table. For the three classification problems to identify EWS, BL and NB, the

accuracy and the acceptance rates were at least 98 percent for all experimental settings. Those

perfect performance results are highlighted in the table. For the fourth classification problem
4-88

to differentiate between RMS and non-RMS samples, the accuracy rates were at least 92

percent. However, the acceptance rate on the test samples dropped to eight percent for SVM

classifier and 16 percent for Naï


ve Bayesian classifier, respectively.

SVM Bayesian
Leave-one-
Leave-one-out Prediction Leave-one- out on Prediction
on training on test out on all training on test Leave-one-out
samples samples samples samples samples on all samples
Classification
problem Initial/final
number of
genes Acu Acp Acu Acp Acu Acp Acu Acp Acu Acp Acu Acp
EWS vs non- 32/5 1.00 1.00 1.00 1.00 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
EWS
BL vs non-BL 37/3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
NB vs non-NB 34/3 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.98 1.00 1.00 1.00 1.00
RMS vs non- 34/4 1.00 1.00 0.92 0.08 0.99 0.88 1.00 1.00 0.92 0.16 0.97 0.35
RMS

Table 4.14: Experimental results for the SRBCT dataset using the hybrid LIK+RFE.

The poor acceptance rate obtained when predicting RMS test samples suggests that the

differences in the output of the classifiers and the actual target values were high for the

incorrectly predicted samples. In order to verify the predictions, we plotted the distribution of

the samples according to the expression values of three genes, ImageID784224 (fibroblast

growth factor receptor 4), ImageID796258 (sarcoglycan, alpha), and ImageID1409509

(troponin T1). The three were selected because their corresponding SVM weights were the

largest. The plot is shown in Figure 4.10. We can clearly see that the two incorrectly

classified non-RMS samples were outliers with large values for ImageID1409509 (troponin

T1). These two outliers were Sk. Muscle samples TEST-9 and TEST-13, which were

misclassified as RMS samples. It should be noted that there were no Sk. Muscle samples in

the training dataset.


4-89

RMS Train
Non-RMS Train
RMS Test
Non-RMS Test
3

2.5

1.5

1
troponin T1

0.5

-0.5

-1

-1.5

-2
-3 3
-2 2
-1 1
0 0
1 -1
2 -2
3 -3
sarcoglycan, alpha
fibroblast growth factor receptor 4

Figure 4.10: Plot of RMS and non-RMS samples. Plot of all 88 RMS and non-RMS samples

according to the expression values of three of the four selected genes.

The genes selected by the hybrid LIK+RFE for each of the four classification problems are

listed in Table 4.15. For the problem of differentiating EWS from non-EWS samples, our

method selected five genes, all of which were also selected by Khan et al. (2001). On the

other hand, to differentiate between NB and non-NB samples, only three genes were needed

and none was selected by Khan et al. All together, the hybrid LIK+RFE identified 15

important genes. This number compares favorably with the total of 96 genes selected by the

PCA (Principle Component Analysis) approach of Khan et al.


4-90

Reported
Classification
by Khan et Image ID Description
problem
al. (2001)

Y 377461 caveolin 1, caveolae protein, 22kD


Y 295985 ESTs
EWS vs non-EWS Y 80338 selenium binding protein 1
Y 52076 olfactomedinrelated ER localized protein
Y 814260 follicular lymphoma variant translocation 1
Y 204545 ESTs
BL vs non-BL 897164 catenin (cadherin-associated protein), alpha 1 (102kD)
Y 241412 E74-like factor 1 (ets domain transcription factor)
45632 glycogen synthase 1 (muscle)
NB vs non-NB 768246 glucose-6-phosphate dehydrogenase
810057 cold shock domain protein A
897177 phosphoglycerate mutase 1 (brain)

RMS vs Y 784224 fibroblast growth factor receptor 4


non_RMS Y 796258 sarcoglycan, alpha (50kD dystrophin-associated glycoprotein)
Y 1409509 troponin T1, skeletal, slow

Table 4.15: The genes selected by the hybrid LIK+RFE for the four binary classification

problems.

We also tested the classification performance of SVM and Naï


ve Bayesian classifiers on

genes selected based purely on their LIK scores. For comparison purpose, for each of the four

problems, the number of genes was set to be the same as the corresponding final number

selected by the hybrid LIK+RFE shown in Table 4.14. Table 4.16 summarizes the results. For

three of the four classification problems, the performance of the classifiers was not as good as

the results reported in Table 4.14. The accuracy and acceptance rates dropped to as low as 52

percent. The most unexpected results came from the fourth problem to differentiate between

RMS and non-RMS samples. The SVM classifier achieved perfect accuracy and acceptance

rate using four genes, while the Naï


ve Bayesian classifier managed to obtain at least 92

percent accuracy and acceptance rate. The four genes were ImageID461425 (MLY4),

ImageID784224 (fibroblast growth factor receptor 4), ImageID296448 (insulin-like growth


4-91

factor 2), and ImageID207274 (Human DNA for insulin-like growth factor II). All these

genes were among the 96 genes identified by Khan et al. (2001). Of these four, only one was

selected by LIK+RFE, that is, ImageID784224.

SVM Bayesian
Leave-one-out Leave-one- Leave-one-out
on training Prediction on out on all on training Prediction on Leave-one-out
samples test samples samples samples test samples on all samples
Classification Number
problem of
genes Acu Acp Acu Acp Acu Acp Acu Acp Acu Acp Acu Acp
1.00 1.00 0.92 0.88 0.95 0.88 0.98 0.97 0.84 0.84 0.95 0.86
EWS vs non-
EWS 5
BL vs non- 0.95 0.92 0.88 0.88 0.97 0.83 0.98 0.98 0.88 0.76 0.93 0.88
BL 3
NB vs non- 0.95 0.92 0.84 0.76 0.97 0.86 0.97 0.97 0.80 0.52 0.95 0.92
NB 3
4 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.95 0.92 0.92 0.97 0.95
RMS vs non-
RMS

Table 4.16: The performance of SVM and Naï


ve Bayesian classifiers built using the top

genes selected according to their LIK scores.

In comparison, Khan et al. (2001) used neural networks for multiple classifications to achieve

93 percent EWS, 96 percent RMS, 100 percent BL and 100 percent NB diagnostic

classification performance on the 88 training and test samples. Since there were four classes

of training data samples, each neural network had four output units. The target outputs were

binary encoded, for example, for an EWS sample the target was (EWS=1, RMS=NB=BL=0).

A total of 3750 neural networks calibrated with 96 genes were required. The highest average

output value from all neural networks determined the predicted class of a new sample. The

Euclidean distance between the average values and the target values was computed for all

samples in order to derive the probability distribution of the distances. A test sample would

be diagnosed as a member of one of the four classes based on the highest average value given

by the neural networks. This was provided that the distance value falls within the 95th
4-92

percentile of the corresponding distance probability of the predicted class. Otherwise, the

diagnosis would be rejected and the sample would be classified as a non-SRBCT sample. Of

the 88 samples in the training and test datasets, eight were rejected. Five of these were non-

SRBCT samples in the test set, while the other three actually belonged to the correct class but

their distances lay outside the threshold of the 95th percentile.

In order to visualize the distribution of the samples based on the expression values of the

selected genes, we performed clustering of the genes using the EPCLUST program

(http://ep.ebi.ac.uk/EP/EPCLUST). The default setting of the program was adopted; the

average linkage clustering and uncentered correlation distance measure were used. Figure

4.11 shows the clusters. It can be seen clearly from this figure that there existed four distinct

clusters corresponding to the four classes in the data. Most of the samples of a class fell into

their own corresponding clusters. The five non-SRBCT samples lay between clusters. We

conjecture that samples between clusters might not belong to any classes found in the training

dataset. Two between-cluster samples, RMS-T7 and TEST-20 were exceptions. RMS-T7,

which was nearer to the two Sk. Muscle samples TEST-9 and TEST-13 was actually an RMS

sample. TEST-20, which was nearer to Prostate sample TEST-11 than to EWS cluster was

actually an EWS sample. These exceptions were consistent with the neural network

prediction results of Khan et al. (2001) as the neural networks predicted TEST-9 and TEST-

13 to be RMS class, and they predicted TEST-20 and TEST-11 to be EWS class. Both

predictions, however, did not meet the 95th percentile distance criterion and were therefore

rejected. This indicated that these samples were also difficult to differentiate by the neural

networks. Different results from our clustering and the neural network classification can be

seen for test sample TEST-3, a non-SRBCT sample. The clustering placed TEST-3 between
4-93

BL and NB clusters. But the neural networks predicted this sample as an RMS sample

without meeting the 95 th percentile distance criterion.

Most of the genes selected from Leukemia and SRBCT datasets by our hybrid LIK + RFE

method have some relevance to cancer according to literature search in PubMed, a document

retrieval service of the National Library of Medicine of United States. However, biological

experiments need to be done for further validation of the role of these genes. The

performance of the method is also data dependent, as demonstrated in the significant

difference in the acceptance rate of the classifiers for the first three binary classification

problems and the fourth problem in the SRBCT dataset. Overall, we observe that the

classification performance on the test set generally does not change much with the

consecutive elimination of a few genes. The removal of one gene would not normally cause a

drastic change in the performance of the classifier. Significant drops in accuracy and/or the

acceptance rate are observed most frequently when a gene is removed from the optimal set.
4-94

Figure 4.11: Hierarchical clustering of SRBCT samples with selected 15 genes.


4-95

4.1.6.3 Artificial datasets

In order to study how expression values of irrelevant genes affect the selection performance

on RFE, we generated three types of artificial datasets. Each of the datasets consists of 40

training samples (20 positive class + 20 negative class) and 40 test samples (also 20 positive

class + 20 negative class). All datasets consist of features that are relevant and irrelevant to

the classification. The values of the irrelevant features are sampled from a normal distribution

with standard deviation 1 and mean 0 regardless of the class labels of the samples. The three

types of datasets are different in the way the relevant features are constructed. For the first

type, the relevant features contribute independently to classification. Their values are

normally distributed with standard deviation 1, and mean x or − x according to the class

labels. For different relevant features, x is randomly chosen from a uniform distribution on

the interval (0,1) . If x is big enough, univariate feature selection method is likely to be able

to select the relevant features. The relevant features of the second type dataset is constructed

by considering the joint effect of the features: suppose there are k relevant features

X 1 ,K , X k , where X 1 , K X k −1 are sampled from normal distribution in the same way as that

k −1
from irrelevant features. We set the remaining relevant feature X k = ∑ X i ± α , where α is
i =1

randomly sampled from a normal distribution of standard deviation 0 and mean 1, and α is

added or subtracted from the sum depending on the class labels of samples. It is expected to

be harder for a univariate method to select the first k − 1 relevant features. The third type of

datasets is constructed by setting the first k − 1 relevant features the same way as that in first

type datasets. But the k th feature is set in the same way as that in the second type datasets.

This type of datasets models the expression of the genes that related to cancer better, where

the genes have certain degrees of individual contributions to cancer from observation, but

they interact with other genes together to cause the disease.


4-96

We tried LIK, RFE and LIK+RFE on the datasets of all the three types above, each contained

a total of 100 features, 4 of which are relevant features. In each experiment, we let the

selection method select the features based on the training samples, then test the classification

performance on the testing samples. For testing RFE, one feature is eliminated each time. For

testing LIK, we include the same number of top ranking features of both LIK ag→b and

LIKbg→a ranks. For testing the combination of LIK+RFE, we choose the top 5+5 and the top

10+10 LIK ranked features to input into RFE. We use SVM to test the classification

performance. The setting of SVM for RFE and classification is same as those in Section

4.1.6.1. Before applying the methods, the datasets were normalized in the same way the

Leukemia dataset. For each setting, 100 experiments were conducted, each on a different

dataset generated, and the mean and standard deviation of the test result are shown in Table

4.17. The table shows the type of the datasets; the number of LIK ranked features to be used

as input for RFE (lik_num), within which the number of distinct ones (num_lik_rfe); the

accuracy of SVM prediction on datasets using all and relevant features (acc_all, acc_rev

respectively), where acc_rev, is used to find out what is the ideal prediction performance; the

corresponding number of support vectors of trained SVMs (num_sv_all and num_sv_rev

respectively); for running LIK, RFE, and LIK+RFE, the highest accuracy obtained

(max_rfe_acc, max_lik_acc, and max_lik_rfe_acc respectively), the smallest number of

features to achieve highest accuracy (num_rfe_acc, num_lik_acc, and num_lik_rfe_acc

respectively); the number of relevant features existing in selected features (num_rfe_acc_rev,

num_lik_acc_rev, and num_lik_rfe_acc_rev respectively); and the percentage of times of the

last relevant feature existing in these features (contain_rfe_acc_last_rev,

contain_lik_acc_last_rev, and contain_lik_rfe_acc_last_rev respectively).


4-97

lik_num 10 20

dataset_type 1 2 3 1 2 3

num_lik_rfe 8.57 ± 0.97 9.01 ± 1.00 8.40 ± 0.83 14.66 ± 1.34 14.97 ± 1.47 14.86 ± 1.20
num_lik_sel 2.71 ± 0.87 1.05 ± 0.59 2.91 ± 0.71 2.79 ± 0.95 1.24 ± 0.62 3.22 ± 0.66

acc_all 0.66 ± 0.09 0.55 ± 0.08 0.71 ± 0.09 0.65 ± 0.11 0.55 ± 0.08 0.71 ± 0.08
num_sv_all 35.25 ± 1.56 36.07 ± 1.61 34.39 ± 1.61 35.14 ± 1.92 36.16 ± 1.64 34.13 ± 1.94
acc_rev 0.83 ± 0.08 0.81 ± 0.07 0.88 ± 0.06 0.82 ± 0.09 0.81 ± 0.06 0.89 ± 0.06
num_sv_rev 13.15 ± 6.45 14.87 ± 4.47 8.76 ± 4.21 12.79 ± 6.66 15.08 ± 4.68 8.29 ± 3.75

max_rfe_acc 0.82 ± 0.08 0.69 ± 0.07 0.89 ± 0.07 0.82 ± 0.09 0.69 ± 0.07 0.89 ± 0.07
num_rfe_acc 9.79 ± 17.15 13.38 ± 17.62 3.51 ± 7.06 7.20 ± 11.85 11.75 ± 17.55 3.98 ± 8.70
num_rfe_acc_rev 2.19 ± 1.01 1.39 ± 0.75 1.29 ± 0.62 1.90 ± 0.89 1.39 ± 0.82 1.29 ± 0.61
contain_rfe_acc_last_rev 0.53 ± 0.50 0.90 ± 0.30 0.97 ± 0.17 0.50 ± 0.50 0.90 ± 0.30 0.94 ± 0.24

max_lik_acc 0.84 ± 0.08 0.70 ± 0.07 0.91 ± 0.06 0.84 ± 0.08 0.70 ± 0.07 0.90 ± 0.06
num_lik_acc 10.20 ± 17.07 24.26 ± 27.35 5.20 ± 11.70 6.97 ± 11.14 18.02 ± 20.83 3.76 ± 5.94
num_lik_acc_rev 2.45 ± 1.00 1.66 ± 1.06 1.81 ± 0.95 2.08 ± 0.97 1.40 ± 0.90 1.81 ± 0.97
contain_lik_acc_last_rev 0.63 ± 0.49 0.94 ± 0.24 1.00 ± 0.00 0.52 ± 0.50 0.92 ± 0.27 1.00 ± 0.00

max_lik_rfe_acc 0.81 ± 0.09 0.66 ± 0.09 0.89 ± 0.07 0.82 ± 0.09 0.66 ± 0.09 0.88 ± 0.07
num_lik_rfe_acc 3.17 ± 2.06 3.14 ± 2.56 1.83 ± 1.48 3.92 ± 3.20 4.24 ± 3.94 2.41 ± 3.15
num_lik_rfe_acc_rev 1.96 ± 0.97 0.84 ± 0.55 1.25 ± 0.58 1.91 ± 0.91 0.85 ± 0.61 1.38 ± 0.76
contain_lik_rfe_acc_last_rev 0.51 ± 0.50 0.72 ± 0.45 0.96 ± 0.20 0.41 ± 0.49 0.70 ± 0.46 0.92 ± 0.27

Table 4.17: Test result of LIK, RFE and LIK+RFE on artificial datasets

It can be seen from Table 4.17, as expected, in terms of the number of feature selected and

the accuracy (rows max_*_acc and num_*_acc), for datasets of type 1 and 3, RFE is similar

to LIK; for datasets of type 2, RFE is significantly better than LIK. But the testing results

from all three types of datasets indicate that LIK+RFE selected significantly more accurate

feature sets, in terms of the number of relevant features over the number of selected features,

without significant loss of prediction performance (rows max_*_acc, num_*_acc and

num_*_acc_rev).

It can also be seen from the table that when the number of irrelevant features is large

compared with the number of relevant features, the classification of SVM becomes bad;
4-98

almost all training samples became support vectors (row num_sv_all compared with row

num_sv_rev). This phenomenon also occurs when SVM is applied on the second type

datasets like Leukemia and SRBCT dataset. From our experience, the number of support

vectors reduces significantly only when the number of features for training is near or less

than the number of training samples. We suspect the phenomenon is related to the learning

capacity of SVM but we have not found theoretical basis for this.

The assumption of single or double normal distribution on irrelevant and relevant features are

simple and may not accurately reflect the real situation in microarray datasets, especially, the

combinatorial effect of features are not modeled. However, as can be seen from the test result

from these simple datasets, it is quite likely that the weights of trained SVM on a mixture of

many irrelevant features and few relevant features are unable to truly measure the

contribution of the features to the classification. Some relevant features are incorrectly

eliminated by RFE starting from all features. By comparison, LIK ranking is able to keep

most of the relevant features for type 1 and 3 datasets, which enables RFE to perform further

selection more effectively.

4.1.7 The combination of Likelihood method and Fisher’s method

We also compared LIK+RFE with the combination of univariate and multivariate versions of

Likelihood method and Fisher’s method. In our experiment, a set of features was first

selected by a univariate method, in the same fashion as in the experiment for LIK+RFE. A

multivariate method was then used to eliminate the features recursively from that feature set.

We tested the method on the Leukemia dataset and set the size of the initial set to 30 as

Fisher’s linear discriminant encounters matrix inversion problems if the initial gene set size is
4-99

bigger than the number of training samples. The algorithm was implemented and run on

Matlab 6.1.

Figure 4.12 shows that when using the combination of F_F, L_L and F_L, the accuracy of

Bayesian classification on test samples was generally better than that of SVM prediction,

based on same set of genes. But the SVM prediction of L_F is better that that of Bayesian

method. In both SVM and Bayesian test results, L_L outperformed the other three

combinations when there are more than 15 genes remaining in the gene set. However, its

performance dropped drastically when there are less than four genes remaining in the gene

set. Under this situation, the combination of F_F is the best. However, none of these

combinations of these four methods outperformed the combination of LIK+RFE on our test

on Leukemia dataset in terms of classification accuracy based on the same number of genes.
4-100

SVM prediction

1
0.9
0.8
0.7
L_F
Accuracy

0.6
F_F
0.5
L_L
0.4
F_L
0.3
0.2
0.1
0
30

27

24

21

18

15

12

3
Number of genes

Bayesian prediction

1
0.9
0.8
0.7 L_F
Accuracy

0.6
F_F
0.5
L_L
0.4
0.3 F_L
0.2
0.1
0
30

27

24

21

18

15

12

Number of genes

Figure 4.12: SVM and Bayesian prediction accuracy when running combination of univariate

and multivariate feature selection methods. L_F: Likelihood + Fisher’s linear discriminator;

F_F: Fisher’s criterion + Fisher’s linear discriminator; L_L: Likelihood + Multivariate

Likelihood Method; F_L: Fisher’s criterion + Multivariate Likelihood Method.


4-101

4.2 First type - low dimension problems

The dataset we used that consists of the low dimension analysis problem is a Zebra fish

developmental microarray dataset obtained from Lab of Functional Genomics of Institute of

Molecular and Cell Biology, Singapore (Lo et al., 2003). In recent years, Zebra fish has been

adopted as a model system for the studies of vertebrate development owing to some of its

unique characteristics favorable for genetic studies compared to other vertebrate systems.

These characteristics include reasonably short lifetime, large number of progenies, external

fertilization and embryonic development, and translucent embryos (Talbot and Hopkins,

2000). In the microarray experiment, there were altogether 11,552 Expression Sequence Tag

(EST) clones representing 3100 genes printed onto the microarray glass slides. According to

BLAST (Basic Local Alignment Search Tool) search, 4519 of the 11,552 clones have

matches to 728 distinct publicly deposited protein sequences. That is, the functions of these

4519 clones are known, and the functions of the remaining clones are unknown. The relative

expression of the 11,552 clones in Zebra fishes’six developmental stages, including cleavage

(E2), gastrula (E3), blastula (E4), segmentation (E5), pharyngula (E6) and hatching (E7),

based on their developmental morphology, was monitored using microarray experiments, in

comparison to the expression of these clones in stage unfertilized eggs (E0). A first type

classification problem was constructed, which included 11,449 samples from the 11,552

clones with 6 features. A total 3887 of the 11,449 samples corresponding to the known clones

were labeled according to whether they are muscle genes or not. Within these 3887 clones,

248 were clones from 17 muscle genes. We did the classification by employing SVM. The

labeled clones were randomly split into two sets, 2500 for training and remaining 1387 for

testing. There were 157 and 91 positive samples in the training and testing sets respectively.

The remaining 7562 unlabelled samples were then used to perform prediction.
4-102

log 10 δ C Number of Support


True positive
False True False Predicted
Vectors positive negative negative positive
-3.0 10 312 34 9 1287 57 88
-2.5 10 291 42 9 1287 49 100
-2.0 10 271 44 10 1286 47 110
-1.5 10 263 41 10 1286 50 109
-1.0 10 271 42 10 1286 49 104
-0.5 10 379 42 14 1282 49 113
0.0 10 631 31 19 1277 60 131
-3.0 20 300 41 9 1287 50 101
-2.5 20 282 42 9 1287 49 107
-2.0 20 267 42 10 1286 49 112
-1.5 20 251 40 10 1286 51 111
-1.0 20 270 43 12 1284 48 101
-0.5 20 362 39 14 1282 52 128
0.0 20 611 33 26 1270 58 176
-3.0 50 289 42 10 1286 49 102
-2.5 50 275 43 10 1286 48 108
-2.0 50 260 43 10 1286 48 112
-1.5 50 248 41 9 1287 50 103
-1.0 50 257 42 12 1284 49 110
-0.5 50 322 36 18 1278 55 159
0.0 50 567 32 29 1267 59 238

Table 4.18: Test of SVM with RBF kernel using different parameters.

Table 4.18 shows the test result of the SVM using different parameters with radial basis

function kernel. It can be seen that best test performance was obtained when setting C = 10

and σ = 0.01 . Using this parameter, the number of correctly predicted positive test samples

(true positive) reached 44, which was the highest among all configurations we tried; the

number of incorrectly predicted negative (false negative) samples was as low as 10,

compared favorably with the lowest false negative we obtained, which is 9. The trained

model under this setting is also not complex, which can be seen from the number of support

vectors was as low as 271. We provided a list of the 110 positively predicted unknown clones

to the biological researchers in the Lab of Functional Genomics of Institute of Molecular and

Cell Biology. Ten clones were selected from these 110 clones to do further biological
4-103

validation (re-sequencing of these clones) was done. Eight of these ten clones are proven to

be from muscle genes. The remaining two were found to be known genes after repeated

sequencing. Although these two are not muscle genes, they are functionally related to muscle

genes, therefore showed similar expression patterns to that of the other eight clones. Besides

these ten clones, another positively predicted clone was re-sequenced, which is a putative

novel gene. In situ hybridization on this clone showed that the corresponding gene truly had

muscle function (Lo et al., 2003).


5-104

5 Conclusion and future work

5.1 Biological knowledge discovery process

Biological research can be treated as a knowledge discovery process, which has been greatly

facilitated by the emergence of the field of Bioinformatics. Take microarray data as an

example, in the discovery process, biological information is extracted out by biological

studies, stored in DNA sequence trace files, scanned microarray images, descriptions of

samples and descriptions of experimental conditions. The information can then be quantified

or symbolized for biological data with certain structure. The biological data include

sequential representations of nucleotides / proteins, gene expression matrices. Computational

and statistical methods are then applied to extract biological knowledge from the data. The

knowledge is in essence relationships, which could be relationship between genes from their

sequential and expression similarity; relationship between gene expression and sample

property, experimental condition, intermediate product, or cellular process. Machine learning

plays an important role in discovering these relationships. However, the amount of

knowledge that can be discovered depends on two factors: the amount and quality of the

biological data available, and the suitability of the machine learning methods for various

biological problems.

There are more and more researchers who work to accelerate the discovery process. There are

currently two main journals and several main conferences in the bioinformatics area: Journal

of Bioinformatics, Journal of Computational Biology, Pacific Symposium on Biocomputing

(PSB), International Conference on Intelligent Systems for Molecular Biology (ISMB), and

Annual International Conference on Research in Computational Molecular Biology


5-105

(RECOMB). There is also a conference specifically focus on microarray data analysis, which

is Critical Assessment of Microarray Data Analysis (CAMDA).

Although much effort have been made for multidisciplinary collaboration in the discovery

process, researchers from non-biology discipline still need to gain more insight of the nature

of the biological problems, together with biologists. The real good design of machine

learning methods lie in full incorporation of biological knowledge rather than simply abstract

the biological problem to fit to well developed models. This criterion dictates the future

direction of applying machine learning methods for biological problems.

5.2 Contribution, limitation and future work

This thesis focuses on the classification and feature selection problems for gene expression

analysis. In the research work, we have reviewed current work in the literature and identified

the classification problems. We then applied seven feature selection methods, feature

extraction methods and some of their combinations for gene selection, and employed five

classification methods for prediction of cancer tissue type and gene function. We improved

neural network feature selector to make it more suitable to the gene selection problem for the

datasets having high dimension but few samples. We also developed a multivariate version of

likelihood feature selection method. We found the hybridization of Likelihood and Recursive

Feature Elimination achieved significantly better gene selection performance on the

benchmark Leukemia dataset than other methods. The hybridization of LIK+RFE on SRBCT

dataset also significantly outperformed a neural network method proposed by other

researchers.
5-106

The thesis shows a process of understanding of the nature of the problem and choosing

suitable methods. We first tested whether the most informative components may contribute to

the classification by applying principle component analysis and neural networks on Leukemia

dataset. Result showed that those components had little discrimination ability. We then

applied decision tree to find sets of rules that were able to classify all the samples. The

simplicity of the rules implied the possibility to find small set of genes that have high

classification ability. Due to the discrete nature of C4.5 algorithm, the small decision tree

generated from training samples, with continuous expression values, did not have good

generalization ability; subsequently the prediction accuracy of the tree on test samples was

low.

The possibly simple underlying classification model and the deficiency of decision tree

method inspired us to use information gain as a gene ranking method, but use neural network

as classifier. The classification was improved. After studying of the distribution of neural

network prediction errors, we found that, by employing AdaBoost technique to summarize

ensemble of neural network classifiers, the classification performance could be improved. We

then moved on into looking for methods that could reduce the number of genes used for

classification without significant loss of prediction performance. By properly tuning the

parameters, the combination of information gain and neural network feature selection

achieved that goal.

We then tested other combinations of univariate selection method and multivariate selection

method, including the methods that are based on extremal margin, likelihood, and Fisher’s

criterion. Within these combinations the hybrid of Likelihood method and Recursive Feature

Elimination method (LIK+RFE) selected the most compact gene set that had prefect
5-107

prediction performance. We did systematic test on this hybrid method using other datasets of

the high dimension classification problem. Test results were very promising. Applying the

classification methods on the Zebra fish dataset with large number of samples was

straightforward. But instead of purely validating the classification performance on known

genes, the real prediction of the function of some unknown genes were confirmed by

biological experiments.

Our experiments of the hybrid of LIK+RFE on SRBCT dataset showed that the feature

selection and classification methods for gene expression analysis are data dependent. Our

experiments also showed that, for microarray datasets of high dimension classification

problem, the choice of feature selection methods are more important than the choice of

classification methods. It is possible to design better selection methods or better combinations

of selection methods for Leukemia dataset; and although some method used in this thesis did

not achieve high selection performance on Leukemia dataset, they may do well on other

datasets.

The study on linear separability (Cover, 1965) suggests that when the number of samples is

small compared with the number of features, it is possible to find a number of subsets of

features that can perfectly distinguish all samples. Our experiments on the leukemia dataset

also support this hypothesis: we found two different gene sets consisting of just three or four

genes, which can achieve perfect classification performance. Biological study shows that

although many genes do not have direct relevance to the cancer under study, their expression

may have subtle and systematic difference in different classes of tissues (Alon et al., 1999).

Hence, a new challenge for cancer classification arises: to find as many as possible small

subsets of genes that can achieve high classification performance. Using only microarray data
5-108

with these subsets of genes, we can build different classifiers and look for those that have

desirable properties such as extremal margin, i.e. wide difference between the smallest output

of the positive class samples and the largest output of the negative class samples. Another

property could be median margin, which is the difference between the median output of the

positive class samples and the median output of the negative class samples. Exhaustively

enumerating and evaluating all the gene combinations is computationally NP-hard (non-

deterministic Polynomial-time hard) and is feasible only when the number of relevant genes

is relatively very small.

Due to its cost, microarray experiments conducted for identifying the genes that are crucial

for cancer diagnosis are still scarce. The measurements obtained from the experiments are

noisy. These facts make the selection of different sets of relevant genes vital. Moreover,

cancer is a complex disease. It is not caused by only a few genes, but also by many other

factors (Kiberstis and Roberts, 2002). So even the best selected subsets may not actually be

the most crucial ones to the cancer under study. They can, however, be important candidates

for a further focused study on the gene interactions within individual subsets, and the

relationship between these interactions and the disease. There has been work done on the

second order selection. For example, Goyun et al. (2002) found a gene pair that could have

zero leave-one-out error on the training samples, but achieved poor performance on test

samples. Hellem and Jonassen (2002) also evaluated the contribution of pairs of genes to the

classification for the ranking of genes, but they still have to combine multiple pairs of genes

to perform classification. We plan to work on finding better ways to develop methods for

high order feature selection that would allow the classifiers to achieve high performance with

different small sets of genes.


109

Reference

• Akutsu, T., Miyano, S., and Kuhara, S., (1999). Identification of genetic networks

from a small number of gene expression patterns under the Boolean network model.

Pacific Symposium on Biocomputing, 4, 17-28.

• Aleksander, I., and Morton, H., (1990). An Introduction to Neural Computing.

Chapman and Hall, London.

• Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A.

J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor

and normal colon tissues probed by oligonucleotide arrays. Proceedings of the

National Academy of Sciences, 96, 6745-6750.

• Alter, O., Brown, P. O., and Botstein, D., (2000). Singular value decomposition for

genome-wide expression data processing and modeling. Proceedings of the National

Academy of Sciences, 97, 10101-10106.

• Becker, S., (1991). Unsupervised learning procedures for neural networks.

International Journal of Neural Systems, 2, 17-33.

• Bersekas, D. P., (1995). Nonlinear Programming. Athenas Scientific.

• Bishop, C.M., (1995). Neural Networks for Pattern Recognition. Clarendon Press,

Oxford.

• Boser, B., Guyon, I., and Vapnik, V. N., 1992. A training algorithm for optimal

margin classifiers. Fifth Annual Workshop on Computational Learning Theory. 144-

152.

• Brazma, A., and Vilo, J., (2000). Gene expression data analysis. FEBS Letters,

480,17-24.
110

• Brown, M. P. S., Grundy, W. N., Lin, D., Sugnet, C., Ares, M., and Haussler, D.,

(2000). Knowledge-based analysis of microarray gene expression data by using

support vector machines. Proceedings of the National Academy of Sciences, 97, 262-

267.

• Butte, A. J., and Kohane, I. S., (2000). Mutual information relevance networks:

Functional genomic clustering using pairwise entropy measurements. Pacific

Symposium on Biocomputing, 5, 415-426.

• Cai, J., Dayanik, A., Yu, H., Hasan, N., Terauchi, T., and Grundy, W. N., (2000).

Classification of Cancer Tissue Types by Support Vector Machines Using Microarray

Gene Expression Data. International Conference on Intelligent Systems for Molecular

Biology.

• Chapelle, O., Vapnik, V., Bousquet, O. and Mukherjee, S., (2000). Choosing kernel

parameters for support vector machines. AT&T Labs Technical Report.

• Chen, T., Filkov, V., and Skiena, S. S., (1999). Identifying gene regulatory networks

from experimental data. Annual International Conference on Computational Biology.

• Cortes, C., and Vapnik, V., (1995). Support vector networks. Machine Learning, 20,

273-297.

• Cover, T., (1965). Geometrical and Statistical Properties of Systems of Linear

Inequalities with Applications in Pattern Recognition. IEEE Transaction on

Electronic Computer, 14, 326-334.

• Datta, S., (2001). Exploring relationships in gene expressions: A partial least squares

approach. Gene Expression, 9, 257-264.

• Dewey, T. G., and Bhan, A., (2001). A linear systems analysis of expression time

series. Methods of Microarray Data Analysis, Kluwer Academic.


111

• D'haeseleer, P., (2000). Reconstructing Gene Networks from Large Scale Gene

Expression Data. Ph.D. thesis, University of New Meixco.

• D'haeseleer, P., Wen, X., Fuhrman, S., and Somogyi, R., (1997). Mining the gene

expression matrix: inferring gene relationships from large scale gene expression data.

Information processing in cells and tissues, Paton, R. C., and Holcombe, M., Eds.,

Plenum Press, 203-212.

• Ewing, R. M., Kahla, A. B., Poirot, O., Lopez, F., Audic, S., and Claverie, J. M.,

(1999). Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene

expression. Genome Research, 9, 950-959.

• Filkov, V, Skiena, S, and Zhi, J., (2001). Analysis techniques for microarray time-

series data. Annual International Conference on Computational Biology. 124-131.

• Fitch, W. M., and Margoliash, E., (1967). Construction of phylogenetic trees. Science,

155, 279-284.

• Fletcher, R., (1987). Practical Methods of Optimization, 2nd edition, Wiley, New

York.

• Freund, Y., and Schapire, R. E., (1996). Experiments with a new boosting algorithm.

Machine learning: Proceedings of the Thirteenth International Conference, 148-156.

• Friedman, N., Linial, M., Nachman, I., and Pe'er, D., (2000). Using Bayesian

networks to analyze expression data. Journal of Computational Biology, 7, 601-620.

• Fuhrman, S., Cunningham, M. J., Wen, X., Zweiger, G., Seihamer, J. J., and

Somogyi, R., (2000). The application of Shannon entropy in the identification of

putative drug targets. Biosystems, 55, 5-14.

• Furey, T. S., Cristiannini, N., Duffy, N., Bednarski, D. W., Schummer, M., and

Haussler, D., (2000). Support vector machine classification and validation of cancer

tissue samples using microarray expression data, Bioinformatics, 16, 906-914.


112

• Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gassenbeek, M., Mesirov, J. P.,

Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and

Lander, E. S., (1999). Molecular Classification of Cancer: Class Discovery and Class

Prediction by Gene Expression Monitoring. Science, 286, 531-537.

• Guyon, I., Weston, J., Barnhill, S., and Vapnik, V., (2002). Gene selection for cancer

classification using support vector machines. Machine Learning, 46, 389-422.

• Haykin, S., (1999). Neural networks: A comprehensive foundation. Prentice Hall.

• Hellem, T. and Jonassen, I., (2002). New feature subset selection procedures for

classification of expression profiles. Genome Biology, 3(4), research0017.1-0017.11

• Hertz, J., Krogh, A., and Pla,er, R. G., (1991). Introduction to the Theory of Neural

Computation. Addison-Wesley.

• Hieter, P., and Boguski, M., (1997). Functional genomics: it's all how you read it.

Science, 278, 601-602.

• Holter, N. S., Maritan, A., Cieplak, M., Fedoroff, N. V., and Banavar, J. R., (2001).

Dynamic modeling of gene expression data. Proceedings of the National Academy of

Sciences, 98, 1693-1698.

• Huang, S., (1999). Gene expression profiling, genetic networks, and cellular states: an

integrating concept for tumorigenesis and drug discovery. Journal of Molecular

Medicine, 77, 469-480.

• Hwang, K. B., Cho, D. Y., Park, S. W., Kim, S. D., and Zhang, B. T., (2001).

Applying machine learning techniques to analysis of gene expression data: cancer

diagnosis. Methods of Microarray Data Analysis. Kluwer Academic, 167-182.

• Ideker, T. E., Thorsson, V. and Karp, R. M., (2000). Discovery of regulatory

interactions through pertubation: inference and experimental design. Pacific

Symposium on Biocomputing, 5, 302-313.


113

• Kauffman, S. A., (1969). Metabolic stability and epigenesis in randomly connected

nets. Journal of Theoretical Biology, 22, 437-467.

• Keller, A. D., Schummer, M., Ruzzo, W. L., and Hood, L., (2000). Bayesian

classification of DNA array expression data. Technical Report, University of

Washington - Computer Science Engineering - 2000-08-01.

• Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M, Westermann, F, Berthold,

F., Schwab, M., Antonescu, C. R., Peterson, C., and Meltzer, P. S., (2001).

Classification and diagnostic prediction of cancers using gene expression profiling

and artificial neural networks. Nature Medicine, 7, 673-679.

• Kiberstis, P., and Roberts, L., (2002). It's Not Just the Genes. Science, 296, 685.

• Kitano, H., (2002). Systems biology: a brief overview. Science, 295, 1662-1664.

• Klevecz, R. R., (2000). Dynamic architecture of the yeast cell cycle uncovered by

wavelet decomposition of expression microarray data. Functional and Integrative

Genomics, 1, 186-192.

• Li, W., (2002). Zipf's law in importance of genes for cancer classification using

microarray Data. Journal of Theoretical Biology, 219, 539-51.

• Liang, S., Fuhrman, S., and Somogyi, R., (1998). REVEAL, A general reverse

engineering algorithm for inference of genetic network architectures, Pacific

Symposium on Biocomputing.

• Liu, H., and Motoda, H., (1998). Feature selection for knowledge discovery and data

mining. Kluwer Academic Publishers, 1998.

• Lo, J., Lee, S., Xu, M., Liu, F., Ruan, H., Eun, A., He, Y., Ma, W., Wang, W., Wen,

Z., and Peng, J., (2003). 15,000 Unique Zebrafish EST Clusters and Their Use in

Microarray for Profiling Gene Expression Patterns During Embryogenesis. Genome

Research, 13, 455-466.


114

• Maki, Y., Tominaga, D., Okamoto, M., Watanabe, S., and Eguchi, Y., (2001).

Development of a System for the Inference of Large Scale Genetic Networks. Pacific

Symposium on Biocomputing, 6, 446-458.

• Michaels, G. S., Carr, D. B., Askenazi, M., Fuhrman, S., Wen, X., and Somogyi, R.,

(1998). Cluster analysis and data visualization of large-scale gene expression data.

Pacific Symposium on Biocomputing, 3, 42-53.

• Michigan, (1999). Challenges and Opportunities in Understanding the Complexity of

Living Systems. University of Michigan, report into initiatives of Life Sciences

Commission within context of biocomplexity.

• Muirhead, R. J., (1982). Aspects of Multivariate Statistical Theory. Wiley series in

probability and mathematical statistics. Wiley, New York.

• Mukherjee, S., Tamayo, P., Slonim, D., Verri, A., Golub, T., Messirov, J. P., and

Poggio, T., (2000). Support vector machine classification of microarray data. AI

memo, CBCL paper 182. MIT.

• Pe'er, D., Regev, A., Elidan, G., and Friedman, N., (2001). Inferring subnetworks

from perturbed expression profiles. International Conference on Intelligent Systems

for Molecular Biology.

• Pineda, F. J., (1987). Generalization of back-propagation to recurrent neural networks.

Physical Review Letters, 59, 2229-2232.

• Platt, J., (1999). Fast training of SVMs using sequential minimal optimisation.

Advances in Kernel Methods: Support Vector Learning. MIT press, Cambridge, MA,

185-208.

• Quinlan, J. R., (1993). C4.5: Programming for Machine Learning. Morgan Kaufmann

Publishers.
115

• Raychaudhuri, S., Stuart, J. M., and Altman, R. B., (2000). Principal components

analysis to summarize microarray experiments: application to sporulation time series.

Pacific Symposium on Biocomputing, 5, 452-463.

• Samsonova, M. G., and Serov, V. N., (1999). NetWork: An interactive interface to the

tools for analysis of genetic network structure and dynamics. Pacific Symposium on

Biocomputing.

• Savageau, M. A., (1976). Biochemical Systems analysis: a study of function and

design in molecular biology. Addison-Wesley, Reading.

• Setiono, R., and Liu, H., (1997). Neural-network feature selector. IEEE Transactions

on Neural Networks, 8, 654-662.

• Slonim, D., Tamayo, P., Mesirov, J., Golub, T. R., and Lander, E., (2000). Class

prediction and discovery using gene expression data. Annual International

Conference on Computational Biology.

• Someren, E. P. V., Wessels, L. F. A. and Reinders, M. J. T., (2001). Genetic Network

Models: A Comparative Study. Proceedings of SPIE, Micro-arrays: Optical

Technologies and Informatics.

• Someren, E. V., Wessels, L. F. A., and Reinders, M. J. T., (2000). Linear modeling of

genetic networks from experimental data. International Conference on Intelligent

Systems for Molecular Biology.

• Somogyi, R., Fuhrman, S., Askenazi, M. and Wuensche, A., (1996). The gene

expression matrix: towards the extraction of genetic network architectures.

Proceedings of Second World Congress of Nonlinear Analysis, 30, 1815-1824.

• Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B.,

Brown, P. O., Botstein, D., and Futcher, B., (1998). Comprehensive identification of
116

cell cycle-regulated genes of the yeast sacccharomyces cerevisiae by microarray

hybridization. Molecular Biology of the Cell, 9, 3272-3297.

• Stone, M., and Brooks, R. J., (1992). Continuum regression: cross-validated

sequentially constructed prediction embracing ordinary least squares, partial least

squares and principal component regression. Journal of the Royal Statistical Society

B, 52, 237-269: 1990. Corrigendum 54, 906-907: 1992.

• Szallasi, Z., (1999). Genetic network analysis in light of massively parallel biological

data acquisition. Pacific Symposium on Biocomputing.

• Talbot, W.S., and Hopkins, N., (2000). Zebra fish mutations and functional analysis

of the vertebrate genome. Genes and Development, 14, 755-762.

• Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P., (1999).

Clustering methods for the analysis of DNA microarray data. Technical Report,

Stanford University.

• Vapnik, V., (1998). Statistical Learning Theory. Wiley, New York.

• Vohradsky, J., (2001). Neural network model of gene expression. FASEB Journal, 15,

846-854.

• Weigend, A. S., Rumelhart, D. E., and Huberman, B. A., (1991). Generalization by

weight-elimination with application to forecasting. Advances in Neural Information

Processing Systems, Lippmann, R. P., Moody, J., and Touretzky, D. S., Eds., Morgan

Kaufmann, 3, 875-882.

• Wen, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker, J. L., and

Somogyi, R., (1998). Large-scale temporal gene expression mapping of CNS

development. Proceedings of the National Academy of Sciences, 95, 334-339.

• Werbos, P. J., (1990). Backpropagation through time: what it does and how to do it.

Proceedings of the IEEE, 78, 1550-1560.


117

• Wessels, L. F. A., Someren, E. P. V., and Reinders, M. J. T., (2001). A Comparison of

Genetic Network Models. Pacific Symposium on Biocomputing, 508-519, Hawai.

• Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T. and Vapnik, V.,

(2001). Feature Selection for SVMs. Advances in Neural Information Processing

Systems, 13, 668-674.

• Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P., (2001). Normalization for cDNA

Microarray Data. Microarrays: Optical Technologies and Informatics, Proceedings of

SPIE.

• Yeung, K. Y., Haynor, D. R., Ruzzo, W. L., (2001). Validating clustering for gene

expression data. Bioinformatics, 17, 309-318.

• Zhang, B. T., Ohm, P., and Muhlenbein, H., (1997). Evolutionary Induction of Sparse

Neural Trees. Evolutionary Computation, 5, 213-236.

• Zipf, G. F., (1965) [1935]. Psycho-Biology of Languages. Mass. MIT Press.

You might also like