Independent Principal Component Analysis For Biologically Meaningful Dimension Reduction of Large Biological Data Sets
Independent Principal Component Analysis For Biologically Meaningful Dimension Reduction of Large Biological Data Sets
Abstract
Background: A key question when analyzing high throughput data is whether the information provided by the
measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or,
rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to
better understand the underlying structure of the data in a ‘blind’ (unsupervised) way. A well-established technique
to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to
the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it
optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can
overcome both the high dimensionality and noisy characteristics of biological data.
Results: We propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA
and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the
important biological entities and reveal insightful patterns in the data. The result is a better clustering of the
biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal
variable selection to identify biologically relevant features (sIPCA).
Conclusions: On simulation studies and real data sets, we showed that IPCA offers a better visualization of the
data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the
list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the
data with respect to the biological experiment.
IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of
high dimensional biological data sets, and on mixomics’ web-interface.
© 2012 Yao et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 2 of 15
http://www.biomedcentral.com/1471-2105/13/24
information and visualization of the data in a smaller been the convention to use a fixed number of compo-
subspace. nents [2]. However, ICA does not order its components
Principal component analysis (PCA) [1] is a classical by ‘relevance’. Therefore, some authors proposed to
tool to reduce the dimension of expression data, to order them either with respect to their kurtosis values
visualize the similarities between the biological samples, [9], or with respect to their l 2 norm [2], or by using
and to filter noise. It is often used as a pre-processing Bayesian frameworks to select the number of compo-
step for subsequent analyses. PCA projects the data into nents [15]. In the case of high dimensional data sets,
a new space spanned by the principal components (PC), PCA is often applied as a pre-processing step to reduce
which are uncorrelated and orthogonal. The PCs can the number of dimensions [2,7]. In that particular case,
successfully extract relevant information in the data. ICA is applied on a subset of data summarized by a
Through sample and variable representations, they can small number of principal components from PCA.
reveal experimental characteristics, as well as artefacts In this paper, we propose to use ICA as a denoising
or bias. Sometimes, however, PCA can fail to accurately process of PCA, since ICA is good at separating mixed
reflect our knowledge of biology for the following rea- signals, i.e. noise vs. no noise. The aim is to generate
sons: a) PCA assumes that gene expression follows a denoised loading vectors. These vectors are crucial in
multivariate normal distribution and recent studies have PCA or ICA as each of them indicates the weights
demonstrated that microarray gene expression measure- assigned to each biological feature in the linear combi-
ments follow instead a super-Gaussian distribution nation that leads to the component. Therefore, the goal
[2-5], b) PCA decomposes the data based on the maxi- is to obtain independent components that better reflect
mization of its variance. In some cases, the biological the underlying biology in a study and achieve better
question may not be related to the highest variance in dimension reduction than PCA or ICA.
the data [6]. Independent Principal Component Analysis (IPCA)
A more plausible assumption of the underlying distri- makes the assumption that biologically meaningful com-
bution of high-throughput biological data is that feature ponents can be obtained if most noise has been
measurements following Gaussian distributions repre- removed in the associated loading vectors.
sent noise - most genes conform to this distribution as In IPCA, PCA is used as a pre-processing step to
they are not expected to change at a given physiological reduce the dimension of the data and to generate the
or pathological transition [7]. Recently, an alternative loading vectors. The FastICA algorithm [9] is then
approach called Independent Component Analysis (ICA) applied on the previously obtained PCA loading vectors
[8-10] has been introduced to analyze microrray and that will subsequently generate the Independent Principal
metabolomics data [2,6,11-13]. In contrary to PCA, ICA Components (IPC). We use the kurtosis measure of the
identifies non-Gaussian components which are modelled loading vectors to order the IPCs. We also propose a
as a linear combination of the biological features. These sparse variant with a built-in variable selection procedure
components are statistically independent, i.e. there is no by applying soft-thresholding on the independent loading
overlapping information between the components. ICA vectors [16,17] (sIPCA).
therefore involves high order statistics, while PCA con- In the ‘Results and Discussion’ Section, we first com-
strains the components to be mutually orthogonal, pare the classical PCA and ICA methodologies to IPCA
which involves second order statistics [14]. As a result, on a simulation study. On three real biological datasets
PCA and ICA often choose different subspaces where (microarray and metabolomics datasets) we demonstrate
the data are projected. As ICA is a blind source signal the satisfying samples clustering abilities of IPCA. We
separation, it is used to reduce the effects of noise or then illustrate the usefulness of variable selection with
artefacts of the signal since usually, noise is generated sIPCA and compare it with the results obtained from the
from independent sources [10]. In the recent literature, sparse PCA from [18]. In the ‘Methods’ Section, we pre-
it has been shown that the independent components sent the PCA, ICA and IPCA methodologies and describe
from ICA were better at separating different biological how to perform variable selection with sIPCA.
groups than the principal components from PCA
[2,5-7]. However, although ICA has been found to be a Results and Discussion
successful alternative to PCA, it faces some limitations We first performed a simulation study where the loading
due to some instability, the choice of number of compo- vectors follow a Gaussian or super-Gaussian distribution.
nents to extract and high dimensionality. As ICA is a On three real data sets, we compared the kurtosis values
stochastic algorithm, it needs to be run several times of the loading vectors as a way of measuring their non-
and the results averaged in order to obtain robust Gaussianity and ordering the IPCs. The samples cluster-
results [5]. The number of independent component to ing ability of each approach is assessed using the Davies
extract and choose is a hard outstanding problem. It has Bouldin index [19]. Finally, the variable selection
Yao et al. BMC Bioinformatics 2012, 13:24 Page 3 of 15
http://www.biomedcentral.com/1471-2105/13/24
performed by sIPCA and sPCA are compared on a simu- Table 2 Mean value of the kurtosis measure of the first 5
lated as well as on the Liver Toxicity data sets. loading vectors in the simulation study for PCA, IPCA
and & ICA.
Simulation study PCA ICA IPCA
In order to understand the benefits of IPCA compared Gaussian case loading 1 -0.007 -0.015 0.54
to PCA or ICA, we simulated 5000 data sets of size n = loading 2 -0.009 -0.013 0.21
50 samples and p = 500 variables from a multivariate loading 3 -0.012 -0.013 -0.01
normal distribution with a pre-specified variance-covar- loading 4 -0.011 -0.013 -0.20
iance matrix described in the ‘Methods’ Section. Two loading 5 -0.015 -0.015 -0.41
cases were tested. super-Gaussian case loading 1 34.75 0.28 52.58
1. Gaussian case. The first two eigenvectors v1 and v2, loading 2 34.16 0.43 33.81
both of length 500, follow a Gaussian distribution. loading 3 -0.01 0.42 0.27
2. Super-Gaussian case. In this case the first two loading 4 -0.01 0.44 -0.02
eigenvectors follow a mixture of Laplacian and uniform loading 5 -0.02 0.47 -0.25
distributions:
L(0, 25) k = 1, . . . , 50 L(0, 25) k = 301, . . . , 350
v1k ∼ and v2k ∼
U(0, 1) otherwise, U(0, 1) otherwise.
Gaussian case. In the high dimensional case, PCA is
Table 1 records the median of the angles between the used as a pre processing step in the ICA algorithm. It is
simulated (known) eigenvectors and the loading vectors likely that such step affects the ICA input matrix and
estimated by the three approaches. PCA gave similar that the ICA assumptions are not met. Therefore, the
results in both simulation cases, and was able to well performance of ICA seems to be largely affected by the
estimate the loading vectors, while ICA performed high number of variables.
poorly in both cases. IPCA performed quite poorly in PCA gave satisfactory results in both cases. In the
the Gaussian case, but outperformed PCA in the super- super-Gaussian case, PCA is even able to recover some
Gaussian case. of the super-Gaussian distribution of the loading vec-
Table 2 displays the kurtosis values of the first 5 load- tors. However, IPCA is able to recover the loading
ing vectors. In IPCA the components are ordered with structure better than PCA in the super-Gaussian case
respect to the kurtosis values of their associated loading (angles are smaller in Table 1 and kurtosis value is
vectors, while in the FastICA algorithm the components much higher for the first loading for IPCA). Depending
are ordered with respect to the kurtosis values of the on the (unknown) nature of the data set to be analyzed,
independent components. In the super-Gaussian case, it is therefore advisable to assess both approaches.
these results show that the kurtosis value is a good post
hoc indicator of the number of components to choose, Application to real data sets
as a sudden drop in the values corresponds to irrelevant Liver Toxicity study
dimensions (from 3 and onwards). Low kurtosis values In this study, 64 male rats were exposed to non-toxic
in the Gaussian case indicate that non-Gaussianity of (50 or 150 mg/kg), moderately toxic (1500 mg/kg) or
the loading vectors cannot be maximized, and that the severely toxic (2000 mg/kg) doses of acetaminophen
assumptions of IPCA are not met (i.e. a small number (paracetamol) in a controlled experiment [20]. In this
of genes heavily contribute to the observed biological paper, we considered 50 and 150 mg/kg as low doses,
process). and 1500 and 2000 as high doses. Necropsies were per-
Tables 1 and 2 seem to suggest that ICA performs formed at 6, 18, 24 and 48 hours after exposure and the
poorly in both Gaussian and super-Gaussian case, even mRNA from the liver was extracted. The microarray
if we would expect quite the contrary in the super- data is arranged in matrix of 64 samples and 3116
transcripts.
Prostate cancer study
Table 1 Simulation study: angle (median value) between
the simulated and estimated loading vectors simulated This study investigated whether gene expression differ-
with either Gaussian or super-Gaussian distributions. ences could distinguish between common clinical and
Method Gaussian super-Gaussian
pathological features of prostate cancer. Expression pro-
files were derived from 52 prostate tumors and from 50
v1 v2 v1 v2
non tumor prostate samples (referred to as normal) using
PCA 20.48 21.61 20.47 21.62
oligonucleotide microarrays containing probes for
ICA 85.70 84.39 82.13 77.77
approximately 12,600 genes and ESTs. After preproces-
IPCA 70.05 69.72 12.46 14.08
sing remains the expression of 6033 genes (see [21]) and
Yao et al. BMC Bioinformatics 2012, 13:24 Page 4 of 15
http://www.biomedcentral.com/1471-2105/13/24
101 samples since one normal sample was suspected to extract relevant information with IPCA, as is further dis-
be an outlier and was removed from the analysis. cussed below.
Yeast metabolomic study Sample representation
In this study, two Saccharomyces cerevisiae strains were The samples in each data set were projected in the new
used - wild-type (WT) and mutant (MT), and were carried subspace spanned by the PCA, ICA or IPCA compo-
out in batch cultures under two different environmental nents (Figure 1, 2 and 3). This kind of graphical output
conditions, aerobic (AER) and anaerobic (ANA) in stan- gives a better insight into the biological study as it
dard mineral media with glucose as the sole carbon reveals the shared similarities between samples. The
source. After normalization and preprocessing, the meta- comparison between the different graphics allows to
bolomic data results in 37 metabolites and 55 samples that visualize how each method is able to partition the sam-
include 13 MT-AER, 14 MT-ANA, 15 WT-AER and 13 ples in a way that reflects the internal structure of the
WT-ANA samples (see [22] for more details). data, and to extract the relevant information to repre-
Choosing the number the components with the kurtosis sent each sample. One would expect that the samples
measure belonging to the same biological group, or undergoing
As mentioned by [5], one major limitation of ICA is the the same biological treatment would be clustered
specification and the choice of the number of components together and separated from the other groups.
to extract. In PCA, the cumulative percentage of explained In Liver Toxicity, IPCA tended to better cluster the low
variance is a popular criterion to choose the number of doses together, compared to PCA or ICA (Figure 1). In
principal components, since they are ordered by decreas- Prostate (Figure 2), PCA graphical representations
ing explained variance [1]. For the case of high dimension- showed interesting patterns. Neither the first, nor the
ality, many alternative ad hoc stopping rules have been second component in PCA were relevant to separate the
proposed without, however, leading to a consensus (see two groups. Instead, it was the third component that
[23] for a thorough review). In Liver Toxicity, the first 3 could give more insight into the expected biological char-
principal components explained 63% of the total variance, acteristics of the samples. It is likely that PCA first
in Yeast, the first 2 principal components explained 85% attempts to maximize the variance of noisy signals, which
of the total variance. For Prostate that contains a very has a Gaussian distribution, before being able to find the
large number of variables, the first 3 components only right direction to differentiate better the sample classes.
explain 51% of the total variance (7 principal components For IPCA, the first component seemed already sufficient
would be necessary to explain more than 60%). However, to separate the classes (as indicated by the kurtosis value
from a visualization perspective, choosing more than 3 of its associated loading vector in Table 3), while two
components would be difficult to interpret. components were necessary for ICA to achieve a satisfy-
The kurtosis values of the loading vectors from PCA, ing clustering. For the Yeast study (Figure 3), even
ICA and IPCA are displayed in Table 3. These values though the first 2 principal components explained 85% of
differ from one approach to the others, as well as their the total variance, it seemed that 3 components were
order. In IPCA, the kurtosis value of the associated necessary to separate WT from the MT in the AER sam-
loading vectors gives a good indicator of the ability of ples with PCA, whereas 2 components were sufficient
the components to separate the clusters, since we are with ICA and IPCA. For all approaches, the WT and MT
interested in extracting signals from non-Gaussian dis- samples for the ANA group remain mixed and seem to
tributions. Respectively, the first 2, 1 and 2 components share strong biological similarities.
seem enough in Liver Toxicity, Prostate and Yeast to Cluster validation
In order to compare how well different methods perform
Table 3 Kurtosis measures of the loading vectors for on a data set, different indexes were proposed to measure
PCA, IPCA and & ICA. the similarities between clusters in the literature [24]. We
Dataset PCA ICA IPCA used the Davies-Bouldin index [19] (see ‘Methods’ sec-
Liver Toxicity study loading 1 6.588 7.697 9.700
tion). This index has both a statistical and geometric ratio-
loading 2 1.912 2.737 6.982
nale, and looks for compact and well-separated clusters.
loading 3 6.958 4.799 0.672
The main purpose is to check whether the different
approaches can distinguish between the known biological
Prostate cancer study loading 1 -1.527 -0.553 1.513
conditions or treatments on the basis of the expression
loading 2 -0.561 0.723 -0.249
data. The approach that gives the smallest index is consid-
loading 3 1.176 1.640 -1.509
ered the best clustering method based on this criterion.
Yeast metabolomic study loading 1 4.532 0.274 1.551
The results are displayed in Table 4 for a choice of 2 or 3
loading 2 12.261 -0.758 1.437
components. On the Liver Toxicity study, the Davies-
loading 3 4.147 1.677 -0.475
Bouldin index indicated that IPCA outperformed the
Yao et al. BMC Bioinformatics 2012, 13:24 Page 5 of 15
http://www.biomedcentral.com/1471-2105/13/24
Figure 1 Liver Toxicity study: Sample representation. Sample representation using the first two components from PCA, ICA and IPCA
approaches.
Figure 2 Prostate cancer study: sample representation. Sample representation using the first two or three components from PCA, ICA and
IPCA approaches.
Table 5 displays the correct identification rate of each gene selections, a selection size of 50 genes per dimen-
loading vector estimated by sPCA and sIPCA. Given sion, for 2 dimensions were arbitrarily chosen for the fol-
this non trivial setting, both approaches identified very lowing analysis. Even if not optimal from the index
well the important variables, especially on the first perspective, this choice was mostly guided by the number
dimension, where sPCA slightly outperformed sIPCA. of subsequent annotated genes that could be analyzed in
On the second dimension, the performance of sPCA the biological interpretation. For each approach, the
and sIPCA differ as sPCA fails to differentiate each genes lists of different sizes are embedded into each
sparse signal separately - it tended to select variables other, and a compromise has to be made to obtain a suf-
from both dimensions in the second loading vector. On ficient but not too large list of genes to be interpreted.
the contrary, and especially in the super-Gaussian case, Comparison of the sparse loading vectors
sIPCA is able to identify each sparse eigenvector signal The first and second sparse loading vectors for both
separately, i.e. each simulated biological process. sPCA sPCA and sIPCA are plotted in Figure 5 (absolute
performed better in the Gaussian than in the super- values). In the first dimension, the loading vectors of the
Gaussian case, whereas sIPCA performed almost equally two sparse approaches are very similar (correlation of
well in both cases. 0.98), a fact that was already indicated in the above simu-
lation study. Both approaches select the same variables.
Real example with Liver Toxicity study On the second dimension, however, the sparse loading
Choosing the number of genes to select vectors differ (correlation of 0.28) as IPCA (similar to
Figure 4 displays the Davies Bouldin index for various ICA) leads to an unnecessarily orthogonal basis which
gene selection sizes. sIPCA clearly outperformed sPCA. may reconstruct the data better than PCA in the pre-
In order to compare the biological relevance of the two sence of noise and is sensitive to high order statistics in
Yao et al. BMC Bioinformatics 2012, 13:24 Page 7 of 15
http://www.biomedcentral.com/1471-2105/13/24
Figure 3 Yeast metabolomic study: sample representation. Sample representation using the first two or three components from PCA, ICA
and IPCA approaches.
the data rather than the covariance matrix only [25]. This Biological relevance of the selected genes
explains why sPCA and sIPCA give different subspaces. We have seen that the independent principal compo-
Sample representation nents indicate relevant biological similarities between
The PCs and IPCs are displayed in Figure 6. Since most the samples. We next assessed whether these selected
of the noisy variables were removed, sPCA seemed to genes were relevant to the biological study. The genes
give a better clustering of the low doses compared to selected with either sIPCA or sPCA were further investi-
Figure 1. sIPCA and IPCA remain similar, which shows gated using the GeneGo software [26], that can output
that IPCA is well able to separate the noise from the pathways, process networks, Gene Ontology (GO) pro-
biologically relevant signal. cesses and molecular functions.
We decided to focus only on the first two dimensions
as they were sufficient to obtain a satisfying cluster of
Table 4 Davies Bouldin index for PCA, ICA and IPCA on
the three data sets. Table 5 Simulation study: average percentage of
Dataset # of components PCA ICA IPCA correctly identified non-zero loadings (standard
Liver Toxicity study 2 components 1.809 1.923 1.242 deviation) when 50 variables are selected on each
Liver Toxicity study 3 components 1.523 1.578 1.525 dimension (each loading vector).
Prostate cancer study 2 components 4.117 1.679 1.782 Method Gaussian super-Gaussian
Prostate cancer study 3 components 3.312 2.316 2.315 v1 v2 v1 v2
Yeast metabolomic study 2 components 1.894 1.788 2.338 sPCA 90.30% (3.5) 72.5% (11.6) 85.44% (4.3) 68.22% (10.6)
Yeast metabolomic study 3 components 2.119 2.139 2.037 sIPCA 86.7% (8.3) 87.7% (8.1) 80.80% (8.6) 82.30% (8.4)
Yao et al. BMC Bioinformatics 2012, 13:24 Page 8 of 15
http://www.biomedcentral.com/1471-2105/13/24
sIPCA
4.0
sPCA
3.5
Davies Bouldin− index
3.0
2.5
2.0
1.5
the samples (see previous results). We therefore ana- more enriched with sIPCA: heme and unfolded protein
lyzed the two lists of 50 genes selected with either binding as well as oxidoreductase activity (Additional
sIPCA or sPCA for each of these two dimensions. file 4).
Amongst these 50 genes, between 33 to 39 genes were Genes selected on dimension 2 The gene lists from
annotated and recognized by the software. dimension two not only highlighted response to
Genes selected on dimension 1 Both methods selected unfolded protein and to organic substance, but also cel-
genes previously highlighted in the literature as having lular carbohydrate biosynthesis process, trygliceride,
functions in detoxification and redox regulation in acylglycerol, neutral metabolic processes as well as cata-
response to oxidative stress: 2 cytochrome P450 genes (1) bolic process and glucogenesis. For this dimension, how-
and heme oxygenase 1 were selected by sIPCA (sPCA) ever, it is sIPCA that selected more relevant genes that
on the first dimension (see Additional files 1 and 2). The enriched these terms (Additional file 5).
expression of these genes has been found to be altered in In terms of pathways, both approaches selected HSP70
biological pathways perturbed subsequent to incipient and HSP90 genes. The HSP90 gene encodes a member
toxicity [27-32]. These genes were also previously of the heat shock proteins 70 family. These proteins
selected with other statistical approaches by other collea- play a role in cell proliferation and stress response,
gues on the same study [20]. which explained the presence of pathways found such as
A Gene Ontology enrichment analysis for each list of oxidative stress [33,34] (Additional file 6). The HSP90
genes was performed. GO terms significantly enriched proteins are highly conserved molecular chaperones that
included biological processes related to response to have key roles in signal transduction, protein folding
unfolded proteins, protein refolding and protein stimulus, and protein degradation. They play an important roles
as well as response to chemical stimulus and organic sub- in folding newly synthesized proteins or stabilizing and
stance (Additional file 3). Although very similar, the refolding denatured proteins after stress [35].
sPCA gene list highlighted slightly more genes related to Summary This preliminary analysis demonstrates the
these GO terms than the sIPCA gene selection. The GO ability of sIPCA and sPCA to select genes that were
molecular functions related to these genes were, however, relevant to the biological study. These genes that are
Yao et al. BMC Bioinformatics 2012, 13:24 Page 9 of 15
http://www.biomedcentral.com/1471-2105/13/24
Figure 5 Liver Toxicity study: sparse loading vectors. Comparison of the first two sparse loading vectors generated by sIPCA and sPCA.
Figure 6 Liver Toxicity study: sample representation with sparse variants. Sample representation using the first two principal components
of sPCA and sIPCA approaches when 50 variables are selected on each dimension.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 10 of 15
http://www.biomedcentral.com/1471-2105/13/24
ranked as being ‘important’ by both approaches, partici- algorithm uses Singular value decomposition (SVD):
pate in the determination of the components which are suppose X is a centered n × p matrix (the mean of each
linear combinations of the original variables. Therefore, column has been subtracted), where n is the number of
the expression of these selected genes not only help samples (or observations) and p is the number of vari-
clustering the samples according to the different treat- ables or biological entities that are measured. Then the
ments or biological conditions but also have a biologi- SVD of data matrix X can be defined as
cally relevant meaning for the system under study.
X = UDVT , (1)
Conclusions
We have developed a variant of PCA called IPCA that where U is an n × p matrix whose columns are uncor-
combines the advantages of both PCA and ICA. IPCA related (i.e. UT U = IP ), V is a p × p orthogonal matrix
assumes that biologically meaningful components can be (i.e. V T V = I P ), and D is a p × p diagonal matrix with
obtained if most noise has been removed from the asso- diagonal elements dj. We denote uj the columns of U and
ciated loading vectors. By identifying non-Gaussian load- vj the columns of V. Then ujdj is the jth principal compo-
ing vectors from the biological data, it better reflects the nent (PC) and vj is the corresponding loading vector [1].
internal structure of the data compared to PCA and The PCs are linear combination of the original variables
ICA. On simulated data sets, we showed that IPCA out- and the loading vectors indicate the weights assigned to
performed PCA and ICA in the super-Gaussian case, each of the variables in the linear combination. The first
and that the kurtosis value of the loading vectors can be PC accounts for the maximal amount of the total var-
used to choose the number of independent principal iance. Similarly, the jth (j = 2,..., p) PC can explain the
components. On real data sets, we assessed the cluster maximal amount of variance that is not accounted by the
validity using the Davies Bouldin index and showed that previous j - 1 PCs. Therefore, most of the information
in high dimensional cases, IPCA could summarize the contained in X can be reduced to a few PCs. Plotting the
information of the data better or with a smaller number PCs enable a visual representation of the samples pro-
of components than PCA or ICA. jected in the subspace spanned by the PCs. We can
We also introduced sIPCA that allows an internal expect that the samples belonging to the same biological
variable selection procedure. By applying a soft-thresh- group, or undergoing the same biological treatment
olding penalization on the independent loading vectors, would be clustered together and separated from the
sparse loading vectors are obtained which enable vari- other groups.
able selection. We have shown that sIPCA can correctly Limitation of PCA
identify most of the important variables in a simulation Sometimes, however, PCA may not be able to extract
study. For one data set, the genes selected with sIPCA relevant information and may therefore provide mean-
and sPCA were further investigated to assess whether ingless principal components that do not describe
the two approaches were able to select genes that were experimental characteristics. The reason is that its linear
relevant to the system under study given these genes, transformation involves second order statistics (i.e. to
relevant GO terms, molecular functions and pathways obtain mutually non-orthogonal PCs) that might not be
where highlighted. This analysis demonstrated the ability appropriate for biological data. PCA assumes that gene
of such approaches to unravel biologically relevant expression data have Gaussian signals, while it has been
information. The expression of these selected genes is demonstrated that many gene expression data in fact
also decisive to cluster the samples according to their have ‘super-Gaussian’ signals [2,4].
biological conditions.
We believe that (s)IPCA approach can be useful, not Independent Component Analysis (ICA)
only to improve data visualization and reveal experimen- Independent Component Analysis (ICA) was first pro-
tal characteristics, but also to identify biologically rele- posed by [8]. ICA can reduce the effects of noise or arte-
vant variables. IPCA and sIPCA are implemented in the facts in the data as it aims at separating a mixture of
R package mixomics [36,37] and its associated web- signals into their different sources. By assuming non-
interface http://mixomics.qfab.org. Gaussian signal distribution, ICA models observations as a
linear combinations of variables, or components, which
Methods are chosen to be as statistically independent as possible
Principal Component Analysis (PCA) (i.e. the different components represent different non-
PCA is a classical dimension reduction and feature overlapping information). ICA therefore involves higher-
extraction tool in exploratory analysis, and has been order statistics [14]. In fact, ICA attempts to recover statis-
used in a wide range of fields. There exists different tically independent signal from the observations of an
ways of solving PCA. The most computationally efficient unknown linear mixture. Several algorithms such as
Yao et al. BMC Bioinformatics 2012, 13:24 Page 11 of 15
http://www.biomedcentral.com/1471-2105/13/24
FastICA, Kernel ICA [38] and ProDenICA [39] were pro- K = E{s4i } − 3. (6)
posed to estimate the independent components. The Fas-
tICA algorithm maximizes non-Gaussianity of each where s i is the row of S, which has zero mean and
component, while Kernel ICA and ProDenICA minimize unit variance, j = 1... n. The kurtosis value equals zero if
mutual information between components. In this article, si has a Gaussian probability density function (pdf), is
we used the FastICA algorithm. positive if si has a spiky pdf (super-Gaussian, i.e. the pdf
Let X (n × p) be the centered data matrix and S (n × is relatively large at zero) and is negative if si has a flat
p) the matrix containing the independent components pdf (sub-Gaussian, i.e. the pdf is rather constant near
(IC). We can solve the ICA problem by introducing a zero). We are interested in the spiky and flat pdf (i.e.
mixing matrix A of size n × n: non-Gaussian pdfs) since non-Gaussianity is regarded as
independence [9]. Note that although kurtosis is both
X = AS. (2)
computationally and theoretically simple, it can be very
The mixing matrix A indicates how the independent sensitive to outliers. The authors in [6] proposed to
components of S are linearly combined to construct X. order the ICs based on their kurtosis value.
If we rearrange the equation above, we get
• In the FastICA algorithm, negentropy is used as it
S = WX, (3) is an excellent measurement of non-Gaussianity.
Negentropy equals zero if si is Gaussian and is posi-
where W (n × n) is the unmixing matrix that tive if si is non-Gaussian. It is not only easy to com-
describes the inverse process of mixing the ICs. If we pute, but also very robust [9]. However, this measure
assume that A is a square and orthonormal matrix, then does not distinguish between super-Gaussianity and
W is simply the transpose of A. In practice, it is very sub-Gaussianity.
useful to whiten the data matrix X, i.e., to obtain Cov Limitation of ICA
(X) = I. This allows the mixing matrix A to be orthogo- Similar to PCA, ICA also suffers from high dimensional-
nal: Cov(AS) = I and SS T = I ⇒ AAT = I. The ortho- ity, which sometimes leads to the inability of the ICs to
gonality of the matrix also enables fewer parameters to reflect the (biologically expected) internal structure of
be estimated. In the FastICA algorithm, PCA is used as the data. Furthermore, since ICA is a stochastic algo-
a pre-processing step to whiten the data matrix. If we rithm, it faces the problem of convergence to local
rearrange (1), we therefore obtain optima, leading to slightly different ICs when re-analyz-
ing the same data [40].
UT = D−1 VT XT , (4)
Independent Principal Component Analysis (IPCA)
since the columns of V are orthonormal. The rows of
To reduce noise and better reflect the internal structure
UT are uncorrelated and have zero mean. To complete
√ of the data generated by the biological experiment, we
the whitening step, we can multiply UT by n − 1 , so
propose a new approach called Independent Principal
that the rows of UT have unit variance. Then let Ũ be Component Analysis (IPCA). Rather than denoising the
√
the whitened PCs ( Ũ = n − 1UT ) . The ICs are esti- data or the PCs directly, as it is performed in ICA, we
mated through the following equation: propose instead to reduce the noise in the loading vec-
tors. Recall that the PCs, which are then used to visua-
S = WŨ. (5) lize the samples and how they cluster together, are a
linear combination of the original variables weighted by
ICA assumes that Gaussian distribution represent their elements in the corresponding loading vectors.
noise, and therefore aims at identifying non-Gaussian Thus we will obtain denoised PCs by using ICA as a
components in the sample space that are as independent denoising process of the associated loading vectors.
as possible. Recent studies have observed that the signal We make the assumption that in a biological system,
distribution of microarray data are typically super-Gaus- different variables (biological entities, such as genes and
sian since only a small number of genes contribute metabolites) have different levels of expression or abun-
heavily to a specific biological process [2,5]. dance depending on the biological conditions. Therefore,
Two classical quantitative measures of Gaussianity are only a few variables contribute to a biological process.
kurtosis and negentropy. These relevant variables should have important weights
in the loading vectors while other irrelevant or noisy vari-
• Kurtosis, also called the fourth-order cumulant is ables should have very small weights. In fact, once the
defined as loading vectors are denoised, we expect them to have a
Yao et al. BMC Bioinformatics 2012, 13:24 Page 12 of 15
http://www.biomedcentral.com/1471-2105/13/24
super-Gaussian distribution (as opposed to a Gaussian number of loading vectors, or, equivalently, a small num-
distribution when noise is included, see Figure 7 for the ber of PCs is needed to summarize most of the relevant
plot of a typical super-Gaussian and a Gaussian distribu- information. However, there is no globally accepted criter-
tion). Maximizing non-Gaussianity of the loading vectors ion on how to choose the number of PCs to keep. We
will thus enable to remove most of the noise. IPCA is have shown that the kurtosis value of the independent
described below and summarized in Table 6. loading vectors gives a post hoc indication of the number
Extract the loading vectors from PCA of independent principal components to be chosen (see
PCA is applied to the X (n × p) centered data matrix ‘Results and Discussion’ Section). We have experimentally
using SVD to extract the loading vectors: observed that 2 or 3 components were sufficient to high-
light meaningful characteristics of the data and to discard
X = UDVT , (7) much of the noise or irrelevant information.
Apply ICA on the loading vectors
where the columns of V contain the loading vectors.
The non-Gaussianity of the loading vectors can be max-
Since the mean of each loading vector is very close to
imized using equation (5):
zero, these vectors are approximately whitened and the
FastICA algorithm can be applied on the loading vectors. T
(8)
S = W Ṽ ,
Dimension reduction
Dimension reduction enables a clearer interpretation with- where Ṽ is the (p × m) matrix containing the m cho-
out the computational burden. Therefore, only a small sen loading vectors, W is the (m × m) unmixing matrix
Super−Gaussian vs Gaussian
0.5
super−Gaussian
Gaussian
0.4
0.3
0.2
0.1
0.0
−4 −2 0 2 4
Figure 7 Super-Gaussian vs. Gaussian distribution. A super-Gaussian distribution (Laplace distribution for example) has a more spiky peak
and a longer tail than a Gaussian distribution. The distribution of a noiseless loading vector is similar to a super-Gaussian distribution. If a large
amount of noise exists in the loading vectors, its distribution will tend towards a Gaussian distribution.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 13 of 15
http://www.biomedcentral.com/1471-2105/13/24
and S is the (m × p) matrix whose rows are the inde- vector, see following paragraph). In this way, we can
pendent loading vectors. The new independent principal control how many variables to select and save some
components (IPCs) are obtained by projecting X on ST: computational time.
matrix, then Σ = VCVT. The data are then generated from Authors’ contributions
FY performed the statistical analysis, wrote the R functions and drafted the
a multivariate normal distribution N(0, Σ), with n = 50
manuscript. KALC participated in the design of the manuscript and helped
samples and p = 500 variables. drafting the manuscript. JC participated in the implementation of the R
functions and implemented IPCA in the web-interface. All authors read and
approved the final manuscript.
Davies-Bouldin index
Davies-Bouldin measure is an index of crisp cluster Competing interests
validity [19]. This index compares the within-cluster The authors declare that they have no competing interests.
scatter with the between-cluster separation. It was cho-
Received: 5 September 2011 Accepted: 3 February 2012
sen in this study because of its statistics and geometric Published: 3 February 2012
rationale. The Davies-Bouldin index is defined as
References
1
K
σi + σj 1. Jolliffe I: Principal Component Analysis. second edition. Springer, New York;
max , 2002.
K i=j d(ci , cj ) 2. Lee S, Batzoglou S: Application of independent component analysis to
i=1
microarrays. Genome Biology 2003, 4(11):R76.
3. Purdom E, Holmes S: Error distribution for gene expression data.
where ci is the centroid of cluster i, and si is the aver- Statistical applications in genetics and molecular biology 2005, 4:16.
age distance of all elements in cluster i to centroid ci 4. Huang D, Zheng C: Independent component analysis-based penalized
and d(ci, cj) is the distance between the two centroids, K discriminant method for tumor classification using gene expression
data. Bioinformatics 2006, 22(15):1855.
is the number of known biological conditions or treat- 5. Engreitz J, Daigle B Jr, Marshall J, Altman R: Independent component
ments. Depending on the number of components that analysis: Mining microarray data for fundamental human gene
were chosen, we applied a 2- or 3-norm distance. Geo- expression modules. Journal of Biomedical Informatics 2010, 43:932-944.
6. Scholz M, Gatzek S, Sterling A, Fiehn O, Selbig J: Metabolite fingerprinting:
metrically speaking, we are seeking to minimize the detecting biological features by independent component analysis.
within-cluster scatter (the numerator) while maximizing Bioinformatics 2004, 20(15):2447-2454.
the between class separation (the denominator). There- 7. Frigyesi A, Veerla S, Lindgren D, Höglund M: Independent component
analysis reveals new and biologically significant structures in micro array
fore, for a given number of components, the approach data. BMC bioinformatics 2006, 7:290.
that gives the lowest index has the best clustering 8. Comon P: Independent component analysis, a new concept? Signal
ability. Process 1994, 36:287-314.
9. Hyvärinen A, Oja E: Indepedent Component Analysis: Algorithms and
Applications. Neural Networks 2000, 13(4-5):411-430.
Additional material 10. Hyvärinen A, Karhunen J, Oja E: Independent Component Analysis John Wiley
& Sons; 2001.
11. Liebermeister W: Linear modes of gene expression determined by
Additional file 1: List of genes from sIPCA. List of genes and gene
independent component analysis. Bioinformatics 2002, 18:51-60.
title selected by sIPCA on each dimension on Liver Toxicity study.
12. Wienkoop S, Morgenthal K, Wolschin F, Scholz M, Selbig J, Weckwerth W:
Additional file 2: List of genes from sPCA. List of genes and gene title Integration of Metabolomic and Proteomic Phenotypes. Molecular &
selected by sPCA on each dimension on Liver Toxicity study. Cellular Proteomics 2008, 7:1725-1736.
Additional file 3: GeneGo analysis. Comparison of the GO processes 13. Rousseau R, Govaerts B, Verleysen M: Combination of Independent
for the genes selected on dimension 1 with sIPCA and sPCA on Liver Component Analysis and statistical modelling for the identification of
Toxicity study. metabonomic biomarkers in H-NMR spectroscopy. Tech rep, Universté
Catholique de Louvain and Universté Paris I 2009.
Additional file 4: GeneGo analysis. Comparison of the GO molecular
14. Kong W, Vanderburg C, Gunshin H, Rogers J, Huang X: A review of
functions for the genes selected on dimension 1 with sIPCA and sPCA
independent component analysis application to microarray gene
on Liver Toxicity study.
expression data. BioTechniques 2008, 45(5):501.
Additional file 5: GeneGo analysis. Comparison of the GO processes 15. Teschendorff A, Journée M, Absil P, Sepulchre R, Caldas C: Elucidating the
for the genes selected on dimension 2 with sIPCA and sPCA on Liver altered transcriptional programs in breast cancer using independent
Toxicity study. component analysis. PLoS computational biology 2007, 3(8):e161.
Additional file 6: GeneGo analysis. Comparison of the GeneGO 16. Jolliffe I, Trendafilov N, Uddin M: A modified principal component
pathways maps for the genes selected on dimension 1 with sIPCA and technique based on the lasso. Journal of Computational and Graphical
sPCA on Liver Toxicity study. Statistics 2003, 12:531-547.
17. Donoho D, Johnstone I: Ideal spatial adaptation by wavelet shrinkage.
Biometrika 1994, 81:425-455.
18. Shen H, Huang JZ: Sparse Principal Component Analysis via Regularized
Low Rank Matrix Approximation. Journal of Multivariate Analysis 2008,
Acknowledgements
99:1015-1034.
We would like to thank Dr Thibault Jombart (Imperial College) for his useful
19. Davies D, Bouldin D: A cluster separation measure. Pattern Analysis and
advice. This work was supported, in part, by the Wound Management
Machine Intelligence, IEEE Transactions on 1979, , 2: 224-227.
Innovation CRC (established and supported under the Australian
20. Bushel P, Wolfinger RD, Gibson G: Simultaneous clustering of gene
Government’s Cooperative Research Centres Program).
expression data with clinical chemistry and pathological evaluations
reveals phenotypic prototypes. BMC Systems Biology 2007, 1.
Author details
1 21. Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P,
Shanghai University of Finance and Economics, Shanghai, P.R. China.
2 Renshaw A, D’Amico A, Richie J, Lander E, Loda M, Kantoff P, Golub T,
Queensland Facility for Advanced Bioinformatics, University of Queensland,
Sellers W: Gene expression correlates of clinical prostate cancer behavior.
St Lucia, QLD 4072, Australia. 3Sup’Biotech, Villejuif, F-94800, France.
Cancer cell 2002, 1(2):203-209.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 15 of 15
http://www.biomedcentral.com/1471-2105/13/24