0% found this document useful (0 votes)

22 views

Independent Principal Component Analysis For Biologically Meaningful Dimension Reduction of Large Biological Data Sets

Uploaded by

yahel.godinez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Independent Principal Component Analysis For Biologically Meaningful Dimension Reduction of Large Biological Data Sets

Uploaded by

yahel.godinez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Yao et al.

BMC Bioinformatics 2012, 13:24

http://www.biomedcentral.com/1471-2105/13/24

RESEARCH ARTICLE Open Access

Independent Principal Component Analysis for

biologically meaningful dimension reduction of
large biological data sets
Fangzhou Yao1,2, Jeff Coquery2,3 and Kim-Anh Lê Cao2*

Abstract
Background: A key question when analyzing high throughput data is whether the information provided by the
measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or,
rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to
better understand the underlying structure of the data in a ‘blind’ (unsupervised) way. A well-established technique
to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to
the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it
optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can
overcome both the high dimensionality and noisy characteristics of biological data.
Results: We propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA
and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the
important biological entities and reveal insightful patterns in the data. The result is a better clustering of the
biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal
variable selection to identify biologically relevant features (sIPCA).
Conclusions: On simulation studies and real data sets, we showed that IPCA offers a better visualization of the
data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the
list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the
data with respect to the biological experiment.
IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of
high dimensional biological data sets, and on mixomics’ web-interface.

Background the number of samples (or observations) is much smal-

With the development of high throughput technologies, ler than the number of variables (the biological entities
such as microarray and next generation sequencing that are measured) and the data are extremely noisy.
data, the exploration of high throughput data sets is In this study, we are interested in the application of
becoming a necessity to unveil the relevant information unsupervised approaches to discover novel biological
contained in the data. Efficient exploratory tools are mechanisms and reveal insightful patterns while redu-
therefore needed, not only to assess the quality of the cing the dimension in the data. Amongst the different
data, but also to give a comprehensive overview of the categories of unsupervised approaches (clustering,
system, extract significant information and cope with model-based and projection methods), we are specifi-
the high dimensionality. Indeed, many statistical cally interested in projection-based methods, which line-
approaches fail or perform poorly for two main reasons: arly decompose the data into components with a desired
property. These exploratory approaches project the data
* Correspondence: [email protected] into a new subspace spanned by the components. They
2
Queensland Facility for Advanced Bioinformatics, University of Queensland, allow dimension reduction without loss of essential
St Lucia, QLD 4072, Australia
Full list of author information is available at the end of the article

© 2012 Yao et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 2 of 15
http://www.biomedcentral.com/1471-2105/13/24

information and visualization of the data in a smaller been the convention to use a fixed number of compo-
subspace. nents [2]. However, ICA does not order its components
Principal component analysis (PCA) [1] is a classical by ‘relevance’. Therefore, some authors proposed to
tool to reduce the dimension of expression data, to order them either with respect to their kurtosis values
visualize the similarities between the biological samples, [9], or with respect to their l 2 norm [2], or by using
and to filter noise. It is often used as a pre-processing Bayesian frameworks to select the number of compo-
step for subsequent analyses. PCA projects the data into nents [15]. In the case of high dimensional data sets,
a new space spanned by the principal components (PC), PCA is often applied as a pre-processing step to reduce
which are uncorrelated and orthogonal. The PCs can the number of dimensions [2,7]. In that particular case,
successfully extract relevant information in the data. ICA is applied on a subset of data summarized by a
Through sample and variable representations, they can small number of principal components from PCA.
reveal experimental characteristics, as well as artefacts In this paper, we propose to use ICA as a denoising
or bias. Sometimes, however, PCA can fail to accurately process of PCA, since ICA is good at separating mixed
reflect our knowledge of biology for the following rea- signals, i.e. noise vs. no noise. The aim is to generate
sons: a) PCA assumes that gene expression follows a denoised loading vectors. These vectors are crucial in
multivariate normal distribution and recent studies have PCA or ICA as each of them indicates the weights
demonstrated that microarray gene expression measure- assigned to each biological feature in the linear combi-
ments follow instead a super-Gaussian distribution nation that leads to the component. Therefore, the goal
[2-5], b) PCA decomposes the data based on the maxi- is to obtain independent components that better reflect
mization of its variance. In some cases, the biological the underlying biology in a study and achieve better
question may not be related to the highest variance in dimension reduction than PCA or ICA.
the data [6]. Independent Principal Component Analysis (IPCA)
A more plausible assumption of the underlying distri- makes the assumption that biologically meaningful com-
bution of high-throughput biological data is that feature ponents can be obtained if most noise has been
measurements following Gaussian distributions repre- removed in the associated loading vectors.
sent noise - most genes conform to this distribution as In IPCA, PCA is used as a pre-processing step to
they are not expected to change at a given physiological reduce the dimension of the data and to generate the
or pathological transition [7]. Recently, an alternative loading vectors. The FastICA algorithm [9] is then
approach called Independent Component Analysis (ICA) applied on the previously obtained PCA loading vectors
[8-10] has been introduced to analyze microrray and that will subsequently generate the Independent Principal
metabolomics data [2,6,11-13]. In contrary to PCA, ICA Components (IPC). We use the kurtosis measure of the
identifies non-Gaussian components which are modelled loading vectors to order the IPCs. We also propose a
as a linear combination of the biological features. These sparse variant with a built-in variable selection procedure
components are statistically independent, i.e. there is no by applying soft-thresholding on the independent loading
overlapping information between the components. ICA vectors [16,17] (sIPCA).
therefore involves high order statistics, while PCA con- In the ‘Results and Discussion’ Section, we first com-
strains the components to be mutually orthogonal, pare the classical PCA and ICA methodologies to IPCA
which involves second order statistics [14]. As a result, on a simulation study. On three real biological datasets
PCA and ICA often choose different subspaces where (microarray and metabolomics datasets) we demonstrate
the data are projected. As ICA is a blind source signal the satisfying samples clustering abilities of IPCA. We
separation, it is used to reduce the effects of noise or then illustrate the usefulness of variable selection with
artefacts of the signal since usually, noise is generated sIPCA and compare it with the results obtained from the
from independent sources [10]. In the recent literature, sparse PCA from [18]. In the ‘Methods’ Section, we pre-
it has been shown that the independent components sent the PCA, ICA and IPCA methodologies and describe
from ICA were better at separating different biological how to perform variable selection with sIPCA.
groups than the principal components from PCA
[2,5-7]. However, although ICA has been found to be a Results and Discussion
successful alternative to PCA, it faces some limitations We first performed a simulation study where the loading
due to some instability, the choice of number of compo- vectors follow a Gaussian or super-Gaussian distribution.
nents to extract and high dimensionality. As ICA is a On three real data sets, we compared the kurtosis values
stochastic algorithm, it needs to be run several times of the loading vectors as a way of measuring their non-
and the results averaged in order to obtain robust Gaussianity and ordering the IPCs. The samples cluster-
results [5]. The number of independent component to ing ability of each approach is assessed using the Davies
extract and choose is a hard outstanding problem. It has Bouldin index [19]. Finally, the variable selection
Yao et al. BMC Bioinformatics 2012, 13:24 Page 3 of 15
http://www.biomedcentral.com/1471-2105/13/24

performed by sIPCA and sPCA are compared on a simu- Table 2 Mean value of the kurtosis measure of the first 5
lated as well as on the Liver Toxicity data sets. loading vectors in the simulation study for PCA, IPCA
and & ICA.
Simulation study PCA ICA IPCA
In order to understand the benefits of IPCA compared Gaussian case loading 1 -0.007 -0.015 0.54
to PCA or ICA, we simulated 5000 data sets of size n = loading 2 -0.009 -0.013 0.21
50 samples and p = 500 variables from a multivariate loading 3 -0.012 -0.013 -0.01
normal distribution with a pre-specified variance-covar- loading 4 -0.011 -0.013 -0.20
iance matrix described in the ‘Methods’ Section. Two loading 5 -0.015 -0.015 -0.41
cases were tested. super-Gaussian case loading 1 34.75 0.28 52.58
1. Gaussian case. The first two eigenvectors v1 and v2, loading 2 34.16 0.43 33.81
both of length 500, follow a Gaussian distribution. loading 3 -0.01 0.42 0.27
2. Super-Gaussian case. In this case the first two loading 4 -0.01 0.44 -0.02
eigenvectors follow a mixture of Laplacian and uniform loading 5 -0.02 0.47 -0.25
distributions:

L(0, 25) k = 1, . . . , 50 L(0, 25) k = 301, . . . , 350
v1k ∼ and v2k ∼
U(0, 1) otherwise, U(0, 1) otherwise.
Gaussian case. In the high dimensional case, PCA is
Table 1 records the median of the angles between the used as a pre processing step in the ICA algorithm. It is
simulated (known) eigenvectors and the loading vectors likely that such step affects the ICA input matrix and
estimated by the three approaches. PCA gave similar that the ICA assumptions are not met. Therefore, the
results in both simulation cases, and was able to well performance of ICA seems to be largely affected by the
estimate the loading vectors, while ICA performed high number of variables.
poorly in both cases. IPCA performed quite poorly in PCA gave satisfactory results in both cases. In the
the Gaussian case, but outperformed PCA in the super- super-Gaussian case, PCA is even able to recover some
Gaussian case. of the super-Gaussian distribution of the loading vec-
Table 2 displays the kurtosis values of the first 5 load- tors. However, IPCA is able to recover the loading
ing vectors. In IPCA the components are ordered with structure better than PCA in the super-Gaussian case
respect to the kurtosis values of their associated loading (angles are smaller in Table 1 and kurtosis value is
vectors, while in the FastICA algorithm the components much higher for the first loading for IPCA). Depending
are ordered with respect to the kurtosis values of the on the (unknown) nature of the data set to be analyzed,
independent components. In the super-Gaussian case, it is therefore advisable to assess both approaches.
these results show that the kurtosis value is a good post
hoc indicator of the number of components to choose, Application to real data sets
as a sudden drop in the values corresponds to irrelevant Liver Toxicity study
dimensions (from 3 and onwards). Low kurtosis values In this study, 64 male rats were exposed to non-toxic
in the Gaussian case indicate that non-Gaussianity of (50 or 150 mg/kg), moderately toxic (1500 mg/kg) or
the loading vectors cannot be maximized, and that the severely toxic (2000 mg/kg) doses of acetaminophen
assumptions of IPCA are not met (i.e. a small number (paracetamol) in a controlled experiment [20]. In this
of genes heavily contribute to the observed biological paper, we considered 50 and 150 mg/kg as low doses,
process). and 1500 and 2000 as high doses. Necropsies were per-
Tables 1 and 2 seem to suggest that ICA performs formed at 6, 18, 24 and 48 hours after exposure and the
poorly in both Gaussian and super-Gaussian case, even mRNA from the liver was extracted. The microarray
if we would expect quite the contrary in the super- data is arranged in matrix of 64 samples and 3116
transcripts.
Prostate cancer study
Table 1 Simulation study: angle (median value) between
the simulated and estimated loading vectors simulated This study investigated whether gene expression differ-
with either Gaussian or super-Gaussian distributions. ences could distinguish between common clinical and
Method Gaussian super-Gaussian
pathological features of prostate cancer. Expression pro-
files were derived from 52 prostate tumors and from 50
v1 v2 v1 v2
non tumor prostate samples (referred to as normal) using
PCA 20.48 21.61 20.47 21.62
oligonucleotide microarrays containing probes for
ICA 85.70 84.39 82.13 77.77
approximately 12,600 genes and ESTs. After preproces-
IPCA 70.05 69.72 12.46 14.08
sing remains the expression of 6033 genes (see [21]) and
Yao et al. BMC Bioinformatics 2012, 13:24 Page 4 of 15
http://www.biomedcentral.com/1471-2105/13/24

101 samples since one normal sample was suspected to extract relevant information with IPCA, as is further dis-
be an outlier and was removed from the analysis. cussed below.
Yeast metabolomic study Sample representation
In this study, two Saccharomyces cerevisiae strains were The samples in each data set were projected in the new
used - wild-type (WT) and mutant (MT), and were carried subspace spanned by the PCA, ICA or IPCA compo-
out in batch cultures under two different environmental nents (Figure 1, 2 and 3). This kind of graphical output
conditions, aerobic (AER) and anaerobic (ANA) in stan- gives a better insight into the biological study as it
dard mineral media with glucose as the sole carbon reveals the shared similarities between samples. The
source. After normalization and preprocessing, the meta- comparison between the different graphics allows to
bolomic data results in 37 metabolites and 55 samples that visualize how each method is able to partition the sam-
include 13 MT-AER, 14 MT-ANA, 15 WT-AER and 13 ples in a way that reflects the internal structure of the
WT-ANA samples (see [22] for more details). data, and to extract the relevant information to repre-
Choosing the number the components with the kurtosis sent each sample. One would expect that the samples
measure belonging to the same biological group, or undergoing
As mentioned by [5], one major limitation of ICA is the the same biological treatment would be clustered
specification and the choice of the number of components together and separated from the other groups.
to extract. In PCA, the cumulative percentage of explained In Liver Toxicity, IPCA tended to better cluster the low
variance is a popular criterion to choose the number of doses together, compared to PCA or ICA (Figure 1). In
principal components, since they are ordered by decreas- Prostate (Figure 2), PCA graphical representations
ing explained variance [1]. For the case of high dimension- showed interesting patterns. Neither the first, nor the
ality, many alternative ad hoc stopping rules have been second component in PCA were relevant to separate the
proposed without, however, leading to a consensus (see two groups. Instead, it was the third component that
[23] for a thorough review). In Liver Toxicity, the first 3 could give more insight into the expected biological char-
principal components explained 63% of the total variance, acteristics of the samples. It is likely that PCA first
in Yeast, the first 2 principal components explained 85% attempts to maximize the variance of noisy signals, which
of the total variance. For Prostate that contains a very has a Gaussian distribution, before being able to find the
large number of variables, the first 3 components only right direction to differentiate better the sample classes.
explain 51% of the total variance (7 principal components For IPCA, the first component seemed already sufficient
would be necessary to explain more than 60%). However, to separate the classes (as indicated by the kurtosis value
from a visualization perspective, choosing more than 3 of its associated loading vector in Table 3), while two
components would be difficult to interpret. components were necessary for ICA to achieve a satisfy-
The kurtosis values of the loading vectors from PCA, ing clustering. For the Yeast study (Figure 3), even
ICA and IPCA are displayed in Table 3. These values though the first 2 principal components explained 85% of
differ from one approach to the others, as well as their the total variance, it seemed that 3 components were
order. In IPCA, the kurtosis value of the associated necessary to separate WT from the MT in the AER sam-
loading vectors gives a good indicator of the ability of ples with PCA, whereas 2 components were sufficient
the components to separate the clusters, since we are with ICA and IPCA. For all approaches, the WT and MT
interested in extracting signals from non-Gaussian dis- samples for the ANA group remain mixed and seem to
tributions. Respectively, the first 2, 1 and 2 components share strong biological similarities.
seem enough in Liver Toxicity, Prostate and Yeast to Cluster validation
In order to compare how well different methods perform
Table 3 Kurtosis measures of the loading vectors for on a data set, different indexes were proposed to measure
PCA, IPCA and & ICA. the similarities between clusters in the literature [24]. We
Dataset PCA ICA IPCA used the Davies-Bouldin index [19] (see ‘Methods’ sec-
Liver Toxicity study loading 1 6.588 7.697 9.700
tion). This index has both a statistical and geometric ratio-
loading 2 1.912 2.737 6.982
nale, and looks for compact and well-separated clusters.
loading 3 6.958 4.799 0.672
The main purpose is to check whether the different
approaches can distinguish between the known biological
Prostate cancer study loading 1 -1.527 -0.553 1.513
conditions or treatments on the basis of the expression
loading 2 -0.561 0.723 -0.249
data. The approach that gives the smallest index is consid-
loading 3 1.176 1.640 -1.509
ered the best clustering method based on this criterion.
Yeast metabolomic study loading 1 4.532 0.274 1.551
The results are displayed in Table 4 for a choice of 2 or 3
loading 2 12.261 -0.758 1.437
components. On the Liver Toxicity study, the Davies-
loading 3 4.147 1.677 -0.475
Bouldin index indicated that IPCA outperformed the
Yao et al. BMC Bioinformatics 2012, 13:24 Page 5 of 15
http://www.biomedcentral.com/1471-2105/13/24

Figure 1 Liver Toxicity study: Sample representation. Sample representation using the first two components from PCA, ICA and IPCA
approaches.

other approaches using 2 components. When choosing 3 Simulated example

components, all approaches gave similar results. On Pros- Using the simulation framework described in the ‘Meth-
tate, ICA slightly outperformed IPCA for 2 components ods’ Section, we considered two cases:
and gave similar performances for 3 components. PCA 1. Gaussian case. The two sparse simulated eigenvec-
seemed clearly limited by the large number of noisy vari- tors followed a Gaussian distribution:
ables and was not able to provide a satisfying clustering of
∼ N(0, 1) k = 1, . . . , 50 N(0, 1) k = 301, . . . , 350
the samples. ICA gave good clustering performance on the v1k
=0 otherwise,
and v2k
=0 otherwise.
Yeast data set for 2 components, followed by PCA and
IPCA. It is probable that there is very little noise in this 2. Super-Gaussian case. In this case, we have
small data set.
∼ L(0, 25) k = 1, . . . , 50 ∼ L(0, 25) k = 301, . . . , 350
v1k and v2k
In fact, the Davies-Bouldin index seemed to indicate =0 otherwise, =0 otherwise.
that for large data sets (Liver Toxicity and Prostate),
IPCA seems to perform best for a smaller number of Each eigenvector has 50 non-zero variables and the
components than PCA. It is able to highlight relevant coefficients in the loading vectors associated to these
information in a very small number of dimensions. non-zero variables follow a Gaussian or super-Gaussian
Variable selection distribution. sPCA and sIPCA were then applied on
We first performed a simulation study to assess whether each generated data set. Both approaches require the
sIPCA could identify relevant variables. We then applied degree of sparsity, which was set to 50, as an input para-
sIPCA to the Liver Toxicity study. In both cases, we com- meter on each component. One can imagine that each
pared sIPCA with the sparse PCA approach (sPCA-rSVD- eigenvector describes a particular biological process
soft from [18]) that we will subsequently call ‘sPCA’. where 50 genes contribute heavily or very heavily to.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 6 of 15
http://www.biomedcentral.com/1471-2105/13/24

Figure 2 Prostate cancer study: sample representation. Sample representation using the first two or three components from PCA, ICA and
IPCA approaches.

Table 5 displays the correct identification rate of each gene selections, a selection size of 50 genes per dimen-
loading vector estimated by sPCA and sIPCA. Given sion, for 2 dimensions were arbitrarily chosen for the fol-
this non trivial setting, both approaches identified very lowing analysis. Even if not optimal from the index
well the important variables, especially on the first perspective, this choice was mostly guided by the number
dimension, where sPCA slightly outperformed sIPCA. of subsequent annotated genes that could be analyzed in
On the second dimension, the performance of sPCA the biological interpretation. For each approach, the
and sIPCA differ as sPCA fails to differentiate each genes lists of different sizes are embedded into each
sparse signal separately - it tended to select variables other, and a compromise has to be made to obtain a suf-
from both dimensions in the second loading vector. On ficient but not too large list of genes to be interpreted.
the contrary, and especially in the super-Gaussian case, Comparison of the sparse loading vectors
sIPCA is able to identify each sparse eigenvector signal The first and second sparse loading vectors for both
separately, i.e. each simulated biological process. sPCA sPCA and sIPCA are plotted in Figure 5 (absolute
performed better in the Gaussian than in the super- values). In the first dimension, the loading vectors of the
Gaussian case, whereas sIPCA performed almost equally two sparse approaches are very similar (correlation of
well in both cases. 0.98), a fact that was already indicated in the above simu-
lation study. Both approaches select the same variables.
Real example with Liver Toxicity study On the second dimension, however, the sparse loading
Choosing the number of genes to select vectors differ (correlation of 0.28) as IPCA (similar to
Figure 4 displays the Davies Bouldin index for various ICA) leads to an unnecessarily orthogonal basis which
gene selection sizes. sIPCA clearly outperformed sPCA. may reconstruct the data better than PCA in the pre-
In order to compare the biological relevance of the two sence of noise and is sensitive to high order statistics in
Yao et al. BMC Bioinformatics 2012, 13:24 Page 7 of 15
http://www.biomedcentral.com/1471-2105/13/24

Figure 3 Yeast metabolomic study: sample representation. Sample representation using the first two or three components from PCA, ICA
and IPCA approaches.

the data rather than the covariance matrix only [25]. This Biological relevance of the selected genes
explains why sPCA and sIPCA give different subspaces. We have seen that the independent principal compo-
Sample representation nents indicate relevant biological similarities between
The PCs and IPCs are displayed in Figure 6. Since most the samples. We next assessed whether these selected
of the noisy variables were removed, sPCA seemed to genes were relevant to the biological study. The genes
give a better clustering of the low doses compared to selected with either sIPCA or sPCA were further investi-
Figure 1. sIPCA and IPCA remain similar, which shows gated using the GeneGo software [26], that can output
that IPCA is well able to separate the noise from the pathways, process networks, Gene Ontology (GO) pro-
biologically relevant signal. cesses and molecular functions.
We decided to focus only on the first two dimensions
as they were sufficient to obtain a satisfying cluster of
Table 4 Davies Bouldin index for PCA, ICA and IPCA on
the three data sets. Table 5 Simulation study: average percentage of
Dataset # of components PCA ICA IPCA correctly identified non-zero loadings (standard
Liver Toxicity study 2 components 1.809 1.923 1.242 deviation) when 50 variables are selected on each
Liver Toxicity study 3 components 1.523 1.578 1.525 dimension (each loading vector).
Prostate cancer study 2 components 4.117 1.679 1.782 Method Gaussian super-Gaussian
Prostate cancer study 3 components 3.312 2.316 2.315 v1 v2 v1 v2
Yeast metabolomic study 2 components 1.894 1.788 2.338 sPCA 90.30% (3.5) 72.5% (11.6) 85.44% (4.3) 68.22% (10.6)
Yeast metabolomic study 3 components 2.119 2.139 2.037 sIPCA 86.7% (8.3) 87.7% (8.1) 80.80% (8.6) 82.30% (8.4)
Yao et al. BMC Bioinformatics 2012, 13:24 Page 8 of 15
http://www.biomedcentral.com/1471-2105/13/24

sIPCA
4.0

sPCA
3.5
Davies Bouldin− index
3.0
2.5
2.0
1.5

1 2 3 4 5 6 7 8 9 10 20 30 40 50 100 200 300 400 500 1000

selected genes on each dimension

Figure 4 Liver Toxicity study: Davis Bouldin index for sIPCA and sPCA. Comparison of the Davies Bouldin index for sIPCA and sPCA with
respect to the number of variables selected on 2 components.

the samples (see previous results). We therefore ana- more enriched with sIPCA: heme and unfolded protein
lyzed the two lists of 50 genes selected with either binding as well as oxidoreductase activity (Additional
sIPCA or sPCA for each of these two dimensions. file 4).
Amongst these 50 genes, between 33 to 39 genes were Genes selected on dimension 2 The gene lists from
annotated and recognized by the software. dimension two not only highlighted response to
Genes selected on dimension 1 Both methods selected unfolded protein and to organic substance, but also cel-
genes previously highlighted in the literature as having lular carbohydrate biosynthesis process, trygliceride,
functions in detoxification and redox regulation in acylglycerol, neutral metabolic processes as well as cata-
response to oxidative stress: 2 cytochrome P450 genes (1) bolic process and glucogenesis. For this dimension, how-
and heme oxygenase 1 were selected by sIPCA (sPCA) ever, it is sIPCA that selected more relevant genes that
on the first dimension (see Additional files 1 and 2). The enriched these terms (Additional file 5).
expression of these genes has been found to be altered in In terms of pathways, both approaches selected HSP70
biological pathways perturbed subsequent to incipient and HSP90 genes. The HSP90 gene encodes a member
toxicity [27-32]. These genes were also previously of the heat shock proteins 70 family. These proteins
selected with other statistical approaches by other collea- play a role in cell proliferation and stress response,
gues on the same study [20]. which explained the presence of pathways found such as
A Gene Ontology enrichment analysis for each list of oxidative stress [33,34] (Additional file 6). The HSP90
genes was performed. GO terms significantly enriched proteins are highly conserved molecular chaperones that
included biological processes related to response to have key roles in signal transduction, protein folding
unfolded proteins, protein refolding and protein stimulus, and protein degradation. They play an important roles
as well as response to chemical stimulus and organic sub- in folding newly synthesized proteins or stabilizing and
stance (Additional file 3). Although very similar, the refolding denatured proteins after stress [35].
sPCA gene list highlighted slightly more genes related to Summary This preliminary analysis demonstrates the
these GO terms than the sIPCA gene selection. The GO ability of sIPCA and sPCA to select genes that were
molecular functions related to these genes were, however, relevant to the biological study. These genes that are
Yao et al. BMC Bioinformatics 2012, 13:24 Page 9 of 15
http://www.biomedcentral.com/1471-2105/13/24

Figure 5 Liver Toxicity study: sparse loading vectors. Comparison of the first two sparse loading vectors generated by sIPCA and sPCA.

Figure 6 Liver Toxicity study: sample representation with sparse variants. Sample representation using the first two principal components
of sPCA and sIPCA approaches when 50 variables are selected on each dimension.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 10 of 15
http://www.biomedcentral.com/1471-2105/13/24

ranked as being ‘important’ by both approaches, partici- algorithm uses Singular value decomposition (SVD):
pate in the determination of the components which are suppose X is a centered n × p matrix (the mean of each
linear combinations of the original variables. Therefore, column has been subtracted), where n is the number of
the expression of these selected genes not only help samples (or observations) and p is the number of vari-
clustering the samples according to the different treat- ables or biological entities that are measured. Then the
ments or biological conditions but also have a biologi- SVD of data matrix X can be defined as
cally relevant meaning for the system under study.
X = UDVT , (1)
Conclusions
We have developed a variant of PCA called IPCA that where U is an n × p matrix whose columns are uncor-
combines the advantages of both PCA and ICA. IPCA related (i.e. UT U = IP ), V is a p × p orthogonal matrix
assumes that biologically meaningful components can be (i.e. V T V = I P ), and D is a p × p diagonal matrix with
obtained if most noise has been removed from the asso- diagonal elements dj. We denote uj the columns of U and
ciated loading vectors. By identifying non-Gaussian load- vj the columns of V. Then ujdj is the jth principal compo-
ing vectors from the biological data, it better reflects the nent (PC) and vj is the corresponding loading vector [1].
internal structure of the data compared to PCA and The PCs are linear combination of the original variables
ICA. On simulated data sets, we showed that IPCA out- and the loading vectors indicate the weights assigned to
performed PCA and ICA in the super-Gaussian case, each of the variables in the linear combination. The first
and that the kurtosis value of the loading vectors can be PC accounts for the maximal amount of the total var-
used to choose the number of independent principal iance. Similarly, the jth (j = 2,..., p) PC can explain the
components. On real data sets, we assessed the cluster maximal amount of variance that is not accounted by the
validity using the Davies Bouldin index and showed that previous j - 1 PCs. Therefore, most of the information
in high dimensional cases, IPCA could summarize the contained in X can be reduced to a few PCs. Plotting the
information of the data better or with a smaller number PCs enable a visual representation of the samples pro-
of components than PCA or ICA. jected in the subspace spanned by the PCs. We can
We also introduced sIPCA that allows an internal expect that the samples belonging to the same biological
variable selection procedure. By applying a soft-thresh- group, or undergoing the same biological treatment
olding penalization on the independent loading vectors, would be clustered together and separated from the
sparse loading vectors are obtained which enable vari- other groups.
able selection. We have shown that sIPCA can correctly Limitation of PCA
identify most of the important variables in a simulation Sometimes, however, PCA may not be able to extract
study. For one data set, the genes selected with sIPCA relevant information and may therefore provide mean-
and sPCA were further investigated to assess whether ingless principal components that do not describe
the two approaches were able to select genes that were experimental characteristics. The reason is that its linear
relevant to the system under study given these genes, transformation involves second order statistics (i.e. to
relevant GO terms, molecular functions and pathways obtain mutually non-orthogonal PCs) that might not be
where highlighted. This analysis demonstrated the ability appropriate for biological data. PCA assumes that gene
of such approaches to unravel biologically relevant expression data have Gaussian signals, while it has been
information. The expression of these selected genes is demonstrated that many gene expression data in fact
also decisive to cluster the samples according to their have ‘super-Gaussian’ signals [2,4].
biological conditions.
We believe that (s)IPCA approach can be useful, not Independent Component Analysis (ICA)
only to improve data visualization and reveal experimen- Independent Component Analysis (ICA) was first pro-
tal characteristics, but also to identify biologically rele- posed by [8]. ICA can reduce the effects of noise or arte-
vant variables. IPCA and sIPCA are implemented in the facts in the data as it aims at separating a mixture of
R package mixomics [36,37] and its associated web- signals into their different sources. By assuming non-
interface http://mixomics.qfab.org. Gaussian signal distribution, ICA models observations as a
linear combinations of variables, or components, which
Methods are chosen to be as statistically independent as possible
Principal Component Analysis (PCA) (i.e. the different components represent different non-
PCA is a classical dimension reduction and feature overlapping information). ICA therefore involves higher-
extraction tool in exploratory analysis, and has been order statistics [14]. In fact, ICA attempts to recover statis-
used in a wide range of fields. There exists different tically independent signal from the observations of an
ways of solving PCA. The most computationally efficient unknown linear mixture. Several algorithms such as
Yao et al. BMC Bioinformatics 2012, 13:24 Page 11 of 15
http://www.biomedcentral.com/1471-2105/13/24

FastICA, Kernel ICA [38] and ProDenICA [39] were pro- K = E{s4i } − 3. (6)
posed to estimate the independent components. The Fas-
tICA algorithm maximizes non-Gaussianity of each where s i is the row of S, which has zero mean and
component, while Kernel ICA and ProDenICA minimize unit variance, j = 1... n. The kurtosis value equals zero if
mutual information between components. In this article, si has a Gaussian probability density function (pdf), is
we used the FastICA algorithm. positive if si has a spiky pdf (super-Gaussian, i.e. the pdf
Let X (n × p) be the centered data matrix and S (n × is relatively large at zero) and is negative if si has a flat
p) the matrix containing the independent components pdf (sub-Gaussian, i.e. the pdf is rather constant near
(IC). We can solve the ICA problem by introducing a zero). We are interested in the spiky and flat pdf (i.e.
mixing matrix A of size n × n: non-Gaussian pdfs) since non-Gaussianity is regarded as
independence [9]. Note that although kurtosis is both
X = AS. (2)
computationally and theoretically simple, it can be very
The mixing matrix A indicates how the independent sensitive to outliers. The authors in [6] proposed to
components of S are linearly combined to construct X. order the ICs based on their kurtosis value.
If we rearrange the equation above, we get
• In the FastICA algorithm, negentropy is used as it
S = WX, (3) is an excellent measurement of non-Gaussianity.
Negentropy equals zero if si is Gaussian and is posi-
where W (n × n) is the unmixing matrix that tive if si is non-Gaussian. It is not only easy to com-
describes the inverse process of mixing the ICs. If we pute, but also very robust [9]. However, this measure
assume that A is a square and orthonormal matrix, then does not distinguish between super-Gaussianity and
W is simply the transpose of A. In practice, it is very sub-Gaussianity.
useful to whiten the data matrix X, i.e., to obtain Cov Limitation of ICA
(X) = I. This allows the mixing matrix A to be orthogo- Similar to PCA, ICA also suffers from high dimensional-
nal: Cov(AS) = I and SS T = I ⇒ AAT = I. The ortho- ity, which sometimes leads to the inability of the ICs to
gonality of the matrix also enables fewer parameters to reflect the (biologically expected) internal structure of
be estimated. In the FastICA algorithm, PCA is used as the data. Furthermore, since ICA is a stochastic algo-
a pre-processing step to whiten the data matrix. If we rithm, it faces the problem of convergence to local
rearrange (1), we therefore obtain optima, leading to slightly different ICs when re-analyz-
ing the same data [40].
UT = D−1 VT XT , (4)
Independent Principal Component Analysis (IPCA)
since the columns of V are orthonormal. The rows of
To reduce noise and better reflect the internal structure
UT are uncorrelated and have zero mean. To complete
√ of the data generated by the biological experiment, we
the whitening step, we can multiply UT by n − 1 , so
propose a new approach called Independent Principal
that the rows of UT have unit variance. Then let Ũ be Component Analysis (IPCA). Rather than denoising the
√
the whitened PCs ( Ũ = n − 1UT ) . The ICs are esti- data or the PCs directly, as it is performed in ICA, we
mated through the following equation: propose instead to reduce the noise in the loading vec-
tors. Recall that the PCs, which are then used to visua-
S = WŨ. (5) lize the samples and how they cluster together, are a
linear combination of the original variables weighted by
ICA assumes that Gaussian distribution represent their elements in the corresponding loading vectors.
noise, and therefore aims at identifying non-Gaussian Thus we will obtain denoised PCs by using ICA as a
components in the sample space that are as independent denoising process of the associated loading vectors.
as possible. Recent studies have observed that the signal We make the assumption that in a biological system,
distribution of microarray data are typically super-Gaus- different variables (biological entities, such as genes and
sian since only a small number of genes contribute metabolites) have different levels of expression or abun-
heavily to a specific biological process [2,5]. dance depending on the biological conditions. Therefore,
Two classical quantitative measures of Gaussianity are only a few variables contribute to a biological process.
kurtosis and negentropy. These relevant variables should have important weights
in the loading vectors while other irrelevant or noisy vari-
• Kurtosis, also called the fourth-order cumulant is ables should have very small weights. In fact, once the
defined as loading vectors are denoised, we expect them to have a
Yao et al. BMC Bioinformatics 2012, 13:24 Page 12 of 15
http://www.biomedcentral.com/1471-2105/13/24

super-Gaussian distribution (as opposed to a Gaussian number of loading vectors, or, equivalently, a small num-
distribution when noise is included, see Figure 7 for the ber of PCs is needed to summarize most of the relevant
plot of a typical super-Gaussian and a Gaussian distribu- information. However, there is no globally accepted criter-
tion). Maximizing non-Gaussianity of the loading vectors ion on how to choose the number of PCs to keep. We
will thus enable to remove most of the noise. IPCA is have shown that the kurtosis value of the independent
described below and summarized in Table 6. loading vectors gives a post hoc indication of the number
Extract the loading vectors from PCA of independent principal components to be chosen (see
PCA is applied to the X (n × p) centered data matrix ‘Results and Discussion’ Section). We have experimentally
using SVD to extract the loading vectors: observed that 2 or 3 components were sufficient to high-
light meaningful characteristics of the data and to discard
X = UDVT , (7) much of the noise or irrelevant information.
Apply ICA on the loading vectors
where the columns of V contain the loading vectors.
The non-Gaussianity of the loading vectors can be max-
Since the mean of each loading vector is very close to
imized using equation (5):
zero, these vectors are approximately whitened and the
FastICA algorithm can be applied on the loading vectors. T
(8)
S = W Ṽ ,
Dimension reduction
Dimension reduction enables a clearer interpretation with- where Ṽ is the (p × m) matrix containing the m cho-
out the computational burden. Therefore, only a small sen loading vectors, W is the (m × m) unmixing matrix

Super−Gaussian vs Gaussian
0.5

super−Gaussian
Gaussian
0.4
0.3
0.2
0.1
0.0

−4 −2 0 2 4
Figure 7 Super-Gaussian vs. Gaussian distribution. A super-Gaussian distribution (Laplace distribution for example) has a more spiky peak
and a longer tail than a Gaussian distribution. The distribution of a noiseless loading vector is similar to a super-Gaussian distribution. If a large
amount of noise exists in the loading vectors, its distribution will tend towards a Gaussian distribution.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 13 of 15
http://www.biomedcentral.com/1471-2105/13/24

Table 6 Summary of the IPCA algorithm.

Algorithm Principal Component Analysis with Independent loadings (IPCA)
1. Implement SVD on the centered data matrix X to generate the whitened loading vectors V, and choose the number of components m to reduce
the dimension.
2. Implement FastICA on the loading vectors V and obtain the independent loading vectors ST.
3. Project the centered data matrix X on the m independent loading vectors sj and get the Independent PCs ũj , j = 1...m .
4. Order the IPCs by the kurtosis value of their corresponding independent loading vectors.

and S is the (m × p) matrix whose rows are the inde- vector, see following paragraph). In this way, we can
pendent loading vectors. The new independent principal control how many variables to select and save some
components (IPCs) are obtained by projecting X on ST: computational time.

Ũ = XST (9) Using (s)IPCA

where Ũ is a (n × m) matrix whose columns contain IPCA and sIPCA are implemented in the R package
mixomics which is dedicated to the analysis of large bio-
the IPCs.
logical data sets [36,37]. The use of the approaches is
Ordering the IPCs
straightforward: the user needs to input the data set,
Recall that ICA provides unordered components and
and to choose the number of components to keep
that the kurtosis measure indicates the Gaussian charac-
(usually set to a small value). In the case of the sparse
teristic of a pdf. [6] recently proposed to use the kurto-
version, the number of variables to select on each
sis measure of the ICs to order them. In IPCA, we
sIPCA dimension must also be given. The number of
propose instead to order the IPCs according to the kur-
components can be reconsidered afterwards by extract-
tosis value of the m independent loading vectors sj (j =
ing the kurtosis value of the loading vectors, i.e., identi-
1... m), as we are mainly interested loading vectors with
fying when a sudden drop occurs in the obtained values
a spiky pdf, indicated by a large kurtosis value.
will indicate how many components are enough to
explain most of the information in the data.
Sparse IPCA (sIPCA)
The number of variables to select is still an open issue
Similar to PCA and ICA, the elements in the loading
(as pinpointed by many authors working on sparse
vectors in IPCA indicate which variables are important
approaches, [18]) as in such studies, we are often limited
or relevant to determine the principal components.
by the number of samples. Tuning the number of vari-
Therefore, obtaining sparse loading vectors enables vari-
ables to select therefore mostly relies on the biological
able selection to identify important variables of potential
question. Sometimes, an optimal but too short gene
biological relevance, as well as removing noisy variables
selection may not suffice to give a comprehensive biolo-
while calculating the IPCs in the algorithm.
gical interpretation, and sometimes, the experimental
Various sparse PCA approaches have been proposed
validation might be limited in the case of a too large gene
in the literature: SPCA [41], sPCA-rSVD [18], SPC [42]).
selection.
In these approaches, the loading vectors are penalized
In our example, for the sake of simplicity, we have set
using Lasso [43] to perform an internal variable selec-
the same number of variables to select on each
tion. In fact, all these sparse PCA variants can be
dimension.
approximately solved by using soft-thresholding [17].
Our sparse IPCA therefore directly implements soft-
Simulation studies
thresholding on the independent loading vector s j to
In the different simulation studies, we used the following
select the variables:
framework (previously proposed by [18]). Σ is the var-
ŝjk = sign(sjk )(| sjk | −γ )+ , (10) iance-covariance matrix of size 500 × 500, whose first two
normalized eigenvectors v1 and v2, both of length 500 are
where g is the threshold and is applied on each ele- simulated for different cases described the the ‘Results and
ment k of the loading vector sj (k = 1... p, j = 1... m) so Discussion’ Section. The other eigenvectors were drawn
as to obtain the sparse loading vector ŝj . The variables from U0[1]. A Gram-Schmidt orthogonalization method
whose original weights are smaller than the threshold g was applied to obtain the orthogonal matrix V whose col-
will be penalized to have zero weights. A classical umns contain v1 and v2 and the other eigenvectors. To
method to choose g is cross-validation. In practice, how- make the first two eigenvectors dominate, the first two
ever, g has been replaced by the degree of sparsity (i.e., eigenvalues were set to c1 = 400, c2 = 300 and ck = 1 for
the number of non-zero elements in each loading k = 3,..., 500. Let C = diag{c 1 ,..b., c 500 } the eigenvalue
Yao et al. BMC Bioinformatics 2012, 13:24 Page 14 of 15
http://www.biomedcentral.com/1471-2105/13/24

matrix, then Σ = VCVT. The data are then generated from Authors’ contributions
FY performed the statistical analysis, wrote the R functions and drafted the
a multivariate normal distribution N(0, Σ), with n = 50
manuscript. KALC participated in the design of the manuscript and helped
samples and p = 500 variables. drafting the manuscript. JC participated in the implementation of the R
functions and implemented IPCA in the web-interface. All authors read and
approved the final manuscript.
Davies-Bouldin index
Davies-Bouldin measure is an index of crisp cluster Competing interests
validity [19]. This index compares the within-cluster The authors declare that they have no competing interests.
scatter with the between-cluster separation. It was cho-
Received: 5 September 2011 Accepted: 3 February 2012
sen in this study because of its statistics and geometric Published: 3 February 2012
rationale. The Davies-Bouldin index is defined as
References
1
K
σi + σj 1. Jolliffe I: Principal Component Analysis. second edition. Springer, New York;
max , 2002.
K i=j d(ci , cj ) 2. Lee S, Batzoglou S: Application of independent component analysis to
i=1
microarrays. Genome Biology 2003, 4(11):R76.
3. Purdom E, Holmes S: Error distribution for gene expression data.
where ci is the centroid of cluster i, and si is the aver- Statistical applications in genetics and molecular biology 2005, 4:16.
age distance of all elements in cluster i to centroid ci 4. Huang D, Zheng C: Independent component analysis-based penalized
and d(ci, cj) is the distance between the two centroids, K discriminant method for tumor classification using gene expression
data. Bioinformatics 2006, 22(15):1855.
is the number of known biological conditions or treat- 5. Engreitz J, Daigle B Jr, Marshall J, Altman R: Independent component
ments. Depending on the number of components that analysis: Mining microarray data for fundamental human gene
were chosen, we applied a 2- or 3-norm distance. Geo- expression modules. Journal of Biomedical Informatics 2010, 43:932-944.
6. Scholz M, Gatzek S, Sterling A, Fiehn O, Selbig J: Metabolite fingerprinting:
metrically speaking, we are seeking to minimize the detecting biological features by independent component analysis.
within-cluster scatter (the numerator) while maximizing Bioinformatics 2004, 20(15):2447-2454.
the between class separation (the denominator). There- 7. Frigyesi A, Veerla S, Lindgren D, Höglund M: Independent component
analysis reveals new and biologically significant structures in micro array
fore, for a given number of components, the approach data. BMC bioinformatics 2006, 7:290.
that gives the lowest index has the best clustering 8. Comon P: Independent component analysis, a new concept? Signal
ability. Process 1994, 36:287-314.
9. Hyvärinen A, Oja E: Indepedent Component Analysis: Algorithms and
Applications. Neural Networks 2000, 13(4-5):411-430.
Additional material 10. Hyvärinen A, Karhunen J, Oja E: Independent Component Analysis John Wiley
& Sons; 2001.
11. Liebermeister W: Linear modes of gene expression determined by
Additional file 1: List of genes from sIPCA. List of genes and gene
independent component analysis. Bioinformatics 2002, 18:51-60.
title selected by sIPCA on each dimension on Liver Toxicity study.
12. Wienkoop S, Morgenthal K, Wolschin F, Scholz M, Selbig J, Weckwerth W:
Additional file 2: List of genes from sPCA. List of genes and gene title Integration of Metabolomic and Proteomic Phenotypes. Molecular &
selected by sPCA on each dimension on Liver Toxicity study. Cellular Proteomics 2008, 7:1725-1736.
Additional file 3: GeneGo analysis. Comparison of the GO processes 13. Rousseau R, Govaerts B, Verleysen M: Combination of Independent
for the genes selected on dimension 1 with sIPCA and sPCA on Liver Component Analysis and statistical modelling for the identification of
Toxicity study. metabonomic biomarkers in H-NMR spectroscopy. Tech rep, Universté
Catholique de Louvain and Universté Paris I 2009.
Additional file 4: GeneGo analysis. Comparison of the GO molecular
14. Kong W, Vanderburg C, Gunshin H, Rogers J, Huang X: A review of
functions for the genes selected on dimension 1 with sIPCA and sPCA
independent component analysis application to microarray gene
on Liver Toxicity study.
expression data. BioTechniques 2008, 45(5):501.
Additional file 5: GeneGo analysis. Comparison of the GO processes 15. Teschendorff A, Journée M, Absil P, Sepulchre R, Caldas C: Elucidating the
for the genes selected on dimension 2 with sIPCA and sPCA on Liver altered transcriptional programs in breast cancer using independent
Toxicity study. component analysis. PLoS computational biology 2007, 3(8):e161.
Additional file 6: GeneGo analysis. Comparison of the GeneGO 16. Jolliffe I, Trendafilov N, Uddin M: A modified principal component
pathways maps for the genes selected on dimension 1 with sIPCA and technique based on the lasso. Journal of Computational and Graphical
sPCA on Liver Toxicity study. Statistics 2003, 12:531-547.
17. Donoho D, Johnstone I: Ideal spatial adaptation by wavelet shrinkage.
Biometrika 1994, 81:425-455.
18. Shen H, Huang JZ: Sparse Principal Component Analysis via Regularized
Low Rank Matrix Approximation. Journal of Multivariate Analysis 2008,
Acknowledgements
99:1015-1034.
We would like to thank Dr Thibault Jombart (Imperial College) for his useful
19. Davies D, Bouldin D: A cluster separation measure. Pattern Analysis and
advice. This work was supported, in part, by the Wound Management
Machine Intelligence, IEEE Transactions on 1979, , 2: 224-227.
Innovation CRC (established and supported under the Australian
20. Bushel P, Wolfinger RD, Gibson G: Simultaneous clustering of gene
Government’s Cooperative Research Centres Program).
expression data with clinical chemistry and pathological evaluations
reveals phenotypic prototypes. BMC Systems Biology 2007, 1.
Author details
1 21. Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P,
Shanghai University of Finance and Economics, Shanghai, P.R. China.
2 Renshaw A, D’Amico A, Richie J, Lander E, Loda M, Kantoff P, Golub T,
Queensland Facility for Advanced Bioinformatics, University of Queensland,
Sellers W: Gene expression correlates of clinical prostate cancer behavior.
St Lucia, QLD 4072, Australia. 3Sup’Biotech, Villejuif, F-94800, France.
Cancer cell 2002, 1(2):203-209.
Yao et al. BMC Bioinformatics 2012, 13:24 Page 15 of 15
http://www.biomedcentral.com/1471-2105/13/24

22. Villas-Boâs S, Moxley J, Åkesson M, Stephanopoulos G, Nielsen J: High-

throughput metabolic state analysis: the missing link in integrated
functional genomics. Biochemical Journal 2005, 388:669-677.
23. Cangelosi R, Goriely A: Component retention in principal component
analysis with application to cDNA microarray data. Biology Direct 2007,
2(2).
24. Bezdek J, Pal N: Some new indexes of cluster validity. Systems, Man, and
Cybernetics, Part B: Cybernetics, IEEE Transactions on 1998, 28(3):301-315.
25. Bartlett M, Movellan J, Sejnowski T: Face recognition by independent
component analysis. Neural Networks, IEEE Transactions on 2002,
13(6):1450-1464.
26. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A,
Dolinski K, Dwight S, Eppig J, Midori A, Hill D, Issel-Tarver L, Kasarskis A,
Lewis S, Matese J, Richardson J, Ringwald M, Rubin G, Sherlock G: Gene
Ontology: tool for the unification of biology. Nature genetics 2000,
25:25-29.
27. Bauer I, Vollmar B, Jaeschke H, Rensing H, Kraemer T, Larsen R, Bauer M:
Transcriptional activation of heme oxygenase-1 and its functional
significance in acetaminophen-induced hepatitis and hepatocellular
injury in the rat. Journal of hepatology 2000, 33(3):395-406.
28. Hamadeh H, Bushel P, Jayadev S, DiSorbo O, Bennett L, Li L, Tennant R,
Stoll R, Barrett J, Paules R, Blanchard K, Afshari C: Prediction of compound
signature using high density gene expression profiling. Toxicological
Sciences 2002, 67(2):232.
29. Heijne W, Slitt A, Van Bladeren P, Groten J, Klaassen C, Stierum R, Van
Ommen B: Bromobenzene-induced hepatotoxicity at the transcriptome
level. Toxicological Sciences 2004, 79(2):411.
30. Heinloth A, Irwin R, Boorman G, Nettesheim P, Fannin R, Sieber S, Snell M,
Tucker C, Li L, Travlos G, Vansant G, Blackshear P, Tennant R,
Cunningham M, Paules R: Gene expression profiling of rat livers reveals
indicators of potential adverse effects. Toxicological Sciences 2004, 80:193.
31. Waring J: Development of a DNA microarray for toxicology based on
hepatotoxin-regulated sequences. Environmental health perspectives 2003,
111(6):863.
32. Wormser U, Calp D: Increased levels of hepatic metallothionein in rat
and mouse after injection of acetaminophen. Toxicology 1988, 53(2-
3):323-329.
33. Flaherty K, DeLuca-Flaherty C, McKay D: Three-dimensional structure of
the ATPase fragment of a 70 K heat-shock cognate protein. Nature 1990,
346(6285):623.
34. Tavaria M, Gabriele T, Kola I, Anderson R: A hitchhiker’s guide to the
human Hsp70 family. Cell Stress & Chaperones 1996, 1:23.
35. Panaretou B, Siligardi G, Meyer P, Maloney A, Sullivan J, Singh S, Millson S,
Clarke P, Naaby-Hansen S, Stein R, Cramer R, Mollapour M, Workman P,
Piper P, Pearl L, Prodromou C: Activation of the ATPase activity of hsp90
by the stress-regulated cochaperone aha1. Molecular cell 2002,
10(6):1307-1318.
36. Lê Cao KA, González I, Déjean S: integrOmics: an R package to unravel
relationships between two omics data sets. Bioinformatics 2009,
25(21):2855-2856.
37. mixOmics. [http://www.math.univ-toulouse.fr/~ biostat/mixOmics].
38. Bach F, Jordan M: Kernel Independent Component Analysis. Journal of
Machine Learning Research 2002, 3:1-48.
39. Hastie T, Tibshirani R: Independent Components Analysis through
Product Density Estimation. 2002.
40. Himberg J, Hyvarinen A, Esposito F: Validating the independent
components of neuroimaging time series via clustering and
visualization. Neuroimage 2004, 22(3):1214-1222.
41. Zou H, Hastie T, Tibshirani R: Sparse Principal Component Analysis. J Submit your next manuscript to BioMed Central
Comput Graph Statist 2006, 15(2):265-286. and take full advantage of:
42. Witten D, Tibshirani R, Hastie T: A penalized matrix decomposition, with
applications to sparse principal components and canonical correlation
• Convenient online submission
analysis. Biostatistics 2009, 10(3):515.
43. Tibshirani R: Regression shrinkage and selection via the lasso. Journal of • Thorough peer review
the Royal Statistical Society, Series B 1996, 58:267-288. • No space constraints or color figure charges
doi:10.1186/1471-2105-13-24 • Immediate publication on acceptance
Cite this article as: Yao et al.: Independent Principal Component • Inclusion in PubMed, CAS, Scopus and Google Scholar
Analysis for biologically meaningful dimension reduction of large
biological data sets. BMC Bioinformatics 2012 13:24. • Research which is freely available for redistribution

Submit your manuscript at

www.biomedcentral.com/submit

Data Mining
75% (4)
Data Mining
22 pages
Statistics For Geoscientists: Pieter Vermeesch
No ratings yet
Statistics For Geoscientists: Pieter Vermeesch
225 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
34 pages
Mathematics For Machine Learning
No ratings yet
Mathematics For Machine Learning
2 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
Principal Component Analysis (PCA) Based Indexing: March 2017
No ratings yet
Principal Component Analysis (PCA) Based Indexing: March 2017
6 pages
Principal Component Analysis (PCA) Based Indexing: March 2017
No ratings yet
Principal Component Analysis (PCA) Based Indexing: March 2017
6 pages
Principal Component Analysis (PCA) Based Indexing: Darshnaben Mahida Sendhil Ramadas
No ratings yet
Principal Component Analysis (PCA) Based Indexing: Darshnaben Mahida Sendhil Ramadas
6 pages
Feature Selection and Dimensionality Reduction
No ratings yet
Feature Selection and Dimensionality Reduction
4 pages
Jin-Xing Liu - 2013 - Pmid23815087
No ratings yet
Jin-Xing Liu - 2013 - Pmid23815087
10 pages
Scholz, M., Kaplan, F. Guy, C., Kopka, J. y Selbig, J., (2005) Non-Linear PCA A Missing Data Approach
No ratings yet
Scholz, M., Kaplan, F. Guy, C., Kopka, J. y Selbig, J., (2005) Non-Linear PCA A Missing Data Approach
9 pages
Sven Bergmann Part1
No ratings yet
Sven Bergmann Part1
11 pages
Ahmed Rebai PCA-ICA
No ratings yet
Ahmed Rebai PCA-ICA
34 pages
Advances in Principal Component Analysis Research and Development - Ganesh R. Naik
No ratings yet
Advances in Principal Component Analysis Research and Development - Ganesh R. Naik
256 pages
Independent Component Analysis
No ratings yet
Independent Component Analysis
27 pages
Robust Feature Extraction and Variable Selection by Removing Unwanted Data Variation
No ratings yet
Robust Feature Extraction and Variable Selection by Removing Unwanted Data Variation
23 pages
PCA Primer
No ratings yet
PCA Primer
2 pages
What Is Principal Component Analysis?: Primer
No ratings yet
What Is Principal Component Analysis?: Primer
2 pages
Principal Component Analysis - InTECH (Naren)
No ratings yet
Principal Component Analysis - InTECH (Naren)
308 pages
Variable Selection and The Interpretation of Principal Subspaces
No ratings yet
Variable Selection and The Interpretation of Principal Subspaces
18 pages
SCHOLKOPF (2019) - Robustifying Independent Component Analysis by Adjusting For Group-Wise Stationary Noise
No ratings yet
SCHOLKOPF (2019) - Robustifying Independent Component Analysis by Adjusting For Group-Wise Stationary Noise
50 pages
Chemometricsanalbiochem
No ratings yet
Chemometricsanalbiochem
11 pages
Detecting Clusters Using PCA
No ratings yet
Detecting Clusters Using PCA
23 pages
Forward Selection Component Analysis Algorithms and Applications
No ratings yet
Forward Selection Component Analysis Algorithms and Applications
16 pages
Cross-Validation: - PCA Cross-Validation Is Done in Two Phases and Several Deletion Rounds
No ratings yet
Cross-Validation: - PCA Cross-Validation Is Done in Two Phases and Several Deletion Rounds
44 pages
Independent Component Analysis: A Statistical Perspective: Klaus Nordhausen - Hannu Oja
No ratings yet
Independent Component Analysis: A Statistical Perspective: Klaus Nordhausen - Hannu Oja
23 pages
Dimensional Reduction in R
No ratings yet
Dimensional Reduction in R
24 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Principal Component Analysis: Points of Significance
No ratings yet
Principal Component Analysis: Points of Significance
2 pages
ICA Dim Red
No ratings yet
ICA Dim Red
39 pages
Jolliffe Principalcomponentanalysis 2016
No ratings yet
Jolliffe Principalcomponentanalysis 2016
17 pages
Pca Ica
No ratings yet
Pca Ica
34 pages
Principal Component Analysis I To
100% (1)
Principal Component Analysis I To
298 pages
Zou 2006
No ratings yet
Zou 2006
23 pages
DS Ca2 PPT 3010 3017
No ratings yet
DS Ca2 PPT 3010 3017
10 pages
Principal Component Analyses (PCA) - Based Findings in Population Genetic Studies Are Highly Biased and Must Be Reevaluated
No ratings yet
Principal Component Analyses (PCA) - Based Findings in Population Genetic Studies Are Highly Biased and Must Be Reevaluated
35 pages
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Tutorialon PCA
No ratings yet
Tutorialon PCA
6 pages
PCA_Explained -
No ratings yet
PCA_Explained -
9 pages
Principal Components Analysis
No ratings yet
Principal Components Analysis
16 pages
Correction To "Accurate Statistical Approaches For Generating Representative Workload Compositions"
No ratings yet
Correction To "Accurate Statistical Approaches For Generating Representative Workload Compositions"
2 pages
Parinya Sanguansat Principal Component Analysis Multidisciplinary Applications InTech 2012 PDF
No ratings yet
Parinya Sanguansat Principal Component Analysis Multidisciplinary Applications InTech 2012 PDF
212 pages
Parinya Sanguansat-Principal Component Analysis - Multidisciplinary Applications-InTech (2012)
No ratings yet
Parinya Sanguansat-Principal Component Analysis - Multidisciplinary Applications-InTech (2012)
212 pages
Feature Extraction Techniques
No ratings yet
Feature Extraction Techniques
32 pages
Package Fastica': R Topics Documented
No ratings yet
Package Fastica': R Topics Documented
8 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
1 page
Kumar 2017
No ratings yet
Kumar 2017
13 pages
Package Fastica': R Topics Documented
No ratings yet
Package Fastica': R Topics Documented
8 pages
Yeung & Ruzzo, 2001
No ratings yet
Yeung & Ruzzo, 2001
12 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
Comparative Assessment of Independent Component Analysis (ICA) For Face Recognition
No ratings yet
Comparative Assessment of Independent Component Analysis (ICA) For Face Recognition
6 pages
Varimax Rotation
No ratings yet
Varimax Rotation
47 pages
Review of Methods
No ratings yet
Review of Methods
7 pages
Ferath Kherif PCA
No ratings yet
Ferath Kherif PCA
17 pages
fastICA
No ratings yet
fastICA
7 pages
3 - Feature Extraction
No ratings yet
3 - Feature Extraction
22 pages
A Comparison of Six Methods For Missing Data Imputation 2155 6180 1000224 PDF
No ratings yet
A Comparison of Six Methods For Missing Data Imputation 2155 6180 1000224 PDF
6 pages
Biological Databases
No ratings yet
Biological Databases
28 pages
QSRI-lecture4
No ratings yet
QSRI-lecture4
56 pages
Principal Components Analysis: Contents at A Glance
No ratings yet
Principal Components Analysis: Contents at A Glance
17 pages
Logical Modeling of Biological Systems
From Everand
Logical Modeling of Biological Systems
Luis Fariñas del Cerro
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ikediashietal2012JFM
No ratings yet
Ikediashietal2012JFM
18 pages
7789imguf_CSBSNEP2023SEM-IIIandIVSyllabus
No ratings yet
7789imguf_CSBSNEP2023SEM-IIIandIVSyllabus
41 pages
Chandragouda Marigoudar
No ratings yet
Chandragouda Marigoudar
138 pages
The Posttraumatic Growth Inventory Measuring The Positive Legacy of Trauma
0% (1)
The Posttraumatic Growth Inventory Measuring The Positive Legacy of Trauma
17 pages
IDP FINAL BATCH-16 SEC-G
No ratings yet
IDP FINAL BATCH-16 SEC-G
7 pages
Research Article: Fusion Mode and Style Based On Artificial Intelligence and Clothing Design
No ratings yet
Research Article: Fusion Mode and Style Based On Artificial Intelligence and Clothing Design
16 pages
Module3 Notes
No ratings yet
Module3 Notes
13 pages
2 - Review Article - Introduction To Multivariate Analysis
No ratings yet
2 - Review Article - Introduction To Multivariate Analysis
8 pages
14_OBSTACLES_FOR_THE_SUSTAINABILITY_OF_BUSINESS_START_UPS_THE_CASE_OF_NORTH_WESTERN_PROVINCE_IN_SRI_LANKA
No ratings yet
14_OBSTACLES_FOR_THE_SUSTAINABILITY_OF_BUSINESS_START_UPS_THE_CASE_OF_NORTH_WESTERN_PROVINCE_IN_SRI_LANKA
8 pages
Entry Level Data Scientist Resume Example
No ratings yet
Entry Level Data Scientist Resume Example
1 page
Pages AFM
No ratings yet
Pages AFM
23 pages
Dynamic Factor Models M Watson
No ratings yet
Dynamic Factor Models M Watson
43 pages
Assignment
No ratings yet
Assignment
2 pages
6COM1044 2023 2024 General ML and PCA
No ratings yet
6COM1044 2023 2024 General ML and PCA
44 pages
An Introduction To t-SNE With Python Example by Andre Violante Towards Data Science
No ratings yet
An Introduction To t-SNE With Python Example by Andre Violante Towards Data Science
12 pages
Quantum Risk Analysis PDF
No ratings yet
Quantum Risk Analysis PDF
8 pages
TUGAS STATISTIK EKONOMI-AGNES - SPV (Document1)
No ratings yet
TUGAS STATISTIK EKONOMI-AGNES - SPV (Document1)
26 pages
IRRBB - Evolving Industry Challenges - April 2024
No ratings yet
IRRBB - Evolving Industry Challenges - April 2024
11 pages
Digital Maturityand Marketing Orientation
No ratings yet
Digital Maturityand Marketing Orientation
11 pages
Visual Features in The Perception of Liquids: Highlights Authors
No ratings yet
Visual Features in The Perception of Liquids: Highlights Authors
12 pages
Introduction To The Case Study: Hank Roark
No ratings yet
Introduction To The Case Study: Hank Roark
25 pages
Module - 5 Lecture Notes - 5: Remote Sensing-Digital Image Processing Information Extraction Principal Component Analysis
No ratings yet
Module - 5 Lecture Notes - 5: Remote Sensing-Digital Image Processing Information Extraction Principal Component Analysis
10 pages
Boruta Feature Selection in R - DataCamp
No ratings yet
Boruta Feature Selection in R - DataCamp
18 pages
Tutorial: Expression Analysis Using RNA-Seq
No ratings yet
Tutorial: Expression Analysis Using RNA-Seq
19 pages
American Sign Language Translation Through Sensory Glove
No ratings yet
American Sign Language Translation Through Sensory Glove
14 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
47 pages

Independent Principal Component Analysis For Biologically Meaningful Dimension Reduction of Large Biological Data Sets

Uploaded by

Independent Principal Component Analysis For Biologically Meaningful Dimension Reduction of Large Biological Data Sets

Uploaded by

Yao et al.

BMC Bioinformatics 2012, 13:24

RESEARCH ARTICLE Open Access

Independent Principal Component Analysis for

Background the number of samples (or observations) is much smal-

other approaches using 2 components. When choosing 3 Simulated example

1 2 3 4 5 6 7 8 9 10 20 30 40 50 100 200 300 400 500 1000

selected genes on each dimension

Table 6 Summary of the IPCA algorithm.

Ũ = XST (9) Using (s)IPCA

22. Villas-Boâs S, Moxley J, Åkesson M, Stephanopoulos G, Nielsen J: High-

Submit your manuscript at

You might also like