Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
we should not need feature selection or extraction as a separate process; the classifier (or regressor)
should be able to use whichever features are necessary, discarding the irrelevant.
there are several reasons why we are interested in reducing dimensionality as a separate preprocessing
step:
• In most learning algorithms, the complexity depends on the number of input dimensions, d, as well as
on the size of the data sample, N, and for reduced memory and computation, we are interested in
reducing the dimensionality of the problem. Decreasing d also decreases the complexity of the
inference algorithm during testing.
• When an input is decided to be unnecessary, we save the cost of ex- tracting it.
• Simpler models are more robust on small datasets. Simpler models have less variance, that is, they
vary less depending on the particulars of a sample, including noise, outliers, and so forth.
• When data can be explained with fewer features, we get a better idea about the process that underlies
the data and this allows knowledge extraction. These fewer features may be interpreted as hidden or
latent factors that in combination generate the observed features.
• When data can be represented in a few dimensions without loss of information, it can be plotted and
analyzed visually for structure and outliers.
There are two main methods for reducing dimensionality: feature se- lection and
feature extraction.
feature selection
feature extraction
2. In feature extraction, we are interested in finding a new set of k di- mensions that
are combinations of the original d dimensions. These methods may be supervised or
unsupervised depending on whether or not they use the output information.
The best known and most widely used feature extraction methods are principal
component analysis and linear discriminant analysis, which are both linear projection
methods, unsupervised and supervised respectively.
Subset Selection
The best subset contains the least number of dimensions that most contribute to
accuracy.
There are two approaches: In forward selection, we start with no vari- ables and add
them one by one, at each step adding the one that de- creases the error the most, until
any further addition does not decrease the error (or decreases it only slightly).
In backward selection, we start with all variables and remove them one by one, at
each step removing the one that decreases the error the most (or increases it only
slightly), until any further removal increases the error significantly. In either case,
checking the error should be done on a validation set distinct from the training set
because we want to test the generalization accuracy. With more features, generally we
have lower training error, but not necessarily lower validation error.
Let us denote by F, a feature set of input dimensions, xi,i = 1,...,d. E(F) denotes the error incurred on the validation
sample when only the inputs in F are used. Depending on the application, the error is either the mean square error or
misclassification error.
In sequential forward selection, we start with no features: F = ∅. At each step, for all possible xi, we train our model
on the training set and calculate E(F ∪ xi ) on the validation set. Then, we choose that input xj that causes the least
error
This algorithm is also known as the wrapper approach, where the pro- cess of feature extraction is thought to “wrap”
around the learner it uses as a subroutine (Kohavi and John 2007).
Subset selection is supervised in that outputs are used by the
regressor or classifier to calculate the error, but it can be used
with any regression or classification method.
Principal Component Analysis-
It transforms the variables into a new set of variables called as principal components.
These principal components are linear combination of original variables and are
orthogonal.
The first principal component accounts for most of the possible variation of original
data.
The second principal component does its best to capture the variance in the data.
There can be only two principal components for a two-dimensional data set.
PCA Algorithm-
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
The aim of this step is to standardize the range of the continuous initial variables so that each one of them
More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite
sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges
of initial variables, those variables with larger ranges will dominate over those with small ranges (For example, a
variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will
lead to biased results. So, transforming the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value
of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.
STEP 2: COVARIANCE MATRIX COMPUTATION
The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each
other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in
such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance
matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries the covariances
associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables x, y, and z,
Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right)
we actually have the variances of each initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the
entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to
determine the principal components of the data. Before getting to the explanation of these concepts, let’s first understand what
Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These
combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the
information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional
data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then
maximum remaining information in the second and so on, until having something like shown in the scree plot below.
How PCA Constructs the Principal Components
As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts for
Without further ado, it is eigenvectors and eigenvalues who are behind all the magic explained
above, because the eigenvectors of the Covariance matrix are actually the directions of the axes
where there is the most variance(most information) and that we call Principal Components. And
eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of
By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the
As we saw in the previous step, computing the eigenvectors and ordering them by
order of significance. In this step, what we do is, to choose whether to keep all
these components or discard those of lesser significance (of low eigenvalues), and
form with the remaining ones a matrix of vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of
the components that we decide to keep. This makes it the first step towards
(components) out of n, the final data set will have only p dimensions.
Example:
Continuing with the example from the previous step, we can either form a feature
vector with both of the eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser significance, and form a
feature vector with v1 only:
In the previous steps, apart from standardization, you do not make any changes on the data, you
just select the principal components and form the feature vector, but the input data set remains
always in terms of the original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis). This
can be done by multiplying the transpose of the original data set by the transpose of the feature
vector.
PRACTICE PROBLEMS BASED ON PRINCIPAL COMPONENT ANALYSIS-
Problem-01:
Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).
Step-01:
Get data.
x1 = (2, 1)
x2 = (3, 5)
x3 = (4, 3)
x4 = (5, 6)
x5 = (6, 7)
x6 = (7, 8)
Step-02:
= ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)
= (4.5, 5)
Thus,
Step-03:
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
= (m1 + m2 + m3 + m4 + m5 + m6) / 6
Step-05:
Calculate the eigen values and eigen vectors of the covariance matrix.
So, we have-
From here,
λ2 – 8.59λ + 3.09 = 0
Clearly, the second eigen value is very small compared to the first eigen value.
Eigen vector corresponding to the greatest eigen value is the principal component for the given data set.
MX = λX
where-
M = Covariance Matrix
X = Eigen vector
λ = Eigen value
On simplification, we get-
https://www.gatevidyalay.com/tag/principal-component-analysis-questions-and-answers/
• Instead of using 5 variables, It can be reduced to 2
factors such as Quantitative ability and Verbal ability
What is factor analysis PPT?
Factor analysis is a correlational method used to find and describe the underlying factors driving data values for a large set of
variables. 6. SIMPLE PATH DIAGRAM FOR A FACTOR ANALYSIS MODEL •F1 and F2 are two common factors.
Factor analysis is a statistical data reduction and analysis technique that strives to explain correlations among multiple outcomes as the
result of one or more underlying explanations, or factors. The technique involves data reduction, as it attempts to represent a set of
variables by a smaller number.
There are two types of factor analyses, exploratory and confirmatory. Exploratory factor analysis (EFA) is method to explore the
underlying structure of a set of observed variables, and is a crucial step in the scale development process.
First go to Analyze – Dimension Reduction – Factor. Move all the observed variables over the Variables: box to be analyze. Under
Extraction – Method, pick Principal components and make sure to Analyze the Correlation matrix. We also request the Unrotated
factor solution and the Scree plot.
Multidimensional scaling (MDS)
Multidimensional scaling (MDS) is the method for placing these points in a low—
for example, two-dimensional—space such that the Euclidean distance between
them there is as close as possible to dij , the given distances in the original space.
Thus it requires a projection from some unknown dimensional space to, for
example, two dimensions.
Given samples from two classes C1 and C2, we want to find the direc-
tion, as defined by a vector w, such that when the data are projected
onto w, the examples from the two classes are as well separated as
possible. As we saw before,
is the projection of x onto w and thus is a dimensionality reduction from
d to 1.