0% found this document useful (0 votes)
2 views

mod 4

This document discusses optimization and dimension reduction techniques in high-dimensional data analysis, focusing on mathematical distance measures, dimension reduction methods, and their applications. It highlights the challenges posed by high-dimensional data, including overfitting, and emphasizes the importance of techniques like PCA, SVD, and LDA for simplifying models and improving performance. Additionally, it covers various distance measures used in machine learning algorithms, such as Euclidean and Manhattan distances, and their significance in clustering and classification tasks.

Uploaded by

Sumod Sanker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

mod 4

This document discusses optimization and dimension reduction techniques in high-dimensional data analysis, focusing on mathematical distance measures, dimension reduction methods, and their applications. It highlights the challenges posed by high-dimensional data, including overfitting, and emphasizes the importance of techniques like PCA, SVD, and LDA for simplifying models and improving performance. Additionally, it covers various distance measures used in machine learning algorithms, such as Euclidean and Manhattan distances, and their significance in clustering and classification tasks.

Uploaded by

Sumod Sanker
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

144 Optimization and Dimension Reduction Techniques

Module - IV: High-Dimensional Data Analysis


Notes
Learning Objectives
At the end of this module, you will be able to:
Ɣ Infer the concept of mathematical distance measures and their importance in
quantifying the similarity or dissimilarity between data points in various applications
Ɣ Explain different dimension reduction techniques and their role in reducing the
number of variables in high-dimensional datasets while preserving essential
information
Ɣ Summarise knowledge of factor analysis and its application in identifying underlying
latent variables that explain the observed variability in data
Ɣ Learn about batch effects and their impact on data analysis and discover strategies
for effectively dealing with batch effects to ensure accurate and reliable results
Ɣ Analyse proficiency in clustering and heatmap visualisation techniques for grouping
similar data points and uncovering patterns and structures within complex datasets

Introduction
High-Dimensional Data
A dataset is said to be high dimensional if it contains more characteristics (p) than
observations (N), which is frequently expressed as p > N.
For instance, a dataset with p = 6 features and only N = 3 observations would be
seen as having high dimensions because there are more features than observations.

People frequently assume that “high dimensional data” merely refers to a dataset
with a lot of features, which is a typical error. That, however, is untrue. Even if a dataset
includes 10,000 features, it is not high dimensional if there are 100,000 observations in it.

High Dimensional Data: Why Is It a Problem?


A deterministic solution will never be achieved when the number of features in a
dataset exceeds the number of observations.
In other words,in the absence of enough data to train the model on, it becomes
hard to create a model that can capture the link between the predictor variables and the
response variable.

High-Dimensional Data Analysis


Lets typically refer to an N-by-p data matrix as being p-dimensional when discussing
it. This is because experiencing a p-dimensional space from the perspective of N points.

1. Categorization

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 145

In classification, one of the important variables p is used as a descriptor to


determine class membership. This process involves predicting the class of an Notes
observation based on its attributes or features. Here are some examples illustrating the
use of categorization in different contexts:

Consumer Financial Data:


™ The dataset consists of multiple variables measuring consumer payment history.
™ One specific variable indicates whether the consumer has declared bankruptcy.
™ The analyst aims to predict bankruptcy based on the consumer’s credit history.

Hyperspectral Image Database:


™ The database contains several variables representing spectral bands.
™ An additional variable provides an indicator of the chemical composition of the
ground truth.
™ The analyst intends to use the spectral band information to predict specific
characteristics based on the credit history.
In order to classify data, a variety of methods have been proposed, including
k-nearest neighbour classification and the identification of hyperplanes that divide the
sample space into non-overlapping groups.

2. Regression
One of the p variables in a regression setup is a quantitative response variable. To
predict it, the other variables are applied. Examples include the fluctuation of currency
rates today given recent exchange prices in a financial data source and an indicator of
chemical composition in a hyperspectral database. Regression modelling is supported by
a well-known and popular set of tools.

3. Analysis of Latent Variables


In the context of latent variable modelling, the equation X = AS is used, where X
represents the observable vector, S represents the vector of latent variables that are not
directly observed and A is a linear transformation that links the two. The main idea behind
this modelling approach is to identify a small number of underlying latent factors that
primarily contribute to the structure of the X data and valuable insights can be gained by
uncovering these latent variables.
Principal Component Analysis (PCA) is a pioneering technique in this field, involving
the derivation of orthogonal eigenvectors from the covariance matrix C of the observable
X. These eigenvectors are then arranged as columns in an orthogonal matrix U and the
latent variables S are defined as S = U X. When A is set to U, the latent variable form is
arrived at.
Extensive applications of this approach are found in data analysis for various fields,
such as sciences, engineering and business. One practical application involves obtaining
the best rank-k mean square approximation of vector X by projecting it onto the space
spanned by the first k eigenvectors of C.

4. Clustering Cluster analysis


Clustering Cluster analysis,which combines elements of art and science, could be
regarded as a distinct topic in and of itself. In order to make close objects comparable,
one attempts to arrange an unorganised collection of objects. There is no one perfect
Amity Directorate of Distance & Online Education
146 Optimization and Dimension Reduction Techniques

way to achieve this because there are numerous approaches that each serve a different
Notes goal. Latent semantic indexing is an apparent application area where it might look for a
document arrangement that makes nearby documents similar and a term arrangement
that makes nearby phrases similar.

4.1 Mathematical Distance Measures


Distance measures are an essential factor within the field of machine learning.
These algorithms, such as k-nearest neighbours for supervised learning and
k-means clustering for unsupervised learning, rely on them as the fundamental building
blocks.
The selection and utilisation of appropriate distance measures is contingent upon
the nature of the data. It is crucial to possess the knowledge and skills required for the
implementation and computation of various commonly used distance measures, along
with a clear understanding of the underlying principles behind the resulting scores.

4.1.1 Introduction to Distance Measures


Distance measures are essential tools used to quantify the dissimilarity or similarity
between objects in a dataset. They play a crucial role in various fields, including data
mining, machine learning and pattern recognition.
A distance measure provides an unbiased evaluation of the relative disparity
between two objects within a specific problem domain. These objects can be rows of data
describing subjects like individuals, cars, or houses, or they can represent events like
purchases, claims, or diagnoses.
In machine learning, distance measures are commonly encountered in algorithms
like the k-nearest neighbours (KNN) algorithm. KNN makes predictions for new examples
by calculating the distance between the new example and all examples in the training
dataset. The KNN algorithm selects the k examples with the smallest distance and uses
them to make predictions. The outcome is determined by either finding the mode of the
class label or calculating the mean of the real value for regression.
Other instance-based learning algorithms, such as learning vector quantization
(LVQ) and self-organising map (SOM), also utilise distance measures. These algorithms
store training examples as they are, without modifications and use a distance function to
find the most similar training example to a given test instance.
In unsupervised learning, the K-means clustering algorithm is another example
where distance measures are fundamental. It clusters data points based on their
distances to centroids.
Overall, distance measures are a fundamental concept in various machine learning
algorithms and are essential for making accurate predictions and classifications based on
similarities and dissimilarities between data points.
The following is a concise compilation of several widely used machine learning
algorithms that rely on distance measures as their fundamental components:
™ K-Nearest Neighbors
™ Learning Vector Quantization (LVQ)
™ Self-Organizing Map (SOM)

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 147

™ K-Means Clustering
Kernel-based methods, including the support vector machine (SVM) algorithm, are Notes
another category of distance-based algorithms widely used in machine learning.
When calculating the distance between two instances with multiple data types in
their columns, it’s essential to consider using different distance measures for each data
type. For example, real values, boolean values, categorical values and ordinal values
may require distinct distance measures, which are then combined into a unified distance
score.
Numeric values often have varying scales, which can significantly impact distance
measures. To address this, it is recommended to normalise or standardise numeric
values before calculating distances.
In regression problems, numerical error can be treated as a distance measure.
The discrepancy between expected and predicted values can be quantified as a one-
dimensional distance. This error can be calculated for each example in a test set and a
total distance is obtained, representing the overall discrepancy between expected and
predicted outcomes in the dataset. This process is akin to conventional distance metrics
and is used to compute error measures like mean squared error or mean absolute error.

4.1.2 Euclidean Distance


The standard geometry problem metric is thought to be Euclidean distance. It
is easily defined as the usual separation between two places. One of the most used
algorithms for cluster analysis is this one. K-mean is one of the algorithms that makes
use of this formula. It mathematically calculates the square root of the coordinate
difference between two objects.

Figure – Euclidean Distance


Image Source: https://www.geeksforgeeks.org/measures-of-distance-in-data-mining/

4.1.3 Manhattan Distance


The absolute difference between the pair of coordinates is determined.
To determine the distance between two points, P and Q, calculate the perpendicular
distance of each point from the X-Axis and Y-Axis.
Consider a two-dimensional plane where point P is located at coordinates (x1, y1)
and point Q is located at coordinates (x2, y2).

Amity Directorate of Distance & Online Education


148 Optimization and Dimension Reduction Techniques

The Manhattan distance between points P and Q is calculated by taking the


Notes absolute difference of their x-coordinates and adding it to the absolute difference of their
y-coordinates.
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

Image Source: https://www.geeksforgeeks.org/measures-of-distance-in-data-mining/

Here the total distance of the Red line gives the Manhattan distance between both
the points.

4.1.4 Cosine Similarity


The cosine similarity metric is commonly employed in the field of Natural Language
Processing (NLP). The cosine similarity is a mathematical measure that quantifies
the cosine of the angle between two vectors. It is used to assess whether two vectors
are oriented in nearly the same direction. The vector in question refers to the specific
mathematical concept being discussed. A document typically consists of a large number
of words, often reaching into the thousands. A vector represents the word frequencies
within a given document. In the scenario where there are five documents, each document
will be represented by a vector consisting of word frequencies. Consequently, the total
number of vectors will also be five, corresponding to the number of documents. Please
find below an illustrative example:

Let’s find the cosine similarity between d1 and d2


Here is the formula:

Here, d1.d2 means the dot product of two vectors d1 and d2.
d1.d2 = 5*4 + 2*0 + 1*0 + 0*2 + 1*2 + 3*2 + 0*1 = 28
||d1|| = (5*5 + 2*2 + 1*1 + 0*0 + 1*1 + 3*3 + 0*0)**0.5 = 6.32
||d2|| = (4*4 + 0*0 + 0*0 + 2*2+ 2*2 + 2*2+1*1)**0.5 = 5.39
cos(d1, d2) = 28 / (6.32*5.39) = 0.82

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 149

So, this is how cosine similarity is generated.


Notes
4.1.5 Mahalanobis Distance
In multivariate analysis, variables are typically represented in a Euclidean space
using a coordinate system consisting of two axes, commonly referred to as the x-axis
and y-axis. When dealing with a large number of variables, it becomes challenging to
effectively represent and measure them using planar coordinates. The Mahalanobis
distance (MD) is a key concept in this context. The reference point in this context is the
mean (also known as the centroid) of the multivariate data.
The Mahalanobis distance (MD) is a metric used to quantify the relative distance
between two variables in relation to the centroid. Hence, as the distance of the variable
from the centroid increases, the magnitude of the MD (Mean Deviation) also increases.
Below is the provided definition of the concept.
The Mahalanobis distance is a measure of the distance between an observation
[  [ [ [«[1 7 DQG D VHW RI REVHUYDWLRQV ZLWK PHDQ ȝ  ȝȝȝ«ȝ1 7 DQG
covariance matrix S.
0' [  ¥^ [±ȝ A76 [±ȝ
The covariance matrix is a mathematical tool used to quantify the covariance
between variables. It is employed to assess the combined impact of two or more
variables.

4.2 Dimension Reduction Techniques


Dimension reduction techniques aim to decrease the number of features or
dimensions in a dataset while preserving essential information. This reduction is often
performed to simplify models, improve learning algorithm performance, or aid data
visualisation. Techniques like PCA, SVD and LDA are commonly used to project data
onto a lower-dimensional space while retaining crucial information. These methods play
a crucial role in predictive modelling, where predictors or features help anticipate future
outcomes.

4.2.1 Overview of Dimension Reduction


What exactly is Dimensionality Reduction?
Dimensionality refers to the count of input features, variables, or columns contained
within a specific dataset. The act of reducing these features is commonly referred to as
dimensionality reduction.
The dataset comprises a vast array of input features across multiple scenarios,
thereby increasing the complexity of the predictive modelling task. In situations where the
training dataset contains a large number of features, it becomes challenging to visualise
or make predictions. To address this issue, the application of dimensionality reduction
techniques becomes necessary.
Dimensionality reduction is a technique used to convert a dataset with higher
dimensions into a dataset with fewer dimensions while preserving similar information.
These techniques are commonly employed in the field of machine learning to achieve
an improved fit for predictive models when addressing classification and regression
problems.

Amity Directorate of Distance & Online Education


150 Optimization and Dimension Reduction Techniques

The utilisation of this technique is prevalent in various domains that involve the
Notes analysis of data with multiple dimensions, including but not limited to speech recognition,
signal processing and bioinformatics. Additionally, it has the capability to be utilised for
various purposes such as data visualisation, noise reduction, cluster analysis and more.

Image Source: https://www.javatpoint.com/dimensionality-reduction-technique

Challenges of Dimensionality (High-Dimensional Data)


High-dimensional data presents significant challenges in practical applications,
commonly known as the curse of dimensionality. As the dimensionality of the input
dataset increases, the complexity of machine learning algorithms and models also grows.
This can lead to overfitting, where the model memorises noise in the data rather than
capturing the underlying patterns.
Overfitting occurs due to the increased number of features in high-dimensional
data, which requires a corresponding increase in the number of samples to adequately
represent the data. Unfortunately, obtaining a sufficient number of samples can be
challenging in many real-world scenarios.
To address the curse of dimensionality and mitigate overfitting, it becomes
essential to reduce the quantity of features. Dimensionality reduction techniques offer a
viable solution to this problem by projecting data into a lower-dimensional space while
preserving crucial information, leading to improved model performance and better
generalisation.

Advantages of dimensionality reduction:


The application of dimensionality reduction techniques to the provided dataset offers
several benefits, which are outlined below:
™ The reduction of feature dimensions results in a corresponding decrease in the
amount of storage space needed for the dataset.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 151

™ Reducing the dimensions of features results in a decrease in computation


training time. Notes
™ The reduction of feature dimensions in the dataset facilitates the rapid
visualisation of data.
™ The removal of redundant features, if they exist, is accomplished by addressing
multicollinearity.

Drawbacks of dimensionality reduction :


There are several drawbacks associated with the application of dimensionality
reduction, which are outlined below:
™ Data loss can occur as a result of dimensionality reduction.
™ In the context of the PCA dimensionality reduction technique, it is possible for
the required principal components to be unknown.

Approaches of dimension reduction


There are two methods for implementing the dimension reduction technique, as
outlined below:
™ Feature selection
™ Feature extraction

1. Feature selection
Feature selection is an important step in building accurate models. It involves the
selection of a subset of relevant features while excluding irrelevant ones from a dataset.
This process helps in improving the accuracy of the model. In essence, feature selection
is a methodology employed to identify and choose the most advantageous features from
a given input dataset.
There are three methods that are commonly employed for feature selection:
Ɣ Filters Methods
The dataset undergoes a filtering process to extract a subset that exclusively
comprises the pertinent features. The filters method employs several commonly used
techniques, including: Correlation, Chi-Square Test, ANOVA and Information Gain etc.
Ɣ Wrapper Methods
The wrapper method shares a common objective with the filter method, but it
employs a machine learning model for its evaluation. This method involves inputting
certain features into the machine learning model and assessing its performance. The
decision to include or exclude these features is contingent upon their impact on the
model’s accuracy, as determined by its performance. The aforementioned method is
characterised by a higher level of accuracy compared to the filtering method, albeit with
increased complexity in its implementation. There are several commonly used techniques
for implementing wrapper methods.
™ The forward selection method is a technique used in statistical modelling to
select the most relevant variables for inclusion in a predictive model.
™ The technique known as backward selection is a method used in statistical
modelling and machine learning to select the most relevant features or
variables for a given model.

Amity Directorate of Distance & Online Education


152 Optimization and Dimension Reduction Techniques

™ The concept of bi-directional elimination refers to a technique used in computer


Notes science and logic systems to simplify and optimise logical expressions.
Ɣ Embedded methods
It refers to a programming technique where methods or functions are defined
within another method or function. This allows for a more modular and organised code
structure, as the embedded methods are only Embedded methods that involve the
examination of various training iterations of a machine learning model to assess the
significance of individual features. Embedded methods encompass a range of commonly
employed techniques. These techniques are widely utilised in various applications and
scenarios.
LASSO, Elastic Net and Ridge Regression are examples of commonly used
techniques in statistical modelling and machine learning. These methods are employed
for various purposes, such as feature selection, regularisation and improving model
performance.

2. Feature extraction
Feature extraction refers to the procedure of converting a high-dimensional space
into a lower-dimensional space. This approach is beneficial for optimising resource usage
during information processing while retaining all necessary information.
There are several commonly used techniques for feature extraction, including:
™ Principal Component Analysis (PCA) is a statistical technique used to reduce
the dimensionality of a dataset while retaining as much information as possible.
™ Linear Discriminant Analysis (LDA) is a statistical technique used for
dimensionality reduction and classification tasks.
™ Kernel Principal Component Analysis (Kernel PCA) is a dimensionality
reduction technique that is commonly used in machine learning and data
analysis. It is an extension of Principal Component Analysis (PCA) that allows
for non-linear transformations of the input data.
™ Quadratic Discriminant Analysis

4.2.2 Singular Value Decomposition (SVD)


The Singular Value Decomposition (SVD) is a mathematical procedure that
factorises a given matrix into three separate matrices. The subject exhibits intriguing
algebraic characteristics and imparts significant geometric and theoretical understandings
pertaining to linear transformations. Additionally, it possesses significant applications
within the field of data science.
7KH69'RIP[QPDWUL[$LVJLYHQE\WKHIRUPXOD$ 8Ȉ9A7
where:
U: m*m matrix of the orthonormal eigenvectors of AA^{T} .
VT: transpose of a n*n matrix containing the orthonormal eigenvectors of A^TA.
Ȉ  GLDJRQDO PDWUL[ ZLWK U HOHPHQWV HTXDO WR WKH URRW RI WKH SRVLWLYH HLJHQYDOXHV RI
$$୶RU$୶$ ERWKPDWULFHVKDYHWKHVDPHSRVLWLYHHLJHQYDOXHVDQ\ZD\ 
Following is a numerical example to illustrate how Singular Value Decomposition
(SVD) works. In this example, we’ll use a small matrix and go through the steps of
decomposing it using SVD.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 153

Step 1: Create a Matrix


Let’s start with a 3x3 matrix, A:
Notes
A=|2 0 1|
|1 1 0|
| 0 2 -1 |

Step 2: Calculate A^T * A and A * A^T


Next, we calculate the product of A and its transpose (A^T) and the product of A^T
and A:
A^T * A = | 2 1 0 | | 2 0 1 | |5 1 2|
|1 1 2|*|1 1 0|=|1 2 1|
| 0 0 1 | | 0 2 -1 | |2 1 6|

A * A^T = | 2 0 1 | | 2 1 0 | |5 2 2|
|1 1 0|*|1 1 2|=|1 2 1|
| 0 2 -1 | | 0 2 -1 | |2 1 6|

Step 3: Calculate Eigenvalues and Eigenvectors


Now, we find the eigenvalues and eigenvectors of both A^T * A and A * A^T.
For A^T * A:
(LJHQYDOXHV Ȝ DUHWKHURRWVRIWKHFKDUDFWHULVWLFHTXDWLRQGHW $A7 $Ȝ,  
GHW $A7 $Ȝ,  _Ȝ_
_Ȝ_
_Ȝ_
Ȝ Ȝ Ȝ   Ȝ    Ȝ     Ȝ   
6ROYLQJWKLVHTXDWLRQZLOOJLYHXVWKHHLJHQYDOXHV ȜȜDQGȜ 
For each eigenvalue, we can find the corresponding eigenvector.

Step 4: Calculate Singular Values


7KHVLQJXODUYDOXHV ı DUHWKHVTXDUHURRWVRIWKHHLJHQYDOXHVRIERWK$A7 $DQG$
* A^T.
ı VTUW Ȝ
ı VTUW Ȝ
ı VTUW Ȝ

Step 5: Calculate the Left and Right Singular Vectors


The left singular vectors (U) are the normalised eigenvectors of A * A^T, and the right
singular vectors (V) are the normalised eigenvectors of A^T * A.

Step 6: Assemble the SVD


1RZ WKDW ZH KDYH ı WKH VLQJXODU YDOXHV  8 WKH OHIW VLQJXODU YHFWRUV  DQG 9 WKH
right singular vectors), we can assemble the SVD of the original matrix A:
Amity Directorate of Distance & Online Education
154 Optimization and Dimension Reduction Techniques

$ 8 Ȉ 9A7


Notes
Where:
U is a 3x3 orthogonal matrix.
ȈLVD[GLDJRQDOPDWUL[ZLWKWKHVLQJXODUYDOXHVııDQGıRQWKHGLDJRQDO
V^T is the transpose of the 3x3 orthogonal matrix V.
This is the Singular Value Decomposition of the original matrix A.

Components of SVD:u and v


The singular value decomposition (SVD) can be computed in the R programming
language by utilising the svd() function. In this process, the original matrix data is scaled
using the pattern contained within it, followed by the application of the singular value
decomposition (SVD) technique.
The singular value decomposition (SVD) of the scaled and ordered data matrix,
`dataMatrixOrdered`, is computed and stored in the variable `svd1`.
> svd1 <- svd(scale(dataMatrixOrdered))
The function svd() returns a list that consists of three components: u, d and v. The
components u and v represent the matrices of left and right singular vectors, respectively.
The component d is a vector of singular values, which corresponds to the diagonal of the
matrix D mentioned earlier.
In the following analysis, a graphical representation of the initial left and right singular
vectors, in addition to the original dataset is presented.
> par(mfrow = c(1, 3))
> image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1], main = “Original Data”)
> plot(svd1$u[, 1], 40:1, , ylab = “Row”, xlab = “First left singular vector”, pch = 19)
> plot(svd1$v[, 1], xlab = “Column”, ylab = “First right singular vector”, pch = 19)

Fig:Components of SVD
Image Source: https://bookdown.org/rdpeng/exdata/dimension-reduction.html

The mean shift in both the rows and columns of the matrix can be observed by
examining the first left and right singular vectors.

Using SVD to compress data

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 155

If it is assumed that the first left and right singular vectors, denoted as u1 and v1,
effectively represent all the variability in the data, it is possible to approximate the original Notes
data matrix using these vectors.
;§X1^ v’1
The original matrix contains 400 numbers, while the compressed matrix contains 50
numbers. This represents a reduction of nearly 90% in information. The following is a
representation of the original data and its corresponding approximation.
> ## Approximate original data with outer product of first singular vectors
> approx <- with(svd1, outer(u[, 1], v[, 1]))
>
> ## Plot original data and approximated data
> par(mfrow = c(1, 2))
> image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1], main = “Original
Matrix”)
> image(t(approx)[, nrow(approx):1], main = “Approximated Matrix”)

Figure 5.8: Approximating a matrix


Image Source: https://bookdown.org/rdpeng/exdata/dimension-reduction.html

It is evident that the two matrices are not identical; however, the approximation
appears to be reasonable in this particular scenario. This outcome is to be expected,
considering that the original data only contained a single significant feature.

4.2.3 Principal Component Analysis (PCA)


Principal Component Analysis (PCA) is an unsupervised learning algorithm widely
used for dimensionality reduction in machine learning. It is a statistical technique that
transforms a set of correlated features into a new set of linearly uncorrelated features
through an orthogonal transformation process. These new features are known as
Principal Components, which capture the most important patterns in the data.
PCA is a powerful tool for exploratory data analysis and predictive modelling. By
reducing the dimensionality of the data, it helps in visualising and understanding complex

Amity Directorate of Distance & Online Education


156 Optimization and Dimension Reduction Techniques

datasets. The technique works by identifying a lower-dimensional subspace onto which


Notes high-dimensional data can be projected.
The key idea behind PCA is to evaluate the variance of each attribute, as attributes
with high variance often contain significant information for distinguishing between classes
or patterns. By retaining the most informative variables and discarding less important
ones, PCA enables efficient data representation and processing.
Due to its versatility, PCA finds applications in various domains such as image
processing, recommendation systems and communication channel optimization, where
it enhances performance by extracting relevant features and reducing computational
complexity.
The Principal Component Analysis (PCA) algorithm relies on several mathematical
concepts, including:
™ Variance and Covariance
™ Eigenvalues and Eigen factors
In the field of statistics and probability theory, variance and covariance are two
important concepts that are used to measure the relationship and dispersion of random
variables.
The concept of eigenvalues and eigenvectors is a fundamental topic in linear
algebra. Eigenvalues represent the scalar values that are associated with a given linear
transformation, while eigenvectors are the corresponding vectors .
The Principal Component Analysis (PCA) algorithm employs several commonly used
terms.
Ɣ Dimensionality refers to the quantity of features or variables that are present within
a given dataset. The column count refers to the number of columns contained within
the dataset.
Ɣ Correlation is a statistical measure that indicates the strength of the relationship
between two variables. When one variable is modified, the other variable undergoes
a corresponding change. The correlation value is a numerical measure that ranges
from -1 to +1. In this context, a value of -1 is assigned when variables exhibit an
inverse relationship, while a value of +1 signifies a direct relationship between the
variables.
Ɣ The term “orthogonal” is used to indicate that variables are not correlated with each
other, resulting in a correlation coefficient of zero between the pair of variables.
(LJHQYHFWRUV DUH GHILQHG DV QRQ]HUR YHFWRUV Y WKDW VDWLVI\ WKH HTXDWLRQ 0Y  ȜY
where M is a square matrix. An eigenvector, v, is defined as a vector that satisfies
WKHFRQGLWLRQ$Y ȜYZKHUH$LVDPDWUL[DQGȜLVDVFDODU7KH&RYDULDQFH0DWUL[
refers to a matrix that encompasses the covariance values between pairs of
variables.

Example:
In this example, we’ll perform PCA on a small dataset with two features and reduce it
to one principal component.

Dataset:
Suppose we have a dataset with two features, “X1” and “X2,” representing data
points in a two-dimensional space:

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 157

X1 = [1, 2, 3, 4, 5]
Notes
X2 = [2, 3, 3, 4, 5]
We want to reduce the dimensionality of this dataset using PCA while preserving
most of the variance.
Steps of PCA:
1. Mean Calculation:
Calculate the mean of both X1 and X2:
™ Mean(X1) = (1 + 2 + 3 + 4 + 5) / 5 = 3
™ Mean(X2) = (2 + 3 + 3 + 4 + 5) / 5 = 3.4
2. Centering the Dataset:
Subtract the mean from each data point:
™ ;BFHQWHUHG >@ >@
™ ;BFHQWHUHG >@ >
1.6]
3. Covariance Matrix:
Calculate the covariance matrix of the centred dataset:
™ &RYDULDQFH ;;  Ȉ ;BFHQWHUHG0HDQ ;   ;BFHQWHUHG0HDQ ; 
/ (n - 1)
™ Covariance(X1, X2) = ((-2 * -1.4) + (-1 * -0.4) + (0 * -0.4) + (1 * 0.6) + (2 * 1.6)) /
(5 - 1)
™ &RYDULDQFH ;; §
4. Eigenvalues and Eigenvectors:
™ &RPSXWHWKHHLJHQYDOXHV Ȝ DQGHLJHQYHFWRUV Y RIWKHFRYDULDQFHPDWUL[
™ (LJHQYDOXH Ȝ §(LJHQYHFWRU Y §>@
5. Explained Variance Ratio:
The explained variance ratio for the principal component:
™ ([SODLQHG9DULDQFH5DWLR§ Ȝ Ȝ §
™ The principal component explains 100% of the variance in the data.
6. PCA Transformation:
 7UDQVIRUPWKHRULJLQDOGDWDLQWRWKHQHZFRRUGLQDWHV\VWHPGH¿QHGE\WKHSULQFLSDO
component:

Transformed Dataset:
Ɣ PC1 = [-2.33, -1.33, -0.33, 0.67, 1.67]
Ɣ These are the coordinates of the data points in the PC1 direction.
PCA has reduced the dimensionality of the dataset from two features (X1 and X2)
to one principal component (PC1) while preserving all of the variance in the data, as
indicated by the explained variance ratio of 1. This simplifies the dataset while retaining
essential information.

Amity Directorate of Distance & Online Education


158 Optimization and Dimension Reduction Techniques

PCA Algorithm Steps:


Notes
1. Dataset Splitting:
™ Split the input dataset into two halves: X (training set) and Y (validation set).
2. Data Representation:
™ Create a structured representation of the dataset, typically using a two-
dimensional matrix.
™ Each row represents a data item and each column represents a feature.
™ The dimensions of the dataset are determined by the number of columns.
3. Data Standardization:
™ Standardise the dataset to ensure that features with higher variation are
considered more significant.
™ Divide each data point in a column by the column’s standard deviation.
™ The standardised matrix is denoted as Z.
4. Covariance Matrix:
™ Calculate the covariance matrix of Z by transposing Z and then multiplying it by
Z.
5. Eigenvalues and Eigenvectors:
™ Compute the eigenvalues and eigenvectors of the covariance matrix.
™ Eigenvalues represent the coefficients of the eigenvectors, which represent the
directions of the high information axis.
6. Sorting Eigenvalues and Eigenvectors:
™ Sort the eigenvalues in decreasing order.
™ Simultaneously sort the corresponding eigenvectors accordingly.
™ The resulting matrix is denoted as P*.
7. omputing New Features (Primary Components):
™ Obtain the new features by multiplying the transposed P* matrix by Z.
™ Each observation in the resulting matrix Z* is a linear combination of the original
features.
™ The columns of Z* are independent of each other.
8. Feature Selection:
™ Decide which features to keep and which to eliminate in the new dataset (Z*).
™ Retain only relevant or significant features, excluding irrelevant information from
the new dataset.

Principal Component Analysis Applications


Applications of Principal Component Analysis (PCA):
1. Dimensionality Reduction:PCA is widely used as a dimensionality reduction technique
in various AI applications.
It helps to reduce the number of features or dimensions in high-dimensional datasets.
 %\ UHWDLQLQJ WKH PRVW LPSRUWDQW LQIRUPDWLRQ 3&$ VLPSOL¿HV DQG VSHHGV XS GDWD
processing.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 159

2. Image Compression and Computer Vision:In computer vision tasks, such as image
compression, PCA is employed to reduce the storage and processing requirements of Notes
images.
 ,WH[WUDFWVWKHPRVWUHOHYDQWIHDWXUHVRIDQLPDJHHQDEOLQJ HI¿FLHQWUHSUHVHQWDWLRQ
and reconstruction.
3. Uncovering Hidden Patterns:
PCA can be utilised to discover hidden patterns or relationships within complex
datasets.
By transforming the original data into a new set of uncorrelated variables (principal
components), PCA reveals the underlying structures.
4. Applications in Finance and Data Mining:
 ,Q¿QDQFH3&$LVHPSOR\HGIRUULVNPDQDJHPHQWDQGSRUWIROLRRSWLPL]DWLRQ
 ,WKHOSVWRLGHQWLI\WKHNH\IDFWRUVWKDWGULYH¿QDQFLDOGDWDDQGLPSURYHGHFLVLRQPDNLQJ
processes.
In data mining, PCA aids in feature selection, clustering and anomaly detection.
5. Applications in Psychology and Social Sciences:
In psychology and social sciences, PCA is applied to analyse and interpret behavioural
data.
 ,WKHOSVLQLGHQWLI\LQJODWHQWFRQVWUXFWVRUIDFWRUVWKDWLQÀXHQFHKXPDQEHKDYLRXU
6. Signal Processing and Audio Analysis:
In signal processing and audio analysis, PCA is used to reduce noise and extract
meaningful information from signals or audio recordings.
7. Machine Learning Model Initialization:
PCA can be used as a pre-processing step to initialise machine learning models with
a reduced set of informative features.
It can enhance the performance and convergence speed of learning algorithms.
8. Quality Control and Fault Detection:
In industrial applications, PCA aids in quality control and fault detection by identifying
patterns of normal behaviour and deviations from it.

4.2.4 Multiple Dimensional Scaling (MDS)


Multidimensional Scaling (MDS) is a method used to reduce the dimensionality of
data while preserving the pairwise distances between observations. Its primary objective
is to find a low-dimensional representation of a high-dimensional dataset without losing
the meaningful relationships between data points. To achieve this, MDS typically
minimises a stress function that measures the difference between the original pairwise
distances in the data and the distances in the reduced representation.
MDS finds application in various disciplines such as psychology, sociology and
marketing research. It is commonly employed to create visual representations of complex
datasets, providing valuable insights into the underlying structure of the data. In R,
MDS analysis can be conducted to analyse the data, interpret the results and create
informative visualisations that aid in understanding the relationships between data points.

Amity Directorate of Distance & Online Education


160 Optimization and Dimension Reduction Techniques

Simple MDS Analysis


Notes
It will start off by performing a straightforward MDS analysis utilising the R
programming language’s built-in iris data set. The three different kinds of iris blooms’
sepal and petal lengths and widths are measured in the iris data collection.
# Load the iris data set
data(iris)
# Perform MDS analysis
PGVBLULVFPGVFDOH GLVW LULV>@
# Plot the results
SORW PGVBLULV>@PGVBLULV>@
type = “n”, xlab = “MDS Dimension 1”,
ylab = “MDS Dimension 2”)
# Plot the points and label them with
# the first two letters of the species name
SRLQWV PGVBLULV>@PGVBLULV>@
pch = 21, bg = “lightblue”)
WH[W PGVBLULV>@PGVBLULV>@
labels = substr(iris$Species, 1, 2),
pos = 3, cex = 0.8)
# Form clusters
FOXVWHUVVRPHBFOXVWHUBIXQFWLRQ PGVBLULV
# Add the cluster information to the plot
SRLQWV PGVBLULV>@PGVBLULV>@
pch = 21, bg = clusters, cex = 1.2)

Output

Fig:A visual representation of the iris dataset using multi-dimensional scaling (MDS) analysis,
showcases the relationship between the species of iris flowers based on their physical characteristics.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 161

™ This code performs Multidimensional Scaling (MDS) analysis on the well-


known iris dataset, which contains measurements of 150 iris flowers from three Notes
different species based on four traits. Initially, the data() function is used to load
the iris dataset.
™ The cmdscale() function is then applied to compute pairwise distances between
observations and obtain a condensed representation of the data in two
GLPHQVLRQV7KHUHVXOWLQJ0'6FRRUGLQDWHVDUHVWRUHGLQWKHPGVBLULVREMHFW
™ Next, the plot() function is used to create an empty plot with the x and y axes
labelled accordingly. The points() function is used to add data points to the plot
and the text() function is used to label each point with the first two letters of
the species name. The points are displayed as filled circles with a light blue
background colour.
™ 7KH FRGH WKHQ HPSOR\V WKH VRPHBFOXVWHUBIXQFWLRQ  WR JHQHUDWH D YHFWRU RI
cluster assignments for each observation, effectively creating clusters. These
clusters are added to the plot using the points() function with the pch parameter
set to 21 and the bg parameter set to the vector of cluster assignments. The
size of the points is adjusted to be larger than the default size using the cex
parameter, set to 1.2.

MDS with Custom Distance Matrix


In some instances, a custom distance matrix may be required in place of the
standard Euclidean distance. In this example, we’ll utilise a custom distance matrix based
on the connection between the variables and the built-in USArrests data set.
# Load the USArrests data set
data(USArrests)

# Calculate the distance matrix


GLVWDQFHBPDWUL[GLVW 86$UUHVWV

# Perform MDS analysis using


# the distance matrix
PGVBXVDUUHVWVFPGVFDOH GLVWDQFHBPDWUL[

# Plot the results


SORW PGVBXVDUUHVWV>@PGVBXVDUUHVWV>@
type = “n”)
WH[W PGVBXVDUUHVWV>@PGVBXVDUUHVWV>@
labels = row.names(USArrests))

The aforementioned code will use the custom distance matrix to produce a scatter
plot of the MDS findings. In the USArrests data collection, each point corresponds to a
different state. The graph has the state labels plotted on it. The MDS analysis divided
the observations of the states into various clusters based on the correlation between the
variables, as can be shown by visualising the data in this fashion.

Amity Directorate of Distance & Online Education


162 Optimization and Dimension Reduction Techniques

Nonmetric Multidimensional Scaling


Notes
Ɣ The dimension-reduction method known as nonmetric multidimensional scaling
(NMDS) is employed in data analysis and visualisation. For visual reasons, it
converts a collection of differences or separations between objects into a lower-
dimensional representation, typically in two or three dimensions. Instead of
preserving the differences themselves, NMDS aims to keep the rank order of the
differences as accurately as feasible.
Ɣ NMDS is helpful for studying complex data structures, contrasting various object
sets, or evaluating object similarity. The relationships between things that are
difficult to see in the raw data can also be seen using this technique.
Ɣ Because it satisfies the metric properties of symmetry and triangle inequality without
the need for dissimilarities, NMDS is known as being “non metric”. Because of this,
NMDS is more robust and adaptable than other multidimensional scaling methods
that call on metric differences, like metric MDS.
Ɣ An iterative optimisation approach is commonly used to accomplish NMDS, starting
with an initial guess of the lower dimensional representation and changing it to
better approximate the dissimilarities. When the representation and the differences
agree to a satisfactory degree, the optimisation procedure is terminated. The
resulting representation can then be used to visualise and examine the data for
trends and linkages.

4.3 Factor Analysis


Factor analytics is a unique method of managing which data should be present in
a spreadsheet by factoring the data, which is the process of reducing a large number
of variables into a small number of factors. In terms of a potentially smaller number of
unseen variables termed factors, it is also used to describe oscillations among the
observed and associated variables.
By using the factor analysis technique, all the variables’ highest common variance
is extracted and they are combined into a single score. It is a theory that is applied when
training a machine learning model, hence data mining is closely related to it. Factor
analysis techniques work under the assumption that knowledge of the relationships
between observed variables can be utilised subsequently to condense the number of
variables in a dataset.
For complicated topics like social status, economic status, food patterns,
psychological scales, biology, psychometrics, personality theories, marketing, product
management, operations research, finance, etc., factor analysis is a very useful
technique for examining changing relationships. By condensing a vast number of
variables into a small number of essential aspects that are simple to understand, a
researcher can more quickly and readily study ideas that are difficult to measure.

4.3.1 Understanding Factor Analysis


The factor analysis method can be used to reduce a large number of variables to
a smaller number of factors. With the use of this specific technique, all variables can be
combined into a single score by determining their maximum common variance. Given
that it is an index of all the factors, this score can be utilised for more in-depth study.
It belongs to the general linear model as a part. It can also result in a number of
assumptions, such as the absence of multicollinearity and linear correlations.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 163

Objectives
Notes
The following guidelines can be used to breakdown factor analysis’ main goals:
Ɣ Figuring out how many elements are needed to explain common themes in a certain
set of variables.
Ɣ Figuring out how closely each dataset’s variable is related to a particular component
or theme.
Ɣ Analysing a dataset’s common factors.
Ɣ Determining how well each observed data point represents a certain theme or
aspect.

Factor analysis types:


1. EFA, or Exploratory factor analysis:
It is used to group elements that are a component of unifying concepts and to identify
composite relationships between objects. The Analyst is not permitted to make any
prior assumptions on the connections between elements. It is also employed to identify
the basic organisation of a sizable collection of variables. It reduces the volume of the
huge data to a considerably more condensed set of summary variables. It resembles
WKH&RQ¿UPDWRU\)DFWRU$QDO\VLV &)$ DOPRVWH[DFWO\

Comparisons include:
™ Analyse a number’s internal consistency.
™ Analyse the factors that item sets represent. They assume there is no
correlation between the variables.
™ Look into the class/grade of each item.
There are, however, certain general distinctions, most of which are focused on the
application of components. In essence, EFA is a data-driven approach that permits
all items to load on all factors, whereas CFA necessitates that you designate which
factors must load. If anyone has no notion of what potential common causes might
exist, EFA is a really good option. If a researcher needs to identify factors, it is not
possible to produce as many alternative models for data as EFA can. The CFA is a
preferable strategy in this situation if anyone has some understanding of how the
models actually look and wish to test data structure hypotheses afterwards.
 &RQ¿UPDWRU\IDFWRUDQDO\VLV &)$ 
 &RQ¿UPDWRU\)DFWRU$QDO\VLV &)$ LVDVWDWLVWLFDOPHWKRGXVHGWRWHVWWKHK\SRWKHVLV
WKDW LWHPV DUH LQWULFDWHO\ UHODWHG WR VSHFL¿F XQGHUO\LQJ IDFWRUV ,W HPSOR\V D ZHOO
GH¿QHGHTXDWLRQPRGHOWRDVVHVVWKHPHDVXUHPHQWPRGHO%\H[DPLQLQJORDGLQJVRQ
the factors, CFA allows the evaluation of correlations between observed variables and
unobserved variables.
Compared to least-squares estimates, structural equation modelling, which includes
&)$ RIIHUV PRUH ÀH[LELOLW\ DQG WROHUDQFH IRU PHDVXUHPHQW LQDFFXUDF\ ,W UHYHDOV
the loadings of observed variables on latent variables (factors) and the correlations
between these latent variables. Hypothesised models are rigorously evaluated against
actual data.
CFA is a valuable tool for analysts and researchers to investigate the relationship
between observable variables (manifest variables) and the underlying constructs. It

Amity Directorate of Distance & Online Education


164 Optimization and Dimension Reduction Techniques

shares similarities with exploratory factor analysis, but CFA is more focused on testing
Notes VSHFL¿FK\SRWKHVHVDERXWWKHXQGHUO\LQJIDFWRUVWUXFWXUH

The primary distinction between the two is:


™ To investigate the pattern, merely utilise exploratory factor analysis.
™ Confirmatory factor analysis can be used to test a hypothesis.
 &RQ¿UPDWRU\IDFWRUDQDO\VLVVKHGVOLJKWRQWKHPLQLPXPQXPEHURIIDFWRUVQHFHVVDU\
to accurately represent the data set. The total number of necessary factors can be
GHWHUPLQHGYLDFRQ¿UPDWRU\IDFWRUDQDO\VLV&RQ¿UPDWRU\IDFWRUDQDO\VLVIRULQVWDQFH
can respond to queries like, “Does my 1,000-question survey have the ability to
measure one particular factor accurately?” Although it can potentially be employed in
any area, social sciences are where it is most frequently applied.
3. Multiple Factor Analysis :
 ,I YDULDEOHV DUH RUJDQLVHG LQWR ÀH[LEOH JURXSLQJV LW VKRXOG XWLOLVH D PXOWLSOH IDFWRU
analysis. A teenager’s health questionnaire, for instance, can include questions on
learning impairments, bad habits, mental health, addiction to mobile phones and
sleeping patterns.

Two steps are included in the Multiple Factor Analysis and they are as follows:
™ First, each and every section of the data will be subject to the Principal
Component Analysis. A relevant eigenvalue can also be obtained from this,
which is then used to normalise the data sets for subsequent use.
™ After merging the newly created data sets into a unique matrix, a global PCA
will be carried out.
4. Generalised Procrustes Analysis (GPA):
The utilisation of this method facilitated the expansion of the GP Analysis, enabling the
comparison of more than two shapes across various dimensions. Procrustes analysis
LVDUHFRPPHQGHGPHWKRGIRUFRPSDULQJWZRDSSUR[LPDWHVHWVRIFRQ¿JXUDWLRQVDQG
shapes. It was initially designed to be comparable to the two solutions derived from
Factor Analysis. In order to achieve the intended shape, it is necessary to align the
shapes accurately. The Generalised Procrustes Analysis (GPA) primarily employs
geometric transformations.

Geometric progressions are :


™ Isotropic rescaling,
™ Reflection,
™ Rotation,
™ Translation of matrices to compare the sets of data.

4.3.2 Factor Extraction Methods


The following are the various techniques used to extract the factor from a certain
data set:

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 165

Notes

(ImageSource:https://stats.stackexchange.com/questions/50745/best-factor-extraction-methods-in-
factor-analysis)

1. Principal Component Analysis:


Iterations of the Principal Axis (PAF), often known as the Principal Factor, is the oldest
and possibly still most widely used approach. Iterative PCA1 application to the matrix
using communalities in place of 1s or variances on the diagonal. Thus, communalities
EHFRPHPRUHUH¿QHGZLWKHDFKLWHUDWLRQXQWLOWKH\FRQYHUJH7KLVOHDGVWRWKHSDLUZLVH
correlations being eventually explained by the method that aims to explain variance,
QRWWKHRWKHUZD\DURXQG7KHEHQH¿WRIWKHSULQFLSDOD[LVDSSURDFKLVWKDWOLNH3&$
it can analyse covariances, other SSCP measures (raw SSCP, cosines), in addition
to correlations. The other three procedures [in SPSS; covariances could be analysed
in other implementations] solely handle correlations. This method’s drawback is that
it is reliant on the accuracy of the communalities’ initial estimates. You might choose
alternate estimates (including those derived from prior research) over the standard
beginning value of the squared multiple correlation/covariance. To learn more, read
this. Please take a look at this example of Principal Axis Factoring Calculations,
annotated and contrasted with PCA Calculations.
2. Ordinary or Unweighted Least Squares (ULS):
The algorithm aims to minimise the residuals between the input correlation matrix and
the reproduced correlation matrix by utilising factors. The diagonal elements, such as
the sums of communality and uniqueness, are designed to restore values of 1. The
assignment for FA2 is presented in a straightforward manner. In cases where the
number of components is less than the rank, the ULS approach can be employed,
HYHQ ZKHQ GHDOLQJ ZLWK VLQJXODU RU QRQSRVLWLYH VHPLGH¿QLWH PDWUL[ FRUUHODWLRQV
However, there is ongoing debate regarding the applicability of FA in such scenarios.

Amity Directorate of Distance & Online Education


166 Optimization and Dimension Reduction Techniques

3. Generalised or Weighted least squares (GLS):


Notes A variation of the previous one is generalised or weighted least squares (GLS).
&RUUHODWLRQFRHI¿FLHQWVDUHDVVLJQHGYDULHGZHLJKWVZKLOHPLQLPLVLQJWKHUHVLGXDOV
with correlations between variables with high uniqueness (at the present iteration)
receiving less weight3. If you want your factors to suit highly uncommon variables (i.e.,
those that are weakly driven by the factors) worse than highly common variables (i.e.,
those that are strongly driven by the factors), use this strategy. This attribute is helpful
because this wish is not unusual, especially during the questionnaire construction
process.
4. Maximum Likelihood (ML):
Maximum Likelihood (ML) assumes that the data (the correlations) come from
a population with a multivariate normal distribution, the residuals of correlation
FRHI¿FLHQWV PXVW EH QRUPDOO\ GLVWULEXWHG DURXQG ]HUR $FFRUGLQJ WR WKH SUHYLRXVO\
stated assumption, the loadings are iteratively evaluated using the ML approach.
Correlations are handled using uniqueness weighting, much like the Generalised
Least Squares method. The ML method permits some population inference, unlike
other methods that only assess the sample as it is. It is often generated along with a
QXPEHURI¿WLQGLFHVDQGFRQ¿GHQFHLQWHUYDOV>XQIRUWXQDWHO\UDUHO\LQ6366EXWVRPH
have created macros for SPSS that accomplish it]. The factor-reproduced correlation
matrix’s ability to represent the population matrix from which the observed matrix was
UDQGRPO\JHQHUDWHGLVDVVHVVHGXVLQJWKHJHQHUDO¿WFKLVTXDUHWHVW

4.3.3 Factor Rotation Techniques


Each pair of factors (columns of the loading matrix) is rotated individually and
iteratively. It would be mathematically challenging to simultaneously maximise or
minimise the objective criterion for all the elements, thus this is required. However, the
final rotation matrix Q is put together so that you can use it to replicate the rotation on
your own. By multiplying the extracted loadings A by it, AQ=S, you can obtain the rotated
factor structure matrix S. The resultant matrix S’s elements’ (loadings’) property is the
objective criterion.
1. Quartimax orthogonal rotation
Maximising the total sum of loadings raised to power 4 in S is the goal of quartimax
orthogonal rotation. The word “quarti” means “four,” therefore the name. It was
GHPRQVWUDWHG WKDW DFKLHYLQJ WKLV PDWKHPDWLFDO JRDO VDWLV¿HV WKH WKLUG 7KXUVWRQH¶V
criterion of “simple structure,” which states that there should be several (ideally >= m)
variables for every pair of factors, with loadings close to zero for any one of the two
and far from zero for the other factor. To put it another way, there will be a lot of huge
loadings and a lot of little loadings. Points on the loading plot created for a pair of
rotational factors should, preferably, be located around one of the two axes. Quartimax
³VLPSOL¿HV´WKHORDGLQJPDWUL[¶VURZVLQRUGHUWRUHGXFHWKHQXPEHURIIDFWRUVUHTXLUHG
to explain a variable. However, quartimax frequently generates what is known as the
“general factor” (which, in most cases, is not desirable in the FA of variables.
2. Varimax orthogonal rotation
Maximising variance of the squared loadings in each factor in S is the goal of
varimax orthogonal rotation. Hence the name variation. As a result, each factor has
a small number of variables with high factor loadings. The interpretability of factors is
FRQVLGHUDEO\DLGHGE\9DULPD[EHFDXVHLWLPPHGLDWHO\³VLPSOL¿HV´WKHORDGLQJPDWUL[¶V

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 167

columns. The points on the loading plot are widely spaced along a factor axis and tend
to polarise into near-zero and far-from-zero. This characteristic appears to partially Notes
satisfy a number of Thurstone’s simple structure points. However, Varimax is not
immune to creating points that are far from the axes, i.e. “complex” variables that are
ORDGHGKHDYLO\E\PXOWLSOHIDFWRUV'HSHQGLQJRQWKHVWXG\¶V¿HOGWKLVPD\EHJRRG
or negative. It is advised to always use it with varimax (and recommended to use it
with any other approach, too) since it works best in tandem with the so-called Kaiser’s
normalisation, which temporarily equalises communalities while rotating. Particularly
in psychometry and the social sciences, it is the most widely used orthogonal rotation
technique.
3. Equamax (rarely, Equimax) orthogonal rotation
One way to sharpen some varimax features is through equamax (or, less frequently,
equamax) orthogonal rotation. In an effort to make it even better, it was invented.
Saunders (1962) added a particular weighting known as equalisation to the algorithm’s
working formula. Equamax automatically adjusts for how many rotating factors are
present. It is less likely to produce “general” factors since it tends to disperse heavily
loaded variables throughout factors more evenly than varimax does. On the other
hand, equamax was not developed to abandon the quartimax’s intention to simplify
rows; rather, equamax is a combination of quartimax and varimax rather than their
LQWHUPHGLDU\ IRUP +RZHYHU HTXDPD[ LV DVVHUWHG WR EH VLJQL¿FDQWO\ OHVV ³UHOLDEOH´
or “stable” than varimax or quartimax: for certain data, it can produce disastrously
poor solutions, while for other data, it produces factors with a basic structure that are
completely comprehensible. Another approach, known as parsimax (or “maximising
parsimony”), is comparable to equamax but considerably more ambitious in its pursuit
of basic structure.

4.3.4 Interpreting Factor Analysis Results


Following are the steps for interpreting factor analysis results:

Step 1: Count the number of variables.


Perform the analysis using the principle components method of extraction without
specifying the number of factors at first if you are unsure about how many to employ. The
number of factors can then be ascertained using one of the following techniques.

% Var
To calculate how much variance the factors explain, use the percentage of variance
(% Var). Keep the variables that account for a reasonable amount of volatility. Your
application will determine the appropriate level. You might just need to explain 80% of
the variance for descriptive reasons. However, you might wish to have at least 90% of the
variance explained by the factors if you want to run additional studies on the data.

(Eigenvalues) Variance
The variance is equal to the eigenvalue if you extract factors using principal
components. The size of the eigenvalue can be used to calculate the number of factors.
The factors with the highest eigenvalues should be kept. For instance, when applying the
Kaiser criterion, you only take into account the factors with eigenvalues greater than 1.

Scree plot
The eigenvalues are arranged in the scree plot from largest to smallest. A sharp

Amity Directorate of Distance & Online Education


168 Optimization and Dimension Reduction Techniques

curve, a bend and then a straight line are the perfect patterns. Utilise the elements in the
Notes steep curve before the line’s first starting point.

Unrotated Factor Loadings and Communalities

Variable Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7


Academic record 0.726 0.336 -0.326 0.104 -0.354 -0.099 0.233
Appearance 0.719 -0.271 -0.163 -0.4 -0.148 -0.362 -0.195
Communication 0.712 -0.446 0.255 0.229 -0.319 0.119 0.032
Company Fit 0.802 -0.06 0.048 0.428 0.306 -0.137 -0.067
Experience 0.644 0.605 -0.182 -0.037 -0.092 0.317 -0.209
Job Fit 0.813 0.078 -0.029 0.365 0.368 -0.067 -0.025
Letter 0.625 0.327 0.654 -0.134 0.031 0.025 0.017
Likeability 0.739 -0.295 -0.117 -0.346 0.249 0.14 0.353
Organisation 0.706 -0.54 0.14 0.247 -0.217 0.136 -0.08
Potential 0.814 0.29 -0.326 0.167 -0.068 -0.073 0.048
Resume 0.709 0.298 0.465 -0.343 -0.022 -0.107 0.024
6HOI&RQ¿GHQFH 0.719 -0.262 -0.294 -0.409 0.175 0.179 -0.159
Variance 6.3876 1.4885 1.1045 1.0516 0.6325 0.367 0.3016
% Var 0.532 0.124 0.092 0.088 0.053 0.031 0.025

Variable Factor8 Factor9 Factor10 Factor11 Factor12 Communality


Academic record 0.147 0.097 -0.142 -0.026 -0.031 1
Appearance -0.151 0.082 0.016 0.02 -0.038 1
Communication 0.088 0.023 0.204 0.012 -0.1 1
&RPSDQ\B)LW 0.105 -0.019 -0.067 0.188 -0.021 1
Experience -0.102 0.121 0.039 0.077 0.009 1
Job Fit -0.032 0.146 0.066 -0.176 0.008 1
Letter -0.113 -0.079 -0.13 -0.043 -0.127 1
Likeability -0.142 0.051 0.022 0.064 0.012 1
Organisation -0.105 -0.02 -0.162 -0.032 0.136 1
Potential -0.112 -0.29 0.1 -0.023 0.028 1
Resume 0.17 0.008 0.09 0.01 0.156 1
6HOI&RQ¿GHQFH 0.23 -0.098 -0.061 -0.065 -0.047 1
Variance 0.2129 0.1557 0.1379 0.0851 0.075 12
% Var 0.018 0.013 0.011 0.007 0.006 1

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 169

Notes

ImageSource:https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistical-modeling/
multivariate/how-to/factor-analysis/interpret-the-results/key-results/

Important Findings: Variance (Eigenvalue), Scree Plot and %Var


The findings from the principal component analysis (PCA) reveal unrotated factor
loadings for each factor. The variances (eigenvalues) of the first four factors are greater
than 1, indicating their significance in explaining the data’s variability. Adding more than
six factors has minimal impact on the eigenvalues. Therefore, it appears that 4-6 factors
account for most of the data’s variability.
Factor 1 explains 53.2% (0.532) of the variability, while Factor 4 explains 8.8%
(0.088) of the variability. The scree plot further supports this, showing that the first
four components contribute significantly to the total data variability. On the other hand,
the remaining variables contribute very little to the overall variability and are likely not
significant in the analysis.

Step 2: Interpret the factors


You can rerun the analysis using the maximum likelihood method once determining
the number of factors (step 1) has been identified. The factor that has the greatest impact
on each variable should then be identified by looking at the loading pattern. The variable
is heavily influenced by the factor when the loadings are near to -1 or 1. Loadings that
are not far from 0 show that the factor has little effect on the variable. Several factors may
have substantial loadings on some variables.
Interpreting unrotated factor loadings can be challenging. Factor rotation makes the
loading structure simpler, making it easier to understand the factor loadings. One rotation
technique might not always be the most effective, though. Consider experimenting with
various rotations and using the one that yields the most understandable results. To
more clearly evaluate the loadings inside a factor, you can additionally order the rotated
loadings.

Rotated Factor Loadings and Communalities

Amity Directorate of Distance & Online Education


170 Optimization and Dimension Reduction Techniques

Varimax Rotation
Notes
Variable Factor1 Factor2 Factor3 Factor4 Communality
Academic record 0.481 0.51 0.086 0.188 0.534
Appearance 0.14 0.73 0.319 0.175 0.685
Communication 0.203 0.28 0.802 0.181 0.795
Company Fit 0.778 0.165 0.445 0.189 0.866
Experience 0.472 0.395 -0.112 0.401 0.553
Job Fit 0.844 0.209 0.305 0.215 0.895
Letter 0.219 0.052 0.217 0.947 0.994
Likeability 0.261 0.615 0.321 0.208 0.593
Organisation 0.217 0.285 0.889 0.086 0.926
Potential 0.645 0.492 0.121 0.202 0.714
Resume 0.214 0.365 0.113 0.789 0.814
6HOI&RQ¿GHQFH 0.239 0.743 0.249 0.092 0.679
Variance 2.5153 2.488 2.0863 1.9594 9.0491
% Var 0.21 0.207 0.174 0.163 0.754

Principal Findings: Loadings, Community and Loading Plot


The data was rotated by a varimax algorithm to produce these findings. You can
interpret the factors as follows using the rotational factor loadings:
Ɣ Factor 1 has significant positive loadings for Company Fit (0.778), Job Fit (0.844)
and Potential (0.645), which means that this factor indicates employee fit and
potential for advancement in the organisation.
Ɣ Factor 2 has significant positive loadings for appearance (0.730), likeability (0.615)
and self-confidence (0.743), indicating that it describes personal characteristics.
Ɣ Factor 3 describes job skills as it has significant positive loadings for communication
(0.802) and organisation (0.889).
Ɣ Factor 4 describes writing abilities because it has significant positive loadings on the
letters (0.947) and resumes (0.789).
Ɣ All four variables together account for 0.754, or 75.4%, of the variation in the data.

Fig:The loading plot visually shows the loading results for the first two factors.
ImageSource:https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistical-modeling/
multivariate/how-to/factor-analysis/interpret-the-results/key-results/

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 171

Step 3: Check your data for problems


Use the score plot to evaluate the data structure and find clusters, outliers and
Notes
trends if the first two components explain the majority of the variance in the data. Data
clusters on the figure may represent two or more distinct data distributions. The points
are randomly distributed about the value of zero if the data have a normal distribution and
there are no outliers.

ImageSource:https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistical-modeling/
multivariate/how-to/factor-analysis/interpret-the-results/key-results/

Main Outcome: Score Plot


The statistics in this score plot look reasonable and no extreme outliers stand
out. Investigate the data value that is farther distant from the other data values and is
displayed in the lower right corner of the plot, though.

4.4 Dealing with Batch Effects


Batch effects are a type of technical variation that arises from the introduction of
external factors during the handling of samples. In situations where there is an excessive
number of samples to be labelled simultaneously, it becomes necessary to divide the
task into manageable rounds of labelling. Samples that are labelled simultaneously will
receive an equal amount of technical variance. However, samples labelled at different
times may exhibit varying levels of added variance. This process is commonly known
as batch labelling. It is crucial to ensure that this particular technical variation does not
cause confusion with the biology, specifically in terms of overlapping biological treatment
groups with technical groups. The potential confounding of technical batch effects with
biological factors can be mitigated through meticulous planning and execution of each
stage of the experiment. Significant importance is held by the aforementioned component
within the experimental design.Following it will be assumed that well-designed
experimental structures are possessed by the experiments under consideration. The
methodology for outlining batch effects within the dataset will be identified.
In the context of outlier detection, the assessment of batch effects commonly
involves exploratory techniques such as distance measures, clustering and spatial

Amity Directorate of Distance & Online Education


172 Optimization and Dimension Reduction Techniques

methods. Contrasting colours are highly beneficial in various applications. If the labelling
Notes process is spread out over a period of three days, it is possible to assign different colours
to the samples labelled on each day. Specifically, samples labelled on day 1 can be
coloured red, samples labelled on day 2 can be coloured blue and samples labelled on
day 3 can be coloured green. This colour-coding system facilitates the identification of
batch effects from plots, as demonstrated in the examples provided below.

Guideline for Shortlisting:


1. Choose a suitable plotting tool, such as Principal Component Analysis (PCA) or
hierarchical clustering.
2. The samples should be coloured based on the biological treatments being tested in
order to observe the distribution of colours within the plot.
3. The samples should be coloured based on their respective handling batches. The
following examples are provided for reference. However, if there are additional technical
YDULDEOHVWKDW\RXEHOLHYHPD\KDYHLQÀXHQFHG\RXUH[SHULPHQWLWLVUHFRPPHQGHGWR
conduct similar tests.
™ The date of sampling and the date of RNA extraction were recorded.
™ Date of RNA Labelling: The date on which the RNA labelling process was
conducted. Date of Hybridisation: The date on which the hybridisation
procedure was performed.
™ The presentation includes an array slide, where some slides contain multiple
arrays.
4. In the event that batch effects are detected, it is advisable to verify the presence of all
biological groups within each batch. If such a scenario arises, there are three available
options for conducting downstream analysis:
™ Proceeding without any intervention, it is important to acknowledge that the
presence of batch effects can lead to increased variability within the same
group. This variability may pose challenges in identifying genes that exhibit
relatively minor differences between the biological groups.
™ The objective is to mitigate the impact of handling batches on the data by
minimising variation. Once this is achieved, the downstream analysis can
proceed as usual. This option involves modifying the data, which may result
in the elimination of undesired variations in the data. One advantage of this
approach is the ability to compare biological groups with a greater number of
replicates, resulting in increased statistical power.
™ Perform data analysis on each batch individually, followed by conducting a
meta-analysis to determine if the same genes are consistently identified across
all batches. This particular option entails the retention of the original data
without any alterations prior to conducting the analysis. However, it is important
to note that this approach does have a disadvantage, as it may result in a
reduction in statistical power.
The dataset consists of four distinct biological groups, each assigned a unique
colour. It is evident that the biological groups have been thoroughly mixed, posing a
challenge in identifying differentially expressed genes among these groups. The plot
exhibits three clusters that appear unrelated to the biological groups under investigation.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 173

In order to determine if this issue is a result of technical manipulation, it is possible to


conduct tests to identify various types of batch effects. This can be achieved by assigning Notes
new colours to the samples.

4.4.1 Understanding Batch Effects in Data


Batch Effects in RNA-Seq Data Analysis
Batch effect correction is a method employed to eliminate variability from data that
is unrelated to the variable of interest, such as cancer type. Batch effects arise from
technical variations among samples, such as disparities in sequencing machines or
even differences in the individuals performing the sample runs. The elimination of this
variability necessitates modifying the data for each individual sample. When working with
expression data, it is important to refrain from extracting individual samples from a batch
corrected set for independent analyses, such as differential expression.
To assess the presence of a batch effect, execute the BatchQC tool. The R module
presented here offers a range of interactive diagnostics, visualisations and statistical
analyses. These tools are designed to facilitate the exploration of batch variation and its
impact on your data. The following section will provide an overview of the steps involved
in running BatchQC and interpreting its output.

Executing the BatchQC process.


In order to proceed, it is necessary to have two files.
The gene-by-sample matrix consists of gene IDs listed in the first column and
sample IDs as column headers. The cells are populated with expression values that have
been quantile normalised.
The metadata file contains sample IDs in the first column, followed by relevant
information about the samples in the remaining columns. The inclusion of suspected
batch variables, such as Sequencing Platform, Data, Biopsy Site, etc., along with the
classifier (e.g. tumour type), is recommended.
Ensure that both tables are thoroughly cleaned, devoid of any empty cells, NA
values, spaces, quotes, or any other unconventional symbols in the column headers. It is
important to thoroughly review the metadata for any misspellings or undesired variations.
For instance, it is likely that “mixed” and “mixed cytology” refer to the same concept. It is
recommended to substitute any spaces in the metadata cells with underscores. The R
programming language is known for its meticulous behaviour and strict requirements.
To execute the programme, please ensure that your data is in the tab-separated text
format. Follow the steps below to proceed:

source(“https://bioconductor.org/biocLite.R”)
biocLite(c(“BatchQC”))
library(BatchQC)

metadf <- read.tsv(“./metadata.tsv“)


exprdata <- read.tsv(“./expressiondata.tsv”)
batch = metadf$Sequencer

Amity Directorate of Distance & Online Education


174 Optimization and Dimension Reduction Techniques

condition = metadf$CancerType
Notes
batchQC(dat=exprdata, batch=batch, condition=condition,
UHSRUWBILOH ´EDWFKTFBUHSRUWKWPO´UHSRUWBGLU ´´
UHSRUWBRSWLRQBELQDU\ ´´
YLHZBUHSRUW 758(LQWHUDFWLYH 758(EDWFKTFBRXWSXW 758(
Upon completion of the computational calculations, the programme will initiate the
launch of an interactive URL. This URL will provide users with visual representations,
in the form of plots, pertaining to their respective data. It is possible to apply colour to
the majority of these plots based on either condition or batch, thereby facilitating the
identification of any potential batch effect.

Key Points:
Ɣ Differential manifestation of batch effects is observed across various plots. Various
types of statistics are presented to fulfil specific purposes. The Circular Dendrogram
should be examined initially as it is the most straightforward plot to comprehend.
Ɣ The runtime of the programme and the ease of interaction will be negatively
impacted by the size of the datasets. This is due to the program’s need to
regenerate plots based on the user’s selections. In the event that your computer
experiences a decrease in performance, it is recommended to downsample your
gene set, thereby reducing the amount of data you are working with to a range of
20-50%.
Ɣ To initiate batch correction, navigate to the interactive interface and access the
ComBat or SVA tabs. From there, locate and select the “Run” option. The feature
enables users to review their plots and choose a “post-correction” version. To obtain
the results table, it is necessary to execute ComBat or SVA on the complete dataset.

4.4.2 Detecting and Correcting Batch Effects


The identification and resolution of batch effects in data analysis are essential for
maintaining the precision and dependability of outcomes, particularly in the analysis
of high-throughput omics data. Batch effects refer to systematic variations that occur
due to technical or experimental factors, rather than the biological factors that are the
main focus of interest. The lack of attention given to batch effects can result in incorrect
conclusions and impede the accurate interpretation of genuine biological signals within
the data. The following are the sequential steps involved in the process of detecting and
correcting batch effects:
1. Data Visualisation and Quality Control: The initial stage involves visualising the data
and conducting quality control checks. Various techniques such as principal component
analysis (PCA), multidimensional scaling (MDS) and hierarchical clustering can be
employed to identify batch effects and other sources of variability.
2. Batch Effect Assessment: To evaluate the existence and magnitude of batch effects,
a range of statistical tests and metrics can be employed. These include the F-statistic,
589,,,DQGWKH/HYHQHWHVW7KHSXUSRVHRIWKHVHWHVWVLVWRDVVHVVWKHVLJQL¿FDQFH
of batch effects by comparing the variability within and between batches.
3. Data preprocessing is a crucial step that needs to be performed before applying
batch effect correction. Preprocessing commonly involves data normalisation and

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 175

transformation to ensure that data from various batches are standardised to a


consistent scale. Notes
4. Batch correction methods encompass a variety of techniques that aim to mitigate batch
HIIHFWV7KHVHPHWKRGVDUHVSHFL¿FDOO\WDLORUHGWRWDUJHWGLVWLQFWW\SHVRIEDWFKHIIHFWV
There are several widely used methods that are commonly employed, including:
™ ComBat is a commonly employed empirical Bayes approach utilised for the
purpose of correcting batch effects in genomics data.
™ Surrogate Variable Analysis (SVA) is a technique used to estimate latent
variables that are linked to batch effects.
™ The Limma method is a linear modelling approach that incorporates batch
information as a covariate.
™ The Removal of Unwanted Variation (RUV) technique is employed to estimate
and eliminate batch effects by utilising negative control genes.
5. Cross-validation is a technique that can be employed to evaluate the effect of batch
correction methods on the performance of a model, following their application. The
process entails dividing the data into separate training and testing sets in order to
assess the extent to which the results can be applied to new data.
6. The inclusion of additional biological replicates can aid in the differentiation of genuine
biological variability from batch effects. Increasing the number of replicates enhances
the precision of the biological signal estimation.
7. Data integration is a critical step in the process of combining data from different
batches. It is crucial to ensure that the integration is performed accurately in order to
prevent the introduction of new batch effects.
8. Sensitivity Analysis: Performing sensitivity analysis is a crucial step in evaluating the
resilience of results to various batch correction methods.
Researchers can effectively detect and correct batch effects by adhering to the
following steps. This approach ensures that their data analysis accurately captures
WKHELRORJLFDOVLJQDOVRILQWHUHVWZKLOHPLQLPLVLQJWKHLQÀXHQFHRIWHFKQLFDODUWHIDFWV

4.4.3 Normalisation and Batch Adjustment Techniques


Normalisation Techniques
Normalisation techniques are essential in the analysis of high-dimensional data,
particularly when datasets contain a significant number of features or variables. The
purpose of these techniques is to scale and standardise the data in order to ensure that
all variables have an equal contribution to the analysis and to prevent dominant features
from overshadowing others. In the field of high-dimensional data analysis, several
common normalisation techniques are employed. These techniques aim to standardise
the data and make it suitable for further analysis.

Min-Max Scaling technique:


The Min-Max Scaling technique is a data normalisation method commonly used in
statistical analysis and machine learning.
Ɣ The process of Min-Max scaling involves rescaling the data to a predetermined
range, typically ranging from 0 to 1.
Ɣ The formula used for Min-Max scaling is as follows:

Amity Directorate of Distance & Online Education


176 Optimization and Dimension Reduction Techniques

[BVFDOHG  [PLQ [  PD[ [ PLQ [


Notes Ɣ This approach is applicable in cases where the inherent minimum and maximum
values of the data are known and when the data does not follow a normal
distribution.

Z-Score Standardisation:
Z-Score Standardisation is a statistical technique used to transform a given dataset
into a standard normal distribution.
Ɣ Z-score standardisation, also referred to as standardisation or z-transformation, is
a statistical technique that adjusts the data to possess a mean value of 0 and a
standard deviation of 1.
Ɣ The formula used for Z-score standardisation is as follows:
z = (x - mean(x)) / std(x)
Ɣ This approach is suitable in cases where the data follows a normal distribution
and when the mean and standard deviation provide meaningful measures for
comparison.

Decimal Scaling:
Decimal scaling is a technique used in data normalisation, specifically in the field of
machine learning and data mining.
Ɣ The decimal scaling technique is a straightforward method of scaling data by
dividing it by a power of 10.
Ɣ The scaling factor is determined by calculating the maximum absolute value of the
data.
Ɣ The utilisation of this method ensures that the scaled data possesses a maximum
absolute value that is below 1, thereby facilitating its manipulation in certain
scenarios.

The log transformation:


The log transformation is a mathematical operation that is commonly used in data
analysis and signal processing.
™ The log transformation technique is employed to address data sets that exhibit
significant skewness or possess a broad range of values.
™ Utilising the logarithm function on the dataset can effectively compress the
range of values and mitigate the influence of outliers.

Robust Scaling:
The concept of robust scaling refers to the ability of a system or process to maintain
its performance and functionality under varying conditions or inputs.
Ɣ The technique of robust scaling, alternatively referred to as percentile scaling or
quantile scaling, involves the scaling of data using percentiles.
Ɣ The median and interquartile range (IQR) are employed in order to standardise the
data, enhancing its resilience against outliers.

Unit Vector Scaling (L2 Norm):


Ɣ The process of unit vector scaling involves scaling each data point in such a way
that its Euclidean norm (also known as L2 norm) becomes equal to 1.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 177

Ɣ The utilisation of this approach is prevalent in machine learning algorithms that


heavily depend on distance computations. Notes
Every normalisation technique possesses its own set of advantages and limitations.
The selection of a technique is contingent upon the distinct characteristics of the
dataset and the analysis requirements. When choosing a normalisation technique for
high-dimensional data, it is crucial to thoroughly evaluate the data distribution and
analysis objectives. Furthermore, it is crucial to perform validation and evaluation of
the normalisation process in order to prevent the introduction of unintended biases or
distortion of the underlying structure of the data.

Batch Adjustment Techniques


Batch Adjustment Techniques, commonly referred to as Batch Correction or Batch
Normalisation, play a crucial role in the analysis of high-dimensional data, specifically in
the fields of genomics, transcriptomics and other omics data. The primary objective is
to tackle the problem of batch effects, which refer to systematic variations in the data
resulting from technical or experimental factors, such as variations in processing dates,
instruments, or laboratories. Batch effects have the potential to introduce confounding
factors that can obscure the genuine biological signals present in the data, thereby
resulting in erroneous conclusions.
Batch Adjustment Techniques are employed to mitigate the impact of batch effects
while retaining the biological or experimental signals present in the data. There are
various techniques employed for batch adjustment and the following are a few prevalent
ones:
1. ComBat:
™ The ComBat method, also referred to as the Empirical Bayes framework, is a
widely utilised batch adjustment technique in genomics research.
™ The empirical Bayes framework is utilised to model the batch effects and
subsequently adjust the data by leveraging information from multiple batches.
™ The ComBat algorithm calculates batch-specific means and variances and then
adjusts them towards the overall mean and variance. This process effectively
aligns the distributions of different batches.
™ This method proves to be highly advantageous in situations where there is a
scarcity of biological replicates and significant batch effects are present.
2. Combat-Seq:
™ Combat-Seq is a specialised version of ComBat that has been specifically
developed for the analysis of single-cell RNA sequencing (scRNA-seq) data.
™ The existing ComBat method has been extended to address the sparse and
high-dimensional characteristics of scRNA-seq data, ensuring efficient removal
of batch effects.
™ The application of Combat-Seq involves the utilisation of Bayesian models for
the purpose of estimating and compensating for the effects specific to each
batch. This process leads to an enhancement in the accuracy of subsequent
analyses conducted on single-cell data.
3. Surrogate Variable Analysis (SVA) is a statistical technique used to identify and account
for hidden sources of variation in high-dimensional data analysis. It is commonly
HPSOR\HGLQJHQRPLFVDQGRWKHU¿HOGVZKHUHODUJHGDWDVHWVDUHDQDO\VHG

Amity Directorate of Distance & Online Education


178 Optimization and Dimension Reduction Techniques

™ The surrogate variable analysis (SVA) is a widely used technique for batch
Notes adjustment. It aims to detect and correct for latent or unmeasured variables,
known as surrogate variables, that are responsible for batch effects.
™ The extraction of surrogate variables, which capture the hidden sources of
variation in the data, is accomplished through the utilisation of singular value
decomposition (SVD).
™ The confounding batch effects are effectively eliminated and the true biological
variation is preserved by employing SVA and making adjustments for surrogate
variables.
4. Linear mixed models (LMM) are a statistical modelling technique used to analyse data
WKDWKDVERWK¿[HGDQGUDQGRPHIIHFWV/00VDUHDQH[WHQVLRQRIOLQHDUUHJUHVVLRQ
models and are particularly useful when dealing with correlated or clustered
™ Linear mixed models (LMMs) are a flexible statistical methodology that can be
employed for batch adjustment across diverse data types.
™ The LMM (Linear Mixed Model) is capable of effectively managing intricate
study designs, including repeated measures or hierarchical structures. This
feature makes it well-suited for a wide range of high-dimensional datasets.
™ By incorporating random effects to account for batch variability, Linear Mixed
Models (LMM) can effectively mitigate the impact of batch effects and enhance
the accuracy of data analysis.
Batch adjustment techniques play a critical role in ensuring the reliability and
reproducibility of high-dimensional data analysis. These tools assist researchers in
obtaining more precise biological insights by mitigating technical biases and enabling the
emergence of genuine signals from the data.

4.5 Clustering and Heatmaps


Clustering and heatmaps are widely employed techniques in data analysis and
visualisation for the purpose of identifying patterns and relationships within datasets that
have a high number of dimensions. In this analysis, each of the techniques at hand will
be examined and delved into.

Clustering:
Ɣ Clustering is a method employed to group data points that exhibit similarities or
proximity to one another within a multidimensional space.
Ɣ The method is classified as unsupervised learning, indicating that it does not
necessitate labelled data during the training process.
Ɣ The objective of clustering is to divide the data into clusters, where the data points
within each cluster exhibit greater similarity to one another compared to data points
in different clusters.
Ɣ The commonly used clustering algorithms comprise K-means, hierarchical clustering
and density-based clustering (DBSCAN).
Ɣ Clustering is a commonly employed technique in diverse domains, including biology,
customer segmentation, image segmentation and anomaly detection.

Heatmaps:
Ɣ Heatmaps are visual representations that utilise cells with colour coding to
effectively display the values of a dataset in two dimensions.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 179

Ɣ Heatmaps are a valuable tool for the visualisation of extensive matrices or data
tables. They enhance the identification of patterns and trends, thereby facilitating Notes
data analysis.
Ɣ A heatmap is a graphical representation where each row and column corresponds to
a specific data point or category. The colour intensity within each cell of the heatmap
is used to visually depict the magnitude of the corresponding data point’s value.
Ɣ Heatmaps are a frequently employed visualisation tool for the representation of
gene expression data, correlation matrices and spatial data.
Ɣ These tools have gained significant popularity, particularly in the fields of genomics,
bioinformatics and data exploration.

Integration of Clustering and Heatmaps


The integration of clustering and heatmaps involves the combination of these two
techniques to analyse and visualise data.
Ɣ The application of clustering techniques to the rows and/or columns of a dataset
enables the identification of groups of data points or categories that exhibit
similarities.
Ɣ Heatmaps can be utilised for visualising the data that has been clustered. This
involves rearranging the rows or columns of the heatmap in accordance with the
clustering outcomes.
Ɣ The heatmap has the capability to effectively visualise the patterns and relationships
present in the clustered data.
Ɣ The integration facilitates the identification of clusters of data points that exhibit
similar characteristics and enables a comprehensive understanding of their
interrelationships within the dataset.

4.5.1 Introduction to Clustering


Cluster analysis, also referred to as clustering, is a data mining technique utilised to
group data points that exhibit similarities. The objective of cluster analysis is to partition
a dataset into distinct groups, commonly referred to as clusters, where the data points
within each cluster exhibit greater similarity to one another compared to data points in
different clusters. The utilisation of this procedure is frequently employed in exploratory
data analysis, serving to detect patterns or relationships within the data that may not
be readily apparent. Cluster analysis employs various algorithms, including k-means,
hierarchical clustering and density-based clustering, to partition data into distinct groups.
The selection of an algorithm is contingent upon the precise analysis requirements and
the characteristics of the data under examination.
Cluster analysis is a methodological approach used to identify and group similar
objects together, thereby forming clusters. The algorithm is a machine learning model
that operates on unlabelled data using an unsupervised approach. A cluster is formed
when a group of data points are combined, with all objects within the cluster belonging to
the same group.
The provided data is partitioned into distinct groups by aggregating similar objects
together. The group exhibits characteristics of disorganisation and inefficiency. A cluster
refers to a grouping of similar data elements that are collected together.
Consider a dataset comprising various vehicles, such as cars, buses, bicycles and
more. This dataset contains detailed information about each vehicle. In the context of
Amity Directorate of Distance & Online Education
180 Optimization and Dimension Reduction Techniques

unsupervised learning, the available data does not include class labels such as Cars,
Notes Bikes, etc. Additionally, the data is not structured and consists of a combination of
information from various vehicles.
The objective at hand is to transform the unlabelled data into labelled data, a
process that can be achieved through the utilisation of clusters.
The primary concept behind cluster analysis is to organise data points by grouping
them into clusters, such as a “cars” cluster that includes all car data points, a “bikes”
cluster that includes all bike data points and so on.
Cluster analysis is a technique used to partition unlabelled data into groups of similar
objects.

Characteristics of Clustering
 6FDODELOLW\ RI &OXVWHULQJ ,Q WKH SUHVHQW HUD WKH YROXPH RI GDWD KDV VLJQL¿FDQWO\
LQFUHDVHG QHFHVVLWDWLQJ WKH PDQDJHPHQW RI ODUJHVFDOH GDWDEDVHV )RU HI¿FLHQW
management of large databases, it is essential to ensure that the clustering algorithm
possesses scalability. The scalability of data is crucial for obtaining accurate results. If
data is not scalable, it can lead to incorrect outcomes.
2. High Dimensionality: The algorithm must possess the capability to effectively process
data in high-dimensional spaces, even when the dataset is relatively small in size.
3. Algorithm Usability with multiple data kinds:An algorithm is a step-by-step procedure
RUVHWRIUXOHVIRUVROYLQJDVSHFL¿FSUREOHPRUFRPSOHWLQJDVSHFL¿FWDVN$OJRULWKPV
for clustering can accommodate various types of data. The system should possess
the capability to handle various types of data, including discrete, categorical, interval-
based and binary data.
4. Managing unstructured data: It is common to encounter databases that contain missing
values, as well as noisy or erroneous data. Poor quality clusters may arise if the
algorithms are sensitive to such data. The system should possess the capability to
process unstructured data and subsequently organise it into clusters of similar data
objects, thereby imparting structure to the data. This facilitates the task of the data
H[SHUWE\VWUHDPOLQLQJGDWDSURFHVVLQJDQGHQDEOLQJWKHLGHQWL¿FDWLRQRIQRYHOSDWWHUQV
5. Interpretability: It is essential for the clustering results to possess interpretability,
comprehensibility and usability. The concept of interpretability pertains to the level of
ease with which data can be comprehended.

Clustering Methods:
The clustering methods can be classified into the following categories:
Ɣ Partitioning Method
Ɣ Hierarchical Method
Ɣ Density-based Method
Ɣ Grid-Based Method
Ɣ Model-Based Method
Ɣ Constraint-based Method

Applications of Cluster Analysis


Ɣ The application of this technology is prevalent in the fields of image processing, data
analysis and pattern recognition.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 181

Ɣ Marketers can effectively identify distinct groups within their customer base by
leveraging purchasing patterns. This enables them to accurately characterise their Notes
customer groups.
Ɣ The application of this technology extends to the domain of biology, where it
facilitates the derivation of animal and plant taxonomies, as well as the identification
of genes possessing similar functionalities.
Ɣ Additionally, it aids in the process of information discovery by effectively categorising
documents found on the internet.

Benefits of Cluster Analysis:


Ɣ One notable advantage is its ability to discern patterns and relationships within a
given dataset that may not be readily apparent.
Ɣ The tool has the capability to facilitate exploratory data analysis and assist in the
process of feature selection.
Ɣ This technique is employed for the purpose of reducing the dimensionality of the
data.
Ɣ This tool has the capability to perform anomaly detection and identify outliers.
Ɣ This tool has the capability to perform market segmentation and customer profiling.

Drawbacks of Cluster Analysis


Ɣ One such drawback is its sensitivity to the selection of initial conditions and the
determination of the number of clusters.
Ɣ The system may exhibit sensitivity to the presence of noise or outliers in the data.
Ɣ Interpretation of analysis results may pose challenges in cases where the clusters
lack clear definition.
Ɣ Large datasets can incur significant computational costs.
Ɣ The selection of the clustering algorithm employed can have an impact on the
outcomes of the analysis.
Ɣ The success of cluster analysis is contingent upon several factors, including the
nature of the data, the objectives of the analysis and the proficiency of the analyst in
interpreting the outcomes.

4.5.2 Clustering Algorithms (K-means, Hierarchical)


Clustering is a machine learning technique utilised to group similar objects together
and classify dissimilar objects into distinct groups. It serves as a crucial tool in data
analysis. Clustering is classified as an unsupervised learning algorithm due to its ability
to operate on unlabelled data and autonomously partition the data into distinct clusters
without any prior knowledge or predefined criteria for group formation.

Types of Clustering Algorithms


The clustering algorithm is divided into two subgroups, namely:
Ɣ Hard clustering refers to a method of grouping data entities that share a common
trait or belong to the same cluster. In this approach, each data entity is assigned
exclusively to a single cluster based on its similarity to other entities within that
cluster. The algorithm eliminates data entities from the cluster set if they fail to
satisfy a specific identity condition.

Amity Directorate of Distance & Online Education


182 Optimization and Dimension Reduction Techniques

Ɣ Soft clustering is a technique that enables data entities to explore and identify other
Notes similar data entities that have a high probability of belonging to the same cluster.
This clustering method has the capability to assign a single data entity to multiple
clusters, depending on its similarity to other data entities.

K-Means Clustering Algorithm


The K-means algorithm is an iterative procedure designed to partition data into K
clusters, ensuring that each observation is assigned exclusively to one of the K clusters.
The goal is to minimise the “within-cluster-variation”, meaning that the elements within
each cluster should exhibit high similarity. The objective of K-means is to minimise the
distance between the points in a cluster and their centroid, which is calculated as the
arithmetic mean of all the data within the cluster. In the field of Euclidean geometry, the
measurement of squared Euclidean distances is performed on pairs of points. A point is
classified as belonging to a specific cluster if its distance to the centroid of that cluster is
smaller than its distance to any other centroid.
K-means can be viewed as an optimisation problem, where the objective function J
is defined as follows:

The superscripts “n” and “k” in this context represent the data index and cluster
index, respectively. Additionally, “k” refers to the centroid for the kth cluster. It should be
noted that the value of rnk is equal to 1 if the data point xn belongs to cluster k and it is 0
otherwise.

The objective is to identify the centroids that minimise the objective function J. This
FDQEHDFKLHYHGE\VROYLQJWKHHTXDWLRQ˜-ȝk = 0. The resulting solution is as follows.

The K-means algorithm is a method that can be summarised as follows:


Ɣ The user is required to select the desired number of clusters, denoted as K, for
categorization. Subsequently, K data points are to be randomly chosen as the
centroids, representing the centre of each cluster.
Ɣ In order to determine the clustering of data points, it is necessary to measure the
distance between each data point and all centroids. Subsequently, each data point
should be assigned to the cluster (centroid) that is closest to it.
Ɣ The centroids are updated by computing the mean of all data points within each
cluster.
Ɣ Continue to iterate steps 2 and 3 until either the centroids remain unchanged or the
maximum number of iterations is reached.

Benefits of K-Means algorithm


1. Simple and easy to implement:
The k-means algorithm is widely favoured for clustering tasks due to its simplicity and
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 183

ease of implementation. Its straightforward nature allows for easy comprehension and
application. Notes
 )DVWDQGHI¿FLHQW
This product is designed to deliver high speed and optimal performance. The K-means
DOJRULWKP LV NQRZQ IRU LWV FRPSXWDWLRQDO HI¿FLHQF\ DQG DELOLW\ WR HIIHFWLYHO\ SURFHVV
large datasets that have high dimensionality.
3. Scalability:
 7KH .PHDQV DOJRULWKP LV FDSDEOH RI HI¿FLHQWO\ SURFHVVLQJ H[WHQVLYH GDWDVHWV
containing a substantial number of data points. Moreover, it can be readily adapted to
accommodate even more substantial datasets without compromising its performance.
4. Flexibility:
The K-means algorithm can be readily customised to suit various applications,
accommodating different distance metrics and initialization methods.

Disadvantages of K-Means algorithm


1. Sensitivity to initial centroids:
The sensitivity to initial centroids is a characteristic of the K-means algorithm, where the
VHOHFWLRQRILQLWLDOFHQWURLGVFDQVLJQL¿FDQWO\LPSDFWWKHFRQYHUJHQFHDQGSRWHQWLDOO\
lead to a suboptimal solution.
2. Requires specifying the number of clusters:
 ,Q RUGHU WR SURFHHG LW LV QHFHVVDU\ WR SURYLGH D VSHFL¿F YDOXH IRU WKH QXPEHU RI
clusters. In order to execute the algorithm, it is necessary to specify the number of
clusters, denoted as k. However, in certain applications, determining the appropriate
value for k can pose a challenge.
3. Outlier Sensitivity: The K-means algorithm exhibits sensitivity to outliers, thereby
H[HUWLQJDQRWDEOHLQÀXHQFHRQWKHUHVXOWDQWFOXVWHUV

Applications of K-Means Clustering:


K-Means clustering finds application in various real-life scenarios and business use
cases, such as:
Ɣ Academic performance
Ɣ Diagnostic systems
Ɣ Search engines are software programmes that are designed to retrieve and display
information from the World Wide Web. They use algorithms to analyse and index
web pages.
Ɣ Wireless sensor networks (WSNs) refer to a collection of interconnected sensor
nodes that communicate wirelessly to perform various sensing and monitoring
tasks. These networks are designed to gather data.

Hierarchical Clustering Algorithm


The concept underlying hierarchical clustering algorithms exhibits notable
distinctions. In contrast to the K-Means algorithm, the hierarchical algorithm does
not necessitate the presence of a representative for every cluster. The assumption
made is that the clusters possess a hierarchical structure. Each cluster consists of
smaller clusters. The approach considers the entire dataset as a singular cluster, with

Amity Directorate of Distance & Online Education


184 Optimization and Dimension Reduction Techniques

the sub-clusters within it serving as partitions of the main cluster. This assumption can
Notes be extended to the sub-clusters within the system. The minimum unit will consist of a
cluster containing a single data point. The algorithm’s assumption enables to interpret the
cluster structure that is produced as a tree. In this tree, the leaves represent individual
data points, which are clusters consisting of only one object. The inner nodes, on the
other hand, represent collections of data points that are contained within their respective
subtrees. The process of the hierarchical clustering algorithm can be conceptualised as
the construction of a tree.
There exist two distinct methodologies for accomplishing this objective:
™ The tree is constructed using a bottom-up approach, which is commonly
referred to as the agglomerative method.
™ The tree is constructed in a top-down manner, which is commonly referred to as
the divisive approach.

Agglomerative Approach
The initial approach is characterised by its simplicity. The algorithm begins by
considering each data point as an individual cluster. This analogy pertains to the
process of constructing the leaves of a tree. At this stage, every individual data point is
considered as a distinct cluster. The provided partition for the dataset is deemed valid.
However, our objective is to surpass this level of performance. The algorithm initiates
the iterative process of merging clusters that exhibit similarity. During each iteration, the
algorithm identifies the two clusters that are most similar and combines them to create
a new cluster. The newly established cluster will expand into a larger tree structure
consisting of two child nodes. Each node within the tree represents a merged cluster.
In the subsequent iteration, the larger cluster will be regarded as a singular entity. By
iteratively merging clusters that exhibit similarity, the process will ultimately result in the
amalgamation of all data points into a singular binary tree. The algorithm will terminate at
that point.
It should be noted that the reduction of clusters is being performed incrementally,
one at a time. Hence, it is possible to obtain a variable number of clusters, ranging from 1
to the size of the dataset, by terminating at various levels of the tree. The following is the
pseudocode for the agglomerative approach:
# Agglomerative (Bottom-up) algorithm
Form initial clusters (i.e., turn each data point into a cluster).
Compute distance between each pair of clusters.
while number of clusters > 1:
Merge the two clusters with minimum distance.
Calculate the distance between the new cluster and all other clusters.

Divisive Approach
As previously stated, an alternative form of hierarchical clustering algorithm exists,
known as the divisive algorithm. The method employed is similar to the agglomerative
approach, as it aims to construct a binary tree, also known as a dendrogram. In contrast
to the bottom-up approach, the divisive algorithm initiates the construction of child nodes
iteratively, beginning from the root of the tree. The initial approach involves considering
the entire dataset as a single cluster, also referred to as the root node. During each
iteration, the algorithm selects a cluster to divide into two clusters that are the least
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 185

similar to each other. These newly created clusters are referred to as child nodes. The
process halts once the desired number of clusters is achieved. The following is the Notes
pseudocode representation of the divisive algorithms:
# Divisive (Top-down) algorithm
Form the initial cluster (i.e., turn the whole dataset into one big cluster).
while number of clusters <= number of data points:
choose a cluster and split it into 2 clusters

4.5.3 Visualisation with Heatmaps


What is Heatmap Visualization?
A heatmap is a data visualisation technique that employs a colour-coded matrix
or grid to represent data values. The visual representation offers a concise overview
of the data, with each cell in the matrix corresponding to a specific data point or a
combination of variables. The colour of each cell indicates the magnitude or intensity of
the corresponding data. This task can be achieved by utilising heatmap data visualisation
in conjunction with the Python programming language.
The primary objective of heatmap visualisation is to reveal patterns, trends and
relationships within the data by emphasising regions with high or low values. Heatmaps
facilitate rapid and intuitive data interpretation by associating colours with distinct data
ranges. This enables viewers to easily identify clusters, gradients, or anomalies within the
data.
Heatmaps are a commonly employed data visualisation technique utilised across
diverse fields and industries, encompassing scientific research, business analysis,
market research, web analytics and other domains. These techniques have the ability to
be applied to various types of data, including numerical data, categorical data and spatial
data. Heatmaps are particularly advantageous in situations involving extensive datasets
or the need to compare multiple variables concurrently. The Business Intelligence and
Visualisation training programme offers the opportunity to receive mentorship from
renowned international experts in Tableau, BI, TIBCO and Data Visualisation.

Types of Heatmaps in Data Visualization


1. The spatial heatmap
One exemplary instance of spatial data visualisation is the spatial heatmap,
alternatively referred to as a geographic heatmap. This visualisation technique is
employed to represent data in relation to their respective geographical locations. The
representation of data values on a map involves the utilisation of colour to indicate the
intensity or magnitude of the data in various regions or points.
This particular type of heatmap is highly advantageous for visualising spatial patterns,
including but not limited to population density, disease outbreaks, weather patterns and
various other geospatial phenomena. The map utilises colours to visually represent
different data attributes, including but not limited to population counts, temperature, or
DQ\RWKHUTXDQWLWDWLYHRUTXDOLWDWLYHGDWDOLQNHGWRVSHFL¿FJHRJUDSKLFORFDWLRQV
2. The grid heatmap
It is a graphical representation that displays data values using a grid-like structure. It
is commonly used to visualise and analyse patterns, trends and relationships within

Amity Directorate of Distance & Online Education


186 Optimization and Dimension Reduction Techniques

a dataset. The grid heatmap, alternatively known as an intensity heatmap or matrix


Notes heatmap, is a widely employed form of heatmap utilised for the visualisation of tabular
data. The data values are presented in a grid or matrix format, wherein each cell
corresponds to a distinct combination of variables or categories. The colour of each
FHOO LV GHWHUPLQHG EDVHG RQ WKH YDOXH LW UHSUHVHQWV IDFLOLWDWLQJ WKH LGHQWL¿FDWLRQ RI
patterns and trends in a convenient manner.
Grid heatmaps are commonly employed for the analysis of multivariate data, the
comparison of multiple variables and the highlighting of relationships between various
categories. Colour-coded grids are extensively utilised in various fields, including
genetics, finance and social sciences, among others. These grids serve as a highly
effective means of summarising and interpreting large datasets.

Tools to Generate Heatmap Data Visualization


The following tools are commonly used for generating heat map visualisations

1. Google Charts
It is a data visualisation library that allows users to create interactive and customizable
charts for their web applications. It provides a wide range of chart
Google Charts is a data visualisation library that has been developed by Google. It is
both powerful and free to use. The software offers a diverse selection of interactive
charts and graphs that can be seamlessly integrated into web pages or applications.
The following section presents a comprehensive analysis of the advantages and
disadvantages associated with the utilisation of Google Charts:

Advantages of Google Charts


™ This platform prioritises ease of use, ensuring that it is accessible to users with
different levels of technical proficiency. The API of the software is extensively
documented and offers comprehensive examples and tutorials, facilitating a
straightforward initiation into the process of creating visualisations.
™ Google Charts provides a wide range of chart types, encompassing line charts,
bar charts, pie charts, scatter plots and various others. This feature enables
users to select the most appropriate chart type to meet their data visualisation
requirements.
™ Customization Options: The library offers a range of options to customise the
visual aspects of your charts. Users have the ability to customise various visual
elements such as colours, fonts, labels, tooltips and other components to align
with their preferred style or branding.

Disadvantages of Google Charts


™ Limited Customization: Although Google Charts provide a satisfactory
level of customization, certain users may perceive the available options as
comparatively restricted when compared to more advanced data visualisation
libraries. In the event that you possess particular or intricate customization
prerequisites, it may be necessary to investigate alternative tools or
frameworks.
™ Dependency on External Libraries: The utilisation of Google Charts
necessitates the inclusion of external JavaScript libraries. Consequently, your
application or web page must establish an internet connection in order to

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 187

retrieve these libraries. Limitations may arise if there are constraints on external
dependencies or a requirement for offline access. Notes
™ The absence of advanced features is a limitation of Google Charts, as
it may not offer certain advanced data visualisation capabilities found in
more specialised libraries or frameworks. If there is a need for complex data
manipulation, interactivity, or advanced statistical analysis, it may be necessary
to explore alternative options.

2. Grafana
Grafana is a software application that falls under the category of open-source data
visualisation and monitoring tools. The purpose of this software is to assist users in
the creation and presentation of interactive dashboards, graphs and charts. These
visualisations are intended for the monitoring and analysis of data derived from multiple
sources. This document provides an overview of Grafana, including its advantages
and disadvantages.

Advantages of Grafana:
™ Intuitive and user-friendly interface: Grafana offers extensive support for various
data sources, encompassing well-known databases, cloud services, time
series databases and monitoring systems. The inherent flexibility of this system
enables users to establish connections and visually represent data from various
sources within a single, cohesive dashboard.
™ Grafana offers a wide range of visualisation options, encompassing graphs,
charts, tables and gauges, among others. The software provides support for a
wide range of chart types and allows for extensive customization options. This
allows users to create visually appealing and informative dashboards.
™ Grafana facilitates the interactive and real-time monitoring and visualisation
of data. Users have the ability to establish alerts and notifications that are
triggered when specific thresholds are met. This feature allows for proactive
monitoring and prompt action in response to critical events or anomalies.

Disadvantages of Grafana.
™ Limited data source support: Grafana has a limited number of supported data
sources compared to other visualisation tools.
™ Learning Curve: The utilisation of Grafana may present a learning curve,
particularly for individuals who are unfamiliar with the tool or possess limited
knowledge in data visualisation and monitoring principles. The successful setup
of intricate dashboards and the configuration of data sources may necessitate a
certain level of technical expertise.
™ Grafana exhibits resource-intensive behaviour, especially when handling
substantial datasets or frequent updates of high-frequency data. The
performance may be affected, particularly if the underlying infrastructure is not
properly provisioned.
™ Grafana’s analytics functionality is limited compared to specialised data analysis
tools, despite its powerful visualisation capabilities. In order to perform intricate
data analysis or statistical modelling tasks, it may be necessary to augment
Grafana with additional tools or libraries.

Amity Directorate of Distance & Online Education


188 Optimization and Dimension Reduction Techniques

Despite the aforementioned limitations, Grafana continues to be a widely favoured


Notes option for data visualisation and monitoring. This can be attributed to its remarkable
ÀH[LELOLW\H[WHQVLYHUDQJHRIFXVWRPL]DWLRQRSWLRQVDQGVWURQJEDFNLQJIURPDODUJH
community of users. The software is highly suitable for the creation of user-friendly
DQGLQWHUDFWLYHGDVKERDUGVWKDWIDFLOLWDWHHI¿FLHQWGDWDDQDO\VLVDQGLQIRUPHGGHFLVLRQ
making.

3. FusionC3harts
FusionCharts is a robust JavaScript-powered data visualisation library that provides
an extensive selection of interactive charts, maps and gauges. The purpose of this
tool is to assist developers in the creation of visually appealing and interactive data
YLVXDOLVDWLRQVVSHFL¿FDOO\IRUZHEDQGPRELOHDSSOLFDWLRQV

Advantages of FusionCharts:
™ Rich and interactive visualisations: FusionCharts offers a wide range of visually
appealing and interactive charts, graphs and maps. These FusionCharts offer
a wide range of chart types, encompassing line charts, bar charts, area charts,
pie charts, maps and various others. This particular variety provides developers
with the flexibility to select the most appropriate chart type based on their
specific data visualisation needs.
™ FusionCharts provides developers with a wide range of customization options
to modify different elements of the charts. These options include the ability to
customise colours, fonts, labels, tooltips and animations. This feature empowers
developers to generate visually captivating and branded visualisations that
seamlessly match the design of their application.
™ FusionCharts offers an extensive range of features and functionalities,
encompassing drill-down capabilities, real-time updates, export options,
interactivity and responsive design. The aforementioned feature enhances the
tool’s versatility in generating dynamic and interactive data visualisations.

Disadvantages of FusionCharts
™ Licencing and Pricing: FusionCharts provides a complimentary version that
includes restricted features, but to access its complete functionality and
advanced features, a commercial licence is necessary. The pricing model may
not be conducive to accommodating all budgets, particularly for projects of
smaller scale or those associated with non-profit organisations.
™ Learning Curve: The process of acquiring proficiency in FusionCharts can be
challenging, particularly for developers who are unfamiliar with the library or
have limited knowledge of data visualisation principles. Gaining proficiency in
the API, configuring data sources and customising charts may necessitate an
initial investment of effort and technical expertise.
™ Performance Considerations: FusionCharts exhibits resource-intensive
behaviour, particularly when handling extensive datasets or intricate
visualisations. In order to achieve seamless rendering and interactivity, it is
imperative to engage in meticulous optimisation and thoughtful consideration
of performance factors. This is especially crucial in situations involving large
amounts of data or frequent updates.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 189

4. Tableau
Notes
Tableau is a robust and extensively utilised software for data visualisation and business
intelligence. The software allows users to generate interactive and visually engaging
dashboards, reports and data visualisations using data from multiple sources.

Advantages of Tableau:
™ Intuitive and user-friendly interface: Tableau offers a user-friendly and intuitive
interface, enabling users to effortlessly manipulate data elements through
drag-and-drop actions, facilitating the creation of visualisations. The “Show
Me” feature provides users with limited technical expertise the ability to easily
access appropriate chart types based on their data.
™ Tableau offers an extensive range of data connections, encompassing various
sources such as spreadsheets, databases, cloud services and big data
platforms. The software enables users to effortlessly establish connections and
integrate data from various sources, resulting in a comprehensive and unified
perspective of the data.
™ Tableau provides a range of interactive features that enable users to perform
exploratory analysis on data and obtain dynamic insights. The system allows
users to apply filters, perform drill-downs, apply sorting and engage with
visualisations in order to identify patterns, trends and outliers within the data.
This facilitates a decision-making process that is guided by data.

Disadvantages of Tableau:
™ High Cost of Licencing: Tableau licences are known to have a significant price
tag, particularly for entities or individuals operating within constrained financial
resources. The pricing structure could potentially pose a challenge for small-
scale or non-profit projects, as well as for those seeking advanced features and
enterprise-level deployments. Additional expenses may be required in these
cases.
™ Steep Learning Curve: Tableau exhibits a significant learning curve, especially
for individuals who possess limited familiarity with data visualisation principles
or possess limited exposure to analogous software applications. To fully
harness the advanced features and maximise the capabilities of Tableau, it may
be necessary to undergo specialised training or utilise dedicated educational
materials.
™ Performance Challenges with Large Datasets: The performance of Tableau
may be affected when handling large datasets or intricate visualisations.
Slower response times and the need for optimisation techniques may arise
when resource-intensive calculations or frequent data updates are performed,
necessitating the maintenance of acceptable performance levels.
Despite the aforementioned limitations, Tableau continues to be widely favoured for
data visualisation and business intelligence owing to its user-friendly interface, robust
functionality and capacity to generate captivating visual representations. The software
provides users with the ability to effectively explore and communicate data, facilitating
data-driven decision-making in diverse industries and sectors.

5. Data Wrapper
Data Wrapper is a web-based application designed to facilitate the creation of

Amity Directorate of Distance & Online Education


190 Optimization and Dimension Reduction Techniques

dynamic and adaptable data visualisations. The software streamlines the procedure
Notes of generating charts, maps and tables, enabling users to effectively showcase data in
a concise and aesthetically pleasing format. This document provides an overview of
Data Wrapper, including its advantages and disadvantages.

Advantages of Data Wrapper:


™ User-friendly interface: Data Wrapper offers a user-friendly interface that
makes it easy for users to create and customise visualisations without requiring
extensive technical knowledge or coding skills.
™ The Data Wrapper platform offers a user-friendly interface that simplifies the
process of creating data visualisations, catering to users with varying levels of
technical expertise. The efficient sequential process and user-friendly interface
enable users to swiftly generate and personalise charts.
™ Data Wrapper has the capability to generate responsive charts that are able to
adapt to various screen sizes and devices. This ensures that the visualisations
remain accessible and user-friendly across different platforms. The objective
is to maintain legibility and visual appeal of the visualisations across different
platforms, such as desktops, tablets and mobile devices.
™ Data Integration and Import: The Data wrapper provides users with a range of
options for importing data, enabling them to import data from spreadsheets,
CSV files and other sources. The feature offers the advantage of facilitating
the manipulation of pre-existing datasets and the seamless updating of visual
representations as new data is acquired.

Disadvantages of Data Wrapper:


™ Limited customization options: Data Wrapper has a limited range of
customization options, which may restrict users from fully tailoring their
visualisations to their specific needs or preferences.
™ Data Wrapper prioritises fundamental chart types, including line charts, bar
charts and scatter plots, while offering a limited selection of advanced features.
Although it fulfils basic charting requirements, it may lack the capability to
cater to advanced or specialised chart types that certain users may demand.
Furthermore, the utilisation of advanced functionalities such as statistical
analysis or intricate data transformations may encounter certain limitations.
™ Pricing and Data Storage: Although Data Wrapper provides a complimentary
plan, accessing additional features and expanding storage capacity may
necessitate a paid subscription. For users who possess extensive datasets or
have specific requirements, it is essential to take into account the pricing and
storage limitations that align with their particular needs.
™ Limited Customization Options for Presentation: Although Data wrapper does
offer the ability to customise the appearance of charts, it is important to note
that there are certain limitations when it comes to customising the overall
presentation. Users with specific design requirements or preferences may
encounter limitations when attempting to achieve their desired visual aesthetics.

6. Plotly
 3ORWO\ LV D VRIWZDUH OLEUDU\ VSHFL¿FDOO\ GHVLJQHG IRU GDWD YLVXDOLVDWLRQ SXUSRVHV ,W
offers a wide range of features and functionalities, including the creation of interactive
and visually appealing graphs and charts. The software platform provides support for
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 191

various programming languages, such as Python, R and JavaScript. This document


provides an overview of Plotly, including its advantages and disadvantages. Notes
Advantages of Plotly:
™ Interactive Visualisations: Plotly is a software tool that facilitates the creation
of interactive visualisations, empowering users to engage in exploration and
interact with data. The addition of zooming, panning, hover effects, tooltips and
other interactive features allows users to enhance their experience and improve
their understanding of the data.
™ Plotly provides users with a diverse array of chart types to choose from. These
include bar charts, line charts, scatter plots, pie charts, heatmaps and various
others. This particular variety of software provides users with the ability to select
the most appropriate chart type for their data, enabling them to accurately and
efficiently depict intricate relationships or patterns.
™ Cross-Platform Compatibility: Plotly exhibits compatibility with various
programming languages, thereby ensuring accessibility to a diverse user base
and facilitating effortless integration into diverse data analysis workflows. The
software can be utilised in various environments including Jupyter Notebooks,
web applications and standalone setups.

Disadvantages of Plotly
™ Limited customization options: Plotly offers a range of customization features,
but compared to other visualisation libraries it is limited.
™ Learning Curve: The utilisation of Plotly may present a significant learning
curve, particularly for individuals who possess limited experience in
programming or are unfamiliar with data visualisation principles. Gaining
proficiency in the syntax, API and customization options may necessitate an
initial investment of time and experimentation.
™ Advanced Features Require Expertise: Plotly provides a comprehensive set of
features, including advanced functionalities like 3D visualisations and complex
statistical analysis. However, utilising these advanced features may necessitate
a higher level of programming skills or expertise.
™ Performance Limitations with Large Datasets: Plotly exhibits reduced
performance when handling large datasets, in contrast to specialised libraries
that are specifically designed for efficient visualisation of big data. Users who
are working with large datasets may encounter extended processing times or
encounter performance limitations.

How to Present Heat Map Visualization?


When delivering heat map visualisations, it is crucial to select the most suitable
format depending on the data and the intended message. There are several commonly
used methods for presenting heat map visualisations.
1. Choropleth maps employ colour shading or patterns to visually depict data values
across various regions or geographic areas. Geographic information systems (GIS)
are highly valuable tools for effectively visualising data pertaining to countries, states,
RUDQ\RWKHUFOHDUO\GH¿QHGJHRJUDSKLFERXQGDULHV

Amity Directorate of Distance & Online Education


192 Optimization and Dimension Reduction Techniques

2. A geographical heatmap is a visualisation tool that represents the intensity of data on


Notes a map by utilising colour gradients or density indicators. Visual representations of data
density or concentration across a geographic area are provided.
3. Abstract positioning heat maps are a valuable tool utilised in scenarios where data
ODFNVLQKHUHQWDVVRFLDWLRQVZLWKVSHFL¿FJHRJUDSKLFDOORFDWLRQV$EVWUDFWSRVLWLRQLQJ
or coordinate systems are employed to visually represent the intensity or distribution
of data.
Web heatmaps are a type of tool that is used to monitor and analyse user interactions
on a website. These interactions can include actions like clicks or mouse movements.
The purpose of using web heatmaps is to create visual representations of the areas
on a webpage that attract the most attention or engagement from users.
Risk heat maps are a visual representation that evaluates and presents the degree
of risk linked to various factors or scenarios. These tools are frequently used in the
context of risk management or decision-making procedures.
4. A clustered heatmap is a visualisation tool commonly employed to represent
hierarchical or grouped data. The data is presented in a tabular format, where
variables or categories are organised in rows and columns. The values within the
table are visually represented using colour coding. The process of clustering aids in
WKHLGHQWL¿FDWLRQRISDWWHUQVRUUHODWLRQVKLSVSUHVHQWZLWKLQDJLYHQGDWDVHW
5. The bubble chart heatmap is a visualisation technique that integrates the principles of
scatter plots and heatmaps. Data points are visually represented using bubbles that
vary in size and colour. The size and colour of these bubbles correspond to different
variables or data values.
6. A matrix heatmap is a frequently employed visualisation technique for representing
relationships or correlations among multiple variables. Colour gradients or shades are
employed to visually depict the intensity or magnitude of the relationship.
 $FRUUHODWLRQKHDWPDSLVDYLVXDOLVDWLRQWRROWKDWLVGHVLJQHGWRVSHFL¿FDOO\GHSLFWWKH
correlation between multiple variables. Colour coding or gradients are employed to
visually depict the magnitude and orientation of the correlation.

Summary
Ɣ A dataset is said to be high dimensional if it contains more characteristics (p) than
observations (N), which is frequently expressed as p > N.
Ɣ Distance measures are essential tools used to quantify the dissimilarity or similarity
between objects in a dataset.
Ɣ The standard geometry problem metric is thought to be Euclidean distance. It is
easily defined as the usual separation between two places.
Ɣ The cosine similarity is a mathematical measure that quantifies the cosine of the
angle between two vectors.
Ɣ The Mahalanobis distance (MD) is a metric used to quantify the relative distance
between two variables in relation to the centroid.
Ɣ Dimension reduction techniques aim to decrease the number of features or
dimensions in a dataset while preserving essential information.
Ɣ Feature extraction refers to the procedure of converting a high-dimensional space
into a lower-dimensional space.

Amity Directorate of Distance & Online Education

You might also like