mod 4
mod 4
Introduction
High-Dimensional Data
A dataset is said to be high dimensional if it contains more characteristics (p) than
observations (N), which is frequently expressed as p > N.
For instance, a dataset with p = 6 features and only N = 3 observations would be
seen as having high dimensions because there are more features than observations.
People frequently assume that “high dimensional data” merely refers to a dataset
with a lot of features, which is a typical error. That, however, is untrue. Even if a dataset
includes 10,000 features, it is not high dimensional if there are 100,000 observations in it.
1. Categorization
2. Regression
One of the p variables in a regression setup is a quantitative response variable. To
predict it, the other variables are applied. Examples include the fluctuation of currency
rates today given recent exchange prices in a financial data source and an indicator of
chemical composition in a hyperspectral database. Regression modelling is supported by
a well-known and popular set of tools.
way to achieve this because there are numerous approaches that each serve a different
Notes goal. Latent semantic indexing is an apparent application area where it might look for a
document arrangement that makes nearby documents similar and a term arrangement
that makes nearby phrases similar.
K-Means Clustering
Kernel-based methods, including the support vector machine (SVM) algorithm, are Notes
another category of distance-based algorithms widely used in machine learning.
When calculating the distance between two instances with multiple data types in
their columns, it’s essential to consider using different distance measures for each data
type. For example, real values, boolean values, categorical values and ordinal values
may require distinct distance measures, which are then combined into a unified distance
score.
Numeric values often have varying scales, which can significantly impact distance
measures. To address this, it is recommended to normalise or standardise numeric
values before calculating distances.
In regression problems, numerical error can be treated as a distance measure.
The discrepancy between expected and predicted values can be quantified as a one-
dimensional distance. This error can be calculated for each example in a test set and a
total distance is obtained, representing the overall discrepancy between expected and
predicted outcomes in the dataset. This process is akin to conventional distance metrics
and is used to compute error measures like mean squared error or mean absolute error.
Here the total distance of the Red line gives the Manhattan distance between both
the points.
Here, d1.d2 means the dot product of two vectors d1 and d2.
d1.d2 = 5*4 + 2*0 + 1*0 + 0*2 + 1*2 + 3*2 + 0*1 = 28
||d1|| = (5*5 + 2*2 + 1*1 + 0*0 + 1*1 + 3*3 + 0*0)**0.5 = 6.32
||d2|| = (4*4 + 0*0 + 0*0 + 2*2+ 2*2 + 2*2+1*1)**0.5 = 5.39
cos(d1, d2) = 28 / (6.32*5.39) = 0.82
The utilisation of this technique is prevalent in various domains that involve the
Notes analysis of data with multiple dimensions, including but not limited to speech recognition,
signal processing and bioinformatics. Additionally, it has the capability to be utilised for
various purposes such as data visualisation, noise reduction, cluster analysis and more.
1. Feature selection
Feature selection is an important step in building accurate models. It involves the
selection of a subset of relevant features while excluding irrelevant ones from a dataset.
This process helps in improving the accuracy of the model. In essence, feature selection
is a methodology employed to identify and choose the most advantageous features from
a given input dataset.
There are three methods that are commonly employed for feature selection:
Ɣ Filters Methods
The dataset undergoes a filtering process to extract a subset that exclusively
comprises the pertinent features. The filters method employs several commonly used
techniques, including: Correlation, Chi-Square Test, ANOVA and Information Gain etc.
Ɣ Wrapper Methods
The wrapper method shares a common objective with the filter method, but it
employs a machine learning model for its evaluation. This method involves inputting
certain features into the machine learning model and assessing its performance. The
decision to include or exclude these features is contingent upon their impact on the
model’s accuracy, as determined by its performance. The aforementioned method is
characterised by a higher level of accuracy compared to the filtering method, albeit with
increased complexity in its implementation. There are several commonly used techniques
for implementing wrapper methods.
The forward selection method is a technique used in statistical modelling to
select the most relevant variables for inclusion in a predictive model.
The technique known as backward selection is a method used in statistical
modelling and machine learning to select the most relevant features or
variables for a given model.
2. Feature extraction
Feature extraction refers to the procedure of converting a high-dimensional space
into a lower-dimensional space. This approach is beneficial for optimising resource usage
during information processing while retaining all necessary information.
There are several commonly used techniques for feature extraction, including:
Principal Component Analysis (PCA) is a statistical technique used to reduce
the dimensionality of a dataset while retaining as much information as possible.
Linear Discriminant Analysis (LDA) is a statistical technique used for
dimensionality reduction and classification tasks.
Kernel Principal Component Analysis (Kernel PCA) is a dimensionality
reduction technique that is commonly used in machine learning and data
analysis. It is an extension of Principal Component Analysis (PCA) that allows
for non-linear transformations of the input data.
Quadratic Discriminant Analysis
A * A^T = | 2 0 1 | | 2 1 0 | |5 2 2|
|1 1 0|*|1 1 2|=|1 2 1|
| 0 2 -1 | | 0 2 -1 | |2 1 6|
Fig:Components of SVD
Image Source: https://bookdown.org/rdpeng/exdata/dimension-reduction.html
The mean shift in both the rows and columns of the matrix can be observed by
examining the first left and right singular vectors.
If it is assumed that the first left and right singular vectors, denoted as u1 and v1,
effectively represent all the variability in the data, it is possible to approximate the original Notes
data matrix using these vectors.
;§X1^ v’1
The original matrix contains 400 numbers, while the compressed matrix contains 50
numbers. This represents a reduction of nearly 90% in information. The following is a
representation of the original data and its corresponding approximation.
> ## Approximate original data with outer product of first singular vectors
> approx <- with(svd1, outer(u[, 1], v[, 1]))
>
> ## Plot original data and approximated data
> par(mfrow = c(1, 2))
> image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1], main = “Original
Matrix”)
> image(t(approx)[, nrow(approx):1], main = “Approximated Matrix”)
It is evident that the two matrices are not identical; however, the approximation
appears to be reasonable in this particular scenario. This outcome is to be expected,
considering that the original data only contained a single significant feature.
Example:
In this example, we’ll perform PCA on a small dataset with two features and reduce it
to one principal component.
Dataset:
Suppose we have a dataset with two features, “X1” and “X2,” representing data
points in a two-dimensional space:
X1 = [1, 2, 3, 4, 5]
Notes
X2 = [2, 3, 3, 4, 5]
We want to reduce the dimensionality of this dataset using PCA while preserving
most of the variance.
Steps of PCA:
1. Mean Calculation:
Calculate the mean of both X1 and X2:
Mean(X1) = (1 + 2 + 3 + 4 + 5) / 5 = 3
Mean(X2) = (2 + 3 + 3 + 4 + 5) / 5 = 3.4
2. Centering the Dataset:
Subtract the mean from each data point:
;BFHQWHUHG >@ >@
;BFHQWHUHG >@ >
1.6]
3. Covariance Matrix:
Calculate the covariance matrix of the centred dataset:
&RYDULDQFH ;; Ȉ ;BFHQWHUHG0HDQ ; ;BFHQWHUHG0HDQ ;
/ (n - 1)
Covariance(X1, X2) = ((-2 * -1.4) + (-1 * -0.4) + (0 * -0.4) + (1 * 0.6) + (2 * 1.6)) /
(5 - 1)
&RYDULDQFH ;; §
4. Eigenvalues and Eigenvectors:
&RPSXWHWKHHLJHQYDOXHV Ȝ DQGHLJHQYHFWRUV Y RIWKHFRYDULDQFHPDWUL[
(LJHQYDOXH Ȝ §(LJHQYHFWRU Y §>@
5. Explained Variance Ratio:
The explained variance ratio for the principal component:
([SODLQHG9DULDQFH5DWLR§ Ȝ Ȝ §
The principal component explains 100% of the variance in the data.
6. PCA Transformation:
7UDQVIRUPWKHRULJLQDOGDWDLQWRWKHQHZFRRUGLQDWHV\VWHPGH¿QHGE\WKHSULQFLSDO
component:
Transformed Dataset:
Ɣ PC1 = [-2.33, -1.33, -0.33, 0.67, 1.67]
Ɣ These are the coordinates of the data points in the PC1 direction.
PCA has reduced the dimensionality of the dataset from two features (X1 and X2)
to one principal component (PC1) while preserving all of the variance in the data, as
indicated by the explained variance ratio of 1. This simplifies the dataset while retaining
essential information.
2. Image Compression and Computer Vision:In computer vision tasks, such as image
compression, PCA is employed to reduce the storage and processing requirements of Notes
images.
,WH[WUDFWVWKHPRVWUHOHYDQWIHDWXUHVRIDQLPDJHHQDEOLQJ HI¿FLHQWUHSUHVHQWDWLRQ
and reconstruction.
3. Uncovering Hidden Patterns:
PCA can be utilised to discover hidden patterns or relationships within complex
datasets.
By transforming the original data into a new set of uncorrelated variables (principal
components), PCA reveals the underlying structures.
4. Applications in Finance and Data Mining:
,Q¿QDQFH3&$LVHPSOR\HGIRUULVNPDQDJHPHQWDQGSRUWIROLRRSWLPL]DWLRQ
,WKHOSVWRLGHQWLI\WKHNH\IDFWRUVWKDWGULYH¿QDQFLDOGDWDDQGLPSURYHGHFLVLRQPDNLQJ
processes.
In data mining, PCA aids in feature selection, clustering and anomaly detection.
5. Applications in Psychology and Social Sciences:
In psychology and social sciences, PCA is applied to analyse and interpret behavioural
data.
,WKHOSVLQLGHQWLI\LQJODWHQWFRQVWUXFWVRUIDFWRUVWKDWLQÀXHQFHKXPDQEHKDYLRXU
6. Signal Processing and Audio Analysis:
In signal processing and audio analysis, PCA is used to reduce noise and extract
meaningful information from signals or audio recordings.
7. Machine Learning Model Initialization:
PCA can be used as a pre-processing step to initialise machine learning models with
a reduced set of informative features.
It can enhance the performance and convergence speed of learning algorithms.
8. Quality Control and Fault Detection:
In industrial applications, PCA aids in quality control and fault detection by identifying
patterns of normal behaviour and deviations from it.
Output
Fig:A visual representation of the iris dataset using multi-dimensional scaling (MDS) analysis,
showcases the relationship between the species of iris flowers based on their physical characteristics.
The aforementioned code will use the custom distance matrix to produce a scatter
plot of the MDS findings. In the USArrests data collection, each point corresponds to a
different state. The graph has the state labels plotted on it. The MDS analysis divided
the observations of the states into various clusters based on the correlation between the
variables, as can be shown by visualising the data in this fashion.
Objectives
Notes
The following guidelines can be used to breakdown factor analysis’ main goals:
Ɣ Figuring out how many elements are needed to explain common themes in a certain
set of variables.
Ɣ Figuring out how closely each dataset’s variable is related to a particular component
or theme.
Ɣ Analysing a dataset’s common factors.
Ɣ Determining how well each observed data point represents a certain theme or
aspect.
Comparisons include:
Analyse a number’s internal consistency.
Analyse the factors that item sets represent. They assume there is no
correlation between the variables.
Look into the class/grade of each item.
There are, however, certain general distinctions, most of which are focused on the
application of components. In essence, EFA is a data-driven approach that permits
all items to load on all factors, whereas CFA necessitates that you designate which
factors must load. If anyone has no notion of what potential common causes might
exist, EFA is a really good option. If a researcher needs to identify factors, it is not
possible to produce as many alternative models for data as EFA can. The CFA is a
preferable strategy in this situation if anyone has some understanding of how the
models actually look and wish to test data structure hypotheses afterwards.
&RQ¿UPDWRU\IDFWRUDQDO\VLV &)$
&RQ¿UPDWRU\)DFWRU$QDO\VLV &)$ LVDVWDWLVWLFDOPHWKRGXVHGWRWHVWWKHK\SRWKHVLV
WKDW LWHPV DUH LQWULFDWHO\ UHODWHG WR VSHFL¿F XQGHUO\LQJ IDFWRUV ,W HPSOR\V D ZHOO
GH¿QHGHTXDWLRQPRGHOWRDVVHVVWKHPHDVXUHPHQWPRGHO%\H[DPLQLQJORDGLQJVRQ
the factors, CFA allows the evaluation of correlations between observed variables and
unobserved variables.
Compared to least-squares estimates, structural equation modelling, which includes
&)$ RIIHUV PRUH ÀH[LELOLW\ DQG WROHUDQFH IRU PHDVXUHPHQW LQDFFXUDF\ ,W UHYHDOV
the loadings of observed variables on latent variables (factors) and the correlations
between these latent variables. Hypothesised models are rigorously evaluated against
actual data.
CFA is a valuable tool for analysts and researchers to investigate the relationship
between observable variables (manifest variables) and the underlying constructs. It
shares similarities with exploratory factor analysis, but CFA is more focused on testing
Notes VSHFL¿FK\SRWKHVHVDERXWWKHXQGHUO\LQJIDFWRUVWUXFWXUH
Two steps are included in the Multiple Factor Analysis and they are as follows:
First, each and every section of the data will be subject to the Principal
Component Analysis. A relevant eigenvalue can also be obtained from this,
which is then used to normalise the data sets for subsequent use.
After merging the newly created data sets into a unique matrix, a global PCA
will be carried out.
4. Generalised Procrustes Analysis (GPA):
The utilisation of this method facilitated the expansion of the GP Analysis, enabling the
comparison of more than two shapes across various dimensions. Procrustes analysis
LVDUHFRPPHQGHGPHWKRGIRUFRPSDULQJWZRDSSUR[LPDWHVHWVRIFRQ¿JXUDWLRQVDQG
shapes. It was initially designed to be comparable to the two solutions derived from
Factor Analysis. In order to achieve the intended shape, it is necessary to align the
shapes accurately. The Generalised Procrustes Analysis (GPA) primarily employs
geometric transformations.
Notes
(ImageSource:https://stats.stackexchange.com/questions/50745/best-factor-extraction-methods-in-
factor-analysis)
columns. The points on the loading plot are widely spaced along a factor axis and tend
to polarise into near-zero and far-from-zero. This characteristic appears to partially Notes
satisfy a number of Thurstone’s simple structure points. However, Varimax is not
immune to creating points that are far from the axes, i.e. “complex” variables that are
ORDGHGKHDYLO\E\PXOWLSOHIDFWRUV'HSHQGLQJRQWKHVWXG\¶V¿HOGWKLVPD\EHJRRG
or negative. It is advised to always use it with varimax (and recommended to use it
with any other approach, too) since it works best in tandem with the so-called Kaiser’s
normalisation, which temporarily equalises communalities while rotating. Particularly
in psychometry and the social sciences, it is the most widely used orthogonal rotation
technique.
3. Equamax (rarely, Equimax) orthogonal rotation
One way to sharpen some varimax features is through equamax (or, less frequently,
equamax) orthogonal rotation. In an effort to make it even better, it was invented.
Saunders (1962) added a particular weighting known as equalisation to the algorithm’s
working formula. Equamax automatically adjusts for how many rotating factors are
present. It is less likely to produce “general” factors since it tends to disperse heavily
loaded variables throughout factors more evenly than varimax does. On the other
hand, equamax was not developed to abandon the quartimax’s intention to simplify
rows; rather, equamax is a combination of quartimax and varimax rather than their
LQWHUPHGLDU\ IRUP +RZHYHU HTXDPD[ LV DVVHUWHG WR EH VLJQL¿FDQWO\ OHVV ³UHOLDEOH´
or “stable” than varimax or quartimax: for certain data, it can produce disastrously
poor solutions, while for other data, it produces factors with a basic structure that are
completely comprehensible. Another approach, known as parsimax (or “maximising
parsimony”), is comparable to equamax but considerably more ambitious in its pursuit
of basic structure.
% Var
To calculate how much variance the factors explain, use the percentage of variance
(% Var). Keep the variables that account for a reasonable amount of volatility. Your
application will determine the appropriate level. You might just need to explain 80% of
the variance for descriptive reasons. However, you might wish to have at least 90% of the
variance explained by the factors if you want to run additional studies on the data.
(Eigenvalues) Variance
The variance is equal to the eigenvalue if you extract factors using principal
components. The size of the eigenvalue can be used to calculate the number of factors.
The factors with the highest eigenvalues should be kept. For instance, when applying the
Kaiser criterion, you only take into account the factors with eigenvalues greater than 1.
Scree plot
The eigenvalues are arranged in the scree plot from largest to smallest. A sharp
curve, a bend and then a straight line are the perfect patterns. Utilise the elements in the
Notes steep curve before the line’s first starting point.
Notes
ImageSource:https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistical-modeling/
multivariate/how-to/factor-analysis/interpret-the-results/key-results/
Varimax Rotation
Notes
Variable Factor1 Factor2 Factor3 Factor4 Communality
Academic record 0.481 0.51 0.086 0.188 0.534
Appearance 0.14 0.73 0.319 0.175 0.685
Communication 0.203 0.28 0.802 0.181 0.795
Company Fit 0.778 0.165 0.445 0.189 0.866
Experience 0.472 0.395 -0.112 0.401 0.553
Job Fit 0.844 0.209 0.305 0.215 0.895
Letter 0.219 0.052 0.217 0.947 0.994
Likeability 0.261 0.615 0.321 0.208 0.593
Organisation 0.217 0.285 0.889 0.086 0.926
Potential 0.645 0.492 0.121 0.202 0.714
Resume 0.214 0.365 0.113 0.789 0.814
6HOI&RQ¿GHQFH 0.239 0.743 0.249 0.092 0.679
Variance 2.5153 2.488 2.0863 1.9594 9.0491
% Var 0.21 0.207 0.174 0.163 0.754
Fig:The loading plot visually shows the loading results for the first two factors.
ImageSource:https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistical-modeling/
multivariate/how-to/factor-analysis/interpret-the-results/key-results/
ImageSource:https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistical-modeling/
multivariate/how-to/factor-analysis/interpret-the-results/key-results/
methods. Contrasting colours are highly beneficial in various applications. If the labelling
Notes process is spread out over a period of three days, it is possible to assign different colours
to the samples labelled on each day. Specifically, samples labelled on day 1 can be
coloured red, samples labelled on day 2 can be coloured blue and samples labelled on
day 3 can be coloured green. This colour-coding system facilitates the identification of
batch effects from plots, as demonstrated in the examples provided below.
source(“https://bioconductor.org/biocLite.R”)
biocLite(c(“BatchQC”))
library(BatchQC)
condition = metadf$CancerType
Notes
batchQC(dat=exprdata, batch=batch, condition=condition,
UHSRUWBILOH ´EDWFKTFBUHSRUWKWPO´UHSRUWBGLU ´´
UHSRUWBRSWLRQBELQDU\ ´´
YLHZBUHSRUW 758(LQWHUDFWLYH 758(EDWFKTFBRXWSXW 758(
Upon completion of the computational calculations, the programme will initiate the
launch of an interactive URL. This URL will provide users with visual representations,
in the form of plots, pertaining to their respective data. It is possible to apply colour to
the majority of these plots based on either condition or batch, thereby facilitating the
identification of any potential batch effect.
Key Points:
Ɣ Differential manifestation of batch effects is observed across various plots. Various
types of statistics are presented to fulfil specific purposes. The Circular Dendrogram
should be examined initially as it is the most straightforward plot to comprehend.
Ɣ The runtime of the programme and the ease of interaction will be negatively
impacted by the size of the datasets. This is due to the program’s need to
regenerate plots based on the user’s selections. In the event that your computer
experiences a decrease in performance, it is recommended to downsample your
gene set, thereby reducing the amount of data you are working with to a range of
20-50%.
Ɣ To initiate batch correction, navigate to the interactive interface and access the
ComBat or SVA tabs. From there, locate and select the “Run” option. The feature
enables users to review their plots and choose a “post-correction” version. To obtain
the results table, it is necessary to execute ComBat or SVA on the complete dataset.
Z-Score Standardisation:
Z-Score Standardisation is a statistical technique used to transform a given dataset
into a standard normal distribution.
Ɣ Z-score standardisation, also referred to as standardisation or z-transformation, is
a statistical technique that adjusts the data to possess a mean value of 0 and a
standard deviation of 1.
Ɣ The formula used for Z-score standardisation is as follows:
z = (x - mean(x)) / std(x)
Ɣ This approach is suitable in cases where the data follows a normal distribution
and when the mean and standard deviation provide meaningful measures for
comparison.
Decimal Scaling:
Decimal scaling is a technique used in data normalisation, specifically in the field of
machine learning and data mining.
Ɣ The decimal scaling technique is a straightforward method of scaling data by
dividing it by a power of 10.
Ɣ The scaling factor is determined by calculating the maximum absolute value of the
data.
Ɣ The utilisation of this method ensures that the scaled data possesses a maximum
absolute value that is below 1, thereby facilitating its manipulation in certain
scenarios.
Robust Scaling:
The concept of robust scaling refers to the ability of a system or process to maintain
its performance and functionality under varying conditions or inputs.
Ɣ The technique of robust scaling, alternatively referred to as percentile scaling or
quantile scaling, involves the scaling of data using percentiles.
Ɣ The median and interquartile range (IQR) are employed in order to standardise the
data, enhancing its resilience against outliers.
The surrogate variable analysis (SVA) is a widely used technique for batch
Notes adjustment. It aims to detect and correct for latent or unmeasured variables,
known as surrogate variables, that are responsible for batch effects.
The extraction of surrogate variables, which capture the hidden sources of
variation in the data, is accomplished through the utilisation of singular value
decomposition (SVD).
The confounding batch effects are effectively eliminated and the true biological
variation is preserved by employing SVA and making adjustments for surrogate
variables.
4. Linear mixed models (LMM) are a statistical modelling technique used to analyse data
WKDWKDVERWK¿[HGDQGUDQGRPHIIHFWV/00VDUHDQH[WHQVLRQRIOLQHDUUHJUHVVLRQ
models and are particularly useful when dealing with correlated or clustered
Linear mixed models (LMMs) are a flexible statistical methodology that can be
employed for batch adjustment across diverse data types.
The LMM (Linear Mixed Model) is capable of effectively managing intricate
study designs, including repeated measures or hierarchical structures. This
feature makes it well-suited for a wide range of high-dimensional datasets.
By incorporating random effects to account for batch variability, Linear Mixed
Models (LMM) can effectively mitigate the impact of batch effects and enhance
the accuracy of data analysis.
Batch adjustment techniques play a critical role in ensuring the reliability and
reproducibility of high-dimensional data analysis. These tools assist researchers in
obtaining more precise biological insights by mitigating technical biases and enabling the
emergence of genuine signals from the data.
Clustering:
Ɣ Clustering is a method employed to group data points that exhibit similarities or
proximity to one another within a multidimensional space.
Ɣ The method is classified as unsupervised learning, indicating that it does not
necessitate labelled data during the training process.
Ɣ The objective of clustering is to divide the data into clusters, where the data points
within each cluster exhibit greater similarity to one another compared to data points
in different clusters.
Ɣ The commonly used clustering algorithms comprise K-means, hierarchical clustering
and density-based clustering (DBSCAN).
Ɣ Clustering is a commonly employed technique in diverse domains, including biology,
customer segmentation, image segmentation and anomaly detection.
Heatmaps:
Ɣ Heatmaps are visual representations that utilise cells with colour coding to
effectively display the values of a dataset in two dimensions.
Ɣ Heatmaps are a valuable tool for the visualisation of extensive matrices or data
tables. They enhance the identification of patterns and trends, thereby facilitating Notes
data analysis.
Ɣ A heatmap is a graphical representation where each row and column corresponds to
a specific data point or category. The colour intensity within each cell of the heatmap
is used to visually depict the magnitude of the corresponding data point’s value.
Ɣ Heatmaps are a frequently employed visualisation tool for the representation of
gene expression data, correlation matrices and spatial data.
Ɣ These tools have gained significant popularity, particularly in the fields of genomics,
bioinformatics and data exploration.
unsupervised learning, the available data does not include class labels such as Cars,
Notes Bikes, etc. Additionally, the data is not structured and consists of a combination of
information from various vehicles.
The objective at hand is to transform the unlabelled data into labelled data, a
process that can be achieved through the utilisation of clusters.
The primary concept behind cluster analysis is to organise data points by grouping
them into clusters, such as a “cars” cluster that includes all car data points, a “bikes”
cluster that includes all bike data points and so on.
Cluster analysis is a technique used to partition unlabelled data into groups of similar
objects.
Characteristics of Clustering
6FDODELOLW\ RI &OXVWHULQJ ,Q WKH SUHVHQW HUD WKH YROXPH RI GDWD KDV VLJQL¿FDQWO\
LQFUHDVHG QHFHVVLWDWLQJ WKH PDQDJHPHQW RI ODUJHVFDOH GDWDEDVHV )RU HI¿FLHQW
management of large databases, it is essential to ensure that the clustering algorithm
possesses scalability. The scalability of data is crucial for obtaining accurate results. If
data is not scalable, it can lead to incorrect outcomes.
2. High Dimensionality: The algorithm must possess the capability to effectively process
data in high-dimensional spaces, even when the dataset is relatively small in size.
3. Algorithm Usability with multiple data kinds:An algorithm is a step-by-step procedure
RUVHWRIUXOHVIRUVROYLQJDVSHFL¿FSUREOHPRUFRPSOHWLQJDVSHFL¿FWDVN$OJRULWKPV
for clustering can accommodate various types of data. The system should possess
the capability to handle various types of data, including discrete, categorical, interval-
based and binary data.
4. Managing unstructured data: It is common to encounter databases that contain missing
values, as well as noisy or erroneous data. Poor quality clusters may arise if the
algorithms are sensitive to such data. The system should possess the capability to
process unstructured data and subsequently organise it into clusters of similar data
objects, thereby imparting structure to the data. This facilitates the task of the data
H[SHUWE\VWUHDPOLQLQJGDWDSURFHVVLQJDQGHQDEOLQJWKHLGHQWL¿FDWLRQRIQRYHOSDWWHUQV
5. Interpretability: It is essential for the clustering results to possess interpretability,
comprehensibility and usability. The concept of interpretability pertains to the level of
ease with which data can be comprehended.
Clustering Methods:
The clustering methods can be classified into the following categories:
Ɣ Partitioning Method
Ɣ Hierarchical Method
Ɣ Density-based Method
Ɣ Grid-Based Method
Ɣ Model-Based Method
Ɣ Constraint-based Method
Ɣ Marketers can effectively identify distinct groups within their customer base by
leveraging purchasing patterns. This enables them to accurately characterise their Notes
customer groups.
Ɣ The application of this technology extends to the domain of biology, where it
facilitates the derivation of animal and plant taxonomies, as well as the identification
of genes possessing similar functionalities.
Ɣ Additionally, it aids in the process of information discovery by effectively categorising
documents found on the internet.
Ɣ Soft clustering is a technique that enables data entities to explore and identify other
Notes similar data entities that have a high probability of belonging to the same cluster.
This clustering method has the capability to assign a single data entity to multiple
clusters, depending on its similarity to other data entities.
The superscripts “n” and “k” in this context represent the data index and cluster
index, respectively. Additionally, “k” refers to the centroid for the kth cluster. It should be
noted that the value of rnk is equal to 1 if the data point xn belongs to cluster k and it is 0
otherwise.
The objective is to identify the centroids that minimise the objective function J. This
FDQEHDFKLHYHGE\VROYLQJWKHHTXDWLRQ-ȝk = 0. The resulting solution is as follows.
ease of implementation. Its straightforward nature allows for easy comprehension and
application. Notes
)DVWDQGHI¿FLHQW
This product is designed to deliver high speed and optimal performance. The K-means
DOJRULWKP LV NQRZQ IRU LWV FRPSXWDWLRQDO HI¿FLHQF\ DQG DELOLW\ WR HIIHFWLYHO\ SURFHVV
large datasets that have high dimensionality.
3. Scalability:
7KH .PHDQV DOJRULWKP LV FDSDEOH RI HI¿FLHQWO\ SURFHVVLQJ H[WHQVLYH GDWDVHWV
containing a substantial number of data points. Moreover, it can be readily adapted to
accommodate even more substantial datasets without compromising its performance.
4. Flexibility:
The K-means algorithm can be readily customised to suit various applications,
accommodating different distance metrics and initialization methods.
the sub-clusters within it serving as partitions of the main cluster. This assumption can
Notes be extended to the sub-clusters within the system. The minimum unit will consist of a
cluster containing a single data point. The algorithm’s assumption enables to interpret the
cluster structure that is produced as a tree. In this tree, the leaves represent individual
data points, which are clusters consisting of only one object. The inner nodes, on the
other hand, represent collections of data points that are contained within their respective
subtrees. The process of the hierarchical clustering algorithm can be conceptualised as
the construction of a tree.
There exist two distinct methodologies for accomplishing this objective:
The tree is constructed using a bottom-up approach, which is commonly
referred to as the agglomerative method.
The tree is constructed in a top-down manner, which is commonly referred to as
the divisive approach.
Agglomerative Approach
The initial approach is characterised by its simplicity. The algorithm begins by
considering each data point as an individual cluster. This analogy pertains to the
process of constructing the leaves of a tree. At this stage, every individual data point is
considered as a distinct cluster. The provided partition for the dataset is deemed valid.
However, our objective is to surpass this level of performance. The algorithm initiates
the iterative process of merging clusters that exhibit similarity. During each iteration, the
algorithm identifies the two clusters that are most similar and combines them to create
a new cluster. The newly established cluster will expand into a larger tree structure
consisting of two child nodes. Each node within the tree represents a merged cluster.
In the subsequent iteration, the larger cluster will be regarded as a singular entity. By
iteratively merging clusters that exhibit similarity, the process will ultimately result in the
amalgamation of all data points into a singular binary tree. The algorithm will terminate at
that point.
It should be noted that the reduction of clusters is being performed incrementally,
one at a time. Hence, it is possible to obtain a variable number of clusters, ranging from 1
to the size of the dataset, by terminating at various levels of the tree. The following is the
pseudocode for the agglomerative approach:
# Agglomerative (Bottom-up) algorithm
Form initial clusters (i.e., turn each data point into a cluster).
Compute distance between each pair of clusters.
while number of clusters > 1:
Merge the two clusters with minimum distance.
Calculate the distance between the new cluster and all other clusters.
Divisive Approach
As previously stated, an alternative form of hierarchical clustering algorithm exists,
known as the divisive algorithm. The method employed is similar to the agglomerative
approach, as it aims to construct a binary tree, also known as a dendrogram. In contrast
to the bottom-up approach, the divisive algorithm initiates the construction of child nodes
iteratively, beginning from the root of the tree. The initial approach involves considering
the entire dataset as a single cluster, also referred to as the root node. During each
iteration, the algorithm selects a cluster to divide into two clusters that are the least
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 185
similar to each other. These newly created clusters are referred to as child nodes. The
process halts once the desired number of clusters is achieved. The following is the Notes
pseudocode representation of the divisive algorithms:
# Divisive (Top-down) algorithm
Form the initial cluster (i.e., turn the whole dataset into one big cluster).
while number of clusters <= number of data points:
choose a cluster and split it into 2 clusters
1. Google Charts
It is a data visualisation library that allows users to create interactive and customizable
charts for their web applications. It provides a wide range of chart
Google Charts is a data visualisation library that has been developed by Google. It is
both powerful and free to use. The software offers a diverse selection of interactive
charts and graphs that can be seamlessly integrated into web pages or applications.
The following section presents a comprehensive analysis of the advantages and
disadvantages associated with the utilisation of Google Charts:
retrieve these libraries. Limitations may arise if there are constraints on external
dependencies or a requirement for offline access. Notes
The absence of advanced features is a limitation of Google Charts, as
it may not offer certain advanced data visualisation capabilities found in
more specialised libraries or frameworks. If there is a need for complex data
manipulation, interactivity, or advanced statistical analysis, it may be necessary
to explore alternative options.
2. Grafana
Grafana is a software application that falls under the category of open-source data
visualisation and monitoring tools. The purpose of this software is to assist users in
the creation and presentation of interactive dashboards, graphs and charts. These
visualisations are intended for the monitoring and analysis of data derived from multiple
sources. This document provides an overview of Grafana, including its advantages
and disadvantages.
Advantages of Grafana:
Intuitive and user-friendly interface: Grafana offers extensive support for various
data sources, encompassing well-known databases, cloud services, time
series databases and monitoring systems. The inherent flexibility of this system
enables users to establish connections and visually represent data from various
sources within a single, cohesive dashboard.
Grafana offers a wide range of visualisation options, encompassing graphs,
charts, tables and gauges, among others. The software provides support for a
wide range of chart types and allows for extensive customization options. This
allows users to create visually appealing and informative dashboards.
Grafana facilitates the interactive and real-time monitoring and visualisation
of data. Users have the ability to establish alerts and notifications that are
triggered when specific thresholds are met. This feature allows for proactive
monitoring and prompt action in response to critical events or anomalies.
Disadvantages of Grafana.
Limited data source support: Grafana has a limited number of supported data
sources compared to other visualisation tools.
Learning Curve: The utilisation of Grafana may present a learning curve,
particularly for individuals who are unfamiliar with the tool or possess limited
knowledge in data visualisation and monitoring principles. The successful setup
of intricate dashboards and the configuration of data sources may necessitate a
certain level of technical expertise.
Grafana exhibits resource-intensive behaviour, especially when handling
substantial datasets or frequent updates of high-frequency data. The
performance may be affected, particularly if the underlying infrastructure is not
properly provisioned.
Grafana’s analytics functionality is limited compared to specialised data analysis
tools, despite its powerful visualisation capabilities. In order to perform intricate
data analysis or statistical modelling tasks, it may be necessary to augment
Grafana with additional tools or libraries.
3. FusionC3harts
FusionCharts is a robust JavaScript-powered data visualisation library that provides
an extensive selection of interactive charts, maps and gauges. The purpose of this
tool is to assist developers in the creation of visually appealing and interactive data
YLVXDOLVDWLRQVVSHFL¿FDOO\IRUZHEDQGPRELOHDSSOLFDWLRQV
Advantages of FusionCharts:
Rich and interactive visualisations: FusionCharts offers a wide range of visually
appealing and interactive charts, graphs and maps. These FusionCharts offer
a wide range of chart types, encompassing line charts, bar charts, area charts,
pie charts, maps and various others. This particular variety provides developers
with the flexibility to select the most appropriate chart type based on their
specific data visualisation needs.
FusionCharts provides developers with a wide range of customization options
to modify different elements of the charts. These options include the ability to
customise colours, fonts, labels, tooltips and animations. This feature empowers
developers to generate visually captivating and branded visualisations that
seamlessly match the design of their application.
FusionCharts offers an extensive range of features and functionalities,
encompassing drill-down capabilities, real-time updates, export options,
interactivity and responsive design. The aforementioned feature enhances the
tool’s versatility in generating dynamic and interactive data visualisations.
Disadvantages of FusionCharts
Licencing and Pricing: FusionCharts provides a complimentary version that
includes restricted features, but to access its complete functionality and
advanced features, a commercial licence is necessary. The pricing model may
not be conducive to accommodating all budgets, particularly for projects of
smaller scale or those associated with non-profit organisations.
Learning Curve: The process of acquiring proficiency in FusionCharts can be
challenging, particularly for developers who are unfamiliar with the library or
have limited knowledge of data visualisation principles. Gaining proficiency in
the API, configuring data sources and customising charts may necessitate an
initial investment of effort and technical expertise.
Performance Considerations: FusionCharts exhibits resource-intensive
behaviour, particularly when handling extensive datasets or intricate
visualisations. In order to achieve seamless rendering and interactivity, it is
imperative to engage in meticulous optimisation and thoughtful consideration
of performance factors. This is especially crucial in situations involving large
amounts of data or frequent updates.
4. Tableau
Notes
Tableau is a robust and extensively utilised software for data visualisation and business
intelligence. The software allows users to generate interactive and visually engaging
dashboards, reports and data visualisations using data from multiple sources.
Advantages of Tableau:
Intuitive and user-friendly interface: Tableau offers a user-friendly and intuitive
interface, enabling users to effortlessly manipulate data elements through
drag-and-drop actions, facilitating the creation of visualisations. The “Show
Me” feature provides users with limited technical expertise the ability to easily
access appropriate chart types based on their data.
Tableau offers an extensive range of data connections, encompassing various
sources such as spreadsheets, databases, cloud services and big data
platforms. The software enables users to effortlessly establish connections and
integrate data from various sources, resulting in a comprehensive and unified
perspective of the data.
Tableau provides a range of interactive features that enable users to perform
exploratory analysis on data and obtain dynamic insights. The system allows
users to apply filters, perform drill-downs, apply sorting and engage with
visualisations in order to identify patterns, trends and outliers within the data.
This facilitates a decision-making process that is guided by data.
Disadvantages of Tableau:
High Cost of Licencing: Tableau licences are known to have a significant price
tag, particularly for entities or individuals operating within constrained financial
resources. The pricing structure could potentially pose a challenge for small-
scale or non-profit projects, as well as for those seeking advanced features and
enterprise-level deployments. Additional expenses may be required in these
cases.
Steep Learning Curve: Tableau exhibits a significant learning curve, especially
for individuals who possess limited familiarity with data visualisation principles
or possess limited exposure to analogous software applications. To fully
harness the advanced features and maximise the capabilities of Tableau, it may
be necessary to undergo specialised training or utilise dedicated educational
materials.
Performance Challenges with Large Datasets: The performance of Tableau
may be affected when handling large datasets or intricate visualisations.
Slower response times and the need for optimisation techniques may arise
when resource-intensive calculations or frequent data updates are performed,
necessitating the maintenance of acceptable performance levels.
Despite the aforementioned limitations, Tableau continues to be widely favoured for
data visualisation and business intelligence owing to its user-friendly interface, robust
functionality and capacity to generate captivating visual representations. The software
provides users with the ability to effectively explore and communicate data, facilitating
data-driven decision-making in diverse industries and sectors.
5. Data Wrapper
Data Wrapper is a web-based application designed to facilitate the creation of
dynamic and adaptable data visualisations. The software streamlines the procedure
Notes of generating charts, maps and tables, enabling users to effectively showcase data in
a concise and aesthetically pleasing format. This document provides an overview of
Data Wrapper, including its advantages and disadvantages.
6. Plotly
3ORWO\ LV D VRIWZDUH OLEUDU\ VSHFL¿FDOO\ GHVLJQHG IRU GDWD YLVXDOLVDWLRQ SXUSRVHV ,W
offers a wide range of features and functionalities, including the creation of interactive
and visually appealing graphs and charts. The software platform provides support for
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 191
Disadvantages of Plotly
Limited customization options: Plotly offers a range of customization features,
but compared to other visualisation libraries it is limited.
Learning Curve: The utilisation of Plotly may present a significant learning
curve, particularly for individuals who possess limited experience in
programming or are unfamiliar with data visualisation principles. Gaining
proficiency in the syntax, API and customization options may necessitate an
initial investment of time and experimentation.
Advanced Features Require Expertise: Plotly provides a comprehensive set of
features, including advanced functionalities like 3D visualisations and complex
statistical analysis. However, utilising these advanced features may necessitate
a higher level of programming skills or expertise.
Performance Limitations with Large Datasets: Plotly exhibits reduced
performance when handling large datasets, in contrast to specialised libraries
that are specifically designed for efficient visualisation of big data. Users who
are working with large datasets may encounter extended processing times or
encounter performance limitations.
Summary
Ɣ A dataset is said to be high dimensional if it contains more characteristics (p) than
observations (N), which is frequently expressed as p > N.
Ɣ Distance measures are essential tools used to quantify the dissimilarity or similarity
between objects in a dataset.
Ɣ The standard geometry problem metric is thought to be Euclidean distance. It is
easily defined as the usual separation between two places.
Ɣ The cosine similarity is a mathematical measure that quantifies the cosine of the
angle between two vectors.
Ɣ The Mahalanobis distance (MD) is a metric used to quantify the relative distance
between two variables in relation to the centroid.
Ɣ Dimension reduction techniques aim to decrease the number of features or
dimensions in a dataset while preserving essential information.
Ɣ Feature extraction refers to the procedure of converting a high-dimensional space
into a lower-dimensional space.