Week 02 Classification & Clustering
Week 02 Classification & Clustering
Contributor
1
Week 02: Classification and Clustering Techniques
LEARNING OUTCOMES
Contents:
1. Overview
2. Supervised Classification
2.1 Bayesian Classification
2.3 Support Vector Machine
2.3 Decision Trees Algorithm
2.4 k-Nearest Neighbor (kNN)
3. Unsupervised Classification
3.1 Hierarchical clustering
3.2 k-means clustering
3.3 Fuzzy C-Means (FCM)
4. Performance evaluation
5. Case studies – Implementation classification and clustering
algorithms for solving problems using R programming.
2
Week 02: Classification and Clustering Techniques
1. An overview
Educational Setting:
Out of 100 students in the class, who are at risk of dropping out?
Can an effective curriculum be developed based on students’ feedback and
market-need?
etc…
Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the least likely to default
on their credit cards?
Identify likely responders to sales promotions.
etc..
3
Week 02: Classification and Clustering Techniques
Fraud detection:
Which types of transactions are likely to be fraudulent, given the demographics
and transactional history of a particular customer?
How to recognize the spam mails, terrorists in a crowd, hackers etc.?
Being the most important part of data science, machine intelligence approach is essential in
solving such pattern recognition / identification problems that led to the framework of
classification problems. Classification techniques are broadly classified into supervised and
unsupervised (clustering) categories. Supervised classification refers to the data-driven models
that are trained with labelled data set whereas unsupervised models are trained with unlabeled
data.
In this chapter, few significant and widely-used supervised and unsupervised machine learning
algorithms have been discussed. In addition, these algorithms have been implemented using R
programming for solving few problems as case studies.
Supervised learning technique uses labelled data to predict outcomes of unseen data or assign
to the unknown data in a known class. It basically analyzes the training data and produces an
inferred function that is then used for mapping new examples. Here, few probabilistic and
machine learning algorithms are described.
Bayesian classification is based on probability theory and more specifically, based on Bayes'
decision theory (Duda, Hart, & Stork, 2007). The principle of the decision is to choose the most
probable or the lowest risk (expected cost) option. Assume that there is a classification task to
classify feature vectors (samples) to K different classes. A feature vector is denoted as
x [ x1 , x2 ,...xn ]T where n is the dimension of a vector. The probability that a feature vector x
belongs to class wk is P ( wk | X ) and it is often referred to as a posterior probability. The
4
Week 02: Classification and Clustering Techniques
p( X | wk ).P(wk )
P( wk | X )
P( X )
where p ( X | wk ) is the class conditional probability density function of class wk in the feature
space for feature X given that class is wk . The function tells the distribution of feature vectors in
the feature space inside a particular class, i.e., it describes the class model. p ( w k ) indicates the
a priori probability that refers to the probability of the class before measuring any features. If
prior probabilities are not actually known, they are estimated by the relative occurrences. The
divisor
K
P( X ) p( X | wi )P( wi )
i 1
is a scaling factor to assure that posterior probabilities are really probabilities, i.e., their sum is 1.
It can be shown that choosing the class of the highest posterior probability produces the
minimum error probability. The prior probability can be estimated by considering their
proportional in the database using the following formula
P( wi )
p( wi )
P( w1 ) P(w2 )
The key issue in the Bayesian classifier is the class-conditional probability density function
p ( X | wk ) . In practice it is always unknown, except in some artificial classification tasks. The
distribution can be estimated from the training set with a range of methods. If the patterns X
from miscellaneous classes can be approximated by normal distribution, the class conditional
distribution p ( X | wk ) has the form
1 1
p ( X | wk ) M
exp( ( X )T W 1 ( X ))
W 2
2 2
5
Week 02: Classification and Clustering Techniques
For two class classification problems, the Bayes’ decision can also be made based on the
following comparison:
p( w1 | X )
1 X w1
p( w2 | X )
else
p( w1 | X )
1 X w2
p( w2 | X )
Training:
Compute the mean vector k and covariance matrices k for every class as,
1
k
wk x ,1 k c
xwk
1
k
wk
i , xi wk
( xi k )( xi k )T
Testing:
where k and k are the mean vector and covariance matrix of the kth class
respectively.
Compute the posterior density from the prior probability P( wk ) and class conditional
p ( x | w ) P(w )
k 1
k k
where the denominator in the right hand side is the total probability.
Compute the cluster label as
yn arg k max p( wk | xn ), 1 k c
6
Week 02: Classification and Clustering Techniques
1
where b
2
c 2
c
2
.
The hyperplane which is optimal in separating the data points into two classes and satisfying
condition will be
minimize 1 2
w
w,b 2
where i are called Lagrange multipliers under the constraint i 0 . Lp is minimized in order to
find the optimal saddle point with respect to primal variables w and b . This problem is
transformed into the dual form by differentiating Lp with respect to w and b and introducing
7
Week 02: Classification and Clustering Techniques
Karush Kuhn Tucker conditions. The transformed dual problem is the minimization problem of
the following objective function.
n n n
1
LD ( )
i 1
i
2
i 1 j 1
i
T
j yi y j xi x j
n
subject to i 0, i 1, 2...n and y
i 1
i i 0
For linearly separable data, the above eq. (10) is useful. But when the data is linearly non-
separable, kernel trick is considered to transform the feature space into a higher dimensional
space to make the data linearly separable. In practice we need not have to map the input
variables into the high dimensional space directly. Instead the inner product between the
features in the kernel space could be used in the optimization problem. The dual problem of
optimization with kernel transformation for SVM is
n n n
1
Maximize LD ( ) 2 i i j yi y j k xi , x j
i 1 i 1 j 1
n
subject to i 0, i 1, 2...n and y
i 1
i i 0
where k ( xi , x j ) T ( xi ) ( x j ) which is the dot product between the transformed features. In fact
any symmetric function that satisfy the Mercer conditions can be used as a kernel function.
Commonly used kernel functions are polynomial, quadratic and radial basis function (RBF)
kernels. The polynomial kernel function of degree d is
d
k xi , x j xiT x j 1
We have used second and third degree polynomial kernels in our analysis. In linear kernel
transformation k xi , x j xiT x j which is the inner product between the features as can be seen
8
Week 02: Classification and Clustering Techniques
Higher dimension non –linear mapping of the input vector x generates the non-linearly separate
input into linearly separable vector (Gunn, 1998) .By choosing a non-linear mapping, the SVM
constructs an optimal separating hyperplane in this higher dimensional space.
Suppose the data is mapped to some other (possibly infinite dimensional) Euclidean space H,
using a mapping which is defined by :
(.) : R n R nh
In this case, optimal function for dual Langrage ( ) with the same constraints becomes
n
1 n n
Max LD ( ) i i j yi y j K ( xiT x j )
i 1 2 i 1 j 1
where
K ( xiT x j ) ( xiT ). ( x j )
is the kernel function performing the non-linear mapping into feature space. The kernel function
may be any of the symmetric functions that satisfy the Mercel conditions (Courant, 1953). The
most commonly used are the Gaussian Radial Basis Function (RBF) and the polynomial
function. Their formulas are shown below respectively.
|| xT x ||
T i j
K ( xi , x j ) exp
2
2
K ( xiT , x j ) ( xiT , x j 1) d
9
Week 02: Classification and Clustering Techniques
In decision tree, the task of predicting a class label of an object is started from the root attribute.
Thereafter, the values of the root attribute are compared with the object’s values. On the basis
of comparison, we follow the branch corresponding to that value and jump to the next node.
Now, let us learn about the important terminologies related to decision trees
Key Terms Definitions
Root Node It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.
Splitting It is a process of dividing a node into two or more sub-nodes.
Decision Node When a sub-node splits into further sub-nodes, then it is called
the decision node.
Leaf / Terminal Node Nodes do not split is called Leaf or Terminal node.
Pruning When we remove sub-nodes of a decision node, this process is
called pruning.
Branch / Sub-Tree A subsection of the entire tree is called branch or sub-tree.
Parent and Child A node, which is divided into sub-nodes is called a parent node
Node of sub-nodes whereas sub-nodes are the child of a parent node.
Root Node
Branch / Sub-Tree
Splitting
10
Week 02: Classification and Clustering Techniques
Decision trees classify the examples by sorting them down the tree from the root to some
leaf/terminal node, with the leaf/terminal node providing the classification of the example. Each
node in the tree acts as a test case for some attribute, and each edge descending from the
node corresponds to the possible answers to the test case. This process is recursive in nature
and is repeated for every subtree rooted at the new node.
CASE-STUDY: CONSTRCUTING DECISION TREE FROM EXAMPLE
Status
Unemployed
Don’t
Don’t Invite
As the figure shows, each internal node in the tree is labeled with a “test” defined in terms of the
attributes and has a branch for each possible outcome for that test, and each leaf in the tree is
labeled with a class.
11
Week 02: Classification and Clustering Techniques
Attributes used for describing cases can be nominal (taking one of a pre-specified set of values)
or continuous. In the above example, Sex and Status are nominal attributes, whereas Age and
GPA are continuous ones. Typically, a test defined on a nominal attribute has one outcome for
each value of the attribute, whereas a test defined on a continuous attribute is based on a fixed
threshold and has two outcomes, one for each interval as imposed by this threshold. The
decision tree in above figure illustrates these tests.
To find the appropriate class for a given case (individual), we start with the test at the root of the
tree and keep following the branches as determined by the values of the attributes of the case
at hand, until a leaf is reached. For example, suppose the attribute values for a given case are
as follows:
Name = Andrew; Social Security No. = 199199; Age = 22; Sex = Male;
Status = Student; Annual Income = 2,000; College GPA = 3.39.
To classify this case, we start at the root of the tree of the above-mentioned figure, which is
labeled Status, and follow the branch labeled Student from there. Then at the test node Age ≥
21, we follow the true branch, and at the test node GPA ≥ 3.0, we again follow the “true” branch.
This leads finally to a leaf labeled “invite”, indicating that this person is to be invited according to
this decision tree.
Decision tree learning is the task of constructing a decision tree classifier, such as the one in the
above figure, from a collection of historical cases. These are individuals who are already
marked by experts as being good candidates or not. Each historical case is called a training
example, or simply an example, and the collection of such examples from which a decision tree
is to be constructed is called a training sample. A training example is assumed to be
represented as a pair <X, c>, where X is a vector of attribute values describing some case, and
c is the appropriate class for that case. A collection of examples for the credit card task is shown
in the below figure. The following subsections describe how a decision tree can be constructed
from such a collection of training examples.
12
Week 02: Classification and Clustering Techniques
Step 1: If all the examples in S are labeled with the same class, return a leaf labeled
with that class.
Step 2: Choose some test t (according to some criterion) that has two or more mutually
exclusive outcomes { , ,….. }.
Step 3: Partition S into disjoint subsets S1, S2,….,Sr, such that Si consists of those
examples having outcome Oi for the test t, for i = 1, 2, .. .., r.
Step 4: Call this tree-construction procedure recursively on each of the subsets S1,
S2,….,Sr, and let the decision trees returned by these recursive calls be T1, T2 ,Tr.
Step 5: Return a decision tree T with a node labeled t as the root and the trees T1, T2 ,Tr
as subtrees below that node.
Example: For illustration, let us apply the above procedure on the set of data represented in the
following table. We will use the Case IDs 1–15 (listed in the first column) to refer to each of
these examples.
13
Week 02: Classification and Clustering Techniques
Obviously, the quality of the tree produced by the above top-down construction procedure of
decision trees depends mainly on how tests are chosen in Step 2.
14
Week 02: Classification and Clustering Techniques
Don’t Invite
(a) (b) (c)
Fig. Subtrees returned by recursive calls on training examples for credit card
invitations
To start with an example for addressing nearest neighbor classifier, it can be observed about a
situation as shown in the below diagram such that:
{ IF it walks like a duck & quacks like a duck, THEN it is probably a duck! }
15
Week 02: Classification and Clustering Techniques
An intuitive way to decide how to classify an unlabeled test item is to look at the training data
points nearby, and make the classification according to the classes of those nearby labelled
data points. This intuition is formalized in a classification approach called k-nearest neighbour
(k-NN) classification which is a supervised classification technqiue. The k-NN approach looks at
the k points in the training set that are closest to the test point; the test point is then classified
according to the class to which the majority of the k-nearest
neighbours belong. Hence, it can be summarized that three things
are required for k-NN as follows:
There few situations where K=1, 2, 3 nearest neighbours of a record X are data points that have
k smallest distance to x.
The training of k-NN classifier is simple where we need to just store all the training set.
However, testing is much slower, since it involves measuring the distance between each test
point and every training point. The basic principle of kNN is to classify new cases based on a
similarity measure (e.g., distance functions) in its proximity where all the cases are available. In
fact, a case is classified by a majority vote of its neighbors, with the case being assigned to the
class most common amongst its k-nearest neighbors measured by a distance function. If k = 1,
then the case is simply assigned to the class of its nearest neighbor. Most widely used distance
metrics are given as:
16
Week 02: Classification and Clustering Techniques
Euclidean distance: ∑ ( − )
Manhattan distance: ∑ | − |
Minkowski distance: ∑ (| − |)
The above distance functions are applicable to only continuous variables. In case of categorical
variable, Hamming distance is used. It also brings up the issue of standardization of the
numerical variables between 0 and 1 when there is a mixture of numerical and categorical
variables in the dataset.
Hamming Distance: =∑ | − |
x = y => D = 0
x ≠ y => D = 1
• Use of K-Nearest Neighbor classifier for intrusion detecton Yihua Liao, V.Rao Vemuri;
Computers and Security Journal, 2002 Classify program behavior as normal or intrusive.
• Fault Detection Using the k-Nearest Neighbor Rule for Semiconductor Manufacturing Processes
He, Q.P., Jin Wang; IEEE Transactions in Semiconductor Manufacturing, 2007 Early fault
detection in industrial systems.
17
Week 02: Classification and Clustering Techniques
In solving many real life problems, difficulty arises in data classification while the collected data
are unlabeled. Basically, the outcomes are unknown as no supervisor is available to label and
hence the context refers to the unsupervised classification. Clustering is an unsupervised
technique that deals with partitioning the unlabeled data into a set of (homogeneous) clusters on
the basis of similarity amongst the patterns. Few of such clustering techniques are discussed
here.
Hierarchical clustering techniques are the most commonly used method of summarizing
data or pattern where the clustering of the data are executed following a hierarchical
structure. Here, clusters are obtained based on the similarity or dissimilarity amongst
the patterns. This helps in discovering the hidden patterns in the data and identifying the
outlier(s) as preprocessing step for other algorithms.
Now, compute the proximity matrix (n×n) where each element d(I,j) describes the
distance or difference between ith and jth patterns as follows:
The proximity value is usually computed by using the Euclidian distance measure on the
basis of which the clusters in the data are formed. There are two types of such
hierarchical clustering: -
18
Week 02: Classification and Clustering Techniques
Step 2: Compute the proximity values and generate the proximity matrix.
Step 3: Find the smallest proximity value and merge the corresponding clusters to make a
cluster (minimum two data points).
Step 4: Compute the proximity matrix for all the initial and newly formed clusters in Step 3.
Divisible clustering algorithm – It is called also as a top-down approach where every data
point is considered in a separate cluster.
Step 2: Compute the proximity values and generate the proximity matrix.
Step 3: Find the largest proximity value and split the corresponding clusters to generate
sub-clusters.
Step 4: Compute the proximity matrix for all the initial and newly formed clusters in Step 3.
Both the hierarchical algorithms execute iteratively as per the computed proximity values and
produce a binary tree, called Dendogram. The final cluster is the root and each data item is a
leaf.
19
Week 02: Classification and Clustering Techniques
Calculation of the proximity / distance between two clusters is based on the pairwise distances
between members of the clusters following various notions:
20
Week 02: Classification and Clustering Techniques
Compete linkage provides preference to compact / spherical clusters. Single linkage can
produce long stretched clusters. An example is given below:
The k-means clustering algorithm was first proposed by MacQueen (1967). The algorithm is an
unsupervised classification method, which assumes fixed number of clusters. It belongs to the
central clustering category which uses Euclidean distance as a distance metric. The algorithm
minimizes the total mean squared error between the cluster centroids and the data points. The
algorithm implements the minimization of the following objective function.
k
2
J
i 1 x j Si
x j i
21
Week 02: Classification and Clustering Techniques
1
k
Xk x
xX k
Step 4: Stop the algorithm if yn | n 1,2,...N , does not change, otherwise go back to Step
2.
The k-means algorithm can be initialized by choosing a set of k seed points. Seed points can be
the first k patterns chosen randomly from the pattern matrix. The first seed point can also be
chosen as the centroid of all the patterns, and successive seed points are chosen such that they
are at a certain distance away from the previously chosen seed points. Each pattern is assigned
to a class based on minimum Euclidean distance criterion. Different initial partitions can lead to
different final clustering because the k-means clustering based on the square error criterion can
converge to local minima, rather than the global minima. Therefore sometimes it is required to
run the k-means algorithm many times with different initializations, such that if most of the runs
lead same results then we will have some confidence that global minimum is achieved. In the
data assignment step, based on the existing centroids the data are partitioned into any one
class based on minimum distance between each pattern and the respective class centroid. In
the centroid computation step, the average pattern of all the patterns assigned to a given class
is computed and is replaced with the previous centroid. The k means algorithm terminates when
the criterion function cannot be improved. The algorithm terminates when the cluster labels for
all the patterns do not change between two successive iterations. A maximum number of
iterations can be specified to prevent endless oscillations. The computational complexity of k
means algorithm is of the order O(NdkT ) where N the total number of patterns, d is is the
number of features, k is the number of clusters, and T is the number of iterations.
22
Week 02: Classification and Clustering Techniques
The k-means clustering algorithm assigns one cluster, and the patterns are partitioned into
disjoint sets. Patterns in one cluster are supposed to be more similar to each other than to
patterns in the different clusters. If the clusters are well separated, then there is no ambiguity or
uncertainty associated with assigning each pattern to one cluster. When the clusters touch or
overlap, the cluster boundaries are not sharp and assignment of patterns to clusters is difficult.
For this kind of problem, fuzzy clustering is useful. Here, a pattern belongs to a class with a
grade of membership. The degree of membership takes a value in the interval [0, 1]. For
ordinary clusters called the crisp clusters, the membership grade for a particular cluster is 1 if
the pattern belongs to the cluster and 0 if it does not. With fuzzy clusters, the pattern xi has a
K
grade of membership uji >=0 or degree of belonging to the jth cluster, where ji 1 and K is
j 1
the number of clusters. The larger uji, the more confidence exists that xi belongs to cluster j. If uji
is 1, pattern xi belongs to cluster j with absolute certainty. The interpretation of values like 0.25
is less clear. Membership grades are subjective and are based on definitions rather than
measurements. The grade of membership is not the same as probability that the pattern
belongs to the cluster even though the grades of membership and probability both take values
in the range [0, 1]. Under a probabilistic framework a pattern xi can belong to one and only one
cluster depending on the outcome of a random experiment. In fuzzy set theory, a pattern xi can
belong to two clusters simultaneously. The membership grades determine the degree to which
the two cluster labels are applicable. Most of the algorithms based on fuzzy set theory are
partitional.
The fuzzy c means algorithm aims at minimizing the following optimization function
N c
2
EFCM ji m pi V j
i 1 j 1
,
where uji is the fuzzy membership, having m as the weighting exponent and with pattern pi such
that it can associate with the cluster j having centroid Vj. The algorithm is as follows.
23
Week 02: Classification and Clustering Techniques
N m
j 1
ji
( r 1)
pi
(r )
Vj N m
j 1
ji
( r 1)
,
where pi is the feature vector.
1
pi V j ( r )
ji ( r )
2
( m 1)
c 1
(r )
v 1
pi V j
Step 4: U (r ) U ( r 1) F
, stop. Otherwise, set r : r 1 and go to Step 2.
Step 5: The clustering decisions for ith data point are made by maximization
ˆ ( pi ) arg j max ji
,
where ˆ ( pi ) 1 , 2 ,.....c are a set of clusters.
Cross-validation is a process of splitting the total dataset into k number of folds such that (k-1)
number of folds altogether constitute a training set and one fold refers to the testing set. Though
there is no such rule to define the value of ‘k’; but, in practice, the value is chosen as either 4 or
5. Therefore, the ratio of training and testing data is either 3:1 or 4:1 where all the folds are
randomized in generating training and testing samples in order to provide the most unbiased
results. In each of the combinations, the performance of the classifier is evaluated using various
measures, as described below.
Confusion matrix: A confusion matrix shows the number of correct and incorrect predictions
made by the classification model compared to the actual outcomes (target value) in the data.
The matrix is NxN, where N is the number of target values (classes). Performance of such
models is commonly evaluated using the data in the matrix. The following table displays a 2x2
confusion matrix for two classes (Positive and Negative).
24
Week 02: Classification and Clustering Techniques
Target
Confusion Matrix
Positive Negative
Model Positive a b Positive Predictive Value a/(a+b)
Negative c d Negative Predictive Value d/(c+d)
Sensitivity Specificity Accuracy = (a+d)/(a+b+c+d)
a/(a+c) d/(b+d)
where
True Positive (TP) - The actual positive is predicted by the model as positive.
True Negative (TN) - The actual negative is predicted by the model as negative.
False Positive (FP) - The actual negative is predicted by the model as positive.
False Negative (FN) – The actual positive is predicted by the model as negative.
Ideally, both the sensitivity and specificity are expected to be higher for a classifier to be the
most efficient and accurate. But, in practice, there may be the situation where sensitivity is
higher and specificity tends to be lower. In such situation, it becomes difficulty to make decision
about the superiority of the classifier. In order to address such conflicting scenario, a trade-off
between sensitivity and specificity is modeled and reflected as a receiver operating curve
(ROC). The area under this ROC, known as AUC (area under the curve) is considered as the
singleton accuracy indicator such that higher the AUC is better the classifier.
25
Week 02: Classification and Clustering Techniques
R-CODE:
# R Libraries
library(naivebayes)
library(dplyr)
library(ggplot2)
library(psych)
# Data
data <- read.csv('binary.csv', header = T)
str(data)
xtabs(~admit+rank, data = data)
# Naive Bayes Model
model <- naive_bayes(admit ~ ., data = train, usekernel = T)
model
train %>%
filter(admit == "1") %>%
summarise(mean(gre), sd(gre))
plot(model)
# Prediction for classification
p <- predict(model, test, type = 'prob')
Results:
26
Week 02: Classification and Clustering Techniques
Prediction: The following table provides the likelihood of admission based on X=(gre, gpa and
rank) of students (testing data set).
IRIS data – IRIS data set is a well-known and benchmark dataset that contains 3 classes of
total 150 instances each, where each class refers to a type of iris plant (Setosa, Versicolor,
Virginica). There are five features viz., length and width of sepal and petal. [Downloadable from
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris ].
R-CODE:
#SVM in R
install.packages("e1071")
library(e1071)
data("iris")
mymodel <- svm(Species~., data=iris,
kernel="radial")
27
Week 02: Classification and Clustering Techniques
Results:
Parameters:
SVM-Kernel: radial
Prediction:
Actual
Predicted setosa versicolor virginica
setosa 50 0 0
versicolor 0 48 2
virginica 0 2 48
GUI of SVM using R showing the Visualization of classes (Setosa, Versicolor, Virginica)
28
Week 02: Classification and Clustering Techniques
R- CODE:
install.packages("party")
library(party)
print(head(readingSkills))
# Create the tree
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)
Results:
From the decision tree shown above we can conclude that anyone whose ‘reading skills’ score
is less than 38.3 and age is more than 6 is not a native Speaker.
29
Week 02: Classification and Clustering Techniques
Description – US Arrests data set provides the information about the attributes viz., murder,
assault, urbanpop, rape for 50 states during the crime investigation. Using these data, can we
find the states those of which are having similar crime profile. Here, a hierarchical clustering is
used based on complete linkage method in R programming environment.
R-CODE:
install.packages("cluster")
library(cluster)
df<-USArrests
d<-dist(df, method="euclidean")
hc1<-hclust(d,method="complete")
plot(hc1, cex=0.6, hang=-1)
rect.hclust(hc1, k=5, border=2:6)
The algorithm provides the following dendogram where four clusters marked with different
colored boxes show the states within each cluster have similar crime records in respect of the
proximity value ‘3’.
30
Week 02: Classification and Clustering Techniques
References:
H Almuallim, S Kaneda, Y Akiba, 3 - Development and Applications of Decision Trees,
Editor(s): Cornelius T. Leondes, Expert Systems, Academic Press, 2002, Pages 53-77.
Duda, Hert and Storck, ………..
A Webb, Statistical Pattern Recognition, ……………
Introduction to Statistics and Data Analysis by C Heumann and MS Shalabh, Springer
Pub., 2016.
31