UNIT 4
UNIT 4
UNIT-4
Classification: It is a data analysis task, i.e. the process of finding a model that
describes and distinguishes data classes and concepts. Classification is the
problem of identifying to which of a set of categories (subpopulations), a new
observation belongs to, on the basis of a training set of data containing
observations and whose categories membership is known.
1
Example: Before starting any project, we need to check its feasibility. In this
case, a classifier is required to predict class labels such as ‘Safe’ and ‘Risky’ for
adopting the Project and to further approve it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn
using the training set available. The model has to be trained for the prediction
of accurate results.
2. Classification Step: Model used to predict class labels and testing the
constructed model on test data and hence estimate the accuracy of the
classification rules.
Test data are used to estimate the accuracy of the classification rule
Classification:
Classification is the process of finding a good model that describes the data
classes or concepts, and the purpose of classification is to predict the class of
objects whose class label is unknown. In simple terms, we can think of
Classification as categorizing the incoming new data based on our current or past
assumptions that we have made and the data that we already have with us.
Prediction:
We can think of prediction is like something that may go to happen in the future.
And just like that in prediction, we identify or predict the missing or unavailable
data for a new observation based on the previous data that we have and based on
the future assumptions. In prediction, the output is a continuous value.
2
Difference between Prediction and Classification:
Prediction Classification
Prediction is about predicting a Classification is about determining a
missing/unknown element(continuous value) (categorial) class (or label) for an
of a dataset element in a dataset
Eg. We can think of prediction as predicting Eg. Whereas the grouping of patients
the correct treatment for a particular disease based on their medical records can be
for an individual person. considered classification.
The model used to predict the unknown The model used to classify the unknown
value is called a predictor. value is called a classifier.
The predictor is constructed from a training A classifier is also constructed from a
set and its accuracy refers to how well it can training set composed of the records of
estimate the value of new data. databases and their corresponding class
names
Data Generalization:-
It is the process of summarizing data by replacing relatively low level values
with higher level concepts. It is a form of descriptive data mining.
There are two basic approaches of data generalization :
1. Data cube approach :
It is also known as OLAP approach.
It is an efficient approach as it is helpful to make the past selling graph.
In this approach, computation and results are stored in the Data cube.
It uses Roll-up and Drill-down operations on a data cube.
These operations typically involve aggregate functions, such as count(), sum(),
average(), and max().
2. Attribute oriented induction :
It is an online data analysis, query oriented and generalization based approach.
In this approach, we perform generalization on basis of different values of
each attributes within the relevant data set. after that same tuple are merged
and their respective counts are accumulated in order to perform aggregation.
It performs off-line aggregation before an OLAP or data mining query is
submitted for processing.
Attribute oriented induction approach uses two method :
(i). Attribute removal.
(ii). Attribute generalization.
3
Analytical Characterization in Data Mining
Analytical characterization is used to help and identifying the weakly relevant, or
irrelevant attributes. We can exclude these unwanted irrelevant attributes when we
preparing our data for the mining.
" Analytical characterization in data mining is the attribute measure in
analysis relevance used in identifying irrelevant attributes"
Why Analytical Characterization?
Analytical Characterization is a very important activity in data mining due to the
following reasons;
Due to the limitation of the OLAP tool about handling the complex objects.
Due to the lack of an automated generalization, we must explicitly tell the system
which attributes are irrelevant and must be removed, and similarly, we must
explicitly tell the system which attributes are relevant and must be included in the
class characterization.
4
4. Generate the concept description using AOI –
Perform AOI utilizing a less Conservative arrangement of characteristic
speculation limits.
5
level as those in the prime target class relation or cuboid, forming the prime
contrasting class relation or cuboid.
4. Presentation of the derived comparison: The resulting class comparison
description can be visualized in the form of tables, charts, and rules
There are several descriptive statistical measures to mine in large databases in data
mining i.e used for knowledge discovery in large databases.
6
Mode:
It is nothing but the value that occurs most frequently in the data.
For Example,
In {6, 9, 3, 6, 6, 5, 2, 3}, the Mode is 6 as it occurs most often.
Inter-quartile range: It is the differences between the 75th and 25th quartile (IQR
= Q3 – Q1).
Quantile Plot
It displays all of the data (allowing the user to assess both the overall behavior and
unusual occurrences).
It plots quantile information
7
Scatter Plot
It provides a first look at bivariate data to see clusters of points, outliers, etc.
Each pair of values is treated as a pair of coordinates and plotted as points in the
plane.
C)Histogram Analysis
8
Statistical-based algorithms
There are two types of statistical-based algorithms which are as follows −
Regression − Regression issues deal with the evaluation of an output value
located on input values. When utilized for classification, the input values are
values from the database and the output values define the classes. Regression
can be used to clarify classification issues, but it is used for different
applications including forecasting. The elementary form of regression is
simple linear regression that includes only one predictor and a prediction.
Regression can be used to implement classification using two various
methods which are as follows −
o Division − The data are divided into regions located on class.
o Prediction − Formulas are created to predict the output class’s value.
Bayesian Classification − Statistical classifiers are used for the
classification. Bayesian classification is based on the Bayes theorem.
Bayesian classifiers view high efficiency and speed when used to high
databases.
Bayes Theorem − Let X be a data tuple. In the Bayesian method, X is
treated as “evidence.” Let H be some hypothesis, including that the data
tuple X belongs to a particularized class C. The probability P (H|X) is
decided to define the data. This probability P (H|X) is the probability that
hypothesis H’s influence has given the “evidence” or noticed data tuple X.
P (H|X) is the posterior probability of H conditioned on X. For instance,
consider the nature of data tuples is limited to users defined by the attribute
age and income, commonly, and that X is 30 years old users with Rs. 20,000
income. Assume that H is the hypothesis that the user will purchase a
9
computer. Thus P (H|X) reverses the probability that user X will purchase a
computer given that the user’s age and income are acknowledged.
P (H) is the prior probability of H. For instance, this is the probability that
any given user will purchase a computer, regardless of age, income, or some
other data. The posterior probability P (H|X) is located on more data than the
prior probability P (H), which is free of X.
Likewise, P (X|H) is the posterior probability of X conditioned on H. It is the
probability that a user X is 30 years old and gains Rs. 20,000.
P (H), P (X|H), and P (X) can be measured from the given information.
Bayes theorem supports a method of computing the posterior probability P
(H|X), from P (H), P (X|H), and P(X). It is given by
P(H|X)=P(X|H)P(H)P(X)
Distance-based algorithms
Distance-based algorithms are nonparametric methods that can be used for
classification. These algorithms classify objects by the dissimilarity between them
as measured by distance functions. Several candidate distance functions are
reviewed in this chapter along with two particular classification algorithms. The
algorithms are used to measure the distance between each text and to calculate the
score. Distance measures play an important role in machine learning. They provide
the foundations for many popular and effective machine learning algorithms like
KNN (K-Nearest Neighbours) for supervised learning and K-Means clustering for
unsupervised learning.
Decision Tree :-
10
fig 1.1 : an example decision tree
fig 1.1 represents a simple decision tree that is used to for a classification task of
whether a customer gets a loan or not. The input features are salary of the person,
the number of children and the age of the person. The decision tree uses these
attributes or features and asks the right questions at the right step or node so as to
classify whether the loan can be provided to the person or not.
Terminologies
Node: The blue coloured rectangles that are shown above are what we call the
nodes of the tree. In a decision tree, a question is asked at each node and based on
the answer, certain selected outcome is given.
Root Node or Root : In a decision tree, the top most node is called as the root
node. In the above tree, the node that asks “age over 30 ?” is the root node.
Leaf node : Nodes that do not have any children are called leaf nodes. ( Get Loan,
Don’t get Loan ). Leaf nodes hold the output labels.
11
Some algorithms used in Decision Trees:
12
Clustering
The process of grouping a set of physical objects into classes of similar objects is
called clustering.
Cluster – similar Objects are grouped within a cluster and dissimilar objects are
grouped in another clusters
Cluster applications – pattern recognition, image processing and market research.
Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset
13
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It
means that it will classify the data into k groups, which satisfy the following
requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method will create
an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects.
We can classify hierarchical methods on the basis of how the hierarchical
decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one
or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into
14
smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it
can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then
performing macro-clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood exceeds some
threshold, i.e., for each data point within a given cluster, the radius of a given
cluster has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data
for a given model. This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user expectation or the
properties of desired clustering results. Constraints provide us with an interactive
15
way of communication with the clustering process. Constraints can be specified by
the user or the application requirement.
16
X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
Here the total distance of the Red line gives the Manhattan distance between both
the points.
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as
the intersection of those items divided by the union of the data items.
17
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:
Here (theta) gives the angle between two vectors and A, B are n-dimensional
vectors.
18
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested
clusters. A diagram called Dendrogram (A Dendrogram is a tree-like diagram
that statistics the sequences of merges or splits) graphically represents this
hierarchy and is an inverted tree that describes the order in which factors are
merged (bottom-up view) or clusters are broken up (top-down view).
Partitioning Method:
This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects
then the partitioning method constructs user-specified(K) partitions of the data in
which each partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular ones are K-
Mean, PAM(K-Mediods), CLARA algorithm (Clustering Large Applications) etc.
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
19
Output:
A dataset of K clusters
CURE Architecture
20
Chameleon Clustering
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to
decide the similarity among pairs of clusters.
Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor
graph into a large number of relatively small subclusters.
In Chameleon, cluster similarity is assessed depending on how well-connected
objects are inside a cluster and on the proximity of clusters. Especially, two
clusters are combined if their interconnectivity is high and they are close together.
Density-Based Clustering
Density-reachable:
21
Density-connected:
Given a set of objects, D' we say that an object p is directly density-reachable from
object q if p is within the ε-neighborhood of q, and q is a core object.
An object p is density-reachable from object q with respect to ε and MinPts in a set
of objects, D' if there is a chain of objects p1,.,.,.pn, where p1 = q and pn = p such
that pi+1 is directly density-reachable from pi with respect to e and MinPts, for
1/n, pi € D.
22
It needs density parameters as a termination condition.
DBSCAN Algorithm
23
say, let MinPts = 3.
24
It is good for both automatic and interactive cluster analysis, including finding an
intrinsic clustering structure.
It can be represented graphically or using visualization techniques.
Grid-Based Clustering
Grid-Based Clustering method uses a multi-resolution grid data structure.
25
For each cell, the high level is partitioned into several smaller cells in the next
lower level.
The statistical info of each cell is calculated and stored beforehand and is used to
answer queries.
The parameters of higher-level cells can be easily calculated from parameters of
lower-level cell
Count, mean, s, min, max
Type of distribution—normal, uniform, etc.
Then using a top-down approach we need to answer spatial data queries.
Then start from a pre-selected layer—typically with a small number of cells.
For each cell in the current level compute the confidence interval.
Now remove the irrelevant cells from further consideration.
When finishing examining the current layer, proceed to the next lower level.
Repeat this process until the bottom layer is reached.
Advantages:
It is Query-independent, easy to parallelize, incremental update.
O(K), where K is the number of grid cells at the lowest level.
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected.
26
CLIQUE - Clustering In QUEst
It was proposed by Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
It is based on automatically identifying the subspaces of high dimensional data
space that allow better clustering than original space.
CLIQUE can be considered as both density-based and grid-based:
It partitions each dimension into the same number of equal-length intervals.
It partitions an m-dimensional data space into non-overlapping rectangular
units.
A unit is dense if the fraction of the total data points contained in the unit
exceeds the input model parameter.
A cluster is a maximal set of connected dense units within a subspace.
Partition the data space and find the number of points that lie inside each cell
of the partition.
Identify the subspaces that contain clusters using the Apriori principle.
Identify clusters:
Determine dense units in all subspaces of interests.
Determine connected dense units in all subspaces of interests.
27
Advantages
It automatically finds subspaces of the highest dimensionality such that
high-density clusters exist in those subspaces.
It is insensitive to the order of records in input and does not presume some
canonical data distribution.
It scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases.
Disadvantages
The accuracy of the clustering result may be degraded at the expense of the
simplicity of the method.
Model-based clustering
Model-based clustering method is an attempt to optimize the fit between the data
and some mathematical models. It is the Statistical and AI approach.
Each cluster corresponds to a different distribution, and these distributions are
assumed to be Gaussians.
Model-based clustering is a statistical approach to data clustering. The observed
(multivariate) data is considered to have been created from a finite combination of
component models. Each component model is a probability distribution, generally
a parametric multivariate distribution.
28
Model-based clustering is a try to advance the fit between the given data and some
mathematical model and is based on the assumption that data are created by a
combination of a basic probability distribution.
There are the following types of model-based clustering are as follows −
1.Statistical approach − Expectation maximization is a popular iterative
refinement algorithm. An extension to k-means −
It can assign each object to a cluster according to weight (probability
distribution).
New means are computed based on weight measures.
The basic idea is as follows −
It can start with an initial estimate of the parameter vector.
It can be used to iteratively rescore the designs against the mixture density
made by the parameter vector.
It is used to rescored patterns are used to update the parameter estimates.
It can be used to pattern belonging to the same cluster if they are placed by
their scores in a particular component.
2. Machine learning approach − Machine learning is an approach that makes
complex algorithms for huge data processing and supports results to its users. It
uses complex programs that can understand through experience and create
predictions.
The algorithms are improved by themselves by frequent input of training
information. The main objective of machine learning is to learn data and build
models from data that can be understood and used by humans.
Given a set of transactions, we can find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction.
TID Items
1 Bread, Milk
29
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
30
Support: Support is the frequency of A or how frequently an item appears in the
dataset.
Confidence: It is the ratio of the transaction that contains X and Y to the number
of records that contain X.
Lift : It is the strength of any rule, which can be defined as below formula:
Parallel Algorithm
An algorithm is a sequence of steps that take inputs from the user and after some
computation, produces an output. A parallel algorithm is an algorithm that can
execute several instructions simultaneously on different processing devices and
then combine all the individual outputs to produce the final result.
31
In parallel computing multiple processors performs multiple tasks assigned to
them simultaneously. Memory in parallel systems can either be shared or
distributed. Parallel computing provides concurrency and saves time and money.
What is Parallelism?
Parallelism is the process of processing several set of instructions simultaneously.
It reduces the total computational time. Parallelism can be implemented by
using parallel computers, i.e. a computer with many processors. Parallel
computers require parallel algorithm, programming languages, compilers and
operating system that support multitasking.
Distributed algorithm
A distributed algorithm is an algorithm designed to run on computer
hardware constructed from interconnected processors. Distributed algorithms are
used in different application areas of distributed computing, such
as telecommunications, scientific computing, distributed information processing,
and real-time process control.
Distributed algorithms are a sub-type of parallel algorithm, typically
executed concurrently, with separate parts of the algorithm being run
simultaneously on independent processors, and having limited information about
what the other parts of the algorithm are doing.
In distributed computing we have multiple autonomous computers which seems
to the user as single system. In distributed systems there is no shared memory and
computers communicate with each other through message passing. In distributed
computing a single task is divided among different computers.
32
Parallel computing, also known as parallel processing, speeds up a computational
task by dividing it into smaller jobs across multiple processors inside one
computer. Distributed computing, on the other hand, uses a distributed system,
such as the internet, to increase the available computing power and enable larger,
more complex tasks to be executed across multiple machines.
S.NO Parallel Computing Distributed Computing
1. Many operations are performed System components are located at
simultaneously different locations
2. Single computer is required Uses multiple computers
3. Multiple processors perform Multiple computers perform multiple
multiple operations operations
4. It may have shared or It have only distributed memory
distributed memory
5. Processors communicate with Computer communicate with each other
each other through bus through message passing.
6. Improves the system Improves system scalability, fault
performance tolerance and resource sharing
capabilities
Neural Network:
In information technology (IT), an artificial neural network (ANN) is a system of
hardware and/or software patterned after the operation of neurons in the human
brain. ANNs -- also called, simply, neural networks -- are a variety of deep
learning technology, which also falls under the umbrella of artificial intelligence,
or AI.
Commercial applications of these technologies generally focus on solving
complex signal processing or pattern recognition problems. Examples of
significant commercial applications since 2000 include handwriting recognition for
check processing, speech-to-text transcription, oil-exploration data analysis,
weather prediction and facial recognition.
33
Neural Network Architecture:
While there are numerous different neural network architectures that have been
created by researchers, the most successful applications in data mining neural
networks have been multilayer feed forward networks. These are networks in
which there is an input layer consisting of nodes that simply accept the input
values and successive layers of nodes that are neurons as depicted in the above
figure of Artificial Neuron. The outputs of neurons in a layer are inputs to
neurons in the next layer. The last layer is called the output layer. Layers between
the input and output layers are known as hidden layers.
34
Multilayer Feedforward Neural Networks
A multilayer feedforward neural network is an interconnection of perceptrons in
which data and calculations flow in a single direction, from the input data to the
outputs. The number of layers in a neural network is the number of layers of
perceptrons. The simplest neural network is one with a single input layer and an
output layer of perceptrons. The network in Figure illustrates this type of network.
Technically, this is referred to as a one-layer feedforward network with two
outputs because the output layer is the only layer with an activation calculation.
35
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to
the larger part of evolutionary algorithms. Genetic algorithms are based on the
ideas of natural selection and genetics. These are intelligent exploitation of
random search provided with historical data to direct the search into the region of
better performance in solution space. They are commonly used to generate
high-quality solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means
those species who can adapt to changes in their environment are able to survive
and reproduce and go to next generation. In simple words, they simulate “survival
of the fittest” among individual of consecutive generation for solving a
problem. Each generation consist of a population of individuals and each
individual represents a point in search space and possible solution. Each
individual is represented as a string of character/integer/float/bits. This string is
analogous to the Chromosome.
Search space
The population of individuals are maintained within search space. Each
individual represents a solution in search space for given problem. Each
individual is coded as a finite length vector (analogous to chromosome) of
components. These variable components are analogous to Genes. Thus a
chromosome (individual) is composed of several genes (variable components).
36
Fitness Score
A Fitness Score is given to each individual which shows the ability of an
individual to “compete”. The individual having optimal fitness score (or near
optimal) are sought.
37
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population
K-Nearest Neighbours Classification
K-Nearest Neighbours is one of the most basic yet essential classification
algorithms in Machine Learning. It belongs to the supervised learning domain
and finds intense application in pattern recognition, data mining and intrusion
detection.
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to
the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set
to the cats and dogs images and based on the most similar features it will put
it in either cat or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:
38
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o Step-6: Our model is ready.
39