Unit 1
Unit 1
2101CS521
Unit-1
Introduction to
Data Mining (DM)
Data Informatio
Input Data n
Data Post
Preprocessin
Mining Processing
g
Feature Selection
Dimensionality Filtering Patterns
Reduction Visualization
Normalization Pattern Interpretation
Data Subsetting
Pattern Evaluation
Knowle
Data Mining Engine dge
Base
Database or Data
Warehouse Server
Descriptive
• This task presents the general properties of data stored in a database.
• The descriptive tasks are used to find out patterns in data.
• E.g.: Cluster, Trends, etc.
Predictive
• These tasks predict the value of one attribute on the basis of values of other
attributes.
• E.g.: Festival Customer/Product Sell prediction at store
• Frequent Subsequence
• A sequence of patterns that occur
frequently such as purchasing a laptop is
followed by digital camera and a memory
card.
• Example
• Suppose we have a transactional dataset from a Electronics store, and we want to discover
associations between purchased items. Here's a simple example of an association rule
generated from the data:
• buys(X,“computer”) ⇒ buys(X,“software”) [support = 1%,confidence = 50%],
• where X is a variable representing a customer.
• A confidence, or certainty, of 50% means that if a customer buys a computer, there is a
50% chance that she will buy software as well.
• A 1% support means that 1% of all the transactions under analysis show that computer and
software are purchased together.
• Classification
• It predicts the class of objects whose class label is unknown.
• The Derived Model is based on the analysis set of training data i.e. the data object whose
class label is well known.
• Example:Consider a scenario where you receive a large volume of emails, and you want to
automatically classify them as spam or non-spam.
• Prediction
• It is used to predict missing or unavailable numerical data values rather than class labels.
• Regression Analysis is generally used for prediction.
• Example: Let's consider a scenario where you want to predict the price of a house based
on its size (in square feet).
• Cluster Analysis
• clustering analyzes data objects without consulting class labels.
• In many cases, class- labeled data may simply not exist at the beginning
• Clustering can be used to generate class labels for a group of data.
• The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are rather dissimilar to objects in other clusters.
• Outlier Analysis
• A data set may contain objects that do not comply with the general behavior or model of
the data.
• These data objects are outliers.
• Many data mining methods discard outliers as noise or exceptions.
• Example: Consider a dataset that records the attendance of students in a class over a
semester. By examining the dataset, we notice that the some data points is significantly
lower than the attendance values of the other students.
2. Qualitative
1. Nominal Quantit Qualita
• Nominal
2. Ordinal ative •
tive
• Discreat Ordinal
3. Binary e • Binary
• • Symm
1. Symmetric Continuo
etric
2. Asymmetric us • Asymm
etric
2) Continues Attribute
Real numbers as attribute values.
The attributes temperature, height, or weight are the examples of continuous
attributes.
Practically, real values can only be measured and represented using a finite number
of digits.
Continuous attributes are typically represented as floating- point variables.
Ratio Attribute
Ratio attribute is looks like interval attribute, but it must have a true zero
(absolute) value.
It tells us about the order and the exact value between units or data.
Example
Age Group 10-20, 30-50, 35-45 (In years)
Mass 20-30 kg, 10-15 kg
It does have a true zero (absolute) so, it is possible to compute ratios.
#2101CS521 (DM) Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 42
Mean is the average of a
Mean (Average) dataset
Mean is the average of a dataset.
The mean is the total of all the values, divided by the number of values.
𝑛
Formula to find 𝑋 1
mean
= ∑
𝑛 𝑖 =1
𝑥
Example
Find out mean for 12, 15, 11, 11, 7, 13 (Here total data is = 6)
Median Median
25%
Q1 Q2 Q3
25th Median 75th
Percentile Percentile
First Quartile (Q1) or 25th Percentile: Q1 is the value below which 25% of
the data falls. This means that 25% of the data points in the dataset are
less than or equal to Q1.
Second Quartile (Q2) or Median or 50th Percentile: Q2 is the value that
separates the dataset into two equal halves.
Third Quartile (Q3) or 75th Percentile: Q3 is the value below which 75% of
the data falls. This means that 75% of the data points in the dataset are
lessProf.than
Jayesh D.or equal to Q3. #2101CS521 (DM) Unit 1 – Introduction to
Vagadiya 53
Quantiles
The distance between the first and third quartiles is a called the
interquartile range (IQR) and is defined as IQR = Q3 − Q1.
Minimum Maximum
(Q1 – 1.5 * IQR) Median (Q3 + 1.5 *
IQR)
Q1 Q2 Q3
(25th Percentile) (75th Percentile)
(50th Percentile)
-5 -4 -3 -2 -1 0 1 2 3 4 5
#2101CS521 (DM) Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 56
Data Matrix vs Dissimilarity Matrix
Data Matrix:
A data matrix, also known as a feature matrix or attribute matrix, is a structured
representation of a dataset where rows represent observations or data points, and
columns represent attributes or variables.
Each cell in the matrix contains the value of a specific attribute for a particular data
point.
A data matrix is made up of two entities or “things,” namely rows (for objects) and
columns (for attributes). Therefore, the data matrix is often called a two-mode
matrix.
Dissimilarity Matrix :
A dissimilarity matrix, also known as a distance matrix or dissimilarity matrix, is a
square matrix that quantifies the dissimilarity or distance between pairs of data
points in a dataset.
Each cell in the matrix represents the dissimilarity measure between two data points.
The dissimilarity matrix contains one kind of entity (dissimilarities) and so is called a
one-mode matrix. #2101CS521 (DM) Unit 1 – Introduction to
Prof. Jayesh D. Vagadiya 57
Data Matrix vs Dissimilarity Matrix
Dissimilarity
Data Matrix
Matrix