0% found this document useful (0 votes)

17 views

DM 2 Part 1

Uploaded by

tanaybobbili.129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

DM 2 Part 1

Uploaded by

tanaybobbili.129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Data Mining

LAKSHMI VIVEKA KESANAPALLI

Associate Professor
Dept of CSE (Artificial Intelligence)
Pragati Engineering College

22-02-2024 DM/UNIT-II/LECTURE-2
UNIT – II PART - I

22-02-2024 DM/UNIT-II/LECTURE-2
Syllabus

 Data Mining: Introduction, what is Data

Mining? Motivating challenges, The
origins of Data Mining, Data Mining
Tasks, Types of Data, Data Quality (Tan &
Vipin)

22-02-2024 DM/UNIT-II/LECTURE-2
Data Mining
 Data mining is a process of automatically
discovering knowledge/useful information
in large data repositories.
 Eg: Predicting whether a newly arrived
customer will spend more than $100 at a
department store.
 Data mining is an integral part of
knowledge discovery in databases (KDD),
which is the overall process of converting
raw data into useful information.

22-02-2024 DM/UNIT-II/LECTURE-2
Process of Knowledge Discovery in
Databases (KDD)

22-02-2024 DM/UNIT-II/LECTURE-2
22-02-2024 DM/UNIT-II/LECTURE-2
Motivating Challenges
The following are the challenges that
motivated the development of data
mining.
 Scalability
 High Dimensionality
 Heterogeneous and Complex Data
 Data Ownership and Distribution
 Non - Traditional Analysis

22-02-2024 DM/UNIT-II/LECTURE-2
 Scalability:
◦ Data mining algorithms must be scalable to
handle massive data sets.
◦ Scalability may require the implementation of
novel data structures to access individual
records in an efficient manner.
◦ Scalability can also be improved by using
sampling or developing parallel and
distributed algorithms.

22-02-2024 DM/UNIT-II/LECTURE-2
 High Dimensionality:
◦ In bio informatics, progress in microarray
technology has produced gene expression data
involving thousands of features/attributes.
◦ Data sets with temporal or spatial components
also tend to have high dimensionality.
◦ Traditional data analysis techniques do not
work well for high dimensionality.
◦ And also the computational complexity
increases rapidly as the dimensionality
increases.

22-02-2024 DM/UNIT-II/LECTURE-2
 Heterogeneous and Complex Data
◦ In the recent years we have seen data
with heterogeneous attributes and
complex data objects.
 Eg: Hyperlinks
 DNA data
 Climate Data
◦ Techniques developed for mining such
complex objects are graph connectivity,
temporal and spatial auto correlation and
so on.

22-02-2024 DM/UNIT-II/LECTURE-2
 Data ownership and Distribution
 Sometimes, the data is geographically
distributed among resources belonging to
multiple entities.
 This requires the development of
distributed data mining techniques.

22-02-2024 DM/UNIT-II/LECTURE-2
 Non Traditional Analysis
 Current data analysis tasks require the
development of some data mining
techniques in order to automate the
process of hypothesis generation and
evaluation.

22-02-2024 DM/UNIT-II/LECTURE-2
The Origins of Data Mining
 Data mining draws upon ideas from
◦ Sampling, estimation and hypothesis testing
from statistics.
◦ Search algorithms, modelling techniques, and
learning theories from artificial intelligence ,
pattern recognition and machine learning.

22-02-2024 DM/UNIT-II/LECTURE-2
Data Mining Confluence Of Multiple
Disciplines

22-02-2024 DM/UNIT-II/LECTURE-2
Data Mining Tasks
 Data mining tasks are generally divided
into two major categories
◦ Predictive tasks
◦ Descriptive tasks
 Predictive tasks:
The objective of these tasks is to predict the value
of a particular attribute based on the values of other
attributes.
The attribute to be predicted is called as target or
dependent variable.
The attributes used for making prediction are called
as explanatory or independent variable.

22-02-2024 DM/UNIT-II/LECTURE-2
 Descriptive Tasks:
The objective is to derive patterns
(Correlations, trends, clusters, trajectories
and anomalies) that summarize the
underlying relationships in data.
Descriptive data mining tasks are often
explanatory in nature and frequently
require post processing technique to
validate and explain results.

22-02-2024 DM/UNIT-II/LECTURE-2
 The following diagram represents the four
core data mining tasks.

22-02-2024 DM/UNIT-II/LECTURE-2
(i) Predictive Modelling
 It refers to building a model for the target
variable as a function of the explanatory
variables. There are two types of
predictive modelling tasks.
◦ Classification which is used for discrete target
variables.
◦ Eg: Predicting whether heavy rainfall on
tomorrow or not.

22-02-2024 DM/UNIT-II/LECTURE-2
 Regression which is used for continuous
target variables.
 Eg: Predicting the future price of a stock.

The goal of both tasks is to learn

a model that minimize the error between
the predicted and true values of the target
values.

22-02-2024 DM/UNIT-II/LECTURE-2
(ii) Association Analysis
It is used to discover patterns that
describe strongly associated features in the data.
The discovered patterns are typically
represented in the form of implication rules.
Eg: { bread } -> { Milk }

The application of association analysis include:

 Finding the groups of genes ( related
functionality)
 Associated or co purchased products (
Market based analysis)
 Web pages that are accessed together.

22-02-2024 DM/UNIT-II/LECTURE-2
(iii) Cluster Analysis
It is used to find groups of closely
related objects so that objects that belong
to the same cluster are more similar to
each other than objects that belong to
other categories.
Applications of cluster analysis include:
◦ Grouping customers
◦ Data compression
◦ Document clustering

22-02-2024 DM/UNIT-II/LECTURE-2
(iv) Anomaly detection:
It is the task of identifying objects
whose characteristics are significantly
different from the rest of the data such
objects are called as anomalies or outliers.
A good anomaly detector must have a
high detection rate and low false alarm
rate.
Applications of anomaly detection includes:
◦ Fraud detection
◦ Network intrusion detection
◦ Disease detection
22-02-2024 DM/UNIT-II/LECTURE-2
Types of Data
 Data Set:
◦ A data set is a collection of data objects.
◦ The other names for data object are record,
event, point, vector, observation, entity.
◦ Data objects are described by a number of
attributes that capture the basic characteristics
of an object.
◦ Eg: student name, branch, college name.
◦ The other names for an attribute are variable,
characteristic, field, feature, dimension.
22-02-2024 DM/UNIT-II/LECTURE-2
 Attribute:
◦ An attribute is a property or characteristic of
an object that may vary, either from one object
to another or from one time to another.
◦ Eg: Eye color varies from person to person.
◦ Temperature varies over time.

22-02-2024 DM/UNIT-II/LECTURE-2
 Different types of Attributes:
 The following properties of numbers are
typically used to describe attributes.
1. Distinctness ( = and != )
2. Order ( <, <=, >, >= )
3. Addition ( +, - )
4. Multiplication ( *, / )
 Given these properties, we can define four types
of attributes;
i. Nominal
ii. Ordinal
iii. Interval
iv. Ratio
22-02-2024 DM/UNIT-II/LECTURE-2
 Nominal Attributes: (Distinctness)
◦ The values of a nominal attribute are just different
names. That is , nominal values provide only enough
information to distinguish one object from another ( =,
!=).
 Eg: Gender, Eye Color
 Ordinal Attributes: ( Distinctness and Order)
◦ The values of an ordinal attribute provide enough
information to order objects (<, >).
◦ Nominal and ordinal attributes are collectively
referred to as categorical or qualitative attributes.
◦ Eg: Grade of a Student (A, B, C, D)
◦ Shirt size (S, M, L, XL)

22-02-2024 DM/UNIT-II/LECTURE-2
 Interval Attributes: ( Distinctness, Order and
Addition )
◦ For interval attributes, the differences between values
are meaningful and a unit of measurement exist. ( +, -
).
◦ Eg: Temperature in Celsius or Fahrenheit
◦ Calendar Dates
 Ratio Attributes: (All the four properties)
◦ For ratio attributes, both differences and ratios are
meaningful ( *, / )
◦ Interval and ratio attributes are collectively called as
quantitative or numeric attributes.
◦ Eg: Electric Current
◦ Age
22-02-2024 DM/UNIT-II/LECTURE-2
 Describing attributes by the number of
values:
1) Discrete Attribute:
It has only a finite or countable infinite set of
values.
Eg: Pincode, Counts, ID numbers
Discrete attributes are often represented as
integer variables.
Binary attributes are a special case of discrete
attributes and assume only two values.
Eg: True/False
Male/Female
Yes/No
0/1

22-02-2024 DM/UNIT-II/LECTURE-2
2) Continuous Attribute:
It has real numbers as attribute values.
Eg: Temperature, height, weight
Continuous attributes are typically
represented as floating point variables.

Types of Data Sets:

The data types are grouped into three.
i. Record data
ii. Graph data
iii. Ordered data

22-02-2024 DM/UNIT-II/LECTURE-2
 Record Data:
The data set is a collection of records,
each of which consists of a fixed set of
data fields.
Record data is usually stored in flat files
or in relational databases.
Different types of record data include:
Transaction data, data matrix, document-
term matrix.

22-02-2024 DM/UNIT-II/LECTURE-2
Example:
TID Refund Martial Taxable Defaulted
Status income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Divorced 95K Yes
. . . . .
. . . . .
. . . . .

Transaction Data: It is a special type of record data, where each

record (transaction) involves a set of items
Ex: Set of items purchased by a customer.

22-02-2024 DM/UNIT-II/LECTURE-2
 Data Matrix: A set of data objects can be interpreted
as m x n matrix, where there are m rows, one for each
object and n columns, one for each attribute.
 Data matrix is a standard data format for most
statistical data.

Projection Projection Distance Load Thickness

of x Load of y Load
10.23 5.27 15.22 27 1.2

12.65 6.25 16.22 22 1.1

. . . . .
. . . . .
. . . . .

22-02-2024 DM/UNIT-II/LECTURE-2
 Document-term matrix:
A document can be represented as a term vector,
where each term is a component (attribute) of
the vector and the value of each component is
the no of times the corresponding term occurs in
the document.

22-02-2024 DM/UNIT-II/LECTURE-2
 Graph Data:
A Graph is a convenient and powerful
representation of data.
There are two specific cases with
graph data.
i) Data with relationship among
objects.
ii) Data with objects that are graphs.

22-02-2024 DM/UNIT-II/LECTURE-2
 Data with relationships among objects:
The data objects are represented as nodes,
and the relationship among objects is
represented by links.
Eg: Linked web pages

22-02-2024 DM/UNIT-II/LECTURE-2
 Data with objects that are graphs:
If objects have structure, that is, the objects
contain sub-objects that have relationships, then
such objects are frequently represented as
graphs.
Eg: Structure of chemical compounds

22-02-2024 DM/UNIT-II/LECTURE-2
Ordered data:
Here, the attributes have relationships
that involve order in time or space.
Sequential data:
Sequential data is also called as temporal
data.
The data which consists of temporal/time
information is called as temporal data.
It is an extension of record data, where each
record has a time associated with it.
Example of pattern: "Candy sales peak before
Halloween"
22-02-2024 DM/UNIT-II/LECTURE-2
22-02-2024 DM/UNIT-II/LECTURE-2
 Sequence data:
Sequence data consists of a sequence of individual
entities such as a sequence of words or letters.
It is quite similar to sequential data, except that
there are no time stamps, instead there are positions in
an ordered sequence.
Eg: Genomic sequence data

22-02-2024 DM/UNIT-II/LECTURE-2
 Time Series Data:
Time series data is a special type of sequential
data in which each record is a series of
measurements taken over time.
Eg: Daily prices of stocks, Temperature time series

22-02-2024 DM/UNIT-II/LECTURE-2
 Spatial Data:
The data that consists location information is
called as spatial data.
Eg: Weather data collected from various geo-
graphical locations

22-02-2024 DM/UNIT-II/LECTURE-2
Data Quality

 Data mining focuses on data quality issues in

two ways:
i) the detection and correction of data quality problems.
ii) the use of algorithms that can tolerate poor data quality
 The detection and correction of data quality
problems is often called as "data cleaning".
 Examples of data quality problems:
Noise
Outliers
Missing values
Duplicate data

22-02-2024 DM/UNIT-II/LECTURE-2
 It is unrealistic to expect that data will be
perfect.
 There may be problems due to human error
limitations, measuring device or flaws in the
data collection process.
 Measurement Error refers to any problem
resulting from the measurement process.
 For continuous attributes, the numerical
difference of the measured and true value is
called error.
 Data collection error refers to error such as
omitting data objects or attribute values or
inappropriately includes a data object.
22-02-2024 DM/UNIT-II/LECTURE-2
Noise and Artifacts
 Noise is the random component of a
measurement error.
 It may involve the distortion of a value or
the additional spacious objects.

22-02-2024 DM/UNIT-II/LECTURE-2
 The term noise is often used in connection
with data that has a spatial or temporal
component.
 Techniques from signal or image
processing can frequently be used to
reduce noise.
 The elimination of noise is frequently
difficult, hence data mining focuses on
devising robust algorithms.
 Deterministic distortions of the data is
called as artifacts.
 Eg: Steak is the same place on a set of
photographs.

22-02-2024 DM/UNIT-II/LECTURE-2
Outliers
 Outliers are the data objects having
characteristics that are different from most
of the other data objects in the data set.
 outliers are values of an attribute that are
unusual with respect to the typical values
for that attribute.
 Outliers can be legitimate data objects or
values.
 Eg: In fraud detection, the goal is to find
unusual objects or events from among a
large number of normal ones

22-02-2024 DM/UNIT-II/LECTURE-2
 Missing Values:
 The reasons for missing values are
i) Information was not collected
Eg: People decline to give their age or
weight.
ii) Attributes may not be applicable to all
cases.
Eg: Annual income is not applicable to
children

22-02-2024 DM/UNIT-II/LECTURE-2
 Missing values can be handled in the
following ways:
1) Eliminate data objects or attributes.
2) Estimate missing values.
3) Ignore the missing values during
analysis.
4) Replace with all possible values.
(Weighted by, their probabilities)

22-02-2024 DM/UNIT-II/LECTURE-2
 Inconsistent Values:
 Even all the data is present and "looks
fine" but there may be inconsistencies.
 Eg: A person has a height of 6 feet, but
weighs only 2 kgs specified.
 Zipcode area is not contained in that city.
 The correction of an inconsistency
requires additional or redundant
information.
22-02-2024 DM/UNIT-II/LECTURE-2
 Duplicate data:
 Data set may include data objects that are duplicates or
almost duplicates of one another.
 Eg: Many people receive duplicate mails because they appear in a
database multiple time under slightly different ...
 To detect and eliminate such duplicates, two
main issues must be addressed.
◦ If there are two objects that actually represent a single
object, then the value of corresponding attributes may
differ and there inconsistent values must be resolved.
◦ Case needs to be taken to avoid accidentally combined
data objects that are similar, but not duplicates.
◦ Eg: Two different people with identical names.
◦ The process of dealing with duplication issues is
called as deduplication.
22-02-2024 DM/UNIT-II/LECTURE-2

Acidic Prayers On Dreams
100% (8)
Acidic Prayers On Dreams
2 pages
Starbucks Supply Chain
90% (20)
Starbucks Supply Chain
22 pages
Map Thailand Complete Island
No ratings yet
Map Thailand Complete Island
1 page
Booklet Gems and Gemstones PDF
100% (4)
Booklet Gems and Gemstones PDF
12 pages
Arduino For Beginners REV2
100% (5)
Arduino For Beginners REV2
32 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
Full
No ratings yet
Full
367 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Ragb Alllnkg Kyoulltherrdz: in Structor
No ratings yet
Ragb Alllnkg Kyoulltherrdz: in Structor
31 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Data Mining
No ratings yet
Data Mining
40 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Unit I
No ratings yet
Unit I
57 pages
DWDM REFERENCE NOTES
No ratings yet
DWDM REFERENCE NOTES
126 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
SCSA3001-1-58
No ratings yet
SCSA3001-1-58
58 pages
Satyabhama Bigdata
No ratings yet
Satyabhama Bigdata
128 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
IDS Unit 2 Additional Topics
No ratings yet
IDS Unit 2 Additional Topics
15 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Unit I Notes
No ratings yet
Unit I Notes
23 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
datamining-1class
No ratings yet
datamining-1class
76 pages
The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A
No ratings yet
The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A
14 pages
2016 Book PrinciplesOfDataMining PDF
100% (2)
2016 Book PrinciplesOfDataMining PDF
530 pages
Module 2 Data Mining
No ratings yet
Module 2 Data Mining
49 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
4 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Data Mining Unit-1 Notes
No ratings yet
Data Mining Unit-1 Notes
18 pages
2DMT
No ratings yet
2DMT
73 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
chap2_data (1)
No ratings yet
chap2_data (1)
105 pages
DWDMUNIT1A
No ratings yet
DWDMUNIT1A
95 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
CH2 data 1
No ratings yet
CH2 data 1
35 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Data Mining Chapter 1 Notes
No ratings yet
Data Mining Chapter 1 Notes
40 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Lec 5
No ratings yet
Lec 5
24 pages
Unit 2 Data Preprocessing for Students.pptx
No ratings yet
Unit 2 Data Preprocessing for Students.pptx
169 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Chap2 Data
No ratings yet
Chap2 Data
68 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Datamining Presentation
No ratings yet
Datamining Presentation
20 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Data Mining
No ratings yet
Data Mining
15 pages
DM NOTES PRA
No ratings yet
DM NOTES PRA
63 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Unit 4 Intro DM
No ratings yet
Unit 4 Intro DM
30 pages
Preprocessing_1
No ratings yet
Preprocessing_1
11 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
TD 3 Pert
No ratings yet
TD 3 Pert
1 page
Checklist Hip Examination
100% (1)
Checklist Hip Examination
7 pages
WQIP-H3-ILF-T-0139 V3, Manufacturer Inspection Test Plan (SADIP) - 1
No ratings yet
WQIP-H3-ILF-T-0139 V3, Manufacturer Inspection Test Plan (SADIP) - 1
7 pages
Articles - Vasu
No ratings yet
Articles - Vasu
8 pages
Futurlux Head Beam
No ratings yet
Futurlux Head Beam
8 pages
Phoenix Aviation Modelling - Issue 04, March 2022
No ratings yet
Phoenix Aviation Modelling - Issue 04, March 2022
114 pages
ddg12 Printable
No ratings yet
ddg12 Printable
37 pages
PREPARE 3 Grammar Standard Unit 02
No ratings yet
PREPARE 3 Grammar Standard Unit 02
2 pages
Experiment 1: Calibration of Volumetric Glassware
100% (1)
Experiment 1: Calibration of Volumetric Glassware
7 pages
OP1 Field Notebook 1v5a
No ratings yet
OP1 Field Notebook 1v5a
254 pages
Classic Motorcycles Issue 3
100% (1)
Classic Motorcycles Issue 3
27 pages
Unit 11 - Sustainable Construction
No ratings yet
Unit 11 - Sustainable Construction
21 pages
cold storage
No ratings yet
cold storage
1 page
Classification of Buildings According To NBC
No ratings yet
Classification of Buildings According To NBC
2 pages
Logical Reasoning
No ratings yet
Logical Reasoning
595 pages
4property Tables
No ratings yet
4property Tables
66 pages
Forbidden-Lands-The Bloodmarch 221028
No ratings yet
Forbidden-Lands-The Bloodmarch 221028
258 pages
Tightening and Maintenance of Bolted Joints
No ratings yet
Tightening and Maintenance of Bolted Joints
10 pages
Liturgy and Sacraments
No ratings yet
Liturgy and Sacraments
62 pages
Corrective Maintenance
100% (5)
Corrective Maintenance
17 pages
Periodontal Pocket
No ratings yet
Periodontal Pocket
3 pages
1. Human Nutrition assignment
No ratings yet
1. Human Nutrition assignment
3 pages
Toyota Hilux Technical Specifications Engine 2.5-Litre D-4D 3.0-Litre D-4D
No ratings yet
Toyota Hilux Technical Specifications Engine 2.5-Litre D-4D 3.0-Litre D-4D
2 pages
Pumping Speed Measurement of Vacuum Pump
No ratings yet
Pumping Speed Measurement of Vacuum Pump
14 pages
Experiment-1: Term - I Physics Practicals
No ratings yet
Experiment-1: Term - I Physics Practicals
15 pages

DM 2 Part 1

Uploaded by

DM 2 Part 1

Uploaded by

Data Mining

LAKSHMI VIVEKA KESANAPALLI

 Data Mining: Introduction, what is Data

The goal of both tasks is to learn

The application of association analysis include:

Types of Data Sets:

Transaction Data: It is a special type of record data, where each

Projection Projection Distance Load Thickness

12.65 6.25 16.22 22 1.1

 Data mining focuses on data quality issues in

You might also like