0% found this document useful (0 votes)
17 views

DM 2 Part 1

Uploaded by

tanaybobbili.129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

DM 2 Part 1

Uploaded by

tanaybobbili.129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Mining

LAKSHMI VIVEKA KESANAPALLI


Associate Professor
Dept of CSE (Artificial Intelligence)
Pragati Engineering College

22-02-2024 DM/UNIT-II/LECTURE-2
UNIT – II PART - I

22-02-2024 DM/UNIT-II/LECTURE-2
Syllabus

 Data Mining: Introduction, what is Data


Mining? Motivating challenges, The
origins of Data Mining, Data Mining
Tasks, Types of Data, Data Quality (Tan &
Vipin)

22-02-2024 DM/UNIT-II/LECTURE-2
Data Mining
 Data mining is a process of automatically
discovering knowledge/useful information
in large data repositories.
 Eg: Predicting whether a newly arrived
customer will spend more than $100 at a
department store.
 Data mining is an integral part of
knowledge discovery in databases (KDD),
which is the overall process of converting
raw data into useful information.

22-02-2024 DM/UNIT-II/LECTURE-2
Process of Knowledge Discovery in
Databases (KDD)

22-02-2024 DM/UNIT-II/LECTURE-2
22-02-2024 DM/UNIT-II/LECTURE-2
Motivating Challenges
The following are the challenges that
motivated the development of data
mining.
 Scalability
 High Dimensionality
 Heterogeneous and Complex Data
 Data Ownership and Distribution
 Non - Traditional Analysis

22-02-2024 DM/UNIT-II/LECTURE-2
 Scalability:
◦ Data mining algorithms must be scalable to
handle massive data sets.
◦ Scalability may require the implementation of
novel data structures to access individual
records in an efficient manner.
◦ Scalability can also be improved by using
sampling or developing parallel and
distributed algorithms.

22-02-2024 DM/UNIT-II/LECTURE-2
 High Dimensionality:
◦ In bio informatics, progress in microarray
technology has produced gene expression data
involving thousands of features/attributes.
◦ Data sets with temporal or spatial components
also tend to have high dimensionality.
◦ Traditional data analysis techniques do not
work well for high dimensionality.
◦ And also the computational complexity
increases rapidly as the dimensionality
increases.

22-02-2024 DM/UNIT-II/LECTURE-2
 Heterogeneous and Complex Data
◦ In the recent years we have seen data
with heterogeneous attributes and
complex data objects.
 Eg: Hyperlinks
 DNA data
 Climate Data
◦ Techniques developed for mining such
complex objects are graph connectivity,
temporal and spatial auto correlation and
so on.

22-02-2024 DM/UNIT-II/LECTURE-2
 Data ownership and Distribution
 Sometimes, the data is geographically
distributed among resources belonging to
multiple entities.
 This requires the development of
distributed data mining techniques.

22-02-2024 DM/UNIT-II/LECTURE-2
 Non Traditional Analysis
 Current data analysis tasks require the
development of some data mining
techniques in order to automate the
process of hypothesis generation and
evaluation.

22-02-2024 DM/UNIT-II/LECTURE-2
The Origins of Data Mining
 Data mining draws upon ideas from
◦ Sampling, estimation and hypothesis testing
from statistics.
◦ Search algorithms, modelling techniques, and
learning theories from artificial intelligence ,
pattern recognition and machine learning.

22-02-2024 DM/UNIT-II/LECTURE-2
Data Mining Confluence Of Multiple
Disciplines

22-02-2024 DM/UNIT-II/LECTURE-2
Data Mining Tasks
 Data mining tasks are generally divided
into two major categories
◦ Predictive tasks
◦ Descriptive tasks
 Predictive tasks:
The objective of these tasks is to predict the value
of a particular attribute based on the values of other
attributes.
The attribute to be predicted is called as target or
dependent variable.
The attributes used for making prediction are called
as explanatory or independent variable.

22-02-2024 DM/UNIT-II/LECTURE-2
 Descriptive Tasks:
The objective is to derive patterns
(Correlations, trends, clusters, trajectories
and anomalies) that summarize the
underlying relationships in data.
Descriptive data mining tasks are often
explanatory in nature and frequently
require post processing technique to
validate and explain results.

22-02-2024 DM/UNIT-II/LECTURE-2
 The following diagram represents the four
core data mining tasks.

22-02-2024 DM/UNIT-II/LECTURE-2
(i) Predictive Modelling
 It refers to building a model for the target
variable as a function of the explanatory
variables. There are two types of
predictive modelling tasks.
◦ Classification which is used for discrete target
variables.
◦ Eg: Predicting whether heavy rainfall on
tomorrow or not.

22-02-2024 DM/UNIT-II/LECTURE-2
 Regression which is used for continuous
target variables.
 Eg: Predicting the future price of a stock.

The goal of both tasks is to learn


a model that minimize the error between
the predicted and true values of the target
values.

22-02-2024 DM/UNIT-II/LECTURE-2
(ii) Association Analysis
It is used to discover patterns that
describe strongly associated features in the data.
The discovered patterns are typically
represented in the form of implication rules.
Eg: { bread } -> { Milk }

The application of association analysis include:


 Finding the groups of genes ( related
functionality)
 Associated or co purchased products (
Market based analysis)
 Web pages that are accessed together.

22-02-2024 DM/UNIT-II/LECTURE-2
(iii) Cluster Analysis
It is used to find groups of closely
related objects so that objects that belong
to the same cluster are more similar to
each other than objects that belong to
other categories.
Applications of cluster analysis include:
◦ Grouping customers
◦ Data compression
◦ Document clustering

22-02-2024 DM/UNIT-II/LECTURE-2
(iv) Anomaly detection:
It is the task of identifying objects
whose characteristics are significantly
different from the rest of the data such
objects are called as anomalies or outliers.
A good anomaly detector must have a
high detection rate and low false alarm
rate.
Applications of anomaly detection includes:
◦ Fraud detection
◦ Network intrusion detection
◦ Disease detection
22-02-2024 DM/UNIT-II/LECTURE-2
Types of Data
 Data Set:
◦ A data set is a collection of data objects.
◦ The other names for data object are record,
event, point, vector, observation, entity.
◦ Data objects are described by a number of
attributes that capture the basic characteristics
of an object.
◦ Eg: student name, branch, college name.
◦ The other names for an attribute are variable,
characteristic, field, feature, dimension.
22-02-2024 DM/UNIT-II/LECTURE-2
 Attribute:
◦ An attribute is a property or characteristic of
an object that may vary, either from one object
to another or from one time to another.
◦ Eg: Eye color varies from person to person.
◦ Temperature varies over time.

22-02-2024 DM/UNIT-II/LECTURE-2
 Different types of Attributes:
 The following properties of numbers are
typically used to describe attributes.
1. Distinctness ( = and != )
2. Order ( <, <=, >, >= )
3. Addition ( +, - )
4. Multiplication ( *, / )
 Given these properties, we can define four types
of attributes;
i. Nominal
ii. Ordinal
iii. Interval
iv. Ratio
22-02-2024 DM/UNIT-II/LECTURE-2
 Nominal Attributes: (Distinctness)
◦ The values of a nominal attribute are just different
names. That is , nominal values provide only enough
information to distinguish one object from another ( =,
!=).
 Eg: Gender, Eye Color
 Ordinal Attributes: ( Distinctness and Order)
◦ The values of an ordinal attribute provide enough
information to order objects (<, >).
◦ Nominal and ordinal attributes are collectively
referred to as categorical or qualitative attributes.
◦ Eg: Grade of a Student (A, B, C, D)
◦ Shirt size (S, M, L, XL)

22-02-2024 DM/UNIT-II/LECTURE-2
 Interval Attributes: ( Distinctness, Order and
Addition )
◦ For interval attributes, the differences between values
are meaningful and a unit of measurement exist. ( +, -
).
◦ Eg: Temperature in Celsius or Fahrenheit
◦ Calendar Dates
 Ratio Attributes: (All the four properties)
◦ For ratio attributes, both differences and ratios are
meaningful ( *, / )
◦ Interval and ratio attributes are collectively called as
quantitative or numeric attributes.
◦ Eg: Electric Current
◦ Age
22-02-2024 DM/UNIT-II/LECTURE-2
 Describing attributes by the number of
values:
1) Discrete Attribute:
It has only a finite or countable infinite set of
values.
Eg: Pincode, Counts, ID numbers
Discrete attributes are often represented as
integer variables.
Binary attributes are a special case of discrete
attributes and assume only two values.
Eg: True/False
Male/Female
Yes/No
0/1

22-02-2024 DM/UNIT-II/LECTURE-2
2) Continuous Attribute:
It has real numbers as attribute values.
Eg: Temperature, height, weight
Continuous attributes are typically
represented as floating point variables.

Types of Data Sets:


The data types are grouped into three.
i. Record data
ii. Graph data
iii. Ordered data

22-02-2024 DM/UNIT-II/LECTURE-2
 Record Data:
The data set is a collection of records,
each of which consists of a fixed set of
data fields.
Record data is usually stored in flat files
or in relational databases.
Different types of record data include:
Transaction data, data matrix, document-
term matrix.

22-02-2024 DM/UNIT-II/LECTURE-2
Example:
TID Refund Martial Taxable Defaulted
Status income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Divorced 95K Yes
. . . . .
. . . . .
. . . . .

Transaction Data: It is a special type of record data, where each


record (transaction) involves a set of items
Ex: Set of items purchased by a customer.

22-02-2024 DM/UNIT-II/LECTURE-2
 Data Matrix: A set of data objects can be interpreted
as m x n matrix, where there are m rows, one for each
object and n columns, one for each attribute.
 Data matrix is a standard data format for most
statistical data.

Projection Projection Distance Load Thickness


of x Load of y Load
10.23 5.27 15.22 27 1.2

12.65 6.25 16.22 22 1.1

. . . . .
. . . . .
. . . . .

22-02-2024 DM/UNIT-II/LECTURE-2
 Document-term matrix:
A document can be represented as a term vector,
where each term is a component (attribute) of
the vector and the value of each component is
the no of times the corresponding term occurs in
the document.

22-02-2024 DM/UNIT-II/LECTURE-2
 Graph Data:
A Graph is a convenient and powerful
representation of data.
There are two specific cases with
graph data.
i) Data with relationship among
objects.
ii) Data with objects that are graphs.

22-02-2024 DM/UNIT-II/LECTURE-2
 Data with relationships among objects:
The data objects are represented as nodes,
and the relationship among objects is
represented by links.
Eg: Linked web pages

22-02-2024 DM/UNIT-II/LECTURE-2
 Data with objects that are graphs:
If objects have structure, that is, the objects
contain sub-objects that have relationships, then
such objects are frequently represented as
graphs.
Eg: Structure of chemical compounds

22-02-2024 DM/UNIT-II/LECTURE-2
Ordered data:
Here, the attributes have relationships
that involve order in time or space.
Sequential data:
Sequential data is also called as temporal
data.
The data which consists of temporal/time
information is called as temporal data.
It is an extension of record data, where each
record has a time associated with it.
Example of pattern: "Candy sales peak before
Halloween"
22-02-2024 DM/UNIT-II/LECTURE-2
22-02-2024 DM/UNIT-II/LECTURE-2
 Sequence data:
Sequence data consists of a sequence of individual
entities such as a sequence of words or letters.
It is quite similar to sequential data, except that
there are no time stamps, instead there are positions in
an ordered sequence.
Eg: Genomic sequence data

22-02-2024 DM/UNIT-II/LECTURE-2
 Time Series Data:
Time series data is a special type of sequential
data in which each record is a series of
measurements taken over time.
Eg: Daily prices of stocks, Temperature time series

22-02-2024 DM/UNIT-II/LECTURE-2
 Spatial Data:
The data that consists location information is
called as spatial data.
Eg: Weather data collected from various geo-
graphical locations

22-02-2024 DM/UNIT-II/LECTURE-2
Data Quality

 Data mining focuses on data quality issues in


two ways:
i) the detection and correction of data quality problems.
ii) the use of algorithms that can tolerate poor data quality
 The detection and correction of data quality
problems is often called as "data cleaning".
 Examples of data quality problems:
Noise
Outliers
Missing values
Duplicate data

22-02-2024 DM/UNIT-II/LECTURE-2
 It is unrealistic to expect that data will be
perfect.
 There may be problems due to human error
limitations, measuring device or flaws in the
data collection process.
 Measurement Error refers to any problem
resulting from the measurement process.
 For continuous attributes, the numerical
difference of the measured and true value is
called error.
 Data collection error refers to error such as
omitting data objects or attribute values or
inappropriately includes a data object.
22-02-2024 DM/UNIT-II/LECTURE-2
Noise and Artifacts
 Noise is the random component of a
measurement error.
 It may involve the distortion of a value or
the additional spacious objects.

22-02-2024 DM/UNIT-II/LECTURE-2
 The term noise is often used in connection
with data that has a spatial or temporal
component.
 Techniques from signal or image
processing can frequently be used to
reduce noise.
 The elimination of noise is frequently
difficult, hence data mining focuses on
devising robust algorithms.
 Deterministic distortions of the data is
called as artifacts.
 Eg: Steak is the same place on a set of
photographs.

22-02-2024 DM/UNIT-II/LECTURE-2
Outliers
 Outliers are the data objects having
characteristics that are different from most
of the other data objects in the data set.
 outliers are values of an attribute that are
unusual with respect to the typical values
for that attribute.
 Outliers can be legitimate data objects or
values.
 Eg: In fraud detection, the goal is to find
unusual objects or events from among a
large number of normal ones

22-02-2024 DM/UNIT-II/LECTURE-2
 Missing Values:
 The reasons for missing values are
i) Information was not collected
Eg: People decline to give their age or
weight.
ii) Attributes may not be applicable to all
cases.
Eg: Annual income is not applicable to
children

22-02-2024 DM/UNIT-II/LECTURE-2
 Missing values can be handled in the
following ways:
1) Eliminate data objects or attributes.
2) Estimate missing values.
3) Ignore the missing values during
analysis.
4) Replace with all possible values.
(Weighted by, their probabilities)

22-02-2024 DM/UNIT-II/LECTURE-2
 Inconsistent Values:
 Even all the data is present and "looks
fine" but there may be inconsistencies.
 Eg: A person has a height of 6 feet, but
weighs only 2 kgs specified.
 Zipcode area is not contained in that city.
 The correction of an inconsistency
requires additional or redundant
information.
22-02-2024 DM/UNIT-II/LECTURE-2
 Duplicate data:
 Data set may include data objects that are duplicates or
almost duplicates of one another.
 Eg: Many people receive duplicate mails because they appear in a
database multiple time under slightly different ...
 To detect and eliminate such duplicates, two
main issues must be addressed.
◦ If there are two objects that actually represent a single
object, then the value of corresponding attributes may
differ and there inconsistent values must be resolved.
◦ Case needs to be taken to avoid accidentally combined
data objects that are similar, but not duplicates.
◦ Eg: Two different people with identical names.
◦ The process of dealing with duplication issues is
called as deduplication.
22-02-2024 DM/UNIT-II/LECTURE-2

You might also like