0% found this document useful (0 votes)

2 views

unit_1

The document provides an overview of data warehousing and data mining, emphasizing the importance of data preprocessing, cleaning, and integration for effective data analysis. It discusses various data mining functionalities, algorithms, and processes, highlighting the distinction between data mining and traditional data analysis. Additionally, it outlines the knowledge discovery process and the significance of ensuring data quality for successful mining outcomes.

Uploaded by

editorvar4444

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

unit_1

Uploaded by

editorvar4444

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 102

Data Warehousing and Data Mining

with R-Programming

—UNIT 1 —
Recommended Book:
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
April 22, 2025 UNIT 1 1
Introduction
 Motivation (for Data Mining)
 Data Mining-Definition & Functionalities
 Data Processing
 Form of Data Preprocessing
 Data Cleaning: Missing Values, Noisy Data,
(Binning, Clustering, Regression, Computer and
Human inspection),Inconsistent Data
 Data Integration and Transformation.
 Data Reduction:-Data Cube Aggregation,
Dimensionality reduction, Data Compression,
 Numerosity Reduction, Clustering
 Discretization and Concept hierarchy generation.
April 22, 2025 UNIT 1
Motivation
 In real world applications data can be
inconsistent ,incomplete and or noisy.
Errors can happen:
 Faulty data collection instruments
 Data entry problems.
 Human misjudgment during data entry
 Data transmission problems.
 Technology limitations
 Discrepancy in naming conventions
Results:
 Duplicated records
 Incomplete data
 Contradictions in data.

April 22, 2025 UNIT 1

Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes

 Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific
simulation, …

Society and everyone: news, digital cameras
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

April 22, 2025 UNIT 1

What Is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
 Data mining: a misnomer?
 The exploration and analysis, by Automatic or semiautomatic means, of
large quantities of data in order to discover meaningful patterns.
 The extraction of implicit, previously unknown, and potentially useful
information from data or the process of discovery advantages patterns in
data.
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

April 22, 2025 UNIT 1

Data Mining Algorithm

 Objective: Fit Data to a Model

 Descriptive (characterize the general

properties of the data in the database)

 Predictive (perform inference on the

current data in order to make prediction)

 Preference – Technique to choose the best
model
 Search – Technique to search the data
 “Query”

April 22, 2025 UNIT 1

Data Mining Process

Define & Understanding the Problem.

Data Warehousing

Collect / Extract data

Clean Data

Data Engineering

Algorithm selection / Engineering

Run Mining Algorithm

Analyze the Results

April 22, 2025 UNIT 1

Database Processing vs. Data Mining
Processing
Query
 Query
•Defined Poorly
 Well defined
•No precise query language
 SQL

 Data  Data
– Operational data – Not operational data
 Output  Output
– Precise – Fuzzy
– Subset of database – Not a subset of database

April 22, 2025 UNIT 1

Data Warehousing and Data Mining
Statistics and Data Mining
 Data Warehousing

Provides The enterprise with a memory
 Data Mining

Provides the Enterprise with Intelligence.

 Statistics

Confirmatory

Small samples

In-sample Performance
 Data Mining

Exploratory

Large Samples

Out-of-Samples Performance

April 22, 2025 UNIT 1

Query Examples

 Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased milk
 Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)

April 22, 2025 UNIT 1

Data Mining Models and Tasks

April 22, 2025 UNIT 1

Basic Data Mining Tasks

 Classification maps data into predefined groups

or classes
 Supervised learning

 Pattern recognition

 Prediction

 Regression is used to map a data item to a real

valued prediction variable.
 Clustering groups similar data together into
clusters.
 Unsupervised learning

 Segmentation

 Partitioning

April 22, 2025 UNIT 1

Basic Data Mining Tasks (cont’d)

 Summarization maps data into subsets with

associated simple descriptions.
 Characterization

 Generalization

 Link Analysis uncovers relationships among data.

 Affinity Analysis

 Association Rules (Finds rule of the form: X=>Y Or

“ If X then Y”)
 Sequential Analysis determines sequential

patterns.
 (Artificial) Neural Networks
 Genetic algorithms
 Hypothesis Testing.
April 22, 2025 UNIT 1
Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
April 22, 2025 UNIT 1
Data Mining vs. KDD

 Knowledge Discovery in Databases

(KDD): process of finding useful
information and patterns in data.
 Data Mining: Use of algorithms to extract
the information and patterns derived by the
KDD process.

April 22, 2025 UNIT 1

Knowledge Discovery (KDD) Process

 Data mining—core of Pattern Evaluation

knowledge discovery
process
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
April 22, 2025 UNIT 1
KDD Process

Selection: Obtain data from various sources.

Preprocessing: Cleanse data.
Transformation: Convert to common format.
Transform to new format.
Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results to
user in meaningful manner.

April 22, 2025 UNIT 1

KDD Process: Several Key Steps
 Learning the application domain
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction, invariant
representation
 Choosing functions of data mining
 summarization, classification, regression, association, clustering
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge

April 22, 2025 UNIT 1

KDD Process Ex: Web Log
 Selection:

Select log data (dates and locations) to use
 Preprocessing:

Remove identifying URLs

Remove error logs
 Transformation:

Sessionize (sort and group)
 Data Mining:

Identify and count patterns

Construct data structure
 Interpretation/Evaluation:

Identify and display frequently accessed sequences.
 Potential User Applications:

Cache prediction

Personalization

April 22, 2025 UNIT 1

Are All the “Discovered” Patterns Interesting?

 Data mining may generate thousands of patterns: Not all of them are
interesting

Suggested approach: Human-centered, query-based, focused
mining
 Interestingness measures

A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
 Objective vs. subjective interestingness measures

Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.

Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
April 22, 2025 UNIT 1
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms
•Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines

•Bayes Theorem
•Regression Analysis

•EM Algorithm

•K-Means Clustering

•Time Series Analysis

•Algorithm Design Techniques

•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms

April 22, 2025 UNIT 1

Why Not Traditional Data Analysis?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked
data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
April 22, 2025 UNIT 1
Data Mining Functionalities
( Kind of Patterns To Be Found)

 Multidimensional concept description:

Characterization( Generalization or summarization) and
discrimination ( Comparison)

Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
 Frequent patterns, association, correlation vs. causality
Launch Internet Explorer Browser.lnk


Diaper  Beer [0.5%, 75%] (Correlation or causality?)
 Classification and prediction

Construct models (functions) that describe and distinguish
classes or concepts for future prediction

E.g., classify countries based on (climate), or classify
cars based on (gas mileage)

Predict some unknown or missing numerical values
April 22, 2025 UNIT 1
Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion
 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy

April 22, 2025 UNIT 1

Data Preprocessing

 Why preprocess the data?

lacking certain attributes of interest, or

containing only aggregate data

e.g., occupation=“ ”
 noisy: containing errors or outliers

e.g., Salary=“-10”
 inconsistent: containing discrepancies in
codes or names

e.g., Age=“42” Birthday=“03/07/1997”

e.g., Was rating “1,2,3”, now rating “A, B, C”

e.g., discrepancy between duplicate records
April 22, 2025 UNIT 1
Why Is Data Dirty?
 Incomplete data may come from

“Not applicable” data value when collected

Different considerations between the time when the data
was collected and when it is analyzed.

Human/hardware/software problems
 Noisy data (incorrect values) may come from

Faulty data collection instruments

Human or computer error at data entry

Errors in data transmission
 Inconsistent data may come from

Different data sources

Functional dependency violation (e.g., modify some linked
data)
 Duplicate records also need data cleaning
April 22, 2025 UNIT 1
Why Is Data Preprocessing
Important?

 No quality data, no quality mining results!

 Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
 Data warehouse needs consistent integration of
quality data
 Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse

April 22, 2025 UNIT 1

Multi-Dimensional Measure of Data
Quality

 A well-accepted multidimensional view:

 Accuracy

 Completeness

 Consistency

 Timeliness

 Believability

 Value added

 Interpretability

 Accessibility

 Broad categories:
 Intrinsic, contextual, representational, and

accessibility

April 22, 2025 UNIT 1

Major Tasks in Data
Preprocessing
 Data cleaning

Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
 Data integration

Integration of multiple databases, data cubes, or files
 Data transformation

Normalization and aggregation
 Data reduction

Obtains reduced representation in volume but produces
the same or similar analytical results
 Data discretization

Part of data reduction but with particular importance,
especially for numerical data

Data discretization is defined as a process of converting
continuous data attribute values into a finite set of
intervals with minimal loss of information.

April 22, 2025 UNIT 1

Forms of Data Preprocessing

April 22, 2025 UNIT 1

Data preprocessing

 Why preprocess the data?

 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
April 22, 2025 UNIT 1
Mining Data Descriptive
Characteristics
 Motivation
 For data preprocessing to be successful it is essential to better understand the data (overall
picture of your data)
 To identify the typical properties of your data and highlight which data values should be treated
as noise or outliers.
 Two Approaches
 Measure of Central Tendency - effective measure to find out the degree to which
numerical data tend to occur at the center of the data set. (Mean, Median, Mode , midrange)
 Measures of Data Dispersion - effective measure to find out the degree to which
numerical data tend to spread in the data set. (Range, Quartiles, interquartile range (IQR),
outliers, Box plot, variance, Standard Deviation)
 Kinds of Measure
 Distributive Measure:-that cam be computed for a given data set by partitioning the data
into smaller subsets, computing the measure for each subset. and then merging the results in
order to arrive at the measure’s value for the original data set. (count sum)
 Algebraic Measure: That can be computed by applying an algebraic function to one or more
distributive measure. (mean)
 Holistic Measure:- That must be computed on the entire data set as a whole (median)

April 22, 2025 UNIT 1

Measuring the Central Tendency

 
1 n x
x   xi
 Mean (algebraic measure) (sample vs. population):
n i 1 N
n
 Weighted arithmetic mean: w x i i

 x i 1
Trimmed mean: chopping extreme values n

w
i 1
i
 Median: A holistic measure
 Middle value if odd number of values, or average of the
middle two values otherwise
 Estimated by interpolation (for grouped data): n / 2  (
median L1  (
 f )l )c
 Mode f median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode 3 (mean  median)
April 22, 2025 UNIT 1
Symmetric vs. Skewed
Data

 Median, mean and mode of

symmetric, positively and
negatively skewed data

April 22, 2025 UNIT 1

Measuring the Dispersion of Data
 Range, Quartiles, outliers and boxplots
 Range : The range of the set is the difference between the largest & smallest values
 Percentile :The value of a variable below which a certain percent of observations fall
 Quartiles: Quartile means separating the given set of data into 4 equal parts by 3
divisions. The three separations are lower quartile, median and upper quartile. The
lower quartile is the mid data between the first number and its median and the upper
quartile is the mid data between the median and last number of a given set. The
outlier of a given set of data can be identified with the help of interquartile range. Q 1
(25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, M, Q3, max
 Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually
n n
Outlier: usually, a value higher/lower than 1.5 x  1 1
 x
 IQR 2
2
( xi  2
)  i  2
 Variance and standard deviation (sample: s, population: σ)N i 1 N i 1


1 nscalable computation)
Variance: (algebraic, 1 n 2 1 n

s 2

Standard deviation  ( xi  x ) 2

n  1s (or σ) is the square
i 1
 [ xi  
( x 2
i ] 2(
)
n  1 root ofnvariance s or σ
i 1 i 1
2)

April 22, 2025 UNIT 1

Properties of Normal Distribution
Curve
 The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the

measurements (μ: mean, σ: standard deviation)

 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

April 22, 2025 UNIT 1

Graphic Displays of Basic Statistical Descriptions

 Histogram:
 Boxplot:
 Quantile plot: each value xi is paired with fi
indicating that approximately 100 fi % of data are 
xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
 Loess (local regression) curve: add a smooth curve
to a scatter plot to provide better perception of the
pattern of dependence UNIT 1
April 22, 2025
Boxplot Analysis

 Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum
 Boxplot

Data is represented with a box

The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ

The median is marked by a line within the box
 Whiskers: two lines outside the box extend to
Minimum and Maximum

April 22, 2025 UNIT 1

Visualization of Data Dispersion: Boxplot
Analysis

April 22, 2025 UNIT 1

OUTLIER Detection

 Quartile means separating the given set of data into 4 equal

parts by 3 divisions. The three separations are lower quartile,
median and upper quartile. The lower quartile is the mid data
between the first number and its median and the upper
quartile is the mid data between the median and last number
of a given set. The outlier of a given set of data can be
identified with the help of interquartile range.
 Formula Involved – Study Outliers in a Set of Data:
 n – n is the total number of elements in a set.
 Median = = Q2 Lower quartile = =
Q1

 Upper quartile = = Q3
 Interquartile range = Q3 – Q1 (IQR)
 The outliers are below Q1 - 1.5 IQR and above Q3+ 1.5IQR
April 22, 2025 UNIT 1
Example Problems – Study Outliers in a Set of
Data:

 Example 1 Calculate the outlier for the given set of data 31, 64, 69,
65, 62, 63, 62.
 Solution:
 The given set of data is 31, 64, 69, 65, 62, 63, 62.
 Organize the given set of data in ascending order. 31, 62, 62, 63, 64, 65, 69.
 The median is = (7+1)/2 = 4 .

The 4th value is the median. Thus Median = 63. = Q2
 Lower quartile = = (7+1)/4 = 8/4 = 2.
 The element which is in 2nd position is a lower quartile. Thus 62 is the lower
quartile.
 Upper quartile = = (3*(7+1))/4 = 6

 The element which is in 6th position is an upper quartile. Thus 65 is an upper

quartile.


Interquartile range = Q3 – Q2 (IQR) = 65-62 = 3

Now to find the outlier it needs to calculate Q1 - 1.5 IQR and Q3+ 1.5IQR

Q1 - 1.5 IQR = 62-1.5*3 = 57.5

Q3+ 1.5IQR = 65+1.5*3 = 69.5
 The outliers are below 57.5 and above 69.5.
 31 is the outlier of a given set of data.
April 22, 2025 UNIT 1
 Example 2 – Calculate the outlier for the given set of data
50, 61, 65,64, 67, 85, 70.

 Solution: The given set of data is 50, 61, 65, 64, 67, 85, 70.
 Organize the given set of data in ascending order. 50, 61, 64, 65, 67, 70,
85.

 The median is = (7+1)/2 = 8/2= 4. Thus the 4th element is a median.

Median = 65. = Q2

 Lower quartile = = (7+1)/4 = 2. The 2nd element is a lower quartile.

Lower Quartile = 61= Q1

 Upper quartile = =3*(7+1)/4 = 6. The 6th element is an upper

quartile
Upper Quartile= 70= Q3


Interquartile range = Q3 – Q1 (IQR) = 70-61 = 9.

Now to find the outlier it needs to calculate Q1 - 1.5 IQR and Q3+ 1.5IQR

Q1 - 1.5 IQR = 61-1.5*9 = 47.5

Q3+ 1.5IQR = 70+1.5*9 = 83.5
 The outliers are below 47.5 and above 83.5.
 So 85 is an outlier of a given set of data.
April 22, 2025 UNIT 1
 Example 1 –Calculate the interquartile range outlier for the given set of
data
60, 61, 62, 55, 58, 59, 64, 65, 67, 90, 100.

 Solution:The given set of data is 60, 61, 62, 55, 58, 59, 64, 65, 67, 90, 100.
Organize the given set of data in ascending order.
55, 58, 59, 60, 61, 62, 64, 65, 67, 90, 100.


The median is = (11+1)/2 =6. The 6 th value is the median = 62= Q2

 Lower quartile = (11+1)/4= 12/4 =3.The element which is in 3 rd position

is a
lower quartile = 59 = Q1

 Upper quartile =3*(11+1)/4 =9. The element which is in 9 th position is

an
upper quartile = 67 =Q3


Interquartile range = Q3 – Q1 = 67 - 59 = 8


Now to find the outlier it needs to calculate Q 1 - 1.5 IQR and Q3+ 1.5IQR

Q1 - 1.5 IQR = 591-1.5*8 = 47

Q3+ 1.5IQR = 70+1.5*8 = 79

 The outliers are below 47 and above 79.

 So, 90 and 100 are an outlier for a given set of data.
April 22, 2025 UNIT 1
Histogram Analysis

 Graph displays of basic statistical class

descriptions

Frequency histograms

A univariate graphical method

Consists of a set of rectangles that reflect the counts
or frequencies of the classes present in the given data

April 22, 2025 UNIT 1

Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data
are below or equal to the value xi

April 22, 2025 UNIT 1

Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
 Allows the user to view whether there is a shift in
going from one distribution to another

April 22, 2025 UNIT 1

Scatter plot
 Provides a first look at bivariate data to see
clusters of points, outliers, etc
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

April 22, 2025 UNIT 1

Scatter plot

April 22, 2025 UNIT 1

Loess Curve
 Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of dependence
 Loess curve is fitted by setting two parameters: a
smoothing parameter, and the degree of the
polynomials that are fitted by the regression

April 22, 2025 UNIT 1

Positively and Negatively Correlated
Data

April 22, 2025 UNIT 1

Not Correlated Data

April 22, 2025 UNIT 1

Data Preprocessing

 Why preprocess the data?

problems in data warehousing”—Ralph Kimball

 “Data cleaning is the number one problem in

data warehousing”—DCI survey

 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

April 22, 2025 UNIT 1

Missing Data

 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred.

April 22, 2025 UNIT 1

How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as Bayesian
formula or decision tree

April 22, 2025 UNIT 1

Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
 Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data
April 22, 2025 UNIT 1
How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency)
bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression

smooth by fitting the data into regression functions
 Clustering

detect and remove outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g.,
deal with possible outliers)

April 22, 2025 UNIT 1

Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate
presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
April 22, 2025 UNIT 1
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
April 22, 2025 UNIT 1
Regression

Y1’ y=x+1

X1 x

April 22, 2025 UNIT 1

Cluster Analysis

April 22, 2025 UNIT 1

Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools


Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface

 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels -Data cleaning

Tool)
April 22, 2025 UNIT 1
Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary

April 22, 2025 UNIT 1

Data Integration
 Data integration:
 Combines data from multiple sources into a

coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources

 Entity identification problem:

 Identify real world entities from multiple data

sources, e.g., Bill Clinton = William Clinton

 Detecting and resolving data value conflicts
 For the same real world entity, attribute values

from different sources are different

 Possible reasons: different representations,

different scales, e.g., metric vs. British units

April 22, 2025 UNIT 1

Handling Redundancy in Data
Integration
 Redundant data occur often when integration of
multiple databases

Object identification: The same attribute or object
may have different names in different databases

Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

April 22, 2025 UNIT 1

Correlation Analysis (Numerical
Data)

 Correlation coefficient (also called Pearson’s product

moment coefficient)

rA, B 
 (A  A)( B  B )

 ( AB )  n AB
( n  1)AB ( n  1)AB

where n is the number of tuples,A and

B are the respective
means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(AB) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated
April 22, 2025 UNIT 1
Correlation Analysis (Categorical
Data)

 Χ2 (chi-square) test
2
(Observed  Expected )
 2 
Expected
 The larger the Χ2 value, the more likely the
variables are related
 The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count
 Correlation does not imply causality

April 22, 2025 UNIT 1

Chi-Square Calculation: An Example

Gender / Preffered Male Female Sum

Reading (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis

are expected counts calculated based on the data
distribution in the two categories)
2 2 2 2
( 250  90) (50  210) ( 200  360) (1000  840)
2     507.93
90 210 360 840
 It shows that like_science_fiction and male are
correlated in the group
April 22, 2025 UNIT 1
Data Transformation

 Smoothing: remove noise from data

 Aggregation: summarization, data cube
construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range

min-max normalization

z-score normalization

normalization by decimal scaling
 Attribute/feature construction

New attributes constructed from the given ones
April 22, 2025 UNIT 1
Data Transformation:
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to
73,600  12,000
(1.0  0)  0 0.716
[0.0, 1.0]. Then $73,000 is mapped to
98,000  12,000
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
1.225
 Ex. Let μ = 54,000, σ = 16,000. Then16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
April 22, 2025 UNIT 1
Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
April 22, 2025 UNIT 1
Data Reduction Strategies
 Why data reduction?
 A database/data warehouse may store terabytes of data

 Complex data analysis/mining may take a very long time

to run on the complete data set

 Data reduction
 Obtain a reduced representation of the data set that is

much smaller in volume but yet produce the same (or

almost the same) analytical results
 Data reduction strategies
 Data cube aggregation:

 Attribute Subset Selection

 Dimensionality reduction — e.g., remove unimportant

attributes
 Data Compression

 Numerosity reduction — e.g., fit data into models

 Discretization and concept hierarchy generation

April 22, 2025 UNIT 1

Data Cube Aggregation
 The lowest level of a data cube (base cuboid)

The aggregated data for an individual entity of
interest

E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with
 Reference appropriate levels

Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
April 22, 2025 UNIT 1
Data Cube Aggregation

Date
t
uc
1Qtr 2Qtr 3Qtr 4Qtr sum
od

TV
Pr

PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

April 22, 2025 UNIT 1

Attribute Subset Selection
 Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the

probability distribution of different classes given the

values for those features is as close as possible to the
original distribution given the values of all features
 reduce # of patterns in the patterns, easier to

understand
 Heuristic methods (due to exponential # of choices):
 Step-wise forward selection

 Step-wise backward elimination

 Combining forward selection and backward

elimination
 Decision-tree induction

April 22, 2025 UNIT 1

Example of Decision Tree
Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

April 22, 2025 UNIT 1

Heuristic Feature Selection
Methods
 There are 2d possible sub-features of d features
 Several heuristic feature selection methods:
 Best single features under the feature

independence assumption: choose by significance

tests
 Best step-wise feature selection:


The best single-feature is picked first

Then next best feature condition to the first, ...
 Step-wise feature elimination:


Repeatedly eliminate the worst feature
 Best combined feature selection and elimination

 Optimal branch and bound:


Use feature elimination and backtracking
April 22, 2025 UNIT 1
Data Compression
 String compression
 There are extensive theories and well-tuned

algorithms
 Typically lossless

 But only limited manipulation is possible without

expansion
 Audio/video compression

 Typically lossy compression, with progressive

refinement
 Sometimes small fragments of signal can be

reconstructed without reconstructing the whole

 Time sequence is not audio

 Typically short and vary slowly with time

April 22, 2025 UNIT 1
Data Compression

Original Data Compressed

Data
lossless

os sy
l
Original Data
Approximated

April 22, 2025 UNIT 1

Dimensionality Reduction:
Wavelet Transformation
Haar2 Daubechie4
 Discrete wavelet transform (DWT): linear signal
processing, multi-resolution analysis
 Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length

April 22, 2025 UNIT 1

DWT for Image Compression
 Image

Low Pass High Pass

April 22, 2025 UNIT 1

Dimensionality Reduction:
Principal Component Analysis
(PCA)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent
data
 Steps

Normalize input data: Each attribute falls within the same range

Compute k orthonormal (unit) vectors, i.e., principal components

Each input data (vector) is a linear combination of the k
principal component vectors

The principal components are sorted in order of decreasing
“significance” or strength

Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with
low variance. (i.e., using the strongest principal components, it
is possible to reconstruct a good approximation of the original
data
 Works for numeric data only
 Used when the number of dimensions is large
April 22, 2025 UNIT 1
Principal Component
Analysis
X2

Y1
Y2

April 22, 2025 UNIT 1

Numerosity Reduction
 Reduce data volume by choosing alternative,
smaller forms of data representation
 Parametric methods

Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)

Example: Log-linear models—obtain value at a
point in m-D space as the product on
appropriate marginal subspaces
 Non-parametric methods

Do not assume models

Major families: histograms, clustering, sampling
April 22, 2025 UNIT 1
Regression and Log-Linear
Models

 Linear regression: Data are modeled to fit a straight

line

Often uses the least-square method to fit the line
 Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
 Log-linear model: approximates discrete
multidimensional probability distributions
April 22, 2025 UNIT 1
Regress Analysis and Log-Linear
Models
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the

line and are to be estimated by using the data

at hand
 Using the least squares criterion to the known

values of Y1, Y2, …, X1, X2, ….

 Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed

into the above

 Log-linear models:
 The multi-way table of joint probabilities is

approximated by a product of lower-order tables


Probability: p(a, b, c, d) = ab acad bcd
April 22, 2025 UNIT 1
Data Reduction Method (2):
Histograms
 Divide data into buckets and store 40
average (sum) for each bucket
35
 Partitioning rules:

Equal-width: equal bucket range30

Equal-frequency (or equal- 25
depth)
20

V-optimal: with the least
histogram variance (weighted 15
sum of the original values that 10
each bucket represents)

MaxDiff: set bucket boundary
5
between each pair for pairs 0
have the β–1 largest differences
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
April 22, 2025 UNIT 1
Data Reduction Method (3):
Clustering

 Partition data set into clusters based on similarity, and

store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms

April 22, 2025 UNIT 1

Data Reduction Method (4):
Sampling
 Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor

performance in the presence of skew

 Develop adaptive sampling methods
 Stratified sampling:


Approximate the percentage of each class (or
subpopulation of interest) in the overall database

Used in conjunction with skewed data
 Note: Sampling may not reduce database I/Os (page
at a time)
April 22, 2025 UNIT 1
Sampling: with or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
April 22, 2025 UNIT 1
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

April 22, 2025 UNIT 1

Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
April 22, 2025 UNIT 1
Discretization
 Three types of attributes:

Nominal — values from an unordered set, e.g., color, profession

Ordinal — values from an ordered set, e.g., military or academic
rank

Continuous — real numbers, e.g., integer or real numbers
 Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes.

Reduce data size by discretization

Prepare for further analysis

April 22, 2025 UNIT 1

Discretization and Concept
Hierarchy
 Discretization

Reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals

Interval labels can then be used to replace actual data
values

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute
 Concept hierarchy formation

Recursively reduce the data by collecting and replacing low
level concepts (such as numeric values for age) by higher
level concepts (such as young, middle-aged, or senior)
April 22, 2025 UNIT 1
Discretization and Concept Hierarchy
Generation for Numeric Data
 Typical methods: All the methods can be applied recursively
 Binning (covered above)

Top-down split, unsupervised,
 Histogram analysis (covered above)

Top-down split, unsupervised

Clustering analysis (covered above)

Either top-down split or bottom-up merge, unsupervised
 Entropy-based discretization: supervised, top-down split

Interval merging by 2 Analysis: unsupervised, bottom-up merge
 Segmentation by natural partitioning: top-down split,
unsupervised

April 22, 2025 UNIT 1

Entropy-Based Discretization
 Given a set of samples S, if S is partitioned into two intervals S 1
and S2 using boundary T, the information gain after partitioning is
|S | |S |
I ( S , T )  1 Entropy ( S 1)  2 Entropy ( S 2)
|S| |S|
 Entropy is calculated based on class distribution of the samples in
the set. Given m classes, the entropy of S1 is
m
Entropy ( S1 )   pi log 2 ( pi )
i 1

where pi is the probability of class i in S1

 The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization
 The process is recursively applied to partitions obtained until
some stopping criterion is met
 Such a boundary may reduce data size and improve classification
accuracy
April 22, 2025 UNIT 1
Interval Merge by 2 Analysis
 Merging-based (bottom-up) vs. splitting-based methods
 Merge: Find the best neighboring intervals and merge them to
form larger intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

Initially, each distinct value of a numerical attr. A is considered
to be one interval

2 tests are performed for every pair of adjacent intervals

Adjacent intervals with the least 2 values are merged together,
since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined
stopping criterion is met (such as significance level, max-
interval, max inconsistency, etc.)

April 22, 2025 UNIT 1

Segmentation by Natural
Partitioning

 A simply 3-4-5 rule can be used to segment numeric

data into relatively uniform, “natural” intervals.

If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals

If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals

April 22, 2025 UNIT 1

Example of 3-4-5 Rule
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
April 22, 2025 UNIT 1
Concept Hierarchy Generation for
Categorical Data

 Specification of a partial/total ordering of attributes

explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by
explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state,
country}
April 22, 2025 UNIT 1
Automatic Concept Hierarchy
Generation
 Some hierarchies can be automatically generated
based on the analysis of the number of distinct
values per attribute in the data set
 The attribute with the most distinct values is

placed at the lowest level of the hierarchy

 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

April 22, 2025 UNIT 1

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (78)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (78)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
Sample Mental Health Progress Note
96% (47)
Sample Mental Health Progress Note
3 pages
2025 MandateForLeadership FULL
70% (10)
2025 MandateForLeadership FULL
920 pages
How To Kiss A Woman's Breast
60% (114)
How To Kiss A Woman's Breast
14 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
70% (71)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Business Report: Statistical Methods For Decision Making Project PGP-DSBA Online Athisya Nadar 9 May 2021
89% (9)
Business Report: Statistical Methods For Decision Making Project PGP-DSBA Online Athisya Nadar 9 May 2021
26 pages
April 25, 2019 Data Mining: Concepts and Techniques
No ratings yet
April 25, 2019 Data Mining: Concepts and Techniques
21 pages
VIPDMTheoryChapter1
No ratings yet
VIPDMTheoryChapter1
25 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
31 pages
Data Mining
No ratings yet
Data Mining
29 pages
Chapter 1. Introduction: December 8, 2021
No ratings yet
Chapter 1. Introduction: December 8, 2021
38 pages
Unit 3.1
No ratings yet
Unit 3.1
23 pages
01Intro1
No ratings yet
01Intro1
33 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
01 - Introduction To Datamining
No ratings yet
01 - Introduction To Datamining
19 pages
Chapter 1. Introduction: December 8, 2021 Data Mining: Concepts and Techniques
No ratings yet
Chapter 1. Introduction: December 8, 2021 Data Mining: Concepts and Techniques
58 pages
Data Mining - IMT Nagpur-Manish
No ratings yet
Data Mining - IMT Nagpur-Manish
82 pages
Ch1 Data Mining New
No ratings yet
Ch1 Data Mining New
35 pages
Data Mining Overview
No ratings yet
Data Mining Overview
14 pages
Data Mining Mod1
No ratings yet
Data Mining Mod1
128 pages
Introduction To Data Mining 1604
No ratings yet
Introduction To Data Mining 1604
32 pages
DM Introduction-SSM
No ratings yet
DM Introduction-SSM
6 pages
Lecture 1-Data Mining (Introduction)
No ratings yet
Lecture 1-Data Mining (Introduction)
30 pages
01 Intro
No ratings yet
01 Intro
22 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
2 DM Module 1 Introduction DVS
No ratings yet
2 DM Module 1 Introduction DVS
81 pages
01Intro (2)
No ratings yet
01Intro (2)
45 pages
Data Mining Lecture Notes
No ratings yet
Data Mining Lecture Notes
186 pages
Introduction To Data Mining: Unit 1
No ratings yet
Introduction To Data Mining: Unit 1
28 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
41 pages
KDD - Knowledge Discovery in Databases
No ratings yet
KDD - Knowledge Discovery in Databases
546 pages
01 Intro
No ratings yet
01 Intro
40 pages
0 Introduction
No ratings yet
0 Introduction
43 pages
Data Mining: Business Intelligence
No ratings yet
Data Mining: Business Intelligence
68 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
01Intro (1)
No ratings yet
01Intro (1)
40 pages
data mining 1
No ratings yet
data mining 1
39 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
01Intro
No ratings yet
01Intro
41 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Week 02 PDF
No ratings yet
Week 02 PDF
39 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
01Intro.pptx
No ratings yet
01Intro.pptx
40 pages
01intro (Autosaved)
No ratings yet
01intro (Autosaved)
43 pages
Module-2-Data Mining
No ratings yet
Module-2-Data Mining
48 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Module 1
No ratings yet
Module 1
40 pages
UNIT-1 Introduction: Motivation: Why Data Mining?
No ratings yet
UNIT-1 Introduction: Motivation: Why Data Mining?
86 pages
Topic10 - Data Mining
No ratings yet
Topic10 - Data Mining
29 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Data Analysis-2
No ratings yet
Data Analysis-2
41 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 1
34 pages
DM 1
No ratings yet
DM 1
78 pages
01 Introduction
No ratings yet
01 Introduction
36 pages
01 Intro
No ratings yet
01 Intro
35 pages
IS414: Data Mining: DR - Waleed M.Ead
No ratings yet
IS414: Data Mining: DR - Waleed M.Ead
36 pages
DWDM 3rd Edition Text Book Slides
No ratings yet
DWDM 3rd Edition Text Book Slides
938 pages
Intro Data Mining
No ratings yet
Intro Data Mining
30 pages
Data Mining
No ratings yet
Data Mining
88 pages
LECTURE 1 data mining
No ratings yet
LECTURE 1 data mining
41 pages
01Intro
No ratings yet
01Intro
28 pages
Unit 1
No ratings yet
Unit 1
95 pages
01 Intro
No ratings yet
01 Intro
29 pages
III-IT-Data Mining Unit 1-Session 1-Part1
No ratings yet
III-IT-Data Mining Unit 1-Session 1-Part1
14 pages
1 01intro, 2data (Except2 3), 3preprocessing
No ratings yet
1 01intro, 2data (Except2 3), 3preprocessing
169 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
PED-6 Joebert Acierto
No ratings yet
PED-6 Joebert Acierto
4 pages
Statistical Method For Economics QUESTION BANK 2010-11: Bliss Point
No ratings yet
Statistical Method For Economics QUESTION BANK 2010-11: Bliss Point
16 pages
Module 4 MMW
No ratings yet
Module 4 MMW
26 pages
Dec 2021 7
No ratings yet
Dec 2021 7
4 pages
8 Describing Individual Performance
No ratings yet
8 Describing Individual Performance
10 pages
Pacheck Mo
No ratings yet
Pacheck Mo
8 pages
U3 IntroSummaryStatistics
No ratings yet
U3 IntroSummaryStatistics
47 pages
Statistical Treatement of Data
No ratings yet
Statistical Treatement of Data
4 pages
Stats Chap03 Bluman
No ratings yet
Stats Chap03 Bluman
86 pages
PANDAS - Series Dataframes
No ratings yet
PANDAS - Series Dataframes
118 pages
Contoh Pentaksiran Bilik Darjah
No ratings yet
Contoh Pentaksiran Bilik Darjah
24 pages
Excercise 4 Correlation Is Alya
No ratings yet
Excercise 4 Correlation Is Alya
6 pages
Z-Value - iSixSigma How Is Z-Value Relates With Six-Sigma
No ratings yet
Z-Value - iSixSigma How Is Z-Value Relates With Six-Sigma
25 pages
Week 5 Worksheet Answers
No ratings yet
Week 5 Worksheet Answers
6 pages
BRM Question Bank Ans
No ratings yet
BRM Question Bank Ans
7 pages
4.14 Measures of Position
No ratings yet
4.14 Measures of Position
6 pages
Lecture 3_Measuresof Assocn
No ratings yet
Lecture 3_Measuresof Assocn
55 pages
BioStat Assignment 2
No ratings yet
BioStat Assignment 2
6 pages
How To Calculate Standard Deviation
No ratings yet
How To Calculate Standard Deviation
4 pages
N M Shah Numericals
0% (1)
N M Shah Numericals
2 pages
Sma 2103 Probability and Statistics I
No ratings yet
Sma 2103 Probability and Statistics I
5 pages
Vicky patil_Practical_9 - Colab
No ratings yet
Vicky patil_Practical_9 - Colab
4 pages
Introduction To Descriptive Statistics
No ratings yet
Introduction To Descriptive Statistics
12 pages
Statistic Projects
No ratings yet
Statistic Projects
4 pages
Module 6 Lesson 2
No ratings yet
Module 6 Lesson 2
15 pages
249C4B
No ratings yet
249C4B
2 pages
Q4W1 WW Assessment
No ratings yet
Q4W1 WW Assessment
1 page
Engineering Statistics Exam Solutions
No ratings yet
Engineering Statistics Exam Solutions
3 pages
Lecture 2 - Descriptive Statistics
No ratings yet
Lecture 2 - Descriptive Statistics
40 pages