0% found this document useful (0 votes)

39 views

Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres

The document discusses the importance of data preparation and exploration before conducting data analysis and machine learning. It covers acquiring data from various sources, understanding data structure and variable types, and performing important preprocessing tasks like data cleaning, integration, and reduction to improve data quality. These steps are crucial for gaining insights from the data and building accurate models. The document emphasizes that properly examining and understanding the data is essential before analyzing or drawing conclusions from it.

Uploaded by

Furqan Arshad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres

Uploaded by

Furqan Arshad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Data Preparation and

Exploration
DSCI 5240
Data Mining and Machine Learning for Business

Russell R. Torres
Know Your Data!

“Conducting data analysis is like drinking a fine wine. It is important to swirl and
sniff the wine, to unpack the complex bouquet and to appreciate the experience.
Gulping the wine doesn’t work.”

Daniel B. Wright

2
About Data

3
Acquiring Data
• Data acquisition may or may not be your concern within your organization
• In large organizations, there may be teams devoted to extracting relevant information from
the data warehouse
• In smaller organizations, that work may fall to the data miner
• We rarely have an issue with too little data
• Big Data refers to situations where datasets are so large they cannot be stored or analyzed
using traditional methods
• In instances where additional data would be helpful, it can often be acquired from
operational systems or third parties
• Data is accessed in a variety of ways
• Directly in the data warehouse (rare)
• Extracted to a database management system (DBMS), e.g., Microsoft Access
• Extracted to Microsoft Excel (very common)
4
Data Structure

• Data is almost always organized in

tabular/matrix format
• Rows
• Tuples
• Observations
• Columns
• Variables
• Dimensions
• Features
• Inputs/Targets

5
Common Data Types – Non-Numeric Data

• Nominal – The value is a name that

identifies a specific category
• Region is a commonly occurring nominal
variable in business data
• In this example, “Color” vs. “Black and White”
simply identify types of movies
• Ordinal – The value identifies a category
but is also associated with a rank (i.e., the
data can be ordered)
• Classic example is position in a race (1 st, 2nd, 3rd,
etc.)
• Here, budget category generally indicates how
much money was spent
• All ordinal data is nominal data

6
Common Data Types – Numeric Data

• Interval – Data has meaningful intervals

between measurements, but there is no
true zero and thus ratios are meaningless
• Classic example is temperature
• Is a movie rated 4 twice as good as a movie
rated 2?
• Ratio – Data has meaningful intervals
between measurements and there is an
absolute zero. Ratios make sense.
• Numeric business data is often ratio
• e.g., a movie that grossed $60M made twice
as much as a movie that grossed $30M
• All ratio data is interval data

7
Common Data Types - Identifiers

• Datasets commonly employ

identifiers to distinguish between
observations
• Identifiers are often the primary key
in the database from which the
dataset was drawn
• Identifiers typically have no
predictive value in models… they
are only there to help you navigate
the data

8
Common Data Types - Text

• Text also frequently appears in

business data
• Can often be distinguished from
nominal data by the number of
levels present
• Text is often not useful in the
development of predictive models
unless those models are combined
with additional text mining
techniques

9
Variable Types

• Each column in your dataset

represents a potential variable that
may be included in your model
• Variables may be of two types:
• Independent variable – A variable
whose variation does not depend on
any other variable (x)
• Dependent variable – A variable whose
variation does (we hope) depend on
other variables (y)

10
Data Preprocessing

11
Data Preprocessing
The data contained in modern data warehouses often has significant data quality
issues
• Accuracy – Do the data accurately represent what they are intended to represent?
We have a customer record but the income field reflects household, rather than individual income as
expected
• Completeness – Do we have all of the data necessary?
We have a customer record but there is no value in the income field
• Timeliness – Was the data collected recently enough to still be useful?
We have a customer record but the value of the income field was collected 20 years ago
• Believability – Can the data be trusted?
We have a customer record but the value of the income field is $5B
• Interpretability – Do we really understand what the data shows?
We have a customer record but the value of the income field has been scaled several times and we
are not really sure what it means
12
Data Quality is Key!

“I still believe to this day, regardless of the tool that’s out there, there is not a tool
today that can replace data cleansing, data quality, and data profiling.”

Chief Technology Officer – Financial Services Firm

13
Data Quality
• Data quality has consistently been shown to be a critical factor in the successful
use of BI within organizations
• Quality depends on the intended use of the data
• Quality costs time and money
• You are looking for sufficient, rather than optimal, quality
• Some preprocessing tasks related to improving data quality are often completed
before the analyst receives the data, others are completed after

14
Preprocessing Tasks - Overview
• Data cleaning – dealing with missing values and smoothing noisy data
• Data integration – ensure that the incorporation of data from multiple sources
has not introduced inconsistencies into the data
• Data reduction – identifying a smaller subset of the data which can produce the
same (or similar) analytical results

15
Data Cleaning
• Missing data approaches
• Ignore the tuple – Skip it; can result in significant data loss in sparse data sets
• Manual correction – Fix it; unrealistic in most scenarios
• Global constant – Use a placeholder; can get confused with actual data
• Central tendency – Use the mean or median; can alter the variation in the data
• Class-based central tendency – Use the mean or median associated with the class to which
this record belongs
• Most probable value – Estimate it using regression, decision tree, etc.
• Noisy data approaches
• Binning – sort and adjust the value based on those of its neighbors (mean, median, boundary)
• Regression – Use predicted rather than actual values
• Outlier analysis – Identify and exclude “odd” records

16
Data Integration
• Entity Identification
• How do we match records in one data source with those in another?
• Does prd_id = product_id?
• Are values contained in the sources in common units?
• Redundancy
• Can a given field be derived from others within the data set?
• Can introduce statistical issues if used in the same model
• Duplication
• Can result from data redundancy in underlying data sources
• Can inappropriately increase the significance of relationships
• Data value conflict
• When data source A indicates price is 24.99 and data source B indicates price is 19.99, which is correct?
• What are the rules that govern conflict resolution?

17
Data Reduction

• The data we work with is often BIG

and its size may inhibit our ability to
work with it
• Acquisition
• Storage
• Modeling
• If we think about data in terms of a
matrix, we may have a lot of
• Rows
• Columns
• Both rows and columns

18
Data Reduction – Reducing Rows

Reduction may be achieved in a

number of ways
• Aggregation in the data warehouse –
Increasing the granularity of the data
cube will reduce the number of
observations present in the resulting
data
• Sampling – Selecting representative
observations for analysis while
discarding the bulk of the data

19
Data Reduction – Reducing Columns

Column reduction (dimensionality

reduction) can be more complicated and
involve more work
• Manual Feature Selection – The data analyst
can examine the available dimensions and
exclude those they feel are not useful for
modeling purposes
• Feature Selection based on Objective
Function – A modeling approach is used to
identify the features that appear to have the
most influence on the dependent variable
• Feature Extraction – Maps high dimensional
data onto a lower dimensional subspace
(i.e., combines variables)

20
Data Exploration

21
Data Exploration
• Having a sound understanding of the data you employ in models is critical
• What does each variable represent?
• How was it measured?
• From whom was it obtained?
• How is it related to the business domain?
• If predicting, is it available at the time the prediction needs to be made?
• A lack of understanding on the part of the modeler will result in poor model
performance and/or nonsensical model parameters
• In addition to a general understanding of the data, understanding their statistical
properties is also important

22
Exploratory Data Analysis (EDA)
• EDA is an approach that attempts to develop an understanding of the data to
facilitate the selection of the best possible models.
• The seminal work is Exploratory Data Analysis (Tukey 1977)
• A nice summary may be found at http://www.itl.nist.gov/div898/handbook/index.htm
• The approach is designed to:
1. Maximize insight into a data set
2. Uncover underlying structure
3. Extract important variables
4. Detect outliers and anomalies
5. Test underlying assumptions
6. Develop parsimonious models
7. Determine optimal factor settings

23
Important Summary Statistics

• Measures of Central Tendency

• Mean
• Sum a group of numbers and divide by
the number of observations
• Represents central tendency but is not
robust

• Median
• Order a group of numbers and select
the middle number
• Also represents central tendency but is
robust to the presence of outliers

24
Important Visualizations
• Histogram
• Graphical representation of the distribution of numeric data
• Bins are constructed and the number of observations that fall within each bin is represented on a bar graph
• Important for verifying assumptions of models are not violated

• Box Plot
• Graphical representation of data through quartiles
• Bottom and top of the box represent the first and third quartile, middle bar or sometimes a dot represents the median,
whiskers vary
• Excellent for assessing distribution and identifying outliers

• Scatter Plot
• Graphical representation of two or more variables in relation to one another
• Each variable is plotted on one axis using Cartesian coordinates
• Good for detecting relationships between variables

25
Histogram

26
Box Plot

27
Scatter Plot

MATH ZC233 COURSE HANDOUT
No ratings yet
MATH ZC233 COURSE HANDOUT
13 pages
Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Phasic and Tonic EDA
No ratings yet
Phasic and Tonic EDA
3 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
Data Mining
No ratings yet
Data Mining
40 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
No ratings yet
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
10 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
CH 2
No ratings yet
CH 2
36 pages
Lecture 01 Overview of Business Analytics
No ratings yet
Lecture 01 Overview of Business Analytics
52 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
BA_CH01
No ratings yet
BA_CH01
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Science and Data Analytics: Part B
No ratings yet
Data Science and Data Analytics: Part B
42 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
Unit I
No ratings yet
Unit I
57 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Week 1
No ratings yet
Week 1
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Big Data Lesson 1 Lucrezia Noli
No ratings yet
Big Data Lesson 1 Lucrezia Noli
46 pages
DM_merged
No ratings yet
DM_merged
169 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
2.1_Data_Analytics[1]
No ratings yet
2.1_Data_Analytics[1]
16 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
Assignement - Data Science For Business Growth and Big Data and Business Analytics
No ratings yet
Assignement - Data Science For Business Growth and Big Data and Business Analytics
5 pages
Data Mining: Business Intelligence
No ratings yet
Data Mining: Business Intelligence
68 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
BI module 2 (1)
No ratings yet
BI module 2 (1)
11 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
C20 Combined
No ratings yet
C20 Combined
291 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
PTDLKT
No ratings yet
PTDLKT
11 pages
Module 3
No ratings yet
Module 3
137 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Unit 1
No ratings yet
Unit 1
61 pages
Data Analytics Fundamentals
No ratings yet
Data Analytics Fundamentals
35 pages
03 Data Science Process_Spring-24-25
No ratings yet
03 Data Science Process_Spring-24-25
48 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
17 ch17 p17-1-17-46
No ratings yet
17 ch17 p17-1-17-46
46 pages
Data Analysis _Unit1
No ratings yet
Data Analysis _Unit1
65 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Credetials
No ratings yet
Credetials
1 page
App Evaluation Questionnaire
No ratings yet
App Evaluation Questionnaire
3 pages
Humananatomy Explanation
No ratings yet
Humananatomy Explanation
1 page
SR No. Country University Name School Department
No ratings yet
SR No. Country University Name School Department
4 pages
Linear Regression: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Linear Regression: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
39 pages
Log
No ratings yet
Log
10 pages
Readingon QUESTby Barkhuizen
100% (1)
Readingon QUESTby Barkhuizen
10 pages
Bio 541l Lab Report 1 GLD
No ratings yet
Bio 541l Lab Report 1 GLD
9 pages
Lovely Professional University QTT-201
No ratings yet
Lovely Professional University QTT-201
11 pages
!probleme ST 2020
No ratings yet
!probleme ST 2020
16 pages
BBIT Programme Document - April 08, 2021
No ratings yet
BBIT Programme Document - April 08, 2021
158 pages
Homework 0
No ratings yet
Homework 0
5 pages
Engineering Mathematics - Ii
No ratings yet
Engineering Mathematics - Ii
2 pages
Determination and Validation of Benzyl Chloride by HPLC Method in Posaconazole Drug Substance
No ratings yet
Determination and Validation of Benzyl Chloride by HPLC Method in Posaconazole Drug Substance
7 pages
Basic Idea and Rules For Logarithms - Math Insight
No ratings yet
Basic Idea and Rules For Logarithms - Math Insight
1 page
Error Analysis Numerical Methods PDF
100% (1)
Error Analysis Numerical Methods PDF
2 pages
Psychodynamics
100% (1)
Psychodynamics
47 pages
Integral (Almost Done)
No ratings yet
Integral (Almost Done)
39 pages
How To Combine Forex Indicators Like A Pro
No ratings yet
How To Combine Forex Indicators Like A Pro
17 pages
Lo, Mamaysky &wang - Foundations of Technical Analysis PDF
No ratings yet
Lo, Mamaysky &wang - Foundations of Technical Analysis PDF
61 pages
Calculus I 1
No ratings yet
Calculus I 1
2 pages
Claims Triangle - R: Submitted by Ishan Bandyopadhyay
No ratings yet
Claims Triangle - R: Submitted by Ishan Bandyopadhyay
23 pages
This Study Resource Was: MBCQ721D-Quantitative Techniques For Management Application
No ratings yet
This Study Resource Was: MBCQ721D-Quantitative Techniques For Management Application
6 pages
Precision: Repeatability, Intermediate Precision, Reproducibility
No ratings yet
Precision: Repeatability, Intermediate Precision, Reproducibility
3 pages
Multiple Linear Regression: Application
No ratings yet
Multiple Linear Regression: Application
22 pages
Unit 8 Inverse Laplace Transforms
No ratings yet
Unit 8 Inverse Laplace Transforms
34 pages
Logical Framework Approach - SSWM
No ratings yet
Logical Framework Approach - SSWM
5 pages
1 Cee 101 Calculus 1 Final
No ratings yet
1 Cee 101 Calculus 1 Final
26 pages
Numerical Analysis
No ratings yet
Numerical Analysis
99 pages
Inflection Points
No ratings yet
Inflection Points
6 pages
Unit 1: Crib Notes Foundation English
No ratings yet
Unit 1: Crib Notes Foundation English
7 pages
Slope-Deflection-Method - Cont Beams
No ratings yet
Slope-Deflection-Method - Cont Beams
20 pages
CH 13 Comments PDF
No ratings yet
CH 13 Comments PDF
27 pages
IVPs With Laplace Transforms
No ratings yet
IVPs With Laplace Transforms
5 pages

Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres

Uploaded by

Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres

Uploaded by

Data Preparation and

• Data is almost always organized in

• Nominal – The value is a name that

• Interval – Data has meaningful intervals

• Datasets commonly employ

• Text also frequently appears in

• Each column in your dataset

Chief Technology Officer – Financial Services Firm

• The data we work with is often BIG

Reduction may be achieved in a

Column reduction (dimensionality

• Measures of Central Tendency

You might also like