Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
Exploration
DSCI 5240
Data Mining and Machine Learning for Business
Russell R. Torres
Know Your Data!
“Conducting data analysis is like drinking a fine wine. It is important to swirl and
sniff the wine, to unpack the complex bouquet and to appreciate the experience.
Gulping the wine doesn’t work.”
Daniel B. Wright
2
About Data
3
Acquiring Data
• Data acquisition may or may not be your concern within your organization
• In large organizations, there may be teams devoted to extracting relevant information from
the data warehouse
• In smaller organizations, that work may fall to the data miner
• We rarely have an issue with too little data
• Big Data refers to situations where datasets are so large they cannot be stored or analyzed
using traditional methods
• In instances where additional data would be helpful, it can often be acquired from
operational systems or third parties
• Data is accessed in a variety of ways
• Directly in the data warehouse (rare)
• Extracted to a database management system (DBMS), e.g., Microsoft Access
• Extracted to Microsoft Excel (very common)
4
Data Structure
5
Common Data Types – Non-Numeric Data
6
Common Data Types – Numeric Data
7
Common Data Types - Identifiers
8
Common Data Types - Text
9
Variable Types
10
Data Preprocessing
11
Data Preprocessing
The data contained in modern data warehouses often has significant data quality
issues
• Accuracy – Do the data accurately represent what they are intended to represent?
We have a customer record but the income field reflects household, rather than individual income as
expected
• Completeness – Do we have all of the data necessary?
We have a customer record but there is no value in the income field
• Timeliness – Was the data collected recently enough to still be useful?
We have a customer record but the value of the income field was collected 20 years ago
• Believability – Can the data be trusted?
We have a customer record but the value of the income field is $5B
• Interpretability – Do we really understand what the data shows?
We have a customer record but the value of the income field has been scaled several times and we
are not really sure what it means
12
Data Quality is Key!
“I still believe to this day, regardless of the tool that’s out there, there is not a tool
today that can replace data cleansing, data quality, and data profiling.”
13
Data Quality
• Data quality has consistently been shown to be a critical factor in the successful
use of BI within organizations
• Quality depends on the intended use of the data
• Quality costs time and money
• You are looking for sufficient, rather than optimal, quality
• Some preprocessing tasks related to improving data quality are often completed
before the analyst receives the data, others are completed after
14
Preprocessing Tasks - Overview
• Data cleaning – dealing with missing values and smoothing noisy data
• Data integration – ensure that the incorporation of data from multiple sources
has not introduced inconsistencies into the data
• Data reduction – identifying a smaller subset of the data which can produce the
same (or similar) analytical results
15
Data Cleaning
• Missing data approaches
• Ignore the tuple – Skip it; can result in significant data loss in sparse data sets
• Manual correction – Fix it; unrealistic in most scenarios
• Global constant – Use a placeholder; can get confused with actual data
• Central tendency – Use the mean or median; can alter the variation in the data
• Class-based central tendency – Use the mean or median associated with the class to which
this record belongs
• Most probable value – Estimate it using regression, decision tree, etc.
• Noisy data approaches
• Binning – sort and adjust the value based on those of its neighbors (mean, median, boundary)
• Regression – Use predicted rather than actual values
• Outlier analysis – Identify and exclude “odd” records
16
Data Integration
• Entity Identification
• How do we match records in one data source with those in another?
• Does prd_id = product_id?
• Are values contained in the sources in common units?
• Redundancy
• Can a given field be derived from others within the data set?
• Can introduce statistical issues if used in the same model
• Duplication
• Can result from data redundancy in underlying data sources
• Can inappropriately increase the significance of relationships
• Data value conflict
• When data source A indicates price is 24.99 and data source B indicates price is 19.99, which is correct?
• What are the rules that govern conflict resolution?
17
Data Reduction
18
Data Reduction – Reducing Rows
19
Data Reduction – Reducing Columns
20
Data Exploration
21
Data Exploration
• Having a sound understanding of the data you employ in models is critical
• What does each variable represent?
• How was it measured?
• From whom was it obtained?
• How is it related to the business domain?
• If predicting, is it available at the time the prediction needs to be made?
• A lack of understanding on the part of the modeler will result in poor model
performance and/or nonsensical model parameters
• In addition to a general understanding of the data, understanding their statistical
properties is also important
22
Exploratory Data Analysis (EDA)
• EDA is an approach that attempts to develop an understanding of the data to
facilitate the selection of the best possible models.
• The seminal work is Exploratory Data Analysis (Tukey 1977)
• A nice summary may be found at http://www.itl.nist.gov/div898/handbook/index.htm
• The approach is designed to:
1. Maximize insight into a data set
2. Uncover underlying structure
3. Extract important variables
4. Detect outliers and anomalies
5. Test underlying assumptions
6. Develop parsimonious models
7. Determine optimal factor settings
23
Important Summary Statistics
• Median
• Order a group of numbers and select
the middle number
• Also represents central tendency but is
robust to the presence of outliers
24
Important Visualizations
• Histogram
• Graphical representation of the distribution of numeric data
• Bins are constructed and the number of observations that fall within each bin is represented on a bar graph
• Important for verifying assumptions of models are not violated
• Box Plot
• Graphical representation of data through quartiles
• Bottom and top of the box represent the first and third quartile, middle bar or sometimes a dot represents the median,
whiskers vary
• Excellent for assessing distribution and identifying outliers
• Scatter Plot
• Graphical representation of two or more variables in relation to one another
• Each variable is plotted on one axis using Cartesian coordinates
• Good for detecting relationships between variables
25
Histogram
26
Box Plot
27
Scatter Plot
28