Data Wrangling
Data Wrangling
Business Analytics
Chapter 2: Data Wrangling
Instructor: Hamed Qahri-Saremi, PhD
Data Wrangling & Management
Data Wrangling
• Real world data is “noisy”, “disorganized”, and “hard to analyze” in vanilla form.
– The increasing volume and variety of data compel organizations to spend large amounts of time and resources
gathering, cleaning, and organizing data before performing any analysis.
• Data analytics is a “garbage-in garbage-out” type of process!
• Up to 80% of data analysis time and effort is focused on data management!
• Data wrangling is the process of retrieving, cleansing, integrating, transforming, and enriching
data to support analysis.
– Transform raw data into a format that is more appropriate and easier to analyze
– Objectives: Improving data quality.
• Reducing time/effort required to perform analytics, reveal the true intelligence in the data.
Data Inspection
Data Inspection
• Once the raw data are extracted from the database, data warehouse, or data mart,
review and inspect the dataset to assess data quality and relevant information for
subsequent analysis.
– Visual inspection
– Counting
– Sorting
• Subsetting Data
– Most data analysis projects focus only on a portion (subset) of the data, rather than the entire data
set.
– Sometimes the objective of the analysis is to compare two subgroups of the data.
Data Preparation: Handling Missing Values
• Missing values are a common data quality problem.
– Reduce usable observations.
– Can bias results.
– Imputation:
• Replace missing values with some reasonable values.
• If the variable that has many missing values is deemed unimportant or can be represented using a proxy
variable that does not have missing values, the variable may be excluded from the analysis.
– Omission.
Data Preparation: Exercise 2
A U.S. producer of automobile tires wants to learn about the conditions of its tires on
automobiles in Texas (TreadWear.xlsx). The data obtained includes the position of the tire
on the automobile, age of the tire, mileage on the tire, and depth of the remaining tread on
the tire.
Assess the quality of these data by identifying which (if any) observations have missing
values.
1. Tables ► Missing Data Pattern
2. Highlight all the columns and click Add Column
Data Preparation: Exercise 2 Solution
• In JMP, a dot signifies a missing numeric value, and a blank signifies a missing
character value.
Data Preparation: Subsetting
• The process of extracting portions of a data set that are relevant to the analysis is
called subsetting.
– It is commonly used to pre-process the data prior to analysis.
– For time series data,
• We may choose to create subsets of recent observations and observations from the distant past in order
to analyze them separately.
– Subsetting can also be used to eliminate unwanted data such as observations that contain missing
values, low quality data, or outliers.
– Sometimes, subsetting involves excluding variables instead of observations.
• Irrelevant to the problem
• Contain redundant information
• Excessive amounts of missing values
Data Preparation: Subsetting
• Subsetting can also be performed as part of descriptive analytics that helps reveal
insights in the data.
– Example: summary data of tuberculosis treatment
• By comparing subsets of medical records with different treatment results, we may identify potential
contributing factors of success in a treatment.
Data Preparation: Exercise 3
Data filter in JMP (file Ch2E1.xlsx)
1. Filter the data and create a subset of only Gender = F
– Rows ► Data Filter; Add the column “Gender” and select the level “F”
– Tables ► Subset; Select “selected rows” and “all columns”
Data Preparation: Exercise 3
2. Filter the data to show only employees with more than 10 years of service
– Rows ► Data Filter; Add the column “Years of Service” and adjust the data range shown in the
histogram
– Tables ► Subset; Select “selected rows” and “all columns”
Data Preparation: Exercise 3
Transforming Numerical Data
Transforming Numerical Data
• Data transformation is the data conversion process from one format or structure to
another.
– It is performed to meet the requirements of statistical and data mining techniques used for the
analysis.
– Binning:
• Group a vast range of numerical values into a small number of “bins”.
– Mathematical Transformation:
• Create new variables through mathematical transformations of existing variables.
Transforming Numerical Data: Binning
• Binning is transforming numerical variables into categorical variables by grouping
the numerical values into a small number of groups or bins.
• Important: the bins are consecutive and nonoverlapping so that each numerical value falls into only one
bin.
• Outliers in the data will be part of the first / last bin and, and therefore, will not distort subsequent data
analysis.
– Bins can be based on:
• equal intervals
• equal counts (frequency),
– where individual bins have the same number of observations.
• Based on distribution.
>>
Transforming Numerical Data: Mathematical Transformation
• Another common approach is to create new variables through mathematical transformations
of existing variables.
– To analyze trend, we often transform raw data values into percentages.
– To standardize, we calculate z-scores or do min-max standardization.
– Sometimes, data on variables such as income, firm size, and house prices are highly skewed.
• Extremely high (or low) values of skewed variables significantly inflate the average for the entire data set
• Difficult to detect meaningful relationships with skewed variables
• Use the natural logarithm (Ln) and square root transformation to reduce data skewness
Transforming Categorical Data
Transforming Categorical Data
• Potential problems when nominal or ordinal variables come with too many categories:
– Reduce model performance.
• There are several parameters associated with the categories of a categorical variable.
– Determining the appropriate number of categories often depends on the data, context, and
disciplinary norms, but there are a few general guidelines.
• Categories with very few observations may be combined to create the “Other” category.
– The rationale is that a critical mass can be created for this “Other” category to help reveal patterns and
relationships in data.
• In JMP:
Select the categorical variable ► Cols ► Utilities
► Make Indicator Columns
Transforming Categorical Data: Category Scores
• Another common transformation of categorical variables is to create category scores.
– This approach is most appropriate if the data are ordinal and have natural, ordered categories.
• This transformation allows the categorical variable to be treated as a numerical variable in
certain analytical models.
– With this transformation, we need not convert a categorical variable into several dummy
variables or to reduce its categories.