0% found this document useful (0 votes)

23 views30 pages

Data Wrangling

Chapter 2 of CIS 370 focuses on data wrangling, emphasizing the importance of cleaning and organizing noisy real-world data for effective analysis. It discusses techniques for data inspection, handling missing values, subsetting data, and transforming both numerical and categorical data to improve data quality and analytical outcomes. The chapter includes practical exercises to apply these concepts using software tools like JMP.

Uploaded by

manojpruthvi650

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views30 pages

Data Wrangling

Uploaded by

manojpruthvi650

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

CIS 370

Business Analytics
Chapter 2: Data Wrangling
Instructor: Hamed Qahri-Saremi, PhD
Data Wrangling & Management
Data Wrangling
• Real world data is “noisy”, “disorganized”, and “hard to analyze” in vanilla form.
– The increasing volume and variety of data compel organizations to spend large amounts of time and resources
gathering, cleaning, and organizing data before performing any analysis.
• Data analytics is a “garbage-in garbage-out” type of process!
• Up to 80% of data analysis time and effort is focused on data management!

• Data wrangling is the process of retrieving, cleansing, integrating, transforming, and enriching
data to support analysis.
– Transform raw data into a format that is more appropriate and easier to analyze
– Objectives: Improving data quality.
• Reducing time/effort required to perform analytics, reveal the true intelligence in the data.
Data Inspection
Data Inspection
• Once the raw data are extracted from the database, data warehouse, or data mart,
review and inspect the dataset to assess data quality and relevant information for
subsequent analysis.
– Visual inspection
– Counting
– Sorting

• Data inspection helps us verify the completeness of data:

– Missing values
• Especially for important variables.
– Range of values for each variable.
• We can sort data based on a single variable or multiple variables.
Data Inspection: Exercise 1
Data sorting in JMP:

1. Which employee(s) has/have the longest Length of Service?

• File ► Open; Select the file Ch2E1.xlsx
• Sort the employee data based on Years of Service in descending order (hint: right click the column
header)

2. Who is/are the oldest employee(s)?

• Sort the employee data based on Age in descending order (hint: right click the column header)
Data Inspection: Solution for Exercise 1
Data Preparation
Data Preparation
• Handling Missing Values
– There may be missing values in the key variables that are crucial for subsequent analysis.

• Subsetting Data
– Most data analysis projects focus only on a portion (subset) of the data, rather than the entire data
set.
– Sometimes the objective of the analysis is to compare two subgroups of the data.
Data Preparation: Handling Missing Values
• Missing values are a common data quality problem.
– Reduce usable observations.
– Can bias results.

• Reasons for missing values:

– Sensitive nature of questions.
• Missing values are not randomly distributed across observations.
• They are concentrated within one or more subgroups.
– Questions do not apply to every respondent.
– Data Collection / Entry Errors
• Human errors,
• Sloppy data collection
• Equipment failures
Data Preparation: Handling Missing Values
• Two strategies for dealing with missing values: omission and imputation.
– Omission:
• Exclude observations with missing values
• When the amount of missing value is small or concentrated in a small number of observations (≤ 5% of
sample).

– Imputation:
• Replace missing values with some reasonable values.

• No treatments required for some techniques

– E.g., Decision trees are robust and can be applied to data sets even with the inclusion of missing
values.
Data Preparation: Handling Missing Values
• Imputation:
– For numerical values:
• Replace missing values with the mean / average value across relevant observations.
– Easy to implement.
– Does not increase the variability in the data set.
– If many values are missing, mean imputation will likely distort the relationships among variables, leading to biased results.
» The variable loses its variance (variability).
– More advanced methods include regression mean imputation.
» Prediction of missing values, using regression.
– In the presence of outliers, it is preferred to use the median instead of the mean to impute missing values.

– For categorical variables:

• The most frequent category (mode) is often used.
• An alternative:
– Create an “unknown” category (Useful when data are missing for a reason)

• If the variable that has many missing values is deemed unimportant or can be represented using a proxy
variable that does not have missing values, the variable may be excluded from the analysis.
– Omission.
Data Preparation: Exercise 2
A U.S. producer of automobile tires wants to learn about the conditions of its tires on
automobiles in Texas (TreadWear.xlsx). The data obtained includes the position of the tire
on the automobile, age of the tire, mileage on the tire, and depth of the remaining tread on
the tire.
Assess the quality of these data by identifying which (if any) observations have missing
values.
1. Tables ► Missing Data Pattern
2. Highlight all the columns and click Add Column
Data Preparation: Exercise 2 Solution
• In JMP, a dot signifies a missing numeric value, and a blank signifies a missing
character value.
Data Preparation: Subsetting
• The process of extracting portions of a data set that are relevant to the analysis is
called subsetting.
– It is commonly used to pre-process the data prior to analysis.
– For time series data,
• We may choose to create subsets of recent observations and observations from the distant past in order
to analyze them separately.
– Subsetting can also be used to eliminate unwanted data such as observations that contain missing
values, low quality data, or outliers.
– Sometimes, subsetting involves excluding variables instead of observations.
• Irrelevant to the problem
• Contain redundant information
• Excessive amounts of missing values
Data Preparation: Subsetting
• Subsetting can also be performed as part of descriptive analytics that helps reveal
insights in the data.
– Example: summary data of tuberculosis treatment
• By comparing subsets of medical records with different treatment results, we may identify potential
contributing factors of success in a treatment.
Data Preparation: Exercise 3
Data filter in JMP (file Ch2E1.xlsx)
1. Filter the data and create a subset of only Gender = F
– Rows ► Data Filter; Add the column “Gender” and select the level “F”
– Tables ► Subset; Select “selected rows” and “all columns”
Data Preparation: Exercise 3

2. Filter the data to show only employees with more than 10 years of service
– Rows ► Data Filter; Add the column “Years of Service” and adjust the data range shown in the
histogram
– Tables ► Subset; Select “selected rows” and “all columns”
Data Preparation: Exercise 3
Transforming Numerical Data
Transforming Numerical Data
• Data transformation is the data conversion process from one format or structure to
another.
– It is performed to meet the requirements of statistical and data mining techniques used for the
analysis.

– Binning:
• Group a vast range of numerical values into a small number of “bins”.

– Mathematical Transformation:
• Create new variables through mathematical transformations of existing variables.
Transforming Numerical Data: Binning
• Binning is transforming numerical variables into categorical variables by grouping
the numerical values into a small number of groups or bins.
• Important: the bins are consecutive and nonoverlapping so that each numerical value falls into only one
bin.
• Outliers in the data will be part of the first / last bin and, and therefore, will not distort subsequent data
analysis.
– Bins can be based on:
• equal intervals
• equal counts (frequency),
– where individual bins have the same number of observations.
• Based on distribution.

>>
Transforming Numerical Data: Mathematical Transformation
• Another common approach is to create new variables through mathematical transformations
of existing variables.
– To analyze trend, we often transform raw data values into percentages.
– To standardize, we calculate z-scores or do min-max standardization.
– Sometimes, data on variables such as income, firm size, and house prices are highly skewed.
• Extremely high (or low) values of skewed variables significantly inflate the average for the entire data set
• Difficult to detect meaningful relationships with skewed variables
• Use the natural logarithm (Ln) and square root transformation to reduce data skewness
Transforming Categorical Data
Transforming Categorical Data
• Potential problems when nominal or ordinal variables come with too many categories:
– Reduce model performance.
• There are several parameters associated with the categories of a categorical variable.

– Categories that rarely occur.

• it is difficult to capture the impact of these categories accurately.
• A relatively small sample may not contain any observations in certain categories, creating errors when the
analytical model is later applied to a larger data set with observations in all categories.

– One category clearly dominates in terms of occurrence.

• the categorical variable will fail to make a positive impact since modeling success is dependent on being able
to differentiate among the observations.
Transforming Categorical Data: Category Reduction
• An effective strategy for dealing with these issues is category reduction,
– where we collapse some of the categories to create fewer nonoverlapping categories.

– Determining the appropriate number of categories often depends on the data, context, and
disciplinary norms, but there are a few general guidelines.
• Categories with very few observations may be combined to create the “Other” category.
– The rationale is that a critical mass can be created for this “Other” category to help reveal patterns and
relationships in data.

• Categories with a similar impact may be combined.

– “Weekdays" instead of Monday, Tuesday, …, Friday, if their impacts are similar, and similarly “weekends”
instead of Saturday and Sunday.
Transforming Categorical Data: Dummy Variable
• A dummy variable, also referred to as an indicator or a binary variable, is commonly
used to describe two categories of a variable.
– It assumes a value of 1 for one of the categories and 0 for the other category, referred to as the
reference or the benchmark category.
– Dummy variables do not suggest any ranking of the categories.
– All interpretation of the results is made in relation to the reference category.

• Given k categories of a variable, the general rule is to create k 1 dummy variables,

using the last category as reference.

• In JMP:
Select the categorical variable ► Cols ► Utilities
► Make Indicator Columns
Transforming Categorical Data: Category Scores
• Another common transformation of categorical variables is to create category scores.
– This approach is most appropriate if the data are ordinal and have natural, ordered categories.
• This transformation allows the categorical variable to be treated as a numerical variable in
certain analytical models.
– With this transformation, we need not convert a categorical variable into several dummy
variables or to reduce its categories.

• For an effective transformation, we assume equal increments between the category

scores, which may not be appropriate in certain situations.
Transforming Categorical Data: Exercise 4
Create a computed field
1. Which Manufacturer and Model was first in sales in March 2010?
– File ► Open; Select the file Top20Cars.xlsx
– Sort the table based on Sales in March 2010 in (highest to lowest)
2. Add a new column to calculate the percent change in sales from March 2010 to March 2011.
– Cols ► New Column ► Column Properties ► Formula
3. Which Manufacturer and Model had the smallest increase in sales (most negative)?
– Sort the table based on the percent change in sales (lowest to highest)
4. Set a filter and create a subset of only vehicles manufactured by Toyota
– Rows ► Data Filter; Add the column “Manufacturer” and select the level “Toyota”
– Tables ► Subset; Select “selected rows” and “all columns”
Transforming Categorical Data: Exercise 4 Solution
• Q1: Toyota Camry
• Q2: See “New Column” setting
• Q3: Nissan Versa
• Q4:

Practical Statistical Process Control
From Everand
Practical Statistical Process Control
Colin Hardwick
5/5 (9)
Sheet 1
100% (1)
Sheet 1
3 pages
Case Starbucks
100% (1)
Case Starbucks
3 pages
Gen Exam CH 1 SOLUTION
No ratings yet
Gen Exam CH 1 SOLUTION
6 pages
Institute and Faculty of Actuaries: Subject CS1A - Actuarial Statistics Core Principles
No ratings yet
Institute and Faculty of Actuaries: Subject CS1A - Actuarial Statistics Core Principles
11 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Unit 2 Data Preprocessing (1)
No ratings yet
Unit 2 Data Preprocessing (1)
66 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
W4-5 03preprocessing
No ratings yet
W4-5 03preprocessing
83 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
2 - Data Management and Wrangling
No ratings yet
2 - Data Management and Wrangling
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Lecture 03 - Data Preprocessing Dashboards(1)
No ratings yet
Lecture 03 - Data Preprocessing Dashboards(1)
34 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
DP
No ratings yet
DP
44 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Week2-2
No ratings yet
Week2-2
25 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
ISE233 Lecture 3
No ratings yet
ISE233 Lecture 3
21 pages
7. Data Cleaning
No ratings yet
7. Data Cleaning
39 pages
data science slides
No ratings yet
data science slides
57 pages
Week3- Data Preprocessing, Extraction and Preparation
No ratings yet
Week3- Data Preprocessing, Extraction and Preparation
34 pages
DM_merged
No ratings yet
DM_merged
169 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
No ratings yet
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
76 pages
Lec 3 Data Preprocessing and Transformation(1)
No ratings yet
Lec 3 Data Preprocessing and Transformation(1)
73 pages
13. Data Preprocessing_Updated (6)
No ratings yet
13. Data Preprocessing_Updated (6)
31 pages
03Preprocessing (2)
No ratings yet
03Preprocessing (2)
80 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Session 2 - Excel Fundamentals For Data Exploration
100% (1)
Session 2 - Excel Fundamentals For Data Exploration
56 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
1st Part of Material
No ratings yet
1st Part of Material
15 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
PPT1
No ratings yet
PPT1
93 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Essential Statistics 2nd Edition Gould Test Bank - PDF Version Is Available For Instant Access
100% (4)
Essential Statistics 2nd Edition Gould Test Bank - PDF Version Is Available For Instant Access
40 pages
BCSL-044 S3
No ratings yet
BCSL-044 S3
3 pages
False-Positive and False-Negative Rates For Carcinogenicity Screens
No ratings yet
False-Positive and False-Negative Rates For Carcinogenicity Screens
6 pages
Statistical Process Control: Samir Mistry
100% (1)
Statistical Process Control: Samir Mistry
26 pages
Practice of Introductory Time Series With R
No ratings yet
Practice of Introductory Time Series With R
22 pages
The Impact of Social Media Marketing and Brand Image On Purchasing Decisions (Study Case at Florasea Florist in Purwokerto) en
No ratings yet
The Impact of Social Media Marketing and Brand Image On Purchasing Decisions (Study Case at Florasea Florist in Purwokerto) en
9 pages
EC221 Principles of Econometrics Solutions PS
No ratings yet
EC221 Principles of Econometrics Solutions PS
6 pages
The PSPP Guide:: An Introduction To Statistical Analysis
No ratings yet
The PSPP Guide:: An Introduction To Statistical Analysis
95 pages
Kriging: Fitting A Variogram Model
100% (1)
Kriging: Fitting A Variogram Model
3 pages
An Introduction to Medical Statistics 4th Edition Complete DOCX Download
100% (12)
An Introduction to Medical Statistics 4th Edition Complete DOCX Download
14 pages
For Gold 3-15 min
100% (1)
For Gold 3-15 min
5 pages
Quantitative Research
100% (1)
Quantitative Research
31 pages
A Scatter Diagram Shows Relationships Between Two Sets of Data
No ratings yet
A Scatter Diagram Shows Relationships Between Two Sets of Data
7 pages
Assignment 10
No ratings yet
Assignment 10
4 pages
Soal Uts Statistika 2
No ratings yet
Soal Uts Statistika 2
1 page
Session 6 Tests of Significance
No ratings yet
Session 6 Tests of Significance
16 pages
Grade 7 and 5 Test 4
No ratings yet
Grade 7 and 5 Test 4
3 pages
A Study On Employee Attrition: Inevitable Yet Manageable: Dr.B.Latha Lavanya
No ratings yet
A Study On Employee Attrition: Inevitable Yet Manageable: Dr.B.Latha Lavanya
13 pages
Chapter 3 - Linear Regression Model
No ratings yet
Chapter 3 - Linear Regression Model
289 pages
Activity 2 - Module 3 - Domalanta, Ashley Jade V
No ratings yet
Activity 2 - Module 3 - Domalanta, Ashley Jade V
6 pages
P&s Imp Que For Mid
No ratings yet
P&s Imp Que For Mid
2 pages
PS With R Lab Record Exp
No ratings yet
PS With R Lab Record Exp
21 pages
Identification of An Arma Model
No ratings yet
Identification of An Arma Model
6 pages
MBA Unit 3 and 5
No ratings yet
MBA Unit 3 and 5
2 pages
Percentile 200904064156
No ratings yet
Percentile 200904064156
11 pages
Statistik Izna
No ratings yet
Statistik Izna
2 pages

Data Wrangling

Uploaded by

Data Wrangling

Uploaded by

CIS 370

• Data inspection helps us verify the completeness of data:

1. Which employee(s) has/have the longest Length of Service?

2. Who is/are the oldest employee(s)?

• Reasons for missing values:

• No treatments required for some techniques

– For categorical variables:

– Categories that rarely occur.

– One category clearly dominates in terms of occurrence.

• Categories with a similar impact may be combined.

• Given k categories of a variable, the general rule is to create k 1 dummy variables,

• For an effective transformation, we assume equal increments between the category

You might also like