0% found this document useful (0 votes)
23 views30 pages

Data Wrangling

Chapter 2 of CIS 370 focuses on data wrangling, emphasizing the importance of cleaning and organizing noisy real-world data for effective analysis. It discusses techniques for data inspection, handling missing values, subsetting data, and transforming both numerical and categorical data to improve data quality and analytical outcomes. The chapter includes practical exercises to apply these concepts using software tools like JMP.

Uploaded by

manojpruthvi650
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views30 pages

Data Wrangling

Chapter 2 of CIS 370 focuses on data wrangling, emphasizing the importance of cleaning and organizing noisy real-world data for effective analysis. It discusses techniques for data inspection, handling missing values, subsetting data, and transforming both numerical and categorical data to improve data quality and analytical outcomes. The chapter includes practical exercises to apply these concepts using software tools like JMP.

Uploaded by

manojpruthvi650
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

CIS 370

Business Analytics
Chapter 2: Data Wrangling
Instructor: Hamed Qahri-Saremi, PhD
Data Wrangling & Management
Data Wrangling
• Real world data is “noisy”, “disorganized”, and “hard to analyze” in vanilla form.
– The increasing volume and variety of data compel organizations to spend large amounts of time and resources
gathering, cleaning, and organizing data before performing any analysis.
• Data analytics is a “garbage-in garbage-out” type of process!
• Up to 80% of data analysis time and effort is focused on data management!

• Data wrangling is the process of retrieving, cleansing, integrating, transforming, and enriching
data to support analysis.
– Transform raw data into a format that is more appropriate and easier to analyze
– Objectives: Improving data quality.
• Reducing time/effort required to perform analytics, reveal the true intelligence in the data.
Data Inspection
Data Inspection
• Once the raw data are extracted from the database, data warehouse, or data mart,
review and inspect the dataset to assess data quality and relevant information for
subsequent analysis.
– Visual inspection
– Counting
– Sorting

• Data inspection helps us verify the completeness of data:


– Missing values
• Especially for important variables.
– Range of values for each variable.
• We can sort data based on a single variable or multiple variables.
Data Inspection: Exercise 1
Data sorting in JMP:

1. Which employee(s) has/have the longest Length of Service?


• File ► Open; Select the file Ch2E1.xlsx
• Sort the employee data based on Years of Service in descending order (hint: right click the column
header)

2. Who is/are the oldest employee(s)?


• Sort the employee data based on Age in descending order (hint: right click the column header)
Data Inspection: Solution for Exercise 1
Data Preparation
Data Preparation
• Handling Missing Values
– There may be missing values in the key variables that are crucial for subsequent analysis.

• Subsetting Data
– Most data analysis projects focus only on a portion (subset) of the data, rather than the entire data
set.
– Sometimes the objective of the analysis is to compare two subgroups of the data.
Data Preparation: Handling Missing Values
• Missing values are a common data quality problem.
– Reduce usable observations.
– Can bias results.

• Reasons for missing values:


– Sensitive nature of questions.
• Missing values are not randomly distributed across observations.
• They are concentrated within one or more subgroups.
– Questions do not apply to every respondent.
– Data Collection / Entry Errors
• Human errors,
• Sloppy data collection
• Equipment failures
Data Preparation: Handling Missing Values
• Two strategies for dealing with missing values: omission and imputation.
– Omission:
• Exclude observations with missing values
• When the amount of missing value is small or concentrated in a small number of observations (≤ 5% of
sample).

– Imputation:
• Replace missing values with some reasonable values.

• No treatments required for some techniques


– E.g., Decision trees are robust and can be applied to data sets even with the inclusion of missing
values.
Data Preparation: Handling Missing Values
• Imputation:
– For numerical values:
• Replace missing values with the mean / average value across relevant observations.
– Easy to implement.
– Does not increase the variability in the data set.
– If many values are missing, mean imputation will likely distort the relationships among variables, leading to biased results.
» The variable loses its variance (variability).
– More advanced methods include regression mean imputation.
» Prediction of missing values, using regression.
– In the presence of outliers, it is preferred to use the median instead of the mean to impute missing values.

– For categorical variables:


• The most frequent category (mode) is often used.
• An alternative:
– Create an “unknown” category (Useful when data are missing for a reason)

• If the variable that has many missing values is deemed unimportant or can be represented using a proxy
variable that does not have missing values, the variable may be excluded from the analysis.
– Omission.
Data Preparation: Exercise 2
A U.S. producer of automobile tires wants to learn about the conditions of its tires on
automobiles in Texas (TreadWear.xlsx). The data obtained includes the position of the tire
on the automobile, age of the tire, mileage on the tire, and depth of the remaining tread on
the tire.
Assess the quality of these data by identifying which (if any) observations have missing
values.
1. Tables ► Missing Data Pattern
2. Highlight all the columns and click Add Column
Data Preparation: Exercise 2 Solution
• In JMP, a dot signifies a missing numeric value, and a blank signifies a missing
character value.
Data Preparation: Subsetting
• The process of extracting portions of a data set that are relevant to the analysis is
called subsetting.
– It is commonly used to pre-process the data prior to analysis.
– For time series data,
• We may choose to create subsets of recent observations and observations from the distant past in order
to analyze them separately.
– Subsetting can also be used to eliminate unwanted data such as observations that contain missing
values, low quality data, or outliers.
– Sometimes, subsetting involves excluding variables instead of observations.
• Irrelevant to the problem
• Contain redundant information
• Excessive amounts of missing values
Data Preparation: Subsetting
• Subsetting can also be performed as part of descriptive analytics that helps reveal
insights in the data.
– Example: summary data of tuberculosis treatment
• By comparing subsets of medical records with different treatment results, we may identify potential
contributing factors of success in a treatment.
Data Preparation: Exercise 3
Data filter in JMP (file Ch2E1.xlsx)
1. Filter the data and create a subset of only Gender = F
– Rows ► Data Filter; Add the column “Gender” and select the level “F”
– Tables ► Subset; Select “selected rows” and “all columns”
Data Preparation: Exercise 3

2. Filter the data to show only employees with more than 10 years of service
– Rows ► Data Filter; Add the column “Years of Service” and adjust the data range shown in the
histogram
– Tables ► Subset; Select “selected rows” and “all columns”
Data Preparation: Exercise 3
Transforming Numerical Data
Transforming Numerical Data
• Data transformation is the data conversion process from one format or structure to
another.
– It is performed to meet the requirements of statistical and data mining techniques used for the
analysis.

– Binning:
• Group a vast range of numerical values into a small number of “bins”.

– Mathematical Transformation:
• Create new variables through mathematical transformations of existing variables.
Transforming Numerical Data: Binning
• Binning is transforming numerical variables into categorical variables by grouping
the numerical values into a small number of groups or bins.
• Important: the bins are consecutive and nonoverlapping so that each numerical value falls into only one
bin.
• Outliers in the data will be part of the first / last bin and, and therefore, will not distort subsequent data
analysis.
– Bins can be based on:
• equal intervals
• equal counts (frequency),
– where individual bins have the same number of observations.
• Based on distribution.

>>
Transforming Numerical Data: Mathematical Transformation
• Another common approach is to create new variables through mathematical transformations
of existing variables.
– To analyze trend, we often transform raw data values into percentages.
– To standardize, we calculate z-scores or do min-max standardization.
– Sometimes, data on variables such as income, firm size, and house prices are highly skewed.
• Extremely high (or low) values of skewed variables significantly inflate the average for the entire data set
• Difficult to detect meaningful relationships with skewed variables
• Use the natural logarithm (Ln) and square root transformation to reduce data skewness
Transforming Categorical Data
Transforming Categorical Data
• Potential problems when nominal or ordinal variables come with too many categories:
– Reduce model performance.
• There are several parameters associated with the categories of a categorical variable.

– Categories that rarely occur.


• it is difficult to capture the impact of these categories accurately.
• A relatively small sample may not contain any observations in certain categories, creating errors when the
analytical model is later applied to a larger data set with observations in all categories.

– One category clearly dominates in terms of occurrence.


• the categorical variable will fail to make a positive impact since modeling success is dependent on being able
to differentiate among the observations.
Transforming Categorical Data: Category Reduction
• An effective strategy for dealing with these issues is category reduction,
– where we collapse some of the categories to create fewer nonoverlapping categories.

– Determining the appropriate number of categories often depends on the data, context, and
disciplinary norms, but there are a few general guidelines.
• Categories with very few observations may be combined to create the “Other” category.
– The rationale is that a critical mass can be created for this “Other” category to help reveal patterns and
relationships in data.

• Categories with a similar impact may be combined.


– “Weekdays" instead of Monday, Tuesday, …, Friday, if their impacts are similar, and similarly “weekends”
instead of Saturday and Sunday.
Transforming Categorical Data: Dummy Variable
• A dummy variable, also referred to as an indicator or a binary variable, is commonly
used to describe two categories of a variable.
– It assumes a value of 1 for one of the categories and 0 for the other category, referred to as the
reference or the benchmark category.
– Dummy variables do not suggest any ranking of the categories.
– All interpretation of the results is made in relation to the reference category.

• Given k categories of a variable, the general rule is to create k 1 dummy variables,


using the last category as reference.

• In JMP:
Select the categorical variable ► Cols ► Utilities
► Make Indicator Columns
Transforming Categorical Data: Category Scores
• Another common transformation of categorical variables is to create category scores.
– This approach is most appropriate if the data are ordinal and have natural, ordered categories.
• This transformation allows the categorical variable to be treated as a numerical variable in
certain analytical models.
– With this transformation, we need not convert a categorical variable into several dummy
variables or to reduce its categories.

• For an effective transformation, we assume equal increments between the category


scores, which may not be appropriate in certain situations.
Transforming Categorical Data: Exercise 4
Create a computed field
1. Which Manufacturer and Model was first in sales in March 2010?
– File ► Open; Select the file Top20Cars.xlsx
– Sort the table based on Sales in March 2010 in (highest to lowest)
2. Add a new column to calculate the percent change in sales from March 2010 to March 2011.
– Cols ► New Column ► Column Properties ► Formula
3. Which Manufacturer and Model had the smallest increase in sales (most negative)?
– Sort the table based on the percent change in sales (lowest to highest)
4. Set a filter and create a subset of only vehicles manufactured by Toyota
– Rows ► Data Filter; Add the column “Manufacturer” and select the level “Toyota”
– Tables ► Subset; Select “selected rows” and “all columns”
Transforming Categorical Data: Exercise 4 Solution
• Q1: Toyota Camry
• Q2: See “New Column” setting
• Q3: Nissan Versa
• Q4:

You might also like