0% found this document useful (0 votes)
14 views

4. Process Data from Dirty to Clean

Uploaded by

Ngọc Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

4. Process Data from Dirty to Clean

Uploaded by

Ngọc Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

CHAPTER 1: The importance of integrity

Focus on integrity

Data Integrity and its Significance


● Clean data is crucial for accurate data analysis, as illustrated by the
example of counting users with multiple subscriptions.
● Maintaining data integrity throughout the data analysis process is essential
for reliable results and informed decision-making.
Data Cleaning Techniques and Tools
● Data cleaning involves techniques to address issues like duplicate data,
ensuring accuracy.
● Spreadsheets and SQL are valuable tools for data cleaning, allowing for
efficient data manipulation and transformation.
Importance of Verification and Reporting
● Verifying and reporting on cleaning results is crucial to ensure the
effectiveness of the cleaning process.
● Documenting the cleaning process provides transparency and facilitates
future audits or revisions.

Data integrity and analytics objectives


Data integrity: the accuracy, completeness, consistency, and
trustworthiness of data throughout its lifecycle.

Data integrity risks:


● Inaccurate or incomplete data can lead to faulty analysis and incorrect
conclusions, impacting decision-making.
● Data integrity can be compromised during replication, transfer, or
manipulation, leading to inconsistencies and errors.
○ Data replication: the process of storing data in multiple locations.
-> out of sync data, lacks integrity because different people might
not be using the same data for their findings -> inconsistencies
○ Data transfer: the process of copying data from a storage device to
memory, or from one computer to another.
-> if data transfer is interrupted -> incomplete data set
○ Data manipulation: the process of changing data to make it more
organized and easier to read.
○ Human error
○ Viruses
○ Malware
○ Hacking
○ System failures

Data integrity safeguards


● Many companies rely on data warehousing or data engineering teams to
ensure data integrity.
● Data analysts should double-check data completeness and validity before
analysis to ensure accurate results.

Data constraint Definition Examples

Data type Values must be of a certain type: date, If the data type is a date, a singl
number, percentage, Boolean, etc. fail the constraint and be invalid

Data range Values must fall between predefined If the data range is 10-20, a valu
maximum and minimum values constraint and be invalid

Mandatory Values can’t be left blank or empty If age is mandatory, that value m

Unique Values can’t have a duplicate Two people can’t have the same
within the same service area

Regular expression Values must match a prescribed pattern A phone number must match ##
(regex) patterns characters allowed)

Cross-field validation Certain conditions for multiple fields must Values are percentages and val
be satisfied must add up to 100%

Primary-key (Databases only) value must be unique A database table can’t have two
per column primary key value. A primary ke
database that references a colu
unique. More information about
is provided later in the program.
Set-membership (Databases only) values for a column Value for a column must be set
must come from a set of discrete values Applicable

Foreign-key (Databases only) values for a column In a U.S. taxpayer database, the
must be unique values coming from a valid state or territory with the se
column in another table defined in a separate States tab

Accuracy The degree to which the data conforms to If values for zip codes are valida
the actual entity being measured or accuracy of the data goes up.
described

Completeness The degree to which the data contains all If data for personal profiles requ
desired components or measures and both are collected, the data

Consistency The degree to which the data is If a customer has the same add
repeatable from different points of entry or repair databases, the data is co
collection

Note:
● Before analyzing data, it's crucial to verify its integrity and relevance to the
business objective. Data that hasn't been cleaned or is misaligned with the
objective can lead to inaccurate conclusions.
● When faced with limited data, explore alternative data sources, adjust your
analysis approach, or collaborate with data engineers to gather additional
information. For example, you could track new data points to gain a more
comprehensive understanding.

Well-aligned objectives and data


● When there is clean data and good alignment, you can get accurate
insights and make conclusions the data supports.
● If there is good alignment but the data needs to be cleaned, clean the data
before you perform your analysis.
● If the data only partially aligns with an objective, think about how you could
modify the objective, or use data constraints to make sure that the subset
of data better aligns with the business objective.
Overcome the challenges of insufficient data

Types of insufficient data


● Data from only one source
● Data that keeps updating
● Outdated data
● Geographically-limited data
Ways to address insufficient data
● Identify trends with the new available data
● Wait for more data if time allows
● Talk with stakeholders and adjust your objective
● Look for a new dataset

Data issue
1. No data

Possible Solutions Examples of solutions in real life

Gather the data on a small scale to If you are surveying employees about what they think
perform a preliminary analysis and then about a new performance and bonus plan, use a
request additional time to complete the sample for a preliminary analysis. Then, ask for
analysis after you have collected more another 3 weeks to collect the data from all
data. employees.

If there isn’t time to collect data, perform If you are analyzing peak travel times for commuters
the analysis using proxy data from other but don’t have the data for a particular city, use the
datasets. data from another city with a similar size and
This is the most common workaround. demographic.

2. Too little data

le Solutions Examples of solutions in real life

analysis using proxy data along with If you are analyzing trends for owners of golden
data. retrievers, make your dataset larger by including the
data from owners of labradors.
your analysis to align with the data If you are missing data for 18- to 24-year-olds, do the
eady have. analysis but note the following limitation in your
report: this conclusion applies to adults 25 years and
older only.

3. Wrong data, including data with errors

Possible Solutions Examples of solutions in real life

If you have the wrong data because If you need the data for female voters and received
requirements were misunderstood, the data for male voters, restate your needs.
communicate the requirements again.

Identify errors in the data and, if possible, If your data is in a spreadsheet and there is a
correct them at the source by looking for a conditional statement or boolean causing
pattern in the errors. calculations to be wrong, change the conditional
statement instead of just fixing the calculated
values.

If you can’t correct data errors yourself, you If your dataset was translated from a different
can ignore the language and some of the translations don’t make
wrong data and go ahead with the analysis if sense, ignore the data with bad translation and go
your sample size is still large enough and ahead with the analysis of the other data.
ignoring the data won’t cause systematic
bias.
- Sample size technique
Issue: Sampling bias occurs when the chosen sample doesn't accurately
represent the whole population, leading to skewed results.
=> random sampling
- Terminology:
+ Population
+ Sample
+ Margin of error (sai số biên độ): difference between the sample’s results
and population’s results
+ Confidence level
+ Confidence interval
+ Statistical significance

Test your data

Statistical Power
● Statistical power is the probability of obtaining meaningful results from a
test, indicating the likelihood of the results being reliable and not due to
random chance.
Hypothesis testing is a way to see if a survey or experiment has
meaningful results.
● A higher statistical power increases the confidence in the results, with a
statistical power of 0.8 (80%) or higher generally considered statistically
significant.
Factors Affecting Statistical Power
● Sample size plays a crucial role in statistical power, with larger samples
generally leading to greater statistical power.
● Various factors, such as external influences or biases within the sample,
can impact the accuracy of the results and should be carefully considered
when designing tests or studies

Proxy Data
● Proxy data serves as a valuable tool when direct data related to a business
objective is unavailable, allowing for estimations and predictions.
● For instance, when a new car model launches, an analyst might use the
number of clicks on the car's specifications on the dealership's website as
a proxy for potential sales.
Open Datasets as Proxy Data Sources
● Open or public datasets, accessible online, can be valuable sources of
proxy data.
● For example, a clinic might use an open dataset from a trial of a vaccine
injection to estimate potential contraindications for a newly available nasal
version of the vaccine.
Working with Open Datasets
● Kaggle, a platform for data science, offers a wide array of datasets in
various formats, including CSV, JSON, SQLite, and BigQuery.
● When using open datasets, exercise caution and check for duplicate data
and null values, as their interpretation can significantly impact analysis.

Sample Size Calculator


● A sample size calculator helps determine the number of participants
needed for your study to accurately reflect the target population.
● To use a sample size calculator, you need to know your desired
confidence level, margin of error, and population size.

Consider the margin of error


Margin of Error
●Margin of error reveals the maximum expected difference between your
sample's results and the actual population's results.
● A smaller margin of error suggests that your sample's results are more
likely to reflect the entire population.
How Margin of Error Works
● Imagine you surveyed a sample of people about their preference for a four-
day workweek, and 60% were in favor. If the margin of error is 10%, the
actual percentage of people who favor a four-day workweek in the entire
population likely falls between 50% and 70%.
● You can calculate the margin of error using the population size, sample
size, and desired confidence level.
Review
Accuracy: The degree to which the data conforms to the actual entity being
measured or described
Completeness: The degree to which the data contains all desired components
or measures
Confidence interval: A range of values that conveys how likely a statistical
estimate reflects the population
Confidence level: The probability that a sample size accurately reflects the
greater population
Consistency: The degree to which data is repeatable from different points of
entry or collection
Cross-field validation: A process that ensures certain conditions for multiple
data fields are satisfied
Data constraints: The criteria that determine whether a piece of a data is clean
and valid
Data integrity: The accuracy, completeness, consistency, and trustworthiness of
data throughout its life cycle
Data manipulation: The process of changing data to make it more organized
and easier to read
Data range: Numerical values that fall between predefined maximum and
minimum values
Data replication: The process of storing data in multiple locations
DATEDIF: A spreadsheet function that calculates the number of days, months, or
years between two dates
Estimated response rate: The average number of people who typically
complete a survey
Hypothesis testing: A process to determine if a survey or experiment has
meaningful results
Mandatory: A data value that cannot be left blank or empty
Margin of error: The maximum amount that the sample results are expected to
differ from those of the actual population
Random sampling: A way of selecting a sample from a population so that every
possible type of the sample has an equal chance of being chosen
Regular expression (RegEx): A rule that says the values in a table must match
a prescribed pattern
CHAPTER 2: Clean data for more accurate
insights
Data cleaning is a must

Data cleaning is critical:


- Dirty data, plagued by inconsistencies, errors, and irrelevant information,
hinders effective analysis and can lead to inaccurate conclusions.
- Clean data, being complete, correct, and relevant, enables meaningful
analysis, pattern identification, and informed decision-making.
What is dirty data?
Duplicate data
Description Possible causes Potential harm to businesses

Any data record that shows up Manual data entry, batch data Skewed metrics or analyses, inflated or
more than once imports, or data migration inaccurate counts or predictions, or
confusion during data retrieval

Outdated data
Description Possible causes Potential harm to businesses

Any data that is old which should People changing roles or Inaccurate insights, decision-making
be replaced with newer and more companies, or software and systems analytics
accurate information becoming obsolete

Incomplete data
Description Possible causes Potential harm to businesses

Any data that is missing important Improper data collection or incorrect Decreased productivity, inaccurate insights,
fields data entry or inability to complete essential services

Incorrect/inaccurate data
Description Possible causes Potential harm to businesses

Any data that is complete but Human error inserted during data Inaccurate insights or decision-making
inaccurate input, fake information, or mock data based on bad information resulting in
revenue loss

Inconsistent data
Description Possible causes Potential harm to businesses

Any data that uses different Data stored incorrectly or errors Contradictory data points leading to
formats to represent the same inserted during data transfer confusion or inability to classify or segmen
thing customers

First steps toward clean data

Data Cleaning Essentials


● Always make a copy of your dataset before cleaning it, so you don't lose
any data accidentally.
● Remove irrelevant data that doesn't contribute to your analysis, such as
past members in a current member analysis.

Common Data Cleaning Techniques


● Delete duplicate entries to avoid inaccurate calculations and insights,
especially when combining data from multiple sources.
● Remove extra spaces and blanks as they can affect sorting, filtering, and
searching, leading to unexpected results.
● Fix misspellings, capitalization errors, and incorrect punctuation, as they
can impact data accuracy and communication efforts.
● Standardize formatting for a cleaner look and to ensure consistency,
especially when dealing with data from various sources (Data merging).
Common Data Cleaning Pitfalls

Continue cleaning data in spreadsheets

Data Cleaning Tools


● Conditional formatting (a spreadsheet tool that changes how cells
appear when values meet specific conditions) helps you visually identify
patterns and potential errors in your data. For instance, you can use it to
highlight blank cells, making it easier to spot missing information that
needs to be addressed.
● The "Remove duplicates" tool automatically searches for and eliminates
duplicate entries within your spreadsheet, ensuring data integrity.
● The "Split text to columns" tool is very helpful when you need to separate
data that's combined within a single cell. For example, you can use it to
split a cell containing a city, state, and zip code into separate columns.
Sometimes, numbers in your spreadsheet might be formatted as text,
leading to calculation errors. The "Split text to columns" tool can also help
resolve this issue by converting those text-formatted numbers into actual
numerical values.
● Concatenate: a function that joins multiple text strings into a single string

Optimize the data cleaning process (COUNTIF, LEN, RIGHT, LEFT, MID,
CONCATENATE, TRIM)

Data Cleaning Techniques


● Sorting and Filtering: These tools help organize and customize data for
specific projects. Sorting arranges data alphabetically or numerically to
easily find information and identify duplicates. Filtering displays data that
meets specific criteria, useful for finding values above a certain number or
even/odd values.
● Pivot Tables: These summarize and reorganize data, providing a clear
view of specific aspects. They can sort, group, count, total, or average
data, helping to focus on the most relevant information for a project.
● VLOOKUP: This function searches for a specific value in a column and
returns corresponding information from another column, even across
different sheets or databases. This is useful when data is spread across
multiple locations.
=VLOOKUP(data to look up, ‘where to look’!Range, column, false)
eg: =VLOOKUP(A2, 'Sheet 2'!A1:B31, 2, FALSE)
● Data mapping: process of matching fields from one data source to
another
-> helps match fields between databases, addressing inconsistencies like
different formats for the same data
Cleaning Checklist
● Determine the size of the dataset
● Determine the number of categories or labels
● Identify missing data
● Identify unformatted data
● Explore the different data types

Review
Clean data: Data that is complete, correct, and relevant to the problem being
solved
Compatibility: How well two or more datasets are able to work together
CONCATENATE: A spreadsheet function that joins together two or more text
strings
Conditional formatting: A spreadsheet tool that changes how cells appear
when values meet specific conditions
Data engineer: A professional who transforms data into a useful format for
analysis and gives it a reliable infrastructure
Data mapping: The process of matching fields from one data source to another
Data merging: The process of combining two or more datasets into a single
dataset
Data validation: A tool for checking the accuracy and quality of data
Data warehousing specialist: A professional who develops processes and
procedures to effectively store and organize data
Delimiter: A character that indicates the beginning or end of a data item
Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem to be
solved
Duplicate data: Any record that inadvertently shares data with another record
Field length: A tool for determining how many characters can be keyed into a
spreadsheet field
Incomplete data: Data that is missing important fields
Inconsistent data: Data that uses different formats to represent the same thing
Incorrect/inaccurate data: Data that is complete but inaccurate
LEFT: A function that returns a set number of characters from the left side of a
text string
LEN: A function that returns the length of a text string by counting the number of
characters it contains
Length: The number of characters in a text string
Merger: An agreement that unites two organizations into a single new one
MID: A function that returns a segment from the middle of a text string
Null: An indication that a value does not exist in a dataset
Outdated data: Any data that has been superseded by newer and more
accurate information
Remove duplicates: A spreadsheet tool that automatically searches for and
eliminates duplicate entries from a spreadsheet
Split: A function that divides text around a specified character and puts each
fragment into a new, separate cell
Substring: A smaller subset of a text string
Text string: A group of characters within a cell, most often composed of letters
TRIM: A function that removes leading, trailing, and repeated spaces in data
Unique: A value that can’t have a duplicate

CHAPTER 3: Data cleaning with SQL


SQL for sparkling clean data

Learn basic SQL queries

Transforming data

CAST() function
● The CAST function converts data from one type to another (like string to
float).
● In the example, CAST(purchase_price AS FLOAT64) converts the
purchase price to a float, enabling accurate numerical sorting.

CONCAT() function
● CONCAT lets you combine text from different columns to create a single,
unique identifier.
● The video uses the example of combining product codes with color data to
analyze customer preferences.

COALESCE() function
● COALESCE helps you manage missing data (null values) in your tables.
● It lets you specify a backup column to pull information from if the preferred
column has missing values.

Review
CAST: A SQL function that converts data from one datatype to another
COALESCE: A SQL function that returns non-null values in a list
CONCAT: A SQL function that adds strings together to create new text strings
that can be used as unique keys
DISTINCT: A keyword that is added to a SQL SELECT statement to retrieve only
non-duplicate entries
Float: A number that contains a decimal
Substring: A subset of a text string
Typecasting: Converting data from one type to another

CHAPTER 4: Verify and report on cleaning


results
Manually cleaning data

Data Verification
● A process to confirm that a data-cleaning effort was well-executed and the
resulting data is accurate and reliable.
● It's crucial because it helps you catch errors that might skew your analysis
and lead to wrong conclusions.
The Power of Reporting
● Reporting on your data cleaning process ensures transparency and
accountability within your team.
● Tools like changelogs help track modifications made to the dataset over
time, fostering better communication and understanding among
collaborators.
Verification process
● Compare: Start by comparing your original, unclean dataset to your
cleaned dataset. Look for common problems like nulls or misspellings that
should have been fixed during cleaning.
● Use Tools: Utilize tools like conditional formatting, filters, or the FIND
function to efficiently check for inconsistencies.
● Taking a Big Picture view of your project
○ Consider the business problem. Be certain that your data makes it
possible to solve business problems.
○ Consider the goal of the project. Need to know is that the goal of
getting feedback is to make improvements to that product, need to
know whether the data you’ve collected and cleaned will help your
company achieve that goal.
○ Consider whether your data is capable of solving the problem and
meeting the project objectives. Thinking where the data came from
and testing your data collection and cleaning process.
● Asking yourself, do the numbers make sense?

Verification technique
For spreadsheets:
● Find and Replace: This tool helps locate a specific term in your
spreadsheet and allows you to replace it with the correct term, ensuring
consistency.
● Pivot Tables: These tools summarize data, helping you quickly spot
inconsistencies. For example, by counting the occurrences of each
supplier name, you can easily see if there are any misspellings.
For SQL:
● Manual Correction: After identifying errors using Find and Replace or
Pivot Tables, you can manually correct the misspelled supplier names in
your spreadsheet.
● SQL CASE Statement: When querying databases, you can use the CASE
statement to identify and correct misspellings. This function checks for
specific conditions, like a misspelled name, and replaces it with the correct
value.
SELECT

Customer_id,

CASE

WHEN first_name = 'Tnoy' THEN 'Tony'

ELSE first_name

END AS cleaned_name

FROM

project-id.customer_data.customer_name

Verification checklist
● Sources of errors: Did you use the right tools and functions to find the
source of the errors in your dataset?
● Null data: Did you search for NULLs using conditional formatting and
filters?
● Misspelled words: Did you locate all misspellings?
● Mistyped numbers: Did you double-check that your numeric data has
been entered correctly?
● Extra spaces and characters: Did you remove any extra spaces or
characters using the TRIM function?
● Duplicates: Did you remove duplicates in spreadsheets using the
Remove Duplicates function or DISTINCT in SQL?
● Mismatched data types: Did you check that numeric, date, and string
data are typecast correctly?
● Messy (inconsistent) strings: Did you make sure that all of your strings
are consistent and meaningful?
● Messy (inconsistent) date formats: Did you format the dates consistently
throughout your dataset?
● Misleading variable labels (columns): Did you name your columns
meaningfully?
● Truncated data: Did you check for truncated or missing data that needs
correction?
● Business Logic: Did you check that the data makes sense given your
knowledge of the business?

Document the cleaning process


Reason to document
- Recover data-cleaning errors
- Inform other users of changes
- Determine quality of data
Clarification
The instructor stated that the first two benefits of documentation --1) recalling the
errors that were cleaned and 2) informing others of the changes -- assume that
the data errors aren't fixable. She then added that when the data errors are
fixable, the documentation needs to record how the data was fixed. Data-
cleaning documentation is important in both cases.
Review
CASE: A SQL statement that returns records that meet conditions by including
an if/then statement in a query
Changelog: A file containing a chronologically ordered list of modifications made
to a project
COUNTA: A spreadsheet function that counts the total number of values within a
specified range
Find and replace: A tool that finds a specified search term and replaces it with
something else
Verification: A process to confirm that a data-cleaning effort was well executed
and the resulting data is accurate and reliable
CHAPTER 5: Verify and report on cleaning
results
The data analyst hiring process
Key elements of a data professional’s resume
The importance of diversity on a DA team
Highlight your skills and experience

Explore areas of interest

You might also like