4. Process Data from Dirty to Clean
4. Process Data from Dirty to Clean
Focus on integrity
Data type Values must be of a certain type: date, If the data type is a date, a singl
number, percentage, Boolean, etc. fail the constraint and be invalid
Data range Values must fall between predefined If the data range is 10-20, a valu
maximum and minimum values constraint and be invalid
Mandatory Values can’t be left blank or empty If age is mandatory, that value m
Unique Values can’t have a duplicate Two people can’t have the same
within the same service area
Regular expression Values must match a prescribed pattern A phone number must match ##
(regex) patterns characters allowed)
Cross-field validation Certain conditions for multiple fields must Values are percentages and val
be satisfied must add up to 100%
Primary-key (Databases only) value must be unique A database table can’t have two
per column primary key value. A primary ke
database that references a colu
unique. More information about
is provided later in the program.
Set-membership (Databases only) values for a column Value for a column must be set
must come from a set of discrete values Applicable
Foreign-key (Databases only) values for a column In a U.S. taxpayer database, the
must be unique values coming from a valid state or territory with the se
column in another table defined in a separate States tab
Accuracy The degree to which the data conforms to If values for zip codes are valida
the actual entity being measured or accuracy of the data goes up.
described
Completeness The degree to which the data contains all If data for personal profiles requ
desired components or measures and both are collected, the data
Consistency The degree to which the data is If a customer has the same add
repeatable from different points of entry or repair databases, the data is co
collection
Note:
● Before analyzing data, it's crucial to verify its integrity and relevance to the
business objective. Data that hasn't been cleaned or is misaligned with the
objective can lead to inaccurate conclusions.
● When faced with limited data, explore alternative data sources, adjust your
analysis approach, or collaborate with data engineers to gather additional
information. For example, you could track new data points to gain a more
comprehensive understanding.
Data issue
1. No data
Gather the data on a small scale to If you are surveying employees about what they think
perform a preliminary analysis and then about a new performance and bonus plan, use a
request additional time to complete the sample for a preliminary analysis. Then, ask for
analysis after you have collected more another 3 weeks to collect the data from all
data. employees.
If there isn’t time to collect data, perform If you are analyzing peak travel times for commuters
the analysis using proxy data from other but don’t have the data for a particular city, use the
datasets. data from another city with a similar size and
This is the most common workaround. demographic.
analysis using proxy data along with If you are analyzing trends for owners of golden
data. retrievers, make your dataset larger by including the
data from owners of labradors.
your analysis to align with the data If you are missing data for 18- to 24-year-olds, do the
eady have. analysis but note the following limitation in your
report: this conclusion applies to adults 25 years and
older only.
If you have the wrong data because If you need the data for female voters and received
requirements were misunderstood, the data for male voters, restate your needs.
communicate the requirements again.
Identify errors in the data and, if possible, If your data is in a spreadsheet and there is a
correct them at the source by looking for a conditional statement or boolean causing
pattern in the errors. calculations to be wrong, change the conditional
statement instead of just fixing the calculated
values.
If you can’t correct data errors yourself, you If your dataset was translated from a different
can ignore the language and some of the translations don’t make
wrong data and go ahead with the analysis if sense, ignore the data with bad translation and go
your sample size is still large enough and ahead with the analysis of the other data.
ignoring the data won’t cause systematic
bias.
- Sample size technique
Issue: Sampling bias occurs when the chosen sample doesn't accurately
represent the whole population, leading to skewed results.
=> random sampling
- Terminology:
+ Population
+ Sample
+ Margin of error (sai số biên độ): difference between the sample’s results
and population’s results
+ Confidence level
+ Confidence interval
+ Statistical significance
Statistical Power
● Statistical power is the probability of obtaining meaningful results from a
test, indicating the likelihood of the results being reliable and not due to
random chance.
Hypothesis testing is a way to see if a survey or experiment has
meaningful results.
● A higher statistical power increases the confidence in the results, with a
statistical power of 0.8 (80%) or higher generally considered statistically
significant.
Factors Affecting Statistical Power
● Sample size plays a crucial role in statistical power, with larger samples
generally leading to greater statistical power.
● Various factors, such as external influences or biases within the sample,
can impact the accuracy of the results and should be carefully considered
when designing tests or studies
Proxy Data
● Proxy data serves as a valuable tool when direct data related to a business
objective is unavailable, allowing for estimations and predictions.
● For instance, when a new car model launches, an analyst might use the
number of clicks on the car's specifications on the dealership's website as
a proxy for potential sales.
Open Datasets as Proxy Data Sources
● Open or public datasets, accessible online, can be valuable sources of
proxy data.
● For example, a clinic might use an open dataset from a trial of a vaccine
injection to estimate potential contraindications for a newly available nasal
version of the vaccine.
Working with Open Datasets
● Kaggle, a platform for data science, offers a wide array of datasets in
various formats, including CSV, JSON, SQLite, and BigQuery.
● When using open datasets, exercise caution and check for duplicate data
and null values, as their interpretation can significantly impact analysis.
Any data record that shows up Manual data entry, batch data Skewed metrics or analyses, inflated or
more than once imports, or data migration inaccurate counts or predictions, or
confusion during data retrieval
Outdated data
Description Possible causes Potential harm to businesses
Any data that is old which should People changing roles or Inaccurate insights, decision-making
be replaced with newer and more companies, or software and systems analytics
accurate information becoming obsolete
Incomplete data
Description Possible causes Potential harm to businesses
Any data that is missing important Improper data collection or incorrect Decreased productivity, inaccurate insights,
fields data entry or inability to complete essential services
Incorrect/inaccurate data
Description Possible causes Potential harm to businesses
Any data that is complete but Human error inserted during data Inaccurate insights or decision-making
inaccurate input, fake information, or mock data based on bad information resulting in
revenue loss
Inconsistent data
Description Possible causes Potential harm to businesses
Any data that uses different Data stored incorrectly or errors Contradictory data points leading to
formats to represent the same inserted during data transfer confusion or inability to classify or segmen
thing customers
Optimize the data cleaning process (COUNTIF, LEN, RIGHT, LEFT, MID,
CONCATENATE, TRIM)
Review
Clean data: Data that is complete, correct, and relevant to the problem being
solved
Compatibility: How well two or more datasets are able to work together
CONCATENATE: A spreadsheet function that joins together two or more text
strings
Conditional formatting: A spreadsheet tool that changes how cells appear
when values meet specific conditions
Data engineer: A professional who transforms data into a useful format for
analysis and gives it a reliable infrastructure
Data mapping: The process of matching fields from one data source to another
Data merging: The process of combining two or more datasets into a single
dataset
Data validation: A tool for checking the accuracy and quality of data
Data warehousing specialist: A professional who develops processes and
procedures to effectively store and organize data
Delimiter: A character that indicates the beginning or end of a data item
Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem to be
solved
Duplicate data: Any record that inadvertently shares data with another record
Field length: A tool for determining how many characters can be keyed into a
spreadsheet field
Incomplete data: Data that is missing important fields
Inconsistent data: Data that uses different formats to represent the same thing
Incorrect/inaccurate data: Data that is complete but inaccurate
LEFT: A function that returns a set number of characters from the left side of a
text string
LEN: A function that returns the length of a text string by counting the number of
characters it contains
Length: The number of characters in a text string
Merger: An agreement that unites two organizations into a single new one
MID: A function that returns a segment from the middle of a text string
Null: An indication that a value does not exist in a dataset
Outdated data: Any data that has been superseded by newer and more
accurate information
Remove duplicates: A spreadsheet tool that automatically searches for and
eliminates duplicate entries from a spreadsheet
Split: A function that divides text around a specified character and puts each
fragment into a new, separate cell
Substring: A smaller subset of a text string
Text string: A group of characters within a cell, most often composed of letters
TRIM: A function that removes leading, trailing, and repeated spaces in data
Unique: A value that can’t have a duplicate
Transforming data
CAST() function
● The CAST function converts data from one type to another (like string to
float).
● In the example, CAST(purchase_price AS FLOAT64) converts the
purchase price to a float, enabling accurate numerical sorting.
CONCAT() function
● CONCAT lets you combine text from different columns to create a single,
unique identifier.
● The video uses the example of combining product codes with color data to
analyze customer preferences.
COALESCE() function
● COALESCE helps you manage missing data (null values) in your tables.
● It lets you specify a backup column to pull information from if the preferred
column has missing values.
Review
CAST: A SQL function that converts data from one datatype to another
COALESCE: A SQL function that returns non-null values in a list
CONCAT: A SQL function that adds strings together to create new text strings
that can be used as unique keys
DISTINCT: A keyword that is added to a SQL SELECT statement to retrieve only
non-duplicate entries
Float: A number that contains a decimal
Substring: A subset of a text string
Typecasting: Converting data from one type to another
Data Verification
● A process to confirm that a data-cleaning effort was well-executed and the
resulting data is accurate and reliable.
● It's crucial because it helps you catch errors that might skew your analysis
and lead to wrong conclusions.
The Power of Reporting
● Reporting on your data cleaning process ensures transparency and
accountability within your team.
● Tools like changelogs help track modifications made to the dataset over
time, fostering better communication and understanding among
collaborators.
Verification process
● Compare: Start by comparing your original, unclean dataset to your
cleaned dataset. Look for common problems like nulls or misspellings that
should have been fixed during cleaning.
● Use Tools: Utilize tools like conditional formatting, filters, or the FIND
function to efficiently check for inconsistencies.
● Taking a Big Picture view of your project
○ Consider the business problem. Be certain that your data makes it
possible to solve business problems.
○ Consider the goal of the project. Need to know is that the goal of
getting feedback is to make improvements to that product, need to
know whether the data you’ve collected and cleaned will help your
company achieve that goal.
○ Consider whether your data is capable of solving the problem and
meeting the project objectives. Thinking where the data came from
and testing your data collection and cleaning process.
● Asking yourself, do the numbers make sense?
Verification technique
For spreadsheets:
● Find and Replace: This tool helps locate a specific term in your
spreadsheet and allows you to replace it with the correct term, ensuring
consistency.
● Pivot Tables: These tools summarize data, helping you quickly spot
inconsistencies. For example, by counting the occurrences of each
supplier name, you can easily see if there are any misspellings.
For SQL:
● Manual Correction: After identifying errors using Find and Replace or
Pivot Tables, you can manually correct the misspelled supplier names in
your spreadsheet.
● SQL CASE Statement: When querying databases, you can use the CASE
statement to identify and correct misspellings. This function checks for
specific conditions, like a misspelled name, and replaces it with the correct
value.
SELECT
Customer_id,
CASE
ELSE first_name
END AS cleaned_name
FROM
project-id.customer_data.customer_name
Verification checklist
● Sources of errors: Did you use the right tools and functions to find the
source of the errors in your dataset?
● Null data: Did you search for NULLs using conditional formatting and
filters?
● Misspelled words: Did you locate all misspellings?
● Mistyped numbers: Did you double-check that your numeric data has
been entered correctly?
● Extra spaces and characters: Did you remove any extra spaces or
characters using the TRIM function?
● Duplicates: Did you remove duplicates in spreadsheets using the
Remove Duplicates function or DISTINCT in SQL?
● Mismatched data types: Did you check that numeric, date, and string
data are typecast correctly?
● Messy (inconsistent) strings: Did you make sure that all of your strings
are consistent and meaningful?
● Messy (inconsistent) date formats: Did you format the dates consistently
throughout your dataset?
● Misleading variable labels (columns): Did you name your columns
meaningfully?
● Truncated data: Did you check for truncated or missing data that needs
correction?
● Business Logic: Did you check that the data makes sense given your
knowledge of the business?