0% found this document useful (0 votes)

14 views

4. Process Data from Dirty to Clean

Uploaded by

Ngọc Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

4. Process Data from Dirty to Clean

Uploaded by

Ngọc Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

CHAPTER 1: The importance of integrity

Focus on integrity

Data Integrity and its Significance

● Clean data is crucial for accurate data analysis, as illustrated by the
example of counting users with multiple subscriptions.
● Maintaining data integrity throughout the data analysis process is essential
for reliable results and informed decision-making.
Data Cleaning Techniques and Tools
● Data cleaning involves techniques to address issues like duplicate data,
ensuring accuracy.
● Spreadsheets and SQL are valuable tools for data cleaning, allowing for
efficient data manipulation and transformation.
Importance of Verification and Reporting
● Verifying and reporting on cleaning results is crucial to ensure the
effectiveness of the cleaning process.
● Documenting the cleaning process provides transparency and facilitates
future audits or revisions.

Data integrity and analytics objectives

Data integrity: the accuracy, completeness, consistency, and
trustworthiness of data throughout its lifecycle.

Data integrity risks:

● Inaccurate or incomplete data can lead to faulty analysis and incorrect
conclusions, impacting decision-making.
● Data integrity can be compromised during replication, transfer, or
manipulation, leading to inconsistencies and errors.
○ Data replication: the process of storing data in multiple locations.
-> out of sync data, lacks integrity because different people might
not be using the same data for their findings -> inconsistencies
○ Data transfer: the process of copying data from a storage device to
memory, or from one computer to another.
-> if data transfer is interrupted -> incomplete data set
○ Data manipulation: the process of changing data to make it more
organized and easier to read.
○ Human error
○ Viruses
○ Malware
○ Hacking
○ System failures

Data integrity safeguards

● Many companies rely on data warehousing or data engineering teams to
ensure data integrity.
● Data analysts should double-check data completeness and validity before
analysis to ensure accurate results.

Data constraint Definition Examples

Data type Values must be of a certain type: date, If the data type is a date, a singl
number, percentage, Boolean, etc. fail the constraint and be invalid

Data range Values must fall between predefined If the data range is 10-20, a valu
maximum and minimum values constraint and be invalid

Mandatory Values can’t be left blank or empty If age is mandatory, that value m

Unique Values can’t have a duplicate Two people can’t have the same
within the same service area

Regular expression Values must match a prescribed pattern A phone number must match ##
(regex) patterns characters allowed)

Cross-field validation Certain conditions for multiple fields must Values are percentages and val
be satisfied must add up to 100%

Primary-key (Databases only) value must be unique A database table can’t have two
per column primary key value. A primary ke
database that references a colu
unique. More information about
is provided later in the program.
Set-membership (Databases only) values for a column Value for a column must be set
must come from a set of discrete values Applicable

Foreign-key (Databases only) values for a column In a U.S. taxpayer database, the
must be unique values coming from a valid state or territory with the se
column in another table defined in a separate States tab

Accuracy The degree to which the data conforms to If values for zip codes are valida
the actual entity being measured or accuracy of the data goes up.
described

Completeness The degree to which the data contains all If data for personal profiles requ
desired components or measures and both are collected, the data

Consistency The degree to which the data is If a customer has the same add
repeatable from different points of entry or repair databases, the data is co
collection

Note:
● Before analyzing data, it's crucial to verify its integrity and relevance to the
business objective. Data that hasn't been cleaned or is misaligned with the
objective can lead to inaccurate conclusions.
● When faced with limited data, explore alternative data sources, adjust your
analysis approach, or collaborate with data engineers to gather additional
information. For example, you could track new data points to gain a more
comprehensive understanding.

Well-aligned objectives and data

● When there is clean data and good alignment, you can get accurate
insights and make conclusions the data supports.
● If there is good alignment but the data needs to be cleaned, clean the data
before you perform your analysis.
● If the data only partially aligns with an objective, think about how you could
modify the objective, or use data constraints to make sure that the subset
of data better aligns with the business objective.
Overcome the challenges of insufficient data

Types of insufficient data

● Data from only one source
● Data that keeps updating
● Outdated data
● Geographically-limited data
Ways to address insufficient data
● Identify trends with the new available data
● Wait for more data if time allows
● Talk with stakeholders and adjust your objective
● Look for a new dataset

Data issue
1. No data

Possible Solutions Examples of solutions in real life

Gather the data on a small scale to If you are surveying employees about what they think
perform a preliminary analysis and then about a new performance and bonus plan, use a
request additional time to complete the sample for a preliminary analysis. Then, ask for
analysis after you have collected more another 3 weeks to collect the data from all
data. employees.

If there isn’t time to collect data, perform If you are analyzing peak travel times for commuters
the analysis using proxy data from other but don’t have the data for a particular city, use the
datasets. data from another city with a similar size and
This is the most common workaround. demographic.

2. Too little data

le Solutions Examples of solutions in real life

analysis using proxy data along with If you are analyzing trends for owners of golden
data. retrievers, make your dataset larger by including the
data from owners of labradors.
your analysis to align with the data If you are missing data for 18- to 24-year-olds, do the
eady have. analysis but note the following limitation in your
report: this conclusion applies to adults 25 years and
older only.

3. Wrong data, including data with errors

Possible Solutions Examples of solutions in real life

If you have the wrong data because If you need the data for female voters and received
requirements were misunderstood, the data for male voters, restate your needs.
communicate the requirements again.

Identify errors in the data and, if possible, If your data is in a spreadsheet and there is a
correct them at the source by looking for a conditional statement or boolean causing
pattern in the errors. calculations to be wrong, change the conditional
statement instead of just fixing the calculated
values.

If you can’t correct data errors yourself, you If your dataset was translated from a different
can ignore the language and some of the translations don’t make
wrong data and go ahead with the analysis if sense, ignore the data with bad translation and go
your sample size is still large enough and ahead with the analysis of the other data.
ignoring the data won’t cause systematic
bias.
- Sample size technique
Issue: Sampling bias occurs when the chosen sample doesn't accurately
represent the whole population, leading to skewed results.
=> random sampling
- Terminology:
+ Population
+ Sample
+ Margin of error (sai số biên độ): difference between the sample’s results
and population’s results
+ Confidence level
+ Confidence interval
+ Statistical significance

Test your data

Statistical Power
● Statistical power is the probability of obtaining meaningful results from a
test, indicating the likelihood of the results being reliable and not due to
random chance.
Hypothesis testing is a way to see if a survey or experiment has
meaningful results.
● A higher statistical power increases the confidence in the results, with a
statistical power of 0.8 (80%) or higher generally considered statistically
significant.
Factors Affecting Statistical Power
● Sample size plays a crucial role in statistical power, with larger samples
generally leading to greater statistical power.
● Various factors, such as external influences or biases within the sample,
can impact the accuracy of the results and should be carefully considered
when designing tests or studies

Proxy Data
● Proxy data serves as a valuable tool when direct data related to a business
objective is unavailable, allowing for estimations and predictions.
● For instance, when a new car model launches, an analyst might use the
number of clicks on the car's specifications on the dealership's website as
a proxy for potential sales.
Open Datasets as Proxy Data Sources
● Open or public datasets, accessible online, can be valuable sources of
proxy data.
● For example, a clinic might use an open dataset from a trial of a vaccine
injection to estimate potential contraindications for a newly available nasal
version of the vaccine.
Working with Open Datasets
● Kaggle, a platform for data science, offers a wide array of datasets in
various formats, including CSV, JSON, SQLite, and BigQuery.
● When using open datasets, exercise caution and check for duplicate data
and null values, as their interpretation can significantly impact analysis.

Sample Size Calculator

● A sample size calculator helps determine the number of participants
needed for your study to accurately reflect the target population.
● To use a sample size calculator, you need to know your desired
confidence level, margin of error, and population size.

Consider the margin of error

Margin of Error
●Margin of error reveals the maximum expected difference between your
sample's results and the actual population's results.
● A smaller margin of error suggests that your sample's results are more
likely to reflect the entire population.
How Margin of Error Works
● Imagine you surveyed a sample of people about their preference for a four-
day workweek, and 60% were in favor. If the margin of error is 10%, the
actual percentage of people who favor a four-day workweek in the entire
population likely falls between 50% and 70%.
● You can calculate the margin of error using the population size, sample
size, and desired confidence level.
Review
Accuracy: The degree to which the data conforms to the actual entity being
measured or described
Completeness: The degree to which the data contains all desired components
or measures
Confidence interval: A range of values that conveys how likely a statistical
estimate reflects the population
Confidence level: The probability that a sample size accurately reflects the
greater population
Consistency: The degree to which data is repeatable from different points of
entry or collection
Cross-field validation: A process that ensures certain conditions for multiple
data fields are satisfied
Data constraints: The criteria that determine whether a piece of a data is clean
and valid
Data integrity: The accuracy, completeness, consistency, and trustworthiness of
data throughout its life cycle
Data manipulation: The process of changing data to make it more organized
and easier to read
Data range: Numerical values that fall between predefined maximum and
minimum values
Data replication: The process of storing data in multiple locations
DATEDIF: A spreadsheet function that calculates the number of days, months, or
years between two dates
Estimated response rate: The average number of people who typically
complete a survey
Hypothesis testing: A process to determine if a survey or experiment has
meaningful results
Mandatory: A data value that cannot be left blank or empty
Margin of error: The maximum amount that the sample results are expected to
differ from those of the actual population
Random sampling: A way of selecting a sample from a population so that every
possible type of the sample has an equal chance of being chosen
Regular expression (RegEx): A rule that says the values in a table must match
a prescribed pattern
CHAPTER 2: Clean data for more accurate
insights
Data cleaning is a must

Data cleaning is critical:

- Dirty data, plagued by inconsistencies, errors, and irrelevant information,
hinders effective analysis and can lead to inaccurate conclusions.
- Clean data, being complete, correct, and relevant, enables meaningful
analysis, pattern identification, and informed decision-making.
What is dirty data?
Duplicate data
Description Possible causes Potential harm to businesses

Any data record that shows up Manual data entry, batch data Skewed metrics or analyses, inflated or
more than once imports, or data migration inaccurate counts or predictions, or
confusion during data retrieval

Outdated data
Description Possible causes Potential harm to businesses

Any data that is old which should People changing roles or Inaccurate insights, decision-making
be replaced with newer and more companies, or software and systems analytics
accurate information becoming obsolete

Incomplete data
Description Possible causes Potential harm to businesses

Any data that is missing important Improper data collection or incorrect Decreased productivity, inaccurate insights,
fields data entry or inability to complete essential services

Incorrect/inaccurate data
Description Possible causes Potential harm to businesses

Any data that is complete but Human error inserted during data Inaccurate insights or decision-making
inaccurate input, fake information, or mock data based on bad information resulting in
revenue loss

Inconsistent data
Description Possible causes Potential harm to businesses

Any data that uses different Data stored incorrectly or errors Contradictory data points leading to
formats to represent the same inserted during data transfer confusion or inability to classify or segmen
thing customers

First steps toward clean data

Data Cleaning Essentials

● Always make a copy of your dataset before cleaning it, so you don't lose
any data accidentally.
● Remove irrelevant data that doesn't contribute to your analysis, such as
past members in a current member analysis.

Common Data Cleaning Techniques

● Delete duplicate entries to avoid inaccurate calculations and insights,
especially when combining data from multiple sources.
● Remove extra spaces and blanks as they can affect sorting, filtering, and
searching, leading to unexpected results.
● Fix misspellings, capitalization errors, and incorrect punctuation, as they
can impact data accuracy and communication efforts.
● Standardize formatting for a cleaner look and to ensure consistency,
especially when dealing with data from various sources (Data merging).
Common Data Cleaning Pitfalls

Continue cleaning data in spreadsheets

Data Cleaning Tools

● Conditional formatting (a spreadsheet tool that changes how cells
appear when values meet specific conditions) helps you visually identify
patterns and potential errors in your data. For instance, you can use it to
highlight blank cells, making it easier to spot missing information that
needs to be addressed.
● The "Remove duplicates" tool automatically searches for and eliminates
duplicate entries within your spreadsheet, ensuring data integrity.
● The "Split text to columns" tool is very helpful when you need to separate
data that's combined within a single cell. For example, you can use it to
split a cell containing a city, state, and zip code into separate columns.
Sometimes, numbers in your spreadsheet might be formatted as text,
leading to calculation errors. The "Split text to columns" tool can also help
resolve this issue by converting those text-formatted numbers into actual
numerical values.
● Concatenate: a function that joins multiple text strings into a single string

Optimize the data cleaning process (COUNTIF, LEN, RIGHT, LEFT, MID,
CONCATENATE, TRIM)

Data Cleaning Techniques

● Sorting and Filtering: These tools help organize and customize data for
specific projects. Sorting arranges data alphabetically or numerically to
easily find information and identify duplicates. Filtering displays data that
meets specific criteria, useful for finding values above a certain number or
even/odd values.
● Pivot Tables: These summarize and reorganize data, providing a clear
view of specific aspects. They can sort, group, count, total, or average
data, helping to focus on the most relevant information for a project.
● VLOOKUP: This function searches for a specific value in a column and
returns corresponding information from another column, even across
different sheets or databases. This is useful when data is spread across
multiple locations.
=VLOOKUP(data to look up, ‘where to look’!Range, column, false)
eg: =VLOOKUP(A2, 'Sheet 2'!A1:B31, 2, FALSE)
● Data mapping: process of matching fields from one data source to
another
-> helps match fields between databases, addressing inconsistencies like
different formats for the same data
Cleaning Checklist
● Determine the size of the dataset
● Determine the number of categories or labels
● Identify missing data
● Identify unformatted data
● Explore the different data types

Review
Clean data: Data that is complete, correct, and relevant to the problem being
solved
Compatibility: How well two or more datasets are able to work together
CONCATENATE: A spreadsheet function that joins together two or more text
strings
Conditional formatting: A spreadsheet tool that changes how cells appear
when values meet specific conditions
Data engineer: A professional who transforms data into a useful format for
analysis and gives it a reliable infrastructure
Data mapping: The process of matching fields from one data source to another
Data merging: The process of combining two or more datasets into a single
dataset
Data validation: A tool for checking the accuracy and quality of data
Data warehousing specialist: A professional who develops processes and
procedures to effectively store and organize data
Delimiter: A character that indicates the beginning or end of a data item
Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem to be
solved
Duplicate data: Any record that inadvertently shares data with another record
Field length: A tool for determining how many characters can be keyed into a
spreadsheet field
Incomplete data: Data that is missing important fields
Inconsistent data: Data that uses different formats to represent the same thing
Incorrect/inaccurate data: Data that is complete but inaccurate
LEFT: A function that returns a set number of characters from the left side of a
text string
LEN: A function that returns the length of a text string by counting the number of
characters it contains
Length: The number of characters in a text string
Merger: An agreement that unites two organizations into a single new one
MID: A function that returns a segment from the middle of a text string
Null: An indication that a value does not exist in a dataset
Outdated data: Any data that has been superseded by newer and more
accurate information
Remove duplicates: A spreadsheet tool that automatically searches for and
eliminates duplicate entries from a spreadsheet
Split: A function that divides text around a specified character and puts each
fragment into a new, separate cell
Substring: A smaller subset of a text string
Text string: A group of characters within a cell, most often composed of letters
TRIM: A function that removes leading, trailing, and repeated spaces in data
Unique: A value that can’t have a duplicate

CHAPTER 3: Data cleaning with SQL

SQL for sparkling clean data

Learn basic SQL queries

Transforming data

CAST() function
● The CAST function converts data from one type to another (like string to
float).
● In the example, CAST(purchase_price AS FLOAT64) converts the
purchase price to a float, enabling accurate numerical sorting.

CONCAT() function
● CONCAT lets you combine text from different columns to create a single,
unique identifier.
● The video uses the example of combining product codes with color data to
analyze customer preferences.

COALESCE() function
● COALESCE helps you manage missing data (null values) in your tables.
● It lets you specify a backup column to pull information from if the preferred
column has missing values.

Review
CAST: A SQL function that converts data from one datatype to another
COALESCE: A SQL function that returns non-null values in a list
CONCAT: A SQL function that adds strings together to create new text strings
that can be used as unique keys
DISTINCT: A keyword that is added to a SQL SELECT statement to retrieve only
non-duplicate entries
Float: A number that contains a decimal
Substring: A subset of a text string
Typecasting: Converting data from one type to another

CHAPTER 4: Verify and report on cleaning

results
Manually cleaning data

Data Verification
● A process to confirm that a data-cleaning effort was well-executed and the
resulting data is accurate and reliable.
● It's crucial because it helps you catch errors that might skew your analysis
and lead to wrong conclusions.
The Power of Reporting
● Reporting on your data cleaning process ensures transparency and
accountability within your team.
● Tools like changelogs help track modifications made to the dataset over
time, fostering better communication and understanding among
collaborators.
Verification process
● Compare: Start by comparing your original, unclean dataset to your
cleaned dataset. Look for common problems like nulls or misspellings that
should have been fixed during cleaning.
● Use Tools: Utilize tools like conditional formatting, filters, or the FIND
function to efficiently check for inconsistencies.
● Taking a Big Picture view of your project
○ Consider the business problem. Be certain that your data makes it
possible to solve business problems.
○ Consider the goal of the project. Need to know is that the goal of
getting feedback is to make improvements to that product, need to
know whether the data you’ve collected and cleaned will help your
company achieve that goal.
○ Consider whether your data is capable of solving the problem and
meeting the project objectives. Thinking where the data came from
and testing your data collection and cleaning process.
● Asking yourself, do the numbers make sense?

Verification technique
For spreadsheets:
● Find and Replace: This tool helps locate a specific term in your
spreadsheet and allows you to replace it with the correct term, ensuring
consistency.
● Pivot Tables: These tools summarize data, helping you quickly spot
inconsistencies. For example, by counting the occurrences of each
supplier name, you can easily see if there are any misspellings.
For SQL:
● Manual Correction: After identifying errors using Find and Replace or
Pivot Tables, you can manually correct the misspelled supplier names in
your spreadsheet.
● SQL CASE Statement: When querying databases, you can use the CASE
statement to identify and correct misspellings. This function checks for
specific conditions, like a misspelled name, and replaces it with the correct
value.
SELECT

Customer_id,

CASE

WHEN first_name = 'Tnoy' THEN 'Tony'

ELSE first_name

END AS cleaned_name

FROM

project-id.customer_data.customer_name

Verification checklist
● Sources of errors: Did you use the right tools and functions to find the
source of the errors in your dataset?
● Null data: Did you search for NULLs using conditional formatting and
filters?
● Misspelled words: Did you locate all misspellings?
● Mistyped numbers: Did you double-check that your numeric data has
been entered correctly?
● Extra spaces and characters: Did you remove any extra spaces or
characters using the TRIM function?
● Duplicates: Did you remove duplicates in spreadsheets using the
Remove Duplicates function or DISTINCT in SQL?
● Mismatched data types: Did you check that numeric, date, and string
data are typecast correctly?
● Messy (inconsistent) strings: Did you make sure that all of your strings
are consistent and meaningful?
● Messy (inconsistent) date formats: Did you format the dates consistently
throughout your dataset?
● Misleading variable labels (columns): Did you name your columns
meaningfully?
● Truncated data: Did you check for truncated or missing data that needs
correction?
● Business Logic: Did you check that the data makes sense given your
knowledge of the business?

Document the cleaning process

Reason to document
- Recover data-cleaning errors
- Inform other users of changes
- Determine quality of data
Clarification
The instructor stated that the first two benefits of documentation --1) recalling the
errors that were cleaned and 2) informing others of the changes -- assume that
the data errors aren't fixable. She then added that when the data errors are
fixable, the documentation needs to record how the data was fixed. Data-
cleaning documentation is important in both cases.
Review
CASE: A SQL statement that returns records that meet conditions by including
an if/then statement in a query
Changelog: A file containing a chronologically ordered list of modifications made
to a project
COUNTA: A spreadsheet function that counts the total number of values within a
specified range
Find and replace: A tool that finds a specified search term and replaces it with
something else
Verification: A process to confirm that a data-cleaning effort was well executed
and the resulting data is accurate and reliable
CHAPTER 5: Verify and report on cleaning
results
The data analyst hiring process
Key elements of a data professional’s resume
The importance of diversity on a DA team
Highlight your skills and experience

Explore areas of interest

Computing Computing: Progress in
No ratings yet
Computing Computing: Progress in
20 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Module 4_(Process Data from Dirty to Clean)
No ratings yet
Module 4_(Process Data from Dirty to Clean)
36 pages
Process of Data Form Dirty Cleaning
No ratings yet
Process of Data Form Dirty Cleaning
48 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Data Quality
No ratings yet
Data Quality
14 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
Lect 6
No ratings yet
Lect 6
36 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Data Analitics 4
No ratings yet
Data Analitics 4
10 pages
Data Analytics Program - Introduction To Data Analytics - Lesson 1
No ratings yet
Data Analytics Program - Introduction To Data Analytics - Lesson 1
56 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Dv Chapter 2
No ratings yet
Dv Chapter 2
36 pages
Data Quality
No ratings yet
Data Quality
48 pages
Unit 2
No ratings yet
Unit 2
22 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
ML-Lecture-5-data-quality
No ratings yet
ML-Lecture-5-data-quality
19 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Module 2 Data Science New
No ratings yet
Module 2 Data Science New
57 pages
Introduction to data science 1-2-2025
No ratings yet
Introduction to data science 1-2-2025
14 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Data Integrity and Compliance
No ratings yet
Data Integrity and Compliance
4 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Part II, Meet 4 - Ch 6 Dan 7 UNP
No ratings yet
Part II, Meet 4 - Ch 6 Dan 7 UNP
19 pages
mylessons 4
No ratings yet
mylessons 4
6 pages
03 Data Science Process_Spring-24-25
No ratings yet
03 Data Science Process_Spring-24-25
48 pages
Data Preparation and Analysis
No ratings yet
Data Preparation and Analysis
22 pages
Isom Midterms
No ratings yet
Isom Midterms
27 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Unit 3
No ratings yet
Unit 3
18 pages
When You Find An Issue With Your Data
No ratings yet
When You Find An Issue With Your Data
2 pages
lec 1 Data Acquisition and preprocessing
No ratings yet
lec 1 Data Acquisition and preprocessing
8 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Mis Group 6 Assignment 1
No ratings yet
Mis Group 6 Assignment 1
10 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
DSF 3-4
No ratings yet
DSF 3-4
18 pages
Why Data Cleaning Is Critical
No ratings yet
Why Data Cleaning Is Critical
5 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Analytics
No ratings yet
Data Analytics
5 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Notes Data Science With Python 1
No ratings yet
Notes Data Science With Python 1
18 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
How should data preparation be done for an analytics project_
No ratings yet
How should data preparation be done for an analytics project_
30 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Doble Engineering PVT LTD, 305-Sakar Building, Vadodara.: Disclaimer
No ratings yet
Doble Engineering PVT LTD, 305-Sakar Building, Vadodara.: Disclaimer
65 pages
System Software - App Software ch5-6
No ratings yet
System Software - App Software ch5-6
107 pages
Amy Smith
No ratings yet
Amy Smith
3 pages
IT For Cambridge International As A Level Coursebook (Paul Long, Sarah Lawrey, Victoria Ellis) - 2022 Onwards
No ratings yet
IT For Cambridge International As A Level Coursebook (Paul Long, Sarah Lawrey, Victoria Ellis) - 2022 Onwards
573 pages
Internal Material Transfers White Paper PDF
No ratings yet
Internal Material Transfers White Paper PDF
104 pages
Excel Conditional Formatting (PC)
No ratings yet
Excel Conditional Formatting (PC)
29 pages
ITFL02G - Advanced Office Productivity Tools
No ratings yet
ITFL02G - Advanced Office Productivity Tools
6 pages
Excel 10 Automatic Grading
No ratings yet
Excel 10 Automatic Grading
8 pages
Anchor Bolt
No ratings yet
Anchor Bolt
7 pages
P90X Workout Sheets
No ratings yet
P90X Workout Sheets
194 pages
Ibm Cognos Prospecting
No ratings yet
Ibm Cognos Prospecting
3 pages
Project of It On Punjab Group of Colleges: Submitted To
No ratings yet
Project of It On Punjab Group of Colleges: Submitted To
59 pages
Geoland InProcessingCenter
No ratings yet
Geoland InProcessingCenter
50 pages
9-Teaching Computer Applications
No ratings yet
9-Teaching Computer Applications
7 pages
Managing Projects Using Oracle Project Management PJT And1425
No ratings yet
Managing Projects Using Oracle Project Management PJT And1425
19 pages
Mathcad Prestressed Concrete Jefferson Example
No ratings yet
Mathcad Prestressed Concrete Jefferson Example
222 pages
CS101 Introduction To Computing: Application Software
No ratings yet
CS101 Introduction To Computing: Application Software
40 pages
Lateral Pressures For Formwork Design: A Review of The Formulas To Determine The Pressure of Fresh Concrete
No ratings yet
Lateral Pressures For Formwork Design: A Review of The Formulas To Determine The Pressure of Fresh Concrete
3 pages
Web ADI User Guide
No ratings yet
Web ADI User Guide
26 pages
Pearbudget: (Easy Budgeting For Everyone)
100% (3)
Pearbudget: (Easy Budgeting For Everyone)
62 pages
WBP
No ratings yet
WBP
426 pages
FPS Rig Loading RB 5 - 5t Hammer
No ratings yet
FPS Rig Loading RB 5 - 5t Hammer
9 pages
Ref Prop 9
100% (1)
Ref Prop 9
61 pages
G10 & G11 Worksheet
No ratings yet
G10 & G11 Worksheet
7 pages
ICT Grade 7 First Quarter - Week 5
No ratings yet
ICT Grade 7 First Quarter - Week 5
30 pages
Caribbean Examinations Council: /JUNE 2017
No ratings yet
Caribbean Examinations Council: /JUNE 2017
10 pages
Relay Planning Guide
No ratings yet
Relay Planning Guide
24 pages
Ict Pamphlet Jss1 Third Term
No ratings yet
Ict Pamphlet Jss1 Third Term
18 pages
ITECH1100 Processes and Automation
No ratings yet
ITECH1100 Processes and Automation
5 pages

4. Process Data from Dirty to Clean

Uploaded by

4. Process Data from Dirty to Clean

Uploaded by

CHAPTER 1: The importance of integrity

Data Integrity and its Significance

Data integrity and analytics objectives

Data integrity risks:

Data integrity safeguards

Data constraint Definition Examples

Well-aligned objectives and data

Types of insufficient data

Possible Solutions Examples of solutions in real life

2. Too little data

le Solutions Examples of solutions in real life

3. Wrong data, including data with errors

Possible Solutions Examples of solutions in real life

Test your data

Sample Size Calculator

Consider the margin of error

Data cleaning is critical:

First steps toward clean data

Data Cleaning Essentials

Common Data Cleaning Techniques

Continue cleaning data in spreadsheets

Data Cleaning Tools

Data Cleaning Techniques

CHAPTER 3: Data cleaning with SQL

Learn basic SQL queries

CHAPTER 4: Verify and report on cleaning

WHEN first_name = 'Tnoy' THEN 'Tony'

Document the cleaning process

Explore areas of interest

You might also like