DA Notes
DA Notes
Dirty Data: Any data that requires cleaning or preparation before analysis. It
includes:
1. Missing Data:
o Example: Missing values in variables essential for analysis, like customer ages
when analyzing purchasing behavior.
2. Duplicate Data:
o Example: Multiple identical records due to merging data from different sources.
3. Inconsistent or Incorrect Data:
o Example: Structural errors, typos, or inconsistent naming, such as mixed labels
like "Pass/Fail" and "G/B" in the same dataset.
"Garbage In, Garbage Out" (GIGO): Incorrect data leads to incorrect results.
Foundation for Analysis: Clean data ensures meaningful, reliable, and long-lasting
analysis, similar to a strong foundation for a house.
Cost of Dirty Data: Poor data practices can lead to significant long-term expenses.
Goal: Properly cleaned data is essential for extracting accurate and actionable
insights.
1. Remove Duplicates:
o Select all data.
o Create a table (Insert → Table).
o Go to Data → Remove Duplicates to delete duplicate rows.
2. Handle Missing Data:
o Remove Blank Rows:
Select all data and sort columns (A → Z or Z → A).
Locate and delete blank rows.
o Find and Remove Blank Cells:
Select a specific column (e.g., column F).
Apply a filter (Data → Filter).
In the filter dropdown, uncheck "Select All" and check only "Blanks."
Delete rows with blank cells, then clear the filter.
Measures of Central Tendency: Include the mean, median, and mode, which estimate the
middle or average values.
Measures of Variability: Include range, standard deviation, and variance, which describe the
spread or variability in the dataset.
A pivot table summarizes large amounts of data by grouping it in meaningful ways (e.g., by sum
or average).