Week 5 Assignme-WPS Office
Week 5 Assignme-WPS Office
Cleaning up a dataset before visualization is a critical process in data analysis, as it ensures the accuracy
and reliability of insights derived from the data. Below is a comprehensive guide on how to clean data,
including how to resolve anomalies. Let’s walk through the steps using an example dataset from
platforms like Kaggle or Google Dataset Search.
First, choose a dataset from Kaggle or Google Dataset Search. For example, you might download a CSV
file containing sales data, weather data, or customer reviews.
Ensure that you inspect the dataset by loading it into your excel environment,
Open your dataset in Excel.Look at the first few rows to get an understanding of the data.
Check data types.Ensure that numbers are recognized as numbers, dates as dates, and text as text. You
can check this by selecting cells and reviewing the data type in the toolbar.
-Highlight missing data: Use Excel’s Conditional Formatting to highlight blank cells.
Go to Home > Conditional Formatting > Highlight Cell Rules > Blanks** to highlight missing cells.
For numerical columns, fill missing values with the *mean* or *median*.
- Select the cells, right-click, and choose *Fill > Down* or *Fill > Series* to replace missing values.
You can also use Excel’s *AVERAGE()* or *MEDIAN()* function to calculate the value and manually
input it
=AVERAGE(B2:B100)
Select the columns you want to check for duplicates and confirm the removal.
4. *Handle Outliers
Use *Conditional Formatting*to highlight values that are too high or too low:
Go to *Home > Conditional Formatting > Highlight Cell Rules > Greater Than/Less Than* and specify
thresholds.
For visualization, you can filter out extreme outliers manually by applying filters to the data range.
Date Format:
- Select the column with dates, right-click, and choose *Format Cells.*Choose the desired date
format, such as `YYYY-MM-DD`.
String Format:
For text, use *TRIM()* to remove leading/trailing spaces and *LOWER()* or **UPPER()** functions to
standardize text.
Example:
=TRIM(A2)
=LOWER(A2)
Convert text to numbers:* If numerical data is formatted as text, select the cells and click the warning
icon, then choose Convert to Number.*
Convert text to dates: Use *Text to Columns* to convert text dates into actual date format.
7. *Categorize Data*
Use the *IF()* function or *VLOOKUP()* to categorize data into bins or groups.
=IF(B2<100,"low",IF(B2<500,"medium","high"))
```
- Select columns or rows that are unnecessary for analysis and press Delete to remove them.
Apply **Log Transformation** or other transformations using functions like **LOG()** in Excel.
=LOG(B2)
Go to *Data > Data Validation* and set validation rules (e.g., ensuring numerical ranges or valid
dates).
After cleaning your data, you can now create meaningful visualizations in Excel: