0% found this document useful (0 votes)
2 views41 pages

FBA Module 3

The document provides an overview of data preprocessing and analysis, emphasizing its importance in transforming raw data into a structured format for machine learning and analysis. It outlines key steps in data preprocessing, including data collection, cleaning, transformation, and handling missing values and outliers. Additionally, it discusses techniques for data cleaning and analysis in Excel, including removing duplicates, handling inconsistencies, and using functions for data manipulation.

Uploaded by

nilsa.vp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views41 pages

FBA Module 3

The document provides an overview of data preprocessing and analysis, emphasizing its importance in transforming raw data into a structured format for machine learning and analysis. It outlines key steps in data preprocessing, including data collection, cleaning, transformation, and handling missing values and outliers. Additionally, it discusses techniques for data cleaning and analysis in Excel, including removing duplicates, handling inconsistencies, and using functions for data manipulation.

Uploaded by

nilsa.vp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Module 3: Data

Preparation and
Analysis
Introduction to Data
Preprocessing

• Data preprocessing is a crucial step in data analysis and


machine learning, ensuring that raw data is transformed into
a clean and structured format suitable for analysis. It helps in
improving model accuracy and reliability by handling missing
values, outliers, and inconsistencies.
Steps in Data Preprocessing
1.Data Collection – Gathering raw data from various sources.
2.Data Cleaning – Handling missing values, outliers, and duplicate data.
3.Data Transformation – Scaling, encoding categorical variables, and
feature engineering.
4.Data Reduction – Reducing dimensionality through techniques like PCA.
5.Data Integration – Merging data from multiple sources.
Handling Missing Data and
Outliers

Missing data can occur Outliers are data


due to errors in data points that
collection, equipment significantly differ
failure, or human from the rest of the
mistakes. It can dataset and can
significantly impact distort statistical
the quality of analysis. analysis.
Types of Missing Data

• MCAR (Missing Completely at Random): No relationship


between missing data and observed variables.
Data is missing randomly, with no pattern.

Example: A survey skips some questions due to a system error.


• MAR (Missing at Random): Missing data depends on
observed values but not the missing ones.
Data is missing due to a known factor but not because of the
missing value itself.
Example: Students with low attendance tend to have missing
test scores. The missing data is linked to attendance, not the
scores.
• MNAR (Missing Not at Random): The missing data
depends on unobserved values

Data is missing for a reason related to the missing value itself.


Example: Wealthy people avoid answering income-related
survey questions because they don’t want to disclose their
earnings.

MCAR = pure chance, MAR = related to other data, MNAR =


missing for a hidden reason
Deletion Methods

• Listwise Deletion – Removes entire rows with


missing values.
• Pairwise Deletion – Uses available data without
deleting entire rows.
Techniques Imputation Techniques
to Handle • Mean/Median/Mode Imputation – Replaces

Missing Data missing values with the mean, median, or mode


of the feature.
• Regression Imputation – Predicts missing
values using regression models.
• K-Nearest Neighbors (KNN) Imputation –
Replaces missing values based on similar
observations.
Using Machine Learning Models

• Decision Trees, Random Forests, or KNN can


predict missing values based on other available
features.
1.Statistical Methods:
1. Z-Score Method: Values beyond
±3 standard deviations from the
mean are considered outliers.

Detecti 2. Interquartile Range (IQR)


Method:
1. Q1 (25th percentile), Q3 (75th

on of percentile), and IQR = Q3 - Q1.


2. Any value below Q1 - 1.5 *
IQR or above Q3 + 1.5 * IQR
Outliers is considered an outlier.
2.Visualization Techniques:
1. Boxplots
2. Scatterplots
3. Histograms
1. Removing Outliers: If an outlier is due to a data entry
error, it can be removed.

Handling
2. Transformation Techniques: Log transformation or
square root transformation can reduce the impact of
Outliers outliers.

3. Capping and Floor Censoring: Extreme values can be


replaced with a percentile threshold.
Data • Data transformation is the
process of converting data
Transformatio into a format that is more
n suitable for analysis.
1.Scaling: Adjusts the range of data
values to improve model performance.
2.Normalization: Ensures data is in a
standard range (e.g., [0,1] or [-1,1]).
Types of 3.Encoding Categorical Data:
Data Converts categorical variables into
numerical values (e.g., One-Hot
Transformat Encoding).
ion 4.Feature Engineering: Creating new
features from existing data to improve
predictive models.
5.Logarithmic Transformation:
Reduces skewness in the data.
• Normalization ensures that
Data numerical features have a
Normaliza uniform scale, improving
model performance.
tion
1.Min-Max Scaling

Common 1. Formula:
Normaliza Xnorm=X−XminXmax−XminX_{
tion norm} = \frac{X - X_{min}}
{X_{max} - X_{min}}Xnorm​
Technique =Xmax​−Xmin​X−Xmin​​
2. Scales values between 0 and
s 1.
3. Useful for algorithms like KNN
and Neural Networks.
• 2. Z-Score
Standardization:
Common • Formula: Xstd=X−μσX_{std}
Normaliza = \frac{X - \mu}{\sigma}Xstd​
=σX−μ​
tion • Converts data to a
Technique distribution with a mean of 0
and a standard deviation of 1.
s • Useful for linear regression
and PCA.
3. Log Transformation:
Common • Used when data is highly
Normaliza skewed.
tion • Formula:
Technique Xlog=log⁡(X+1)X_{log} = \
log(X + 1)Xlog​=log(X+1)
s
Common • 4. Robust Scaling:
Normaliza • Uses median and IQR
instead of mean and
tion standard deviation.
Technique • Useful for handling outliers.
s
Data • Data cleaning involves
Cleani identifying and correcting
inconsistencies, errors, and
ng irrelevant data.
1.Removing Duplicates: Eliminating duplicate records to avoid
redundancy.
2.Handling Inconsistent Data: Standardizing formats (e.g., date
formats, text case normalization).
3.Handling Outliers and Missing Values: As discussed in previous
sections.
4.Fixing Structural Errors: Correcting typos and syntax errors in
categorical variables.
5.Data Validation: Checking data integrity using automated scripts or
manual review.

Common Data
Cleaning Techniques
Data Preprocessing
and Analysis in Excel
Data Cleaning-Removing duplicate rows

• Start by moving the cell cursor to any cell within your


data range. Choose Data ⇨ Data Tools ⇨ Remove
Duplicates, and the Remove Duplicates dialog box
Identifying duplicate rows

• If you would like to identify duplicate rows so that you can examine
them without automatically deleting them, here's another method.
Unlike the technique described in the previous section, this method
looks at actual values, not formatted values. Create a formula to
the right of your data that concatenates each of the cells to the left.
• The formulas that follow assume that the data is in columns A:F.
Enter this formula into cell G2: =CONCAT(A2:F2)
• Add another formula in cell H2. This formula displays the number of
times a value in column G occurs: =COUNTIF(G:G,G2)
Removing extra spaces

• Extra spaces in Excel can cause errors when comparing or analyzing


text.
• The TRIM function helps by removing leading, trailing, and extra spaces
within text, making it clean and consistent.
• However, some imported data (especially from web pages) may contain
nonbreaking spaces (HTML  ), which TRIM doesn’t remove.
• These spaces can be replaced using the SUBSTITUTE function with
CHAR(160), which represents a nonbreaking space.
• To fully clean the text, use:=TRIM(SUBSTITUTE(A2,CHAR(160)," "))
• This formula first replaces nonbreaking spaces with normal ones and
then removes extra spaces, ensuring clean data for analysis
Removing strange characters
• When you import data into Excel, it may contain
invisible or unprintable characters that can cause
issues.
• The CLEAN function removes these unwanted
characters, making the text clean and usable.
• For example, if your data is in A2, use:=CLEAN(A2)This
formula removes things like line breaks or control
characters, but it does not remove spaces.
• If spaces are also a problem, combine it with TRIM like
this:=TRIM(CLEAN(A2))This ensures your data is clean
and properly formatted for analysis.
Converting values
• Excel provides powerful functions to convert
measurements and number systems.
• The CONVERT function changes units like ounces to
milliliters (=CONVERT(A2, "oz", "ml")) and works with
weight, distance, time, temperature, and more.
• For number system conversions, Excel has specific
functions: HEX2DEC converts hexadecimal to decimal
(=HEX2DEC("4FF") → 1279), BIN2DEC for binary to
decimal, and OCT2DEC for octal to decimal.
• If you need to convert decimal to other bases, use
DEC2HEX, DEC2BIN, or DEC2OCT. The BASE function
(introduced in Excel 2013) allows conversion of decimal
to any base, but Excel does not provide a universal
function to convert any base back to decimal.
Joining Values
• To combine data in two more columns, you can use the
CONCAT function in a formula. For example, the
following formula combines the contents of cells A1, B1,
and C1: =CONCAT(A1:C1) Often, you'll need to insert
spaces, or some other delimiter, between the cells—for
example, if the columns contain a title, first name, and
last name. Concatenating using the previous formulas
would produce something like Mr.ThomasJones. To add
spaces (to produce Mr. Thomas Jones), use the TEXTJOIN
function: =TEXTJOIN(" ",TRUE,A1:C1)
Filling gaps in an imported report
• If the report is small, you can enter the missing cell
values manually or by using a series of Home ⇨ Editing
⇨ Fill ⇨ Down commands (or the Ctrl+D shortcut). But if
you have a large list that's in this format, here's a better
way:
• 1. Select the range that has the gaps (A3:A14, in this
• example).
• 2. Choose Home ⇨ Editing ⇨ Find & Select ⇨ Go to
• Special. The Go to Special dialog box appears.
• 3. Select the Blanks option and click OK. This action
• selects the blank cells in the original selection.
• 4. In the formula bar, type an equal sign (=) followed
• by the address of the first cell with an entry in the
• column (=A3, in this example) and press Ctrl+Enter.
• 5. Reselect the original range and press Ctrl+C to
• copy the selection.
• 6. Choose Home ⇨ Clipboard ⇨ Paste ⇨ Paste Values
• to convert the formulas to values.
Checking spelling
• Excel has a built-in spell-checker to help catch spelling
mistakes in your data. Errors can cause serious issues,
like miscategorized data (e.g., a misspelled month could
create a "13-month" year). To check spelling, go to
Review → Proofing → Spelling or press F7. If you
want to check only a specific range, select it first. When
Excel finds a mistake, the Spelling dialog box appears
with options to correct, ignore, or add words to the
dictionary. This ensures your data stays accurate and
error-free!
Following a data cleaning checklist -This
section contains a list of items that could cause problems
with data. Not all of these are relevant to every set of data.
• Does each column have a unique and descriptive
• header?
• Is each column of data formatted consistently?
• Did you check for duplicate or missing rows?
• For text data, are the words consistent in terms of case?
• Did you check for spelling errors?
• Does the data contain any extra spaces?
• Are the columns arranged in the proper (or logical)
• order?
• Are there any cells blank that shouldn't be blank?
• Did you correct any trailing minus signs?
• Are the columns wide enough to display all data?
Identifying Outliers with the
Interquartile Range
• Outliers are extreme values that don’t fit within the
normal range of data.
• To find them, we use the interquartile range (IQR),
which measures the middle 50% of the data (between
the 75th percentile and 25th percentile).
Steps to Find Outliers in Excel
Why This Method Works?
• Expanding the range using 1.5 times the IQR ensures
that only truly extreme values are considered outliers.
• If too many or too few values are flagged, adjust the 1.5
factor up or down!

You might also like