0% found this document useful (0 votes)

2 views41 pages

FBA Module 3

The document provides an overview of data preprocessing and analysis, emphasizing its importance in transforming raw data into a structured format for machine learning and analysis. It outlines key steps in data preprocessing, including data collection, cleaning, transformation, and handling missing values and outliers. Additionally, it discusses techniques for data cleaning and analysis in Excel, including removing duplicates, handling inconsistencies, and using functions for data manipulation.

Uploaded by

nilsa.vp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views41 pages

FBA Module 3

Uploaded by

nilsa.vp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Module 3: Data

Preparation and
Analysis
Introduction to Data
Preprocessing

• Data preprocessing is a crucial step in data analysis and

machine learning, ensuring that raw data is transformed into
a clean and structured format suitable for analysis. It helps in
improving model accuracy and reliability by handling missing
values, outliers, and inconsistencies.
Steps in Data Preprocessing
1.Data Collection – Gathering raw data from various sources.
2.Data Cleaning – Handling missing values, outliers, and duplicate data.
3.Data Transformation – Scaling, encoding categorical variables, and
feature engineering.
4.Data Reduction – Reducing dimensionality through techniques like PCA.
5.Data Integration – Merging data from multiple sources.
Handling Missing Data and
Outliers

Missing data can occur Outliers are data

due to errors in data points that
collection, equipment significantly differ
failure, or human from the rest of the
mistakes. It can dataset and can
significantly impact distort statistical
the quality of analysis. analysis.
Types of Missing Data

• MCAR (Missing Completely at Random): No relationship

between missing data and observed variables.
Data is missing randomly, with no pattern.

Example: A survey skips some questions due to a system error.

• MAR (Missing at Random): Missing data depends on
observed values but not the missing ones.
Data is missing due to a known factor but not because of the
missing value itself.
Example: Students with low attendance tend to have missing
test scores. The missing data is linked to attendance, not the
scores.
• MNAR (Missing Not at Random): The missing data
depends on unobserved values

Data is missing for a reason related to the missing value itself.

Example: Wealthy people avoid answering income-related
survey questions because they don’t want to disclose their
earnings.

MCAR = pure chance, MAR = related to other data, MNAR =

missing for a hidden reason
Deletion Methods

• Listwise Deletion – Removes entire rows with

missing values.
• Pairwise Deletion – Uses available data without
deleting entire rows.
Techniques Imputation Techniques
to Handle • Mean/Median/Mode Imputation – Replaces

Missing Data missing values with the mean, median, or mode

of the feature.
• Regression Imputation – Predicts missing
values using regression models.
• K-Nearest Neighbors (KNN) Imputation –
Replaces missing values based on similar
observations.
Using Machine Learning Models

• Decision Trees, Random Forests, or KNN can

predict missing values based on other available
features.
1.Statistical Methods:
1. Z-Score Method: Values beyond
±3 standard deviations from the
mean are considered outliers.

Detecti 2. Interquartile Range (IQR)

Method:
1. Q1 (25th percentile), Q3 (75th

on of percentile), and IQR = Q3 - Q1.

2. Any value below Q1 - 1.5 *
IQR or above Q3 + 1.5 * IQR
Outliers is considered an outlier.
2.Visualization Techniques:
1. Boxplots
2. Scatterplots
3. Histograms
1. Removing Outliers: If an outlier is due to a data entry
error, it can be removed.

Handling
2. Transformation Techniques: Log transformation or
square root transformation can reduce the impact of
Outliers outliers.

3. Capping and Floor Censoring: Extreme values can be

replaced with a percentile threshold.
Data • Data transformation is the
process of converting data
Transformatio into a format that is more
n suitable for analysis.
1.Scaling: Adjusts the range of data
values to improve model performance.
2.Normalization: Ensures data is in a
standard range (e.g., [0,1] or [-1,1]).
Types of 3.Encoding Categorical Data:
Data Converts categorical variables into
numerical values (e.g., One-Hot
Transformat Encoding).
ion 4.Feature Engineering: Creating new
features from existing data to improve
predictive models.
5.Logarithmic Transformation:
Reduces skewness in the data.
• Normalization ensures that
Data numerical features have a
Normaliza uniform scale, improving
model performance.
tion
1.Min-Max Scaling

Common 1. Formula:
Normaliza Xnorm=X−XminXmax−XminX_{
tion norm} = \frac{X - X_{min}}
{X_{max} - X_{min}}Xnorm
Technique =Xmax−XminX−Xmin
2. Scales values between 0 and
s 1.
3. Useful for algorithms like KNN
and Neural Networks.
• 2. Z-Score
Standardization:
Common • Formula: Xstd=X−μσX_{std}
Normaliza = \frac{X - \mu}{\sigma}Xstd
=σX−μ
tion • Converts data to a
Technique distribution with a mean of 0
and a standard deviation of 1.
s • Useful for linear regression
and PCA.
3. Log Transformation:
Common • Used when data is highly
Normaliza skewed.
tion • Formula:
Technique Xlog=log⁡(X+1)X_{log} = \
log(X + 1)Xlog=log(X+1)
s
Common • 4. Robust Scaling:
Normaliza • Uses median and IQR
instead of mean and
tion standard deviation.
Technique • Useful for handling outliers.
s
Data • Data cleaning involves
Cleani identifying and correcting
inconsistencies, errors, and
ng irrelevant data.
1.Removing Duplicates: Eliminating duplicate records to avoid
redundancy.
2.Handling Inconsistent Data: Standardizing formats (e.g., date
formats, text case normalization).
3.Handling Outliers and Missing Values: As discussed in previous
sections.
4.Fixing Structural Errors: Correcting typos and syntax errors in
categorical variables.
5.Data Validation: Checking data integrity using automated scripts or
manual review.

Common Data
Cleaning Techniques
Data Preprocessing
and Analysis in Excel
Data Cleaning-Removing duplicate rows

• Start by moving the cell cursor to any cell within your

data range. Choose Data ⇨ Data Tools ⇨ Remove
Duplicates, and the Remove Duplicates dialog box
Identifying duplicate rows

• If you would like to identify duplicate rows so that you can examine
them without automatically deleting them, here's another method.
Unlike the technique described in the previous section, this method
looks at actual values, not formatted values. Create a formula to
the right of your data that concatenates each of the cells to the left.
• The formulas that follow assume that the data is in columns A:F.
Enter this formula into cell G2: =CONCAT(A2:F2)
• Add another formula in cell H2. This formula displays the number of
times a value in column G occurs: =COUNTIF(G:G,G2)
Removing extra spaces

• Extra spaces in Excel can cause errors when comparing or analyzing

text.
• The TRIM function helps by removing leading, trailing, and extra spaces
within text, making it clean and consistent.
• However, some imported data (especially from web pages) may contain
nonbreaking spaces (HTML  ), which TRIM doesn’t remove.
• These spaces can be replaced using the SUBSTITUTE function with
CHAR(160), which represents a nonbreaking space.
• To fully clean the text, use:=TRIM(SUBSTITUTE(A2,CHAR(160)," "))
• This formula first replaces nonbreaking spaces with normal ones and
then removes extra spaces, ensuring clean data for analysis
Removing strange characters
• When you import data into Excel, it may contain
invisible or unprintable characters that can cause
issues.
• The CLEAN function removes these unwanted
characters, making the text clean and usable.
• For example, if your data is in A2, use:=CLEAN(A2)This
formula removes things like line breaks or control
characters, but it does not remove spaces.
• If spaces are also a problem, combine it with TRIM like
this:=TRIM(CLEAN(A2))This ensures your data is clean
and properly formatted for analysis.
Converting values
• Excel provides powerful functions to convert
measurements and number systems.
• The CONVERT function changes units like ounces to
milliliters (=CONVERT(A2, "oz", "ml")) and works with
weight, distance, time, temperature, and more.
• For number system conversions, Excel has specific
functions: HEX2DEC converts hexadecimal to decimal
(=HEX2DEC("4FF") → 1279), BIN2DEC for binary to
decimal, and OCT2DEC for octal to decimal.
• If you need to convert decimal to other bases, use
DEC2HEX, DEC2BIN, or DEC2OCT. The BASE function
(introduced in Excel 2013) allows conversion of decimal
to any base, but Excel does not provide a universal
function to convert any base back to decimal.
Joining Values
• To combine data in two more columns, you can use the
CONCAT function in a formula. For example, the
following formula combines the contents of cells A1, B1,
and C1: =CONCAT(A1:C1) Often, you'll need to insert
spaces, or some other delimiter, between the cells—for
example, if the columns contain a title, first name, and
last name. Concatenating using the previous formulas
would produce something like Mr.ThomasJones. To add
spaces (to produce Mr. Thomas Jones), use the TEXTJOIN
function: =TEXTJOIN(" ",TRUE,A1:C1)
Filling gaps in an imported report
• If the report is small, you can enter the missing cell
values manually or by using a series of Home ⇨ Editing
⇨ Fill ⇨ Down commands (or the Ctrl+D shortcut). But if
you have a large list that's in this format, here's a better
way:
• 1. Select the range that has the gaps (A3:A14, in this
• example).
• 2. Choose Home ⇨ Editing ⇨ Find & Select ⇨ Go to
• Special. The Go to Special dialog box appears.
• 3. Select the Blanks option and click OK. This action
• selects the blank cells in the original selection.
• 4. In the formula bar, type an equal sign (=) followed
• by the address of the first cell with an entry in the
• column (=A3, in this example) and press Ctrl+Enter.
• 5. Reselect the original range and press Ctrl+C to
• copy the selection.
• 6. Choose Home ⇨ Clipboard ⇨ Paste ⇨ Paste Values
• to convert the formulas to values.
Checking spelling
• Excel has a built-in spell-checker to help catch spelling
mistakes in your data. Errors can cause serious issues,
like miscategorized data (e.g., a misspelled month could
create a "13-month" year). To check spelling, go to
Review → Proofing → Spelling or press F7. If you
want to check only a specific range, select it first. When
Excel finds a mistake, the Spelling dialog box appears
with options to correct, ignore, or add words to the
dictionary. This ensures your data stays accurate and
error-free!
Following a data cleaning checklist -This
section contains a list of items that could cause problems
with data. Not all of these are relevant to every set of data.
• Does each column have a unique and descriptive
• header?
• Is each column of data formatted consistently?
• Did you check for duplicate or missing rows?
• For text data, are the words consistent in terms of case?
• Did you check for spelling errors?
• Does the data contain any extra spaces?
• Are the columns arranged in the proper (or logical)
• order?
• Are there any cells blank that shouldn't be blank?
• Did you correct any trailing minus signs?
• Are the columns wide enough to display all data?
Identifying Outliers with the
Interquartile Range
• Outliers are extreme values that don’t fit within the
normal range of data.
• To find them, we use the interquartile range (IQR),
which measures the middle 50% of the data (between
the 75th percentile and 25th percentile).
Steps to Find Outliers in Excel
Why This Method Works?
• Expanding the range using 1.5 times the IQR ensures
that only truly extreme values are considered outliers.
• If too many or too few values are flagged, adjust the 1.5
factor up or down!

Data Cleaning in Excel
No ratings yet
Data Cleaning in Excel
4 pages
Excel For Auditors
100% (1)
Excel For Auditors
53 pages
Session 2 - Excel Fundamentals For Data Exploration
100% (1)
Session 2 - Excel Fundamentals For Data Exploration
56 pages
Probability Notes
No ratings yet
Probability Notes
18 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
2.data Management & Data Cleaning
No ratings yet
2.data Management & Data Cleaning
40 pages
Unit- II Business Analytics
No ratings yet
Unit- II Business Analytics
25 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Lec 3 Data Preprocessing and Transformation(1)
No ratings yet
Lec 3 Data Preprocessing and Transformation(1)
73 pages
AIDS C04-Session-21
No ratings yet
AIDS C04-Session-21
18 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Outliners
No ratings yet
Outliners
15 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
chap3
No ratings yet
chap3
26 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Excel Data Cleansing Essentials
No ratings yet
Excel Data Cleansing Essentials
7 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Data Transformation in Excel
No ratings yet
Data Transformation in Excel
5 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Blue Futuristic Illustrative Artificial Intelligence Project Presentation
No ratings yet
Blue Futuristic Illustrative Artificial Intelligence Project Presentation
11 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Best Practices for Data Cleaning_EN_1802
No ratings yet
Best Practices for Data Cleaning_EN_1802
13 pages
Irwin 1962
No ratings yet
Irwin 1962
6 pages
Lec 9
No ratings yet
Lec 9
1 page
Data Preparation & Cleaning
No ratings yet
Data Preparation & Cleaning
24 pages
DataCleaning 1717312956
No ratings yet
DataCleaning 1717312956
22 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Free Ebook - The Ultimate Guide To Basic Data Cleaning PDF
No ratings yet
Free Ebook - The Ultimate Guide To Basic Data Cleaning PDF
70 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Chap3.Data Extraction and Management
No ratings yet
Chap3.Data Extraction and Management
29 pages
MathThink II
No ratings yet
MathThink II
271 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Reporting quantitative research in psychology how to meet APA style journal article reporting standards Second Edition Harris M. Cooper all chapter instant download
100% (2)
Reporting quantitative research in psychology how to meet APA style journal article reporting standards Second Edition Harris M. Cooper all chapter instant download
55 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
1731318370214_DATA EXCELL MANUAL-2
No ratings yet
1731318370214_DATA EXCELL MANUAL-2
49 pages
BA Lab Manual
No ratings yet
BA Lab Manual
62 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Data Cleaning in Excel
No ratings yet
Data Cleaning in Excel
16 pages
Data Analysis
No ratings yet
Data Analysis
29 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
DSA2
No ratings yet
DSA2
4 pages
Task2 10th April
No ratings yet
Task2 10th April
10 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Data Cleaning in Excel
No ratings yet
Data Cleaning in Excel
12 pages
Excel For Data Analysis
No ratings yet
Excel For Data Analysis
9 pages
Using Excel To Clean and Prepare Data
No ratings yet
Using Excel To Clean and Prepare Data
9 pages
10 Ways to Clean Data in Excel
No ratings yet
10 Ways to Clean Data in Excel
10 pages
Z-Test For Two Independent Proportions
100% (1)
Z-Test For Two Independent Proportions
11 pages
Solutions Manual To Accompany Quantitative Methods
No ratings yet
Solutions Manual To Accompany Quantitative Methods
133 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Cleaning Data
No ratings yet
Cleaning Data
7 pages
Using Excel To Clean and Prepare Data For Analysis
No ratings yet
Using Excel To Clean and Prepare Data For Analysis
9 pages
BA Tableau Final Capstone a Section
No ratings yet
BA Tableau Final Capstone a Section
17 pages
Hand out on Data Cleaning
No ratings yet
Hand out on Data Cleaning
4 pages
Calibration Curve Guide
100% (1)
Calibration Curve Guide
30 pages
DWM
No ratings yet
DWM
14 pages
An Overview of The Spatial Statistics Toolbox PDF
No ratings yet
An Overview of The Spatial Statistics Toolbox PDF
155 pages
Kapitza1996 - The Phenomenological Theory of World Population Growth
No ratings yet
Kapitza1996 - The Phenomenological Theory of World Population Growth
16 pages
Sem-4 Project Work (Nitesh Subhash Ghanekar)
No ratings yet
Sem-4 Project Work (Nitesh Subhash Ghanekar)
82 pages
Excel
No ratings yet
Excel
6 pages
Immediate download Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson ebooks 2024
100% (1)
Immediate download Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson ebooks 2024
26 pages
Idm Kaggle Assignment 1 Usman Ghani 22850
No ratings yet
Idm Kaggle Assignment 1 Usman Ghani 22850
13 pages
MA in Civics & Ethics
No ratings yet
MA in Civics & Ethics
71 pages
Business Analytics For Decision Making
No ratings yet
Business Analytics For Decision Making
3 pages
Housing Price Prediction
No ratings yet
Housing Price Prediction
25 pages
Getting Started With Spss Printable PDF
No ratings yet
Getting Started With Spss Printable PDF
15 pages
SML Book Draft Latest (001 046)
No ratings yet
SML Book Draft Latest (001 046)
46 pages
Data Science: Executive PG Programme in
No ratings yet
Data Science: Executive PG Programme in
54 pages
Ridge Regression LASSO
No ratings yet
Ridge Regression LASSO
18 pages
Forecast - Notes
100% (1)
Forecast - Notes
24 pages
Chapter 4: Correlation: 4.1 Association Between Variables
No ratings yet
Chapter 4: Correlation: 4.1 Association Between Variables
5 pages
Applied Quantitative Analysis in Education and The Social Sciences
No ratings yet
Applied Quantitative Analysis in Education and The Social Sciences
2 pages
Hands-On Lab 5 - Cleaning Data
No ratings yet
Hands-On Lab 5 - Cleaning Data
5 pages
Chapter 1
No ratings yet
Chapter 1
11 pages
DST in Psychology PDF
No ratings yet
DST in Psychology PDF
3 pages
Comp - Sem VI - Quantitative Analysis+Sample Questions
No ratings yet
Comp - Sem VI - Quantitative Analysis+Sample Questions
10 pages
To Determine The Effective Between The Different Filter in PH and Turbidity
No ratings yet
To Determine The Effective Between The Different Filter in PH and Turbidity
4 pages
Non Probability Sampling
No ratings yet
Non Probability Sampling
9 pages
Contents of Geostatistical Site Investigation Report: Standard Guide For
100% (1)
Contents of Geostatistical Site Investigation Report: Standard Guide For
5 pages
DLL - Week - 12 - Statistics and Probability
No ratings yet
DLL - Week - 12 - Statistics and Probability
3 pages
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

FBA Module 3

Uploaded by

FBA Module 3

Uploaded by

Module 3: Data

• Data preprocessing is a crucial step in data analysis and

Missing data can occur Outliers are data

• MCAR (Missing Completely at Random): No relationship

Example: A survey skips some questions due to a system error.

Data is missing for a reason related to the missing value itself.

MCAR = pure chance, MAR = related to other data, MNAR =

• Listwise Deletion – Removes entire rows with

Missing Data missing values with the mean, median, or mode

• Decision Trees, Random Forests, or KNN can

Detecti 2. Interquartile Range (IQR)

on of percentile), and IQR = Q3 - Q1.

3. Capping and Floor Censoring: Extreme values can be

• Start by moving the cell cursor to any cell within your

• Extra spaces in Excel can cause errors when comparing or analyzing

You might also like