Data Cleaning Wrangling

The document discusses data cleaning and wrangling, emphasizing the importance of preprocessing raw data for accurate analysis in data science. It outlines the phases of data preparation, types of data, common errors in datasets, and various data cleaning techniques such as handling missing values and detecting outliers. Additionally, it provides methods for filling missing values and detecting outliers using statistical techniques.

Uploaded by

amitdhoundiyal2810

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Data Cleaning Wrangling

Uploaded by

amitdhoundiyal2810

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Data Cleaning and Wrangling

Introduction to data preprocessing

The real-world raw data collected for performing data analysis in Data science are in a
raw and unprocessed state.
Such data needs to be preprocessed for converting it to a form that can be used for
analysis.

data analytical pipeline that involves the tasks of data preprocessing as well as data wrangling
Data preparation takes place in usually two phases for any data science
or data analysis project:
Data preprocessing : It is the task of transforming raw data to be ready
to be fed into an algorithm. It is a time-consuming yet important step
that cannot be avoided for the accuracy of results in data analysis.
Data wrangling : It is the task of converting data into a feasible format
that is suitable for the consumption of the data. It is also known as data
munging and it typically follows a set of common steps such as
extracting data from various data sources, parsing data into predefined
data structures and storing the converted data into a data sink for
further analysis. Data wrangling is sometimes considered as an add-on
to data preprocessing and is often performed by data engineers or data
scientists prior to data analysis.
Types of data
• Categorical data: This type of data is non-numeric and consists of text that can be
coded as numeric. However, these numbers do not represent any fixed mathematical
notation or meaning for the text and are simply assigned as labels or codes.
• Nominal data: This type of data is used to label variables without providing any quantitative
value. For instance, gender can be labeled as 1 for Male, 2 for Female, and 3 for Others.
However, in reality, the assigned numbers for gender are not fixed and are simply assigned for
labeling.
• Ordinal data: This type of data is used to label variables that need to follow some order. For
instance, a company may take feedback about the quality of their service. In such a case, the
possible answers could be labeled as 1 (very unsatisfied), 2 (somewhat unsatisfied), 3 (neutral),
4(somewhat satisfied), and 5 (very satisfied). Thus, each categorical value is classified on a rating
scale of 1 to 5. Ordinal data follows some order of preference, satisfaction, comfort, happiness,
or any such similar order and then accordingly labels the options.
• Numerical data: This type of data is numeric and it usually follows an order of
values. These quantitative data represent fixed values
• Interval data: This type of data follows numeric scales in which the order and exact
differences between the values is considered. In other words, interval data can be
measured along a scale in which each position is equidistant from one another.
The distances between each value on the interval scale are always kept equal. For
instance, age can be measured on an interval scale as 1, 2, 3, 4, 5 years, etc. Also,
income can be measured on an interval scale as Rs. 0 – 20,000, Rs. 20,001 –
40,000, Rs. 40,001 – 60,000, Rs. 60,001 – 80,000, and Rs. 80,001 –1,00,000.
Another example is a set of years from 2009 to 2019 in which the time interval
between each of these years is the same, namely 365 days.
• Ratio data: This type of data also follows numeric scales and has an equal and
definitive ratio between each data. It is measured as multiples of one another and
unlike interval data, can be multiplied or divided. No negative numerical value is
considered in ratio data and zero is considered as a point of origin. For instance,
measurement of height and weight is an example of ratio data
 To apply correct statistical analysis, understanding the various data types is important for applying correct
statistical measurements.
 Also to choose the appropriate data visualization tool.
 Thus, dealing with the right measurement scales in exploratory data analysis (EDA) requires thorough
knowledge of understanding data and its types.
Possible data error types

The raw data that is collected for analyzing usually consists of several types of
errors that need to be prepared and processed for data analysis.
• The various possible error types found in data are listed below:
• Missing data : Some values in the data may not be filled up for various
reasons and hence are considered missing. The data may be purposely
not provided or may be mistakenly not mentioned. In general, there can
be three cases of missing data
• Manual input: The manual input error is the human-made error that
occur usually while making data entry for collecting data input. Few
examples of such human prone errors are making a data entry in the
wrong field, misinterpretation of data, spelling mistakes, and
grammatical mistakes.
• Data inconsistency: This error occurs as for the same field, data may
be stored in varying formats. For example, in the case of gender, the
input can be stored as M, Male or 1 (indicating male), but all indicate
the same value. This leads to the discrepancy of dataand may lead to
incorrect output due to misinterpretation.
• Regional formats : The format in which data is stored differs from
place to place. For instance, while working with dates, some may
follow the format as dd/mm/yyyy whereas some may follow the
format as dd month, yyyy.
• Numerical units: Data values may also drastically differ due to varying
consideration of data units. For instance, the weight of several
persons is stored partially in pounds and partially in kilograms.
• Wrong data types : Wrong data types usually occur when values are
stored in a computer that needs to be stored with the correct data
type. For instance, the human may interpret 3 and three as same. But
for computer, 3 is numeric and three is textual and so represent
different data types.
• File manipulation : This problem arises when we need to deal with
data that may be stored in CSV or text formats. The software may not
be able to correctly display data depending on the considered
separator character, the qualifier used or the text encoding used.
• Missing anonymization : All data may not be perfect and hence may
require to be anonymized or removed before analysis. This is usually
done to maintain privacy, security issues or removal of bias.
Various data preprocessing operations
• Error-prone data lead to biased results, loss of informative results, or incorrect
results which may lead to incorrect statistical analysis or business decision-
making.
• As a data engineer or data analyst, it is a primary task to handle the
unprocessed, raw, error-prone data by initially detecting the various errors
and then choosing the right operation(s) to remove the errors.
• As a data engineer or data analyst, it is a primary task to handle the
unprocessed, raw, error-prone data by initially detecting the various errors
and then choosing the right operation(s) to remove the errors.
• Pre-processing include data cleaning, data integration, data transformation,
data reduction, and data discretization
Data cleaning
• Dirty data can cause an error while doing data analysis. Data cleaning
is done to handle irrelevant or missing data. Data is cleaned by filling
in the missing values, smoothing any noisy data, identifying and
removing outliers, and resolving any inconsistencies.
• Therefore, an important preprocessing step is to correct the data by
following some data cleaning techniques.
1. Filling missing values
• Filling up the missing values in data is known as the imputation of
missing data. Sometimes, this imputation process becomes time-
consuming and fixing up this problem takes a longer duration than
the actual data analysis.
• Also, the method to be adapted for filling up the missing values
depends on the pattern of data used and the nature of analysis to be
performed with the data.
• Method 1: Replace Missing Values with Zeroes -Python function used is
fillna() which accepts one argument that indicates the value with which the
NaN values should be replaced.
• Method 2: Dropping Rows with Missing Values - Python function used is
dropna() which deletes the rows consisting of missing values. This method
results in loss of data and it will work poorly if the percentage of missing
values in the dataset is comparatively high. However, once all the missing
values get removed, the dataset becomes robust and perfectly fit to be fed
for data analysis.
• Method 3: Replace Missing Values with Mean/Median/Mode – For this, a
particular column is selected for which the central value (say, median) is
found. Then the central value is replaced with all the NaN values of that
particular column. Instead of the median, the mean or mode value can also
be used for the same. Replacing NaN values with either mean, mode or
median is considered as a statistical approach of handling the missing
values
Finding and Filling missing values with zero
• import pandas as pd
• import numpy as np
• df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['C1', 'C2', 'C3'])
• df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
• print("\n Reindexed Data Values")
• print(df)
• #Counting Missing Values
• print(df.isnull().sum())
• print("\n\n Every Missing Value Replaced with '0':")
• print(df.fillna(0))
Method 2 -Dropping Rows Having Missing Values
• print("\n\n Dropping Rows with Missing Values:")
• print(df.dropna())

Method 3 - Replacing missing values with the Median Value

• median = df['C1'].median()
• df['C1'].fillna(median, inplace=True)
• print("\n\n Missing Values for Column 1 Replaced with Median
Value:")
• print(" ")
• print(df)
Creating a DataFrame with Missing Values
• df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['C1', 'C2', 'C3'])
• df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
• print("\n\n Data with Missing Values")
• print(df)
Method 1a - Filling Every Missing Values with the Previous Value
• print("\n\n Every Missing Values Replaced with the Previous Value")
• df1=df.fillna(method="ffill")
• print(df1)
Method 1b - Filling Every Missing Values with the Next value
• print("\n\n Every Missing Values Replaced with the Next Value")
• df2=df.fillna(method="bfill")
• print(df2)
Python Code for Filling Missing Values with
Interpolation Method
• the linear interpolation method to predict the missing values by using
the Python function interpolate ().
• Linear interpolation is the simplest method which connects two data
points with a straight line using the simple equation as given below

(x1, y1) and (x2, y2) are the two known data points which are
used to find the value of y for a given x value:
• print("\n\n Filling Missing Values with Interpolation Method:")
• df_new=df.interpolate()
• print(df_new)
Detect and Remove Outlier
• An outlier is a data point that is very far away from other related data
points.
• Outliers may occur due to several reasons such as measurement error,
data entry error, experimental error, intentional inclusion of outliers,
sampling error, or natural occurrence of outliers.
• For data analysis, outliers should be excluded from the dataset as much
as possible as these outliers may mislead the analysis process resulting in
incorrect results and longer training time.
• In turn, the model developed will be less accurate that provide
comparatively poorer results.
• There are several ways to detect outliers in a given dataset.
• Probabilistic and Statistical Modeling (parametric)
• Z-Score or Extreme Value Analysis (parametric)
• Proximity Based Models (non-parametric)
• Linear Regression Models (PCA, LMS)
• High Dimensional Outlier Detection Methods (high dimensional
sparse data)
Standard deviation method
• This method of outlier detection initially calculates the mean and
standard deviation of the data points.
• Each value is then compared by checking whether the value is a
certain number of standard deviations away from the mean.
• If so, the data point is identified as an outlier.
• The specified number of standard deviations is considered as the
threshold value for which the default value is 3.
• import numpy as np
• from matplotlib import pyplot as plt
• data = [10, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577,
471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 704, 443,
• 569, 430, 637, 331, 511, 552, 496, 484, 566, 554, 472, 335, 440, 579,
341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405,
• 487, 490, 496, 398, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639,
565, 415, 486, 668, 414, 665, 763, 557, 304, 404, 454, 689, 610, 483,
• 441, 657, 590, 492, 476, 437, 483, 12, 363, 711, 543
• print("Original List \n", data)
• elements = np.array(data)
• mean = np.mean(elements)
• std = np.std(elements)
• a = np.array(elements) #For plotting a Histogram
• plt.hist(a, bins = [0,100,200,300,400,500,600,700,800])
• plt.title("histogram")
• plt.show()
• final_list = [x for x in data if (x > mean - 2 * std)]
• finallist = [x for x in data if (x < mean + 2 * std)]
• a = np.array(final_list)
• plt.hist(a, bins = [0,100,200,300,400,500,600,700,800]) #For plotting a Histogram
• plt.title("histogram")
• plt.show()
Interquartile range method
• This method of outlier detection initially calculates the interquartile
range (IQR ) for the given data points.
• Each value is then compared with the value (1.5 x IQR). If the data
point is more than (1.5 x IQR) above the third quartile or below the
first quartile, the data point is identified as an outlier.
• This can be mathematically represented as low outliers are less than
Q1 - (1.5 x IQR), and high outliers are more than Q3 + (1.5 x IQR),
where Q1 is the first quartile and Q3 is the third quartile
• import numpy as np
• from matplotlib import pyplot as plt
• data = [3, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577, 471,
615, 583, 441, 562, 563, 527, 433,541, 585,704,443, 569, 430, 331,
• 511, 440,579, 341, 545, 615, 548, 439,556, 442, 624, 444]
• data = sorted(data)
• print("Original List \n", data)
• a = np.array(data) #For plotting a Histogram
• plt.hist(a, bins = [0,100,200,300,400,500,600,700,800])
• plt.title("histogram")
• plt.show()
• q1, q3 = np.percentile(data,[25,75])
• iqr=q3-q1
• LB = q1 - (1.5 * iqr)
• UB = q3 + (1.5 * iqr)
• final_list = [x for x in data if (x > LB)]
• finallist = [x for x in finallist if (x < UB)]
• plt.hist(final_list, bins = [0,100,200,300,400,500,600,700,800])
• plt.title("histogram")
• plt.show()
Data integration
• The technique of data integration allows merging data from various disparate sources so as to
maintain a unified view of the data. It is an important technique used mainly for merging varying
data of a company in a common unified format or for combining data of more than one company
so as to maintain common data assets.
• The data sources in real life are heterogeneous and this raises the complexity of assimilating the
data of different formats into a common format to be stored in a unified data source.
• Data integration is carried out in many areas such as data warehousing, data migration,
information integration, and enterprise management.
• It is a challenging work as a lot of understanding of the system is required prior to integrating
data from multiple sources.
• Redundant data can be detected using the concept of correlation analysis. There are several
methods used in correlation analysis to find the correlation coefficient (a value between 0 and 1),
which measures the strength and the direction of a linear relationship between two variables
Data Transformation
• Once the data is cleaned and integrated, it is transformed into a range of values
that are easier to be analyzed. This is done as the values for different information
are found to be in a varied range of scales.
• For example, for a company, age values for employee scan be within the range of
20-55 years whereas salary values for employees can be within the range of Rs.
10,000 – Rs. 1,00,000. This indicates one column in a dataset can be more
weighted compared to others due to the varying range of values.
• In such cases, applying statistical measures for data analysis across this dataset
may lead to unnatural or incorrect results. Data transformation is hence required
to solve this issue before applying any analysis of data.
• Various data transformation techniques are used during data preprocessing. The
choice of data transformation technique depends on how the data will be later
used for analysis.
Rescaling data
• When the data encompasses attributes with varying scales, many
statistical or machine learning techniques prefer rescaling the
attributes to fall within a given scale. Rescaling of data allows scaling
all data values to lie between a specified minimum and maximum
value (say, between 0 and 1).
• Data rescaling is done prior to data analysis in many cases such as, in
algorithms that weight inputs like regression and neural networks, in
optimization algorithms used in machine learning, and in algorithms
that use distance measures like K-Nearest Neighbors.

• Using MinMaxScaler() function

Normalizing data
• Normalizing rescales data in such a way that each row of observation equals to a
length of 1 (called a unit norm in linear algebra).
• Data normalization is done prior to data analysis in many cases such as for sparse
data having lots of zeroes or for attributes having high varying ranges of data
values
• Using normalize() function
Binarizing data
• Binarizing is the process of converting data to either 0 or 1 based on a threshold value.
All the data values above the threshold value are marked 1 whereas all the data values
equal to or below the threshold value are marked as 0.
• Data binarizing is done prior to data analysis in many cases such as dealing with crisp
values for the handling of probabilities and adding new meaningful features in the
dataset.
• Using Binarizer()
Standardizing data
• Standardization, also called mean removal, is the process of
transforming attributes having a Gaussian distribution with differing
mean and standard deviation values into a standard Gaussian
distribution with a mean of 0 and a standard deviation of 1.
• Standardization of data is done prior to data analysis in many cases
such as, in the case of linear discriminate analysis, linear regression,
and logistic regression.
• Uisng scale() function

other standard data transformation techniques: Label encoding, One

hot encodings
Data reduction
• An essential step of data preprocessing that is carried out to reduce
the unimportant or unwanted features from a dataset.
• Strategies for data reduction includes:
• Dimensionality reduction
• Data cube aggregation
• Numerosity reduction
Data Wrangling

• Data wrangling or Data munging is transformation of data from one

type to another type to make more appropriate for an algo or
application. Sometimes data cleaning is part of data wrangling
• It includes:
• Data aggregation-group by functkion
• Data visualization
• Reshaping, Filtering
Functions: Join, Filter, Group By, Merging
Hierarchical indexing (MultiIndex)

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially
for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions
in lower dimensional data structures like Series (1d) and DataFrame (2d).
Merge, join, concatenate and compare
• pandas provides various facilities for easily combining together Series
or DataFrame with various kinds of set logic for the indexes and
relational algebra functionality in the case of join / merge-type
operations.
• In addition, pandas also provides utilities to compare two Series or
DataFrame and summarize their differences.
Joining the Data Frames
• When we have data spread in various data frames (or tables), we can
combine that data into a single data frame to have an overall view.
• This can be done only when the data frames to be combined have a
common column.
• Combining data from various data frames is known as joining or
merging the data.
• The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes will be ignored. Otherwise if joining
indexes on indexes or indexes on a column or columns, the index will
be passed on.
• pandas.merge(left, right, how='inner', on=None, left_on=None, right
_on=None, left_index=False, right_index=False, sort=False, suffixes=
('_x', '_y'), copy=None, indicator=False, validate=None)
• Parameters:
• leftDataFrame or named Series
• rightDataFrame or named Series
• how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
• Type of merge to be performed.
• left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
• right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
• outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
• inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys.
• cross: creates the cartesian product from both frames, preserves the
order of the left keys.
On label or list
Column or index level names to join on. These must be found in both DataFrames.
If on is None and not merging on indexes then this defaults to the intersection of the columns
in both DataFrames.

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [1, 2, 3, 5]})

>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [5, 6, 7, 8]})
>>> df1

FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
EDA
100% (1)
EDA
9 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Introduction to data science 1-2-2025
No ratings yet
Introduction to data science 1-2-2025
14 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
DM_merged
No ratings yet
DM_merged
169 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Null 1
No ratings yet
Null 1
62 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
lec 1 Data Acquisition and preprocessing
No ratings yet
lec 1 Data Acquisition and preprocessing
8 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
UNIT V
No ratings yet
UNIT V
47 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
ML_EXP_NO_1
No ratings yet
ML_EXP_NO_1
8 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
GK NU CS 503 - Data Preprocessing
No ratings yet
GK NU CS 503 - Data Preprocessing
62 pages
W4-5 03preprocessing
No ratings yet
W4-5 03preprocessing
83 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
No ratings yet
Unit_5_Graphs_Trees
No ratings yet
Unit_5_Graphs_Trees
69 pages
Python Pandas - DataFrame
No ratings yet
Python Pandas - DataFrame
12 pages
Introduction to Data Structures
No ratings yet
Introduction to Data Structures
3 pages
E_Health_Campaigns_Assignment
No ratings yet
E_Health_Campaigns_Assignment
14 pages
Data Loading- Jupyter Notebook
No ratings yet
Data Loading- Jupyter Notebook
15 pages
Unit 1_DE
No ratings yet
Unit 1_DE
44 pages
AQI- EVS practical file (1)
No ratings yet
AQI- EVS practical file (1)
5 pages
About Blockchain Technology
No ratings yet
About Blockchain Technology
10 pages
MATHEMATICS MCQs
No ratings yet
MATHEMATICS MCQs
6 pages
Fraud Detection
No ratings yet
Fraud Detection
3 pages
Pricelist Hardware Czone
No ratings yet
Pricelist Hardware Czone
2 pages
2.3
No ratings yet
2.3
9 pages
St. Thomas College of Engineering & Technology Kozhuvalloor: Time: 2 Hrs Series Exam - II Max. Marks 60 Part A
No ratings yet
St. Thomas College of Engineering & Technology Kozhuvalloor: Time: 2 Hrs Series Exam - II Max. Marks 60 Part A
2 pages
Waveguide Arc Detector: Passive Components
No ratings yet
Waveguide Arc Detector: Passive Components
3 pages
Construction Progress Claims Template
No ratings yet
Construction Progress Claims Template
4 pages
Simple Machine Worksheets
No ratings yet
Simple Machine Worksheets
7 pages
MIT18 S096F13 Pset9 PDF
No ratings yet
MIT18 S096F13 Pset9 PDF
2 pages
Novell Netware v3.11 Server Install Guide: Basic Instructions Written by Jason John Schwarz
No ratings yet
Novell Netware v3.11 Server Install Guide: Basic Instructions Written by Jason John Schwarz
4 pages
Star Wars Empire at War Prima Official EGuide
No ratings yet
Star Wars Empire at War Prima Official EGuide
273 pages
Lpa 43 26 NF 01
No ratings yet
Lpa 43 26 NF 01
2 pages
CertApple - Affordable Apple P12 & MobileProvision Certificates - Automatic Provision of Apple Certificates for iPhone and iPad
No ratings yet
CertApple - Affordable Apple P12 & MobileProvision Certificates - Automatic Provision of Apple Certificates for iPhone and iPad
1 page
Manual R25 PDF
No ratings yet
Manual R25 PDF
8 pages
Syllabus For Diploma in Jewellery Design
No ratings yet
Syllabus For Diploma in Jewellery Design
4 pages
What Is Exception: Throwable
No ratings yet
What Is Exception: Throwable
31 pages
Adding Clients Page To Business Manager
No ratings yet
Adding Clients Page To Business Manager
2 pages
AUR Appointment Order PDF
No ratings yet
AUR Appointment Order PDF
6 pages
(Ebooks PDF) Download Introduction To Communications Technologies A Guide For Non Engineers 3rd Edition Stephan S. Jones Full Chapters
100% (3)
(Ebooks PDF) Download Introduction To Communications Technologies A Guide For Non Engineers 3rd Edition Stephan S. Jones Full Chapters
65 pages
Foundation Tamil
No ratings yet
Foundation Tamil
21 pages
Cinema Tamil Joke
No ratings yet
Cinema Tamil Joke
28 pages
Autroprime102 Eng
No ratings yet
Autroprime102 Eng
24 pages
Programming Concepts
No ratings yet
Programming Concepts
5 pages
Tutorial 11 Secondary Consolidation
No ratings yet
Tutorial 11 Secondary Consolidation
16 pages
Dokumen - Tips - Real Application Security Ras and Oracle Application Express Apex
No ratings yet
Dokumen - Tips - Real Application Security Ras and Oracle Application Express Apex
33 pages
Transformers As Support Vector Machines: Davoud Ataee Tarzanagh Yingcong Li Christos Thrampoulidis Samet Oymak
No ratings yet
Transformers As Support Vector Machines: Davoud Ataee Tarzanagh Yingcong Li Christos Thrampoulidis Samet Oymak
58 pages
cc2530 Development Kit User Guide
No ratings yet
cc2530 Development Kit User Guide
46 pages
Wolf Richarrd VISH212-I
100% (1)
Wolf Richarrd VISH212-I
29 pages
Openclinic Technical Documentation
No ratings yet
Openclinic Technical Documentation
21 pages
Presentation Prep Checklist Worksheet
No ratings yet
Presentation Prep Checklist Worksheet
3 pages
5 Star Service Standards 5 10 Minute Trainings
No ratings yet
5 Star Service Standards 5 10 Minute Trainings
13 pages

Data Cleaning Wrangling

Uploaded by

Data Cleaning Wrangling

Uploaded by

Data Cleaning and Wrangling

Introduction to data preprocessing

Method 3 - Replacing missing values with the Median Value

• Using MinMaxScaler() function

other standard data transformation techniques: Label encoding, One

• Data wrangling or Data munging is transformation of data from one

You might also like