Data Cleaning Wrangling
Data Cleaning Wrangling
data analytical pipeline that involves the tasks of data preprocessing as well as data wrangling
Data preparation takes place in usually two phases for any data science
or data analysis project:
Data preprocessing : It is the task of transforming raw data to be ready
to be fed into an algorithm. It is a time-consuming yet important step
that cannot be avoided for the accuracy of results in data analysis.
Data wrangling : It is the task of converting data into a feasible format
that is suitable for the consumption of the data. It is also known as data
munging and it typically follows a set of common steps such as
extracting data from various data sources, parsing data into predefined
data structures and storing the converted data into a data sink for
further analysis. Data wrangling is sometimes considered as an add-on
to data preprocessing and is often performed by data engineers or data
scientists prior to data analysis.
Types of data
• Categorical data: This type of data is non-numeric and consists of text that can be
coded as numeric. However, these numbers do not represent any fixed mathematical
notation or meaning for the text and are simply assigned as labels or codes.
• Nominal data: This type of data is used to label variables without providing any quantitative
value. For instance, gender can be labeled as 1 for Male, 2 for Female, and 3 for Others.
However, in reality, the assigned numbers for gender are not fixed and are simply assigned for
labeling.
• Ordinal data: This type of data is used to label variables that need to follow some order. For
instance, a company may take feedback about the quality of their service. In such a case, the
possible answers could be labeled as 1 (very unsatisfied), 2 (somewhat unsatisfied), 3 (neutral),
4(somewhat satisfied), and 5 (very satisfied). Thus, each categorical value is classified on a rating
scale of 1 to 5. Ordinal data follows some order of preference, satisfaction, comfort, happiness,
or any such similar order and then accordingly labels the options.
• Numerical data: This type of data is numeric and it usually follows an order of
values. These quantitative data represent fixed values
• Interval data: This type of data follows numeric scales in which the order and exact
differences between the values is considered. In other words, interval data can be
measured along a scale in which each position is equidistant from one another.
The distances between each value on the interval scale are always kept equal. For
instance, age can be measured on an interval scale as 1, 2, 3, 4, 5 years, etc. Also,
income can be measured on an interval scale as Rs. 0 – 20,000, Rs. 20,001 –
40,000, Rs. 40,001 – 60,000, Rs. 60,001 – 80,000, and Rs. 80,001 –1,00,000.
Another example is a set of years from 2009 to 2019 in which the time interval
between each of these years is the same, namely 365 days.
• Ratio data: This type of data also follows numeric scales and has an equal and
definitive ratio between each data. It is measured as multiples of one another and
unlike interval data, can be multiplied or divided. No negative numerical value is
considered in ratio data and zero is considered as a point of origin. For instance,
measurement of height and weight is an example of ratio data
To apply correct statistical analysis, understanding the various data types is important for applying correct
statistical measurements.
Also to choose the appropriate data visualization tool.
Thus, dealing with the right measurement scales in exploratory data analysis (EDA) requires thorough
knowledge of understanding data and its types.
Possible data error types
The raw data that is collected for analyzing usually consists of several types of
errors that need to be prepared and processed for data analysis.
• The various possible error types found in data are listed below:
• Missing data : Some values in the data may not be filled up for various
reasons and hence are considered missing. The data may be purposely
not provided or may be mistakenly not mentioned. In general, there can
be three cases of missing data
• Manual input: The manual input error is the human-made error that
occur usually while making data entry for collecting data input. Few
examples of such human prone errors are making a data entry in the
wrong field, misinterpretation of data, spelling mistakes, and
grammatical mistakes.
• Data inconsistency: This error occurs as for the same field, data may
be stored in varying formats. For example, in the case of gender, the
input can be stored as M, Male or 1 (indicating male), but all indicate
the same value. This leads to the discrepancy of dataand may lead to
incorrect output due to misinterpretation.
• Regional formats : The format in which data is stored differs from
place to place. For instance, while working with dates, some may
follow the format as dd/mm/yyyy whereas some may follow the
format as dd month, yyyy.
• Numerical units: Data values may also drastically differ due to varying
consideration of data units. For instance, the weight of several
persons is stored partially in pounds and partially in kilograms.
• Wrong data types : Wrong data types usually occur when values are
stored in a computer that needs to be stored with the correct data
type. For instance, the human may interpret 3 and three as same. But
for computer, 3 is numeric and three is textual and so represent
different data types.
• File manipulation : This problem arises when we need to deal with
data that may be stored in CSV or text formats. The software may not
be able to correctly display data depending on the considered
separator character, the qualifier used or the text encoding used.
• Missing anonymization : All data may not be perfect and hence may
require to be anonymized or removed before analysis. This is usually
done to maintain privacy, security issues or removal of bias.
Various data preprocessing operations
• Error-prone data lead to biased results, loss of informative results, or incorrect
results which may lead to incorrect statistical analysis or business decision-
making.
• As a data engineer or data analyst, it is a primary task to handle the
unprocessed, raw, error-prone data by initially detecting the various errors
and then choosing the right operation(s) to remove the errors.
• As a data engineer or data analyst, it is a primary task to handle the
unprocessed, raw, error-prone data by initially detecting the various errors
and then choosing the right operation(s) to remove the errors.
• Pre-processing include data cleaning, data integration, data transformation,
data reduction, and data discretization
Data cleaning
• Dirty data can cause an error while doing data analysis. Data cleaning
is done to handle irrelevant or missing data. Data is cleaned by filling
in the missing values, smoothing any noisy data, identifying and
removing outliers, and resolving any inconsistencies.
• Therefore, an important preprocessing step is to correct the data by
following some data cleaning techniques.
1. Filling missing values
• Filling up the missing values in data is known as the imputation of
missing data. Sometimes, this imputation process becomes time-
consuming and fixing up this problem takes a longer duration than
the actual data analysis.
• Also, the method to be adapted for filling up the missing values
depends on the pattern of data used and the nature of analysis to be
performed with the data.
• Method 1: Replace Missing Values with Zeroes -Python function used is
fillna() which accepts one argument that indicates the value with which the
NaN values should be replaced.
• Method 2: Dropping Rows with Missing Values - Python function used is
dropna() which deletes the rows consisting of missing values. This method
results in loss of data and it will work poorly if the percentage of missing
values in the dataset is comparatively high. However, once all the missing
values get removed, the dataset becomes robust and perfectly fit to be fed
for data analysis.
• Method 3: Replace Missing Values with Mean/Median/Mode – For this, a
particular column is selected for which the central value (say, median) is
found. Then the central value is replaced with all the NaN values of that
particular column. Instead of the median, the mean or mode value can also
be used for the same. Replacing NaN values with either mean, mode or
median is considered as a statistical approach of handling the missing
values
Finding and Filling missing values with zero
• import pandas as pd
• import numpy as np
• df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['C1', 'C2', 'C3'])
• df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
• print("\n Reindexed Data Values")
• print(df)
• #Counting Missing Values
• print(df.isnull().sum())
• print("\n\n Every Missing Value Replaced with '0':")
• print(df.fillna(0))
Method 2 -Dropping Rows Having Missing Values
• print("\n\n Dropping Rows with Missing Values:")
• print(df.dropna())
(x1, y1) and (x2, y2) are the two known data points which are
used to find the value of y for a given x value:
• print("\n\n Filling Missing Values with Interpolation Method:")
• df_new=df.interpolate()
• print(df_new)
Detect and Remove Outlier
• An outlier is a data point that is very far away from other related data
points.
• Outliers may occur due to several reasons such as measurement error,
data entry error, experimental error, intentional inclusion of outliers,
sampling error, or natural occurrence of outliers.
• For data analysis, outliers should be excluded from the dataset as much
as possible as these outliers may mislead the analysis process resulting in
incorrect results and longer training time.
• In turn, the model developed will be less accurate that provide
comparatively poorer results.
• There are several ways to detect outliers in a given dataset.
• Probabilistic and Statistical Modeling (parametric)
• Z-Score or Extreme Value Analysis (parametric)
• Proximity Based Models (non-parametric)
• Linear Regression Models (PCA, LMS)
• High Dimensional Outlier Detection Methods (high dimensional
sparse data)
Standard deviation method
• This method of outlier detection initially calculates the mean and
standard deviation of the data points.
• Each value is then compared by checking whether the value is a
certain number of standard deviations away from the mean.
• If so, the data point is identified as an outlier.
• The specified number of standard deviations is considered as the
threshold value for which the default value is 3.
• import numpy as np
• from matplotlib import pyplot as plt
• data = [10, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577,
471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 704, 443,
• 569, 430, 637, 331, 511, 552, 496, 484, 566, 554, 472, 335, 440, 579,
341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405,
• 487, 490, 496, 398, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639,
565, 415, 486, 668, 414, 665, 763, 557, 304, 404, 454, 689, 610, 483,
• 441, 657, 590, 492, 476, 437, 483, 12, 363, 711, 543
• print("Original List \n", data)
• elements = np.array(data)
• mean = np.mean(elements)
• std = np.std(elements)
• a = np.array(elements) #For plotting a Histogram
• plt.hist(a, bins = [0,100,200,300,400,500,600,700,800])
• plt.title("histogram")
• plt.show()
• final_list = [x for x in data if (x > mean - 2 * std)]
• finallist = [x for x in data if (x < mean + 2 * std)]
• a = np.array(final_list)
• plt.hist(a, bins = [0,100,200,300,400,500,600,700,800]) #For plotting a Histogram
• plt.title("histogram")
• plt.show()
Interquartile range method
• This method of outlier detection initially calculates the interquartile
range (IQR ) for the given data points.
• Each value is then compared with the value (1.5 x IQR). If the data
point is more than (1.5 x IQR) above the third quartile or below the
first quartile, the data point is identified as an outlier.
• This can be mathematically represented as low outliers are less than
Q1 - (1.5 x IQR), and high outliers are more than Q3 + (1.5 x IQR),
where Q1 is the first quartile and Q3 is the third quartile
• import numpy as np
• from matplotlib import pyplot as plt
• data = [3, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577, 471,
615, 583, 441, 562, 563, 527, 433,541, 585,704,443, 569, 430, 331,
• 511, 440,579, 341, 545, 615, 548, 439,556, 442, 624, 444]
• data = sorted(data)
• print("Original List \n", data)
• a = np.array(data) #For plotting a Histogram
• plt.hist(a, bins = [0,100,200,300,400,500,600,700,800])
• plt.title("histogram")
• plt.show()
• q1, q3 = np.percentile(data,[25,75])
• iqr=q3-q1
• LB = q1 - (1.5 * iqr)
• UB = q3 + (1.5 * iqr)
• final_list = [x for x in data if (x > LB)]
• finallist = [x for x in finallist if (x < UB)]
• plt.hist(final_list, bins = [0,100,200,300,400,500,600,700,800])
• plt.title("histogram")
• plt.show()
Data integration
• The technique of data integration allows merging data from various disparate sources so as to
maintain a unified view of the data. It is an important technique used mainly for merging varying
data of a company in a common unified format or for combining data of more than one company
so as to maintain common data assets.
• The data sources in real life are heterogeneous and this raises the complexity of assimilating the
data of different formats into a common format to be stored in a unified data source.
• Data integration is carried out in many areas such as data warehousing, data migration,
information integration, and enterprise management.
• It is a challenging work as a lot of understanding of the system is required prior to integrating
data from multiple sources.
• Redundant data can be detected using the concept of correlation analysis. There are several
methods used in correlation analysis to find the correlation coefficient (a value between 0 and 1),
which measures the strength and the direction of a linear relationship between two variables
Data Transformation
• Once the data is cleaned and integrated, it is transformed into a range of values
that are easier to be analyzed. This is done as the values for different information
are found to be in a varied range of scales.
• For example, for a company, age values for employee scan be within the range of
20-55 years whereas salary values for employees can be within the range of Rs.
10,000 – Rs. 1,00,000. This indicates one column in a dataset can be more
weighted compared to others due to the varying range of values.
• In such cases, applying statistical measures for data analysis across this dataset
may lead to unnatural or incorrect results. Data transformation is hence required
to solve this issue before applying any analysis of data.
• Various data transformation techniques are used during data preprocessing. The
choice of data transformation technique depends on how the data will be later
used for analysis.
Rescaling data
• When the data encompasses attributes with varying scales, many
statistical or machine learning techniques prefer rescaling the
attributes to fall within a given scale. Rescaling of data allows scaling
all data values to lie between a specified minimum and maximum
value (say, between 0 and 1).
• Data rescaling is done prior to data analysis in many cases such as, in
algorithms that weight inputs like regression and neural networks, in
optimization algorithms used in machine learning, and in algorithms
that use distance measures like K-Nearest Neighbors.
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially
for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions
in lower dimensional data structures like Series (1d) and DataFrame (2d).
Merge, join, concatenate and compare
• pandas provides various facilities for easily combining together Series
or DataFrame with various kinds of set logic for the indexes and
relational algebra functionality in the case of join / merge-type
operations.
• In addition, pandas also provides utilities to compare two Series or
DataFrame and summarize their differences.
Joining the Data Frames
• When we have data spread in various data frames (or tables), we can
combine that data into a single data frame to have an overall view.
• This can be done only when the data frames to be combined have a
common column.
• Combining data from various data frames is known as joining or
merging the data.
• The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes will be ignored. Otherwise if joining
indexes on indexes or indexes on a column or columns, the index will
be passed on.
• pandas.merge(left, right, how='inner', on=None, left_on=None, right
_on=None, left_index=False, right_index=False, sort=False, suffixes=
('_x', '_y'), copy=None, indicator=False, validate=None)
• Parameters:
• leftDataFrame or named Series
• rightDataFrame or named Series
• how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
• Type of merge to be performed.
• left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
• right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
• outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
• inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys.
• cross: creates the cartesian product from both frames, preserves the
order of the left keys.
On label or list
Column or index level names to join on. These must be found in both DataFrames.
If on is None and not merging on indexes then this defaults to the intersection of the columns
in both DataFrames.
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [5, 6, 7, 8]})
>>> df1