FDS Chapter 3
FDS Chapter 3
Chapter 3
Introduction
• Real-world data is often dirty / data pathalogies.
– Formatting issues (inconsistent capitalization/
extraneous whitespaces etc
– Pathologies in actual data content( duplicate values/
major outliers / NULL values)
• Often requires some detective work to figure out
what these issues mean in a particular situation
and hence how they should be addressed.
• Data needs to be cleaned up, before it can be
used for a desired purpose data pre-processing.
• Factors that make data dirty :
– Incomplete. When some of the attribute values are
lacking, certain attributes of interest are lacking, or
attributes contain only aggregate data.
– Noisy. When data contains errors or outliers. For
example, some of the data points in a dataset may
contain extreme values that can severely affect the
dataset’s range.
– Inconsistent. Data contains discrepancies in codes or
names. For example, if the “Name” column for
registration records of employees contains values
other than alphabetical letters, or if records do not
start with a capital letter, discrepancies are present.
• The term dirty to describe data, here, refers
to the syntactical, formatting, and structural
issues with the data, and ignoring all other
ways the data could be “muddled up.” (bias in
data)
Data pre-processing
Data objects & attribute types
• Attribute types :
– Qualitative describes qualities or characteristics
of data.
• Descriptive and cant be measured
• Consists of words, pictures, symbols
• Types Nominal, ordinal. Binary
– Quantitative can be counted / measured / can be
expressed using numbers
• Types Numeric, Discrete, Continuous
Data Quality
Why pre-process data
• Whenever there is a large organization, a
complicated data collection process, or several
datasets that have been merged, issues tend to
pile up.
• They are rarely documented and often only come
to light when some poor data scientist is tasked
with analyzing them.
• One of the most embarrassing things that can
happen in data science is to have to retract results
that you’ve presented because you realize that
you processed the data incorrectly
• Data quality is a measure of data based on the
following factors:
– Accuracy presence of inaccurate or noisy data with
errors , due to faulty instruments, errors in collection
etc
– Completeness Incomplete data due to missing
values, missing attributes, only aggregates etc
– Consistency inconsistency due to discrepancies in
data values. Data duplication leading to inconsistency
etc
– Timeliness Availability of information when needed.
– Believability Refers of trust in data by users.
– Interpretability how easily data can be understood (
based on how accurately the attributes are described
in the data set)
Data Munging
• Also known as data manipulation / data
wrangling
• It is the process of collecting and transforming
raw data into another format for better
understanding , and analysis.
• Often the data is not in a format that is easy
to work with.
– Eg: data stored in a way that is hard to process.
• Hence the need to convert it to something
more suitable for a computer to understand.
• All methods manipulate/wrangle/mung data
to turn it into something that is more
convenient or desirable.
• Eg : Consider the following text recipe.
“Add two diced tomatoes, three cloves of
garlic, and a pinch of salt in the mix.”
Ingredient Quantity Unit/size
Tomato 2 Diced
Garlic 3 Cloves
Salt 1 Pinch
• This table conveys the same information as
the text, but it is more “analysis friendly.”
• No systematic method for wrangling
wrangle ill-formatted data into something
more manageable.
Data Cleaning
• Many different ways to clean dirty data.
• Handling missing data
• Handling Noisy data
• Handling formatting issues
HANDLING MISSING values
• Many real-world datasets may contain missing values
for various reasons.
• They are often encoded as NaNs, blanks or any other
placeholders.
• Training a model with a dataset that has a lot of
missing values can drastically impact the machine
learning model’s quality..
• One way to handle this problem is to get rid of the
observations that have missing data. However, you
will risk losing data points with valuable information. A
better strategy would be to impute the missing values
• Handling missing data:
– sometimes data may be in right format, but some of
the values are missing
– Eg: Consider an employee table with employee data,
in which some of the home phone numbers are
absent.
• People may not have home phones, their mobile phone
may be the primary or only phone.
• Another eg , consider a log of transactions from the past
year. Group the transactions, by customer, and add up the
size for each customer, thus giving one row per customer.
– If a customer didn’t have any transactions that year, then his
record will be missing in aggregate. To solve this, we join this
aggregate dat with some known set of all customers and fill in the
appropriate missing values, for the ones who were missing.
• Missing data can arise, when data was never
gathered in the first place for some entities
• Data may be missing due to problems with the
process of collecting data, or an equipment
malfunction.
• Some data may not have been considered
important at the time of collection
• Eg: the data collection was limited to a certain
area or region, hence the area code was not
taken that time, for a phone number.
– But now when we decide to expand beyond that
city/region, then phone numbers will have area
code too
• Data may get lost due to system or human
error while storing or transferring the data.
• Thus some strategy needed to handle missing
data
• Methods for handling missing data:
– Replace missing values manually time
consuming and needs expertise.
– Replace missing values with zeros
• Python function fillna(), df.fillna(0)
– Dropping rows with missing values suitable for
large data sets, where multiple values are missing
within a tuple.
• Df.dropna()
– Replace missing values with mean/median/mode
• Median = df[‘C1’].median()
• Df[‘C1’].fillna(median,inplace=True)
– Replace missing values with previous/next row
• Df.fillna(method=“ffill”) fill with previous row
• Df.fillna(method=“bfill”) fill with next row value
– Use Interpolation for filling missing values
• Interpolation can be used to construct new values
within the range of a discrete set of known data values.
• In conclusion, there is no perfect way to
compensate for the missing values in a
dataset.
• Each strategy can perform better for certain
datasets and missing data types but may
perform much worse on other types of
datasets.
• There are some set rules to decide which
strategy to use for particular types of missing
values, but beyond that, you should
experiment and check which model works
best for your dataset.
Noisy data
• Situations when data is not missing, but is
corrupted for some reasons.
• Data corruption may be a result of faulty data
collection instruments, data entry problems,
or technology limitations.
• Eg, a floating point values like 70.1, 70.9 ,
both stored a s70, since storage system
ignores decimal points.
• May not be a big issue, but if the values relate to
temperature measures, then there is a concern.
– Eg for humans, temperature 99.4 normal, but 99.8 implies fever. If
storage system fails to note this difference, then the system will
fail to differentiate between sick and healthy person.
0 0.111675 0.212121 HR
1 0.000000 0.000000 Legal
2 1.000000 0.727273 Marketing
3 0.069374 1.000000 Management
• data_set = pd.read_csv(
• 'C:\\Users\\dell\\Desktop\\Data_for_Feature_Scaling.csv')
• data_set.head()
•
• # here Features - Age and Salary columns are taken using slicing
to binarize values
• age = data_set.iloc[:, 1].values
• salary = data_set.iloc[:, 2].values
• print ("\nOriginal age data values : \n", age)
• print ("\nOriginal salary data values : \n", salary)
• from sklearn.preprocessing import Binarizer
x = age
x = x.reshape(1, -1)
y = salary
y = y.reshape(1, -1)
# For age, let threshold be
35 # For salary, let threshold be 61000
binarizer_1 = Binarizer(35)
binarizer_2 = Binarizer(61000)
# Transformed feature
print ("\nBinarized age : \n",
binarizer_1.fit_transform(x))
3 new features are added , since the country contains 3 unique values.
So here each category is represented by a binary vector
• We apply One-Hot Encoding when:
– The categorical feature is not ordinal (like the
countries above)
– The number of categorical features is less so
one-hot encoding can be effectively applied
• We apply Label Encoding when:
– The categorical feature is ordinal (like Jr. kg, Sr.
kg, Primary school, high school)
– The number of categories is quite large as
one-hot encoding can lead to high memory
consumption
Data Reduction
• Data reduction is a key process in which a reduced
representation of a dataset that produces the
same or similar analytical results is obtained.
• One example of a large dataset that could warrant
reduction is a data cube.
• Another example of data reduction is removal of
unnecessary attributes.
• Reduces the data by removing unimportant and
unwanted features from the data set.
• Data Reduction techniques are methods that
one can use to preserve data in a reduced or
condensed form but without any loss of
information or fidelity.
• Different data reduction strategies are
– Dimensionality reduction
– Data cube aggregation
– Numerosity reduction
• Data reduction consciously allows us to
categorize or extract the necessary
information from a huge array of data to
enable us to make conscious decisions.
• “Data reduction is the transformation of
numerical or alphabetical digital information
derived empirically or experimentally into a
corrected, ordered, and simplified form.”
• In simple terms, it simply means large
amounts of data are cleaned, organized and
categorized based on prerequisite criteria to
help in driving business decisions.
• Data cube aggregation:
– Data Cube Aggregation is a multidimensional aggregation
that uses aggregation at various levels of a data cube to
represent the original data set, thus achieving data
reduction.
– Data Cube Aggregation, where the data cube is a much
more efficient way of storing data, thus achieving data
reduction, besides faster aggregation operations.
– used to aggregate data in a simpler form
– Example: consider a data set gathered for analysis that
includes the revenue of your company every three months.
• But for analysis we need the annual sales, rather than the
quarterly average.
• So we can summarize the data in such a way that the resulting
data summarizes the total sales per year instead of per quarter. It
summarizes the data.
• Dimensionality Reduction:
– In contrast with the data cube aggregation method,
where the data reduction was with the consideration
of the task, dimensionality reduction method works
with respect to the nature of the data
– A dimension or a column in the data spreadsheet is
referred to as a “feature,” and the goal of the process
is to identify which features to remove or collapse to a
combined feature.
– This requires identifying redundancy in the given data
and/or creating composite dimensions or features
that could sufficiently represent a set of raw features.
– Strategies for reduction include sampling, clustering,
principal component analysis, etc.
• There are mainly two types of dimensionality
reduction methods.
• Both methods reduce the number of
dimensions but in different ways.
• It is very important to distinguish between
those two types of methods.
• One type of method only keeps the most
important features in the dataset and
removes the redundant features Feature
Selection
– There is no transformation applied to the set of
features
• The other method finds a combination of new
features Feature extraction.
• An appropriate transformation is applied to
the set of features.
• The new set of features contains different
values instead of the original values.
• Feature selection methods:
– Extracts a subset of features from the original set of all
features of a dataset to obtain a smaller subset that
can be used for further analysis.
– These methods only keep the most important features
in the dataset and remove the redundant features.
– Step-wise Forward Selection –
• The selection begins with an empty set of attributes later on
we decide best of the original attributes on the set based on
their relevance to other attributes.
• Initial attribute Set: {X1, X2, X3, X4, X5, X6} Initial reduced
attribute set: { } Step-1: {X1} Step-2: {X1, X2} Step-3: {X1, X2,
X5} Final reduced attribute set: {X1, X2, X5}
– Instead of eliminating features recursively, the
algorithm attempts to train the model on a single
feature in the dataset and calculates the
performance of the model (usually, accuracy score
for a classification model and RMSE for a
regression model).
– Then, the algorithm adds (selects) one feature
(variable) at a time, trains the model on those
features and calculates the performance scores.
– The algorithm repeats adding features until it
detects a small (or no) change in the performance
score of the model and stops there!
– Step-wise Backward Selection –
• This selection starts with a set of complete attributes in the
original data and at each point, it eliminates the worst
remaining attribute in the set.
• This method eliminates (removes) features from a dataset
through a recursive feature elimination (RFE) process.
• The algorithm first attempts to train the model on the initial
set of features in the dataset and calculates the
performance of the model (usually, accuracy score for a
classification model and RMSE for a regression model).
• Then, the algorithm drops one feature (variable) at a time,
trains the model on the remaining features and calculates
the performance scores.
• The algorithm repeats eliminating features until it detects a
small (or no) change in the performance score of the model
and stops there!
• Suppose there are the following attributes in the data
set in which few attributes are redundant.
• Initial attribute Set: {X1, X2, X3, X4, X5, X6} Initial
reduced attribute set: {X1, X2, X3, X4, X5, X6 } Step-1:
{X1, X2, X3, X4, X5} Step-2: {X1, X2, X3, X5} Step-3: {X1,
X2, X5} Final reduced attribute set: {X1, X2, X5}
– Combination of forwarding and Backward
Selection –
• It allows us to remove the worst and select best
attributes, saving time and making the process faster.
• Univariate selection
– Works by inspecting each feature and then finding the
best feature , based on statistical tests.
– Analyzes the capability of these features in
accordance with the response variable.
• Decision tree induction
– Uses the concept of decision trees for feature
extraction.
– The nodes of the tree indicates a test applied on an
attribute
– The branches indicate the outcome of the test
– Helps in discarding irrelevant attributes i.e those
attributes that are not part of the tree.
• Feature extraction methods:
– Used to reduce data with many features to a data
set with reduced features.
– Feature selection chooses the most relevant
features from a feature set, whereas feature
extraction creates a new , smaller set of features
that consist of most useful information.
– Most common methods of feature extraction are
• Principal component analysis
• Linear discriminant analysis
– Principal component analysis
• This method involves the identification of a few
independent tuples with ‘n’ attributes that can represent
the entire data set
• PCA is a linear dimensionality reduction technique
(algorithm) that transforms a set of correlated variables (p)
into a smaller k (k<p) number of uncorrelated variables
called principal components while retaining as much of the
variation in the original dataset as possible
• Principal Component Analysis, or PCA, is a
dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that
still contains most of the information in the large set.
• reduce the number of variables of a data set, while
preserving as much information as possible.
• Linear Discriminant Analysis:
– LDA is typically used for multi-class classification. It
can also be used as a dimensionality reduction
technique.
– LDA best separates or discriminates (hence the name
LDA) training instances by their classes.
– The major difference between LDA and PCA is that
LDA finds a linear combination of input features that
optimizes class separability while PCA attempts to find
a set of uncorrelated components of maximum
variance in a dataset.
– Another key difference between the two is that PCA is
an unsupervised algorithm whereas LDA is a
supervised algorithm where it takes class labels into
account.
• Advantages of Dimensionality Reduction
– A lower number of dimensions in data means less
training time and less computational resources and
increases the overall performance of machine
learning algorithms
• — Machine learning problems that involve many features
make training extremely slow. Most data points in
high-dimensional space are very close to the border of that
space. This is because there’s plenty of space in high
dimensions. In a high-dimensional dataset, most data points
are likely to be far away from each other. Therefore, the
algorithms cannot effectively and efficiently train on the
high-dimensional data.
• In machine learning, that kind of problem is referred to as
the curse of dimensionality
•
– Dimensionality reduction is extremely useful
for data visualization — When we reduce the
dimensionality of higher dimensional data into
two or three components, then the data can easily
be plotted on a 2D or 3D plot
– Dimensionality reduction takes care
of multicollinearity — In regression,
multicollinearity occurs when an independent
variable is highly correlated with one or more of
the other independent variables. Dimensionality
reduction takes advantage of this and combines
those highly correlated variables into a set of
uncorrelated variables. This will address the
problem of multicollinearity.
– Dimensionality reduction is very useful for factor
analysis — This is a useful approach to find latent
variables which are not directly measured in a single
variable but rather inferred from other variables in the
dataset. These latent variables are called factors.
– Dimensionality reduction removes noise in the
data — By keeping only the most important features
and removing the redundant features, dimensionality
reduction removes noise in the data. This will improve
the model accuracy.
– Dimensionality reduction can be used for image
compression — image compression is a technique
that minimizes the size in bytes of an image while
keeping as much of the quality of the image as
possible. The pixels which make the image can be
considered as dimensions (columns/variables) of the
image data.
• Numerosity Reduction
– It is a data reduction technique which replaces the
original data by smaller form of data representation.
– There are two techniques for numerosity
reduction- Parametric and Non-Parametric methods.
– Parametric methods
• For parametric methods, data is represented using some
model.
• The model is used to estimate the data, so that only
parameters of data are required to be stored, instead of
actual data.
• Regression and Log-Linear methods are used for creating
such models.
– Regression:
Regression can be a simple linear regression or multiple
linear regression.
– When there is only single independent attribute, such
regression model is called simple linear regression
– If there are multiple independent attributes, then such
regression models are called multiple linear regression.
– In linear regression, the data are modeled to a fit straight
line. For example, a random variable y can be modeled as a
linear function of another random variable x with the
equation y = ax+b
where a and b (regression coefficients) specifies the slope
and y-intercept of the line, respectively.
– In multiple linear regression, y will be modeled as a linear
function of two or more predictor(independent) variables.
• Non-Parametric Methods –
– These methods are used for storing reduced
representations of the data
include histograms, clustering, sampling and data cube
aggregation.
– Histograms:
• Histogram is the data representation in terms of frequency. It
uses binning to approximate data distribution and is a popular
form of data reduction.
– Clustering:
• Clustering divides the data into groups/clusters. This technique
partitions the whole data into different clusters. In data
reduction, the cluster representation of the data are used to
replace the actual data. It also helps to detect outliers in data.
– Sampling:
• Sampling can be used for data reduction because it allows a large
data set to be represented by a much smaller random data
sample (or subset).
Data Discretization
• Data discretization refers to a method of
converting a huge number of data values into
smaller ones so that the evaluation and
management of data become easy.
• Also defined as a process of converting
continuous data attribute values into a finite
set of intervals and associating with each
interval some specific data value.
• In other words, data discretization is a
method of converting attributes values of
continuous data into a finite set of intervals
with minimum data loss.
• Often, it is easier to understand continuous
data (such as weight) when divided and stored
into meaningful categories or groups.
– For example, we can divide a continuous variable,
weight, and store it in the following groups :
Under 100 kg (light), between 140–160 kg (mid),
and over 200 kg (heavy).
• Discretization is useful if we see no objective
difference between variables falling under the
same weight class.
– In our example, weights of 85 lbs and 56
lbs convey the same information (the object is
light).
– Therefore, discretization helps make our data
easier to understand if it fits the problem
statement.
•
• There are two forms of data discretization
– Supervised discretization,
– Unsupervised discretization.
• Supervised discretization refers to a method in
which the class data is used.
• Unsupervised discretization refers to a
method depending upon the way which
operation proceeds.
– It means it works on the top-down splitting
strategy and bottom-up merging strategy
• Approaches to Discretization
– Unsupervised:
— Equal-Width
— Equal-Frequency
— K-Means
– Supervised:
— Decision Trees
• Unsupervised methods:
– Binning: Binning is a data smoothing technique and its
helps to group a huge number of continuous values
into a smaller number of bins.
– For example, if we have data about a group of
students, and we want to arrange their marks into a
smaller number of marks intervals by making the bins
of grades.
– One bin for grade A, one for grade B, one for C, one
for D, and one for F Grade
– Equal-Width Discretization
• Separating all possible values into ‘N’ number of bins, each
having the same width. Formula for interval width:
• Width = (maximum value - minimum value) / N
* where N is the number of bins or intervals.
– Equal-Frequency Discretization
• Separating all possible values into ‘N’ number of bins,
each having the same amount of observations.
– K-Means Discretization
• We apply K-Means clustering to the continuous
variable, thus dividing it into discrete groups or
clusters.
• Decision trees
– Decision Trees (DTs) are a non-parametric
supervised learning method used for data
discretization.
– The goal is to create a model that predicts the
value of a target variable by learning simple
decision rules inferred from the data features.
– A Decision tree is a flowchart like tree structure,
where each internal node denotes a test on an
attribute, each branch represents an outcome of
the test, and each leaf node (terminal node) holds
a class label.
• Data discretization and concept hierarchy generation
– A concept hierarchy represents a sequence of mappings
with a set of more general concepts to specialized
concepts.
– Similarly mapping from low-level concepts to higher-level
concepts. In other words, we can say top-down mapping
and bottom-up mapping.
– Example of a concept hierarchy for the dimension location.
• Each city can be mapped with the country to which the given city
belongs. For example, Delhi can be mapped to India and India can
be mapped to Asia.
– Top-down mapping
• Top-down mapping starts from the top with general concepts
and moves to the bottom to the specialized concepts.
• Bottom-up mapping
– Bottom-up mapping starts from the Bottom with
specialized concepts and moves to the top to the
generalized concepts.