21BCAD5C01 IDA Module 2 Notes
21BCAD5C01 IDA Module 2 Notes
Module -2
Syllabus of the Module
Data Preparation (12 Hours)
Data Collection Methods: Primary, Secondary data; Pre-Processing: Data Cleaning, Data Integration
and Transformation, Data Discretization; Dimensionality Reduction, PCA, Feature Engineering and
Selection.
In organizations, while data warehouses and data marts are home to pre-processed data, data lakes
contain data in its natural or raw format.
But the possibility exists that your data still resides in Excel files on the desktop of a domain
expert.
Finding data even within your own company can sometimes be a challenge.
As companies grow, their data becomes scattered around many places. Knowledge of the data
may be dispersed as people change positions and leave the company.
Getting access to data is another difficult task.
Organizations understand the value and sensitivity of data and often have policies in place so
everyone has access to what they need and nothing more.
These policies translate into physical and digital barriers called Chinese walls.
Quality Factors
❑ Accuracy: Data must be correct. Outdated data, typos, and redundancies can affect a dataset’s
accuracy.
❑ Consistency: The data should have no contradictions.
❑ Inconsistent data may give you different answers to the same question.
❑ Completeness: The dataset shouldn’t have incomplete fields or lack empty fields.
❑ This characteristic allows data scientists to perform accurate analyses as they have access to a
complete picture of the situation the data describes.
❑ Validity: A dataset is considered valid if the data samples appear in the correct format, are within a
specified range, and are of the right type.
❑ Invalid datasets are hard to organize and analyze.
❑ Timeliness: Data should be collected as soon as the event it represents occurs.
❑ As time passes, every dataset becomes less accurate and useful as it doesn’t represent the
current reality. Therefore, the topicality and relevance of data is a critical data quality
characteristic.
Noisy Data
Noise includes
❑ data having incorrect attribute values,
❑ duplicate or semi-duplicates of data points,
❑ data segments of no value/attributes for a specific research process, or
❑ unwanted/irrelevant attributes.
An outlier can be treated as noise, although some consider it a valid data point.
❑ For numeric values, you can use a scatter plot or box plot to identify outliers.
Outliers
An outlier is a data point that is noticeably different from the rest. It is an unusual data point that does
not follow the general trend of the rest of the data. It can be an extreme value.
Outliers can arise for a variety of reasons, such as errors in measurement or bad data collection (e.g.
Human error, Measuring Instrument error, Experiment design error etc.).
Machine learning algorithms are sensitive to the range and distribution of attribute values.
Data outliers can spoil and mislead the training process resulting in less accurate models and poorer
results.
❑ Data Integration
Data integration is the process of combining/ merging data from different sources.
The data sources might be heterogenous in nature.
These sources may include multiple data cubes (A data cube in a data warehouse is a
multidimensional structure used to store data), databases, or flat files.
Data integration results in a coherent/ well-organized data store and provides a unified view of the
data.
Metadata of an attribute incorporates its name, what does it mean in the particular scenario,
what is its data type, up to what range it can accept the value. What rules does the attribute
follow for the null value, blank, or zero?
Analyzing this metadata information will prevent error in schema integration.
Entity identification problem:
The real-world entities from multiple sources need to be matched correctly.
For example, we have customer data from two different data source. An entity from one data
source has customer_id and the entity from the other data source has customer_number.
Here, the data analyst or the integration tool needs to understand that these two entities refer to
the same attribute.
❑ Redundancy:
Redundancy is one of the big issues during data integration.
Redundant data is an unimportant data or the data that is no longer needed.
An attribute may be redundant if it can be derived or obtaining from another attribute or set of
attributes.
For example, one data set has the customer age and other data set has the customers date of
birth then age would be a redundant attribute as it could be derived using the date of birth.
❖ Some redundancies can be detected by correlation analysis. The attributes are analyzed to detect their
interdependency on each other thereby detecting the correlation between them – collinearity.
❑ Collinearity:
Collinearity refers to the situation in which two or more predictor variables are closely related
to one another.
The presence of collinearity can pose problems, since it can be difficult to separate out the
individual effects of collinear variables on the response variable.
When faced with the problem of collinearity, there are two simple solutions.
The first is to drop one of the problematic variables from the data. This can usually be done
without much compromise to the prediction, since the presence of collinearity implies that the
information that this variable provides about the response is redundant in the presence of the
other variables.
The second solution is to combine the collinear variables together into a single predictor. For
instance, we might take the average of standardized versions of there variables in order to
create a new variable.
❑ Data Value Conflict:
Data conflict means the data merged from the different sources do not match.
Attribute values from different sources may differ for the same real-world entity. The
difference maybe because they are represented differently in the different data sets.
For example, the price of a hotel room may be represented in different currencies in different
cities. Likewise, the date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”.
Detection and resolution of data value conflicts need to be carried out carefully.
❑ Tuple Duplication:
Along with redundancies data integration has to deal with the duplicate tuples also.
Duplicate tuples may come in the resultant data if the denormalized table has been used as a
source for data integration. Normalization is used to remove redundant data from the
database and to store non-redundant and consistent data into it.
❑ Data Transformation
Data transformation is the process of converting data from one format to another.
In essence, it involves methods for transforming data into appropriate formats that the computer can
learn efficiently from.
Data transformation increases the efficiency of analytic processes, and it enables businesses to make
better data-driven decisions.
Data transformation could change the structure, format, or values of data by
➢ Data is collected and presented in a view within the context of various time intervals:
Reporting period: The period over which data is collected for presentation. For example, Daily,
Weekly, Monthly, Quarterly, and Yearly.
Polling period: The time duration that determines how often resources are sampled for data. For
example, a group of resources might be polled every 5 minutes, meaning that a data point for each
resource is generated every 5 minutes.
❖ Example: Weather forecasting, Sensor data for temperature, humidity polled in every 5 minutes,
aggregated to minimum and maximum and reported daily. Daily average sales of a super bazar store
aggregated for monthly forecasting
❑ Data Smoothing
Data smoothing is performed to remove noise from a data set. This allows important patterns to more
clearly stand out.
When data collected over time displays random variation, smoothing techniques can be used to reduce
or cancel the effect of these variations.
When properly applied, these techniques smooth out the random variation in the time series data to
reveal underlying trends.
For example, Data smoothing can be used to help predict trends, such as those found in
securities prices, as well as in economic analysis.
Data Analytics Tools features four different smoothing techniques: Exponential, Moving Average,
Double Exponential, and Holt-Winters.
Exponential and Moving Average are relatively simple smoothing techniques and should not
be performed on data sets involving seasonality.
Double Exponential and Holt-Winters are more advanced techniques that can be used on data
sets involving seasonality.
Seasonality is a characteristic of a time series in which the data experiences regular and predictable
changes that recur every calendar year. Any predictable fluctuation or pattern that recurs or repeats
over a one-year period is said to be seasonal.
❑ Discretization
Numerical input variables may have a highly skewed or non-standard distribution. This could be
caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.
Many machine learning algorithms prefer or perform better when numerical input variables have a
standard probability distribution.
The discretization transform provides an automatic way to change a numeric input variable to have a
different data distribution, which in turn can be used as input to a predictive model.
Discretization is the process through which we can transform continuous variables into a discrete
form.
We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired
variable.
For example, we can divide a continuous variable, weight, and store it in the following groups : Under
100 gm (light), between 140–160 gm (mid), and over 200 gm (heavy)
Methods: Equal-width, Equal-frequency, clustered/ k-means
❑ Equal-Width or Uniform Discretization:
Each bin has the same width in the span of possible values for the variable.
Separating all possible values into ‘N’ number of bins, each having the same width. Formula for
interval width:
Width = (maximum value - minimum value) / N, where N is the number of bins or intervals.
Example: 1-10: child; 11-20: teenager; 21-30: young; 31-40: middle-aged; 41-50: adults; 51-
60: aged; 61-70: senior citizens;
❑ Equal-Frequency or Quantile Discretization:
A quantile discretization transform will attempt to split the observations for each input variable into k
groups, where the number of observations assigned to each group is approximately equal.
Each bin has the same number of values, split based on percentiles.
Separating all possible values into ‘N’ number of bins, each having the same amount of observations.
Intervals may correspond to quantile values.
❑ Clustered or K-Means Discretization:
Clusters are identified and observations / data points are assigned to each group.
We apply K-Means clustering to the continuous variable, thus dividing it into discrete groups or
clusters.
❑ Normalization
Normalization is a technique often applied as part of data preparation for machine learning. The goal of
normalization is to change the values of numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values. For machine learning, every dataset does not require
normalization. It is required only when features have different ranges. Normalization gives equal
weights/importance to each variable so that no single variable steers model performance in one direction
just because they are bigger numbers. Normalization dramatically improves model accuracy.
Data normalisation is a method to convert the source data into another format for effective processing.
The purpose of normalization is to transform data in a way that they are either dimensionless and/or
have similar distributions.
It offers several advantages, such as making data mining /machine learning algorithms more effective,
faster data extraction, etc.
This process of normalization is known by other names such as standardization, feature scaling etc.
Normalization is an essential step in data pre-processing in any machine learning application and
model fitting.
For example, consider a data set containing two features, age(x1), and income(x2). Where age ranges
from 0–100, while income ranges from 20,000-50,0000. Income is about several 1,000 times larger
than age and ranges from 20,000–500,000. So, these two features are in very different ranges. When
we do further analysis, like multivariate linear regression, for example, the attributed income will
intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is
more important as a predictor.
Methods of Normalization:
Rescaling: also known as “min-max normalization”; the values are shifted and rescaled so that they
end up ranging between 0 and 1; it is the simplest of all methods and calculated as:
Mean normalization: This method uses the mean of the observations in the transformation process:
Z-score normalization: Also known as standardization, this technic uses Z-score or “standard score”.
It is widely used in machine learning algorithms such as SVM and logistic regression:
Here, z is the standard score, μ is the population mean and ϭ is the population standard deviation
❑ Generalization
Generalization is used to convert low-level data attributes to high-level data attributes by the use of
concept hierarchy.
Data generalization is the process of creating a more broad categorization of data in a database,
essentially ‘zooming out’ from the data to create a more general picture of trends or insights it
provides.
For example, If you have a data set that includes the ages of a group of people, the data generalization
process may look like this:
Original Data: Ages: 15, 17, 20, 26, 28, 31, 33, 37, 42, 42, 46, 48, 49, 54, 57, 57, 58, 59
Generalized Data: Ages: < 40 : young; >41: old
Here, an age in the numerical form of raw data (20, 52) is converted into (Young, old)
categorical value.
Another example, Customer Name, Address Detail can be Customer Name, City
Data generalization replaces a specific data value with one that is less precise, which may seem
counterproductive, but actually is a widely applicable and used technique in data mining, analysis, and
secure storage.
Dimensionality Reduction
Usually, datasets contain large number of attributes (i.e., features) and hundreds and thousands of
instances (i.e., data points) making the dataset voluminous.
High dimensional datasets throw several challenges in analysis projects.
In particular, many machine learning algorithms cannot be applied directly on a high dimensional
dataset.
Large amounts of data might sometimes produce worse performances in data analytics applications.
Dimensionality reduction techniques help us in reducing the high dimensional data into a small
dimensional data format that is easier to analyse, and visualize.
Consider a variable in a dataset where all the values have the same value, say 1 or the values vary less.
When a dataset has constant variables or variables whose values differ very less, it is not possible to
improve the model’s performance. We need to identify these low variance variables and eliminate them
when their variance is beyond a threshold value.
However, one thing you must remember about the low variance filter method is that variance is range
dependent. Thus, normalization is a must before implementing this dimensionality reduction technique.
Random Forest
One approach to dimensionality reduction is to generate a large and carefully constructed set of trees
against a target attribute and then use each attribute’s usage statistics to find the most informative
subset of features.
Specifically, we can generate a large set (2000) of very shallow trees (2 levels), with each tree being
trained on a small fraction (3) of the total number of attributes.
If an attribute is often selected as best split, it is most likely an informative feature to retain.
A score calculated on the attribute usage statistics in the random forest tells us ‒ relative to the other
attributes ‒ which are the most predictive attributes.
✓ Principal components are extracted in such a way that the first principal component captures the
maximum information / variance in the dataset. Larger the variability captured, larger the information
captured by component.
✓ Second principal component is computed that tries to capture the remaining variance in the dataset and
is uncorrelated (or orthogonal) to the first principal component.
✓ Third principal component is computed tries to capture the variance which is not captured by the first
two principal components and so on.
✓ For example, if a 10-dimensional data gives you 10 principal components, PCA tries to put maximum
possible information in the first component, then maximum remaining information in the second and
so on.