0% found this document useful (0 votes)
7 views

FDS Chapter 3

Uploaded by

prernakhot19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

FDS Chapter 3

Uploaded by

prernakhot19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Data Preprocessing

Chapter 3
Introduction
• Real-world data is often dirty / data pathalogies.
– Formatting issues (inconsistent capitalization/
extraneous whitespaces etc
– Pathologies in actual data content( duplicate values/
major outliers / NULL values)
• Often requires some detective work to figure out
what these issues mean in a particular situation
and hence how they should be addressed.
• Data needs to be cleaned up, before it can be
used for a desired purpose data pre-processing.
• Factors that make data dirty :
– Incomplete. When some of the attribute values are
lacking, certain attributes of interest are lacking, or
attributes contain only aggregate data.
– Noisy. When data contains errors or outliers. For
example, some of the data points in a dataset may
contain extreme values that can severely affect the
dataset’s range.
– Inconsistent. Data contains discrepancies in codes or
names. For example, if the “Name” column for
registration records of employees contains values
other than alphabetical letters, or if records do not
start with a capital letter, discrepancies are present.
• The term dirty to describe data, here, refers
to the syntactical, formatting, and structural
issues with the data, and ignoring all other
ways the data could be “muddled up.” (bias in
data)
Data pre-processing
Data objects & attribute types
• Attribute types :
– Qualitative describes qualities or characteristics
of data.
• Descriptive and cant be measured
• Consists of words, pictures, symbols
• Types Nominal, ordinal. Binary
– Quantitative can be counted / measured / can be
expressed using numbers
• Types Numeric, Discrete, Continuous
Data Quality
Why pre-process data
• Whenever there is a large organization, a
complicated data collection process, or several
datasets that have been merged, issues tend to
pile up.
• They are rarely documented and often only come
to light when some poor data scientist is tasked
with analyzing them.
• One of the most embarrassing things that can
happen in data science is to have to retract results
that you’ve presented because you realize that
you processed the data incorrectly
• Data quality is a measure of data based on the
following factors:
– Accuracy presence of inaccurate or noisy data with
errors , due to faulty instruments, errors in collection
etc
– Completeness Incomplete data due to missing
values, missing attributes, only aggregates etc
– Consistency inconsistency due to discrepancies in
data values. Data duplication leading to inconsistency
etc
– Timeliness Availability of information when needed.
– Believability Refers of trust in data by users.
– Interpretability how easily data can be understood (
based on how accurately the attributes are described
in the data set)
Data Munging
• Also known as data manipulation / data
wrangling
• It is the process of collecting and transforming
raw data into another format for better
understanding , and analysis.
• Often the data is not in a format that is easy
to work with.
– Eg: data stored in a way that is hard to process.
• Hence the need to convert it to something
more suitable for a computer to understand.
• All methods manipulate/wrangle/mung data
to turn it into something that is more
convenient or desirable.
• Eg : Consider the following text recipe.
“Add two diced tomatoes, three cloves of
garlic, and a pinch of salt in the mix.”
Ingredient Quantity Unit/size
Tomato 2 Diced
Garlic 3 Cloves
Salt 1 Pinch
• This table conveys the same information as
the text, but it is more “analysis friendly.”
• No systematic method for wrangling
wrangle ill-formatted data into something
more manageable.
Data Cleaning
• Many different ways to clean dirty data.
• Handling missing data
• Handling Noisy data
• Handling formatting issues
HANDLING MISSING values
• Many real-world datasets may contain missing values
for various reasons.
• They are often encoded as NaNs, blanks or any other
placeholders.
• Training a model with a dataset that has a lot of
missing values can drastically impact the machine
learning model’s quality..
• One way to handle this problem is to get rid of the
observations that have missing data. However, you
will risk losing data points with valuable information. A
better strategy would be to impute the missing values
• Handling missing data:
– sometimes data may be in right format, but some of
the values are missing
– Eg: Consider an employee table with employee data,
in which some of the home phone numbers are
absent.
• People may not have home phones, their mobile phone
may be the primary or only phone.
• Another eg , consider a log of transactions from the past
year. Group the transactions, by customer, and add up the
size for each customer, thus giving one row per customer.
– If a customer didn’t have any transactions that year, then his
record will be missing in aggregate. To solve this, we join this
aggregate dat with some known set of all customers and fill in the
appropriate missing values, for the ones who were missing.
• Missing data can arise, when data was never
gathered in the first place for some entities
• Data may be missing due to problems with the
process of collecting data, or an equipment
malfunction.
• Some data may not have been considered
important at the time of collection
• Eg: the data collection was limited to a certain
area or region, hence the area code was not
taken that time, for a phone number.
– But now when we decide to expand beyond that
city/region, then phone numbers will have area
code too
• Data may get lost due to system or human
error while storing or transferring the data.
• Thus some strategy needed to handle missing
data
• Methods for handling missing data:
– Replace missing values manually time
consuming and needs expertise.
– Replace missing values with zeros
• Python function fillna(), df.fillna(0)
– Dropping rows with missing values suitable for
large data sets, where multiple values are missing
within a tuple.
• Df.dropna()
– Replace missing values with mean/median/mode
• Median = df[‘C1’].median()
• Df[‘C1’].fillna(median,inplace=True)
– Replace missing values with previous/next row
• Df.fillna(method=“ffill”) fill with previous row
• Df.fillna(method=“bfill”) fill with next row value
– Use Interpolation for filling missing values
• Interpolation can be used to construct new values
within the range of a discrete set of known data values.
• In conclusion, there is no perfect way to
compensate for the missing values in a
dataset.
• Each strategy can perform better for certain
datasets and missing data types but may
perform much worse on other types of
datasets.
• There are some set rules to decide which
strategy to use for particular types of missing
values, but beyond that, you should
experiment and check which model works
best for your dataset.
Noisy data
• Situations when data is not missing, but is
corrupted for some reasons.
• Data corruption may be a result of faulty data
collection instruments, data entry problems,
or technology limitations.
• Eg, a floating point values like 70.1, 70.9 ,
both stored a s70, since storage system
ignores decimal points.
• May not be a big issue, but if the values relate to
temperature measures, then there is a concern.
– Eg for humans, temperature 99.4 normal, but 99.8 implies fever. If
storage system fails to note this difference, then the system will
fail to differentiate between sick and healthy person.

• Similar to missing values, no single technique to


take care of missing data.
• Some methods to handle noisy data are
– Identify and remove outliers
• All students getting between 70 to 90, but one students
getting 12.
– Resolve inconsistencies in data.
• Eg , all customer names should have first letter capitalized,
should follow the convention of capitalizing all letters.
• Reasons for noisy data
– Duplicate entries
– Multiple entries for a single entity
– Null values
– Huge outliers
– Out-of-date data
– Artificial entries
– Irregular spacing
Formatting issues
• Irregular formatting between different
tables/columns
– Based on how data was stored in the first place
– Happens when joinable/groupable keys are irregularly
formatted between different data sets.
• Extra whitespaces
– Random whitespaces pose probblems during analysis.
Eg : while join on identifier “EmpNo” and “EmpNo “.
– Whitespace is especially harmful , since when we
actually print data to the screen to check, the
whitespace might be impossible to identify.
– In Python, every string object has a strip() method
that removes whitespace from the front and end
of a string.
– The methods lstrip() and rstrip() will remove
whitespace only from the front and end,
respectively.
– If we pass a character as an argument into the
strip functions, only that character will be
stripped.
– Eg : “ABC\t”.strip() ‘ABC’
– “ ABC\t”.lstrip () ‘ABC\t’
– “ ABC\t”.rstrip() ‘ABC’
– “ABC”.strip(“C”) ‘AB’
• Irregular Capitalization
– Python provides lower() & upper() methods,
which will return a copy of the original string, with
all letters set to uppercase or lowercase.
• Inconsistent delimiters
– Normally , every data set will have only one
delimiter.
– Sometimes, when different tables use different
ones, then the aggregated data set may have
multiple delimiters.
– Most common delimiters are commas, Tabs, and
Pipes
• Irregular NULL format:
– There are a number of different ways that missing
entries are encoded into CSV files, and they
should all be interpreted as NULLs when the data
is read in.
– Some popular examples are the empty string “”,
“NA,” and “NULL.”
– Occasionally, we will see others such as
“unavailable” or “unknown” as well.
• Invalid characters:
– Data files can randomly contain invalid bytes in
the middle of them.
– Programs may raise exceptions, when invalid
bytes are encountered. Hence necessary to filter
them out from the data set.
– Python provides a method decode() for the same.
– Decode() takes two arguments , first is the text
format that the string should be converted to.
Second is the action to be taken, if such
conversion not possible; giving “ignore” invalid
characters to be simply dropped.
• s = "abc\xFF"
print s
abc□ //last character isnt a letter
s.decode("ascii", "ignore")
'abc’
• Incompatible Datetimes:
– Datetimes are one of the most frequently mangled
types of data field
– Some of common date formats sene in a data set are
• August 12, 2015
• AUG 12, ’15
• 2015-08-12
– Most of the time we have two different ways of
expressing the same information, and a perfect
translation is possible from the one to the other.
– But with dates and times, the information content
itself can be different.
• For example, ywe might have just the date, or there could
also be a time associated with it.
• If there is a time, does it go out to the minute, hour, second,
or something else? What about time zones?
– Most scripting languages include some kind of
built-in datetime data structure, which lets us
specify any of these different parameters (and
uses reasonable defaults if we don’t specify).
– The easiest way to parse dates in Python is with a
package called dateutil,
– It takes in a string, uses some reasonable rules to
determine how that string is encoding dates and
times, and convert it into the datetime data type.
Data Integration
• To be as efficient and effective for various data
analyses as possible, data from various sources
commonly needs to be integrated.
• The following steps describe how to integrate
multiple databases or files.
– Combine data from multiple sources into a coherent
storage place (e.g., a single file or a database).
– Engage in schema integration, or the combining of
metadata from different sources.
– Detect and resolve data value conflicts.
• A conflict may arise; for instance, such as the presence
of different attributes and values from various sources
for the same real-world entity.
• Reasons for this conflict could be different
representations or different scales; for example, metric
vs. British units.
– Address redundant data in data integration.
Redundant data is commonly generated in the
process of integrating multiple databases.
• The same attribute may have different names in
different databases.
• One attribute may be a “derived” attribute in another
table; for example, annual revenue.
• Correlation analysis may detect instances of redundant
data
Data transformation
• Data must be transformed so it is consistent and
readable (by a system).
• The following five processes may be used for data
transformation.
– 1. Smoothing: Remove noise from data.
– 2. Aggregation: Summarization, data cube construction.
– 3. Generalization: Concept hierarchy climbing.
– 4. Normalization: Scaled to fall within a small, specified
range and aggregation.
• a. Min–max normalization.
• b. Z-score normalization.
• c. Normalization by decimal scaling.
– 5. Attribute or feature construction.
– a. New attributes constructed from the given ones.
• Data transformation techniques
– Rescaling
– Normalizing
– Binarizing
– Standardizing
– Labeling
– One hot encoding
• Data transformation allows users to derive new variables
from existing ones.
• The transformation process can change the scale of the
variables, the grouping of the values, and the type of the
variable.
• Transformation also allows you to infer missing values, for
example, replace the missing values with new values.
• Transformations and inference make the data more useful
in the modeling process.
• Rescaling:
– Scaling is required to rescale the data and it’s used
when we want features to be compared on the same
scale for our algorithm.
– When all features are in the same scale, it also helps
algorithms to understand the relative relationship
better.
– Eg: A Minmax scaler For each feature, each value is
subtracted by the minimum value of the respective
feature and then divide by the range of original
maximum and minimum of the same feature. It has a
default range between [0,1].
• x_scaled = (x – x_min)/(x_max – x_min)
• Though (0, 1) is the default range, we can define our range
of max and min values as well
– RobustScaler can be used when data has high
outliers and we want to subside their effects.
• Unimportant outliers should be removed in the first
place.
• RobustScaler subtracts the column’s median and
divides by the interquartile range.
– StandardScaler rescales each column to have 0
mean and 1 Standard Deviation.
• It standardizes a feature by subtracting the mean and
dividing by the standard deviation.
• If the original distribution is not normally distributed, it
may distort the relative space among the features.
• Oftentimes, we have datasets in which different
columns have different units –
– One column can be in kilograms, while another
column can be in centimeters.
– Furthermore, we can have columns like income which
can range from 20,000 to 100,000, and even more;
while an age column which can range from 0 to 100(at
the most).
• Thus, Income is about 1,000 times larger than age.
• When we feed these features to the model as is,
there is every chance that the income will
influence the result more due to its larger value.
• So, to give importance to both Age, and Income,
we need feature scaling.
• Eg of Rescaling: MinMax scaler / Normalizing
Import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'Income': [15000, 1800, 120000, 10000], 'Age': [25,
18, 42, 51], 'Department': ['HR','Legal','Marketing','Management'] })
– Before directly applying any feature
transformation or scaling technique, we need to
first deal the categorical column: Department.
– This is because we cannot scale non-numeric
values.
– So we first create a copy of our dataframe and
store the numerical feature names in a list, and
their values as well:
df_scaled = df.copy()
col_names = ['Income', 'Age']
features = df_scaled[col_names]
from sklearn.preprocessing import MinMaxScaler scaler =
MinMaxScaler()
df_scaled[col_names] = scaler.fit_transform(features.values
Income Age Dept )

0 0.111675 0.212121 HR
1 0.000000 0.000000 Legal
2 1.000000 0.727273 Marketing
3 0.069374 1.000000 Management

The minimum value among the columns became 0, and the


maximum value was changed to 1, with other values in
between.
In case we don’t want the income or age to have values like 0.
Let us take the range to be (5, 10)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(5, 10))
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Income Age Department


0 5.558376 6.060606 HR
1 5.000000 5.000000 Legal
2 10.00000 8.636364 Marketing
3 5.346870 10.00000 Management
• The min-max scaler lets us set the range in which we want the
variables to be.
• Standard scaler EG
– For each feature, the Standard Scaler scales the
values such that the mean is 0 and the standard
deviation is 1(or the variance).
• x_scaled = x – mean/std_dev
• Standard Scaler assumes that the distribution
of the variable is normal.
• In cases where the variables are not normally
distributed, we either
– choose a different scaler
– or first, convert the variables to a normal
distribution and then apply this scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Income Age department


0 -0.449056 -0.685248 HR
1 -0.722214 -1.218219 Legal
2 1.723796 0.609110 Marketing
3 -0.552525 1.294358 Management
Income mean = 0.000000, std = 1.154701
Age mean = -5.551115e-17 std = 1.154701e+00
• Robust scaler
– The scalers we saw so far, each of them was using values
like the mean, maximum and minimum values of the
columns.
• All these values are sensitive to outliers.
• Too many Outliers in the data they will influence the mean and
the max value or the min value.
• Thus, even if we scale this data using the previous methods, we
cannot guarantee a balanced data with a normal distribution.
– The Robust Scaler, as the name suggests is not sensitive to
outliers.
– This scaler removes the median from the data scales the
data by the InterQuartile Range(IQR)
– IQR is nothing but the difference between the first and
third quartile of the variable.
– The interquartile range can be defined as-
• IQR = Q3 – Q1
– Thus the formula would be:
• x_scaled = (x – Q1)/(Q3 – Q1)
• from sklearn.preprocessing import RobustScaler
• scaler = RobustScaler()
• df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Income Age Department


0 0.075075 -0.404762 HR
1 -0.321321 -0.738095 Legal
2 3.228228 0.404762 Marketing
3 -0.075075 0.833333 Management
• Normalizing
– A scaling technique, in which values are shifted
and rescaled , to fall in the range between 0 to 1.
– Also called MinMax scaling (Eg seen in previous
section)
– Minimum value transformed to 0, maximum
transformed to 1, and all other values to a decimal
between 0 and 1.
• Binarization of data
– Binarization is the process of transforming data features of
any entity into vectors of binary numbers to
make classifier algorithms more efficient.
– In a simple example, transforming an image’s gray-scale
from the 0-255 spectrum to a 0-1 spectrum is binarization.
– In machine learning, even the most complex concepts can
be transformed into binary form.
– For example, to binarize the sentence
• “The dog ate the cat,” every word is assigned an ID (for example
dog-1, ate-2, the-3, cat-4).
• Then replace each word with the tag to provide a binary vector.
• In this case the vector: <3,1,2,3,4> can be refined by providing
each word with four possible slots, then setting the slot to
correspond with a specific word: <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>.
• This is commonly referred to as the bag-of-words-method.
• sklearn.preprocessing.Binarizer() is a method
which belongs to preprocessing module.
• It plays a key role in the discretization of
continuous feature values.
– Example #1:
A continuous data of pixels values of an 8-bit
grayscale image have values ranging between 0
(black) and 255 (white) and one needs it to be
black and white. So, using Binarizer() one can set a
threshold converting pixel values from 0 – 127 to
0 and 128 – 255 as 1.
– Syntax
• sklearn.preprocessing.Binarizer(threshold, copy)
– threshold :[float, optional] Values less than or
equal to threshold is mapped to 0, else to 1. By
default threshold value is 0.0.
copy :[boolean, optional] If set to False, it avoids a
copy. By default it is True.
• import numpy as np
• import matplotlib.pyplot as plt
• import pandas as pd

• from sklearn import preprocessing

• data_set = pd.read_csv(
• 'C:\\Users\\dell\\Desktop\\Data_for_Feature_Scaling.csv')
• data_set.head()

• # here Features - Age and Salary columns are taken using slicing
to binarize values
• age = data_set.iloc[:, 1].values
• salary = data_set.iloc[:, 2].values
• print ("\nOriginal age data values : \n", age)
• print ("\nOriginal salary data values : \n", salary)
• from sklearn.preprocessing import Binarizer
x = age
x = x.reshape(1, -1)
y = salary
y = y.reshape(1, -1)
# For age, let threshold be
35 # For salary, let threshold be 61000
binarizer_1 = Binarizer(35)
binarizer_2 = Binarizer(61000)

# Transformed feature
print ("\nBinarized age : \n",
binarizer_1.fit_transform(x))

print ("\nBinarized salary : \n",


binarizer_2.fit_transform(y))
• Data set
• Output
• Country Age Salary Purchased
– 0 France 44 72000 0
– 1 Spain 27 48000 1
– 2 Germany 30 54000 0
– 3 Spain 38 61000 0
– 4 Germany 40 1000 1
• Original age data values :
– [44 27 30 38 40 35 78 48 50 37]
• Original salary data values :
– [72000 48000 54000 61000 1000 58000 52000
79000 83000 67000]
• Binarized age : [[1 0 0 1 1 0 1 1 1 1]]
• Binarized salary : [[1 0 0 0 0 0 0 1 1 1]]
• N=3
• P1 5 t1, p2 6 t5, p3 7 t6, //sort the Q on arrival time
• Time t1 p1 take for execution, so queue will have p2 6 t5, p3 7
t6. P1 will finish at t5, and wait for i/o 2 units, hence p1 comes
back to ready Q at t7 with a random burst p1 rb t7
• At time t6 Q image will be P2 6 t5, p3 7 t6,
• At time t7 Q image will be p3 7 t7, p1 rb t7,
• At time t12 , P2 completes, Q image p3 7 t7, p1 rb t7, P3 will
go for execution & completes at t18

• At time t19 Q image , P1 rb 2 t7, p2 rb t15 p1 goes for


execution, completes at t20.
• At T21 Q image p2 rb 2 t13, p3 rb 5 t21 p2 for execution,
complete t22.
• At time t23 p3 goes in execution, completes at t27
• P1 5 t1, p2 6 t5, p3 7 t6, p1 2 t7, p2 2 t15, p3 5 t21 sort the Q
on arrival time and then perform FCFS
• NP SJF
• P1 4 t1, p2 5 t3, p3 4 t2, p1 3 t7, p3 7 t9, p2 2 t19
• P1 execution at t1,
• P3 4, p2 5, p1 3 t7,
• At t5 p3 complete at t8
• At t9 p2 5 p1 3 p1 3, p2 5, p3 7 t10 at t9 p1 is taken
• At t12 p2 5 , p3 7 t10 p2 taken up p2 t16 completes,
P2 comes back again at t19 with rb 2
• At t17 p3 7 p3execution , complte 24, complete 24,
• At t25 p2 2 p2 completes at t27
• P1 t1 – t12
• Label encoding
– Machines understand numbers, not text.
– We need to convert each text category to
numbers in order for the machine to process them
using mathematical equations.
– Done using Label encoding and one-Hot encoding
techniques
– First we look into what is categorical encoding?
• What is categorical encoding?
– Any structured dataset includes multiple columns.
• a combination of numerical as well as categorical variables.
• A machine can only understand the numbers. It cannot understand the
text.

– That’s primarily the reason we need to convert categorical


columns to numerical columns so that a machine learning
algorithm understands it.
– This process is called categorical encoding.
– Categorical encoding is a process of converting categories to
numbers
• The two most widely used techniques, for
categorical encoding:
– Label Encoding
– One-Hot Encoding
• Label Encoding:
– Label Encoding is a popular encoding technique
for handling categorical variables.
– In this technique, each label is assigned a unique
integer based on alphabetical ordering.
– #importing the libraries
– import pandas as pd
– import numpy as np
– #reading dataset
– df=pd.read_csv("Salary.csv")

Country Age salary


India 44 72000
US 34 65000
Japan 46 98000
US 35 45000
Japan 23 34000
– the first column, Country, is the categorical
feature as it is represented by the object data
type and the rest of them are numerical features
as they are represented by int64.
– # Import label encoder
– from sklearn import preprocessing
– label_encoder = preprocessing.LabelEncoder
– data['Country']=
label_encoder.fit_transform(data[‘Country'])
print(data.head())
Country Age Salary
0 44 72000
2 34 65000
1 46 98000
2 35 45000
1 23 34000
– Here label encoding uses alphabetical ordering,
since here its nominal attributes.
– Hence, India has been encoded with 0, the US
with 2, and Japan with 1.
– The fit_transform() method calculates the mean
and variance of each of the features present in
our data
– The transform method transforms all features
using the respective mean and variance (ordinal
attributes)
• Challenges with Label Encoding
– In the above scenario, the Country names do not
have an order or rank.
– But, when label encoding is performed, the
country names are ranked based on the
alphabets.
– Due to this, there is a very high probability that
the model captures the relationship between
countries such as India < Japan < the US.
– This is something that we do not want! So how
can we overcome this obstacle? Here comes the
concept of One-Hot Encoding.
• One-Hot encoding
– One-Hot Encoding is another popular technique
for treating categorical variables.
– It simply creates additional features based on the
number of unique values in the categorical
feature.
– Every unique value in the category will be added
as a feature.
– One-Hot Encoding is the process of creating
dummy variables.
– In this encoding technique, each category is
represented as a one-hot vector.
0 1 2 Age Salary
1 0 0 44 72000
0 0 1 34 65000
0 1 0 46 98000
0 0 1 35 45000
0 1 0 23 34000

3 new features are added , since the country contains 3 unique values.
So here each category is represented by a binary vector
• We apply One-Hot Encoding when:
– The categorical feature is not ordinal (like the
countries above)
– The number of categorical features is less so
one-hot encoding can be effectively applied
• We apply Label Encoding when:
– The categorical feature is ordinal (like Jr. kg, Sr.
kg, Primary school, high school)
– The number of categories is quite large as
one-hot encoding can lead to high memory
consumption
Data Reduction
• Data reduction is a key process in which a reduced
representation of a dataset that produces the
same or similar analytical results is obtained.
• One example of a large dataset that could warrant
reduction is a data cube.
• Another example of data reduction is removal of
unnecessary attributes.
• Reduces the data by removing unimportant and
unwanted features from the data set.
• Data Reduction techniques are methods that
one can use to preserve data in a reduced or
condensed form but without any loss of
information or fidelity.
• Different data reduction strategies are
– Dimensionality reduction
– Data cube aggregation
– Numerosity reduction
• Data reduction consciously allows us to
categorize or extract the necessary
information from a huge array of data to
enable us to make conscious decisions.
• “Data reduction is the transformation of
numerical or alphabetical digital information
derived empirically or experimentally into a
corrected, ordered, and simplified form.”
• In simple terms, it simply means large
amounts of data are cleaned, organized and
categorized based on prerequisite criteria to
help in driving business decisions.
• Data cube aggregation:
– Data Cube Aggregation is a multidimensional aggregation
that uses aggregation at various levels of a data cube to
represent the original data set, thus achieving data
reduction.
– Data Cube Aggregation, where the data cube is a much
more efficient way of storing data, thus achieving data
reduction, besides faster aggregation operations.
– used to aggregate data in a simpler form
– Example: consider a data set gathered for analysis that
includes the revenue of your company every three months.
• But for analysis we need the annual sales, rather than the
quarterly average.
• So we can summarize the data in such a way that the resulting
data summarizes the total sales per year instead of per quarter. It
summarizes the data.
• Dimensionality Reduction:
– In contrast with the data cube aggregation method,
where the data reduction was with the consideration
of the task, dimensionality reduction method works
with respect to the nature of the data
– A dimension or a column in the data spreadsheet is
referred to as a “feature,” and the goal of the process
is to identify which features to remove or collapse to a
combined feature.
– This requires identifying redundancy in the given data
and/or creating composite dimensions or features
that could sufficiently represent a set of raw features.
– Strategies for reduction include sampling, clustering,
principal component analysis, etc.
• There are mainly two types of dimensionality
reduction methods.
• Both methods reduce the number of
dimensions but in different ways.
• It is very important to distinguish between
those two types of methods.
• One type of method only keeps the most
important features in the dataset and
removes the redundant features Feature
Selection
– There is no transformation applied to the set of
features
• The other method finds a combination of new
features Feature extraction.
• An appropriate transformation is applied to
the set of features.
• The new set of features contains different
values instead of the original values.
• Feature selection methods:
– Extracts a subset of features from the original set of all
features of a dataset to obtain a smaller subset that
can be used for further analysis.
– These methods only keep the most important features
in the dataset and remove the redundant features.
– Step-wise Forward Selection –
• The selection begins with an empty set of attributes later on
we decide best of the original attributes on the set based on
their relevance to other attributes.
• Initial attribute Set: {X1, X2, X3, X4, X5, X6} Initial reduced
attribute set: { } Step-1: {X1} Step-2: {X1, X2} Step-3: {X1, X2,
X5} Final reduced attribute set: {X1, X2, X5}
– Instead of eliminating features recursively, the
algorithm attempts to train the model on a single
feature in the dataset and calculates the
performance of the model (usually, accuracy score
for a classification model and RMSE for a
regression model).
– Then, the algorithm adds (selects) one feature
(variable) at a time, trains the model on those
features and calculates the performance scores.
– The algorithm repeats adding features until it
detects a small (or no) change in the performance
score of the model and stops there!
– Step-wise Backward Selection –
• This selection starts with a set of complete attributes in the
original data and at each point, it eliminates the worst
remaining attribute in the set.
• This method eliminates (removes) features from a dataset
through a recursive feature elimination (RFE) process.
• The algorithm first attempts to train the model on the initial
set of features in the dataset and calculates the
performance of the model (usually, accuracy score for a
classification model and RMSE for a regression model).
• Then, the algorithm drops one feature (variable) at a time,
trains the model on the remaining features and calculates
the performance scores.
• The algorithm repeats eliminating features until it detects a
small (or no) change in the performance score of the model
and stops there!
• Suppose there are the following attributes in the data
set in which few attributes are redundant.
• Initial attribute Set: {X1, X2, X3, X4, X5, X6} Initial
reduced attribute set: {X1, X2, X3, X4, X5, X6 } Step-1:
{X1, X2, X3, X4, X5} Step-2: {X1, X2, X3, X5} Step-3: {X1,
X2, X5} Final reduced attribute set: {X1, X2, X5}
– Combination of forwarding and Backward
Selection –
• It allows us to remove the worst and select best
attributes, saving time and making the process faster.
• Univariate selection
– Works by inspecting each feature and then finding the
best feature , based on statistical tests.
– Analyzes the capability of these features in
accordance with the response variable.
• Decision tree induction
– Uses the concept of decision trees for feature
extraction.
– The nodes of the tree indicates a test applied on an
attribute
– The branches indicate the outcome of the test
– Helps in discarding irrelevant attributes i.e those
attributes that are not part of the tree.
• Feature extraction methods:
– Used to reduce data with many features to a data
set with reduced features.
– Feature selection chooses the most relevant
features from a feature set, whereas feature
extraction creates a new , smaller set of features
that consist of most useful information.
– Most common methods of feature extraction are
• Principal component analysis
• Linear discriminant analysis
– Principal component analysis
• This method involves the identification of a few
independent tuples with ‘n’ attributes that can represent
the entire data set
• PCA is a linear dimensionality reduction technique
(algorithm) that transforms a set of correlated variables (p)
into a smaller k (k<p) number of uncorrelated variables
called principal components while retaining as much of the
variation in the original dataset as possible
• Principal Component Analysis, or PCA, is a
dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that
still contains most of the information in the large set.
• reduce the number of variables of a data set, while
preserving as much information as possible.
• Linear Discriminant Analysis:
– LDA is typically used for multi-class classification. It
can also be used as a dimensionality reduction
technique.
– LDA best separates or discriminates (hence the name
LDA) training instances by their classes.
– The major difference between LDA and PCA is that
LDA finds a linear combination of input features that
optimizes class separability while PCA attempts to find
a set of uncorrelated components of maximum
variance in a dataset.
– Another key difference between the two is that PCA is
an unsupervised algorithm whereas LDA is a
supervised algorithm where it takes class labels into
account.
• Advantages of Dimensionality Reduction
– A lower number of dimensions in data means less
training time and less computational resources and
increases the overall performance of machine
learning algorithms
• — Machine learning problems that involve many features
make training extremely slow. Most data points in
high-dimensional space are very close to the border of that
space. This is because there’s plenty of space in high
dimensions. In a high-dimensional dataset, most data points
are likely to be far away from each other. Therefore, the
algorithms cannot effectively and efficiently train on the
high-dimensional data.
• In machine learning, that kind of problem is referred to as
the curse of dimensionality

– Dimensionality reduction is extremely useful
for data visualization — When we reduce the
dimensionality of higher dimensional data into
two or three components, then the data can easily
be plotted on a 2D or 3D plot
– Dimensionality reduction takes care
of multicollinearity — In regression,
multicollinearity occurs when an independent
variable is highly correlated with one or more of
the other independent variables. Dimensionality
reduction takes advantage of this and combines
those highly correlated variables into a set of
uncorrelated variables. This will address the
problem of multicollinearity.
– Dimensionality reduction is very useful for factor
analysis — This is a useful approach to find latent
variables which are not directly measured in a single
variable but rather inferred from other variables in the
dataset. These latent variables are called factors.
– Dimensionality reduction removes noise in the
data — By keeping only the most important features
and removing the redundant features, dimensionality
reduction removes noise in the data. This will improve
the model accuracy.
– Dimensionality reduction can be used for image
compression — image compression is a technique
that minimizes the size in bytes of an image while
keeping as much of the quality of the image as
possible. The pixels which make the image can be
considered as dimensions (columns/variables) of the
image data.
• Numerosity Reduction
– It is a data reduction technique which replaces the
original data by smaller form of data representation.
– There are two techniques for numerosity
reduction- Parametric and Non-Parametric methods.
– Parametric methods
• For parametric methods, data is represented using some
model.
• The model is used to estimate the data, so that only
parameters of data are required to be stored, instead of
actual data.
• Regression and Log-Linear methods are used for creating
such models.
– Regression:
Regression can be a simple linear regression or multiple
linear regression.
– When there is only single independent attribute, such
regression model is called simple linear regression
– If there are multiple independent attributes, then such
regression models are called multiple linear regression.
– In linear regression, the data are modeled to a fit straight
line. For example, a random variable y can be modeled as a
linear function of another random variable x with the
equation y = ax+b
where a and b (regression coefficients) specifies the slope
and y-intercept of the line, respectively.
– In multiple linear regression, y will be modeled as a linear
function of two or more predictor(independent) variables.
• Non-Parametric Methods –
– These methods are used for storing reduced
representations of the data
include histograms, clustering, sampling and data cube
aggregation.
– Histograms:
• Histogram is the data representation in terms of frequency. It
uses binning to approximate data distribution and is a popular
form of data reduction.
– Clustering:
• Clustering divides the data into groups/clusters. This technique
partitions the whole data into different clusters. In data
reduction, the cluster representation of the data are used to
replace the actual data. It also helps to detect outliers in data.
– Sampling:
• Sampling can be used for data reduction because it allows a large
data set to be represented by a much smaller random data
sample (or subset).
Data Discretization
• Data discretization refers to a method of
converting a huge number of data values into
smaller ones so that the evaluation and
management of data become easy.
• Also defined as a process of converting
continuous data attribute values into a finite
set of intervals and associating with each
interval some specific data value.
• In other words, data discretization is a
method of converting attributes values of
continuous data into a finite set of intervals
with minimum data loss.
• Often, it is easier to understand continuous
data (such as weight) when divided and stored
into meaningful categories or groups.
– For example, we can divide a continuous variable,
weight, and store it in the following groups :
Under 100 kg (light), between 140–160 kg (mid),
and over 200 kg (heavy).
• Discretization is useful if we see no objective
difference between variables falling under the
same weight class.
– In our example, weights of 85 lbs and 56
lbs convey the same information (the object is
light).
– Therefore, discretization helps make our data
easier to understand if it fits the problem
statement.

• There are two forms of data discretization
– Supervised discretization,
– Unsupervised discretization.
• Supervised discretization refers to a method in
which the class data is used.
• Unsupervised discretization refers to a
method depending upon the way which
operation proceeds.
– It means it works on the top-down splitting
strategy and bottom-up merging strategy
• Approaches to Discretization
– Unsupervised:
— Equal-Width
— Equal-Frequency
— K-Means
– Supervised:
— Decision Trees
• Unsupervised methods:
– Binning: Binning is a data smoothing technique and its
helps to group a huge number of continuous values
into a smaller number of bins.
– For example, if we have data about a group of
students, and we want to arrange their marks into a
smaller number of marks intervals by making the bins
of grades.
– One bin for grade A, one for grade B, one for C, one
for D, and one for F Grade
– Equal-Width Discretization
• Separating all possible values into ‘N’ number of bins, each
having the same width. Formula for interval width:
• Width = (maximum value - minimum value) / N
* where N is the number of bins or intervals.
– Equal-Frequency Discretization
• Separating all possible values into ‘N’ number of bins,
each having the same amount of observations.
– K-Means Discretization
• We apply K-Means clustering to the continuous
variable, thus dividing it into discrete groups or
clusters.
• Decision trees
– Decision Trees (DTs) are a non-parametric
supervised learning method used for data
discretization.
– The goal is to create a model that predicts the
value of a target variable by learning simple
decision rules inferred from the data features.
– A Decision tree is a flowchart like tree structure,
where each internal node denotes a test on an
attribute, each branch represents an outcome of
the test, and each leaf node (terminal node) holds
a class label.
• Data discretization and concept hierarchy generation
– A concept hierarchy represents a sequence of mappings
with a set of more general concepts to specialized
concepts.
– Similarly mapping from low-level concepts to higher-level
concepts. In other words, we can say top-down mapping
and bottom-up mapping.
– Example of a concept hierarchy for the dimension location.
• Each city can be mapped with the country to which the given city
belongs. For example, Delhi can be mapped to India and India can
be mapped to Asia.
– Top-down mapping
• Top-down mapping starts from the top with general concepts
and moves to the bottom to the specialized concepts.
• Bottom-up mapping
– Bottom-up mapping starts from the Bottom with
specialized concepts and moves to the top to the
generalized concepts.

You might also like