0% found this document useful (0 votes)
11 views88 pages

AI - Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views88 pages

AI - Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

MODULE 3

Chapter 1: Learning

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 1


Introduction to ML

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 2


a) Evolution of Machine Learning
• The term Machine Learning (ML) was first used by Arthur Samuel, one
of the pioneers of Artificial Intelligence at IBM, in 1959.
• Machine learning (ML) is an important tool for the goal of leveraging
technologies around artificial intelligence.
• Because of its learning and decision-making abilities, machine learning
is often referred to as AI, though, in reality, it is a subdivision of AI.
• Until the late 1970s, it was a part of AI’s evolution. Then, it branched off
to evolve on its own.
• Machine learning is now responsible for some of the most significant
advancements in technology.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 3


b) What is Machine Learning (ML)?

• Machine learning is a branch of artificial intelligence (AI) and


computer science which focuses on the use of data and algorithms
to imitate the way that humans learn, gradually improving its
accuracy.

• Machine learning is an application of AI that provides systems the


ability to learn on their own and improve from experiences without
being programmed externally.

• Machine learning was defined by Stanford University as “the science


of getting computers to act without being explicitly programmed.”
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 4
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 5
• Traditional programming is a manual process—here the programmer
creates the program. Programming aims to answer a problem using a
predefined set of rules or logic.

• In machine learning, the algorithm automatically formulates the rules


from the data. Machine learning seeks to construct a model or logic for
the problem by analyzing its input data and answers.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 6


c) Types of ML
• Based on the methods and way of learning, machine learning is
divided into six types.

03-08-2023 7
1. Supervised Machine Learning Algorithms:
• The primary purpose of supervised learning is to scale the
scope of data and to make predictions of unavailable, future or
unseen data based on labeled sample data.
• Supervised learning is where there are input variables (x) and
an output variable (Y) and an algorithm is used to learn the
mapping function from the input to the output Y = f(x) .
• The goal is to approximate the mapping function so well that
when there comes a new input data (x), the machine should be
able to predict the output variable (Y) for that data.
• Supervised machine learning includes two major
processes: classification and regression.
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 8
❑ Classification is the process which basically categorizes a set of data into
classes (yes/no, true/false, 0/1, yes/no/may be). There are various types
of Classification problems, such as: Binary Classification, Multi-class
Classification, Multi-label Classification. Examples for classification
problems are: Spam filtering, Image classification, Sentiment analysis,
Classifying cancerous and non-cancerous tumors, Customer churn
prediction etc.

❑ Regression is the process of identifying patterns and calculating the


predictions of continuous outcomes. The different types of regression
analysis techniques get used when the target and independent variables
show a linear or non-linear relationship between each other, and the target
variable contains continuous values. Examples for regression problems
are: predicting the house rate, predicting month’s sales, predicting age of a
person, prediction of rain, determining Market trends etc.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 9


03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 10
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 11
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 12
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 13
• The most widely used supervised algorithms are:
❑ Linear Regression
❑ Logistic Regression
❑ Random Forest
❑ Boosting algorithms
❑ Support Vector Machines
❑ Decision Trees
❑ Naive Bayes
❑ Nearest Neighbor.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 14


2. Unsupervised Machine Learning Algorithms:
• Unsupervised learning feeds on unlabeled data.
• In unsupervised machine learning algorithms, the desired results are unknown and yet to
be defined.
• Unsupervised learning algorithms apply the following techniques to describe the data:
❑ Clustering: It is an exploration of data used to segment it into meaningful groups (i.e.,
clusters) based on their internal patterns without any prior knowledge of group credentials.
The credentials are defined by similarity of individual data objects and also aspects of its
dissimilarity from the rest. Examples: Identifying fraudulent or criminal activity, classifying
network traffic, Identifying Fake News etc.

❑ Dimensionality reduction: Most of the time, there is a lot of noise in the incoming data.
Machine learning algorithms use dimensionality reduction to remove this noise while
distilling the relevant information. Examples: Image compression, classify a database full of
emails into “not spam” and “spam”.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 15


• The most widely used unsupervised algorithms are:
❑ K-means clustering
❑ PCA (Principal Component Analysis)
❑ Association rule.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 16


03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 17
3. Semi-supervised Machine Learning Algorithms:
• Semi-supervised learning algorithms represent a middle ground
between supervised and unsupervised algorithms.
• In this type of learning, the algorithm is trained upon a combination of
labeled and unlabelled data.
• This combination will contain a very small amount of labeled data and
a very large amount of unlabelled data.
• The basic procedure involved is that first, the programmer will cluster
similar data using an unsupervised learning algorithm and then use
the existing labeled data to label the rest of the unlabelled data.
• Examples: Text document classifier, Speech analysis etc.
• One of the popular Semi-supervised ML algorithm is Label
Propagation algorithm.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 18


03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 19
4. Reinforcement Machine Learning Algorithms:
• Reinforced ML employs a technique
called exploration/exploitation.
• It’s an iterative algorithm. The action takes place, the
consequences are observed, and the next action considers
the results of the first action.
• Using this algorithm, the machine is trained to make specific
decisions.
• It works this way: The machine is exposed to an environment
where it trains itself continually using trial and error. The
machine learns from past experience and tries to capture the
best possible knowledge to make accurate business decisions.
• Examples: Video games, Self-driving cars etc.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 20


03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 21
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 22
• Most common reinforcement learning algorithms include:
❑ Q-Learning
❑ Temporal Difference (TD)
❑ Monte-Carlo Tree Search (MCTS)
❑ Asynchronous Actor-Critic Agents (A3C).

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 23


5. Weakly Supervised Machine Learning Algorithms:
• Weakly supervised learning is a technique of building models
based on newly generated data.
• It is a branch of machine learning that uses noisy, restricted, or
inaccurate sources to label vast quantities of training data.
• These labels are often generated by computers by using
heuristics to combine unlabeled data with a signal to create their
label.
• The algorithm learns from large amounts of weak supervisory
data. This could include:
❑ Incomplete supervision (e.g., Semi-supervised learning).
❑ Inexact supervision, (e.g., Multi-instance learning).
❑ Incorrect supervision (e.g., Label noise learning).
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 24
a) Incomplete Supervision
• In this type, only a subset of the
training data is labeled.
• In most cases, this subset is
correctly and accurately labeled
but not sufficient for training a
supervised model.
• There are two techniques to deal with incomplete supervision; active
learning and semi-supervised learning.
• Active learning converts the weak learning to the strong type. It requires
human experts to annotate the unlabeled data manually. Because of the
high cost of acquiring all labels from human experts, they are asked to
annotate only a subset of the unlabeled data.
• Semi-supervised learning means that the labeled data is used to train a
predictive model, and the unlabeled data is used to test it to get labels. No
human experts are involved in this technique.
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 25
b) Inexact Supervision
• The given labels in this type are
imprecise.
• In some cases, this type also contains
some misleading records. These can
accept more than a label because there
are no discriminating features.
• In this type of supervision, multi-instance
learning is used where a bag (subset) of
instances is labeled according to one
of the instances (the key instance), or
the majority, inside the bag.

• For each algorithm, the bag generator specifies how many instances should
be in each bag.
• A bag can be an image, a text document, a set of records for stocks, and so
on.03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 26
c) Inaccurate Supervision
• In this type, there are wrong or
low-quality labels.
• Inaccurate labels usually come from
collecting public or crowdsourcing data
sets.
• The idea is to identify the potential mislabeled instances and to
correct or remove them.
• One of the practical techniques to achieve this idea is the data-editing
approach.
• The data-editing approach constructs a graph of relative neighborhoods,
where each node is an instance, and an edge connects two nodes of
different labels. An instance (node) is considered suspicious if it is
connected to many edges. This suspicious instance is then removed or
re-labeled according to the majority.
• For crowdsourcing data sets,
03-08-2023 theRao,labels
Prof. Trupthi are
Dept. of AI & DS, GAT obtained according to the
27
6. Self Supervised Machine Learning Algorithms:
• Self-supervised learning (SSL) is an evolving machine learning
technique poised to solve the challenges posed by the
over-dependence of labeled data.
• Here the model trains itself to learn one part of the input from another
part of the input. It is also known as predictive or pretext learning.
• In this process, the unsupervised problem is transformed into a
supervised problem by auto-generating the labels.
• To make use of the huge quantity of unlabeled data, it is crucial to set
the right learning objectives to get supervision from the data itself.
• The process of the self-supervised learning method is to identify any
hidden part of the input from any unhidden part of the input.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 28


03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 29
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 30
• The model learns in two steps. First, the task is solved based on an
auxiliary or pretext classification task using pseudo-labels which
help to initialize the model parameters. Second, the actual task is
performed with supervised or unsupervised learning.
• Types: For a binary classification task, training data can be divided
into positive examples and negative examples. Positive examples
are those that match the target. Negative examples are those that do
not.
❑ Contrastive self-supervised learning: It uses both positive and
negative examples. Contrastive learning’s loss function minimizes
the distance between positive samples while maximizing the distance
between negative samples.
❑ Non-contrastive self-supervised learning: It uses only positive
examples. NCSSL converges on a useful local minimum rather than
reaching a trivial solution, with zero loss.

03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 31


Chapter 2: Data

03-08-2023 32
Contents
• Types of Data: Structured and Unstructured Data, Quantitative and
Qualitative Data.
• Four Levels of data (Nominal, Ordinal, Interval, Ratio Level).
1. Structured vs Unstructured
▪ Structured (Organized) Data: Data stored into a
row/column structure.
• Every row represents a single observation and column
represent the characteristics of that observation.
• Unstructured (Unorganized) Data: Type of data that is
in the free form and does not follow any standard
format/hierarchy.
• Eg: Text or raw audio signals that must be parsed
further to become organized.
Pros of Structured Data
▪ Structured data is generally thought of as being much
easier to work with and analyze.
▪ Most statistical and machine learning models were
built with structured data in mind and cannot work on
the loose interpretation of unstructured data.
▪ The natural row and column structure is easy to digest
for human and machine eyes.
Example of Data Pre-processing
for Text Data
• Text data is generally unstructured and hence there is
need to transform data into structured form.
• Few characteristics that describe the data to assist
transformation are:
❑ Word/phrase count
❑ The existence of certain special characters
❑ The relative length of text
❑ Picking out topics
Example: A Tweet
• This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn
skies.

• Pre-processing is necessary for this tweet because a vast


majority of learning algorithms require numerical data.
• Pre-processing allows us to explore features that have been
created from the existing features.
• For example, we can extract features such as word count and
special characters from the mentioned tweet.
Example: This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins
Venus & Saturn. Afloat in the dawn skies.
1. Word/phrase counts:-
• We may break down a tweet into its word/phrase count.
• The word ‘this’ appears in the tweet once, as does every other
word.
• We can represent this tweet in a structured format, converting
the unstructured set of words into a row/column format:

2. Presence of certain special characters


Example: This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins
Venus & Saturn. Afloat in the dawn skies.
• 3. Relative length of text
• This tweet is 121 characters long.
• The average tweet, as discovered by analysts, is about 30
characters in length.
• So, we calculate a new characteristic, called relative length, (which
is the length of the tweet divided by the average length), i.e.
121/30 telling us the length of this tweet as compared to average
tweet.
• This tweet is actually 4.03 times longer than the average tweet.
Example: This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins
Venus & Saturn. Afloat in the dawn skies.
• 4. Picking out topics
• This tweet is about astronomy, so we can add that information as a
column.
• Thus, we can convert a piece of text into structured/organized
data, ready for use in our models and exploratory analysis.

Topic

Astronomy
2. Qualitative/Quantitative
1. Quantitative data: Data that can be described using
numbers, and basic mathematical procedures,
including addition, subtraction etc can be performed.

2. Qualitative data: This data cannot be described using


numbers and basic Mathematics cannot be performed,
and described using "natural" categories and natural
language.
Examples
Example of Qualitative/Quantitative
Coffee Shop Data
Observations of coffee shops in a major city was made.
Following characteristics were recorded.
1. Name of coffee shop
2. Revenue (in thousands of dollars)
3. Zip code
4. Average monthly customers
5. Country of coffee origin
Let us try to classify each characteristic as Qualitative OR
Quantitative
Example of Qualitative/Quantitative
Coffee Shop Data
1. Name of coffee shop
• Qualitative
• The name of a coffee shop is not expressed as a number
and we cannot perform math on the name of the shop.
2. Revenue
• Revenue – Quantitative
Example of Qualitative/Quantitative
Coffee Shop Data
3. Zipcode
• This one is tricky!!!
• Zip code – Qualitative
• A zip code is always represented using numbers, but what
makes it qualitative is that it does not fit the second part of the
definition of quantitative—we cannot perform basic
mathematical operations on a zip code.
• If we add together two zip codes, it is a nonsensical
measurement.
4. Average monthly customers
• Average monthly customers – Quantitative
5. Country of coffee origin
• Country of coffee origin – Qualitative
Example 2: World alcohol consumption data

• Classification of attributes as Quantitative OR Qualitative


• country: Qualitative
• beer_servings: Quantitative
• spirit_servings: Quantitative
• wine_servings: Quantitative
• total_litres_of_pure_alcohol: Quantitative
• continent: Qualitative
Quantitative data can be broken down, one step further, into
discrete and continuous quantities.
Continuous Discrete
It can take any value in a It can only have specific
interval value. No decimal values
[1 to 10] 1, 2, 3, 4, 5….
Values can be: 1, 1.3,
2.46, 5.378…
Measured Counted
Example: Temperature: Example: Rolling die
22.6 C, 83.46 F 1, 2, 3, 4, 5, 6

Examples: The speed of a car – Continuous


The number of cats in a house – Discrete
Your weight – Continuous
The number of students in a class – Discrete
The number of books in a shelf – Discrete
The height of a person – Continuous
Exact age - Continuous
Four Levels of Data
• It is generally understood that a specific characteristic
(feature/column) of structured data can be broken
down into one of four levels of data. The levels are:
▪ The nominal level
▪ The ordinal level
▪ The interval level
▪ The ratio level
The nominal level
• The first level of data, the nominal level, consists of
data that is described purely by name or category
with no rank order.
• Basic examples include gender, nationality, species,
or name of a student, color of hair etc...
• No rank order means: Cannot tell which color of hair
is more important than other.
• They are not described by numbers and are therefore
qualitative.
Mathematical operations allowed

• We cannot perform mathematics on the nominal level


of data except the basic equality and set membership
functions, as shown in the following two examples:
Being a tech entrepreneur is the same as being in the
tech industry, but not vice versa.
Measures of center
• A measure of center is a number that describes what the data tends
to.
• It is sometimes referred to as the balance point of the data.
• Common examples include the mean, median, and mode.
• In order to find the center of nominal data, we generally turn to the
mode (the most common element) of the dataset.
2. The Ordinal level
• Categorical in nature but inherent with order or rank
where each options has a different values.
Examples:
Income Levels: ( Low, Medium, High )
Levels of agreement ( disagree, neutral, agree )
Levels of satisfaction ( Poor, average, good, excellent )
All these options are still categorical but they have
different values ( Ranking difference ).
Quick summary
Measures of center
• In order to find the center of ordinal data, we generally turn to the
median of the dataset.

• Mean isn’t chosen as we need to perform division operation, which


isn’t allowed.
3. Interval level
It is numerical data
Quantitative data: Data by default measured in numbers.
Examples:
Credit scores ( 300 – 850 )
GMAT scores ( 200 – 800 )
Temperature ( Fahrenheit )
Example of Interval level: Temperature

• If it is 100 degrees Fahrenheit in Texas and 80 degrees


Fahrenheit in Istanbul, Turkey, then Texas is 20 degrees
warmer than Istanbul.
• Thus, Data at the interval level allows meaningful
subtraction between data points.
Mathematical operations allowed
• We can use all the operations allowed with nominal and
ordinal(ordering, comparisons, and so on), along with
two other notable operations:
• Addition
• Subtraction
Measures of center
• Measures of mean, median and mode to describe this
data.
• Usually the most accurate description of the center of
data would be the arithmetic mean, more commonly
referred to as, simply, "the mean".
• At the previous levels, addition was meaningless;
therefore, the mean would have lost extreme value.
• It is only at the interval level and above that the
arithmetic mean makes sense.
Example: Temperature of Fridge
• Suppose we look at the temperature of a fridge containing a
pharmaceutical company's new vaccine. We measure the temperate
every hour with the following data points (in Fahrenheit):
• 31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26
Finding Measure of Centre
• Let’s find the mean and median of the data:
• temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26]
• Mean = 30.73
• Median = 31.0
Drawback with Interval Data:
• Data at the interval level does not have a "natural starting point or a
natural zero".
• However, being at zero degrees Celsius does not mean that you have
"no temperature".
4. Ratio level
• A ratio variable, has all the properties of an interval
variable, and also has a clear definition of 0.0. When
the variable equals 0.0, there is none of that variable.
Type of chocolate: Nominal data.
Satisfied: Ordinal data
Age, Groceries and choco-bars: Interval/Ratio data
Chapter 3: Feature Engineering

08-08-2023 69
Feature Engineering
• Feature engineering is the pre-processing step of machine
learning, which is used to transform raw data into features that
can be used for creating a predictive model using Machine learning.
• All machine learning algorithms take input data to generate the
output.
• The input data remains in a tabular form consisting of rows
(instances or observations) and columns (variable or attributes),
and these attributes are often known as features.
• Example: An image is an instance in computer vision, but a line in
the image could be the feature. In NLP, a document can be an
observation, and the word count could be the feature.
• A feature is an attribute or individual measurable property or
characteristic of a phenomenon.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 70
• Feature engineering process selects the most useful predictor
variables for the model.

• Feature engineering in ML contains mainly four processes: Feature


Creation, Transformations, Feature Extraction, and Feature
Selection.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 71
1. Feature Creation:
• Feature creation is finding the most useful variables to be used in a
predictive model.
• The process is subjective, and it requires human creativity and
intervention.
• The new features are created by mixing existing features using
addition, subtraction, and ratio, and these new features have great
flexibility.

2. Transformation:
• It involves adjusting the predictor variable to improve the accuracy and
performance of the model.
• It ensures that all the variables are on the same scale, making the
model easier to understand.
• It ensures that all the features are within the acceptable range to avoid
any computational error.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 72
3. Feature Extraction:
• Feature extraction is an automated feature engineering process that
generates new variables by extracting them from the raw data.
• The aim is to reduce the volume of data so that it can be easily
used and managed for data modelling.
• Feature extraction methods include cluster analysis, text
analytics, edge detection algorithms, and principal components
analysis (PCA).

4. Feature Selection:
• Feature selection is a way of selecting the subset of the most
relevant features from the original features set by removing the
redundant, irrelevant, or noisy features.
• This is done in order to reduce overfitting in the model and improve
the performance.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 73
Feature Engineering Techniques
1. Imputation:
• Imputation deals with handling missing values in data.
• Deleting records that are missing is one way of dealing with missing
data issue. But it could lead to losing out on a chunk of valuable
data. This is where imputation can help.
• Data imputation can be classified into two types:
❑ Categorical Imputation: Missing categorical values are generally
replaced by the most commonly occurring value (mode) of the
feature.
❑ Numerical Imputation: Missing numerical values are generally
replaced by the mean or median of the corresponding feature.

08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 74


• Example: Categorical Imputation

08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 75


• Example: Numerical Imputation

08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 76


2. Discretization:
• Discretization involves taking a set of values of data and
grouping sets of them in some logical fashion into bins (or buckets).
• Binning can apply to numerical values as well as to categorical
values.

Prof. Trupthi Rao, Dept. of AI & DS, GAT 77


• The grouping of data can be done as follows:
❑ Grouping of equal intervals (equal width)
❑ Grouping based on equal frequencies (of observations in the bin)
• Example:

Prof. Trupthi Rao, Dept. of AI & DS, GAT 78


3. Categorical encoding:
• Categorical encoding is the technique used to encode categorical
features into numerical values which are usually simpler for an
algorithm to understand.
• This can be done by:
(i) Integer Encoding
(ii) One-Hot Encoding

08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 79


(i) Integer Encoding:
• Integer encoding consist in replacing the
categories by digits from 1 to n (or 0 to
n-1), where n is the number of distinct
categories of the variable.
• Each unique category is assigned an
integer value.
• This method is also called as label
encoding.
• This method is used when there exists ordinal
relationship in the variables.

08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 80


(ii) One-Hot Encoding:
• For categorical variables where no ordinal
relationship exists, a one-hot encoding (OHE)
can be applied.
• Here a new binary variable is added for each
unique integer value.
• In the “color” variable example, there are 3
categories: red, green and blue.
• Therefore 3 binary variables: ‘color_red’,
‘color_blue’ and ‘color_green’ are needed.
• A “1” value is placed in the binary variable for
the color and “0” values for the other colors.
• The binary variables are often called
“dummy variables or indicator variables”.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 81
4. Feature Splitting:
• Feature splitting is the process of separating features into two or
more parts to make new features.
• This technique helps the algorithms to better understand and learn
the patterns in the dataset.
• Example 1: Sale Date is split into year, month and day.

08-08-2023 82
• Example 2: Time stamp is split into 6 different attributes.

08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 83


5. Handling outliers:
• Outliers are unusually high or low values in the dataset which
are unlikely to occur in normal scenarios.
• Since the outliers could adversely affect the model prediction they
must be handled appropriately.
• Methods of handling outliers include:
❑ Removal: The records containing outliers are removed from the
variable. However, the presence of outliers over multiple variables
could result in losing out on a large portion of the data.
❑ Replacing values: The outliers could alternatively be treated as
missing values and replaced by using appropriate imputation.
❑ Capping: Capping the maximum and minimum values and replacing
them with an arbitrary value.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 84
6. Variable transformations:
• Variable transformation
techniques could help with
normalizing skewed data.
• Skewness is a measure of the
asymmetry of a distribution.
• A distribution is asymmetrical when
its left and right side are not mirror
images.
• Some of the variable transformations
are the Logarithmic transformation,
Square root transformation and Box
cox transformation which when
applied on heavy-tailed distributions
results in less skewed values.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 85
7. Scaling:
• Feature scaling is a method used to normalize the range of
independent variables or features of data.
• The commonly used processes of scaling include:
❑ Min-Max Scaling/Normalization: This process involves the
rescaling of all values in a feature in the range 0 to 1. In other words,
the minimum value in the original range will take the value 0, the
maximum value will take 1 and the rest of the values in between the
two extremes will be appropriately scaled.

❑ Standardization/Variance scaling: Mean is subtracted from every


data point and the result is divided by the standard deviation to arrive
at a distribution with a 0 mean and variance of 1.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 86
8. Creating features:
• Feature creation involves deriving new features from existing
ones.
• This can be done by simple mathematical operations such as
aggregations to obtain the mean, median, mode, sum, or difference
and even product of two values.
• These features, although derived directly from the given data, when
carefully chosen to relate to the target can have an impact on the
performance.
• Example:

08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 87


08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 88

You might also like