0% found this document useful (0 votes)
40 views

Unit-3 New

The document discusses machine learning algorithms and applications. It introduces basic machine learning algorithms like association rule mining, linear regression, logistic regression, and classifiers like k-nearest neighbors and k-means. It also covers decision trees, naive bayes, ensemble methods, feature generation, feature selection, and filters/wrappers. The document provides examples of using association rule mining for applications like market basket analysis, fraud detection, and customer segmentation. It defines key concepts like support, confidence, and lift for evaluating association rules.

Uploaded by

adityakolhelite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Unit-3 New

The document discusses machine learning algorithms and applications. It introduces basic machine learning algorithms like association rule mining, linear regression, logistic regression, and classifiers like k-nearest neighbors and k-means. It also covers decision trees, naive bayes, ensemble methods, feature generation, feature selection, and filters/wrappers. The document provides examples of using association rule mining for applications like market basket analysis, fraud detection, and customer segmentation. It defines key concepts like support, confidence, and lift for evaluating association rules.

Uploaded by

adityakolhelite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Unit-3

Contents

• Basic Machine Learning Algorithms


• Association Rule mining
• Linear Regression
• Logistic Regression
• Classifiers – k-Nearest Neighbors (k-NN) k-means
• Decision tree – Naive Bayes- Ensemble Methods – Random Forest.
• Feature Generation and Feature Selection –
• Feature Selection algorithms – Filters; Wrappers;
• Decision Trees; Random Forests.

2
Machine Learning

“Learning denotes changes in a system that ... enable a system to do the same task more
efficiently the next time.” –Herbert Simon

“Learning is constructing or modifying representations of what is being experienced.”


–Ryszard Michalski

“Learning is making useful changes in our minds.” –Marvin Minsky

3
Machine Learning

• Automating automation
• Getting computers to program
themselves
• Writing software is the bottleneck
• Let the data do the work instead!

4
Machine Learning
• Understand and improve efficiency of human learning
o Use to improve methods for teaching and tutoring people (e.g., better computer-aided
instruction)

• Discover new things or structure that were previously unknown to humans


o Examples: data mining, scientific discovery

• Fill in skeletal or incomplete specifications about a domain


o Large, complex AI systems cannot be completely derived by hand and require dynamic
updating to incorporate new information.
o Learning new characteristics expands the domain or expertise and lessens the “brittleness” of
the system

• Build software agents that can adapt to their users or to other software agents

5
A general model of learning agents

6
Major paradigms of machine learning

• Rote learning - One-to-one mapping from inputs to stored representation. “Learning by


memorization.” Association-based storage and retrieval.

• Induction - Use specific examples to reach general conclusions

• Clustering - Unsupervised identification of natural groups in data

• Analogy - Determine correspondence between two different representations

• Discovery - Unsupervised, specific goal not given

• Genetic algorithms - “Evolutionary” search techniques, based on an analogy to “survival of the


fittest”

• Reinforcement - Feedback (positive or negative reward) given at the end of a sequence of steps

7
Machine Learning in practice

• Understanding domain, prior knowledge, and goals


• Data integration, selection, cleaning, pre-processing, etc.
• Learning models
• Interpreting results
• Consolidating and deploying discovered knowledge
• Loop

8
Association Rule Mining
• Association rule mining is a technique used to uncover hidden relationships between variables in
large datasets.

• Association Rule Mining is a method for identifying frequent patterns, correlations, associations, or
causal structures in data sets found in numerous databases such as relational databases,
transactional databases, and other types of data repositories.

• Given a set of transactions, the goal of association rule mining is to find the rules that allow us to
predict the occurrence of a specific item based on the occurrences of the other items in the
transaction.

• An association rule consists of two parts:

• an Antecedent (if) and


• a Consequent (then)

9
Association Rule Mining
• An antecedent is something found in data, and a consequent is something located in conjunction
with the antecedent.

• Consider the following association rule:

“If a customer buys bread, he’s 70% likely of buying milk.”

Bread is the antecedent in the given association rule, and milk is the consequent.

10
Metrics for evaluating association rules

• Association rules are carefully derived from the dataset. Several metrics are commonly used to
evaluate the performance of association rule mining algorithms. Let us consider the following
transaction table.

Transaction
Items purchased
ID
1 Item1, Item2
2 Item1, Item3 ,Item4, Item5
3 Item2, Item3, Item4, Item6
4 Item1, Item2, Item3, Item4
5 Item1, Item2, Item3, Item6

11
Metrics for evaluating association rules
Support
It is the proportion of transactions in the dataset that contain a specific itemset. It indicates
the frequency with which the itemset appears in the data. Higher support values indicate that the rule
is more common or significant in the dataset. Rules with low support are considered less relevant.

There are five transactions; three of


Transaction those have Item 4 appearing in them.
Items purchased
ID
1 Item1, Item2 3
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐼𝑡𝑒𝑚 4 =
2 Item1, Item3 ,Item4, Item5 5

3 Item2, Item3, Item4, Item6


Out of five transactions, {Item 2, Item
4 Item1, Item2, Item3, Item4
6} appears together in two transactions
5 Item1, Item2, Item3, Item6
2
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐼𝑡𝑒𝑚 2, 𝐼𝑡𝑒𝑚 6 =
5

12
Confidence
It is the ratio of the number of transactions containing both the antecedent and the
consequent to the number of transactions containing only the antecedent.
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 → 𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑋 → 𝑌 =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋

𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐼𝑡𝑒𝑚 2 → 𝐼𝑡𝑒𝑚 6


𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑡𝑒𝑚 2 → 𝐼𝑡𝑒𝑚 6 =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐼𝑡𝑒𝑚 2

2/5
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑡𝑒𝑚 2 → 𝐼𝑡𝑒𝑚 6 = = 0.5
4/5

It is not symmetric, meaning the confidence for 𝑋 → 𝑌 is not the same as 𝑌 → 𝑋

13
Lift
Lift quantifies how likely the consequent is to occur when the antecedent is present
compared to when the two events are independent.

𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 → 𝑌
𝐿𝑖𝑓𝑡 𝑋 → 𝑌 =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 × 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑋 → 𝑌
=
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑌

𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑡𝑒𝑚 2 → 𝐼𝑡𝑒𝑚 6


𝐿𝑖𝑓𝑡 𝐼𝑡𝑒𝑚 2 → 𝐼𝑡𝑒𝑚 6 =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐼𝑡𝑒𝑚 6

0.5
= = 1.25
2/5

14
Use Cases of Association Rule Mining
Association rule mining is commonly used in a variety of applications, some common ones are:
1. Market Basket Analysis
One of the most well-known applications of association rule mining is in market basket analysis. This
involves analyzing the items customers purchase together to understand their purchasing habits and
preferences.
For example, a retailer might use association rule mining to discover that customers who
purchase diapers are also likely to purchase baby formula. We can use this information to optimize
product placements and promotions to increase sales.

2. Fraud Detection
We can also use association rule mining to detect fraudulent activity. For example, a credit card
company might use association rule mining to identify patterns of fraudulent transactions, such as
multiple purchases from the same merchant within a short period of time.
We can then use this information can to flag potentially fraudulent activity and take
preventative measures to protect customers.

15
3. Customer Segmentation
Association rule mining can also be used to segment customers based on their purchasing habits. For
example, a company might use association rule mining to discover that customers who purchase
certain types of products are more likely to be younger. Similarly, they could learn that customers who
purchase certain combinations of products are more likely to be located in specific geographic regions.
We can use this information to tailor marketing campaigns and personalized
recommendations to specific customer segments.

4. Social network analysis


Various companies use association rule mining to identify patterns in social media data that can
inform the analysis of social networks.
For example, an analysis of Twitter data might reveal that users who tweet about a particular
topic are also likely to tweet about other related topics, which could inform the identification of groups
or communities within the network.

16
5. Recommendation systems
Association rule mining can be used to suggest items that a customer might be interested in based on
their past purchases or browsing history. For example, a music streaming service might use
association rule mining to recommend new artists or albums to a user based on their listening history.

17
Apriori Algorithm
• The Apriori algorithm is a widely used algorithm to find frequent itemsets in a dataset.

• The Apriori algorithm is used to implement Frequent Pattern Mining (FPM). Frequent pattern
mining is a data mining technique to discover frequent patterns or relationships between items in a
dataset.

• Frequent pattern mining involves finding sets of items or itemsets that occur together frequently in
a dataset. These sets of items or itemsets are called frequent patterns, and their frequency is
measured by the number of transactions in which they occur.

• An itemset is a collection of one or more items that appear together in a transaction or dataset. An
itemset can be either a single item, also known as a 1-itemset, or a set of k items, also known as a k-
itemset.

18
• For example, in sales transactions of a retail store, an itemset can be referred to as products
purchased together, such as bread and milk, which would be a 2-item set.

• The Apriori algorithm can be used to discover frequent itemsets in the sales transactions of a retail
store. For instance, the algorithm might discover that customers who purchase bread and milk
together often also purchase eggs. This information can be used to recommend eggs to customers
who purchase bread and milk in the future.

• The Apriori algorithm is called "apriori" because it uses prior knowledge about the frequent
itemsets. The algorithm uses the concept of "apriori property," which states that if an itemset is
frequent, then all of its subsets must also be frequent.

19
Apriori Algorithm: Example
In this example, we will use a minimum support threshold of 3. This means an item set must
appear in at least three transactions to be considered frequent.
TID Items
T1 {milk, bread}
T2 {bread, sugar}
T3 {bread, butter} Item Support (Frequency)

T4 {milk, bread, sugar} milk 8


bread 7
T5 {milk, bread, butter}
sugar 5
T6 {milk, bread, butter}
butter 7
T7 {milk, sugar}
T8 {milk, sugar}
Support for all items is greater than 3. It means that all
T9 {sugar, butter}
items are considered as frequent 1-itemsets and will be
T10 {milk, sugar, butter}
used to generate candidates for 2-itemsets.
T11 {milk, bread, butter}

20
All candidates generated from frequent 1-itemsets
identified from the previous step and their support
value.
Now remove candidate item sets
Candidate Item Sets Support (Frequency)
that do not meet the minimum
{milk, bread} 5
support threshold of 3. After this
{milk, sugar} 4 step, frequent 2-itemsets would
{milk, butter} 5 be - {milk, bread}, {milk, sugar},
{bread, sugar} 2 {milk, butter}, and {bread,
{bread, butter} 3 butter}
{sugar, butter} 2
Candidate Item Sets Support (Frequency)
{milk, bread} 5
{milk, sugar} 4
{milk, butter} 5
{bread, butter} 3

21
let’s generate candidates for 3-itemsets and calculate their respective support values

Candidate Item Sets Support (Frequency)


{milk, bread, sugar} 1
As we can see, only one candidate itemset exceeds
{milk, bread, butter} 3 the minimum defined support threshold –
{milk, sugar, butter} 1 {milk, bread, butter}

As there is only one 3-itemset exceeding minimum


support, we can’t generate candidates for 4-
itemsets.

Candidate Item Sets Support (Frequency)


{milk, bread, butter} 3

22
we can write the association rules and their respective metrics,

Candidate
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 → 𝒀 Support 𝑿 Confidence
Item Sets
{milk, bread}
3 5 60%
→ butter
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 → 𝒀
{bread, butter} 100% 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝑿 → 𝒀 =
3 3 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿
→ milk
{milk, butter} 60%
3 5
→ bread

23
Apriori Algorithm

24
Correlation and Regression

• Relation between variables where changes in some variables may “explain” or possibly
“cause” changes in other variables.

• Explanatory variables are termed the independent variables and the variables to be
explained are termed the dependent variables.

• Regression model estimates the nature of the relationship between the independent and
dependent variables.
– Change in dependent variables that results from changes in independent variables,
ie. size of the relationship.
– Strength of the relationship.
– Statistical significance of the relationship.

25
Examples

• Dependent variable is retail price of gasoline – independent variable is the price of crude oil.

• Dependent variable is employment income – independent variables might be hours of work,


education, occupation, sex, age, region, years of experience, unionization status, etc.

• Price of a product and quantity produced or sold:

– Quantity sold affected by price. Dependent variable is quantity of product sold –


independent variable is price.

– Price affected by quantity offered for sale. Dependent variable is price – independent
variable is quantity sold.

26
100
200
300
400
500
600

0
1981M01

1982M01

1983M01

1984M01

1985M01

1986M01

1987M01

1988M01

1989M01

1990M01

1991M01

1992M01

Crude Oil price index, 1997=100, left axis


1993M01

1994M01

1995M01

1996M01

1997M01

1998M01

1999M01

2000M01

2001M01

2002M01

2003M01

2004M01

2005M01

2006M01

2007M01
Regular gasoline prices, regina, cents per litre, right axis

2008M01
0
20
40
60
80
100
120
140
160

27
Bivariate and multivariate models
100% Y = 0.2*x1+0.15* x2+0.5*x3+0.15*x4
Bivariate or simple regression model

(Education) x y (Income)

Multivariate or multiple regression model

(Education) x1
(Sex) x2 y (Income)

(Experience) x3
(Age) x4

Model with simultaneous relationship

Price of wheat Quantity of wheat produced

28
Bivariate or simple linear regression

• x is the independent variable


• y is the dependent variable
• The regression model is
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀

• The model has two variables, the independent or explanatory variable, x, and the dependent
variable y, the variable whose variation is to be explained.
• The relationship between x and y is a linear or straight line relationship.
• Two parameters to estimate – the slope of the line β1 and the y-intercept β0 (where the line crosses
the vertical axis).
• ε is the unexplained, random, or error component. Much more on this later.

29
Regression line

• The regression model is 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀


• Data about x and y are obtained from a sample.
• From the sample of values of x and y, estimates b0 of β0 and b1 of β1 are obtained using the least
squares or another method.
• The resulting estimate of the model is
𝑦ො = 𝑏0 + 𝑏1 𝑥
• The symbol 𝑦ො is termed “y hat” and refers to the predicted values of the dependent variable y that
are associated with values of x, given the linear model.

30
Uses of regression

• Amount of change in a dependent variable that results from changes in the independent
variable(s) – can be used to estimate elasticities, returns on investment in human
capital, etc.
• Attempt to determine causes of phenomena.
• Prediction and forecasting of sales, economic growth, etc.
• Support or negate theoretical model.
• Modify and improve theoretical models and explanations of phenomena.

31
Summer Income as a Function of Hours Worked

30000

25000

20000
Income

15000

10000

5000

0
0 10 20 30 40 50 60
Hours per Week

32
R2 = 0.311
Significance = 0.0031

33
34
35
Outliers

• Rare, extreme values may distort the outcome.


• Could be an error.
• Could be a very important observation.
• Outlier: more than 3 standard deviations from the mean.

36
GPA vs. Time Online

12

10

8
Time Online

0
50 55 60 65 70 75 80 85 90 95 100
GPA

37
Cost Function

38
Gradient Descent
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century. Gradient Descent is defined
as one of the most commonly used iterative optimization algorithms of machine learning to train the machine learning
and deep learning models. It helps in finding the local minimum of a function.

The best way to define the local minimum or local maximum of a


function using gradient descent is as follows:

• If we move towards a negative gradient or away from the


gradient of the function at the current point, it will give the
local minimum of that function.
• Whenever we move towards a positive gradient or towards the
gradient of the function at the current point, we will get the
local maximum of that function.
The main objective of using a gradient descent algorithm is to minimize the cost function using iteration. To achieve this goal,
it performs two steps iteratively:
• Calculates the first-order derivative of the function to compute the gradient or slope of that function.
• Move away from the direction of the gradient, which means slope increased from the current point by alpha times, where
Alpha is defined as Learning Rate. It is a tuning parameter in the optimization process which helps to decide the length of
the steps.
39
Logistic Regression

Instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function
to squeeze the output of a linear equation between 0 and 1. The logistic function is defined as:

1
The sigmoid function σ(z) = (1+𝑒 −𝑧 ) takes a real value and maps it to the range (0,1). It is nearly linear
around 0 but outlier values get squashed toward 0 or 1.

40
Logistic Regression
The step from linear regression to logistic regression is kind of straightforward. In the linear
regression model, we have modelled the relationship between outcome and features with a linear
equation:

For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation
into the logistic function. This forces the output to assume only values between 0 and 1.

41
Logistic Regression
The step from linear regression to logistic regression is kind of straightforward. In the linear
regression model, we have modelled the relationship between outcome and features with a linear
equation:

For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation
into the logistic function. This forces the output to assume only values between 0 and 1.

42
Common Distance measures:
Distance measure will determine how the similarity of two elements is calculated.

1. The Euclidean distance


2. Manhattan distance
3. Hamming distance
4. Mahalanobis distance
5. Maximum norm
6. Inner product space

43
K-Means Clustering
• K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the
nearest mean.
• The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be
computed from the data.
• The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:

Algorithm:
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to
the Euclidean distance function.
4. Calculate the centroid or mean of all objects in each
cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned
to each cluster in consecutive rounds.

44
K-Means Clustering: Example
Suppose we want to group the visitors to a website using just their age as follows:
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
n = 19
Initial clusters (random centroid or average):
k=2
c1 = 16
c2 = 22

45
K-Means Clustering: Example
Iteration 1:
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid

15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 3 3 2
19 16 22 3 3 2
20 16 22 4 2 2
20 16 22 4 2 2
21 16 22 5 1 2
22 16 22 6 0 2
28 16 22 12 6 2
35 16 22 19 13 2
36.25
40 16 22 24 18 2
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2

46
K-Means Clustering: Example
Iteration 2:
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid

15 15.33 36.25 0.33 21.25 1


15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
28 15.33 36.25 12.67 8.25 2
35 15.33 36.25 19.67 1.25 2
40 15.33 36.25 24.67 3.75 2
41 15.33 36.25 25.67 4.75 2
42 15.33 36.25 26.67 5.75 2
45.9
43 15.33 36.25 27.67 6.75 2
44 15.33 36.25 28.67 7.75 2
60 15.33 36.25 44.67 23.75 2
61 15.33 36.25 45.67 24.75 2
65 15.33 36.25 49.67 28.75 2

47
K-Means Clustering: Example
Iteration 3:
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid

15 18.56 45.9 3.56 30.9 1


15 18.56 45.9 3.56 30.9 1
16 18.56 45.9 2.56 29.9 1
19 18.56 45.9 0.44 26.9 1
19 18.56 45.9 0.44 26.9 1
19.50
20 18.56 45.9 1.44 25.9 1
20 18.56 45.9 1.44 25.9 1
21 18.56 45.9 2.44 24.9 1
22 18.56 45.9 3.44 23.9 1
28 18.56 45.9 9.44 17.9 1
35 18.56 45.9 16.44 10.9 2
40 18.56 45.9 21.44 5.9 2
41 18.56 45.9 22.44 4.9 2
42 18.56 45.9 23.44 3.9 2
43 18.56 45.9 24.44 2.9 2 47.89
44 18.56 45.9 25.44 1.9 2
60 18.56 45.9 41.44 14.1 2
61 18.56 45.9 42.44 15.1 2
65 18.56 45.9 46.44 19.1 2

48
K-Means Clustering: Example
Iteration 4:

xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid


No change between
15 19.5 47.89 4.50 32.89 1 iterations 3 and 4 has
15 19.5 47.89 4.50 32.89 1 been noted.
16 19.5 47.89 3.50 31.89 1
19 19.5 47.89 0.50 28.89 1
By using clustering, 2
groups have been
19 19.5 47.89 0.50 28.89 1
19.50 identified 15-28 and
20 19.5 47.89 0.50 27.89 1
35-65.
20 19.5 47.89 0.50 27.89 1
21 19.5 47.89 1.50 26.89 1
The initial choice of
22 19.5 47.89 2.50 25.89 1 centroids can affect the
28 19.5 47.89 8.50 19.89 1 output clusters, so the
35 19.5 47.89 15.50 12.89 2 algorithm is often run
40 19.5 47.89 20.50 7.89 2 multiple times with
41 19.5 47.89 21.50 6.89 2 different starting
42 19.5 47.89 22.50 5.89 2 conditions in order to
43 19.5 47.89 23.50 4.89 2 47.89 get a fair view of what
44 19.5 47.89 24.50 3.89 2 the clusters should be.
60 19.5 47.89 40.50 12.11 2
61 19.5 47.89 41.50 13.11 2
65 19.5 47.89 45.50 17.11 2

49
K-Means Clustering

k1 k1
Y Y
k2
k2

k3
k3
X
X
Pick 3 initial cluster centers (randomly) Assign each point to the closest cluster center

50
K-Means Clustering

51
K-Means Clustering

52
15, 27, 14, 9, 36, 89, 74, 55, 12, 23

53
Decision Trees
• Decision tree is a simple but powerful learning paradigm. In this method a set of training examples
is broken down into smaller and smaller subsets while at the same time an associated decision tree
get incrementally developed. At the end of the learning process, a decision tree covering the
training set is returned.

• The decision tree can be thought of as a set sentences (in Disjunctive Normal Form) written
propositional logic.

• Some characteristics of problems that are well suited to Decision Tree Learning are:
o Attribute-value paired elements
o Discrete target function
o Disjunctive descriptions (of target function)
o Works well with missing or erroneous training data

54
Decision Trees
• Classify a pattern through a sequence of questions, next question asked depends on the answer to
the current question

• This approach is particularly useful for non-metric data; questions can be asked in a “yes-no” or
“true-false” style that do not require any notion of metric

• Sequence of questions is displayed in a directed decision tree


• Root node, links or branches, leaf or terminal nodes

• Classification of a pattern begins at the root node until we reach


the leaf node; pattern is assigned the category of the leaf node

• Benefit of decision tee:


– Interpretability: a tree can be expressed as a logical
expression
– Rapid classification: a sequence of simple queries
– Higher accuracy & speed

55
Entropy
• Entropy is a measure of the level of disorder or uncertainty in a given dataset or system. It is a
metric that quantifies the amount of information in a dataset

• Entropy is used to determine the best split at each decision node in the tree by calculating the
reduction in entropy achieved by splitting the dataset based on a specific feature. The feature with
the highest reduction in entropy is chosen as the split point

• a high value of entropy means that the randomness in your system is high, meaning it is difficult to
predict the state of atoms or molecules in it. On the other hand, if the entropy is low, predicting
that state is much easier.

• The lower this disorder is the more accurate results/predictions you can get.
𝑐

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = ෍ −𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖


𝑖=1

56
Information Gain
• Information gain is a measure used in decision trees to determine the usefulness of a feature in
classifying a dataset. It is based on the concept of entropy, where entropy is the measure of
impurity in a dataset.

• Each decision tree node represents a specific feature, and the branches stemming from that node
correspond to the potential values that the feature can take. Information gain is used to determine
which feature to split on at each internal node, such that the resulting subsets of data are as pure as
possible.

• Information gain is determined by subtracting the entropy of the parent node from the weighted
average of the entropies of the child nodes. The formula for information gain can be represented as:

𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 𝑆 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝑝 𝑆Τ𝐴 . 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆Τ𝐴

57
ID3 Algorithm

Play
Day Outlook Temperature Humidity Wind
Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

58
Step 1: Calculate Entropy:

To calculate entropy, we first determine the proportion of positive and negative instances in the
dataset:

Positive instances (Play Tennis = Yes): 9


Negative instances (Play Tennis = No): 5
𝑐

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = ෍ −𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖


𝑖=1

Entropy(Play Tennis) = -(9/14) * log2(9/14) – (5/14) * log2(5/14) = 0.940

59
Step 2: Calculating Information Gain:

We calculate the information gain for each attribute (Weather, Temperature, Humidity, Windy) and choose the
attribute with the highest information gain as the root node.

𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 𝑆 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝑝 𝑆Τ𝐴 . 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆Τ𝐴

Lets consider Wind Attribute first:

Information Gain(Play Tennis, Wind) = Entropy(Play Tennis) –


[p(Play Tennis |Wind=Weak) . Entropy(Play Tennis |Wind=Weak) ] –
[p(Play Tennis |Wind=Strong) . Entropy(Play Tennis |Wind=Strong)]

Wind (Weak) = 8
Wind (Strong) = 6

60
Wind = ‘Weak’
Play
Day Outlook Temperature Humidity Wind Entropy(Play Tennis |Wind=Weak) ]
Tennis?
= – p(No) . log2p(No) –
1 Sunny Hot High Weak No
p(Yes) . log2p(Yes)
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes Entropy(Play Tennis |Wind=Weak) ]
8 Sunny Mild High Weak No = – (2/8) . log2(2/8) –
9 Sunny Cool Normal Weak Yes (6/8) . log2(6/8)
10 Rain Mild Normal Weak Yes
= 0.811
13 Overcast Hot Normal Weak Yes

61
Wind = ‘Strong’
Play
Day Outlook Temperature Humidity Wind Entropy(Play Tennis |Wind=Strong) ]
Tennis?
= – p(No) . log2p(No) –
2 Sunny Hot High Strong No
p(Yes) . log2p(Yes)
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
11 Sunny Mild Normal Strong Yes Entropy(Play Tennis |Wind=Strong) ]
12 Overcast Mild High Strong Yes = – (3/6) . log2(3/6) –
14 Rain Mild High Strong No (3/6) . log2(3/6)

=1

62
Information Gain(Play Tennis, Wind) = Entropy(Play Tennis) –
[p(Play Tennis |Wind=Weak) . Entropy(Play Tennis |Wind=Weak) ] –
[p(Play Tennis |Wind=Strong) . Entropy(Play Tennis |Wind=Strong)]

= 0.940 – [(8/14)*0.811] – [(6/14)*1]


= 0.048

63
Calculations for wind column is over. Now, we need to apply same calculations for other columns:

Information Gain(Play Tennis, Outlook)


= Entropy(S) – [(5/14) * Entropy(Sunny) + (4/14) * Entropy(Overcast) + (5/14) * Entropy(Rainy)]
= 0.246

Information Gain(Play Tennis, Temperature)


= Entropy(S) – [(4/14) * Entropy(Hot) + (4/14) * Entropy(Mild) + (6/14) * Entropy(Cool)]
= 0.029

Information Gain(Play Tennis, Humidity)


= Entropy(S) – [(7/14) * Entropy(High) + (7/14) * Entropy(Normal)]
= 0.152

64
Outlook factor on decision produces the highest score. That’s why, outlook decision will appear in
the root node of the tree.

65
Outlook = ‘Sunny’
Here, there are 5 instances for sunny outlook.
Play
Day Outlook Temperature Humidity Wind P(play Tennis = No) = 3/5
Tennis?
P(play Tennis = Yes) = 2/5
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No Information Gain ( Outlook = Sunny |
8 Sunny Mild High Weak No Temperature)
9 Sunny Cool Normal Weak Yes = 0.570
11 Sunny Mild Normal Strong Yes
Information Gain ( Outlook = Sunny |
Humidity)
= 0.970

Information Gain ( Outlook = Sunny | Wind)


= 0.019

Now, humidity is the decision because it


produces the highest score if outlook were
sunny.

66
Outlook = ‘Sunny’
Play
Day Outlook Temperature Humidity Wind
Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes

Temperat Humidi Play Play


Day Outlook Wind Day Outlook Temperature Humidity Wind
ure ty Tennis? Tennis?
1 Sunny Hot High Weak No 9 Sunny Cool Normal Weak Yes
2 Sunny Hot High Strong No
11 Sunny Mild Normal Strong Yes
8 Sunny Mild High Weak No

67
Outlook = ‘Overcast’
Play
Day Outlook Temperature Humidity Wind
Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

68
Outlook = ‘Rain’
Play
Day Outlook Temperature Humidity Wind Here, there are 5 instances for sunny outlook.
Tennis?
P(play Tennis = No) = ??
1 Sunny Hot High Weak No
P(play Tennis = yes) = ??
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes Information Gain ( Outlook = Rain |
4 Rain Mild High Weak Yes Temperature)
5 Rain Cool Normal Weak Yes = 0.019973
6 Rain Cool Normal Strong No
Information Gain ( Outlook = Rain |
7 Overcast Cool Normal Strong Yes
Humidity)
8 Sunny Mild High Weak No = 0.01997
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes Information Gain ( Outlook = Rain | Wind)
11 Sunny Mild Normal Strong Yes = 0.9709
12 Overcast Mild High Strong Yes
Here, wind produces the highest
13 Overcast Hot Normal Weak Yes
score if outlook were rain. That’s why, we
14 Rain Mild High Strong No need to check wind attribute in 2nd level if
outlook were rain.

69
70
71
Step 3: Selecting the Best Attribute:
The “Weather” attribute has the highest information gain, so we select it as the root node for our
decision tree.

Step 4: Splitting the Dataset:


We split the dataset based on the values of the “Outlook” attribute into three subsets (Sunny,
Overcast, Rain).

Step 5: Repeat the Process:


Since the “Outlook” attribute has no repeating values in any subset, we stop splitting and label each
leaf node with the majority class in that subset. The decision tree will look like below:

72
Random Forest
• Classification in random forests employs an ensemble methodology to attain the outcome. The
training data is fed to train various decision trees. This dataset consists of observations and
features that will be selected randomly during the splitting of nodes.

• Every decision tree consists of decision nodes, leaf nodes, and a root node.

• The leaf node of each tree is the final output produced by that specific decision tree.

• The selection of the final output follows the majority-voting system. In this case, the output chosen
by the majority of the decision trees becomes the final output of the rain forest system.

73
Random Forest

74
Feature Engineering
Feature engineering in Machine learning consists of mainly 5 processes:

1. Feature Creation,
2. Feature Transformation,
3. Feature Extraction,
4. Feature Selection, and
5. Feature Scaling.

75

You might also like