Unit-3 New
Unit-3 New
Contents
2
Machine Learning
“Learning denotes changes in a system that ... enable a system to do the same task more
efficiently the next time.” –Herbert Simon
3
Machine Learning
• Automating automation
• Getting computers to program
themselves
• Writing software is the bottleneck
• Let the data do the work instead!
4
Machine Learning
• Understand and improve efficiency of human learning
o Use to improve methods for teaching and tutoring people (e.g., better computer-aided
instruction)
• Build software agents that can adapt to their users or to other software agents
5
A general model of learning agents
6
Major paradigms of machine learning
• Reinforcement - Feedback (positive or negative reward) given at the end of a sequence of steps
7
Machine Learning in practice
8
Association Rule Mining
• Association rule mining is a technique used to uncover hidden relationships between variables in
large datasets.
• Association Rule Mining is a method for identifying frequent patterns, correlations, associations, or
causal structures in data sets found in numerous databases such as relational databases,
transactional databases, and other types of data repositories.
• Given a set of transactions, the goal of association rule mining is to find the rules that allow us to
predict the occurrence of a specific item based on the occurrences of the other items in the
transaction.
9
Association Rule Mining
• An antecedent is something found in data, and a consequent is something located in conjunction
with the antecedent.
Bread is the antecedent in the given association rule, and milk is the consequent.
10
Metrics for evaluating association rules
• Association rules are carefully derived from the dataset. Several metrics are commonly used to
evaluate the performance of association rule mining algorithms. Let us consider the following
transaction table.
Transaction
Items purchased
ID
1 Item1, Item2
2 Item1, Item3 ,Item4, Item5
3 Item2, Item3, Item4, Item6
4 Item1, Item2, Item3, Item4
5 Item1, Item2, Item3, Item6
11
Metrics for evaluating association rules
Support
It is the proportion of transactions in the dataset that contain a specific itemset. It indicates
the frequency with which the itemset appears in the data. Higher support values indicate that the rule
is more common or significant in the dataset. Rules with low support are considered less relevant.
12
Confidence
It is the ratio of the number of transactions containing both the antecedent and the
consequent to the number of transactions containing only the antecedent.
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 → 𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑋 → 𝑌 =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋
2/5
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑡𝑒𝑚 2 → 𝐼𝑡𝑒𝑚 6 = = 0.5
4/5
13
Lift
Lift quantifies how likely the consequent is to occur when the antecedent is present
compared to when the two events are independent.
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 → 𝑌
𝐿𝑖𝑓𝑡 𝑋 → 𝑌 =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 × 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑋 → 𝑌
=
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑌
0.5
= = 1.25
2/5
14
Use Cases of Association Rule Mining
Association rule mining is commonly used in a variety of applications, some common ones are:
1. Market Basket Analysis
One of the most well-known applications of association rule mining is in market basket analysis. This
involves analyzing the items customers purchase together to understand their purchasing habits and
preferences.
For example, a retailer might use association rule mining to discover that customers who
purchase diapers are also likely to purchase baby formula. We can use this information to optimize
product placements and promotions to increase sales.
2. Fraud Detection
We can also use association rule mining to detect fraudulent activity. For example, a credit card
company might use association rule mining to identify patterns of fraudulent transactions, such as
multiple purchases from the same merchant within a short period of time.
We can then use this information can to flag potentially fraudulent activity and take
preventative measures to protect customers.
15
3. Customer Segmentation
Association rule mining can also be used to segment customers based on their purchasing habits. For
example, a company might use association rule mining to discover that customers who purchase
certain types of products are more likely to be younger. Similarly, they could learn that customers who
purchase certain combinations of products are more likely to be located in specific geographic regions.
We can use this information to tailor marketing campaigns and personalized
recommendations to specific customer segments.
16
5. Recommendation systems
Association rule mining can be used to suggest items that a customer might be interested in based on
their past purchases or browsing history. For example, a music streaming service might use
association rule mining to recommend new artists or albums to a user based on their listening history.
17
Apriori Algorithm
• The Apriori algorithm is a widely used algorithm to find frequent itemsets in a dataset.
• The Apriori algorithm is used to implement Frequent Pattern Mining (FPM). Frequent pattern
mining is a data mining technique to discover frequent patterns or relationships between items in a
dataset.
• Frequent pattern mining involves finding sets of items or itemsets that occur together frequently in
a dataset. These sets of items or itemsets are called frequent patterns, and their frequency is
measured by the number of transactions in which they occur.
• An itemset is a collection of one or more items that appear together in a transaction or dataset. An
itemset can be either a single item, also known as a 1-itemset, or a set of k items, also known as a k-
itemset.
18
• For example, in sales transactions of a retail store, an itemset can be referred to as products
purchased together, such as bread and milk, which would be a 2-item set.
• The Apriori algorithm can be used to discover frequent itemsets in the sales transactions of a retail
store. For instance, the algorithm might discover that customers who purchase bread and milk
together often also purchase eggs. This information can be used to recommend eggs to customers
who purchase bread and milk in the future.
• The Apriori algorithm is called "apriori" because it uses prior knowledge about the frequent
itemsets. The algorithm uses the concept of "apriori property," which states that if an itemset is
frequent, then all of its subsets must also be frequent.
19
Apriori Algorithm: Example
In this example, we will use a minimum support threshold of 3. This means an item set must
appear in at least three transactions to be considered frequent.
TID Items
T1 {milk, bread}
T2 {bread, sugar}
T3 {bread, butter} Item Support (Frequency)
20
All candidates generated from frequent 1-itemsets
identified from the previous step and their support
value.
Now remove candidate item sets
Candidate Item Sets Support (Frequency)
that do not meet the minimum
{milk, bread} 5
support threshold of 3. After this
{milk, sugar} 4 step, frequent 2-itemsets would
{milk, butter} 5 be - {milk, bread}, {milk, sugar},
{bread, sugar} 2 {milk, butter}, and {bread,
{bread, butter} 3 butter}
{sugar, butter} 2
Candidate Item Sets Support (Frequency)
{milk, bread} 5
{milk, sugar} 4
{milk, butter} 5
{bread, butter} 3
21
let’s generate candidates for 3-itemsets and calculate their respective support values
22
we can write the association rules and their respective metrics,
Candidate
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 → 𝒀 Support 𝑿 Confidence
Item Sets
{milk, bread}
3 5 60%
→ butter
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 → 𝒀
{bread, butter} 100% 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝑿 → 𝒀 =
3 3 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿
→ milk
{milk, butter} 60%
3 5
→ bread
23
Apriori Algorithm
24
Correlation and Regression
• Relation between variables where changes in some variables may “explain” or possibly
“cause” changes in other variables.
• Explanatory variables are termed the independent variables and the variables to be
explained are termed the dependent variables.
• Regression model estimates the nature of the relationship between the independent and
dependent variables.
– Change in dependent variables that results from changes in independent variables,
ie. size of the relationship.
– Strength of the relationship.
– Statistical significance of the relationship.
25
Examples
• Dependent variable is retail price of gasoline – independent variable is the price of crude oil.
– Price affected by quantity offered for sale. Dependent variable is price – independent
variable is quantity sold.
26
100
200
300
400
500
600
0
1981M01
1982M01
1983M01
1984M01
1985M01
1986M01
1987M01
1988M01
1989M01
1990M01
1991M01
1992M01
1994M01
1995M01
1996M01
1997M01
1998M01
1999M01
2000M01
2001M01
2002M01
2003M01
2004M01
2005M01
2006M01
2007M01
Regular gasoline prices, regina, cents per litre, right axis
2008M01
0
20
40
60
80
100
120
140
160
27
Bivariate and multivariate models
100% Y = 0.2*x1+0.15* x2+0.5*x3+0.15*x4
Bivariate or simple regression model
(Education) x y (Income)
(Education) x1
(Sex) x2 y (Income)
(Experience) x3
(Age) x4
28
Bivariate or simple linear regression
• The model has two variables, the independent or explanatory variable, x, and the dependent
variable y, the variable whose variation is to be explained.
• The relationship between x and y is a linear or straight line relationship.
• Two parameters to estimate – the slope of the line β1 and the y-intercept β0 (where the line crosses
the vertical axis).
• ε is the unexplained, random, or error component. Much more on this later.
29
Regression line
30
Uses of regression
• Amount of change in a dependent variable that results from changes in the independent
variable(s) – can be used to estimate elasticities, returns on investment in human
capital, etc.
• Attempt to determine causes of phenomena.
• Prediction and forecasting of sales, economic growth, etc.
• Support or negate theoretical model.
• Modify and improve theoretical models and explanations of phenomena.
31
Summer Income as a Function of Hours Worked
30000
25000
20000
Income
15000
10000
5000
0
0 10 20 30 40 50 60
Hours per Week
32
R2 = 0.311
Significance = 0.0031
33
34
35
Outliers
36
GPA vs. Time Online
12
10
8
Time Online
0
50 55 60 65 70 75 80 85 90 95 100
GPA
37
Cost Function
38
Gradient Descent
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century. Gradient Descent is defined
as one of the most commonly used iterative optimization algorithms of machine learning to train the machine learning
and deep learning models. It helps in finding the local minimum of a function.
Instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function
to squeeze the output of a linear equation between 0 and 1. The logistic function is defined as:
1
The sigmoid function σ(z) = (1+𝑒 −𝑧 ) takes a real value and maps it to the range (0,1). It is nearly linear
around 0 but outlier values get squashed toward 0 or 1.
40
Logistic Regression
The step from linear regression to logistic regression is kind of straightforward. In the linear
regression model, we have modelled the relationship between outcome and features with a linear
equation:
For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation
into the logistic function. This forces the output to assume only values between 0 and 1.
41
Logistic Regression
The step from linear regression to logistic regression is kind of straightforward. In the linear
regression model, we have modelled the relationship between outcome and features with a linear
equation:
For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation
into the logistic function. This forces the output to assume only values between 0 and 1.
42
Common Distance measures:
Distance measure will determine how the similarity of two elements is calculated.
43
K-Means Clustering
• K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the
nearest mean.
• The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be
computed from the data.
• The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:
Algorithm:
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to
the Euclidean distance function.
4. Calculate the centroid or mean of all objects in each
cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned
to each cluster in consecutive rounds.
44
K-Means Clustering: Example
Suppose we want to group the visitors to a website using just their age as follows:
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
n = 19
Initial clusters (random centroid or average):
k=2
c1 = 16
c2 = 22
45
K-Means Clustering: Example
Iteration 1:
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 3 3 2
19 16 22 3 3 2
20 16 22 4 2 2
20 16 22 4 2 2
21 16 22 5 1 2
22 16 22 6 0 2
28 16 22 12 6 2
35 16 22 19 13 2
36.25
40 16 22 24 18 2
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2
46
K-Means Clustering: Example
Iteration 2:
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
47
K-Means Clustering: Example
Iteration 3:
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
48
K-Means Clustering: Example
Iteration 4:
49
K-Means Clustering
k1 k1
Y Y
k2
k2
k3
k3
X
X
Pick 3 initial cluster centers (randomly) Assign each point to the closest cluster center
50
K-Means Clustering
51
K-Means Clustering
52
15, 27, 14, 9, 36, 89, 74, 55, 12, 23
53
Decision Trees
• Decision tree is a simple but powerful learning paradigm. In this method a set of training examples
is broken down into smaller and smaller subsets while at the same time an associated decision tree
get incrementally developed. At the end of the learning process, a decision tree covering the
training set is returned.
• The decision tree can be thought of as a set sentences (in Disjunctive Normal Form) written
propositional logic.
• Some characteristics of problems that are well suited to Decision Tree Learning are:
o Attribute-value paired elements
o Discrete target function
o Disjunctive descriptions (of target function)
o Works well with missing or erroneous training data
54
Decision Trees
• Classify a pattern through a sequence of questions, next question asked depends on the answer to
the current question
• This approach is particularly useful for non-metric data; questions can be asked in a “yes-no” or
“true-false” style that do not require any notion of metric
55
Entropy
• Entropy is a measure of the level of disorder or uncertainty in a given dataset or system. It is a
metric that quantifies the amount of information in a dataset
• Entropy is used to determine the best split at each decision node in the tree by calculating the
reduction in entropy achieved by splitting the dataset based on a specific feature. The feature with
the highest reduction in entropy is chosen as the split point
• a high value of entropy means that the randomness in your system is high, meaning it is difficult to
predict the state of atoms or molecules in it. On the other hand, if the entropy is low, predicting
that state is much easier.
• The lower this disorder is the more accurate results/predictions you can get.
𝑐
56
Information Gain
• Information gain is a measure used in decision trees to determine the usefulness of a feature in
classifying a dataset. It is based on the concept of entropy, where entropy is the measure of
impurity in a dataset.
• Each decision tree node represents a specific feature, and the branches stemming from that node
correspond to the potential values that the feature can take. Information gain is used to determine
which feature to split on at each internal node, such that the resulting subsets of data are as pure as
possible.
• Information gain is determined by subtracting the entropy of the parent node from the weighted
average of the entropies of the child nodes. The formula for information gain can be represented as:
57
ID3 Algorithm
Play
Day Outlook Temperature Humidity Wind
Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
58
Step 1: Calculate Entropy:
To calculate entropy, we first determine the proportion of positive and negative instances in the
dataset:
59
Step 2: Calculating Information Gain:
We calculate the information gain for each attribute (Weather, Temperature, Humidity, Windy) and choose the
attribute with the highest information gain as the root node.
Wind (Weak) = 8
Wind (Strong) = 6
60
Wind = ‘Weak’
Play
Day Outlook Temperature Humidity Wind Entropy(Play Tennis |Wind=Weak) ]
Tennis?
= – p(No) . log2p(No) –
1 Sunny Hot High Weak No
p(Yes) . log2p(Yes)
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes Entropy(Play Tennis |Wind=Weak) ]
8 Sunny Mild High Weak No = – (2/8) . log2(2/8) –
9 Sunny Cool Normal Weak Yes (6/8) . log2(6/8)
10 Rain Mild Normal Weak Yes
= 0.811
13 Overcast Hot Normal Weak Yes
61
Wind = ‘Strong’
Play
Day Outlook Temperature Humidity Wind Entropy(Play Tennis |Wind=Strong) ]
Tennis?
= – p(No) . log2p(No) –
2 Sunny Hot High Strong No
p(Yes) . log2p(Yes)
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
11 Sunny Mild Normal Strong Yes Entropy(Play Tennis |Wind=Strong) ]
12 Overcast Mild High Strong Yes = – (3/6) . log2(3/6) –
14 Rain Mild High Strong No (3/6) . log2(3/6)
=1
62
Information Gain(Play Tennis, Wind) = Entropy(Play Tennis) –
[p(Play Tennis |Wind=Weak) . Entropy(Play Tennis |Wind=Weak) ] –
[p(Play Tennis |Wind=Strong) . Entropy(Play Tennis |Wind=Strong)]
63
Calculations for wind column is over. Now, we need to apply same calculations for other columns:
64
Outlook factor on decision produces the highest score. That’s why, outlook decision will appear in
the root node of the tree.
65
Outlook = ‘Sunny’
Here, there are 5 instances for sunny outlook.
Play
Day Outlook Temperature Humidity Wind P(play Tennis = No) = 3/5
Tennis?
P(play Tennis = Yes) = 2/5
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No Information Gain ( Outlook = Sunny |
8 Sunny Mild High Weak No Temperature)
9 Sunny Cool Normal Weak Yes = 0.570
11 Sunny Mild Normal Strong Yes
Information Gain ( Outlook = Sunny |
Humidity)
= 0.970
66
Outlook = ‘Sunny’
Play
Day Outlook Temperature Humidity Wind
Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
67
Outlook = ‘Overcast’
Play
Day Outlook Temperature Humidity Wind
Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
68
Outlook = ‘Rain’
Play
Day Outlook Temperature Humidity Wind Here, there are 5 instances for sunny outlook.
Tennis?
P(play Tennis = No) = ??
1 Sunny Hot High Weak No
P(play Tennis = yes) = ??
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes Information Gain ( Outlook = Rain |
4 Rain Mild High Weak Yes Temperature)
5 Rain Cool Normal Weak Yes = 0.019973
6 Rain Cool Normal Strong No
Information Gain ( Outlook = Rain |
7 Overcast Cool Normal Strong Yes
Humidity)
8 Sunny Mild High Weak No = 0.01997
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes Information Gain ( Outlook = Rain | Wind)
11 Sunny Mild Normal Strong Yes = 0.9709
12 Overcast Mild High Strong Yes
Here, wind produces the highest
13 Overcast Hot Normal Weak Yes
score if outlook were rain. That’s why, we
14 Rain Mild High Strong No need to check wind attribute in 2nd level if
outlook were rain.
69
70
71
Step 3: Selecting the Best Attribute:
The “Weather” attribute has the highest information gain, so we select it as the root node for our
decision tree.
72
Random Forest
• Classification in random forests employs an ensemble methodology to attain the outcome. The
training data is fed to train various decision trees. This dataset consists of observations and
features that will be selected randomly during the splitting of nodes.
• Every decision tree consists of decision nodes, leaf nodes, and a root node.
• The leaf node of each tree is the final output produced by that specific decision tree.
• The selection of the final output follows the majority-voting system. In this case, the output chosen
by the majority of the decision trees becomes the final output of the rain forest system.
73
Random Forest
74
Feature Engineering
Feature engineering in Machine learning consists of mainly 5 processes:
1. Feature Creation,
2. Feature Transformation,
3. Feature Extraction,
4. Feature Selection, and
5. Feature Scaling.
75