updated dm unit 3
updated dm unit 3
There are two forms of data analysis that can be used for extracting models
describing important classes or to predict future data trends. These two forms
are as follows −
Classification
Prediction
What is classification?
Following are the examples of cases where the data analysis task is
Classification −
A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer
will spend during a sale at his company. In this example we are bothered to
predict a numeric value. Therefore the data analysis task is an example of
numeric prediction. In this case, a model or a predictor will be constructed that
predicts a continuous-valued-function or ordered value.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process
includes two steps −
In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be
applied to the new data tuples if the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
Here is the criteria for comparing the methods of Classification and Prediction −
Accuracy − Accuracy of classifier refers to the ability of classifier. It
predict the class label correctly and the accuracy of the predictor refers to
how well a given predictor can guess the value of predicted attribute for a
new data.
Speed − This refers to the computational cost in generating and using the
classifier or predictor.
Robustness − It refers to the ability of classifier or predictor to make
correct predictions from given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
Interpretability − It refers to what extent the classifier or predictor
understands.
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents a class.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due
to noise or outliers. The pruned trees are smaller and less complex.
Cost Complexity
In this article, we will learn about tree pruning in data mining, but first, let us know
about a decision tree.
Decision Tree
A decision tree is an algorithm that is used for classification and prediction. It describes
rules in the form of a tree. It visually defines the rules simply for straightforward
interpretation and understanding. It represents the decision-making
process graphically and helps to make decisions easily.
It contains three nodes: the root, branch, and leaf. The root node is also the first decision
node where the main question is being asked. The branch node is the intermediate node
that helps in answering the main question asked at the root node. The leaf node is the
terminal node, which gives the final answer.
Two factors are used to draw a decision tree. The first factor is information gain, which
measures how much information the answer to a specific question provides. The second
factor is entropy, which measures how much uncertainty is present in the information.
After constructing the tree, it can be pruned to stop the overfitting with the help of the
pruning method. This method is used to remove branches to make the tree more
predictive. We will discuss pruning properly further in this article.
Example
Let us consider the dataset provided below:
From the given dataset, we will construct the decision tree and check whether the cricket
can be played outside or not.
Below is the graphical representation of the dataset provided above:
In the decision tree constructed above, as you can see, the root node is "Weather", as it
is the initial node used for making the first decision. It is the node where the main
question is asked & the question is whether to play cricket outside or not.
In the decision tree, the decision nodes are "Weather" and "Temperature". The branch
nodes are "Sunny", "Cloudy", "Rainy", "Hot", "Mild", and "Cool". The leaf nodes are
"Yes" or "No".
Now, let us decide with the help of a decision tree. We must choose the branch at the
root node, which should be our decision. We must choose the branch according to the
conditions.
Let us decide for Day 1; the weather is sunny so that we will choose the branch node
"Sunny". After that, we will look for further decision nodes. The further decision node is
"Temperature", which is split into three branches: "Hot", "Mild", and "Cool". For Day 1,
we will choose the branch node "Mild", which has a leaf node "Yes". The option "Mild" is
perfect for playing cricket outside, which means we can play cricket outside on Day 1.
Now, let us decide for Day 2; the weather is rainy so that we will choose the branch node
"Rainy". This branch node has a leaf node that says "No", meaning we cannot play
cricket outside on Day 2.
Now, let us decide for Day 3; the weather is cloudy so that we will choose the branch
node "Cloudy". After that, we will look for further decision nodes. The further decision
node is "Temperature", split into three branches: "Hot", "Mild", and "Cool". For Day 3, we
will choose the branch node "Mild", which has a leaf node "Yes". So, we can play cricket
outside on Day 3.
Similarly, we can decide on Day 4 and Day 5. So, we have seen how easy it was to
decide and predict the answer using the decision tree.
Tree Pruning
When a decision tree is constructed, a tree-growing algorithm is used to build the tree.
The noise in the training data creates abnormalities in various branches of the tree while
constructing a tree. The tree pruning technique addresses this issue and is used to
remove it.
In data mining, tree pruning is the technique that is used to decrease the size of the
decision tree model without lowering its accuracy. It improves the decision tree model
and decreases overfitting by removing certain branches from the fully grown tree. It
removes the abnormalities present in the training data due to noise. Trees that are
pruned are smaller in size and simple to understand.
One thing that comes to mind is what is the optimal size of the final tree. A tree that is
too large has the risk of overfitting the training data, and a tree that is too small may miss
the essential structural information.
Pre-pruning Approach
Pre-pruning is also called forward pruning or early stopping. The approach puts
constraints on the decision tree before it is constructed. In pre-pruning, the tree-building
process is halted before the tree becomes complex. It helps deal with the issue of
overfitting. Some measures can halt the tree's construction, such as the Gini index,
statistical significance, information gain, entropy, etc.
The tree can be pruned by keeping the threshold in mind. If the threshold is high, the
tree may be overly simplified. If the threshold is low, the tree may be slightly simpler.
The tree nodes are pruned by keeping all measures in mind, after which the node
becomes a leaf. The leaf node can determine the most frequent class within its subset of
tuples. When partitioning the tuples at a node results in a split falling below a specified
threshold, further partitioning is stopped.
Example:
Let us consider the customer dataset provided below:
1 Young Low No
2 Middle High Yes
3 Old Low No
We will construct the decision tree with a pre-pruning condition of a maximum depth 3.
After calculating gain information and entropy, the decision tree is constructed with a
maximum depth of 3 that you can see below:
The root node is at "Depth 0" and represents the whole dataset. In the above decision
tree, the root node is "Age". The root node is divided into three intermediate nodes:
"Young", "Middle", and "Old". Then, the node "Young" is divided into two intermediate
nodes based on salary: "High" and "Low". Since our maximum limit is till Depth 3, we will
not split branches at the "Middle" and "Old" nodes.
That's how the decision tree is created. It is easy to interpret the tree and predict the
result. In pre-pruning, a decision tree stops growing when the maximum depth limit is
reached, even when the tree can be divided further into more branches.
Post-pruning Approach
Post-pruning is done after the tree has grown to its full depth. It is also called backward
pruning. A tree is pruned by eliminating its branches and replacing them with a leaf to
prevent the decision tree model from overfitting. The most frequent class among the
subtrees being replaced is then assigned as the label for the leaf.
Example:
Consider constructing a decision tree in which students pass or fail based on their
studied hours.
1 2 8 Fail
2 6 6 Pass
3 7 5 Pass
The fully grown decision tree will look like you can see below:
Now, consider that the following is the validation dataset, which has additional data on
students:
5 5 9 Fail
6 8 6 Pass
We can prune the tree to improve it. We will remove the branches which affect accuracy.
We will prune the branch "Hours slept > 7: Fail" as it does not provide much value in the
tree.
After pruning the full-grown tree, the final decision tree will look like you can see below:
Bayesian classification
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the
statistical classifiers. Bayesian classifiers can predict class membership
probabilities such as the probability that a given tuple belongs to a particular
class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −
Example:
Imagine you know that 1% of people in a population have a rare disease. Before
doing any tests, you assume the chance of a randomly chosen person having the
disease is 1%. This is your prior probability.
Example:
Now suppose a person takes a medical test, and it comes back positive for the
disease. The posterior probability tells you the chance that the person actually
has the disease, given both:
1. Facts:
o P(H)= 0.01% of the population has the disease (prior probability).
o P(X∣H)=0.95: The test correctly detects the disease 95% of the time
(sensitivity).
o P(X∣¬H)=0.05: The test gives a false positive 5% of the time.
o P(¬H)=0.99% of the population does not have the disease.
Objective:
P(X)=P(X∣H)⋅P(H)+P(X∣¬H)⋅P(¬H)
Substitute values:
P(X)=(0.95⋅0.01)+(0.05⋅0.99)=0.0095+0.0495=0.059P0.059
P(H∣X)=P(X∣H)⋅P(H)/P(X)=0.95⋅0.010.059=0.00950.059≈0.161
If the test result is positive, there’s only about a 16.1% chance that the person
actually has the disease, despite the high sensitivity of the test. This
counterintuitive result arises because the disease is rare (P(H)=0.01P(H) =
0.01P(H)=0.01), meaning false positives from the test are relatively more
common than true positives.
key terminologies:
1. Nodes:
Each node represents a random variable or a factor that influences other
variables in the network. These can be anything you want to model, such
as a disease, symptom, or weather condition.
o Example: In a medical diagnosis network, nodes could represent
"Cold", "Fever", "Cough", etc.
2. Edges:
The edges (arrows) represent the conditional dependencies between
variables. An edge from one node to another implies that the first variable
influences the second one.
o Example: An edge from "Cold" to "Fever" indicates that having a
cold influences the likelihood of having a fever.
Prior Probability:
The initial probability of a node before considering any other
information. It represents your belief about the state of the node before
seeing evidence.
Posterior Probability:
The updated probability of a node after considering new evidence or data.
It represents the belief about the state of the node after observing
evidence.
Joint Probability:
The probability of a combination of events or variables occurring
together. This considers the entire network.
Let's create a simple Bayesian Belief Network to model the likelihood of having
a cold, based on symptoms such as a fever and cough.
Step 1: Identify the Nodes
1. Cold (C): Whether a person has a cold or not (this is the disease).
2. Fever (F): Whether the person has a fever or not (a symptom).
3. Cough (Cg): Whether the person has a cough or not (another symptom).
Cold → Fever: If you have a cold, you are more likely to have a fever.
Cold → Cough: If you have a cold, you are more likely to have a cough.
C (Cold) → F (Fever)
Cg (Cough)
Let’s say we observe that a person has both a fever and a cough. We want to
know the probability that they have a cold, i.e., P(C∣F,Cg)P(C|F, Cg)P(C∣F,Cg)
(the posterior probability of having a cold given the symptoms).
P(C∣F,Cg)=P(F,Cg∣C)⋅P(C)/P(F,Cg)
Where:
P(F,Cg∣C)=P(F∣C)⋅P(Cg∣C)
= 0.8×0.9=0.72
P(F,Cg)=P(F,Cg∣C)⋅P(C)+P(F,Cg∣¬C)⋅P(¬C)
P(F,Cg∣¬C)=P(F∣¬C)⋅P(Cg∣¬C)=0.2×0.4=0.08
P(F,Cg)=(0.72×0.1)+(0.08×0.9)=0.072+0.072=0.144
P(C∣F,Cg)=0.72×0.10.144=0.0720.144=0.5
Conclusion
Given that the person has both a fever and a cough, the probability that they
actually have a cold is 50%.
IF-THEN Rules
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Points to remember −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
The Sequential Covering Algorithm works by iteratively selecting rules that best "cover" the
training data, removing the covered instances at each step. These rules are then combined to
form a complete classifier. The algorithm continues until all instances in the training set are
covered by the selected rules.
1. Initialize the training data: Start with the entire training dataset.
2. Select a rule:
o At each iteration, a rule is generated to cover a subset of the
training data.
o A rule is typically in the form:
IF <condition> THEN <class>,
where the condition is a conjunction of attribute-value pairs (e.g.,
"age > 30 AND income < 50000").
o The rule is selected based on its ability to cover instances that have
the correct class label (i.e., the target label).
3. Covering the data:
o The selected rule is applied to the training data, marking the
instances that match the rule as "covered" or "classified".
o The rule is removed from the training set (or those covered
examples are removed), leaving the rest of the data for further
processing.
4. Repeat:
o Steps 2 and 3 are repeated until all instances in the training set are
covered by at least one rule.
o Alternatively, a stopping criterion such as a maximum number of
rules can be used.
Let’s assume we are trying to classify whether a person will buy a computer
based on two features: Age and Income.
25 50 Yes
45 80 No
35 60 Yes
55 90 No
Step 1: Initialize Training Data
Let's select a rule based on the data. Suppose we create the rule:
Rule 1:
IF Age ≤ 40 AND Income ≥ 50 THEN Buys Computer = Yes
This rule covers the first and third instances in the table. After applying this
rule, we remove the covered instances:
45 80 No
55 90 No
Now we create another rule for the remaining instances. For the second and
third rows, we could generate:
Rule 2:
IF Age > 40 THEN Buys Computer = No
Advantages:
Disadvantages:
Greedy: The greedy nature of the algorithm may lead to suboptimal rules
(it doesn't look ahead to find a global solution).
Overfitting: Since rules are added one by one, there is a risk of
overfitting the model to the training data, especially if the dataset is noisy.
Computationally expensive: If there are many examples and attributes,
the algorithm may take time to find the best rule at each step.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
FOIL is one of the simple and effective method for rule pruning. For a given
rule R,
where pos and neg is the number of positive tuples covered by R, respectively.
The term “lazy learner” or “lazy algorithm” is used to describe the k-Nearest
Neighbors (KNN) algorithm in machine learning. The key characteristic that
earns KNN this nickname is that it doesn’t learn a model during the training
phase. Instead, it defers the learning until the prediction or testing phase.
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.