0% found this document useful (0 votes)
17 views

updated dm unit 3

The document discusses data mining techniques, specifically classification and prediction, which are used to analyze data and predict outcomes. It explains the processes involved in building classifiers, the importance of data preparation, and the comparison of classification and prediction methods. Additionally, it covers decision tree induction, tree pruning techniques, and Bayesian classification, providing examples and algorithms for better understanding.

Uploaded by

lubnasiddiqui028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

updated dm unit 3

The document discusses data mining techniques, specifically classification and prediction, which are used to analyze data and predict outcomes. It explains the processes involved in building classifiers, the importance of data preparation, and the comparison of classification and prediction methods. Additionally, it covers decision tree induction, tree pruning techniques, and Bayesian classification, providing examples and algorithms for better understanding.

Uploaded by

lubnasiddiqui028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit 3

Data Mining - Classification & Prediction

There are two forms of data analysis that can be used for extracting models
describing important classes or to predict future data trends. These two forms
are as follows −

 Classification
 Prediction

Classification models predict categorical class labels; and prediction models


predict continuous valued functions. For example, we can build a classification
model to categorize bank loan applications as either safe or risky, or a
prediction model to predict the expenditures in dollars of potential customers on
computer equipment given their income and occupation.

What is classification?

Following are the examples of cases where the data analysis task is
Classification −

 A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.

In both of the above examples, a model or classifier is constructed to predict the


categorical labels. These labels are risky or safe for loan application data and
yes or no for marketing data.

What is prediction?

Following are the examples of cases where the data analysis task is Prediction −

Suppose the marketing manager needs to predict how much a given customer
will spend during a sale at his company. In this example we are bothered to
predict a numeric value. Therefore the data analysis task is an example of
numeric prediction. In this case, a model or a predictor will be constructed that
predicts a continuous-valued-function or ordered value.
How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process
includes two steps −

 Building the Classifier or Model


 Using Classifier for Classification

Building the Classifier or Model


 This step is the learning step or the learning phase.
 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database tuples and
their associated class labels.
 Each tuple that constitutes the training set is referred to as a category or
class. These tuples can also be referred to as sample, object or data points.

Using Classifier for Classification

In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be
applied to the new data tuples if the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −

 Data Cleaning − Data cleaning involves removing the noise and


treatment of missing values. The noise is removed by applying smoothing
techniques and the problem of missing values is solved by replacing a
missing value with most commonly occurring value for that attribute.
 Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
 Data Transformation and reduction − The data can be transformed by
any of the following methods.
o Normalization − The data is transformed using normalization.
Normalization involves scaling all values for given attribute in
order to make them fall within a small specified range.
Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing
it to the higher concept. For this purpose we can use the concept
hierarchies.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction −
 Accuracy − Accuracy of classifier refers to the ability of classifier. It
predict the class label correctly and the accuracy of the predictor refers to
how well a given predictor can guess the value of predicted attribute for a
new data.
 Speed − This refers to the computational cost in generating and using the
classifier or predictor.
 Robustness − It refers to the ability of classifier or predictor to make
correct predictions from given noisy data.
 Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
 Interpretability − It refers to what extent the classifier or predictor
understands.

Decision Tree Induction


A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label. The topmost node in
the tree is the root node.

The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.


 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and
fast.

Decision Tree Induction Algorithm


A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5,
which was the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this
algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.

Generating a decision tree form training tuples of data partition D


Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

Tree Pruning

Tree pruning is performed in order to remove anomalies in the training data due
to noise or outliers. The pruned trees are smaller and less complex.

Tree Pruning Approaches

There are two approaches to prune a tree −

 Pre-pruning − The tree is pruned by halting its construction early.


 Post-pruning - This approach removes a sub-tree from a fully grown
tree.

Cost Complexity

The cost complexity is measured by the following two parameters −

 Number of leaves in the tree, and


 Error rate of the tree.

Tree Pruning in Data Mining


Pruning is the data compression method that is related to decision trees. It is used to
eliminate certain parts from the decision tree to diminish the size of the tree.

In this article, we will learn about tree pruning in data mining, but first, let us know
about a decision tree.

Decision Tree
A decision tree is an algorithm that is used for classification and prediction. It describes
rules in the form of a tree. It visually defines the rules simply for straightforward
interpretation and understanding. It represents the decision-making
process graphically and helps to make decisions easily.

It contains three nodes: the root, branch, and leaf. The root node is also the first decision
node where the main question is being asked. The branch node is the intermediate node
that helps in answering the main question asked at the root node. The leaf node is the
terminal node, which gives the final answer.

Two factors are used to draw a decision tree. The first factor is information gain, which
measures how much information the answer to a specific question provides. The second
factor is entropy, which measures how much uncertainty is present in the information.

After constructing the tree, it can be pruned to stop the overfitting with the help of the
pruning method. This method is used to remove branches to make the tree more
predictive. We will discuss pruning properly further in this article.

Example
Let us consider the dataset provided below:

Day Weather Temperature Play-Cricket

Day 1 Sunny Mild Yes

Day 2 Rainy Cool No

Day 3 Cloudy Mild Yes

Day 4 Sunny Cool Yes

Day 5 Sunny Hot No

From the given dataset, we will construct the decision tree and check whether the cricket
can be played outside or not.
Below is the graphical representation of the dataset provided above:

In the decision tree constructed above, as you can see, the root node is "Weather", as it
is the initial node used for making the first decision. It is the node where the main
question is asked & the question is whether to play cricket outside or not.

In the decision tree, the decision nodes are "Weather" and "Temperature". The branch
nodes are "Sunny", "Cloudy", "Rainy", "Hot", "Mild", and "Cool". The leaf nodes are
"Yes" or "No".

Now, let us decide with the help of a decision tree. We must choose the branch at the
root node, which should be our decision. We must choose the branch according to the
conditions.

Let us decide for Day 1; the weather is sunny so that we will choose the branch node
"Sunny". After that, we will look for further decision nodes. The further decision node is
"Temperature", which is split into three branches: "Hot", "Mild", and "Cool". For Day 1,
we will choose the branch node "Mild", which has a leaf node "Yes". The option "Mild" is
perfect for playing cricket outside, which means we can play cricket outside on Day 1.

Now, let us decide for Day 2; the weather is rainy so that we will choose the branch node
"Rainy". This branch node has a leaf node that says "No", meaning we cannot play
cricket outside on Day 2.

Now, let us decide for Day 3; the weather is cloudy so that we will choose the branch
node "Cloudy". After that, we will look for further decision nodes. The further decision
node is "Temperature", split into three branches: "Hot", "Mild", and "Cool". For Day 3, we
will choose the branch node "Mild", which has a leaf node "Yes". So, we can play cricket
outside on Day 3.

Similarly, we can decide on Day 4 and Day 5. So, we have seen how easy it was to
decide and predict the answer using the decision tree.

Tree Pruning
When a decision tree is constructed, a tree-growing algorithm is used to build the tree.
The noise in the training data creates abnormalities in various branches of the tree while
constructing a tree. The tree pruning technique addresses this issue and is used to
remove it.

In data mining, tree pruning is the technique that is used to decrease the size of the
decision tree model without lowering its accuracy. It improves the decision tree model
and decreases overfitting by removing certain branches from the fully grown tree. It
removes the abnormalities present in the training data due to noise. Trees that are
pruned are smaller in size and simple to understand.

One thing that comes to mind is what is the optimal size of the final tree. A tree that is
too large has the risk of overfitting the training data, and a tree that is too small may miss
the essential structural information.

There are two approaches to tree pruning:

Pre-pruning Approach
Pre-pruning is also called forward pruning or early stopping. The approach puts
constraints on the decision tree before it is constructed. In pre-pruning, the tree-building
process is halted before the tree becomes complex. It helps deal with the issue of
overfitting. Some measures can halt the tree's construction, such as the Gini index,
statistical significance, information gain, entropy, etc.

The tree can be pruned by keeping the threshold in mind. If the threshold is high, the
tree may be overly simplified. If the threshold is low, the tree may be slightly simpler.

The tree nodes are pruned by keeping all measures in mind, after which the node
becomes a leaf. The leaf node can determine the most frequent class within its subset of
tuples. When partitioning the tuples at a node results in a split falling below a specified
threshold, further partitioning is stopped.

Example:
Let us consider the customer dataset provided below:

Customer ID Age Salary Purchase

1 Young Low No
2 Middle High Yes

3 Old Low No

4 Young High Yes

5 Middle High Yes

We will construct the decision tree with a pre-pruning condition of a maximum depth 3.

After calculating gain information and entropy, the decision tree is constructed with a
maximum depth of 3 that you can see below:

The root node is at "Depth 0" and represents the whole dataset. In the above decision
tree, the root node is "Age". The root node is divided into three intermediate nodes:
"Young", "Middle", and "Old". Then, the node "Young" is divided into two intermediate
nodes based on salary: "High" and "Low". Since our maximum limit is till Depth 3, we will
not split branches at the "Middle" and "Old" nodes.

That's how the decision tree is created. It is easy to interpret the tree and predict the
result. In pre-pruning, a decision tree stops growing when the maximum depth limit is
reached, even when the tree can be divided further into more branches.
Post-pruning Approach
Post-pruning is done after the tree has grown to its full depth. It is also called backward
pruning. A tree is pruned by eliminating its branches and replacing them with a leaf to
prevent the decision tree model from overfitting. The most frequent class among the
subtrees being replaced is then assigned as the label for the leaf.

Example:
Consider constructing a decision tree in which students pass or fail based on their
studied hours.

Following is the student's dataset:

Student ID Hours studied Hours slept Result

1 2 8 Fail

2 6 6 Pass

3 7 5 Pass

The fully grown decision tree will look like you can see below:

Now, consider that the following is the validation dataset, which has additional data on
students:

Student ID Hours studied Hours slept Result


4 7 6 Pass

5 5 9 Fail

6 8 6 Pass

We can prune the tree to improve it. We will remove the branches which affect accuracy.
We will prune the branch "Hours slept > 7: Fail" as it does not provide much value in the
tree.

After pruning the full-grown tree, the final decision tree will look like you can see below:

The pruned tree is more accurate and easy to interpret.

Bayesian classification
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the
statistical classifiers. Bayesian classifiers can predict class membership
probabilities such as the probability that a given tuple belongs to a particular
class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −

 Posterior Probability [P(H/X)]


 Prior Probability [P(H)]

where X is data tuple and H is some hypothesis.

According to Bayes' Theorem,

P(H/X)= P(X/H)P(H) / P(X)

Prior Probability (P(H)P(H)P(H))

 What It Is: The probability of a hypothesis H being true before seeing


any new evidence or data.
 Think of it as: Your initial belief about how likely something is to
happen based on past knowledge or assumptions.

Example:
Imagine you know that 1% of people in a population have a rare disease. Before
doing any tests, you assume the chance of a randomly chosen person having the
disease is 1%. This is your prior probability.

2. Posterior Probability (P(H∣X)P(H|X)P(H∣X))

 What It Is: The probability of a hypothesis H being true after


considering new evidence X.
 Think of it as: Updating your belief about something based on new
information.

Example:
Now suppose a person takes a medical test, and it comes back positive for the
disease. The posterior probability tells you the chance that the person actually
has the disease, given both:

 Your initial belief (prior probability, 1%).


 The accuracy of the test (new evidence)
Example: Diagnosing a Disease

Let’s assume a medical test for a rare disease:

1. Facts:
o P(H)= 0.01% of the population has the disease (prior probability).
o P(X∣H)=0.95: The test correctly detects the disease 95% of the time
(sensitivity).
o P(X∣¬H)=0.05: The test gives a false positive 5% of the time.
o P(¬H)=0.99% of the population does not have the disease.

Objective:

Calculate P(H∣X): The probability of having the disease (hypothesis H) given


that the test result is positive (X).

Steps Using Bayes' Theorem


1. Write the formula:
P(H∣X)=P(X∣H)⋅P(H)/P(X)
2. Calculate P(X)(the evidence):

The total probability of getting a positive test result is:

P(X)=P(X∣H)⋅P(H)+P(X∣¬H)⋅P(¬H)

Substitute values:

P(X)=(0.95⋅0.01)+(0.05⋅0.99)=0.0095+0.0495=0.059P0.059

3. Calculate P(H∣X) (posterior probability):

Substitute values into Bayes' formula:

P(H∣X)=P(X∣H)⋅P(H)/P(X)=0.95⋅0.010.059=0.00950.059≈0.161

If the test result is positive, there’s only about a 16.1% chance that the person
actually has the disease, despite the high sensitivity of the test. This
counterintuitive result arises because the disease is rare (P(H)=0.01P(H) =
0.01P(H)=0.01), meaning false positives from the test are relatively more
common than true positives.

Bayesian Belief Network


A Bayesian Belief Network (BBN) is a graphical model used to represent
probabilistic relationships among a set of variables. It provides a framework for
reasoning about uncertainty and making inferences based on available data.

Bayesian Belief Networks specify joint conditional probability distributions.


They are also known as Belief Networks, Bayesian Networks, or Probabilistic
Networks.

 A Belief Network allows class conditional independencies to be defined


between subsets of variables.
 It provides a graphical model of causal relationship on which learning can
be performed.
 We can use a trained Bayesian Network for classification.

There are two components that define a Bayesian Belief Network −

 Directed acyclic graph


 A set of conditional probability tables

Directed Acyclic Graph


 Each node in a directed acyclic graph represents a random variable.
 These variable may be discrete or continuous valued.
 These variables may correspond to the actual attribute given in the data.

key terminologies:

1. Nodes:
Each node represents a random variable or a factor that influences other
variables in the network. These can be anything you want to model, such
as a disease, symptom, or weather condition.
o Example: In a medical diagnosis network, nodes could represent
"Cold", "Fever", "Cough", etc.
2. Edges:
The edges (arrows) represent the conditional dependencies between
variables. An edge from one node to another implies that the first variable
influences the second one.
o Example: An edge from "Cold" to "Fever" indicates that having a
cold influences the likelihood of having a fever.

Conditional Probability Table (CPT):


A table that quantifies the relationships between a node and its parent
nodes. It shows the probability of a node’s state given the states of its
parent nodes.

o Example: For the "Fever" node, a CPT might describe the


probability of having a fever given that you have a cold or not.

Prior Probability:
The initial probability of a node before considering any other
information. It represents your belief about the state of the node before
seeing evidence.

o Example: The prior probability of having a cold might be 10%


(i.e., 10% of people have a cold).

Posterior Probability:
The updated probability of a node after considering new evidence or data.
It represents the belief about the state of the node after observing
evidence.

o Example: After knowing that a person has a cough, the posterior


probability of them having a cold may change.

Joint Probability:
The probability of a combination of events or variables occurring
together. This considers the entire network.

o Example: The joint probability would be the likelihood of a person


having both a cold and a cough at the same time.

Scenario: Diagnosing a Cold

Let's create a simple Bayesian Belief Network to model the likelihood of having
a cold, based on symptoms such as a fever and cough.
Step 1: Identify the Nodes

1. Cold (C): Whether a person has a cold or not (this is the disease).
2. Fever (F): Whether the person has a fever or not (a symptom).
3. Cough (Cg): Whether the person has a cough or not (another symptom).

Step 2: Define the Edges (Dependencies)

 Cold → Fever: If you have a cold, you are more likely to have a fever.
 Cold → Cough: If you have a cold, you are more likely to have a cough.

Step 3: Assign Prior Probabilities

We know the following prior probabilities:

 P(C)=0.1 (Prior probability of having a cold: 10%).


 P(¬C)=0.9 (Probability of not having a cold: 90%).

step 4: Define Conditional Probabilities

 Fever given Cold:


P(F∣C)=0.8 (80% chance of having a fever if you have a cold).
P(F∣¬C)=0.2P(F|\neg C) = 0.2P(F∣¬C)=0.2 (20% chance of having a fever
if you don’t have a cold).
 Cough given Cold:
P(Cg∣C)=0.9 (90% chance of having a cough if you have a cold).
P(Cg∣¬C)=0.4 (40% chance of having a cough if you don’t have a cold).

C (Cold) → F (Fever)

Cg (Cough)

Let’s say we observe that a person has both a fever and a cough. We want to
know the probability that they have a cold, i.e., P(C∣F,Cg)P(C|F, Cg)P(C∣F,Cg)
(the posterior probability of having a cold given the symptoms).

We can use Bayes' Theorem:

P(C∣F,Cg)=P(F,Cg∣C)⋅P(C)/P(F,Cg)
Where:

P(F,Cg∣C)=P(F∣C)⋅P(Cg∣C)
= 0.8×0.9=0.72

Now, calculate the total probability of P(F,Cg)P(F, Cg)P(F,Cg):

P(F,Cg)=P(F,Cg∣C)⋅P(C)+P(F,Cg∣¬C)⋅P(¬C)

P(F,Cg∣¬C)=P(F∣¬C)⋅P(Cg∣¬C)=0.2×0.4=0.08

P(F,Cg)=(0.72×0.1)+(0.08×0.9)=0.072+0.072=0.144

Finally, calculate the posterior probability:

P(C∣F,Cg)=0.72×0.10.144=0.0720.144=0.5

Conclusion

Given that the person has both a fever and a cough, the probability that they
actually have a cold is 50%.

Rule Based Classification

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We


can express a rule in the following from −

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes


THEN buy_computer = yes
Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute tests
and these tests are logically ANDed.
 The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction

Here we will learn how to build a rule-based classifier by extracting IF-THEN


rules from a decision tree.

Points to remember −

To extract a rule from a decision tree −

 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm


The Sequential Covering Algorithm is a machine learning algorithm used for inductive
learning, particularly for constructing rules in a rule-based classifier. It is commonly used
in classification tasks, where the goal is to predict a class label based on various features of
the input data.

The Sequential Covering Algorithm works by iteratively selecting rules that best "cover" the
training data, removing the covered instances at each step. These rules are then combined to
form a complete classifier. The algorithm continues until all instances in the training set are
covered by the selected rules.

Steps of the Sequential Covering Algorithm

1. Initialize the training data: Start with the entire training dataset.
2. Select a rule:
o At each iteration, a rule is generated to cover a subset of the
training data.
o A rule is typically in the form:
IF <condition> THEN <class>,
where the condition is a conjunction of attribute-value pairs (e.g.,
"age > 30 AND income < 50000").
o The rule is selected based on its ability to cover instances that have
the correct class label (i.e., the target label).
3. Covering the data:
o The selected rule is applied to the training data, marking the
instances that match the rule as "covered" or "classified".
o The rule is removed from the training set (or those covered
examples are removed), leaving the rest of the data for further
processing.

4. Repeat:
o Steps 2 and 3 are repeated until all instances in the training set are
covered by at least one rule.
o Alternatively, a stopping criterion such as a maximum number of
rules can be used.

5. Generate the final set of rules:


o The algorithm outputs the set of rules that together classify all or
most of the training data.

Key Features of Sequential Covering

 Greedy Approach: Sequential Covering is a greedy algorithm, meaning


it tries to find the best rule at each step based on the current dataset.
 Non-iterative rule creation: The algorithm creates one rule at a time,
and the rules are added incrementally.
 Rule Pruning: Sometimes, after a rule is created, it may need to be
pruned to improve performance (e.g., by removing unnecessary
conditions or generalizing the rule).

Example: Sequential Covering Algorithm with a Simple Dataset

Let’s assume we are trying to classify whether a person will buy a computer
based on two features: Age and Income.

Age (years) Income (K$) Buys Computer (Class)

25 50 Yes

45 80 No

35 60 Yes

55 90 No
Step 1: Initialize Training Data

Start with the entire dataset.

Step 2: Generate First Rule

Let's select a rule based on the data. Suppose we create the rule:

Rule 1:
IF Age ≤ 40 AND Income ≥ 50 THEN Buys Computer = Yes

This rule covers the first and third instances in the table. After applying this
rule, we remove the covered instances:

Age (years) Income (K$) Buys Computer (Class)

45 80 No

55 90 No

Step 3: Generate Second Rule

Now we create another rule for the remaining instances. For the second and
third rows, we could generate:

Rule 2:
IF Age > 40 THEN Buys Computer = No

This rule covers both the second and fourth instances.

Step 4: Complete the Process

Now, all instances are covered by rules, so the algorithm stops.

Final Set of Rules:

 Rule 1: IF Age ≤ 40 AND Income ≥ 50 THEN Buys Computer = Yes


 Rule 2: IF Age > 40 THEN Buys Computer = No

These rules can now be used to classify new instances.


Advantages and Disadvantages of Sequential Covering

Advantages:

 Simple to implement: Sequential Covering is easy to understand and


implement.
 Interpretability: The output of the algorithm is a set of interpretable
rules, which are easy for humans to understand.
 Flexible: It can be adapted to a variety of problems and data types.

Disadvantages:

 Greedy: The greedy nature of the algorithm may lead to suboptimal rules
(it doesn't look ahead to find a global solution).
 Overfitting: Since rules are added one by one, there is a risk of
overfitting the model to the training data, especially if the dataset is noisy.
 Computationally expensive: If there are many examples and attributes,
the algorithm may take time to find the best rule at each step.

Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;
Rule Pruning

The rule is pruned is due to the following reason −

 The Assessment of quality is made on the original set of training data.


The rule may perform well on training data but less well on subsequent
data. That's why the rule pruning is required.
 The rule is pruned by removing conjunct. The rule R is pruned, if pruned
version of R has greater quality than what was assessed on an
independent set of tuples.

FOIL is one of the simple and effective method for rule pruning. For a given
rule R,

FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.

Lazy Learner” Or A “Lazy Algorithm

The term “lazy learner” or “lazy algorithm” is used to describe the k-Nearest
Neighbors (KNN) algorithm in machine learning. The key characteristic that
earns KNN this nickname is that it doesn’t learn a model during the training
phase. Instead, it defers the learning until the prediction or testing phase.

Why it’s considered “lazy”:

1. No Training Phase: Traditional machine learning algorithms involve a


training phase where the model learns patterns and relationships in the data.
KNN, however, skips this step entirely. It doesn’t attempt to understand the
underlying structure of the data during training.
2. Instance-Based Learning: KNN is an instance-based or memory-based
learning algorithm. It memorizes the entire training dataset, and when a
prediction is needed, it looks up the most similar instances (neighbors) in the
training data.

3. Decision at Prediction Time: The learning happens at the time of


prediction. When you input a new data point, KNN searches for the k-
nearest neighbors in the training data and makes a decision based on the
majority class or average of these neighbors.

4. No Generalization: Since KNN doesn’t create a model during training, it


doesn’t generalize well to new, unseen data. Each prediction is based on the
specific instances in the training set.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms


based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
o K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.
o K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to
cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN,
we can easily identify the category or class of a particular dataset. Consider the
below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the
k=5.
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we
have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm:

o There is no particular way to determine the best value for "K", so we


need to try some values to find the best out of them. The most preferred
value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some


time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.

You might also like