DataMining_Unit-3
DataMining_Unit-3
1) Classification and prediction are two fundamental techniques in data mining used to extract
meaningful insights and models from datasets.
A Decision Tree is a popular and powerful model used for both classification and regression tasks in data
mining. It is a supervised learning algorithm that splits data into subsets based on feature values, creating a
tree-like structure to make decisions.
1. Root Node: The topmost node, representing the entire dataset. It is split into two or more branches
based on a feature's value.
2. Internal Nodes: These represent the features (attributes) used for decision-making and splitting the data.
3. Edges/Branches: These represent the outcomes of splitting the data on the attribute.
4. Leaf Nodes: The terminal nodes, which represent the final class label (for classification) or predicted
value (for regression).
5. Pruning: The process of removing or cutting down specific nodes in a tree to prevent overfitting and
simplify the model.
1. Gini Index (for classification): The Gini index measures the "impurity" or "impurity" of a node. A
lower Gini value indicates better splits (more homogeneous groups).
Example of Decision Tree
Decision trees are upside down which means the root is at the top and then this root is split into various several
nodes. Decision trees are nothing but a bunch of if-else statements in layman terms. It checks if the condition is
true and if it is then it goes to the next node attached to that decision.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If yes then it will
go to the next feature which is humidity and wind. It will again check if there is a strong wind or weak, if it’s a
weak wind and it’s rainy then the person may go and play.
Rules:
1) If weather = cloudy, then play = YES
2) If weather = sunny, humidity = high, then play = NO
3) If weather = sunny, humidity = normal, then play = YES
4) If weather = rainy, wind=strong, then play = NO
5) If weather = rainy, wind=weak, then play = YES
Test:
Day 11: If weather = cloudy, humidity = high,wind=weak, then play = ?
Answer: YES
3) Bayesian Classification in Data Mining
❖ Bayesian classification is a statistical method based on Bayes' theorem, which describes the probability
of an event based on prior knowledge of conditions related to the event. In data mining,
❖ Bayesian classification is used to predict the probability that a given data point belongs to a particular
class.
Example Problem
You observe a basket containing fruits, and you want to classify whether a randomly chosen fruit is an Apple or
an Orange based on the observed color.
Given Data
The basket contains:
• 60 fruits in total:
o 30 Apples: 20 Red and 10 Green.
o 30 Oranges: 15 Orange and 15 Green.
Bayes' Theorem is widely used in various fields for different applications. Here are some key applications:
• Medical Diagnosis: Estimating the probability of a disease given the presence of certain symptoms and
test results.
• Spam Filtering: Classifying emails as spam or not spam based on their content and features.
• Machine Learning: Training classifiers and models in supervised learning, especially in probabilistic
algorithms.
• Risk Assessment: Evaluating risks in finance, insurance, and other industries by updating probabilities
based on new evidence.
• Recommender Systems: Improving recommendations by updating user preferences based on new
interactions or feedback.
• Fault Diagnosis: Identifying the probability of different faults in complex systems like machinery or
electronics based on observed symptoms.
• Decision Making: Supporting decision-making processes by providing probabilistic estimates and
updating them as new information becomes available.
4)K-Nearest Neighbors (KNN) / LAZY LEARNER
K-Nearest Neighbors (KNN) is a simple algorithm used in data mining,and machine learning.
❖ It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
❖ KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
❖ K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
The K-NN working can be explained on the basis of the below algorithm:
• Instance-based Learning: KNN does not explicitly learn a model; instead, it makes predictions based
on the similarity of new data to the training data.
• Lazy Learning: It delays the learning process until a prediction is required, meaning the training phase
is minimal or non-existent.
• Similarity Measure: The algorithm uses distance metrics (e.g., Euclidean, Manhattan, Minkowski) to
find the "nearest" neighbors of a given data point. How KNN Works
Applications
KNN is widely used in various domains for its simplicity and effectiveness:
Let's say you want to predict the class of a new data point with the coordinates (3, 4).
3.Identify the three nearest neighbors: The nearest neighbors are the points (3, 3), (2, 3), and (4, 5), with the
smallest distances (1, 1.41, and 1.41, respectively).
4.Make the prediction: Among these three nearest neighbors, two belong to class B and one belongs to class
A. Since class B is the majority, the new point (3, 4) is classified as B.
5)Rule-based classification in data mining
Rule-based classification in data mining is a technique in which class decisions are taken based on various
“if...then… else” rules.
The rules are often written as "IF-THEN" statements,
IF-THEN Rule