0% found this document useful (0 votes)
52 views

Data Mining Algorithms Classification L4

Classification algorithms are used to predict class membership for new observations based on a training set where the class labels are known. Decision trees are a popular classification algorithm that use a flowchart-like structure to split the data into subsets based on attribute values. Decision trees classify new instances by sorting them from the root node to a leaf node, where the class prediction is given. Decision trees are easy to understand and interpret, can handle both numerical and categorical data, and provide an indication of attribute importance. However, they are prone to errors with small datasets and are computationally expensive to build.

Uploaded by

u- m-
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Data Mining Algorithms Classification L4

Classification algorithms are used to predict class membership for new observations based on a training set where the class labels are known. Decision trees are a popular classification algorithm that use a flowchart-like structure to split the data into subsets based on attribute values. Decision trees classify new instances by sorting them from the root node to a leaf node, where the class prediction is given. Decision trees are easy to understand and interpret, can handle both numerical and categorical data, and provide an indication of attribute importance. However, they are prone to errors with small datasets and are computationally expensive to build.

Uploaded by

u- m-
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

-4-

Data mining algorithms: Classification


Classification is the processing of finding a set of models (or functions) that
describe and distinguish data classes or concepts, for the purpose of being able to use
the model to predict the class of objects whose class label is unknown. The determined
model depends on the investigation of a set of training data information (i.e. data objects
whose class label is known). The derived model may be represented in various forms,
such as classification (if – then) rules, decision trees, and neural networks.
Classification is the problem of identifying to which of a set of categories
(subpopulations), a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case,
a classifier is required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the
Project and to further approve it. It is a two-step process such as:
1- Learning Step (Training Phase): Construction of Classification Model. Different
Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate
results.
2- Classification is the problem of identifying to which of a set of categories
(subpopulations), a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.
There are certain data types associated with data mining that actually tells us the
format of the file (whether it is in text format or in numerical format).
Attributes – Represents different features of an object. Different types of
attributes are:
 Symmetric: Both values are equally important in all aspects
 Asymmetric: When both the values may not be important.
 Binary: Possesses only two values i.e. True or False

1
-4-

Data Mining has a different type of classifier for examples:


1- Decision Tree
2- Covering rules
Decision Trees: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and each
leaf node (terminal node) holds a class label.

A decision tree for the concept Play Tennis.


Construction of Decision Tree
A tree can be “learned” by splitting the source set into subsets based on an attribute
value test. This process is repeated on each derived subset in a recursive manner called
recursive partitioning. The recursion is completed when the subset at a node all has the
same value of the target variable, or when splitting no longer adds value to the
predictions. The construction of decision tree classifier does not require any domain
knowledge or parameter setting, and therefore is appropriate for exploratory knowledge
discovery. Decision trees can handle high dimensional data. In general decision tree
classifier has good accuracy.

2
-4-

Decision Tree Representation


Decision trees classify instances by sorting them down the tree from the root to some
leaf node, which provides the classification of the instance. An instance is classified by
starting at the root node of the tree, testing the attribute specified by this node, and then
moving down the tree branch corresponding to the value of the attribute as shown in the
above figure. This process is then repeated for the subtree rooted at the new node.
The decision tree in above figure classifies a particular morning according to whether
it is suitable for playing tennis and returning the classification associated with the
particular leaf.(in this case Yes or No).
For example, the instance
(Outlook = Rain, Temperature = Hot, Humidity = High, Wind = Strong)
Would be sorted down the leftmost branch of this decision tree and would therefore be
classified as a negative instance.
In other words we can say that decision tree represent a disjunction of conjunctions of
constraints on the attribute values of instances.
(Outlook = Sunny ^ Humidity = Normal) v (Outlook = Overcast) v (Outlook = Rain ^
Wind = Weak)
Strengths and Weakness of Decision Tree approach
The strengths of decision tree methods are:
1- Decision trees are able to generate understandable rules.
2- Decision trees perform classification without requiring much computation.
3- Decision trees are able to handle both continuous and categorical variables.
4- Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods:
1- Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
2- Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.

3
-4-

3- Decision tree can be computationally expensive to train. The process of growing a


decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.

In Decision Tree the major challenge is to identification of the attribute for the root
node in each level. This process is known as attribute selection. We have two
popular attribute selection measures: Information Gain and Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into
smaller subsets the entropy changes. Information gain is a measure of this change in
entropy.
Or

Example: Consider a dataset with 1 blue, 2 greens, and 3 reds:

4
-4-

2-Gini index

Example: Consider the training examples for a binary classification problem

(a) Compute the Gini index for the overall collection of training examples.
Answer: Gini = 1 − 2 × 0.52 = 0.5.
(b) Compute the Gini index for the Customer ID attribute.
Answer:
5
-4-

The gini for each Customer ID value is 0. Therefore, the overall gini for
Customer ID is 0.
(c) Compute the Gini index for the Gender attribute.
Answer:
The gini for Male is 1 − 2 × 0.52 = 0.5. The gini for Female is also 0.5.
Therefore, the overall gini for Gender is 0.5 × 0.5 + 0.5 × 0.5 = 0.5.

h.w:
(d) Compute the Gini index for the Car Type attribute using multiway
split.
(e) Compute the Gini index for the Shirt Size attribute using multiway
split.

Example 2: Consider the training examples for a binary classification problem.

(a) What is the entropy of this collection of training examples with respect to the
positive class?
Answer:
There are four positive examples and five negative examples. Thus,
6
-4-

P(+) = 4/9 and P(−) = 5/9. The entropy of the training examples is
−4/9 log2(4/9) − 5/9 log2(5/9) = 0.9911.

(b) What are the information gains of a1 and a2 relative to these training examples?
Answer:
For attribute a1, the corresponding counts and probabilities are:

The entropy for a1 is

Therefore, the information gain for a1 is 0.9911 − 0.7616 = 0.2294.


For attribute a2, the corresponding counts and probabilities are:

The entropy for a2 is

Therefore, the information gain for a2 is 0.9911 − 0.9839 = 0.0072.

You might also like