Decision Trees
Decision Trees
Presentation no.1
Definitions
Why should we use Decision Trees
The basic algorithm of Decision Trees (overview)
Common steps for using Decision Trees
Disadvantages
Application of Decision Trees in NLP
3 Decision Trees
Definition:
The decision tree method is a powerful statistical tool for
classification, prediction, interpretation, and data
manipulation that has several potential applications.
Non-parametric approach without distributional assumptions.
Decision tree can also be re-represented as
if-then rules to improve human readability
4 Why should we use Decision Trees?
Information gain
A quantitative measure of the worth of an attribute or Measures the
expected reduction in entropy given the value of some attribute A
Gain(S,A) Entropy(S) - vValues(A) |Sv|Entropy(S)/|S|
Variable selection.
To select the most relevant input variables that should be used to
form decision tree models.
Assessing the relative importance of variables.
Generally, variable importance is computed based on the
reduction of model accuracy, when the variable is removed. In
most circumstances the more records a variable have an effect
on, the greater the importance of the variable.
10 Common Steps for using decision trees (2)
Stopping
All decision trees need stopping criteria or it would be possible, and undesirable, to
grow a tree in which each case occupied its own node. The resulting tree would be
computationally expensive, difficult to interpret and would probably not work very
well with new data
Number of cases in the node is less than some pre-specified limit.
Purity of the node is more than some pre-specified limit.
Depth of the node is more than some pre-specified limit.
Predictor values for all records are identical - in which no rule could be
generated to split them.
13 Common Steps for using decision trees (5)
Pruning.
In some situations, stopping rules do not work well. An alternative way to build a
decision tree model is to grow a large tree first, and then prune it to optimal size by
removing nodes that provide less additional information.
Two types
pre-pruning (forward pruning) uses Chi-square tests or multiple-comparison
adjustment methods to prevent the generation of non-significant branches.
Post-pruning is used after generating a full decision tree to remove branches in a
manner that improves the accuracy of the overall classification when applied to the
validation dataset.
14 Common Steps for using decision trees (6)
Prediction.
This is one of the most important usages of decision tree models to predict the
result for future records.
16 Disadvantages
It can be subject to overfitting and underfitting, particularly when using a small data set.
This can limit the generalizability and robustness of the resultant models.
Strong correlation between different potential input variables may result in the selection of
variables that improve the model statistics but are not causally related to the outcome of
interest.
Application of Decision Trees in NLP
17
Part of speech (POS) tagging
Text Classification
18 Application of Decision Trees in NLP
The heuristic is choosing for each word its most probable tag
according to the lexical probability.
Choosing the proper syntactic tag for a word in a particular context
can be stated as a problem of classification.
Learning algorithm would be used for a set of possible tags,
Classes are identified with tags.
It is possible to group all the words appearing in the corpus according to the set of
21
their possible tags called ambiguity classes.
Taxonomy extracted from the WSJ. The general POS tagging problem is split into
one classification problem for each ambiguity class.
22 Treetagger
Classify the word using the corresponding decision tree. The ambiguity of
the context (either left or right) during classification may generate
multiple answers for the questions of the nodes. In this case, all the paths
are followed and the result is taken as a weighted average of the results of
all possible paths. The weight for each path is actually its probability.
Use the resulting probability distribution to update the probability
distribution of the word.
Discard the tags with almost zero probability, that is, those with
probabilities lower than a certain discard boundary parameter.
23 Treetagger
After the stopping criterion is satisfied, some words could still remain
ambiguous. Then there are two possibilities:
1) Choose the most probable tag for each still-ambiguous word to
completely disambiguate the text.
2) Accept the residual ambiguity.
Bayesian classifier
Decision Tree
K-nearest neighbor(KNN)
Support Vector Machines(SVMs)
Neural Networks
Rocchio’s.
29 How decision tree works for text
classification?
When decision tree is used for text classification it consist tree
internal node are label by term, branches departing from them are
labeled by test on the weight, and leaf node are represented by
corresponding class labels .
Tree can classify the document by running through the query
structure from root to until it reaches a certain leaf, which represents
the goal for the classification of the document.
30
Advantages
Simplicity in understanding and interpreting, even for non-expert
users.
The multi-label document reduce cost of induction.
Decision-tree-based symbolic rule induction system for text
categorization also improves text classification.
Disadvatage
Most of training data will not fit in memory decision tree
construction it becomes inefficient due to swapping of training
tuples. This issue was handled by using numeric and categorical
data.
31 Which classifier to use?