5 - Predictive Modeling Using Decision Trees
5 - Predictive Modeling Using Decision Trees
What is prediction?
In data science, prediction more generally means to estimate an
unknown value. Machine learning usually deals with historical data,
models very often are built and tested using events from the past.
Predictive models for credit scoring estimate the likelihood that a
potential customer will default (become a write-off). Predictive models
for spam filtering estimate whether a given piece of email is spam.
Predictive models for fraud detection judge whether an account has
been defrauded. The key is that the model is intended to be used to
estimate an unknown value.
Models, Induction, and Prediction
Supervised learning is model creation where the model describes a
relationship between a set of selected variables (attributes or features)
and a predefined variable called the target variable.
An instance or example represents a fact or a data point (row in the data set).
Target – column which represents the class for each instance. Also called
label, dependent variable.
Lets say for the churn problem, we want to create a supervised learning
model that divides or classifies the data into segments such high risk,
low risk, medium risk etc.
In the churn example, what variable gives us the most information about
the future churn rate of the population? Being a professional? Age?
Place of residence? Income? Number of complaints to customer service?
Amount of overage charges?
Selecting Informative Attributes
When we split using an attribute, the new groups should have the same
target value.
Then the group is called “Pure”.
Selecting Informative Attributes
If every member of a group has the same value for the target, then the
group is pure. If there is at least one member of the group that has a
different value for the target variable than the rest of the group, then the
group is impure.
3) Not all attributes are binary; many attributes have three or more
distinct values. We must take into account that one attribute can split
into two groups while another might split into three groups, or seven.
How do we compare these?
Selecting Informative Attributes
for classification problems we can address all the issues by creating a
formula that evaluates how well each attribute splits a set of examples
into segments. Such a formula is based on a purity measure.
P+ = 1 – P-
Starting with all negative instances at the lower left, p+ = 0, the set has minimal
disorder (it is pure) and the entropy is zero.
If we start to switch class labels of elements of the set from – to +, the entropy
increases. Entropy is maximized at 1 when the instance classes are balanced (five
of each), and p+ = p– = 0.5
As more class labels are switched, the + class starts to predominate and the
entropy lowers again. When all instances are positive, p+ = 1 and entropy is
minimal again at zero.
Entropy
Example:
consider a set S of 10 people with seven of the non-write-off class and
three of the write-off class
Information Gain
Entropy tells us how impure a set is.
Information gain is a function of both a parent set and of the children set
resulting from some partitioning of the parent set based on an attribute.
Information Gain is calculated for a split by subtracting the weighted entropies of each branch from the original
entropy.
Splitting a class example
Consider the following class which has loan defaulters (dots) and non
defaulters (stars)
Splitting a class example
The question: how do you pick the right splitting point or threshold?
Conceptually, we can try all reasonable split points, and choose the one
that gives the highest information gain