DA_Lab_Week-3 (1)
DA_Lab_Week-3 (1)
Agenda:
1. About Decision Tree
a. What is a Decision Tree?
b. What problems can be solved using DT?
c. How does a DT work?
d. Decision Trees with Package party on iris dataset
e. Decision Trees with Package rpart on iris dataset
2. Use case / Case Study on ………………
a. Data Discovery
b. Data Pre-processing
c. Model Planning and Building
d. Communicate Results
1
2
(a)
Decision tree is a graph to represent choices and their results in form of a tree. The
nodes in the graph represent an event or choice and the edges of the graph
represent the decision rules or conditions.
(or)
Decision Tree is a tree shaped algorithm used to determine a course of action. Each
branch of the tree represents a possible decision, occurrence or reaction.
It is mostly used in Machine Learning and Data Mining applications using R.
3
(or)
A decision tree (also called prediction tree) uses a tree structure to specify
sequences of decisions and consequences.
Given input X={x1,x2,…xn} the goal is to predict a response or output variable Y .
Each member of the set {x1,x2,…xn}is called an input variable.
The prediction can be achieved by constructing a decision tree with test points and
branches.
At each test point, a decision is made to pick a specific branch and traverse down
the tree. Eventually, a final point is reached, and a prediction can be made.
Each test point in a decision tree involves testing a particular input variable (or
attribute), and each branch represents the decision being made.
Due to its flexibility and easy visualization, decision trees are commonly deployed in
data mining applications for classification purposes.
The input values of a decision tree can be categorical or continuous.
A decision tree employs a structure of test points (called nodes) and branches,
which represent the decision being made.
A node without further branches is called a leaf node. The leaf nodes return class labels
and, in some implementations, they return the probability scores. A decision tree can
be converted into a set of decision rules.
In the following example rule, income and mortgage_amount are input variables, and
the response is the output variable default with a probability score.
IF income < $50,000 AND mortgage_amount > $100K
THEN default = True WITH PROBABILITY 75%
Generally, a model is created with observed data also called training data.
Then a set of validation data is used to verify and improve the model.
Decision trees have two varieties: classification trees and regression trees.
1. Classification trees usually apply to output variables that are categorical—
often binary—in nature, such as yes or no, purchase or not purchase, and so on.
2. Regression trees, on the other hand, can apply to output variables that are
numeric or continuous, such as the predicted price of a consumer good or the
likelihood a subscription will be purchased.
R has packages which are used to create and visualize decision trees.
For new set of predictor variable, we use this model to arrive at a decision on the
category (yes/No, spam/not spam) of the data.
4
5
(b)
(c) Build a decision tree for the iris data with function ctree() in package party
Details of the data can be found in iris. Sepal.Length, Sepal.Width,
Petal.Length and Petal.Width are used to predict the Species of owers.
In the package, function ctree() builds a decision tree, and predict() makes
prediction for new data.
6
Before modeling, the iris data is split below into two subsets: training
(70%) and test (30%).The random seed is set to a fixed value below to
make the results reproducible.
> str(iris)
data.frame: 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 .
> set.seed(1234)
> ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
> trainData <- iris[ind==1,]
> testData <- iris[ind==2,]
We then load package party, build a decision tree, and check the
prediction result.
Function ctree() provides some parameters, such as MinSplit, MinBusket,
MaxSurrogate and MaxDepth, to control the training of decision trees.
Below we use default settings to build a decision tree. Examples of setting
the above parameters are available in Chapter 13. In the code below,
myFormula
species that Species is the target variable and all other variables are
independent variables.
> library(party)
> myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
#What does '~' mean in R?
#Tilde operator
#Tilde operator is used to define the relationship between dependent variable
and #independent variables in a statistical model formula. The variable on the
left-hand side #of tilde operator is the dependent variable and the variable(s) on
the right-hand side of #tilde operator is/are called the independent variable(s).
> iris_ctree <- ctree(myFormula, data=trainData)
> # check the prediction
> table(predict(iris_ctree), trainData$Species)
setosa versicolor virginica
setosa 40 0 0
versicolor 0 37 3
virginica 0 1 31
7
> print(iris_ctree)
Conditional inference tree with 4 terminal nodes
Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 112
1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643
2)* weights = 40
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939
4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397
5)* weights = 21
4) Petal.Length > 4.4
6)* weights = 19
3) Petal.Width > 1.7
7)* weights = 32
>plot(iris_ctree)
8
9
> plot(iris_ctree, type="simple")
10
Decision Trees with Package rpart
Package rpart is used in this section to build a decision tree on the bodyfat data.
Function rpart() is used to build a decision tree, and the tree with the minimum
prediction error is selected. After that, it is applied to new data to make prediction
with function predict().
Next, the data is split into training and test subsets, and a decision tree is built on
the training data.
> set.seed(1234)
> ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))
> bodyfat.train <- bodyfat[ind==1,]
> bodyfat.test <- bodyfat[ind==2,]
> # train a decision tree
> library(rpart)
> myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth
> bodyfat_rpart <- rpart(myFormula, data = bodyfat.train,
+ control = rpart.control(minsplit = 10))
> attributes(bodyfat_rpart)
$names
[1] "frame" "where" "call"
[4] "terms" "cptable" "method"
[7] "parms" "control" "functions"
[10] "numresp" "splits" "variable.importance"
[13] "y" "ordered"
$xlevels
11
named list()
$class
[1] "rpart"
> print(bodyfat_rpart$cptable)
CP nsplit rel error xerror xstd
1 0.67272638 0 1.00000000 1.0194546 0.18724382
2 0.09390665 1 0.32727362 0.4415438 0.10853044
3 0.06037503 2 0.23336696 0.4271241 0.09362895
4 0.03420446 3 0.17299193 0.3842206 0.09030539
5 0.01708278 4 0.13878747 0.3038187 0.07295556
6 0.01695763 5 0.12170469 0.2739808 0.06599642
7 0.01007079 6 0.10474706 0.2693702 0.06613618
8 0.01000000 7 0.09467627 0.2695358 0.06620732
> print(bodyfat_rpart)
n= 56
node), split, n, deviance, yval
* denotes terminal node
1) root 56 7265.0290000 30.94589
2) waistcirc< 88.4 31 960.5381000 22.55645
4) hipcirc< 96.25 14 222.2648000 18.41143
8) age< 60.5 9 66.8809600 16.19222 *
9) age>=60.5 5 31.2769200 22.40600 *
5) hipcirc>=96.25 17 299.6470000 25.97000
10) waistcirc< 77.75 6 30.7345500 22.32500 *
11) waistcirc>=77.75 11 145.7148000 27.95818
22) hipcirc< 99.5 3 0.2568667 23.74667 *
23) hipcirc>=99.5 8 72.2933500 29.53750 *
3) waistcirc>=88.4 25 1417.1140000 41.34880
6) waistcirc< 104.75 18 330.5792000 38.09111
12) hipcirc< 109.9 9 68.9996200 34.37556 *
13) hipcirc>=109.9 9 13.0832000 41.80667 *
7) waistcirc>=104.75 7 404.3004000 49.72571 *
12
Then we select the tree with the minimum prediction error:
> plot(bodyfat_prune)
> text(bodyfat_prune, use.n=T)
After that, the selected tree is used to make prediction and the predicted values are
compared with actual labels. In the code below, function abline() draws a diagonal line.
13
The predictions of a good model are expected to be equal or very close to their actual
values, that is, most points should be on or close to the diagonal line.
14
WEEK-4
Random Forest
Package randomForest [Liaw and Wiener, 2002] is used below to build a predictive model
for the iris data.
There are two limitations with function randomForest().
First, it cannot handle data with missing values, and users have to impute data
before feeding them into the function.
Second, there is a limit of 32 to the maximum number of levels of each
categorical attribute. Attributes with more than 32 levels have to be transformed
first before using randomForest().
An alternative way to build a random forest is to use function cforest() from package
party, which is not limited to the above maximum levels. However, generally speaking,
categorical variables with more levels will make it require more memory and take longer
time to build a random forest.
Again, the iris data is first split into two subsets: training (70%) and test (30%).
> ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
> trainData <- iris[ind==1,]
> testData <- iris[ind==2,]
Then we load package randomForest and train a random forest. In the code below, the
formula is set to \Species .", which means to predict Species with all other variables in
the data.
> library(randomForest)
> rf <- randomForest(Species ~ ., data=trainData, ntree=100, proximity=TRUE)
> table(predict(rf), trainData$Species)
15