0% found this document useful (0 votes)
7 views15 pages

DA_Lab_Week-3 (1)

The document outlines a week-long agenda focused on building decision trees using the 'party' and 'rpart' packages in R, covering the theory behind decision trees, their applications, and practical implementations on datasets like iris and bodyfat. It details the structure of decision trees, including classification and regression trees, and provides step-by-step instructions for creating and visualizing decision trees using R functions. The document also emphasizes the importance of data preprocessing and model validation in the decision tree modeling process.

Uploaded by

upesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

DA_Lab_Week-3 (1)

The document outlines a week-long agenda focused on building decision trees using the 'party' and 'rpart' packages in R, covering the theory behind decision trees, their applications, and practical implementations on datasets like iris and bodyfat. It details the structure of decision trees, including classification and regression trees, and provides step-by-step instructions for creating and visualizing decision trees using R functions. The document also emphasizes the importance of data preprocessing and model validation in the decision tree modeling process.

Uploaded by

upesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 15

Week-3

Aim: Build a Decision Tree using party and rpart packages.

Agenda:
1. About Decision Tree
a. What is a Decision Tree?
b. What problems can be solved using DT?
c. How does a DT work?
d. Decision Trees with Package party on iris dataset
e. Decision Trees with Package rpart on iris dataset
2. Use case / Case Study on ………………
a. Data Discovery
b. Data Pre-processing
c. Model Planning and Building
d. Communicate Results

1
2
 (a)
 Decision tree is a graph to represent choices and their results in form of a tree. The
nodes in the graph represent an event or choice and the edges of the graph
represent the decision rules or conditions.
(or)
 Decision Tree is a tree shaped algorithm used to determine a course of action. Each
branch of the tree represents a possible decision, occurrence or reaction.
 It is mostly used in Machine Learning and Data Mining applications using R.

Examples of use of decision tress is –


a. Predicting an email as spam or not spam,
b. Predicting of a tumor is cancerous or not
c. Predicting a loan as a good or bad credit risk based on the factors in each
of these.

3
(or)
 A decision tree (also called prediction tree) uses a tree structure to specify
sequences of decisions and consequences.
 Given input X={x1,x2,…xn} the goal is to predict a response or output variable Y .
Each member of the set {x1,x2,…xn}is called an input variable.
 The prediction can be achieved by constructing a decision tree with test points and
branches.
 At each test point, a decision is made to pick a specific branch and traverse down
the tree. Eventually, a final point is reached, and a prediction can be made.
 Each test point in a decision tree involves testing a particular input variable (or
attribute), and each branch represents the decision being made.
 Due to its flexibility and easy visualization, decision trees are commonly deployed in
data mining applications for classification purposes.
 The input values of a decision tree can be categorical or continuous.
 A decision tree employs a structure of test points (called nodes) and branches,
which represent the decision being made.
 A node without further branches is called a leaf node. The leaf nodes return class labels
and, in some implementations, they return the probability scores. A decision tree can
be converted into a set of decision rules.
In the following example rule, income and mortgage_amount are input variables, and
the response is the output variable default with a probability score.
IF income < $50,000 AND mortgage_amount > $100K
THEN default = True WITH PROBABILITY 75%

 Generally, a model is created with observed data also called training data.
 Then a set of validation data is used to verify and improve the model.

Decision trees have two varieties: classification trees and regression trees.
1. Classification trees usually apply to output variables that are categorical—
often binary—in nature, such as yes or no, purchase or not purchase, and so on.
2. Regression trees, on the other hand, can apply to output variables that are
numeric or continuous, such as the predicted price of a consumer good or the
likelihood a subscription will be purchased.
 R has packages which are used to create and visualize decision trees.
 For new set of predictor variable, we use this model to arrive at a decision on the
category (yes/No, spam/not spam) of the data.

4
5
(b)

(c) Build a decision tree for the iris data with function ctree() in package party
 Details of the data can be found in iris. Sepal.Length, Sepal.Width,
Petal.Length and Petal.Width are used to predict the Species of owers.
 In the package, function ctree() builds a decision tree, and predict() makes
prediction for new data.

6
 Before modeling, the iris data is split below into two subsets: training
(70%) and test (30%).The random seed is set to a fixed value below to
make the results reproducible.

> str(iris)
data.frame: 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 .
> set.seed(1234)
> ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
> trainData <- iris[ind==1,]
> testData <- iris[ind==2,]
 We then load package party, build a decision tree, and check the
prediction result.
 Function ctree() provides some parameters, such as MinSplit, MinBusket,
MaxSurrogate and MaxDepth, to control the training of decision trees.
 Below we use default settings to build a decision tree. Examples of setting
the above parameters are available in Chapter 13. In the code below,
myFormula
 species that Species is the target variable and all other variables are
independent variables.
> library(party)
> myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
#What does '~' mean in R?
#Tilde operator
#Tilde operator is used to define the relationship between dependent variable
and #independent variables in a statistical model formula. The variable on the
left-hand side #of tilde operator is the dependent variable and the variable(s) on
the right-hand side of #tilde operator is/are called the independent variable(s).
> iris_ctree <- ctree(myFormula, data=trainData)
> # check the prediction
> table(predict(iris_ctree), trainData$Species)
setosa versicolor virginica
setosa 40 0 0
versicolor 0 37 3
virginica 0 1 31

7
> print(iris_ctree)
Conditional inference tree with 4 terminal nodes

Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 112
1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643
2)* weights = 40
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939
4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397
5)* weights = 21
4) Petal.Length > 4.4
6)* weights = 19
3) Petal.Width > 1.7
7)* weights = 32

>plot(iris_ctree)

8
9
> plot(iris_ctree, type="simple")

10
Decision Trees with Package rpart

 Package rpart is used in this section to build a decision tree on the bodyfat data.
 Function rpart() is used to build a decision tree, and the tree with the minimum
prediction error is selected. After that, it is applied to new data to make prediction
with function predict().

> data("bodyfat", package = "TH.data")


> dim(bodyfat)
[1] 71 10
> attributes(bodyfat)
$names
[1] "age" "DEXfat" "waistcirc" "hipcirc" "elbowbreadth"
[6] "kneebreadth" "anthro3a" "anthro3b" "anthro3c" "anthro4"
$row.names
[1] "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
[14] "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72"
[27] "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84" "85"
[40] "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98"
[53] "99" "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"
"111"
[66] "112" "113" "114" "115" "116" "117"
$class
[1] "data.frame"
> bodyfat[1:5,]
age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b
47 57 41.68 100.0 112.0 7.1 9.4 4.42 4.95
48 65 43.29 99.5 116.5 6.5 8.9 4.63 5.01
49 59 35.41 96.0 108.5 6.2 8.9 4.12 4.74
50 58 22.79 72.0 96.5 6.1 9.2 4.03 4.48
51 60 36.42 89.5 100.5 7.1 10.0 4.24 4.68
anthro3c anthro4
47 4.50 6.13
48 4.48 6.37
49 4.60 5.82
50 3.91 5.66
51 4.15 5.91

 Next, the data is split into training and test subsets, and a decision tree is built on
the training data.
> set.seed(1234)
> ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))
> bodyfat.train <- bodyfat[ind==1,]
> bodyfat.test <- bodyfat[ind==2,]
> # train a decision tree
> library(rpart)
> myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth
> bodyfat_rpart <- rpart(myFormula, data = bodyfat.train,
+ control = rpart.control(minsplit = 10))
> attributes(bodyfat_rpart)

$names
[1] "frame" "where" "call"
[4] "terms" "cptable" "method"
[7] "parms" "control" "functions"
[10] "numresp" "splits" "variable.importance"
[13] "y" "ordered"
$xlevels

11
named list()
$class
[1] "rpart"
> print(bodyfat_rpart$cptable)
CP nsplit rel error xerror xstd
1 0.67272638 0 1.00000000 1.0194546 0.18724382
2 0.09390665 1 0.32727362 0.4415438 0.10853044
3 0.06037503 2 0.23336696 0.4271241 0.09362895
4 0.03420446 3 0.17299193 0.3842206 0.09030539
5 0.01708278 4 0.13878747 0.3038187 0.07295556
6 0.01695763 5 0.12170469 0.2739808 0.06599642
7 0.01007079 6 0.10474706 0.2693702 0.06613618
8 0.01000000 7 0.09467627 0.2695358 0.06620732
> print(bodyfat_rpart)
n= 56
node), split, n, deviance, yval
* denotes terminal node
1) root 56 7265.0290000 30.94589
2) waistcirc< 88.4 31 960.5381000 22.55645
4) hipcirc< 96.25 14 222.2648000 18.41143
8) age< 60.5 9 66.8809600 16.19222 *
9) age>=60.5 5 31.2769200 22.40600 *
5) hipcirc>=96.25 17 299.6470000 25.97000
10) waistcirc< 77.75 6 30.7345500 22.32500 *
11) waistcirc>=77.75 11 145.7148000 27.95818
22) hipcirc< 99.5 3 0.2568667 23.74667 *
23) hipcirc>=99.5 8 72.2933500 29.53750 *
3) waistcirc>=88.4 25 1417.1140000 41.34880
6) waistcirc< 104.75 18 330.5792000 38.09111
12) hipcirc< 109.9 9 68.9996200 34.37556 *
13) hipcirc>=109.9 9 13.0832000 41.80667 *
7) waistcirc>=104.75 7 404.3004000 49.72571 *

With the code below, the built tree is plotted


> plot(bodyfat_rpart)
> text(bodyfat_rpart, use.n=T)

12
Then we select the tree with the minimum prediction error:

> opt <- which.min(bodyfat_rpart$cptable[,"xerror"])


> cp <- bodyfat_rpart$cptable[opt, "CP"]
> bodyfat_prune <- prune(bodyfat_rpart, cp = cp)
> print(bodyfat_prune)
n= 56
node), split, n, deviance, yval
* denotes terminal node
1) root 56 7265.02900 30.94589
2) waistcirc< 88.4 31 960.53810 22.55645
4) hipcirc< 96.25 14 222.26480 18.41143
8) age< 60.5 9 66.88096 16.19222 *
9) age>=60.5 5 31.27692 22.40600 *
5) hipcirc>=96.25 17 299.64700 25.97000
10) waistcirc< 77.75 6 30.73455 22.32500 *
11) waistcirc>=77.75 11 145.71480 27.95818 *
3) waistcirc>=88.4 25 1417.11400 41.34880
6) waistcirc< 104.75 18 330.57920 38.09111
12) hipcirc< 109.9 9 68.99962 34.37556 *
13) hipcirc>=109.9 9 13.08320 41.80667 *
7) waistcirc>=104.75 7 404.30040 49.72571 *

> plot(bodyfat_prune)
> text(bodyfat_prune, use.n=T)

After that, the selected tree is used to make prediction and the predicted values are
compared with actual labels. In the code below, function abline() draws a diagonal line.

13
The predictions of a good model are expected to be equal or very close to their actual
values, that is, most points should be on or close to the diagonal line.

> DEXfat_pred <- predict(bodyfat_prune, newdata=bodyfat.test)


> xlim <- range(bodyfat$DEXfat)
> plot(DEXfat_pred ~ DEXfat, data=bodyfat.test, xlab="Observed",
+ ylab="Predicted", ylim=xlim, xlim=xlim)
> abline(a=0, b=1)

14
WEEK-4
Random Forest

Package randomForest [Liaw and Wiener, 2002] is used below to build a predictive model
for the iris data.
There are two limitations with function randomForest().
 First, it cannot handle data with missing values, and users have to impute data
before feeding them into the function.
 Second, there is a limit of 32 to the maximum number of levels of each
categorical attribute. Attributes with more than 32 levels have to be transformed
first before using randomForest().
An alternative way to build a random forest is to use function cforest() from package
party, which is not limited to the above maximum levels. However, generally speaking,
categorical variables with more levels will make it require more memory and take longer
time to build a random forest.
Again, the iris data is first split into two subsets: training (70%) and test (30%).
> ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
> trainData <- iris[ind==1,]
> testData <- iris[ind==2,]
Then we load package randomForest and train a random forest. In the code below, the
formula is set to \Species .", which means to predict Species with all other variables in
the data.
> library(randomForest)
> rf <- randomForest(Species ~ ., data=trainData, ntree=100, proximity=TRUE)
> table(predict(rf), trainData$Species)

setosa versicolor virginica


setosa 33 0 0
versicolor 0 33 2
virginica 0 2 35

15

You might also like