0% found this document useful (0 votes)
4 views36 pages

ML Unit 3

Decision tree induction is a method for learning decision trees from labeled training data, where each node represents a test on an attribute and each leaf node holds a class label. The ID3 algorithm is a popular approach that uses information gain to select the best attributes for classification, while also addressing issues like overfitting through techniques such as pruning. Decision trees are widely applicable in various fields, including medicine and finance, due to their robustness and interpretability.

Uploaded by

Nihaal Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

ML Unit 3

Decision tree induction is a method for learning decision trees from labeled training data, where each node represents a test on an attribute and each leaf node holds a class label. The ID3 algorithm is a popular approach that uses information gain to select the best attributes for classification, while also addressing issues like overfitting through techniques such as pruning. Decision trees are widely applicable in various fields, including medicine and finance, due to their robustness and interpretability.

Uploaded by

Nihaal Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Learning with Trees

Decision tree induction is the learning of decision trees from class labeled training tuples. A decision tree flowchart
is like a tree structure, where each internal node(non leaf node) denotes a test on an attribute, each branch represents
an outcome of the test and each leaf node holds a class label. The top most node in a tree is called root node.

Decision tree learning is one of the most widely used and practical methods for inductive inference. It is a method
for approximating discrete-valued functions that is robust to noisy data and capable of learning disjunctive
expressions. The most popular family of decision tree learning algorithms includes algorithms such as ID3,
ASSISTANT, and C4.5. These decision tree learning methods search a completely expressive hypothesis space and
thus avoid the difficulties of restricted hypothesis spaces. Their inductive bias is a preference for small trees over
large trees.

INTRODUCTION
Decision tree learning is a method for approximating discrete-valued target functions, in which the learned
function is represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to
improve human readability. These learning methods are among the most popular of inductive inference
algorithm and have been successfully applied to a broad range of tasks from learning to diagnose medical cases
to learning to assess credit risk of loan applicants.

DECISION TREE REPRESENTATION

Decision trees classify instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance. Each node in the tree specifies a test of some attribute of the
instance, and each branch descending from that node corresponds to one of the possible values for this
attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this
node, then moving down the tree branch corresponding to the value of the attribute in the given example. This
process is then repeated for the sub-tree rooted at the new node.

Figure 3.1 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings according
to whether they are suitable for playing tennis. For example, the instance
(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
would be sorted down the leftmost branch of this decision tree and would therefore be classified as a negative
instance (i.e., the tree predicts that PlayTennis = no).
In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of
instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree
itself to a disjunction of these conjunctions. For example, the decision tree shown in Figure 3.1 corresponds to
K.Ramya Laxmi, Assistant Professor, CSE
the expression

Decision Tree Induction:


Decision Tree Induction is the learning of decision trees from class labeled training tuples. A decision tree
flowchart is like a tree structure where each internal node(non-leaf node) denotes a test on an attribute. each
branch represents an outcome of the test and each leaf node holds a class label. The top most node in a tree is the
root node.

HOW ARE DECISION TREES USED FOR CLASSIFICATION

● Given a Tuple X, for which the associated class label is unknown, the attribute values of the tuples are
tested against the decision tree.
● Decision tree classifiers have good accuracy eg: used in the areas of medicine, manufacturing and
production, financial analysis etc
● During tree construction, attribute selection measures are used to select the attribute that best partitions
the tuples into distinct classes.
● When decision trees are built, many branches may reflect noise or outliers in the training data.
● Tree pruning attempts to identify and remove such branches with the goal of improving classification
accuracy on unseen data.

THE BASIC DECISION TREE LEARNING ALGORITHM

Most algorithms that have been developed for learning decision trees are variations on a core algorithm that
employs a top-down, greedy search through the space of possible decision trees. This approach is exemplified
by the ID3 algorithm (Quinlan 1986) and its successor C4.5 (Quinlan 1993).
The ID3 algorithm learns decision trees by constructing them top-down, beginning with the attribute that best
classifies the given data. To find the best attribute, each instance attribute is evaluated using a statistical test to
determine how well it alone classifies the training examples. The best attribute is selected and used as the test
at the root node of the tree. A descendant of the root node is then created for each possible value of this
attribute, and the training examples are sorted to the appropriate descendant node (i.e., down the branch
corresponding to the example's value for this attribute).
The entire process is then repeated using the training examples associated with each descendant node to select
the best attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in
which the algorithm never backtracks to reconsider earlier choices.

The Best Classifying attribute:

The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. The
attribute that is most useful for classifying examples is selected. A good quantitative measure of the worth of
an attribute is defined by a statistical property, called information gain. It measures how well a given attribute
separates the training examples according to their target classification. ID3 uses this information gain measure
to select among the candidate attributes at each step while growing the tree.

3.4.1.1 ENTROPY MEASURES HOMOGENEITY OF EXAMPLES

In order to define information gain precisely, we begin by defining a measure commonly used in information
theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples. Given a
collection S, containing positive and negative examples of some target concept, the entropy of S relative to this
boolean classification is

K.Ramya Laxmi, Assistant Professor, CSE


Where, p+, is the proportion of positive examples in S and p-, is the proportion of negative examples in S. In
all calculations involving entropy we define 0 log 0 to be 0.

To illustrate, suppose S is a collection of 14 examples of some Boolean concept, including 9 positive and 5
negative examples. Then the entropy of S relative to this boolean classification is

Notice that the entropy is 0 if all members of S belong to the same class. For example, if all members are
positive (pe = I), then p, is 0, and Entropy(S) = -1 . log2(1) - 0 . log2 0 = -1 . 0 - 0 . log2 0 = 0. Note
theentropy is 1 when the collection contains an equal number of positive and negative examples. If the
collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1.

Figure 3.2 shows the form of the entropy function relative to a boolean classification, as p, varies between 0
and 1. One interpretation of entropy from information theory is that it specifies the minimum number of bits of
information needed to encode the classification of an arbitrary member of S (i.e., a member of S drawn at
random with uniform probability).
More generally, if the target attribute can take on c different values, then the entropy of S relative to this c-wise
classification is defined as

Where, pi is the proportion of S belonging to class i. The logarithm is base 2 because entropy is a measure of
the expected encoding length measured in bits. Note also that if the target attribute can take on c possible
values, the entropy can be as large as log, c.
Information gain, is the expected reduction in entropy caused by partitioning the examples according to
this attribute. More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of
examples S, is defined as

K.Ramya Laxmi, Assistant Professor, CSE


where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which attribute A
has value v (i.e., Sv = {s ϵ S|A(s) = v}). Here, the first term in the Equation is the entropy of the original
collection S, and the second term is the expected value of the entropy after S is partitioned using attribute A.
The expected entropy described by this second term is simply the sum of the entropies of each subset Sv
weighted by the fraction of examples that belong to S,. Gain(S, A) is therefore the expected reduction in
entropy caused by knowing the value of attribute A. In the other way, Gain(S, A) is the information provided
about the target &action value, given the value of some other attribute A.

For example, suppose S is a collection of training-example days described by attributes including Wind, which
can have the values Weak or Strong. As before, assume S is a collection containing 14 examples, [9+, 5-]. Of
these 14 examples, suppose 6 of the positive and 2 of the negative examples have Wind = Weak, and
theremainder have Wind = Strong. The information gain due to sorting the original 14 examples by the
attribute
Wind may then be calculated as

Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the
tree. The use of information gain to evaluate the relevance of attributes is summarized in Figure 3.3. In this
figure the information gain of two different attributes, Humidity and Wind, is computed in order to determine
which is the better attribute for classifying the training examples shown in Table 3.2.

K.Ramya Laxmi, Assistant Professor, CSE


An Illustrative Example
To illustrate the operation of ID3, consider the learning task represented by the training examples of Table 3.2.

Here the target attribute PlayTennis, which can have values yes or no for different Saturday mornings, is to be
predicted based on other attributes of the morning in question. Consider the first step through the algorithm, in
which the topmost node of the decision tree is created.
ID3 determines the information gain for each candidate attribute (i.e., Outlook, Temperature, Humidity, and
Wind), then selects the one with highest information gain. The computation of information gain for two of
these attributes is shown in Figure 3.3. The information gain values for all four attributes are
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
where S denotes the collection of training examples from Table 3.2.

According to the information gain measure, the Outlook attribute provides the best prediction of the target
attribute, PlayTennis, over the training examples. Therefore, Outlook is selected as the decision attribute for
the root node, and branches are created below the root for each of its possible values (i.e., Sunny, Overcast,
and Rain). The resulting partial decision tree is shown in Figure 3.4, along with the training examples sorted to
each new descendant node. Note that every example for which Outlook = Overcast is also a positive example
of PlayTennis. Therefore, this node of the tree becomes a leaf node with the classification PlayTennis = Yes.
In contrast, the descendants corresponding to Outlook = Sunny and Outlook = Rain still have nonzero entropy,
and the decision tree will be further elaborated below these nodes.
The process of selecting a new attribute and partitioning the training examples is now repeated for each
nonterminal descendant node, this time using only the training examples associated with that node. Attributes
that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once
along any path through the tree. This process continues for each new leaf node until either of two conditions is
met: (1) every attribute has already been included along this path through the tree, or (2) the training examples
associated with this leaf node all have the same target attribute value (i.e., their entropy is zero). Figure 3.4
illustrates the computations of information gain for the next step in growing the decision tree. The final
decision tree learned by ID3 from the 14 training examples of Table 3.2 is shown in the figure below:

K.Ramya Laxmi, Assistant Professor, CSE


ID3 Algorithm:

Program to demonstrate on a decision tree using a party dataset.

K.Ramya Laxmi, Assistant Professor, CSE


import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Load dataset from CSV file


df = pd.read_csv('party.csv')

# Encode categorical variables


le = LabelEncoder()
for col in df.columns:
df[col] = le.fit_transform(df[col])

# Split features (X) and target variable (y)


X = df[['Lazy', 'Is there a party?', 'Deadline']]
print(X)
y = df['Activity']
print(y)

# Train Decision Tree Classifier


clf = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf.fit(X, y)

# Print Decision Tree rules


print("Decision Tree Rules:")
print(export_text(clf, feature_names=['Lazy', 'Is there a party?', 'Deadline']))

# Visualize the Decision Tree


plt.figure(figsize=(10, 6))

K.Ramya Laxmi, Assistant Professor, CSE


plot_tree(clf, feature_names=['Lazy', 'Is there a party?', 'Deadline'], class_names=['Party', 'Study',
'Pub', 'TV'], filled=True)
plt.show()

print(clf.predict([[1,1,2]]))
output:

INDUCTIVE BIAS IN DECISION TREE LEARNING:


The inductive bias of ID3 consists of describing the basis by which it chooses one of the consistent hypotheses
over the others. ID3 chooses the first acceptable tree it encounters in its simple-to-complex, hill-climbing
search through the space of possible trees.The ID3 search strategy:
(a) Selection in favor of shorter trees over longer ones, and
(b) Selection of trees that place the attributes with highest information gain closest to the root.
Approximate characterization its bias as a preference for short decision trees over complex trees.
Approximate inductive bias of ID3: Shorter trees are preferred over larger trees.

ID3 searches a complete hypothesis space (i.e., one capable of expressing any finite discrete-valued function).
It searches incompletely through this space, from simple to complex hypotheses, until its termination condition
is met (e.g., until it finds a hypothesis consistent with the data). Its inductive bias is solely a consequence of the
ordering of hypotheses by its search strategy. Its hypothesis space introduces no additional bias.

OVERFITTING IN DECISION TREE LEARNING

Avoiding Overfitting the Data


The ID3 algorithm grows each branch of the tree just deeply enough to perfectly classify the training
examples. It can lead to difficulties when there is noise in the data, or when the number of training examples is
too small to produce a representative sample of the true target function. In either of these cases, this simple
algorithm can produce trees that overfit the training examples.
There are several approaches to avoiding overfitting in decision tree learning. These can be grouped into two
classes:

K.Ramya Laxmi, Assistant Professor, CSE


1. approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies
the training data,
2. approaches that allow the tree to overfit the data, and then post-prune the tree.

The second approach of post-pruning overfit trees has been found to be more successful in practice.

REDUCED ERROR PRUNING:

Pruning a decision node consists of removing the sub-tree rooted at that node, making it a leaf node, and
assigning it the most common classification of the training examples affiliated with that node. Nodes are
removed only if the resulting pruned tree performs no worse than-the original over the validation set. This has
the effect that any leaf node added due to coincidental regularities in the training set is likely to be pruned
because these same coincidences are unlikely to occur in the validation set. Nodes are pruned iteratively,
always choosing the node whose removal most increases the decision tree accuracy over the validation set.
Pruning of nodes continues until further pruning is harmful.

RULE POST-PRUNING:

In rule post-pruning, one rule is generated for each leaf node in the tree. Each attribute test along the path from
the root to the leaf becomes a rule antecedent (precondition) and the classification at the leaf node becomes the
rule consequent (post-condition). For example, the leftmost path of the tree in figure 3.1 is translated into the
rule

Next, each such rule is pruned by removing any antecedent, or precondition, whose removal does not worsen
its estimated accuracy. Given the above rule, for example, rule post-pruning would consider removing the
preconditions (Outlook = Sunny) and (Humidity = High).
It would select whichever of these pruning steps produced the greatest improvement in estimated rule
accuracy, then consider pruning the second precondition as a further pruning step. No pruning step is
performed if it reduces the estimated rule accuracy.

K.Ramya Laxmi, Assistant Professor, CSE


Dealing with Continuous Variables:

To deal with continuous variables, the continuous variables are discretized. For a continuous variable there is
not just one place to split it: the variable can be broken between any pair of data points, as shown in Figure. It
is expensive for continuous variables than it is for discrete ones,
In general, only one split is made to a continuous

K.Ramya Laxmi, Assistant Professor, CSE


Regression in Trees:
Suppose that the outputs are continuous, so that a regression model is appropriate. To evaluate the choice of
which feature to use next, we also need to find the value at which to split the dataset according to that feature.
The output is a value at each leaf. In general, this is just a constant value for the output, computed as the mean
average of all the data points that are situated in that leaf. This is the optimal choice in order to minimize the
sum-of-squares error, but it also means that we can choose the split point quickly for a given feature, by
choosing it to minimize the sum-of-squares error. We can then pick the feature that has the split point that
provides the best sum-of-squares error, and continue to use the algorithm as for classification.

CLASSIFICATION AND REGRESSION TREES (CART):

Classification and Regression Trees (CART) is a decision tree algorithm that is used for both classification and
regression tasks. It is a supervised learning algorithm that learns from labelled data to predict unseen data.

Classification Trees: The tree is used to determine which “class” the target variable is most likely to fall into,
when it is continuous.
Regression trees: These are used to predict a continuous variable’s value.

Tree structure:
CART builds a tree-like structure consisting of nodes and branches. The nodes represent different decision
points, and the branches represent the possible outcomes of those decisions. The leaf nodes in the tree contain a
predicted class label or value for the target variable.

K.Ramya Laxmi, Assistant Professor, CSE


Splitting criteria:
CART uses a greedy approach to split the data at each node. It evaluates all possible splits and selects the one
that best reduces the impurity of the resulting subsets. For classification tasks, CART uses Gini impurity as the
splitting criterion. The lower the Gini impurity, the more pure the subset is. For regression tasks, CART uses
residual reduction as the splitting criterion. The lower the residual reduction, the better the fit of the model to
the data.

Pruning:
To prevent overfitting of the data, pruning is a technique used to remove the nodes that contribute little to the
model accuracy. Cost complexity pruning and information gain pruning are two popular pruning techniques.
Cost complexity pruning involves calculating the cost of each node and removing nodes that have a negative
cost. Information gain pruning involves calculating the information gain of each node and removing nodes that
have a low information gain. CART algorithm uses Gini Impurity to split the dataset into a decision tree .It
does that by searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.

Gini index/Gini impurity


The Gini index is a metric for the classification tasks in CART. It stores the sum of squared probabilities of
each class. It computes the degree of probability of a specific variable that is wrongly being classified when
chosen randomly and a variation of the Gini coefficient. It works on categorical variables, provides outcomes
either “successful” or “failure” and hence conducts binary splitting only. The degree of the Gini index varies
from 0 to 1,where 0 depicts that all the elements are allied to a certain class, or only one class exists there. Gini
index close to 1 means a high level of impurity, where each class contains a very small fraction of elements,
and A value of 1-1/n occurs when the elements are uniformly distributed into n classes and each class has an
equal probability of 1/n. For example, with two classes, the Gini impurity is 1 – 1/2 = 0.5. Mathematically, we
can write Gini Impurity as follows:


Where, pi is the probability of an object being classified to a particular class.

The CART algorithm :


Step 1: The best-split point of each input is obtained.
Step 2: Based on the best-split points of each input in Step 1, the new “best” split point is identified.
Step 3: Split the chosen input according to the “best” split point.
Step 4: Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.

K.Ramya Laxmi, Assistant Professor, CSE


K.Ramya Laxmi, Assistant Professor, CSE
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
import matplotlib.pyplot as plt

# Dataset
data = pd.DataFrame({
'House_Size': [750, 800, 850, 900, 950, 1000],
'Price': [150, 180, 200, 220, 240, 260] # Prices in $1000s
})

# Create feature and target variables


X = data[['House_Size']]
y = data['Price']

# Initialize and train the regression tree with manual splits


regressor = DecisionTreeRegressor(criterion='squared_error', max_depth=2)
regressor.fit(X, y)

# Visualizing the regression tree


plt.figure(figsize=(10, 6))
tree.plot_tree(
regressor,
feature_names=['House_Size'],
filled=True,
rounded=True,
fontsize=10
)
plt.title("Regression Tree for House Prices")
plt.show()
K.Ramya Laxmi, Assistant Professor, CSE
# Predictions based on input
test_data = np.array([[775], [850], [925], [1000]])
predictions = regressor.predict(test_data)

# Display results
for size, price in zip(test_data.flatten(), predictions):
print(f"Predicted price for house size {size} sq ft: ${price * 1000:.2f}")

K.Ramya Laxmi, Assistant Professor, CSE


Ensemble Learning:
What will you do when you want to purchase a new mobile phone? Will you just walk up to the shop and buy from
there? We ask friends suggestions, we check the reviews online, we check the phone specification in online media and
finalize our choice. We may put a poll in social media asking for others opinion. In short, we can say that final decision
will be combined personal opinion along with the opinion that we got from other medium.

Most of our realworld problems are similar in nature. Combination of methods or models are used to solve these
problems.

Ensemble model based learning also works in a similar idea. If we want to enjoy the performance of more than one
machine learning algorithm, we have to build a modelwith combination of algorithms. A machine learning ensemble
consists of a concrete finite set of alternative models with a flexible structure that performs better solutions.

Thus ensemble methods is a combination of multiple machine learning models to obtain better predictive performance
than what can be obtained from any contstituent models.

Why ensemble learning:

· Combining predictions of an ensemble is often more accurate than the individual classifiers that make them up.

· The classifier should be accurate and diverse

· An accurate classifier is one that has an error rate better than random guessing

· Uncorrelated errors of individual classifiers can be eliminated by averaging.

Random forest:

Let us think what happens when there are several Decision Trees which are contributing to various final results. In this
case we consider about what maximum number of trees are voting for. For example when 7 trees out of 10 are saying
‘yes’ and other 3 are saying ‘No’, then we have to decide the final result to be yes. In this manner, the majority voting
will lead us to better accuracy in the final result.

The main aim of constructing a Random forest is to arrive at better accuracy in the predictions. When a machine
learning uses a group of other models or repeats a process several times then it is called an iterative model. Random
forest is called iterative model since it involves a group of Decision trees in arriving at final results. In case of Random
Forest, how the data is distributed between various models is the main concern.

Let us discuss the first one:

Bootstrapping:

Let us imagine a dataset that contains a group of rows. Each row contains some columns from where we take only
those columns that contribute for our analysis. When we represent the relation between these columns on a graph, it is
shown as a data point. Thus a dataset contains many datapoints.

Suppose we want to create a subset of the data points(samples) from the main set of datapoints . How can we do this?

Suppose we want to create a subset with 5 data points.

We have to collect 5 data points from the main set of data and put them into the subset. This can be done in 2 ways..

· In the first way, we can actually ‘remove’ the data points from the main and put them into subset. That means,
when a data point enters the subset, it was removed from the main set and hence not available in the main set.

· The data points removed from the main set are not replaced by any other data. This is called creating the subset
K.Ramya Laxmi, Assistant Professor, CSE
with out replacement.

· The data points which are removed from the subset1 were removed from the main dataset and hence they can
not appear again in subset2.

There is another way of creating subset of data points. Here, we will not actually remove the data points from the main
dataset.. We copy the data points from the main dataset and put them into a subset. That means, the original data points
are still available in the main dataset and their copies only are used in creation of the subsets.

In this case, the same datapoints can be used to create various subsets of data.

The first subset is created by copying 5 datapoints from the main set. After creating the subset also, the same data
points are still available in the main set. Hence, they can be used either fully or partially in creating the second subset.
This is known as creating the subset with replacement of data.

It is possible to create subsets of data from the main dataset. This can be done either with replacement or with out
K.Ramya Laxmi, Assistant Professor, CSE
replacement. This process is called ‘bootstrapping’. That means creating subsets from main set of data is called
‘bootstrapping’. The following are steps to do in the bootstrapping:

1. We decide the size of the subset to be created.

2. We create the subset with or without replacement of data.

3. We can repeat the above steps to create several subsets.

These subsets of data can be fed to the various machine learning models to observe their outputs.

Eg: we have several decision trees in the random forest machine learning model. Each decision tree needs a
subset of data on which it will act and provide the result. The results obtained from all the decision trees are
weighed to arrive at final conclusion.

Through bootstrapping, it is possible to create various subsets of data. Each subset of data is fed to one
decision tree in random forest. Hence every tree receives different data. So,bootstrapping is the technique used
to create subsets of data that are used by decision trees in the random forest model.

#train the random forest on the scikit digits dataset and check

#if the model is correctly predicting the handwritten digits

from sklearn.datasets import load_digits

digits=load_digits()

#see the column names in the dataset

dir(digits)

#digits.images--->array of images. each image is of 8x8 pixels

#digits.data--->array of data related to images. Each array is of 64 values

#digits.target--->actual digit representing the image

#display the first 10 digits images

import matplotlib.pyplot as plt

plt.gray() #show in gray color

for i in range(10):

plt.matshow(digits.images[i])

K.Ramya Laxmi, Assistant Professor, CSE


#create dataframe with all data

import pandas as pd

df=pd.DataFrame(digits.data)

df.head()

#in the above output, the 1st row --0 2nd row is 1

#add target data to the dataframe

df['target']=digits.target

df.head()

#take input data as x and target data as y

x=df.drop(['target'],axis='columns')

y=df['target']

#split the data as train and test data

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

#the word ensemble indicates using multiple algorithms(Decision trees) to predict output

#default number of random trees=n_estimators=100

from sklearn.ensemble import RandomForestClassifier

model=RandomForestClassifier()

model.fit(x,y)

#what is the score with 100 trees?

model.score(x_test,y_test)

#what is the score with 50 trees?


K.Ramya Laxmi, Assistant Professor, CSE
model=RandomForestClassifier(n_estimators=50)

model.fit(x_train,y_train)

model.score(x_test,y_test)

#make Prediction

#find out the hand written digit contained in 12th row in data

model.predict([digits.data[12]])

#display its image to verify

plt.matshow(digits.images[12])

#match with the target that shows original digit

print(digits.target[12])

Boosting:
It is one of the most popular ensemble method. Here a collection of very poor learners, each performing only
just better than chance, are put together to make an ensemble learner that can perform arbitrarily well. The
principal algorithm of boosting is named AdaBoost (Adaptive Boosting). The algorithm was proposed as an
improvement on the original 1990 boosting algorithm, which was rather data hungry. In that algorithm, the
training set was split into three. A classifier was trained on the first third, and then tested on the second third.
All of the data that was misclassified during that testing was used to form a new dataset, along with an equally
sized random selection of the data that was correctly classified. A second classifier was trained on this new
dataset, and then both of the classifiers were tested on the final third of the dataset. If they both produced the
same output, then that data point was ignored, otherwise the data point was added to yet another new dataset,
which formed the training set for a third classifier. There are various sorts of boosting algorithms that can be
employed in machine learning. Few of the most well-known are AdaBoost, Gradient Boost, Stochastic
Gradient Boost, Linear Programming Boost, Total Boost.

AdaBoost:

The innovation that AdaBoost uses is to give weights to each data point according to how difficult previous
classifiers have found to get it correct. These weights are given to the classifier as part of the input when it is
trained. At each iteration a new classifier is trained on the training set, with the weights that are applied to the
training set for each data point being modified at each iteration according to how successfully that data point
has been classified in the past. The weights are initially all set to the same value, 1/N, where N is the number of
data points in the training set. Then, at each iteration, the error (ϵ ) is computed as the sum of the weights of the
misclassified points, and the weights for incorrect examples are updated by being multiplied by α = (1 )
K.Ramya Laxmi, Assistant Professor, CSE
/ϵ. Weights for correct examples are left alone, and then the whole set is normalized so that it sums to 1
(which is effectively a reduction in the importance of the correctly classified data points). Training terminates
after a set number of iterations, or when either all of the data points are classified correctly, or one point
contains more than half of the available weight.

K.Ramya Laxmi, Assistant Professor, CSE


Stumping:
There is a very extreme form of boosting that is applied to trees. The stump of a tree is the tiny piece that is left
over when you chop off the rest, and the same is true here: stumping consists of simply taking the root of the
tree and using that as the decision maker. So for each classifier you use the very first question that makes up
the root of the tree, and that is it. By using the weights to sort out when that classifier should be used, and to
what extent, as opposed to the other ones, the overall output of stumping can be very successful.

Bagging:
The simplest method of combining classifiers is known as bagging, which stands for bootstrap aggregating.
A bootstrap sample is a sample taken from the original dataset with replacement, so that we may get some data
several times and others not at all. A bootstrap dataset is a random sample of the original dataset, created by
sampling with replacement. This means that some samples from the original dataset can appear multiple times
in the bootstrap sample, while others might not appear at all. Having taken a set of bootstrap samples, the
bagging method simply requires that we fit a model to each dataset, and then combine them by taking the
output to be the majority vote of all the classifiers.

K.Ramya Laxmi, Assistant Professor, CSE


Example of a Bootstrap Dataset
Suppose we have a small dataset with 5 samples:

The bootstrap dataset would look like this:

Sample 3 appears twice in the bootstrap dataset, while Sample 4 does not appear at all.

Comparison with Boosting:

Boosting is exhaustive, it searches over the whole set of features at each stage, and each stage depends on
the previous one. Boosting has to run sequentially, and the individual steps can be expensive to run. By way
of contrast, the parallelism of the random forest and the fact that it only searches over a fairly small set of
features at each stage speed the algorithm up a lot. Since the algorithm only searches a small subset of the data
at each stage, it cannot be expected to be as good as boosting for the same number of trees. However, since
the trees are cheaper to train, we can make more of them in the same computational time, and often the results
are amazingly good even on very large and complicated datasets. The most amazing thing about random
forests is that they seem to deal very well with really big datasets. It is fairly clear that they should do well
computationally, since both the reduced number of features to search over and the ability to parallelize should

K.Ramya Laxmi, Assistant Professor, CSE


help there. However, they seem to also produce good outputs based on surprisingly small parts of the problem
space seen by each tree.

DIFFERENT WAYS TO COMBINE CLASSIFIERS:


Bagging puts most of its effort into ensuring that the different classifiers see different data, since they see
different samples of the data. Boosting, where the data stays the same, but the importance of each data point
changes for the different classifiers, since they each get different weights according to how well the previous
classifiers have performed. For an ensemble method, it is important how it combines the outputs of the
different classifiers. Both boosting and bagging take a vote from amongst the classifiers, although they do it in
different ways: boosting takes a weighted vote, while bagging simple takes the majority vote. If the number of
classifiers is odd and the classifiers are each independent of each other, then majority voting will return the
correct label if more than half of the classifiers agree. Assuming that each individual classifier has a success
rate of p, the probability of the ensemble getting the correct answer is a binomial distribution of the form

where T is the number of classifiers.


k successes out of T trials

It computes the probability of having more than half of the trials result in success. If p > 0.5, then this sum
approaches 1 as T . This is a lot of the power behind ensemble methods: even if each classifier only gets
about half the answers right, if we use a decent number of classifiers (maybe 100), then the probability of the
ensemble being correct gets close to 1. In fact, even with less than 50% chance of success for each individual
classifier, the ensemble can often do very well indeed. For regression problems, rather than taking the majority
vote, it is common to take the mean of the outputs. However, the mean is heavily affected by outliers, with the
result that the median is a more common average to use. It is the use of the median that produces the bragging
algorithm, which is meant to imply ‘robust bagging’. There is an algorithm that does precisely this, known as
the mixture of experts. Inputs are presented to the network, and each individual classifier makes an assessment.
These outputs from the classifiers are then weighted by the relevant gate, which produces a weight w using the
current inputs, and this is propagated further up the hierarchy. The most common version of the mixture of
experts works as follows:

K.Ramya Laxmi, Assistant Professor, CSE


BASIC STATISTICS:

1. Averages:
The mean is the most commonly used average of a set of data, and is the value that is found by adding up
all the points in the dataset and dividing by the number of points. There are two other averages that are
used: the median and the mode. The median is the middle value, so the most common way to find it is to
sort the dataset according to size and then find the point that is in the middle. If there is an even number of
data points then there is no exact middle, so take the value halfway between the two points that are closest
to the middle. The mode is the most common value, so it just requires counting how many times each
element appears and picking the most frequent one.

2. Variance and Covariance:


The variance of the set of numbers is a measure of how spread out the values are. It is computed as the
sum of the squared distances between each element in the set and the mean value of the set:

The square root of the variance is known as the standard deviation. The variance looks at the variation in
one variable compared to its mean. This can be generalized to look at how two variables vary together,
which is known as the covariance. It is a measure of how dependent the two variables are, It is computed
by:

where ‘v’ is the mean of set {yi}.

If two variables are independent, then the covariance is 0 (the variables are then known as uncorrelated), If
they both increase and decrease at the same time, then the covariance is positive, and if one goes up while
the other goes down, then the covariance is negative. The covariance can be used to look at the correlation
between all pairs of variables within a set of data. We need to compute the covariance of each pair, and
these are then put together into what is imaginatively known as the covariance matrix.

K.Ramya Laxmi, Assistant Professor, CSE


GAUSSIAN MIXTURE MODELS:
Consider two datasets shown in Figure 2.13 and the test point (labelled by the large ‘X’ in the figures) and
asked if the ‘X’ was part of the data. For the figure on the left the answer would probably be yes, while for the
figure on the right it would be no, even though the two points are the same distance from the center of the data.
The reason for this is that the test point lies in relation to the spread of the actual data points. If the data is
tightly controlled then the test point has to be close to the mean, while if the data is very spread out, then the
distance of the test point from the mean does not matter as much.

This can be used to construct a distance measure called the Mahalanobis distance after the person who
described it in 1936, and is written as:

where x is the data arranged as a column vector, μ


is column vector representing the mean, and
1 is the inverse of the covariance matrix.

If the covariance matrix is set to the identity matrix, then the Mahalanobis distance reduces to the
Euclidean distance. Computing the Mahalanobis distance requires heavy computational machinery in

K.Ramya Laxmi, Assistant Professor, CSE


computing the covariance matrix and then its inverse. Fortunately in numpy there is a function that
estimates the covariance matrix of a dataset (np.cov(x) for data matrix x) and the inverse is called
np.linalg.inv(x). The inverse does not have to exist in all cases. Consider a probability distribution, which
describes the probabilities of something occurring over the range of possible feature values. The
probability distribution that is most well-known is the Gaussian or normal distribution. In one dimension it
has the familiar ‘bell-shaped’ curve shown in Figure 2.14, and its equation in one dimension is:

where μ is the mean and σ the standard deviation.

The Gaussian distribution turns up in many problems because of the Central Limit Theorem, which says
that lots of small random numbers will add up to something Gaussian. In higher dimensions it looks like:


Where, is the n × n covariance matrix (with || being its determinant and 1 being its inverse). Figure
2.15 shows the appearance in two dimensions of three different cases:
When the covariance matrix is the identity; when there are only numbers on the leading diagonal of the
matrix; and the general case. The first case is known as a spherical covariance matrix, and has only 1
parameter. The second and third cases define ellipses in two dimensions, either aligned with the axes (with
n parameters) or more generally, with n2 parameters.

K.Ramya Laxmi, Assistant Professor, CSE


GAUSSIAN MIXTURE MODELS
If the data has target labels, it is supervised learning, learning the probabilities from the labelled data. If the
same data, but without target labels. This requires unsupervised learning. Suppose that the different classes
each come from their own Gaussian distribution. This is known as multi-modal data, since there is one
distribution (mode) for each different class. If we know how many classes there are in the data, then we can try
to estimate the parameters for that many Gaussians, all at once. Then the output for any particular data point
that is input to the algorithm will be the sum of the values expected by all of the M Gaussians:

where ф(x;μm , m) is a Gaussian function with mean μm and covariance matrix m, and the αm are
weights with the constraint that

K.Ramya Laxmi, Assistant Professor, CSE


Figure shows two examples, where the data (shown by the histograms) comes from two different
Gaussians, and the model is computed as a sum or mixture of the two Gaussians together. The probability
that input xi belongs to class m can be written as (where a hat on a variable (ˆ·) means that we are
estimating the value of that variable):

The problem is how to choose the weights m. The common approach is to aim for the maximum likelihood
solution. The likelihood is the conditional probability of the data given the model, and the maximum
likelihood solution varies the model to maximize this conditional probability. In fact, it is common to
compute the log likelihood and then to maximize that; it is guaranteed to be negative, since probabilities
are all less than 1, and the logarithm spreads out the values, making the optimization more effective. The
algorithm that is used is an example of a very general one known as the expectation-maximization (or
more compactly, EM) algorithm.

The Expectation-Maximization (EM) Algorithm:


The basic idea of the EM algorithm is that sometimes it is easier to add extra variables that are not actually
known called hidden or latent variables and then to maximize the function over those variables. Consider
the simplest interesting case of the Gaussian mixture model: a combination of just two Gaussian mixtures.
If data were created by randomly choosing one of two possible Gaussians, and then creating a sample from
that Gaussian. If the probability of picking Gaussian one is p, then the entire model looks like this (where
N(μ,σ2) specifies a Gaussian distribution with mean μ and standard deviation σ

K.Ramya Laxmi, Assistant Professor, CSE


The maximum likelihood solution can be obtained by the sum of the logarithm of Equation (7.4) over
all of the training data, and differentiating it. Although we don’t know which component each data
point came from, we can pretend we do, by introducing a new variable f.
If f = 0 then the data came from Gaussian one, if f = 1 then it came from Gaussian two. This is the
typical initial step of an EM algorithm: adding latent variables. Now we need to optimize over them.
The value of the variable f is unknown but we can compute its expectation from the data:

Where, D denotes the data. Note that since we have set f = 1 this means that we are choosing Gaussian
two.
Computing the value of this expectation is known as the E-step. Then this estimate of the expectation
is maximized over the model parameters (the parameters of the two Gaussians and the mixing
parameter), the M-step. This requires differentiating the expectation with respect to each of the model
parameters. These two steps are simply iterated until the algorithm converges.

There are two such information criteria that are commonly used to identify how well we can expect the
trained model to perform.
• Aikake Information Criterium
• Bayesian Information Criterium
In these equations, k is the number of parameters in the model, N is the number of training examples, and
L is the best (largest) likelihood of the model. In both cases, based on the way that they are written here,
the model with the largest value is taken. Both of the measures will favor simple models, which is a form
of Occam’s razor.

Nearest Neighbor Methods:


The data points are positioned within the input space, the training data close to it are to be found out. This
requires computing the distance to each data point in the training set, which is relatively expensive. Then
identify the k nearest neighbor's to the test point, and then set the class of the test point to be the most
common one out of those for the nearest neighbors. The choice of k is not trivial. If it is too small the
nearest neighbor methods are sensitive to noise. If it is too large the accuracy reduces as points that are too
far away are considered. This method suffers from the curse of dimensionality. The computational costs
get higher as the number of dimensions grows. As the number of dimensions increases, so the distance to
other data points tends to increase. In addition, they can be far away in a variety of different directions—
there might be points that are relatively close in some dimensions, but a long way in others. There are
methods for dealing with these problems are known as adaptive nearest neighbor methods

K.Ramya Laxmi, Assistant Professor, CSE


This method suffers from the curse of dimensionality. The computational costs get higher as the number of
dimensions grows. As the number of dimensions increases, so the distance to other data points tends to
increase. In addition, they can be far away in a variety of different directions—there might be points that
are relatively close in some dimensions, but a long way in others. There methods for dealing with these
problems are known as adaptive nearest neighbor methods. For the k-nearest neighbor's algorithm the bias-
variance decomposition can be computed as:

The way to interpret this is that when k is small, so that there are few neighbors considered, the model has
flexibility and can represent the underlying model well, but that it makes mistakes (has high variance)
because there is relatively little data. As ‘k’ increases, the variance decreases, but at the cost of less
flexibility and so more bias.

K Nearest Neighbor(KNN) Algorithm:

Nearest Neighbor Smoothing:


Nearest neighbor methods can also be used for regression by returning the average value of the neighbors
to a point, or a spline or similar fit as the new value. The most common methods are known as kernel

K.Ramya Laxmi, Assistant Professor, CSE


smoothers, and they use a kernel i.e., a weighting function between pairs of points that decides how much
emphasis/weight to put onto the contribution from each data point according to its distance from the input.
Both of these kernels are designed to give more weight to points that are closer to the current input, with
the weights decreasing smoothly to zero as they pass out of the range of the current input, with the range
specified by a parameter.They are

1. The Epanechnikov quadratic kernel:

2. and The Tricube kernel:

Efficient Distance Computations: the KD-Tree:


Computing the distances between all pairs of points is very computationally expensive. For the
problem of finding nearest neighbors the data structure of choice is the KD-Tree. It has been around
since the late 1970s, when it was devised by Friedman and Bentley, and it reduces the cost of finding a
nearest neighbor to O(logN) for O(N) storage computing the distances between all pairs of points is
very computationally expensive. The idea behind the KD-tree is very simple. Create a binary tree by
choosing one dimension at a time to split into two, and placing the line through the median of the
point coordinates of that dimension. The points themselves end up as leaves of the tree. Making the
tree follows the same steps as usual for constructing a binary tree: identify a place to split into two
choices, left and right, and then carry on down the tree. This makes it natural to write the algorithm
recursively.
The choice of what to split and where is what makes the KD-tree special. Just one dimension is split in
each step, and the position of the split is found by computing the median of the points that are to be
split in that one dimension, and putting the line there. In general, the choice of which dimension to
split alternates through the different choices, or it can be made randomly. The algorithm below cycles
through the possible dimensions based on the depth of the tree so far, so that in two dimensions it
alternates horizontal and vertical splits. Suppose that we had seven two-dimensional points to make a
tree from: (5, 4), (1, 6), (6, 1), (7, 5), (2, 7), (2, 2), (5, 8) (as plotted in Figure 7.5). The algorithm will
pick the first coordinate to split on initially, and the median point here is 5, so the split is through x =
5. Of those on the left of the line, the median y coordinate is 6, and for those on the right it is 5. At this
point we have separated all the points, and so the algorithm terminates with the split shown in Figure
7.6 and the tree shown in Figure 7.7.

K.Ramya Laxmi, Assistant Professor, CSE


K.Ramya Laxmi, Assistant Professor, CSE
Searching the tree is the same as any other binary tree; we are more interested in finding the nearest
neighbors of a test point. This is fairly easy: starting at the root of the tree you recurse down through
the tree comparing just one dimension at a time until you find a leaf node that is in the region
containing the test point. Using the tree shown in Figure 7.7 we introduce the test point (3, 5), which
finds (2, 2) as the leaf for the box that (3, 5) is in. Using the tree shown in Figure 7.7 the test point (3,
5) is introduced, which finds (2, 2) as the leaf for the box that (3, 5) is in. However, looking at Figure
7.8 we see that this is not the closest point at all.

The first thing to be done is label the leaf found as a potential nearest neighbor, and compute the
distance between the test point and this point, since any other point has to be closer. Now, check any
other boxes that could contain something closer. From figure 7.8 it is observed that point (3, 7) is
closer, and that is the label of the leaf for the sibling box to the one that was returned, so the algorithm
also needs to check the sibling box. However, suppose that (4.5, 2) is used as the test point. In that
case the sibling is too far away, but another point (6, 1) is closer. So just checking the sibling is not
enough the siblings of the parent node, must also be checked together with its descendants.

Distance Measures:
The most common measure of distance between points is the Euclidean distance.

K.Ramya Laxmi, Assistant Professor, CSE


Mathematically, these distance measures are known as metrics. A metric function or norm takes two
inputs and gives a scalar (the distance) back, which is positive, and 0 if and only if the two points are
the same, symmetric and obeys the triangle inequality, i.e., the distance from a to b plus the distance
from b to c should not be less than the direct distance from a to c. These two measures are both
instances of a class of metrics that work in any number of dimensions. The general measure is the
Minkowski metric and it is written as:

If k = 1 then we get the city-block distance and k = 2 gives the Euclidean distance. The Euclidean
metric is written as the L2 norm and the city-block distance as the L1 norm. These can define different
averages of a set of numbers. If we define the average as the point that minimizes the sum of the
distance to every data point, then it turns out that the mean minimizes the Euclidean distance (the sum-
of-squares distance), and the median minimizes the L1 metric. A common invariant metric in use for
images is the tangent distance, which is an approximation to the Taylor expansion in first derivatives,
and works very well for small rotations and scalings.

Unsupervised Learning:
Many of the learning algorithms that we have seen till now have made use of a training set that
consists of a collection of labelled target data. Targets are obviously useful, since they enable us to
show the algorithm the correct answer to possible inputs, but in many circumstances they are difficult
to obtain—they could, for instance, involve somebody labelling each instance by hand.
In addition, it doesn’t seem to be very biologically plausible: most of the time when we are learning,
we don’t get told exactly what the right answer should be. In this chapter we will consider exactly the
opposite case, where there is no information about the correct outputs available at all, and the
algorithm is left to spot some similarity between different inputs for itself. Unsupervised learning is a
conceptually different problem to supervised learning. If the algorithm can exploit similarities
between inputs in order to cluster inputs that are similar together, this might perform classification
automatically. So the aim of unsupervised learning is to find clusters of similar inputs in the data
without being explicitly told that these data points belong to one class and those to a different class.
Instead, the algorithm has to discover the similarities for itself. The supervised learning algorithms
that we have discussed so far have aimed to minimize some external error criterion—mostly the sum-
of-squares error—based on the difference between the targets and the outputs.

Calculating and minimizing this error was possible because we had target data to calculate it from,
which is not true for unsupervised learning. If two inputs are close together then it means that their
vectors are similar, and so the distance between them is small (distance measures were discussed in

K.Ramya Laxmi, Assistant Professor, CSE


Section 7.2.3, but here we will stick to Euclidean distance). Then inputs that are close together are
identified as being similar, so that they can be clustered, while inputs that are far apart are not
clustered together. We can extend this to the nodes of a network by aligning weight space with input
space. Now if the weight values of a node are similar to the elements of an input vector then that node
should be a good match for the input, and any other inputs that are similar.

Dealing with Noise:


The most common reason to use clustering is to deal with noisy data readings. If the clusters are
chosen correctly, then the noise is effectively removed, because each noisy data point is replaced by
the cluster center. Unfortunately, the mean average, which is central to the k-means algorithm, is very
susceptible to outliers, i.e., very noisy measurements. One way to avoid the problem is to replace the
mean average with the median, which is what is known as a robust statistic, meaning that it is not
affected by outliers (the mean of (1, 2, 1, 2, 100) is 21.2, while the median is 2). The only change that
is needed to the algorithm is to replace the computation of the mean with the computation of the
median.

K.Ramya Laxmi, Assistant Professor, CSE

You might also like