• A decision tree is a structure that includes a root node,
branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. Decision Tree What are Decision Trees A decision tree is a tree-like structure that is used as a model for classifying data. A decision tree decomposes the data into sub-trees made of other sub-trees and/or leaf nodes. A decision tree is made up of three types of nodes Decision Nodes: These type of node have two or more branches. Root Node: This is also a decision node but at the topmost level Leaf Nodes: The lowest nodes which represents decision • ID3 Algorithm • Entropy in Information Gain
• Consider the table below. It represent factors that
affect whether John would go out to play golf or not. Using the data in the table, build a decision tree by using ID3 algorithm that can be predict if John would play golf or not. Step by Step Procedure for Building a Decision Tree- ID3 algorithm Step 1: Determine the Decision Column
• Since decision trees are used for classification, you
need to determine the classes which are the basis for the decision. In this case, the last column, that is Play Golf column with classes Yes and No.
• To determine the Root Node we need to compute the
entropy. To do this, we create a frequency table for the classes (the Yes/No column). Step 2: Calculating Entropy for the classes (Play Golf) In this step, you need to calculate the entropy for the Play Golf column and the calculation step is given below. Entropy(Play Golf) = E(5,9) Step 3: Calculate Entropy for Other Attributes After Split For the other four attributes, we need to calculate the entropy after each of the split. •E(Play Golf, Outloook) •E(Play Golf, Temperature) •E(Play Golf, Humidity) •E(Play Golf, Windy) • The entropy for two variables is calculated using the formula.
There to calculate E(Play Golf, Outlook), we would use the
formula below: Which is the same as: E(Play Golf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3) • The easiest way to approach this calculation is to create a frequency table for the two variables, that is Play Golf and Outlook. • This frequency table is given below:
Frequency Table for Outlook
• Using this table, we can then calculate E(Play Golf, Outlook), which would then be given by the formula below • Calculate the E(Play Golf, Outlook) by substituting the values we calculated from E(Sunny), E(Overcast) and E(Rainy) in the equation: E(Play Golf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3) E(Play Golf, Temperature) Calculation
• Just like in the previous calculation, the calculation of
E(Play Golf, Temperature) is given below. It It is easier to do if you form the frequency table for the split for Temperature as shown.
– Entropy(Play Golf, Windy) = 0.94 – 0.892 = 0.048 • Having calculated all the information gain, we now choose the attribute that gives the highest information gain after the split.
Step 5: Perform the First Split
• Draw the First Split of the Decision Tree Now that we have all the information gain, we then split the tree based on the attribute with the highest information gain.
• From our calculation, the highest information gain comes from
Outlook. Therefore the split will look like this: we could see that the Overcast outlook requires no further split because it is just one homogeneous group. So we have a leaf node. Step 6: Perform Further Splits • The Sunny and the Rainy attributes needs to be split • The Rainy outlook can be split using either Temperature, Humidity or Windy. • What attribute would best be used for this split? Why? Humidity, Because it produces homogenous groups. • The Rainy attribute could be split using High and Normal attributes and that would give us the tree below.
Split using the Humidity Attribute
• Let’t now go ahead to do the same thing for the Sunny outlook The Rainy outlook can be split using either Temperature, Humidity or Windy.
Quiz 2: What attribute would best be used for this split? Why? Answer: Windy . Because it produces homogeneous groups.
Split using Windy Attribute
• Step 7: Complete the Decision Tree • The complete table is shown in Figure 4 Note that the same calculation that was used initially could also be used for the further splits. But that would not be necessary since you could just look at the sub table and be able to determine which attribute to use for the split. Final Decision Tree CART( Classification And Regression Tree)
• CART is a predictive algorithm used in Machine
learning and it explains how the target variable's values can be predicted based on other matters. It is a decision tree where each fork is split into a predictor variable and each node has a prediction for the target variable at the end. • In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an attribute. The root node is taken as the training set and is split into two by considering the best attribute and threshold value • Further, the subsets are also split using the same logic. This continues till the last pure sub-set is found in the tree or the maximum number of leaves possible in that growing tree. • The CART algorithm works via the following process: • The best split point of each input is obtained. • Based on the best split points of each input in Step 1, the new “best” split point is identified. • Split the chosen input according to the “best” split point. • Continue splitting until a stopping rule is satisfied or no further desirable splitting is available. • CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion. Gini index/Gini impurity
• The Gini index is a metric for the classification
tasks in CART. It stores the sum of squared probabilities of each class. It computes the degree of probability of a specific variable that is wrongly being classified when chosen randomly and a variation of the Gini coefficient. It works on categorical variables, provides outcomes either “successful” or “failure” and hence conducts binary splitting only. • The degree of the Gini index varies from 0 to 1, • Where 0 depicts that all the elements are allied to a certain class, or only one class exists there. • The Gini index of value 1 signifies that all the elements are randomly distributed across various classes, and • A value of 0.5 denotes the elements are uniformly distributed into some classes. where pi is the probability of an object being classified to a particular class. • Classification tree • A classification tree is an algorithm where the target variable is categorical. The algorithm is then used to identify the “Class” within which the target variable is most likely to fall. Classification trees are used when the dataset needs to be split into classes that belong to the response variable(like yes or no) • Regression tree • A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict its value. Regression trees are used when the response variable is continuous. For example, if the response variable is the temperature of the day. where pi is the probability of an object being classified to a particular class. Decision for rain outlook The winner is wind feature for rain outlook because it has the minimum gini index score in features. Ensemble Learning Bagging (Bootstrap Aggregation) Boosting: Adaboost, Stumping; Random Forests. • Ensemble Learning • A Ensemble method is a technique that combined the predictions from multiple machine learning algorithm together to make more accurate predictions than any individual model. A model comprised of many models is called an ensemble learning • Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model. Basic idea is to learn a set of classifiers and to allow them to vote.
• Ensemble methods are techniques that create multiple
models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model. This has been the case in a number of machine learning competitions, where the winning solutions used ensemble methods. Types of Ensemble Learning
• Ensemble learning contains the same type of
learning algorithms which are called homogeneous ensemble but there are also some methods that contain different types of learning algorithms and they are called heterogeneous ensembles. Bagging (Bootstrap Aggregation)
• There are two main key ingredients of Bagging, one
is Bootstrap and other is Aggregation. • It is the general procedure that can be used to reduce the variance for that algorithm that has high variance, typically decision trees. Bagging makes each model run independently and then aggregates the outputs at the end with out preference to any model.
Random forest is a Bagging Technique
• In Bagging, we take different subsets of datasets randomly and combined them with the help of Bootstrap sampling. In detail given a training data set contain ‘n’ number of training records, a sample of ‘m’ training records will be generated by sampling with replacement. In Bagging, we used the most popular strategies for Aggregating the output of the base learners, that is find out the majority vote in a classification task and finding the mean in the regression task. • In Bagging, we actually combined several strong learners in which all the base models are overfitted models they are having a very high variance and at the time of Aggregation we simply try to reduce that variance without affecting the bias with the accuracy may improved. Boosting • Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of weak classifiers. It is done by building a model by using weak models in series. Firstly, a model is built from the training data. Then the second model is built which tries to correct the errors present in the first model. This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models are added. • Boosting is an efficient algorithm that converts a weak learner into a strong learner. • They use the concept of the weak learner and strong learner conversation through the weighted average values and higher votes values for prediction.
• AdaBoost was the first really successful boosting
algorithm developed for the purpose of binary classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple “weak classifiers” into a single “strong classifier”. • AdaBoost is implemented by combining several weak learners into a single strong learner. The weak learners in AdaBoost take into account a single input feature and draw out a single split decision tree called the decision stump. Each observation is weighted equally while drawing out the first decision stump. • The results from the first decision stump are analyzed, and if any observations are wrongfully classified, they are assigned higher weights. A new decision stump is drawn by considering the higher-weight observations as more significant. Again if any observations are misclassified, they're given a higher weight, and this process continues until all the observations fall into the right class. • AdaBoost can be used for both classification and regression-based problems. However, it is more commonly used for classification purposes. • Gradient Boosting: Gradient Boosting is also based on sequential ensemble learning. Here the base learners are generated sequentially so that the present base learner is always more effective than the previous one, i.e., and the overall model improves sequentially with each iteration. • The difference in this boosting type is that the weights for misclassified outcomes are not incremented. Instead, the Gradient Boosting method tries to optimize the loss function of the previous learner by adding a new model that adds weak learners to reduce the loss function. • The main idea here is to overcome the errors in the previous learner's predictions. This boosting has three main components: • Loss function: The use of the loss function depends on the type of problem. The advantage of gradient boosting is that there is no need for a new boosting algorithm for each loss function.
• Weak learner: In gradient boosting, decision trees are
used as a weak learners. A regression tree is used to give true values, which can combine to create correct predictions. Like in the AdaBoost algorithm, small trees with a single split are used, i.e., decision stump. Larger trees are used for large levels,e, 4-8.
• Additive Model: Trees are added one at a time in this
model. Existing trees remain the same. During the addition of trees, gradient descent is used to minimize the loss function. Random Forest Random Forest • Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. • As we know that a forest is made up of trees and more trees means more robust forest. • Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by mean or voting. • It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result. • Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. • It is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. • Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. Working of Random Forest Algorithm We can understand the working of Random Forest algorithm with the help of following steps − • Step 1 − First, start with the selection of random samples from a given dataset. • Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree. • Step 3 − In this step, voting will be performed for every predicted result. • Step 4 − At last, select the most voted prediction result as the final prediction result. • # importing libraries • import numpy as nm • import matplotlib.pyplot as mtp • import pandas as pd • • #importing datasets • data_set= pd.read_csv('user_data.csv') • • #Extracting Independent and dependent Variable • x= data_set.iloc[:, [2,3]].values • y= data_set.iloc[:, 4].values • • # Splitting the dataset into training and test set. • from sklearn.model_selection import train_test_split • x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0) • • #feature Scaling • from sklearn.preprocessing import StandardScaler • st_x= StandardScaler() • x_train= st_x.fit_transform(x_train) • x_test= st_x.transform(x_test)