0% found this document useful (0 votes)
104 views

MACHINE LEARNING AND DATA ANALYTICS USING PYTHON LAB

Machine learning notes for mca students.

Uploaded by

harsh sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

MACHINE LEARNING AND DATA ANALYTICS USING PYTHON LAB

Machine learning notes for mca students.

Uploaded by

harsh sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

INDEX

SR.NO TOPIC PAGE


NO.

1 Design and evaluate a data model using Linear Regression 3-4

2 Design and evaluate a data model using Logistic Regression. 5-6

3 Design and evaluate a data model using KNN. 7-14

4 Design and evaluate a data model using K Means Clustering. 15-18

5 Design and evaluate a data model using SVM. 19-21

6 Design and evaluate a data model using PCA. 22-26

7 Design and evaluate a data model using Decision Trees. 27-29

8 Design and evaluate a data model using Random Forest. 30-35

9 Compare the performance of all the above ML techniques on a similar data 36-40
set using matplotlib.

1|Page
PRACTICAL-1
Design and evaluate a data model using Linear Regression

Linear regression is a statistical method that is used to predict a continuous dependent variable(target
variable) based on one or more independent variables(predictor variables). This technique assumes a
linear relationship between the dependent and independent variables, which implies that the
dependent variable changes proportionally with changes in the independent variables. In other words,
linear regression is used to determine the extent to which one or more variables can predict the value
of the dependent variable.

Assumptions We Make in a Linear Regression Model:

Given below are the basic assumptions that a linear regression model makes regarding a dataset on
which it is applied:

 Linear relationship: The relationship between response and feature variables should be linear.
The linearity assumption can be tested using scatter plots. As shown below, 1st figure represents
linearly related variables whereas variables in the 2nd and 3rd figures are most likely non-linear.
So, 1st figure will give better predictions using linear regression.

Linear relationship i the feature space

 Little or no multi-collinearity: It is assumed that there is little or no multicollinearity in the data.


Multicollinearity occurs when the features (or independent variables) are not independent of
each other.

2|Page
 Little or no autocorrelation: Another assumption is that there is little or no autocorrelation in the
data. Autocorrelation occurs when the residual errors are not independent of each other. You can
refer here for more insight into this topic.

 No outliers: We assume that there are no outliers in the data. Outliers are data points that are far
away from the rest of the data. Outliers can affect the results of the analysis.

 Homoscedasticity: Homoscedasticity describes a situation in which the error term (that is, the
“noise” or random disturbance in the relationship between the independent variables and the
dependent variable) is the same across all values of the independent variables. As shown below,
figure 1 has homoscedasticity while Figure 2 has heteroscedasticity.

Homoscedasticity in Linear Regression

As we reach the end of this article, we discuss some applications of linear regression below.

Types of Linear Regression

There are two main types of linear regression:

 Simple linear regression: This involves predicting a dependent variable based on a single
independent variable.

 Multiple linear regression: This involves predicting a dependent variable based on multiple
independent variables.

3|Page
PRACTICAL-2
Design and evaluate a data model using Logistic Regression.

Logistic Regression
Logistic regression aims to solve classification problems. It does this by predicting categorical outcomes, unlike
linear regression that predicts a continuous outcome.

In the simplest case there are two outcomes, which is called binomial, an example of which is predicting if a tumor
is malignant or benign. Other cases have more than two outcomes to classify, in this case it is called multinomial. A
common example for multinomial logistic regression would be predicting the class of an iris flower between 3
different species.

Here we will be using basic logistic regression to predict a binomial variable. This means it has only two possible
outcomes.

In Python we have modules that will do the work for us. Start by importing the NumPy module.

import numpy

Store the independent variables in X.

Store the dependent variable in y.

Below is a sample dataset:

#X represents the size of a tumor in centimeters.


X=numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)

#Note: X has to be reshaped into a column from a row for the LogisticRegression() function to work.
#y represents whether or not the tumor is cancerous (0 for "No", 1 for "Yes").
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

We will use a method from the sklearn module, so we will have to import that module as well:

from sklearn import linear_model

From the sklearn module we will use the LogisticRegression() method to create a logistic regression object.

This object has a method called fit() that takes the independent and dependent values as parameters and fills the
regression object with data that describes the relationship:

logr = linear_model.LogisticRegression()
logr.fit(X,y)

Now we have a logistic regression object that is ready to whether a tumor is cancerous based on the tumor size:

4|Page
#predict if tumor is cancerous where the size is 3.46mm:
predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))

See the whole example in action:

import numpy
from sklearn import linear_model

#Reshapedfor Logistic function.


X=numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y=numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr=linear_model.LogisticRegression()
logr.fit(X,y)

#predict if tumor is cancerous where the size is 3.46mm:


predicted=logr.predict(numpy.array([3.46]).reshape(-1,1))
print(predicted)

Result

[0]

5|Page
PRACTICAL-3
Design and evaluate a data model using KNN.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning


o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.

o K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.

o K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.

o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.

o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.

o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.

o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know
either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat or dog category.

6|Page
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so
this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of neighbors

o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

o Step-4: Among these k neighbors, count the number of the data points in each category.

o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:

7|Page
o Firstly, we will choose the number of neighbors, so we will choose the k=5.

o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:

8|Page
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category
A and two nearest neighbors in category B. Consider the below image:

9|Page
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some values to find the
best out of them. The most preferred value for K is 5.

o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.

o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.

o It is robust to the noisy training data

o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.

10 | P a g e
o The computation cost is high because of calculating the distance between the data points for all the
training samples.

Python implementation of the KNN algorithm


To do the Python implementation of the K-NN algorithm, we will use the same problem and dataset
which we have used in Logistic Regression. But here we will improve the performance of the model.
Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV
car. The company wants to give the ads to the users who are interested in buying that SUV. So for this
problem, we have a dataset that contains multiple user's information through the social network. The
dataset contains lots of information but the Estimated Salary and Age we will consider for the
independent variable and the Purchased variable is for the dependent variable. Below is the dataset:

Steps to implement the K-NN algorithm:

o Data Pre-processing step

o Fitting the K-NN algorithm to the Training set

o Predicting the test result

o Test accuracy of the result(Creation of Confusion matrix)

11 | P a g e
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code for
it:

1. # importing libraries

2. import numpy as nm

3. import matplotlib.pyplot as mtp

4. import pandas as pd

5.

6. #importing datasets

7. data_set= pd.read_csv('user_data.csv')

8.

9. #Extracting Independent and dependent Variable

10. x= data_set.iloc[:, [2,3]].values

11. y= data_set.iloc[:, 4].values

12.

13. # Splitting the dataset into training and test set.

14. from sklearn.model_selection import train_test_split

15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

16.

17. #feature Scaling

18. from sklearn.preprocessing import StandardScaler

19. st_x= StandardScaler()

20. x_train= st_x.fit_transform(x_train)

21. x_test= st_x.transform(x_test)

12 | P a g e
2ND EXAMPLE

Example of the k-nearest neighbor algorithm

# Import necessary modules

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris

# Loading data

irisData = load_iris()

# Create feature and target arrays

X = irisData.data

y = irisData.target

# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size = 0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train, y_train)

# Predict on dataset which model has not seen before

print(knn.predict(X_test))

13 | P a g e
PRACTICAL-4
Design and evaluate a data model using K Means Clustering.

K-means clustering

on a sample random data using open-cv library.

Pre-requisites: Numpy, OpenCV, matplot-lib


Let’s first visualize test data with Multiple Features using matplot-lib tool.

# importing required tools

import numpy as np

from matplotlib import pyplot as plt

# creating two test data

X = np.random.randint(10,35,(25,2))

Y = np.random.randint(55,70,(25,2))

Z = np.vstack((X,Y))

Z = Z.reshape((50,2))

# convert to np.float32

Z = np.float32(Z)

plt.xlabel('Test Data')

plt.ylabel('Z samples')

plt.hist(Z,256,[0,256])

plt.show()

14 | P a g e
Here ‘Z’ is an array of size 100, and values ranging from 0 to 255. Now, reshaped ‘z’ to a column
vector. It will be more useful when more than one features are present. Then change the data to
np.float32 type.

Output:

Now, apply the k-Means clustering algorithm to the same example as in the above test data and see

its behavior.

Steps Involved:

1) First we need to set a test data.

2) Define criteria and apply kmeans().

3) Now separate the data.

4) Finally Plot the data.

import numpy as np

import cv2

from matplotlib import pyplot as plt

X = np.random.randint(10,45,(25,2))

15 | P a g e
Y = np.random.randint(55,70,(25,2))

Z = np.vstack((X,Y))

# convert to np.float32

Z = np.float32(Z)

# define criteria and apply kmeans()

criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)

ret,label,center = cv2.kmeans(Z,2,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)

# Now separate the data

A = Z[label.ravel()==0]

B = Z[label.ravel()==1]

# Plot the data

plt.scatter(A[:,0],A[:,1])

plt.scatter(B[:,0],B[:,1],c = 'r')

plt.scatter(center[:,0],center[:,1],s = 80,c = 'y', marker = 's')

plt.xlabel('Test Data'),plt.ylabel('Z samples')

plt.show()

16 | P a g e
Output:

17 | P a g e
PRACTICAL-5
Design and evaluate a data model using SVM.

The Support Vector Machine algorithm is commonly used within classification problems. They distinguish between
classes by finding the maximum margin between the closest data points of opposite classes, creating the
optimal hyperplane. The number of features in the input data determine if the hyperplane is a line in a 2D space or
a plane in an N-dimensional space.

Because multiple hyperplanes can be found to differentiate classes, maximizing the margin between
points enables the algorithm to find the best decision boundary between classes. This differentiation, in
turn, enables the SVM algorithm to generalize well to new data and make accurate classification
predictions. The lines that are adjacent to the optimal hyperplane are known as support vectors as these
vectors run through the data points that determine the maximal margin.

The SVM algorithm is widely used in machine learning as it can handle both linear and nonlinear
classification tasks. However, when the data is not linearly separable, kernel functions are used to
transform the data higher dimension feature space to enable linear separation. This application of
kernel functions can be known as the “kernel trick,” and the choice of kernel function, such as linear
kernels, polynomial kernels, radial basis function (RBF) kernels, or sigmoid kernels, depends on data
characteristics and the specific use case.

In this tutorial, learn how to apply support vector classification to a credit card clients data set to predict
default payments for the following month. The tutorial provides a step-by-step guide for how to
implement this classification in Python using scikit-learn. You also gain insights into how to reduce
dimensionality within the data set using principal component analysis (PCA), enhancing the efficiency of
the model.

For more information, look at the Reducing dimensionality with principal component analysis with
Python tutorial.

Prerequisites

 Create an IBM Cloud account

 Install scikit-learn

Steps

Step 1. Set up your environment

While you can choose from a number of tools, this tutorial shows how to set up an IBM account to use a
Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code, text,
images, data visualizations to formulate a well-formed analysis.

1. Log in to watsonx.ai using your IBM Cloud account.


2. Create a watsonx.ai project.

18 | P a g e
3. Create a Jupyter Notebook.

From here, a notebook environment opens for you to load your data set and copy code from this
beginner tutorial to tackle a simple classification problem.

Step 2. Import libraries and load the data set

You must import the necessary Python libraries so that you can work with the default of the credit card
clients data set, perform data preprocessing, and build and evaluate your SVM model. These libraries
are crucial for data manipulation, visualization, and machine learning tasks. If they're not installed, you
can resolve this with a quick pip install.

The study associated with this data set focused on customers' default payments and compared the
predictive accuracy of the probability that a client will default on payment across six data mining
methods. It used a binary variable, "default payment" (Yes = 1; No = 0), as the response variable and
used 23 variables as explanatory variables. To explore the data definitions, refer to UCI’s data
repository.

#Load the required libraries

import pandas as pd

import numpy as np

import seaborn as sns

from sklearn.utils import resample

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV, StratifiedKFold

from sklearn.decomposition import PCA

import matplotlib.colors as colors

import matplotlib.pyplot as plt

!pip3 install xlrd

# Import the data set

df = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card
%20clients.xls', header=1)

Show more

19 | P a g e
Step 3. Explore the data set

Prior to initiating data preprocessing, you should conduct an exploratory data analysis to understand the
data's structure and format, including the types of variables, their distributions, and the overall
organization of information. This exploration direct the modeling approach.

In this step, you explore the first ten rows of the pandas DataFrame.

#Explore the first ten rows of the data set

df.head(10)

Show more

In the output provided, each row represents an individual entry. The columns represent specific features
like the identification number, credit limit, sex, education, marriage status, age, payment status across
several months, bill statement amounts, payment amounts, and a target variable indicating default in
the following month. The numerical data in each column provides information about the respective
feature or attribute. The "default payment next month" column represents the class label or target
variable for classification tasks in an SVM.

# Rename the columns

df.rename({'default payment next month': 'DEFAULT'}, axis='columns', inplace=True)

#Remove the ID column as it is not informative

df.drop('ID', axis=1, inplace=True)

df.head()

Show more

To clean up the data set, rename and simplify the "default payment next month" column to "DEFAULT",
making it convenient for further analysis or modeling tasks. The DataFrame also no longer contains
the ID column because it is considered non-informative for analysis or modeling tasks.

20 | P a g e
PRACTICAL-6
Design and evaluate a data model using PCA.

Principal Component Analysis is basically a statistical procedure to convert a set of observations of


possibly correlated variables into a set of values of linearly uncorrelated variables.

Each of the principal components is chosen in such a way that it would describe most of them still
available variance and all these principal components are orthogonal to each other. In all principal
components, first principal component has a maximum variance.

Uses of PCA:

1. It is used to find interrelations between variables in the data.

2. It is used to interpret and visualize data.

3. The number of variables is decreasing which makes further analysis simpler.

4. It’s often used to visualize genetic distance and relatedness between populations.

These are basically performed on a square symmetric matrix. It can be a pure sums of squares and
cross-products matrix Covariance matrix or Correlation matrix. A correlation matrix is used if the
individual variance differs much.

Objectives of PCA:

1. It is basically a non-dependent procedure in which it reduces attribute space from a large number
of variables to a smaller number of factors.

2. PCA is basically a dimension reduction process but there is no guarantee that the dimension is
interpretable.

3. The main task in this PCA is to select a subset of variables from a larger set, based on which
original variables have the highest correlation with the principal amount.

4. Identifying patterns: PCA can help identify patterns or relationships between variables that may
not be apparent in the original data. By reducing the dimensionality of the data, PCA can reveal
underlying structures that can be useful in understanding and interpreting the data.

5. Feature extraction: PCA can be used to extract features from a set of variables that are more
informative or relevant than the original variables. These features can then be used in modeling or
other analysis tasks.

21 | P a g e
6. Data compression: PCA can be used to compress large datasets by reducing the number of
variables needed to represent the data, while retaining as much information as possible.

7. Noise reduction: PCA can be used to reduce the noise in a dataset by identifying and removing the
principal components that correspond to the noisy parts of the data.

8. Visualization: PCA can be used to visualize high-dimensional data in a lower-dimensional space,


making it easier to interpret and understand. By projecting the data onto the principal
components, patterns and relationships between variables can be more easily visualized.

Principal Axis Method: PCA basically searches a linear combination of variables so that we can extract
maximum variance from the variables. Once this process completes it removes it and searches for
another linear combination that gives an explanation about the maximum proportion of remaining
variance which basically leads to orthogonal factors. In this method, we analyze total variance.

Eigenvector: It is a non-zero vector that stays parallel after matrix multiplication. Let’s suppose x is an
eigenvector of dimension r of matrix M with dimension r*r if Mx and x are parallel. Then we need to
solve Mx=Ax where both x and A are unknown to get eigenvector and eigenvalues.
Under Eigen-Vectors, we can say that Principal components show both common and unique variance
of the variable. Basically, it is variance focused approach seeking to reproduce total variance and
correlation with all components. The principal components are basically the linear combinations of
the original variables weighted by their contribution to explain the variance in a particular orthogonal
dimension.

Eigen Values: It is basically known as characteristic roots. It basically measures the variance in all
variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of explanatory
importance of the factors with respect to the variables. If the factor is low then it is contributing less
to the explanation of variables. In simple words, it measures the amount of variance in the total given
database accounted by the factor. We can calculate the factor’s eigenvalue as the sum of its squared
factor loading for all the variables.

Now, Let’s understand Principal Component Analysis with Python.

To get the dataset used in the implementation, click here.

Step 1: Importing the libraries

 Python

# importing required libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

Step 2: Importing the data set

22 | P a g e
Import the dataset and distributing the dataset into X and y components for data analysis.

 Python

# importing or loading the dataset

dataset = pd.read_csv('wine.csv')

# distributing the dataset into two components X and Y

X = dataset.iloc[:, 0:13].values

y = dataset.iloc[:, 13].values

Step 3: Splitting the dataset into the Training set and Test set

 Python

# Splitting the X and Y into the

# Training set and Testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_

state = 0)

Step 4: Feature Scaling

Doing the pre-processing part on training and testing set such as fitting the Standard scale.

 Python

# performing preprocessing part

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

23 | P a g e
X_test = sc.transform(X_test)

Step 5: Applying PCA function

Applying the PCA function into the training and testing set for analysis.

 Python

# Applying PCA function on training

# and testing set of X component

from sklearn.decomposition import PCA

pca = PCA(n_components = 2)

X_train = pca.fit_transform(X_train)

X_test = pca.transform(X_test)

explained_variance = pca.explained_variance_ratio_

Step 6: Fitting Logistic Regression To the training set

 Python

# Fitting Logistic Regression To the training set

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 0)

classifier.fit(X_train, y_train)

24 | P a g e
PRACTICAL-7
Design and evaluate a data model using Decision Trees.

Decision Tree

A Decision tree is a tree-like structure that represents a set of decisions and their possible
consequences. Each node in the tree represents a decision, and each branch represents an outcome
of that decision. The leaves of the tree represent the final decisions or predictions.

Decision trees are created by recursively partitioning the data into smaller and smaller subsets. At
each partition, the data is split based on a specific feature, and the split is made in a way that
maximizes the information gain.

Decision Tree

In the above figure, decision tree is a flowchart-like tree structure that is used to make decisions. It
consists of Root Node(WINDY), Internal nodes(OUTLOOK, TEMPERATURE), which represent tests on
attributes, and leaf nodes, which represent the final decisions. The branches of the tree represent the
possible outcomes of the tests.

Key Components of Decision Trees in Python

1. Root Node: The decision tree’s starting node, which stands for the complete dataset.

2. Branch Nodes: Internal nodes that represent decision points, where the data is split based on a
specific attribute.

25 | P a g e
3. Leaf Nodes: Final categorization or prediction-representing terminal nodes.

4. Decision Rules: Rules that govern the splitting of data at each branch node.

5. Attribute Selection: The process of choosing the most informative attribute for each split.

6. Splitting Criteria: Metrics like information gain, entropy, or the Gini Index are used to calculate
the optimal split.

Assumptions we make while using Decision tree

 At the beginning, we consider the whole training set as the root.

 Attributes are assumed to be categorical for information gain and for gini index, attributes are
assumed to be continuous.

 On the basis of attribute values records are distributed recursively.

 We use statistical methods for ordering attributes as root or internal node.

Pseudocode of Decision tree

1. Find the best attribute and place it on the root node of the tree.

2. Now, split the training set of the dataset into subsets. While making the subset make sure that
each subset of training dataset should have the same value for an attribute.

3. Find leaf nodes in all branches by repeating 1 and 2 on each subset.

Key concept in Decision Tree

Gini index and information gain both of these methods are used to select from the n attributes of the
dataset which attribute would be placed at the root node or the internal node.

Gini index

 Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.

 It means an attribute with lower gini index should be preferred.

 Sklearn supports “gini” criteria for Gini Index and by default, it takes “gini” value.

Entropy

If a random variable x can take N different value, the i’value with probability we can
associate the following entropy with x :

26 | P a g e
 Entropy is the measure of uncertainty of a random variable, it characterizes the impurity
of an arbitrary collection of examples. The higher the entropy the more the information
content.
 Information Gain
 Definition: Suppose S is a set of instances, A is an attribute, is the subset of s with A = v
and Values(A) is the set of all possible of A, then

 The entropy typically changes when we use a node in a Python decision tree to partition the
training instances into smaller subsets. Information gain is a measure of this change in entropy.

 Sklearn supports “entropy” criteria for Information Gain and if we want to use Information Gain
method in sklearn then we have to mention it explicitly.

Python Decision Tree Implementation

Dataset Description:

Title : Balance Scale Weight & Distance

Database

Number of Instances : 625 (49 balanced, 288 left, 288 right)

Number of Attributes : 4 (numeric) + class name = 5

Attribute Information:

1. Class Name (Target variable): 3

L [balance scale tip to the left]

B [balance scale be balanced]

R [balance scale tip to the right]

2. Left-Weight: 5 (1, 2, 3, 4, 5)

3. Left-Distance: 5 (1, 2, 3, 4, 5)

4. Right-Weight: 5 (1, 2, 3, 4, 5)

5. Right-Distance: 5 (1, 2, 3, 4, 5)

Missing Attribute Values: None

Class Distribution:

1. 46.08 percent are L

27 | P a g e
2. 07.84 percent are B

3. 46.08 percent are R

PRACTICAL-8
Design and evaluate a data model using Random Forest.

Random Forest Regression is a versatile machine-learning technique for predicting numerical values. It
combines the predictions of multiple decision trees to reduce overfitting and improve accuracy.
Python’s machine-learning libraries make it easy to implement and optimize this approach.

Ensemble Learning

Ensemble learning is a machine learning technique that combines the predictions from multiple
models to create a more accurate and stable prediction. It is an approach that leverages the collective
intelligence of multiple models to improve the overall performance of the learning system.

Types of Ensemble Methods

There are various types of ensemble learning methods, including:

1. Bagging (Bootstrap Aggregating): This method involves training multiple models on random
subsets of the training data. The predictions from the individual models are then combined,
typically by averaging.

2. Boosting: This method involves training a sequence of models, where each subsequent model
focuses on the errors made by the previous model. The predictions are combined using a
weighted voting scheme.

3. Stacking: This method involves using the predictions from one set of models as input features for
another model. The final prediction is made by the second-level model.

Random Forest

A random forest is an ensemble learning method that combines the predictions from multiple
decision trees to produce a more accurate and stable prediction. It is a type of supervised learning
algorithm that can be used for both classification and regression tasks.

Every decision tree has high variance, but when we combine all of them in parallel then the resultant
variance is low as each decision tree gets perfectly trained on that particular sample data, and hence
the output doesn’t depend on one decision tree but on multiple decision trees. In the case of a
classification problem, the final output is taken by using the majority voting classifier. In the case of a
regression problem, the final output is the mean of all the outputs. This part is called Aggregation.

28 | P a g e
Random Forest Regression Model Working

What is Random Forest Regression?

Random Forest Regression in machine learning is an ensemble technique capable of performing


both regression and classification tasks with the use of multiple decision trees and a technique called
Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine
multiple decision trees in determining the final output rather than relying on individual decision
trees.

Random Forest has multiple decision trees as base learning models. We randomly perform row
sampling and feature sampling from the dataset forming sample datasets for every model. This part is
called Bootstrap.

We need to approach the Random Forest regression technique like any other machine
learning technique.

 Design a specific question or data and get the source to determine the required data.

 Make sure the data is in an accessible format else convert it to the required format.

 Specify all noticeable anomalies and missing data points that may be required to achieve the
required data.

29 | P a g e
 Create a machine-learning model.

 Set the baseline model that you want to achieve

 Train the data machine learning model.

 Provide an insight into the model with test data

 Now compare the performance metrics of both the test data and the predicted data from the
model.

 If it doesn’t satisfy your expectations, you can try improving your model accordingly or dating your
data, or using another data modeling technique.

 At this stage, you interpret the data you have gained and report accordingly.

Random Forest Regression in Python

We will be using a similar sample technique in the below example. Below is a step-by-step sample
implementation of Random Forest Regression, on the dataset that can be downloaded here-
https://bit.ly/417n3N5

Python libraries make it very easy for us to handle the data and perform typical and complex tasks
with a single line of code.

 Pandas – This library helps to load the data frame in a 2D array format and has multiple functions
to perform analysis tasks in one go.

 Numpy – Numpy arrays are very fast and can perform large computations in a very short time.

 Matplotlib/Seaborn – This library is used to draw visualizations.

 Sklearn – This module contains multiple libraries having pre-implemented functions to perform
tasks from data preprocessing to model development and evaluation.

 RandomForestRegressor – This is the regression model that is based upon the Random Forest
model or the ensemble learning that we will be using in this article using the sklearn library.

 sklearn: This library is the core machine learning library in Python. It provides a wide range of
tools for preprocessing, modeling, evaluating, and deploying machine learning models.

 LabelEncoder: This class is used to encode categorical data into numerical values.

 KNNImputer: This class is used to impute missing values in a dataset using a k-nearest neighbors
approach.

 train_test_split: This function is used to split a dataset into training and testing sets.

 StandardScaler: This class is used to standardize features by removing the mean and scaling to
unit variance.

30 | P a g e
 f1_score: This function is used to evaluate the performance of a classification model using the F1
score.

 RandomForestRegressor: This class is used to train a random forest regression model.

 cross_val_score: This function is used to perform k-fold cross-validation to evaluate the


performance of a model

Step-1: Import Libraries

Here we are importing all the necessary libraries required.

 Python3

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import sklearn

import warnings

from sklearn.preprocessing import LabelEncoder

from sklearn.impute import KNNImputer

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import f1_score

from sklearn.ensemble import RandomForestRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import cross_val_score

warnings.filterwarnings('ignore')

Step-2: Import Dataset

Now let’s load the dataset in the panda’s data frame. For better data handling and leveraging the
handy functions to perform complex tasks in one go.

31 | P a g e
 Python3

df= pd.read_csv('Salaries.csv')

print(df)

Output:

Position Level Salary


0 BusinessAnalyst 1 45000
1 JuniorConsultant 2 50000
2 SeniorConsultant 3 60000
3 Manager 4 80000
4 CountryManager 5 110000
5 RegionManager 6 150000
6 Partner 7 200000
7 SeniorPartner 8 300000
8 C-level 9 500000
9 CEO 10 1000000

Here the .info() method provides a quick overview of the structure, data types, and memory usage of
the dataset.

 Python3

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Position 10 non-null object
1 Level 10 non-null int64
2 Salary 10 non-null int64
dtypes: int64(2), object(1)
memory usage: 372.0+ bytes

32 | P a g e
PRACTICAL-9
Compare the performance of all the above ML techniques on a similar data set using
matplotlib.

Matplotlib, a remarkable data visualization library in Python, specializes in creating accurate 2D


plots from arrays and is built on NumPy arrays, making it a versatile solution for multi-platform
applications.

Why Matplotlib?
 Developed using Python and NumPy arrays.

 Seamlessly integrates with pandas.

 Capable of generating both basic and advanced plots.

 Features a user-friendly interface.

Matplotlib Workflow
Below is the high-level Matplotlib workflow.

Author created image

Let’s see what we will cover in this blog.

 Importing the library and different ways of plotting.

 Plotting data from NumPy ndarrays.

 Plotting data from pandas DataFrames.

 Customizing Plots.

 Saving and sharing the plots.

33 | P a g e
Importing Library and Plotting
In workflow we know one of the initial steps is to establish or create a plot (also known as
figure). Let’s see how we can do that.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.plot();

Above code just created an empty plot or figure.


x=[1,2,3,4]
y=[11,22,33,44]
plt.plot(x,y);

If you notice both x and y axis has changed per the data. This is referred to as state-less way of
plotting.

34 | P a g e
Also, the best practice is to use object-oriented api than using pyplot api since plyplot api is less
flexible.

Let’s look at different object-oriented ways we can plot using matplotlib.


#1st method
fig = plt.figure() # creates a figure
ax = fig.add_subplot() # adds some axes
plt.show()

#2nd method
fig = plt.figure() # creates a figure
ax = fig.add_axes((1,1,1,1)) # adds some axes
ax.plot(x,y) # add some data
plt.show()

#3rd method (recommended method)


fig,ax = plt.subplots()
ax.plot(x,y); # add some data

If you look at it, 3rd method is the recommended way to use matplotlib object-oriented api.

Below is the anatomy of a Matplotlib plot.

Image taken from https://matplotlib.org/3.1.1/gallery/subplots_axes_and_figures/subplot.html

# 0) Import matplotlib
%matplotlib inline

35 | P a g e
import matplotlib.pyplot as plt

# 1) Generate/prepare data
x=[2,4,6,8]
y=[5,10,15,20]

# 2) Establish/Setup plot
fig,ax = plt.subplots(figsize=(10,10)) #x and y dimensions of the plot (width,height)

# 3) Plot data
ax.plot(x,y)

# 4) Customize plot
ax.set(title= 'First Plot',
xlabel='x-axis',
ylabel='y-axis')

# 5) Save and show (you save the complete figure)


fig.savefig('images/first-plot.png')

36 | P a g e

You might also like