MACHINE LEARNING AND DATA ANALYTICS USING PYTHON LAB
MACHINE LEARNING AND DATA ANALYTICS USING PYTHON LAB
9 Compare the performance of all the above ML techniques on a similar data 36-40
set using matplotlib.
1|Page
PRACTICAL-1
Design and evaluate a data model using Linear Regression
Linear regression is a statistical method that is used to predict a continuous dependent variable(target
variable) based on one or more independent variables(predictor variables). This technique assumes a
linear relationship between the dependent and independent variables, which implies that the
dependent variable changes proportionally with changes in the independent variables. In other words,
linear regression is used to determine the extent to which one or more variables can predict the value
of the dependent variable.
Given below are the basic assumptions that a linear regression model makes regarding a dataset on
which it is applied:
Linear relationship: The relationship between response and feature variables should be linear.
The linearity assumption can be tested using scatter plots. As shown below, 1st figure represents
linearly related variables whereas variables in the 2nd and 3rd figures are most likely non-linear.
So, 1st figure will give better predictions using linear regression.
2|Page
Little or no autocorrelation: Another assumption is that there is little or no autocorrelation in the
data. Autocorrelation occurs when the residual errors are not independent of each other. You can
refer here for more insight into this topic.
No outliers: We assume that there are no outliers in the data. Outliers are data points that are far
away from the rest of the data. Outliers can affect the results of the analysis.
Homoscedasticity: Homoscedasticity describes a situation in which the error term (that is, the
“noise” or random disturbance in the relationship between the independent variables and the
dependent variable) is the same across all values of the independent variables. As shown below,
figure 1 has homoscedasticity while Figure 2 has heteroscedasticity.
As we reach the end of this article, we discuss some applications of linear regression below.
Simple linear regression: This involves predicting a dependent variable based on a single
independent variable.
Multiple linear regression: This involves predicting a dependent variable based on multiple
independent variables.
3|Page
PRACTICAL-2
Design and evaluate a data model using Logistic Regression.
Logistic Regression
Logistic regression aims to solve classification problems. It does this by predicting categorical outcomes, unlike
linear regression that predicts a continuous outcome.
In the simplest case there are two outcomes, which is called binomial, an example of which is predicting if a tumor
is malignant or benign. Other cases have more than two outcomes to classify, in this case it is called multinomial. A
common example for multinomial logistic regression would be predicting the class of an iris flower between 3
different species.
Here we will be using basic logistic regression to predict a binomial variable. This means it has only two possible
outcomes.
In Python we have modules that will do the work for us. Start by importing the NumPy module.
import numpy
#Note: X has to be reshaped into a column from a row for the LogisticRegression() function to work.
#y represents whether or not the tumor is cancerous (0 for "No", 1 for "Yes").
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
We will use a method from the sklearn module, so we will have to import that module as well:
From the sklearn module we will use the LogisticRegression() method to create a logistic regression object.
This object has a method called fit() that takes the independent and dependent values as parameters and fills the
regression object with data that describes the relationship:
logr = linear_model.LogisticRegression()
logr.fit(X,y)
Now we have a logistic regression object that is ready to whether a tumor is cancerous based on the tumor size:
4|Page
#predict if tumor is cancerous where the size is 3.46mm:
predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
import numpy
from sklearn import linear_model
logr=linear_model.LogisticRegression()
logr.fit(X,y)
Result
[0]
5|Page
PRACTICAL-3
Design and evaluate a data model using KNN.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know
either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat or dog category.
6|Page
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so
this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
7|Page
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
8|Page
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category
A and two nearest neighbors in category B. Consider the below image:
9|Page
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
o There is no particular way to determine the best value for "K", so we need to try some values to find the
best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
10 | P a g e
o The computation cost is high because of calculating the distance between the data points for all the
training samples.
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV
car. The company wants to give the ads to the users who are interested in buying that SUV. So for this
problem, we have a dataset that contains multiple user's information through the social network. The
dataset contains lots of information but the Estimated Salary and Age we will consider for the
independent variable and the Purchased variable is for the dependent variable. Below is the dataset:
11 | P a g e
o Visualizing the test set result.
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code for
it:
1. # importing libraries
2. import numpy as nm
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
12.
16.
12 | P a g e
2ND EXAMPLE
# Loading data
irisData = load_iris()
X = irisData.data
y = irisData.target
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
print(knn.predict(X_test))
13 | P a g e
PRACTICAL-4
Design and evaluate a data model using K Means Clustering.
K-means clustering
import numpy as np
X = np.random.randint(10,35,(25,2))
Y = np.random.randint(55,70,(25,2))
Z = np.vstack((X,Y))
Z = Z.reshape((50,2))
# convert to np.float32
Z = np.float32(Z)
plt.xlabel('Test Data')
plt.ylabel('Z samples')
plt.hist(Z,256,[0,256])
plt.show()
14 | P a g e
Here ‘Z’ is an array of size 100, and values ranging from 0 to 255. Now, reshaped ‘z’ to a column
vector. It will be more useful when more than one features are present. Then change the data to
np.float32 type.
Output:
Now, apply the k-Means clustering algorithm to the same example as in the above test data and see
its behavior.
Steps Involved:
import numpy as np
import cv2
X = np.random.randint(10,45,(25,2))
15 | P a g e
Y = np.random.randint(55,70,(25,2))
Z = np.vstack((X,Y))
# convert to np.float32
Z = np.float32(Z)
ret,label,center = cv2.kmeans(Z,2,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
A = Z[label.ravel()==0]
B = Z[label.ravel()==1]
plt.scatter(A[:,0],A[:,1])
plt.scatter(B[:,0],B[:,1],c = 'r')
plt.show()
16 | P a g e
Output:
17 | P a g e
PRACTICAL-5
Design and evaluate a data model using SVM.
The Support Vector Machine algorithm is commonly used within classification problems. They distinguish between
classes by finding the maximum margin between the closest data points of opposite classes, creating the
optimal hyperplane. The number of features in the input data determine if the hyperplane is a line in a 2D space or
a plane in an N-dimensional space.
Because multiple hyperplanes can be found to differentiate classes, maximizing the margin between
points enables the algorithm to find the best decision boundary between classes. This differentiation, in
turn, enables the SVM algorithm to generalize well to new data and make accurate classification
predictions. The lines that are adjacent to the optimal hyperplane are known as support vectors as these
vectors run through the data points that determine the maximal margin.
The SVM algorithm is widely used in machine learning as it can handle both linear and nonlinear
classification tasks. However, when the data is not linearly separable, kernel functions are used to
transform the data higher dimension feature space to enable linear separation. This application of
kernel functions can be known as the “kernel trick,” and the choice of kernel function, such as linear
kernels, polynomial kernels, radial basis function (RBF) kernels, or sigmoid kernels, depends on data
characteristics and the specific use case.
In this tutorial, learn how to apply support vector classification to a credit card clients data set to predict
default payments for the following month. The tutorial provides a step-by-step guide for how to
implement this classification in Python using scikit-learn. You also gain insights into how to reduce
dimensionality within the data set using principal component analysis (PCA), enhancing the efficiency of
the model.
For more information, look at the Reducing dimensionality with principal component analysis with
Python tutorial.
Prerequisites
Install scikit-learn
Steps
While you can choose from a number of tools, this tutorial shows how to set up an IBM account to use a
Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code, text,
images, data visualizations to formulate a well-formed analysis.
18 | P a g e
3. Create a Jupyter Notebook.
From here, a notebook environment opens for you to load your data set and copy code from this
beginner tutorial to tackle a simple classification problem.
You must import the necessary Python libraries so that you can work with the default of the credit card
clients data set, perform data preprocessing, and build and evaluate your SVM model. These libraries
are crucial for data manipulation, visualization, and machine learning tasks. If they're not installed, you
can resolve this with a quick pip install.
The study associated with this data set focused on customers' default payments and compared the
predictive accuracy of the probability that a client will default on payment across six data mining
methods. It used a binary variable, "default payment" (Yes = 1; No = 0), as the response variable and
used 23 variables as explanatory variables. To explore the data definitions, refer to UCI’s data
repository.
import pandas as pd
import numpy as np
df = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card
%20clients.xls', header=1)
Show more
19 | P a g e
Step 3. Explore the data set
Prior to initiating data preprocessing, you should conduct an exploratory data analysis to understand the
data's structure and format, including the types of variables, their distributions, and the overall
organization of information. This exploration direct the modeling approach.
In this step, you explore the first ten rows of the pandas DataFrame.
df.head(10)
Show more
In the output provided, each row represents an individual entry. The columns represent specific features
like the identification number, credit limit, sex, education, marriage status, age, payment status across
several months, bill statement amounts, payment amounts, and a target variable indicating default in
the following month. The numerical data in each column provides information about the respective
feature or attribute. The "default payment next month" column represents the class label or target
variable for classification tasks in an SVM.
df.head()
Show more
To clean up the data set, rename and simplify the "default payment next month" column to "DEFAULT",
making it convenient for further analysis or modeling tasks. The DataFrame also no longer contains
the ID column because it is considered non-informative for analysis or modeling tasks.
20 | P a g e
PRACTICAL-6
Design and evaluate a data model using PCA.
Each of the principal components is chosen in such a way that it would describe most of them still
available variance and all these principal components are orthogonal to each other. In all principal
components, first principal component has a maximum variance.
Uses of PCA:
4. It’s often used to visualize genetic distance and relatedness between populations.
These are basically performed on a square symmetric matrix. It can be a pure sums of squares and
cross-products matrix Covariance matrix or Correlation matrix. A correlation matrix is used if the
individual variance differs much.
Objectives of PCA:
1. It is basically a non-dependent procedure in which it reduces attribute space from a large number
of variables to a smaller number of factors.
2. PCA is basically a dimension reduction process but there is no guarantee that the dimension is
interpretable.
3. The main task in this PCA is to select a subset of variables from a larger set, based on which
original variables have the highest correlation with the principal amount.
4. Identifying patterns: PCA can help identify patterns or relationships between variables that may
not be apparent in the original data. By reducing the dimensionality of the data, PCA can reveal
underlying structures that can be useful in understanding and interpreting the data.
5. Feature extraction: PCA can be used to extract features from a set of variables that are more
informative or relevant than the original variables. These features can then be used in modeling or
other analysis tasks.
21 | P a g e
6. Data compression: PCA can be used to compress large datasets by reducing the number of
variables needed to represent the data, while retaining as much information as possible.
7. Noise reduction: PCA can be used to reduce the noise in a dataset by identifying and removing the
principal components that correspond to the noisy parts of the data.
Principal Axis Method: PCA basically searches a linear combination of variables so that we can extract
maximum variance from the variables. Once this process completes it removes it and searches for
another linear combination that gives an explanation about the maximum proportion of remaining
variance which basically leads to orthogonal factors. In this method, we analyze total variance.
Eigenvector: It is a non-zero vector that stays parallel after matrix multiplication. Let’s suppose x is an
eigenvector of dimension r of matrix M with dimension r*r if Mx and x are parallel. Then we need to
solve Mx=Ax where both x and A are unknown to get eigenvector and eigenvalues.
Under Eigen-Vectors, we can say that Principal components show both common and unique variance
of the variable. Basically, it is variance focused approach seeking to reproduce total variance and
correlation with all components. The principal components are basically the linear combinations of
the original variables weighted by their contribution to explain the variance in a particular orthogonal
dimension.
Eigen Values: It is basically known as characteristic roots. It basically measures the variance in all
variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of explanatory
importance of the factors with respect to the variables. If the factor is low then it is contributing less
to the explanation of variables. In simple words, it measures the amount of variance in the total given
database accounted by the factor. We can calculate the factor’s eigenvalue as the sum of its squared
factor loading for all the variables.
Python
import numpy as np
import pandas as pd
22 | P a g e
Import the dataset and distributing the dataset into X and y components for data analysis.
Python
dataset = pd.read_csv('wine.csv')
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values
Step 3: Splitting the dataset into the Training set and Test set
Python
state = 0)
Doing the pre-processing part on training and testing set such as fitting the Standard scale.
Python
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
23 | P a g e
X_test = sc.transform(X_test)
Applying the PCA function into the training and testing set for analysis.
Python
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
Python
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
24 | P a g e
PRACTICAL-7
Design and evaluate a data model using Decision Trees.
Decision Tree
A Decision tree is a tree-like structure that represents a set of decisions and their possible
consequences. Each node in the tree represents a decision, and each branch represents an outcome
of that decision. The leaves of the tree represent the final decisions or predictions.
Decision trees are created by recursively partitioning the data into smaller and smaller subsets. At
each partition, the data is split based on a specific feature, and the split is made in a way that
maximizes the information gain.
Decision Tree
In the above figure, decision tree is a flowchart-like tree structure that is used to make decisions. It
consists of Root Node(WINDY), Internal nodes(OUTLOOK, TEMPERATURE), which represent tests on
attributes, and leaf nodes, which represent the final decisions. The branches of the tree represent the
possible outcomes of the tests.
1. Root Node: The decision tree’s starting node, which stands for the complete dataset.
2. Branch Nodes: Internal nodes that represent decision points, where the data is split based on a
specific attribute.
25 | P a g e
3. Leaf Nodes: Final categorization or prediction-representing terminal nodes.
4. Decision Rules: Rules that govern the splitting of data at each branch node.
5. Attribute Selection: The process of choosing the most informative attribute for each split.
6. Splitting Criteria: Metrics like information gain, entropy, or the Gini Index are used to calculate
the optimal split.
Attributes are assumed to be categorical for information gain and for gini index, attributes are
assumed to be continuous.
1. Find the best attribute and place it on the root node of the tree.
2. Now, split the training set of the dataset into subsets. While making the subset make sure that
each subset of training dataset should have the same value for an attribute.
Gini index and information gain both of these methods are used to select from the n attributes of the
dataset which attribute would be placed at the root node or the internal node.
Gini index
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
Sklearn supports “gini” criteria for Gini Index and by default, it takes “gini” value.
Entropy
If a random variable x can take N different value, the i’value with probability we can
associate the following entropy with x :
26 | P a g e
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity
of an arbitrary collection of examples. The higher the entropy the more the information
content.
Information Gain
Definition: Suppose S is a set of instances, A is an attribute, is the subset of s with A = v
and Values(A) is the set of all possible of A, then
The entropy typically changes when we use a node in a Python decision tree to partition the
training instances into smaller subsets. Information gain is a measure of this change in entropy.
Sklearn supports “entropy” criteria for Information Gain and if we want to use Information Gain
method in sklearn then we have to mention it explicitly.
Dataset Description:
Database
Attribute Information:
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)
Class Distribution:
27 | P a g e
2. 07.84 percent are B
PRACTICAL-8
Design and evaluate a data model using Random Forest.
Random Forest Regression is a versatile machine-learning technique for predicting numerical values. It
combines the predictions of multiple decision trees to reduce overfitting and improve accuracy.
Python’s machine-learning libraries make it easy to implement and optimize this approach.
Ensemble Learning
Ensemble learning is a machine learning technique that combines the predictions from multiple
models to create a more accurate and stable prediction. It is an approach that leverages the collective
intelligence of multiple models to improve the overall performance of the learning system.
1. Bagging (Bootstrap Aggregating): This method involves training multiple models on random
subsets of the training data. The predictions from the individual models are then combined,
typically by averaging.
2. Boosting: This method involves training a sequence of models, where each subsequent model
focuses on the errors made by the previous model. The predictions are combined using a
weighted voting scheme.
3. Stacking: This method involves using the predictions from one set of models as input features for
another model. The final prediction is made by the second-level model.
Random Forest
A random forest is an ensemble learning method that combines the predictions from multiple
decision trees to produce a more accurate and stable prediction. It is a type of supervised learning
algorithm that can be used for both classification and regression tasks.
Every decision tree has high variance, but when we combine all of them in parallel then the resultant
variance is low as each decision tree gets perfectly trained on that particular sample data, and hence
the output doesn’t depend on one decision tree but on multiple decision trees. In the case of a
classification problem, the final output is taken by using the majority voting classifier. In the case of a
regression problem, the final output is the mean of all the outputs. This part is called Aggregation.
28 | P a g e
Random Forest Regression Model Working
Random Forest has multiple decision trees as base learning models. We randomly perform row
sampling and feature sampling from the dataset forming sample datasets for every model. This part is
called Bootstrap.
We need to approach the Random Forest regression technique like any other machine
learning technique.
Design a specific question or data and get the source to determine the required data.
Make sure the data is in an accessible format else convert it to the required format.
Specify all noticeable anomalies and missing data points that may be required to achieve the
required data.
29 | P a g e
Create a machine-learning model.
Now compare the performance metrics of both the test data and the predicted data from the
model.
If it doesn’t satisfy your expectations, you can try improving your model accordingly or dating your
data, or using another data modeling technique.
At this stage, you interpret the data you have gained and report accordingly.
We will be using a similar sample technique in the below example. Below is a step-by-step sample
implementation of Random Forest Regression, on the dataset that can be downloaded here-
https://bit.ly/417n3N5
Python libraries make it very easy for us to handle the data and perform typical and complex tasks
with a single line of code.
Pandas – This library helps to load the data frame in a 2D array format and has multiple functions
to perform analysis tasks in one go.
Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
Sklearn – This module contains multiple libraries having pre-implemented functions to perform
tasks from data preprocessing to model development and evaluation.
RandomForestRegressor – This is the regression model that is based upon the Random Forest
model or the ensemble learning that we will be using in this article using the sklearn library.
sklearn: This library is the core machine learning library in Python. It provides a wide range of
tools for preprocessing, modeling, evaluating, and deploying machine learning models.
LabelEncoder: This class is used to encode categorical data into numerical values.
KNNImputer: This class is used to impute missing values in a dataset using a k-nearest neighbors
approach.
train_test_split: This function is used to split a dataset into training and testing sets.
StandardScaler: This class is used to standardize features by removing the mean and scaling to
unit variance.
30 | P a g e
f1_score: This function is used to evaluate the performance of a classification model using the F1
score.
Python3
import pandas as pd
import sklearn
import warnings
warnings.filterwarnings('ignore')
Now let’s load the dataset in the panda’s data frame. For better data handling and leveraging the
handy functions to perform complex tasks in one go.
31 | P a g e
Python3
df= pd.read_csv('Salaries.csv')
print(df)
Output:
Here the .info() method provides a quick overview of the structure, data types, and memory usage of
the dataset.
Python3
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Position 10 non-null object
1 Level 10 non-null int64
2 Salary 10 non-null int64
dtypes: int64(2), object(1)
memory usage: 372.0+ bytes
32 | P a g e
PRACTICAL-9
Compare the performance of all the above ML techniques on a similar data set using
matplotlib.
Why Matplotlib?
Developed using Python and NumPy arrays.
Matplotlib Workflow
Below is the high-level Matplotlib workflow.
Customizing Plots.
33 | P a g e
Importing Library and Plotting
In workflow we know one of the initial steps is to establish or create a plot (also known as
figure). Let’s see how we can do that.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.plot();
If you notice both x and y axis has changed per the data. This is referred to as state-less way of
plotting.
34 | P a g e
Also, the best practice is to use object-oriented api than using pyplot api since plyplot api is less
flexible.
#2nd method
fig = plt.figure() # creates a figure
ax = fig.add_axes((1,1,1,1)) # adds some axes
ax.plot(x,y) # add some data
plt.show()
If you look at it, 3rd method is the recommended way to use matplotlib object-oriented api.
# 0) Import matplotlib
%matplotlib inline
35 | P a g e
import matplotlib.pyplot as plt
# 1) Generate/prepare data
x=[2,4,6,8]
y=[5,10,15,20]
# 2) Establish/Setup plot
fig,ax = plt.subplots(figsize=(10,10)) #x and y dimensions of the plot (width,height)
# 3) Plot data
ax.plot(x,y)
# 4) Customize plot
ax.set(title= 'First Plot',
xlabel='x-axis',
ylabel='y-axis')
36 | P a g e