ML Final Prac
ML Final Prac
Introduction:
We know Python provides us with many libraries for us to work in different
domains to solve different problems and come up with various solutions.
We will study about the two libraries NumPy and Pandas which is mainly used
for understanding the concepts of Machine learning today.
Datatypes in Python:
1. Integer or int: Eg: 1,2,3,4,5
2. String: “Hello World”, “Machine 123”
3. Float: 0.1, 0.11, 0.111
4. Tuple: (12, 34, “Power scheme of India”, 7.41)
5. Dictionary:
data = {'Name': ['John', 'Mary', 'Peter', 'Tom'],
'Age': [25, 30, 35, 40],
'Country': ['USA', 'Canada', 'Australia', 'UK']}
6.List: My_list = [1, 2, 3, "hello", True]
NumPy
NumPy is a library for numerical computing in Python. It provides fast and efficient
multidimensional array operations. Arrays are the main data structure in NumPy.
They can be created using the array() function.
Pandas
Pandas is a library for data manipulation and analysis in Python. It provides a fast
and efficient way to work with structured data. Data frames are the main data
structure in Pandas. They can be created using the DataFrame() function.
Exercises:
1. Create a Python function that takes two integers as inputs and returns their
sum.
import numpy as np
Tom 40 UK Noodles
import pandas as pd
import pandas as pd
5. Calculate the median age of the Pandas data frame from Exercise 3
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as ply
from sklearn.datasets import fetch_california_housing
df = fetch_california_housing()
pd.DataFrame(df.data)
dataset.columns = df.feature_names
X=dataset
Y=df.target
#train test split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.3)
#Standardizing the data using scaler class
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
#importing training model and cross validation score to predict
accuracy of the model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
#making objects
reg = LinearRegression()
reg.fit(X_train, Y_train)
#check cross value score
mean_score = cross_val_score(reg, X_train, Y_train,
scoring='neg_mean_absolute_error', cv=10)
np.mean(mean_score)
#prediction
reg_pred = reg.predict(X_test)
#plotting displot for cv = 10 and scoring = 'neg_mean_absolute_error'
import seaborn as sns
sns.displot(Y_test - reg_pred,kind='kde').
import pandas as pd
import numpy as np
import matplotlib.pyplot as ply
from sklearn.datasets import fetch_california_housing
df = fetch_california_housing()
pd.DataFrame(df.data)
dataset.columns = df.feature_names
X=dataset
Y=df.target
#train test split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.4)
#Standardizing the data using scaler class
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
#importing training model and cross validation score to predict
accuracy of the model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
#making objects
reg = LinearRegression()
reg.fit(X_train, Y_train)
#check cross value score
mean_score = cross_val_score(reg, X_train, Y_train,
scoring='neg_mean_absolute_error', cv=10)
np.mean(mean_score)
#prediction
reg_pred = reg.predict(X_test)
#plotting displot for cv = 10 and scoring = 'neg_mean_absolute_error'
import seaborn as sns
sns.displot(Y_test - reg_pred,kind='kde').
Implementaion code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df=sns.load_dataset('iris')
df.head()
df
df.species.unique()
df.isnull().sum()
sns.scatterplot(x='petal_length', y='species', data=df,
hue='species')
plt.show()
sns.scatterplot(x='petal_width', y='species', data=df,
hue='species')
plt.show()
sns.scatterplot(x='sepal_width', y='species', data=df,
hue='species')
plt.show()
sns.scatterplot(x='sepal_length', y='species', data=df,
hue='species')
plt.show()
df[df['species']!='setosa']
df=df[df['species']!='setosa']
df
df['species'].map({'versicolor':0,'virginica':1})
X=df.iloc[:,:-1]
y=df.iloc[:,-1]
X
y
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.25,
random_state=42)
x_train
x_test
len(x_test)
len(X)
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
test
from sklearn.metrics import
accuracy_score,classification_report
score=accuracy_score(y_test,y_pred)
print(score)
print(classification_report(y_test,y_pred))
LAB ASSIGNMENT – 5
1.
2.
We import the breast cancer dataset from scikit-learn and store it in cancer_data.
cancer_data is a dictionary like object.
The keys() method returns a list of all the keys in the cancer_data object. These
keys represent different parts of the dataset, such as:
● 'data': The features of the dataset (i.e., the input variables).
● 'target': The labels (i.e., the output variable, indicating whether the tumor is
malignant or benign).
● 'target_names': The names corresponding to the labels.
● 'feature_names': The names of the features.
● 'DESCR': A description of the dataset.
● 'filename': The path to the dataset file (if available).
● 'frame': DataFrame representation of the data (if pandas is
available). 3.
This is a description of the dataset
Dataset Characteristics
Characteris*c Details
Number of 569
instances
There are 30 numerical features. These features are grouped into 3 categories
for each of the 10 original features.
● Mean
● Standard Error
● Worst (Largest Value)
The 10 original features are :
● Radius
● Texture
● Perimeter
● Area
● Smoothness
● Compactness
● Concavity
● Concave Points
● Symmetry
● Fractal Dimension
4.
5.
scaled_data will store the NumPy array containing all the scaled
features.
6.
7.
First, we set the figure size to 8 inches wide and 6 inches tall
We represent the First Principal Component on x axis and Second
Principal Component on y axis of the scatter plot.
cmap = 'plasma' : The cmap parameter specifies the colormap to be used. The
'plasma' colormap will apply a gradient of colors to the points based on their class.
This plot is used to visualize how well the PCA transformation has separated the
different classes in the data. If the points form distinct clusters, it indicates that
PCA has captured the essential structure of the data.
2. We will retain all 30 dimensions to see how PCA performs without dimension reduction.
We see that the first few principal components capture most of the
variance. The later components capture much less, indicating that they
contribute less to the overall structure of the data.
Hence, we can understand why reducing dimensions (to 2 or 3) might still retain
most of the information.
1.We import the wine dataset from scikit-learn and store it in wine_data.
wine_data is a dictionary like object.
The keys() method returns a list of all the keys in the wine_data object. These
keys represent different parts of the dataset.
2. We store data (features for all samples in a 2D array) in x.
We store target label for each sample in y.
We then get the unique classes using np.unique(y)
There are 3 unique classes. There are 13 features. There are 178 samples
4. We get an array in which each value represents the proportion of the total variance that is
explained by each linear discriminant.
It’s a measure of how much information (or variance) about class separability
is retained in the new dimensions
5.First, we set the figure size to 8 inches wide and 6 inches tall.
We represent the First Principal Component on x axis and Second
Principal Component on y axis of the scatter plot.
We have already discussed ‘c’ and ‘cmap’ parameters in PCA.
edgecolors = ‘y’ adds a yellow edge around each point in the scatter
plot.
load_digits contains images of handwritten digits (0-9) that are 8*8 pixels in size. The name
‘pixel_i_j’ refers to the pixel in the ith row and jth column. Since the images are 8*8, there are
64 pixels.
Each pixel has a value that indicates the intensity (brightness) of that pixel, typically
on a scale from 0 (black) to 16 (white).
Hence, the pixels are the feature names.
We store the features in x and the target labels in y.
There are 10 unique classes : 0,1,2,3,4,5,6,7,8,9.
Each class represents a digit from 0 to 9.
There are 64 features i.e pixels.
There are 1797 samples.
We are going to perform t-distributed stochastic neighbour embedding on the Wine Dataset
and the Breast Cancer Dataset available in scikit-learn.
1.
Hence, we geta a 2D array where rows represent data points and the 2
columns represent the reduced dimensions.
5. # Plot on next page since there is not enough space on this page
Here, we can see that there is better separation between clusters since all the features contribute
equally to the distance calculations in t-SNE.
3. Should the mean of scaled_data be 0 above ? Why/ Why not ? If yes why is it not zero
above ?
Answer : Here, we are talking about the scaled data of the Breast Cancer Dataset. Ideally, the
mean of scaled_data should be 0 so that each feature can be centered around the origin and
hence contribute equally in the model and it becomes easier to compare the features since
they have the same scale now.
We are asked to apply classification model on the reduced dimensions after applying
the 3 techniques.
Lab Activity – 6 ANN
What is an Artificial Neural Network ?
The term "Artificial neural network" refers to a biologically inspired
sub-field of artificial intelligence modeled after the brain. An Artificial
neural network is usually a computational network based on
biological neural networks that construct the structure of the human
brain. Similar to how a human brain has neurons interconnected to
each other, artificial neural networks also have neurons that are
linked to each other in various layers of the networks.
UNDERSTANDINGS FROM THE NOTEBOOK
X (independent variables) :
The code first splits the data into training and test sets, then it standardizes
the training and test feature sets to ensure they are on the same scale for
better model performance.
APPLYING ANN
The code builds and compiles a basic neural network with an input layer,
two hidden layers, and an output layer for binary classification, using the
adam optimizer and binary cross-entropy loss.
This code trains using 33% of the data for validation, processes the data in
batches of 10 samples, and repeats the training process 50 times, with an
option for early stopping not shown here.
This code plots a graph showing how the model's accuracy on both training
and validation data changes over each epoch of training.
This code plots a graph showing how the model's loss on both training and
validation data changes over each epoch of training.
This code predicts the results for the test data using the model and converts
the predictions to binary values (0 or 1) based on whether they are greater
than 0.5.
Accuracy score is 86.3%.
Exercise :
Q.1] How would we know at which epoch the accuracy is best. Find
a way (write/modify in the above code) so that epochs stop by
themselves when the model reaches peak accuracy.
Answer :
To stop training when the model reaches its best accuracy we can use a
feature called “Early Stopping”. This feature watches the model's
performance on a validation set. It prevents wasting time on more epochs
when the model is already performing well.
Q.2] How can we know the weights that are used by the above
models in their different iterations?
Answer
To see the weights used by models at di\erent times:
1. Save Weights: Save the model's weights to files during training.
2. Load Weights: Load these files to check the weights.
3. Access Weights: Use functions in your machine learning library to
view or extract weights from the model.
Q.3] What is the confusion matrix? What are the values in the
confusion matrix? Find the accuracy analytically by the formula of
accuracy given below and compare it with the accuracy given by
ANN.
ANSWER
A confusion matrix is a table used to evaluate the performance of a
classification model. It shows how well the model's predictions match the
actual labels.
Accuracy = ( TP + TN ) / ( TP + TN + FN + FP )
Q.4] Why do we need an activation function? Name some
Answer :
Adaptive Moment Estimation is an algorithm for optimization technique for
gradient descent. The method is really e\icient when working with large
problems involving a lot of data or parameters. It requires less memory and
is e\icient.
Lab Activity- 7 CNN
What is CNN ?
The pixel values in images typically range from 0 to 255 (as they are 8-bit RGB images).
To normalize them between 0 and 1, each pixel value is divided by 255.0.
Loading the first 25 images from the dataset along with their class names.
model.add(layers.Dense(64, activation='relu')):
● Adds a fully connected layer with 64 neurons and ReLU activation. This layer
learns patterns from the flattened features. model.add(layers.Dense(10)):
● Adds a final output layer with 10 neurons (one for each class in CIFAR-10). The
raw outputs (logits) from this layer are typically passed to a softmax function to
predict class probabilities.
Training the model
Questions
Q.1] An input image has been converted into a matrix of size 12 X 12 along with a filter
of size 3 X 3 with a Stride of 1. Determine the size of the convoluted matrix.
Answer :
Q.2] In the above question do you think there will be a loss of information?
Why / Why not ? What measures can we take to prevent that?
Answer:
Yes, there will be a loss of information because the filter is smaller than the input
matrix, and there is no padding applied. It will lead to loss in the edge information.
Measures to prevent loss:
1. Use Padding: Add zero-padding around the input matrix to maintain the input
size. This ensures that the edges of the input are considered during convolution.
2. Use larger filters or smaller strides: Smaller strides or larger filters capture more
detailed patterns.
Q.3] Explain the significance of the RELU Activation function in the Convolution Neural
Network.
Answer:
1. Non-linearity : ReLU introduces non-linearity into the model, which allows the
network to learn complex patterns. Without this, the CNN would behave like a
linear model, limiting its learning capability.
3. Sparse Activation: ReLU outputs zero for all negative inputs, which introduces
sparsity, making the model more ejicient and potentially less prone to
overfitting.
Q.4] What is the dijerence between a convolution layer and a pooling layer?
Answer:
1. Operation:
Pooling Layer: Takes small regions of the feature map (like 2x2 or 3x3) and
applies a down-sampling operation, such as selecting the maximum value (max
pooling) or averaging (average pooling).
2. Learning :
Convolution Layer: The weights of the filters are learned during the training
process.
Pooling Layer: No learning takes place; it's a fixed operation used to reduce the
size of the feature map.
Pooling Layer: Helps in reducing the feature map size, making the network more
computationally ejicient and reducing overfitting.
Q.5] A formula is given for evaluating the size of the convoluted matrix, but it precludes
stride and padding. Give a general formula for computing size of convoluted matrix,
which takes into account stride and padding also.
Answer :
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)
# Classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)
Output:-
Q. Why we are getting 3X3 confusion matrix rather than 2X2?
In a confusion matrix, the dimensions are determined by the
number of classes in the target variable.
For the Wine dataset, there are 3 classes representing di`erent
types of wine, typically labeled as 0, 1, and 2. When you apply a
classifier to this dataset, the confusion matrix will reflect this by
being of size 3 x 3, corresponding to the three di`erent classes.
Explanation of a 3x3 Confusion Matrix
If your confusion matrix is 3x3, it means:
● Rows represent the actual classes of the wine samples.
● Columns represent the predicted classes by the classifier.
A 3x3 matrix looks like this:
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
Output:-