0% found this document useful (0 votes)
4 views

15. Machine Learning Classification, Regression and Clustering

The document provides an overview of machine learning (ML), highlighting its ability to learn from data without explicit programming. It details the differences between traditional programming and ML, the learning process, types of ML (supervised, unsupervised, semi-supervised, and reinforcement learning), and various applications. Additionally, it introduces Scikit-Learn, a Python library for ML, and includes a practical implementation of the KNN algorithm with a custom dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

15. Machine Learning Classification, Regression and Clustering

The document provides an overview of machine learning (ML), highlighting its ability to learn from data without explicit programming. It details the differences between traditional programming and ML, the learning process, types of ML (supervised, unsupervised, semi-supervised, and reinforcement learning), and various applications. Additionally, it introduces Scikit-Learn, a Python library for ML, and includes a practical implementation of the KNN algorithm with a custom dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Machine Learning: Classification, Regression and

Clustering

❖​Machine learning (ML) is a subset of artificial intelligence


(AI) that allows systems to learn and improve without being
explicitly programmed
❖​Machine learning uses statistical techniques to enable
computers to learn and make decisions.
❖​It is predicted on the idea that computers can learn from
data, spot patterns and make judgments with little
assistance from humans.
Difference between Traditional and Machine learning
programming
❖​In traditional programming, we would feed the input data
and a well written and tested program into a machine to
generate output.
❖​In machine learning, input data along with the output is fed
into the machine during the learning phase, and it works out
a program for itself.

How do Machines learn


1. Data Collection
ML starts with data—this can be numbers, text, images, or
anything else. The more high-quality data, the better the model
learns.
2. Feature Extraction & Preprocessing
Raw data is often messy. It needs to be cleaned and transformed
into a format that ML models can understand.
Important characteristics (features) are extracted from the data.
For example, in lung cancer prediction, features could be tumor
size, age, and smoking history.
3. Choosing a Model
Different algorithms (models) are used depending on the task

4. Training the Model


The model learns patterns by adjusting internal parameters using
mathematical optimization techniques.
5. Testing & Evaluation
The model is tested on new, unseen data to measure
performance.
Metrics like accuracy, precision, recall, F1-score (for classification)
or mean squared error (MSE) (for regression) help determine how
well the model performs.
6. Deployment & Prediction
Once trained, the model can be used to make predictions on
real-world data.
It can be deployed in apps, websites, or medical tools to assist
decision-making.
7. Improvement & Retraining
Over time, as new data becomes available, the model can be
retrained to improve its accuracy.
Fine-tuning or choosing a different model can help if the
performance is not satisfactory.
Machine Learning Applications
Here’s a table of some popular machine-learning applications:

Types of Machine Learning


There are several types of machine learning, each with special
characteristics and applications. Some of the main types of
machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
1. Supervised Machine Learning
Supervised learning is defined as when a model gets trained on a
“Labeled Dataset”. Labeled datasets have both input and output
parameters. In Supervised Learning algorithms learn to map
points between inputs and correct outputs. It has both training
and validation datasets labeled.

Example: Consider a scenario where you have to build an image


classifier to differentiate between cats and dogs. If you feed the
datasets of dogs and cats labeled images to the algorithm, the
machine will learn to classify between a dog or a cat from these
labeled images. When we input new dog or cat images that it has
never seen before, it will use the learned algorithms and predict
whether it is a dog or a cat. This is how supervised learning
works, and this is particularly an image classification.
There are two main categories of supervised learning that are
mentioned below:
● Classification
● Regression
Classification
Classification deals with predicting categorical target variables,
which represent discrete classes or labels. For instance,classifying
emails as spam or not spam, or predicting whether a patient has
a high risk of heart disease.Classification algorithms learn to map
the input features to one of the predefined classes.
Here are some classification algorithms:
● Logistic Regression
● Support Vector Machine
● Random Forest
● Decision Tree
● K-Nearest Neighbors (KNN)
● Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous
target variables, which represent numerical values. For example,
predicting the price of a house based on its size, location, and
amenities, or forecasting the sales of a product. Regression
algorithms learn to map the input features to a continuous
numerical value.
Here are some regression algorithms:
● Linear Regression
● Polynomial Regression
● Ridge Regression
● Lasso Regression
● Decision tree
● Random Forest
Advantages of Supervised Machine Learning
● Supervised Learning models can have high accuracy as they are
trained on labeled data.
● The process of decision-making in supervised learning models is
often interpretable.
● It can often be used in pre-trained models which saves time
and resources when developing new models from scratch.
Disadvantages of Supervised Machine Learning
● It has limitations in knowing patterns and may struggle with
unseen or unexpected patterns that are not present in the
training data.
● It can be time-consuming and costly as it relies on labeled data
only.
● It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications,
including:
● Image classification: Identify objects, faces, and other features
in images.
● Natural language processing: Extract information from text,
such as sentiment, entities, and relationships.
● Speech recognition: Convert spoken language into text.
● Recommendation systems: Make personalized
recommendations to users.
● Predictive analytics: Predict outcomes, such as sales, customer
churn, and stock prices.
● Medical diagnosis: Detects diseases and other medical
conditions.
● Fraud detection: Identify fraudulent transactions.
● Autonomous vehicles: Recognize and respond to objects in the
environment.
● Email spam detection: Classify emails as spam or not spam.
● Quality control in manufacturing: Inspect products for defects.
● Credit scoring: Assess the risk of a borrower defaulting on a
loan.
● Gaming: Recognize characters, analyze player behavior, and
create NPCs.
● Customer support: Automate customer support tasks.
● Weather forecasting: Make predictions for temperature,
precipitation, and other meteorological parameters.
● Sports analytics: Analyze player performance, make game
predictions, and optimize strategies.
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of
machine learning technique in which an algorithm discovers
patterns and relationships using unlabeled data. Unlike
supervised learning, unsupervised learning doesn’t involve
providing the algorithm with labeled target outputs. The primary
goal of Unsupervised learning is often to discover hidden
patterns, similarities, or clusters within the data, which can then
be used for various purposes, such as data exploration,
visualization, dimensionality reduction, and more.

Example: Consider that you have a dataset that contains


information about the purchases you made from the shop.
Through clustering, the algorithm can group the same purchasing
behavior among you and other customers,which reveals potential
customers without predefined labels. This type of information can
help businesses get target customers as well as identify outliers.
There are four main categories of unsupervised learning that are
mentioned below:
● Clustering
● Association
● Dimensionality Reduction
● Anomaly Detection
Clustering
Clustering is the process of grouping data points into clusters
based on their similarity. This technique is useful for identifying
patterns and relationships in data without the need for labeled
examples.
Here are some clustering algorithms:
● K-Means Clustering algorithm
● Mean-shift algorithm
● DBSCAN Algorithm
● Principal Component Analysis
● Independent Component Analysis
Advantages of Unsupervised Machine Learning
● It helps to discover hidden patterns and various relationships
between the data.
● Used for tasks such as customer segmentation, anomaly
detection, and data exploration.
● It does not require labeled data and reduces the effort of data
labeling.
Disadvantages of Unsupervised Machine Learning
● Without using labels, it may be difficult to predict the quality of
the model’s output.
● Cluster Interpretability may not be clear and may not have
meaningful interpretations.
● It has techniques such as autoencoders and dimensionality
reduction that can be used to extract meaningful features from
raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
● Clustering: Group similar data points into clusters.
● Anomaly detection: Identify outliers or anomalies in data.
● Dimensionality reduction: Reduce the dimensionality of data
while preserving its essential information.
● Recommendation systems: Suggest products, movies, or
content to users based on their historical behavior or preferences.
● Topic modeling: Discover latent topics within a collection of
documents.
● Density estimation: Estimate the probability density function of
data.
● Image and video compression: Reduce the amount of storage
required for multimedia content.
● Data preprocessing: Help with data preprocessing tasks such as
data cleaning, imputation of missing values, and data scaling.
● Market basket analysis: Discover associations between
products.
● Genomic data analysis: Identify patterns or group genes with
similar expression profiles.
● Image segmentation: Segment images into meaningful regions.
● Community detection in social networks: Identify communities
or groups of individuals with similar interests or connections.
● Customer behavior analysis: Uncover patterns and insights for
better marketing and product recommendations.
● Content recommendation: Classify and tag content to make it
easier to recommend similar items to users.
● Exploratory data analysis (EDA): Explore data and gain insights
before defining specific tasks.

Scikit-Learn:

Scikit-Learn: A Powerful Python Library for Machine Learning

Scikit-Learn (or sklearn) is a popular open-source machine


learning library in Python. It provides simple, efficient tools for
data mining and machine learning and is built on top of
NumPy, SciPy, and matplotlib.

Datasets Bundled with Scikit-Learn

The following table lists scikit-learn’s bundled datasets.It also


provides capabilities for loading datasets from other sources, such
as the 20,000+ datasets available at openml.org.


Name Aptitude_Score Communication_Sc Class
ore

0 Karuna 2 5.0 Speaker

1 Bhuvna 2 6.0 Speaker

2 Parimal 3 5.5 Speaker

3 Jani 4 7.0 Speaker

4 Bobby 5 3.0 Intel

5 Ravi 6 2.0 Intel

6 Gouri 6 4.0 Intel

7 Parul 7 2.5 Intel

8 Govind 8 3.0 Intel

9 Susant 6 5.5 Leader

10 Bharat 6 7.0 Leader

11 Gaurav 7 6.0 Leader

12 Dinesh 8 6.0 Leader

13 Pradeep 9 7.0 Leader


EVALUATION OF PERFORMANCE OF A MODEL
Implementation of KNN Algorithm with custom Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
# Sample Dataset (Name, Aptitude Score, Communication Score,
Class)
data = {
"Name": ["Karuna", "Bhuvna", "Parimal", "Jani", "Bobby",
"Ravi", "Gouri", "Parul", "Govind", "Susant",
"Bharat", "Gaurav", "Dinesh", "Pradeep"],
"Aptitude_Score": [2, 2, 3, 4, 5, 6, 6, 7, 8, 6, 6, 7, 8, 9],
"Communication_Score": [5, 6, 5.5,7, 3, 2, 4, 2.5, 3, 5.5, 7, 6,
6, 7],
"Class": ["Speaker", "Speaker", "Speaker", "Speaker", "Intel",
"Intel", "Intel", "Intel", "Intel", "Leader",
"Leader", "Leader", "Leader", "Leader"]
}

print(data)

Output:

{'Name': ['Karuna', 'Bhuvna', 'Parimal', 'Jani',


'Bobby', 'Ravi', 'Gouri', 'Parul', 'Govind', 'Susant',
'Bharat', 'Gaurav', 'Dinesh', 'Pradeep'],
'Aptitude_Score': [2, 2, 3, 4, 5, 6, 6, 7, 8, 6, 6, 7,
8, 9], 'Communication_Score': [5, 6, 5.5, 7, 3, 2, 4,
2.5, 3, 5.5, 7, 6, 6, 7], 'Class': ['Speaker',
'Speaker', 'Speaker', 'Speaker', 'Intel', 'Intel',
'Intel', 'Intel', 'Intel', 'Leader', 'Leader',
'Leader', 'Leader', 'Leader']}

# Convert to DataFrame
df = pd.DataFrame(data)
df
Output:

Name Aptitude_Score Communication_Score Class

0 Karuna 2 5.0 Speaker

1 Bhuvna 2 6.0 Speaker

2 Parimal 3 5.5 Speaker

3 Jani 4 7.0 Speaker

4 Bobby 5 3.0 Intel

5 Ravi 6 2.0 Intel

6 Gouri 6 4.0 Intel

7 Parul 7 2.5 Intel

8 Govind 8 3.0 Intel

9 Susant 6 5.5 Leader

10 Bharat 6 7.0 Leader

11 Gaurav 7 6.0 Leader

12 Dinesh 8 6.0 Leader

13 Pradeep 9 7.0 Leader

# Features (Aptitude & Communication), excluding Name


X = df[["Aptitude_Score", "Communication_Score"]]

y = df["Class"]

print(X.shape,y.shape)

Output:
(14, 2) (14,)

# Split Data into Training and Testing Sets (80% Train, 20% Test)

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)

print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

Output:
(11, 2) (11,) (3, 2) (3,)

print(X_test)

Output:
Aptitude_Score Communication_Score
9 6 5.5
11 7 6.0
0 2 5.0

print(y_test)
Output:
9 Leader
11 Leader
0 Speaker
Name: Class, dtype: object

# Initialize KNN Model


knn = KNeighborsClassifier(n_neighbors=3) # k=3

# Train the Model


knn.fit(X_train, y_train)

# Make Predictions
y_pred = knn.predict(X_test)

print(y_pred)
Output:
['Leader' 'Leader' 'Speaker']
# Evaluate Model Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
Output:
Model Accuracy: 100.00%

# Display Classification Report


print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Output:
Classification Report:
precision recall f1-score support

Leader 1.00 1.00 1.00 2


Speaker 1.00 1.00 1.00 1

accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3
#confusion Matrix
print("\nConfusion Matrix:")
#cm = confusion_matrix(y_test, y_pred) # or your actual class
labels
cm = confusion_matrix(y_test, y_pred, labels=['Speaker', 'Intel',
'Leader'])
print(cm)
Output:
Confusion Matrix:
[[1 0 0]
[0 0 0]
[0 0 2]]
# User Input for Prediction
aptitude = float(input("Enter Aptitude Score: "))
communication = float(input("Enter Communication Score: "))
# Predict Class
predicted_class = knn.predict([[aptitude, communication]])
print(f"Predicted Class: {predicted_class[0]}")
Output:
Enter Aptitude Score: 5
Enter Communication Score: 4.5
Predicted Class: Intel
Classification with k-Nearest Neighbors and the Digits
Dataset
Our Approach
We’ll cover this case study over two sections. In this section, we’ll
begin with the basic
steps of a machine learning case study:
• Decide the data from which to train a model.
• Load and explore the data.
• Split the data for training and testing.
• Select and build the model.
• Train the model.
• Make predictions.
• Evaluate the results.
• Tune the model.
• Run several classification models to choose the best one(s).
Program:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

digits=datasets.load_digits()
print(digits.DESCR)
Output:
.. _digits_dataset:

Optical recognition of handwritten digits dataset


--------------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 1797


:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in
the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML


hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recogn
ition+of+Handwritten+Digits

The data set contains images of hand-written digits:


10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were


used to extract
normalized bitmaps of handwritten digits from a
preprinted form. From a
total of 43 people, 30 contributed to the training set
and different 13
to the test set. 32x32 bitmaps are divided into
nonoverlapping blocks of
4x4 and the number of on pixels are counted in each
block. This generates
an input matrix of 8x8 where each element is an
integer in the range
0..16. This reduces dimensionality and gives
invariance to small
distortions.

For info on NIST preprocessing routines, see M. D.


Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S.
A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition
System, NISTIR 5469,
1994.

.. dropdown:: References
- C. Kaynak (1995) Methods of Combining Multiple
Classifiers and Their
Applications to Handwritten Digit Recognition, MSc
Thesis, Institute of
Graduate Studies in Science and Engineering,
Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading
Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao
and A. Kai Qin.
Linear dimensionalityreduction using relevance
weighted LDA. School of
Electrical and Electronic Engineering Nanyang
Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin
Classification
Algorithm. NIPS. 2000.
X, y = digits.data, digits.target

# Step 2: Inspect the dataset


print(f"Shape of feature data (X): {X.shape}")
print(f"Shape of target data (y): {y.shape}")
print(f"Feature names: {digits.feature_names}")
#print(f"Target labels (first 10): {y[:10]}")
print(f"Target labels (first 10): {digits.target_names}")
Output:
Shape of feature data (X): (1797, 64)
Shape of target data (y): (1797,)
Feature names: ['pixel_0_0', 'pixel_0_1', 'pixel_0_2',
'pixel_0_3', 'pixel_0_4', 'pixel_0_5', 'pixel_0_6',
'pixel_0_7', 'pixel_1_0', 'pixel_1_1', 'pixel_1_2',
'pixel_1_3', 'pixel_1_4', 'pixel_1_5', 'pixel_1_6',
'pixel_1_7', 'pixel_2_0', 'pixel_2_1', 'pixel_2_2',
'pixel_2_3', 'pixel_2_4', 'pixel_2_5', 'pixel_2_6',
'pixel_2_7', 'pixel_3_0', 'pixel_3_1', 'pixel_3_2',
'pixel_3_3', 'pixel_3_4', 'pixel_3_5', 'pixel_3_6',
'pixel_3_7', 'pixel_4_0', 'pixel_4_1', 'pixel_4_2',
'pixel_4_3', 'pixel_4_4', 'pixel_4_5', 'pixel_4_6',
'pixel_4_7', 'pixel_5_0', 'pixel_5_1', 'pixel_5_2',
'pixel_5_3', 'pixel_5_4', 'pixel_5_5', 'pixel_5_6',
'pixel_5_7', 'pixel_6_0', 'pixel_6_1', 'pixel_6_2',
'pixel_6_3', 'pixel_6_4', 'pixel_6_5', 'pixel_6_6',
'pixel_6_7', 'pixel_7_0', 'pixel_7_1', 'pixel_7_2',
'pixel_7_3', 'pixel_7_4', 'pixel_7_5', 'pixel_7_6',
'pixel_7_7']
Target labels (first 10): [0 1 2 3 4 5 6 7 8 9]

df=pd.DataFrame(X,columns=digits.feature_names)
df
Output:
df1=pd.DataFrame(y,columns=['target'])
df1

Output:

target

0 0

1 1

2 2

3 3

4 4
.
... .
.

1792 9

1793 0

1794 8

1795 9

1796 8

1797 rows × 1 columns

Visualizing the Data

digits.images.shape

Output:

(1797, 8, 8)
#Let us look at the first image which is an 8*8 array of pixcel
intensity
digits.images[0]

Output:

array([[ 0., 0., 5., 13., 9., 1., 0., 0.],


[ 0., 0., 13., 15., 10., 15., 5., 0.],

[ 0., 3., 15., 2., 0., 11., 8., 0.],

[ 0., 4., 12., 0., 0., 8., 8., 0.],

[ 0., 5., 8., 0., 0., 9., 8., 0.],

[ 0., 4., 11., 0., 1., 12., 7., 0.],

[ 0., 2., 14., 5., 10., 12., 0., 0.],

[ 0., 0., 6., 13., 10., 0., 0., 0.]])


plt.imshow(digits.images[0], cmap='gray')
plt.title('Number:'+str(y[0]))
plt.show()

Output:
# set up the plots
figure,axes=plt.subplots(nrows=3,ncols=10,figsize=(15,6))
for ax,image,number in
zip(axes.ravel(),digits.images,digits.target):
ax.axis('off')
ax.imshow(image,cmap=plt.cm.gray_r)
ax.set_title('Number: '+str(number))

Output:
Split the data into training and testing

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,r
andom_state=99,stratify=y)

print(f"Shape of X_train: {X_train.shape}")

print(f"Shape of X_test: {X_test.shape}")

print(f"Shape of y_train: {y_train.shape}")

print(f"Shape of y_test: {y_test.shape}")

Output:

Shape of X_train: (1257, 64)

Shape of X_test: (540, 64)

Shape of y_train: (1257,)

Shape of y_test: (540,)

Fit the Model


from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train,y_train)

y_pred=knn.predict(X_test)

Performance Measure

from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_pred)

Output:

0.9851851851851852

from sklearn.metrics import classification_report

report=classification_report(y_test,y_pred)

print(report)

Output:
precision recall f1-score support

0 1.00 1.00 1.00 54


1 0.95 0.98 0.96 55
2 1.00 1.00 1.00 53
3 1.00 0.98 0.99 55
4 1.00 0.98 0.99 54
5 0.96 0.98 0.97 55
6 1.00 1.00 1.00 54
7 0.98 1.00 0.99 54
8 1.00 0.94 0.97 52
9 0.96 0.98 0.97 54

accuracy 0.99 540


macro avg 0.99 0.99 0.99 540
weighted avg 0.99 0.99 0.99 540

Evaluate using confusion matrix

from sklearn.metrics import confusion_matrix


cm=confusion_matrix(y_test,y_pred)
print(cm)

Output:

[[54 0 0 0 0 0 0 0 0 0]

[ 0 54 0 0 0 1 0 0 0 0]

[ 0 0 53 0 0 0 0 0 0 0]

[ 0 0 0 54 0 0 0 1 0 0]

[ 0 0 0 0 53 0 0 0 0 1]

[ 0 0 0 0 0 54 0 0 0 1]
[ 0 0 0 0 0 0 54 0 0 0]

[ 0 0 0 0 0 0 0 54 0 0]

[ 0 3 0 0 0 0 0 0 49 0]

[ 0 0 0 0 0 1 0 0 0 53]]

import seaborn as sns

sns.heatmap(confusion_matrix(y_test,y_pred),annot=True,cmap
='nipy_spectral_r')

Output:
Each row represents one distinct class—that is, one of the digits
0–9. The columns

within a row specify how many of the test samples were classified
into each distinct class.

For example, row 0:

[45, 0, 0, 0, 0, 0, 0, 0, 0, 0]

represents the digit 0 class. The columns represent the ten


possible target classes 0 through

9. Because we’re working with digits, the classes (0–9) and the
row and column index

numbers (0–9) happen to match. According to row 0, 45 test


samples were classified as the

digit 0, and none of the test samples were misclassified as any of


the digits 1 through 9. So

100% of the 0s were correctly predicted.

On the other hand, consider row 8 which represents the results


for the digit 8:

For Testing

custom_image = np.array([
[0, 0, 5, 13, 15, 16, 14, 6],
[0, 0, 7, 18, 20, 19, 16, 7],
[0, 4, 15, 28, 30, 27, 23, 9],
[0, 8, 23, 37, 40, 37, 30, 12],
[0, 7, 21, 38, 42, 40, 32, 13],
[0, 4, 14, 30, 33, 29, 22, 8],
[0, 0, 7, 17, 19, 16, 11, 4],
[0, 0, 2, 9, 11, 10, 7, 3]
])
plt.imshow(custom_image, cmap='gray')
plt.title("Custom Digit")
plt.axis('off')
plt.show()

Output:
predicted_label=knn.predict(custom_image.reshape(1, -1)) #
Reshape custom_image before prediction

print(f"The predicted label for the test digit is:


{predicted_label}")

Output:

The predicted label for the test digit is: [1]

Using K -Fold cross validation

K-fold Cross-validation method


The issues in random sampling approach, in Holdout method,
1.​ The smaller data sets-difficult to provide the data of some of the
classes proportionally amongst training and test datasets.
2.​ A repeated holdout is sometimes used to ensure the randomness of
the composed datasets.
❖​ Several random holdouts are used to measure the model
performance .
❖​ In the end, the average of all performances is taken.
❖​ As multiple holdouts have been drawn, the training and test
data(and validation data) contain representative data from all
classes and resemble the original input data closely.
❖​ This process of repeated holdout is the basis of k-fold
cross-validation technique.
Program:
from sklearn.datasets import load_digits
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Load digits dataset


digits = load_digits()
X, y = digits.data, digits.target

# Define K-Fold cross-validator


kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter tuning for KNN


k_values = range(1, 21)

print("Tuning KNN with different k values:")


for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=kfold,
scoring='accuracy')

print(f'accuracy with different K: {scores}')


print(f'Mean accuracy: {scores.mean():.2%}')
Output:
Tuning KNN with different k values:
accuracy with different K: [0.97777778 0.98055556
0.95264624 0.97214485 0.97214485]
Mean accuracy: 97.11%

Running Multiple Models to Find the Best One


It’s difficult to know in advance which machine learning model(s)
will perform best for a given dataset, especially when they hide
the details of how they operate from their users.
This encourages you to run multiple models to determine which is
the best for a particular machine learning study.
Program:
from sklearn.datasets import load_digits
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Load digits dataset


digits = load_digits()
X, y = digits.data, digits.target

# Define K-Fold cross-validator


k = 5 # You can change this to any number of splits you like
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Define the classifiers


classifiers = {
'KNN': KNeighborsClassifier(),
'SVC': SVC(),
'GaussianNB': GaussianNB()
}

# Evaluate each classifier using cross_val_score


for name, clf in classifiers.items():
scores = cross_val_score(clf, X, y, cv=kf, scoring='accuracy')
print(f"{name} average accuracy: {np.mean(scores):.4f}")
Output:
KNN average accuracy: 0.9861
SVC average accuracy: 0.9878
GaussianNB average accuracy: 0.8392

Hyperparameter Tuning
Earlier in this section, we mentioned that k in the k-nearest
neighbors algorithm is a hyperparameter of the algorithm.
Hyperparameters are set before using the algorithm to train your
model. In real-world machine learning studies, you’ll want to use
hyperparameter tuning to choose hyperparameter values that
produce the best possible predictions. To determine the best
value for k in the kNN algorithm, try different values of k then
compare the estimator’s performance with each. We can do this
using techniques similar to comparing estimators. The following
loop creates KNeighborsClassifiers with odd k values from 1
through 19 (again, we use odd k values in kNN to avoid ties) and
performs k-fold cross-validation on each. As you can see from the
accuracy scores and standard deviations, the k value 1 in kNN
produces the most accurate predictions for the Digits dataset. You
can also see that accuracy tends to decrease for higher k values:
Program:
from sklearn.datasets import load_digits
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Load digits dataset


digits = load_digits()
X, y = digits.data, digits.target
# Define K-Fold cross-validator
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter tuning for KNN


k_values = range(1, 21)
knn_scores = []

print("Tuning KNN with different k values:")


for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=kfold,
scoring='accuracy')
avg_score = np.mean(scores)
knn_scores.append(avg_score)
print(f"k={k}, Accuracy={avg_score:.4f}")

# Find best k
best_k = k_values[np.argmax(knn_scores)]
print(f"\nBest k for KNN: {best_k} with accuracy
{max(knn_scores):.4f}")
Output:
Tuning KNN with different k values:
k=1, Accuracy=0.9878
k=2, Accuracy=0.9844
k=3, Accuracy=0.9866
k=4, Accuracy=0.9816
k=5, Accuracy=0.9861
k=6, Accuracy=0.9850
k=7, Accuracy=0.9861
k=8, Accuracy=0.9833
k=9, Accuracy=0.9816
k=10, Accuracy=0.9822
k=11, Accuracy=0.9816
k=12, Accuracy=0.9805
k=13, Accuracy=0.9783
k=14, Accuracy=0.9772
k=15, Accuracy=0.9772
k=16, Accuracy=0.9761
k=17, Accuracy=0.9761
k=18, Accuracy=0.9750
k=19, Accuracy=0.9733
k=20, Accuracy=0.9711

Best k for KNN: 1 with accuracy 0.9878


Case Study: Time Series and Simple Linear Regression

A time series is just a sequence of data points collected or


recorded at specific time intervals. Think of it like a timeline of
measurements—one after another—usually taken at evenly
spaced times (like every second, minute, day, month, etc.).

Examples of Time Series:

●​ Daily temperature readings


●​ Monthly sales numbers for a store
●​ Hourly stock prices of a company
●​ Weekly number of steps recorded by a fitness tracker

Simple Linear Regression

Definition:

Simple Linear Regression is a statistical method used to model


the relationship between two variables:

Independent variable (X) — the predictor

Dependent variable (Y) — the response

It assumes a linear relationship, represented by the equation:

𝑌=𝑎+𝑏𝑋+𝜖

Where:

●​ a is the intercept (value of Y when X = 0)​

●​ b is the slope (change in Y for one unit change in X)​

●​ ϵ is the error term

Goal:​
To find the best-fitting straight line (called the regression line)
that predicts Y from X.

Applications:
●​ Predicting sales from advertising spend
●​ Estimating house prices based on size
●​ Forecasting trends using historical data

Assumptions:

●​ Linear relationship between X and Y


●​ Errors are normally distributed
●​ Constant variance of errors (homoscedasticity)
●​ Independence of observations

Program:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

nyc = pd.read_csv('/content/ave_hi_nyc_jan_1895-2018.csv')

nyc.head()
Output:
Date Value Anomaly

0 189501 34.2 -3.2

1 189601 34.7 -2.7

2 189701 35.5 -1.9

3 189801 39.6 2.2


189901 36.4 -1.0
4

nyc.Date = nyc.Date.floordiv(100)

nyc.tail()
Output:
Date​ Temperature​Anomaly
119​ 2014​​ 35.5​​ -1.9
120​ 2015​​ 36.1​​ -1.3
121​ 2016​​ 40.8​​ 3.4
122​ 2017​​ 42.8​​ 5.4
123​ 2018​​ 38.7​​ 1.3
#Spliting the data into training and Testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(nyc.Date.values.reshape(-1, 1),
nyc.Temperature.values,test_size=0.25,random_state=11)

print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
Output:
(93, 1) (31, 1) (93,) (31,)
linear_regression = LinearRegression()
linear_regression.fit(X_train,y_train)

print(linear_regression.coef_)
print(linear_regression.intercept_)
Output:
[0.01939167]
-0.30779820252656265
y_pred = linear_regression.predict(X_test)
for p, e in zip(y_pred[::5], y_test[::5]):
print(f'y_pred: {p:.2f}, y_test: {e:.2f}')
Output:
y_pred: 37.86, y_test: 31.70
y_pred: 38.69, y_test: 34.80
y_pred: 37.00, y_test: 39.40
y_pred: 37.25, y_test: 45.70
y_pred: 38.05, y_test: 32.30
y_pred: 37.64, y_test: 33.80
y_pred: 36.94, y_test: 39.70

print(linear_regression.predict([[2019]]))
Output:
[38.84399018]
predict = (lambda x: linear_regression.coef_ * x +
linear_regression.intercept_)

predict(2019)
Output:
array([38.84399018])
predict(1890)
Output:
array([36.34246432])
predict(2018)
array([38.82459851])
Visualizing the Dataset with the Regression Line
import seaborn as sns
axes.set_ylim(10, 70)
axes = sns.scatterplot(data=nyc, x='Date',
y='Temperature',hue='Temperature', palette='winter',
legend=False)
x = np.array([min(nyc.Date.values), max(nyc.Date.values)])
y = predict(x)
line = plt.plot(x, y)
Output:
Multiple Linear Regression with the California Housing
Dataset
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
import pandas as pd

print(california.DESCR)
output:
.. _california_housing_dataset:

California Housing dataset


--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive


attributes and the target

:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per
household
- AveBedrms average number of bedrooms per
household
- Population block group population
- AveOccup average number of household
members
- Latitude block group latitude
- Longitude block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.


https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housin
g.html

The target variable is the median house value for


California districts,
expressed in hundreds of thousands of dollars
($100,000).

This dataset was derived from the 1990 U.S. census,


using one row per census
block group. A block group is the smallest
geographical unit for which the U.S.
Census Bureau publishes sample data (a block group
typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a


home. Since the average
number of rooms and bedrooms in this dataset are
provided per household, these
columns may take surprisingly large values for block
groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the


:func:`sklearn.datasets.fetch_california_housing`
function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial


Autoregressions,
Statistics and Probability Letters, 33 (1997)
291-297
california_df=pd.DataFrame(california.data,columns=california.fe
ature_names)
california_df['MedHouseValue'] = pd.Series(california.target)

print(california_df.head())
Output:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
Longitude MedHouseValue
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

california_df.describe()

MedInc HouseAge Ave AveBe Popu AveOcc Latitude Longitu MedHo


Roo drms lation up de useValu
ms e

20640. 20640.0000 206 20640 2064 20640.0 20640.0 20640.0 20640.0


000000 00 40.0 .0000 0.000 00000 00000 00000 00000
count
000 00 000
00

3.8706 28.639486 5.42 1.096 1425. 3.07065 35.6318 -119.56 2.06855


mean 71 900 675 4767 5 61 9704 8
0 44

1.8998 12.585558 2.47 0.473 1132. 10.3860 2.13595 2.00353 1.15395


std 22 417 911 4621 50 2 2 6
3 22

0.4999 1.000000 0.84 0.333 3.000 0.69230 32.5400 -124.35 0.14999


min 00 615 333 000 8 00 0000 0
4

2.5634 18.000000 4.44 1.006 787.0 2.42974 33.9300 -121.80 1.19600


25% 00 071 079 0000 1 00 0000 0
6 0

3.5348 29.000000 5.22 1.048 1166. 2.81811 34.2600 -118.49 1.79700


50% 00 912 780 0000 6 00 0000 0
9 00
4.7432 37.000000 6.05 1.099 1725. 3.28226 37.7100 -118.01 2.64725
75% 50 238 526 0000 1 00 0000 0
1 00

15.000 52.000000 141. 34.06 3568 1243.33 41.9500 -114.31 5.00001


max 100 909 6667 2.000 3333 00 0000 0
091 000

california_df.shape
Output:
(20640, 9)
Visualizing the Features
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=2)
sns.set_style('whitegrid')

sample_df = california_df.sample(frac=0.1, random_state=17)


for feature in california.feature_names:
plt.figure(figsize=(16, 9))
sns.scatterplot(data=sample_df, x=feature,
y='MedHouseValue', hue='MedHouseValue',
palette='cool', legend=False)
Splitting the Data for Training and Testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(california.data,
california.target, random_state=11)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
Output:
(15480, 8) (5160, 8) (15480,) (5160,)
Training the Model

from sklearn.linear_model import LinearRegression


linear_regression = LinearRegression()
linear_regression.fit(X=X_train, y=y_train)

for i, name in enumerate(california.feature_names):


print(f'{name:>10}: {linear_regression.coef_[i]}')
Output:
MedInc: 0.4377030215382206
HouseAge: 0.009216834565797749
AveRooms: -0.10732526637360926
AveBedrms: 0.6117133073918087
Population: -5.756822009275742e-06
AveOccup: -0.003384566465716442
Latitude: -0.4194818609649067
Longitude: -0.4337713349874023
predicted = linear_regression.predict(X_test)
expected = y_test
predicted[:5]
Output:
array([1.25396876, 2.34693107, 2.03794745, 1.8701254 ,
2.53608339])
expected[:5]
Output:
array([0.762, 1.732, 1.125, 1.37 , 1.856])
Visualizing the Expected vs. Predicted Prices
df = pd.DataFrame()
df['Expected'] = pd.Series(expected)
df['Predicted'] = pd.Series(predicted)
figure = plt.figure(figsize=(9, 9))
axes = sns.scatterplot(data=df, x='Expected', y='Predicted',
hue='Predicted', palette='cool', legend=False)
start = min(expected.min(), predicted.min())
end = max(expected.max(), predicted.max())
axes.set_xlim(start, end)
axes.set_ylim(start, end)
line = plt.plot([start, end], [start, end], 'k--')
Unsupervised Machine Learning, Part 1—Dimensionality
Reduction

1. What is Dimensionality Reduction?

It’s the process of reducing the number of input variables (features) in your dataset
while preserving its structure or patterns. This is crucial when visualizing
high-dimensional data like images (e.g., 784 pixels for MNIST).

2. What is t-SNE?

t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear technique for


reducing data to 2 or 3 dimensions so we can visualize it. It focuses on preserving the
local structure of data—i.e., similar points stay close in the new space.

Program:

from sklearn.datasets import load_digits


digits = load_digits()
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=11)

reduced_data = tsne.fit_transform(digits.data)
Output:
(1797, 2)

import matplotlib.pyplot as plt


dots = plt.scatter(reduced_data[:, 0], reduced_data[:,
1],c='black')
dots = plt.scatter(reduced_data[:, 0], reduced_data[:, 1],
c=digits.target, cmap=plt.cm.get_cmap('nipy_spectral_r', 10))
colorbar = plt.colorbar(dots)
Unsupervised Machine Learning, Part 2—k-Means
Clustering

The dataset describes 50 samples for each of three Iris flower


species—Iris setosa, Iris versicolor and Iris virginica. Photos of
these are shown below. Each sample’s features are the sepal
length, sepal width, petal length and petal width, all measured in
centimeters. The sepals are the larger outer parts of each flower
that protect the smaller inside petals before the flower buds
bloom.
Program:
from sklearn.datasets import load_iris
iris = load_iris()

print(iris.DESCR)
Output:
.. _iris_dataset:

Iris plants dataset


--------------------
**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)


:Number of Attributes: 4 numeric, predictive attributes and the
class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= =====


====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= =====
====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= =====
====================

:Missing Attribute Values: None


:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988
iris.data.shape
Output:
(150, 4)
iris.target.shape
Output:
(150,)
iris.target_names
Output:
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
iris.feature_names
Output:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']

Exploring the Iris Dataset: Descriptive Statistics with Pandas


import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = [iris.target_names[i] for i in iris.target]
iris_df.head()
Output:

sepal length sepal width (cm) petal length petal species


(cm) (cm) width
(cm)

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

5.0 3.6 1.4 0.2 setosa


4

iris_df.describe()
Output:

sepal length (cm) sepal width (cm) petal length (cm) petal width
(cm)

150.000000 150.000000 150.000000 150.00000


count
0

mean 5.843333 3.057333 3.758000 1.199333

std 0.828066 0.435866 1.765298 0.762238


min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

iris_df['species'].value_counts()
Output:

species count

setosa 50

versicolor 50

virginica 50

import seaborn as sns


sns.set(font_scale=1.1)
sns.set_style('whitegrid')
grid = sns.pairplot(data=iris_df, vars=iris_df.columns[0:4],
hue='species')
Output:
The graphs along the top-left-to-bottom-right diagonal, show the
distribution of just the feature plotted in that column, with the
range of values (left-to-right) and the number of samples with
those values (top-to-bottom). For example, consider the
sepal-length distributions: The blue shaded area indicates that
the range of sepal length values (shown along the x axis) for Iris
setosa is approximately 4–6 centimeters and that most Iris
setosa samples are in the middle of that range (approximately 5
centimeters). Similarly, the green shaded area indicates that the
range of sepal length values for Iris virginica is approximately
4–8.5 centimeters and that the majority of Iris virginica samples
have sepal length values between 6 and 7 centimeters.

from sklearn.cluster import KMeans


kmeans = KMeans(n_clusters=3, random_state=11)

kmeans.fit(iris.data)

print(kmeans.labels_[0:50])
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0]
print(kmeans.labels_[50:100])
Output:
[2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1]

print(kmeans.labels_[100:150])
Output:
[2 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1
1 2 2 2 2 2 1 2 2 2
2 1 2 2 2 1 2 2 2 1 2 2 1]
label_counts = pd.Series(kmeans.labels_).value_counts()
print(label_counts)
Output:
1 61
0 50
2 39
Name: count, dtype: int64
Dimensionality Reduction with Principal Component
Analysis(PCA)

from sklearn.decomposition import PCA


pca = PCA(n_components=2, random_state=11)

pca.fit(iris.data)

iris_pca = pca.transform(iris.data)

iris_pca.shape
Output:
(150, 2)

iris_pca_df = pd.DataFrame(iris_pca,
columns=['Component1', 'Component2'])
iris_pca_df['species'] = iris_df.species

axes = sns.scatterplot(data=iris_pca_df, x='Component1',


y='Component2', hue='species', legend='brief',palette='cool')
iris_centers = pca.transform(kmeans.cluster_centers_)
dots = plt.scatter(iris_centers[:,0], iris_centers[:,1],s=100,
c='k')
Output:

You might also like