0% found this document useful (0 votes)

4 views

15. Machine Learning Classification, Regression and Clustering

The document provides an overview of machine learning (ML), highlighting its ability to learn from data without explicit programming. It details the differences between traditional programming and ML, the learning process, types of ML (supervised, unsupervised, semi-supervised, and reinforcement learning), and various applications. Additionally, it introduces Scikit-Learn, a Python library for ML, and includes a practical implementation of the KNN algorithm with a custom dataset.

Uploaded by

debashisdasmohapatra87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

15. Machine Learning Classification, Regression and Clustering

Uploaded by

debashisdasmohapatra87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Machine Learning: Classification, Regression and

Clustering

❖Machine learning (ML) is a subset of artificial intelligence

(AI) that allows systems to learn and improve without being
explicitly programmed
❖Machine learning uses statistical techniques to enable
computers to learn and make decisions.
❖It is predicted on the idea that computers can learn from
data, spot patterns and make judgments with little
assistance from humans.
Difference between Traditional and Machine learning
programming
❖In traditional programming, we would feed the input data
and a well written and tested program into a machine to
generate output.
❖In machine learning, input data along with the output is fed
into the machine during the learning phase, and it works out
a program for itself.

How do Machines learn

1. Data Collection
ML starts with data—this can be numbers, text, images, or
anything else. The more high-quality data, the better the model
learns.
2. Feature Extraction & Preprocessing
Raw data is often messy. It needs to be cleaned and transformed
into a format that ML models can understand.
Important characteristics (features) are extracted from the data.
For example, in lung cancer prediction, features could be tumor
size, age, and smoking history.
3. Choosing a Model
Different algorithms (models) are used depending on the task

4. Training the Model

The model learns patterns by adjusting internal parameters using
mathematical optimization techniques.
5. Testing & Evaluation
The model is tested on new, unseen data to measure
performance.
Metrics like accuracy, precision, recall, F1-score (for classification)
or mean squared error (MSE) (for regression) help determine how
well the model performs.
6. Deployment & Prediction
Once trained, the model can be used to make predictions on
real-world data.
It can be deployed in apps, websites, or medical tools to assist
decision-making.
7. Improvement & Retraining
Over time, as new data becomes available, the model can be
retrained to improve its accuracy.
Fine-tuning or choosing a different model can help if the
performance is not satisfactory.
Machine Learning Applications
Here’s a table of some popular machine-learning applications:

Types of Machine Learning

There are several types of machine learning, each with special
characteristics and applications. Some of the main types of
machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
1. Supervised Machine Learning
Supervised learning is defined as when a model gets trained on a
“Labeled Dataset”. Labeled datasets have both input and output
parameters. In Supervised Learning algorithms learn to map
points between inputs and correct outputs. It has both training
and validation datasets labeled.

Example: Consider a scenario where you have to build an image

classifier to differentiate between cats and dogs. If you feed the
datasets of dogs and cats labeled images to the algorithm, the
machine will learn to classify between a dog or a cat from these
labeled images. When we input new dog or cat images that it has
never seen before, it will use the learned algorithms and predict
whether it is a dog or a cat. This is how supervised learning
works, and this is particularly an image classification.
There are two main categories of supervised learning that are
mentioned below:
● Classification
● Regression
Classification
Classification deals with predicting categorical target variables,
which represent discrete classes or labels. For instance,classifying
emails as spam or not spam, or predicting whether a patient has
a high risk of heart disease.Classification algorithms learn to map
the input features to one of the predefined classes.
Here are some classification algorithms:
● Logistic Regression
● Support Vector Machine
● Random Forest
● Decision Tree
● K-Nearest Neighbors (KNN)
● Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous
target variables, which represent numerical values. For example,
predicting the price of a house based on its size, location, and
amenities, or forecasting the sales of a product. Regression
algorithms learn to map the input features to a continuous
numerical value.
Here are some regression algorithms:
● Linear Regression
● Polynomial Regression
● Ridge Regression
● Lasso Regression
● Decision tree
● Random Forest
Advantages of Supervised Machine Learning
● Supervised Learning models can have high accuracy as they are
trained on labeled data.
● The process of decision-making in supervised learning models is
often interpretable.
● It can often be used in pre-trained models which saves time
and resources when developing new models from scratch.
Disadvantages of Supervised Machine Learning
● It has limitations in knowing patterns and may struggle with
unseen or unexpected patterns that are not present in the
training data.
● It can be time-consuming and costly as it relies on labeled data
only.
● It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications,
including:
● Image classification: Identify objects, faces, and other features
in images.
● Natural language processing: Extract information from text,
such as sentiment, entities, and relationships.
● Speech recognition: Convert spoken language into text.
● Recommendation systems: Make personalized
recommendations to users.
● Predictive analytics: Predict outcomes, such as sales, customer
churn, and stock prices.
● Medical diagnosis: Detects diseases and other medical
conditions.
● Fraud detection: Identify fraudulent transactions.
● Autonomous vehicles: Recognize and respond to objects in the
environment.
● Email spam detection: Classify emails as spam or not spam.
● Quality control in manufacturing: Inspect products for defects.
● Credit scoring: Assess the risk of a borrower defaulting on a
loan.
● Gaming: Recognize characters, analyze player behavior, and
create NPCs.
● Customer support: Automate customer support tasks.
● Weather forecasting: Make predictions for temperature,
precipitation, and other meteorological parameters.
● Sports analytics: Analyze player performance, make game
predictions, and optimize strategies.
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of
machine learning technique in which an algorithm discovers
patterns and relationships using unlabeled data. Unlike
supervised learning, unsupervised learning doesn’t involve
providing the algorithm with labeled target outputs. The primary
goal of Unsupervised learning is often to discover hidden
patterns, similarities, or clusters within the data, which can then
be used for various purposes, such as data exploration,
visualization, dimensionality reduction, and more.

Example: Consider that you have a dataset that contains

information about the purchases you made from the shop.
Through clustering, the algorithm can group the same purchasing
behavior among you and other customers,which reveals potential
customers without predefined labels. This type of information can
help businesses get target customers as well as identify outliers.
There are four main categories of unsupervised learning that are
mentioned below:
● Clustering
● Association
● Dimensionality Reduction
● Anomaly Detection
Clustering
Clustering is the process of grouping data points into clusters
based on their similarity. This technique is useful for identifying
patterns and relationships in data without the need for labeled
examples.
Here are some clustering algorithms:
● K-Means Clustering algorithm
● Mean-shift algorithm
● DBSCAN Algorithm
● Principal Component Analysis
● Independent Component Analysis
Advantages of Unsupervised Machine Learning
● It helps to discover hidden patterns and various relationships
between the data.
● Used for tasks such as customer segmentation, anomaly
detection, and data exploration.
● It does not require labeled data and reduces the effort of data
labeling.
Disadvantages of Unsupervised Machine Learning
● Without using labels, it may be difficult to predict the quality of
the model’s output.
● Cluster Interpretability may not be clear and may not have
meaningful interpretations.
● It has techniques such as autoencoders and dimensionality
reduction that can be used to extract meaningful features from
raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
● Clustering: Group similar data points into clusters.
● Anomaly detection: Identify outliers or anomalies in data.
● Dimensionality reduction: Reduce the dimensionality of data
while preserving its essential information.
● Recommendation systems: Suggest products, movies, or
content to users based on their historical behavior or preferences.
● Topic modeling: Discover latent topics within a collection of
documents.
● Density estimation: Estimate the probability density function of
data.
● Image and video compression: Reduce the amount of storage
required for multimedia content.
● Data preprocessing: Help with data preprocessing tasks such as
data cleaning, imputation of missing values, and data scaling.
● Market basket analysis: Discover associations between
products.
● Genomic data analysis: Identify patterns or group genes with
similar expression profiles.
● Image segmentation: Segment images into meaningful regions.
● Community detection in social networks: Identify communities
or groups of individuals with similar interests or connections.
● Customer behavior analysis: Uncover patterns and insights for
better marketing and product recommendations.
● Content recommendation: Classify and tag content to make it
easier to recommend similar items to users.
● Exploratory data analysis (EDA): Explore data and gain insights
before defining specific tasks.

Scikit-Learn:

Scikit-Learn: A Powerful Python Library for Machine Learning

Scikit-Learn (or sklearn) is a popular open-source machine

learning library in Python. It provides simple, efficient tools for
data mining and machine learning and is built on top of
NumPy, SciPy, and matplotlib.

Datasets Bundled with Scikit-Learn

The following table lists scikit-learn’s bundled datasets.It also

provides capabilities for loading datasets from other sources, such
as the 20,000+ datasets available at openml.org.

Name Aptitude_Score Communication_Sc Class
ore

0 Karuna 2 5.0 Speaker

1 Bhuvna 2 6.0 Speaker

2 Parimal 3 5.5 Speaker

3 Jani 4 7.0 Speaker

4 Bobby 5 3.0 Intel

5 Ravi 6 2.0 Intel

6 Gouri 6 4.0 Intel

7 Parul 7 2.5 Intel

8 Govind 8 3.0 Intel

9 Susant 6 5.5 Leader

10 Bharat 6 7.0 Leader

11 Gaurav 7 6.0 Leader

12 Dinesh 8 6.0 Leader

13 Pradeep 9 7.0 Leader

EVALUATION OF PERFORMANCE OF A MODEL
Implementation of KNN Algorithm with custom Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
# Sample Dataset (Name, Aptitude Score, Communication Score,
Class)
data = {
"Name": ["Karuna", "Bhuvna", "Parimal", "Jani", "Bobby",
"Ravi", "Gouri", "Parul", "Govind", "Susant",
"Bharat", "Gaurav", "Dinesh", "Pradeep"],
"Aptitude_Score": [2, 2, 3, 4, 5, 6, 6, 7, 8, 6, 6, 7, 8, 9],
"Communication_Score": [5, 6, 5.5,7, 3, 2, 4, 2.5, 3, 5.5, 7, 6,
6, 7],
"Class": ["Speaker", "Speaker", "Speaker", "Speaker", "Intel",
"Intel", "Intel", "Intel", "Intel", "Leader",
"Leader", "Leader", "Leader", "Leader"]
}

print(data)

Output:

{'Name': ['Karuna', 'Bhuvna', 'Parimal', 'Jani',

'Bobby', 'Ravi', 'Gouri', 'Parul', 'Govind', 'Susant',
'Bharat', 'Gaurav', 'Dinesh', 'Pradeep'],
'Aptitude_Score': [2, 2, 3, 4, 5, 6, 6, 7, 8, 6, 6, 7,
8, 9], 'Communication_Score': [5, 6, 5.5, 7, 3, 2, 4,
2.5, 3, 5.5, 7, 6, 6, 7], 'Class': ['Speaker',
'Speaker', 'Speaker', 'Speaker', 'Intel', 'Intel',
'Intel', 'Intel', 'Intel', 'Leader', 'Leader',
'Leader', 'Leader', 'Leader']}

# Convert to DataFrame
df = pd.DataFrame(data)
df
Output:

Name Aptitude_Score Communication_Score Class

0 Karuna 2 5.0 Speaker

1 Bhuvna 2 6.0 Speaker

2 Parimal 3 5.5 Speaker

3 Jani 4 7.0 Speaker

4 Bobby 5 3.0 Intel

5 Ravi 6 2.0 Intel

6 Gouri 6 4.0 Intel

7 Parul 7 2.5 Intel

8 Govind 8 3.0 Intel

9 Susant 6 5.5 Leader

10 Bharat 6 7.0 Leader

11 Gaurav 7 6.0 Leader

12 Dinesh 8 6.0 Leader

13 Pradeep 9 7.0 Leader

# Features (Aptitude & Communication), excluding Name

X = df[["Aptitude_Score", "Communication_Score"]]

y = df["Class"]

print(X.shape,y.shape)

Output:
(14, 2) (14,)

# Split Data into Training and Testing Sets (80% Train, 20% Test)

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state=42)

print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

Output:
(11, 2) (11,) (3, 2) (3,)

print(X_test)

Output:
Aptitude_Score Communication_Score
9 6 5.5
11 7 6.0
0 2 5.0

print(y_test)
Output:
9 Leader
11 Leader
0 Speaker
Name: Class, dtype: object

# Initialize KNN Model

knn = KNeighborsClassifier(n_neighbors=3) # k=3

# Train the Model

knn.fit(X_train, y_train)

# Make Predictions
y_pred = knn.predict(X_test)

print(y_pred)
Output:
['Leader' 'Leader' 'Speaker']
# Evaluate Model Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
Output:
Model Accuracy: 100.00%

# Display Classification Report

print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Output:
Classification Report:
precision recall f1-score support

Leader 1.00 1.00 1.00 2

Speaker 1.00 1.00 1.00 1

accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3
#confusion Matrix
print("\nConfusion Matrix:")
#cm = confusion_matrix(y_test, y_pred) # or your actual class
labels
cm = confusion_matrix(y_test, y_pred, labels=['Speaker', 'Intel',
'Leader'])
print(cm)
Output:
Confusion Matrix:
[[1 0 0]
[0 0 0]
[0 0 2]]
# User Input for Prediction
aptitude = float(input("Enter Aptitude Score: "))
communication = float(input("Enter Communication Score: "))
# Predict Class
predicted_class = knn.predict([[aptitude, communication]])
print(f"Predicted Class: {predicted_class[0]}")
Output:
Enter Aptitude Score: 5
Enter Communication Score: 4.5
Predicted Class: Intel
Classification with k-Nearest Neighbors and the Digits
Dataset
Our Approach
We’ll cover this case study over two sections. In this section, we’ll
begin with the basic
steps of a machine learning case study:
• Decide the data from which to train a model.
• Load and explore the data.
• Split the data for training and testing.
• Select and build the model.
• Train the model.
• Make predictions.
• Evaluate the results.
• Tune the model.
• Run several classification models to choose the best one(s).
Program:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

digits=datasets.load_digits()
print(digits.DESCR)
Output:
.. _digits_dataset:

Optical recognition of handwritten digits dataset

--------------------------------------------------

Data Set Characteristics:

:Number of Instances: 1797

:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in
the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML

hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recogn
ition+of+Handwritten+Digits

The data set contains images of hand-written digits:

10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were

used to extract
normalized bitmaps of handwritten digits from a
preprinted form. From a
total of 43 people, 30 contributed to the training set
and different 13
to the test set. 32x32 bitmaps are divided into
nonoverlapping blocks of
4x4 and the number of on pixels are counted in each
block. This generates
an input matrix of 8x8 where each element is an
integer in the range
0..16. This reduces dimensionality and gives
invariance to small
distortions.

For info on NIST preprocessing routines, see M. D.

Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S.
A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition
System, NISTIR 5469,
1994.

.. dropdown:: References
- C. Kaynak (1995) Methods of Combining Multiple
Classifiers and Their
Applications to Handwritten Digit Recognition, MSc
Thesis, Institute of
Graduate Studies in Science and Engineering,
Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading
Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao
and A. Kai Qin.
Linear dimensionalityreduction using relevance
weighted LDA. School of
Electrical and Electronic Engineering Nanyang
Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin
Classification
Algorithm. NIPS. 2000.
X, y = digits.data, digits.target

# Step 2: Inspect the dataset

print(f"Shape of feature data (X): {X.shape}")
print(f"Shape of target data (y): {y.shape}")
print(f"Feature names: {digits.feature_names}")
#print(f"Target labels (first 10): {y[:10]}")
print(f"Target labels (first 10): {digits.target_names}")
Output:
Shape of feature data (X): (1797, 64)
Shape of target data (y): (1797,)
Feature names: ['pixel_0_0', 'pixel_0_1', 'pixel_0_2',
'pixel_0_3', 'pixel_0_4', 'pixel_0_5', 'pixel_0_6',
'pixel_0_7', 'pixel_1_0', 'pixel_1_1', 'pixel_1_2',
'pixel_1_3', 'pixel_1_4', 'pixel_1_5', 'pixel_1_6',
'pixel_1_7', 'pixel_2_0', 'pixel_2_1', 'pixel_2_2',
'pixel_2_3', 'pixel_2_4', 'pixel_2_5', 'pixel_2_6',
'pixel_2_7', 'pixel_3_0', 'pixel_3_1', 'pixel_3_2',
'pixel_3_3', 'pixel_3_4', 'pixel_3_5', 'pixel_3_6',
'pixel_3_7', 'pixel_4_0', 'pixel_4_1', 'pixel_4_2',
'pixel_4_3', 'pixel_4_4', 'pixel_4_5', 'pixel_4_6',
'pixel_4_7', 'pixel_5_0', 'pixel_5_1', 'pixel_5_2',
'pixel_5_3', 'pixel_5_4', 'pixel_5_5', 'pixel_5_6',
'pixel_5_7', 'pixel_6_0', 'pixel_6_1', 'pixel_6_2',
'pixel_6_3', 'pixel_6_4', 'pixel_6_5', 'pixel_6_6',
'pixel_6_7', 'pixel_7_0', 'pixel_7_1', 'pixel_7_2',
'pixel_7_3', 'pixel_7_4', 'pixel_7_5', 'pixel_7_6',
'pixel_7_7']
Target labels (first 10): [0 1 2 3 4 5 6 7 8 9]

df=pd.DataFrame(X,columns=digits.feature_names)
df
Output:
df1=pd.DataFrame(y,columns=['target'])
df1

Output:

target

0 0

1 1

2 2

3 3

4 4
.
... .
.

1792 9

1793 0

1794 8

1795 9

1796 8

1797 rows × 1 columns

Visualizing the Data

digits.images.shape

Output:

(1797, 8, 8)
#Let us look at the first image which is an 8*8 array of pixcel
intensity
digits.images[0]

Output:

array([[ 0., 0., 5., 13., 9., 1., 0., 0.],

[ 0., 0., 13., 15., 10., 15., 5., 0.],

[ 0., 3., 15., 2., 0., 11., 8., 0.],

[ 0., 4., 12., 0., 0., 8., 8., 0.],

[ 0., 5., 8., 0., 0., 9., 8., 0.],

[ 0., 4., 11., 0., 1., 12., 7., 0.],

[ 0., 2., 14., 5., 10., 12., 0., 0.],

[ 0., 0., 6., 13., 10., 0., 0., 0.]])

plt.imshow(digits.images[0], cmap='gray')
plt.title('Number:'+str(y[0]))
plt.show()

Output:
# set up the plots
figure,axes=plt.subplots(nrows=3,ncols=10,figsize=(15,6))
for ax,image,number in
zip(axes.ravel(),digits.images,digits.target):
ax.axis('off')
ax.imshow(image,cmap=plt.cm.gray_r)
ax.set_title('Number: '+str(number))

Output:
Split the data into training and testing

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,r
andom_state=99,stratify=y)

print(f"Shape of X_train: {X_train.shape}")

print(f"Shape of X_test: {X_test.shape}")

print(f"Shape of y_train: {y_train.shape}")

print(f"Shape of y_test: {y_test.shape}")

Output:

Shape of X_train: (1257, 64)

Shape of X_test: (540, 64)

Shape of y_train: (1257,)

Shape of y_test: (540,)

Fit the Model

from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train,y_train)

y_pred=knn.predict(X_test)

Performance Measure

from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_pred)

Output:

0.9851851851851852

from sklearn.metrics import classification_report

report=classification_report(y_test,y_pred)

print(report)

Output:
precision recall f1-score support

0 1.00 1.00 1.00 54

1 0.95 0.98 0.96 55
2 1.00 1.00 1.00 53
3 1.00 0.98 0.99 55
4 1.00 0.98 0.99 54
5 0.96 0.98 0.97 55
6 1.00 1.00 1.00 54
7 0.98 1.00 0.99 54
8 1.00 0.94 0.97 52
9 0.96 0.98 0.97 54

accuracy 0.99 540

macro avg 0.99 0.99 0.99 540
weighted avg 0.99 0.99 0.99 540

Evaluate using confusion matrix

from sklearn.metrics import confusion_matrix

cm=confusion_matrix(y_test,y_pred)
print(cm)

Output:

[[54 0 0 0 0 0 0 0 0 0]

[ 0 54 0 0 0 1 0 0 0 0]

[ 0 0 53 0 0 0 0 0 0 0]

[ 0 0 0 54 0 0 0 1 0 0]

[ 0 0 0 0 53 0 0 0 0 1]

[ 0 0 0 0 0 54 0 0 0 1]
[ 0 0 0 0 0 0 54 0 0 0]

[ 0 0 0 0 0 0 0 54 0 0]

[ 0 3 0 0 0 0 0 0 49 0]

[ 0 0 0 0 0 1 0 0 0 53]]

import seaborn as sns

sns.heatmap(confusion_matrix(y_test,y_pred),annot=True,cmap
='nipy_spectral_r')

Output:
Each row represents one distinct class—that is, one of the digits
0–9. The columns

within a row specify how many of the test samples were classified
into each distinct class.

For example, row 0:

[45, 0, 0, 0, 0, 0, 0, 0, 0, 0]

represents the digit 0 class. The columns represent the ten

possible target classes 0 through

9. Because we’re working with digits, the classes (0–9) and the
row and column index

numbers (0–9) happen to match. According to row 0, 45 test

samples were classified as the

digit 0, and none of the test samples were misclassified as any of

the digits 1 through 9. So

100% of the 0s were correctly predicted.

On the other hand, consider row 8 which represents the results

for the digit 8:

For Testing

custom_image = np.array([
[0, 0, 5, 13, 15, 16, 14, 6],
[0, 0, 7, 18, 20, 19, 16, 7],
[0, 4, 15, 28, 30, 27, 23, 9],
[0, 8, 23, 37, 40, 37, 30, 12],
[0, 7, 21, 38, 42, 40, 32, 13],
[0, 4, 14, 30, 33, 29, 22, 8],
[0, 0, 7, 17, 19, 16, 11, 4],
[0, 0, 2, 9, 11, 10, 7, 3]
])
plt.imshow(custom_image, cmap='gray')
plt.title("Custom Digit")
plt.axis('off')
plt.show()

Output:
predicted_label=knn.predict(custom_image.reshape(1, -1)) #
Reshape custom_image before prediction

print(f"The predicted label for the test digit is:

{predicted_label}")

Output:

The predicted label for the test digit is: [1]

Using K -Fold cross validation

K-fold Cross-validation method

The issues in random sampling approach, in Holdout method,
1. The smaller data sets-difficult to provide the data of some of the
classes proportionally amongst training and test datasets.
2. A repeated holdout is sometimes used to ensure the randomness of
the composed datasets.
❖ Several random holdouts are used to measure the model
performance .
❖ In the end, the average of all performances is taken.
❖ As multiple holdouts have been drawn, the training and test
data(and validation data) contain representative data from all
classes and resemble the original input data closely.
❖ This process of repeated holdout is the basis of k-fold
cross-validation technique.
Program:
from sklearn.datasets import load_digits
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Load digits dataset

digits = load_digits()
X, y = digits.data, digits.target

# Define K-Fold cross-validator

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter tuning for KNN

k_values = range(1, 21)

print("Tuning KNN with different k values:")

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=kfold,
scoring='accuracy')

print(f'accuracy with different K: {scores}')

print(f'Mean accuracy: {scores.mean():.2%}')
Output:
Tuning KNN with different k values:
accuracy with different K: [0.97777778 0.98055556
0.95264624 0.97214485 0.97214485]
Mean accuracy: 97.11%

Running Multiple Models to Find the Best One

It’s difficult to know in advance which machine learning model(s)
will perform best for a given dataset, especially when they hide
the details of how they operate from their users.
This encourages you to run multiple models to determine which is
the best for a particular machine learning study.
Program:
from sklearn.datasets import load_digits
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Load digits dataset

digits = load_digits()
X, y = digits.data, digits.target

# Define K-Fold cross-validator

k = 5 # You can change this to any number of splits you like
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Define the classifiers

classifiers = {
'KNN': KNeighborsClassifier(),
'SVC': SVC(),
'GaussianNB': GaussianNB()
}

# Evaluate each classifier using cross_val_score

for name, clf in classifiers.items():
scores = cross_val_score(clf, X, y, cv=kf, scoring='accuracy')
print(f"{name} average accuracy: {np.mean(scores):.4f}")
Output:
KNN average accuracy: 0.9861
SVC average accuracy: 0.9878
GaussianNB average accuracy: 0.8392

Hyperparameter Tuning
Earlier in this section, we mentioned that k in the k-nearest
neighbors algorithm is a hyperparameter of the algorithm.
Hyperparameters are set before using the algorithm to train your
model. In real-world machine learning studies, you’ll want to use
hyperparameter tuning to choose hyperparameter values that
produce the best possible predictions. To determine the best
value for k in the kNN algorithm, try different values of k then
compare the estimator’s performance with each. We can do this
using techniques similar to comparing estimators. The following
loop creates KNeighborsClassifiers with odd k values from 1
through 19 (again, we use odd k values in kNN to avoid ties) and
performs k-fold cross-validation on each. As you can see from the
accuracy scores and standard deviations, the k value 1 in kNN
produces the most accurate predictions for the Digits dataset. You
can also see that accuracy tends to decrease for higher k values:
Program:
from sklearn.datasets import load_digits
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Load digits dataset

digits = load_digits()
X, y = digits.data, digits.target
# Define K-Fold cross-validator
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter tuning for KNN

k_values = range(1, 21)
knn_scores = []

print("Tuning KNN with different k values:")

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=kfold,
scoring='accuracy')
avg_score = np.mean(scores)
knn_scores.append(avg_score)
print(f"k={k}, Accuracy={avg_score:.4f}")

# Find best k
best_k = k_values[np.argmax(knn_scores)]
print(f"\nBest k for KNN: {best_k} with accuracy
{max(knn_scores):.4f}")
Output:
Tuning KNN with different k values:
k=1, Accuracy=0.9878
k=2, Accuracy=0.9844
k=3, Accuracy=0.9866
k=4, Accuracy=0.9816
k=5, Accuracy=0.9861
k=6, Accuracy=0.9850
k=7, Accuracy=0.9861
k=8, Accuracy=0.9833
k=9, Accuracy=0.9816
k=10, Accuracy=0.9822
k=11, Accuracy=0.9816
k=12, Accuracy=0.9805
k=13, Accuracy=0.9783
k=14, Accuracy=0.9772
k=15, Accuracy=0.9772
k=16, Accuracy=0.9761
k=17, Accuracy=0.9761
k=18, Accuracy=0.9750
k=19, Accuracy=0.9733
k=20, Accuracy=0.9711

Best k for KNN: 1 with accuracy 0.9878

Case Study: Time Series and Simple Linear Regression

A time series is just a sequence of data points collected or

recorded at specific time intervals. Think of it like a timeline of
measurements—one after another—usually taken at evenly
spaced times (like every second, minute, day, month, etc.).

Examples of Time Series:

● Daily temperature readings

● Monthly sales numbers for a store
● Hourly stock prices of a company
● Weekly number of steps recorded by a fitness tracker

Simple Linear Regression

Definition:

Simple Linear Regression is a statistical method used to model

the relationship between two variables:

Independent variable (X) — the predictor

Dependent variable (Y) — the response

It assumes a linear relationship, represented by the equation:

𝑌=𝑎+𝑏𝑋+𝜖

Where:

● a is the intercept (value of Y when X = 0)

● b is the slope (change in Y for one unit change in X)

● ϵ is the error term

Goal:
To find the best-fitting straight line (called the regression line)
that predicts Y from X.

Applications:
● Predicting sales from advertising spend
● Estimating house prices based on size
● Forecasting trends using historical data

Assumptions:

● Linear relationship between X and Y

● Errors are normally distributed
● Constant variance of errors (homoscedasticity)
● Independence of observations

Program:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

nyc = pd.read_csv('/content/ave_hi_nyc_jan_1895-2018.csv')

nyc.head()
Output:
Date Value Anomaly

0 189501 34.2 -3.2

1 189601 34.7 -2.7

2 189701 35.5 -1.9

3 189801 39.6 2.2

189901 36.4 -1.0
4

nyc.Date = nyc.Date.floordiv(100)

nyc.tail()
Output:
Date TemperatureAnomaly
119 2014 35.5 -1.9
120 2015 36.1 -1.3
121 2016 40.8 3.4
122 2017 42.8 5.4
123 2018 38.7 1.3
#Spliting the data into training and Testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(nyc.Date.values.reshape(-1, 1),
nyc.Temperature.values,test_size=0.25,random_state=11)

print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
Output:
(93, 1) (31, 1) (93,) (31,)
linear_regression = LinearRegression()
linear_regression.fit(X_train,y_train)

print(linear_regression.coef_)
print(linear_regression.intercept_)
Output:
[0.01939167]
-0.30779820252656265
y_pred = linear_regression.predict(X_test)
for p, e in zip(y_pred[::5], y_test[::5]):
print(f'y_pred: {p:.2f}, y_test: {e:.2f}')
Output:
y_pred: 37.86, y_test: 31.70
y_pred: 38.69, y_test: 34.80
y_pred: 37.00, y_test: 39.40
y_pred: 37.25, y_test: 45.70
y_pred: 38.05, y_test: 32.30
y_pred: 37.64, y_test: 33.80
y_pred: 36.94, y_test: 39.70

print(linear_regression.predict([[2019]]))
Output:
[38.84399018]
predict = (lambda x: linear_regression.coef_ * x +
linear_regression.intercept_)

predict(2019)
Output:
array([38.84399018])
predict(1890)
Output:
array([36.34246432])
predict(2018)
array([38.82459851])
Visualizing the Dataset with the Regression Line
import seaborn as sns
axes.set_ylim(10, 70)
axes = sns.scatterplot(data=nyc, x='Date',
y='Temperature',hue='Temperature', palette='winter',
legend=False)
x = np.array([min(nyc.Date.values), max(nyc.Date.values)])
y = predict(x)
line = plt.plot(x, y)
Output:
Multiple Linear Regression with the California Housing
Dataset
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
import pandas as pd

print(california.DESCR)
output:
.. _california_housing_dataset:

California Housing dataset

--------------------------

Data Set Characteristics:

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive

attributes and the target

:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per
household
- AveBedrms average number of bedrooms per
household
- Population block group population
- AveOccup average number of household
members
- Latitude block group latitude
- Longitude block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.

https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housin
g.html

The target variable is the median house value for

California districts,
expressed in hundreds of thousands of dollars
($100,000).

This dataset was derived from the 1990 U.S. census,

using one row per census
block group. A block group is the smallest
geographical unit for which the U.S.
Census Bureau publishes sample data (a block group
typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a

home. Since the average
number of rooms and bedrooms in this dataset are
provided per household, these
columns may take surprisingly large values for block
groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the

:func:`sklearn.datasets.fetch_california_housing`
function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial

Autoregressions,
Statistics and Probability Letters, 33 (1997)
291-297
california_df=pd.DataFrame(california.data,columns=california.fe
ature_names)
california_df['MedHouseValue'] = pd.Series(california.target)

print(california_df.head())
Output:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
Longitude MedHouseValue
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

california_df.describe()

MedInc HouseAge Ave AveBe Popu AveOcc Latitude Longitu MedHo

Roo drms lation up de useValu
ms e

20640. 20640.0000 206 20640 2064 20640.0 20640.0 20640.0 20640.0

000000 00 40.0 .0000 0.000 00000 00000 00000 00000
count
000 00 000
00

3.8706 28.639486 5.42 1.096 1425. 3.07065 35.6318 -119.56 2.06855

mean 71 900 675 4767 5 61 9704 8
0 44

1.8998 12.585558 2.47 0.473 1132. 10.3860 2.13595 2.00353 1.15395

std 22 417 911 4621 50 2 2 6
3 22

0.4999 1.000000 0.84 0.333 3.000 0.69230 32.5400 -124.35 0.14999

min 00 615 333 000 8 00 0000 0
4

2.5634 18.000000 4.44 1.006 787.0 2.42974 33.9300 -121.80 1.19600

25% 00 071 079 0000 1 00 0000 0
6 0

3.5348 29.000000 5.22 1.048 1166. 2.81811 34.2600 -118.49 1.79700

50% 00 912 780 0000 6 00 0000 0
9 00
4.7432 37.000000 6.05 1.099 1725. 3.28226 37.7100 -118.01 2.64725
75% 50 238 526 0000 1 00 0000 0
1 00

15.000 52.000000 141. 34.06 3568 1243.33 41.9500 -114.31 5.00001

max 100 909 6667 2.000 3333 00 0000 0
091 000

california_df.shape
Output:
(20640, 9)
Visualizing the Features
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=2)
sns.set_style('whitegrid')

sample_df = california_df.sample(frac=0.1, random_state=17)

for feature in california.feature_names:
plt.figure(figsize=(16, 9))
sns.scatterplot(data=sample_df, x=feature,
y='MedHouseValue', hue='MedHouseValue',
palette='cool', legend=False)
Splitting the Data for Training and Testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(california.data,
california.target, random_state=11)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
Output:
(15480, 8) (5160, 8) (15480,) (5160,)
Training the Model

from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(X=X_train, y=y_train)

for i, name in enumerate(california.feature_names):

print(f'{name:>10}: {linear_regression.coef_[i]}')
Output:
MedInc: 0.4377030215382206
HouseAge: 0.009216834565797749
AveRooms: -0.10732526637360926
AveBedrms: 0.6117133073918087
Population: -5.756822009275742e-06
AveOccup: -0.003384566465716442
Latitude: -0.4194818609649067
Longitude: -0.4337713349874023
predicted = linear_regression.predict(X_test)
expected = y_test
predicted[:5]
Output:
array([1.25396876, 2.34693107, 2.03794745, 1.8701254 ,
2.53608339])
expected[:5]
Output:
array([0.762, 1.732, 1.125, 1.37 , 1.856])
Visualizing the Expected vs. Predicted Prices
df = pd.DataFrame()
df['Expected'] = pd.Series(expected)
df['Predicted'] = pd.Series(predicted)
figure = plt.figure(figsize=(9, 9))
axes = sns.scatterplot(data=df, x='Expected', y='Predicted',
hue='Predicted', palette='cool', legend=False)
start = min(expected.min(), predicted.min())
end = max(expected.max(), predicted.max())
axes.set_xlim(start, end)
axes.set_ylim(start, end)
line = plt.plot([start, end], [start, end], 'k--')
Unsupervised Machine Learning, Part 1—Dimensionality
Reduction

1. What is Dimensionality Reduction?

It’s the process of reducing the number of input variables (features) in your dataset
while preserving its structure or patterns. This is crucial when visualizing
high-dimensional data like images (e.g., 784 pixels for MNIST).

2. What is t-SNE?

t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear technique for

reducing data to 2 or 3 dimensions so we can visualize it. It focuses on preserving the
local structure of data—i.e., similar points stay close in the new space.

Program:

from sklearn.datasets import load_digits

digits = load_digits()
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=11)

reduced_data = tsne.fit_transform(digits.data)
Output:
(1797, 2)

import matplotlib.pyplot as plt

dots = plt.scatter(reduced_data[:, 0], reduced_data[:,
1],c='black')
dots = plt.scatter(reduced_data[:, 0], reduced_data[:, 1],
c=digits.target, cmap=plt.cm.get_cmap('nipy_spectral_r', 10))
colorbar = plt.colorbar(dots)
Unsupervised Machine Learning, Part 2—k-Means
Clustering

The dataset describes 50 samples for each of three Iris flower

species—Iris setosa, Iris versicolor and Iris virginica. Photos of
these are shown below. Each sample’s features are the sepal
length, sepal width, petal length and petal width, all measured in
centimeters. The sepals are the larger outer parts of each flower
that protect the smaller inside petals before the flower buds
bloom.
Program:
from sklearn.datasets import load_iris
iris = load_iris()

print(iris.DESCR)
Output:
.. _iris_dataset:

Iris plants dataset

--------------------
**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the
class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= =====

====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= =====
====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= =====
====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988
iris.data.shape
Output:
(150, 4)
iris.target.shape
Output:
(150,)
iris.target_names
Output:
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
iris.feature_names
Output:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']

Exploring the Iris Dataset: Descriptive Statistics with Pandas

import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = [iris.target_names[i] for i in iris.target]
iris_df.head()
Output:

sepal length sepal width (cm) petal length petal species

(cm) (cm) width
(cm)

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

5.0 3.6 1.4 0.2 setosa

iris_df.describe()
Output:

sepal length (cm) sepal width (cm) petal length (cm) petal width
(cm)

150.000000 150.000000 150.000000 150.00000

count
0

mean 5.843333 3.057333 3.758000 1.199333

std 0.828066 0.435866 1.765298 0.762238

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

iris_df['species'].value_counts()
Output:

species count

setosa 50

versicolor 50

virginica 50

import seaborn as sns

sns.set(font_scale=1.1)
sns.set_style('whitegrid')
grid = sns.pairplot(data=iris_df, vars=iris_df.columns[0:4],
hue='species')
Output:
The graphs along the top-left-to-bottom-right diagonal, show the
distribution of just the feature plotted in that column, with the
range of values (left-to-right) and the number of samples with
those values (top-to-bottom). For example, consider the
sepal-length distributions: The blue shaded area indicates that
the range of sepal length values (shown along the x axis) for Iris
setosa is approximately 4–6 centimeters and that most Iris
setosa samples are in the middle of that range (approximately 5
centimeters). Similarly, the green shaded area indicates that the
range of sepal length values for Iris virginica is approximately
4–8.5 centimeters and that the majority of Iris virginica samples
have sepal length values between 6 and 7 centimeters.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=11)

kmeans.fit(iris.data)

print(kmeans.labels_[0:50])
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0]
print(kmeans.labels_[50:100])
Output:
[2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1]

print(kmeans.labels_[100:150])
Output:
[2 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1
1 2 2 2 2 2 1 2 2 2
2 1 2 2 2 1 2 2 2 1 2 2 1]
label_counts = pd.Series(kmeans.labels_).value_counts()
print(label_counts)
Output:
1 61
0 50
2 39
Name: count, dtype: int64
Dimensionality Reduction with Principal Component
Analysis(PCA)

from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=11)

pca.fit(iris.data)

iris_pca = pca.transform(iris.data)

iris_pca.shape
Output:
(150, 2)

iris_pca_df = pd.DataFrame(iris_pca,
columns=['Component1', 'Component2'])
iris_pca_df['species'] = iris_df.species

axes = sns.scatterplot(data=iris_pca_df, x='Component1',

y='Component2', hue='species', legend='brief',palette='cool')
iris_centers = pca.transform(kmeans.cluster_centers_)
dots = plt.scatter(iris_centers[:,0], iris_centers[:,1],s=100,
c='k')
Output:

Final Copy LM Dressmaking 18 MB PDF
90% (77)
Final Copy LM Dressmaking 18 MB PDF
321 pages
Episode 12 May I Help You - Field Study
No ratings yet
Episode 12 May I Help You - Field Study
4 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Machine Learning
No ratings yet
Machine Learning
12 pages
Machine Learning Is The Branch of
No ratings yet
Machine Learning Is The Branch of
12 pages
Machine Learning Unit-I
No ratings yet
Machine Learning Unit-I
41 pages
AI UNIT 4
No ratings yet
AI UNIT 4
34 pages
Machine Learning Presentation
No ratings yet
Machine Learning Presentation
20 pages
AI(Part-II)
No ratings yet
AI(Part-II)
11 pages
Learning Algorithms
No ratings yet
Learning Algorithms
28 pages
Machine Learning - Part -1
No ratings yet
Machine Learning - Part -1
17 pages
ML Unit 1
No ratings yet
ML Unit 1
19 pages
ML Unit 1
No ratings yet
ML Unit 1
42 pages
ml_unit1
No ratings yet
ml_unit1
31 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
13 pages
Unit 1
No ratings yet
Unit 1
19 pages
machine learning and AI
No ratings yet
machine learning and AI
13 pages
ML
No ratings yet
ML
17 pages
Unit 1
No ratings yet
Unit 1
47 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
61 pages
Machine Learning Types
No ratings yet
Machine Learning Types
30 pages
Unit-1
No ratings yet
Unit-1
24 pages
Machine Learning - its types
No ratings yet
Machine Learning - its types
8 pages
Full Notes
No ratings yet
Full Notes
37 pages
Types of Machine Learning - Tpoint Tech
No ratings yet
Types of Machine Learning - Tpoint Tech
10 pages
AIML Super, UnSuper
No ratings yet
AIML Super, UnSuper
3 pages
Business Data Mining Week 5
No ratings yet
Business Data Mining Week 5
19 pages
Session 3 Types of Machine Learning (1)
No ratings yet
Session 3 Types of Machine Learning (1)
22 pages
Machine Learning-Supervised Learning
No ratings yet
Machine Learning-Supervised Learning
31 pages
Unit-5 Machine Learning
No ratings yet
Unit-5 Machine Learning
25 pages
Unit 1 PDF
No ratings yet
Unit 1 PDF
135 pages
UNIT4
No ratings yet
UNIT4
12 pages
Basics of Machine Learning
No ratings yet
Basics of Machine Learning
38 pages
Chapter Five
No ratings yet
Chapter Five
178 pages
ML1
No ratings yet
ML1
11 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
Supervised Unsupervised Reinforcement
No ratings yet
Supervised Unsupervised Reinforcement
39 pages
Meta Motion Fitness Tracker 241109 213742[1] Removed
No ratings yet
Meta Motion Fitness Tracker 241109 213742[1] Removed
20 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
12 pages
DS-unit2
No ratings yet
DS-unit2
23 pages
MLT Unit 1
No ratings yet
MLT Unit 1
15 pages
Machine L
No ratings yet
Machine L
29 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
78 pages
Unit 5 big data
No ratings yet
Unit 5 big data
14 pages
UNit 1 Introduction To ML
No ratings yet
UNit 1 Introduction To ML
225 pages
Unit 1
No ratings yet
Unit 1
19 pages
Datascience Notes
No ratings yet
Datascience Notes
16 pages
Machine Learning IAI
No ratings yet
Machine Learning IAI
94 pages
Module 1
No ratings yet
Module 1
122 pages
unit V
No ratings yet
unit V
67 pages
Unit 1
No ratings yet
Unit 1
21 pages
ML Unit-1
No ratings yet
ML Unit-1
28 pages
Unit 3
No ratings yet
Unit 3
33 pages
Machine Learning Reg
No ratings yet
Machine Learning Reg
45 pages
AI unit 5
No ratings yet
AI unit 5
27 pages
Machine Learning Is A Branch of Artificial Intelligence (AI)
No ratings yet
Machine Learning Is A Branch of Artificial Intelligence (AI)
80 pages
ML IN FASHION INDUSTRY
No ratings yet
ML IN FASHION INDUSTRY
40 pages
Ai Unit-4 ML
No ratings yet
Ai Unit-4 ML
4 pages
Unit 1
No ratings yet
Unit 1
52 pages
Unit-I
No ratings yet
Unit-I
8 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
Cross-Curricular Focus: Reasoning Skills
0% (1)
Cross-Curricular Focus: Reasoning Skills
2 pages
Curriculum Map Arts 10
100% (1)
Curriculum Map Arts 10
11 pages
rosenshine1987
No ratings yet
rosenshine1987
3 pages
Limited Face To Face Narrative
100% (1)
Limited Face To Face Narrative
3 pages
Fundamentals of Drawing
No ratings yet
Fundamentals of Drawing
6 pages
320 Cohort 9 Report Final
No ratings yet
320 Cohort 9 Report Final
46 pages
Contextualizing-Literacy-for-Filipino-Learners (2)
No ratings yet
Contextualizing-Literacy-for-Filipino-Learners (2)
8 pages
Math Plus, Action Research
100% (1)
Math Plus, Action Research
28 pages
Ced 102-Lesson Week 7-8
No ratings yet
Ced 102-Lesson Week 7-8
48 pages
Emily Gunsch Resume
No ratings yet
Emily Gunsch Resume
2 pages
STUDENTS REMARKS 2020
No ratings yet
STUDENTS REMARKS 2020
4 pages
Reading
No ratings yet
Reading
1 page
Lo Grade 09 Memo
No ratings yet
Lo Grade 09 Memo
8 pages
Swot
No ratings yet
Swot
4 pages
Math Lesson Plan
No ratings yet
Math Lesson Plan
6 pages
Claude Zephrix C. Ariston: Education
No ratings yet
Claude Zephrix C. Ariston: Education
2 pages
DLL MATATAG _ENGLISH 4 Q3 W6
No ratings yet
DLL MATATAG _ENGLISH 4 Q3 W6
10 pages
Establishing Classroom Routines and Procedures: Learning Task
No ratings yet
Establishing Classroom Routines and Procedures: Learning Task
13 pages
Week 1 - G2
No ratings yet
Week 1 - G2
13 pages
Class Attendance & Participation Rubric
No ratings yet
Class Attendance & Participation Rubric
1 page
Division Ranking For Promotion of Teachers - Rev 02
No ratings yet
Division Ranking For Promotion of Teachers - Rev 02
7 pages
Group 3 PPT (Final)
No ratings yet
Group 3 PPT (Final)
44 pages
Caner M. and Yuksel I. 2014 - Silent Way PDF
No ratings yet
Caner M. and Yuksel I. 2014 - Silent Way PDF
16 pages
Table of Specification 1
100% (1)
Table of Specification 1
25 pages
Using Routines-Based Interventions in Early Childhood Special Education
No ratings yet
Using Routines-Based Interventions in Early Childhood Special Education
11 pages
Electricity
No ratings yet
Electricity
4 pages
CBC-Automotive Servicing NC II
No ratings yet
CBC-Automotive Servicing NC II
130 pages
Homework Be Banned Debate
100% (1)
Homework Be Banned Debate
5 pages

15. Machine Learning Classification, Regression and Clustering

Uploaded by

15. Machine Learning Classification, Regression and Clustering

Uploaded by

Machine Learning: Classification, Regression and

❖​Machine learning (ML) is a subset of artificial intelligence

How do Machines learn

4. Training the Model

Types of Machine Learning

Example: Consider a scenario where you have to build an image

Example: Consider that you have a dataset that contains

Scikit-Learn: A Powerful Python Library for Machine Learning

Scikit-Learn (or sklearn) is a popular open-source machine

Datasets Bundled with Scikit-Learn

The following table lists scikit-learn’s bundled datasets.It also

0 Karuna 2 5.0 Speaker

1 Bhuvna 2 6.0 Speaker

2 Parimal 3 5.5 Speaker

3 Jani 4 7.0 Speaker

4 Bobby 5 3.0 Intel

5 Ravi 6 2.0 Intel

6 Gouri 6 4.0 Intel

7 Parul 7 2.5 Intel

8 Govind 8 3.0 Intel

9 Susant 6 5.5 Leader

10 Bharat 6 7.0 Leader

11 Gaurav 7 6.0 Leader

12 Dinesh 8 6.0 Leader

13 Pradeep 9 7.0 Leader

{'Name': ['Karuna', 'Bhuvna', 'Parimal', 'Jani',

Name Aptitude_Score Communication_Score Class

0 Karuna 2 5.0 Speaker

1 Bhuvna 2 6.0 Speaker

2 Parimal 3 5.5 Speaker

3 Jani 4 7.0 Speaker

4 Bobby 5 3.0 Intel

5 Ravi 6 2.0 Intel

6 Gouri 6 4.0 Intel

7 Parul 7 2.5 Intel

8 Govind 8 3.0 Intel

9 Susant 6 5.5 Leader

10 Bharat 6 7.0 Leader

11 Gaurav 7 6.0 Leader

12 Dinesh 8 6.0 Leader

13 Pradeep 9 7.0 Leader

# Features (Aptitude & Communication), excluding Name

X_train, X_test, y_train, y_test = train_test_split(X, y,

# Initialize KNN Model

# Train the Model

# Display Classification Report

Leader 1.00 1.00 1.00 2

Optical recognition of handwritten digits dataset

**Data Set Characteristics:**

:Number of Instances: 1797

This is a copy of the test set of the UCI ML

The data set contains images of hand-written digits:

Preprocessing programs made available by NIST were

For info on NIST preprocessing routines, see M. D.

# Step 2: Inspect the dataset

1797 rows × 1 columns

Visualizing the Data

array([[ 0., 0., 5., 13., 9., 1., 0., 0.],

[ 0., 3., 15., 2., 0., 11., 8., 0.],

[ 0., 4., 12., 0., 0., 8., 8., 0.],

[ 0., 5., 8., 0., 0., 9., 8., 0.],

[ 0., 4., 11., 0., 1., 12., 7., 0.],

[ 0., 2., 14., 5., 10., 12., 0., 0.],

[ 0., 0., 6., 13., 10., 0., 0., 0.]])

print(f"Shape of X_train: {X_train.shape}")

print(f"Shape of X_test: {X_test.shape}")

print(f"Shape of y_train: {y_train.shape}")

print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (1257, 64)

Shape of X_test: (540, 64)

Shape of y_train: (1257,)

Shape of y_test: (540,)

Fit the Model

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

0 1.00 1.00 1.00 54

❖Machine learning (ML) is a subset of artificial intelligence

Data Set Characteristics:

● Daily temperature readings

● a is the intercept (value of Y when X = 0)

● b is the slope (change in Y for one unit change in X)

● ϵ is the error term

● Linear relationship between X and Y

Data Set Characteristics: