15. Machine Learning Classification, Regression and Clustering
15. Machine Learning Classification, Regression and Clustering
Clustering
Scikit-Learn:
Name Aptitude_Score Communication_Sc Class
ore
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
# Sample Dataset (Name, Aptitude Score, Communication Score,
Class)
data = {
"Name": ["Karuna", "Bhuvna", "Parimal", "Jani", "Bobby",
"Ravi", "Gouri", "Parul", "Govind", "Susant",
"Bharat", "Gaurav", "Dinesh", "Pradeep"],
"Aptitude_Score": [2, 2, 3, 4, 5, 6, 6, 7, 8, 6, 6, 7, 8, 9],
"Communication_Score": [5, 6, 5.5,7, 3, 2, 4, 2.5, 3, 5.5, 7, 6,
6, 7],
"Class": ["Speaker", "Speaker", "Speaker", "Speaker", "Intel",
"Intel", "Intel", "Intel", "Intel", "Leader",
"Leader", "Leader", "Leader", "Leader"]
}
print(data)
Output:
# Convert to DataFrame
df = pd.DataFrame(data)
df
Output:
y = df["Class"]
print(X.shape,y.shape)
Output:
(14, 2) (14,)
# Split Data into Training and Testing Sets (80% Train, 20% Test)
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)
Output:
(11, 2) (11,) (3, 2) (3,)
print(X_test)
Output:
Aptitude_Score Communication_Score
9 6 5.5
11 7 6.0
0 2 5.0
print(y_test)
Output:
9 Leader
11 Leader
0 Speaker
Name: Class, dtype: object
# Make Predictions
y_pred = knn.predict(X_test)
print(y_pred)
Output:
['Leader' 'Leader' 'Speaker']
# Evaluate Model Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
Output:
Model Accuracy: 100.00%
accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3
#confusion Matrix
print("\nConfusion Matrix:")
#cm = confusion_matrix(y_test, y_pred) # or your actual class
labels
cm = confusion_matrix(y_test, y_pred, labels=['Speaker', 'Intel',
'Leader'])
print(cm)
Output:
Confusion Matrix:
[[1 0 0]
[0 0 0]
[0 0 2]]
# User Input for Prediction
aptitude = float(input("Enter Aptitude Score: "))
communication = float(input("Enter Communication Score: "))
# Predict Class
predicted_class = knn.predict([[aptitude, communication]])
print(f"Predicted Class: {predicted_class[0]}")
Output:
Enter Aptitude Score: 5
Enter Communication Score: 4.5
Predicted Class: Intel
Classification with k-Nearest Neighbors and the Digits
Dataset
Our Approach
We’ll cover this case study over two sections. In this section, we’ll
begin with the basic
steps of a machine learning case study:
• Decide the data from which to train a model.
• Load and explore the data.
• Split the data for training and testing.
• Select and build the model.
• Train the model.
• Make predictions.
• Evaluate the results.
• Tune the model.
• Run several classification models to choose the best one(s).
Program:
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
digits=datasets.load_digits()
print(digits.DESCR)
Output:
.. _digits_dataset:
.. dropdown:: References
- C. Kaynak (1995) Methods of Combining Multiple
Classifiers and Their
Applications to Handwritten Digit Recognition, MSc
Thesis, Institute of
Graduate Studies in Science and Engineering,
Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading
Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao
and A. Kai Qin.
Linear dimensionalityreduction using relevance
weighted LDA. School of
Electrical and Electronic Engineering Nanyang
Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin
Classification
Algorithm. NIPS. 2000.
X, y = digits.data, digits.target
df=pd.DataFrame(X,columns=digits.feature_names)
df
Output:
df1=pd.DataFrame(y,columns=['target'])
df1
Output:
target
0 0
1 1
2 2
3 3
4 4
.
... .
.
1792 9
1793 0
1794 8
1795 9
1796 8
digits.images.shape
Output:
(1797, 8, 8)
#Let us look at the first image which is an 8*8 array of pixcel
intensity
digits.images[0]
Output:
Output:
# set up the plots
figure,axes=plt.subplots(nrows=3,ncols=10,figsize=(15,6))
for ax,image,number in
zip(axes.ravel(),digits.images,digits.target):
ax.axis('off')
ax.imshow(image,cmap=plt.cm.gray_r)
ax.set_title('Number: '+str(number))
Output:
Split the data into training and testing
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,r
andom_state=99,stratify=y)
Output:
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
Performance Measure
accuracy_score(y_test,y_pred)
Output:
0.9851851851851852
report=classification_report(y_test,y_pred)
print(report)
Output:
precision recall f1-score support
Output:
[[54 0 0 0 0 0 0 0 0 0]
[ 0 54 0 0 0 1 0 0 0 0]
[ 0 0 53 0 0 0 0 0 0 0]
[ 0 0 0 54 0 0 0 1 0 0]
[ 0 0 0 0 53 0 0 0 0 1]
[ 0 0 0 0 0 54 0 0 0 1]
[ 0 0 0 0 0 0 54 0 0 0]
[ 0 0 0 0 0 0 0 54 0 0]
[ 0 3 0 0 0 0 0 0 49 0]
[ 0 0 0 0 0 1 0 0 0 53]]
sns.heatmap(confusion_matrix(y_test,y_pred),annot=True,cmap
='nipy_spectral_r')
Output:
Each row represents one distinct class—that is, one of the digits
0–9. The columns
within a row specify how many of the test samples were classified
into each distinct class.
[45, 0, 0, 0, 0, 0, 0, 0, 0, 0]
9. Because we’re working with digits, the classes (0–9) and the
row and column index
For Testing
custom_image = np.array([
[0, 0, 5, 13, 15, 16, 14, 6],
[0, 0, 7, 18, 20, 19, 16, 7],
[0, 4, 15, 28, 30, 27, 23, 9],
[0, 8, 23, 37, 40, 37, 30, 12],
[0, 7, 21, 38, 42, 40, 32, 13],
[0, 4, 14, 30, 33, 29, 22, 8],
[0, 0, 7, 17, 19, 16, 11, 4],
[0, 0, 2, 9, 11, 10, 7, 3]
])
plt.imshow(custom_image, cmap='gray')
plt.title("Custom Digit")
plt.axis('off')
plt.show()
Output:
predicted_label=knn.predict(custom_image.reshape(1, -1)) #
Reshape custom_image before prediction
Output:
Hyperparameter Tuning
Earlier in this section, we mentioned that k in the k-nearest
neighbors algorithm is a hyperparameter of the algorithm.
Hyperparameters are set before using the algorithm to train your
model. In real-world machine learning studies, you’ll want to use
hyperparameter tuning to choose hyperparameter values that
produce the best possible predictions. To determine the best
value for k in the kNN algorithm, try different values of k then
compare the estimator’s performance with each. We can do this
using techniques similar to comparing estimators. The following
loop creates KNeighborsClassifiers with odd k values from 1
through 19 (again, we use odd k values in kNN to avoid ties) and
performs k-fold cross-validation on each. As you can see from the
accuracy scores and standard deviations, the k value 1 in kNN
produces the most accurate predictions for the Digits dataset. You
can also see that accuracy tends to decrease for higher k values:
Program:
from sklearn.datasets import load_digits
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import numpy as np
# Find best k
best_k = k_values[np.argmax(knn_scores)]
print(f"\nBest k for KNN: {best_k} with accuracy
{max(knn_scores):.4f}")
Output:
Tuning KNN with different k values:
k=1, Accuracy=0.9878
k=2, Accuracy=0.9844
k=3, Accuracy=0.9866
k=4, Accuracy=0.9816
k=5, Accuracy=0.9861
k=6, Accuracy=0.9850
k=7, Accuracy=0.9861
k=8, Accuracy=0.9833
k=9, Accuracy=0.9816
k=10, Accuracy=0.9822
k=11, Accuracy=0.9816
k=12, Accuracy=0.9805
k=13, Accuracy=0.9783
k=14, Accuracy=0.9772
k=15, Accuracy=0.9772
k=16, Accuracy=0.9761
k=17, Accuracy=0.9761
k=18, Accuracy=0.9750
k=19, Accuracy=0.9733
k=20, Accuracy=0.9711
Definition:
𝑌=𝑎+𝑏𝑋+𝜖
Where:
Goal:
To find the best-fitting straight line (called the regression line)
that predicts Y from X.
Applications:
● Predicting sales from advertising spend
● Estimating house prices based on size
● Forecasting trends using historical data
Assumptions:
Program:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
nyc = pd.read_csv('/content/ave_hi_nyc_jan_1895-2018.csv')
nyc.head()
Output:
Date Value Anomaly
nyc.Date = nyc.Date.floordiv(100)
nyc.tail()
Output:
Date TemperatureAnomaly
119 2014 35.5 -1.9
120 2015 36.1 -1.3
121 2016 40.8 3.4
122 2017 42.8 5.4
123 2018 38.7 1.3
#Spliting the data into training and Testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(nyc.Date.values.reshape(-1, 1),
nyc.Temperature.values,test_size=0.25,random_state=11)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
Output:
(93, 1) (31, 1) (93,) (31,)
linear_regression = LinearRegression()
linear_regression.fit(X_train,y_train)
print(linear_regression.coef_)
print(linear_regression.intercept_)
Output:
[0.01939167]
-0.30779820252656265
y_pred = linear_regression.predict(X_test)
for p, e in zip(y_pred[::5], y_test[::5]):
print(f'y_pred: {p:.2f}, y_test: {e:.2f}')
Output:
y_pred: 37.86, y_test: 31.70
y_pred: 38.69, y_test: 34.80
y_pred: 37.00, y_test: 39.40
y_pred: 37.25, y_test: 45.70
y_pred: 38.05, y_test: 32.30
y_pred: 37.64, y_test: 33.80
y_pred: 36.94, y_test: 39.70
print(linear_regression.predict([[2019]]))
Output:
[38.84399018]
predict = (lambda x: linear_regression.coef_ * x +
linear_regression.intercept_)
predict(2019)
Output:
array([38.84399018])
predict(1890)
Output:
array([36.34246432])
predict(2018)
array([38.82459851])
Visualizing the Dataset with the Regression Line
import seaborn as sns
axes.set_ylim(10, 70)
axes = sns.scatterplot(data=nyc, x='Date',
y='Temperature',hue='Temperature', palette='winter',
legend=False)
x = np.array([min(nyc.Date.values), max(nyc.Date.values)])
y = predict(x)
line = plt.plot(x, y)
Output:
Multiple Linear Regression with the California Housing
Dataset
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
import pandas as pd
print(california.DESCR)
output:
.. _california_housing_dataset:
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per
household
- AveBedrms average number of bedrooms per
household
- Population block group population
- AveOccup average number of household
members
- Latitude block group latitude
- Longitude block group longitude
.. rubric:: References
print(california_df.head())
Output:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
Longitude MedHouseValue
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
california_df.describe()
california_df.shape
Output:
(20640, 9)
Visualizing the Features
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=2)
sns.set_style('whitegrid')
It’s the process of reducing the number of input variables (features) in your dataset
while preserving its structure or patterns. This is crucial when visualizing
high-dimensional data like images (e.g., 784 pixels for MNIST).
2. What is t-SNE?
Program:
reduced_data = tsne.fit_transform(digits.data)
Output:
(1797, 2)
print(iris.DESCR)
Output:
.. _iris_dataset:
:Summary Statistics:
iris_df.describe()
Output:
sepal length (cm) sepal width (cm) petal length (cm) petal width
(cm)
iris_df['species'].value_counts()
Output:
species count
setosa 50
versicolor 50
virginica 50
kmeans.fit(iris.data)
print(kmeans.labels_[0:50])
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0]
print(kmeans.labels_[50:100])
Output:
[2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1]
print(kmeans.labels_[100:150])
Output:
[2 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1
1 2 2 2 2 2 1 2 2 2
2 1 2 2 2 1 2 2 2 1 2 2 1]
label_counts = pd.Series(kmeans.labels_).value_counts()
print(label_counts)
Output:
1 61
0 50
2 39
Name: count, dtype: int64
Dimensionality Reduction with Principal Component
Analysis(PCA)
pca.fit(iris.data)
iris_pca = pca.transform(iris.data)
iris_pca.shape
Output:
(150, 2)
iris_pca_df = pd.DataFrame(iris_pca,
columns=['Component1', 'Component2'])
iris_pca_df['species'] = iris_df.species