0% found this document useful (0 votes)
34 views

ML All Prints

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

ML All Prints

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

import numpy as np

import pandas as pd

data=pd.read_csv(r"uber.csv")
# test_df=pd.read_csv(r"test.csv")
print (data.shape)
print (data.columns)

(200000, 9)
Index(['Unnamed: 0', 'key', 'fare_amount', 'pickup_datetime',
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'passenger_count'],
dtype='object')

data_x = data.iloc[:,0:-1].values
data_y = data.iloc[:,-1].values
print(data_y)

[1 1 1 ... 2 1 1]

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200000 non-null int64
1 key 200000 non-null object
2 fare_amount 200000 non-null float64
3 pickup_datetime 200000 non-null object
4 pickup_longitude 200000 non-null float64
5 pickup_latitude 200000 non-null float64
6 dropoff_longitude 199999 non-null float64
7 dropoff_latitude 199999 non-null float64
8 passenger_count 200000 non-null int64
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB

data["pickup_datetime"]=pd.to_datetime(data['pickup_datetime'])

data.head()

Unnamed: 0 key fare_amount \


0 24238194 2015-05-07 19:52:06.0000003 7.5
1 27835199 2009-07-17 20:04:56.0000002 7.7
2 44984355 2009-08-24 21:45:00.00000061 12.9
3 25894730 2009-06-26 08:22:21.0000001 5.3
4 17610152 2014-08-28 17:47:00.000000188 16.0

pickup_datetime pickup_longitude pickup_latitude \


0 2015-05-07 19:52:06+00:00 -73.999817 40.738354
1 2009-07-17 20:04:56+00:00 -73.994355 40.728225
2 2009-08-24 21:45:00+00:00 -74.005043 40.740770
3 2009-06-26 08:22:21+00:00 -73.976124 40.790844
4 2014-08-28 17:47:00+00:00 -73.925023 40.744085

dropoff_longitude dropoff_latitude passenger_count


0 -73.999512 40.723217 1
1 -73.994710 40.750325 1
2 -73.962565 40.772647 1
3 -73.965316 40.803349 3
4 -73.973082 40.761247 5
As this is Taxi fare data and
we know there are many factors which affect the price of taxi like Travelled distance Time of Travel Demand
and Availability of Taxi Some special places are more costlier like Airport or other places where there might
be toll

data.describe()

Unnamed: 0 fare_amount pickup_longitude pickup_latitude \


count 2.000000e+05 200000.000000 200000.000000 200000.000000
mean 2.771250e+07 11.359955 -72.527638 39.935885
std 1.601382e+07 9.901776 11.437787 7.720539
min 1.000000e+00 -52.000000 -1340.648410 -74.015515
25% 1.382535e+07 6.000000 -73.992065 40.734796
50% 2.774550e+07 8.500000 -73.981823 40.752592
75% 4.155530e+07 12.500000 -73.967154 40.767158
max 5.542357e+07 499.000000 57.418457 1644.421482

dropoff_longitude dropoff_latitude passenger_count


count 199999.000000 199999.000000 200000.000000
mean -72.525292 39.923890 1.684535
std 13.117408 6.794829 1.385997
min -3356.666300 -881.985513 0.000000
25% -73.991407 40.733823 1.000000
50% -73.980093 40.753042 1.000000
75% -73.963658 40.768001 2.000000
max 1153.572603 872.697628 208.000000

Here first thing which we can see is minimum value of fare is negative which is -62 which is not the valid
value, so we need to remove the fare which are negative values. Secondly, passenger_count minimum value
is 0 and maximum value is 208 which impossible, so we need to remove them as well, for safer side we can
think that a taxi can have maximum 7 people.

#Lets check if there is any null value


data.isnull().sum()

Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64

Here we can see there are 14 null values in drop_off latitude and longitude. as removing 14 to 28 rows from
our huge dataset will not affect our analysis so, lets remove the rows having null values

data.dropna(inplace=True)
print(data.isnull().sum())

Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64

import matplotlib.pyplot as plt


import seaborn as sns
%matplotlib inline

sns.distplot(data['fare_amount'])

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='fare_amount', ylabel='Density'>

In distribution plot also it can be seen that there are some values which are negative fare

sns.distplot(data['pickup_latitude'])

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='pickup_latitude', ylabel='Density'>
Here we can see minimum value is going to be less than even -3000 which is not correct value and also on
positive side also going more than 2000

sns.distplot(data['pickup_longitude'])

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='pickup_longitude', ylabel='Density'>

Here also negative and positive values are excedding far behond the real limit.

sns.distplot(data['dropoff_longitude'])
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='dropoff_longitude', ylabel='Density'>

#Similarly here also same issue


sns.distplot(data['dropoff_latitude'])

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='dropoff_latitude', ylabel='Density'>
print("drop_off latitude min value",data["dropoff_latitude"].min())
print("drop_off latitude max value",data["dropoff_latitude"].max())
print("drop_off longitude min value", data["dropoff_longitude"].min())
print("drop_off longitude max value",data["dropoff_longitude"].max())
print("pickup latitude min value",data["pickup_latitude"].min())
print("pickup latitude max value",data["pickup_latitude"].max())
print("pickup longitude min value",data["pickup_longitude"].min())
print("pickup longitude max value",data["pickup_longitude"].max())

drop_off latitude min value -881.9855130000001


drop_off latitude max value 872.6976279999999
drop_off longitude min value -3356.6663
drop_off longitude max value 1153.5726029999998
pickup latitude min value -74.01551500000001
pickup latitude max value 1644.421482
pickup longitude min value -1340.64841
pickup longitude max value 57.418457

we can see what is range of latitude and longitude of our test dataset, lets keep the range same in our train
set so that even noisy data is remove and we have only the values which belongs to new york

min_longitude=-74.263242,
min_latitude=40.573143,
max_longitude=-72.986532,
max_latitude=41.709555

#lets drop all the values which are not coming in above boundary, as those are
noisy data

tempdf=data[(data["dropoff_latitude"]<min_latitude) |
(data["pickup_latitude"]<min_latitude) |
(data["dropoff_longitude"]<min_longitude) |
(data["pickup_longitude"]<min_longitude) |
(data["dropoff_latitude"]>max_latitude) |
(data["pickup_latitude"]>max_latitude) |
(data["dropoff_longitude"]>max_longitude) |
(data["pickup_longitude"]>max_longitude) ]
print("before droping",data.shape)
data.drop(tempdf.index,inplace=True)
print("after droping",data.shape)
before droping (199999, 9)
after droping (195732, 9)

#lets remove all those rows where fare amount is negative

print("before droping", data.shape)


train_df=data[data['fare_amount']>0]
print("after droping", data.shape)

before droping (195732, 9)


after droping (195732, 9)

On different day and time there would be different price like during eveing price would be more compare to
afternoon, during christmas price would be different and similarly on weekends price would be different
compare to week days. so lets create some extra features which will take care of all these things

import calendar
data['day']=data['pickup_datetime'].apply(lambda x:x.day)
data['hour']=data['pickup_datetime'].apply(lambda x:x.hour)
data['weekday']=data['pickup_datetime'].apply(lambda
x:calendar.day_name[x.weekday()])
data['month']=data['pickup_datetime'].apply(lambda x:x.month)
data['year']=data['pickup_datetime'].apply(lambda x:x.year)

data.head()

Unnamed: 0 key fare_amount \


0 24238194 2015-05-07 19:52:06.0000003 7.5
1 27835199 2009-07-17 20:04:56.0000002 7.7
2 44984355 2009-08-24 21:45:00.00000061 12.9
3 25894730 2009-06-26 08:22:21.0000001 5.3
4 17610152 2014-08-28 17:47:00.000000188 16.0

pickup_datetime pickup_longitude pickup_latitude \


0 2015-05-07 19:52:06+00:00 -73.999817 40.738354
1 2009-07-17 20:04:56+00:00 -73.994355 40.728225
2 2009-08-24 21:45:00+00:00 -74.005043 40.740770
3 2009-06-26 08:22:21+00:00 -73.976124 40.790844
4 2014-08-28 17:47:00+00:00 -73.925023 40.744085

dropoff_longitude dropoff_latitude passenger_count day hour


weekday \
0 -73.999512 40.723217 1 7 19 Thursday

1 -73.994710 40.750325 1 17 20 Friday

2 -73.962565 40.772647 1 24 21 Monday

3 -73.965316 40.803349 3 26 8 Friday

4 -73.973082 40.761247 5 28 17 Thursday

month year
0 5 2015
1 7 2009
2 8 2009
3 6 2009
4 8 2014

#here we can see that week are in monday , tuesday and so on. So we need
convert them in numerical for
data.weekday =
data.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4
,'Friday':5,'Saturday':6})

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195732 entries, 0 to 199999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195732 non-null int64
1 key 195732 non-null object
2 fare_amount 195732 non-null float64
3 pickup_datetime 195732 non-null datetime64[ns, UTC]
4 pickup_longitude 195732 non-null float64
5 pickup_latitude 195732 non-null float64
6 dropoff_longitude 195732 non-null float64
7 dropoff_latitude 195732 non-null float64
8 passenger_count 195732 non-null int64
9 day 195732 non-null int64
10 hour 195732 non-null int64
11 weekday 195732 non-null int64
12 month 195732 non-null int64
13 year 195732 non-null int64
dtypes: datetime64[ns, UTC](1), float64(5), int64(7), object(1)
memory usage: 22.4+ MB

# we will keep only those rows where number of passangers are less than or
equal to 8

data=data[data['passenger_count']<=8]

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195731 entries, 0 to 199999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195731 non-null int64
1 key 195731 non-null object
2 fare_amount 195731 non-null float64
3 pickup_datetime 195731 non-null datetime64[ns, UTC]
4 pickup_longitude 195731 non-null float64
5 pickup_latitude 195731 non-null float64
6 dropoff_longitude 195731 non-null float64
7 dropoff_latitude 195731 non-null float64
8 passenger_count 195731 non-null int64
9 day 195731 non-null int64
10 hour 195731 non-null int64
11 weekday 195731 non-null int64
12 month 195731 non-null int64
13 year 195731 non-null int64
dtypes: datetime64[ns, UTC](1), float64(5), int64(7), object(1)
memory usage: 22.4+ MB

#here key column and pickup_datetime columns are not needed as we have already
created variables extracted from it

data.drop(["key","pickup_datetime"], axis=1, inplace=True)

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 195731 entries, 0 to 199999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195731 non-null int64
1 fare_amount 195731 non-null float64
2 pickup_longitude 195731 non-null float64
3 pickup_latitude 195731 non-null float64
4 dropoff_longitude 195731 non-null float64
5 dropoff_latitude 195731 non-null float64
6 passenger_count 195731 non-null int64
7 day 195731 non-null int64
8 hour 195731 non-null int64
9 weekday 195731 non-null int64
10 month 195731 non-null int64
11 year 195731 non-null int64
dtypes: float64(5), int64(7)
memory usage: 19.4 MB

lets divide the data set into train and validation test set

from sklearn.model_selection import train_test_split

x=data.drop("fare_amount", axis=1)

y=data['fare_amount']

x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.2,random_state=101)

x_train.head()

Unnamed: 0 pickup_longitude pickup_latitude dropoff_longitude \


7570 51992033 -73.991973 40.742657 -73.991358
155037 10241908 -73.964111 40.807957 -73.966688
67010 48963133 -73.987658 40.700823 -73.985670
155236 30446807 -73.999577 40.726656 -74.007562
187226 40739497 -73.983377 40.738938 -73.978432

dropoff_latitude passenger_count day hour weekday month year


7570 40.750086 1 31 22 1 10 2011
155037 40.803299 1 18 14 3 6 2014
67010 40.770540 1 2 22 0 2 2014
155236 40.713286 1 29 18 3 5 2013
187226 40.745286 1 12 2 6 6 2010

x_test.head()

Unnamed: 0 pickup_longitude pickup_latitude dropoff_longitude \


51869 5536882 -73.953347 40.767932 -73.990867
44724 35054768 -73.137393 41.366138 -73.137393
47705 15258057 -74.009707 40.712480 -73.962757
17345 34739111 -74.016055 40.715077 -74.008840
179351 53446498 -73.950474 40.784003 -73.971086

dropoff_latitude passenger_count day hour weekday month year


51869 40.751295 5 8 17 0 11 2009
44724 41.366138 2 11 20 0 7 2010
47705 40.758977 1 3 21 0 7 2011
17345 40.711375 3 4 6 5 1 2013
179351 40.748328 1 18 22 0 9 2011
x_train.shape

(156584, 11)

x_test.shape

(39147, 11)

#Lets run the model.


#As we have to build regression model, lets start with linear regression model

from sklearn.linear_model import LinearRegression

lrmodel=LinearRegression()
lrmodel.fit(x_train, y_train)

LinearRegression()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

LinearRegression(n_jobs=1, normalize=False)

predictedvalues = lrmodel.predict(x_test)

#lets calculate rmse for linear Regression model


from sklearn.metrics import mean_squared_error
lrmodelrmse = np.sqrt(mean_squared_error(predictedvalues, y_test))
print("RMSE value for Linear regression is", lrmodelrmse)

RMSE value for Linear regression is 8.363019859396488

#Lets see with Random Forest and calculate its rmse


from sklearn.ensemble import RandomForestRegressor
# rfrmodel = RandomForestRegressor(n_estimators=100, random_state=101)
rfrmodel = RandomForestRegressor(n_estimators=50, random_state=101)

rfrmodel.fit(x_train,y_train)

RandomForestRegressor(n_estimators=50, random_state=101)

rfrmodel_pred= rfrmodel.predict(x_test)

rfrmodel_rmse=np.sqrt(mean_squared_error(rfrmodel_pred, y_test))
print("RMSE value for Random forest regression is ",rfrmodel_rmse)

RMSE value for Random forest regression is 3.9973617568779463

#RandomForest Regressor is giving good value, so we can use it as final model


Classify the email using the binary classification method. Email Spam detection has two states: a) Normal
State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support Vector Machine for
classification. Analyze their performance. Dataset link: The emails.csv dataset on the Kaggle

Support Vector Machine - supervised machine learning algorithm which can be used for both classification
or regression challenges,mostly used in classification.we plot each data item as a point in n-dimensional
space (where n is a number of features you have) with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two
classes very well

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn import svm

data = pd.read_csv('emails.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB

# Data preprocessing
X = data.drop(columns=['Email No.', 'spam']) # Remove non-numeric columns
y = data['spam'] # Target variable

from sklearn.model_selection import train_test_split


from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train K-Nearest Neighbors (K-NN) model


knn_model = KNeighborsClassifier(n_neighbors=5) # You can adjust n_neighbors
knn_model.fit(X_train, y_train)
knn_predictions = knn_model.predict(X_test)

# Train Support Vector Machine (SVM) model


svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)

# Model evaluation
def evaluate_model(predictions, model_name):
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)
print(f"Performance metrics for {model_name}:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"ROC AUC: {roc_auc}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

evaluate_model(knn_predictions, "K-Nearest Neighbors")


evaluate_model(svm_predictions, "Support Vector Machine")

Performance metrics for K-Nearest Neighbors:


Accuracy: 0.9893719806763285
Precision: 0.5
Recall: 0.7272727272727273
F1-Score: 0.5925925925925926
ROC AUC: 0.8597301136363636
Confusion Matrix:
[[1016 8]
[ 3 8]]
Performance metrics for Support Vector Machine:
Accuracy: 0.9893719806763285
Precision: 0.0
Recall: 0.0
F1-Score: 0.0
ROC AUC: 0.5
Confusion Matrix:
[[1024 0]
[ 11 0]]

c:\Users\Kedar\AppData\Local\Programs\Python\Python39\lib\site-packages\
sklearn\metrics\_classification.py:1327: UndefinedMetricWarning: Precision is
ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Given a bank customer, build a neural network-based classifier that can determine whether they will leave
or not in the next 6 months. Dataset Description: The case study is from an open-source dataset from
Kaggle. The dataset contains 10,000 sample points with 14 distinct features such as CustomerId,
CreditScore, Geography, Gender, Age, Tenure, Balance, etc. Link to the Kaggle project:
https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling Perform following steps:

1. Read the dataset.


2. Distinguish the feature and target set and divide the data set into training and test sets.
3. Normalize the train and test data.
4. Initialize and build the model. Identify the points of improvement and implement the same.
5. Print the accuracy score and confusion matrix (5 points).
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

data = pd.read_csv('Churn_Modelling.csv')

X = data.drop(columns=['Exited', 'CustomerId', 'Surname', 'RowNumber']) #


Exclude columns
y = data['Exited'] # Target

# Step 3: Data Preprocessing


# Handle missing values and encode categorical variables

# Removing rows with missing values:


data = data.drop(['CustomerId', 'Surname', 'RowNumber'], axis = 1)
print(data.columns)

# Replacing missing values with a specific value (e.g., mean):


# data['column_name'].fillna(data['column_name'].mean(), inplace=True)

Index(['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance',


'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary',
'Exited'],
dtype='object')

# You need to ensure that the columns 'Geography' and 'Gender' are present in
the DataFrame X
# Add additional error handling to verify the column names
columns_to_encode = ['Geography', 'Gender']
for column in columns_to_encode:
if column not in X.columns:
raise ValueError(f"Column '{column}' not found in the DataFrame X.")

# You need to encode categorical variables like "Geography" and "Gender" into
numerical format using one-hot encoding.
X = pd.get_dummies(X, columns=['Geography', 'Gender'], drop_first=True)

scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# Step 5: Initialize and Build the Model


model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train the model


model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=1)

Epoch 1/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4769 -
accuracy: 0.7947
Epoch 2/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4413 -
accuracy: 0.8098
Epoch 3/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4229 -
accuracy: 0.8200
Epoch 4/20
250/250 [==============================] - 0s 2ms/step - loss: 0.4007 -
accuracy: 0.8299
Epoch 5/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3800 -
accuracy: 0.8406
Epoch 6/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3663 -
accuracy: 0.8486
Epoch 7/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3593 -
accuracy: 0.8511
Epoch 8/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3537 -
accuracy: 0.8551
Epoch 9/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3502 -
accuracy: 0.8575
Epoch 10/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3482 -
accuracy: 0.8574
Epoch 11/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3450 -
accuracy: 0.8585
Epoch 12/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3435 -
accuracy: 0.8581
Epoch 13/20
250/250 [==============================] - 1s 2ms/step - loss: 0.3411 -
accuracy: 0.8612
Epoch 14/20
250/250 [==============================] - 1s 2ms/step - loss: 0.3412 -
accuracy: 0.8601
Epoch 15/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3378 -
accuracy: 0.8610
Epoch 16/20
250/250 [==============================] - 1s 6ms/step - loss: 0.3371 -
accuracy: 0.8605
Epoch 17/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3364 -
accuracy: 0.8608
Epoch 18/20
250/250 [==============================] - 1s 6ms/step - loss: 0.3366 -
accuracy: 0.8612
Epoch 19/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3348 -
accuracy: 0.8604
Epoch 20/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3326 -
accuracy: 0.8634

<keras.callbacks.History at 0x1ebc037af40>

# Step 6: Evaluate the Model


y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5).astype(int) # Convert to binary prediction

63/63 [==============================] - 0s 2ms/step

accuracy = accuracy_score(y_test, y_pred)


confusion = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(confusion)

Accuracy: 0.86
Confusion Matrix:
[[1557 50]
[ 230 163]]
import pandas as pd
import numpy as np

data = pd.read_csv("diabetes.csv")
data

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
.. ... ... ... ... ... ...
763 10 101 76 48 180 32.9
764 2 122 70 27 0 36.8
765 5 121 72 23 112 26.2
766 1 126 60 0 0 30.1
767 1 93 70 31 0 30.4

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
.. ... ... ...
763 0.171 63 0
764 0.340 27 0
765 0.245 30 0
766 0.349 47 1
767 0.315 23 0

[768 rows x 9 columns]

df = pd.DataFrame(data)
df.head()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

df.isnull().sum()

Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
plt.figure(figsize=(12,10)) # on this line I just set the size of figure to
12 by 10.
p=sns.heatmap(df.corr(), annot=True,cmap ='RdYlGn') # seaborn has very simple
solution for heatmap

Manipulating and Cleaning our dataset


cols_clean =
['Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFu
nction']
for i in cols_clean:
df[i] = df[i].replace(0,np.NaN)
cols_mean = int(df[i].mean(skipna=True))
df[i] = df[i].replace(np.NaN, cols_mean)
data1 = df
data1.head().style.highlight_max(color="lightblue").highlight_min(color="red")

<pandas.io.formats.style.Styler at 0x2a3e3087580>

print(data1.describe())
Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.00000
mean 3.845052 121.682292 72.386719 29.108073 155.28125
std 3.369578 30.435999 12.096642 8.791221 85.02155
min 0.000000 44.000000 24.000000 7.000000 14.00000
25% 1.000000 99.750000 64.000000 25.000000 121.50000
50% 3.000000 117.000000 72.000000 29.000000 155.00000
75% 6.000000 140.250000 80.000000 32.000000 155.00000
max 17.000000 199.000000 122.000000 99.000000 846.00000

BMI DiabetesPedigreeFunction Age Outcome


count 768.000000 768.000000 768.000000 768.000000
mean 32.450911 0.471876 33.240885 0.348958
std 6.875366 0.331329 11.760232 0.476951
min 18.200000 0.078000 21.000000 0.000000
25% 27.500000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000

import matplotlib.pyplot as plt


import seaborn as sns
%matplotlib inline

# graph = ['Glucose','Insulin','BMI','Age','Outcome']
sns.set()
# print(sns.pairplot(data1[graph],hue='Outcome', diag_kind='kde'))
print(sns.pairplot(data1[graph],hue='Outcome', diag_kind='kde'))

<seaborn.axisgrid.PairGrid object at 0x000002A3E0B306A0>


# for the purpose of simplicity and analysing the most relevent data , we
will select three features of the dataset
# Glucose , Insulin and BMI
# defining variables and features for the dataset for splitting
# q_cols = ['Glucose','Insulin','BMI','Outcome']
q_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

df = data1[q_cols]
print(df.head(2))

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148.0 72.0 35.0 155.0 33.6
1 1 85.0 66.0 29.0 155.0 26.6

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0

# # let's split the data into training and testing datasets


# split = 0.75 # 75% train and 25% test dataset
# total_len = len(df)
# split_df = int(total_len*split)
# train, test = df.iloc[:split_df,0:4],df.iloc[split_df:,0:4]
# train_x = train[['Glucose','Insulin','BMI']]
# train_y = train['Outcome']
# test_x = test[['Glucose','Insulin','BMI']]
# test_y = test['Outcome']

# Split the data into training and testing datasets


split = 0.75 # 75% train and 25% test dataset
total_len = len(df)
split_df = int(total_len * split)
train, test = df.iloc[:split_df], df.iloc[split_df:]

# Select the columns specified in q_cols for training and testing


train_x = train[q_cols[:-1]] # Exclude the 'Outcome' column from features
train_y = train['Outcome'] # Target variable
test_x = test[q_cols[:-1]] # Exclude the 'Outcome' column from features
test_y = test['Outcome'] # Target variable

a = len(train_x)
b = len(test_x)
print(' Training data =',a,'\n','Testing data =',b,'\n','Total data length =
',a+b)

Training data = 576


Testing data = 192
Total data length = 768

from sklearn.neighbors import KNeighborsClassifier


from sklearn import metrics

def knn(x_train, y_train, x_test, y_test,n):


n_range = range(1, n)
results = []
for n in n_range:
knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(x_train, y_train)
#Predict the response for test dataset
predict_y = knn.predict(x_test)
accuracy = metrics.accuracy_score(y_test, predict_y)
#matrix = confusion_matrix(y_test,predict_y)
#seaborn_matrix = sns.heatmap(matrix, annot = True,
cmap="Blues",cbar=True)
results.append(accuracy)
return results

n= 500
output = knn(train_x,train_y,test_x,test_y,n)
n_range = range(1, n)
plt.plot(n_range, output)

[<matplotlib.lines.Line2D at 0x2a3ec0c6100>]
# best k that could optimize this model is between 100 to 200 offering a 77%
accuracy
# ideal k value for this dataset should be 150 give or take

from sklearn.metrics import confusion_matrix


from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, fbeta_score
y_pred = knn(train_x,train_y,test_x,test_y,n)
cnf_matrix = confusion_matrix(test_y, y_pred)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18924/597529570.py in <module>
2 from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, fbeta_score
3 y_pred = knn(train_x,train_y,test_x,test_y,n)
----> 4 cnf_matrix = confusion_matrix(test_y, y_pred)

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py in confusion_matrix(y_true, y_pred, labels,
sample_weight, normalize)
305 (0, 2, 1, 1)
306 """
--> 307 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
308 if y_type not in ("binary", "multiclass"):
309 raise ValueError("%s is not supported" % y_type)

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
82 y_pred : array or indicator matrix
83 """
---> 84 check_consistent_length(y_true, y_pred)
85 type_true = type_of_target(y_true, input_name="y_true")
86 type_pred = type_of_target(y_pred, input_name="y_pred")

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\utils\validation.py in check_consistent_length(*arrays)
385 uniques = np.unique(lengths)
386 if len(uniques) > 1:
--> 387 raise ValueError(
388 "Found input variables with inconsistent numbers of
samples: %r"
389 % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [192,


499]

p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')


plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 12.5, 'Predicted label')

# Define your KNN function to return predictions


def knn2(x_train, y_train, x_test, y_test, n):
knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(x_train, y_train)
# Predict the response for the test dataset
predict_y = knn.predict(x_test)
return predict_y

n = 500
y_pred = knn2(train_x, train_y, test_x, test_y, n)
cnf_matrix = confusion_matrix(test_y, y_pred)

# Now you can calculate other metrics like accuracy, precision, recall, etc.
accuracy = accuracy_score(test_y, y_pred)
precision = precision_score(test_y, y_pred)
recall = recall_score(test_y, y_pred)
f1 = f1_score(test_y, y_pred)
fbeta = fbeta_score(test_y, y_pred, beta=0.5)

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py:1327: UndefinedMetricWarning: Precision is
ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
# Print the confusion matrix and other metrics
print("Confusion Matrix:\n", cnf_matrix)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("F-beta Score:", fbeta)

Confusion Matrix:
[[122 0]
[ 70 0]]
Accuracy: 0.6354166666666666
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
F-beta Score: 0.0

p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')


plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 12.5, 'Predicted label')

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report

# Load your dataset


# Replace 'your_dataset.csv' with the actual file path to your dataset
df = pd.read_csv('diabetes.csv')

# Define your feature columns and target column


q_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_col = 'Outcome'

# Split the data into features (X) and target (y)


X = df[q_cols]
y = df[target_col]

# Split the data into training and testing datasets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=42)

# Perform feature scaling (standardization) on the features


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train a K-nearest neighbors (KNN) classifier


k = 5 # You can adjust the value of k
knn_classifier = KNeighborsClassifier(n_neighbors=k)
knn_classifier.fit(X_train_scaled, y_train)

# Make predictions on the test data


y_pred = knn_classifier.predict(X_test_scaled)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the results


print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.6822916666666666
Confusion Matrix:
[[94 29]
[32 37]]
Classification Report:
precision recall f1-score support

0 0.75 0.76 0.76 123


1 0.56 0.54 0.55 69

accuracy 0.68 192


macro avg 0.65 0.65 0.65 192
weighted avg 0.68 0.68 0.68 192

p = sns.heatmap(pd.DataFrame(conf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')


plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 12.5, 'Predicted label')

You might also like