0% found this document useful (0 votes)

34 views

ML All Prints

Uploaded by

61ANazneen Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

ML All Prints

Uploaded by

61ANazneen Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

import numpy as np

import pandas as pd

data=pd.read_csv(r"uber.csv")
# test_df=pd.read_csv(r"test.csv")
print (data.shape)
print (data.columns)

(200000, 9)
Index(['Unnamed: 0', 'key', 'fare_amount', 'pickup_datetime',
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'passenger_count'],
dtype='object')

data_x = data.iloc[:,0:-1].values
data_y = data.iloc[:,-1].values
print(data_y)

[1 1 1 ... 2 1 1]

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200000 non-null int64
1 key 200000 non-null object
2 fare_amount 200000 non-null float64
3 pickup_datetime 200000 non-null object
4 pickup_longitude 200000 non-null float64
5 pickup_latitude 200000 non-null float64
6 dropoff_longitude 199999 non-null float64
7 dropoff_latitude 199999 non-null float64
8 passenger_count 200000 non-null int64
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB

data["pickup_datetime"]=pd.to_datetime(data['pickup_datetime'])

data.head()

Unnamed: 0 key fare_amount \

0 24238194 2015-05-07 19:52:06.0000003 7.5
1 27835199 2009-07-17 20:04:56.0000002 7.7
2 44984355 2009-08-24 21:45:00.00000061 12.9
3 25894730 2009-06-26 08:22:21.0000001 5.3
4 17610152 2014-08-28 17:47:00.000000188 16.0

pickup_datetime pickup_longitude pickup_latitude \

0 2015-05-07 19:52:06+00:00 -73.999817 40.738354
1 2009-07-17 20:04:56+00:00 -73.994355 40.728225
2 2009-08-24 21:45:00+00:00 -74.005043 40.740770
3 2009-06-26 08:22:21+00:00 -73.976124 40.790844
4 2014-08-28 17:47:00+00:00 -73.925023 40.744085

dropoff_longitude dropoff_latitude passenger_count

0 -73.999512 40.723217 1
1 -73.994710 40.750325 1
2 -73.962565 40.772647 1
3 -73.965316 40.803349 3
4 -73.973082 40.761247 5
As this is Taxi fare data and
we know there are many factors which affect the price of taxi like Travelled distance Time of Travel Demand
and Availability of Taxi Some special places are more costlier like Airport or other places where there might
be toll

data.describe()

Unnamed: 0 fare_amount pickup_longitude pickup_latitude \

count 2.000000e+05 200000.000000 200000.000000 200000.000000
mean 2.771250e+07 11.359955 -72.527638 39.935885
std 1.601382e+07 9.901776 11.437787 7.720539
min 1.000000e+00 -52.000000 -1340.648410 -74.015515
25% 1.382535e+07 6.000000 -73.992065 40.734796
50% 2.774550e+07 8.500000 -73.981823 40.752592
75% 4.155530e+07 12.500000 -73.967154 40.767158
max 5.542357e+07 499.000000 57.418457 1644.421482

dropoff_longitude dropoff_latitude passenger_count

count 199999.000000 199999.000000 200000.000000
mean -72.525292 39.923890 1.684535
std 13.117408 6.794829 1.385997
min -3356.666300 -881.985513 0.000000
25% -73.991407 40.733823 1.000000
50% -73.980093 40.753042 1.000000
75% -73.963658 40.768001 2.000000
max 1153.572603 872.697628 208.000000

Here first thing which we can see is minimum value of fare is negative which is -62 which is not the valid
value, so we need to remove the fare which are negative values. Secondly, passenger_count minimum value
is 0 and maximum value is 208 which impossible, so we need to remove them as well, for safer side we can
think that a taxi can have maximum 7 people.

#Lets check if there is any null value

data.isnull().sum()

Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64

Here we can see there are 14 null values in drop_off latitude and longitude. as removing 14 to 28 rows from
our huge dataset will not affect our analysis so, lets remove the rows having null values

data.dropna(inplace=True)
print(data.isnull().sum())

Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64

import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline

sns.distplot(data['fare_amount'])

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='fare_amount', ylabel='Density'>

In distribution plot also it can be seen that there are some values which are negative fare

sns.distplot(data['pickup_latitude'])

<AxesSubplot:xlabel='pickup_latitude', ylabel='Density'>
Here we can see minimum value is going to be less than even -3000 which is not correct value and also on
positive side also going more than 2000

sns.distplot(data['pickup_longitude'])

<AxesSubplot:xlabel='pickup_longitude', ylabel='Density'>

Here also negative and positive values are excedding far behond the real limit.

sns.distplot(data['dropoff_longitude'])
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='dropoff_longitude', ylabel='Density'>

#Similarly here also same issue

sns.distplot(data['dropoff_latitude'])

<AxesSubplot:xlabel='dropoff_latitude', ylabel='Density'>
print("drop_off latitude min value",data["dropoff_latitude"].min())
print("drop_off latitude max value",data["dropoff_latitude"].max())
print("drop_off longitude min value", data["dropoff_longitude"].min())
print("drop_off longitude max value",data["dropoff_longitude"].max())
print("pickup latitude min value",data["pickup_latitude"].min())
print("pickup latitude max value",data["pickup_latitude"].max())
print("pickup longitude min value",data["pickup_longitude"].min())
print("pickup longitude max value",data["pickup_longitude"].max())

drop_off latitude min value -881.9855130000001

drop_off latitude max value 872.6976279999999
drop_off longitude min value -3356.6663
drop_off longitude max value 1153.5726029999998
pickup latitude min value -74.01551500000001
pickup latitude max value 1644.421482
pickup longitude min value -1340.64841
pickup longitude max value 57.418457

we can see what is range of latitude and longitude of our test dataset, lets keep the range same in our train
set so that even noisy data is remove and we have only the values which belongs to new york

min_longitude=-74.263242,
min_latitude=40.573143,
max_longitude=-72.986532,
max_latitude=41.709555

#lets drop all the values which are not coming in above boundary, as those are
noisy data

tempdf=data[(data["dropoff_latitude"]<min_latitude) |
(data["pickup_latitude"]<min_latitude) |
(data["dropoff_longitude"]<min_longitude) |
(data["pickup_longitude"]<min_longitude) |
(data["dropoff_latitude"]>max_latitude) |
(data["pickup_latitude"]>max_latitude) |
(data["dropoff_longitude"]>max_longitude) |
(data["pickup_longitude"]>max_longitude) ]
print("before droping",data.shape)
data.drop(tempdf.index,inplace=True)
print("after droping",data.shape)
before droping (199999, 9)
after droping (195732, 9)

#lets remove all those rows where fare amount is negative

print("before droping", data.shape)

train_df=data[data['fare_amount']>0]
print("after droping", data.shape)

before droping (195732, 9)

after droping (195732, 9)

On different day and time there would be different price like during eveing price would be more compare to
afternoon, during christmas price would be different and similarly on weekends price would be different
compare to week days. so lets create some extra features which will take care of all these things

import calendar
data['day']=data['pickup_datetime'].apply(lambda x:x.day)
data['hour']=data['pickup_datetime'].apply(lambda x:x.hour)
data['weekday']=data['pickup_datetime'].apply(lambda
x:calendar.day_name[x.weekday()])
data['month']=data['pickup_datetime'].apply(lambda x:x.month)
data['year']=data['pickup_datetime'].apply(lambda x:x.year)

data.head()

Unnamed: 0 key fare_amount \

pickup_datetime pickup_longitude pickup_latitude \

dropoff_longitude dropoff_latitude passenger_count day hour

weekday \
0 -73.999512 40.723217 1 7 19 Thursday

1 -73.994710 40.750325 1 17 20 Friday

2 -73.962565 40.772647 1 24 21 Monday

3 -73.965316 40.803349 3 26 8 Friday

4 -73.973082 40.761247 5 28 17 Thursday

month year
0 5 2015
1 7 2009
2 8 2009
3 6 2009
4 8 2014

#here we can see that week are in monday , tuesday and so on. So we need
convert them in numerical for
data.weekday =
data.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4
,'Friday':5,'Saturday':6})

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195732 entries, 0 to 199999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195732 non-null int64
1 key 195732 non-null object
2 fare_amount 195732 non-null float64
3 pickup_datetime 195732 non-null datetime64[ns, UTC]
4 pickup_longitude 195732 non-null float64
5 pickup_latitude 195732 non-null float64
6 dropoff_longitude 195732 non-null float64
7 dropoff_latitude 195732 non-null float64
8 passenger_count 195732 non-null int64
9 day 195732 non-null int64
10 hour 195732 non-null int64
11 weekday 195732 non-null int64
12 month 195732 non-null int64
13 year 195732 non-null int64
dtypes: datetime64[ns, UTC](1), float64(5), int64(7), object(1)
memory usage: 22.4+ MB

# we will keep only those rows where number of passangers are less than or
equal to 8

data=data[data['passenger_count']<=8]

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195731 entries, 0 to 199999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195731 non-null int64
1 key 195731 non-null object
2 fare_amount 195731 non-null float64
3 pickup_datetime 195731 non-null datetime64[ns, UTC]
4 pickup_longitude 195731 non-null float64
5 pickup_latitude 195731 non-null float64
6 dropoff_longitude 195731 non-null float64
7 dropoff_latitude 195731 non-null float64
8 passenger_count 195731 non-null int64
9 day 195731 non-null int64
10 hour 195731 non-null int64
11 weekday 195731 non-null int64
12 month 195731 non-null int64
13 year 195731 non-null int64
dtypes: datetime64[ns, UTC](1), float64(5), int64(7), object(1)
memory usage: 22.4+ MB

#here key column and pickup_datetime columns are not needed as we have already
created variables extracted from it

data.drop(["key","pickup_datetime"], axis=1, inplace=True)

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 195731 entries, 0 to 199999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195731 non-null int64
1 fare_amount 195731 non-null float64
2 pickup_longitude 195731 non-null float64
3 pickup_latitude 195731 non-null float64
4 dropoff_longitude 195731 non-null float64
5 dropoff_latitude 195731 non-null float64
6 passenger_count 195731 non-null int64
7 day 195731 non-null int64
8 hour 195731 non-null int64
9 weekday 195731 non-null int64
10 month 195731 non-null int64
11 year 195731 non-null int64
dtypes: float64(5), int64(7)
memory usage: 19.4 MB

lets divide the data set into train and validation test set

from sklearn.model_selection import train_test_split

x=data.drop("fare_amount", axis=1)

y=data['fare_amount']

x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.2,random_state=101)

x_train.head()

Unnamed: 0 pickup_longitude pickup_latitude dropoff_longitude \

7570 51992033 -73.991973 40.742657 -73.991358
155037 10241908 -73.964111 40.807957 -73.966688
67010 48963133 -73.987658 40.700823 -73.985670
155236 30446807 -73.999577 40.726656 -74.007562
187226 40739497 -73.983377 40.738938 -73.978432

dropoff_latitude passenger_count day hour weekday month year

7570 40.750086 1 31 22 1 10 2011
155037 40.803299 1 18 14 3 6 2014
67010 40.770540 1 2 22 0 2 2014
155236 40.713286 1 29 18 3 5 2013
187226 40.745286 1 12 2 6 6 2010

x_test.head()

Unnamed: 0 pickup_longitude pickup_latitude dropoff_longitude \

51869 5536882 -73.953347 40.767932 -73.990867
44724 35054768 -73.137393 41.366138 -73.137393
47705 15258057 -74.009707 40.712480 -73.962757
17345 34739111 -74.016055 40.715077 -74.008840
179351 53446498 -73.950474 40.784003 -73.971086

dropoff_latitude passenger_count day hour weekday month year

51869 40.751295 5 8 17 0 11 2009
44724 41.366138 2 11 20 0 7 2010
47705 40.758977 1 3 21 0 7 2011
17345 40.711375 3 4 6 5 1 2013
179351 40.748328 1 18 22 0 9 2011
x_train.shape

(156584, 11)

x_test.shape

(39147, 11)

#Lets run the model.

#As we have to build regression model, lets start with linear regression model

from sklearn.linear_model import LinearRegression

lrmodel=LinearRegression()
lrmodel.fit(x_train, y_train)

LinearRegression()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

LinearRegression(n_jobs=1, normalize=False)

predictedvalues = lrmodel.predict(x_test)

#lets calculate rmse for linear Regression model

from sklearn.metrics import mean_squared_error
lrmodelrmse = np.sqrt(mean_squared_error(predictedvalues, y_test))
print("RMSE value for Linear regression is", lrmodelrmse)

RMSE value for Linear regression is 8.363019859396488

#Lets see with Random Forest and calculate its rmse

from sklearn.ensemble import RandomForestRegressor
# rfrmodel = RandomForestRegressor(n_estimators=100, random_state=101)
rfrmodel = RandomForestRegressor(n_estimators=50, random_state=101)

rfrmodel.fit(x_train,y_train)

RandomForestRegressor(n_estimators=50, random_state=101)

rfrmodel_pred= rfrmodel.predict(x_test)

rfrmodel_rmse=np.sqrt(mean_squared_error(rfrmodel_pred, y_test))
print("RMSE value for Random forest regression is ",rfrmodel_rmse)

RMSE value for Random forest regression is 3.9973617568779463

#RandomForest Regressor is giving good value, so we can use it as final model

Classify the email using the binary classification method. Email Spam detection has two states: a) Normal
State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support Vector Machine for
classification. Analyze their performance. Dataset link: The emails.csv dataset on the Kaggle

Support Vector Machine - supervised machine learning algorithm which can be used for both classification
or regression challenges,mostly used in classification.we plot each data item as a point in n-dimensional
space (where n is a number of features you have) with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two
classes very well

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn import svm

data = pd.read_csv('emails.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB

# Data preprocessing
X = data.drop(columns=['Email No.', 'spam']) # Remove non-numeric columns
y = data['spam'] # Target variable

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train K-Nearest Neighbors (K-NN) model

knn_model = KNeighborsClassifier(n_neighbors=5) # You can adjust n_neighbors
knn_model.fit(X_train, y_train)
knn_predictions = knn_model.predict(X_test)

# Train Support Vector Machine (SVM) model

svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)

# Model evaluation
def evaluate_model(predictions, model_name):
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)
print(f"Performance metrics for {model_name}:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"ROC AUC: {roc_auc}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

evaluate_model(knn_predictions, "K-Nearest Neighbors")

evaluate_model(svm_predictions, "Support Vector Machine")

Performance metrics for K-Nearest Neighbors:

Accuracy: 0.9893719806763285
Precision: 0.5
Recall: 0.7272727272727273
F1-Score: 0.5925925925925926
ROC AUC: 0.8597301136363636
Confusion Matrix:
[[1016 8]
[ 3 8]]
Performance metrics for Support Vector Machine:
Accuracy: 0.9893719806763285
Precision: 0.0
Recall: 0.0
F1-Score: 0.0
ROC AUC: 0.5
Confusion Matrix:
[[1024 0]
[ 11 0]]

c:\Users\Kedar\AppData\Local\Programs\Python\Python39\lib\site-packages\
sklearn\metrics\_classification.py:1327: UndefinedMetricWarning: Precision is
ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Given a bank customer, build a neural network-based classifier that can determine whether they will leave
or not in the next 6 months. Dataset Description: The case study is from an open-source dataset from
Kaggle. The dataset contains 10,000 sample points with 14 distinct features such as CustomerId,
CreditScore, Geography, Gender, Age, Tenure, Balance, etc. Link to the Kaggle project:
https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling Perform following steps:

1. Read the dataset.

2. Distinguish the feature and target set and divide the data set into training and test sets.
3. Normalize the train and test data.
4. Initialize and build the model. Identify the points of improvement and implement the same.
5. Print the accuracy score and confusion matrix (5 points).
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

data = pd.read_csv('Churn_Modelling.csv')

X = data.drop(columns=['Exited', 'CustomerId', 'Surname', 'RowNumber']) #

Exclude columns
y = data['Exited'] # Target

# Step 3: Data Preprocessing

# Handle missing values and encode categorical variables

# Removing rows with missing values:

data = data.drop(['CustomerId', 'Surname', 'RowNumber'], axis = 1)
print(data.columns)

# Replacing missing values with a specific value (e.g., mean):

# data['column_name'].fillna(data['column_name'].mean(), inplace=True)

Index(['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance',

'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary',
'Exited'],
dtype='object')

# You need to ensure that the columns 'Geography' and 'Gender' are present in
the DataFrame X
# Add additional error handling to verify the column names
columns_to_encode = ['Geography', 'Gender']
for column in columns_to_encode:
if column not in X.columns:
raise ValueError(f"Column '{column}' not found in the DataFrame X.")

# You need to encode categorical variables like "Geography" and "Gender" into
numerical format using one-hot encoding.
X = pd.get_dummies(X, columns=['Geography', 'Gender'], drop_first=True)

scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# Step 5: Initialize and Build the Model

model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train the model

model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=1)

Epoch 1/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4769 -
accuracy: 0.7947
Epoch 2/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4413 -
accuracy: 0.8098
Epoch 3/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4229 -
accuracy: 0.8200
Epoch 4/20
250/250 [==============================] - 0s 2ms/step - loss: 0.4007 -
accuracy: 0.8299
Epoch 5/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3800 -
accuracy: 0.8406
Epoch 6/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3663 -
accuracy: 0.8486
Epoch 7/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3593 -
accuracy: 0.8511
Epoch 8/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3537 -
accuracy: 0.8551
Epoch 9/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3502 -
accuracy: 0.8575
Epoch 10/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3482 -
accuracy: 0.8574
Epoch 11/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3450 -
accuracy: 0.8585
Epoch 12/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3435 -
accuracy: 0.8581
Epoch 13/20
250/250 [==============================] - 1s 2ms/step - loss: 0.3411 -
accuracy: 0.8612
Epoch 14/20
250/250 [==============================] - 1s 2ms/step - loss: 0.3412 -
accuracy: 0.8601
Epoch 15/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3378 -
accuracy: 0.8610
Epoch 16/20
250/250 [==============================] - 1s 6ms/step - loss: 0.3371 -
accuracy: 0.8605
Epoch 17/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3364 -
accuracy: 0.8608
Epoch 18/20
250/250 [==============================] - 1s 6ms/step - loss: 0.3366 -
accuracy: 0.8612
Epoch 19/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3348 -
accuracy: 0.8604
Epoch 20/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3326 -
accuracy: 0.8634

<keras.callbacks.History at 0x1ebc037af40>

# Step 6: Evaluate the Model

y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5).astype(int) # Convert to binary prediction

63/63 [==============================] - 0s 2ms/step

accuracy = accuracy_score(y_test, y_pred)

confusion = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(confusion)

Accuracy: 0.86
Confusion Matrix:
[[1557 50]
[ 230 163]]
import pandas as pd
import numpy as np

data = pd.read_csv("diabetes.csv")
data

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
.. ... ... ... ... ... ...
763 10 101 76 48 180 32.9
764 2 122 70 27 0 36.8
765 5 121 72 23 112 26.2
766 1 126 60 0 0 30.1
767 1 93 70 31 0 30.4

DiabetesPedigreeFunction Age Outcome

0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
.. ... ... ...
763 0.171 63 0
764 0.340 27 0
765 0.245 30 0
766 0.349 47 1
767 0.315 23 0

[768 rows x 9 columns]

df = pd.DataFrame(data)
df.head()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome

0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

df.isnull().sum()

Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
plt.figure(figsize=(12,10)) # on this line I just set the size of figure to
12 by 10.
p=sns.heatmap(df.corr(), annot=True,cmap ='RdYlGn') # seaborn has very simple
solution for heatmap

Manipulating and Cleaning our dataset

cols_clean =
['Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFu
nction']
for i in cols_clean:
df[i] = df[i].replace(0,np.NaN)
cols_mean = int(df[i].mean(skipna=True))
df[i] = df[i].replace(np.NaN, cols_mean)
data1 = df
data1.head().style.highlight_max(color="lightblue").highlight_min(color="red")

<pandas.io.formats.style.Styler at 0x2a3e3087580>

print(data1.describe())
Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.00000
mean 3.845052 121.682292 72.386719 29.108073 155.28125
std 3.369578 30.435999 12.096642 8.791221 85.02155
min 0.000000 44.000000 24.000000 7.000000 14.00000
25% 1.000000 99.750000 64.000000 25.000000 121.50000
50% 3.000000 117.000000 72.000000 29.000000 155.00000
75% 6.000000 140.250000 80.000000 32.000000 155.00000
max 17.000000 199.000000 122.000000 99.000000 846.00000

BMI DiabetesPedigreeFunction Age Outcome

count 768.000000 768.000000 768.000000 768.000000
mean 32.450911 0.471876 33.240885 0.348958
std 6.875366 0.331329 11.760232 0.476951
min 18.200000 0.078000 21.000000 0.000000
25% 27.500000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000

import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline

# graph = ['Glucose','Insulin','BMI','Age','Outcome']
sns.set()
# print(sns.pairplot(data1[graph],hue='Outcome', diag_kind='kde'))
print(sns.pairplot(data1[graph],hue='Outcome', diag_kind='kde'))

<seaborn.axisgrid.PairGrid object at 0x000002A3E0B306A0>

# for the purpose of simplicity and analysing the most relevent data , we
will select three features of the dataset
# Glucose , Insulin and BMI
# defining variables and features for the dataset for splitting
# q_cols = ['Glucose','Insulin','BMI','Outcome']
q_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

df = data1[q_cols]
print(df.head(2))

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148.0 72.0 35.0 155.0 33.6
1 1 85.0 66.0 29.0 155.0 26.6

DiabetesPedigreeFunction Age Outcome

0 0.627 50 1
1 0.351 31 0

# # let's split the data into training and testing datasets

# split = 0.75 # 75% train and 25% test dataset
# total_len = len(df)
# split_df = int(total_len*split)
# train, test = df.iloc[:split_df,0:4],df.iloc[split_df:,0:4]
# train_x = train[['Glucose','Insulin','BMI']]
# train_y = train['Outcome']
# test_x = test[['Glucose','Insulin','BMI']]
# test_y = test['Outcome']

# Split the data into training and testing datasets

split = 0.75 # 75% train and 25% test dataset
total_len = len(df)
split_df = int(total_len * split)
train, test = df.iloc[:split_df], df.iloc[split_df:]

# Select the columns specified in q_cols for training and testing

train_x = train[q_cols[:-1]] # Exclude the 'Outcome' column from features
train_y = train['Outcome'] # Target variable
test_x = test[q_cols[:-1]] # Exclude the 'Outcome' column from features
test_y = test['Outcome'] # Target variable

a = len(train_x)
b = len(test_x)
print(' Training data =',a,'\n','Testing data =',b,'\n','Total data length =
',a+b)

Training data = 576

Testing data = 192
Total data length = 768

from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

def knn(x_train, y_train, x_test, y_test,n):

n_range = range(1, n)
results = []
for n in n_range:
knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(x_train, y_train)
#Predict the response for test dataset
predict_y = knn.predict(x_test)
accuracy = metrics.accuracy_score(y_test, predict_y)
#matrix = confusion_matrix(y_test,predict_y)
#seaborn_matrix = sns.heatmap(matrix, annot = True,
cmap="Blues",cbar=True)
results.append(accuracy)
return results

n= 500
output = knn(train_x,train_y,test_x,test_y,n)
n_range = range(1, n)
plt.plot(n_range, output)

[<matplotlib.lines.Line2D at 0x2a3ec0c6100>]
# best k that could optimize this model is between 100 to 200 offering a 77%
accuracy
# ideal k value for this dataset should be 150 give or take

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, fbeta_score
y_pred = knn(train_x,train_y,test_x,test_y,n)
cnf_matrix = confusion_matrix(test_y, y_pred)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18924/597529570.py in <module>
2 from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, fbeta_score
3 y_pred = knn(train_x,train_y,test_x,test_y,n)
----> 4 cnf_matrix = confusion_matrix(test_y, y_pred)

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py in confusion_matrix(y_true, y_pred, labels,
sample_weight, normalize)
305 (0, 2, 1, 1)
306 """
--> 307 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
308 if y_type not in ("binary", "multiclass"):
309 raise ValueError("%s is not supported" % y_type)

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
82 y_pred : array or indicator matrix
83 """
---> 84 check_consistent_length(y_true, y_pred)
85 type_true = type_of_target(y_true, input_name="y_true")
86 type_pred = type_of_target(y_pred, input_name="y_pred")

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\utils\validation.py in check_consistent_length(*arrays)
385 uniques = np.unique(lengths)
386 if len(uniques) > 1:
--> 387 raise ValueError(
388 "Found input variables with inconsistent numbers of
samples: %r"
389 % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [192,

499]

p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 12.5, 'Predicted label')

# Define your KNN function to return predictions

def knn2(x_train, y_train, x_test, y_test, n):
knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(x_train, y_train)
# Predict the response for the test dataset
predict_y = knn.predict(x_test)
return predict_y

n = 500
y_pred = knn2(train_x, train_y, test_x, test_y, n)
cnf_matrix = confusion_matrix(test_y, y_pred)

# Now you can calculate other metrics like accuracy, precision, recall, etc.
accuracy = accuracy_score(test_y, y_pred)
precision = precision_score(test_y, y_pred)
recall = recall_score(test_y, y_pred)
f1 = f1_score(test_y, y_pred)
fbeta = fbeta_score(test_y, y_pred, beta=0.5)

c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py:1327: UndefinedMetricWarning: Precision is
ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
# Print the confusion matrix and other metrics
print("Confusion Matrix:\n", cnf_matrix)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("F-beta Score:", fbeta)

Confusion Matrix:
[[122 0]
[ 70 0]]
Accuracy: 0.6354166666666666
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
F-beta Score: 0.0

p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 12.5, 'Predicted label')

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report

# Load your dataset

# Replace 'your_dataset.csv' with the actual file path to your dataset
df = pd.read_csv('diabetes.csv')

# Define your feature columns and target column

q_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_col = 'Outcome'

# Split the data into features (X) and target (y)

X = df[q_cols]
y = df[target_col]

# Split the data into training and testing datasets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=42)

# Perform feature scaling (standardization) on the features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train a K-nearest neighbors (KNN) classifier

k = 5 # You can adjust the value of k
knn_classifier = KNeighborsClassifier(n_neighbors=k)
knn_classifier.fit(X_train_scaled, y_train)

# Make predictions on the test data

y_pred = knn_classifier.predict(X_test_scaled)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the results

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.6822916666666666
Confusion Matrix:
[[94 29]
[32 37]]
Classification Report:
precision recall f1-score support

0 0.75 0.76 0.76 123

1 0.56 0.54 0.55 69

accuracy 0.68 192

macro avg 0.65 0.65 0.65 192
weighted avg 0.68 0.68 0.68 192

p = sns.heatmap(pd.DataFrame(conf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 12.5, 'Predicted label')

Delhivery Feature Engineering Cs
No ratings yet
Delhivery Feature Engineering Cs
46 pages
Yash Week 3 Uber Case Study
No ratings yet
Yash Week 3 Uber Case Study
38 pages
A Century of Psychology As Science
No ratings yet
A Century of Psychology As Science
1,032 pages
How Do Vestas Manufacture Nacelles - PE Rev3
No ratings yet
How Do Vestas Manufacture Nacelles - PE Rev3
67 pages
Name: Siddhesh Asati: #Group: B (ML) #Assignment: 6
No ratings yet
Name: Siddhesh Asati: #Group: B (ML) #Assignment: 6
9 pages
ML 1 Um
No ratings yet
ML 1 Um
5 pages
P1) Code Uber
No ratings yet
P1) Code Uber
6 pages
ML - Practical - 1 - Jupyter Notebook
No ratings yet
ML - Practical - 1 - Jupyter Notebook
15 pages
ML 1 16
No ratings yet
ML 1 16
13 pages
ML Practical 1 Code
100% (1)
ML Practical 1 Code
1 page
ML Practical 1
No ratings yet
ML Practical 1
15 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
Practical 1
No ratings yet
Practical 1
6 pages
ML - 2 - Jupyter Notebook
No ratings yet
ML - 2 - Jupyter Notebook
6 pages
Lab1.ipynb - Colaboratory
No ratings yet
Lab1.ipynb - Colaboratory
9 pages
SourceCode Assignment1
No ratings yet
SourceCode Assignment1
9 pages
Praktikum 5
No ratings yet
Praktikum 5
20 pages
ML Practical 1
No ratings yet
ML Practical 1
15 pages
Analyzing Taxi Trends
No ratings yet
Analyzing Taxi Trends
43 pages
Taxi Fare Team 09
No ratings yet
Taxi Fare Team 09
25 pages
Merged
No ratings yet
Merged
47 pages
Bose A S
No ratings yet
Bose A S
37 pages
Uber ml1 - Jupyter Notebook
No ratings yet
Uber ml1 - Jupyter Notebook
10 pages
SPPUML1
No ratings yet
SPPUML1
8 pages
Taxi Trips Analysis Project 1682332303
100% (2)
Taxi Trips Analysis Project 1682332303
28 pages
Ml-Exp-1 - Jupyter Notebook
No ratings yet
Ml-Exp-1 - Jupyter Notebook
8 pages
Uber
No ratings yet
Uber
7 pages
Assignment No 1 output
No ratings yet
Assignment No 1 output
42 pages
ml_code_output
No ratings yet
ml_code_output
38 pages
Airline Passenger Booking Analyze
No ratings yet
Airline Passenger Booking Analyze
26 pages
Airfare ML - Predicting Flight Fares
No ratings yet
Airfare ML - Predicting Flight Fares
21 pages
Supervised Regression
No ratings yet
Supervised Regression
24 pages
EDA_Optimising_NYC_Taxis_GautamTiwari.cleanup
No ratings yet
EDA_Optimising_NYC_Taxis_GautamTiwari.cleanup
1 page
Case Study 1 Exercise R Script
No ratings yet
Case Study 1 Exercise R Script
5 pages
NYC Taxi Data Analysis
No ratings yet
NYC Taxi Data Analysis
8 pages
scaffold fg
No ratings yet
scaffold fg
13 pages
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
No ratings yet
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
18 pages
2016MIS013
No ratings yet
2016MIS013
36 pages
Step 16 Chapter4
No ratings yet
Step 16 Chapter4
64 pages
Divvy Exercise R Script
No ratings yet
Divvy Exercise R Script
5 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
report
No ratings yet
report
25 pages
Data Science Lab Group Submission
No ratings yet
Data Science Lab Group Submission
13 pages
Delhivery Business Case Study 1723758771
No ratings yet
Delhivery Business Case Study 1723758771
56 pages
Titanic
No ratings yet
Titanic
22 pages
Working With The Divvy Data Set
100% (1)
Working With The Divvy Data Set
43 pages
SN Travel Jupyter Notebook PDF
No ratings yet
SN Travel Jupyter Notebook PDF
28 pages
Railway Price Prediction
No ratings yet
Railway Price Prediction
20 pages
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
No ratings yet
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
28 pages
taxi
No ratings yet
taxi
905 pages
Untitled 18
No ratings yet
Untitled 18
7 pages
Check Data Types and Data Structures For All The Data Frames - Sapply (Tripdata - 202307, Class) To Sapply (Tripdata - 202406, Class)
No ratings yet
Check Data Types and Data Structures For All The Data Frames - Sapply (Tripdata - 202307, Class) To Sapply (Tripdata - 202406, Class)
9 pages
Zomato Rating Prediction
No ratings yet
Zomato Rating Prediction
11 pages
Uber - Rides - Analysis - Jupyter Notebook
No ratings yet
Uber - Rides - Analysis - Jupyter Notebook
12 pages
EDA Zomato 1681401606
No ratings yet
EDA Zomato 1681401606
15 pages
Data_cleaning_on_Melbourne_housing
No ratings yet
Data_cleaning_on_Melbourne_housing
16 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
You Have Two Datasets - Trips - TXT Which Records Tri...
No ratings yet
You Have Two Datasets - Trips - TXT Which Records Tri...
6 pages
Taxi Trip Analysis Using Hive
No ratings yet
Taxi Trip Analysis Using Hive
3 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Stripe Integration in Angular: A Step-by-Step Guide to Creating Payment Functionality
From Everand
Stripe Integration in Angular: A Step-by-Step Guide to Creating Payment Functionality
Abdelfattah Ragab
No ratings yet
Ses 435 Coach Observation Practice Plan
No ratings yet
Ses 435 Coach Observation Practice Plan
9 pages
GloMax Discover System TM397 PDF
No ratings yet
GloMax Discover System TM397 PDF
106 pages
The Neuroscience of Human Relationships Attachment and the Developing Social Brain Second Edition Norton Series on Interpersonal Neurobiology Louis Cozolino download
No ratings yet
The Neuroscience of Human Relationships Attachment and the Developing Social Brain Second Edition Norton Series on Interpersonal Neurobiology Louis Cozolino download
59 pages
Week 7 Assignment 7: Assignment Submitted On 2023-03-14, 07:09 IST
100% (1)
Week 7 Assignment 7: Assignment Submitted On 2023-03-14, 07:09 IST
3 pages
(SP-31) - Memorial For Plaintiffs.
No ratings yet
(SP-31) - Memorial For Plaintiffs.
39 pages
Pricing Decisions: How Have The Syllabus Learning Outcomes Been Examined?
No ratings yet
Pricing Decisions: How Have The Syllabus Learning Outcomes Been Examined?
14 pages
200 Likely Botany Questions NATHAN O-WPS Office-1
No ratings yet
200 Likely Botany Questions NATHAN O-WPS Office-1
10 pages
Stone Age in Asia
No ratings yet
Stone Age in Asia
26 pages
Map Structural Symbols: Mesoscopic Structures
No ratings yet
Map Structural Symbols: Mesoscopic Structures
4 pages
Test Bank for Business Statistics : 0321924290 download
100% (4)
Test Bank for Business Statistics : 0321924290 download
45 pages
Oral Sex: What's The Real Risk For HIV?
No ratings yet
Oral Sex: What's The Real Risk For HIV?
2 pages
Basix GrannySquareVest
No ratings yet
Basix GrannySquareVest
10 pages
Strategies and Methods of Teaching Reviewer
100% (2)
Strategies and Methods of Teaching Reviewer
14 pages
Playfair Cipher: × 5 Square in Some Predetermined Order
No ratings yet
Playfair Cipher: × 5 Square in Some Predetermined Order
3 pages
E-TECH DLL Wk1
No ratings yet
E-TECH DLL Wk1
3 pages
Design of FM Broadcast Systems
0% (1)
Design of FM Broadcast Systems
5 pages
01 - FREEBIE Comma Handout ACT Prep - English Grammar Review
No ratings yet
01 - FREEBIE Comma Handout ACT Prep - English Grammar Review
2 pages
CO2 Ged102 pg.193
No ratings yet
CO2 Ged102 pg.193
3 pages
Noise Nerve Neckband Bluetooth Headset: Grand Total 899.00
No ratings yet
Noise Nerve Neckband Bluetooth Headset: Grand Total 899.00
1 page
Avant Garde Fashion ACase Studyof Martin Margiela
No ratings yet
Avant Garde Fashion ACase Studyof Martin Margiela
14 pages
Single Beam Echo Sounder
No ratings yet
Single Beam Echo Sounder
9 pages
Belgian Projectile Fuze, BD, NR 2 490
No ratings yet
Belgian Projectile Fuze, BD, NR 2 490
0 pages
Class XII (Theory) : One Paper Time: 3 Hours 70 Marks Unit No. Title Marks
No ratings yet
Class XII (Theory) : One Paper Time: 3 Hours 70 Marks Unit No. Title Marks
7 pages
1414744276140LPH H1 2010 KualaLumpur PDF
No ratings yet
1414744276140LPH H1 2010 KualaLumpur PDF
84 pages
Module 1-18EE53-Notes
No ratings yet
Module 1-18EE53-Notes
20 pages
Tutorial Letter 101/0/2024: Teaching Practice Intermediate Phase
No ratings yet
Tutorial Letter 101/0/2024: Teaching Practice Intermediate Phase
40 pages
Cracking the AP World History Exam 2019 Premium Edition Princeton Review instant download
100% (2)
Cracking the AP World History Exam 2019 Premium Edition Princeton Review instant download
37 pages
Project Synopsis Template For Mba
No ratings yet
Project Synopsis Template For Mba
5 pages