ML All Prints
ML All Prints
import pandas as pd
data=pd.read_csv(r"uber.csv")
# test_df=pd.read_csv(r"test.csv")
print (data.shape)
print (data.columns)
(200000, 9)
Index(['Unnamed: 0', 'key', 'fare_amount', 'pickup_datetime',
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'passenger_count'],
dtype='object')
data_x = data.iloc[:,0:-1].values
data_y = data.iloc[:,-1].values
print(data_y)
[1 1 1 ... 2 1 1]
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200000 non-null int64
1 key 200000 non-null object
2 fare_amount 200000 non-null float64
3 pickup_datetime 200000 non-null object
4 pickup_longitude 200000 non-null float64
5 pickup_latitude 200000 non-null float64
6 dropoff_longitude 199999 non-null float64
7 dropoff_latitude 199999 non-null float64
8 passenger_count 200000 non-null int64
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB
data["pickup_datetime"]=pd.to_datetime(data['pickup_datetime'])
data.head()
data.describe()
Here first thing which we can see is minimum value of fare is negative which is -62 which is not the valid
value, so we need to remove the fare which are negative values. Secondly, passenger_count minimum value
is 0 and maximum value is 208 which impossible, so we need to remove them as well, for safer side we can
think that a taxi can have maximum 7 people.
Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64
Here we can see there are 14 null values in drop_off latitude and longitude. as removing 14 to 28 rows from
our huge dataset will not affect our analysis so, lets remove the rows having null values
data.dropna(inplace=True)
print(data.isnull().sum())
Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64
sns.distplot(data['fare_amount'])
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='fare_amount', ylabel='Density'>
In distribution plot also it can be seen that there are some values which are negative fare
sns.distplot(data['pickup_latitude'])
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='pickup_latitude', ylabel='Density'>
Here we can see minimum value is going to be less than even -3000 which is not correct value and also on
positive side also going more than 2000
sns.distplot(data['pickup_longitude'])
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='pickup_longitude', ylabel='Density'>
Here also negative and positive values are excedding far behond the real limit.
sns.distplot(data['dropoff_longitude'])
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='dropoff_longitude', ylabel='Density'>
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated
function and will be removed in a future version. Please adapt your code to
use either `displot` (a figure-level function with similar flexibility) or
`histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='dropoff_latitude', ylabel='Density'>
print("drop_off latitude min value",data["dropoff_latitude"].min())
print("drop_off latitude max value",data["dropoff_latitude"].max())
print("drop_off longitude min value", data["dropoff_longitude"].min())
print("drop_off longitude max value",data["dropoff_longitude"].max())
print("pickup latitude min value",data["pickup_latitude"].min())
print("pickup latitude max value",data["pickup_latitude"].max())
print("pickup longitude min value",data["pickup_longitude"].min())
print("pickup longitude max value",data["pickup_longitude"].max())
we can see what is range of latitude and longitude of our test dataset, lets keep the range same in our train
set so that even noisy data is remove and we have only the values which belongs to new york
min_longitude=-74.263242,
min_latitude=40.573143,
max_longitude=-72.986532,
max_latitude=41.709555
#lets drop all the values which are not coming in above boundary, as those are
noisy data
tempdf=data[(data["dropoff_latitude"]<min_latitude) |
(data["pickup_latitude"]<min_latitude) |
(data["dropoff_longitude"]<min_longitude) |
(data["pickup_longitude"]<min_longitude) |
(data["dropoff_latitude"]>max_latitude) |
(data["pickup_latitude"]>max_latitude) |
(data["dropoff_longitude"]>max_longitude) |
(data["pickup_longitude"]>max_longitude) ]
print("before droping",data.shape)
data.drop(tempdf.index,inplace=True)
print("after droping",data.shape)
before droping (199999, 9)
after droping (195732, 9)
On different day and time there would be different price like during eveing price would be more compare to
afternoon, during christmas price would be different and similarly on weekends price would be different
compare to week days. so lets create some extra features which will take care of all these things
import calendar
data['day']=data['pickup_datetime'].apply(lambda x:x.day)
data['hour']=data['pickup_datetime'].apply(lambda x:x.hour)
data['weekday']=data['pickup_datetime'].apply(lambda
x:calendar.day_name[x.weekday()])
data['month']=data['pickup_datetime'].apply(lambda x:x.month)
data['year']=data['pickup_datetime'].apply(lambda x:x.year)
data.head()
month year
0 5 2015
1 7 2009
2 8 2009
3 6 2009
4 8 2014
#here we can see that week are in monday , tuesday and so on. So we need
convert them in numerical for
data.weekday =
data.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4
,'Friday':5,'Saturday':6})
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 195732 entries, 0 to 199999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195732 non-null int64
1 key 195732 non-null object
2 fare_amount 195732 non-null float64
3 pickup_datetime 195732 non-null datetime64[ns, UTC]
4 pickup_longitude 195732 non-null float64
5 pickup_latitude 195732 non-null float64
6 dropoff_longitude 195732 non-null float64
7 dropoff_latitude 195732 non-null float64
8 passenger_count 195732 non-null int64
9 day 195732 non-null int64
10 hour 195732 non-null int64
11 weekday 195732 non-null int64
12 month 195732 non-null int64
13 year 195732 non-null int64
dtypes: datetime64[ns, UTC](1), float64(5), int64(7), object(1)
memory usage: 22.4+ MB
# we will keep only those rows where number of passangers are less than or
equal to 8
data=data[data['passenger_count']<=8]
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 195731 entries, 0 to 199999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195731 non-null int64
1 key 195731 non-null object
2 fare_amount 195731 non-null float64
3 pickup_datetime 195731 non-null datetime64[ns, UTC]
4 pickup_longitude 195731 non-null float64
5 pickup_latitude 195731 non-null float64
6 dropoff_longitude 195731 non-null float64
7 dropoff_latitude 195731 non-null float64
8 passenger_count 195731 non-null int64
9 day 195731 non-null int64
10 hour 195731 non-null int64
11 weekday 195731 non-null int64
12 month 195731 non-null int64
13 year 195731 non-null int64
dtypes: datetime64[ns, UTC](1), float64(5), int64(7), object(1)
memory usage: 22.4+ MB
#here key column and pickup_datetime columns are not needed as we have already
created variables extracted from it
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 195731 entries, 0 to 199999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 195731 non-null int64
1 fare_amount 195731 non-null float64
2 pickup_longitude 195731 non-null float64
3 pickup_latitude 195731 non-null float64
4 dropoff_longitude 195731 non-null float64
5 dropoff_latitude 195731 non-null float64
6 passenger_count 195731 non-null int64
7 day 195731 non-null int64
8 hour 195731 non-null int64
9 weekday 195731 non-null int64
10 month 195731 non-null int64
11 year 195731 non-null int64
dtypes: float64(5), int64(7)
memory usage: 19.4 MB
lets divide the data set into train and validation test set
x=data.drop("fare_amount", axis=1)
y=data['fare_amount']
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.2,random_state=101)
x_train.head()
x_test.head()
(156584, 11)
x_test.shape
(39147, 11)
lrmodel=LinearRegression()
lrmodel.fit(x_train, y_train)
LinearRegression()
LinearRegression(n_jobs=1, normalize=False)
predictedvalues = lrmodel.predict(x_test)
rfrmodel.fit(x_train,y_train)
RandomForestRegressor(n_estimators=50, random_state=101)
rfrmodel_pred= rfrmodel.predict(x_test)
rfrmodel_rmse=np.sqrt(mean_squared_error(rfrmodel_pred, y_test))
print("RMSE value for Random forest regression is ",rfrmodel_rmse)
Support Vector Machine - supervised machine learning algorithm which can be used for both classification
or regression challenges,mostly used in classification.we plot each data item as a point in n-dimensional
space (where n is a number of features you have) with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two
classes very well
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn import svm
data = pd.read_csv('emails.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB
# Data preprocessing
X = data.drop(columns=['Email No.', 'spam']) # Remove non-numeric columns
y = data['spam'] # Target variable
# Model evaluation
def evaluate_model(predictions, model_name):
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)
print(f"Performance metrics for {model_name}:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"ROC AUC: {roc_auc}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))
c:\Users\Kedar\AppData\Local\Programs\Python\Python39\lib\site-packages\
sklearn\metrics\_classification.py:1327: UndefinedMetricWarning: Precision is
ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Given a bank customer, build a neural network-based classifier that can determine whether they will leave
or not in the next 6 months. Dataset Description: The case study is from an open-source dataset from
Kaggle. The dataset contains 10,000 sample points with 14 distinct features such as CustomerId,
CreditScore, Geography, Gender, Age, Tenure, Balance, etc. Link to the Kaggle project:
https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling Perform following steps:
data = pd.read_csv('Churn_Modelling.csv')
# You need to ensure that the columns 'Geography' and 'Gender' are present in
the DataFrame X
# Add additional error handling to verify the column names
columns_to_encode = ['Geography', 'Gender']
for column in columns_to_encode:
if column not in X.columns:
raise ValueError(f"Column '{column}' not found in the DataFrame X.")
# You need to encode categorical variables like "Geography" and "Gender" into
numerical format using one-hot encoding.
X = pd.get_dummies(X, columns=['Geography', 'Gender'], drop_first=True)
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
Epoch 1/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4769 -
accuracy: 0.7947
Epoch 2/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4413 -
accuracy: 0.8098
Epoch 3/20
250/250 [==============================] - 1s 2ms/step - loss: 0.4229 -
accuracy: 0.8200
Epoch 4/20
250/250 [==============================] - 0s 2ms/step - loss: 0.4007 -
accuracy: 0.8299
Epoch 5/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3800 -
accuracy: 0.8406
Epoch 6/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3663 -
accuracy: 0.8486
Epoch 7/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3593 -
accuracy: 0.8511
Epoch 8/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3537 -
accuracy: 0.8551
Epoch 9/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3502 -
accuracy: 0.8575
Epoch 10/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3482 -
accuracy: 0.8574
Epoch 11/20
250/250 [==============================] - 0s 2ms/step - loss: 0.3450 -
accuracy: 0.8585
Epoch 12/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3435 -
accuracy: 0.8581
Epoch 13/20
250/250 [==============================] - 1s 2ms/step - loss: 0.3411 -
accuracy: 0.8612
Epoch 14/20
250/250 [==============================] - 1s 2ms/step - loss: 0.3412 -
accuracy: 0.8601
Epoch 15/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3378 -
accuracy: 0.8610
Epoch 16/20
250/250 [==============================] - 1s 6ms/step - loss: 0.3371 -
accuracy: 0.8605
Epoch 17/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3364 -
accuracy: 0.8608
Epoch 18/20
250/250 [==============================] - 1s 6ms/step - loss: 0.3366 -
accuracy: 0.8612
Epoch 19/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3348 -
accuracy: 0.8604
Epoch 20/20
250/250 [==============================] - 1s 5ms/step - loss: 0.3326 -
accuracy: 0.8634
<keras.callbacks.History at 0x1ebc037af40>
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(confusion)
Accuracy: 0.86
Confusion Matrix:
[[1557 50]
[ 230 163]]
import pandas as pd
import numpy as np
data = pd.read_csv("diabetes.csv")
data
df = pd.DataFrame(data)
df.head()
df.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
plt.figure(figsize=(12,10)) # on this line I just set the size of figure to
12 by 10.
p=sns.heatmap(df.corr(), annot=True,cmap ='RdYlGn') # seaborn has very simple
solution for heatmap
<pandas.io.formats.style.Styler at 0x2a3e3087580>
print(data1.describe())
Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.00000
mean 3.845052 121.682292 72.386719 29.108073 155.28125
std 3.369578 30.435999 12.096642 8.791221 85.02155
min 0.000000 44.000000 24.000000 7.000000 14.00000
25% 1.000000 99.750000 64.000000 25.000000 121.50000
50% 3.000000 117.000000 72.000000 29.000000 155.00000
75% 6.000000 140.250000 80.000000 32.000000 155.00000
max 17.000000 199.000000 122.000000 99.000000 846.00000
# graph = ['Glucose','Insulin','BMI','Age','Outcome']
sns.set()
# print(sns.pairplot(data1[graph],hue='Outcome', diag_kind='kde'))
print(sns.pairplot(data1[graph],hue='Outcome', diag_kind='kde'))
df = data1[q_cols]
print(df.head(2))
a = len(train_x)
b = len(test_x)
print(' Training data =',a,'\n','Testing data =',b,'\n','Total data length =
',a+b)
n= 500
output = knn(train_x,train_y,test_x,test_y,n)
n_range = range(1, n)
plt.plot(n_range, output)
[<matplotlib.lines.Line2D at 0x2a3ec0c6100>]
# best k that could optimize this model is between 100 to 200 offering a 77%
accuracy
# ideal k value for this dataset should be 150 give or take
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18924/597529570.py in <module>
2 from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, fbeta_score
3 y_pred = knn(train_x,train_y,test_x,test_y,n)
----> 4 cnf_matrix = confusion_matrix(test_y, y_pred)
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py in confusion_matrix(y_true, y_pred, labels,
sample_weight, normalize)
305 (0, 2, 1, 1)
306 """
--> 307 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
308 if y_type not in ("binary", "multiclass"):
309 raise ValueError("%s is not supported" % y_type)
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
82 y_pred : array or indicator matrix
83 """
---> 84 check_consistent_length(y_true, y_pred)
85 type_true = type_of_target(y_true, input_name="y_true")
86 type_pred = type_of_target(y_pred, input_name="y_pred")
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\utils\validation.py in check_consistent_length(*arrays)
385 uniques = np.unique(lengths)
386 if len(uniques) > 1:
--> 387 raise ValueError(
388 "Found input variables with inconsistent numbers of
samples: %r"
389 % [int(l) for l in lengths]
n = 500
y_pred = knn2(train_x, train_y, test_x, test_y, n)
cnf_matrix = confusion_matrix(test_y, y_pred)
# Now you can calculate other metrics like accuracy, precision, recall, etc.
accuracy = accuracy_score(test_y, y_pred)
precision = precision_score(test_y, y_pred)
recall = recall_score(test_y, y_pred)
f1 = f1_score(test_y, y_pred)
fbeta = fbeta_score(test_y, y_pred, beta=0.5)
c:\users\kedar\appdata\local\programs\python\python39\lib\site-packages\
sklearn\metrics\_classification.py:1327: UndefinedMetricWarning: Precision is
ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
# Print the confusion matrix and other metrics
print("Confusion Matrix:\n", cnf_matrix)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("F-beta Score:", fbeta)
Confusion Matrix:
[[122 0]
[ 70 0]]
Accuracy: 0.6354166666666666
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
F-beta Score: 0.0
Accuracy: 0.6822916666666666
Confusion Matrix:
[[94 29]
[32 37]]
Classification Report:
precision recall f1-score support