0% found this document useful (0 votes)

40 views

Supervised Regression

This document summarizes the data pre-processing steps performed on an airline ticket price dataset: 1. The dataset is loaded and initial exploratory analysis is done to understand the data types and distribution of values. 2. Data cleaning steps include encoding categorical features like departure/arrival stations and extra info categories numerically, and splitting the date field into separate day, month columns. 3. Unneeded columns are dropped to reduce the number of features in the final preprocessed dataset.

Uploaded by

amos123ezra

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Supervised Regression

Uploaded by

amos123ezra

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

MLSR - Regression Model

Project Done by : Amos B

1. Data Pre-processing
Import the required libraries

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing

from warnings import filterwarnings

filterwarnings('ignore')

Load the CSV file

In [2]:

df_airline = pd.read_excel("airfare_CT3-1.xlsx")
df_airline.head(3)

Out[2]:

Departure Arrival Departure Arrival Journey Extra

Airline Date Route Map Stops Price
Station Station Time Time Time Info

01:10 22 non-
0 IndiGo 24/03/2019 Banglore New Delhi BLR → DEL 22:20 2h 50m No info 3897
Mar stop

1 Air India 1/05/2019 Kolkata Banglore CCU → IXR → BBI → BLR 05:50 13:15 7h 25m 2 stops No info 7662

Jet DEL → LKO → BOM → 04:25 10

2 9/06/2019 Delhi Cochin 09:25 19h 2 stops No info 13882
Airways COK Jun

In [3]:

df_airline.shape

Out[3]:

(9000, 11)

In [4]:

df_airline.keys()

Out[4]:

Index(['Airline', 'Date', 'Departure Station', 'Arrival Station', 'Route Map',

'Departure Time', 'Arrival Time', 'Journey Time', 'Stops', 'Extra Info',
'Price'],
dtype='object')
In [5]:

df_airline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9000 entries, 0 to 8999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Airline 9000 non-null object
1 Date 9000 non-null object
2 Departure Station 9000 non-null object
3 Arrival Station 9000 non-null object
4 Route Map 9000 non-null object
5 Departure Time 9000 non-null object
6 Arrival Time 9000 non-null object
7 Journey Time 9000 non-null object
8 Stops 9000 non-null object
9 Extra Info 9000 non-null object
10 Price 9000 non-null int64
dtypes: int64(1), object(10)
memory usage: 773.6+ KB

Prepare the data

In [6]:

df_airline.describe()

Out[6]:

Price

count 9000.000000

mean 9087.764333

std 4605.498942

min 1759.000000

25% 5228.000000

50% 8369.000000

75% 12373.000000

max 79512.000000

In [7]:

df_airline.dtypes

Out[7]:

Airline object
Date object
Departure Station object
Arrival Station object
Route Map object
Departure Time object
Arrival Time object
Journey Time object
Stops object
Extra Info object
Price int64
dtype: object

We can see from the above result that we have price as a numerical data with dtype int and the other data type as object.

Perform missing value analysis

In [8]:

# sort the variables on the basis of total null values in the variable
# 'isnull().sum()' returns the number of missing values in each variable

missing_total = df_airline.isnull().sum()
print(missing_total)

Airline 0
Date 0
Departure Station 0
Arrival Station 0
Route Map 0
Departure Time 0
Arrival Time 0
Journey Time 0
Stops 0
Extra Info 0
Price 0
dtype: int64

There are no missing values present in the given dataset

In [9]:

print(df_airline['Departure Station'].unique())

['Banglore' 'Kolkata' 'Delhi' 'Chennai' 'Mumbai']

Replacing the departure values as per station code

In [10]:

df_airline['Departure Station'] = df_airline['Departure Station'].replace({

'Banglore':'BLR', 'Delhi' : 'DEL', 'Kolkata' : 'CCU', 'Chennai' : 'MAA', 'Mumbai' : 'BOM'})

print(df_airline['Departure Station'].unique())

['BLR' 'CCU' 'DEL' 'MAA' 'BOM']

We have replaced the the departure station values as per the location code

Replacing arrival station values as per the location code

In [11]:

print(df_airline['Arrival Station'].unique())

['New Delhi' 'Banglore' 'Cochin' 'Kolkata' 'Delhi' 'Hyderabad']

In [12]:

df_airline['Arrival Station'] = df_airline['Arrival Station'].replace({

'Banglore':'BLR', 'New Delhi' : 'DEL', 'Cochin' : 'COK', 'Hyderabad' : 'HYD',
'Delhi' : 'DEL', 'Kolkata' : 'CCU', 'Chennai' : 'MAA', 'Mumbai' : 'BOM'})

print(df_airline['Arrival Station'].unique())

['DEL' 'BLR' 'COK' 'CCU' 'HYD']

Cleaning Extra info variable

In [13]:

print(df_airline['Extra Info'].unique())

['No info' 'In-flight meal not included' 'No check-in baggage included'
'1 Short layover' 'No Info' '1 Long layover' 'Change airports'
'Business class' 'Red-eye flight']

In [14]:

# There are two no info's {No info & No Info}

# Clearing those as first step

df_airline['Extra Info'] = df_airline['Extra Info'].replace({"No info":"No Info"})

In [15]:

print(df_airline['Extra Info'].unique())

['No Info' 'In-flight meal not included' 'No check-in baggage included'
'1 Short layover' '1 Long layover' 'Change airports' 'Business class'
'Red-eye flight']

In [16]:

df_airline.groupby('Extra Info') ['Extra Info'].count()

Out[16]:

Extra Info
1 Long layover 17
1 Short layover 1
Business class 3
Change airports 4
In-flight meal not included 1649
No Info 7055
No check-in baggage included 270
Red-eye flight 1
Name: Extra Info, dtype: int64

In [17]:

## Assigning the categories using map function for the above categories shown in the result

df_airline['Extra Info'] = df_airline['Extra Info'].map({

'No Info':0, 'In-flight meal not included':1, 'No check-in baggage included':2, '1 Long layover': 3,
'Change airports':4, 'Business class':5, '1 Short layover':6, 'Red-eye flight':7})

In [18]:

print(df_airline['Extra Info'].unique())

[0 1 2 6 3 4 5 7]

Cleaning Stops variabel

In [19]:

print(df_airline['Stops'].unique())

['non-stop' '2 stops' '1 stop' '3 stops']

In [20]:

df_airline['Stops'] = df_airline['Stops'].replace({'non-stop': 0, '1 stop' : 1, '2 stops': 2, '3 stops':3})

In [21]:

print(df_airline['Stops'].unique())

[0 2 1 3]

Creating Day, Month, year variable from Date variable

In [22]:

df_airline.head(2)

Out[22]:

Airline Date Departure Station Arrival Station Route Map Departure Time Arrival Time Journey Time Stops Extra Info Price

0 IndiGo 24/03/2019 BLR DEL BLR → DEL 22:20 01:10 22 Mar 2h 50m 0 0 3897

1 Air India 1/05/2019 CCU BLR CCU → IXR → BBI → BLR 05:50 13:15 7h 25m 2 0 7662

In [23]:

df_airline['Day'],df_airline['Month'],df_airline['Year'] = df_airline['Date'].str.split('/',3).str
In [24]:

df_airline.head(2)

Out[24]:

Departure Arrival Departure Arrival Journey Extra

Airline Date Route Map Stops Price Day Month Year
Station Station Time Time Time Info

01:10 22
0 IndiGo 24/03/2019 BLR DEL BLR → DEL 22:20 2h 50m 0 0 3897 24 03 2019
Mar

Air CCU → IXR → BBI

1 1/05/2019 CCU BLR 05:50 13:15 7h 25m 2 0 7662 1 05 2019
India → BLR

Dropping unwanted columns from the dataset

In [25]:

df_airline.drop('Date', axis='columns', inplace=True)

df_airline.drop('Arrival Time', axis='columns', inplace=True)
df_airline.drop('Year', axis='columns', inplace=True)

In [26]:

df_airline.head(2)

Out[26]:

Airline Departure Station Arrival Station Route Map Departure Time Journey Time Stops Extra Info Price Day Month

0 IndiGo BLR DEL BLR → DEL 22:20 2h 50m 0 0 3897 24 03

1 Air India CCU BLR CCU → IXR → BBI → BLR 05:50 7h 25m 2 0 7662 1 05

In [27]:

df_airline.shape

Out[27]:

(9000, 11)

3. Feature Engineering

Calculating distance

In [28]:

df_air_distance = pd.read_csv("air_distance.csv")

In [29]:

df_air_distance.head(2)

Out[29]:

Unnamed: 0 Source Dest Distance(Km)

0 0 BLR DEL 1709.71

1 1 CCU IXR 327.84

In [30]:

import math

def getDistance(route):
distance = 0.0
route="".join(route.split())
routeArray = route.split('→')
i=0
if len(routeArray) > 1:
while i < (len(routeArray)-1):
df_dist = df_air_distance[(df_air_distance['Source'] == routeArray[i]) & (df_air_distance['Dest'] == routeArray[i+1])]
if (df_dist.empty):
df_dist = df_air_distance[(df_air_distance['Source'] == routeArray[i+1]) & (df_air_distance['Dest'] == routeArray[i])]
distValue = df_dist['Distance(Km)'].item()
distance = distance + distValue
i += 1
return round(distance,2)

In [31]:

# df_airline['Distance(km)'] = distSeries.assign(distance = : getDistance(route))

df_airline['Distance(km)'] = df_airline['Route Map'].apply(lambda x: getDistance(x))

df_airline.head(3)

Out[31]:

Departure Arrival Departure Journey Extra

Airline Route Map Stops Price Day Month Distance(km)
Station Station Time Time Info

0 IndiGo BLR DEL BLR → DEL 22:20 2h 50m 0 0 3897 24 03 1709.71

CCU → IXR → BBI →

1 Air India CCU BLR 05:50 7h 25m 2 0 7662 1 05 1838.55
BLR

Jet DEL → LKO → BOM →

2 DEL COK 09:25 19h 2 0 13882 9 06 2671.33
Airways COK

Creating arrival & departure hour, Minutes from arrival time and departure time

In [32]:

df_airline['Dep_Hr'],df_airline['Dep_Min'] = df_airline['Departure Time'].str.split(':',2).str

df_airline['Duration'] = df_airline['Journey Time'].str.replace('h ',':').str.replace('m','')
df_airline['Duration_Hr'],df_airline['Duration_Min'] = df_airline['Duration'].str.split(':',2).str

Dropping unwanted columns

In [33]:

df_airline.drop('Departure Time', axis='columns', inplace=True)

df_airline.drop('Journey Time', axis='columns', inplace=True)
df_airline.drop('Duration', axis='columns', inplace=True)

In [34]:

df_airline.head(2)

Out[34]:

Departure Arrival Extra

Airline Route Map Stops Price Day Month Distance(km) Dep_Hr Dep_Min Duration_Hr Duration_Min
Station Station Info

0 IndiGo BLR DEL BLR → DEL 0 0 3897 24 03 1709.71 22 20 2 50

Air CCU → IXR →

1 CCU BLR 2 0 7662 1 05 1838.55 05 50 7 25
India BBI → BLR

In [35]:

df_airline.shape

Out[35]:

(9000, 14)
Chnaging the datatype as per our requirment and model design

In [36]:

df_airline['Month'] = df_airline['Month'].astype(str).astype(int)
df_airline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9000 entries, 0 to 8999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Airline 9000 non-null object
1 Departure Station 9000 non-null object
2 Arrival Station 9000 non-null object
3 Route Map 9000 non-null object
4 Stops 9000 non-null int64
5 Extra Info 9000 non-null int64
6 Price 9000 non-null int64
7 Day 9000 non-null object
8 Month 9000 non-null int32
9 Distance(km) 9000 non-null float64
10 Dep_Hr 9000 non-null object
11 Dep_Min 9000 non-null object
12 Duration_Hr 9000 non-null object
13 Duration_Min 8143 non-null object
dtypes: float64(1), int32(1), int64(3), object(9)
memory usage: 949.3+ KB

In [37]:

# Replace the null values as 0

df_airline.Duration_Min.replace(np.nan, 0,inplace=True)

In [38]:

# String the duration_hr variable to remove unwanted spaces of special char

df_airline['Duration_Hr'] = df_airline['Duration_Hr'].str.rstrip('h')
df_airline.Duration_Hr.unique()

Out[38]:

array(['2', '7', '19', '5', '4', '15', '21', '25', '13', '12', '26', '22',
'23', '20', '10', '6', '11', '8', '16', '3', '27', '1', '14', '9',
'18', '17', '24', '30', '28', '29', '37', '34', '38', '35', '36',
'47', '33', '32', '31', '42', '39', '41'], dtype=object)

In [39]:

df_airline.head(2)

Out[39]:

Departure Arrival Extra

Airline Route Map Stops Price Day Month Distance(km) Dep_Hr Dep_Min Duration_Hr Duration_Min
Station Station Info

0 IndiGo BLR DEL BLR → DEL 0 0 3897 24 3 1709.71 22 20 2 50

Air CCU → IXR →

1 CCU BLR 2 0 7662 1 5 1838.55 05 50 7 25
India BBI → BLR

4. Regularization

Renaming few variables to our understanding

In [40]:

df_airline = df_airline.rename(columns={'Departure Station': 'Source',

'Arrival Station': 'Dest', "Extra Info": "Info"})

In [41]:

df_airline.head(2)

Out[41]:

Airline Source Dest Route Map Stops Info Price Day Month Distance(km) Dep_Hr Dep_Min Duration_Hr Duration_Min

0 IndiGo BLR DEL BLR → DEL 0 0 3897 24 3 1709.71 22 20 2 50

1 Air India CCU BLR CCU → IXR → BBI → BLR 2 0 7662 1 5 1838.55 05 50 7 25
Exporting the cleaned dataset as csv file

In [42]:

df_airline.to_csv('Cleaned_airline.csv', index=False)

2. Apply machine learning algorithm

In [43]:

from sklearn.preprocessing import LabelEncoder, MinMaxScaler

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

In [44]:

df = pd.read_csv('Cleaned_airline.csv')
df.head(2)

Out[44]:

Airline Source Dest Route Map Stops Info Price Day Month Distance(km) Dep_Hr Dep_Min Duration_Hr Duration_Min

0 IndiGo BLR DEL BLR → DEL 0 0 3897 24 3 1709.71 22 20 2 50

1 Air India CCU BLR CCU → IXR → BBI → BLR 2 0 7662 1 5 1838.55 5 50 7 25

In [45]:

plt.figure(figsize=(12,6))
df.corr()['Price'].sort_values().plot(kind='bar');
In [46]:

plt.figure(figsize=(12,6))
sns.countplot(x="Airline", data = df, palette='Set3')
plt.title('Count of Airlines', size=30)
plt.xticks(rotation=90)
plt.show()
In [47]:

plt.figure(figsize=(12,6))
sns.boxenplot(x = 'Airline', y= 'Price', data=df, palette='Set3')
plt.title('Airlines vs Price', size=30)
plt.xticks(rotation=90)
plt.show()
In [48]:

plt.figure(figsize=(12,6))
sns.countplot(x='Source', data = df, palette='Set2')
plt.title('Count of Source', size=30)
plt.xticks(rotation=90)
plt.show()

In [49]:

plt.figure(figsize=(12,6))
sns.boxenplot(x= 'Source', y= 'Price', data=df, palette='Set3')
plt.title('Airlines vs Source', size=30)
plt.xticks(rotation=90)
plt.show()
In [50]:

plt.figure(figsize=(12,6))
sns.countplot(x='Day', data= df, palette='Set2')
plt.title('Count of Days', size=30)
plt.xticks(rotation=90)
plt.show()

In [51]:

plt.figure(figsize=(12,6))
sns.barplot(x='Day', y='Price', data=df, palette='Set2')
plt.title('Days vs Price', size=30)
plt.xticks(rotation=90)
plt.show()

In [52]:

df['Month'] = df['Month'].map({
1:'JAN', 2:'FEB', 3:'MAR', 4:'APR', 5:'MAY', 6:'JUN',
7:'JUL', 8:'AUG', 9:'SEP', 10:'OCT', 11:'NOV', 12:'DEC'})
In [53]:

plt.figure(figsize=(12,6))
sns.barplot(x='Month', y='Price', data=df, palette='Set2')
plt.title('Month vs Price', size=30)
plt.xticks(rotation=90)
plt.show()

In [54]:

plt.figure(figsize=(12,6))
sns.barplot(x='Stops', y='Price', data=df, palette='Set2')
plt.title('Stops vs Price', size=30)
plt.xticks(rotation=90)
plt.show()
In [55]:

plt.figure(figsize=(12,6))
sns.barplot(x='Info', y='Price', data=df, palette='Set2')
plt.title('Extra Info vs Price', size=30)
plt.xticks(rotation=90)
plt.show()

In [56]:

df['Duration_bool'] = (df['Duration_Hr']*60)+df['Duration_Min']
plt.figure(figsize=(12,6))
sns.scatterplot(x= 'Duration_bool', y ='Price', data=df, palette='Set2')
plt.title('Duration vs Price', size=30)
plt.xticks(rotation=90)
plt.show()
In [57]:

ncol=["Duration_bool"]

for i in ncol:
q75, q25 = np.percentile(df.loc[:,i], [75 ,25])
iqr = q75 - q25
min = q25 - (iqr*1.5)
max = q75 + (iqr*1.5)
df = df.drop(df[df.loc[:,i] <= min].index)
df = df.drop(df[df.loc[:,i] >= max].index)

df = df.dropna()
df1 = df[['Airline', 'Source', 'Dest', 'Stops',
'Info', 'Price', 'Day', 'Month', 'Distance(km)', 'Duration_bool']]
df1 = df1.rename(columns={'Duration_bool': 'Duration'})
df1['Month'] = df1['Month'].map({
'JAN':1, 'FEB':2, 'MAR':3, 'APR':4, 'MAY':5, 'JUN':6,
'JUL':7, 'AUG':8, 'SEP':9, 'OCT':10, 'NOV':11, 'DEC':12})

df.head(2)

Out[57]:

Airline Source Dest Route Map Stops Info Price Day Month Distance(km) Dep_Hr Dep_Min Duration_Hr Duration_Min Duration_bool

0 IndiGo BLR DEL BLR → DEL 0 0 3897 24 MAR 1709.71 22 20 2 50 170

Air CCU → IXR → BBI

1 CCU BLR 2 0 7662 1 MAY 1838.55 5 50 7 25 445
India → BLR

In [58]:

X = df1.drop('Price', axis=1)
y = df1['Price']
In [59]:

# set figure size

fig, ax = plt.subplots(4, 3, figsize=(40, 60))

# create box plot for categorical variables

for var, subplot in zip(df1.columns, ax.flatten()):
sns.boxplot(x=var, y='Price', data=df1, ax=subplot)
In [60]:

df1.to_csv('final_airfare.csv', index=False)

In [61]:

# display all columns of the dataframe

pd.options.display.max_columns = None

# display all rows of the dataframe

pd.options.display.max_rows = None

# to display the float values upto 6 decimal places

pd.options.display.float_format = '{:.6f}'.format

# import various functions from statsmodels

import statsmodels
import statsmodels.api as sm

# import 'stats'
from scipy import stats

# import variuos functions from sklearn

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [62]:

df = pd.read_csv('./final_airfare.csv')

# display first two observations using head()

df.head(2)

Out[62]:

Airline Source Dest Stops Info Price Day Month Distance(km) Duration

0 IndiGo BLR DEL 0 0 3897 24 3 1709.710000 170

1 Air India CCU BLR 2 0 7662 1 5 1838.550000 445

In [63]:

# store the target variable 'PRICE' in a dataframe 'df_target'

df_target = df['Price'].copy()
df_feature = df.drop('Price', axis = 1)

# display numerical features

df_num = df_feature.select_dtypes(include = [np.number])
print("display numerical features:\n",df_num.columns)

# display categorical features

df_cat = df_feature.select_dtypes(include = [np.object])
print("display categorical features:\n",df_cat.columns)

# use 'get_dummies' from pandas to create dummy variables

# use 'drop_first' to create (n-1) dummy variables
dummy_var = pd.get_dummies(data = df_cat, drop_first = True)

display numerical features:

Index(['Stops', 'Info', 'Day', 'Month', 'Distance(km)', 'Duration'], dtype='object')
display categorical features:
Index(['Airline', 'Source', 'Dest'], dtype='object')
In [64]:

# initialize the standard scalar

X_scaler = StandardScaler()

# scale all the numeric variables

# standardize all the columns of the dataframe 'df_num'
num_scaled = X_scaler.fit_transform(df_num)

# create a dataframe of scaled numerical variables

# pass the required column names to the parameter 'columns'
df_num_scaled = pd.DataFrame(num_scaled, columns = df_num.columns)

# standardize the target variable explicitly and store it in a new variable 'y'
y = (df_target - df_target.mean()) / df_target.std()

In [65]:

# concat the dummy variables with numeric features to create a dataframe of all independent variables
# 'axis=1' concats the dataframes along columns
X = pd.concat([df_num_scaled, dummy_var], axis = 1)

# display first five observations

X.head(2)

Out[65]:

A
Airline_Jet
Airline_Air Airline_Jet Airline_Multiple
Stops Info Day Month Distance(km) Duration Airline_GoAir Airline_IndiGo Airways
India Airways carriers
Business

0 -1.221463 -0.479818 1.240175 -1.470566 -0.614115 -0.939403 0 0 1 0 0 0

1 1.789648 -0.479818 -1.474359 0.249940 -0.391266 -0.374705 1 0 0 0 0 0

Train-test split
In [66]:

# split data into train subset and test subset

# set 'random_state' to generate the same dataset each time you run the code
# 'test_size' returns the proportion of data to be included in the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10, test_size = 0.3)

# check the dimensions of the train & test subset using 'shape'
# print dimension of train set
print('X_train', X_train.shape)
print('y_train', y_train.shape)

# print dimension of test set

print('X_test', X_test.shape)
print('y_test', y_test.shape)

X_train (6254, 25)

y_train (6254,)
X_test (2681, 25)
y_test (2681,)
Creating RMSE values for train set
In [67]:

# create a generalized function to calculate the RMSE values for train set
def get_train_rmse(model):

# For training set:

# train_pred: prediction made by the model on the training dataset 'X_train'
# y_train: actual values ofthe target variable for the train dataset

# predict the output of the target variable from the train data
train_pred = model.predict(X_train)

# calculate the MSE using the "mean_squared_error" function

# MSE for the train data

mse_train = mean_squared_error(y_train, train_pred)

# take the square root of the MSE to calculate the RMSE

# round the value upto 4 digits using 'round()'
rmse_train = round(np.sqrt(mse_train), 4)

# return the training RMSE

return(rmse_train)

Creating RMSE values for test data

In [68]:

# create a generalized function to calculate the RMSE values test set

def get_test_rmse(model):

# For testing set:

# test_pred: prediction made by the model on the test dataset 'X_test'
# y_test: actual values of the target variable for the test dataset

# predict the output of the target variable from the test data
test_pred = model.predict(X_test)

# MSE for the test data

mse_test = mean_squared_error(y_test, test_pred)

# take the square root of the MSE to calculate the RMSE

# round the value upto 4 digits using 'round()'
rmse_test = round(np.sqrt(mse_test), 4)

# return the test RMSE

return(rmse_test)

MAPE Calculation
In [69]:

# define a function to calculate MAPE

# pass the actual and predicted values as input to the function
# return the calculated MAPE
def mape(actual, predicted):
return (np.mean(np.abs((actual - predicted) / actual)) * 100)

def get_test_mape(model):

# For testing set:

# test_pred: prediction made by the model on the test dataset 'X_test'
# y_test: actual values of the target variable for the test dataset

# predict the output of the target variable from the test data
test_pred = model.predict(X_test)

# calculate the mape using the "mape()" function created above

# calculate the MAPE for the test data
mape_test = mape(y_test, test_pred)

# return the MAPE for the test set

return(mape_test)
Creating a function to update scorecard
In [70]:

# create a function to update the score card for comparision of the scores from different algorithms
# pass the model name, model build, alpha and l1_ration as input parameters
# if 'alpha' and/or 'l1_ratio' is not specified, the function assigns '-'
def update_score_card(algorithm_name, model, alpha = '-', l1_ratio = '-'):

# assign 'score_card' as global variable

global score_card

# append the results to the dataframe 'score_card'

# 'ignore_index = True' do not consider the index labels
score_card = score_card.append({'Model_Name': algorithm_name,
'Alpha (Wherever Required)': alpha,
'l1-ratio': l1_ratio,
'Test_MAPE': get_test_mape(model),
'Test_RMSE': get_test_rmse(model),
'R-Squared': get_score(model)[0],
'Adj. R-Squared': get_score(model)[1]}, ignore_index = True)

Function to plot barplot

In [71]:

# define a function to plot a barplot

# pass the model
def plot_coefficients(model, algorithm_name):
# create a dataframe of variable names and their corresponding value of coefficients obtained from model
# 'columns' returns the column names of the dataframe 'X'
# 'coef_' returns the coefficient of each variable
df_coeff = pd.DataFrame({'Variable': X.columns, 'Coefficient': model.coef_})

# sort the dataframe in descending order

# 'sort_values' sorts the column based on the values
# 'ascending = False' sorts the values in the descending order
sorted_coeff = df_coeff.sort_values('Coefficient', ascending = False)

# plot a bar plot with Coefficient on the x-axis and Variable names on y-axis
# pass the data to the parameter, 'sorted_coeff' to plot the barplot
sns.barplot(x = "Coefficient", y = "Variable", data = sorted_coeff)

# add x-axis label

# set the size of the text using 'fontsize'
plt.xlabel("Coefficients from {}".format(algorithm_name), fontsize = 15)

# add y-axis label

# set the size of the text using 'fontsize'
plt.ylabel('Features', fontsize = 15)

Function to generated R-squared and Adj R-squared

In [72]:

# define a function to get R-squared and adjusted R-squared value

def get_score(model):

# score() returns the R-squared value

r_sq = model.score(X_train, y_train)

# calculate adjusted R-squared value

# 'n' denotes number of observations in train set
# 'shape[0]' returns number of rows
n = X_train.shape[0]

# 'k' denotes number of variables in train set

# 'shape[1]' returns number of columns
k = X_train.shape[1]

# calculate adjusted R-squared using the formula

r_sq_adj = 1 - ((1-r_sq)*(n-1)/(n-k-1))

# return the R-squared and adjusted R-squared value

return ([r_sq, r_sq_adj])

In [73]:

# n_splits: specify the number of k folds

kf = KFold(n_splits = 5)
In [74]:

# create a function 'get_score' that returns the R-squared score for the training set
# 'get_score' takes 5 input parameters
def Get_score(model, X_train_k, X_test_k, y_train_k, y_test_k):
model.fit(X_train_k, y_train_k) # fit the model
return model.score(X_test_k, y_test_k)

In [75]:

# create an empty list to store the scores

scores = []

# kf.split() splits the indices of X_train into train_index and test_index

# further dividing the X_train and y_train sets into train and test sets for cross validation
# Remember: Cross validation works on training set not on test set
# use '\' for stacking the code
for train_index, test_index in kf.split(X_train):
X_train_k, X_test_k, y_train_k, y_test_k = X_train.iloc[train_index], X_train.iloc[test_index], \
y_train.iloc[train_index], y_train.iloc[test_index]

# call the function 'get_scores()' and append the scores in the list 'scores'
scores.append(Get_score(LinearRegression(), X_train_k, X_test_k, y_train_k, y_test_k))

# print all scores

print('All scores: ', scores)

# print the minimum score from the list

# use 'round()' to round-off the minimum score upto 4 digits
# min() returns minimum score
print("\nMinimum score obtained: ", np.min(scores))

# print the maximum score from the list

# use 'round()' to round-off the maximum score upto 4 digits
# max() returns maximum score
print("Maximum score obtained: ", np.max(scores))

# print the average score from the list

# use 'round()' to round-off the average score upto 4 digits
# np.mean() returns average score
print("Average score obtained: ", np.mean(scores))

All scores: [0.5837935157120637, 0.6215639618731146, 0.5774617416754073, 0.6150767867337072, 0.4863310346921561]

Minimum score obtained: 0.4863310346921561

Maximum score obtained: 0.6215639618731146
Average score obtained: 0.5768454081372898

In [76]:

# using cross_val_score() for k-fold cross validation

# estimator: pass the machine learning function. Here we are performing linear regression
# pass the X_train and y_train sets
# cv: stands for number of folds. Similar to k in KFold
# scoring: pass the scoring parameter e.g. 'r2' for r-squared, 'neg_mean_squared_error' for mean squared error (negative)
scores = cross_val_score(estimator = LinearRegression(), X = X_train,
y = y_train, cv = 5, scoring = 'r2')

In [77]:

# print all scores

print('All scores: ', scores)

# print the minimum score from the list

# use 'round()' to round-off the minimum score upto 4 digits
# min() returns minimum score
print("\nMinimum score obtained: ", round(np.min(scores), 4))

# print the maximum score from the list

# use 'round()' to round-off the maximum score upto 4 digits
# max() returns maximum score
print("Maximum score obtained: ", round(np.max(scores), 4))

# print the average score from the list

# use 'round()' to round-off the average score upto 4 digits
# np.mean() returns average score
print("Average score obtained: ", round(np.mean(scores), 4))

All scores: [0.58379352 0.62156396 0.57746174 0.61507679 0.48633103]

Minimum score obtained: 0.4863

Maximum score obtained: 0.6216
Average score obtained: 0.5768
In [78]:

# create an empty to store the MSE for each model

loocv_rmse = []

# instantiate the LOOCV method

loocv = LeaveOneOut()

# use the for loop to build the regression model for each cross validation
# use split() to split the dataset into two subsets; one with (n-1) data points and another with 1 data point
# where, n = total number of observations

for train_index, test_index in loocv.split(X_train):

# create the train dataset, use iloc[] to retrieve the corresponding observations in train data
# create the test dataset, use iloc[] to retrieve the corresponding observations in test data
# # use '\' for stacking the code
X_train_l, X_test_l, y_train_l, y_test_l = X_train.iloc[train_index], X_train.iloc[test_index], \
y_train.iloc[train_index], y_train.iloc[test_index]

# instantiate the regression model

linreg = LinearRegression()

# fit the model on training dataset

linreg.fit(X_train_l, y_train_l)

# calculate MSE using test dataset

# use predict() to predict the values of target variable
mse = mean_squared_error(y_test_l, linreg.predict(X_test_l))

# calculate the RMSE

rmse = np.sqrt(mse)

# use append() to add each RMSE to the list 'loocv_rmse'

loocv_rmse.append(rmse)

In [79]:

# print the minimum rmse from the list

# use 'round()' to round-off the minimum rmse upto 4 digits
# min() returns minimum rmse
print("\nMinimum rmse obtained: ", round(np.min(loocv_rmse), 4))

# print the maximum rmse from the list

# use 'round()' to round-off the maximum rmse upto 4 digits
# max() returns maximum rmse
print("Maximum rmse obtained: ", round(np.max(loocv_rmse), 4))

# print the average rmse from the list

# use 'round()' to round-off the average rmse upto 4 digits
# np.mean() returns average rmse
print("Average rmse obtained: ", round(np.mean(loocv_rmse), 4))

Minimum rmse obtained: 0.0

Maximum rmse obtained: 690148210.7596
Average rmse obtained: 110353.4964

In [80]:

models = [['LinearRegression', LinearRegression(),'na'],

['ElasticNet', ElasticNet(), [{'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 20, 40, 60],
'l1_ratio':[0.0001, 0.0002, 0.001, 0.01, 0.1, 0.2]}]],
['Lasso', Lasso(), [{'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 20]}]],
['Ridge', Ridge(), [{'alpha':[1e-4,1e-3, 1e-2, 0.1, 1, 5, 10, 20, 40, 60, 80, 100]}]],
['GradientBoostingRegressor', GradientBoostingRegressor(), 'na'],
['SGDRegressor', SGDRegressor(), 'na']]
In [81]:

# create an empty dataframe to store the scores for various algorithms

score_card = pd.DataFrame(columns=['Model_Name', 'Alpha (Wherever Required)', 'l1-ratio', 'R-Squared',
'Adj. R-Squared', 'Test_RMSE', 'Test_MAPE'])

for name, model, grid in models:

model=model
if grid == 'na':
model.fit(X_train, y_train)
update_score_card(algorithm_name = name, model = model)
else:
model = GridSearchCV(estimator = model, param_grid = grid, cv = 10)
model.fit(X_train, y_train)
update_score_card(algorithm_name = name, model = model, alpha = model.best_params_.get('alpha'),
l1_ratio = model.best_params_.get('l1_ratio'))
# sort the dataframe 'score_card' on 'Test_RMSE' in an ascending order using 'sort_values'
# 'reset_index' resets the index of the dataframe
# 'drop = True' drops the previous index
score_card = score_card.sort_values('Test_RMSE').reset_index(drop = True)

# color the cell in the column 'Test_RMSE' having minimum RMSE value
# 'style.highlight_min' assigns color to the minimum value
# pass specified color to the parameter, 'color'
# pass the data to limit the color assignment to the parameter, 'subset'
score_card.style.highlight_min(color = 'lightblue', subset = 'Test_RMSE')

Out[81]:

Model_Name Alpha (Wherever Required) l1-ratio R-Squared Adj. R-Squared Test_RMSE Test_MAPE

0 GradientBoostingRegressor - - 0.834096 0.833430 0.387700 82.749896

1 ElasticNet 0.000100 0.200000 0.616437 0.614897 0.590600 136.275176

2 Ridge 0.100000 None 0.621233 0.619712 0.615200 135.834600

3 Lasso 0.000100 None 0.621497 0.619978 0.624500 135.599727

4 SGDRegressor - - 0.576104 0.574402 0.670100 140.212988

5 LinearRegression - - 0.621432 0.619912 13643934.498300 24643102.636528

2. Module Creation
In [82]:

gradBoost = GradientBoostingRegressor()
gradBoost.fit(X_train, y_train)
prediction = gradBoost.predict(X_test)
print('RMSE : {}'.format(np.sqrt(mean_squared_error(y_test, prediction))))

RMSE : 0.3878337073231023

In [83]:

gradBoost.score(X_train, y_train), gradBoost.score(X_test, y_test)

Out[83]:

(0.8340958488952346, 0.8512828235466694)

In [84]:

print('MAE:', mean_absolute_error(y_test, prediction))

print('MSE:', mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(mean_squared_error(y_test, prediction)))

MAE: 0.26135248119793697
MSE: 0.15041498453598176
RMSE: 0.3878337073231023
In [85]:

plt.figure(figsize = (4,4))
plt.scatter(y_test, prediction, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()

Create a Pipeline and Save Predictive Model:

In [86]:

import pickle
file = open('final_model.pkl', 'wb')
pickle.dump(gradBoost, file)

In [87]:

model = open("final_model.pkl", "rb")

gradBoost = pickle.load(model)

In [88]:

from sklearn import metrics

predictions2=gradBoost.predict(X_test)
metrics.r2_score(y_test,predictions2)

Out[88]:

0.8512828235466694

We have now created predictive model and permanently saved in hard-drive with all required pre-processing steps and whenever the new data to be tested

Almig Air Control P Service Manual
33% (3)
Almig Air Control P Service Manual
61 pages
Ubuntu Server Guide 2024 01 22
No ratings yet
Ubuntu Server Guide 2024 01 22
486 pages
Apser Hussain Ticket RPR - DXB
No ratings yet
Apser Hussain Ticket RPR - DXB
2 pages
Sabre Introduction Manual
No ratings yet
Sabre Introduction Manual
26 pages
Air India Web Booking Eticket (J77BV) - Prakash
No ratings yet
Air India Web Booking Eticket (J77BV) - Prakash
4 pages
Flight Fare Prediction Using ML Algorithms
No ratings yet
Flight Fare Prediction Using ML Algorithms
40 pages
Flight Price Eda
No ratings yet
Flight Price Eda
33 pages
Flight Price Predictions
No ratings yet
Flight Price Predictions
14 pages
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
No ratings yet
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
28 pages
Dse4 Stug082
No ratings yet
Dse4 Stug082
43 pages
Nested Logit Models
No ratings yet
Nested Logit Models
3 pages
PROJECT FILE CLASS 12
No ratings yet
PROJECT FILE CLASS 12
23 pages
AirIndia Design
No ratings yet
AirIndia Design
13 pages
Air India Web Booking ETicket (JE431) - Chaudhari
No ratings yet
Air India Web Booking ETicket (JE431) - Chaudhari
8 pages
SAN 16842 - Industry Mandate - Agency Access To Airline Reissued Documents V1.1
No ratings yet
SAN 16842 - Industry Mandate - Agency Access To Airline Reissued Documents V1.1
8 pages
BCD 1d7a52 A8bb - VDNTBP
No ratings yet
BCD 1d7a52 A8bb - VDNTBP
2 pages
XUFEHL Delhi to Bangalore 12 Mar 2025 Mr Santosh Kumar Mohanty
No ratings yet
XUFEHL Delhi to Bangalore 12 Mar 2025 Mr Santosh Kumar Mohanty
1 page
NF22995136307762 Invoice
No ratings yet
NF22995136307762 Invoice
3 pages
Itinerary AGARWAL SIDDHARTH 7FW1NS
No ratings yet
Itinerary AGARWAL SIDDHARTH 7FW1NS
2 pages
Amanpreet Dubai To BLR - Jet Airways PDF
No ratings yet
Amanpreet Dubai To BLR - Jet Airways PDF
3 pages
08052024-Airports - SOP
No ratings yet
08052024-Airports - SOP
13 pages
Dsa Mini Project
No ratings yet
Dsa Mini Project
16 pages
ML Practical 1 Code
100% (1)
ML Practical 1 Code
1 page
Fwd Itinerary for MUKHTIARKAUR SANDHU Saturday 28 September 2024
No ratings yet
Fwd Itinerary for MUKHTIARKAUR SANDHU Saturday 28 September 2024
4 pages
ml_merged
No ratings yet
ml_merged
60 pages
AERO DILI 13AUG 19SEP DPS DIL DPS MR MARCELINO
No ratings yet
AERO DILI 13AUG 19SEP DPS DIL DPS MR MARCELINO
1 page
Railway Reservation Project in C++: Introduction
No ratings yet
Railway Reservation Project in C++: Introduction
27 pages
Final Slide
No ratings yet
Final Slide
37 pages
India Flight Ticket
No ratings yet
India Flight Ticket
2 pages
Travel Itinerary: Zs5Txv
No ratings yet
Travel Itinerary: Zs5Txv
3 pages
JAI to BLR
No ratings yet
JAI to BLR
5 pages
Atishay Shaman Mr 1585878
No ratings yet
Atishay Shaman Mr 1585878
1 page
Biswas 22
No ratings yet
Biswas 22
2 pages
Itinerary
No ratings yet
Itinerary
4 pages
SIA E-Ticket - 618 2446951023
No ratings yet
SIA E-Ticket - 618 2446951023
5 pages
Scribd Asu Susah Upload
No ratings yet
Scribd Asu Susah Upload
4 pages
Venkata Hyd-Del-Hyd PDF
No ratings yet
Venkata Hyd-Del-Hyd PDF
2 pages
Screenshot 2024-10-01 at 12.13.19 PM
No ratings yet
Screenshot 2024-10-01 at 12.13.19 PM
1 page
Ip Sample
No ratings yet
Ip Sample
32 pages
NSMRHJ: Eticket Itinerary / Receipt
No ratings yet
NSMRHJ: Eticket Itinerary / Receipt
3 pages
P4M3JQ 1729700197010
No ratings yet
P4M3JQ 1729700197010
7 pages
Ta Bill Gorakhpur
No ratings yet
Ta Bill Gorakhpur
10 pages
kerala_journey_2March-9March20
No ratings yet
kerala_journey_2March-9March20
10 pages
Electronic Ticket Receipt NUREDA
No ratings yet
Electronic Ticket Receipt NUREDA
7 pages
Cycle Sheet 1: Name: P. Mahitha REG NO.: 17MIS0458 SLOT: L1+L2 Faculty: Prof. Jayaram Reddy
No ratings yet
Cycle Sheet 1: Name: P. Mahitha REG NO.: 17MIS0458 SLOT: L1+L2 Faculty: Prof. Jayaram Reddy
46 pages
E-Ticket of MR Sattar Abdul (17feb20) PDF
No ratings yet
E-Ticket of MR Sattar Abdul (17feb20) PDF
4 pages
E-Ticket for Pnr 7CDDSP
No ratings yet
E-Ticket for Pnr 7CDDSP
3 pages
TaxInvoice DEL-DMU Merged
No ratings yet
TaxInvoice DEL-DMU Merged
15 pages
Travel Itinerary
No ratings yet
Travel Itinerary
3 pages
Train_ticket_TK580952721q24
No ratings yet
Train_ticket_TK580952721q24
3 pages
Itinerary R1TMJXKQ-2
No ratings yet
Itinerary R1TMJXKQ-2
2 pages
Babu Prem Itinerary Details
No ratings yet
Babu Prem Itinerary Details
4 pages
SIA Resumes VTL and non-VTL Services To India With Celebration Offer - 23 Nov 21
No ratings yet
SIA Resumes VTL and non-VTL Services To India With Celebration Offer - 23 Nov 21
8 pages
C++ Program Airways Seat Reservation :by A - R
No ratings yet
C++ Program Airways Seat Reservation :by A - R
22 pages
Booking Vouche2
No ratings yet
Booking Vouche2
2 pages
NF7FSKEXFM7XPUQ66172 ETicket
No ratings yet
NF7FSKEXFM7XPUQ66172 ETicket
4 pages
Tax in Voice Ka 1222305 a g 94518
No ratings yet
Tax in Voice Ka 1222305 a g 94518
1 page
Name
No ratings yet
Name
64 pages
Ccu To Del
No ratings yet
Ccu To Del
1 page
Itinerary_KUMAR_ANANT_2KNT1K
No ratings yet
Itinerary_KUMAR_ANANT_2KNT1K
2 pages
Solar Power Generation Forecasting in Europe a Time Series Analysis
No ratings yet
Solar Power Generation Forecasting in Europe a Time Series Analysis
19 pages
Designing XSD diagrams vol1
From Everand
Designing XSD diagrams vol1
Jose Luis Arias Cobreros
No ratings yet
The Pseudoinverse of A Rectangular Matrix and Its Statistical Applications
No ratings yet
The Pseudoinverse of A Rectangular Matrix and Its Statistical Applications
6 pages
Super Charger ReleaseNote
No ratings yet
Super Charger ReleaseNote
10 pages
CTS Marketing Executive - CTS - NSQF-4
No ratings yet
CTS Marketing Executive - CTS - NSQF-4
40 pages
LAB Exercise 1
No ratings yet
LAB Exercise 1
3 pages
Unit 3 (3.3) Inter Process Communication (IPC)
No ratings yet
Unit 3 (3.3) Inter Process Communication (IPC)
18 pages
Free Tutorial in PFD: Free PDF Download All Qms Topics in One File
No ratings yet
Free Tutorial in PFD: Free PDF Download All Qms Topics in One File
23 pages
Unit4-Pps-functions &dynamic Memory Allocation
No ratings yet
Unit4-Pps-functions &dynamic Memory Allocation
19 pages
The Next Frontier in Endpoint Security: Dan Larson, Crowdstrike
No ratings yet
The Next Frontier in Endpoint Security: Dan Larson, Crowdstrike
15 pages
First-Order Logic in Artificial Intelligence
No ratings yet
First-Order Logic in Artificial Intelligence
4 pages
Babok V3 & Sfia V7: Mappings of Knowledge Areas To Skills Jan 2021 Source: Saffron House
No ratings yet
Babok V3 & Sfia V7: Mappings of Knowledge Areas To Skills Jan 2021 Source: Saffron House
9 pages
Adobe Interactive Forms Tutorials: Step-by-Step Tutorials
No ratings yet
Adobe Interactive Forms Tutorials: Step-by-Step Tutorials
420 pages
(Overall Equipment Effectiveness) : System Architecture Production Monitoring
No ratings yet
(Overall Equipment Effectiveness) : System Architecture Production Monitoring
1 page
Resume_Chetan
No ratings yet
Resume_Chetan
3 pages
August - 2022 Regular Programme Practical Schedule
No ratings yet
August - 2022 Regular Programme Practical Schedule
1 page
Iphone 15 - Google Search
No ratings yet
Iphone 15 - Google Search
1 page
2 Builtup - Proyektor Jaringan 23 Juni 2021
No ratings yet
2 Builtup - Proyektor Jaringan 23 Juni 2021
1 page
Clinical Data Management
No ratings yet
Clinical Data Management
5 pages
Nasty VCS
No ratings yet
Nasty VCS
22 pages
Toyota Series Electrical 623 Training Course Electrical Circuits
100% (49)
Toyota Series Electrical 623 Training Course Electrical Circuits
7 pages
Linic - by Slidesgo
No ratings yet
Linic - by Slidesgo
53 pages
Technical Specification of Portable Dissolved Gas Analysis of Oil (Applicable As Per BPS) S.No. Particulars Specification
No ratings yet
Technical Specification of Portable Dissolved Gas Analysis of Oil (Applicable As Per BPS) S.No. Particulars Specification
19 pages
Image Classification and Its Applications
No ratings yet
Image Classification and Its Applications
12 pages
FL 4 Fuzzy Logic Controller PDF
No ratings yet
FL 4 Fuzzy Logic Controller PDF
34 pages
Coag 4d 4 Channel Coagulometer Semi Automated Coagulation Instrument
No ratings yet
Coag 4d 4 Channel Coagulometer Semi Automated Coagulation Instrument
2 pages
MS Parent Bulletin (Week of November 14 To 18)
No ratings yet
MS Parent Bulletin (Week of November 14 To 18)
13 pages
3rd Term JS3 ICT Note 2024
No ratings yet
3rd Term JS3 ICT Note 2024
11 pages
UCO CYBER SUCCESS CL 3
No ratings yet
UCO CYBER SUCCESS CL 3
22 pages
SVU - 2020 TY B.Tech IT Syllabus July 2022
No ratings yet
SVU - 2020 TY B.Tech IT Syllabus July 2022
97 pages