0% found this document useful (0 votes)
10 views

Cac2 22112338

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Cac2 22112338

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

-Data loading

import pandas as pd

data1 = pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data1.csv')
data2=pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data2.csv')

print("This is 1st dataset",data1.head())


print("This is 2nd dataset",data2.head())

This is 1st dataset Count_point_id Direction_of_travel Year


Count_date hour \
0 749 E 2014 25-06-2014 00:00 7
1 749 E 2014 25-06-2014 00:00 8
2 749 E 2014 25-06-2014 00:00 9
3 749 E 2014 25-06-2014 00:00 10
4 749 E 2014 25-06-2014 00:00 11

Region_id Region_name Region_ons_code Local_authority_id \


0 3 Scotland S92000003 39
1 3 Scotland S92000003 39
2 3 Scotland S92000003 39
3 3 Scotland S92000003 39
4 3 Scotland S92000003 39

Local_authority_name ... 3:00-4:00PM 4:00-5:00PM 5:00-6:00PM 6:00-


7:00PM \
0 East Ayrshire ... 105.0 147.0 120.0
91.0
1 East Ayrshire ... 98.0 133.0 131.0
95.0
2 East Ayrshire ... 115.0 130.0 143.0
106.0
3 East Ayrshire ... 127.0 122.0 144.0
122.0
4 East Ayrshire ... 126.0 133.0 135.0
102.0

7:00-8:00PM 8:00-9:00PM 9:00-10:00PM 10:00-11:00PM 11:00-12:00AM


\
0 83.0 74.0 49.0 42.0 42.0

1 73.0 70.0 63.0 42.0 35.0

2 89.0 68.0 64.0 56.0 43.0

3 76.0 64.0 58.0 64.0 43.0


4 106.0 58.0 58.0 55.0 54.0

Unnamed: 60
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN

[5 rows x 61 columns]
This is 2nd dataset ID SegmentID Roadway Name From
To Direction \
0 1 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
1 2 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
2 3 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
3 4 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
4 5 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB

Date 12:00-1:00 AM 1:00-2:00AM 2:00-3:00AM ... 2:00-


3:00PM \
0 01-09-2012 20.0 10.0 11.0 ...
104.0
1 01-10-2012 21.0 16.0 8.0 ...
102.0
2 01-11-2012 27.0 14.0 6.0 ...
115.0
3 01-12-2012 22.0 7.0 7.0 ...
71.0
4 01/13/2012 31.0 17.0 7.0 ...
113.0

3:00-4:00PM 4:00-5:00PM 5:00-6:00PM 6:00-7:00PM 7:00-8:00PM \


0 105.0 147.0 120.0 91.0 83.0
1 98.0 133.0 131.0 95.0 73.0
2 115.0 130.0 143.0 106.0 89.0
3 127.0 122.0 144.0 122.0 76.0
4 126.0 133.0 135.0 102.0 106.0

8:00-9:00PM 9:00-10:00PM 10:00-11:00PM 11:00-12:00AM


0 74.0 49.0 42.0 42.0
1 70.0 63.0 42.0 35.0
2 68.0 64.0 56.0 43.0
3 64.0 58.0 64.0 43.0
4 58.0 58.0 55.0 54.0
[5 rows x 31 columns]

----------------------------------------------------------------------
--------------------------------------------

Cell In[2], line 1

----------------------------------------------------------------------
--------------------------------------------

^
SyntaxError: invalid syntax

print(data1.shape)
print(data2.shape)

(1048575, 61)
(42756, 31)

print(data1.info())
print(data2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Count_point_id 1048575 non-null int64
1 Direction_of_travel 1048575 non-null object
2 Year 1048575 non-null int64
3 Count_date 1048575 non-null object
4 hour 1048575 non-null int64
5 Region_id 1048575 non-null int64
6 Region_name 1048575 non-null object
7 Region_ons_code 1048575 non-null object
8 Local_authority_id 1048575 non-null int64
9 Local_authority_name 1048575 non-null object
10 Local_authority_code 1048575 non-null object
11 Road_name 1048575 non-null object
12 Road_category 1048575 non-null object
13 Road_type 1048575 non-null object
14 Start_junction_road_name 402756 non-null object
15 End_junction_road_name 402732 non-null object
16 Easting 1048575 non-null int64
17 Northing 1048575 non-null int64
18 Latitude 1048575 non-null float64
19 Longitude 1048575 non-null float64
20 Link_length_km 403584 non-null float64
21 Link_length_miles 403584 non-null float64
22 Pedal_cycles 1048575 non-null int64
23 Two_wheeled_motor_vehicles 1048575 non-null int64
24 Cars_and_taxis 1048575 non-null int64
25 Buses_and_coaches 1048575 non-null int64
26 LGVs 1048575 non-null int64
27 HGVs_2_rigid_axle 1048575 non-null int64
28 HGVs_3_rigid_axle 1048575 non-null int64
29 HGVs_4_or_more_rigid_axle 1048575 non-null int64
30 HGVs_3_or_4_articulated_axle 1048575 non-null int64
31 HGVs_5_articulated_axle 1048575 non-null int64
32 HGVs_6_articulated_axle 1048575 non-null int64
33 All_HGVs 1048575 non-null int64
34 All_motor_vehicles 1048575 non-null int64
35 Unnamed: 35 0 non-null float64
36 12:00-1:00 AM 15519 non-null float64
37 1:00-2:00AM 15519 non-null float64
38 2:00-3:00AM 15519 non-null float64
39 3:00-4:00AM 15519 non-null float64
40 4:00-5:00AM 15519 non-null float64
41 5:00-6:00AM 15519 non-null float64
42 6:00-7:00AM 15519 non-null float64
43 7:00-8:00AM 15519 non-null float64
44 8:00-9:00AM 15519 non-null float64
45 9:00-10:00AM 15519 non-null float64
46 10:00-11:00AM 15519 non-null float64
47 11:00-12:00PM 15519 non-null float64
48 12:00-1:00PM 15266 non-null float64
49 1:00-2:00PM 15266 non-null float64
50 2:00-3:00PM 15266 non-null float64
51 3:00-4:00PM 15266 non-null float64
52 4:00-5:00PM 15266 non-null float64
53 5:00-6:00PM 15266 non-null float64
54 6:00-7:00PM 15266 non-null float64
55 7:00-8:00PM 15266 non-null float64
56 8:00-9:00PM 15266 non-null float64
57 9:00-10:00PM 15266 non-null float64
58 10:00-11:00PM 15266 non-null float64
59 11:00-12:00AM 15266 non-null float64
60 Unnamed: 60 0 non-null float64
dtypes: float64(30), int64(20), object(11)
memory usage: 488.0+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42756 entries, 0 to 42755
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 42756 non-null int64
1 SegmentID 42756 non-null int64
2 Roadway Name 42756 non-null object
3 From 42756 non-null object
4 To 42756 non-null object
5 Direction 42756 non-null object
6 Date 42756 non-null object
7 12:00-1:00 AM 42752 non-null float64
8 1:00-2:00AM 42752 non-null float64
9 2:00-3:00AM 42752 non-null float64
10 3:00-4:00AM 42752 non-null float64
11 4:00-5:00AM 42752 non-null float64
12 5:00-6:00AM 42752 non-null float64
13 6:00-7:00AM 42752 non-null float64
14 7:00-8:00AM 42752 non-null float64
15 8:00-9:00AM 42752 non-null float64
16 9:00-10:00AM 42752 non-null float64
17 10:00-11:00AM 42753 non-null float64
18 11:00-12:00PM 42755 non-null float64
19 12:00-1:00PM 42503 non-null float64
20 1:00-2:00PM 42503 non-null float64
21 2:00-3:00PM 42503 non-null float64
22 3:00-4:00PM 42503 non-null float64
23 4:00-5:00PM 42503 non-null float64
24 5:00-6:00PM 42503 non-null float64
25 6:00-7:00PM 42503 non-null float64
26 7:00-8:00PM 42503 non-null float64
27 8:00-9:00PM 42503 non-null float64
28 9:00-10:00PM 42503 non-null float64
29 10:00-11:00PM 42503 non-null float64
30 11:00-12:00AM 42503 non-null float64
dtypes: float64(24), int64(2), object(5)
memory usage: 10.1+ MB
None

print(data1.describe())
print(data2.describe())

Count_point_id Year hour Region_id \


count 1.048575e+06 1.048575e+06 1.048575e+06 1.048575e+06
mean 5.995700e+05 2.011974e+03 1.249989e+01 6.305353e+00
std 4.394447e+05 3.490533e+00 3.452043e+00 2.997732e+00
min 5.300000e+01 2.001000e+03 7.000000e+00 1.000000e+00
25% 5.607700e+04 2.009000e+03 9.000000e+00 4.000000e+00
50% 9.407220e+05 2.010000e+03 1.200000e+01 7.000000e+00
75% 9.462220e+05 2.015000e+03 1.500000e+01 9.000000e+00
max 9.999990e+05 2.021000e+03 1.800000e+01 1.100000e+01

Local_authority_id Easting Northing Latitude \


count 1.048575e+06 1.048575e+06 1.048575e+06 1.048575e+06
mean 1.028254e+02 4.350583e+05 3.003754e+05 5.259198e+01
std 5.172886e+01 9.336256e+04 1.589562e+05 1.431117e+00
min 1.000000e+00 7.040600e+04 1.077500e+04 4.991714e+01
25% 6.700000e+01 3.720730e+05 1.786700e+05 5.149455e+01
50% 9.700000e+01 4.350000e+05 2.724000e+05 5.234075e+01
75% 1.410000e+02 5.105410e+05 3.966100e+05 5.346388e+01
max 2.080000e+02 6.550000e+05 1.179870e+06 6.049980e+01

Longitude Link_length_km ... 3:00-4:00PM 4:00-


5:00PM \
count 1.048575e+06 403584.000000 ... 15266.000000 15266.000000

mean -1.500683e+00 3.094704 ... 669.589152 676.870693

std 1.371715e+00 3.589633 ... 798.586645 804.420051

min -7.425717e+00 0.100000 ... 0.000000 0.000000

25% -2.418145e+00 0.900000 ... 253.000000 255.000000

50% -1.479373e+00 1.900000 ... 426.000000 432.000000

75% -3.898619e-01 3.900000 ... 736.000000 745.000000

max 1.754553e+00 46.300000 ... 6016.000000 5923.000000

5:00-6:00PM 6:00-7:00PM 7:00-8:00PM 8:00-9:00PM 9:00-


10:00PM \
count 15266.000000 15266.000000 15266.000000 15266.000000
15266.000000
mean 676.528560 643.128848 575.776104 498.165073
429.040875
std 799.064056 784.666243 737.540903 670.361583
600.447526
min 0.000000 0.000000 0.000000 0.000000
0.000000
25% 253.000000 230.000000 194.000000 159.000000
128.000000
50% 430.000000 400.000000 345.000000 288.000000
234.000000
75% 751.000000 704.000000 612.000000 516.000000
435.000000
max 6169.000000 5810.000000 5249.000000 5102.000000
4986.000000

10:00-11:00PM 11:00-12:00AM Unnamed: 60


count 15266.000000 15266.000000 0.0
mean 375.862243 315.701231 NaN
std 552.187778 483.493903 NaN
min 0.000000 0.000000 NaN
25% 103.000000 79.000000 NaN
50% 193.000000 154.000000 NaN
75% 373.000000 309.000000 NaN
max 4468.000000 4815.000000 NaN

[8 rows x 50 columns]
ID SegmentID 12:00-1:00 AM 1:00-2:00AM 2:00-
3:00AM \
count 42756.000000 4.275600e+04 42752.000000 42752.000000
42752.000000
mean 302.926841 4.988159e+05 251.448423 178.591504
135.280318
std 504.422798 1.875303e+06 407.435712 303.030296
242.091877
min 1.000000 2.020000e+02 0.000000 0.000000
0.000000
25% 95.000000 3.402700e+04 60.000000 38.000000
26.000000
50% 193.000000 7.534300e+04 118.000000 79.000000
56.000000
75% 299.000000 1.448810e+05 241.000000 171.000000
128.000000
max 3393.000000 9.017050e+06 4805.000000 4489.000000
4818.000000

3:00-4:00AM 4:00-5:00AM 5:00-6:00AM 6:00-7:00AM 7:00-


8:00AM \
count 42752.000000 42752.000000 42752.000000 42752.00000
42752.000000
mean 117.619359 135.677980 206.655747 352.60051
491.184673
std 215.316979 249.763737 415.614033 621.37410
697.735616
min 0.000000 0.000000 0.000000 0.00000
0.000000
25% 22.000000 27.000000 42.000000 77.00000
133.000000
50% 47.000000 56.000000 85.000000 156.00000
270.000000
75% 110.000000 125.000000 182.000000 335.00000
538.000000
max 4323.000000 4469.000000 6456.000000 7513.00000
9226.330000

... 2:00-3:00PM 3:00-4:00PM 4:00-5:00PM 5:00-6:00PM \


count ... 42503.000000 42503.000000 42503.000000 42503.000000
mean ... 630.179376 657.691175 661.120886 657.712538
std ... 758.610055 779.552761 776.312508 770.034661
min ... 0.000000 0.000000 0.000000 0.000000
25% ... 250.000000 260.000000 261.000000 258.000000
50% ... 406.000000 425.000000 428.000000 424.000000
75% ... 679.000000 713.000000 724.000000 724.000000
max ... 6996.000000 7524.000000 8683.000000 9762.000000

6:00-7:00PM 7:00-8:00PM 8:00-9:00PM 9:00-10:00PM 10:00-


11:00PM \
count 42503.000000 42503.000000 42503.000000 42503.000000
42503.000000
mean 628.825612 570.346987 496.648166 428.457779
376.555067
std 764.796252 732.905268 678.326096 614.480940
565.197252
min 0.000000 0.000000 0.000000 0.000000
0.000000
25% 237.000000 204.000000 167.000000 133.000000
108.000000
50% 396.000000 344.000000 287.000000 235.000000
196.000000
75% 680.000000 598.500000 508.000000 427.000000
366.000000
max 9879.000000 10532.000000 6659.000000 5698.000000
5460.000000

11:00-12:00AM
count 42503.000000
mean 315.806014
std 493.563207
min 0.000000
25% 82.000000
50% 157.000000
75% 303.000000
max 5027.000000

[8 rows x 26 columns]

print(data1.columns)

print(data2.columns)

Index(['Count_point_id', 'Direction_of_travel', 'Year', 'Count_date',


'hour',
'Region_id', 'Region_name', 'Region_ons_code',
'Local_authority_id',
'Local_authority_name', 'Local_authority_code', 'Road_name',
'Road_category', 'Road_type', 'Start_junction_road_name',
'End_junction_road_name', 'Easting', 'Northing', 'Latitude',
'Longitude', 'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', 'Unnamed: 35', '12:00-1:00 AM', '1:00-
2:00AM',
'2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM',
'6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM', '11:00-12:00PM', '12:00-1:00PM', '1:00-
2:00PM',
'2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM',
'6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM', '11:00-12:00AM', 'Unnamed: 60'],
dtype='object')
Index(['ID', 'SegmentID', 'Roadway Name', 'From', 'To', 'Direction',
'Date',
'12:00-1:00 AM', '1:00-2:00AM', '2:00-3:00AM', '3:00-4:00AM',
'4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM', '7:00-8:00AM',
'8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM', '11:00-
12:00PM',
'12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM', '3:00-4:00PM',
'4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM', '7:00-8:00PM',
'8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM', '11:00-
12:00AM'],
dtype='object')

print(data1['Count_point_id'].unique())
print(data2['ID'].unique())

[ 749 7033 18029 ... 945081 946667 942323]


[ 1 2 3 ... 3391 3392 3393]

Data Preparation

Checking for missing values

print("Missing values in Traffic Volume Data:")


print(data1.isnull().sum())

Missing values in Traffic Volume Data:


Count_point_id 0
Direction_of_travel 0
Year 0
Count_date 0
hour 0
...
8:00-9:00PM 1033309
9:00-10:00PM 1033309
10:00-11:00PM 1033309
11:00-12:00AM 1033309
Unnamed: 60 1048575
Length: 61, dtype: int64

##Since "Start_junction_road_name" and " End_junction_road_nam" have non-numerical data ,


it is better to drop these 2 columnse

data1.drop(columns=['Start_junction_road_name'], inplace=True)

data1.drop(columns=['End_junction_road_name'], inplace=True)

print(data1.columns)

Index(['Count_point_id', 'Direction_of_travel', 'Year', 'Count_date',


'hour',
'Region_id', 'Region_name', 'Region_ons_code',
'Local_authority_id',
'Local_authority_name', 'Local_authority_code', 'Road_name',
'Road_category', 'Road_type', 'Easting', 'Northing',
'Latitude',
'Longitude', 'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', 'Unnamed: 35', '12:00-1:00 AM', '1:00-
2:00AM',
'2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM',
'6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM', '11:00-12:00PM', '12:00-1:00PM', '1:00-
2:00PM',
'2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM',
'6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM', '11:00-12:00AM', 'Unnamed: 60'],
dtype='object')

##The rest columns with missing values 'Link_length_km', 'Link_length_miles' represent


numerical data. We should go to impute missing values with the mean or median of their
respective columns.

mean_link_length_km = data1['Link_length_km'].mean()
median_link_length_miles = data1['Link_length_miles'].median()
data1['Link_length_km'].fillna(mean_link_length_km, inplace=True)
data1['Link_length_miles'].fillna(median_link_length_miles,
inplace=True)

##After handling missing values, we will check if there are any remaining missing values.

print("Missing values in data after handling:")


print(data1.isnull().sum())

Missing values in data after handling:


Count_point_id 0
Direction_of_travel 0
Year 0
Count_date 0
hour 0
Region_id 0
Region_name 0
Region_ons_code 0
Local_authority_id 0
Local_authority_name 0
Local_authority_code 0
Road_name 0
Road_category 0
Road_type 0
Easting 0
Northing 0
Latitude 0
Longitude 0
Link_length_km 0
Link_length_miles 0
Pedal_cycles 0
Two_wheeled_motor_vehicles 0
Cars_and_taxis 0
Buses_and_coaches 0
LGVs 0
HGVs_2_rigid_axle 0
HGVs_3_rigid_axle 0
HGVs_4_or_more_rigid_axle 0
HGVs_3_or_4_articulated_axle 0
HGVs_5_articulated_axle 0
HGVs_6_articulated_axle 0
All_HGVs 0
All_motor_vehicles 0
Unnamed: 35 1048575
12:00-1:00 AM 1033056
1:00-2:00AM 1033056
2:00-3:00AM 1033056
3:00-4:00AM 1033056
4:00-5:00AM 1033056
5:00-6:00AM 1033056
6:00-7:00AM 1033056
7:00-8:00AM 1033056
8:00-9:00AM 1033056
9:00-10:00AM 1033056
10:00-11:00AM 1033056
11:00-12:00PM 1033056
12:00-1:00PM 1033309
1:00-2:00PM 1033309
2:00-3:00PM 1033309
3:00-4:00PM 1033309
4:00-5:00PM 1033309
5:00-6:00PM 1033309
6:00-7:00PM 1033309
7:00-8:00PM 1033309
8:00-9:00PM 1033309
9:00-10:00PM 1033309
10:00-11:00PM 1033309
11:00-12:00AM 1033309
Unnamed: 60 1048575
dtype: int64

##After the cleaning , there are no evident missing values in the dataset

#checking missing values in 2nd dataset(Data2)


print("Missing values in Traffic Volume Counts:")
print(data2.isnull().sum())

Missing values in Traffic Volume Counts:


ID 0
SegmentID 0
Roadway Name 0
From 0
To 0
Direction 0
Date 0
12:00-1:00 AM 4
1:00-2:00AM 4
2:00-3:00AM 4
3:00-4:00AM 4
4:00-5:00AM 4
5:00-6:00AM 4
6:00-7:00AM 4
7:00-8:00AM 4
8:00-9:00AM 4
9:00-10:00AM 4
10:00-11:00AM 3
11:00-12:00PM 1
12:00-1:00PM 253
1:00-2:00PM 253
2:00-3:00PM 253
3:00-4:00PM 253
4:00-5:00PM 253
5:00-6:00PM 253
6:00-7:00PM 253
7:00-8:00PM 253
8:00-9:00PM 253
9:00-10:00PM 253
10:00-11:00PM 253
11:00-12:00AM 253
dtype: int64

Since,both of the datasets are relevant to a single problem statement , we can merge the
datasets.

import pandas as pd
df=pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data1.csv')
print(df)

Count_point_id Direction_of_travel Year Count_date


hour \
0 749 E 2014 25-06-2014 00:00
7
1 749 E 2014 25-06-2014 00:00
8
2 749 E 2014 25-06-2014 00:00
9
3 749 E 2014 25-06-2014 00:00
10
4 749 E 2014 25-06-2014 00:00
11
... ... ... ... ...
...
1048570 942187 N 2009 03-06-2009 00:00
8
1048571 942187 N 2009 03-06-2009 00:00
9
1048572 942187 N 2009 03-06-2009 00:00
10
1048573 942187 N 2009 03-06-2009 00:00
11
1048574 942187 N 2009 03-06-2009 00:00
12

Region_id Region_name Region_ons_code


Local_authority_id \
0 3 Scotland S92000003
39
1 3 Scotland S92000003
39
2 3 Scotland S92000003
39
3 3 Scotland S92000003
39
4 3 Scotland S92000003
39
... ... ... ... ..
.
1048570 7 East of England E12000006
126
1048571 7 East of England E12000006
126
1048572 7 East of England E12000006
126
1048573 7 East of England E12000006
126
1048574 7 East of England E12000006
126

Local_authority_name ... 3:00-4:00PM 4:00-5:00PM 5:00-6:00PM


\
0 East Ayrshire ... 105.0 147.0 120.0

1 East Ayrshire ... 98.0 133.0 131.0

2 East Ayrshire ... 115.0 130.0 143.0

3 East Ayrshire ... 127.0 122.0 144.0

4 East Ayrshire ... 126.0 133.0 135.0

... ... ... ... ... ...

1048570 Suffolk ... NaN NaN NaN

1048571 Suffolk ... NaN NaN NaN

1048572 Suffolk ... NaN NaN NaN

1048573 Suffolk ... NaN NaN NaN

1048574 Suffolk ... NaN NaN NaN


6:00-7:00PM 7:00-8:00PM 8:00-9:00PM 9:00-10:00PM 10:00-
11:00PM \
0 91.0 83.0 74.0 49.0
42.0
1 95.0 73.0 70.0 63.0
42.0
2 106.0 89.0 68.0 64.0
56.0
3 122.0 76.0 64.0 58.0
64.0
4 102.0 106.0 58.0 58.0
55.0
... ... ... ... ... .
..
1048570 NaN NaN NaN NaN
NaN
1048571 NaN NaN NaN NaN
NaN
1048572 NaN NaN NaN NaN
NaN
1048573 NaN NaN NaN NaN
NaN
1048574 NaN NaN NaN NaN
NaN

11:00-12:00AM Unnamed: 60
0 42.0 NaN
1 35.0 NaN
2 43.0 NaN
3 43.0 NaN
4 54.0 NaN
... ... ...
1048570 NaN NaN
1048571 NaN NaN
1048572 NaN NaN
1048573 NaN NaN
1048574 NaN NaN

[1048575 rows x 61 columns]

import matplotlib.pyplot as plt


import seaborn as sns

# Create a heatmap of missing values


plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
missing_percentage = (df.isnull().sum() / len(df)) * 100
plt.show()

df.columns

Index(['Count_point_id', 'Direction_of_travel', 'Year', 'Count_date',


'hour',
'Region_id', 'Region_name', 'Region_ons_code',
'Local_authority_id',
'Local_authority_name', 'Local_authority_code', 'Road_name',
'Road_category', 'Road_type', 'Start_junction_road_name',
'End_junction_road_name', 'Easting', 'Northing', 'Latitude',
'Longitude', 'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', 'Unnamed: 35', '12:00-1:00 AM', '1:00-
2:00AM',
'2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM',
'6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM', '11:00-12:00PM', '12:00-1:00PM', '1:00-
2:00PM',
'2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM',
'6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM', '11:00-12:00AM', 'Unnamed: 60'],
dtype='object')

---Data Preprocessing
Data exploration

#printing data shape


print("Dimensions of the dataset:", df.shape)

Dimensions of the dataset: (1048575, 61)

print(df.describe())

Count_point_id Year hour Region_id \


count 1.048575e+06 1.048575e+06 1.048575e+06 1.048575e+06
mean 5.995700e+05 2.011974e+03 1.249989e+01 6.305353e+00
std 4.394447e+05 3.490533e+00 3.452043e+00 2.997732e+00
min 5.300000e+01 2.001000e+03 7.000000e+00 1.000000e+00
25% 5.607700e+04 2.009000e+03 9.000000e+00 4.000000e+00
50% 9.407220e+05 2.010000e+03 1.200000e+01 7.000000e+00
75% 9.462220e+05 2.015000e+03 1.500000e+01 9.000000e+00
max 9.999990e+05 2.021000e+03 1.800000e+01 1.100000e+01

Local_authority_id Easting Northing Latitude \


count 1.048575e+06 1.048575e+06 1.048575e+06 1.048575e+06
mean 1.028254e+02 4.350583e+05 3.003754e+05 5.259198e+01
std 5.172886e+01 9.336256e+04 1.589562e+05 1.431117e+00
min 1.000000e+00 7.040600e+04 1.077500e+04 4.991714e+01
25% 6.700000e+01 3.720730e+05 1.786700e+05 5.149455e+01
50% 9.700000e+01 4.350000e+05 2.724000e+05 5.234075e+01
75% 1.410000e+02 5.105410e+05 3.966100e+05 5.346388e+01
max 2.080000e+02 6.550000e+05 1.179870e+06 6.049980e+01

Longitude Link_length_km ... 3:00-4:00PM 4:00-


5:00PM \
count 1.048575e+06 403584.000000 ... 15266.000000 15266.000000

mean -1.500683e+00 3.094704 ... 669.589152 676.870693


std 1.371715e+00 3.589633 ... 798.586645 804.420051

min -7.425717e+00 0.100000 ... 0.000000 0.000000

25% -2.418145e+00 0.900000 ... 253.000000 255.000000

50% -1.479373e+00 1.900000 ... 426.000000 432.000000

75% -3.898619e-01 3.900000 ... 736.000000 745.000000

max 1.754553e+00 46.300000 ... 6016.000000 5923.000000

5:00-6:00PM 6:00-7:00PM 7:00-8:00PM 8:00-9:00PM 9:00-


10:00PM \
count 15266.000000 15266.000000 15266.000000 15266.000000
15266.000000
mean 676.528560 643.128848 575.776104 498.165073
429.040875
std 799.064056 784.666243 737.540903 670.361583
600.447526
min 0.000000 0.000000 0.000000 0.000000
0.000000
25% 253.000000 230.000000 194.000000 159.000000
128.000000
50% 430.000000 400.000000 345.000000 288.000000
234.000000
75% 751.000000 704.000000 612.000000 516.000000
435.000000
max 6169.000000 5810.000000 5249.000000 5102.000000
4986.000000

10:00-11:00PM 11:00-12:00AM Unnamed: 60


count 15266.000000 15266.000000 0.0
mean 375.862243 315.701231 NaN
std 552.187778 483.493903 NaN
min 0.000000 0.000000 NaN
25% 103.000000 79.000000 NaN
50% 193.000000 154.000000 NaN
75% 373.000000 309.000000 NaN
max 4468.000000 4815.000000 NaN

[8 rows x 50 columns]

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Count_point_id 1048575 non-null int64
1 Direction_of_travel 1048575 non-null object
2 Year 1048575 non-null int64
3 Count_date 1048575 non-null object
4 hour 1048575 non-null int64
5 Region_id 1048575 non-null int64
6 Region_name 1048575 non-null object
7 Region_ons_code 1048575 non-null object
8 Local_authority_id 1048575 non-null int64
9 Local_authority_name 1048575 non-null object
10 Local_authority_code 1048575 non-null object
11 Road_name 1048575 non-null object
12 Road_category 1048575 non-null object
13 Road_type 1048575 non-null object
14 Start_junction_road_name 402756 non-null object
15 End_junction_road_name 402732 non-null object
16 Easting 1048575 non-null int64
17 Northing 1048575 non-null int64
18 Latitude 1048575 non-null float64
19 Longitude 1048575 non-null float64
20 Link_length_km 403584 non-null float64
21 Link_length_miles 403584 non-null float64
22 Pedal_cycles 1048575 non-null int64
23 Two_wheeled_motor_vehicles 1048575 non-null int64
24 Cars_and_taxis 1048575 non-null int64
25 Buses_and_coaches 1048575 non-null int64
26 LGVs 1048575 non-null int64
27 HGVs_2_rigid_axle 1048575 non-null int64
28 HGVs_3_rigid_axle 1048575 non-null int64
29 HGVs_4_or_more_rigid_axle 1048575 non-null int64
30 HGVs_3_or_4_articulated_axle 1048575 non-null int64
31 HGVs_5_articulated_axle 1048575 non-null int64
32 HGVs_6_articulated_axle 1048575 non-null int64
33 All_HGVs 1048575 non-null int64
34 All_motor_vehicles 1048575 non-null int64
35 Unnamed: 35 0 non-null float64
36 12:00-1:00 AM 15519 non-null float64
37 1:00-2:00AM 15519 non-null float64
38 2:00-3:00AM 15519 non-null float64
39 3:00-4:00AM 15519 non-null float64
40 4:00-5:00AM 15519 non-null float64
41 5:00-6:00AM 15519 non-null float64
42 6:00-7:00AM 15519 non-null float64
43 7:00-8:00AM 15519 non-null float64
44 8:00-9:00AM 15519 non-null float64
45 9:00-10:00AM 15519 non-null float64
46 10:00-11:00AM 15519 non-null float64
47 11:00-12:00PM 15519 non-null float64
48 12:00-1:00PM 15266 non-null float64
49 1:00-2:00PM 15266 non-null float64
50 2:00-3:00PM 15266 non-null float64
51 3:00-4:00PM 15266 non-null float64
52 4:00-5:00PM 15266 non-null float64
53 5:00-6:00PM 15266 non-null float64
54 6:00-7:00PM 15266 non-null float64
55 7:00-8:00PM 15266 non-null float64
56 8:00-9:00PM 15266 non-null float64
57 9:00-10:00PM 15266 non-null float64
58 10:00-11:00PM 15266 non-null float64
59 11:00-12:00AM 15266 non-null float64
60 Unnamed: 60 0 non-null float64
dtypes: float64(30), int64(20), object(11)
memory usage: 488.0+ MB
None

--Missing Values
# Identify missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

Missing values:
Count_point_id 0
Direction_of_travel 0
Year 0
Count_date 0
hour 0
...
8:00-9:00PM 1033309
9:00-10:00PM 1033309
10:00-11:00PM 1033309
11:00-12:00AM 1033309
Unnamed: 60 1048575
Length: 61, dtype: int64

import matplotlib.pyplot as plt


import seaborn as sns

# Create a heatmap of missing values


plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
missing_percentage = (df.isnull().sum() / len(df)) * 100

plt.show()
# Get columns with any missing values
columns_with_missing_values = df.columns[df.isnull().any()]

# Display the columns with missing values


print("Columns with missing values:")
print(columns_with_missing_values)

Columns with missing values:


Index(['Start_junction_road_name', 'End_junction_road_name',
'Link_length_km',
'Link_length_miles', 'Unnamed: 35', '12:00-1:00 AM', '1:00-
2:00AM',
'2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM',
'6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM', '11:00-12:00PM', '12:00-1:00PM', '1:00-
2:00PM',
'2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM',
'6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM', '11:00-12:00AM', 'Unnamed: 60'],
dtype='object')

-Feature Selection
columns_to_drop = ['Unnamed: 60','Unnamed: 35']
df.drop(columns=columns_to_drop, inplace=True)

#for continuous numerical columns like 'Link_length_km' and


'Link_length_miles', we can impute the missing values with the mean or
median of the column.
# Replace missing values in numerical columns with the mean
cont_var = ['Link_length_km', 'Link_length_miles', '12:00-1:00 AM']
df[cont_var] = df[cont_var].fillna(df[cont_var].mean())

columns_to_drop =
['Start_junction_road_name','End_junction_road_name']
df.drop(columns=columns_to_drop, inplace=True)

columns_with_missing_values = df.columns[df.isnull().any()]

# Display the columns with missing values


print("Columns with missing values:")
print(columns_with_missing_values)

Columns with missing values:


Index(['1:00-2:00AM', '2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM',
'5:00-6:00AM', '6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM',
'9:00-10:00AM', '10:00-11:00AM', '11:00-12:00PM', '12:00-
1:00PM',
'1:00-2:00PM', '2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM',
'5:00-6:00PM', '6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM',
'9:00-10:00PM', '10:00-11:00PM', '11:00-12:00AM'],
dtype='object')

# List of columns with missing values


columns_with_missing_values = ['1:00-2:00AM', '2:00-3:00AM', '3:00-
4:00AM', '4:00-5:00AM',
'5:00-6:00AM', '6:00-7:00AM', '7:00-
8:00AM', '8:00-9:00AM',
'9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM',
'1:00-2:00PM', '2:00-3:00PM', '3:00-
4:00PM', '4:00-5:00PM',
'5:00-6:00PM', '6:00-7:00PM', '7:00-
8:00PM', '8:00-9:00PM',
'9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM']

# Forward fill missing values


df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='ffill')

# Backward fill missing values


df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='bfill')

C:\Users\lariy\AppData\Local\Temp\ipykernel_19624\1721878666.py:10:
FutureWarning: DataFrame.fillna with 'method' is deprecated and will
raise in a future version. Use obj.ffill() or obj.bfill() instead.
df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='ffill')
C:\Users\lariy\AppData\Local\Temp\ipykernel_19624\1721878666.py:13:
FutureWarning: DataFrame.fillna with 'method' is deprecated and will
raise in a future version. Use obj.ffill() or obj.bfill() instead.
df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='bfill')

columns_with_missing_values = df.columns[df.isnull().any()]

# Display the columns with missing values


print("Columns with missing values:")
print(columns_with_missing_values)

Columns with missing values:


Index([], dtype='object')

df.columns

Index(['Count_point_id', 'Direction_of_travel', 'Year', 'Count_date',


'hour',
'Region_id', 'Region_name', 'Region_ons_code',
'Local_authority_id',
'Local_authority_name', 'Local_authority_code', 'Road_name',
'Road_category', 'Road_type', 'Easting', 'Northing',
'Latitude',
'Longitude', 'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-2:00AM', '2:00-
3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM'],
dtype='object')

coldrop1=['Local_authority_name', 'Local_authority_code']
df.drop(columns=coldrop1, inplace=True)

coldrop =
['Count_point_id','Direction_of_travel','Year','Count_date','Region_id
','Region_name', 'Region_ons_code', 'Local_authority_id']
df.drop(columns=coldrop, inplace=True)
coldrop1=['Local_authority_name', 'Local_authority_code']
df.drop(columns=coldrop1, inplace=True)

----------------------------------------------------------------------
-----
KeyError Traceback (most recent call
last)
Cell In[23], line 4
2 df.drop(columns=coldrop, inplace=True)
3 coldrop1=['Local_authority_name', 'Local_authority_code']
----> 4 df.drop(columns=coldrop1, inplace=True)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\frame.py:5344, in DataFrame.drop(self, labels, axis,
index, columns, level, inplace, errors)
5196 def drop(
5197 self,
5198 labels: IndexLabel | None = None,
(...)
5205 errors: IgnoreRaise = "raise",
5206 ) -> DataFrame | None:
5207 """
5208 Drop specified labels from rows or columns.
5209
(...)
5342 weight 1.0 0.8
5343 """
-> 5344 return super().drop(
5345 labels=labels,
5346 axis=axis,
5347 index=index,
5348 columns=columns,
5349 level=level,
5350 inplace=inplace,
5351 errors=errors,
5352 )

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\generic.py:4711, in NDFrame.drop(self, labels, axis,
index, columns, level, inplace, errors)
4709 for axis, labels in axes.items():
4710 if labels is not None:
-> 4711 obj = obj._drop_axis(labels, axis, level=level,
errors=errors)
4713 if inplace:
4714 self._update_inplace(obj)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\generic.py:4753, in NDFrame._drop_axis(self, labels, axis,
level, errors, only_slice)
4751 new_axis = axis.drop(labels, level=level,
errors=errors)
4752 else:
-> 4753 new_axis = axis.drop(labels, errors=errors)
4754 indexer = axis.get_indexer(new_axis)
4756 # Case for non-unique axis
4757 else:

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\indexes\base.py:6992, in Index.drop(self, labels, errors)
6990 if mask.any():
6991 if errors != "ignore":
-> 6992 raise KeyError(f"{labels[mask].tolist()} not found in
axis")
6993 indexer = indexer[~mask]
6994 return self.delete(indexer)

KeyError: "['Local_authority_name', 'Local_authority_code'] not found


in axis"

coldrop2=['Road_name','Road_category']
df.drop(columns=coldrop2, inplace=True)

df.columns

Index(['hour', 'Road_type', 'Easting', 'Northing', 'Latitude',


'Longitude',
'Link_length_km', 'Link_length_miles', 'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-2:00AM', '2:00-
3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM'],
dtype='object')

import pandas as pd

df.to_excel('data_revised.xlsx', index=False)

# Detect outliers using box plot


import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x=df['hour'])
plt.title('Box plot to detect outliers')
plt.show()

-Feature Engineering
Encoding
Since,there is one categorical column - Major and Minor . We have to convert it into numerical
content by creating dummies using One-Hot Encoding
import pandas as pd

# Assume df is your DataFrame containing the 'Road_type' column

# One-Hot Encoding
one_hot_encoded = pd.get_dummies(df['Road_type'], prefix='Road_type')

# Label Encoding
label_encoded = df['Road_type'].astype('category').cat.codes

# Replace the original 'Road_type' column with the encoded values


df['Road_type_encoded'] = label_encoded # Or use one-hot encoded
DataFrame if using one-hot encoding

#checking the results of encoding


print(df['Road_type_encoded'])

print(df.head(5))

0 0
1 0
2 0
3 0
4 0
..
1048570 1
1048571 1
1048572 1
1048573 1
1048574 1
Name: Road_type_encoded, Length: 1048575, dtype: int8
hour Road_type Easting Northing Latitude Longitude
Link_length_km \
0 7 Major 243900 635900 55.591636 -4.478606
4.6
1 8 Major 243900 635900 55.591636 -4.478606
4.6
2 9 Major 243900 635900 55.591636 -4.478606
4.6
3 10 Major 243900 635900 55.591636 -4.478606
4.6
4 11 Major 243900 635900 55.591636 -4.478606
4.6

Link_length_miles Pedal_cycles Two_wheeled_motor_vehicles ... \


0 2.86 0 2 ...
1 2.86 0 5 ...
2 2.86 0 2 ...
3 2.86 0 1 ...
4 2.86 0 2 ...
3:00-4:00PM 4:00-5:00PM 5:00-6:00PM 6:00-7:00PM 7:00-8:00PM \
0 105.0 147.0 120.0 91.0 83.0
1 98.0 133.0 131.0 95.0 73.0
2 115.0 130.0 143.0 106.0 89.0
3 127.0 122.0 144.0 122.0 76.0
4 126.0 133.0 135.0 102.0 106.0

8:00-9:00PM 9:00-10:00PM 10:00-11:00PM 11:00-12:00AM


Road_type_encoded
0 74.0 49.0 42.0 42.0
0
1 70.0 63.0 42.0 35.0
0
2 68.0 64.0 56.0 43.0
0
3 64.0 58.0 64.0 43.0
0
4 58.0 58.0 55.0 54.0
0

[5 rows x 46 columns]

The dummies are created with 'Road_type_encoded' , hence we can drop the column Road_type
as it further adds no value to data

df.drop(columns=['Road_type'], inplace=True)

df.columns

Index(['hour', 'Easting', 'Northing', 'Latitude', 'Longitude',


'Link_length_km', 'Link_length_miles', 'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-2:00AM', '2:00-
3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM', 'Road_type_encoded'],
dtype='object')

Adding a feature
# Define the columns to be used for calculating traffic_rate
columns_to_use = ['hour', 'Easting', 'Northing', 'Latitude',
'Longitude',
'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle',
'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle',
'HGVs_6_articulated_axle', 'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-
2:00AM', '2:00-3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-
7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM',
'2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-
7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM',
'11:00-12:00AM', 'Road_type_encoded']

# Calculate the total sum of all columns


total_sum = df[columns_to_use].sum(axis=1)

# Calculate the weighted average


weighted_average = (df[columns_to_use].mul(df[columns_to_use],
axis=0)).sum(axis=1) / total_sum

# Determine the percentage of each column relative to the total sum


traffic_rate = (weighted_average / total_sum) * 100

# Add the 'traffic_rate' column to the DataFrame


df['traffic_rate'] = traffic_rate

# Display the DataFrame with the new column


print(df)

hour Easting Northing Latitude Longitude Link_length_km


\
0 7 243900 635900 55.591636 -4.478606 4.600000

1 8 243900 635900 55.591636 -4.478606 4.600000

2 9 243900 635900 55.591636 -4.478606 4.600000


3 10 243900 635900 55.591636 -4.478606 4.600000

4 11 243900 635900 55.591636 -4.478606 4.600000

... ... ... ... ... ... ...

1048570 8 628200 234310 51.960413 1.320092 3.094704

1048571 9 628200 234310 51.960413 1.320092 3.094704

1048572 10 628200 234310 51.960413 1.320092 3.094704

1048573 11 628200 234310 51.960413 1.320092 3.094704

1048574 12 628200 234310 51.960413 1.320092 3.094704

Link_length_miles Pedal_cycles
Two_wheeled_motor_vehicles \
0 2.860000 0 2

1 2.860000 0 5

2 2.860000 0 2

3 2.860000 0 1

4 2.860000 0 2

... ... ... ...

1048570 1.922684 0 1

1048571 1.922684 1 2

1048572 1.922684 1 1

1048573 1.922684 1 3

1048574 1.922684 0 2

Cars_and_taxis ... 4:00-5:00PM 5:00-6:00PM 6:00-7:00PM \


0 845 ... 147.0 120.0 91.0
1 908 ... 133.0 131.0 95.0
2 595 ... 130.0 143.0 106.0
3 590 ... 122.0 144.0 122.0
4 695 ... 133.0 135.0 102.0
... ... ... ... ... ...
1048570 28 ... 84.0 88.0 66.0
1048571 24 ... 84.0 88.0 66.0
1048572 28 ... 84.0 88.0 66.0
1048573 31 ... 84.0 88.0 66.0
1048574 81 ... 84.0 88.0 66.0

7:00-8:00PM 8:00-9:00PM 9:00-10:00PM 10:00-11:00PM 11:00-


12:00AM \
0 83.0 74.0 49.0 42.0
42.0
1 73.0 70.0 63.0 42.0
35.0
2 89.0 68.0 64.0 56.0
43.0
3 76.0 64.0 58.0 64.0
43.0
4 106.0 58.0 58.0 55.0
54.0
... ... ... ... ...
...
1048570 84.0 66.0 93.0 63.0
82.0
1048571 84.0 66.0 93.0 63.0
82.0
1048572 84.0 66.0 93.0 63.0
82.0
1048573 84.0 66.0 93.0 63.0
82.0
1048574 84.0 66.0 93.0 63.0
82.0

Road_type_encoded traffic_rate
0 0 59.450027
1 0 59.415506
2 0 59.484287
3 0 59.482806
4 0 59.443662
... ... ...
1048570 1 60.137671
1048571 1 60.144764
1048572 1 60.149633
1048573 1 60.147685
1048574 1 60.123348

[1048575 rows x 46 columns]

df.drop(columns=['Road_type'], inplace=True)

df.columns

Index(['hour', 'Easting', 'Northing', 'Latitude', 'Longitude',


'Link_length_km', 'Link_length_miles', 'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-2:00AM', '2:00-
3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM', 'Road_type_encoded', 'traffic_rate'],
dtype='object')

df.dtypes

hour int64
Easting int64
Northing int64
Latitude float64
Longitude float64
Link_length_km float64
Link_length_miles float64
Pedal_cycles int64
Two_wheeled_motor_vehicles int64
Cars_and_taxis int64
Buses_and_coaches int64
LGVs int64
HGVs_2_rigid_axle int64
HGVs_3_rigid_axle int64
HGVs_4_or_more_rigid_axle int64
HGVs_3_or_4_articulated_axle int64
HGVs_5_articulated_axle int64
HGVs_6_articulated_axle int64
All_HGVs int64
All_motor_vehicles int64
12:00-1:00 AM float64
1:00-2:00AM float64
2:00-3:00AM float64
3:00-4:00AM float64
4:00-5:00AM float64
5:00-6:00AM float64
6:00-7:00AM float64
7:00-8:00AM float64
8:00-9:00AM float64
9:00-10:00AM float64
10:00-11:00AM float64
11:00-12:00PM float64
12:00-1:00PM float64
1:00-2:00PM float64
2:00-3:00PM float64
3:00-4:00PM float64
4:00-5:00PM float64
5:00-6:00PM float64
6:00-7:00PM float64
7:00-8:00PM float64
8:00-9:00PM float64
9:00-10:00PM float64
10:00-11:00PM float64
11:00-12:00AM float64
Road_type_encoded int8
traffic_rate float64
dtype: object

-Outliers detection
import pandas as pd

columns_of_interest = ['hour', 'Easting', 'Northing', 'Latitude',


'Longitude',
'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle',
'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle',
'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle',
'HGVs_6_articulated_axle', 'All_HGVs',
'All_motor_vehicles', 'traffic_rate']

# threshold for identifying outliers


threshold = 1.5

def count_outliers(column):
q1 = column.quantile(0.25)
q3 = column.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - threshold * iqr
upper_bound = q3 + threshold * iqr
outliers_lower = column[column < lower_bound]
outliers_upper = column[column > upper_bound]
return len(outliers_lower), len(outliers_upper)

# Create a dictionary to store the outlier counts for each column


outlier_counts = {}
# Iterate over the columns of interest and count outliers
for col in columns_of_interest:
outliers_lower, outliers_upper = count_outliers(df[col])
outlier_counts[col] = {'lower_outliers': outliers_lower,
'upper_outliers': outliers_upper}

# Convert the dictionary to a DataFrame for easier visualization


outlier_counts_df = pd.DataFrame(outlier_counts).T

# Display the outlier counts for each column


print(outlier_counts_df)

lower_outliers upper_outliers
hour 0 0
Easting 2208 0
Northing 0 16830
Latitude 0 16518
Longitude 2160 0
Link_length_km 235872 110328
Link_length_miles 235872 110328
Pedal_cycles 0 110674
Two_wheeled_motor_vehicles 0 94091
Cars_and_taxis 0 96771
Buses_and_coaches 0 87315
LGVs 0 109842
HGVs_2_rigid_axle 0 127022
HGVs_3_rigid_axle 0 130470
HGVs_4_or_more_rigid_axle 0 143570
HGVs_3_or_4_articulated_axle 0 155028
HGVs_5_articulated_axle 0 169182
HGVs_6_articulated_axle 0 170369
All_HGVs 0 149406
All_motor_vehicles 0 100972
traffic_rate 0 705

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df is your DataFrame containing the data


# Select the columns where you want to detect outliers
columns_of_interest = ['hour', 'Easting', 'Northing', 'Latitude',
'Longitude',
'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle',
'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle',
'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle',
'HGVs_6_articulated_axle', 'All_HGVs',
'All_motor_vehicles', 'traffic_rate']

# Create a boxplot for each selected column


plt.figure(figsize=(14, 8))
sns.boxplot(data=df[columns_of_interest])
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.title('Boxplot of Selected Columns to Detect Outliers')
plt.xlabel('Columns')
plt.ylabel('Values')
plt.show()

#removing outliers
import numpy as np
from scipy import stats

# Define a function to remove outliers using Z-score


def remove_outliers_zscore(df, cols):
for col in cols:
z_scores = np.abs(stats.zscore(df[col]))
df = df[(z_scores < 3)]
return df
#columns with outliers
cols_with_outliers = ['Easting', 'Northing', 'Latitude', 'Longitude',
'Link_length_km', 'Link_length_miles',
'Pedal_cycles', 'Two_wheeled_motor_vehicles',
'Cars_and_taxis', 'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle',
'HGVs_3_rigid_axle', 'HGVs_4_or_more_rigid_axle',
'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs', 'All_motor_vehicles',
'traffic_rate']

# removing outliers using Z-score method


df_cleaned = remove_outliers_zscore(df, cols_with_outliers)

print("Original DataFrame shape:", df.shape)


print("DataFrame shape after removing outliers:", df_cleaned.shape)

Original DataFrame shape: (1048575, 46)


DataFrame shape after removing outliers: (759978, 46)

As we can observe, original DataFrame had 1,048,575 rows and 46 columns. After removing
outliers using the Z-score method, the data now contains 759,978 rows and 46 columns. This
reduction in the number of rows indicates that some data points were identified as outliers and
subsequently removed from the DataFrame. The specific rows containing outliers were excluded
based on the chosen outlier detection method and threshol.

-Scaling Transformation
from sklearn.preprocessing import MinMaxScaler

# Select numerical columns for scaling


numerical_cols = ['Link_length_km', 'Link_length_miles',
'Pedal_cycles', 'Two_wheeled_motor_vehicles',
'Cars_and_taxis', 'traffic_rate']

# Initialize the MinMaxScaler


scaler = MinMaxScaler()

# Fit and transform the numerical columns


df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
scaled_data = df[numerical_cols]

print(scaled_data)
Link_length_km Link_length_miles Pedal_cycles \
0 0.097403 0.097527 0.000000
1 0.097403 0.097527 0.000000
2 0.097403 0.097527 0.000000
3 0.097403 0.097527 0.000000
4 0.097403 0.097527 0.000000
... ... ... ...
1048570 0.064820 0.064879 0.000000
1048571 0.064820 0.064879 0.000453
1048572 0.064820 0.064879 0.000453
1048573 0.064820 0.064879 0.000453
1048574 0.064820 0.064879 0.000000

Two_wheeled_motor_vehicles Cars_and_taxis traffic_rate


0 0.002604 0.094382 0.465707
1 0.006510 0.101419 0.464980
2 0.002604 0.066458 0.466428
3 0.001302 0.065900 0.466397
4 0.002604 0.077628 0.465573
... ... ... ...
1048570 0.001302 0.003127 0.480189
1048571 0.002604 0.002681 0.480338
1048572 0.001302 0.003127 0.480441
1048573 0.003906 0.003463 0.480400
1048574 0.002604 0.009047 0.479887

[1048575 rows x 6 columns]

---Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plots of numerical features against the target variable


'traffic_rate'
plt.figure(figsize=(12, 8))
for i, col in enumerate(numerical_cols):
plt.subplot(3, 2, i+1)
sns.scatterplot(data=df, x=col, y='traffic_rate')
plt.title(f'Scatter Plot: {col} vs traffic_rate')
plt.xlabel(col)
plt.ylabel('traffic_rate')
plt.tight_layout()
plt.show()
# Bar chart of categorical feature 'Road_type_encoded' against the
target variable 'traffic_rate'
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='Road_type_encoded', y='traffic_rate')
plt.title('Bar Chart: Road_type_encoded vs traffic_rate')
plt.xlabel('Road_type_encoded')
plt.ylabel('traffic_rate')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming correlation_matrix is your computed correlation matrix


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns

# Plot histograms for numerical variables


plt.figure(figsize=(15, 10))
for i, col in enumerate(df.columns):
if df[col].dtype in ['int64', 'float64']:
plt.subplot(4, 3, i + 1)
sns.histplot(df[col], kde=True)
plt.title(f'Histogram of {col}')
plt.xlabel(col)
plt.tight_layout()
plt.show()

# Plot count plots for categorical variables


plt.figure(figsize=(15, 10))
for i, col in enumerate(df.columns):
if df[col].dtype == 'object':
plt.subplot(4, 3, i + 1)
sns.countplot(data=df, x=col)
plt.title(f'Count Plot of {col}')
plt.xlabel(col)
plt.tight_layout()
plt.show()

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
Cell In[54], line 8
6 for i, col in enumerate(df.columns):
7 if df[col].dtype in ['int64', 'float64']:
----> 8 plt.subplot(4, 3, i + 1)
9 sns.histplot(df[col], kde=True)
10 plt.title(f'Histogram of {col}')

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
matplotlib\pyplot.py:1425, in subplot(*args, **kwargs)
1422 fig = gcf()
1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if
the user passed no
1429 # kwargs or if the axes class and kwargs are identical.
1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 ==
fig._process_projection_requirements(**kwargs)))):

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
matplotlib\gridspec.py:599, in SubplotSpec._from_subplot_args(figure,
args)
597 else:
598 if not isinstance(num, Integral) or num < 1 or num >
rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <=
{rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]

ValueError: num must be an integer with 1 <= num <= 12, not 13
-Model Building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

#Defining target(dependent variable) and independent variable


X = df.drop(columns=['traffic_rate'])
y = df['traffic_rate']

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Initialize and fit Linear Regression model


model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

# Make predictions
y_pred = model.predict(X_test)

from sklearn.metrics import mean_squared_error, r2_score


import numpy as np
# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Root Mean Squared Error (RMSE)


rmse = np.sqrt(mse)

# R-squared
r2 = r2_score(y_test, y_pred)

# R-squared to percentage
lin_reg = r2 * 100

# The regression evaluation metrics


print("Mean Squared Error (MSE): {:.2f}%".format(mse))
print("Root Mean Squared Error (RMSE): {:.2f}%".format(rmse))
print("R-squared (R2): {:.2f}%".format(lin_reg))

Mean Squared Error (MSE): 0.01%


Root Mean Squared Error (RMSE): 0.08%
R-squared (R2): 66.59%

-Model Tunning
For linear regression model , we can go ahead with Lasso or Ridge Regression ,

#Let's run Ridge Regression- Grid search

from sklearn.model_selection import GridSearchCV


from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define the model


model = Ridge()

# Define the hyperparameters grid


param_grid = {
'alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0] # Example values for
alpha
}

# Initialize Grid Search Cross Validation


grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
cv=5, scoring='neg_mean_squared_error')
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters


best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Get the best model


best_model = grid_search.best_estimator_

# Evaluate the best model


y_pred_best = best_model.predict(X_test)
mse_best = mean_squared_error(y_test, y_pred_best)
rmse_best = np.sqrt(mse_best)
r2_grid = r2_score(y_test, y_pred_best)

# Calculate the range of the target variable


y_range = y_test.max() - y_test.min()

# Convert MSE and RMSE to percentage


mse_percentage = (mse_best / y_range) * 100
rmse_percentage = (rmse_best / y_range) * 100

# Print results
print("Mean Squared Error (MSE) - Best Model: {:.2f}
%".format(mse_percentage))
print("Root Mean Squared Error (RMSE) - Best Model: {:.2f}
%".format(rmse_percentage))
print("R-squared (R2) - Best Model: {:.2f}%".format(r2_grid * 100))

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51559e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51117e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51082e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50696e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.5059e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25769e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25551e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25533e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.2534e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25289e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51517e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51088e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51051e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50667e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50565e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02984e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02149e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02073e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.01308e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.01109e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

Best Hyperparameters: {'alpha': 0.1}


Mean Squared Error (MSE) - Best Model: 0.61%
Root Mean Squared Error (RMSE) - Best Model: 7.82%
R-squared (R2) - Best Model: 66.59%

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=3.60808e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

from sklearn.model_selection import RandomizedSearchCV


from sklearn.linear_model import Ridge
from scipy.stats import uniform
from sklearn.metrics import r2_score

# Define the model


model = Ridge()

# Define the hyperparameters grid


param_distribution = {
'alpha': uniform(loc=0, scale=10) # Example distribution for
alpha
}

# Initialize Random Search Cross Validation


random_search = RandomizedSearchCV(estimator=model,
param_distributions=param_distribution, n_iter=10, cv=5,
scoring='neg_mean_squared_error', random_state=42)

# Fit the random search to the data


random_search.fit(X_train, y_train)

# Get the best hyperparameters


best_params_random = random_search.best_params_
print("Best Hyperparameters (Random Search):", best_params_random)

# Get the best model


best_model = random_search.best_estimator_
# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Calculate R-squared (R2) score


r2_random = r2_score(y_test, y_pred)

# Print the R2 score


print("R-squared (R2) - Best Model: {:.2f}%".format(r2_random * 100))

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.04425e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03766e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03707e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03109e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.02953e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.04316e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03657e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03599e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03001e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.02844e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.62267e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.62015e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.61994e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.6177e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.6171e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

Best Hyperparameters (Random Search): {'alpha': 0.5808361216819946}


R-squared (R2) - Best Model: 66.59%

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.09563e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

model.fit(X_train, y_train)

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=1.80398e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

Ridge(alpha=0.5)

# Plot actual vs predicted


plt.scatter(y_test, model.predict(X_test), color='blue', label='Actual
vs Predicted')

# Plotting the diagonal line


plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
color='red', linestyle='--', label='Ideal Prediction')

plt.title('Actual vs Predicted Traffic Rate')


plt.xlabel('Actual Traffic Rate')
plt.ylabel('Predicted Traffic Rate')
plt.legend()
plt.show()
model.fit(X_train, y_train)
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=1.80398e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

Ridge(alpha=0.5)

import pandas as pd

# Fit the model to the training data


model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Create a DataFrame with actual and predicted values


results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

# Display the DataFrame


print(results_df)

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=1.80398e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

Actual Predicted
781974 0.269387 0.276978
937737 0.385366 0.418029
907828 0.294860 0.343721
784628 0.255908 0.281116
662460 0.260028 0.274427
... ... ...
673443 0.643551 0.537741
656736 0.369448 0.429966
858501 0.437590 0.451252
617079 0.382499 0.283711
487559 0.297439 0.363648

[209715 rows x 2 columns]

You might also like