0% found this document useful (0 votes)

10 views

Cac2 22112338

Uploaded by

SUHANI LARIYA 22112338

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Cac2 22112338

Uploaded by

SUHANI LARIYA 22112338

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

-Data loading

import pandas as pd

data1 = pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data1.csv')
data2=pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data2.csv')

print("This is 1st dataset",data1.head())

print("This is 2nd dataset",data2.head())

This is 1st dataset Count_point_id Direction_of_travel Year

Count_date hour \
0 749 E 2014 25-06-2014 00:00 7
1 749 E 2014 25-06-2014 00:00 8
2 749 E 2014 25-06-2014 00:00 9
3 749 E 2014 25-06-2014 00:00 10
4 749 E 2014 25-06-2014 00:00 11

Region_id Region_name Region_ons_code Local_authority_id \

0 3 Scotland S92000003 39
1 3 Scotland S92000003 39
2 3 Scotland S92000003 39
3 3 Scotland S92000003 39
4 3 Scotland S92000003 39

Local_authority_name ... 3:00-4:00PM 4:00-5:00PM 5:00-6:00PM 6:00-

7:00PM \
0 East Ayrshire ... 105.0 147.0 120.0
91.0
1 East Ayrshire ... 98.0 133.0 131.0
95.0
2 East Ayrshire ... 115.0 130.0 143.0
106.0
3 East Ayrshire ... 127.0 122.0 144.0
122.0
4 East Ayrshire ... 126.0 133.0 135.0
102.0

7:00-8:00PM 8:00-9:00PM 9:00-10:00PM 10:00-11:00PM 11:00-12:00AM

\
0 83.0 74.0 49.0 42.0 42.0

1 73.0 70.0 63.0 42.0 35.0

2 89.0 68.0 64.0 56.0 43.0

3 76.0 64.0 58.0 64.0 43.0

4 106.0 58.0 58.0 55.0 54.0

Unnamed: 60
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN

[5 rows x 61 columns]
This is 2nd dataset ID SegmentID Roadway Name From
To Direction \
0 1 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
1 2 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
2 3 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
3 4 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
4 5 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB

Date 12:00-1:00 AM 1:00-2:00AM 2:00-3:00AM ... 2:00-

3:00PM \
0 01-09-2012 20.0 10.0 11.0 ...
104.0
1 01-10-2012 21.0 16.0 8.0 ...
102.0
2 01-11-2012 27.0 14.0 6.0 ...
115.0
3 01-12-2012 22.0 7.0 7.0 ...
71.0
4 01/13/2012 31.0 17.0 7.0 ...
113.0

3:00-4:00PM 4:00-5:00PM 5:00-6:00PM 6:00-7:00PM 7:00-8:00PM \

0 105.0 147.0 120.0 91.0 83.0
1 98.0 133.0 131.0 95.0 73.0
2 115.0 130.0 143.0 106.0 89.0
3 127.0 122.0 144.0 122.0 76.0
4 126.0 133.0 135.0 102.0 106.0

8:00-9:00PM 9:00-10:00PM 10:00-11:00PM 11:00-12:00AM

0 74.0 49.0 42.0 42.0
1 70.0 63.0 42.0 35.0
2 68.0 64.0 56.0 43.0
3 64.0 58.0 64.0 43.0
4 58.0 58.0 55.0 54.0
[5 rows x 31 columns]

----------------------------------------------------------------------
--------------------------------------------

Cell In[2], line 1

----------------------------------------------------------------------
--------------------------------------------

^
SyntaxError: invalid syntax

print(data1.shape)
print(data2.shape)

(1048575, 61)
(42756, 31)

print(data1.info())
print(data2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Count_point_id 1048575 non-null int64
1 Direction_of_travel 1048575 non-null object
2 Year 1048575 non-null int64
3 Count_date 1048575 non-null object
4 hour 1048575 non-null int64
5 Region_id 1048575 non-null int64
6 Region_name 1048575 non-null object
7 Region_ons_code 1048575 non-null object
8 Local_authority_id 1048575 non-null int64
9 Local_authority_name 1048575 non-null object
10 Local_authority_code 1048575 non-null object
11 Road_name 1048575 non-null object
12 Road_category 1048575 non-null object
13 Road_type 1048575 non-null object
14 Start_junction_road_name 402756 non-null object
15 End_junction_road_name 402732 non-null object
16 Easting 1048575 non-null int64
17 Northing 1048575 non-null int64
18 Latitude 1048575 non-null float64
19 Longitude 1048575 non-null float64
20 Link_length_km 403584 non-null float64
21 Link_length_miles 403584 non-null float64
22 Pedal_cycles 1048575 non-null int64
23 Two_wheeled_motor_vehicles 1048575 non-null int64
24 Cars_and_taxis 1048575 non-null int64
25 Buses_and_coaches 1048575 non-null int64
26 LGVs 1048575 non-null int64
27 HGVs_2_rigid_axle 1048575 non-null int64
28 HGVs_3_rigid_axle 1048575 non-null int64
29 HGVs_4_or_more_rigid_axle 1048575 non-null int64
30 HGVs_3_or_4_articulated_axle 1048575 non-null int64
31 HGVs_5_articulated_axle 1048575 non-null int64
32 HGVs_6_articulated_axle 1048575 non-null int64
33 All_HGVs 1048575 non-null int64
34 All_motor_vehicles 1048575 non-null int64
35 Unnamed: 35 0 non-null float64
36 12:00-1:00 AM 15519 non-null float64
37 1:00-2:00AM 15519 non-null float64
38 2:00-3:00AM 15519 non-null float64
39 3:00-4:00AM 15519 non-null float64
40 4:00-5:00AM 15519 non-null float64
41 5:00-6:00AM 15519 non-null float64
42 6:00-7:00AM 15519 non-null float64
43 7:00-8:00AM 15519 non-null float64
44 8:00-9:00AM 15519 non-null float64
45 9:00-10:00AM 15519 non-null float64
46 10:00-11:00AM 15519 non-null float64
47 11:00-12:00PM 15519 non-null float64
48 12:00-1:00PM 15266 non-null float64
49 1:00-2:00PM 15266 non-null float64
50 2:00-3:00PM 15266 non-null float64
51 3:00-4:00PM 15266 non-null float64
52 4:00-5:00PM 15266 non-null float64
53 5:00-6:00PM 15266 non-null float64
54 6:00-7:00PM 15266 non-null float64
55 7:00-8:00PM 15266 non-null float64
56 8:00-9:00PM 15266 non-null float64
57 9:00-10:00PM 15266 non-null float64
58 10:00-11:00PM 15266 non-null float64
59 11:00-12:00AM 15266 non-null float64
60 Unnamed: 60 0 non-null float64
dtypes: float64(30), int64(20), object(11)
memory usage: 488.0+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42756 entries, 0 to 42755
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 42756 non-null int64
1 SegmentID 42756 non-null int64
2 Roadway Name 42756 non-null object
3 From 42756 non-null object
4 To 42756 non-null object
5 Direction 42756 non-null object
6 Date 42756 non-null object
7 12:00-1:00 AM 42752 non-null float64
8 1:00-2:00AM 42752 non-null float64
9 2:00-3:00AM 42752 non-null float64
10 3:00-4:00AM 42752 non-null float64
11 4:00-5:00AM 42752 non-null float64
12 5:00-6:00AM 42752 non-null float64
13 6:00-7:00AM 42752 non-null float64
14 7:00-8:00AM 42752 non-null float64
15 8:00-9:00AM 42752 non-null float64
16 9:00-10:00AM 42752 non-null float64
17 10:00-11:00AM 42753 non-null float64
18 11:00-12:00PM 42755 non-null float64
19 12:00-1:00PM 42503 non-null float64
20 1:00-2:00PM 42503 non-null float64
21 2:00-3:00PM 42503 non-null float64
22 3:00-4:00PM 42503 non-null float64
23 4:00-5:00PM 42503 non-null float64
24 5:00-6:00PM 42503 non-null float64
25 6:00-7:00PM 42503 non-null float64
26 7:00-8:00PM 42503 non-null float64
27 8:00-9:00PM 42503 non-null float64
28 9:00-10:00PM 42503 non-null float64
29 10:00-11:00PM 42503 non-null float64
30 11:00-12:00AM 42503 non-null float64
dtypes: float64(24), int64(2), object(5)
memory usage: 10.1+ MB
None

print(data1.describe())
print(data2.describe())

Count_point_id Year hour Region_id \

count 1.048575e+06 1.048575e+06 1.048575e+06 1.048575e+06
mean 5.995700e+05 2.011974e+03 1.249989e+01 6.305353e+00
std 4.394447e+05 3.490533e+00 3.452043e+00 2.997732e+00
min 5.300000e+01 2.001000e+03 7.000000e+00 1.000000e+00
25% 5.607700e+04 2.009000e+03 9.000000e+00 4.000000e+00
50% 9.407220e+05 2.010000e+03 1.200000e+01 7.000000e+00
75% 9.462220e+05 2.015000e+03 1.500000e+01 9.000000e+00
max 9.999990e+05 2.021000e+03 1.800000e+01 1.100000e+01

Local_authority_id Easting Northing Latitude \

count 1.048575e+06 1.048575e+06 1.048575e+06 1.048575e+06
mean 1.028254e+02 4.350583e+05 3.003754e+05 5.259198e+01
std 5.172886e+01 9.336256e+04 1.589562e+05 1.431117e+00
min 1.000000e+00 7.040600e+04 1.077500e+04 4.991714e+01
25% 6.700000e+01 3.720730e+05 1.786700e+05 5.149455e+01
50% 9.700000e+01 4.350000e+05 2.724000e+05 5.234075e+01
75% 1.410000e+02 5.105410e+05 3.966100e+05 5.346388e+01
max 2.080000e+02 6.550000e+05 1.179870e+06 6.049980e+01

Longitude Link_length_km ... 3:00-4:00PM 4:00-

5:00PM \
count 1.048575e+06 403584.000000 ... 15266.000000 15266.000000

mean -1.500683e+00 3.094704 ... 669.589152 676.870693

std 1.371715e+00 3.589633 ... 798.586645 804.420051

min -7.425717e+00 0.100000 ... 0.000000 0.000000

25% -2.418145e+00 0.900000 ... 253.000000 255.000000

50% -1.479373e+00 1.900000 ... 426.000000 432.000000

75% -3.898619e-01 3.900000 ... 736.000000 745.000000

max 1.754553e+00 46.300000 ... 6016.000000 5923.000000

5:00-6:00PM 6:00-7:00PM 7:00-8:00PM 8:00-9:00PM 9:00-

10:00PM \
count 15266.000000 15266.000000 15266.000000 15266.000000
15266.000000
mean 676.528560 643.128848 575.776104 498.165073
429.040875
std 799.064056 784.666243 737.540903 670.361583
600.447526
min 0.000000 0.000000 0.000000 0.000000
0.000000
25% 253.000000 230.000000 194.000000 159.000000
128.000000
50% 430.000000 400.000000 345.000000 288.000000
234.000000
75% 751.000000 704.000000 612.000000 516.000000
435.000000
max 6169.000000 5810.000000 5249.000000 5102.000000
4986.000000

10:00-11:00PM 11:00-12:00AM Unnamed: 60

count 15266.000000 15266.000000 0.0
mean 375.862243 315.701231 NaN
std 552.187778 483.493903 NaN
min 0.000000 0.000000 NaN
25% 103.000000 79.000000 NaN
50% 193.000000 154.000000 NaN
75% 373.000000 309.000000 NaN
max 4468.000000 4815.000000 NaN

[8 rows x 50 columns]
ID SegmentID 12:00-1:00 AM 1:00-2:00AM 2:00-
3:00AM \
count 42756.000000 4.275600e+04 42752.000000 42752.000000
42752.000000
mean 302.926841 4.988159e+05 251.448423 178.591504
135.280318
std 504.422798 1.875303e+06 407.435712 303.030296
242.091877
min 1.000000 2.020000e+02 0.000000 0.000000
0.000000
25% 95.000000 3.402700e+04 60.000000 38.000000
26.000000
50% 193.000000 7.534300e+04 118.000000 79.000000
56.000000
75% 299.000000 1.448810e+05 241.000000 171.000000
128.000000
max 3393.000000 9.017050e+06 4805.000000 4489.000000
4818.000000

3:00-4:00AM 4:00-5:00AM 5:00-6:00AM 6:00-7:00AM 7:00-

8:00AM \
count 42752.000000 42752.000000 42752.000000 42752.00000
42752.000000
mean 117.619359 135.677980 206.655747 352.60051
491.184673
std 215.316979 249.763737 415.614033 621.37410
697.735616
min 0.000000 0.000000 0.000000 0.00000
0.000000
25% 22.000000 27.000000 42.000000 77.00000
133.000000
50% 47.000000 56.000000 85.000000 156.00000
270.000000
75% 110.000000 125.000000 182.000000 335.00000
538.000000
max 4323.000000 4469.000000 6456.000000 7513.00000
9226.330000

... 2:00-3:00PM 3:00-4:00PM 4:00-5:00PM 5:00-6:00PM \

count ... 42503.000000 42503.000000 42503.000000 42503.000000
mean ... 630.179376 657.691175 661.120886 657.712538
std ... 758.610055 779.552761 776.312508 770.034661
min ... 0.000000 0.000000 0.000000 0.000000
25% ... 250.000000 260.000000 261.000000 258.000000
50% ... 406.000000 425.000000 428.000000 424.000000
75% ... 679.000000 713.000000 724.000000 724.000000
max ... 6996.000000 7524.000000 8683.000000 9762.000000

6:00-7:00PM 7:00-8:00PM 8:00-9:00PM 9:00-10:00PM 10:00-

11:00PM \
count 42503.000000 42503.000000 42503.000000 42503.000000
42503.000000
mean 628.825612 570.346987 496.648166 428.457779
376.555067
std 764.796252 732.905268 678.326096 614.480940
565.197252
min 0.000000 0.000000 0.000000 0.000000
0.000000
25% 237.000000 204.000000 167.000000 133.000000
108.000000
50% 396.000000 344.000000 287.000000 235.000000
196.000000
75% 680.000000 598.500000 508.000000 427.000000
366.000000
max 9879.000000 10532.000000 6659.000000 5698.000000
5460.000000

11:00-12:00AM
count 42503.000000
mean 315.806014
std 493.563207
min 0.000000
25% 82.000000
50% 157.000000
75% 303.000000
max 5027.000000

[8 rows x 26 columns]

print(data1.columns)

print(data2.columns)

Index(['Count_point_id', 'Direction_of_travel', 'Year', 'Count_date',

'hour',
'Region_id', 'Region_name', 'Region_ons_code',
'Local_authority_id',
'Local_authority_name', 'Local_authority_code', 'Road_name',
'Road_category', 'Road_type', 'Start_junction_road_name',
'End_junction_road_name', 'Easting', 'Northing', 'Latitude',
'Longitude', 'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', 'Unnamed: 35', '12:00-1:00 AM', '1:00-
2:00AM',
'2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM',
'6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM', '11:00-12:00PM', '12:00-1:00PM', '1:00-
2:00PM',
'2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM',
'6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM', '11:00-12:00AM', 'Unnamed: 60'],
dtype='object')
Index(['ID', 'SegmentID', 'Roadway Name', 'From', 'To', 'Direction',
'Date',
'12:00-1:00 AM', '1:00-2:00AM', '2:00-3:00AM', '3:00-4:00AM',
'4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM', '7:00-8:00AM',
'8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM', '11:00-
12:00PM',
'12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM', '3:00-4:00PM',
'4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM', '7:00-8:00PM',
'8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM', '11:00-
12:00AM'],
dtype='object')

print(data1['Count_point_id'].unique())
print(data2['ID'].unique())

[ 749 7033 18029 ... 945081 946667 942323]

[ 1 2 3 ... 3391 3392 3393]

Data Preparation

Checking for missing values

print("Missing values in Traffic Volume Data:")

print(data1.isnull().sum())

Missing values in Traffic Volume Data:

Count_point_id 0
Direction_of_travel 0
Year 0
Count_date 0
hour 0
...
8:00-9:00PM 1033309
9:00-10:00PM 1033309
10:00-11:00PM 1033309
11:00-12:00AM 1033309
Unnamed: 60 1048575
Length: 61, dtype: int64

##Since "Start_junction_road_name" and " End_junction_road_nam" have non-numerical data ,

it is better to drop these 2 columnse

data1.drop(columns=['Start_junction_road_name'], inplace=True)

data1.drop(columns=['End_junction_road_name'], inplace=True)

print(data1.columns)

Index(['Count_point_id', 'Direction_of_travel', 'Year', 'Count_date',

'hour',
'Region_id', 'Region_name', 'Region_ons_code',
'Local_authority_id',
'Local_authority_name', 'Local_authority_code', 'Road_name',
'Road_category', 'Road_type', 'Easting', 'Northing',
'Latitude',
'Longitude', 'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', 'Unnamed: 35', '12:00-1:00 AM', '1:00-
2:00AM',
'2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM',
'6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM', '11:00-12:00PM', '12:00-1:00PM', '1:00-
2:00PM',
'2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM',
'6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM', '11:00-12:00AM', 'Unnamed: 60'],
dtype='object')

##The rest columns with missing values 'Link_length_km', 'Link_length_miles' represent

numerical data. We should go to impute missing values with the mean or median of their
respective columns.

mean_link_length_km = data1['Link_length_km'].mean()
median_link_length_miles = data1['Link_length_miles'].median()
data1['Link_length_km'].fillna(mean_link_length_km, inplace=True)
data1['Link_length_miles'].fillna(median_link_length_miles,
inplace=True)

##After handling missing values, we will check if there are any remaining missing values.

print("Missing values in data after handling:")

print(data1.isnull().sum())

Missing values in data after handling:

Count_point_id 0
Direction_of_travel 0
Year 0
Count_date 0
hour 0
Region_id 0
Region_name 0
Region_ons_code 0
Local_authority_id 0
Local_authority_name 0
Local_authority_code 0
Road_name 0
Road_category 0
Road_type 0
Easting 0
Northing 0
Latitude 0
Longitude 0
Link_length_km 0
Link_length_miles 0
Pedal_cycles 0
Two_wheeled_motor_vehicles 0
Cars_and_taxis 0
Buses_and_coaches 0
LGVs 0
HGVs_2_rigid_axle 0
HGVs_3_rigid_axle 0
HGVs_4_or_more_rigid_axle 0
HGVs_3_or_4_articulated_axle 0
HGVs_5_articulated_axle 0
HGVs_6_articulated_axle 0
All_HGVs 0
All_motor_vehicles 0
Unnamed: 35 1048575
12:00-1:00 AM 1033056
1:00-2:00AM 1033056
2:00-3:00AM 1033056
3:00-4:00AM 1033056
4:00-5:00AM 1033056
5:00-6:00AM 1033056
6:00-7:00AM 1033056
7:00-8:00AM 1033056
8:00-9:00AM 1033056
9:00-10:00AM 1033056
10:00-11:00AM 1033056
11:00-12:00PM 1033056
12:00-1:00PM 1033309
1:00-2:00PM 1033309
2:00-3:00PM 1033309
3:00-4:00PM 1033309
4:00-5:00PM 1033309
5:00-6:00PM 1033309
6:00-7:00PM 1033309
7:00-8:00PM 1033309
8:00-9:00PM 1033309
9:00-10:00PM 1033309
10:00-11:00PM 1033309
11:00-12:00AM 1033309
Unnamed: 60 1048575
dtype: int64

##After the cleaning , there are no evident missing values in the dataset

#checking missing values in 2nd dataset(Data2)

print("Missing values in Traffic Volume Counts:")
print(data2.isnull().sum())

Missing values in Traffic Volume Counts:

ID 0
SegmentID 0
Roadway Name 0
From 0
To 0
Direction 0
Date 0
12:00-1:00 AM 4
1:00-2:00AM 4
2:00-3:00AM 4
3:00-4:00AM 4
4:00-5:00AM 4
5:00-6:00AM 4
6:00-7:00AM 4
7:00-8:00AM 4
8:00-9:00AM 4
9:00-10:00AM 4
10:00-11:00AM 3
11:00-12:00PM 1
12:00-1:00PM 253
1:00-2:00PM 253
2:00-3:00PM 253
3:00-4:00PM 253
4:00-5:00PM 253
5:00-6:00PM 253
6:00-7:00PM 253
7:00-8:00PM 253
8:00-9:00PM 253
9:00-10:00PM 253
10:00-11:00PM 253
11:00-12:00AM 253
dtype: int64

Since,both of the datasets are relevant to a single problem statement , we can merge the
datasets.

import pandas as pd
df=pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data1.csv')
print(df)

Count_point_id Direction_of_travel Year Count_date

hour \
0 749 E 2014 25-06-2014 00:00
7
1 749 E 2014 25-06-2014 00:00
8
2 749 E 2014 25-06-2014 00:00
9
3 749 E 2014 25-06-2014 00:00
10
4 749 E 2014 25-06-2014 00:00
11
... ... ... ... ...
...
1048570 942187 N 2009 03-06-2009 00:00
8
1048571 942187 N 2009 03-06-2009 00:00
9
1048572 942187 N 2009 03-06-2009 00:00
10
1048573 942187 N 2009 03-06-2009 00:00
11
1048574 942187 N 2009 03-06-2009 00:00
12

Region_id Region_name Region_ons_code

Local_authority_id \
0 3 Scotland S92000003
39
1 3 Scotland S92000003
39
2 3 Scotland S92000003
39
3 3 Scotland S92000003
39
4 3 Scotland S92000003
39
... ... ... ... ..
.
1048570 7 East of England E12000006
126
1048571 7 East of England E12000006
126
1048572 7 East of England E12000006
126
1048573 7 East of England E12000006
126
1048574 7 East of England E12000006
126

Local_authority_name ... 3:00-4:00PM 4:00-5:00PM 5:00-6:00PM

\
0 East Ayrshire ... 105.0 147.0 120.0

1 East Ayrshire ... 98.0 133.0 131.0

2 East Ayrshire ... 115.0 130.0 143.0

3 East Ayrshire ... 127.0 122.0 144.0

4 East Ayrshire ... 126.0 133.0 135.0

... ... ... ... ... ...

1048570 Suffolk ... NaN NaN NaN

1048571 Suffolk ... NaN NaN NaN

1048572 Suffolk ... NaN NaN NaN

1048573 Suffolk ... NaN NaN NaN

1048574 Suffolk ... NaN NaN NaN

6:00-7:00PM 7:00-8:00PM 8:00-9:00PM 9:00-10:00PM 10:00-
11:00PM \
0 91.0 83.0 74.0 49.0
42.0
1 95.0 73.0 70.0 63.0
42.0
2 106.0 89.0 68.0 64.0
56.0
3 122.0 76.0 64.0 58.0
64.0
4 102.0 106.0 58.0 58.0
55.0
... ... ... ... ... .
..
1048570 NaN NaN NaN NaN
NaN
1048571 NaN NaN NaN NaN
NaN
1048572 NaN NaN NaN NaN
NaN
1048573 NaN NaN NaN NaN
NaN
1048574 NaN NaN NaN NaN
NaN

11:00-12:00AM Unnamed: 60
0 42.0 NaN
1 35.0 NaN
2 43.0 NaN
3 43.0 NaN
4 54.0 NaN
... ... ...
1048570 NaN NaN
1048571 NaN NaN
1048572 NaN NaN
1048573 NaN NaN
1048574 NaN NaN

[1048575 rows x 61 columns]

import matplotlib.pyplot as plt

import seaborn as sns

# Create a heatmap of missing values

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
missing_percentage = (df.isnull().sum() / len(df)) * 100
plt.show()

df.columns

Index(['Count_point_id', 'Direction_of_travel', 'Year', 'Count_date',

---Data Preprocessing
Data exploration

#printing data shape

print("Dimensions of the dataset:", df.shape)

Dimensions of the dataset: (1048575, 61)

print(df.describe())

Count_point_id Year hour Region_id \

Local_authority_id Easting Northing Latitude \

Longitude Link_length_km ... 3:00-4:00PM 4:00-

5:00PM \
count 1.048575e+06 403584.000000 ... 15266.000000 15266.000000

mean -1.500683e+00 3.094704 ... 669.589152 676.870693

std 1.371715e+00 3.589633 ... 798.586645 804.420051

min -7.425717e+00 0.100000 ... 0.000000 0.000000

25% -2.418145e+00 0.900000 ... 253.000000 255.000000

50% -1.479373e+00 1.900000 ... 426.000000 432.000000

75% -3.898619e-01 3.900000 ... 736.000000 745.000000

max 1.754553e+00 46.300000 ... 6016.000000 5923.000000

5:00-6:00PM 6:00-7:00PM 7:00-8:00PM 8:00-9:00PM 9:00-

10:00-11:00PM 11:00-12:00AM Unnamed: 60

[8 rows x 50 columns]

print(df.info())

--Missing Values
# Identify missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

Missing values:
Count_point_id 0
Direction_of_travel 0
Year 0
Count_date 0
hour 0
...
8:00-9:00PM 1033309
9:00-10:00PM 1033309
10:00-11:00PM 1033309
11:00-12:00AM 1033309
Unnamed: 60 1048575
Length: 61, dtype: int64

import matplotlib.pyplot as plt

import seaborn as sns

# Create a heatmap of missing values

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
missing_percentage = (df.isnull().sum() / len(df)) * 100

plt.show()
# Get columns with any missing values
columns_with_missing_values = df.columns[df.isnull().any()]

# Display the columns with missing values

print("Columns with missing values:")
print(columns_with_missing_values)

Columns with missing values:

Index(['Start_junction_road_name', 'End_junction_road_name',
'Link_length_km',
'Link_length_miles', 'Unnamed: 35', '12:00-1:00 AM', '1:00-
2:00AM',
'2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM',
'6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM', '11:00-12:00PM', '12:00-1:00PM', '1:00-
2:00PM',
'2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM',
'6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM', '11:00-12:00AM', 'Unnamed: 60'],
dtype='object')

-Feature Selection
columns_to_drop = ['Unnamed: 60','Unnamed: 35']
df.drop(columns=columns_to_drop, inplace=True)

#for continuous numerical columns like 'Link_length_km' and

'Link_length_miles', we can impute the missing values with the mean or
median of the column.
# Replace missing values in numerical columns with the mean
cont_var = ['Link_length_km', 'Link_length_miles', '12:00-1:00 AM']
df[cont_var] = df[cont_var].fillna(df[cont_var].mean())

columns_to_drop =
['Start_junction_road_name','End_junction_road_name']
df.drop(columns=columns_to_drop, inplace=True)

columns_with_missing_values = df.columns[df.isnull().any()]

# Display the columns with missing values

print("Columns with missing values:")
print(columns_with_missing_values)

Columns with missing values:

Index(['1:00-2:00AM', '2:00-3:00AM', '3:00-4:00AM', '4:00-5:00AM',
'5:00-6:00AM', '6:00-7:00AM', '7:00-8:00AM', '8:00-9:00AM',
'9:00-10:00AM', '10:00-11:00AM', '11:00-12:00PM', '12:00-
1:00PM',
'1:00-2:00PM', '2:00-3:00PM', '3:00-4:00PM', '4:00-5:00PM',
'5:00-6:00PM', '6:00-7:00PM', '7:00-8:00PM', '8:00-9:00PM',
'9:00-10:00PM', '10:00-11:00PM', '11:00-12:00AM'],
dtype='object')

# List of columns with missing values

columns_with_missing_values = ['1:00-2:00AM', '2:00-3:00AM', '3:00-
4:00AM', '4:00-5:00AM',
'5:00-6:00AM', '6:00-7:00AM', '7:00-
8:00AM', '8:00-9:00AM',
'9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM',
'1:00-2:00PM', '2:00-3:00PM', '3:00-
4:00PM', '4:00-5:00PM',
'5:00-6:00PM', '6:00-7:00PM', '7:00-
8:00PM', '8:00-9:00PM',
'9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM']

# Forward fill missing values

df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='ffill')

# Backward fill missing values

df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='bfill')

C:\Users\lariy\AppData\Local\Temp\ipykernel_19624\1721878666.py:10:
FutureWarning: DataFrame.fillna with 'method' is deprecated and will
raise in a future version. Use obj.ffill() or obj.bfill() instead.
df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='ffill')
C:\Users\lariy\AppData\Local\Temp\ipykernel_19624\1721878666.py:13:
FutureWarning: DataFrame.fillna with 'method' is deprecated and will
raise in a future version. Use obj.ffill() or obj.bfill() instead.
df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='bfill')

columns_with_missing_values = df.columns[df.isnull().any()]

# Display the columns with missing values

print("Columns with missing values:")
print(columns_with_missing_values)

Columns with missing values:

Index([], dtype='object')

df.columns

Index(['Count_point_id', 'Direction_of_travel', 'Year', 'Count_date',

'hour',
'Region_id', 'Region_name', 'Region_ons_code',
'Local_authority_id',
'Local_authority_name', 'Local_authority_code', 'Road_name',
'Road_category', 'Road_type', 'Easting', 'Northing',
'Latitude',
'Longitude', 'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-2:00AM', '2:00-
3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM'],
dtype='object')

coldrop1=['Local_authority_name', 'Local_authority_code']
df.drop(columns=coldrop1, inplace=True)

coldrop =
['Count_point_id','Direction_of_travel','Year','Count_date','Region_id
','Region_name', 'Region_ons_code', 'Local_authority_id']
df.drop(columns=coldrop, inplace=True)
coldrop1=['Local_authority_name', 'Local_authority_code']
df.drop(columns=coldrop1, inplace=True)

----------------------------------------------------------------------
-----
KeyError Traceback (most recent call
last)
Cell In[23], line 4
2 df.drop(columns=coldrop, inplace=True)
3 coldrop1=['Local_authority_name', 'Local_authority_code']
----> 4 df.drop(columns=coldrop1, inplace=True)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\frame.py:5344, in DataFrame.drop(self, labels, axis,
index, columns, level, inplace, errors)
5196 def drop(
5197 self,
5198 labels: IndexLabel | None = None,
(...)
5205 errors: IgnoreRaise = "raise",
5206 ) -> DataFrame | None:
5207 """
5208 Drop specified labels from rows or columns.
5209
(...)
5342 weight 1.0 0.8
5343 """
-> 5344 return super().drop(
5345 labels=labels,
5346 axis=axis,
5347 index=index,
5348 columns=columns,
5349 level=level,
5350 inplace=inplace,
5351 errors=errors,
5352 )

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\generic.py:4711, in NDFrame.drop(self, labels, axis,
index, columns, level, inplace, errors)
4709 for axis, labels in axes.items():
4710 if labels is not None:
-> 4711 obj = obj._drop_axis(labels, axis, level=level,
errors=errors)
4713 if inplace:
4714 self._update_inplace(obj)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\generic.py:4753, in NDFrame._drop_axis(self, labels, axis,
level, errors, only_slice)
4751 new_axis = axis.drop(labels, level=level,
errors=errors)
4752 else:
-> 4753 new_axis = axis.drop(labels, errors=errors)
4754 indexer = axis.get_indexer(new_axis)
4756 # Case for non-unique axis
4757 else:

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\indexes\base.py:6992, in Index.drop(self, labels, errors)
6990 if mask.any():
6991 if errors != "ignore":
-> 6992 raise KeyError(f"{labels[mask].tolist()} not found in
axis")
6993 indexer = indexer[~mask]
6994 return self.delete(indexer)

KeyError: "['Local_authority_name', 'Local_authority_code'] not found

in axis"

coldrop2=['Road_name','Road_category']
df.drop(columns=coldrop2, inplace=True)

df.columns

Index(['hour', 'Road_type', 'Easting', 'Northing', 'Latitude',

'Longitude',
'Link_length_km', 'Link_length_miles', 'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-2:00AM', '2:00-
3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM'],
dtype='object')

import pandas as pd

df.to_excel('data_revised.xlsx', index=False)

# Detect outliers using box plot

import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x=df['hour'])
plt.title('Box plot to detect outliers')
plt.show()

-Feature Engineering
Encoding
Since,there is one categorical column - Major and Minor . We have to convert it into numerical
content by creating dummies using One-Hot Encoding
import pandas as pd

# Assume df is your DataFrame containing the 'Road_type' column

# One-Hot Encoding
one_hot_encoded = pd.get_dummies(df['Road_type'], prefix='Road_type')

# Label Encoding
label_encoded = df['Road_type'].astype('category').cat.codes

# Replace the original 'Road_type' column with the encoded values

df['Road_type_encoded'] = label_encoded # Or use one-hot encoded
DataFrame if using one-hot encoding

#checking the results of encoding

print(df['Road_type_encoded'])

print(df.head(5))

0 0
1 0
2 0
3 0
4 0
..
1048570 1
1048571 1
1048572 1
1048573 1
1048574 1
Name: Road_type_encoded, Length: 1048575, dtype: int8
hour Road_type Easting Northing Latitude Longitude
Link_length_km \
0 7 Major 243900 635900 55.591636 -4.478606
4.6
1 8 Major 243900 635900 55.591636 -4.478606
4.6
2 9 Major 243900 635900 55.591636 -4.478606
4.6
3 10 Major 243900 635900 55.591636 -4.478606
4.6
4 11 Major 243900 635900 55.591636 -4.478606
4.6

Link_length_miles Pedal_cycles Two_wheeled_motor_vehicles ... \

0 2.86 0 2 ...
1 2.86 0 5 ...
2 2.86 0 2 ...
3 2.86 0 1 ...
4 2.86 0 2 ...
3:00-4:00PM 4:00-5:00PM 5:00-6:00PM 6:00-7:00PM 7:00-8:00PM \
0 105.0 147.0 120.0 91.0 83.0
1 98.0 133.0 131.0 95.0 73.0
2 115.0 130.0 143.0 106.0 89.0
3 127.0 122.0 144.0 122.0 76.0
4 126.0 133.0 135.0 102.0 106.0

8:00-9:00PM 9:00-10:00PM 10:00-11:00PM 11:00-12:00AM

Road_type_encoded
0 74.0 49.0 42.0 42.0
0
1 70.0 63.0 42.0 35.0
0
2 68.0 64.0 56.0 43.0
0
3 64.0 58.0 64.0 43.0
0
4 58.0 58.0 55.0 54.0
0

[5 rows x 46 columns]

The dummies are created with 'Road_type_encoded' , hence we can drop the column Road_type
as it further adds no value to data

df.drop(columns=['Road_type'], inplace=True)

df.columns

Index(['hour', 'Easting', 'Northing', 'Latitude', 'Longitude',

'Link_length_km', 'Link_length_miles', 'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle', 'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-2:00AM', '2:00-
3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM',
'11:00-12:00AM', 'Road_type_encoded'],
dtype='object')

Adding a feature
# Define the columns to be used for calculating traffic_rate
columns_to_use = ['hour', 'Easting', 'Northing', 'Latitude',
'Longitude',
'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle',
'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle',
'HGVs_6_articulated_axle', 'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-
2:00AM', '2:00-3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-
7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM',
'2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-
7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM',
'11:00-12:00AM', 'Road_type_encoded']

# Calculate the total sum of all columns

total_sum = df[columns_to_use].sum(axis=1)

# Calculate the weighted average

weighted_average = (df[columns_to_use].mul(df[columns_to_use],
axis=0)).sum(axis=1) / total_sum

# Determine the percentage of each column relative to the total sum

traffic_rate = (weighted_average / total_sum) * 100

# Add the 'traffic_rate' column to the DataFrame

df['traffic_rate'] = traffic_rate

# Display the DataFrame with the new column

print(df)

hour Easting Northing Latitude Longitude Link_length_km

\
0 7 243900 635900 55.591636 -4.478606 4.600000

1 8 243900 635900 55.591636 -4.478606 4.600000

2 9 243900 635900 55.591636 -4.478606 4.600000

3 10 243900 635900 55.591636 -4.478606 4.600000

4 11 243900 635900 55.591636 -4.478606 4.600000

... ... ... ... ... ... ...

1048570 8 628200 234310 51.960413 1.320092 3.094704

1048571 9 628200 234310 51.960413 1.320092 3.094704

1048572 10 628200 234310 51.960413 1.320092 3.094704

1048573 11 628200 234310 51.960413 1.320092 3.094704

1048574 12 628200 234310 51.960413 1.320092 3.094704

Link_length_miles Pedal_cycles
Two_wheeled_motor_vehicles \
0 2.860000 0 2

1 2.860000 0 5

2 2.860000 0 2

3 2.860000 0 1

4 2.860000 0 2

... ... ... ...

1048570 1.922684 0 1

1048571 1.922684 1 2

1048572 1.922684 1 1

1048573 1.922684 1 3

1048574 1.922684 0 2

Cars_and_taxis ... 4:00-5:00PM 5:00-6:00PM 6:00-7:00PM \

0 845 ... 147.0 120.0 91.0
1 908 ... 133.0 131.0 95.0
2 595 ... 130.0 143.0 106.0
3 590 ... 122.0 144.0 122.0
4 695 ... 133.0 135.0 102.0
... ... ... ... ... ...
1048570 28 ... 84.0 88.0 66.0
1048571 24 ... 84.0 88.0 66.0
1048572 28 ... 84.0 88.0 66.0
1048573 31 ... 84.0 88.0 66.0
1048574 81 ... 84.0 88.0 66.0

7:00-8:00PM 8:00-9:00PM 9:00-10:00PM 10:00-11:00PM 11:00-

12:00AM \
0 83.0 74.0 49.0 42.0
42.0
1 73.0 70.0 63.0 42.0
35.0
2 89.0 68.0 64.0 56.0
43.0
3 76.0 64.0 58.0 64.0
43.0
4 106.0 58.0 58.0 55.0
54.0
... ... ... ... ...
...
1048570 84.0 66.0 93.0 63.0
82.0
1048571 84.0 66.0 93.0 63.0
82.0
1048572 84.0 66.0 93.0 63.0
82.0
1048573 84.0 66.0 93.0 63.0
82.0
1048574 84.0 66.0 93.0 63.0
82.0

Road_type_encoded traffic_rate
0 0 59.450027
1 0 59.415506
2 0 59.484287
3 0 59.482806
4 0 59.443662
... ... ...
1048570 1 60.137671
1048571 1 60.144764
1048572 1 60.149633
1048573 1 60.147685
1048574 1 60.123348

[1048575 rows x 46 columns]

df.drop(columns=['Road_type'], inplace=True)

df.columns

Index(['hour', 'Easting', 'Northing', 'Latitude', 'Longitude',

df.dtypes

hour int64
Easting int64
Northing int64
Latitude float64
Longitude float64
Link_length_km float64
Link_length_miles float64
Pedal_cycles int64
Two_wheeled_motor_vehicles int64
Cars_and_taxis int64
Buses_and_coaches int64
LGVs int64
HGVs_2_rigid_axle int64
HGVs_3_rigid_axle int64
HGVs_4_or_more_rigid_axle int64
HGVs_3_or_4_articulated_axle int64
HGVs_5_articulated_axle int64
HGVs_6_articulated_axle int64
All_HGVs int64
All_motor_vehicles int64
12:00-1:00 AM float64
1:00-2:00AM float64
2:00-3:00AM float64
3:00-4:00AM float64
4:00-5:00AM float64
5:00-6:00AM float64
6:00-7:00AM float64
7:00-8:00AM float64
8:00-9:00AM float64
9:00-10:00AM float64
10:00-11:00AM float64
11:00-12:00PM float64
12:00-1:00PM float64
1:00-2:00PM float64
2:00-3:00PM float64
3:00-4:00PM float64
4:00-5:00PM float64
5:00-6:00PM float64
6:00-7:00PM float64
7:00-8:00PM float64
8:00-9:00PM float64
9:00-10:00PM float64
10:00-11:00PM float64
11:00-12:00AM float64
Road_type_encoded int8
traffic_rate float64
dtype: object

-Outliers detection
import pandas as pd

columns_of_interest = ['hour', 'Easting', 'Northing', 'Latitude',

# threshold for identifying outliers

threshold = 1.5

def count_outliers(column):
q1 = column.quantile(0.25)
q3 = column.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - threshold * iqr
upper_bound = q3 + threshold * iqr
outliers_lower = column[column < lower_bound]
outliers_upper = column[column > upper_bound]
return len(outliers_lower), len(outliers_upper)

# Create a dictionary to store the outlier counts for each column

outlier_counts = {}
# Iterate over the columns of interest and count outliers
for col in columns_of_interest:
outliers_lower, outliers_upper = count_outliers(df[col])
outlier_counts[col] = {'lower_outliers': outliers_lower,
'upper_outliers': outliers_upper}

# Convert the dictionary to a DataFrame for easier visualization

outlier_counts_df = pd.DataFrame(outlier_counts).T

# Display the outlier counts for each column

print(outlier_counts_df)

lower_outliers upper_outliers
hour 0 0
Easting 2208 0
Northing 0 16830
Latitude 0 16518
Longitude 2160 0
Link_length_km 235872 110328
Link_length_miles 235872 110328
Pedal_cycles 0 110674
Two_wheeled_motor_vehicles 0 94091
Cars_and_taxis 0 96771
Buses_and_coaches 0 87315
LGVs 0 109842
HGVs_2_rigid_axle 0 127022
HGVs_3_rigid_axle 0 130470
HGVs_4_or_more_rigid_axle 0 143570
HGVs_3_or_4_articulated_axle 0 155028
HGVs_5_articulated_axle 0 169182
HGVs_6_articulated_axle 0 170369
All_HGVs 0 149406
All_motor_vehicles 0 100972
traffic_rate 0 705

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df is your DataFrame containing the data

# Select the columns where you want to detect outliers
columns_of_interest = ['hour', 'Easting', 'Northing', 'Latitude',
'Longitude',
'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle',
'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle',
'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle',
'HGVs_6_articulated_axle', 'All_HGVs',
'All_motor_vehicles', 'traffic_rate']

# Create a boxplot for each selected column

plt.figure(figsize=(14, 8))
sns.boxplot(data=df[columns_of_interest])
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.title('Boxplot of Selected Columns to Detect Outliers')
plt.xlabel('Columns')
plt.ylabel('Values')
plt.show()

#removing outliers
import numpy as np
from scipy import stats

# Define a function to remove outliers using Z-score

def remove_outliers_zscore(df, cols):
for col in cols:
z_scores = np.abs(stats.zscore(df[col]))
df = df[(z_scores < 3)]
return df
#columns with outliers
cols_with_outliers = ['Easting', 'Northing', 'Latitude', 'Longitude',
'Link_length_km', 'Link_length_miles',
'Pedal_cycles', 'Two_wheeled_motor_vehicles',
'Cars_and_taxis', 'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle',
'HGVs_3_rigid_axle', 'HGVs_4_or_more_rigid_axle',
'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle', 'HGVs_6_articulated_axle',
'All_HGVs', 'All_motor_vehicles',
'traffic_rate']

# removing outliers using Z-score method

df_cleaned = remove_outliers_zscore(df, cols_with_outliers)

print("Original DataFrame shape:", df.shape)

print("DataFrame shape after removing outliers:", df_cleaned.shape)

Original DataFrame shape: (1048575, 46)

DataFrame shape after removing outliers: (759978, 46)

As we can observe, original DataFrame had 1,048,575 rows and 46 columns. After removing
outliers using the Z-score method, the data now contains 759,978 rows and 46 columns. This
reduction in the number of rows indicates that some data points were identified as outliers and
subsequently removed from the DataFrame. The specific rows containing outliers were excluded
based on the chosen outlier detection method and threshol.

-Scaling Transformation
from sklearn.preprocessing import MinMaxScaler

# Select numerical columns for scaling

numerical_cols = ['Link_length_km', 'Link_length_miles',
'Pedal_cycles', 'Two_wheeled_motor_vehicles',
'Cars_and_taxis', 'traffic_rate']

# Initialize the MinMaxScaler

scaler = MinMaxScaler()

# Fit and transform the numerical columns

df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
scaled_data = df[numerical_cols]

print(scaled_data)
Link_length_km Link_length_miles Pedal_cycles \
0 0.097403 0.097527 0.000000
1 0.097403 0.097527 0.000000
2 0.097403 0.097527 0.000000
3 0.097403 0.097527 0.000000
4 0.097403 0.097527 0.000000
... ... ... ...
1048570 0.064820 0.064879 0.000000
1048571 0.064820 0.064879 0.000453
1048572 0.064820 0.064879 0.000453
1048573 0.064820 0.064879 0.000453
1048574 0.064820 0.064879 0.000000

Two_wheeled_motor_vehicles Cars_and_taxis traffic_rate

0 0.002604 0.094382 0.465707
1 0.006510 0.101419 0.464980
2 0.002604 0.066458 0.466428
3 0.001302 0.065900 0.466397
4 0.002604 0.077628 0.465573
... ... ... ...
1048570 0.001302 0.003127 0.480189
1048571 0.002604 0.002681 0.480338
1048572 0.001302 0.003127 0.480441
1048573 0.003906 0.003463 0.480400
1048574 0.002604 0.009047 0.479887

[1048575 rows x 6 columns]

---Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plots of numerical features against the target variable

'traffic_rate'
plt.figure(figsize=(12, 8))
for i, col in enumerate(numerical_cols):
plt.subplot(3, 2, i+1)
sns.scatterplot(data=df, x=col, y='traffic_rate')
plt.title(f'Scatter Plot: {col} vs traffic_rate')
plt.xlabel(col)
plt.ylabel('traffic_rate')
plt.tight_layout()
plt.show()
# Bar chart of categorical feature 'Road_type_encoded' against the
target variable 'traffic_rate'
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='Road_type_encoded', y='traffic_rate')
plt.title('Bar Chart: Road_type_encoded vs traffic_rate')
plt.xlabel('Road_type_encoded')
plt.ylabel('traffic_rate')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming correlation_matrix is your computed correlation matrix

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns

# Plot histograms for numerical variables

plt.figure(figsize=(15, 10))
for i, col in enumerate(df.columns):
if df[col].dtype in ['int64', 'float64']:
plt.subplot(4, 3, i + 1)
sns.histplot(df[col], kde=True)
plt.title(f'Histogram of {col}')
plt.xlabel(col)
plt.tight_layout()
plt.show()

# Plot count plots for categorical variables

plt.figure(figsize=(15, 10))
for i, col in enumerate(df.columns):
if df[col].dtype == 'object':
plt.subplot(4, 3, i + 1)
sns.countplot(data=df, x=col)
plt.title(f'Count Plot of {col}')
plt.xlabel(col)
plt.tight_layout()
plt.show()

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
Cell In[54], line 8
6 for i, col in enumerate(df.columns):
7 if df[col].dtype in ['int64', 'float64']:
----> 8 plt.subplot(4, 3, i + 1)
9 sns.histplot(df[col], kde=True)
10 plt.title(f'Histogram of {col}')

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
matplotlib\pyplot.py:1425, in subplot(*args, **kwargs)
1422 fig = gcf()
1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if
the user passed no
1429 # kwargs or if the axes class and kwargs are identical.
1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 ==
fig._process_projection_requirements(**kwargs)))):

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
matplotlib\gridspec.py:599, in SubplotSpec._from_subplot_args(figure,
args)
597 else:
598 if not isinstance(num, Integral) or num < 1 or num >
rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <=
{rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]

ValueError: num must be an integer with 1 <= num <= 12, not 13
-Model Building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

#Defining target(dependent variable) and independent variable

X = df.drop(columns=['traffic_rate'])
y = df['traffic_rate']

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Initialize and fit Linear Regression model

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

# Make predictions
y_pred = model.predict(X_test)

from sklearn.metrics import mean_squared_error, r2_score

import numpy as np
# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Root Mean Squared Error (RMSE)

rmse = np.sqrt(mse)

# R-squared
r2 = r2_score(y_test, y_pred)

# R-squared to percentage
lin_reg = r2 * 100

# The regression evaluation metrics

print("Mean Squared Error (MSE): {:.2f}%".format(mse))
print("Root Mean Squared Error (RMSE): {:.2f}%".format(rmse))
print("R-squared (R2): {:.2f}%".format(lin_reg))

Mean Squared Error (MSE): 0.01%

Root Mean Squared Error (RMSE): 0.08%
R-squared (R2): 66.59%

-Model Tunning
For linear regression model , we can go ahead with Lasso or Ridge Regression ,

#Let's run Ridge Regression- Grid search

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define the model

model = Ridge()

# Define the hyperparameters grid

param_grid = {
'alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0] # Example values for
alpha
}

# Initialize Grid Search Cross Validation

grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
cv=5, scoring='neg_mean_squared_error')
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters

best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Get the best model

best_model = grid_search.best_estimator_

# Evaluate the best model

y_pred_best = best_model.predict(X_test)
mse_best = mean_squared_error(y_test, y_pred_best)
rmse_best = np.sqrt(mse_best)
r2_grid = r2_score(y_test, y_pred_best)

# Calculate the range of the target variable

y_range = y_test.max() - y_test.min()

# Convert MSE and RMSE to percentage

mse_percentage = (mse_best / y_range) * 100
rmse_percentage = (rmse_best / y_range) * 100

# Print results
print("Mean Squared Error (MSE) - Best Model: {:.2f}
%".format(mse_percentage))
print("Root Mean Squared Error (RMSE) - Best Model: {:.2f}
%".format(rmse_percentage))
print("R-squared (R2) - Best Model: {:.2f}%".format(r2_grid * 100))

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51559e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51117e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51082e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50696e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.5059e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25769e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25551e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25533e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.2534e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25289e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51517e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51088e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51051e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50667e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50565e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02984e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02149e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02073e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.01308e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.01109e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

Best Hyperparameters: {'alpha': 0.1}

Mean Squared Error (MSE) - Best Model: 0.61%
Root Mean Squared Error (RMSE) - Best Model: 7.82%
R-squared (R2) - Best Model: 66.59%

from sklearn.model_selection import RandomizedSearchCV

from sklearn.linear_model import Ridge
from scipy.stats import uniform
from sklearn.metrics import r2_score

# Define the model

model = Ridge()

# Define the hyperparameters grid

param_distribution = {
'alpha': uniform(loc=0, scale=10) # Example distribution for
alpha
}

# Initialize Random Search Cross Validation

random_search = RandomizedSearchCV(estimator=model,
param_distributions=param_distribution, n_iter=10, cv=5,
scoring='neg_mean_squared_error', random_state=42)

# Fit the random search to the data

random_search.fit(X_train, y_train)

# Get the best hyperparameters

best_params_random = random_search.best_params_
print("Best Hyperparameters (Random Search):", best_params_random)

# Get the best model

best_model = random_search.best_estimator_
# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Calculate R-squared (R2) score

r2_random = r2_score(y_test, y_pred)

# Print the R2 score

print("R-squared (R2) - Best Model: {:.2f}%".format(r2_random * 100))

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.04425e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03766e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03707e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03109e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.02953e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.04316e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03657e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03599e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03001e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.02844e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.62267e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.62015e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.61994e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.6177e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.6171e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

Best Hyperparameters (Random Search): {'alpha': 0.5808361216819946}

R-squared (R2) - Best Model: 66.59%

model.fit(X_train, y_train)

Ridge(alpha=0.5)

# Plot actual vs predicted

plt.scatter(y_test, model.predict(X_test), color='blue', label='Actual
vs Predicted')

# Plotting the diagonal line

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
color='red', linestyle='--', label='Ideal Prediction')

plt.title('Actual vs Predicted Traffic Rate')

plt.xlabel('Actual Traffic Rate')
plt.ylabel('Predicted Traffic Rate')
plt.legend()
plt.show()
model.fit(X_train, y_train)
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=1.80398e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T

Ridge(alpha=0.5)

import pandas as pd

# Fit the model to the training data

model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Create a DataFrame with actual and predicted values

results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

# Display the DataFrame

print(results_df)

Actual Predicted
781974 0.269387 0.276978
937737 0.385366 0.418029
907828 0.294860 0.343721
784628 0.255908 0.281116
662460 0.260028 0.274427
... ... ...
673443 0.643551 0.537741
656736 0.369448 0.429966
858501 0.437590 0.451252
617079 0.382499 0.283711
487559 0.297439 0.363648

[209715 rows x 2 columns]

Delhivery Feature Engineering Cs
No ratings yet
Delhivery Feature Engineering Cs
46 pages
Import: Sys - Executable - M Pip Install
No ratings yet
Import: Sys - Executable - M Pip Install
23 pages
ML#05
No ratings yet
ML#05
35 pages
Yulu Case Study
No ratings yet
Yulu Case Study
1 page
ML LAB Prob 1 5
No ratings yet
ML LAB Prob 1 5
22 pages
Assignment2 VidulGarg
No ratings yet
Assignment2 VidulGarg
11 pages
MTA Project
No ratings yet
MTA Project
1 page
Crime Prediction - Clustering - Ipynb
No ratings yet
Crime Prediction - Clustering - Ipynb
180 pages
Nornikel Data Anaconda
No ratings yet
Nornikel Data Anaconda
11 pages
Data Alat
No ratings yet
Data Alat
32 pages
Sudan Main Road
No ratings yet
Sudan Main Road
449 pages
New ANN-Copy1
No ratings yet
New ANN-Copy1
1 page
Machine Learning
No ratings yet
Machine Learning
41 pages
q1
No ratings yet
q1
2 pages
DP v8
No ratings yet
DP v8
19 pages
Dinesh DWDM CCE
No ratings yet
Dinesh DWDM CCE
17 pages
Mikeio-Dfs2 - Global Forecasting System - Ipynb at Main DHI-mikeio GitHub
No ratings yet
Mikeio-Dfs2 - Global Forecasting System - Ipynb at Main DHI-mikeio GitHub
15 pages
UNIT 1 PYTHON PROGRAMMING-II
No ratings yet
UNIT 1 PYTHON PROGRAMMING-II
15 pages
explainable-ai-driven-rainfall-prediction-using-dl
No ratings yet
explainable-ai-driven-rainfall-prediction-using-dl
66 pages
Ml1.ipynb - Colaboratory
No ratings yet
Ml1.ipynb - Colaboratory
5 pages
Data_cleaning_on_Melbourne_housing
No ratings yet
Data_cleaning_on_Melbourne_housing
16 pages
Practical 1
No ratings yet
Practical 1
6 pages
Data Loading- Jupyter Notebook
No ratings yet
Data Loading- Jupyter Notebook
15 pages
ML Practical 1
No ratings yet
ML Practical 1
15 pages
DM Project
No ratings yet
DM Project
34 pages
ML Practical 1 Code
100% (1)
ML Practical 1 Code
1 page
Tatenda Muswere - Analytics of Programming PDF
No ratings yet
Tatenda Muswere - Analytics of Programming PDF
10 pages
P1) Code Uber
No ratings yet
P1) Code Uber
6 pages
Task 2 Exploratory Data Analysis
No ratings yet
Task 2 Exploratory Data Analysis
5 pages
Bigdata - Ipynb - Colab
No ratings yet
Bigdata - Ipynb - Colab
28 pages
230103-ECON209_S2025__Lab_2.ipynb-Colab
No ratings yet
230103-ECON209_S2025__Lab_2.ipynb-Colab
10 pages
Customer Data Outliers Pyspark
No ratings yet
Customer Data Outliers Pyspark
1 page
Untitled 23
No ratings yet
Untitled 23
4 pages
Experiment 8
No ratings yet
Experiment 8
9 pages
5G Resources Allocation Machine Learning Project ???
No ratings yet
5G Resources Allocation Machine Learning Project ???
37 pages
ADS_Phase 5 Code
No ratings yet
ADS_Phase 5 Code
12 pages
MLTHEORY_2
No ratings yet
MLTHEORY_2
14 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
MARKET Segmentation
No ratings yet
MARKET Segmentation
1 page
20MIS1025 - Comparative Analysis - Ipynb - Colaboratory
No ratings yet
20MIS1025 - Comparative Analysis - Ipynb - Colaboratory
6 pages
Code
No ratings yet
Code
2 pages
Data Analysis Dummy Report: 0. Data Import and Cleaning
No ratings yet
Data Analysis Dummy Report: 0. Data Import and Cleaning
1 page
Radio 28
No ratings yet
Radio 28
799 pages
Project
No ratings yet
Project
12 pages
Serv NYC Project
No ratings yet
Serv NYC Project
9 pages
Nairobi - Kenya-Healthcare-Facility-Analysis - Health - Analysis - Ipynb at Main Ronaldonyagaka - Nairobi - Kenya-Healthcare-Facility-Analysis
No ratings yet
Nairobi - Kenya-Healthcare-Facility-Analysis - Health - Analysis - Ipynb at Main Ronaldonyagaka - Nairobi - Kenya-Healthcare-Facility-Analysis
34 pages
Stop Times
No ratings yet
Stop Times
131 pages
Apex Financial Services Loan Data Automation
No ratings yet
Apex Financial Services Loan Data Automation
18 pages
Delhivery Business Case Study 1723758771
No ratings yet
Delhivery Business Case Study 1723758771
56 pages
flecs_performance
No ratings yet
flecs_performance
4 pages
PRTG November
No ratings yet
PRTG November
38 pages
MBS 2179
No ratings yet
MBS 2179
8 pages
LT05 L1TP 220076 20110919 20161006 01 T1 Ver
No ratings yet
LT05 L1TP 220076 20110919 20161006 01 T1 Ver
35 pages
scaffold fg
No ratings yet
scaffold fg
13 pages
Assignmemt 1
No ratings yet
Assignmemt 1
288 pages
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
Pandas Cheatsheet DF
No ratings yet
Pandas Cheatsheet DF
1 page
Credit Card Fraud Detection With CNN 99 Accuracy
No ratings yet
Credit Card Fraud Detection With CNN 99 Accuracy
12 pages
WIN SEM (2023-24) FRESHERS - CSE0504 - ETH - AP2023247000196 - 2024-02-29 - Reference-Material-II
No ratings yet
WIN SEM (2023-24) FRESHERS - CSE0504 - ETH - AP2023247000196 - 2024-02-29 - Reference-Material-II
13 pages
Professional Windows Phone 7 Application Development: Building Applications and Games Using Visual Studio, Silverlight, and XNA
From Everand
Professional Windows Phone 7 Application Development: Building Applications and Games Using Visual Studio, Silverlight, and XNA
Nick Randolph
No ratings yet
Quezon City University: Cc106 - Introduction To Application Development and Emerging Technologies Assignment # 1
No ratings yet
Quezon City University: Cc106 - Introduction To Application Development and Emerging Technologies Assignment # 1
2 pages
Credential Dumping Phishing Windows Credentials 1701434265
No ratings yet
Credential Dumping Phishing Windows Credentials 1701434265
16 pages
Intro To Java
No ratings yet
Intro To Java
2 pages
SANOG12 Merike OPsec
No ratings yet
SANOG12 Merike OPsec
132 pages
Student Portal Dashboard
No ratings yet
Student Portal Dashboard
1 page
Chapter 2
No ratings yet
Chapter 2
11 pages
Port Numbers and Functions
No ratings yet
Port Numbers and Functions
27 pages
Connecting To Container Databases (CDB) and Pluggable Databases (PDB)
No ratings yet
Connecting To Container Databases (CDB) and Pluggable Databases (PDB)
4 pages
Lebe0007-07 1
100% (1)
Lebe0007-07 1
550 pages
Algorithms and Architectures For 2D Discrete Wavelet Transform
No ratings yet
Algorithms and Architectures For 2D Discrete Wavelet Transform
20 pages
4 - Design of I2C Master Core With AHB Protocol For High Performance Interface
No ratings yet
4 - Design of I2C Master Core With AHB Protocol For High Performance Interface
72 pages
CMU-CS 303 - Fundamentals of Computing 1 - 2020S - Lecture Slides - 7-2
No ratings yet
CMU-CS 303 - Fundamentals of Computing 1 - 2020S - Lecture Slides - 7-2
55 pages
C++ Intership
No ratings yet
C++ Intership
10 pages
UNIX The Textbook 3rd Edition Syed Mansoor Sarwar download
100% (6)
UNIX The Textbook 3rd Edition Syed Mansoor Sarwar download
56 pages
Pack'N Deploy User'S Guide: 31 January 2017
No ratings yet
Pack'N Deploy User'S Guide: 31 January 2017
24 pages
Indraworks Safelogic 15Vrs: First Steps
No ratings yet
Indraworks Safelogic 15Vrs: First Steps
86 pages
Exploring Security Threats On Blockchain Technology Along With Possible Remedies
No ratings yet
Exploring Security Threats On Blockchain Technology Along With Possible Remedies
4 pages
JAVA Questions
No ratings yet
JAVA Questions
5 pages
3BUR002117R3701 en AdvaBuild TCL Builder User Guide
No ratings yet
3BUR002117R3701 en AdvaBuild TCL Builder User Guide
120 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
17 pages
RTCU IDE Users Manual
No ratings yet
RTCU IDE Users Manual
1,990 pages
Technical Seminar Report Chatgpt Removed
No ratings yet
Technical Seminar Report Chatgpt Removed
22 pages
005 Administer Governance and Compliance
No ratings yet
005 Administer Governance and Compliance
36 pages
Threat Hunting Workshop Hunting For Execution - Configuration Document
No ratings yet
Threat Hunting Workshop Hunting For Execution - Configuration Document
10 pages
All Chapters of Computers
No ratings yet
All Chapters of Computers
32 pages
SQL Alwayson Availability Groups Virtual Environment
No ratings yet
SQL Alwayson Availability Groups Virtual Environment
10 pages
PowerSwitch SmartFabric OS10 REST API Implementation Participant Guide
No ratings yet
PowerSwitch SmartFabric OS10 REST API Implementation Participant Guide
43 pages
PC Avid Media Composer Version Matrix: Avid Professional Editor Patch Updates
No ratings yet
PC Avid Media Composer Version Matrix: Avid Professional Editor Patch Updates
43 pages
S.E. - 2019 Pattern - Endsem Exam Timetable For Nov-Dec-2024
No ratings yet
S.E. - 2019 Pattern - Endsem Exam Timetable For Nov-Dec-2024
23 pages