Cac2 22112338
Cac2 22112338
import pandas as pd
data1 = pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data1.csv')
data2=pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data2.csv')
Unnamed: 60
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
[5 rows x 61 columns]
This is 2nd dataset ID SegmentID Roadway Name From
To Direction \
0 1 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
1 2 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
2 3 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
3 4 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
4 5 15540 BEACH STREET UNION PLACE VAN DUZER STREET
NB
----------------------------------------------------------------------
--------------------------------------------
----------------------------------------------------------------------
--------------------------------------------
^
SyntaxError: invalid syntax
print(data1.shape)
print(data2.shape)
(1048575, 61)
(42756, 31)
print(data1.info())
print(data2.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Count_point_id 1048575 non-null int64
1 Direction_of_travel 1048575 non-null object
2 Year 1048575 non-null int64
3 Count_date 1048575 non-null object
4 hour 1048575 non-null int64
5 Region_id 1048575 non-null int64
6 Region_name 1048575 non-null object
7 Region_ons_code 1048575 non-null object
8 Local_authority_id 1048575 non-null int64
9 Local_authority_name 1048575 non-null object
10 Local_authority_code 1048575 non-null object
11 Road_name 1048575 non-null object
12 Road_category 1048575 non-null object
13 Road_type 1048575 non-null object
14 Start_junction_road_name 402756 non-null object
15 End_junction_road_name 402732 non-null object
16 Easting 1048575 non-null int64
17 Northing 1048575 non-null int64
18 Latitude 1048575 non-null float64
19 Longitude 1048575 non-null float64
20 Link_length_km 403584 non-null float64
21 Link_length_miles 403584 non-null float64
22 Pedal_cycles 1048575 non-null int64
23 Two_wheeled_motor_vehicles 1048575 non-null int64
24 Cars_and_taxis 1048575 non-null int64
25 Buses_and_coaches 1048575 non-null int64
26 LGVs 1048575 non-null int64
27 HGVs_2_rigid_axle 1048575 non-null int64
28 HGVs_3_rigid_axle 1048575 non-null int64
29 HGVs_4_or_more_rigid_axle 1048575 non-null int64
30 HGVs_3_or_4_articulated_axle 1048575 non-null int64
31 HGVs_5_articulated_axle 1048575 non-null int64
32 HGVs_6_articulated_axle 1048575 non-null int64
33 All_HGVs 1048575 non-null int64
34 All_motor_vehicles 1048575 non-null int64
35 Unnamed: 35 0 non-null float64
36 12:00-1:00 AM 15519 non-null float64
37 1:00-2:00AM 15519 non-null float64
38 2:00-3:00AM 15519 non-null float64
39 3:00-4:00AM 15519 non-null float64
40 4:00-5:00AM 15519 non-null float64
41 5:00-6:00AM 15519 non-null float64
42 6:00-7:00AM 15519 non-null float64
43 7:00-8:00AM 15519 non-null float64
44 8:00-9:00AM 15519 non-null float64
45 9:00-10:00AM 15519 non-null float64
46 10:00-11:00AM 15519 non-null float64
47 11:00-12:00PM 15519 non-null float64
48 12:00-1:00PM 15266 non-null float64
49 1:00-2:00PM 15266 non-null float64
50 2:00-3:00PM 15266 non-null float64
51 3:00-4:00PM 15266 non-null float64
52 4:00-5:00PM 15266 non-null float64
53 5:00-6:00PM 15266 non-null float64
54 6:00-7:00PM 15266 non-null float64
55 7:00-8:00PM 15266 non-null float64
56 8:00-9:00PM 15266 non-null float64
57 9:00-10:00PM 15266 non-null float64
58 10:00-11:00PM 15266 non-null float64
59 11:00-12:00AM 15266 non-null float64
60 Unnamed: 60 0 non-null float64
dtypes: float64(30), int64(20), object(11)
memory usage: 488.0+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42756 entries, 0 to 42755
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 42756 non-null int64
1 SegmentID 42756 non-null int64
2 Roadway Name 42756 non-null object
3 From 42756 non-null object
4 To 42756 non-null object
5 Direction 42756 non-null object
6 Date 42756 non-null object
7 12:00-1:00 AM 42752 non-null float64
8 1:00-2:00AM 42752 non-null float64
9 2:00-3:00AM 42752 non-null float64
10 3:00-4:00AM 42752 non-null float64
11 4:00-5:00AM 42752 non-null float64
12 5:00-6:00AM 42752 non-null float64
13 6:00-7:00AM 42752 non-null float64
14 7:00-8:00AM 42752 non-null float64
15 8:00-9:00AM 42752 non-null float64
16 9:00-10:00AM 42752 non-null float64
17 10:00-11:00AM 42753 non-null float64
18 11:00-12:00PM 42755 non-null float64
19 12:00-1:00PM 42503 non-null float64
20 1:00-2:00PM 42503 non-null float64
21 2:00-3:00PM 42503 non-null float64
22 3:00-4:00PM 42503 non-null float64
23 4:00-5:00PM 42503 non-null float64
24 5:00-6:00PM 42503 non-null float64
25 6:00-7:00PM 42503 non-null float64
26 7:00-8:00PM 42503 non-null float64
27 8:00-9:00PM 42503 non-null float64
28 9:00-10:00PM 42503 non-null float64
29 10:00-11:00PM 42503 non-null float64
30 11:00-12:00AM 42503 non-null float64
dtypes: float64(24), int64(2), object(5)
memory usage: 10.1+ MB
None
print(data1.describe())
print(data2.describe())
[8 rows x 50 columns]
ID SegmentID 12:00-1:00 AM 1:00-2:00AM 2:00-
3:00AM \
count 42756.000000 4.275600e+04 42752.000000 42752.000000
42752.000000
mean 302.926841 4.988159e+05 251.448423 178.591504
135.280318
std 504.422798 1.875303e+06 407.435712 303.030296
242.091877
min 1.000000 2.020000e+02 0.000000 0.000000
0.000000
25% 95.000000 3.402700e+04 60.000000 38.000000
26.000000
50% 193.000000 7.534300e+04 118.000000 79.000000
56.000000
75% 299.000000 1.448810e+05 241.000000 171.000000
128.000000
max 3393.000000 9.017050e+06 4805.000000 4489.000000
4818.000000
11:00-12:00AM
count 42503.000000
mean 315.806014
std 493.563207
min 0.000000
25% 82.000000
50% 157.000000
75% 303.000000
max 5027.000000
[8 rows x 26 columns]
print(data1.columns)
print(data2.columns)
print(data1['Count_point_id'].unique())
print(data2['ID'].unique())
Data Preparation
data1.drop(columns=['Start_junction_road_name'], inplace=True)
data1.drop(columns=['End_junction_road_name'], inplace=True)
print(data1.columns)
mean_link_length_km = data1['Link_length_km'].mean()
median_link_length_miles = data1['Link_length_miles'].median()
data1['Link_length_km'].fillna(mean_link_length_km, inplace=True)
data1['Link_length_miles'].fillna(median_link_length_miles,
inplace=True)
##After handling missing values, we will check if there are any remaining missing values.
##After the cleaning , there are no evident missing values in the dataset
Since,both of the datasets are relevant to a single problem statement , we can merge the
datasets.
import pandas as pd
df=pd.read_csv('C:\\Users\\lariy\\OneDrive\\Documents\\
Machine_Learning SEM4\\data1.csv')
print(df)
11:00-12:00AM Unnamed: 60
0 42.0 NaN
1 35.0 NaN
2 43.0 NaN
3 43.0 NaN
4 54.0 NaN
... ... ...
1048570 NaN NaN
1048571 NaN NaN
1048572 NaN NaN
1048573 NaN NaN
1048574 NaN NaN
df.columns
---Data Preprocessing
Data exploration
print(df.describe())
[8 rows x 50 columns]
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Count_point_id 1048575 non-null int64
1 Direction_of_travel 1048575 non-null object
2 Year 1048575 non-null int64
3 Count_date 1048575 non-null object
4 hour 1048575 non-null int64
5 Region_id 1048575 non-null int64
6 Region_name 1048575 non-null object
7 Region_ons_code 1048575 non-null object
8 Local_authority_id 1048575 non-null int64
9 Local_authority_name 1048575 non-null object
10 Local_authority_code 1048575 non-null object
11 Road_name 1048575 non-null object
12 Road_category 1048575 non-null object
13 Road_type 1048575 non-null object
14 Start_junction_road_name 402756 non-null object
15 End_junction_road_name 402732 non-null object
16 Easting 1048575 non-null int64
17 Northing 1048575 non-null int64
18 Latitude 1048575 non-null float64
19 Longitude 1048575 non-null float64
20 Link_length_km 403584 non-null float64
21 Link_length_miles 403584 non-null float64
22 Pedal_cycles 1048575 non-null int64
23 Two_wheeled_motor_vehicles 1048575 non-null int64
24 Cars_and_taxis 1048575 non-null int64
25 Buses_and_coaches 1048575 non-null int64
26 LGVs 1048575 non-null int64
27 HGVs_2_rigid_axle 1048575 non-null int64
28 HGVs_3_rigid_axle 1048575 non-null int64
29 HGVs_4_or_more_rigid_axle 1048575 non-null int64
30 HGVs_3_or_4_articulated_axle 1048575 non-null int64
31 HGVs_5_articulated_axle 1048575 non-null int64
32 HGVs_6_articulated_axle 1048575 non-null int64
33 All_HGVs 1048575 non-null int64
34 All_motor_vehicles 1048575 non-null int64
35 Unnamed: 35 0 non-null float64
36 12:00-1:00 AM 15519 non-null float64
37 1:00-2:00AM 15519 non-null float64
38 2:00-3:00AM 15519 non-null float64
39 3:00-4:00AM 15519 non-null float64
40 4:00-5:00AM 15519 non-null float64
41 5:00-6:00AM 15519 non-null float64
42 6:00-7:00AM 15519 non-null float64
43 7:00-8:00AM 15519 non-null float64
44 8:00-9:00AM 15519 non-null float64
45 9:00-10:00AM 15519 non-null float64
46 10:00-11:00AM 15519 non-null float64
47 11:00-12:00PM 15519 non-null float64
48 12:00-1:00PM 15266 non-null float64
49 1:00-2:00PM 15266 non-null float64
50 2:00-3:00PM 15266 non-null float64
51 3:00-4:00PM 15266 non-null float64
52 4:00-5:00PM 15266 non-null float64
53 5:00-6:00PM 15266 non-null float64
54 6:00-7:00PM 15266 non-null float64
55 7:00-8:00PM 15266 non-null float64
56 8:00-9:00PM 15266 non-null float64
57 9:00-10:00PM 15266 non-null float64
58 10:00-11:00PM 15266 non-null float64
59 11:00-12:00AM 15266 non-null float64
60 Unnamed: 60 0 non-null float64
dtypes: float64(30), int64(20), object(11)
memory usage: 488.0+ MB
None
--Missing Values
# Identify missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)
Missing values:
Count_point_id 0
Direction_of_travel 0
Year 0
Count_date 0
hour 0
...
8:00-9:00PM 1033309
9:00-10:00PM 1033309
10:00-11:00PM 1033309
11:00-12:00AM 1033309
Unnamed: 60 1048575
Length: 61, dtype: int64
plt.show()
# Get columns with any missing values
columns_with_missing_values = df.columns[df.isnull().any()]
-Feature Selection
columns_to_drop = ['Unnamed: 60','Unnamed: 35']
df.drop(columns=columns_to_drop, inplace=True)
columns_to_drop =
['Start_junction_road_name','End_junction_road_name']
df.drop(columns=columns_to_drop, inplace=True)
columns_with_missing_values = df.columns[df.isnull().any()]
C:\Users\lariy\AppData\Local\Temp\ipykernel_19624\1721878666.py:10:
FutureWarning: DataFrame.fillna with 'method' is deprecated and will
raise in a future version. Use obj.ffill() or obj.bfill() instead.
df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='ffill')
C:\Users\lariy\AppData\Local\Temp\ipykernel_19624\1721878666.py:13:
FutureWarning: DataFrame.fillna with 'method' is deprecated and will
raise in a future version. Use obj.ffill() or obj.bfill() instead.
df[columns_with_missing_values] =
df[columns_with_missing_values].fillna(method='bfill')
columns_with_missing_values = df.columns[df.isnull().any()]
df.columns
coldrop1=['Local_authority_name', 'Local_authority_code']
df.drop(columns=coldrop1, inplace=True)
coldrop =
['Count_point_id','Direction_of_travel','Year','Count_date','Region_id
','Region_name', 'Region_ons_code', 'Local_authority_id']
df.drop(columns=coldrop, inplace=True)
coldrop1=['Local_authority_name', 'Local_authority_code']
df.drop(columns=coldrop1, inplace=True)
----------------------------------------------------------------------
-----
KeyError Traceback (most recent call
last)
Cell In[23], line 4
2 df.drop(columns=coldrop, inplace=True)
3 coldrop1=['Local_authority_name', 'Local_authority_code']
----> 4 df.drop(columns=coldrop1, inplace=True)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\frame.py:5344, in DataFrame.drop(self, labels, axis,
index, columns, level, inplace, errors)
5196 def drop(
5197 self,
5198 labels: IndexLabel | None = None,
(...)
5205 errors: IgnoreRaise = "raise",
5206 ) -> DataFrame | None:
5207 """
5208 Drop specified labels from rows or columns.
5209
(...)
5342 weight 1.0 0.8
5343 """
-> 5344 return super().drop(
5345 labels=labels,
5346 axis=axis,
5347 index=index,
5348 columns=columns,
5349 level=level,
5350 inplace=inplace,
5351 errors=errors,
5352 )
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\generic.py:4711, in NDFrame.drop(self, labels, axis,
index, columns, level, inplace, errors)
4709 for axis, labels in axes.items():
4710 if labels is not None:
-> 4711 obj = obj._drop_axis(labels, axis, level=level,
errors=errors)
4713 if inplace:
4714 self._update_inplace(obj)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\generic.py:4753, in NDFrame._drop_axis(self, labels, axis,
level, errors, only_slice)
4751 new_axis = axis.drop(labels, level=level,
errors=errors)
4752 else:
-> 4753 new_axis = axis.drop(labels, errors=errors)
4754 indexer = axis.get_indexer(new_axis)
4756 # Case for non-unique axis
4757 else:
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
pandas\core\indexes\base.py:6992, in Index.drop(self, labels, errors)
6990 if mask.any():
6991 if errors != "ignore":
-> 6992 raise KeyError(f"{labels[mask].tolist()} not found in
axis")
6993 indexer = indexer[~mask]
6994 return self.delete(indexer)
coldrop2=['Road_name','Road_category']
df.drop(columns=coldrop2, inplace=True)
df.columns
import pandas as pd
df.to_excel('data_revised.xlsx', index=False)
-Feature Engineering
Encoding
Since,there is one categorical column - Major and Minor . We have to convert it into numerical
content by creating dummies using One-Hot Encoding
import pandas as pd
# One-Hot Encoding
one_hot_encoded = pd.get_dummies(df['Road_type'], prefix='Road_type')
# Label Encoding
label_encoded = df['Road_type'].astype('category').cat.codes
print(df.head(5))
0 0
1 0
2 0
3 0
4 0
..
1048570 1
1048571 1
1048572 1
1048573 1
1048574 1
Name: Road_type_encoded, Length: 1048575, dtype: int8
hour Road_type Easting Northing Latitude Longitude
Link_length_km \
0 7 Major 243900 635900 55.591636 -4.478606
4.6
1 8 Major 243900 635900 55.591636 -4.478606
4.6
2 9 Major 243900 635900 55.591636 -4.478606
4.6
3 10 Major 243900 635900 55.591636 -4.478606
4.6
4 11 Major 243900 635900 55.591636 -4.478606
4.6
[5 rows x 46 columns]
The dummies are created with 'Road_type_encoded' , hence we can drop the column Road_type
as it further adds no value to data
df.drop(columns=['Road_type'], inplace=True)
df.columns
Adding a feature
# Define the columns to be used for calculating traffic_rate
columns_to_use = ['hour', 'Easting', 'Northing', 'Latitude',
'Longitude',
'Link_length_km', 'Link_length_miles',
'Pedal_cycles',
'Two_wheeled_motor_vehicles', 'Cars_and_taxis',
'Buses_and_coaches',
'LGVs', 'HGVs_2_rigid_axle', 'HGVs_3_rigid_axle',
'HGVs_4_or_more_rigid_axle',
'HGVs_3_or_4_articulated_axle',
'HGVs_5_articulated_axle',
'HGVs_6_articulated_axle', 'All_HGVs',
'All_motor_vehicles', '12:00-1:00 AM', '1:00-
2:00AM', '2:00-3:00AM',
'3:00-4:00AM', '4:00-5:00AM', '5:00-6:00AM', '6:00-
7:00AM',
'7:00-8:00AM', '8:00-9:00AM', '9:00-10:00AM',
'10:00-11:00AM',
'11:00-12:00PM', '12:00-1:00PM', '1:00-2:00PM',
'2:00-3:00PM',
'3:00-4:00PM', '4:00-5:00PM', '5:00-6:00PM', '6:00-
7:00PM',
'7:00-8:00PM', '8:00-9:00PM', '9:00-10:00PM',
'10:00-11:00PM',
'11:00-12:00AM', 'Road_type_encoded']
Link_length_miles Pedal_cycles
Two_wheeled_motor_vehicles \
0 2.860000 0 2
1 2.860000 0 5
2 2.860000 0 2
3 2.860000 0 1
4 2.860000 0 2
1048570 1.922684 0 1
1048571 1.922684 1 2
1048572 1.922684 1 1
1048573 1.922684 1 3
1048574 1.922684 0 2
Road_type_encoded traffic_rate
0 0 59.450027
1 0 59.415506
2 0 59.484287
3 0 59.482806
4 0 59.443662
... ... ...
1048570 1 60.137671
1048571 1 60.144764
1048572 1 60.149633
1048573 1 60.147685
1048574 1 60.123348
df.drop(columns=['Road_type'], inplace=True)
df.columns
df.dtypes
hour int64
Easting int64
Northing int64
Latitude float64
Longitude float64
Link_length_km float64
Link_length_miles float64
Pedal_cycles int64
Two_wheeled_motor_vehicles int64
Cars_and_taxis int64
Buses_and_coaches int64
LGVs int64
HGVs_2_rigid_axle int64
HGVs_3_rigid_axle int64
HGVs_4_or_more_rigid_axle int64
HGVs_3_or_4_articulated_axle int64
HGVs_5_articulated_axle int64
HGVs_6_articulated_axle int64
All_HGVs int64
All_motor_vehicles int64
12:00-1:00 AM float64
1:00-2:00AM float64
2:00-3:00AM float64
3:00-4:00AM float64
4:00-5:00AM float64
5:00-6:00AM float64
6:00-7:00AM float64
7:00-8:00AM float64
8:00-9:00AM float64
9:00-10:00AM float64
10:00-11:00AM float64
11:00-12:00PM float64
12:00-1:00PM float64
1:00-2:00PM float64
2:00-3:00PM float64
3:00-4:00PM float64
4:00-5:00PM float64
5:00-6:00PM float64
6:00-7:00PM float64
7:00-8:00PM float64
8:00-9:00PM float64
9:00-10:00PM float64
10:00-11:00PM float64
11:00-12:00AM float64
Road_type_encoded int8
traffic_rate float64
dtype: object
-Outliers detection
import pandas as pd
def count_outliers(column):
q1 = column.quantile(0.25)
q3 = column.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - threshold * iqr
upper_bound = q3 + threshold * iqr
outliers_lower = column[column < lower_bound]
outliers_upper = column[column > upper_bound]
return len(outliers_lower), len(outliers_upper)
lower_outliers upper_outliers
hour 0 0
Easting 2208 0
Northing 0 16830
Latitude 0 16518
Longitude 2160 0
Link_length_km 235872 110328
Link_length_miles 235872 110328
Pedal_cycles 0 110674
Two_wheeled_motor_vehicles 0 94091
Cars_and_taxis 0 96771
Buses_and_coaches 0 87315
LGVs 0 109842
HGVs_2_rigid_axle 0 127022
HGVs_3_rigid_axle 0 130470
HGVs_4_or_more_rigid_axle 0 143570
HGVs_3_or_4_articulated_axle 0 155028
HGVs_5_articulated_axle 0 169182
HGVs_6_articulated_axle 0 170369
All_HGVs 0 149406
All_motor_vehicles 0 100972
traffic_rate 0 705
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#removing outliers
import numpy as np
from scipy import stats
As we can observe, original DataFrame had 1,048,575 rows and 46 columns. After removing
outliers using the Z-score method, the data now contains 759,978 rows and 46 columns. This
reduction in the number of rows indicates that some data points were identified as outliers and
subsequently removed from the DataFrame. The specific rows containing outliers were excluded
based on the chosen outlier detection method and threshol.
-Scaling Transformation
from sklearn.preprocessing import MinMaxScaler
print(scaled_data)
Link_length_km Link_length_miles Pedal_cycles \
0 0.097403 0.097527 0.000000
1 0.097403 0.097527 0.000000
2 0.097403 0.097527 0.000000
3 0.097403 0.097527 0.000000
4 0.097403 0.097527 0.000000
... ... ... ...
1048570 0.064820 0.064879 0.000000
1048571 0.064820 0.064879 0.000453
1048572 0.064820 0.064879 0.000453
1048573 0.064820 0.064879 0.000453
1048574 0.064820 0.064879 0.000000
---Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
Cell In[54], line 8
6 for i, col in enumerate(df.columns):
7 if df[col].dtype in ['int64', 'float64']:
----> 8 plt.subplot(4, 3, i + 1)
9 sns.histplot(df[col], kde=True)
10 plt.title(f'Histogram of {col}')
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
matplotlib\pyplot.py:1425, in subplot(*args, **kwargs)
1422 fig = gcf()
1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if
the user passed no
1429 # kwargs or if the axes class and kwargs are identical.
1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 ==
fig._process_projection_requirements(**kwargs)))):
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\
matplotlib\gridspec.py:599, in SubplotSpec._from_subplot_args(figure,
args)
597 else:
598 if not isinstance(num, Integral) or num < 1 or num >
rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <=
{rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]
ValueError: num must be an integer with 1 <= num <= 12, not 13
-Model Building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
LinearRegression()
# Make predictions
y_pred = model.predict(X_test)
# R-squared
r2 = r2_score(y_test, y_pred)
# R-squared to percentage
lin_reg = r2 * 100
-Model Tunning
For linear regression model , we can go ahead with Lasso or Ridge Regression ,
# Print results
print("Mean Squared Error (MSE) - Best Model: {:.2f}
%".format(mse_percentage))
print("Root Mean Squared Error (RMSE) - Best Model: {:.2f}
%".format(rmse_percentage))
print("R-squared (R2) - Best Model: {:.2f}%".format(r2_grid * 100))
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51559e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51117e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51082e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50696e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.5059e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25769e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25551e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25533e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.2534e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.25289e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51517e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51088e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.51051e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50667e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=4.50565e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02984e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02149e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.02073e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.01308e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=9.01109e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=3.60808e-18): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.04425e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03766e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03707e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03109e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.02953e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.04316e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03657e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03599e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.03001e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=7.02844e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.62267e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.62015e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.61994e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.6177e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.6171e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=2.09563e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
model.fit(X_train, y_train)
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=1.80398e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
Ridge(alpha=0.5)
Ridge(alpha=0.5)
import pandas as pd
# Make predictions
y_pred = model.predict(X_test)
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\linear_model\_ridge.py:204: LinAlgWarning: Ill-
conditioned matrix (rcond=1.80398e-17): result may not be accurate.
return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
Actual Predicted
781974 0.269387 0.276978
937737 0.385366 0.418029
907828 0.294860 0.343721
784628 0.255908 0.281116
662460 0.260028 0.274427
... ... ...
673443 0.643551 0.537741
656736 0.369448 0.429966
858501 0.437590 0.451252
617079 0.382499 0.283711
487559 0.297439 0.363648