Quikr Car Price Prediction Using Linear Regression 1717999953
Quikr Car Price Prediction Using Linear Regression 1717999953
1. Importing Libraries
2. Exploratory Data Analysis
3. Basic Data Cleaning
4. Data Visulaization
5. Model Building using Linear Regression
✏Importing Libraries :-
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
✏Loading Dataset :-
df = backup = pd.read_csv("quikr_car.csv")
df
0 Hyundai Santro Xing XO eRLX Euro III Hyundai 2007 80,000 45,000 kms Petrol
2 Maruti Suzuki Alto 800 Vxi Maruti 2018 Ask For Price 22,000 kms Petrol
3 Hyundai Grand i10 Magna 1.2 Kappa VTVT Hyundai 2014 3,25,000 28,000 kms Petrol
4 Ford EcoSport Titanium 1.5L TDCi Ford 2014 5,75,000 36,000 kms Diesel
888 Tata Zest XM Diesel Tata 2018 2,60,000 27,000 kms Diesel
890 Honda Amaze 1.2 E i VTEC Honda 2014 1,80,000 Petrol NaN
891 Chevrolet Sail 1.2 LT ABS Chevrolet 2014 1,60,000 Petrol NaN
df.head()
0 Hyundai Santro Xing XO eRLX Euro III Hyundai 2007 80,000 45,000 kms Petrol
2 Maruti Suzuki Alto 800 Vxi Maruti 2018 Ask For Price 22,000 kms Petrol
3 Hyundai Grand i10 Magna 1.2 Kappa VTVT Hyundai 2014 3,25,000 28,000 kms Petrol
4 Ford EcoSport Titanium 1.5L TDCi Ford 2014 5,75,000 36,000 kms Diesel
df.info()
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 892 non-null object
1 company 892 non-null object
2 year 892 non-null object
3 Price 892 non-null object
4 kms_driven 840 non-null object
5 fuel_type 837 non-null object
dtypes: object(6)
memory usage: 41.9+ KB
df.describe()
top Honda City Maruti 2015 Ask For Price 45,000 kms Petrol
Clean Price
df[df['Price'].str.isnumeric() != True]['Price'].describe()
count 35
unique 1
top Ask For Price
freq 35
Name: Price, dtype: object
df.Price.isnull().sum()
35
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 892 non-null object
1 company 892 non-null object
2 year 892 non-null object
3 Price 857 non-null float64
4 kms_driven 840 non-null object
5 fuel_type 837 non-null object
dtypes: float64(1), object(5)
memory usage: 41.9+ KB
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 892 non-null object
1 company 892 non-null object
2 year 892 non-null object
3 Price 857 non-null float64
4 kms_driven 840 non-null object
5 fuel_type 837 non-null object
dtypes: float64(1), object(5)
memory usage: 41.9+ KB
df.head()
0 Hyundai Santro Xing XO eRLX Euro III Hyundai 2007 80000.0 45,000 kms Petrol
2 Maruti Suzuki Alto 800 Vxi Maruti 2018 NaN 22,000 kms Petrol
3 Hyundai Grand i10 Magna 1.2 Kappa VTVT Hyundai 2014 325000.0 28,000 kms Petrol
4 Ford EcoSport Titanium 1.5L TDCi Ford 2014 575000.0 36,000 kms Diesel
Skip to fill na values in Price and create new column as age from year
df.year.describe()
count 842
unique 21
top 2015
freq 117
Name: year, dtype: object
current_year = datetime.now().year
df['age'] = current_year - df['year']
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 892 non-null object
1 company 892 non-null object
2 Price 857 non-null float64
3 kms_driven 840 non-null object
4 fuel_type 837 non-null object
5 age 842 non-null float64
dtypes: float64(2), object(4)
memory usage: 41.9+ KB
df.Price.isnull().sum()
35
Clean kms_driven
df_median = df.Price.median()
df_std = df.Price.std()
#df[df['Price'] > df_median + 3.5*df_std]
df['Price'] = df.Price.apply(lambda x : df_median + 3.5*df_std if(x > df_median + 3.5*df_std) else x)
df.kms_driven
0 45,000 kms
1 40 kms
2 22,000 kms
3 28,000 kms
4 36,000 kms
...
887 NaN
888 27,000 kms
889 40,000 kms
890 Petrol
891 Petrol
Name: kms_driven, Length: 892, dtype: object
df.describe(include='object')
df.head()
0 Hyundai Santro Xing XO eRLX Euro III Hyundai 80000.0 45000.0 Petrol 17.0
2 Maruti Suzuki Alto 800 Vxi Maruti NaN 22000.0 Petrol 6.0
3 Hyundai Grand i10 Magna 1.2 Kappa VTVT Hyundai 325000.0 28000.0 Petrol 10.0
4 Ford EcoSport Titanium 1.5L TDCi Ford 575000.0 36000.0 Diesel 10.0
Clean fuel_type
df.fuel_type.value_counts()
Petrol 440
Diesel 395
LPG 2
Name: fuel_type, dtype: int64
df.fuel_type.isnull().sum()
55
clean company
df = df.drop(df.loc[df['company'].str.isnumeric()].index)
df.company = df.company.str.lower()
Try to fill nan values in price
df.Price.isnull().sum()
33
df = df.dropna(subset = ['Price'],axis='rows')
df.kms_driven.describe()
count 817.000000
mean 46250.714810
std 34283.745254
min 0.000000
25% 27000.000000
50% 41000.000000
75% 56758.000000
max 400000.000000
Name: kms_driven, dtype: float64
print(kms_mean)
print(kms_std)
46250.71481028152
34283.745253881294
groupby_age_mean = df.groupby('age')['kms_driven'].transform('mean')
groupby_age_mean
0 60605.201221
1 41211.318182
3 41990.974417
4 41990.974417
6 50464.677643
...
887 NaN
888 20307.966667
889 40963.893617
890 41990.974417
891 41990.974417
Name: kms_driven, Length: 856, dtype: float64
df.kms_driven
0 45000.0
1 40.0
3 28000.0
4 36000.0
6 41000.0
...
886 132000.0
888 27000.0
889 40000.0
890 NaN
891 NaN
Name: kms_driven, Length: 819, dtype: float64
df.kms_driven.fillna(groupby_age_mean,inplace = True)
df.kms_driven.fillna(groupby_age_mean,inplace = True)
df.fuel_type.value_counts()
Petrol 428
Diesel 386
LPG 2
Name: fuel_type, dtype: int64
fuel_type = df.fuel_type.mode()[0]
fuel_type
'Petrol'
df.fuel_type.fillna(fuel_type,inplace = True)
814 886 Toyota Corolla Altis toyota 300000.0 132000.000000 Petrol 15.0
817 890 Honda Amaze 1.2 honda 180000.0 41990.974417 Petrol 10.0
818 891 Chevrolet Sail 1.2 chevrolet 160000.0 41990.974417 Petrol 10.0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 819 entries, 0 to 818
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 819 non-null int64
1 name 819 non-null object
2 company 819 non-null object
3 Price 819 non-null float64
4 kms_driven 819 non-null float64
5 fuel_type 819 non-null object
6 age 819 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 44.9+ KB
df.shape
(819, 7)
Cleaned Data
df
814 886 Toyota Corolla Altis toyota 300000.0 132000.000000 Petrol 15.0
817 890 Honda Amaze 1.2 honda 180000.0 41990.974417 Petrol 10.0
818 891 Chevrolet Sail 1.2 chevrolet 160000.0 41990.974417 Petrol 10.0
df.to_csv("cleaned_data.csv")
✏Data Visualization :-
df.head()
plt.figure(figsize = (12,8))
sns.scatterplot( x = 'Price' , y = 'age' , data = df, hue = 'fuel_type')
<AxesSubplot:xlabel='Price', ylabel='age'>
plt.subplots(figsize=(20,8))
ax=sns.boxplot(x='company',y='Price',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()
plt.figure(figsize=(20,8))
ax=sns.boxplot(x='age',y='Price',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()
plt.figure(figsize=(20,8))
ax=sns.boxplot(x='fuel_type',y='Price',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()
plt.figure(figsize = (12,8))
ax = sns.histplot(
data=df,
x="company", hue="fuel_type",
multiple="fill", stat="proportion",
discrete=True, shrink=.8
)
plt.xticks(rotation = 75)
plt.title('Company vs Fuel Type')
plt.xlabel('Company')
plt.ylabel('Fuel Type')
df.head()
<AxesSubplot:xlabel='Price', ylabel='kms_driven'>
plt.figure(figsize=(20,8))
ax=sns.boxplot(x='age',y='kms_driven',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()
✏Data Split :-
X=df[['name','company','age','kms_driven','fuel_type']]
y= df['Price']
ohe = OneHotEncoder()
ohe.fit(X[['name','company','fuel_type']])
OneHotEncoder()
✏Model Building:-
lr = LinearRegression()
pipe.fit(Xtrain,ytrain)
Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('onehotencoder',
OneHotEncoder(categories=[array(['Audi A3 Cabriolet', 'Audi A4
1.8', 'Audi A4 2.0', 'Audi A6 2.0',
'Audi A8', 'Audi Q3 2.0', 'Audi Q5 2.0', 'Audi Q7', 'BMW 3 Series',
'BMW 5 Series', 'BMW 7 Series', 'BMW X1', 'BMW X1 sDrive20d',
'BMW X1 xDrive20d', 'Chevrolet Beat', 'Chevrolet Beat...
array(['audi', 'bmw', 'chevrolet', 'd
atsun', 'fiat', 'force', 'ford',
'hindustan', 'honda', 'hyundai', 'jaguar', 'jeep', 'land',
'mahindra', 'maruti', 'mercedes', 'mini', 'mitsubishi', 'nissan',
'renault', 'skoda', 'tata', 'toyota', 'volkswagen', 'volvo'],
dtype=object),
array(['Diesel', 'LPG', 'Petrol'], dt
ype=object)]),
['name', 'company',
'fuel_type'])])),
('linearregression', LinearRegression())])
ypred = pipe.predict(Xtest)
r2_score(ytest,ypred)
0.834527509712749
Exciting Milestone: Successfully trained my first machine learning model, marking the beginning of my journey into data science and
predictive analytics. Looking forward to exploring more complex algorithms and applications!
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js