0% found this document useful (0 votes)
50 views12 pages

Quikr Car Price Prediction Using Linear Regression 1717999953

Uploaded by

sanokho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views12 pages

Quikr Car Price Prediction Using Linear Regression 1717999953

Uploaded by

sanokho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

✏Contents of notebook :-

1. Importing Libraries
2. Exploratory Data Analysis
3. Basic Data Cleaning
4. Data Visulaization
5. Model Building using Linear Regression

✏Importing Libraries :-

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

✏Loading Dataset :-

df = backup = pd.read_csv("quikr_car.csv")
df

name company year Price kms_driven fuel_type

0 Hyundai Santro Xing XO eRLX Euro III Hyundai 2007 80,000 45,000 kms Petrol

1 Mahindra Jeep CL550 MDI Mahindra 2006 4,25,000 40 kms Diesel

2 Maruti Suzuki Alto 800 Vxi Maruti 2018 Ask For Price 22,000 kms Petrol

3 Hyundai Grand i10 Magna 1.2 Kappa VTVT Hyundai 2014 3,25,000 28,000 kms Petrol

4 Ford EcoSport Titanium 1.5L TDCi Ford 2014 5,75,000 36,000 kms Diesel

... ... ... ... ... ... ...

887 Ta Tara zest 3,10,000 NaN NaN

888 Tata Zest XM Diesel Tata 2018 2,60,000 27,000 kms Diesel

889 Mahindra Quanto C8 Mahindra 2013 3,90,000 40,000 kms Diesel

890 Honda Amaze 1.2 E i VTEC Honda 2014 1,80,000 Petrol NaN

891 Chevrolet Sail 1.2 LT ABS Chevrolet 2014 1,60,000 Petrol NaN

892 rows × 6 columns

df.head()

name company year Price kms_driven fuel_type

0 Hyundai Santro Xing XO eRLX Euro III Hyundai 2007 80,000 45,000 kms Petrol

1 Mahindra Jeep CL550 MDI Mahindra 2006 4,25,000 40 kms Diesel

2 Maruti Suzuki Alto 800 Vxi Maruti 2018 Ask For Price 22,000 kms Petrol

3 Hyundai Grand i10 Magna 1.2 Kappa VTVT Hyundai 2014 3,25,000 28,000 kms Petrol

4 Ford EcoSport Titanium 1.5L TDCi Ford 2014 5,75,000 36,000 kms Diesel

✏Exploratory Data Analysis (EDA) And Basic Data


Cleaning:-

df.info()

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 892 non-null object
1 company 892 non-null object
2 year 892 non-null object
3 Price 892 non-null object
4 kms_driven 840 non-null object
5 fuel_type 837 non-null object
dtypes: object(6)
memory usage: 41.9+ KB

df.describe()

name company year Price kms_driven fuel_type

count 892 892 892 892 840 837

unique 525 48 61 274 258 3

top Honda City Maruti 2015 Ask For Price 45,000 kms Petrol

freq 13 235 117 35 30 440

Clean Price

df['Price'] = df['Price'].str.replace(",", "")

df[df['Price'].str.isnumeric() != True]['Price'].describe()

count 35
unique 1
top Ask For Price
freq 35
Name: Price, dtype: object

df['Price'] = df['Price'].replace("Ask For Price", np.nan)

df.Price.isnull().sum()

35

df['Price'] = pd.to_numeric(df['Price'], errors = 'coerce')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 892 non-null object
1 company 892 non-null object
2 year 892 non-null object
3 Price 857 non-null float64
4 kms_driven 840 non-null object
5 fuel_type 837 non-null object
dtypes: float64(1), object(5)
memory usage: 41.9+ KB

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 892 non-null object
1 company 892 non-null object
2 year 892 non-null object
3 Price 857 non-null float64
4 kms_driven 840 non-null object
5 fuel_type 837 non-null object
dtypes: float64(1), object(5)
memory usage: 41.9+ KB

df.head()

name company year Price kms_driven fuel_type

0 Hyundai Santro Xing XO eRLX Euro III Hyundai 2007 80000.0 45,000 kms Petrol

1 Mahindra Jeep CL550 MDI Mahindra 2006 425000.0 40 kms Diesel

2 Maruti Suzuki Alto 800 Vxi Maruti 2018 NaN 22,000 kms Petrol

3 Hyundai Grand i10 Magna 1.2 Kappa VTVT Hyundai 2014 325000.0 28,000 kms Petrol

4 Ford EcoSport Titanium 1.5L TDCi Ford 2014 575000.0 36,000 kms Diesel

Skip to fill na values in Price and create new column as age from year

df.year = df.year.apply(lambda x : x if x.replace('.','').replace(',','').isdigit() else np.nan)

df.year.describe()

count 842
unique 21
top 2015
freq 117
Name: year, dtype: object

from datetime import datetime

df['year'] = pd.to_numeric(df['year'] , errors = 'coerce')

current_year = datetime.now().year
df['age'] = current_year - df['year']

df.drop('year', axis= 'columns',inplace = True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 892 non-null object
1 company 892 non-null object
2 Price 857 non-null float64
3 kms_driven 840 non-null object
4 fuel_type 837 non-null object
5 age 842 non-null float64
dtypes: float64(2), object(4)
memory usage: 41.9+ KB

df.Price.isnull().sum()

35

Clean kms_driven

df_median = df.Price.median()
df_std = df.Price.std()
#df[df['Price'] > df_median + 3.5*df_std]
df['Price'] = df.Price.apply(lambda x : df_median + 3.5*df_std if(x > df_median + 3.5*df_std) else x)

df.kms_driven

0 45,000 kms
1 40 kms
2 22,000 kms
3 28,000 kms
4 36,000 kms
...
887 NaN
888 27,000 kms
889 40,000 kms
890 Petrol
891 Petrol
Name: kms_driven, Length: 892, dtype: object

df['kms_driven'] = (df.kms_driven.str.replace(" kms","")).str.replace("," , "")

df.describe(include='object')

name company kms_driven fuel_type

count 892 892 840 837

unique 525 48 258 3

top Honda City Maruti 45000 Petrol

freq 13 235 30 440

df['kms_driven'] = pd.to_numeric(df['kms_driven'], errors= 'coerce')

df.head()

name company Price kms_driven fuel_type age

0 Hyundai Santro Xing XO eRLX Euro III Hyundai 80000.0 45000.0 Petrol 17.0

1 Mahindra Jeep CL550 MDI Mahindra 425000.0 40.0 Diesel 18.0

2 Maruti Suzuki Alto 800 Vxi Maruti NaN 22000.0 Petrol 6.0

3 Hyundai Grand i10 Magna 1.2 Kappa VTVT Hyundai 325000.0 28000.0 Petrol 10.0

4 Ford EcoSport Titanium 1.5L TDCi Ford 575000.0 36000.0 Diesel 10.0

Clean fuel_type

df.fuel_type.value_counts()

Petrol 440
Diesel 395
LPG 2
Name: fuel_type, dtype: int64

df.fuel_type.isnull().sum()

55

clean company

df = df.drop(df.loc[df['company'].str.isnumeric()].index)

df.company = df.company.str.lower()
Try to fill nan values in price

df.Price.isnull().sum()

33

df = df.dropna(subset = ['Price'],axis='rows')

df.kms_driven.describe()

count 817.000000
mean 46250.714810
std 34283.745254
min 0.000000
25% 27000.000000
50% 41000.000000
75% 56758.000000
max 400000.000000
Name: kms_driven, dtype: float64

_, kms_mean, kms_std, *_ = df.kms_driven.describe()

print(kms_mean)
print(kms_std)

46250.71481028152
34283.745253881294

#df[df['kms_driven'] > kms_mean + 3.5* kms_std]


df['kms_driven'] = df['kms_driven'].apply(lambda x : kms_mean + 3.5*kms_std if( x > (kms_mean + 3.5*kms_std) ) else

groupby_age_mean = df.groupby('age')['kms_driven'].transform('mean')
groupby_age_mean

0 60605.201221
1 41211.318182
3 41990.974417
4 41990.974417
6 50464.677643
...
887 NaN
888 20307.966667
889 40963.893617
890 41990.974417
891 41990.974417
Name: kms_driven, Length: 856, dtype: float64

df.dropna(subset = ['age','kms_driven','fuel_type'], how = 'all',inplace = True)

df.kms_driven

0 45000.0
1 40.0
3 28000.0
4 36000.0
6 41000.0
...
886 132000.0
888 27000.0
889 40000.0
890 NaN
891 NaN
Name: kms_driven, Length: 819, dtype: float64

df.kms_driven.fillna(groupby_age_mean,inplace = True)
df.kms_driven.fillna(groupby_age_mean,inplace = True)

Fill na values in fuel_type

df.fuel_type.value_counts()

Petrol 428
Diesel 386
LPG 2
Name: fuel_type, dtype: int64

fuel_type = df.fuel_type.mode()[0]
fuel_type

'Petrol'

df.fuel_type.fillna(fuel_type,inplace = True)

There is no null values in age column

df['name'] = df['name'].str.split().str.slice(start = 0, stop = 3).str.join(' ')


df.reset_index(inplace = True)
df

index name company Price kms_driven fuel_type age

0 0 Hyundai Santro Xing hyundai 80000.0 45000.000000 Petrol 17.0

1 1 Mahindra Jeep CL550 mahindra 425000.0 40.000000 Diesel 18.0

2 3 Hyundai Grand i10 hyundai 325000.0 28000.000000 Petrol 10.0

3 4 Ford EcoSport Titanium ford 575000.0 36000.000000 Diesel 10.0

4 6 Ford Figo ford 175000.0 41000.000000 Diesel 12.0

... ... ... ... ... ... ... ...

814 886 Toyota Corolla Altis toyota 300000.0 132000.000000 Petrol 15.0

815 888 Tata Zest XM tata 260000.0 27000.000000 Diesel 6.0

816 889 Mahindra Quanto C8 mahindra 390000.0 40000.000000 Diesel 11.0

817 890 Honda Amaze 1.2 honda 180000.0 41990.974417 Petrol 10.0

818 891 Chevrolet Sail 1.2 chevrolet 160000.0 41990.974417 Petrol 10.0

819 rows × 7 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 819 entries, 0 to 818
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 819 non-null int64
1 name 819 non-null object
2 company 819 non-null object
3 Price 819 non-null float64
4 kms_driven 819 non-null float64
5 fuel_type 819 non-null object
6 age 819 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 44.9+ KB

df.shape

(819, 7)

Cleaned Data
df

index name company Price kms_driven fuel_type age

0 0 Hyundai Santro Xing hyundai 80000.0 45000.000000 Petrol 17.0

1 1 Mahindra Jeep CL550 mahindra 425000.0 40.000000 Diesel 18.0

2 3 Hyundai Grand i10 hyundai 325000.0 28000.000000 Petrol 10.0

3 4 Ford EcoSport Titanium ford 575000.0 36000.000000 Diesel 10.0

4 6 Ford Figo ford 175000.0 41000.000000 Diesel 12.0

... ... ... ... ... ... ... ...

814 886 Toyota Corolla Altis toyota 300000.0 132000.000000 Petrol 15.0

815 888 Tata Zest XM tata 260000.0 27000.000000 Diesel 6.0

816 889 Mahindra Quanto C8 mahindra 390000.0 40000.000000 Diesel 11.0

817 890 Honda Amaze 1.2 honda 180000.0 41990.974417 Petrol 10.0

818 891 Chevrolet Sail 1.2 chevrolet 160000.0 41990.974417 Petrol 10.0

819 rows × 7 columns

df.to_csv("cleaned_data.csv")

✏Data Visualization :-

fig, ax = plt.subplots( 1, 2, figsize = (16,8))


data = df['company'].value_counts()
pal = sns.color_palette("magma")
sns.set(style = "dark", color_codes = True)

ax[0] =sns.barplot(x = data.index, y = data.values , ax = ax[0] , palette = pal )


for bar in ax[0].patches:
ax[0].annotate('{:.0f}'.format(bar.get_height()) , ( bar.get_x() + bar.get_width()/2 , bar.get_height()) , ha
ax[0].set_xlabel("Company")
ax[0].set_ylabel("No. of cars")
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=75)

_, _, autotexts = ax[1].pie( data.values, labels = data.index , autopct = '%.0f%%' , colors = pal )

for autotext in autotexts:


autotext.set_color("white")
plt.show()

fig, ax = plt.subplots(1,2,figsize = (16,8))


fig, ax = plt.subplots(1,2,figsize = (16,8))
data = df['fuel_type'].value_counts()
pal = sns.color_palette("magma", len(data))
sns.set(style = 'dark')

ax[0] = sns.barplot(x = data.index, y = data.values, palette = pal, ax = ax[0])


for bar in ax[0].patches:
ax[0].annotate( "{:.0f}".format(bar.get_height()) , ( bar.get_x() + bar.get_width()/2 , bar.get_height()) , ha
ax[0].set_xlabel("Fule Type")
ax[0].set_ylabel("No. Of Cars")

_,_,autotexts = ax[1].pie(data.values, labels = data.index , autopct = "%.2f%%" , colors = pal)


for text in autotexts:
text.set_color("white")
plt.show()

df.head()

index name company Price kms_driven fuel_type age

0 0 Hyundai Santro Xing hyundai 80000.0 45000.0 Petrol 17.0

1 1 Mahindra Jeep CL550 mahindra 425000.0 40.0 Diesel 18.0

2 3 Hyundai Grand i10 hyundai 325000.0 28000.0 Petrol 10.0

3 4 Ford EcoSport Titanium ford 575000.0 36000.0 Diesel 10.0

4 6 Ford Figo ford 175000.0 41000.0 Diesel 12.0

plt.figure(figsize = (12,8))
sns.scatterplot( x = 'Price' , y = 'age' , data = df, hue = 'fuel_type')

<AxesSubplot:xlabel='Price', ylabel='age'>
plt.subplots(figsize=(20,8))
ax=sns.boxplot(x='company',y='Price',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()

plt.figure(figsize=(20,8))
ax=sns.boxplot(x='age',y='Price',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()

plt.figure(figsize=(20,8))
ax=sns.boxplot(x='fuel_type',y='Price',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()
plt.figure(figsize = (12,8))

ax = sns.histplot(
data=df,
x="company", hue="fuel_type",
multiple="fill", stat="proportion",
discrete=True, shrink=.8
)

plt.xticks(rotation = 75)
plt.title('Company vs Fuel Type')
plt.xlabel('Company')
plt.ylabel('Fuel Type')

Text(0, 0.5, 'Fuel Type')

df.head()

index name company Price kms_driven fuel_type age

0 0 Hyundai Santro Xing hyundai 80000.0 45000.0 Petrol 17.0

1 1 Mahindra Jeep CL550 mahindra 425000.0 40.0 Diesel 18.0

2 3 Hyundai Grand i10 hyundai 325000.0 28000.0 Petrol 10.0


3 4 Ford EcoSport Titanium ford 575000.0 36000.0 Diesel 10.0

4 6 Ford Figo ford 175000.0 41000.0 Diesel 12.0

plt.figure( figsize = (16,8))


sns.scatterplot( x = 'Price' , y = 'kms_driven' , data = df)

<AxesSubplot:xlabel='Price', ylabel='kms_driven'>

plt.figure(figsize=(20,8))
ax=sns.boxplot(x='age',y='kms_driven',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()

✏Data Split :-

from sklearn.model_selection import train_test_split

X=df[['name','company','age','kms_driven','fuel_type']]
y= df['Price']

Xtrain , Xtest, ytrain, ytest = train_test_split(X,y, test_size = 0.3)


✏Data Preprocessing:-

from sklearn.linear_model import LinearRegression


from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

ohe = OneHotEncoder()
ohe.fit(X[['name','company','fuel_type']])

OneHotEncoder()

column_trans = make_column_transformer( (OneHotEncoder(ohe.categories_) , ['name','company','fuel_type']) , remainder

✏Model Building:-

lr = LinearRegression()

pipe = make_pipeline( column_trans, lr)

pipe.fit(Xtrain,ytrain)

Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('onehotencoder',
OneHotEncoder(categories=[array(['Audi A3 Cabriolet', 'Audi A4
1.8', 'Audi A4 2.0', 'Audi A6 2.0',
'Audi A8', 'Audi Q3 2.0', 'Audi Q5 2.0', 'Audi Q7', 'BMW 3 Series',
'BMW 5 Series', 'BMW 7 Series', 'BMW X1', 'BMW X1 sDrive20d',
'BMW X1 xDrive20d', 'Chevrolet Beat', 'Chevrolet Beat...
array(['audi', 'bmw', 'chevrolet', 'd
atsun', 'fiat', 'force', 'ford',
'hindustan', 'honda', 'hyundai', 'jaguar', 'jeep', 'land',
'mahindra', 'maruti', 'mercedes', 'mini', 'mitsubishi', 'nissan',
'renault', 'skoda', 'tata', 'toyota', 'volkswagen', 'volvo'],
dtype=object),
array(['Diesel', 'LPG', 'Petrol'], dt
ype=object)]),
['name', 'company',
'fuel_type'])])),
('linearregression', LinearRegression())])

ypred = pipe.predict(Xtest)

r2_score(ytest,ypred)

0.834527509712749

Exciting Milestone: Successfully trained my first machine learning model, marking the beginning of my journey into data science and
predictive analytics. Looking forward to exploring more complex algorithms and applications!
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like