Car Price Predicition
Car Price Predicition
CERTIFICATE
This is to certify that the project entitled CAR PRICE PREDICTION USING LINEAR
REGRESSION being submitted by
CONTENT
ABSTRACT i
LIST OF FIGURES ii
LIST OF TABLES iii
CHAPTERS
1. INTRODUCTION 1-14
1.1. Machine Learning 1
2. SOFTWARE
REQUIREMENTS 15-16
SPECIFICATIONS
2.1. Requirements Specification Document 16
2.2. Functional Requirements 17
2.3. Non-Functional Requirements 17
2.4. Software Requirements 18
2.5. Hardware Requirements 18
2.6. Requirement Analysis 19
2.7. Test Construction and verification 20
2.8. Test Execution and Bug Reporting 20
2.9. Final Testing and Implementation 20
2.10. Post Implementation 20
2.11. Technologies used 21
5. IMPLEMENTATION 38-59
5.1. Pseudo code 39
6. TESTING 60-72
7. SCREENSHOTS 73-75
8.FURTHER ENHANCEMENTS 76
9.CONCLUSION 78
10.REFERENCES 80
ABSTRACT
In this fast-moving generation, the present study proposes the newer concept of
predicting the prices of certain items. With an idea and motivation to help everyone we
came up with a solution to get an appropriate estimate of one’s car using Machine
Learning Techniques which will save a lot of time and money. A car price prediction has
been a high interest research area, as it requires noticeable effort and knowledge of the
field expert. Considerable number of distinct attributes is examined for the reliable and
accurate prediction. The production of cars has been steadily increasing in the past
decade, with over 70 million passenger cars being produced in the year 2016. This has
given rise to the used car market, which on its own has become a booming industry. The
recent advent of online portals has facilitated the need for both the customer and the
seller to be better informed about the trends and patterns that determine the value of a
used car in the market. To build a model for predicting the price of used cars in, we
applied one of the machine learning techniques i.e., Linear Regression. Using linear
regression, there are multiple independent variables, but one and only one dependent
variable whose actual and predicted values are compared to find precision of results. Our
paper proposes a system where price is dependent variable which is predicted, and this
price is derived from factors like kilometers driven, car purchase year, Car Company, car
model, and the fuel type.
LIST OF FIGURES
1.3.2 Unsupervised 5
LIST OF TABLES
CHAPTER -1
1. INTRODUCTION
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming, coined
the term ―Machine Learning‖. He defined machine learning as – a ―Field of study that gives
computers the capability to learn without being explicitly programmed‖. In a very layman’s
manner, Machine Learning (ML) can be explained as automating and improving the learning
process of computers based on their experiences without being actually programmed i.e. without
any human assistance. The process starts with feeding good quality data and then training our
machines(computers) by building machine learning models using the data and different
algorithms. The choice of algorithms depends on what type of data do we have and what kind of
task we are trying to automate. Example: Training of students during exams. While preparing
for the exams students don’t actually cram the subject but try to learn it with complete
understanding. Before the examination, they feed their machine(brain) with a good amount of
high-quality data (questions and answers from different books or teachers’ notes, or online video
lectures).
Actually, they are training their brain with input as well as output i.e, what kind of approach or
logic do they have to solve a different kinds of questions. Each time they solve practice test
papers and find the performance (accuracy /score) by comparing answers with the answer key
given, Gradually, the performance keeps on increasing, gaining more confidence with the adopted
approach. That’s how actually models are built, train machine with data (both inputs and outputs
are given to the model), and when the time comes test on data (with input only) and achieve our
model scores by comparing its answer with the actual output which has not been fed while
training. Researchers are working with assiduous efforts to improve algorithms, and techniques so
that these models perform even much better.
Page 2
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Example
Page 3
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
customer will purchase a particular product under consideration or not based on his/ her
gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means that
the customer won’t purchase it.
Figure B: It is a Meteorological dataset that serves the purpose of predicting wind speed
based on different parameters.
Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
Output: Wind Speed
B. Regression:
It is a Supervised Learning task where output is having continuous value.
For example in above Figure B, Output – Wind Speed is not having any discrete value
but is continuous in a particular range. The goal here is to predict a value as much closer
to the actual output value as our model can and then evaluation is done by calculating the
error value. The smaller the error, the greater the accuracy of our regression model.
Page 5
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Clustering: Broadly this technique is applied to group data based on different patterns,
such as similarities or differences, our machine model finds. These algorithms are used to
process raw, unclassified data objects into groups. For example, in the above figure, we
have not given output parameter values, so this technique will be used to group clients
based on the input parameters provided by our data.
Association: This technique is a rule-based ML technique that finds out some very useful
relations between parameters of a large data set. This technique is basically used for
market basket analysis that helps to better understand the relationship between different
products. For e.g. shopping stores use algorithms based on this technique to find out the
relationship between the sale of one product w.r.t to another’s sales based on customer
behavior. Like if a customer buys milk, then he may also buy bread, eggs, or butter. Once
trained well, such models can be used to increase their sales by planning different offers.
Some algorithms: K-Means Clustering
Page 6
This technique is mostly applicable in the case of image data sets where usually all
images are not labeled.
1.3.4 Reinforcement Learning:
In this technique, the model keeps on increasing its performance using Reward
Feedback to learn the behavior or pattern. These algorithms are specific to a particular
problem e.g. Google Self Driving car, AlphaGo where a bot competes with humans and
even itself to get better and better performers in Go Game. Each time we feed in data,
they learn and add the data to their knowledge which is training data. So, the more it
learns the better it gets trained and hence experienced.
Figure1.3.4 Reinforcement
Page 7
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
In linear regression, the relationships are modeled using linear predictor functions whose
unknown model parameters are estimated from the data. Such models are called linear
models. Most commonly, the conditional mean of the response given the values of the
explanatory variables (or predictors) is assumed to be an affine function of those values;
less commonly, the conditional median or some other quantile is used. Like all forms of
regression analysis, linear regression focuses on the conditional probability distribution
of the response given the values of the predictors, rather than on the joint probability
distribution of all of these variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and to
be used extensively in practical applications. This is because models which depend
linearly on their unknown parameters are easier to fit than models which are non-linearly
related to their parameters and because the statistical properties of the resulting estimators
are easier to determine.
Page 8
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories:
Linear regression models are often fitted using the least squares approach, but they may
also be fitted in other ways, such as by minimizing the "lack of fit" in some other norm
(as with least absolute deviations regression), or by minimizing a penalized version of the
least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm
penalty). Conversely, the least squares approach can be used to fit models that are not
linear models. Thus, although the terms "least squares" and "linear model" are closely
linked, they are not synonymous.
where T denotes the transpose, so that xiTβ is the inner product between vectors xi and β.
Often these n equations are stacked together and written in matrix notation as
𝑦 = 𝑥𝛽 + s,
The very simplest case of a single scalar predictor variable x and a single scalar response
variable y is known as simple linear regression. The extension to multiple and/or vector-
valued predictor variables (denoted with a capital X) is known as multiple linear
regression, also known as multivariable linear regression (not to be confused with
multivariate linear regression.
In the more general multivariate linear regression, there is one equation of the above
form for each of m > 1 dependent variables that share the same set of explanatory
for all observations indexed as i = 1,....., n and for all dependent variables indexed as j =
1,....., m.
Nearly all real-world regression models involve multiple predictors, and basic
descriptions of linear regression are often phrased in terms of the multiple regression
model. Note, however, that in these cases the response variable y is still a scalar. Another
term, multivariate linear regression, refers to cases where y is a vector, i.e., the same as
general linear regression.
Page 1
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
L1 loss: This is the difference between the predicted and actual values. It is also called
mean absolute error (MAE).
The model will calculate all the MAE values and add them to find the total L1 Loss. The
𝑀𝐴𝐸 = ∑ − 𝑦̂ |
1
formula of L1 loss is shown below.
|
𝑦
𝑁 i=1 i
where, 𝑦̂ i𝑠 𝑝𝑟e𝑑i𝑐𝑡e𝑑 𝑣𝑎𝑙𝑢e of 𝑦
𝑦 i𝑠 𝑚e𝑎𝑛 𝑣𝑎𝑙𝑢e of 𝑦
L2 Loss: In this loss, we take the squared average difference between the predicted and
actual value. It is also known as Mean Squared Error (MSE). The formula of L2 loss is
𝑀𝑆𝐸 = − 𝑦̂ )2
1
shown below.
∑𝑁 (𝑦
𝑁 i=1 i
RSME Error: It tells the error rate by the square root of the L2 loss i.e. MSE. The
formula of RSME is shown below.
𝑅𝑆𝑀𝐸 =√
√𝑀𝑆𝐸 = 1 (𝑦 − 𝑦̂ )2
∑𝑁i=1 i
𝑁
R-squared error: It tells the good fit of the model-predicted line with the actual values
of data. The coefficient value range is from 0 to 1 i.e. the value close to 1 is a well-fitted
line. The formula is shown below.
𝑅2 = 1 − i ∑(𝑦 −𝑦̂ )2
∑(𝑦i−𝑦)2
𝑦 i𝑠 𝑚e𝑎𝑛 𝑣𝑎𝑙𝑢e of 𝑦
Note: In the case of an outlier, we can use L1 losses because with L2 loss the error is
being squared to give more loss value. We can remove the outlier from the first and then
can use L2 loss.
Learning Rate:
The alpha is the learning rate in the gradient descent formula as we seen above. It
functions of the alpha to control the speed of the gradient descent to get the minima point.
The value of alpha should be optimal so that it won’t miss the minima point or take time
∂𝐿
to reach the minima point.
𝜃𝑛ew = 𝜃o𝑙𝑑 − 𝛼
∂ o𝑙𝑑
𝜃
One approach to solve this problem can be label encoding where we will assign a
numerical value to these labels for example Male and Female mapped to 0 and 1. But this
can add bias in our model as it will start giving higher preference to the Female parameter
as 1>0 and ideally both labels are equally important in the dataset. To deal with this issue
we will use One Hot Encoding technique.
In this technique, the categorical parameters will prepare separate columns for both Male
and Female labels. So, wherever there is Male, the value will be 1 in Male column and 0
in Female column, and vice-versa. Let’s understand with an example: Consider the data
where fruits and their corresponding categorical values and prices are given.
Objective Of the Project - The goal of this project is to create an efficient and
effective model that will be able to predict the price of a used car by using the Linear
Regression algorithm with better accuracy.
Brand or Type of the car one prefers like Ford, Hyundai
Problem Statement - It is easy for any company to price their new cars based on the
manufacturing and marketing cost it involves. But when it comes to a used car it is quite
difficult to define a price because it involves it is influenced by various parameters like
car brand, manufactured year and etc. The goal of our project is to predict the best price
for a pre-owned car in the Indian market based on the previous data related to sold cars
using Linear Regression.
The used car market is an ever-rising industry, which has almost doubled its market
value in the last few years. The emergence of online portals such as CarDheko, Quikr,
Carwale, Cars24, and many others has facilitated the need for both the customer and the
seller to be better informed about the trends and patterns that determine the value of the
used car in the market. Machine Learning algorithms can be used to predict the retail
value of a car, based on a certain set of features. The purpose of this project is to provide
Car price prediction using machine learning without any human interference.
In our day to day lives everyone buys and sells a car every day. Now there are
limited facilities and applications to get an appropriate price for one’s car. Now we use
Page 13
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
We are required to model the price of cars with the available independent
variables. It will be used by the management to understand how exactly the prices vary
with the independent variables. They can accordingly manipulate the design of the cars,
the business strategy etc. to meet certain price levels. Further, the model will be a good
way for management to understand the pricing dynamics of a new market.
Page 14
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
CHAPTER -2
Page 15
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 16
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
document is also known by the names SRS report, software document. A software
document is primarily prepared for a project, software or any kind of application.
There are a set of guidelines to be followed while preparing the software requirement
specification document. This includes the purpose, scope, functional and non-functional
requirements, software and hardware requirements of the project. In addition to this, it
also contains the information about environmental conditions required, safety and
security requirements, software quality attributes of the project etc.
Page 17
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
the constraints the system must work within. Following are the non-functional
requirements:
2.3.1 Performance:
The performance of the developed applications can be calculated by using following
methods: Measuring enables you to identify how the performance of your application
stands in relation to your defined performance goals and helps you to identify the
bottlenecks that affect your application performance. It helps you identify whether your
application is moving toward or away from your performance goals. Defining what you
will measure, that is, your metrics, and defining the objectives for each metric is a
critical part of your testing plan.
Performance objectives include the following:
Response time, Latency throughput or Resource utilization.
Page 18
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
What is SRS ?
1.Requirements Analysis
2.Test Planning
3.Test Analysis
4.Test Design
5. Test Construction and Verification
6.Test Execution and Bug Reporting
7.Final Testing and Implementation
8.Post Implementation
Page 19
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 21
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Google Colab's major differentiator from Google colab is that it is cloud-based and
Jupyter is not. This means that if you work in Google Collab, you do not have to worry
about downloading and installing anything to your hardware.
2.11.3 SQL
SQL (Structured Query Language) is a powerful and standard query language for
relational database systems. We use SQL to perform CRUD (Create, Read, Update,
Delete) operations on databases along with other various operations. SQL has evolved a
lot in the past decade.
Page 22
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Although SQL is an ANSI/ISO standard, there are different versions of the SQL
language. However, to be compliant with the ANSI standard, they all support at least the
major commands (such as SELECT, UPDATE, DELETE, INSERT, WHERE) in a
similar manner.
MySQL, the most popular Open Source SQL database management system, is
developed, distributed, and supported by Oracle Corporation.
RDBMS
RDBMS stands for Relational Database Management System. RDBMS is the basis for
SQL, and for all modern database systems such as MS SQL Server, IBM DB2, Oracle,
MySQL, and Microsoft Access. The data in RDBMS is stored in database objects called
tables. A table is a collection of related data entries and it consists of columns and rows.
2.11.4 Flask
Flask is a micro web framework written in Python. It is classified as a micro
framework because it does not require particular tools or libraries. It has no database
abstraction layer, form validation, or any other components where pre-existing third-party
libraries provide common functions.
Page 23
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
CHAPTER -3
Page 24
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
3. LITERATURE SURVEY
Over fitting and under fitting come into picture when we create our statistical
models. The models might be too biased to the training data and might not perform well
on the test dataset. This is called over fitting. Likewise, the models might not take into
consideration all the variance present in the population and perform poorly on a test data
set. This is called underfitting. A perfect balance needs to be achieved between these two,
which leads to the concept of Bias-Variance tradeoff. Pierre Geurts has introduced and
explained how bias-variance tradeoff is achieved in both regression and classification.
The selection of variables/attribute plays a vital role in influencing both the bias and
variance of the statistical model. Robert Tibshirani proposed a new method called Lasso,
which minimizes the residual sum of squares. This returns a subset of attributes which
need to be included in multiple regression to get the minimal error rate. Similarly,
decision trees suffer from overfitting if they are not pruned/shrunk. Trevor Hastie and
Daryl Pregibon have explained the concept of pruning in their research paper. Moreover,
hypothesis testing using ANOVA is needed to verify whether the different groups of
errors really differ from each other. This is explained by TK Kim and Tae Kyun in their
paper. A Post-Hoc test needs to be performed along with ANOVA if the number of
groups exceeds two.
Turkey’s Test has been explored by Haynes W. in his research paper. Using these
techniques, we will create, train and test the effectiveness of our statistical models.
The paper is Predicting the price of Used Car Using Machine Learning Techniques. In
this paper, they investigate the application of supervised machine learning techniques to predict
the price of used cars in Mauritius. The predictions are based on historical data collected from
daily newspapers. Different techniques like multiple linear regression analysis, k-nearest
neighbors, naïve bayes and decision trees have been used to make the predictions.
The paper is Car Price Prediction Using Machine Learning Techniques. Considerable
number of distinct attributes is examined for the reliable and accurate prediction. To build
a model for predicting the price of used cars in Bosnia and Herzegovina, they have
applied three machine learning techniques (Artificial Neural Network, Support Vector
Machine and Random Forest).
Page 25
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
The paper is Price Evaluation model in second hand car system based on BP neural
networks. In this paper, the price evaluation model based on big data analysis is
proposed, which takes advantage of widely circulated vehicle data and a large number of
vehicle transaction data to analyze the price data for each type of vehicles by using the
optimized BP neural network algorithm. It aims to established second-hand car price
evaluation model to get the price that best matches the car.
Null Hypothesis
Even though the magnitude of over fitting has been reduced, Regression trees still suffer
from over fitting even after Pruning. This leads to our following hypothesis.
Hypothesis: Multiple and Lasso Regressions are better at predicting price than the
Regression Tree.
Linear Regression
In statistics, linear regression is a linear approach for modelling the relationship between
a scalar response and one or more explanatory variables (also known as dependent and
independent variables). The case of one explanatory variable is called simple linear
regression; for more than one, the process is called multiple linear regression. This term
is distinct from multivariate linear regression, where multiple correlated dependent
variables are predicted, rather than a single scalar variable.
Page 26
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 27
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
CHAPTER -4
Page 28
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
4. SYSTEM DESIGN
The Unified Modeling Language allows the software engineer to express an analysis
model using the modeling notation that is governed by a set of syntactic, semantic and
pragmatic rules. A UML system is represented using five different views that describe
the system from distinctly different perspective. Each view is defined by a set of
diagram, which is as follows:
1. User Model View
This view represents the system from the users’ perspective. The analysis
representation describes a usage scenario from the end-users’ perspective.
2. Structural Model View
In this model, the data and functionality are arrived from inside the system. This
model view models the static structures.
3. Behavioral Model View
It represents the dynamic of behavioral as parts of the system, depicting he
interactions of collection between various structural elements described in the
user model and structural model view.
4. Implementation Model View
In this view, the structural and behavioral as parts of the system are represented
as they are to be built.
5. Environmental Model View
In this view, the structural and behavioral aspects of the environment in which
the system is to be implemented are represented.
To model a system, the most important aspect is to capture the dynamic behavior. To
clarify a bit in details, dynamic behavior means the behavior of the system when it is
running/operating.
Page 29
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 30
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 31
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 32
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Class diagrams are the main building blocks of every object oriented methods. The
class diagram can be used to show the classes, relationships, interface, association, and
collaboration. UML is standardized in class diagrams. Since classes are the building
block of an application that is based on OOPs, so as the class diagram has appropriate
structure to represent the classes, inheritance, relationships, and everything that OOPs
have in its context. It describes various kinds of objects and the static relationship in
between them.
The main purpose to use class diagrams are:
1. This is the only UML which can appropriately depict various aspects of
OOPsconcept.
2. Proper design and analysis of application can be faster and efficient.
3. It is base for deployment and component diagram.
Each class is represented by a rectangle having a subdivision of three compartments
name, attributes and operation.
Page 33
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
A narrative description of each module, its function(s), the conditions under which
it is used (called or scheduled for execution), its overall processing, logic,
interfaces to other modules, interfaces to external systems, security requirements,
etc.; explain any algorithms used by the module in detail
For COTS packages, specify any call routines or bridging programs to integrate the
package with the system and/or other COTS packages (for example, Dynamic Link
Libraries)
Data elements, record structures, and file structures associated with module input
and output
Graphical representation of the module processing, logic, flow of control, and
algorithms, using an accepted diagramming approach (for example, structure
charts, action diagrams, flowcharts, etc.)
Page 34
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Data entry and data output graphics; define or reference associated data elements;
if the project is large and complex or if the detailed module designs will be
incorporated into a separate document, then it may be appropriate to repeat the
screen information in this section
Report layout
Page 35
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
The name of the diagram itself clarifies the purpose of the diagram and other
details. It describes different states of a component in a system. The states are specific to
a component/object of a system.
Activity diagram explained in the next chapter, is a special kind of a Statechart diagram.
As Statechart diagram defines the states, it is used to model the lifetime of an object.
Statechart diagram is used to describe the states of different objects in its life
cycle. Emphasis is placed on the state changes upon some internal or external events.
These states of objects are important to analyze and implement them accurately.
Statechart diagrams are very important for describing the states. States can be identified
as the condition of objects when a particular event occurs.
The first state is an idle state from where the process starts. The next states are arrived for
events like send request, confirm request, and dispatch order. These events are
responsible for the state changes of order object.
Page 36
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
During the life cycle of an object (here order object) it goes through the following states
and there may be some abnormal exits. This abnormal exit may occur due to some
problem in the system. When the entire life cycle is complete, it is considered as a
complete transaction as shown in the following figure. The initial and final state of an
object is also shown in the following figure.
Page 37
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
CHAPTER -5
Page 38
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
5. IMPLEMENTATION
Step 6: Saving the cleaned car data set after performing operations on data.
Step 9: Split the new data into 80% of Training data and 20% of Testing data.
Step 10: Train the model with Training data and Testing data.
Step 11: Implementing one hot encoder and column transformer to model.
Step 14: If accuracy is good use the model for prediction else fit the model again,
using other random states.
Step 15: Dump the Linear Regression model into our files using pickle .
Step 16: Open Pycharm and extract the cleaned car.csv and LinearRegressionModel.pkl
files into our project.
Step 17: Reading the model and dataset, make the prediction using python
and flask from webpage.
Page 39
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
car_price_predictor/m
aster/quikr_car.csv")
car.shape
(892, 6)
car.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
Page 40
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 41
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
car['kms_driven'].unique()
array(['45,000 kms', '40 kms', '22,000 kms', '28,000 kms', '36,000 kms',
'59,000 kms', '41,000 kms', '25,000 kms', '24,530 kms',
'60,000 kms', '30,000 kms', '32,000 kms', '48,660 kms',
'4,000 kms', '16,934 kms', '43,000 kms', '35,550 kms',
Page 42
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 43
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
'22,134 kms', '1,000 kms', '8,500 kms', '87,000 kms', '6,000 kms',
'15,574 kms', '8,000 kms', '55,800 kms', '56,400 kms',
'72,160 kms', '11,500 kms', '1,33,000 kms', '2,000 kms',
car['fuel_type'].unique()
array(['Petrol', 'Diesel', nan, 'LPG'], dtype=object)
backup=car.copy()
car=car[car['year'].str.isnumeric()]
car['year']=car['year'].astype(int)
car.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 842 entries, 0 to 891
Page 44
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
car['kms_driven']=car['kms_driven'].str.split(' ').str.get(0).str.replace(',','')
car=car[car['kms_driven'].str.isnumeric()]
car['kms_driven']=car['kms_driven'].astype(int)
car=car[~car['fuel_type'].isna()]
car=car.reset_index(drop=True)
car=car[car['Price']<6e6].reset_index(drop=True)
car.to_csv('cleaned car.csv')
#Splitting the features and target
x=car.drop(columns='Price')
y=car['Price']
Page 45
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
ohe=OneHotEncoder()
ohe.fit(x[['name','company','fuel_type']])
OneHotEncoder()
ohe.categories_
[array(['Audi A3 Cabriolet', 'Audi A4 1.8', 'Audi A4 2.0', 'Audi A6 2.0',
'Audi A8', 'Audi Q3 2.0', 'Audi Q5 2.0', 'Audi Q7', 'BMW 3 Series',
'BMW 5 Series', 'BMW 7 Series', 'BMW X1', 'BMW X1 sDrive20d',
'BMW X1 xDrive20d', 'Chevrolet Beat', 'Chevrolet Beat Diesel',
'Chevrolet Beat LS', 'Chevrolet Beat LT', 'Chevrolet Beat PS',
'Chevrolet Cruze LTZ', 'Chevrolet Enjoy', 'Chevrolet Enjoy 1.4',
'Chevrolet Sail 1.2', 'Chevrolet Sail UVA', 'Chevrolet Spark',
Page 47
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 48
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 49
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
'Audi A8', 'Audi Q3 2.0', 'Audi Q5 2.0', 'Audi Q7', 'BMW 3 Series',
'BMW 5 Series', 'BMW 7 Series', 'BMW X1', 'BMW X1 sDrive20d',
'BMW X1 xDrive20d', 'Chevrolet Beat', 'Chevrolet Beat...
array(['Audi', 'BMW', 'Chevrolet', 'Datsun', 'Fiat', 'Force', 'Ford',
'Hindustan', 'Honda', 'Hyundai', 'Jaguar', 'Jeep', 'Land',
'Mahindra', 'Maruti', 'Mercedes', 'Mini', 'Mitsubishi', 'Nissan',
'Renault', 'Skoda', 'Tata', 'Toyota', 'Volkswagen', 'Volvo'],
dtype=object),
array(['Diesel', 'LPG', 'Petrol'], dtype=object)]),
['name', 'company','fuel_type'])])),
('linearregression', LinearRegression())])
y_pred=pipe.predict(x_test)
y_pred
y_test
322 210000
204 500000
42 284999
606 500000
513 159000
...
801 465000
711 200000
731 300000
757 150000
379 130000
Name: Price, Length: 164, dtype: int32
r2_score(y_test,y_pred)
0.6863234123258164
Page 50
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
pipe=make_pipeline(column_trans,lr)
pipe.fit(x_train,y_train)
y_pred=pipe.predict(x_test)
scores.append(r2_score(y_test,y_pred))
import numpy as np
np.argmax(scores)
906
scores[np.argmax(scores)]
0.7768125045875028
#prediction
array([459113.49353657]
# dumping the LinearRegressionModel.pkl file using pickle for further development process
import pickle
pickle.dump(pipe,open('LinearRegressionModel.pkl','wb'))
Page 51
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
1. home.html
<!doctype html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-
fit=no">
</nav>
<a class="btn btn-outline-primary" href="/logout">Log out</a>
</div>
<div class="container">
<div clas="row">
<div class="card mt-50" style="width:100%;height:100%">
<div class="card-header">
<div class="col-12" style="text-align:center">
<h1>Welcome to Car Price Predictor</h1>
Page 52
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
</div>
</div>
<div class="card-body">
<form class="form" method="post" >
</select>
</div>
Page 53
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
</input>
</div>
</form>
<br>
<div class="row">
<div class="col-12" style="text-align: center">
<h3><span id="prediction"></span> </h3>
</div>
</div>
</div>
</div>
</div>
</div>
Page 54
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
<script>
function load_car_models(company_id,car_model_id)
{
var company= document.getElementById(company_id);
var car_model= document.getElementById(car_model_id);
car_model.value="";
car_model.innerHTML="";
if(company.value == "{{company}}" )
{
{% for model in car_models %}
{% if company in model %}
{% endif %}
{% endfor %}
}
{% endfor %}
}
function form_handler()
{
event.preventDefault();
}
function send_data()
{
document.querySelector('form').addEventListener('submit', form_handler);
var fd= new FormData(document.querySelector('form'));
Page 55
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
xhr.onreadystatechange= function()
{
if(xhr.readyState == XMLHttpRequest.DONE)
{
document.getElementById("prediction").innerHTML="The Predicted Price is: "+
xhr.responseText + " Rs/-";
}
}
xhr.onload=function(){};
xhr.send(fd);
</script>
<!-- Optional JavaScript -->
<!-- jQuery first, then Popper.js, then Bootstrap JS -->
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-
q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/umd/popper.min.js"
integrity="sha384-
ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.min.js"
integrity="sha384-
ChfqqxuZUCnJSK3+MXmPNIyE6ZbWh2IMqE241rYiqJxyMiZ6OW/JmZQ5stwEULTy"
crossorigin="anonymous"></script>
</body>
</html>
Page 56
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
1. App.java
import pandas as pd
#from flask import Flask, render_template, request, url_for,redirect,session
import pickle
import numpy as np
from flask import *
import flask_login
import os
from num2words import num2words
import mysql.connector
model=pickle.load(open("LinearRegressionModel.pkl",'rb'))
car=pd.read_csv("cleaned car.csv")
app=Flask( name )
app.secret_key=os.urandom(24)
conn=mysql.connector.connect(
host='localhost',
user='root',
password='Password123@',
port='3306',
database='database'
)
mycursor=conn.cursor()
@app.route('/')
def login():
if 'user_id' in session:
return redirect('/home')
else:
return render_template('login.html')
@app.route('/register')
def register():
return render_template('register.html')
Page 57
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
@app.route('/logout')
def logout():
session.pop('user_id')
return redirect('/')
@app.route('/login_validation',methods=['POST'])
def login_validation():
email=request.form.get('email')
password=request.form.get('password')
if len(uinfo)>0:
session['user_id']=uinfo[0][0]
return redirect('/home')
else:
flash('Incorrect username/ password')
return redirect('/')
@app.route('/add_user',methods=['POST'])
def add_user():
name=request.form.get('uname')
email=request.form.get('uemail')
password=request.form.get('upassword')
conn.commit()
return render_template('login.html')
Page 58
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
@app.route('/home')
def home():
companies=sorted(car['company'].unique())
car_models = sorted(car['name'].unique())
year = sorted(car['year'].unique(),reverse=True)
fuel_type = (car['fuel_type'].unique())
companies.insert(0, "Select Company")
year.insert(0,"Select Year of Purchase")
if 'user_id' in session:
return
render_template('home.html',companies=companies,car_models=car_models,years=year,
fuel_types=fuel_type)
else:
return redirect('/')
@app.route('/predict',methods=['POST'])
def predict():
company= request.form.get('company')
car_model=request.form.get('car_model')
year=request.form.get('year')
fuel_type=request.form.get('fuel_type')
kms_driven=request.form.get('kms_driven')
Page 59
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
CHAPTER -6
6. TESTING
Testing is the process of evaluating a system or its component(s) with the intent to
find whether it satisfies the specified requirements or not. Testing is executing a
system in order to identify any gaps, errors, or missing requirements in contrary to the
actual requirements.
It depends on the process and the associated stakeholders of the project(s). In the IT
industry, large companies have a team with responsibilities to evaluate the developed
software in context of the given requirements. Moreover, developers also conduct testing
which is called Unit Testing. In most cases, the following professionals are involved in
testing a system within their respective capacities:
● Software Tester
● Software Developer
● Project Lead/Manager
● End User
Levels of testing include different methodologies that can be used while conducting
software testing. The main levels of software testing are:
● Functional Testing
● Non-functional Testing
Functional Testing
This is a type of black-box testing that is based on the specifications of the software
that is to be tested. The application is tested by providing input and then the results are
examined that need to conform to the functionality it was intended for. Functional
testing of a software is conducted on a complete, integrated system to evaluate the
system's compliance with its specified requirements.
Page 61
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
1. Requirements Analysis
2. Test Planning
3. Test Analysis
4. Test Design
● Requirements Analysis
In this phase testers analyze the customer requirements and work with developersduring
the design phase to see which requirements are testable and how they are going to test
those requirements.
It is very important to start testing activities from the requirements phase itself because
the cost of fixing defect is very less if it is found in requirements phase rather than in
future phases.
● Test Planning
In this phase all the planning about testing is done like what needs to be tested, how the
testing will be done, test strategy to be followed, what will be the test environment, what
test methodologies will be followed, hardware and software availability, resources, risks
etc. A high level test plan document is created which includes all the planning inputs
mentioned above and circulated to the stakeholders.
● Test Analysis
After test planning phase is over test analysis phase starts, in this phase we needto dig
deeper into project and figure out what testing needs to be carried out in each SDLC
phase. Automation activities are also decided in this phase, information needs to be
done for software product, how will the automation be done, how much time will it
take to automate and which features need to be automated. Non functional testing
areas(Stress and performance testing) are also analyzed and defined in this phase.
Page 62
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
● Test Design
In this phase various black-box and white-box test design techniques are used to design
the test cases for testing, testers start writing test cases by following those design
techniques, if automation testing needs to be done then automation scripts also needs to
written in this phase.
Page 63
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
http://127.0. Dinesh
1 Registered Registered Chrome Pass NA
0.1.5000/ho Successfully
me dinesh Successfully
@gma
il.com
12345
Pranav
@1
Page 64
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
12345
Pranav@gm
ail.com
Pranav@1
Page 65
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 66
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 67
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 68
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Table 6.2.5 Price Prediction Test case with selecting correct attributes
Page 69
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Table 6.2.6 Price Prediction Test case without selecting one or more attributes.
Click on predict Clicking on Fill all Fill all the Chrome Pass Price is not
1
price button predict price attributes attributes and price predicted
without filling button is not predicted.
all attributes
Click on predict Clicking on Incorrect and price is not Chrome Pass Price is not
2
price button with predict price attributes predicted. predicted
filling incorrect button
attributes
Page 70
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
1 Click on home Click Refreshing home Home page Chrome Pass Successfully
button page refreshed refreshed
Page 71
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
1 Click on logout Click on Return to login Return back to Chrome Pass Log out
button button page login page successfully
Page 72
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
CHAPTER -7
Page 73
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
7. SCREENSHOTS
Page 74
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 75
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 76
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 77
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
Page 78
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
CHAPTER -8
Page 79
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
8. FUTURE ENHANCEMENTS
CHAPTER -9
Page 81
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
9. CONCLUSION
The prediction error rate of all the models was well under the accepted 5% of
error. But, on further analysis, the mean error of the regression tree model was found to
be more than the mean error rate of the linear regression model. Even though for some
seeds the regression tree has better accuracy, its error rates are higher for the rest. This
has been confirmed by performing an ANOVA. Also, the post-hoc test revealed that
the error rates in multiple regression models and lasso regression models aren’t
significantly different from each other. To get even more accurate models, we can also
choose more advanced machine learning algorithms such as random forests, an
ensemble learning algorithm which creates multiple decision/regression trees, which
brings down overfitting massively or Boosting, which tries to bias the overall model by
weighing in the favor of good performers. More data from newer websites and
different countries can also be scraped and this data can be used to retrain these models to
check for reproducibility.
CHAPTER -10
Page 83
Downloaded by Prashant Chaudhari
lOMoARcPSD|40893658
10. REFERENCES
[1]. no. 22, pp. 12 693–12 700, 2018. [12] E. Gegic, B. Isakovic, D. Keco, Z. Masetic,
and J. Kevric, ―Car price prediction using machine learning techniques,‖ 2019.
[6]. K. Noor and S. Jan, ―Vehicle price prediction system using machine learning
techniques,‖ International Journal of Computer Applications, vol. 167, no. 9, pp. 27–31,
2017.
[7]. M. Jabbar, ―Prediction of heart disease using k-nearest neighbor and particle swarm
optimization,‖ Biomed. Res, vol. 28, no. 9, pp. 4154– 4158, 2017.
[9]. S. Pudaruth, ―Predicting the price of used cars using machine learning techniques,‖
Int. J. Inf. Comput. Technol,vol. 4, no. 7, pp. 753–764, 2014. 183 Authorized licensed
use limited to: Carleton University. Downloaded on May 29,2021 at 09:56:13 UTC from
IEEE Xplore. Restrictions apply.
[11]. Q. Yuan, Y. Liu, G. Peng, and B. Lv, ―A prediction study on the car sales based on
web search data,‖ in The International Conference on E-Business and E-Government
(Index by EI), 2011, p. 5.
[12]. K. S. Durgesh and B. Lekha, ―Data classification using support vector machine,‖
Journal of theoretical and applied information technology, vol. 12, no. 1, pp. 1–7, 2010.
[14]. S. Veni and A. Srinivasan, ―Defect classification using naive Bayes classification,‖
International Journal of Applied Engineering Research, vol.
[16]. M. C. Sorkun, ―Secondhand car price estimation using artificial neural network.‖
Page 85
Downloaded by Prashant Chaudhari