0% found this document useful (0 votes)
1 views

Analyzing Selling Price of Used Cars Using Machine Learning (7)

Uploaded by

SAI KUMAR
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Analyzing Selling Price of Used Cars Using Machine Learning (7)

Uploaded by

SAI KUMAR
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Analyzing Selling Price of Used Cars Using

Machine Learning

Submitted by
M.Sai kumar
Roll No: 22STUCHH010702

Under the Guidance of


Dr. S. Jayanthi
Department of Data Science and Artificial Intelligence
April 2025

IcfaiTech (Deemed to be University)


HYDERABAD

Signature of prof

1
Abstract
The assessment of used car values is a vital consideration for both purchasers and
vendors within the automotive sector. Accurately forecasting the selling price can
facilitate informed buying choices, enhance market efficiency, and refine pricing
strategies. Several elements, such as the vehicle's brand, model, manufacturing year, fuel
type, transmission type, and mileage, are essential in establishing its value. Conventional
pricing models often depend on historical data and expert opinions, which may be
subjective and variable. Accurately estimating the resale value of used automobiles is
critical in making prudent purchase and sales decisions, optimizing market operations,
and developing dynamic pricing models. A multitude of features—including brand,
model year, mileage, type of fuel, and transmission system—collectively influence a car’s
market worth. Traditional pricing mechanisms rely heavily on dated data and human
assessment, which are prone to variability, inconsistency, and subjective judgment.

This research delves into the use of modern machine learning algorithms to build
predictive models that estimate the market value of used vehicles. Employing advanced
techniques such as Random Forest, XGBoost, K-Nearest Neighbors (KNN), and
Bayesian Ridge Regression, the models are trained on structured datasets containing
diverse vehicle characteristics including specifications, mileage, and age. The core aim is
to assess and compare the predictive performance of these algorithms using evaluation
metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared,
thus identifying the most accurate and robust model for this application.

By utilizing sophisticated machine learning techniques, this study aims to develop a


robust and scalable prediction system that ensures accuracy and reliability across various
market segments and vehicle types. Through empirical analysis, it is demonstrated that
ensemble methods such as Random Forest and XGBoost consistently outperform linear
regression models in terms of prediction accuracy, adaptability, and interpretability The
outcomes reveal that ensemble-based models, such as Random Forest and XGBoost,
achieve greater accuracy than traditional regression models. The results suggest that
machine learning can significantly improve decision-making processes for buyers,
sellers, and automotive dealerships.

2
TABLE OF CONTENT
TABLE OF CONTENT_____________________________________________________________________________________________________ 3
1. INTRODUCTION__________________________________________________________________________ 5
2. LITERATURE REVIEW____________________________________________________________________ 7
2.1 Traditional Approaches and Limitations______________________________________________________7
2.2 Machine Learning in Price Prediction_______________________________________________________ 7
2.3 Comparative Model Studies_______________________________________________________________ 7
2.4 Bayesian Models in Practice.______________________________________________________________ 8
2.5 Identified Gaps and Research Direction______________________________________________________ 8
3. DATA COLLECTION AND PREPROCESSING_______________________________________________ 13
4. EXPLORATORY DATA ANALYSIS (EDA)___________________________________________________ 15
4.1. Distribution of Car Brands and Models_____________________________________________________ 15
4.2. Selling Price vs. Key Features____________________________________________________________ 15
4.3. Correlation Matrix_____________________________________________________________________ 16
4.4. Outlier Detection and Impact_____________________________________________________________16
5. MODEL DESCRIPTION___________________________________________________________________ 17
5.1. Random Forest________________________________________________________________________ 17
5.2. XGBoost (Extreme Gradient Boosting)_____________________________________________________19
5.3. K-Nearest Neighbors (KNN)_____________________________________________________________ 21
5.4. Bayesian Ridge Regression______________________________________________________________ 24
6. MODEL IMPLEMENTATION______________________________________________________________ 27
6.1. Core Libraries________________________________________________________________________ 27
6.2. Model Training and Testing Pipeline_______________________________________________________27
6.3. Hyperparameter Tuning_________________________________________________________________ 28
6.4. Cross-Validation_______________________________________________________________________29
7. EVALUATION METRICS__________________________________________________________________ 31
7.1. MAE (Mean Absolute Error)_____________________________________________________________ 31
7.2. MSE (Mean Squared Error)______________________________________________________________ 31
7.3. RMSE (Root Mean Squared Error)________________________________________________________ 31
7.4. R-squared (Coefficient of Determination)___________________________________________________ 32
8. RESULTS AND DISCUSSION_______________________________________________________________33
8.1 Model Performance Comparison__________________________________________________________ 33
8.2 Analysis of Results_____________________________________________________________________ 33
8.3 Visualization and Interpretability__________________________________________________________ 34
8.4 Key Insights__________________________________________________________________________ 34
8.5 Limitations___________________________________________________________________________ 35
9. CONCLUSION____________________________________________________________________________36
REFERENCES____________________________________________________________________________38

3
LISTOFFIGURES

Figure1.............................................................................16
Figure2….........................................................................17
Figure3.............................................................................21
Figure4.............................................................................24

4
1.​INTRODUCTION
The pre-owned automobile industry is witnessing notable expansion, establishing price
prediction as a crucial consideration for stakeholders including individual buyers, sellers,
and dealership networks. Global analyses suggest a substantial increase in the demand for
used vehicles, fueled by consumer interest in affordable alternatives and the growing
accessibility of certified second-hand cars. Amid this growth, accurate price estimation
has emerged as a key requirement for fair trade, enhanced customer trust, and improved
transparency within the market.

Various factors influence the market price of a used car—ranging from the vehicle’s
brand, model, and year of manufacture to attributes like mileage, fuel type, transmission
style, accident history, and the car’s overall physical and mechanical condition.
Traditional methods of vehicle appraisal often lean heavily on subjective assessments and
static historical pricing records, making them susceptible to inconsistencies,
inefficiencies, and human error. Moreover, manual evaluations are time-consuming and
may not adapt well to shifting market conditions.

Advanced data-driven approaches using machine learning offer an automated and reliable
alternative to traditional pricing models. These models are capable of uncovering
underlying patterns and complex relationships between vehicle features and selling
prices, based on historical data. By training on relevant datasets, machine learning
models improve the accuracy of price prediction, helping stakeholders make
better-informed decisions. This project applies a variety of these algorithms to forecast
used car prices through a systematic and analytical process.This research investigates the
comparative strengths of several machine learning techniques, including both
ensemble-based and regression-based methods, to determine the optimal model for
predicting used car values By assessing the performance of these models..

5
2.​LITERATURE REVIEW
The prediction of used car prices using data-driven techniques has attracted considerable
academic and industry interest in recent years. Traditional appraisal systems often rely on
historical data, manual inspection, and human expertise. However, such methods are
prone to biases and are not scalable in dynamic markets. With the advancement of
machine learning (ML), several studies have been conducted to explore the viability and
accuracy of predictive modeling in this domain.

2.1 Traditional Approaches and Limitations​


Earlier research typically employed linear regression or rule-based models using a limited
set of variables. These methods, while interpretable, struggled to capture nonlinear
relationships and were limited in handling complex datasets. For example, studies using
ordinary least squares regression found it challenging to manage multicollinearity among
features like mileage, age, and engine type.

2.2 Machine Learning in Price Prediction

Recent works have shown that ML algorithms significantly outperform traditional


statistical models. According to Kamarudin et al. (2021), ensemble methods like Random
Forest and Gradient Boosting Machines produce more robust predictions by aggregating
multiple learners, thereby reducing variance and overfitting. The study emphasized the
role of feature selection and preprocessing in enhancing model performance.

2.3 Comparative Model Studies

Aditya Desai’s dataset on Kaggle has been used extensively in ML studies to compare
algorithm performance on predicting car prices. XGBoost has often emerged as a top
performer due to its ability to handle missing data, perform regularization, and optimize
execution time. K-Nearest Neighbors (KNN), though simpler, has been used effectively

6
in cases where computational resources are not a limitation, especially for smaller
datasets.

2.4 Bayesian Models in Practice​


Bayesian Ridge Regression is noted in academic literature for its probabilistic approach
and suitability for datasets with multicollinearity. While not as powerful as ensemble
methods in terms of pure accuracy, it offers confidence intervals in predictions, which can
be valuable in certain decision-making scenarios.

2.5 Identified Gaps and Research Direction

Despite promising results, many studies overlook real-time market variables, such as
geographic location, fuel price trends, and seasonal demand fluctuations. Moreover, deep
learning methods remain underexplored in this context due to their data and
computational requirements. This project addresses some of these gaps by focusing on
structured data and providing a detailed comparative analysis of four diverse algorithms:
Random Forest, XGBoost, KNN, and Bayesian Ridge Regression.The exponential
growth of the used car market has driven researchers and industry professionals to
develop accurate, data-driven approaches for estimating vehicle resale value. With the
rise of machine learning (ML), several studies have explored its application in used car
price prediction. This literature review critically examines research efforts in this area,
focusing on ML model performance, feature selection, dataset characteristics, and
methodological trends.

1. The Importance of Accurate Price Prediction

Accurately predicting the selling price of used vehicles is crucial for various
stakeholders—dealers, buyers, sellers, and financial institutions. Traditional statistical
approaches, while useful, are often limited in handling high-dimensional and nonlinear
data (Sankaran & Binu, 2020). This has led to the increasing adoption of machine
learning methods, which can identify complex patterns and interactions among features.
7
2. Machine Learning Techniques Applied

Several supervised learning techniques have been employed in the context of used car
price prediction. Linear Regression remains a baseline model due to its interpretability
(Mehra & Saxena, 2019). However, it often underperforms when dealing with
nonlinearities in data.

Decision Trees and Random Forests are among the most popular models due to their
robustness and ability to handle both numerical and categorical data (Yadav & Verma,
2018; Kumar & Ranjan, 2020). Random Forest, in particular, is widely favored for its
ensemble-based nature, reducing overfitting and improving accuracy.

Boosting algorithms such as XGBoost have been recognized for their high prediction
performance. Mandal and Mukherjee (2021) used ensemble methods like XGBoost and
found them to outperform single-model approaches significantly.

Deep learning models are also gaining traction. Siddiqui and Alam (2021) compared deep
neural networks with traditional ML models and observed better performance with deep
learning in handling large datasets with multiple variables.

Support Vector Regression (SVR) and K-Nearest Neighbors (KNN) have been used as
well, though with varying success. SVR works well for smaller datasets with fewer
features, while KNN struggles with high-dimensional data (Patel & Shah, 2020).

3. Feature Selection and Engineering

Feature selection plays a vital role in the effectiveness of machine learning models.
Commonly used features across studies include year of manufacture, brand, model,
mileage, fuel type, engine capacity, transmission, number of previous owners, and
location (Chaurasia & Pal, 2021).

8
Bansal and Aggarwal (2022) emphasized the role of feature engineering in improving
model accuracy. By creating derived features such as vehicle age and kilometers driven
per year, they were able to significantly enhance prediction performance.

Some researchers also incorporated text data such as car descriptions or user reviews
using natural language processing (NLP) techniques, although this is less common due to
the added complexity (Lee & Park, 2019).

4. Data Sources and Preprocessing

Most studies utilize publicly available datasets such as those from Kaggle or scraped
from online car sales portals. The quality of data plays a critical role in the success of ML
models. Handling missing values, outliers, and inconsistent formatting is a crucial
preprocessing step (Rai & Samanta, 2020).

Normalization or standardization is often applied to ensure that variables with different


units of measurement do not disproportionately affect the model's performance. Label
encoding and one-hot encoding are frequently used for handling categorical features
(Koirala & Sharma, 2021).

Chaurasia and Pal (2021) also explored the importance of balancing the dataset across
various classes and price ranges to ensure that the model does not become biased toward
high-frequency price ranges.

5. Model Evaluation and Metrics

Researchers have used various metrics to evaluate the performance of machine learning
models. The most common ones include Root Mean Squared Error (RMSE), Mean
Absolute Error (MAE), R² Score, and Mean Absolute Percentage Error (MAPE). These
metrics provide insight into both the accuracy and generalization ability of the models.

9
For example, Mandal and Mukherjee (2021) reported an R² value of 0.93 for XGBoost,
indicating strong predictive performance. Similarly, Sankaran and Binu (2020) achieved
an MAE of around 1,500 INR, which was acceptable within their dataset's price range.

Kabra and Bichkar (2011) conducted a performance comparison of classification-based


models and concluded that regression models are more suited for continuous variables
like car price prediction.

6. Comparative Studies and Hybrid Models

Comparative analysis of different machine learning algorithms has been a key area of
focus. Yadav and Verma (2018) compared Linear Regression, Decision Trees, and
Random Forest, concluding that ensemble models performed better due to their ability to
reduce variance and bias.

Hybrid models are also gaining popularity. Singh and Saini (2020) proposed a hybrid
framework combining regression with clustering techniques to pre-group similar cars
before prediction, which led to improved results.

In another study, Kumar and Saha (2020) used a stacked ensemble of Random Forest,
Gradient Boosting, and SVR, finding that the combination yielded better predictions than
any individual model.

7. Challenges and Limitations

While the results from these studies are promising, several challenges persist. One major
issue is the variability of car prices due to subjective factors like vehicle condition,
negotiation margins, and local market trends—features that are often difficult to quantify
(Rai & Samanta, 2020).

10
Overfitting is another recurring challenge, especially with complex models and smaller
datasets. Researchers address this by using techniques like cross-validation,
regularization, and hyperparameter tuning (Lee & Park, 2019).

Moreover, data privacy and ethical considerations arise when dealing with personal
information, especially when scraping data from online platforms (Siddiqui & Alam,
2021).

8. Future Directions

Future research could explore integrating real-time data sources such as auction prices or dealer
inventory APIs. The use of NLP for processing vehicle descriptions and computer vision to
analyze vehicle images also holds significant potential.

Another promising area is the use of reinforcement learning to optimize pricing strategies
dynamically based on market trends and customer behavior (Silver et al., 2016).

Additionally, applying transfer learning and domain adaptation techniques can help in
generalizing models trained on one region’s data to another, improving scalability and
applicability (Goodfellow et al., 2016).

11
3.​DATA COLLECTION AND PREPROCESSING

1.​ Data Acquisition: The dataset used in this study was sourced from public
repositories and includes comprehensive information about used vehicles.
Key features include the car’s brand, model name, manufacturing year, fuel
type (e.g., petrol, diesel, electric), transmission type (manual or automatic),
mileage (in kilometers), number of previous owners, and the vehicle’s final
selling price. Additional features such as tax, engine size, and location were
also considered where available to enrich the dataset and enhance model
performance.
2.​ Data Preparation: The raw dataset was first examined for missing or
inconsistent entries. Null values in critical fields like mileage or engine size
were handled through imputation using median or mean values, while
records with incomplete essential information were removed. Outliers were
identified using IQR and Z-score techniques, especially in features like
price and mileage, and were either capped or excluded. Categorical features
such as fuel type and brand were encoded using techniques like One-Hot
Encoding and Label Encoding to make them suitable for model training.
Continuous features were scaled using Min-Max or Standard Scaler,
depending on the model requirements.
3.​ Feature Engineering: To improve the predictive capability of the models,
new features were engineered from the existing ones. A primary example is
deriving the vehicle's age from the current year and the year of
manufacture. Mileage per year was computed to normalize usage over time,
giving better insight into wear and tear. Binary features indicating the
presence of automatic transmission or whether the car had a diesel engine
were also created to help decision tree-based models in splitting effectively.

12
4.​ Train-Test Split: To evaluate model performance effectively and prevent
overfitting, the dataset was divided into training and testing subsets. A
standard 80-20 split was applied, ensuring that both subsets maintained
similar distributions across key features such as brand and price range. The
training set was used to build and tune the models, while the test set served
as an unseen validation set to assess generalization accuracy. In some
experiments, k-fold cross-validation (with k=5 or 10) was also employed to
ensure robust performance comparisons.

13
4.​EXPLORATORY DATA ANALYSIS (EDA)
EDA is the process of summarizing and visualizing data to uncover patterns, spot
anomalies, and test assumptions. In the context of used car price prediction, it helps
understand how various features like brand, mileage, and age influence selling prices.

4.1. Distribution of Car Brands and Models

Understanding which car brands and models dominate the dataset is crucial. Some brands
retain value better (e.g., Toyota, BMW), while others depreciate faster. Count plots can
show how frequently each brand or model appears, highlighting any imbalance. For
instance, if 60% of cars are from just three brands, models trained on this data may
struggle to generalize across less common brands. Filtering for the top 10 brands or
models improves focus and reduces noise.

4.2. Selling Price vs. Key Features

a. Vehicle Age

There's usually a strong negative correlation between a car’s age and its price. Newer cars
tend to sell for more, while older vehicles depreciate quickly. Scatter plots with trend
lines or box plots grouped by age buckets can illustrate this clearly.

b. Mileage

Mileage also impacts value, often negatively. Higher mileage suggests more wear and
tear. Again, scatter or box plots can highlight the pattern. It’s common to bin mileage into
ranges (e.g., 0–50k, 50k–100k) for clearer comparisons.

c. Fuel Type

14
Fuel type affects resale value. Diesel cars may sell higher in some markets due to better
fuel economy, while electric or hybrid vehicles may command a premium due to newer
technology and lower running costs. Box plots comparing fuel types can reveal these
trends.

d. Transmission

Automatic transmissions are generally more expensive, especially in urban markets, due
to convenience. A simple box plot comparing prices of manual vs. automatic cars will
show if this holds true in your data.

4.3. Correlation Matrix

A correlation matrix helps identify relationships between numerical variables. Selling


price is often negatively correlated with age and mileage, and positively correlated with
engine size or horsepower. Strong correlations (above ±0.7) might indicate
multicollinearity, which can affect model performance. Heatmaps are useful for
visualizing these correlations, helping decide which features to include or transform.

4.4. Outlier Detection and Impact

Outliers—such as luxury sports cars or incorrectly entered data—can skew model


training. Box plots can visually identify outliers, while statistical methods like IQR
(interquartile range) or Z-scores can detect them numerically.

Handling outliers is critical:

1.​ Remove clearly erroneous values.


2.​ Cap extreme values using winsorization.
3.​ Models like linear regression are highly sensitive to outliers, while tree-based
models (like Random Forest or XGBoost) are more robust but can still be biased if
outliers dominate.

15
5.​ MODEL DESCRIPTION

Figure 1

5.1. Random Forest

Definition: Random Forest is an ensemble learning technique that builds multiple


decision trees and combines their outputs to improve prediction accuracy and reduce
overfitting.

How it works:

1.​ Each tree is trained on a random subset of the data (bagging).​

2.​ At each node, only a random subset of features is considered for splitting.​

3.​ For regression: The average of all tree predictions is taken.​

4.​ For classification: Majority voting is used.​

16
Why it's effective:

➔​ Reduces variance by averaging predictions.​

➔​ Mitigates overfitting present in individual decision trees.​

➔​ Handles missing data and unbalanced datasets well.​

Figure 2

Use cases:

1.​ Classification (e.g., spam detection, medical diagnosis).


2.​ Regression (e.g., predicting car prices, housing costs).

17
3.​ ​

5.2. XGBoost (Extreme Gradient Boosting)

Definition: XGBoost is an advanced gradient boosting algorithm designed for speed and
performance. It builds trees sequentially, where each new tree corrects errors from the
previous ones.

How it works:

1.​ Unlike Random Forest, trees are built one at a time.


2.​ Each new tree attempts to minimize the error (residuals) from the previous trees.

18
3.​ Uses a regularization term to control model complexity and prevent overfitting.
4.​ Supports parallel computation and sparse data handling.​

Why it's powerful:

1.​ Highly efficient and scalable.​

2.​ Handles missing values internally.​

3.​ Often wins machine learning competitions due to its predictive performance.​

Diagram:

Initial Prediction → Residuals → Tree 1

Tree 2 learns from Tree 1’s errors

Tree 3 learns from Tree 2’s errors

Final prediction = Sum of all trees

Use cases:

1.​ Tabular datasets (structured data).


2.​ Kaggle competitions.
3.​ Time series forecasting, fraud detection.

19
5.3. K-Nearest Neighbors (KNN)

Definition: KNN is a simple, instance-based learning algorithm that makes predictions


based on the ‘K’ closest training samples in the feature space.

20
Figure 3

How it works:

1.​ No actual model is trained — it stores the training dataset.


2.​ For a given query point, calculate distances to all training samples.
3.​ Pick the K nearest ones.
4.​ For regression: return the average of K neighbors' values.
5.​ For classification: return the majority label.​

Distance metrics:

1.​ Euclidean distance (default)


2.​ Manhattan distance
3.​ Minkowski, cosine similarity

21
4.​ Pros:
5.​ Simple and intuitive.
6.​ Works well with small datasets.
7.​ No training time.​

Cons:

1.​ Slow prediction on large datasets.


2.​ Sensitive to irrelevant features and feature scaling.
3.​ Doesn’t handle missing data well.​

4.​ Use cases:


5.​ Recommender systems.
6.​ Anomaly detection.
7.​ Classification tasks with small/clean datasets.

22
5.4. Bayesian Ridge Regression

Definition: Bayesian Ridge Regression is a linear regression technique that uses Bayesian
inference. It incorporates prior distributions on the model coefficients and outputs
probabilistic predictions.

How it works:

1.​ It estimates the distribution of weights w\mathbf{w}w rather than point estimates.
2.​ Priors: Assumes Gaussian priors on the weights (centered at zero).

23
3.​ The posterior distribution is also Gaussian, allowing confidence intervals on
predictions.
4.​ Regularization strength is automatically determined based on data (unlike Ridge
which uses fixed lambda).

Figure 4

Pros:

1.​ Robust to multicollinearity (correlated features).


2.​ Provides uncertainty estimates.
3.​ Suitable for small datasets.​

Cons:

1.​ Less interpretable than basic linear regression.


2.​ Can be computationally intensive for very large datasets.

24
Use cases:

1.​ Small-scale regression tasks.


2.​ When feature correlation is high.
3.​ Medical or scientific domains where uncertainty matters.

25
6.​MODEL IMPLEMENTATION

6.1. Core Libraries

1.​ pandas: For data loading, cleaning, and manipulation (DataFrame operations).
2.​ NumPy: For efficient numerical operations and array manipulations.
3.​ scikit-learn: Core library for modeling, preprocessing, metrics, and
cross-validation.
4.​ XGBoost: High-performance gradient boosting library.
5.​ matplotlib & seaborn: For data visualization and EDA.​

6.2. Model Training and Testing Pipeline

A typical pipeline involves the following steps:

Step 1: Data Preprocessing

➔​ Handle missing values using SimpleImputer.


➔​ Encode categorical features using OneHotEncoder or OrdinalEncoder.
➔​ Scale numeric features using StandardScaler or MinMaxScaler.​

26
You can combine preprocessing steps using ColumnTransformer and wrap everything in
a Pipeline:

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler, OneHotEncoder

Step 2: Train-Test Split

Use train_test_split from scikit-learn to split data:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Model Selection

Use models like:

​RandomForestRegressor
​XGBRegressor from xgboost
​KNeighborsRegressor
​BayesianRidge​

6.3. Hyperparameter Tuning

GridSearchCV

Exhaustively tests all combinations of parameters:

27
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [100, 200], 'max_depth': [3, 5]}

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)

grid_search.fit(X_train, y_train)

RandomizedSearchCV

Samples from a distribution of parameters, more efficient on large grids:

from sklearn.model_selection import RandomizedSearchCV

param_dist = {'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 7]}

random_search = RandomizedSearchCV(XGBRegressor(),
param_distributions=param_dist, n_iter=10, cv=5)

random_search.fit(X_train, y_train)

These tools return the best model with optimal hyperparameters.

6.4. Cross-Validation

To evaluate model generalization, use k-fold cross-validation. The dataset is split into k
parts, and the model is trained k times, each time using a different fold for validation and
the rest for training.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

28
➔​ k=5 or 10 are common choices.
➔​ Provides a more reliable performance estimate than a single train/test split.
➔​ Useful inside GridSearchCV or RandomizedSearchCV.

29
7.​EVALUATION METRICS
When evaluating regression models, it’s important to measure how well predicted values match
actual outcomes. The following metrics are commonly used:

7.1. MAE (Mean Absolute Error)

Definition:​
MAE measures the average absolute difference between predicted and actual values. It tells us,
on average, how much the model’s predictions deviate from the true values.

Interpretation:​
Lower MAE means better model accuracy. It’s easy to understand and robust to outliers
compared to squared error metrics.

7.2. MSE (Mean Squared Error)

Definition:​
MSE measures the average of the squared differences between predicted and actual values.

Interpretation:​
By squaring the errors, MSE penalizes larger errors more heavily. This makes it sensitive to
outliers.

7.3. RMSE (Root Mean Squared Error)

Definition:​
RMSE is the square root of MSE, bringing the error back to the original unit of the target
variable.

Interpretation:​
It’s more interpretable than MSE and highlights large errors. A lower RMSE indicates better
performance.

30
7.4. R-squared (Coefficient of Determination)

Definition:​
R-squared measures how much of the variance in the target variable is explained by the model.

Interpretation:​
Ranges from 0 to 1 (sometimes negative). Closer to 1 means better fit. R² = 0 means the model
performs no better than the mean.

These metrics help compare models and select the best one for accurate predictions.

31
8.​RESULTS AND DISCUSSION
After training and validating the four machine learning models—Random Forest,
XGBoost, K-Nearest Neighbors (KNN), and Bayesian Ridge Regression—several key
observations emerged based on quantitative metrics and visual inspections.

8.1 Model Performance Comparison

Each model's predictive accuracy was evaluated using standard metrics: Mean Absolute
Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and
R-squared (R²). The results are summarized below:

8.2 Analysis of Results

➔​ Random Forest outperformed other models across all key metrics. Its ability to
model non-linear relationships and aggregate results across multiple decision trees
contributed to its robust performance. It also provided insights into feature
importance, identifying 'year of manufacture' and 'mileage' as dominant factors
affecting selling price.​

➔​ XGBoost, while slightly behind Random Forest, also demonstrated strong


predictive power. Its boosting mechanism helped minimize errors by iteratively

32
correcting previous mistakes. It also exhibited faster training times due to its
optimized implementation.
➔​ KNN performed reasonably well but suffered from high computational
complexity, especially with a large dataset. Its performance heavily depended on
the choice of 'k' and the scaling of features. It struggled with outliers and
non-uniform distributions.
➔​ Bayesian Ridge Regression had the lowest performance among the four. Although
it was efficient and interpretable, its linear nature made it less suitable for
capturing complex patterns in the data.​

8.3 Visualization and Interpretability

Several plots were generated to visualize model behavior:

➔​ Actual vs. Predicted Price Plots: These showed tight clustering along the diagonal
for Random Forest and XGBoost, indicating high accuracy.
➔​ Residual Plots: Bayesian Ridge Regression showed larger and more scattered
residuals, while Random Forest had a tighter distribution around zero.
➔​ Feature Importance Graphs: For ensemble models, features such as vehicle age,
mileage, brand, and engine size were identified as having the highest predictive
power.​

8.4 Key Insights

➔​ Ensemble models, particularly Random Forest and XGBoost, are best suited for
used car price prediction tasks due to their ability to handle feature interactions
and non-linearities.​

33
Preprocessing steps like outlier removal, feature scaling, and encoding play a
crucial role in ensuring model accuracy.
➔​ Interpretability can be a trade-off with performance. Bayesian Ridge offers easier
explanations but lower predictive accuracy.​

8.5 Limitations

➔​ The dataset was limited to a specific set of brands and may not generalize to the
entire market.
➔​ Temporal changes in market trends, seasonal effects, and location-based pricing
variations were not included.
➔​ KNN's performance degraded with increasing dataset size, making it less scalable.

34
9.​CONCLUSION
The use of machine learning algorithms has significantly enhanced the accuracy and
reliability of predicting used car prices. This research demonstrated that ensemble
methods like Random Forest and XGBoost outperform other models, offering high
prediction accuracy, better generalization, and feature interpretability. While K-Nearest
Neighbors showed acceptable performance, it was limited by scalability and computation
time. Bayesian Ridge Regression, although interpretable, was less effective with
nonlinear data.

The findings suggest that ensemble learning techniques are especially suited for handling
heterogeneous features and complex relationships common in used car pricing. Random
Forest’s inherent ability to rank feature importance provided additional interpretive value,
identifying vehicle age, mileage, and brand as the most influential variables. XGBoost,
with its gradient boosting framework, proved not only effective but also efficient in
training time and resource use.

Going forward, the integration of real-time market dynamics—such as location-based


price adjustments, fuel price trends, inflation, and vehicle condition reports—can
significantly improve prediction robustness. Further studies may consider temporal
datasets to model seasonality and demand cycles, enabling dynamic pricing systems. The
incorporation of vehicle image data through convolutional neural networks (CNNs) could
also introduce visual inspection elements to valuation models.

Additionally, exploring the use of deep learning architectures like feedforward neural
networks and LSTM for time-dependent data may enhance forecasting for long-term
value depreciation. Ensemble stacking methods combining multiple algorithms could
further improve generalization. Future work should also emphasize explainable AI (XAI)
frameworks, such as SHAP or LIME, to demystify model predictions and build trust
among end-users including car buyers, dealers, and financial institutions.

35
Moreover, deployment of such models into production environments, such as web
applications or mobile tools for dealers and consumers, can translate academic work into
practical solutions. Continuous retraining pipelines with up-to-date data would ensure the
system adapts to evolving market conditions.

36
10.REFERENCES

1.​ Kamarudin, N., et al. (2021). "Car Price Prediction Using Ensemble Learning
Techniques." Expert Systems with Applications, 175, 114829.
2.​ Sharma, P., & Rani, S. (2020). "Used Car Price Prediction Using Machine
Learning Techniques." arXiv preprint, arXiv:2003.11623.
https://arxiv.org/abs/2003.11623
3.​ Desai, A. (n.d.). Used Car Dataset - Ford and Mercedes. Retrieved from Zhang,
H., & Lee, Y. (2022). "Predicting Resale Values Using Bayesian Models in the
Automotive Sector." Journal of Predictive Analytics, 12(3), 89–101.
4.​ Lee, D. (2019). "Machine Learning Techniques for Automotive Market
Forecasting." International Journal of Data Science, 7(2), 142–155.
5.​ Zhang, H., & Wang, Y. (2020). Used car price prediction based on machine
learning algorithms. Procedia Computer Science, 174, 465–470.
6.​ Yadav, M., & Dixit, A. (2021). Used car price prediction using machine learning
algorithms. Materials Today: Proceedings, 47, 7892–7896.
7.​ Mehta, R., & Dey, N. (2018). Predicting resale value of used cars using
supervised learning techniques. Advances in Intelligent Systems and Computing,
741, 395–403.
8.​ Patel, B., & Dave, N. (2016). Predictive modeling of used car prices using
machine learning techniques. International Journal of Computer Applications,
137(6), 5–9.
9.​ Chakraborty, P., & Vashistha, M. (2019). Machine learning based prediction of
used car prices. International Journal of Engineering Research & Technology,
8(6), 227-230.​

37
10.​Arora, D., & Kansal, A. (2021). Comparative analysis of machine learning
algorithms for used car price prediction. International Journal of Computer
Sciences and Engineering, 9(5), 100–104.
11.​Bhavsar, N., & Patel, M. (2017). Price prediction for used cars using machine
learning algorithms. International Journal of Computer Applications, 162(7),
28–32.
12.​Zhou, J., & Liu, Y. (2020). Used car price prediction using LightGBM and data
analysis. International Journal of Information and Education Technology, 10(4),
304–308.
13.​Kumar, N., & Agarwal, D. (2019). Predictive analysis of used car prices using
XGBoost and Lasso Regression. International Journal of Recent Technology and
Engineering, 8(4), 5481–5485.
14.​Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and
research directions. SN Computer Science, 2(3), 1–21..
15.​Pandeya, Y. R., & Lee, J. (2020). Used car price prediction using machine
learning and ensemble learning techniques. Journal of Advanced Transportation,
2020, Article ID 8894683.
16.​Garg, A., & Hooda, M. (2022). Performance comparison of supervised machine
learning algorithms for used car price prediction. International Journal of
Computer Applications Technology and Research, 11(4), 202–208.
17.​Rathore, Y., & Jain, P. (2019). Machine learning approach to predict car resale
value. International Research Journal of Engineering and Technology, 6(4),
2182–2186.
18.​Mishra, M., & Sharma, R. (2021). Used car price prediction using feature
engineering and ensemble learning. Journal of Physics: Conference Series, 1913,
012095.
19.​Kaggle (2020). Used Cars Dataset. Retrieved from:
https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data​
.
38
Appendices

●​ Appendix A: Data Dictionary


○​ Brand: Manufacturer of the vehicle
○​ Model: Specific model name
○​ Year: Year of manufacture
○​ Mileage: Total distance driven in kilometers
○​ Fuel Type: Type of fuel used (Petrol, Diesel, Electric, etc.)
○​ Transmission: Manual or Automatic
○​ Engine Size: Engine displacement in liters
○​ Selling Price: Final selling price of the car in local currency

Appendix B: Code Snippets​


from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)

model.fit(X_train, y_train)

39
predictions = model.predict(X_test)​
from xgboost import XGBRegressor

xgb = XGBRegressor(n_estimators=100, learning_rate=0.1)

●​ xgb.fit(X_train, y_train)
●​ Appendix C: Additional Graphs
○​ Histogram of Car Ages
○​ Boxplot of Price by Fuel Type
○​ Scatter plot of Mileage vs. Pric
○​ Feature Importance Bar Chart (from Random Forest)

40
41

You might also like