Project Soft
Project Soft
MEHER
TY / DS / 20
Viva College
1
TABLE OF CONTENTS
1. Abstract………………………………………………………. ………………………… 3
1.1) Methodology…………………………………………………………………….... 3
1.2) Data Preprocessing………………………………………………………………. 4
1.3) Predictive Model…………………………………………………………………. 4
2. Introduction………………………………………………………. …………………... 5
2.1) Importance of Car Price Prediction…………………………………………….. 6
3. Problem Statement……………………………………………………………………. 8
3.1) Background………………………………………………………………………. 8
3.2) Objectives………………………………………………………………………… 8
3.3) Dataset Overview………………………………………………………………… 9
3.4) Challenges………………………………………………………………………… 9
3.5) Expected Outcomes……………………………………………………………… 9
5. Project Analysis……………………………………………………………………… 13
5.1) Data Exploration and Preprocessing………………………………………… 13
5.2) Feature Engineering……………………………………………………………. 13
5.3) Model Development…………………………………………………………….. 14
5.4) Graphs And Explanation……………………………………………………… 14
5.5) Model Evaluation……………………………………………………………….. 20
5.6) Model Results…………………………………………………………………… 20
6. Final Results………………………………………………………………………….. 22
6.1) Final Results Summary………………………………………………………… 22
6.2) Feature Importance…………………………………………………………….. 23
8. References. …………………………………………………………………………… 28
8.1) Dataset Source………………………………………………………………… 28
2
1. ABSTRACT
This project focuses on developing a predictive model for car prices, leveraging Python and
advanced machine learning techniques to address the growing need for accurate price
forecasting in the automobile industry. As consumer preferences shift and technology
evolves, reliable price predictions are essential for buyers, dealers, and financial institutions
alike.
The dataset encompasses key features that influence vehicle pricing, including make, model,
year of manufacture, mileage, engine size, fuel type, and geographical location. The
methodology begins with extensive data collection from online automotive marketplaces and
reputable databases to create a robust dataset.
Following data collection, the project emphasizes data preprocessing to ensure data integrity.
This includes handling missing values, detecting outliers, and encoding categorical variables,
preparing the dataset for analysis. Effective feature engineering and exploratory data analysis
(EDA) will uncover relationships among various attributes, enhancing the model's predictive
capabilities.
A key objective is to develop a user-friendly interface, allowing non-technical users to easily
access and understand price predictions. Visualizations will provide insights into how
different features influence car values, fostering trust in the model's outputs.
Future work aims to expand the dataset to include market trends and economic indicators,
further enhancing prediction accuracy. Continuous refinement of the model and interface will
ensure the tool remains relevant in a dynamic marketplace, empowering stakeholders to make
informed, data-driven decisions .
1.1) Methodology :
Data Collection :
The first phase of our project involves extensive data collection from diverse sources to
ensure a rich dataset. We gather information from online automotive marketplaces, reputable
automotive databases, and user-generated content platforms. This dataset comprises
numerous variables that are crucial for predicting car prices.
Key features include:
- Make and Model : Different manufacturers and models have distinct pricing dynamics.
- Year of Manufacture : The age of the vehicle is a critical factor, as newer models
generally command higher prices.
- Mileage : Higher mileage typically correlates with reduced value.
3
- Engine Size and Fuel Type : These specifications can affect a vehicle's desirability and
price.
- Geographical Location : Regional demand and supply dynamics can significantly
influence prices.
Once the data is collected, we proceed to the preprocessing stage. This step is vital for
ensuring the dataset is clean and reliable for analysis. We handle missing values through
imputation or removal, depending on the extent of the missing data. Additionally, categorical
variables are encoded to convert non-numeric data into a usable format for machine learning
algorithms.
In the predictive modeling phase, we evaluate multiple algorithms to identify the most
effective approach for our dataset. We consider :
1. Linear Regression : A simple linear model that estimates the relationship between the
features and the target price.
2. Ridge Regression : A type of linear regression that includes L2 regularization, which
helps prevent overfitting by adding a penalty on the size of coefficients.
3. Lasso Regression : Similar to Ridge, but uses L1 regularization, which can shrink some
coefficients to zero, effectively performing feature selection.
4. Random Forest Regressor : An ensemble model that builds multiple decision trees and
averages their predictions, providing robustness against overfitting.
5. Gradient Boosting Regressor : Another ensemble technique that builds trees sequentially,
each new tree correcting errors made by the previous ones, often yielding strong predictive
performance.
These models were evaluated using R-squared and Mean Squared Error metrics, with
hyperparameter tuning performed for Ridge and Lasso to find the best alpha values.
4
2. INTRODUCTION
The car market is a dynamic and ever-evolving industry, influenced by various factors such
as economic conditions, technological advancements, and consumer preferences. Accurate
predictions in this market can provide valuable insights for manufacturers, dealers, and
buyers. This project aims to predict car prices based on various attributes, leveraging data
science and machine learning techniques.
The car market stands as a vital component of the global economy, characterized by its
dynamic nature and constant evolution. Factors such as economic fluctuations, technological
innovations, regulatory changes, and shifting consumer preferences significantly influence
this industry. In recent years, the emergence of electric vehicles, advancements in
autonomous driving technologies, and increasing environmental awareness have reshaped the
landscape, prompting manufacturers and consumers alike to adapt swiftly.
As car prices are inherently tied to these changing factors, accurate predictions hold immense
value for various stakeholders, including manufacturers, dealers, and consumers. For
manufacturers, understanding pricing trends can inform production strategies and inventory
management. Dealers can optimize pricing strategies and marketing approaches to better
attract potential buyers. For consumers, knowing the expected price of a vehicle based on
specific attributes can empower them to make informed purchasing decisions, potentially
saving them time and money.
In this project, we aim to harness the power of data science and machine learning techniques
to predict car prices based on an array of attributes, such as make, model, year, mileage,
engine specifications, and additional features. By analyzing historical data and identifying
correlations among these variables, we can develop models that not only provide accurate
price predictions but also offer insights into the underlying factors driving these prices.
Furthermore, as the data landscape evolves, integrating real-time data and market trends into
our models will enhance their accuracy and relevance. This approach not only contributes to
a deeper understanding of the market dynamics but also allows for continuous model
improvement. As a result, our findings could serve as a crucial resource for various
stakeholders, facilitating informed decision-making and strategic planning in a rapidly
changing automotive landscape.
The goal of this project extends beyond mere price prediction; it seeks to illuminate the
intricate interplay of factors that shape the car market, ultimately fostering a more informed
and efficient marketplace. Through the application of advanced data analytics and machine
learning methodologies, we aspire to contribute valuable insights that resonate with the needs
and challenges faced by industry participants.
The automotive industry is undergoing significant transformation, driven by rapid
technological advancements, shifting economic conditions, and evolving consumer
preferences. As these factors intertwine, the need for precise car price predictions becomes
increasingly critical for all stakeholders—manufacturers, dealers, and buyers alike. This
5
project leverages Python and machine learning techniques to build a robust predictive model
that estimates car prices based on a variety of attributes.
2.1) Importance of Car Price Prediction :
Car price prediction plays a vital role in the automotive industry for several reasons:
1. Informed Decision-Making :
- Consumers : Accurate predictions empower buyers to make informed purchasing
decisions, helping them understand fair market values based on vehicle attributes. This can
lead to better negotiating power and potential savings.
- Dealers : By understanding expected price ranges, dealers can set competitive pricing
strategies that attract customers while maximizing their profit margins.
3. Financial Insights :
- Investment Decisions : Investors and stakeholders can use price predictions to assess
market potential and make strategic investments in automotive companies or technologies.
- Valuation of Assets : Financial institutions rely on accurate price predictions to evaluate
the worth of vehicle assets, influencing loan approvals and insurance assessments.
6
- Providing a clear understanding of pricing trends fosters trust between consumers and
dealers. Transparent pricing models can reduce disparities and encourage fair competition in
the market.
In summary, car price prediction is essential for making informed decisions, optimizing
strategies, and enhancing transparency within the automotive industry. As the market
continues to evolve, accurate predictions will become increasingly critical for navigating
complexities and driving success for all stakeholders involved.
7
3. PROBLEM STATEMENT
The automotive market is characterized by numerous factors influencing car prices, including
vehicle specifications, brand reputation, and market trends. Accurate prediction of car prices
is essential for various stakeholders such as manufacturers, dealers, and consumers to make
informed decisions regarding purchases, sales, and investments. This project aims to develop
a robust predictive model for car prices using a comprehensive dataset containing multiple
vehicle attributes.
Predicting car prices is a complex task due to the multitude of variables that influence a car's
value. These include make and model, age, mileage, engine size, fuel type, and many more.
The primary goal of this project is to develop a predictive model that can estimate the price of
a car given its attributes. Accurate price predictions can help consumers make informed
purchasing decisions and assist sellers in setting competitive prices.
3.1) Background :
Car pricing is influenced by a multitude of factors, which can be broadly categorized into
quantitative and qualitative attributes. Quantitative attributes include measurable data such as
age, mileage, engine size, and fuel type. Qualitative attributes encompass subjective measures
like make, model, and brand reputation. The interplay of these variables creates a dynamic
and often unpredictable pricing landscape. Moreover, external factors such as economic
conditions, market trends, and consumer preferences further complicate the pricing equation.
3.2) Objectives :
The primary objective of this project is to create an accurate and interpretable model that can
predict car prices based on a range of features. Specific goals include:
1. Data Exploration and Understanding : Perform exploratory data analysis (EDA) to
uncover patterns, distributions, and relationships within the dataset. This will provide insights
into which features are most relevant for predicting prices.
2. Feature Engineering : Enhance the dataset by creating new features that may capture
underlying relationships better than the original attributes. This includes:
- Extracting brand names from car names to understand brand impact.
8
- Calculating new metrics like `weight_per_hp` to assess vehicle efficiency.
- Developing a `brand_luxury_index` based on average prices associated with brands,
reflecting perceived value.
4. Model Evaluation : Evaluate the models using metrics such as R-squared and Mean
Squared Error (MSE). The project will also explore cross-validation techniques to optimize
hyperparameters for Ridge and Lasso regressions, ensuring robustness in predictions.
5. Feature Importance Analysis : Investigate the significance of the engineered features and
original attributes in the final models, allowing for a better understanding of what drives car
pricing.
The dataset consists of 205 entries and 26 features, including vehicle specifications (e.g.,
engine size, horsepower, and fuel type), physical dimensions (e.g., wheelbase, car length,
width, height), and price. Key categorical variables such as brand, car body type, and drive
wheel are also included, which will be one-hot encoded to facilitate model training.
3.4) Challenges :
This project is anticipated to yield a predictive model capable of accurately estimating car
prices based on the provided features. By employing rigorous EDA, strategic feature
engineering, and advanced regression techniques, the model will offer valuable insights into
the factors that significantly influence car pricing. Furthermore, the results can aid in
9
establishing best practices for pricing strategies in the automotive industry, ultimately
benefiting consumers and manufacturers alike.
In conclusion, this project not only addresses a practical challenge in the automotive sector
but also serves as an illustrative case study on the importance of data science techniques in
real-world applications. Through comprehensive analysis and modeling, we aim to contribute
to a deeper understanding of car price dynamics and the effectiveness of various predictive
approaches.
10
4. METHODS AND ALGORITHMS
4.3) Modeling :
1. Train-Test Split : Data was split into training and testing sets using `train_test_split`.
2. Scaling : Features were standardized using `StandardScaler`.
3. Regression Models :
- Linear Regression :
- A basic regression model that estimates the relationship between features and the target
variable (price) using a linear approach.
- Ridge Regression :
11
- A regularized version of linear regression that adds L2 regularization to prevent
overfitting, particularly useful when multicollinearity is present.
- Lasso Regression :
- Another regularized linear regression technique that adds L1 regularization, which can
help in feature selection by shrinking some coefficients to zero.
- Random Forest Regressor :
- An ensemble method that constructs multiple decision trees and averages their
predictions. It’s effective for capturing non-linear relationships and interactions between
features.
- Gradient Boosting Regressor :
- Another ensemble technique that builds trees sequentially, where each tree tries to correct
the errors of the previous ones. It often yields better performance than Random Forest in
certain scenarios.
6. Cross-Validation :
- Used with the regression models to evaluate their performance more reliably by splitting
the data into multiple training and testing subsets.
1. Metrics : Used R-squared and Mean Squared Error (MSE) to evaluate model
performance.
2. Cross-Validation : Employed cross-validation to assess model stability and performance,
especially for Ridge and Lasso.
3. Feature Importance : Analyzed the importance of features in Random Forest and
Gradient Boosting models.
Summary :
- The project demonstrated how effective feature engineering and various regression
techniques can enhance the predictive performance in car price estimation. Regularization
methods (Ridge and Lasso) helped in improving model robustness, while ensemble methods
(Random Forest and Gradient Boosting) provided better accuracy through their collective
learning approach.
12
5. PROJECT ANALYSIS
5.1) Data Exploration and Preprocessing :
1. Dataset Overview :
- The dataset contains 205 entries and 26 columns, including various car features and
prices.
- Features include both numerical (e.g., horsepower, engine size) and categorical variables
(e.g., fuel type, car body).
2. Missing Values and Duplicates :
- There were no missing values or duplicates in the dataset, ensuring clean data for analysis.
3. Exploratory Data Analysis (EDA) :
- Price Distribution : A histogram showed the distribution of car prices, indicating a right-
skewed distribution with a few high-value outliers.
- Price by Car Body Type : A box plot illustrated significant price variations across
different car body types.
- Engine Size vs. Horsepower : A scatter plot displayed a positive correlation between
engine size and horsepower.
- Fuel Type Distribution : A pie chart revealed the proportion of different fuel types in the
dataset.
13
1. Brand Extraction :
- Extracted brand names from the `CarName` column for better categorization.
2. Categorical to Numeric Mapping :
- Mapped `cylindernumber` and `doornumber` to numerical values for regression analysis.
- One-hot encoded categorical features (fuel type, aspiration, etc.) to prepare them for
modeling.
3. Correlation Analysis :
- A correlation matrix indicated strong relationships between certain features (e.g., size,
weight per horsepower) and price.
14
fig = px.histogram(data, x='price', nbins=30,
title='Distribution of Car Prices',
labels={'price': 'Price', 'count': 'Frequency'},
opacity=0.7)
fig.update_layout(
xaxis_title='Price',
yaxis_title='Frequency',
bargap=0.1,
)
fig.update_traces(marker_line_width=1, marker_line_color="white")
fig.show()
- The code creates a histogram of car prices using Plotly Express (`px`):
1. Histogram Creation : `px.histogram` generates a histogram from the `data` DataFrame,
using the `'price'` column. It specifies 30 bins, a title, and labels for the axes, with a 70%
opacity for the bars.
2. Layout Update : `fig.update_layout` customizes the x-axis and y-axis titles to "Price" and
"Frequency", respectively, and sets a gap between the bars.
3. Trace Customization : `fig.update_traces` modifies the appearance of the bars, adding a
1-pixel white border around them.
4. Display : `fig.show()` renders the histogram in a web browser or notebook.
Overall, this code visually represents the distribution of car prices in an informative and
aesthetically pleasing way.
15
fig = px.box(data, x='carbody', y='price',
title='Car Prices by Car Body Type',
labels={'carbody': 'Car Body Type', 'price': 'Price'})
fig.update_layout(
xaxis_title='Car Body Type',
yaxis_title='Price',
plot_bgcolor='white'
)
fig.update_xaxes(gridcolor='lightgrey')
fig.update_yaxes(gridcolor='lightgrey')
fig.show()
- This code uses the Plotly Express library to create a box plot visualizing car prices
categorized by different car body types :
1. Data and Plot Creation : It initializes a box plot with `data`, using `'carbody'` for the x-
axis and `'price'` for the y-axis. The plot has a title and custom labels for both axes.
2. Layout Customization : The layout is updated to set the x-axis and y-axis titles, and the
background color of the plot area is set to white.
3. Grid Color Customization : The grid lines for both axes are updated to a light grey color
for better visibility.
4. Display Plot : Finally, `fig.show()` displays the interactive box plot.
Overall, this code effectively visualizes the distribution of car prices across different body
types, providing insights into price variation.
16
fig = px.scatter(data,
x='enginesize',
y='horsepower',
opacity=0.7,
labels={'enginesize': 'Engine Size', 'horsepower': 'Horsepower'},
title='Engine Size vs. Horsepower')
- This code uses the Plotly Express library to create a scatter plot that visualizes the
relationship between engine size and horsepower in a dataset.
1. Data and Plot Creation : It initializes a scatter plot using `data`, plotting `'enginesize'` on
the x-axis and `'horsepower'` on the y-axis.
2. Visual Customization : The points in the scatter plot are set to have an opacity of 0.7,
making them slightly transparent for better visibility, especially if points overlap.
3. Labels and Title : Custom labels are provided for both axes, and the plot is given a title,
"Engine Size vs. Horsepower."
4. Display Plot : `fig.show()` displays the interactive scatter plot.
Overall, this code helps to analyze the correlation between engine size and horsepower,
allowing for visual exploration of the data.
fueltype_counts = data['fueltype'].value_counts().reset_index()
fueltype_counts.columns = ['fueltype', 'count']
17
fig = px.pie(fueltype_counts,
names='fueltype',
values='count',
title='Distribution of Fuel Types',
hole=0.4,
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_traces(rotation=140, textinfo='percent+label')
fig.show()
- This code creates a pie chart to visualize the distribution of different fuel types in a
dataset using Plotly Express.
1. Count Calculation : It counts the occurrences of each unique fuel type in the `'fueltype'`
column of the `data` DataFrame and resets the index to create a new
DataFrame called `fueltype_counts`. The columns are renamed to `'fueltype'` and
`'count'`.
2. Pie Chart Creation : A pie chart is initialized using `fueltype_counts`, with the fuel types
represented by the `names` parameter and their counts represented by the `values` parameter.
The plot has a title and features a hole in the center (making it a donut chart) with a pastel
color palette.
3. Trace Update : The pie chart is customized to rotate for better aesthetics, and it displays
both the percentage and the label of each segment.
4. Display Plot : `fig.show()` displays the interactive pie chart.
Overall, this code provides a clear visual representation of the distribution of fuel types,
allowing for easy comparison of their relative proportions.
18
Correlation of Features with Price :
numeric_data = data_encoded.select_dtypes(include=[np.number])
correlation_matrix = numeric_data.corr()
correlation_with_price = correlation_matrix['price'].drop('price').sort_values(ascending=False).reset_index()
correlation_with_price.columns = ['Feature', 'Correlation with Price']
fig = px.bar(correlation_with_price,
x='Feature',
y='Correlation with Price',
title='Correlation of Features with Price',
labels={'Correlation with Price': 'Correlation Coefficient'},
color='Correlation with Price',
color_continuous_scale='viridis')
fig.update_layout(
xaxis_title='Feature',
yaxis_title='Correlation Coefficient',
xaxis=dict(tickangle=90)
)
fig.show()
- This code creates a bar chart to visualize the correlation of various features with car
prices in a dataset using Plotly Express.
1. Select Numeric Data : It extracts only the numeric columns from the `data_encoded`
DataFrame using `select_dtypes(include=[np.number])`.
2. Correlation Matrix : The correlation matrix is calculated for the numeric data using
`.corr()`, which shows the relationships between all numeric features.
19
3. Correlation with Price : The correlation values specifically with the `'price'` column are
extracted, dropping the price itself, and sorted in descending order. This results in a
DataFrame named `correlation_with_price` that lists each feature and its correlation
coefficient with price.
4. Bar Chart Creation : A bar chart is created using `correlation_with_price`, where
features are plotted on the x-axis and their corresponding correlation coefficients with price
on the y-axis. The bars are colored based on the correlation values using the 'viridis' color
scale.
5. Layout Customization : The chart layout is updated to set the x-axis and y-axis titles, and
the x-axis tick labels are rotated 90 degrees for better readability.
6. Display Plot : Finally, `fig.show()` displays the interactive bar chart.
Overall, this code provides insights into which features have the strongest positive or
negative relationships with car prices, aiding in feature selection and analysis.
- Utilized R-squared and Mean Squared Error (MSE) for evaluating model accuracy.
- Found the best random state for training/testing splits, achieving an R-squared of 0.9577
with Linear Regression.
1. Performance Metrics :
- Random Forest achieved a CV R-squared of 0.90 and a test R-squared of 0.93.
- Gradient Boosting had a CV R-squared of 0.89 and a test R-squared of 0.95.
- These results suggest that ensemble methods provided better predictive power than linear
models.
2. Feature Importance :
- For Random Forest :
- Top features influencing price : enginesize, curbweight.
- For Gradient Boosting:
- Top features : enginesize, curbweight.
3. Engineered Features Evaluation :
- Engineered features like `weight_per_hp`, `size` were shown to have significant
predictive power ranking high in both models.
20
Conclusions :
- Importance of Feature Engineering : The project highlighted the critical role of feature
engineering in improving model performance. Derived features significantly impacted the
accuracy of price predictions.
- Comparison of Models : Demonstrated the advantages of using ensemble methods
(Random Forest, Gradient Boosting) over traditional linear models for this type of regression
task.
- Insights on Car Pricing : The analysis revealed key factors affecting car prices, including
engine specifications, brand reputation, and vehicle size.
This project serves as an excellent demonstration of the power of data science techniques in
real-world applications, especially in predicting market trends based on various features.
Future work could involve experimenting with other machine learning techniques or
integrating external datasets for even deeper insights.
21
6. FINAL RESULTS
Certainly! Here’s a concise summary of the final results from car price prediction
project :
Feature Engineering :
Model Development :
- Models Implemented :
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
- Gradient Boosting Regressor
Model Evaluation :
22
- sBest Random State : 94
- Linear Regression Performance :
- R-squared : 0.9577
- Mean Squared Error : ~4,254,446
23
4. `horsepower`
5. `highwaympg`
Conclusion :
24
7. CONCLUSION AND FUTURE SCOPE
7.1) Conclusion :
In this project, we explored the intricacies of car price prediction through a robust data
analysis pipeline. Here are the key highlights:
2. Feature Engineering :
- We created new features, such as `weight_per_hp`, `size` which significantly enhanced
the model's predictive power.
- The extraction of the car brand from the name facilitated better understanding of market
positioning, helping to develop a luxury index based on average prices.
3. Model Development :
- We implemented multiple regression techniques including Linear Regression, Ridge
Regression, Lasso Regression, and ensemble methods like Random Forest and Gradient
Boosting.
- The use of StandardScaler ensured that all features contributed equally to the model
training.
4. Model Evaluation :
- Models were evaluated based on R-squared and Mean Squared Error, with Gradient
Boosting achieving the highest performance on test data.
- The feature importance analysis showed that engineered features played a significant role
in the models’ predictions, particularly in Random Forest and Gradient Boosting.
25
Moving forward, there are several avenues for enhancing this project:
1. Hyperparameter Tuning :
- Further optimization of hyperparameters for the models, especially ensemble techniques,
could yield even better performance.
4. Time-Series Analysis :
- If historical pricing data is available, applying time-series analysis could help identify
trends and seasonality in car prices.
5. Model Interpretability :
- Utilizing techniques such as SHAP (SHapley Additive exPlanations) values for better
interpretability of model predictions would provide stakeholders with actionable insights.
6. User-Friendly Application :
- Developing a user interface or web application to allow users to input car specifications
and receive price predictions could make this model accessible to a wider audience.
7. Regular Updates :
- Establishing a mechanism for regularly updating the model with new data to maintain
accuracy as market trends evolve.
By addressing these areas, we can further refine our approach to car price prediction, making
it more robust, accurate, and applicable in real-world scenarios. This project serves as a
26
foundational framework for future research and development in predictive modeling within
the automotive industry.
27
8. REFERENCES
The dataset for this project is sourced from a reputable automotive database, such as
Kaggle's " CarPrice_Assignment " .
28