car features case study
car features case study
Project Description
The major activity of the project involves the exploration of the factors that determine price of cars
and fuel efficiency by comparing car prices and fuel efficiency rates. That is why the results must be
valuable for making conclusions about the optimal pricing policies, regularity of fuel consumption,
and the possible advice to the business or client from the automotive industry. Using rationale and
sound analysis of numerical data, we expect to identify relationships, trends and correlations in the
set data.
The aim of the project is to identify which variables influence the price of a car, whether fuel
economy change with car specifications, and whether car brands or body types influence market
price or not. Besides, creation of an interactive dashboard for visually presenting these trends is also
planned for use in business decisions.
The business question that this project addresses is: ‘’On which factors does the price of cars
depend, and to what extent does fuel consumption depend on them? how can this knowledge be
used in the development of optimal pricing strategies?’’ By knowing how engine horsepower, fuel
economy or even the manufacturer influences the price the manufacturers can be in a better
position to set reasonable prices to their cars With the knowledge in fuel efficiency and features of a
certain car the customers can be in a better position in making the right decisions when it comes to
purchasing a car.
Explanation of the Data Sources that have been used in the present project
The data used in this project is a dataset of cars, including variables such as:
This data can originate from car dealership, manufacturer, US Dossing or Kelley Blue book or any
other commercial database. It has different car models for different years and in different markets.
These sources may include car dealerships, manufacturers, and more formal sources in the United
States Department of Energy, or Kelley Blue Book, a commercial database. It includes a range of car
models over different years and markets.
Explanation of the Data Preprocessing and Cleaning Process Undertaken on the Data
Data cleaning and Data preprocessing were important in assuring that the data analysis was
correct. Below are the steps taken:
Handling Missing Values: There were a number of incomplete rows in the dataset and primary in the
aspects related to Engine HP and Highway MPG. In the cases where mean imputation was possible to
be done, it was done, in other cases row with many missing data were sometimes omitted.
Removing Duplicates: The data collected was thus scrutinized for duplication and cleaning up was
done to guarantee that every car entry was different. An action, namely, deletion of duplicate rows of
the frame, was performed in order to prevent bias impact.
Data Type Conversion: There were few variables like Year and MSRP in which the data were not
preprocessed and encoded in the correct format which means numeric or date type. These were
then converted to numeric data type to allowed them to be as usable in Excel formulas and charts.
Outlier Detection: Some anomalous cases were identified in the features Engine HP and MSRP. Any
car prices or horsepower values that were greatly higher or lower than the overall average were
either deleted or transformed based on prior experience and knowledge, or the assistance of a
statistician.
Normalization: Some variables like Engine HP & Highway MPG had to undergo normalization if at all
needed to fit in the model most especially regression analysis.
Assumption on Missing Data Handling: We also assumed that the missing values of attributes such
as ‘Engine HP’ and ‘Highway MPG’ could be replaced using mean imputation since missing data does
not seem to be systematic and therefore does not contaminate the data.
Data Quality: We also presupposed that the given data is quite likely to be reliable and that slight
variations, like the difference in MSRP or Engine HP might be occur because of rounding off etc.
2. Approach
That involved the identified statistical techniques as well as other exploratory tests, data visualization
techniques, and various regression models. The main analytical techniques used are:
Descriptive Statistics: To analyze the demographic data, simple measures were used which included
mean, median and standard deviation. These were useful in compartmentalizing the data and getting
a preliminary feel of the variables.
Correlation Analysis: To test this with the values we obtained correlation coefficients so if it were, for
example, Engine HP and MSRP or Engine Cylinders and Highway MPG. This was quite important in
determining the interaction between various features and car prices and fuel consumption.
Regression Analysis: Multiple regression was used to establish the fact between dependent variable
which is MSRP and one or more independent variables that include Engine HP, Highway MPG and
Engine Cylinders��
A multiple regression model was used to analysis the relationship of MSRP with Engine HP, Highway
MPG, and Engine Cylinders. This was useful in establishing which feature has the greatest effect on
the price of the car.
Pivot Tables: Make Pivot tables were used to present data by Make, Body Style, and other group
variables. It enabled us to find averages, distribution, and the growth of patterns contained in
portions of the data set.
Data Visualization: Different types of charts were employed in displaying the data asymptotically:
Second and third charts showing Price Distribution by Body Style as stacked column charts.
Residual plots for investigating the nature of the relationship between Engine Cylinders and
Highway MPG.
To obtain first and general information about data characteristics, Descriptive statistics were
employed.
Below are some examples of visualizations:
Price
Tesla
Scion
Pontiac
Mitsubishi
Maybach
Lexus Total
Infiniti
GMC
Ferrari
Cadillac
Bentley
Acura
0 500000 1000000 1500000 2000000
Price distribution
8000000
7000000
6000000
5000000 Total
4000000
3000000
2000000
1000000
0
2 Axxess CX-9 G6 M30 R8 Spectra XT5
Fuel Efficeincy
35
30
25
20 Total
15
10
5
0
90 92 94 96 98 00 02 04 06 08 10 12 14 16 nk)
19 19 19 19 19 20 20 20 20 20 20 20 20 20 bla
(
6000
4000
2000
0
Average of Popularity
Count of Popularity
Multiple regression analysis was selected as the main method in order to examine the presence and
direction of relationships between multiple variables and the cars’ MSRP and fuel economy because
we required the estimation of coefficients.
Summarized data was done in Pivot Tables where the data was also easily grouped as well as finding
averages and distributions in order to describe effects that manufacturers or body styles have on
Pivot Tables were used to help organize data and to make quicker work of average and distribution
calculations if, for example, different manufacturers skews the price or different body styles within a
car does the same.
It was important to use visualization techniques in order to facilitate readability for both experts as
well as executives.
Missing Data: The dataset contained missing values, which were included by imputation, but this
method prescribes that missing values do not cause systematic error, which is not always true.
Data Quality Issues: The visual data also had minor variations, for instance, MSRP or Engine HP may
be entered synchronously with incorrect format. Still, with these we might have handled them, there
could be other unseen data issues which influenced the outcomes.
Overfitting in Regression Models: Thus, there was a danger of overfitting regression equation
especially when working with a large number of independent variables. This might be prevented by
cross-validation methods, which however were not used in this study.
3. Tech-Stack Used
Microsoft Excel: Specifically, Excel was solved to be used for this project because of its wide
availability, functionality, and robustness of its data analysis tools. Another of it is that it supports
regression analysis, pivot tables, as well as numerous visualization opportunities.
Analysis Toolpak (Excel Add-in): For the regression analysis, I used the Analysis Toolpak add in
package. It admits an uncomplicated application of various statistical procedures such as regression
analysis, correlation analysis, analysis of variance and the like.
Excel Formulas: Rational regression tools like =CORREL() and =AVERAGE() where used to analyze the
correlation and furthermore to summarize the data.
Excel was chosen for use in this work because of its popularity across organizations, its effectiveness
in data processing, and complexity depending on the user’s level of expertise. Besides, the charting
and pivot table features will be most helpful for the needed sort of visualizations and analysis within
the framework of this work.
No other added libraries and languages were utilized apart from the Excel system function and add-
in. But for further development, tools can be Python (with the Pandas, Matplotlib, Seaborn) or the R
– for more complex modeling.
4. Insight
Engine HP and MSRP Correlation: The correlation between Engine HP and MSRP was positive and
thus considered to be strong. Price was observed to have an increasing tendency with the level of
horsepower; therefore, vehicles with higher horsepower costs more than those with lesser
horsepower. According to this, the consumers have to spend a lot extra dollars for cars' engines with
high power capacities.
Fuel Efficiency vs. Engine Cylinders: What this analysis shows is that if the number of engine
cylinders increased, the amount of highway MPG reduced and vice versa. This means that vehicles
with greater cylinders consume less fuel as commonly perceived in the market that vehicles with
large engines always use more fuel.
Fuel Efficeincy
35
30
25
20
Total
15
10
0
90 92 94 96 98 00 02 04 06 08 10 12 14 16 nk)
19 19 19 19 19 20 20 20 20 20 20 20 20 20 bla
(
Price Distribution by Manufacturer: The results showed that premium car makers including BMW
and Mercedes Benz had a comparatively larger average MSRPs than the Asian car makers including
Toyota and Ford.
Price
Tesla
Scion
Pontiac
Mitsubishi
Maybach
Lexus Total
Infiniti
GMC
Ferrari
Cadillac
Bentley
Acura
0 500000 1000000 1500000 2000000
5. Results
Visualization of Results
Bar Charts for displaying the average MSRP per car maker.
Scatter Plots with a view of relationships between Engine Cylinders and Highway MPG.
The realizations obtained from the visualisations affirm that the engine parameters indeed have a
role to perform in influencing car prices and fuel economy. The analysis also tells the relative
concentration of market share by luxury car brands in the higher price segments.
Data Quality Issues: However, cleaned dataset may still contains certain amount of biases or
inaccuracies that can influences the result.
External Factors: Other factors such as the economic conditions of the two entities, the brand image
and reputation and the demand in the market was not taken into consideration in this study but can
affect the results.
Incorporating More Variables: More variables could be incorporated in future analysis, for instance,
safety features of the car, a particular brand identity, or price variants across geographic location.
Advanced Modeling: Applying feature selection methods could enhance the model and it might be
worth to feed the model with simple artificial decision trees or random forests.
Recommendations and /or conclusion
Pricing Strategy: Car manufacturers should give emphasis on engine attributes such as Horsepower
and Cylinders these two attributes are indications of MSRP. Luxury brands should ensure that they
market their products basing their selling prices on performance in a similar manner to non-luxury
brands.
Fuel Efficiency: The consumers should be encouraged to begin looking at MPG as one of the primary
options of cars especially knowing that fuel prices are steadily rising.
Customization Options: Pricing data about the Body Style and Manufacturer can help car
manufacturers make changes to the prices in accordance with the consumer trends.
Excel Link:
https://docs.google.com/spreadsheets/d/1aG9uvkfzJj24Qh8X3vPFr4lToU5XzB6t/edit?
usp=sharing&ouid=100240818750079743933&rtpof=true&sd=true