0% found this document useful (0 votes)
42 views9 pages

Multiple Regression Analysis Project

1) The study uses data from CarDekho.com to predict the selling prices of used cars in India and analyze the impact of various factors like original price, kilometers driven, age, transmission type, and fuel type. 2) A multiple linear regression model is estimated with the log of selling price as the dependent variable and the log of original price, age, kilometers driven, and dummy variables for diesel, automatic transmission as independent variables. 3) The results show that original price, diesel and automatic transmission have a positive impact on selling price, while age and kilometers driven are negatively associated with selling price, holding other factors constant. Statistical tests support the validity of the model.

Uploaded by

Abhinand C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views9 pages

Multiple Regression Analysis Project

1) The study uses data from CarDekho.com to predict the selling prices of used cars in India and analyze the impact of various factors like original price, kilometers driven, age, transmission type, and fuel type. 2) A multiple linear regression model is estimated with the log of selling price as the dependent variable and the log of original price, age, kilometers driven, and dummy variables for diesel, automatic transmission as independent variables. 3) The results show that original price, diesel and automatic transmission have a positive impact on selling price, while age and kilometers driven are negatively associated with selling price, holding other factors constant. Statistical tests support the validity of the model.

Uploaded by

Abhinand C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 9

MULTIPLE REGRESSION ANALYSIS

PREDICTING USED CAR PRICES IN INDIA SUBITTED BY


1. ABHINAND.C (540)
THE USED CAR MARKET
• The Covid-19 has impacted the fates of many Industries irrespective of any Scales. The Automobile Dealers witnessed a
zero-sale month (April) as a result of the stringent lockdown in India. The Automobile Sales market, especially that of
passenger Cars are expected to gain back its momentum once the restrictions are completely off
• As predicted by Analysts, the Second-hand or Used car market is expected to make a significant boom as more people are
moving away from the crowded public transport mode due to social distancing concerns which is going to be a part of
normal life for the near future.
• The Dataset we choose to work with is from VARIABLE TYPE
CarDekho.com and we intend to predict the second- 1.Selling Price (Dependent) • Continuous
hand selling price and also the causal effects of the
2.Original Price/Purchase Price (Inr) • Continuous
variables which we had considered in the selling price of
a particular car. 3.Kilometers Driven (K.m) • Continuous
• With the Year of make in hand, we can conveniently 4.Age of Car (Years) • Discrete
covert in into age of the car and because data
5.Transmission Type(Petrol/Diesel) • Categorical
limitations ,we are taking it as discrete units.
6.Fuel Type(Manual/Automatic) • Categorical
MODEL
ESTIMATION The population Model is: -
• The Gauss Markov Assumptions that are Log (Selling Price) = β0 + β1. Log (Purchase price) + β2.KMS_Driven + β3.Age
required to get an unbiased OLS estimator are
discussed below: +β4.Automatic_Dummy + β5.Diesel_Dummy + u
1. The population model is Linear in parameters.
2. Random Sampling- A random Sample is
obtained from CarDekho.com. Some outliers
were however removed on primary Visual
Inspection.
3. No perfect Collinearity between the
independent variables is also satisfied.
(Discussed in detail in Slide 4).
4. The expected value of Errors conditional on
different independent variables should be 0.
Some omitted variables that are present in the
error may affect the price such as ‘trend
factors’, ‘vintage value’ are not in any way
correlated with the added variables intuitively. The Estimated Model is:-
Therefore, even if the Marginal change Log(Selling Price)^= .8084958 + .8611*Log(Purchase Price) ^ -.0377*Age^ -
associated by an omitted variable is non zero,
the almost zero correlation makes sure that 1.62e-06*Kms_Driven^ + .083874*Diesel_dummy^
there is no bias because of the omitted values. + .0562636*Automatic_dummy^.
Hence assumption 4 holds in our estimation.
Taking proper functional forms also help in
strengthening the assumption. The correlation
between each of the independent variables
and residuals are found to be 0 which also
reinforces our claim of a satisfied MLR.4
assumption.(Figure.1 & 2) Fig.1 Fig.2
COEFFICIENTS AND ITS
INTERPRETATIONS
A NOTE ON SELECTION OF FUNCTIONAL FORM INTERPRETATION OF COEFFIECNETS
•By using log (Price), the CLM assumptions are getting more •For a Car, keeping all other variables constant, a 1% increase in
reinforced, especially Assumption.4 and Heteroskedasticity purchase price will increase the Selling price by .86%
assumption. Strictly positive variables often have conditional
•As the Age of Car increases by 1 year, the price of the Car drops by
distributions that are either skewed or Heteroskedastic.
3.77% when all the other variables are held constant.
• Taking log can mitigate the effect of Heteroskedasticity and it was
•Keeping Everything else constant, a Car that had driven 1 more Km
evident from the course of our study while analysing the plots of
will have its selling price decresed by .0162%.
Residuals and fitted values.
•Keeping all other factors as constant, a Car which has got an
•So apart from a 2 to 3 extra steps involved in changing the
automatic transmission will cost 5.6 % more than one which has a
Logarithmic or percentage change to absolute terms, taking the
manual transmission.
logarithmic function for the relation between the Selling Price and
other variables will definitely give us an unbiased and a more precise •Keeping all other factors as the same, a car which runs on Diesel costs
estimator as compared to any other alternative. 8.4% more than one which runs on petrol.
• There is no practical significance in interpreting the Intercept coefficient as No car will have a Purchase Price of 0 Rupees. The properties
of OLS no longer hold for regression through origin (intercept coefficient is 0). The cost of estimating the intercept coefficient when it is
actually zero is that the variances of OLS estimators will be larger-a trade-off which everyone has to make for getting unbiased and
consistent estimators.
Fig.1

ISSUE OF
MULTICOLLINEARITY
• A High correlation between 2 or
more independent variables is called
Multicollinearity and it can lead to
large Variances for OLS estimators.
• The correlations between of different
independent variables with each Fig.2
other is shown in the above figure.1.
Before going into Multicollinearity,
we can infer that none of the
variables exhibit a perfect Correlation
between each other. Also, none of
them are a Linear combination of the
others. Hence our Assumption.3(No
Perfect collinearity between the
independent variables are satisfied.
• The Impact of Multicollinearity is
measured by the Variance Inflation • Here the V.I.F s of none of the Variables are so large in order to become an
factor. The results are Given in figure
2. issue. Here we can proceed further with our Model.
ISSUE OF • In order to conclusively check the presence of Heteroskedasticity, a
HETEROSKEDAST Bruesh Pagan Test is done. It is first done step by step before using the
ICITY built in Stata Command. The results clearly indicated the presence of
statistically significant levels of Heteroskedasticity. The results are shown
• The Homoskedasticity is initially checked by below.
plotting standardised residuals with fitted
values. The plot obtained was not random (as it
should be) indicating Heteroskedasticity. The
Heteroskedasticity issue was more pronounced
when the functional relationship with selling
price and Explanatory variables were not
Logarithmic.
• Also, some of the observations were removed
which were either seen as an outlier or a
particular group of Cars which had high levels • A regression with robust standard errors for the regressors is done and
variation within in residuals. To our surprise, the significance of the parameters are checked again. Results are shown
the Cars that fell into this group are mostly cars
manufactured by Toyota. Cars like Fortuner, below. Because the new set of standard error are not much different, all
Innova, Corolla, Corolla Altis etc were exhibiting the variables that were statistically significant before are also significant
more than expected residuals, mostly towards
the negative side(Actual SP Very less than the now.
predicted).It is because Toyota’s operations in
India has been performing poorly in the recent
years forcing the company to stop the
production and services of many models. These
all downgraded the customer confidence. The
Unavailability of Spare parts which is also a
common reason for some second hand cars to
be priced less when compared to their similar
counterparts.
THE F-TEST
•Assumption:6-The Population Error is independent of the explanatory variables and is normally distributed
with zero mean and variance : σ^2
•F-test is usually conducted to test the overall significance of a regression or the joint significance of a group
of variables.
•Here we only need to find the overall significance of the Regression. So we are taking the null H0:
β1=β2=β3=β4=β5=0
•Ha: At least one of the βj is different from zero
•The F-statistic is reported by default in Stata.
•Here the probability value associated with F-statistic is almost null(0) and hence the null hypothesis can be
rejected at even a 1 % significance level in favour of the Alternative.
•Hence our Regression is significant, that is the independent variables help in explain the variation in
dependent variable with an R-squared of 93.3%
SUMMARY
•We Estimated the model subjected to the first 4 Gauss Markov Assumptions to get unbiased estimators of the parameters of interest.
(Both size and direction)
•The presence of Multicollinearity is checked as part of post estimation analysis by examining the Variance inflation factor and no
significant levels of the same were found.
•We detected heteroskedasticity initially in our analysis by looking at the plot of fitted values and standardised residuals. Some
observations were removed and the functional relationship between the variables were also changed which decreased the issue to some
extend. A Bruesh Pagan test was done to statistically conclude the Presence of Heteroskedasticity. A Regression with Robust Standard
errors were conducted and new t-statistic values were reported.
•An F-test also showed the overall significance of the Regression.
•Hence our estimated Regression Equation is: Log(Selling_Price)=.81 +.86*Log(Purchase Price) -.038*Age
-1.62e-06 *Kms +.056*Automatic_Dummy + .084*Diesel_Dummy.
•Selling_Price=1.003*Exp(Log[Selling_Price]) - Changing the log form of the Dependent Variable.
•Also we made the model by excluding many of the Cars manufactured by Toyota and hence it may not be a good model in predicting the
Prices of the same.
•With the help of this model, if one finds the price of a particular Car much higher than the predicted result, she may try to bargain it
down or may be in a position to find the reason why it is higher as a result of some extra fittings or unnoticed features of the Car.
•Also if one finds that the Price of Car too less than the predicted value, it will mostly because of the loosing popularity of the Brand or the
unavailability of Spares as the model is no more in production.
THANK YOU

You might also like