Big Ipl
Big Ipl
CASE QUESTIONS
1. Develop a simple linear regression model between the sold price and batting strike rate,
is there a statistically significant relationship between sold price and batting strike rate?
Answer:
Equation for estimating line:-
Y = β0 + β1 (X)
β0 = 289510.4
β1 = 2086.5
So,
We get R2=0.02641
Variation in Strike rate explains only 2.6% of variation in Sold Price. Therefore, variation in Strike
rate doesn’t explain most of the variations in Sold Price.
In the concerned analysis we can observe that R2=0.02641. This implies that variation in strike
rate explains only 2.6% of variation in sold price. This implies that the level of degree of
dependency between strike rates and sold price is only of 2.6%. Thus, they are not closely related.
In general variables with 70% or above level of dependency are considered to be closely related.
With only 2.6% of dependency, we can conclude that there are other factors which affect sold price
more than strike rate.
2. What is the impact of ability to score “SIXERS” on the player’s price?
Answer:
Y = β0 + β1 (X)
β0 = 385115
β1 = 7693
So,
This implies that the level of degree of dependency of sixers on sold price is only of 19.6%.
3. Develop a multiple linear regression model between Sold price and batting striking rate
and Sixers? What do you conclude from this model?
Answer:
Y = β0 + β1 (X1) + β2(X2)
β0 = 395327.0
β1 = -102.4
β2 = 7758.7
So,
This implies that the level of degree of dependency of batting strike rates and sixers on sold price
is only of 19%.
4. Cricket in the T20 format is considered a young man’s sport, is there evidence that the
player’s price is influenced by age?
Answer:
For Category 1: Age < 25, we have taken 1, for other category 0.
Y = β0 + β1 (X)
β1 = 226961
So,
This implies that the level of degree of dependency of player’s Age on sold price is only 2.6%.
So the age of the player’s hardly depends on the sold price.
5. Are players of Indian origin paid more than players from other countries?
Answer:
In the given data, a column was added where Countries cricketers belong to, were codified into
two categories:
Player of Indian Origin – represented by A
Mean for the above mentioned categories have been calculated individually. The mean value of
Sold Price for Category A=Rs.652339.6 and Category B=Rs.430974.
Result
Therefore, the mean selling price when the individual’s age lies in the Category 2 is Rs.
4,84,535.
Similarly,
Country Code A is serving as our reference or base line. Therefore, to know the mean selling
price when the Players are other than that of Indian Origin is obtained when A equals 0 in the
above equation.
In the given result, it can be seen that p-value of the F-statistic is 0.002015, which is very less than
0.05, hence, it is highly significant. This means that, there exist a statistical relation between Sold
Price and the Country Cricketers belong to.
It can be seen that, changing in Country Cricketers belong to is significantly associated to changes
in Sold Price.
R-squared:
In the given result, it is observed that R-squared value is 0.06481 i.e., 6.481% which is extremely
less. This implies that variation in Sold Price explains only 6.481% of variation in Country
Cricketers belong to. Thus, the predictor variables and the outcome variable are not closely
related. In general variables with 70% or above level of dependency are considered to be
closely related. It can be concluded that there are other factors also, which affect the Sold
Price.
6. Develop the model which can used by Franchises to predict the sold price.
To develop the best model which can be used by Franchises to predict the Sold Price, four models
have been created – modelOpt1, modelOpt2, modelOpt3 and modelOpt4.
Model p-value R-Squared
Option 2 1.548e-05 Very less than 0.05. Hence, it is 14.68% Extremely less %. Thus, the predictor
highly significant. This means that, variables and the outcome variable are
at least, one of the predictor not closely related.
variables is significantly related to
the outcome variable.
Option 3 3.742e-07 Very less than 0.05. Hence, it is 23.17% Less %. Thus, the predictor variables
highly significant. This means that, and the outcome variable are not closely
at least, one of the predictor related.
variables is significantly related to
the outcome variable.
Option 4 4.91e-07 Very less than 0.05. Hence, it is 31.57% Less %. Thus, the predictor variables
highly significant. This means that, and the outcome variable are not closely
at least, one of the predictor related.
variables is significantly related to
the outcome variable.
In the given result, it can be seen that p-value of the F-statistic in Option 1 is more than 0.05, more
than 50%. This shows that there does not exist any significant relationship between one or more
predictor or independent variables and the outcome variable. Also, the Adjusted R2 has a negative
value which proves that the explanation towards response is very low or negligible. So this option can be removed this
model from the four options.
At the same time, p-value in Option 2, Option 3 and Option 4 are very less than 0.05, Hence, they
represent highly significant relationship. This means that, one or more predictor variables are
significantly related to the outcome variable.
Comparing all the other three options, we see that p-value of Option 3 is farthest from 0.05 as
against other two options. This tells us that model 3 is better than the rest of the three models. But
Adjusted R2 is highest in Option 4. This tells us that model 4 is better than the other three models.
If we compare the p-values of Option 3 and Option 4, there is not much difference in the values.
This contradiction in selection of best model between Option 3 and Option 4 may be due to one or
more insignificant variables in the models. In that case it is better to remove such insignificant
variables to show the best relation between dependant and independent variables.
Model accuracy assessment
The overall quality of the model can be assessed by examining the R-squared (R2) and Residual
Standard Error (RSE).
R-squared:
In multiple linear regression, the R2 represents the correlation coefficient between the observed
values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the
value of R will always be positive and will range from zero to one.
R2 represents the proportion of variance, in the outcome variable y, that may be predicted by
knowing the value of the x variables. An R2 value close to 1 indicates that the model explains a
large portion of the variance in the outcome variable.
A problem with the R2 is that, it will always increase when more variables are added to the model,
even if those variables are only weakly associated with the response. A solution is to adjust the R2
by taking into account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction for the
number of x variables included in the prediction model.
In the given result, it is observed that R-squared value is 0.1906 i.e., 19.06% which is very less.
This implies that variation in Sold Price explains only 19.06% of variation in Strike Rate of Batting
and Sixers. Thus, the predictor variables and the outcome variable are not closely related. In
general variables with 70% or above level of dependency are considered to be closely related. It
can be concluded that there are other factors also, which affect the Sold Price.