Econometrics Assignment Week 1-806979
Econometrics Assignment Week 1-806979
What determines the transfer value of football players? That was the research question former BSc
student Joram Jonker analysed in his BSc thesis1. He collected data on players’ market values and
various characteristics for the season 2021/2022. A subset of this data can be found in the dataset
transfers.dta.
c1 Open the STATA datafile transfers.dta and use the summarize command to get some basic statistics.
1
Jonker, J. (2023). Econometric analysis of the drivers behind football players’ market values. BSc thesis, Wageningen
University.
a. How many observations are there in the dataset? Why is the average number of goals scored so
low?
ok
There are 420 observations in the dataset.
The average number of goals scored is in this case 1.911. This is a dataset about football players in
general, we assume that not all players have the role of striker so most players in this dataset would
have been less opportunities to score, explaining the ‘low’ average number of goals.
But it is not possible to say if the average number of goals is low or high without knowing more about
the data.
- The variable assists is statistically significant assuming testing at 0.05, as the table above shows
(0.022<0.05)
- As can be seen in the table above, the 95% CI for the variable goals shows that the coefficient effect
of this variable lies between 1.26 and 2.056. This means that the market value of a player increases
between 1.26-2.056 million euros per goals scored. This is a larger increase than 1 million as claimed
by some expert.
- The 95% CI shown in the table above shows that for the variable age, the values range from -0.473
to 0.09. Therefore, it cannot be said with 95% confidence that age has a negative effect on a player's
value.
ok
5−1
0,3355 − ∗ (1 − 0,3355) = 0,3273
420 − 5
2. Test statistic:
Not how you carry out an F-test -1.5
3. Under H0, F~Fdf = 412
4. Under Ha, 2 sided p-value
5. Use 2-tailed p-value
6. Outcome of the test statistic: β6 = 1,95 & β7 = -0,90
7. Outcome p-value: β6 = 0,052 & β7 = 0,367 both >0.05 So,
8. Conclusion: Reject Ha, both β6 & β7 have been proven to be zero at α = 0,05 H0 has been shown.
The variables minsplayed and leagueapps do not jointly contribute to the model.
e. By including the two extra variables minsplayed and leagueapps you may have introduced a
multicollinearity problem in your model. Explain in your own words why this may be the case. Next,
investigate whether there is multicollinearity problem, and if so, try to solve for it. Show Stata
output. In the end, draw a final conclusion on the presence of multicollinearity and the inclusion of
minsplayed and leagueapps in the model. why this may be the case? -0.5
To investigate the multicollinearity problem, we instructed Stata to give us the VIF values for our variables.
The result is the table below. From this table we conclude while the VIF value for variable minutes played is
not exceeding 10, it still is high (8.83), and league appearances is quite high already as well (5.20). To
combat this potential problem, we’ve chosen to omit both minsplyd and leagueapps from our model, as they
don’t contribute significantly to it, whilst adding much multicollinearity for it in return. There could be an
argument for still including league appearances as a variable, but in our opinion this variable doesn’t hold
much value thus we don’t mind losing it as much.
Wrong interpretation -2
f. Compare the estimated coefficients for goals, assists, and inceptions in the original model
estimated in c2 and in the model estimated in c3. Explain any differences in the values of the
coefficients using the theory discussed in the lecture.
c4 Obtain studentized residuals for the model estimated at c2 (re-estimate the model). Since we
have many observations it may be difficult to check all studentized residuals to find influential
observations. Therefore, it is handy to create an indicator that has value 0 if the calculated
studentized residual is below the critical value, and 1 if it larger. Therefore, use the following
commands in Stata:
As we have discussed before, there is a concern of potential multicollinearity in the second model. This issue
arises because both minutes and league appearances are correlated with the variables goals and assists.
Consequently, in the second model, the accuracy of the coefficients for these variables is compromised.
Some of the explanatory power seen in the coefficients of the first model is diminished in the second model,
this leads to larger standard errors, reducing reliability of our estimations. Consequently, the wider
confidence intervals make it more challenging to precisely interpret the variable coefficients.
ok
g. Based on the data overview obtained after the browse command, are there any potential
influential observations? How many?
– What is the main characteristic of all influential observations?
– Estimate the model from c2 again, but excluding the potential observations. You can use an if
statement for that in your regression command (either via the regress menu, or look at the syntax
used in the example from the lecture on land prices). Are there any differences compared to the
model output obtained at c2?
following step c4, we've identified 21 potentially influential observations. These 21 observations stand out
due to their very high or very low market values—extremely high for positive values and remarkably low for
negative values, surpassing the critical value. ok
Excluding these potential influential observations has led to the presentation of the table below. The notable
differences include an increase in both R-squared and adjusted R-squared. The error margins for the
variables have seen significant reduction, and the 95% Confidence Intervals are now narrower.
Another evident contrast with the model obtained in step c2 is that we now have 399 observations in the
model, down from the original 420. This reduction is a consequence of excluding the potential influential
observations. Consequently, the coefficient data is now clustered more closely around the average.
Specifically, the coefficient for goals appears to be less steep in the new model. While the remaining
coefficients have undergone slight changes, their values remain quite similar to those of the original model.
ok