0% found this document useful (0 votes)
28 views

Econometrics Assignment Week 1-806979

The document describes a dataset containing information on 420 football players, including their market value, age, goals scored, assists, and other characteristics. It asks the student to analyze the data using Stata. The student runs summary statistics, estimates OLS regression models relating market value to various player characteristics, tests hypotheses about the coefficients, and investigates potential multicollinearity and influential observations issues. Differences are found between models with and without additional variables that may cause multicollinearity. Influential observations are identified and excluding them improves the model fit.

Uploaded by

jantien De Groot
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Econometrics Assignment Week 1-806979

The document describes a dataset containing information on 420 football players, including their market value, age, goals scored, assists, and other characteristics. It asks the student to analyze the data using Stata. The student runs summary statistics, estimates OLS regression models relating market value to various player characteristics, tests hypotheses about the coefficients, and investigates potential multicollinearity and influential observations issues. Differences are found between models with and without additional variables that may cause multicollinearity. Influential observations are identified and excluding them improves the model fit.

Uploaded by

jantien De Groot
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment Econometrics week 1 – Football players’ transfer values

Jantien de Groot: 1068296


Robin van der Grond: 1123300

What determines the transfer value of football players? That was the research question former BSc
student Joram Jonker analysed in his BSc thesis1. He collected data on players’ market values and
various characteristics for the season 2021/2022. A subset of this data can be found in the dataset
transfers.dta.

The following variables are available:


Variable Description
mvalue Market value of player (million euros)
age Age of the player
contractdays Remaining days in current contract
goals Number of goals scored
assists Number of assists (keypass before goal) given
inceptions Number of successful interceptions
minsplyd Total minutes played
leagueapps Number of games in which a player appeared (partly or whole game)

c1 Open the STATA datafile transfers.dta and use the summarize command to get some basic statistics.

1
Jonker, J. (2023). Econometric analysis of the drivers behind football players’ market values. BSc thesis, Wageningen
University.
a. How many observations are there in the dataset? Why is the average number of goals scored so
low?
ok
There are 420 observations in the dataset.
The average number of goals scored is in this case 1.911. This is a dataset about football players in
general, we assume that not all players have the role of striker so most players in this dataset would
have been less opportunities to score, explaining the ‘low’ average number of goals.
But it is not possible to say if the average number of goals is low or high without knowing more about
the data.

c2 Estimate the following model in Stata using OLS :

Need to show hypothesis and economic interpretation -1


b. Test whether the variable assists has a statistically significant impact on a player’s value.
– Some expert claims that each additional goal scored increases a player’s value by 1 million euro.
Test this hypothesis.
– Test whether age has a negative impact on a player’s value.

- The variable assists is statistically significant assuming testing at 0.05, as the table above shows
(0.022<0.05)
- As can be seen in the table above, the 95% CI for the variable goals shows that the coefficient effect
of this variable lies between 1.26 and 2.056. This means that the market value of a player increases
between 1.26-2.056 million euros per goals scored. This is a larger increase than 1 million as claimed
by some expert.
- The 95% CI shown in the table above shows that for the variable age, the values range from -0.473
to 0.09. Therefore, it cannot be said with 95% confidence that age has a negative effect on a player's
value.

c. Calculate R2 and Adjusted R2 yourself (show calculations)


- For the R2, we used the formula below, and used the data from the table.
➔ R-squared = 29800,2819 / 88881,9269 = 0,3355

ok

- for the Adjusted R2 ->

5−1
0,3355 − ∗ (1 − 0,3355) = 0,3273
420 − 5

c3 Estimate an extended model in Stata using OLS:


d. Test whether these two additional variables (minsplayed and leagueapps) jointly contribute to the
model. Calculate the test statistic yourself, and show all steps of the test. Check your answer by
using an appropriate Stata test command.

1. H0: β6 = β7 = 0 Ha: β6 or β7 is not 0, or both

2. Test statistic:
Not how you carry out an F-test -1.5
3. Under H0, F~Fdf = 412
4. Under Ha, 2 sided p-value
5. Use 2-tailed p-value
6. Outcome of the test statistic: β6 = 1,95 & β7 = -0,90
7. Outcome p-value: β6 = 0,052 & β7 = 0,367 both >0.05 So,
8. Conclusion: Reject Ha, both β6 & β7 have been proven to be zero at α = 0,05 H0 has been shown.
The variables minsplayed and leagueapps do not jointly contribute to the model.

e. By including the two extra variables minsplayed and leagueapps you may have introduced a
multicollinearity problem in your model. Explain in your own words why this may be the case. Next,
investigate whether there is multicollinearity problem, and if so, try to solve for it. Show Stata
output. In the end, draw a final conclusion on the presence of multicollinearity and the inclusion of
minsplayed and leagueapps in the model. why this may be the case? -0.5

To investigate the multicollinearity problem, we instructed Stata to give us the VIF values for our variables.
The result is the table below. From this table we conclude while the VIF value for variable minutes played is
not exceeding 10, it still is high (8.83), and league appearances is quite high already as well (5.20). To
combat this potential problem, we’ve chosen to omit both minsplyd and leagueapps from our model, as they
don’t contribute significantly to it, whilst adding much multicollinearity for it in return. There could be an
argument for still including league appearances as a variable, but in our opinion this variable doesn’t hold
much value thus we don’t mind losing it as much.

Wrong interpretation -2
f. Compare the estimated coefficients for goals, assists, and inceptions in the original model
estimated in c2 and in the model estimated in c3. Explain any differences in the values of the
coefficients using the theory discussed in the lecture.

c4 Obtain studentized residuals for the model estimated at c2 (re-estimate the model). Since we
have many observations it may be difficult to check all studentized residuals to find influential
observations. Therefore, it is handy to create an indicator that has value 0 if the calculated
studentized residual is below the critical value, and 1 if it larger. Therefore, use the following
commands in Stata:

predict studres, rstudent gen indicator=0


replace indicator=1 if studres>1.96 replace indicator=1 if studres<-1.96
sort mvalue
browse mvalue studres indicator

As we have discussed before, there is a concern of potential multicollinearity in the second model. This issue
arises because both minutes and league appearances are correlated with the variables goals and assists.
Consequently, in the second model, the accuracy of the coefficients for these variables is compromised.
Some of the explanatory power seen in the coefficients of the first model is diminished in the second model,
this leads to larger standard errors, reducing reliability of our estimations. Consequently, the wider
confidence intervals make it more challenging to precisely interpret the variable coefficients.
ok
g. Based on the data overview obtained after the browse command, are there any potential
influential observations? How many?
– What is the main characteristic of all influential observations?
– Estimate the model from c2 again, but excluding the potential observations. You can use an if
statement for that in your regression command (either via the regress menu, or look at the syntax
used in the example from the lecture on land prices). Are there any differences compared to the
model output obtained at c2?

following step c4, we've identified 21 potentially influential observations. These 21 observations stand out
due to their very high or very low market values—extremely high for positive values and remarkably low for
negative values, surpassing the critical value. ok

Excluding these potential influential observations has led to the presentation of the table below. The notable
differences include an increase in both R-squared and adjusted R-squared. The error margins for the
variables have seen significant reduction, and the 95% Confidence Intervals are now narrower.

Another evident contrast with the model obtained in step c2 is that we now have 399 observations in the
model, down from the original 420. This reduction is a consequence of excluding the potential influential
observations. Consequently, the coefficient data is now clustered more closely around the average.
Specifically, the coefficient for goals appears to be less steep in the new model. While the remaining
coefficients have undergone slight changes, their values remain quite similar to those of the original model.

ok

You might also like