0% found this document useful (0 votes)
1 views

Data mining va True size of test

The document discusses the concept of rejecting a correct null hypothesis, highlighting that this can occur by chance due to the random distribution of test statistics. It warns against 'data mining' in regression analysis, where selecting variables without theoretical backing can inflate the true significance level. To mitigate this risk, it suggests using out-of-sample testing to validate model performance and avoid spurious relationships.

Uploaded by

nguyenttngan1995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Data mining va True size of test

The document discusses the concept of rejecting a correct null hypothesis, highlighting that this can occur by chance due to the random distribution of test statistics. It warns against 'data mining' in regression analysis, where selecting variables without theoretical backing can inflate the true significance level. To mitigate this risk, it suggests using out-of-sample testing to validate model performance and avoid spurious relationships.

Uploaded by

nguyenttngan1995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Recall that the probability of rejecting a correct null hypothesis is equal to the size of the test, denoted α.

The possibility of rejecting a correct null hypothesis arises from the fact that test statistics are assumed
to follow a random distribution and hence they will take on extreme values that fall in the rejection
region some of the time by chance alone. A consequence of this is that it will almost always be possible
to find significant relationships between variables if enough variables are examined. For example,
suppose that a dependent variable yt and twenty explanatory variables x2t, …, x21t (excluding a constant
term) are generated separately as independent normally distributed random variables. Then y is
regressed separately on each of the twenty explanatory variables plus a constant, and the significance of
each explanatory variable in the regressions is examined. If this experiment is repeated many times, on
average one of the twenty regressions will have a slope coefficient that is significant at the 5% level for
each experiment. The implication is that for any regression, if enough explanatory variables are
employed in a regression, often one or more will be significant by chance alone. More concretely, it
could be stated that if an α% size of test is used, on average one in every (100/α) regressions will have a
significant slope coefficient by chance alone.

Trying many variables in a regression without basing the selection of the candidate variables on a
financial or economic theory is known as ‘data mining’ or ‘data snooping’. The result in such cases is that
the true significance level will be considerably greater than the nominal significance level assumed. For
example, suppose that twenty separate regressions are conducted, of which three contain a significant
regressor, and a 5% nominal significance level is assumed, then the true significance level would be
much higher (e.g., 25%). Therefore, if the researcher then shows only the results for the regression
containing the final three equations and states that they are significant at the 5% level, inappropriate
conclusions concerning the significance of the variables would result.

As well as ensuring that the selection of candidate regressors for inclusion in a model is made on the
basis of financial or economic theory, another way to avoid data mining is by examining the forecast
performance of the model in an ‘out-of-sample’ data set (see Chapter 6). The idea is essentially that a
proportion of the data is not used in model estimation, but is retained for model testing. A relationship
observed in the estimation period that is purely the result of data mining, and is therefore spurious, is
very unlikely to be repeated for the out-of-sample period. Therefore, models that are the product of data
mining are likely to fit very poorly and to give very inaccurate forecasts for the out-of-sample period

You might also like