0% found this document useful (0 votes)
6 views39 pages

IV_AI-DS_AD3491_FDSA_Unit5

The document provides an overview of predictive analytics, focusing on linear regression, goodness-of-fit testing, and weighted resampling methods. It covers concepts such as linear least squares, multiple regression, logistic regression, and time series analysis, along with practical implementations using Python. Additionally, it discusses statistical hypothesis testing, particularly the chi-square goodness-of-fit test, and introduces resampling techniques like bootstrap and cross-validation for model assessment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views39 pages

IV_AI-DS_AD3491_FDSA_Unit5

The document provides an overview of predictive analytics, focusing on linear regression, goodness-of-fit testing, and weighted resampling methods. It covers concepts such as linear least squares, multiple regression, logistic regression, and time series analysis, along with practical implementations using Python. Additionally, it discusses statistical hypothesis testing, particularly the chi-square goodness-of-fit test, and introduces resampling techniques like bootstrap and cross-validation for model assessment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT V PREDICTIVE ANALYTICS

Linear least squares – implementation – goodness of fit – testing a linear model – weighted
resampling. Regression using StatsModels – multiple regression – nonlinear relationships –
logistic regression – estimating parameters – Time series analysis – moving averages – missing
values – serial correlation – autocorrelation. Introduction to survival analysis.

 5.1 LINEAR LEAST SQUARES- IMPLEMENTATION

In statistics, linear regression is a linear approach to modelling the relationship between a


dependent variable and one or more independent variables. In the case of one independent
variable it is called simple linear regression. For more than one independent variable, the process
is called mulitple linear regression. We will be dealing with simple linear regression in this
tutorial.
Let X be the independent variable and Y be the dependent variable. We will define a linear
relationship between these two variables as follows:

This is the equation for a line that you studied in high school. m is the slope of the line and c is
the y intercept. Today we will use this equation to train our model with a given dataset and
predict the value of Y for any given value of X.
Our challenege today is to determine the value of m and c, that gives the minimum error for the
given dataset. We will be doing this by using the Least Squares method.
Finding the Error
So to minimize the error we need a way to calculate the error in the first place. A loss
function in machine learning is simply a measure of how different the predicted value is from
the actual value.
Today we will be using the Quadratic Loss Function to calculate the loss or error in our model.
It can be defined as:

AD3491_FDSA
We are squaring it because, for the points below the regression line y — p will be negative and
we don’t want negative values in our total error.
Least Squares method
Now that we have determined the loss function, the only thing left to do is minimize it. This is
done by finding the partial derivative of L, equating it to 0 and then finding an expression
for m and c. After we do the math, we are left with these equations:

Here x̅ is the mean of all the values in the input X and ȳ is the mean of all the values in the
desired output Y. This is the Least Squares method. Now we will implement this in python and
make predictions.
Implementing the Model

AD3491_FDSA
There wont be much accuracy because we are simply taking a straight line and forcing it to fit
into the given data in the best possible way. But you can use this to make simple predictions or
get an idea about the magnitude/range of the real value. Also this is a good first step for
beginners in Machine Learning.

 5.2 GOODNESS OF FIT – TESTING A LINEAR MODEL

A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond
to the fitted (assumed) model. We will use this concept throughout the course as a way of
checking the model fit. Like in linear regression, in essence, the goodness-of-fit test compares
the observed values to the expected (fitted or predicted) values.
A goodness-of-fit statistic tests the following
hypothesis: H0: the model M0 fits vs.

AD3491_FDSA
HA: the model M0 does not fit (or, some other model MA fits)
Most often the observed data represent the fit of the saturated model, the most complex model
possible with the given data. Thus, most often the alternative hypothesis (HA) will represent the
saturated model MA which fits perfectly because each observation has a separate parameter.
Later in the course, we will see that MA could be a model other than the saturated one. Let us
now consider the simplest example of the goodness-of-fit test with categorical data.
In the setting for one-way tables, we measure how well an observed variable X corresponds to
a Mult(n,π) model for some vector of cell probabilities, π. We will consider two cases:
1. when vector π is known, and
2. when vector π is unknown.
In other words, we assume that under the null hypothesis data come from
a Mult(n,π) distribution, and we test whether that model fits against the fit of the saturated
model. The rationale behind any model fitting is the assumption that a complex mechanism of
data generation may be represented by a simpler model. The goodness-of-fit test is applied to
corroborate our assumption.
Consider our dice example from Lesson 1. We want to test the hypothesis that there is an equal
probability of six faces by comparing the observed frequencies to those expected under the
assumed model: X∼Multi(n=30,π0), where π0=(1/6,1/6,1/6,1/6,1/6,1/6). We can think of this as
simultaneously testing that the probability in each cell is being equal or not to a specified value:
H0:π=π0
where the alternative hypothesis is that any of these elements differ from the null value. There
are two statistics available for this test.
Test Statistics
Pearson Goodness-of-fit Test Statistic
The Pearson goodness-of-fit statistic
is X2=∑j=1k(Xj−nπ0j)2nπ0j
An easy way to remember it
is X2=∑j=1k(Oj−Ej)2Ej
where Oj=Xj is the observed count in cell j, and Ej=E(Xj)=nπ0j is the expected count in
cell j under the assumption that null hypothesis is true.
Deviance Test Statistic
The deviance statistic is
G2=2∑j=1kXjlog⁡(Xjnπ0j)=2∑jOjlog⁡(OjEj)
In some texts, G2 is also called the likelihood-ratio test (LRT) statistic, for comparing the
loglikelihoods L0 and L1 of two models under H0 (reduced model) and HA (full model),
respectively:
Likelihood-ratio Test Statistic
G2=−2log⁡(ℓ0ℓ1)=−2(L0−L1)
Note that X2 and G2 are both functions of the observed data X and a vector of probabilities π0.
For this reason, we will sometimes write them as X2(x,π0) and G2(x,π0), respectively; when
there is no ambiguity, however, we will simply use X2 and G2. We will be dealing with these
statistics throughout the course in the analysis of 2-way and k-way tables and when assessing the
fit of log-linear and logistic regression models.
Testing the Goodness-of-Fit
X2 and G2 both measure how closely the model, in this case Mult(n,π0) "fits" the observed data.
And both have an approximate chi-square distribution with k−1 degrees of freedom when H0 is

AD3491_FDSA
true. This allows us to use the chi-square distribution to find critical values and p-values for
establishing statistical significance.
 If the sample proportions π^j (i.e., saturated model) are exactly equal to the
model's π0j for cells j=1,2,…,k, then Oj=Ej for all j, and both X2 and G2 will be zero.
That is, the model fits perfectly.
 If the sample proportions π^j deviate from the π0js, then X2 and G2 are both positive.
Large values of X2 and G2 mean that the data do not agree well with the
assumed/proposed model M0.
Example: Dice Rolls
Suppose that we roll a die 30 times and observe the following table showing the number of times
each face ends up on top.

Face Count
1 3
2 7
3 5
4 10
5 2
6 3
Total 30
We want to test the null hypothesis that the die is fair. Under this
hypothesis, X∼Mult(n=30,π0) where π0j=1/6, for j=1,…,6. This is our assumed model, and
under this H0, the expected counts are Ej=30/6=5 for each cell. We now have what we need to
calculate the goodness-of-fit statistics:
X2=(3−5)25+(7−5)25+(5−5)25+(10−5)25+(2−5)25+(3−5)25=9.2
G2=2(3log35+7log75+5log55+10log105+2log25+3log35)=8.8
Note that even though both have the same approximate chi-square distribution, the realized
numerical values of Χ2 and G2 can be different. The p-values
are P(χ52≥9.2)=.10 and P(χ52≥8.8)=.12. Given these p-values, with the significance level
of α=0.05, we fail to reject the null hypothesis. But rather than concluding that H0 is true, we
simply don't have enough evidence to conclude it's false. That is, the fair-die model doesn't fit
the data exactly, but the fit isn't bad enough to conclude that the die is unfair, given our
significance threshold of 0.05.
A chi-square (Χ2) goodness of fit test is a type of Pearson’s chi-square test. You can use it to
test whether the observed distribution of a categorical variable differs from your expectations.
Example: Chi-square goodness of fit testYou’re hired by a dog food company to help them test
three new dog food flavors.
You recruit a random sample of 75 dogs and offer each dog a choice between the three flavors by
placing bowls in front of them. You expect that the flavors will be equally popular among the
dogs, with about 25 dogs choosing each flavor.
Once you have your experimental results, you plan to use a chi-square goodness of fit test to
figure out whether the distribution of the dogs’ flavor choices is significantly different from your
expectations.

A chi-square (Χ2) goodness of fit test is a goodness of fit test for a categorical variable.
Goodness of fit is a measure of how well a statistical model fits a set of observations.

AD3491_FDSA
 When goodness of fit is high, the values expected based on the model are close to the
observed values.
 When goodness of fit is low, the values expected based on the model are far from the
observed values.
The statistical models that are analyzed by chi-square goodness of fit tests are distributions.
They can be any distribution, from as simple as equal probability for all groups, to as complex as
a probability distribution with many parameters.
Hypothesis testing
The chi-square goodness of fit test is a hypothesis test. It allows you to draw conclusions about
the distribution of a population based on a sample. Using the chi-square goodness of fit test, you
can test whether the goodness of fit is “good enough” to conclude that the population follows the
distribution.
With the chi-square goodness of fit test, you can ask questions such as: Was this sample drawn
from a population that has…
 Equal proportions of male and female turtles?
 Equal proportions of red, blue, yellow, green, and purple jelly beans?
 90% right-handed and 10% left-handed people?
 Offspring with an equal probability of inheriting all possible genotypic combinations
(i.e., unlinked genes)?
 A Poisson distribution of floods per year?
 A normal distribution of bread prices?

 5.3 WEIGHTED RESAMPLING.

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly
drawing samples from a training set and refitting a model of interest on each sample in order to
obtain additional information about the fitted model. For example, in order to estimate the
variability of a linear regression fit, we can repeatedly draw different samples from the training
data, fit a linear regression to each new sample, and then examine the extent to which the
resulting fits differ. Such an approach may allow us to obtain information that would not be
available from fitting the model only once using the original training sample.

Two resampling methods are often used in Machine Learning analyses,

1. The bootstrap method


2. and Cross-Validation
In addition there are several other methods such as the Jackknife and the Blocking methods. We
will discuss in particular cross-validation and the bootstrap method.

Resampling approaches can be computationally expensive, because they involve fitting the same
statistical method multiple times using different subsets of the training data. However, due to
recent advances in computing power, the computational requirements of resampling methods
generally are not prohibitive. In this chapter, we discuss two of the most commonly used
resampling methods, cross-validation and the bootstrap. Both methods are important tools in the
practical application of many statistical learning procedures. For example, cross-validation can

AD3491_FDSA
be used to estimate the test error associated with a given statistical learning method in order to
evaluate its performance, or to select the appropriate level of flexibility. The process of
evaluating a model’s performance is known as model assessment, whereas the process of
selecting the proper level of flexibility for a model is known as model selection. The bootstrap is
widely used.

 Our simulations can be treated as computer experiments. This is particularly the case
for Monte Carlo methods
 The results can be analysed with the same statistical tools as we would use
analysing experimental data.
 As in all experiments, we are looking for expectation values and an estimate of
how accurate they are, i.e., possible sources for errors.

Reminder on Statistics
 As in other experiments, many numerical experiments have two classes of errors:
o Statistical errors
o Systematical errors
 Statistical errors can be estimated using standard tools from statistics
 Systematical errors are method specific and must be treated differently from case to
case. The advantage of doing linear regression is that we actually end up with analytical
expressions for several statistical quantities.
Standard least squares and Ridge regression allow us to derive quantities like the variance and
other expectation values in a rather straightforward way.

It is assumed that εi∼N(0,σ2) and the εi are independent, i.e.:

Cov(εi1,εi2)={σ2ifi1=i2,0ifi1≠i2.

The randomness of εi implies that yi is also a random variable. In particular, yi is normally


distributed, because εi∼N(0,σ2) and Xi,∗β is a non-random scalar. To specify the parameters of
the distribution of yi we need to calculate its first two moments.

Recall that X is a matrix of dimensionality n×p. The notation above Xi,∗ means that we are
looking at the row number i and perform a sum over all values p.

The assumption we have made here can be summarized as (and this is going to be useful when
we discuss the bias-variance trade off) that there exists a function f(x) and a normal
distributed error ε∼N(0,σ2) which describe our data

y=f(x)+ε

We approximate this function with our model from the solution of the linear regression
equations, that is our function f is approximated by y~ where we want to minimize (y−y~)2, our
MSE, with

y~=Xβ.

AD3491_FDSA
We can calculate the expectation value of y for a given element i

E(yi)=E(Xi,∗β)+E(εi)=Xi,∗β,

while its variance is

Var(yi)=E{[yi−E(yi)]2}=E(yi2)−[E(yi)]2=E[(Xi,∗β+εi)2]−(Xi,∗β)2=E[(Xi,∗β)2+2εiXi,∗β+εi2]−
(Xi,∗β)2=(Xi,∗β)2+2 E(εi)Xi,∗β+E(εi2)−(Xi,∗β)2=E(εi2)=Var(εi)=σ2.

Hence, yi∼N(Xi,∗β,σ2), that is y follows a normal distribution with mean value Xβ and
variance σ2 (not be confused with the singular values of the SVD).

With the OLS expressions for the parameters β we can evaluate the expectation value

E(β)=E[(X𝖳X)−1XTY]=(XTX)−1XTE[Y]=(XTX)−1XTXβ=β.

This means that the estimator of the regression parameters is

unbiased. We can also calculate the variance

The variance of β is

Var(β)=E{[β−E(β)][β−E(β)]T}=E{[(XTX)−1XTY−β][(XTX)−1XTY−β]T}=(XTX)−1XTE{YYT}X(XTX)−1−ββT=(XTX)−1
XT{XββTXT+σ2}X(XTX)−1−ββT=ββT+σ2(XTX)−1−ββT=σ2(XTX)−1,

where we have used that E(YYT)=XββTXT+σ2Inn. From Var(β)=σ2(XTX)−1, one obtains an


estimate of the variance of the estimate of the j-th regression coefficient: σ2(βj)=σ2[(XTX)−1]jj.
This may be used to construct a confidence interval for the estimates.

In a similar way, we can obtain analytical expressions for say the expectation values of the
parameters β and their variance when we employ Ridge regression, allowing us again to define a
confidence interval.

It is rather straightforward to show that

E[βRidge]=(XTX+λIpp)−1(X𝖳X)βOLS.

We see clearly that E[βRidge]≠βOLS for any λ>0. We say then that the ridge estimator is biased.

We can also compute the variance as

Var[βRidge]=σ2[XTX+λI]−1XTX{[X𝖳X+λI]−1}T,

and it is easy to see that if the parameter λ goes to infinity then the variance of Ridge
parameters β goes to zero.

With this, we can compute the difference

AD3491_FDSA
Var[βOLS]−Var(βRidge)=σ2[XTX+λI]−1[2λI+λ2(XTX)−1]{[XTX+λI]−1}T.

The difference is non-negative definite since each component of the matrix product is non-
negative definite. This means the variance we obtain with the standard OLS will always
for λ>0 be larger than the variance of β obtained with the Ridge estimator. This has interesting
consequences when we discuss the so-called bias-variance trade-off below.

Resampling methods
With all these analytical equations for both the OLS and Ridge regression, we will now outline
how to assess a given model. This will lead us to a discussion of the so-called bias-variance
tradeoff (see below) and so-called resampling methods.

One of the quantities we have discussed as a way to measure errors is the mean-squared error
(MSE), mainly used for fitting of continuous functions. Another choice is the absolute error.

In the discussions below we will focus on the MSE and in particular since we will split the data
into test and training data, we discuss the

1. prediction error or simply the test error ErrTest, where we have a fixed training set
and the test error is the MSE arising from the data reserved for testing. We discuss also
the
2. training error ErrTrain, which is the average loss over the training data.
As our model becomes more and more complex, more of the training data tends to be used. The
training may thence adapt to more complicated structures in the data. This may lead to a decrease
in the bias (see below for code example) and a slight increase of the variance for the test error.
For a certain level of complexity the test error will reach minimum, before starting to increase
again. The training error reaches a saturation.

Two famous resampling methods are the independent bootstrap and the jackknife.

The jackknife is a special case of the independent bootstrap. Still, the jackknife was made
popular prior to the independent bootstrap. And as the popularity of the independent bootstrap
soared, new variants, such as the dependent bootstrap.

The Jackknife and independent bootstrap work for independent, identically distributed random
variables. If these conditions are not satisfied, the methods will fail. Yet, it should be said that if
the data are independent, identically distributed, and we only want to estimate the variance
of X― (which often is the case), then there is no need for bootstrapping.

The Jackknife works by making many replicas of the estimator β^. The jackknife is a resampling
method where we systematically leave out one observation from the vector of observed
values x=(x1,x2,⋯,Xn). Let xi denote the vector

xi=(x1,x2,⋯,xi−1,xi+1,⋯,xn),

which equals the vector x with the exception that observation number i is left out. Using this
notation, define β^i to be the estimator β^ computed using X→i.

AD3491_FDSA
Bootstrap

Bootstrapping is a nonparametric approach to statistical inference that substitutes computation


for more traditional distributional assumptions and asymptotic results. Bootstrapping offers a
number of advantages:

1. The bootstrap is quite general, although there are some cases in which it fails.
2. Because it does not require distributional assumptions (such as normally distributed
errors), the bootstrap can provide more accurate inferences when the data are not
well behaved or when the sample size is small.
3. It is possible to apply the bootstrap to statistics with sampling distributions that
are difficult to derive, even asymptotically.
4. It is relatively simple to apply the bootstrap to complex data-collection plans (such
as stratified and clustered samples).

Since β^=β^(X) is a function of random variables, β^ itself must be a random variable. Thus it
has a pdf, call this function p(t). The aim of the bootstrap is to estimate p(t) by the relative
frequency of β^. You can think of this as using a histogram in the place of p(t). If the relative
frequency closely resembles p(t→), then using numerics, it is straight forward to estimate all the
interesting parameters of p(t) using point estimators.

In the case that β^ has more than one component, and the components are independent, we use
the same estimator on each component separately. If the probability density function of Xi, p(x),
had been known, then it would have been straight forward to do this by:

1. Drawing lots of numbers from p(x), suppose we call one such set of
numbers (X1∗,X2∗,⋯,Xn∗).
2. Then using these numbers, we could compute a replica of β^ called β^∗.

By repeated use of (1) and (2), many estimates of β^ could have been obtained. The idea is to use
the relative frequency of β^∗ (think of a histogram) as an estimate of p(t).

But unless there is enough information available about the process that
generated X1,X2,⋯,Xn, p(x) is in general unknown. Therefore, Efron in 1979 asked the
question: What if we replace p(x) by the relative frequency of the observation Xi; if we draw
observations in accordance with the relative frequency of the observations, will we obtain the
same result in some asymptotic sense? The answer is yes.

Instead of generating the histogram for the relative frequency of the observation Xi, just draw the
values (X1∗,X2∗,⋯,Xn∗) with replacement from the vector X.

The independent bootstrap works like this:

1. Draw with replacement n numbers for the observed variables x=(x1,x2,⋯,xn).


2. Define a vector x∗ containing the values which were drawn from x.
3. Using the vector x∗ compute β^∗ by evaluating β^ under the observations x∗.

AD3491_FDSA
4. Repeat this process k times.

When you are done, you can draw a histogram of the relative frequency of β^∗. This is your
estimate of the probability distribution p(t). Using this probability distribution you can estimate
any statistics thereof. In principle you never draw the histogram of the relative frequency of β^∗.
Instead you use the estimators corresponding to the statistic of interest. For example, if you are
interested in estimating the variance of β^, apply the estimator σ^2 to the values β^∗.

Before we proceed however, we need to remind ourselves about a central theorem in statistics,
namely the so-called central limit theorem. This theorem plays a central role in understanding
why the Bootstrap (and other resampling methods) work so well on independent and identically
distributed variables.

Suppose we have a PDF p(x) from which we generate a series N of averages ⟨xi⟩. Each mean value
⟨xi⟩ is viewed as the average of a specific measurement, e.g., throwing dice 100 times and then

we set ⟨xi⟩=xi in the discussion which follows.


taking the average value, or producing a certain amount of random numbers. For notational ease,

If we compute the mean z of m such mean values xi z=x1+x2+⋯+xmm,

the question we pose is which is the PDF of the new variable z.

The probability of obtaining an average value z is the product of the probabilities of obtaining
arbitrary individual mean values xi, but with the constraint that the average is z. We can express
this through the following expression

p~(z)=∫dx1p(x1)∫dx2p(x2)…∫dxmp(xm)δ(z−x1+x2+⋯+xmm),

where the δ-function enbodies the constraint that the mean is z. All measurements that lead to each
individual xi are expected to be independent, which in turn means that we can express p~ as the
product of individual p(xi). The independence assumption is important in the derivation of the
central limit theorem.

If we use the integral expression for the δ-function δ(z−x1+x2+⋯

+xmm)=12π∫−∞∞dqexp⁡(iq(z−x1+x2+⋯+xmm)), and inserting

eiμq−iμq where μ is the mean value we arrive at

p~(z)=12π∫−∞∞dqexp⁡(iq(z−μ))[∫−∞∞dxp(x)exp⁡(iq(μ−x)/m)]m, with

the integral over x resulting in

∫−∞∞dxp(x)exp⁡(iq(μ−x)/m)=∫−∞∞dxp(x)[1+iq(μ−x)m−q2(μ−x)22m2+…].

AD3491_FDSA
The second term on the rhs disappears since this is just the mean and employing the definition
of σ2 we have

∫−∞∞dxp(x)e(iq(μ−x)/m)=1−q2σ22m2+…,

resulting in

[∫−∞∞dxp(x)exp⁡(iq(μ−x)/m)]m≈[1−q2σ22m2+…]m,

and in the limit m→∞ we obtain p~(z)=12π(σ/m)exp⁡(−

(z−μ)22(σ/m)2),

which is the normal distribution with variance σm2=σ2/m, where σ is the variance of the
PDF p(x) and μ is also the mean of the PDF p(x).

Thus, the central limit theorem states that the PDF p~(z) of the average of m random values
corresponding to a PDF p(x) is a normal distribution whose mean is the mean value of the
PDF p(x) and whose variance is the variance of the PDF p(x) divided by m, the number of values
used to compute z.

The central limit theorem leads to the well-known expression for the standard deviation, given
by

σm=σm.

The latter is true only if the average value is known exactly. This is obtained in the
limit m→∞ only. Because the mean and the variance are measured quantities we obtain the
familiar expression in statistics

σm≈σm−1.

In many cases however the above estimate for the standard deviation, in particular if correlations
are strong, may be too simplistic. Keep in mind that we have assumed that the variables x are
independent and identically distributed. This is obviously not always the case. For example, the
random numbers (or better pseudorandom numbers) we generate in various calculations do
always exhibit some correlations.

The theorem is satisfied by a large class of PDFs. Note however that for a finite m, it is not
always possible to find a closed form /analytic expression for p~(x).

The following code starts with a Gaussian distribution with mean value μ=100 and
variance σ=15. We use this to generate the data used in the bootstrap analysis. The bootstrap
analysis returns a data set after a given number of bootstrap operations (as many as we have data
points). This data set consists of estimated mean values for each bootstrap operation. The
histogram generated by the bootstrap method shows that the distribution for these mean values is

AD3491_FDSA
also a Gaussian, centered around the mean value μ=100 but with standard deviation σ/n,
where n is the number of bootstrap samples (in this case the same as the number of original data
points). The value of the standard deviation is what we expect from the central limit theorem.

The bias-variance tradeoff

We will discuss the bias-variance tradeoff in the context of continuous predictions such as
regression. However, many of the intuitions and ideas discussed here also carry over to
classification tasks. Consider a dataset L consisting of the data XL={(yj,xj),j=0…n−1}.

Let us assume that the true data is generated from a noisy model

y=f(x)+ϵ

where ϵ is normally distributed with mean zero and standard deviation σ2.

In our derivation of the ordinary least squares method we defined then an approximation to
the function f in terms of the parameters β and the design matrix X which embody our model,
that is y~=Xβ.

Thereafter we found the parameters β by optimizing the means squared error via the so-called
cost function

C(X,β)=1n∑i=0n−1(yi−y~i)2=E[(y−y~)2].

We can rewrite this as

E[(y−y~)2]=1n∑i(fi−E[y~])2+1n∑i(y~i−E[y~])2+σ2.

The first term represents the square of the bias of the learning method, which can be thought of
as the error caused by the simplifying assumptions built into the method. The second term
represents the variance of the chosen model and finally the last terms is variance of the error ϵ.

To derive this equation, we need to recall that the variance of y and ϵ are both equal to σ2. The
mean value of ϵ is by definition equal to zero. Furthermore, the function f is not a stochastic
variable, idem for y~. We use a more compact notation in terms of the expectation value

E[(y−y~)2]=E[(f+ϵ−y~)2],

and adding and subtracting E[y~] we get

E[(y−y~)2]=E[(f+ϵ−y~+E[y~]−E[y~])2],

which, using the abovementioned expectation values can be rewritten

as E[(y−y~)2]=E[(y−E[y~])2]+Var[y~]+σ2,

AD3491_FDSA
that is the rewriting in terms of the so-called bias, the variance of the model y~ and the variance
of ϵ.

import matplotlib.pyplot as

plt import numpy as np

from sklearn.linear_model import LinearRegression, Ridge, Lasso

from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from sklearn.utils import resample

np.random.seed(2018)

n = 500

n_boostraps = 100

degree = 18 # A quite high value, just to show.

noise = 0.1

# Make data set.

x = np.linspace(-1, 3, n).reshape(-1, 1)

y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2) + np.random.normal(0, 0.1, x.shape)

# Hold out some test data that is never used in training.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

AD3491_FDSA
# Combine x transformation and model into one operation.

# Not neccesary, but convenient.

model = make_pipeline(PolynomialFeatures(degree=degree),
LinearRegression(fit_intercept=False))

# The following (m x n_bootstraps) matrix holds the column vectors y_pred

# for each bootstrap iteration.

y_pred = np.empty((y_test.shape[0], n_boostraps))

for i in range(n_boostraps):

x_, y_ = resample(x_train, y_train)

# Evaluate the new model on the same test data each time.

y_pred[:, i] = model.fit(x_, y_).predict(x_test).ravel()

# Note: Expectations and variances taken w.r.t. different training

# data sets, hence the axis=1. Subsequent means are taken across the test data

# set in order to obtain a total value, but before this we have

error/bias/variance # calculated per data point in the test set.

# Note 2: The use of keepdims=True is important in the calculation of bias as this

# maintains the column vector form. Dropping this yields very unexpected

results. error = np.mean( np.mean((y_test - y_pred)**2, axis=1,

keepdims=True) )

bias = np.mean( (y_test - np.mean(y_pred, axis=1, keepdims=True))**2 )

variance = np.mean( np.var(y_pred, axis=1, keepdims=True) )

AD3491_FDSA
print('Error:', error)

AD3491_FDSA
print('Bias^2:', bias)

print('Var:', variance)

print('{} >= {} + {} = {}'.format(error, bias, variance, bias+variance))

plt.plot(x[::5, :], y[::5, :], label='f(x)')

plt.scatter(x_test, y_test, label='Data points')

plt.scatter(x_test, np.mean(y_pred, axis=1), label='Pred')

plt.legend()

plt.show()

Error: 0.013121574015585152

Bias^2: 0.012073649446193166

Var: 0.0010479245693919886

0.013121574015585152 >= 0.012073649446193166 + 0.0010479245693919886 =


0.01312157401

 4.4 REGRESSION USING STATSMODELS – MULTIPLE REGRESSION

Linear Regression is one of the most essential techniques used in Data Science and Machine
Learning to predict the value of a certain variable based on the value of another variable.
The goal of this article is to explore in-depth how Linear Regression works and the math behind
it, explaining the relationship between dependent variables and independent variables and the
main conditions for a successful Linear Regression.
For this notebook, we are going to use Sklearn’s diabetes dataset, which is particularly useful for
regression tasks.We start by loading all the necessary libraries.

# Basic libraries
import pandas as
pd import numpy
as np

# Data Visualization

AD3491_FDSA
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# Diabetes dataset
from sklearn.datasets import load_diabetes

# For splitting the dataset into training and testing


sets from sklearn.model_selection import
train_test_split

# Metric for evaluation


from sklearn.metrics import mean_squared_error

# Statsmodels for Linear Regression


import statsmodels.api as sm

# Hiding warnings
import warnings
warnings.filterwarnings("ignore")

What is Linear Regression?

Linear Regression is a statistical method widely used in various fields, including economics,
biology, engineering, and many more for predictive modeling and hypothesis testing.
It is based on modeling the relationship between and dependent variable and one or more
independent variables by fitting a linear equation to the observed data. The primary objective of
a linear regression is finding the best-fitting line that can accurately predict the output of the
dependent variable given the values of the independent variable(s).
The line is given by the following equation:

AD3491_FDSA
Where:Y is the dependent variable.β₀ is the intercept.
β₁, β₂, β3₃, … are the coefficients for each independent variable.
X₁, X₂, X₃, … represent each one independent variable.
ϵ is the error term, which captures the random noise in the dependent
variable. It is the stochastic component.
In this equation, each component serves a specific purpose. The Intercept(β₀),
for
instance, is the constant term that represents the value of Y (the
dependent variable) when all the independent variables X₁, X₂, … are
equal to zero. It locates the line up or down the y-axis on the Cartesian
coordinate system, as you can see in the image below

When X = 0, Y = β₀
In the case of one single independent variable, the intercept is
calculated as the following:

Where:
Y̅ is the mean value of the dependent
variable Y. X̅ is the mean value of the
independent variable X.

AD3491_FDSA
The Coefficients (β₁, β₂, β3₃, …) are the values that measure the
magnitude of impact of each independent variable on the dependent
variable. A one-unit change in X₁, for instance, will result in a β₁ change
in Y if all the other Xₛ are constant.
The equation for finding the value of β₁ in a simple linear regression
model with only one single independent variable is given by the
following:

Where:
Xᵢ and Yᵢ represent the iᵗʰ sample of both X and Y.
X̅ and Y̅ represent the mean values for both X and Y.
∑ⁿᵢ ₌ ₁ represents the sum starting from the first element of the
series up until the nᵗʰ element of the series.
In this equation, the coefficient β₁ is the result of the ratio of
the covariance between X and Y, represented by the product of the
deviations between each X and Y from their respective means, and the
variance of X, given by the sum of the squares of the deviations of each X
from its mean. The effect of the independent variable on the dependent
variable can be summed up as the following:
•β₁> 0: Y increases as X increases.
•β₁ < 0: Y decreases as X increases.
•β₁ = 0: X has no effect on Y.
In the case of a linear regression with multiple independent variables,
the coefficients and the intercept are obtained through matrix algebra:

AD3491_FDSA
Where:
β is a vector containing the intercept and coefficients.

AD3491_FDSA
X is a matrix where the first column is all 1ₛ and the other columns each
represent a different independent variable.
Xᵀ is the transpose of the X matrix.
(XᵀX)⁻¹ is the inverse of the product of the X matrix and its transpose.
Y is a vector of values of the dependent variable.
The goal of this operation is to obtain estimates of the parameters in the β
vector,
which are the coefficients and the intercept.
The linear regression with multiple independent variables can be then
given by the following equation:

The image below can correctly display the matrix form of the equation
above.

The first column of the X matrix is populated by 1ₛ for the intercept, which is represented
by the β1 in the β vector.
To demonstrate how Linear Regression works, let’s load the diabetes
dataset and obtain our X and Y values.

 4.5NONLINEAR RELATIONSHIPS

Nonlinear Relationships
The final division among correlation coefficients addresses the question of nonlinear
relationships between two variables. As noted previously, when two variables are related in a
nonlinear way, the product-moment basis for Pearson's r will understate the strength of the
relationship between the two variables. This is because r is a statement of the existence and
strength of the linear relationship between two variables. The correlation ratio, η, addresses the
relationship between a polychotomous qualitative variable and an interval/ratio level quantitative
AD3491_FDSA
variable, but an advantage of this measure is that it also states the strength of a possible nonlinear
relationship between the two variables. At the same time, it makes only a statement of the
strength of that relationship, not direction, because of the polychotomous nature of the first
variable. For a population, η is calculated by the following equation:
η=Σ(Y−μc)2Σ(Y−μt)2
The mean value of the interval/ratio scores (μ c) is calculated for each category of the first
variable; the grand mean (μ t) is also calculated. Summation occurs over the differences between
the score for each observational unit and these means. For a sample, the sample means are
calculated instead. Generally, η2 will be greater than r2, and the difference is the degree of
nonlinearity in the data. For two interval/ratio variables, one variable can be divided into
categories of equal width and the resulting value of η will again approximate the strength of the
nonlinear relationship between the two variables. In either case, η reflects the same information
provided by an analysis of variance (ANOVA) about the relative contribution of the
between sum of squares to the total sum of squares for the two variables.

Linear Relationship: The Straightforward Path


Linear relationship is the simplest to understand in data science. Imagine a scenario where your
savings increase by a fixed amount each month. The more time passes, the more money you
accumulate. This predictable, proportional increase is a hallmark of a linear relationship.
Mathematically, it’s expressed as y = mx + c, where y changes in a fixed pattern as x varies,
with m representing the rate of change and c the starting point.
Nonlinear Relationships: The Complex Web
Nonlinear relationships are where things get interesting. Consider the spread of a viral social
media post. It starts with a few likes, but as it gains momentum, the likes can skyrocket. This
exponential increase is a prime example of a nonlinear pattern. Unlike linear relationships,
nonlinear ones involve variables that change in more complex ways, such as quadrat ic “y = ax²
+ bx + c” or exponential y = ae^{bx} patterns.
Identifying Relationships in Data
First step in identifying the nature of a relationship is often a visual examination. Scatter plots
are invaluable as they offer a graphical representation of data points. The pattern formed by these
points can reveal the relationship’s nature. Beyond visuals, residual plots or statistical tests can
offer further insights.
Applications in Data Science
Understanding the type of relationship between variables is critical in data science, particularly
for model selection. Linear regression models suit linear relationships, while nonlinear
relationships might require more complex transformations or models like Polynomial
Regression. This understanding leads to more accurate predictions and effective data analysis.

 4.6 LOGISTIC REGRESSION – ESTIMATING PARAMETERS

In Maximum Likelihood Estimation, a probability distribution for the


target variable (class label) is assumed and then a likelihood function is
defined that calculates the probability of observing the outcome given
the input data and the model. This
AD3491_FDSA
function can then be optimized to find the set of parameters that results
in the largest sum likelihood over the training dataset.

By applying Maximum Likelihood estimation, the first thing we do is


project the data points onto the line. This gives each data point a
log(odds) value.

We then transform this log(odds) to probabilities using the formula

Derivation of the above formula

As we already know that

AD3491_FDSA
Exponentiate both the sides

AD3491_FDSA
Once we calculate the probabilities, we will plot them to get an s-curve.
Now, We just keep rotation the log(odds) line and projecting the data
points onto it and then transforming it to probabilities and calculating
the log-likelihood. We will repeat this process until we maximize the log-
likelihood.

The algorithm that finds the line with the maximum likelihood is pretty
smart each time it rotates the line, it does so in a way that increases the
log-likelihood. Thus, the algorithm can find the optimal fit after a few
rotations.

Estimation of Log-likelihood function

As explained, Logistic regression uses the Maximum Likelihood for


parameter estimation. The logistic regression equation is given as

The parameters to be estimated in the equation of a logistic regression are β


vectors.

AD3491_FDSA
To estimate β vectors consider the N sample with labels either 0 or 1.

AD3491_FDSA
Mathematically, For samples labeled as ‘1’, we try to estimate β such
that the product of all probability p(x) is as close to 1 as possible. And for
samples labeled as ‘0’, we try to estimate β such that the product of all
probability is as close to 0 as possible in other words (1 — p(x)) should
be as close to 1 as possible.

The above intuition is represented as

xᵢ represents the feature vector for the iᵗʰ


sample.

On combining the above conditions we want to find β parameters


such that the
product of both of these products is maximum over all elements of

This function is the one we need to optimize and is called the likelihood
function.

Now, We combine the products and take log-likelihood to


simply it further

Let’s substitute p(x) with its exponent form

AD3491_FDSA
Now we end up with the final form of the log-likelihood function
which is to be
optimized and is given as

Maximizing Log-likelihood function

The goal here is to find the value of β that maximizes the log-
likelihood function. There are many methods to do so like

 Fixed-point iteration

 Bisection method

 Newton-raphson method

 Muller’s method

In this article, we will be using the Newton-raphson method to


estimate the β vector. The Newton-raphson equation is given as

AD3491_FDSA
Let’s determine the gradient first. To determine gradient we will take the first-
order derivative of our log-likelihood function

Now, we will replace the exponential term with probability

The matrix representation of gradient will be

AD3491_FDSA
We are done with the numerator term of newton-raphson. Now we will
calculate the denominator i.e second-order derivative which is also called as
Hessian Matrix.

To do so we will find derivate of gradient as

Now, We will replace probability with its equivalent exponential term


and compute
its derivative

Resubstitute exponential term as


probability

The matrix representation of the Hessian matrix will be

AD3491_FDSA
AD3491_FDSA
As we have calculated gradient and Hessian matrix, plugging these two
terms into the newton-raphson equation to get a final form

Now, we will execute the final equation for t number of iterations until
the value of β
converges.

Once the coefficients have been estimated we can then plug in the
values of some
feature vector X to estimate the probability of it belonging to a

We should choose a threshold value above which belongs to class 1 and


below which
is class 0.
 4.7 TIME SERIES ANALYSIS – MOVING AVERAGES – MISSING VALUES

Time series analysis is a statistical technique used to analyze data points gathered at consistent
intervals over a time span in order to detect patterns and trends. Understanding the fundamental
framework of the data can assist in predicting future data points and making knowledgeable
choices.
Objectives of Time Series Analysis
 To understand how time series works and what factors affect a certain variable(s) at
different points in time.
 Time series analysis will provide the consequences and insights of the given dataset’s
features that change over time.
 Supporting to derive the predicting the future values of the time series variable.
 Assumptions: TSA makes only one assumption, which is “stationary,” which means that
the origin of time does not affect the properties of the process under the statistical factor.
Checkout this article about the Time Series Analysis & Modern Statistical Models
How to Analyze Time Series?
To perform the time series analysis, we have to follow the following steps:
 Collecting the data and cleaning it
 Preparing Visualization with respect to time vs key feature
 Observing the stationarity of the series
 Developing charts to understand its nature.

AD3491_FDSA
 Model building – AR, MA, ARMA and ARIMA
 Extracting insights from
prediction Significance of Time Series
TSA is the backbone for prediction and forecasting analysis, specific to time-based problem statements.
 Analyzing the historical dataset and its patterns
 Understanding and matching the current situation with patterns derived from the previous
stage.
 Understanding the factor or factors influencing certain variable(s) in different
periods. With the help of “Time Series,” we can prepare numerous time-based analyses and
results.
 Forecasting: Predicting any value for the future.
 Segmentation: Grouping similar items together.
 Classification: Classifying a set of items into given classes.
 Descriptive analysis: Analysis of a given dataset to find out what is there in it.
 Intervention analysis: Effect of changing a given variable on the
outcome. Read More about the Guide to Time Series Analysis and Forecasting
Components of Time Series Analysis
Let’s look at the various components of Time Series Analysis:
 Trend: In which there is no fixed interval and any divergence within the given dataset is a
continuous timeline. The trend would be Negative or Positive or Null Trend
 Seasonality: In which regular or fixed interval shifts within the dataset in a continuous
timeline. Would be bell curve or saw tooth
 Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern
 Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.

Limitations of Time Series Analysis?


Time series has the below-mentioned limitations; we have to take care of those during our data
analysis.
 Similar to other models, the missing values are not supported by TSA
 The data points must be linear in their relationship.
 Data transformations are mandatory, so they are a little expensive.
 Models mostly work on Uni-variate data.
Data Types of Time Series
Let’s discuss the time series’ data types and their influence. While discussing TS data types,
there are two major types – stationary and non-stationary.
Stationary: A dataset should follow the below thumb rules without having Trend, Seasonality,
Cyclical, and Irregularity components of the time series.
 The mean value of them should be completely constant in the data during the analysis.
 The variance should be constant with respect to the time-frame
 Covariance measures the relationship between two variables.
Non- Stationary: If either the mean-variance or covariance is changing with respect to time, the
dataset is called non-stationary.
Methods to Check Stationarity
During the TSA model preparation workflow, we must assess whether the dataset is stationary or
not. This is done using Statistical Tests. There are two tests available to test if the dataset is
stationary:

AD3491_FDSA
 Augmented Dickey-Fuller (ADF) Test
 Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test
Augmented Dickey-Fuller (ADF) Test or Unit Root Test
The ADF test is the most popular statistical test. It is done with the following assumptions:
 Null Hypothesis (H0): Series is non-stationary
 Alternate Hypothesis (HA): Series is stationary
o p-value >0.05 Fail to reject (H0)
o p-value <= 0.05 Accept (H1)
Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test
These tests assess a NULL Hypothesis (HO), perceiving the time series as stationary around a
deterministic trend, in contrast to the alternative of a unit root. Since TSA is looking for
Stationary Data for its further analysis, we have to ensure that the dataset is stationary.
Converting Non-Stationary Into Stationary
Let’s discuss quickly how to convert non-stationary to stationary for effective time series
modeling. There are three methods available for this conversion – detrending, differencing, and
transformation.
Detrending
It involves removing the trend effects from the given dataset and showing only the differences in
values from the trend. It always allows cyclical patterns to be identified.

Differencing
This is a simple transformation of the series into a new time series, which we use to remove the
series dependence on time and stabilize the mean of the time series, so trend and seasonality are
reduced during this transformation.
 Yt= Yt – Yt-1
 Yt=Value with time

Transformation
This includes three different methods they are Power Transform, Square Root, and Log Transfer.
The most commonly used one is Log Transfer.
Moving Average Methodology
The commonly used time series method is the Moving Average. This method is slick with
random short-term variations. Relatively associated with the components of time series.
The Moving Average (MA) (or) Rolling Mean: The value of MA is calculated by taking average
data of the time-series within k periods.
Let’s see the types of moving averages:
 Simple Moving Average (SMA),
 Cumulative Moving Average (CMA)
 Exponential Moving Average
(EMA) Simple Moving Average (SMA)
The Simple Moving Average (SMA) calculates the unweighted mean of the previous M or N
points. We prefer selecting sliding window data points based on the amount of smoothing, as
increasing the value of M or N improves smoothing but reduces accuracy.
To understand better, I will use the air temperature dataset.

 4.8 SERIAL CORRELATION

AD3491_FDSA
Expectation, Variance and Covariance
Many of these definitions will be familiar to most QuantStart readers, but I am going to outline them
specifically for purposes of consistent notation.
The first definition is that of the expected value or expectation:
Expectation
The expected value or expectation, E(x), of a random variable x is its mean average value in the population.
We denote the expectation of x by μ, such that E(x)=μ.
Now that we have the definition of expectation we can define the variance, which characterises the
"spread" of a random variable:
Variance
The variance of a random variable is the expectation of the squared deviations of the variable
from the mean, denoted by σ2(x)=E[(x−μ)2].
Notice that the variance is always non-negative. This allows us to define the standard deviation:
Standard Deviation
The standard deviation of a random variable x, σ(x), is the square root of the variance of x.
Now that we've outlined these elementary statistical definitions we can generalise the variance to the
concept of covariance between two random variables. Covariance tells us how linearly related
these two variables are:
Covariance
The covariance of two random variables x and y, each having respective expectations μx and μy, is
given by σ(x,y)=E[(x−μx)(y−μy)].
Covariance tells us how two variables move together.
However since we are in a statistical situation we do not have access to the population means
μx and μy. Instead we must estimate the covariance from a sample. For this we use the respective
sample means x¯ and y¯.
If we consider a set of n pairs of elements of random variables from x and y, given by (xi,yi), the
sample covariance, Cov(x,y) (also sometimes denoted by q(x,y)) is given by:
Cov(x,y)=1n−1∑i=1n(xi−x¯)(yi−y¯)
Note: Some of you may be wondering why we divide by n−1 in the denominator, rather than n. This
is a valid question! The reason we choose n−1 is that it makes Cov(x,y) an unbiased estimator.
Example: Sample Covariance in R
This is actually our first usage of R on QuantStart. I am not going to discuss the installation
procedure of R here, but I will do so in later articles. Thankfully it is much more straightforward
than installing Python! Assuming you have R installed you can open up the R terminal.
In the following commands we are going to simulate two vectors of length 100, each with a linearly
increasing sequence of integers with some normally distributed noise added. Thus we are
constructing linearly associated variables by design.
We will firstly construct a scatter plot and then calculate the sample covariance using the
cor function. In order to ensure you see exactly the same data as I do, we will set a random seed
of 1 and 2 respectively for each variable:
> set.seed(1)
> x <- seq(1,100) + 20.0*rnorm(1:100)
> set.seed(2)
> y <- seq(1,100) + 20.0*rnorm(1:100)
> plot(x,y)

AD3491_FDSA
Scatter plot of two linearly increasing variables with normally distributed noise
There is a relatively clear association between the two variables. We can now calculate the
The sample covariance is given as 681.6859.
One drawback of using the covariance to estimate linear association between two random variables
is that it is a dimensional measure. That is, it isn't normalised by the spread of the data and thus it
is hard to draw comparisons between datasets with large differences in spread. This motivates
another concept, namely correlation.
Correlation
Correlation is a dimensionless measure of how two variables vary together, or "co-vary". In
essence, it is the covariance of two random variables normalised by their respective spreads. The
(population) correlation between two variables is often denoted by ρ(x,y): ρ(x,y)=E[(x−μx)
(y−μy)]σxσy=σ(x,y)σxσy
The denominator product of the two spreads will constrain the correlation to lie within the
interval [−1,1]:
 A correlation of ρ(x,y)=+1 indicates exact positive linear association
 A correlation of ρ(x,y)=0 indicates no linear association at all
 A correlation of ρ(x,y)=−1 indicates exact negative linear association
As with the covariance, we can define the sample correlation, Cor(x,y):
Cor(x,y)=Cov(x,y)sd(x)sd(y)
Where Cov(x,y) is the sample covariance of x and y, while sd(x) is the sample standard
deviation of x.
Example: Sample Correlation in R
We will use the same x and y vectors of the previous example. The following R code will
calculate the sample correlation:
The sample correlation is given as 0.5796604 showing a reasonably strong positive linear association
between the two vectors, as expected.
Stationarity in Time Series
Now that we outlined the general definitions of expectation, variance, standard deviation, covariance
and correlation we are in a position to discuss how they apply to time series data.
Firstly, we will discuss a concept known as stationarity. This is an extremely important aspect of
time series and much of the analysis carried out on financial time series data will concern

AD3491_FDSA
stationarity. Once we have discussed stationarity we are in a position to talk about serial
correlation and construct some correlogram plots.

 4.9 AUTOCORRELATION

In many cases, the value of a variable at a point in time is related to the value of it at a previous
point in time. Autocorrelation analysis measures the relationship of the observations between the
different points in time, and thus seeks a pattern or trend over the time series. For example, the
temperatures on different days in a month are autocorrelated.
Similar to correlation, autocorrelation can be either positive or negative. It ranges from -1
(perfectly negative autocorrelation) to 1 (perfectly positive autocorrelation). Positive
autocorrelation means that the increase observed in a time interval leads to a proportionate
increase in the lagged time interval.
The example of temperature discussed above demonstrates a positive autocorrelation. The
temperature the next day tends to rise when it’s been increasing and tends to drop when it’s been
decreasing during the previous days.
The observations with positive autocorrelation can be plotted into a smooth curve. By adding a
regression line, it can be observed that a positive error is followed by another positive one, and a
negative error is followed by another negative one.

Conversely, negative autocorrelation represents that the increase observed in a time interval
leads to a proportionate decrease in the lagged time interval. By plotting the observations with a

AD3491_FDSA
regression line, it shows that a positive error will be followed by a negative one and vice versa.

Autocorrelation can be applied to different numbers of time gaps, which is known as lag. A lag 1
autocorrelation measures the correlation between the observations that are a one-time gap apart.
For example, to learn the correlation between the temperatures of one day and the corresponding
day in the next month, a lag 30 autocorrelation should be used (assuming 30 days in that month).
Test for Autocorrelation
The Durbin-Watson statistic is commonly used to test for autocorrelation. It can be applied to a
data set by statistical software. The outcome of the Durbin-Watson test ranges from 0 to 4. An
outcome closely around 2 means a very low level of autocorrelation. An outcome closer to 0
suggests a stronger positive autocorrelation, and an outcome closer to 4 suggests a stronger
negative autocorrelation.
It is necessary to test for autocorrelation when analyzing a set of historical data. For example, in
the equity market, the stock prices on one day can be highly correlated to the prices on another
day. However, it provides little information for statistical data analysis and does not tell the
actual performance of the stock.
Therefore, it is necessary to test for the autocorrelation of the historical prices to identify to what
extent the price change is merely a pattern or caused by other factors. In finance, an ordinary
way to eliminate the impact of autocorrelation is to use percentage changes in asset prices
instead of historical prices themselves.
Autocorrelation and Technical Analysis
Although autocorrelation should be avoided in order to apply further data analysis more
accurately, it can still be useful in technical analysis, as it looks for a pattern from historical data.
The autocorrelation analysis can be applied together with the momentum factor analysis.
A technical analyst can learn how the stock price of a particular day is affected by those of
previous days through autocorrelation. Thus, he can estimate how the price will move in the
future.
If the price of a stock with strong positive autocorrelation has been increasing for several days,
the analyst can reasonably estimate the future price will continue to move upward in the recent
future days. The analyst may buy and hold the stock for a short period of time to profit from the
upward price movement.
The autocorrelation analysis only provides information about short-term trends and tells little
about the fundamentals of a company. Therefore, it can only be applied to support the trades with
short holding periods.

AD3491_FDSA

You might also like