MMM - Multiple Regression
MMM - Multiple Regression
ECATERINA MHITAREAN
ECATERINA MHITAREAN
1.1 Background
Marketing Mix Modelling is a term that is used to cover statistical methods which are suitable for
explanatory and predictive statistical modelling of some variable of interest, for example company's
sales or market shares. This thesis is focused on modelling sales as a factor of marketing instruments
and environmental variables. In this case, the goal of Marketing Mix Modelling is to explain and
predict sales from marketing instruments, while controlling for other factors that inuence sales. Its
main task is to decompose sales into base volume (which occurs due to such factors as seasonality
and brand awareness) and incremental volume (which captures the weekly variation in sales driven
by marketing activities). One of the most important Marketing Mix instruments is advertising,
thus it is crucial to understand the impact of advertising expenditures on sales.
Model building in marketing started in the middle of twentieth century. Many studies have
been conducted since then, which helped managers understand the marketing process. Appropri-
ately constructed market response models helped the managers to determine the instruments that
inuence sales and take actions that would aect it. Applications show that model benets include
cost savings resulting from improvements in resource allocations. Many studies discuss and describe
the model development process, provide a structure for model building and serve as a starting point
for this thesis, including: Leeang (2015), Leeang (2000), Hanssens (2001), P.M Cain (2010).
This thesis attempts to develop a general model building strategy suitable for a high level of
complexity of the data, to establish the most appropriate functional relationships and estimation
methods for the Marketing Mix Modelling projects. This strategy will be used by Nepa for system-
atic analysis of the data collected. All the steps of this model building strategy are implemented
in a user-friendly way and will be applied by Nepa for designing a marketing plan for its clients.
As an illustration, the thesis analyses the relationship between marketing expenditures and sales
on a dataset provided by Nepa. The data comes from a client of Nepa who is one of the largest
electronics retailer in Sweden. This dataset contains model-specic weekly sales and marketing
activities data, as well as environmental data, for two years. To overcome some of the problems
that are commonly encountered when working with marketing data, advanced estimation methods
such as ridge regression, the lasso and elastic net were employed to quantify the sales-marketing
relationship and identify short and long-run eects of marketing on performance. The thesis de-
scribes each method and presents the output for each model introduced. Marketing dynamics were
also considered in the model of sales structure, by optimizing the decays for each media variable.
1.2 Nepa
Nepa is an innovative research company founded in 2006 with the ambition to improve the eciency
of the research industry by moving from analog to digital methodologies. It is a company that went
beyond phone interviews and mail surveys and pioneered a fully automated and online tracking
1
solution. Today Nepa has more than 350 clients from all over the world and oces in Stockholm,
Helsinki, Oslo, Copenhagen, London and Mumbai.
1.3 Purpose
The main purpose of the thesis is to elaborate a methodology that Nepa can use in Marketing Mix
Modelling projects. A method is needed to nd the optimal parameters to create a model with as
good predictability and low multi collinearity as possible, with the following main areas of interest:
Parameter estimation
What type of decay should each media variable have? That is, how much eect does a certain
amount invested in a media variable have one week later? This is known as the carryover
eect, and it appears when some of the marketing strategies have impact not only in the
current period, but also in the future periods.
Variable selection
It is important to eciently tackle the problems of selecting the informative variables and
evaluating the seasonality eect. How should be handled season & trend to avoid over-
or underestimation of the eects of other variables? Estimating the impact of marketing
instruments on sales becomes dicult when advertising activities coincide with seasonal peaks.
Regression modelling
It is often the case that several marketing investments take place at the same time. The
collinearity caused makes the parameters estimated with ordinary least squares to be unre-
liable. The question then arises as to what estimation methods should be used to attain
predictability and stability of the models. (Coping with multicollinearity, variable selection,
etc.)
2
2 Theoretical Background
This section presents the mathematical background of the common challenges that marketing mix
modellers are facing. It begins with the challenge of choosing the appropriate functional form,
continues with the dynamic structure of marketing variables, and nally an approach to account
for the eects of seasonality is described.
3
Equation 2.3 is linear in the parameters β0? , β1 , where β0? = ln β0 . This model is known as the
double-logarithmic or the log-log model. The version of the multiplicative model that retains the
highest-order interaction among the variables for K marketing instruments is:
yt = β0 xβ1t1 xβ2t2 · · · xβKt
K
εt (2.4)
or more compactly:
K
(2.5)
Y
yt = β0 ( xβktk )εt
k=1
In this setting, if some of the variables are "dummies", the corresponding variables are used as
exponents. Besides reecting the non-constant behavior of the sales response function, another
advantage of the multiplicative model over the linear model is that it allows for a specic form of
interaction between the various instruments. Taking the rst-order partial derivative of yt with
respect to any of the independent variables xkt , the impact of a change in xkt on yt is a function of
yt itself, which means that it depends not only on the value of xkt but on all the other variables as
well:
∂yt β
= β0 βk xβ1t1 xβ2t2 · · · xktk−1 · · · xβKt
K
(2.6)
∂xkt
When sales response function exhibits increasing returns to scale, the exponential model can be
used:
yt = β0 eβ1 x1t εt (2.7)
After taking the logarithms of both sides it becomes the semi-logarithmic also known as the log-
linear model:
ln yt = ln β0 + β1 x1t + ln εt (2.8)
When the nonlinear model is log-log or log-linear, an adjustment to the forecasts of yt is required,
so that they remain unbiased ([15]). Considering the typical multiplicative specication 2.4, where
ln εt is N (0, σ 2 ), it can be shown that:
(2.9)
2
E[yt ] = β0 xβ1t1 xβ2t2 · · · xβKt
K 1/2σ
e
(2.10)
2
ŷt = β̂0 xβ̂1t1 xβ̂2t2 · · · xβ̂Kt
K 1/2σ̂
e
where hats denote the ordinary least squares (OLS) estimates. A direct re-transformation would
under-estimate the forecasts.
4
the value that maximizes the MLE score. The specication is then chosen according to the value
of λ reported. If λ = 1 then the specication is essentially linear. When λ approaches 0, equation
2.11 approaches the semi-logarithmic form, since:
ytλ − 1
lim = ln yt (2.12)
λ→0 λ
where xt−l , l = 0, 1, . . ., are the lagged terms of the independent variable. The model 2.13 is called
the Innite Distributed Lag (IDL) Model. Assuming that all coecients of the lagged terms of a
covariate have the same sign, equation 2.23 can be rewritten as:
∞
(2.14)
X
y t = β0 + β ωl xt−l + εt .
l=0
or
yt = β0 + β1 xt + β1 λxt−1 + β1 λ2 xt−2 + . . . + β1 λl xt−l + . . . + εt , (2.18)
5
where β1 = β(1 − λ). The direct short-term eect of marketing eort is β1 = β(1 − λ), while the
retention rate λ measures how much of the advertising eect in one period is retained in the next.
The implied long-term eect is β = β1 /(1 − λ). This model is also approximately equivalent to the
Simple Decay-Eect Model (Broadbent (1979)):
yt = β0 + β1 at + εt , (2.19)
where at = f (xt ) is the adstock function at time t, xt is the value of the advertising variable at
time t and λ is the decay or lag weight parameter:
at = f (xt ) = xt + λat−1 , t = 2, . . . , n (2.20)
Recursively substituting and expanding the equation for the adstock function becomes:
at = xt + λxt−1 + λ2 xt−2 + . . . + λn xt−n , (2.21)
Since 0 < λ < 1, λ → 0 as n → ∞. Moving on to the case with K explanatory variables x1 , . . . , xK ,
each with dierent retention rates λ1 , . . . , λK , the model becomes:
yt = β0 + β1 a1t + β2 a2t + · · · + βK aKt + εt (2.22)
where
ait = f (xit ) = xit + λi ait−1 , i = 1, . . . , K (2.23)
To estimate the marketing variables coecients, as well as retention rate values, non-linear least
squares can be used. The algorithm is described more detailed in section 3.2. First the adstock at
time t is dened for each marketing instrument, as in equation 2.23. The estimated sales are then:
ŷt = β̂0 + β̂1 a1t + β̂2 a2t + . . . + β̂K aKt (2.24)
Finally, the optimization problem is:
T
minimize
X
(yt − ŷt )2
t=1
subject to 0 ≤ λi < 1, i = 1, . . . , K.
For the semi-logarithmic and double-logarithmic models the equation for the predicted sales becomes
2.25 and 2.26 respectively:
ln ŷt = β̂0? + β̂1 a1t + β̂2 a2t + . . . + β̂K aKt (2.25)
and
ln ŷt = β̂0? + β̂1 ln a1t + β̂2 ln a2t + . . . + β̂K ln aKt (2.26)
And the optimization problem is:
T
minimize
X
(ln yt − ln ŷt )2
t=1
subject to 0 ≤ λi < 1, i = 1, . . . , K.
6
2.3 Modelling trend and seasonality
In this section the "classical decomposition" is considered:
yt = mt + δit + εt , (2.27)
where: mt is a slowly changing function (the "trend component");
δit is a function with known period d (the "seasonal component");
εt is a stationary time series.
In trying to explain sales behavior, a linear trend variable (mt = 1, 2, · · · , T for t = 1, 2, · · · , T )
could be introduced into the sales response function to capture the time-dependent nature of sales
growth.
If a variable follows a systematic pattern within the year, it is said to exhibit seasonality. To
deal with seasonality, s dummy variables could be introduced in the model to express s seasons in
the following way:
7
3 Estimation
Once the appropriate functional form is decided, the parameters of the marketing model must be
estimated. A description of the estimation methods for the model parameters is provided in this
Chapter.
Following the notations in [8], the total sum of squares is dened as: SStot = Tt=1 (yt − ȳt )2 . With
P
RSS and SStot dened above, the following relationship holds: SStot = SSreg + RSS , where SSreg
is the regression sum of squares : SSreg = Tt=1 (ŷt − ȳt )2 . It is easy to show that the coecient
P
where the standard error SE(β̂k ) for any k = 1, . . . , K is the square root of the corresponding
diagonal element of Cov(
d β̂).
8
3.2 Non-linear Least Squares
To model the dynamic structure with several explanatory variables, the Levenberg-Marquardt al-
gorithm (LMA) was used. As described in [7], the LMA interpolates between the Gauss-Newton
algorithm (GNA) and the method of gradient descent. In the current setting, following the nota-
tions dened in the previous sections, the problem is dened in the following way: given a number
T of observations of independent and dependent variables, (xt , yt ), where xt is a vector of length
K , containing the K variable measurements corresponding to the observation | of the dependent
variable yt , the objective is to optimize K parameters β = β0 β1 . . . βK of the model curve
f (X, β) such that the sum of the squares of the deviations
T
(3.6)
X
S(β) = [yt − f (xt , β)]2
t=1
is minimized.
9
3.2.3 The Levenberg-Marquardt Method
The Levenberg-Marquardt algorithm interpolates between the Gauss-Newton method and the
method of gradient descent.
(J | J + λI)δlm = J | [y − f (X, β)] (3.12)
Small values of the damping parameter λ result in a Gauss-Newton update and large values of λ
result in a gradient descent update. In each step the parameter λ is iteratively adjusted, that is λ is
increased S(β + δ) > S(β), and is decreased otherwise. To avoid slow convergence in the direction
of small gradient, Marquardt provided the insight that the values of λ should be scaled to the values
of J | J ([7]):
[J | J + λ diag((J | J)]δlm = J | [y − f (X, β)]. (3.13)
10
4 Validation and Testing
The process of validation and testing of the model begins with testing model's statistical assump-
tions. This part is called specication error analysis (section 4.2). The next step is to test the
regression results. This involves tests of signicance described in section 4.1.
is:
SSreg /K
F =
RSS/(T − K − 1)
which has an approximate F (K, T −K −1) distribution under the null. To determine the amount of
variation "explained" by the covariates, one looks at a descriptive statistic R2 , called the coecient
of determination or goodness of t.
SSreg RSS
R2 = =1− (4.1)
SStot SStot
11
εt is normally distributed.
The matrix X has full rank, thus X| X is non-singular.
Table 1 based on [16] is a part of model building strategy from the perspective of violation of as-
sumptions. It presents a short summary of reasons, remedies, and ways to detect possible violations
of each assumption. The Table is adapted to the given problem, and the methods applied in this
thesis.
4.2.2 Heteroscedasticity
The second assumption is that all residuals εt have the same standard deviation. In this case
standard errors and F -statistics will be computed from the estimated covariance matrix. However,
if the model has heteroscedastic residuals and is misspecied as homoscedastic, then the estimators
of the standard errors of the coecient estimates will be wrong, and therefore the F -tests will be
invalid ([13]). Thus, OLS estimates of the coecients of the model will still be unbiased but not
ecient. One solution is to use another estimation method, like generalized least squares or the
method of maximum likelihood ([4]). In many cases the critical remedy is to use an appropriately
adjusted formula for the variances and covariances of the parameter estimates.
Heteroscedascticity can be detected using the Breusch-Pagan test ([21]). The idea of this test
to run a regression of the squared residuals on the covariates from the original equation:
ε̂2t = δ0 + δ1 x1t + δ2 x2t + . . . + δK xKt + νt (4.5)
where νt is a disturbance term with mean zero given the xkt , k = 1, . . . , K . The null hypothesis of
homoscedasticity is:
H0 : δ1 = δ2 = . . . = δK xKt = 0 (4.6)
The F -statistic of the test is calculated in the following way:
R2ˆ2 /K
F = ε
(4.7)
(1 − R2ˆ2 )/(T − K − 1)
ε
where Rε2ˆ2 is the R-squared from the regression 4.5. This F -statistic has (approximately) an
FK,T −K−1 distribution under the null.
12
Table 1: Violations of the assumptions about the disturbance term: reasons, consequences, tests
and remedies (based on [16])
Violated Possible Consequence Detection Remedy
Assumption Reasons
1. E[εt ] 6= 0
Incorrect func- Biased parame- Plot residual Modify the
tional form(s) ter estimate against each model specica-
Omitted vari- predictor vari- tion in terms of
able(s) able functional form
RESET test Add relevant
Box-Cox trans- predictors
formation
2. Var[εt ] 6= σ 2
Error pro- Inecient Plot residual Modify the
portional to parameter against each specication
variance of the estimate predictor vari- Use het-
predictor able eroscedasticity
Breusch-Pagan consistent es-
test timation (e.g.
GLS)
3.
Cov[εt , εt0 ] 6= 0 See 1. See 2. Plot residuals See 1.
against time
Durbin Watson
test
4. Nonnormal
errors See 1. p-values cannot Inspect the See 1.
be trusted distribution of Box-Cox trans-
residuals formation
Normality tests
5. Multi-
collinearity Relations be- Unreliable Inspect the cor- Apply other es-
tween predictor parameters relation matrix timation meth-
variables of the predictor ods
variables Eliminate
Some VIF ≥ 5 predictor vari-
Condition num- able(s)
ber of the ma-
trix (X| X)−1 is
greater than 30
13
4.2.3 Correlated Disturbances
Instead of assuming a model where the disturbance term is 0, let us consider the following simple
linear additive relation for T time-series observations:
yt = β0 + β1 xt + ut , t = 1, . . . , T (4.8)
where the disturbances are correlated in the following way:
ut = ρut−1 + εt , |ρ|< 1 (4.9)
and:
E[εt ] = 0, Cov(εt , εt0 ) = 0, t 6= t0
In 4.8 the error terms u1 , u2 , . . . , uT follow a rst-order AutoRegressive (AR) process with auto-
correlation parameter ρ. In this case, the parameter estimates are no longer ecient, although still
unbiased, and the usual F -statistic cannot be trusted.
A plot of the residuals against time could help to detect a violation of the assumption of
uncorrelated disturbances. Another way is to use the test developed by Durbin and Watson ([9],
[10]), based on the variance of the dierence between two successive disturbances:
E[(ut − ut−1 )2 ] = E[u2t ] + E[u2t−1 ] − 2E[ut ut−1 ] (4.10)
The Durbin-Watson test statistic varies between zero and four and is calculated in the following
way: P T 2
t=2 (ût − ût−1 )
DW = PT 2
(4.11)
t=1 û
Values of the DW test below (above) 2 are associated with positive (negative) autocorrelation. The
test statistic is used as described in ([16]):
1. Tests for positive autocorrelation:
(a) If DW < dL , there is positive autocorrelation;
(b) If dL < DW < dU , the result is inconclusive;
(c) If DW > dU , there is no positive autocorrelation.
2. Tests for negative autocorrelation:
(a) If DW > 4 − dL , there is negative autocorrelation;
(b) If 4 − dU < DW < 4 − dL , the result is inconclusive;
(c) If DW < 4 − dU , there is no negative autocorrelation.
where the lower and upper bounds dL and dU depend on signicance level and sample size.
When rst-order autocorrelation is detected, a two-step estimation procedure is required. The
rst step involves obtaining an estimate of ρ by means of OLS estimation. The second step requires
this estimate of ρ to be used in an estimated generalized least squares (GLS) regression ([15]).
However, according to [16], this remedy should only be a last resort option.
14
4.2.4 Nonnormal Errors
The assumption of normally distributes is required for hypothesis testing and condence intervals to
be applicable. When this assumption is violated the standard statistical tests cannot be performed
although the least squares estimates of the parameters remain unbiased as well as consistent.
The normality of the errors can be examined through the residuals. For this, an inspection of
the distribution function of the residuals as well as normality tests might be used. In this thesis,
the Lilliefors test was employed to assess the normality assumption of the residuals.
4.2.5 Multicollinearity
In linear model, matrix of observations X is assumed to have full rank, otherwise X| X will be
singular, and the OLS estimates cannot be uniquely determined. When the number of covariates is
smaller than the number of observations, X| X will be singular when some of the columns of X are
collinear. In practice however, more often the problem that arrises is imperfect multicollinearity,
when a column of X is nearly a linear combination of the other columns. In this case, (X| X)−1
exists, but its elements will be large, thus the standard errors of one or more of the regression
coecients become very large, and the point estimates of the those coecients will be imprecise.
This problem is encountered in marketing area, since data often show high degrees of correlation
between media variables. Some methods of diagnosing multicollinearity in a given dataset include:
1. Examining the correlation matrix of the predictor variables. A correlation coecient close to
1 or -1 is considered as an indicator of positive or negative collinearity.
2. Looking at the Variance Ination Factor (VIF). This measure is based on the regression of
each individual predictor variable on all the other predictor variables. VIF is computed as
1/(1 − Rk2 ), where Rk2 values result from the regressions above. There is no exact value of
VIF that would be considered as a sign of multicollinearity. Some analysts argue that a VIF
value greater than 5 is a signal that collinearity is a problem.
3. Comparing results for F -test and t-tests. Multicollinearity may be regarded as acute if the
F -statistic shows signicance and none of the t statistics for the slope coecients is signicant.
4. Looking at the condition number of the matrix (X| X)−1 , which is the ratio of its largest
eigenvalue to its smallest eigenvalue λM ax /λM in . The data matrix should rst be normalized
so that each column has equal length - usually unit length. A rule of thumb is that a condition
index of 15 indicates some degree of collinearity, and a condition index above 30 is an indicator
of severe multicollinearity.
The main solution proposed to Nepa for solving multicollinearity was to apply regularization meth-
ods specically developed for the cases with severe multicollinearity.
15
5 Linear Model Selection and Regularization
In this section, there are discussed distinct ways that might improve the linear (or linearizable)
models, using variable selection or alternative estimation methods. To avoid confusion with the
systematic literature, for this Chapter n states for the number of observations and p is the number
covariates.
16
5.2 Shrinkage Methods
As an alternative to the subset selection methods described in section 5.1 above, techniques that
shrink coecient estimates towards zero can be used. These include ridge regression, the lasso, and
the elastic net. To understand better these techniques, rst the concept of bias-variance trade o
is introduced.
where β̂ ∼ N (β, σ 2 (X| X)−1 ). Note that the covariates are xed, only the responses are random.
The expected test MSE, for a given vector x0 of length K that contains new measurements, can be
decomposed in the following way:
MSE0 = Var[x|0 β̂] + Bias2 (x|0 β̂) + Var[ε]
where Bias(x|0 β̂) = E[x|0 β̂] − x|0 β. In practice, some bias might be accepted for a reduction in
variance of the coecient estimates. This can be achieved by employing regularized regression
methods described in the following sections.
where λ ≥ 0 is a tuning parameter to be determined, and the term λ βj2 is called a shrinkage
Pp
j=1
penalty. The result is the ridge regression estimator:
β̂ ridge (λ) = ((X| X + λIr )−1 X| y = W(λ)β̂ OLS ,
where W(λ) = ((X| X + λIr )−1 X| X. For each value of λ ridge regression will produce a set of
coecient estimates. When λ = 0 ridge estimates will be equal to the least squares estimates. As
λ → ∞ the ridge regression coecient estimates will approach zero. The intercept remains simply
the mean value of the response.
Because ridge coecients change substantially when multiplying a covariate by a constant, ridge
regression should be applied using standardized predictors ([6]):
xij
x̃ij = q P (5.4)
1 n
n i=1 (xij − x̄j )2
Standardizing the predictors also makes it possible to compare estimated coecients with each
other.
17
5.2.3 The Lasso
Ridge regression will shrink all the coecient estimates towards zero, but it will not set any of
them exactly equal to zero, which might be a drawback if the purpose of the model is also variable
selection. An alternative method called the lasso overcomes this disadvantage. The lasso coecients
are the values that minimize the quantity:
n p p p
(5.5)
X X X X
(yi − β0 − βj xji )2 + λ |βj |= RSS + λ |βj |
i=1 j=1 j=1 j=1
Like ridge regression, the lasso shrinks the coecient estimates towards zero. However, in the
case of the lasso penalty, some of the coecient estimates will be exactly zero when the tuning
parameter λ is suciently large. Hence the lasso performs also variable selection, which makes the
interpretation of the model much easier.
and
Xn p p
minimize subject to (5.7)
X X
(yi − β0 − βj xij )2 βj2 ≤ s,
β
i=1 j=1
j=1
respectively.
Note that the restriction pj=1 βj2 ≤ s on β is a hypersphere centered at the origin with bounded
P
squared radius s, where the value of s determines the value of k. Figure 1 (taken from [6]) shows
the restrictions for the lasso and ridge regression for the two-parameter case.
Choosing among the regularization methods is not trivial. Which model produces better pre-
diction accuracy depends on the dataset used. Since lasso assumes that several coecients are in
fact equal to zero, it will perform better when some of the predictors are not related to the re-
sponse. In the case when all coecients substantially dier from zero, ridge regression is expected
to outperform the lasso. Since the number of coecient related to the response is never known,
cross-validation approach can be used to determine the best method for each dataset. For a deeper
discussion see ([6]) on how to select the regularization approach.
18
Figure 1: (taken from [1]). Contours of the error and constraint functions for the lasso (left)
and ridge regression (right). The solid blue areas are the regions |β1 |+|β2 |≤ s and β12 + β22 ≤ s,
respectively, while the red ellipses are the countors of the RSS .
being the matrix of predictors, the naive elastic net estimator β̂ is the one that minimizes the quan-
tity:
n p p p
(5.8)
X X X X
L(λ1 , λ2 , β) = (yi − β0 − βj xji )2 + λ1 |βj |+λ2 βj2
i=1 j=1 j=1 j=1
for any λ1 and λ2 . Similar to ridge regression and the lasso, this procedure can be viewed as a
penalized least squares. If α is dened as α = λ2 /(λ1 + λ2 ), then solving for β in equation 5.8 is
equivalent to the optimization problem:
Xn p p p
minimize subject to (1 − α) (5.9)
X X X
(yi − β0 − βj xij )2 |βj |+α βj2 ≤ s,
β
i=1 j=1
j=1 j=1
The function (1 − α) pj=1 |βj |+α pj=1 βj2 ≤ s is called the elastic net penalty. When α = 1, the
P P
naive elastic net becomes simple ridge regression. For all α ∈ [0, 1), the elastic net penalty function
is singular (without rst derivative) at 0 and it is strictly convex for all α > 0. Note that the lasso
penalty (α = 0) is convex but not strictly convex. The two-dimensional contours of the penalty
function for ridge, lasso and naive elastic net are given in Figure 2 (taken from [26]). In the article
[27] Hui Zou and Trevor Hastie develope a method to solve the naive elastic net problem eciently.
19
Figure 2: (taken from [26]). Two-dimensional contour plots of the ridge, the lasso, and α = 0.5
elastic net penalties.
It turns out that minimizing equation 5.8 is equivalent to a lasso-type optimization problem. This
fact implies that the naive elastic net also enjoys the computational advantage of the lasso. The
next Lemma is a result from paper [27].
Lemma 1. Given a dataset (y, X) and (λ1 , λ2 ), an articial dataset (y? , X? ) is dened by:
X?(n+p)∗p −1/2
= (1 + λ2 ) √X , ?
y(n+p) =
y
(5.10)
λ2 I 0
√ √
Let γ = λ1 / 1 + λ2 and β? = 1 + λ2 β. Then the naive elastic net criterion can be given as:
n p p
(5.11)
X X X
L(γ, β) = L(γ, β ? ) = (yi? − β0? − βj? x?ji )2 + γ |βj? |.
t=1 j=1 j=1
?
Let β̂ = (β̂1? , . . . , β̂p? )| be the vector that minimizes the quantity above. Then
1 ?
β̂ = √ β̂ (5.12)
1 + λ2
Note that the sample size in the augmented problem is n + p and X? has rank p, which means
that the naive elastic net can potentially select all p predictors in all situations. Lemma 1 also
shows that the naive elastic net can perform an automatic variable selection in a fashion similar to
the lasso.
20
double shrinkage procedure, which causes an increase in bias. In their paper, Hui Zou and Trevor
Hastie propose a scaling of the naive elastic net coecients which keeps the advantage of variable
selection property, avoiding the undesirable double shrinkage. Following the notations in section
5.2.6, the naive elastic net solves a lasso-type problem:
? λ1
β̂ = arg min |y? − X? β ? |2 + √ |β ? |1 (5.13)
β ?
1 + λ 2
where (y? , X? ) is the augmented data dened in 5.10, and (λ1 , λ2 ) is the penalty parameter. The
elastic net corrected estimates are dened by:
?
(5.14)
p
β̂ enet = 1 + λ2 β̂
√ ?
Recall that β̂naive enet = (1/ 1 + λ2 )β̂ , thus:
β̂ enet = (1 + λ2 )β̂ naive enet (5.15)
In elastic net one could choose the type of the tuning parameters as (λ2 , s) where s ∈ [0; 1] is
the fraction of the l1 -norm. The tuning parameters were chosen using a two-dimensional tenfold
cross-validation method, following the procedure suggested in [27]: rst a (relatively small) grid of
values for λ2 is picked, then the other tuning parameter is selected by cross-validation. The value
of λ2 is chosen such as to give the smallest CV error.
21
6 Results
In this section, the results for the models specied in the previous sections are presented. It starts
from the linear and multiplicative models, described in section 2.1.1. The linear, log-linear and
log-log functional forms are estimated. Next, the retention rates for the functional form chosen
are estimated using non-linear least squares. Finally, the results for the modern approaches are
illustrated and compared.
With the variables dened above the linear model 2.1 becomes:
yt = α0 + α1 TVt + α2 DRt + α3 DR.POSTENt + α4 OUTDOORt +
+α5 RADIOt + α6 PRINTt + α7 SOCIALMEDIAt + (6.1)
(1)
+α8 Raint + α9 salt + α10 HOLIDAYt + εt ,
yt = β0 eβ1 TVt +β2 DRt +β3 DR.POSTENt +β4 OUTDOORt +β5 RADIOt · (6.2)
(2)
·eβ6 PRINTt +β7 SOCIALMEDIAt eβ8 Raint +β9 salt +β10 HOLIDAYt εt
22
and after the transformation:
where:
γ8? = ln(γ8 )
γ9? = ln(γ9 )
?
γ10 = ln(γ10 )
The following tables illustrate the estimation results for 6.1, 6.3 and 6.5, respectively.
##
## Call :
## lm ( formula = " SALES_TOT ~ TV + DR + DR . POSTEN + OUTDOOR + RADIO + PRINT + SOCIALMEDIA + Rain ..
mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -23290744 -7598418 -1641095 7793738 82440104
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 4.226 e +07 3.988 e +06 10.598 < 2e -16 ***
## TV 2.239 e +01 2.567 e +00 8.721 9.59 e -14 ***
## DR 1.453 e +01 8.034 e +00 1.809 0.07369 .
## DR . POSTEN 1.742 e +01 5.417 e +00 3.216 0.00178 **
## OUTDOOR 3.756 e +01 1.035 e +01 3.631 0.00046 ***
## RADIO -6.633 e +01 2.957 e +01 -2.243 0.02724 *
## PRINT 6.304 e +00 5.876 e +00 1.073 0.28610
## SOCIALMEDIA 1.832 e +02 2.400 e +01 7.636 1.84 e -11 ***
## Rain .. mm . 2.796 e +05 1.461 e +05 1.913 0.05877 .
## sal 6.885 e +06 3.734 e +06 1.844 0.06834 .
## HOLIDAY 2.906 e +06 6.383 e +06 0.455 0.64999
## ---
## Signif . codes :
## 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 14380000 on 94 degrees of freedom
## Multiple R - squared : 0.8203 , Adjusted R - squared : 0.8012
## F - statistic : 42.92 on 10 and 94 DF , p - value : < 2.2 e -16
For the linear functional form, relatively high value for R2 was expected, given the fact that time
series data was used. Note that R2 can be compared among dierent models, only if the models
have exactly the same LHS and exactly the same observations. The values of F -statistics in all
23
Table 3: Estimation results of the log-linear model (OLS)
##
## Call :
## lm ( formula = " log ( SALES_TOT ) ~ TV + DR + DR . POSTEN + OUTDOOR + RADIO + PRINT + SOCIALMEDIA +
Rain .. mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -0.20887 -0.07726 -0.00648 0.06248 0.32894
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 1.787 e +01 3.063 e -02 583.581 < 2e -16 ***
## TV 1.661 e -07 1.972 e -08 8.425 4.07 e -13 ***
## DR 1.273 e -07 6.171 e -08 2.062 0.041929 *
## DR . POSTEN 1.423 e -07 4.161 e -08 3.420 0.000929 ***
## OUTDOOR 3.334 e -07 7.946 e -08 4.196 6.15 e -05 ***
## RADIO -6.494 e -07 2.271 e -07 -2.860 0.005224 **
## PRINT 7.119 e -08 4.513 e -08 1.577 0.118097
## SOCIALMEDIA 1.547 e -06 1.843 e -07 8.393 4.76 e -13 ***
## Rain .. mm . 2.251 e -03 1.122 e -03 2.006 0.047757 *
## sal 5.934 e -02 2.868 e -02 2.069 0.041258 *
## HOLIDAY -4.279 e -02 4.903 e -02 -0.873 0.385054
## ---
## Signif . codes :
## 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 0.1105 on 94 degrees of freedom
## Multiple R - squared : 0.8332 , Adjusted R - squared : 0.8154
## F - statistic : 46.95 on 10 and 94 DF , p - value : < 2.2 e -16
cases indicate that all three models (especially the rst two) are highly signicant. The number of
signicant parameters slightly vary in each model. The signicant parameters that all three models
have in common are the intercept, TV, DR.POSTEN, OUTDOOR, RADIO, SOCIALMEDIA.For
there parameters the corresponding p-values are smaller than 0.05. In the multiplicative log-linear
model, also the parameters for DR, Rain, and sal are signicant. For the log-log model, the
parameters log(PRINT) and Rain along with the common ones mentioned above are signicant.
It is important to mention that each specication has its own unique economic interpretation.
That is, the choice of a log versus linear specication should be made largely based on the underlying
economics. Table 5 (taken from [25]) summarizes the interpretation of the estimates for each case.
24
Table 4: Estimation results of the log-log model (OLS)
##
## Call :
## lm ( formula = " log ( SALES_TOT ) ~ log ( TV ) + log ( DR )+ log ( DR . POSTEN )+ log ( OUTDOOR )+ log (
RADIO )+ log ( PRINT )+ log ( SOCIALMEDIA )+ Rain .. mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -0.36417 -0.10845 0.01072 0.06971 0.79316
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 14.953425 0.501883 29.795 < 2e -16 ***
## log ( TV ) 0.012518 0.003359 3.727 0.000331 ***
## log ( DR ) 0.005367 0.007390 0.726 0.469484
## log ( DR . POSTEN ) 0.013922 0.006767 2.057 0.042410 *
## log ( OUTDOOR ) 0.052534 0.014639 3.589 0.000530 ***
## log ( RADIO ) -0.012448 0.004007 -3.107 0.002500 **
## log ( PRINT ) 0.181852 0.036741 4.950 3.26 e -06 ***
## log ( SOCIALMEDIA ) 0.012168 0.003768 3.229 0.001709 **
## Rain .. mm . 0.004394 0.001918 2.292 0.024164 *
## sal 0.049464 0.047164 1.049 0.296974
## HOLIDAY -0.050739 0.086438 -0.587 0.558615
## ---
## Signif . codes :
## 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 0.1858 on 94 degrees of freedom
## Multiple R - squared : 0.5279 , Adjusted R - squared : 0.4777
## F - statistic : 10.51 on 10 and 94 DF , p - value : 1.001 e -11
Table 5: (taken from [25]). Summary of the interpretation of Marketing Mix Modelling functional
forms
Dependent Independent Interpretation Marginal Eect
Variable Variable of β of ∆x
linear model y x β = ∆y/∆x β
log-linear model ln(y) x 100 · β = %∆y/∆x y·β
log-log model ln(y) ln(x) β = %∆y/ %∆x y · β/x
RSS, ESS, and σ̂ are not comparable in size across the models. This is due to the fact that
several variables, including the dependent variable, were transformed to be able to estimate the
multiplicative models using least square. Therefore, it is not possible to compare the estimated
values for the parameters across the models.
Note that the numbers in Table 4 are estimates for the parameters in 6.3, the linearized version
of the log-log model (6.2). To nd the estimates for the independent variables that had been logged,
an 'anti-ln' transformation must be applied. Instead of just taking the exponential of the estimates
from Table 4 to obtain proper estimates for the parameters in the log-log model, the following
25
Figure 3: Plot of residuals against each predictor variable for the linear model
26
Figure 4: Plot of residuals against each predictor variable for the log-linear model
Figure 5: Plot of residuals against each predictor variable for the log-log model
27
Table 8: RESET test. Power 2 and 3 of the tted response
linear model RESET = 38.269, df1 = 2, df2 = 92, p-value = 8.055e-13
log-linear model RESET = 4.5081, df1 = 2, df2 = 92, p-value = 0.01356
log-log model RESET = 26.782, df1 = 2, df2 = 92, p-value = 6.818e-10
The Box-Cox test can also be used to determine whether transformations of variables are re-
quired. Figure 6 shows the log of the likelihood ratio test for dierent values of λ. The best tting
transformation is λ = −0.4242424, which is closest to the log-linear specication. Note that if the
task was to t historical data, the value of λ above would have been chosen. However, this has no
economic meaning, thus the log-linear specication is preferred.
28
Figure 6: Box-Cox transformation of the response variable, with 95% condence interval of the
parameter λ
29
The starting values for all the decays is 0, and as starting values for the parameter coecients the
estimates of the equation 6.3 were used, shown in Table 3. The results of the Non-Linear Least
squares regression are shown in Table 9. The estimates are quite close to the ones provided in 3,
although their signicance changes. To avoid over-tting, one might pick the signicant decays and
run a linear regression again. Observe that from all the decays, the SOCIALMEDIA_adstock has the
p-value 0.055435, so the null hypothesis that the decay for SOCIALMEDIA equals 0 is rejected
at signicance level α = 0.1. The adstock function for SOCIALMEDIA expenditures looks in the
following way:
f7 (SOCIALMEDIAt , 0.2644602) = SOCIALMEDIAt + 0.2644602 · f (SOCIALMEDIAt−1 )
t = 2, . . . , T
Table 9: Estimation results of the log-linear model, dynamic structure using Levenberg-Marquardt
method
##
## Formula : log ( SALES_TOT ) ~ Intercept + TV_coefficient * adstock (TV , TV_adstock ) +
## DR_coefficient * adstock (DR , DR_adstock ) + DR . POSTEN_coefficient *
## adstock ( DR . POSTEN , DR . POSTEN_adstock ) + OUTDOOR_coefficient *
## adstock ( OUTDOOR , OUTDOOR_adstock ) + RADIO_coefficient * adstock ( RADIO ,
## RADIO_adstock ) + PRINT_coefficient * adstock ( PRINT , PRINT_adstock ) +
## SOCIALMEDIA_coefficient * adstock ( SOCIALMEDIA , SOCIALMEDIA_adstock ) +
## Rain .. mm . _coefficient * Rain .. mm . + sal_coefficient * sal +
## HOLIDAY_coefficient * HOLIDAY
##
## Parameters :
## Estimate Std . Error t value Pr ( >| t |)
## Intercept 1.787 e +01 4.094 e -02 436.519 < 2e -16 ***
## TV_coefficient 1.593 e -07 2.313 e -08 6.885 8.54 e -10 ***
## TV_adstock 1.488 e -02 1.286 e -01 0.116 0.908188
## DR_coefficient 1.118 e -07 6.615 e -08 1.690 0.094613 .
## DR_adstock 0.000 e +00 5.868 e -01 0.000 1.000000
## DR . POSTEN_coefficient 1.445 e -07 5.444 e -08 2.655 0.009436 **
## DR . POSTEN_adstock 7.176 e -04 3.168 e -01 0.002 0.998198
## OUTDOOR_coefficient 3.229 e -07 8.639 e -08 3.737 0.000332 ***
## OUTDOOR_adstock 0.000 e +00 3.257 e -01 0.000 1.000000
## RADIO_coefficient -5.799 e -07 2.428 e -07 -2.389 0.019075 *
## RADIO_adstock 3.523 e -02 4.130 e -01 0.085 0.932215
## PRINT_coefficient 7.731 e -08 4.689 e -08 1.649 0.102793
## PRINT_adstock 0.000 e +00 5.808 e -01 0.000 1.000000
## SOCIALMEDIA_coefficient 1.409 e -06 2.141 e -07 6.581 3.38 e -09 ***
## SOCIALMEDIA_adstock 2.645 e -01 1.362 e -01 1.941 0.055435 .
## Rain .. mm . _coefficient 2.141 e -03 1.173 e -03 1.826 0.071339 .
## sal_coefficient 5.876 e -02 3.178 e -02 1.849 0.067882 .
## HOLIDAY_coefficient -3.850 e -02 5.050 e -02 -0.762 0.447846
## ---
## Signif . codes : 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 0.1105 on 87 degrees of freedom
##
## Number of iterations to convergence : 20
## Achieved convergence tolerance : 1.49 e -08
30
6.3 Re-estimation and testing the OLS assumptions
In this section the lag weight parameters found using the Levenberg-Marquardt algorithm are being
used, and the re-estimated model with OLS is tested. Since the only signicant decay found in the
previous section is the one for SOCIALMEDIA, the following equation is estimated with OLS:
##
## Call :
## lm ( formula = " log ( SALES_TOT ) ~ TV + DR + DR . POSTEN + OUTDOOR + RADIO + PRINT + SOCIALMEDIA +
Rain .. mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -0.231972 -0.073825 -0.009095 0.061918 0.291512
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 1.787 e +01 2.955 e -02 604.603 < 2e -16 ***
## TV 1.540 e -07 1.914 e -08 8.048 2.53 e -12 ***
## DR 1.124 e -07 5.925 e -08 1.897 0.060940 .
## DR . POSTEN 1.471 e -07 3.999 e -08 3.678 0.000391 ***
## OUTDOOR 3.153 e -07 7.667 e -08 4.113 8.37 e -05 ***
## RADIO -5.581 e -07 2.197 e -07 -2.540 0.012712 *
## PRINT 8.798 e -08 4.351 e -08 2.022 0.045990 *
## SOCIALMEDIA 1.901 e -06 2.078 e -07 9.146 1.20 e -14 ***
## Rain .. mm . 2.176 e -03 1.080 e -03 2.015 0.046764 *
## sal 5.672 e -02 2.757 e -02 2.057 0.042443 *
## HOLIDAY -4.193 e -02 4.717 e -02 -0.889 0.376266
## ---
## Signif . codes : 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 0.1063 on 94 degrees of freedom
## Multiple R - squared : 0.8456 , Adjusted R - squared : 0.8292
## F - statistic : 51.48 on 10 and 94 DF , p - value : < 2.2 e -16
The following results indicate that most of the parameters are signicant, except HOLIDAY.
The goodness of t has increased, and the p-value for the F -statistics indicates that the model is
highly signicant. The short-term eect of marketing eort for SOCIALMEDIA is 1.901 ∗ 10−6 .
The implied long-term eect is 1.901 ∗ 10−6 /(1 − 0.2644602) = 2.584496 ∗ 10−6 .
In order to test the rst assumption E[εt ] = 0, the residuals against each predictor variable are
plotted again (Figure 7). The graphs do not show any systematic pattern in the residuals. Next
31
Figure 7: Plot of residuals against each predictor variable for the log-linear functional form with
adstock model considered for SOCIALMEDIA variable (equation 6.9)
the RESET test was employed with powers of the tted response. The results shown in Table 11
indicate that there is no strong evidence of misspecication.
Table 11: RESET test for log-linear adstock model, using powers of the tted response
power 2 RESET = 1.1517, df1 = 1, df2 = 93, p-value = 0.286
power 3 RESET = 0.3628, df1 = 1, df2 = 93, p-value = 0.5484
power 2 and 3 RESET = 2.6459, df1 = 2, df2 = 92, p-value = 0.07634
In order to test the second assumption Var[εt ] = 0 for all t, Figure 7 must be examined again,
but now with the purpose to detect changes in the variability of the residuals. To test for het-
eroscedasticity more formally, the Breusch-Pagan test is employed, by running a regression of the
squared residuals on the explanatory variables that appear in equation 6.9. The p-value associated
with Breusch-Pagan test is 0.9635, indicating that no signicant heteroscedasticity is detected. To
test the normality of the residuals, normality tests together with visual assessment were employed.
Figure 9 shows the empirical cumulative distribution function together with normal cumulative dis-
tribution function. The normal probability plot is shown in Figure 8 Right. The plots don't indicate
evidence of non-normality of the residuals. To asses nonnormality of the residuals more carefully,
Lilliefors test was employed, which returned a p-value equal to 0.7248. From the results above
it can be concluded that nonnormality is not an issue. To check the presence of multicollinearity
rst the correlation matrix of the explanatory variables must be inspected. The correlation matrix
32
Figure 8: Diagnostics of the regression model considered in equation 6.9. Left : scatterplot of the
residuals against tted values. Right : Normal probability plot of the residuals
Figure 9: Empirical cumulative distribution function of the residuals for the log-linear functional
form with adstock model considered for SOCIALMEDIA variable (equation 6.9)
33
Table 12: Correlation matrix of the explanatory variables
indicates that RADIO and OUTDOOR are negatively correlated (−0.708). Also, there is evidence
of negative correlation between DR and DR.POSTEN (−0.490). Multicollinearity issue might be
the cause of the non-signicance of some of the coecients. In Table 13 there are presented the
VIF values of the explanatory variables.
The condition number of the matrix (X| X)−1 , after normalizing the data matrix is 16.0652,
also indicating moderate degree of multicollinearity. Although in this case the multicollinearity
detected is not severe, to provide Nepa with a strategy for the cases of severe multicollinearity, this
34
issue will be addressed in section 6.5.
To assess autocorrelation, rst the plot of the residuals over time (Figure 10) were examined.
Figure 10: Plot of residuals against time for the log-linear functional form with adstock model
considered for SOCIALMEDIA variable (equation 6.9)
The residuals in Figure 10 show shorter and longer runs on either side of the mean value. The
D-W Statistic is 1.582541, and the p-value associated with this statistic is 0.018, indicating that
the residuals are positively autocorrelated. The estimation of the autocorrelation parameter is
0.1833852, meaning that Durbin Watson test assumes that the errors are driven by the following
rst order autocorrelation process: ut = 0.1833852 ∗ ut−1 + εt .
An approach that specically considers the autocorrelation structure is the Cochrane-Orcutt
method described in [17]. The procedure behind this method is based on the estimation of the
autocorrelation coecient, and then the transformation of variables. With the autocorrelation
parameter estimated as 0.1833852, the variables are transformed in the following way:
yt0 = yt − 0.1833852 ∗ yt−1 , t = 2, . . . , T
35
Table 14: Coecient estimates for the Cochrane-Orcutt Method, log-linear functional form with
adstock model considered for SOCIALMEDIA variable (equation 6.9)
##
## Call :
## lm ( formula = " SALES_TOT ~ TV + DR + DR . POSTEN + OUTDOOR + RADIO + PRINT + SOCIALMEDIA + Rain ..
mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -23290744 -7598418 -1641095 7793738 82440104
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 4.226 e +07 3.988 e +06 10.598 < 2e -16 ***
## TV 2.239 e +01 2.567 e +00 8.721 9.59 e -14 ***
## DR 1.453 e +01 8.034 e +00 1.809 0.07369 .
## DR . POSTEN 1.742 e +01 5.417 e +00 3.216 0.00178 **
## OUTDOOR 3.756 e +01 1.035 e +01 3.631 0.00046 ***
## RADIO -6.633 e +01 2.957 e +01 -2.243 0.02724 *
## PRINT 6.304 e +00 5.876 e +00 1.073 0.28610
## SOCIALMEDIA 1.832 e +02 2.400 e +01 7.636 1.84 e -11 ***
## Rain .. mm . 2.796 e +05 1.461 e +05 1.913 0.05877 .
## sal 6.885 e +06 3.734 e +06 1.844 0.06834 .
## HOLIDAY 2.906 e +06 6.383 e +06 0.455 0.64999
## ---
## Signif . codes : 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 14380000 on 94 degrees of freedom
## Multiple R - squared : 0.8203 , Adjusted R - squared : 0.8012
## F - statistic : 42.92 on 10 and 94 DF , p - value : < 2.2 e -16
An alternative estimation method that deals with the problem of autocorrelation is the maximum
likelihood method. As mentioned in [17] this method is attractive, because it can be used when
the structure of the errors is more complicated than the autoregressive process of order one. Table
15 shows the output from the maximum likelihood estimation assuming rst order autoregressive
process of the residuals, using gls function in R .The autocorrelation parameter is estimated to be
0.234079, which is close to the value retrieved by the D-W test.
36
Table 15: Coecient estimates shown for the maximum likelihood estimation, log-linear functional
form with adstock model considered for SOCIALMEDIA variable (equation 6.9)
Table 16: Best Subset Selection. The best model that contains a given number of predictors is
chosen according to RSS
Figure 11 displays the plots of RSS, adjusted R2 , Cp , and BIC for all of the models at once. It
can be seen that both adjusted R2 and Cp choose the model with 9 variables, while BIC chooses
the model with 6 variables. As mentioned in [6], the BIC statistic generally places a heavier penalty
on models with many variables, and hence results in the selection of smaller models than Cp .
37
Figure 11: RSS, adjusted R2 , Cp , and BIC shown for the best models of each size
Figure 12: Adjusted R2 , Cp and BIC for log-linear functional form with adstock model considered
for SOCIALMEDIA variable (equation 6.9)
38
Figure 12 displays the selected variables for the best model with a given number of predictors,
ranked according to adjusted R2 , Cp and BIC.
One can also choose among a set of models of dierent sizes using the validation set and cross-
validation approaches. For these approaches to yield accurate estimates of the test error, only the
training observations must be used. The observations must be split into a training set and a test set.
Next, best subset selection only on the training observations should be performed. The validation
set error is computed for the best model of each model size, getting the results from Table 17.
Table 17: Validation set errors for the best model of each model size
The best model is found to be the one that contains ten variables. Next we would have to
perform best subset selection on the full dataset and select the best ten-variable model, but since
it is the full model, we just re-estimate the coecients on the full dataset. Since the full model was
selected, the estimates will be the ones from Table 10.
To choose among the models of dierent sizes using cross-validation, best subset selection is
performed within each of the k training sets. First, each observation is allocated to one of k = 10
folds. Next each of the folds is used as a test set for the best subset selection procedure, and the
rest of the data is used as the training set. The test errors are stored in a matrix, and then the
average is calculated over the columns of this matrix in order to obtain a vector for which the j th
element is the cross-validation error for the j -variable model, j = 1, . . . , k, (k = 10, number of
folds). Figure 13 shows that cross-validation selected the three-variable model. Performing cross-
validation multiple times for the dataset provided in the current case study, the cross-validation
error always decreased for the three-variable model, followed by a growth, nally to decrease again
by ten-variable model, as shown in Figure 13. If Nepa preferred a retracted model, a three-variable
model would have been selected, otherwise it is also possible to chose the full model.
39
Figure 13: Cross-validation errors for the log-linear functional form with adstock model considered
for SOCIALMEDIA variable (equation 6.9)
It is important to perform best subset selection on the full dataset to obtain reliable estimates
for the three-variable model. Results are shown in Table 18.
Table 18: Parameter Estimates for the three-variable model
(Intercept) 1.793679e+01
TV 1.739107e-07
DR.POSTEN 2.131372e-07
SOCIALMEDIA 1.948903e-06
40
from λ = 1010 to λ = 10−2 . This grid essentially covers all scenarios from the model containing
only the intercept, to the least squares t. It is recommended to standardize the variables before
performing ridge regression, so that all the variables would be on the same scale. The function
glmnet does it automatically, returning the coecient estimates of the variables in the original
scale.
In order to estimate the optimal parameter λ, ten-fold cross-validation was performed, using
cv.glmnet function in R. First ridge regression model is tted on the training set. Next cross-
validation is used to choose the tuning parameter λ that gives the smallest cross-validation error.
For each value of λ the test MSE is calculated. Finally, ridge regression model is retted on the
full dataset, using the value of λ chosen by cross-validation.
Figure 14: Ridge regression on the full dataset. Upper left: L1 norm against scaled coecients.
Upper right: Log lambda against scaled coecients. Lower left: Fraction of deviance explained
against scaled coecients. Lower right: Log(lambda) against MSE.
Figure 14 shows the plots of l1 norm, ln(λ) and fraction of deviance explained against coecient
estimates, as well as the plot of ln(λ) against mean squared error (lower right). At the top of each
graph the number of nonzero coecients is indicated.
Besides choosing the λ value that gives the smallest cross-validation error, one can also choose
the value of λ which gives the most regularized model such that error is within one standard
error of the minimum. In Table 19 are presented the estimated coecients for lambda.min and
lambda.1se chosen by cross-validation. As expected, none of the coecients is exactly zero, as ridge
regression does not perform variable selection. Figure 15 shows the plots of predicted values against
actual values of sales, for ridge coecients above from Table 19. The coecients corresponding to
lambda.min predict sales more accurately, since they are chosen in such a way that the cross-
validation error is minimal.
41
Table 19: Ridge regression coecient estimates for lambda.min and lambda.1se chosen by cross-
validation
## lambdaminridge = 0.0160694261981513
## lambda1seridge = 0.124419731067532
Figure 15: Ridge regression t for lambda.min (top ) and lambda.1se (bottom ) chosen by cross-
validation
42
6.6 The Lasso
To perform the Lasso, the glmnet function in R is used again for the same range of λ. Following the
same cross-validation procedure as for ridge regression, the coecient estimates and test MSE are
obtained for lambda.min and lambda.1se chosen by cross-validation. The lasso test MSE is close to
the ridge test MSE. However, the lasso has a substantial advantage over the ridge regression that it
performs also variable selection. The results are shown in Table 20. For the largest λ at which the
MSE is within one standard error of the minimal MSE two coecient estimates are zero: RADIO
and HOLIDAY.
Table 20: Lasso coecient estimates for lambda.min and lambda.1se chosen by cross-validation
## lambdaminlasso = 0.00294189136833369
## lambda1selasso = 0.0207544620232037
Figure 16 shows each curve's path of its coecient against the l1 -norm , ln(λ) and fraction of
deviance explained, as well as the plot of ln(λ) against mean squared error (lower right). At the
top of each graph the number of nonzero coecients is indicated. Figure 17 shows the plots of
predicted against actual sales, using lasso coecients from Table 20.
43
Figure 16: The lasso on the full dataset. Upper left: L1 norm against scaled coecients. Upper
right: Log lambda against scaled coecients. Lower left: Fraction of deviance explained against
scaled coecients. Lower right: Log(lambda) against MSE.
Figure 17: Lasso t for lambda.min (top ) and lambda.1se (bottom ) chosen by cross-validation
44
6.7 Naive elastic net
As described in section 5.2.6, the naive elastic net penalty is a convex combination of the lasso and
ridge penalty.
p
X p
X
(1 − α) |βj |+α βj2 ≤ s
j=1 j=1
To choose the optimal parameter α, the function cv.glmnet was called with a pre-computed vector
foldid, and then this same fold vector was used in separate calls to cv.glmnet with dierent values
of α. Note that in the glmnet package in R the penalty is dened as
p
X p
X
(1 − α)/2 βj2 + α |βj |≤ s
j=1 j=1
It can be seen in the Figure 18 that ridge does about the best for the given dataset, so it seems
Figure 18: The standardized coecients as a function of λ, displayed for several values of α
reasonable to choose a value of α closer to ridge. Calling the cv.glmnet function with parameter
alpha 0.1 yields the following results shown in Table 21 and Figures 19 and 20.
45
Table 21: Naive elastic net coecient estimates for lambda.min and lambda.1se chosen by cross-
validation
## lambdaminelnet = 0.0202773160550586
## lambda1seelnet = 0.108213937001959
Figure 19: Naive elastic net on the full dataset. Upper left: L1 norm against scaled coecients.
Upper right: Log lambda against scaled coecients. Lower left: Fraction of deviance explained
against scaled coecients. Lower right: Log(lambda) against MSE.
46
Figure 20: Naive elastic net t for lambda.min (top ) and lambda.1se (bottom ) chosen by cross-
validation
It is clear that even naive elastic net outperforms ridge and lasso.
47
Figure 21: Left : Elastic net estimates (λ2 = 1) as a function of s. Right : solution path (λ2 = 1) as
a function of s
To include the seasonal eect, the model was extended by adding a trend variable and 51 dummy
variables for seasons, as described in section 2.3. Table 23 shows the coecients for the extended
model with λ2 = 0.1 and s = 0.4. It can be seen that the method selects only some of the seasonal
variables, allowing a partial seasonal adjustment.
48
Table 22: Elastic net coecient estimates for λ2 = 1 and s = 0.47 chosen by cross-validation
## $s
## [1] 0.47
##
## $fraction
## 0
## 0.47
##
## $mode
## [1] " fraction "
##
## $coefficients
## TV DR DR . POSTEN OUTDOOR RADIO PRINT
## 1.320210 e -07 1.511765 e -07 1.506639 e -07 5.525586 e -08 0.000000 e +00 0.000000 e +00
## SOCIALMEDIA Rain .. mm . sal HOLIDAY
## 1.091444 e -06 0.000000 e +00 0.000000 e +00 0.000000 e +00
Table 23: Elastic net coecient estimates with season and trend variables
## $s
## [1] 0.4
##
## $fraction
## 0
## 0.4
##
## $mode
## [1] " fraction "
##
## $coefficients
## S1 S2 S3 S4 S5
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S6 S7 S8 S9 S10
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S11 S12 S13 S14 S15
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 -3.310320 e -02 0.000000 e +00
## S16 S17 S18 S19 S20
## 0.000000 e +00 0.000000 e +00 -2.827022 e -02 0.000000 e +00 -2.507911 e -02
## S21 S22 S23 S24 S25
## -6.727018 e -02 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S26 S27 S28 S29 S30
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S31 S32 S33 S34 S35
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 4.078321 e -02
## S36 S37 S38 S39 S40
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S41 S42 S43 S44 S45
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S46 S47 S48 S49 S50
## 0.000000 e +00 0.000000 e +00 2.042501 e -02 0.000000 e +00 1.358926 e -01
## S51 trend TV DR DR . POSTEN
## 1.703252 e -01 0.000000 e +00 1.499851 e -07 1.342277 e -07 1.178116 e -07
## OUTDOOR RADIO PRINT SOCIALMEDIA Rain .. mm .
## 7.377897 e -08 0.000000 e +00 1.595277 e -08 1.445939 e -06 1.432973 e -04
## sal HOLIDAY
## 2.989432 e -02 0.000000 e +00
49
7 Conclusions & Recommendations
This thesis illustrates an application of modern approaches of statistical learning on a set of data
provided by Nepa. The goal of the thesis is to construct a model building strategy suitable for a
high level of complexity of the data, with the ambition to tackle several diculties encountered
with statistical analysis applied to marketing economics. A marketing mix model must address
all elements of the problem being studied. In the specication step, one of such elements is the
choice of the appropriate functional form. To nd the suitable specication, which describes the
relationship between the dependent and independent variables, the RESET test and the Box-Cox
transformation of the response variable were used. The plot of the residuals against each predictor
variable as well as the tests above suggest that the log-linear specication is appropriate. Several
subset selection methods were employed on the log-linear model. The results of the validation set
and cross-validation approaches justify the choice of the full model. To adapt the model to the
dynamic marketing behavior, the optimal lag weight parameters can be found with the Levenberg-
Marquardt algorithm, using nlsLM function in R.
Since the purpose is both explanatory and predictive analysis, in order to be able to perform
statistical inferences based on obtained point estimates, the assumptions made in section 4 must
hold. To sum up, the results show that we cannot assume that the error terms are uncorrelated. The
solution proposed was to employ the Cochrane-Orcutt method, an approach that specically con-
siders the autocorrelation structure, or to use alternative estimation methods, such as the method of
maximum likelihood. The testing of the assumptions also shows that the data exhibits mild degree
of multicollinearity. A comparison of several estimation methods is provided, so that Nepa could use
this thesis as a guideline for future marketing mix modelling projects that include data with severe
multicollinearity. Regularization methods were performed using glmnet and elasticnet packages
in R. Note that the penalty in the glmnet package in R is dened dierently from the penalty in
the elasticnet package. Table 24 shows the performance of ridge regression, the lasso, the naive
elastic net and elastic net results applied to the same training set and validation set. Model t-
ting and tuning parameter selection by tenfold cross-validation (CV) should be carried out on the
training data, and then the performance of those methods must be compared by computing their
prediction mean-squared error (MSE) on the test data. Although the dierence between mentioned
methods is not signicant, the lowest test MSE is achieved by elastic net, which also chose the
smallest number of variables.
Method Parameters test MSE Variables selected
Ridge regression λ1 = 0, λ2 = 0.01606943 ∗ 2 0.0188158 All
Ridge regression λ1 = 0, λ2 = 0.1244197 ∗ 2 0.01313806 All
Lasso λ1 = 0.002941891, λ2 = 0 0.01927511 All
Lasso λ1 = 0.02075446, λ2 = 0 0.0190074 (1,2,3,4,6,7,8,9)
Naive elastic net λ1 = 0.02027732/0.1, λ2 = 0.02027732 ∗ 2/(1 − 0.1) 0.01898267 All
Naive elastic net λ1 = 0.1082139/0.1, λ2 = 0.1082139 ∗ 2/(1 − 0.1) 0.01207035 (1,2,3,4,6,7,8,9,10)
Elastic net λ2 = 1, s = 0.47 0.01188888 (1,2,3,4,7)
50
Figure 23: Mean-squared test errors illustrated for dierent methods. It can be seen that OLS
performs worst in terms of prediction accuracy
The mean-squared test errors of the models above are also illustrated in comparison with OLS
in Figure 23. The results show that while the elastic net produces a model with fewer variables, its
prediction accuracy is higher compared to other estimation methods.
51
References
[1] David M. Blei. Regularized Regression. Columbia University. 2015.
[2] William D. Perreault Charlotte H. Mason and Jr. Collinearity, Power, and Interpretation of
Multiple Regression Analysis. Journal of Marketing Research, Vol. 28, No. 3 (Aug., 1991), pp.
268-280. 1991.
[3] Peter S.H. Leeang Csilla Horvath Marcel Kornelis. What marketing scholars should know
about Time Series Analysis: Time Series applications in marketing. 2002.
[4] James G. MacKinnon Davidson Russell. Estimation and Inference in Econometrics. 1993.
[5] J. Durbin. Testing for Serial Correlation in Least-Squares Regression When Some of the
Regressors are Lagged Dependent Variables. Econometrica , Vol. 38, No. 3 (May, 1970), pp.
410-421. 1970.
[6] Trevor Hastie Robert Tibshirani Gareth James Daniela Witten. An Introduction to Statistical
Learning: with Applications in R. Springer Texts in Statistics. 2013.
[7] Henri P. Gavin. The Levenberg-Marquardt method for nonlinear least squares curve-tting
problems. 2016.
[8] Alan J. Izenman. Modern Multivariate Statistical Techniques. Springer Texts in Statistics.
2008.
[9] G. S. Watson J. Durbin. Testing for Serial Correlation in Least Squares Regression: I. Biometrika,
Vol. 37, No. 3/4 (Dec., 1950), pp. 409-428. 1950.
[10] G. S. Watson J. Durbin. Testing for Serial Correlation in Least Squares Regression: II.
Biometrika, Vol. 38, No. 1/2 (Jun., 1951), pp. 159-177. 1951.
[11] John DiNardo Jack Johnson. Econometric methods, Fourth Edition. 1997.
[12] Peter Kennedy. A Guide to Econometrics, 6th Edition. 2008.
[13] Harald Lang. Elements of Regression Analysis. 2016.
[14] Wittink D.R. Wedel M. Naert P.A. Leeang P. Building Models for Marketing Decisions.
Springer International Series in Quantitative Marketing. 2000.
[15] Market Response Models: Econometric and Time Series Analysis. Volume 12. 2001.
[16] Modeling Markets: Analyzing Marketing Phenomena and Improving Marketing Decision Mak-
ing. International Series in Quantitative Marketing. 2015.
[17] Douglas C. Montgomery. Introduction to Linear Regression Analysis, Fifth Edition. 2013.
[18] On the econometrics of the Koyck model. Springer Texts in Statistics. 2004.
[19] J.B. Ramsey. Classical model selection through specication tests. Frontiers in Econometrics.
Academic, New York. 1974.
[20] J.B. Ramsey. Tests for specication errors in classical linear least squares regression analysis.
1969.
[21] A. R. Pagan T. S. Breusch. A Simple Test for Heteroscedasticity and Random Coecient
Variation. Econometrica, Vol. 47, No. 5 (Sep., 1979), pp. 1287-1294. 1979.
[22] Halbert White. A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct
Test for Heteroskedasticity. Econometrica, Vol. 48, No. 4 (May, 1980), pp. 817-838. 1980.
52
[23] Wayne Winston. Marketing Analytics: Data-Driven Techniques with Microsoft Excel, 1st Edi-
tion. 2014.
[24] Jerey M. Wooldridge. Introductory Econometrics, a modern approach,5th Edition. 2012.
[25] Elena Yusupova. Additive versus Multiplicative Marketing Mix Model. 2013. url
: http://
analytics.sd-group.com.au/blog/additive-versus-multiplicative-marketing-mix-
model/.
[26] Hui Zou and Trevor Hastie. Regularization and Variable Selection via the Elastic Net. url :
http://web.stanford.edu/~hastie/TALKS/enet_talk.pdf.
[27] Hui Zou and Trevor Hastie. Regularization and Variable Selection via the Elastic Net. Journal
of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 67,No. 2 (2005), pp.
301-320. 2005.
53
TRITA -MAT-E 2017:32
ISRN -KTH/MAT/E--17/32--SE
www.kth.se