0% found this document useful (0 votes)
65 views

MMM - Multiple Regression

This document is a degree project in mathematics from KTH Royal Institute of Technology in Stockholm, Sweden from 2017. The project investigates marketing mix modelling from a multiple regression perspective. Specifically, it compares various regression-based modelling strategies, with a focus on advanced regularization methods for linear regression. It aims to provide recommendations for modelling and analyzing marketing data from the company Nepa AB.

Uploaded by

sarabjeethanspal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

MMM - Multiple Regression

This document is a degree project in mathematics from KTH Royal Institute of Technology in Stockholm, Sweden from 2017. The project investigates marketing mix modelling from a multiple regression perspective. Specifically, it compares various regression-based modelling strategies, with a focus on advanced regularization methods for linear regression. It aims to provide recommendations for modelling and analyzing marketing data from the company Nepa AB.

Uploaded by

sarabjeethanspal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

DEGREE PROJECT IN MATHEMATICS,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2017

Marketing Mix Modelling from the


multiple regression perspective

ECATERINA MHITAREAN

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ENGINEERING SCIENCES
Marketing Mix Modelling from
the multiple regression
perspective

ECATERINA MHITAREAN

Degree Projects in Mathematical Statistics (30 ECTS credits)


Degree Programme in Applied and Computational Mathematics (120 credits)
KTH Royal Institute of Technology year 2017
Supervisor at Nepa AB: Dr. Daniel Malmquist
Supervisor at KTH: Tatjana Pavlenko
Examiner at KTH: Tatjana Pavlenko
TRITA-MAT-E 2017:32
ISRN-KTH/MAT/E--17/32--SE

Royal Institute of Technology


School of Engineering Sciences
KTH SCI
SE-100 44 Stockholm, Sweden
URL: www.kth.se/sci
Abstract
The optimal allocation of the marketing budget has become a dicult issue that each
company is facing. With the appearance of new marketing techniques, such as online advertising
and social media advertising, the complexity of data has increased, making this problem even
more challenging. Statistical tools for explanatory and predictive modelling have commonly
been used to tackle the problem of budget allocation. Marketing Mix Modelling involves the
use of a range of statistical methods which are suitable for modelling the variable of interest
(in this thesis it is sales) in terms of advertising strategies and external variables, with the aim
to construct an optimal combination of marketing strategies that would maximize the prot.
The purpose of this thesis is to investigate a number of regression-based model building
strategies, with the focus on advanced regularization methods of linear regression, with the
analysis of advantages and disadvantages of each method. Several crucial problems that modern
marketing mix modelling is facing are discussed in the thesis. These include the choice of the
most appropriate functional form that describes the relationship between the set of explanatory
variables and the response, modelling the dynamical structure of marketing environment by
choosing the optimal decays for each marketing advertising strategy, evaluating the seasonality
eects and collinearity of marketing instruments.
To eciently tackle two common challenges when dealing with marketing data, which are
multicollinearity and selection of informative variables, regularization methods are exploited.
In particular, the performance accuracy of ridge regression, the lasso, the naive elastic net
and elastic net is compared using cross-validation approach for the selection of tuning parame-
ters. Specic practical recommendations for modelling and analyzing Nepa marketing data are
provided.
Sammanfattning
Att fördela marknadsföringsbudgeten optimalt är en svår uppgift som alla företag ställs in-
för. Med uppkomsten av nya marknadsföringstekniker, som reklam på nätet och sociala media,
har komplexiteten av data ökat, vilket gör detta problem ännu mer utmanande. Statistiska
verktyg för förklarande och prediktiv modellering har vanligtvis använts för att hantera proble-
met med budgetallokering. Marknadsföringsmix Modellering är en term som omfattar klassen
av statistiska metoder som är lämpliga för modellering av den intressanta variabeln (i denna
uppsats är det försäljning) när det gäller reklamstrategier och externa variaber, med målet att
maximera vinsten genom att konstruera en optimal kombination av marknadsstrategier.
Syftet med denna uppsats är att konstruera ett antal modellbyggnadsstrategier, som även
inkluderar avancerade regulariseringsmetoder för linjär regression, med en analys av fördelar
och nackdelar för varje metod. Flera stora problem som den moderna marknadsföringsmix
modellering står inför har beaktats, som till exempel: att välja en passande funktionsformel
som bäst beskriver relationen mellan den oberoende variabeln och de beroende variablerna, att
hantera marknadsföringens dynamiska omgivningar genom att välja det optimala förfallet hos
varje marknadsföringsstrategi, utvärdera säsongsmässiga eekten och marknadsföringsverkty-
gens kollinjäritet.
För att överkomma de två vanligaste problemen inom marknadsföringsekonometri, som är
multikollinearitet och val av variabler, har regulariseringsmetoder använts. I synnerhet har
prestationsnoggrannheten av ridge regression, lasso, naive elastic net och elastic net jämförts
- för att ge specika rekommendationer för Nepa data. Parametrarna för de regulariserade
regressionsmetoderna har valts genom korsvalidering. Modellens resultat visar en hög nivå av
förutsägelse noggrannhet. Skillnaden mellan nämnda metoder är inte signikanta för det givna
datasetet.
Acknowledgements
I would like to thank my supervisor Tatjana Pavlenko, Associate Professor at KTH, for her support
and guidance during the master thesis. I would also like to thank my supervisor at Nepa AB, Dr.
Daniel Malmquist, for his support and feedback throughout the process.
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Nepa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theoretical Background 3
2.1 Methods of selecting functional forms of the model . . . . . . . . . . . . . . . . . . . 3
2.1.1 Linear and Multiplicative Models . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 The Box-Cox transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Marketing Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Modelling trend and seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Estimation 8
3.1 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Non-linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 The Gradient Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 The Gauss-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.3 The Levenberg-Marquardt Method . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Validation and Testing 11
4.1 Methods of Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Specication Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2.1 Nonzero expectation of the residuals . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.3 Correlated Disturbances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.4 Nonnormal Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.5 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Linear Model Selection and Regularization 16
5.1 Subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2.1 The Bias-Variance Trade-O . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2.3 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.4 Comparing Ridge regression and The Lasso . . . . . . . . . . . . . . . . . . . 18
5.2.5 Selecting the Tuning Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.6 Naive Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.7 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Results 22
6.1 Choosing among functional forms for Marketing Mix Modelling . . . . . . . . . . . . 22
6.2 Marketing dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Re-estimation and testing the OLS assumptions . . . . . . . . . . . . . . . . . . . . . 31
6.4 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.5 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.6 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.7 Naive elastic net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.8 Elastic net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Conclusions & Recommendations 50
1 Introduction
This section provides a short introduction into the concept of Marketing Mix Modelling, as well a
brief presentation of the company, and the purpose of the study.

1.1 Background
Marketing Mix Modelling is a term that is used to cover statistical methods which are suitable for
explanatory and predictive statistical modelling of some variable of interest, for example company's
sales or market shares. This thesis is focused on modelling sales as a factor of marketing instruments
and environmental variables. In this case, the goal of Marketing Mix Modelling is to explain and
predict sales from marketing instruments, while controlling for other factors that inuence sales. Its
main task is to decompose sales into base volume (which occurs due to such factors as seasonality
and brand awareness) and incremental volume (which captures the weekly variation in sales driven
by marketing activities). One of the most important Marketing Mix instruments is advertising,
thus it is crucial to understand the impact of advertising expenditures on sales.
Model building in marketing started in the middle of twentieth century. Many studies have
been conducted since then, which helped managers understand the marketing process. Appropri-
ately constructed market response models helped the managers to determine the instruments that
inuence sales and take actions that would aect it. Applications show that model benets include
cost savings resulting from improvements in resource allocations. Many studies discuss and describe
the model development process, provide a structure for model building and serve as a starting point
for this thesis, including: Leeang (2015), Leeang (2000), Hanssens (2001), P.M Cain (2010).
This thesis attempts to develop a general model building strategy suitable for a high level of
complexity of the data, to establish the most appropriate functional relationships and estimation
methods for the Marketing Mix Modelling projects. This strategy will be used by Nepa for system-
atic analysis of the data collected. All the steps of this model building strategy are implemented
in a user-friendly way and will be applied by Nepa for designing a marketing plan for its clients.
As an illustration, the thesis analyses the relationship between marketing expenditures and sales
on a dataset provided by Nepa. The data comes from a client of Nepa who is one of the largest
electronics retailer in Sweden. This dataset contains model-specic weekly sales and marketing
activities data, as well as environmental data, for two years. To overcome some of the problems
that are commonly encountered when working with marketing data, advanced estimation methods
such as ridge regression, the lasso and elastic net were employed to quantify the sales-marketing
relationship and identify short and long-run eects of marketing on performance. The thesis de-
scribes each method and presents the output for each model introduced. Marketing dynamics were
also considered in the model of sales structure, by optimizing the decays for each media variable.

1.2 Nepa
Nepa is an innovative research company founded in 2006 with the ambition to improve the eciency
of the research industry by moving from analog to digital methodologies. It is a company that went
beyond phone interviews and mail surveys and pioneered a fully automated and online tracking

1
solution. Today Nepa has more than 350 clients from all over the world and oces in Stockholm,
Helsinki, Oslo, Copenhagen, London and Mumbai.

1.3 Purpose
The main purpose of the thesis is to elaborate a methodology that Nepa can use in Marketing Mix
Modelling projects. A method is needed to nd the optimal parameters to create a model with as
good predictability and low multi collinearity as possible, with the following main areas of interest:
Parameter estimation
What type of decay should each media variable have? That is, how much eect does a certain
amount invested in a media variable have one week later? This is known as the carryover
eect, and it appears when some of the marketing strategies have impact not only in the
current period, but also in the future periods.
Variable selection
It is important to eciently tackle the problems of selecting the informative variables and
evaluating the seasonality eect. How should be handled season & trend to avoid over-
or underestimation of the eects of other variables? Estimating the impact of marketing
instruments on sales becomes dicult when advertising activities coincide with seasonal peaks.
Regression modelling
It is often the case that several marketing investments take place at the same time. The
collinearity caused makes the parameters estimated with ordinary least squares to be unre-
liable. The question then arises as to what estimation methods should be used to attain
predictability and stability of the models. (Coping with multicollinearity, variable selection,
etc.)

2
2 Theoretical Background
This section presents the mathematical background of the common challenges that marketing mix
modellers are facing. It begins with the challenge of choosing the appropriate functional form,
continues with the dynamic structure of marketing variables, and nally an approach to account
for the eects of seasonality is described.

2.1 Methods of selecting functional forms of the model


An important part of the model building process is deciding upon the functional form that would
reect the most appropriate relationship between the variables. The most commonly applied func-
tional form in Marketing Mix Modelling is the linear model. However, it is often the case that the
nonlinear functional forms are used, since they take into account such properties as diminishing/in-
creasing returns to scale and threshold eects. In this section the vector of the model parameters β
as well as the vector of the disturbance term ε are named in the same way for dierent specications,
even though the parameters dier, depending on the functional form.

2.1.1 Linear and Multiplicative Models


Linear models assume constant returns to scale and have the following structure:
yt = β0 + β1 x1t + β2 x2t + · · · + βK xKt + εt , (2.1)
where, following the notations in [16]:
yt = value of the dependent variable in period t (t = 1, ..., T , where T is the number of observations),
xkt = value of independent variable k in period t, (k = 1, ..., K , where K is the number of covariates),
and
β0 , β1 , ..., βK = model parameters.
εt = the (unobserved) value of the disturbance term.
A linear model is often tried rst since the estimation of the coecients and the interpretation
of the results are easy. It shows a good predictive performance and a reasonable approximation to
an underlying nonlinear function, but only on a limited range.
One drawback of the linearity assumption is that it implies constant returns to scale with respect
to each of the covariates, meaning that an increase of one unit in xkt leads to an increase of βk
units in yt . However, the assumption of constant returns to scale is unrealistic in most real life
marketing applications. Usually a sales response curve exhibits a non-constant behavior. One type
of non-constant behavior is diminishing returns to scale, which happens when the response variable
always increase with increases in the covariates, but each additional unit of xkt brings less in yt
than the previous unit did ([15]). One of the functional forms that reects this phenomenon is the
multiplicative power model (again, following the notations in [16]):
yt = β0 xβ1t1 εt , x1t ≥ 0, 0 < β1 < 1 (2.2)
Model 2.2 can be linearized by taking the logarithms of both sides:
ln yt = ln β0 + β1 ln x1t + ln εt , x1t ≥ 0, 0 < β1 < 1 (2.3)

3
Equation 2.3 is linear in the parameters β0? , β1 , where β0? = ln β0 . This model is known as the
double-logarithmic or the log-log model. The version of the multiplicative model that retains the
highest-order interaction among the variables for K marketing instruments is:
yt = β0 xβ1t1 xβ2t2 · · · xβKt
K
εt (2.4)
or more compactly:
K
(2.5)
Y
yt = β0 ( xβktk )εt
k=1

In this setting, if some of the variables are "dummies", the corresponding variables are used as
exponents. Besides reecting the non-constant behavior of the sales response function, another
advantage of the multiplicative model over the linear model is that it allows for a specic form of
interaction between the various instruments. Taking the rst-order partial derivative of yt with
respect to any of the independent variables xkt , the impact of a change in xkt on yt is a function of
yt itself, which means that it depends not only on the value of xkt but on all the other variables as
well:
∂yt β
= β0 βk xβ1t1 xβ2t2 · · · xktk−1 · · · xβKt
K
(2.6)
∂xkt
When sales response function exhibits increasing returns to scale, the exponential model can be
used:
yt = β0 eβ1 x1t εt (2.7)
After taking the logarithms of both sides it becomes the semi-logarithmic also known as the log-
linear model:
ln yt = ln β0 + β1 x1t + ln εt (2.8)
When the nonlinear model is log-log or log-linear, an adjustment to the forecasts of yt is required,
so that they remain unbiased ([15]). Considering the typical multiplicative specication 2.4, where
ln εt is N (0, σ 2 ), it can be shown that:

(2.9)
2
E[yt ] = β0 xβ1t1 xβ2t2 · · · xβKt
K 1/2σ
e

The forecasts should be calculated from the expression:

(2.10)
2
ŷt = β̂0 xβ̂1t1 xβ̂2t2 · · · xβ̂Kt
K 1/2σ̂
e

where hats denote the ordinary least squares (OLS) estimates. A direct re-transformation would
under-estimate the forecasts.

2.1.2 The Box-Cox transformation


One way to compare between the linear and the multiplicative specications is the likelihood ratio
test, using the Box-Cox transformation. It is based on the following transformation of the dependent
variable:
ytλ − 1
= β0 + β1 x1t + . . . + βK xKt + εt . (2.11)
λ
To choose the appropriate functional form, the likelihood ratio test of the model above can be used.
The idea behind this method is to compute the likelihood ratio for dierent values of λ and choose

4
the value that maximizes the MLE score. The specication is then chosen according to the value
of λ reported. If λ = 1 then the specication is essentially linear. When λ approaches 0, equation
2.11 approaches the semi-logarithmic form, since:
ytλ − 1
 
lim = ln yt (2.12)
λ→0 λ

2.2 Marketing Dynamics


Because of the evolving character of markets, the assumption that advertising expenditures have a
current and immediate impact on sales rarely happens to be realistic. Most often is happens that
parts of the media eects remain noticeable for several future periods. Thus, sales in some period t
are aected by advertising expenditures in the same period t, but also by expenditures in previous
periods t − 1, t − 2, . . . . The inuence of current marketing expenditures on sales in future periods
is called the carryover eect. When the eect of a marketing variable is distributed over several
time periods, sales in any period are a function of the current and previous marketing expenditures.
In the case of just one explanatory variable the equation for sales is:

(2.13)
X
y t = β0 + βl+1 xt−l + εt ,
l=0

where xt−l , l = 0, 1, . . ., are the lagged terms of the independent variable. The model 2.13 is called
the Innite Distributed Lag (IDL) Model. Assuming that all coecients of the lagged terms of a
covariate have the same sign, equation 2.23 can be rewritten as:

(2.14)
X
y t = β0 + β ωl xt−l + εt .
l=0

Equation 2.14 is called the Geometric Lag Model, where



(2.15)
X
ωl ≥ 0 and ωl = 1.
l=0

The omegas can be regarded as probabilities of a discrete-time distribution. As mentioned in [15],


the Geometric Distributed Lag (GL) Model is the most commonly used distributed-lag model in
marketing. The maximum impact of marketing expenditures on sales is registered instantaneously,
then the inuence declines geometrically to zero. The impact of any past expenditure in subsequent
periods will be a constant fraction of its immediate impact. This constant fraction is called the
retention rate. If the retention rate is λ the geometric distribution gives:
ωl = (1 − λ)λl l = 0, 1, 2 · · · (2.16)
where 0 < λ < 1. The specication of the sales response function becomes:

(2.17)
X
yt = β0 + β(1 − λ) λl xt−l + εt ,
l=0

or
yt = β0 + β1 xt + β1 λxt−1 + β1 λ2 xt−2 + . . . + β1 λl xt−l + . . . + εt , (2.18)

5
where β1 = β(1 − λ). The direct short-term eect of marketing eort is β1 = β(1 − λ), while the
retention rate λ measures how much of the advertising eect in one period is retained in the next.
The implied long-term eect is β = β1 /(1 − λ). This model is also approximately equivalent to the
Simple Decay-Eect Model (Broadbent (1979)):
yt = β0 + β1 at + εt , (2.19)
where at = f (xt ) is the adstock function at time t, xt is the value of the advertising variable at
time t and λ is the decay or lag weight parameter:
at = f (xt ) = xt + λat−1 , t = 2, . . . , n (2.20)
Recursively substituting and expanding the equation for the adstock function becomes:
at = xt + λxt−1 + λ2 xt−2 + . . . + λn xt−n , (2.21)
Since 0 < λ < 1, λ → 0 as n → ∞. Moving on to the case with K explanatory variables x1 , . . . , xK ,
each with dierent retention rates λ1 , . . . , λK , the model becomes:
yt = β0 + β1 a1t + β2 a2t + · · · + βK aKt + εt (2.22)
where
ait = f (xit ) = xit + λi ait−1 , i = 1, . . . , K (2.23)
To estimate the marketing variables coecients, as well as retention rate values, non-linear least
squares can be used. The algorithm is described more detailed in section 3.2. First the adstock at
time t is dened for each marketing instrument, as in equation 2.23. The estimated sales are then:
ŷt = β̂0 + β̂1 a1t + β̂2 a2t + . . . + β̂K aKt (2.24)
Finally, the optimization problem is:
T
minimize
X
(yt − ŷt )2
t=1
subject to 0 ≤ λi < 1, i = 1, . . . , K.
For the semi-logarithmic and double-logarithmic models the equation for the predicted sales becomes
2.25 and 2.26 respectively:
ln ŷt = β̂0? + β̂1 a1t + β̂2 a2t + . . . + β̂K aKt (2.25)
and
ln ŷt = β̂0? + β̂1 ln a1t + β̂2 ln a2t + . . . + β̂K ln aKt (2.26)
And the optimization problem is:
T
minimize
X
(ln yt − ln ŷt )2
t=1
subject to 0 ≤ λi < 1, i = 1, . . . , K.

6
2.3 Modelling trend and seasonality
In this section the "classical decomposition" is considered:
yt = mt + δit + εt , (2.27)
where: mt is a slowly changing function (the "trend component");
δit is a function with known period d (the "seasonal component");
εt is a stationary time series.
In trying to explain sales behavior, a linear trend variable (mt = 1, 2, · · · , T for t = 1, 2, · · · , T )
could be introduced into the sales response function to capture the time-dependent nature of sales
growth.
If a variable follows a systematic pattern within the year, it is said to exhibit seasonality. To
deal with seasonality, s dummy variables could be introduced in the model to express s seasons in
the following way:

if t is the i'th period


(
1,
δit = i = 1, · · · , s t = 1, · · · , T (2.28)
0, otherwise
These "dummy" variables for seasons and "time" variable mt for trend could be incorporated into
the linear model 2.1:
yt = β0 + mt + δ1t + . . . + δst + β1 x1t + β2 x2t + . . . + βK xKt + εt , (2.29)
and also into the multiplicative models, for example into the log-log model 2.4:
yt = β0 e(mt +δ1t +...+δst ) xβ1t1 xβ2t2 . . . xβKt
K
εt (2.30)
Equation 2.30 is non-linear. For the purposes of estimation, the model is converted into an additive
form by taking natural logarithms thus:
ln yt = ln β0 + mt + δ1t + . . . + δst + β1 ln x1t + . . . + βK ln xKt + ln εt (2.31)
Taking into account the dynamic structure, equation 2.31 becomes:
ln yt = ln β0 + mt + δ1t + . . . + δst + β1 ln a1t + . . . + βK ln aKt + ln εt (2.32)
where a1t , . . . , aKt are the adstock variables dened in section 2.2. Equation 2.32 is no longer
linear, and was estimated with non-linear least squares, using the Levenberg-Marquardt algorithm,
described in section 3.2.

7
3 Estimation
Once the appropriate functional form is decided, the parameters of the marketing model must be
estimated. A description of the estimation methods for the model parameters is provided in this
Chapter.

3.1 Ordinary Least Squares


Let us consider the linear model 2.1:
yt = β0 + β1 x1t + β2 x2t + · · · + βK xKt + εt , t = 1, . . . , T (3.1)
where the notations are dened in section 2.1.1. Equation 2.1 can be rewritten in the matrix form:
 
 β
xK1  0 
    
y1 1 x11 x21 ··· ε1
 y2  1 x12 x22 ··· xK2   β1   ε 2 
 ..  =  ..
  
.. .. ..
  β2   
..    +  ..  (3.2)
 .  . . . . .  .  .
 .. 
yT 1 x1T x2T ··· xKT εT
βK
or:
y = Xβ + ε. (3.3)
|
The OLS estimates of the parameters β̂ = β̂0 β̂1 · · · β̂K in 3.1 are the values which minimize
the Residual Sum of Squares (RSS):
T T K
(3.4)
X X X
RSS = (yt − ŷt )2 = (yt − β̂0 − β̂k xkt )2
t=1 t=1 k=1

Following the notations in [8], the total sum of squares is dened as: SStot = Tt=1 (yt − ȳt )2 . With
P
RSS and SStot dened above, the following relationship holds: SStot = SSreg + RSS , where SSreg
is the regression sum of squares : SSreg = Tt=1 (ŷt − ȳt )2 . It is easy to show that the coecient
P

estimates β̂ obtained by minimizing the quantity above are:


β̂ = (X| X)−1 X| y. (3.5)
Assuming that Cov(ε) = σ 2 I, the covariance matrix of β̂ is then Cov(β̂) = σ 2 (X| X)−1 , estimated
as:
RSS
Cov(
d β̂) = (X| X)−1
T −K −1
An Fα (1, T − K − 1)-statistic for the hypothesis βk = 0 is calculated as:
!2
β̂k
F = .
SE(β̂k )

where the standard error SE(β̂k ) for any k = 1, . . . , K is the square root of the corresponding
diagonal element of Cov(
d β̂).

8
3.2 Non-linear Least Squares
To model the dynamic structure with several explanatory variables, the Levenberg-Marquardt al-
gorithm (LMA) was used. As described in [7], the LMA interpolates between the Gauss-Newton
algorithm (GNA) and the method of gradient descent. In the current setting, following the nota-
tions dened in the previous sections, the problem is dened in the following way: given a number
T of observations of independent and dependent variables, (xt , yt ), where xt is a vector of length
K , containing the K variable measurements corresponding to the observation | of the dependent
variable yt , the objective is to optimize K parameters β = β0 β1 . . . βK of the model curve
f (X, β) such that the sum of the squares of the deviations
T
(3.6)
X
S(β) = [yt − f (xt , β)]2
t=1

is minimized.

3.2.1 The Gradient Descent Method


The idea behind the steepest descent method is that it updates parameter estimates in the direction
opposite to the gradient of the objective function. The gradient of S with respect to β is
∂S(β) ∂ ∂f (X, β)
= 2(y − f (X, β))| (y − f (X, β)) = −2(y − f (X, β))| = −2(y − f (X, β))| J
∂β ∂β ∂β
(3.7)
where the Jacobian matrix
∂f (X, β)
J=
∂β
represents the the change of f (X, β) to variation in the parameters β. In each iteration step, the
parameter increment δ that moves the parameters β in the direction of steepest descent is given by
δgd = αJ | (y − f (X, β)) (3.8)
The positive scalar α determines the length of the step in the steepest-descent direction.

3.2.2 The Gauss-Newton Method


The Gauss-Newton method assumes that the objective function is approximately quadratic in the
parameters near the optimal solution ([7]). The parameter increment δ is found by approximating
the functions f (xt , β + δ) by their linearizations
f (xt , β + δ) ≈ f (xt , β) + Jt δ (3.9)
where
∂f (xt , β)
Jt =
∂β
The above rst-order approximation of f (xt , β + δ) gives
S(β + δ) ≈ (y − f (X, β))| (y − f (X, β)) − 2(y − f (X, β))| Jδ + δ | J | Jδ (3.10)
Taking the derivative of S(β + δ) with respect to δ and setting the result to zero gives:
(J | J)δgn = J | [y − f (X, β)] (3.11)

9
3.2.3 The Levenberg-Marquardt Method
The Levenberg-Marquardt algorithm interpolates between the Gauss-Newton method and the
method of gradient descent.
(J | J + λI)δlm = J | [y − f (X, β)] (3.12)
Small values of the damping parameter λ result in a Gauss-Newton update and large values of λ
result in a gradient descent update. In each step the parameter λ is iteratively adjusted, that is λ is
increased S(β + δ) > S(β), and is decreased otherwise. To avoid slow convergence in the direction
of small gradient, Marquardt provided the insight that the values of λ should be scaled to the values
of J | J ([7]):
[J | J + λ diag((J | J)]δlm = J | [y − f (X, β)]. (3.13)

10
4 Validation and Testing
The process of validation and testing of the model begins with testing model's statistical assump-
tions. This part is called specication error analysis (section 4.2). The next step is to test the
regression results. This involves tests of signicance described in section 4.1.

4.1 Methods of Model Assessment


In this section it is assumed that there are no specication errors. Linear regression assumes that
the disturbances are normally distributed: ε ∼ N (0, σ 2 I), thus β̂ ∼ N (β, σ 2 (X| X)−1 ). A test
statistic for the hypothesis that all of the β 's are all equal to zero:
H0 : β1 = β2 = . . . = βK = 0 vs H1 : at least one βi 6= 0

is:
SSreg /K
F =
RSS/(T − K − 1)
which has an approximate F (K, T −K −1) distribution under the null. To determine the amount of
variation "explained" by the covariates, one looks at a descriptive statistic R2 , called the coecient
of determination or goodness of t.
SSreg RSS
R2 = =1− (4.1)
SStot SStot

There is also an adjusted R2 that considers an adjustment for degrees of freedom:


T − 1 RSS
R̄2 = 1 − (4.2)
T − K − 1 SStot
To determine which covariates are contributing to the t, one has to examine each covariate sep-
arately. The test statistic for the null hypothesis that a coecient is zero must be calculated as
explained in section 3.1. A common test to determine which covariate should enter the regression
is the Akaike Information Criterion test:
AIC = T ln(RSS) + 2K. (4.3)
The model with the lowest AIC is prefered, since it minimizes the information loss ([13]).

4.2 Specication Error Analysis


To obtain point estimates of the coecients and perform statistical inferences based on those point
estimates (for example: tests of signicance, condence intervals) the following assumptions must
be satised:
ˆ E[εt ] = 0 for all t;

ˆ Var[εt ] = σ 2 for all t;

ˆ Cov[εt , εt0 ] = 0 for t 6= t0 ;

11
ˆ εt is normally distributed.
ˆ The matrix X has full rank, thus X| X is non-singular.
Table 1 based on [16] is a part of model building strategy from the perspective of violation of as-
sumptions. It presents a short summary of reasons, remedies, and ways to detect possible violations
of each assumption. The Table is adapted to the given problem, and the methods applied in this
thesis.

4.2.1 Nonzero expectation of the residuals


The violation of the assumption that the residuals are normally distributed could be a sign of
incorrect functional form, or omitted variables. If the assumed functional form is incorrect, a plot
of the residuals et = yt − ŷt , t = 1, . . . , T against each predictor should show a systematic pattern
in the residual values. However, this plot will not show that a variable has been omitted ([16]).
One way to test the possibility of an omitted variable is to add additional variables in the
original regression. Ramsey recommends to add powers of the tted response as additional terms.
The test is based on the estimation of the following model:
yt = β0 + β1 x1t + β2 x2t + · · · + βK xKt + γ1 ŷt2 + γ2 ŷt3 + · · · + γm ŷtm+1 + ε?t , (4.4)
The null hypothesis is that the tested model is the true model, meaning that the additional variables
should not have an impact on the dependent variable in the model 4.4. Since the regression equation
4.4 is useful to detect both omitted variables and nonlinearities, it is dicult to determine the exact
cause of the test failure ([12]).

4.2.2 Heteroscedasticity
The second assumption is that all residuals εt have the same standard deviation. In this case
standard errors and F -statistics will be computed from the estimated covariance matrix. However,
if the model has heteroscedastic residuals and is misspecied as homoscedastic, then the estimators
of the standard errors of the coecient estimates will be wrong, and therefore the F -tests will be
invalid ([13]). Thus, OLS estimates of the coecients of the model will still be unbiased but not
ecient. One solution is to use another estimation method, like generalized least squares or the
method of maximum likelihood ([4]). In many cases the critical remedy is to use an appropriately
adjusted formula for the variances and covariances of the parameter estimates.
Heteroscedascticity can be detected using the Breusch-Pagan test ([21]). The idea of this test
to run a regression of the squared residuals on the covariates from the original equation:
ε̂2t = δ0 + δ1 x1t + δ2 x2t + . . . + δK xKt + νt (4.5)
where νt is a disturbance term with mean zero given the xkt , k = 1, . . . , K . The null hypothesis of
homoscedasticity is:
H0 : δ1 = δ2 = . . . = δK xKt = 0 (4.6)
The F -statistic of the test is calculated in the following way:
R2ˆ2 /K
F = ε
(4.7)
(1 − R2ˆ2 )/(T − K − 1)
ε

where Rε2ˆ2 is the R-squared from the regression 4.5. This F -statistic has (approximately) an
FK,T −K−1 distribution under the null.

12
Table 1: Violations of the assumptions about the disturbance term: reasons, consequences, tests
and remedies (based on [16])
Violated Possible Consequence Detection Remedy
Assumption Reasons
1. E[εt ] 6= 0
ˆ Incorrect func- ˆ Biased parame- ˆ Plot residual ˆ Modify the
tional form(s) ter estimate against each model specica-
ˆ Omitted vari- predictor vari- tion in terms of
able(s) able functional form
ˆ RESET test ˆ Add relevant
ˆ Box-Cox trans- predictors
formation
2. Var[εt ] 6= σ 2
ˆ Error pro- ˆ Inecient ˆ Plot residual ˆ Modify the
portional to parameter against each specication
variance of the estimate predictor vari- ˆ Use het-
predictor able eroscedasticity
ˆ Breusch-Pagan consistent es-
test timation (e.g.
GLS)
3.
Cov[εt , εt0 ] 6= 0 ˆ See 1. ˆ See 2. ˆ Plot residuals ˆ See 1.
against time
ˆ Durbin Watson
test
4. Nonnormal
errors ˆ See 1. ˆ p-values cannot ˆ Inspect the ˆ See 1.
be trusted distribution of ˆ Box-Cox trans-
residuals formation
ˆ Normality tests
5. Multi-
collinearity ˆ Relations be- ˆ Unreliable ˆ Inspect the cor- ˆ Apply other es-
tween predictor parameters relation matrix timation meth-
variables of the predictor ods
variables ˆ Eliminate
ˆ Some VIF ≥ 5 predictor vari-
ˆ Condition num- able(s)
ber of the ma-
trix (X| X)−1 is
greater than 30

13
4.2.3 Correlated Disturbances
Instead of assuming a model where the disturbance term is 0, let us consider the following simple
linear additive relation for T time-series observations:
yt = β0 + β1 xt + ut , t = 1, . . . , T (4.8)
where the disturbances are correlated in the following way:
ut = ρut−1 + εt , |ρ|< 1 (4.9)
and:
E[εt ] = 0, Cov(εt , εt0 ) = 0, t 6= t0
In 4.8 the error terms u1 , u2 , . . . , uT follow a rst-order AutoRegressive (AR) process with auto-
correlation parameter ρ. In this case, the parameter estimates are no longer ecient, although still
unbiased, and the usual F -statistic cannot be trusted.
A plot of the residuals against time could help to detect a violation of the assumption of
uncorrelated disturbances. Another way is to use the test developed by Durbin and Watson ([9],
[10]), based on the variance of the dierence between two successive disturbances:
E[(ut − ut−1 )2 ] = E[u2t ] + E[u2t−1 ] − 2E[ut ut−1 ] (4.10)
The Durbin-Watson test statistic varies between zero and four and is calculated in the following
way: P T 2
t=2 (ût − ût−1 )
DW = PT 2
(4.11)
t=1 û

Values of the DW test below (above) 2 are associated with positive (negative) autocorrelation. The
test statistic is used as described in ([16]):
1. Tests for positive autocorrelation:
(a) If DW < dL , there is positive autocorrelation;
(b) If dL < DW < dU , the result is inconclusive;
(c) If DW > dU , there is no positive autocorrelation.
2. Tests for negative autocorrelation:
(a) If DW > 4 − dL , there is negative autocorrelation;
(b) If 4 − dU < DW < 4 − dL , the result is inconclusive;
(c) If DW < 4 − dU , there is no negative autocorrelation.
where the lower and upper bounds dL and dU depend on signicance level and sample size.
When rst-order autocorrelation is detected, a two-step estimation procedure is required. The
rst step involves obtaining an estimate of ρ by means of OLS estimation. The second step requires
this estimate of ρ to be used in an estimated generalized least squares (GLS) regression ([15]).
However, according to [16], this remedy should only be a last resort option.

14
4.2.4 Nonnormal Errors
The assumption of normally distributes is required for hypothesis testing and condence intervals to
be applicable. When this assumption is violated the standard statistical tests cannot be performed
although the least squares estimates of the parameters remain unbiased as well as consistent.
The normality of the errors can be examined through the residuals. For this, an inspection of
the distribution function of the residuals as well as normality tests might be used. In this thesis,
the Lilliefors test was employed to assess the normality assumption of the residuals.

4.2.5 Multicollinearity
In linear model, matrix of observations X is assumed to have full rank, otherwise X| X will be
singular, and the OLS estimates cannot be uniquely determined. When the number of covariates is
smaller than the number of observations, X| X will be singular when some of the columns of X are
collinear. In practice however, more often the problem that arrises is imperfect multicollinearity,
when a column of X is nearly a linear combination of the other columns. In this case, (X| X)−1
exists, but its elements will be large, thus the standard errors of one or more of the regression
coecients become very large, and the point estimates of the those coecients will be imprecise.
This problem is encountered in marketing area, since data often show high degrees of correlation
between media variables. Some methods of diagnosing multicollinearity in a given dataset include:
1. Examining the correlation matrix of the predictor variables. A correlation coecient close to
1 or -1 is considered as an indicator of positive or negative collinearity.
2. Looking at the Variance Ination Factor (VIF). This measure is based on the regression of
each individual predictor variable on all the other predictor variables. VIF is computed as
1/(1 − Rk2 ), where Rk2 values result from the regressions above. There is no exact value of
VIF that would be considered as a sign of multicollinearity. Some analysts argue that a VIF
value greater than 5 is a signal that collinearity is a problem.
3. Comparing results for F -test and t-tests. Multicollinearity may be regarded as acute if the
F -statistic shows signicance and none of the t statistics for the slope coecients is signicant.
4. Looking at the condition number of the matrix (X| X)−1 , which is the ratio of its largest
eigenvalue to its smallest eigenvalue λM ax /λM in . The data matrix should rst be normalized
so that each column has equal length - usually unit length. A rule of thumb is that a condition
index of 15 indicates some degree of collinearity, and a condition index above 30 is an indicator
of severe multicollinearity.
The main solution proposed to Nepa for solving multicollinearity was to apply regularization meth-
ods specically developed for the cases with severe multicollinearity.

15
5 Linear Model Selection and Regularization
In this section, there are discussed distinct ways that might improve the linear (or linearizable)
models, using variable selection or alternative estimation methods. To avoid confusion with the
systematic literature, for this Chapter n states for the number of observations and p is the number
covariates.

5.1 Subset selection


There are several methods for selecting subsets of predictors. These include best subset and stepwise
model selection procedures.
Best subset selection is performed by tting a least squared regression for each possible combi-
nation of the total number of p predictors. Then all the resulting 2p models are examined, with the
goal of identifying the one that is best. Since there are 2p models to be examined, the number of all
possible models that must be considered grows rapidly as p increases. For computational reasons,
stepwise methods come as alternatives for best subset selection, which include: forward stepwise,
backward stepwise selection as well as hybrid approaches.
Forward stepwise selection begins with no predictors, and then gradually adds predictors to the
model by adding at each step the variable that gives the greatest additional improvement to the t
(in terms of RSS or R2 ). On the other hand, backward stepwise selection starts with the full model
containing all p predictors, and iteratively removes the least useful one. Finally, it is possible to
combine both forward and backward stepwise selection, in which variables are added to the model
sequentially, but at each step the method may also remove any variables that no longer provide an
improvement in the model t. The best model can be selected according to various criteria, such as:
Cp , AIC, BIC or Adjusted R2 , where AIC and Adjusted R2 are dened in the previous chapters,
and Cp and BIC are computed using the following equations:
1
Cp = (RSS + 2pσ̂ 2 ) (5.1)
n
and
1
BIC = (RSS + ln(n)pσ̂ 2 ). (5.2)
n
Both Cp and BIC will return a small value for the models with a low test error, so similar to AIC,
the models with lowest Cp and BIC are prefered.

Validation and Cross-Validation


The validation set approach involves randomly dividing the observations into a training set and a
test set. The coecients are estimated with the training set, and then used to predict the dependent
values in the test set. In k-fold cross-validation approach the data is divided into k groups. Each of
these groups is subsequently used as a test set, while the rest of the observations is used as a training
set. Obtaining k estimates of the test error, the k-fold cross-validation estimate is computed by
averaging these values. The advantage of cross-validation relative to the methods mentioned above
is that it provides a direct estimate of the test error, and helps to avoid overtting. To obtain
accurate estimates of the test error, only the training observations should be used.

16
5.2 Shrinkage Methods
As an alternative to the subset selection methods described in section 5.1 above, techniques that
shrink coecient estimates towards zero can be used. These include ridge regression, the lasso, and
the elastic net. To understand better these techniques, rst the concept of bias-variance trade o
is introduced.

5.2.1 The Bias-Variance Trade-O


Recall that mean squared error (MSE) is estimated as:
n
1X 1
MSE = (yi − ŷi )2 = [(y − Xβ̂)| (y − Xβ̂)]
n i=1 n

where β̂ ∼ N (β, σ 2 (X| X)−1 ). Note that the covariates are xed, only the responses are random.
The expected test MSE, for a given vector x0 of length K that contains new measurements, can be
decomposed in the following way:
MSE0 = Var[x|0 β̂] + Bias2 (x|0 β̂) + Var[ε]
where Bias(x|0 β̂) = E[x|0 β̂] − x|0 β. In practice, some bias might be accepted for a reduction in
variance of the coecient estimates. This can be achieved by employing regularized regression
methods described in the following sections.

5.2.2 Ridge Regression


The balance between bias and variance may be achieved by placing constraints on the estimated
coecients β. Instead of minimizing the sum of the squared residuals RSS dened above, the ridge
regression coecient estimates are found by minimizing the following value:
n p p p
(5.3)
X X X X
(yi − β0 − βj xji )2 + λ βj2 = RSS + λ βj2
i=1 j=1 j=1 j=1

where λ ≥ 0 is a tuning parameter to be determined, and the term λ βj2 is called a shrinkage
Pp
j=1
penalty. The result is the ridge regression estimator:
β̂ ridge (λ) = ((X| X + λIr )−1 X| y = W(λ)β̂ OLS ,
where W(λ) = ((X| X + λIr )−1 X| X. For each value of λ ridge regression will produce a set of
coecient estimates. When λ = 0 ridge estimates will be equal to the least squares estimates. As
λ → ∞ the ridge regression coecient estimates will approach zero. The intercept remains simply
the mean value of the response.
Because ridge coecients change substantially when multiplying a covariate by a constant, ridge
regression should be applied using standardized predictors ([6]):
xij
x̃ij = q P (5.4)
1 n
n i=1 (xij − x̄j )2

Standardizing the predictors also makes it possible to compare estimated coecients with each
other.

17
5.2.3 The Lasso
Ridge regression will shrink all the coecient estimates towards zero, but it will not set any of
them exactly equal to zero, which might be a drawback if the purpose of the model is also variable
selection. An alternative method called the lasso overcomes this disadvantage. The lasso coecients
are the values that minimize the quantity:
n p p p
(5.5)
X X X X
(yi − β0 − βj xji )2 + λ |βj |= RSS + λ |βj |
i=1 j=1 j=1 j=1

Like ridge regression, the lasso shrinks the coecient estimates towards zero. However, in the
case of the lasso penalty, some of the coecient estimates will be exactly zero when the tuning
parameter λ is suciently large. Hence the lasso performs also variable selection, which makes the
interpretation of the model much easier.

5.2.4 Comparing Ridge regression and The Lasso


One can show that the lasso and ridge regression coecient estimates solve the problems:
 
Xn p  p
minimize subject to (5.6)
X X
(yi − β0 − βj xij )2 |βj |≤ s
β 
i=1 j=1

j=1

and  
Xn p  p
minimize subject to (5.7)
X X
(yi − β0 − βj xij )2 βj2 ≤ s,
β 
i=1 j=1

j=1

respectively.
Note that the restriction pj=1 βj2 ≤ s on β is a hypersphere centered at the origin with bounded
P
squared radius s, where the value of s determines the value of k. Figure 1 (taken from [6]) shows
the restrictions for the lasso and ridge regression for the two-parameter case.
Choosing among the regularization methods is not trivial. Which model produces better pre-
diction accuracy depends on the dataset used. Since lasso assumes that several coecients are in
fact equal to zero, it will perform better when some of the predictors are not related to the re-
sponse. In the case when all coecients substantially dier from zero, ridge regression is expected
to outperform the lasso. Since the number of coecient related to the response is never known,
cross-validation approach can be used to determine the best method for each dataset. For a deeper
discussion see ([6]) on how to select the regularization approach.

5.2.5 Selecting the Tuning Parameter


Cross-validation approach tackles the problem of selecting the appropriate tuning parameter λ in
the following way: the cross-validation error is computed over a grid of λ values, then the tuning
parameter value is selected for which the cross-validation error is smallest. Finally, the model is
retted using all the available observations and the selected value of the tuning parameter.

18
Figure 1: (taken from [1]). Contours of the error and constraint functions for the lasso (left)
and ridge regression (right). The solid blue areas are the regions |β1 |+|β2 |≤ s and β12 + β22 ≤ s,
respectively, while the red ellipses are the countors of the RSS .

5.2.6 Naive Elastic Net


Considering the model above with y being the vector of response variable and X = x1 x2 . . . xp


being the matrix of predictors, the naive elastic net estimator β̂ is the one that minimizes the quan-
tity:
n p p p
(5.8)
X X X X
L(λ1 , λ2 , β) = (yi − β0 − βj xji )2 + λ1 |βj |+λ2 βj2
i=1 j=1 j=1 j=1

for any λ1 and λ2 . Similar to ridge regression and the lasso, this procedure can be viewed as a
penalized least squares. If α is dened as α = λ2 /(λ1 + λ2 ), then solving for β in equation 5.8 is
equivalent to the optimization problem:
 
Xn p  p p
minimize subject to (1 − α) (5.9)
X X X
(yi − β0 − βj xij )2 |βj |+α βj2 ≤ s,
β 
i=1 j=1

j=1 j=1

The function (1 − α) pj=1 |βj |+α pj=1 βj2 ≤ s is called the elastic net penalty. When α = 1, the
P P
naive elastic net becomes simple ridge regression. For all α ∈ [0, 1), the elastic net penalty function
is singular (without rst derivative) at 0 and it is strictly convex for all α > 0. Note that the lasso
penalty (α = 0) is convex but not strictly convex. The two-dimensional contours of the penalty
function for ridge, lasso and naive elastic net are given in Figure 2 (taken from [26]). In the article
[27] Hui Zou and Trevor Hastie develope a method to solve the naive elastic net problem eciently.

19
Figure 2: (taken from [26]). Two-dimensional contour plots of the ridge, the lasso, and α = 0.5
elastic net penalties.

It turns out that minimizing equation 5.8 is equivalent to a lasso-type optimization problem. This
fact implies that the naive elastic net also enjoys the computational advantage of the lasso. The
next Lemma is a result from paper [27].
Lemma 1. Given a dataset (y, X) and (λ1 , λ2 ), an articial dataset (y? , X? ) is dened by:
   
X?(n+p)∗p −1/2
= (1 + λ2 ) √X , ?
y(n+p) =
y
(5.10)
λ2 I 0
√ √
Let γ = λ1 / 1 + λ2 and β? = 1 + λ2 β. Then the naive elastic net criterion can be given as:
n p p
(5.11)
X X X
L(γ, β) = L(γ, β ? ) = (yi? − β0? − βj? x?ji )2 + γ |βj? |.
t=1 j=1 j=1
?
Let β̂ = (β̂1? , . . . , β̂p? )| be the vector that minimizes the quantity above. Then
1 ?
β̂ = √ β̂ (5.12)
1 + λ2
Note that the sample size in the augmented problem is n + p and X? has rank p, which means
that the naive elastic net can potentially select all p predictors in all situations. Lemma 1 also
shows that the naive elastic net can perform an automatic variable selection in a fashion similar to
the lasso.

5.2.7 Elastic Net


The studies in [27] show that one drawback of the naive elastic net is that it performs best when
it is very close to either ridge regression or the lasso. The estimation of the coecients implies a

20
double shrinkage procedure, which causes an increase in bias. In their paper, Hui Zou and Trevor
Hastie propose a scaling of the naive elastic net coecients which keeps the advantage of variable
selection property, avoiding the undesirable double shrinkage. Following the notations in section
5.2.6, the naive elastic net solves a lasso-type problem:
? λ1
β̂ = arg min |y? − X? β ? |2 + √ |β ? |1 (5.13)
β ?
1 + λ 2

where (y? , X? ) is the augmented data dened in 5.10, and (λ1 , λ2 ) is the penalty parameter. The
elastic net corrected estimates are dened by:
?
(5.14)
p
β̂ enet = 1 + λ2 β̂
√ ?
Recall that β̂naive enet = (1/ 1 + λ2 )β̂ , thus:
β̂ enet = (1 + λ2 )β̂ naive enet (5.15)
In elastic net one could choose the type of the tuning parameters as (λ2 , s) where s ∈ [0; 1] is
the fraction of the l1 -norm. The tuning parameters were chosen using a two-dimensional tenfold
cross-validation method, following the procedure suggested in [27]: rst a (relatively small) grid of
values for λ2 is picked, then the other tuning parameter is selected by cross-validation. The value
of λ2 is chosen such as to give the smallest CV error.

21
6 Results
In this section, the results for the models specied in the previous sections are presented. It starts
from the linear and multiplicative models, described in section 2.1.1. The linear, log-linear and
log-log functional forms are estimated. Next, the retention rates for the functional form chosen
are estimated using non-linear least squares. Finally, the results for the modern approaches are
illustrated and compared.

6.1 Choosing among functional forms for Marketing Mix Modelling


To estimate the parameters for the equations formulated in the above sections, the following data
provided by Nepa was used:
yt = value of sales in week t,
TVt = Advertising expenditures for Television in week t,
DRt = Advertising expenditures for ads coming with the mail in week t,
DR.POSTENt = Investments in ads coming with the mail, but in a slightly dierent format,
OUTDOORt = Investments in ads put up outdoor, i.e. at bus stops in week t,
RADIOt = RADIO advertising expenditures in week t,
PRINTt = PRINT advertising expenditures in week t,
SOCIALMEDIAt = SOCIALMEDIA advertising expenditures in week t,
Raint = Rain quantity in week t,
salt = Dummy variable, indicating whether it was salary week,
HOLIDAYt = Dummy variable, indicating whether it was a Holiday in week t,

With the variables dened above the linear model 2.1 becomes:
yt = α0 + α1 TVt + α2 DRt + α3 DR.POSTENt + α4 OUTDOORt +
+α5 RADIOt + α6 PRINTt + α7 SOCIALMEDIAt + (6.1)
(1)
+α8 Raint + α9 salt + α10 HOLIDAYt + εt ,

The exponential model 2.7 model takes the form:

yt = β0 eβ1 TVt +β2 DRt +β3 DR.POSTENt +β4 OUTDOORt +β5 RADIOt · (6.2)
(2)
·eβ6 PRINTt +β7 SOCIALMEDIAt eβ8 Raint +β9 salt +β10 HOLIDAYt εt

which after taking the natural logarithm on both sides becomes:

ln(yt ) = ln(β0 ) + β1 TVt + β2 DRt + β3 DR.POSTENt + β4 OUTDOORt +


+β5 RADIOt + β6 PRINTt + β7 SOCIALMEDIAt + (6.3)
(2)
+β8 Raint + β9 salt + β10 HOLIDAYt + ln(εt ),

Finally, the multiplicative model 2.4 becomes:

yt = γ0 TVβt 1 DRγt 2 DR.POSTENγt 3 OUTDOORγt 4 RADIOγt 5 · (6.4)


(3)
·PRINTγt 6 SOCIALMEDIAγt 7 γ8Raint γ9salt γ10
HOLIDAYt
εt

22
and after the transformation:

ln(yt ) = ln(γ0 ) + γ1 ln(TVt ) + γ2 ln(DRt ) + γ3 ln(DR.POSTENt ) +


+γ4 ln(OUTDOORt ) + γ5 ln(RADIOt ) + γ6 ln(PRINTt ) +
(6.5)
+γ7 ln(SOCIALMEDIAt ) + γ8? Raint + γ9? salt +
? (3)
+γ10 HOLIDAYt + ln(εt ),

where:
γ8? = ln(γ8 )
γ9? = ln(γ9 )
?
γ10 = ln(γ10 )
The following tables illustrate the estimation results for 6.1, 6.3 and 6.5, respectively.

Table 2: Estimation results of the linear model (OLS)

##
## Call :
## lm ( formula = " SALES_TOT ~ TV + DR + DR . POSTEN + OUTDOOR + RADIO + PRINT + SOCIALMEDIA + Rain ..
mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -23290744 -7598418 -1641095 7793738 82440104
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 4.226 e +07 3.988 e +06 10.598 < 2e -16 ***
## TV 2.239 e +01 2.567 e +00 8.721 9.59 e -14 ***
## DR 1.453 e +01 8.034 e +00 1.809 0.07369 .
## DR . POSTEN 1.742 e +01 5.417 e +00 3.216 0.00178 **
## OUTDOOR 3.756 e +01 1.035 e +01 3.631 0.00046 ***
## RADIO -6.633 e +01 2.957 e +01 -2.243 0.02724 *
## PRINT 6.304 e +00 5.876 e +00 1.073 0.28610
## SOCIALMEDIA 1.832 e +02 2.400 e +01 7.636 1.84 e -11 ***
## Rain .. mm . 2.796 e +05 1.461 e +05 1.913 0.05877 .
## sal 6.885 e +06 3.734 e +06 1.844 0.06834 .
## HOLIDAY 2.906 e +06 6.383 e +06 0.455 0.64999
## ---
## Signif . codes :
## 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 14380000 on 94 degrees of freedom
## Multiple R - squared : 0.8203 , Adjusted R - squared : 0.8012
## F - statistic : 42.92 on 10 and 94 DF , p - value : < 2.2 e -16

For the linear functional form, relatively high value for R2 was expected, given the fact that time
series data was used. Note that R2 can be compared among dierent models, only if the models
have exactly the same LHS and exactly the same observations. The values of F -statistics in all

23
Table 3: Estimation results of the log-linear model (OLS)

##
## Call :
## lm ( formula = " log ( SALES_TOT ) ~ TV + DR + DR . POSTEN + OUTDOOR + RADIO + PRINT + SOCIALMEDIA +
Rain .. mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -0.20887 -0.07726 -0.00648 0.06248 0.32894
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 1.787 e +01 3.063 e -02 583.581 < 2e -16 ***
## TV 1.661 e -07 1.972 e -08 8.425 4.07 e -13 ***
## DR 1.273 e -07 6.171 e -08 2.062 0.041929 *
## DR . POSTEN 1.423 e -07 4.161 e -08 3.420 0.000929 ***
## OUTDOOR 3.334 e -07 7.946 e -08 4.196 6.15 e -05 ***
## RADIO -6.494 e -07 2.271 e -07 -2.860 0.005224 **
## PRINT 7.119 e -08 4.513 e -08 1.577 0.118097
## SOCIALMEDIA 1.547 e -06 1.843 e -07 8.393 4.76 e -13 ***
## Rain .. mm . 2.251 e -03 1.122 e -03 2.006 0.047757 *
## sal 5.934 e -02 2.868 e -02 2.069 0.041258 *
## HOLIDAY -4.279 e -02 4.903 e -02 -0.873 0.385054
## ---
## Signif . codes :
## 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 0.1105 on 94 degrees of freedom
## Multiple R - squared : 0.8332 , Adjusted R - squared : 0.8154
## F - statistic : 46.95 on 10 and 94 DF , p - value : < 2.2 e -16

cases indicate that all three models (especially the rst two) are highly signicant. The number of
signicant parameters slightly vary in each model. The signicant parameters that all three models
have in common are the intercept, TV, DR.POSTEN, OUTDOOR, RADIO, SOCIALMEDIA.For
there parameters the corresponding p-values are smaller than 0.05. In the multiplicative log-linear
model, also the parameters for DR, Rain, and sal are signicant. For the log-log model, the
parameters log(PRINT) and Rain along with the common ones mentioned above are signicant.
It is important to mention that each specication has its own unique economic interpretation.
That is, the choice of a log versus linear specication should be made largely based on the underlying
economics. Table 5 (taken from [25]) summarizes the interpretation of the estimates for each case.

24
Table 4: Estimation results of the log-log model (OLS)

##
## Call :
## lm ( formula = " log ( SALES_TOT ) ~ log ( TV ) + log ( DR )+ log ( DR . POSTEN )+ log ( OUTDOOR )+ log (
RADIO )+ log ( PRINT )+ log ( SOCIALMEDIA )+ Rain .. mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -0.36417 -0.10845 0.01072 0.06971 0.79316
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 14.953425 0.501883 29.795 < 2e -16 ***
## log ( TV ) 0.012518 0.003359 3.727 0.000331 ***
## log ( DR ) 0.005367 0.007390 0.726 0.469484
## log ( DR . POSTEN ) 0.013922 0.006767 2.057 0.042410 *
## log ( OUTDOOR ) 0.052534 0.014639 3.589 0.000530 ***
## log ( RADIO ) -0.012448 0.004007 -3.107 0.002500 **
## log ( PRINT ) 0.181852 0.036741 4.950 3.26 e -06 ***
## log ( SOCIALMEDIA ) 0.012168 0.003768 3.229 0.001709 **
## Rain .. mm . 0.004394 0.001918 2.292 0.024164 *
## sal 0.049464 0.047164 1.049 0.296974
## HOLIDAY -0.050739 0.086438 -0.587 0.558615
## ---
## Signif . codes :
## 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 0.1858 on 94 degrees of freedom
## Multiple R - squared : 0.5279 , Adjusted R - squared : 0.4777
## F - statistic : 10.51 on 10 and 94 DF , p - value : 1.001 e -11

Table 5: (taken from [25]). Summary of the interpretation of Marketing Mix Modelling functional
forms
Dependent Independent Interpretation Marginal Eect
Variable Variable of β of ∆x
linear model y x β = ∆y/∆x β
log-linear model ln(y) x 100 · β = %∆y/∆x y·β
log-log model ln(y) ln(x) β = %∆y/ %∆x y · β/x

RSS, ESS, and σ̂ are not comparable in size across the models. This is due to the fact that
several variables, including the dependent variable, were transformed to be able to estimate the
multiplicative models using least square. Therefore, it is not possible to compare the estimated
values for the parameters across the models.
Note that the numbers in Table 4 are estimates for the parameters in 6.3, the linearized version
of the log-log model (6.2). To nd the estimates for the independent variables that had been logged,
an 'anti-ln' transformation must be applied. Instead of just taking the exponential of the estimates
from Table 4 to obtain proper estimates for the parameters in the log-log model, the following

25
Figure 3: Plot of residuals against each predictor variable for the linear model

correction must be employed ([16]):


1
γ̂ = exp(γ̂ ? ) · exp(− σγ̂2? ) (6.6)
2
The estimates γ̂8 , γ̂9 , γ̂10 from equation 6.4 become:
1 2
γ̂8 = e0.004394 · e− 2 0.001918 = 1.0044
1 2
γ̂9 = e0.049464 · e− 2 0.047164 = 1.0495
1 2
γ̂10 = e−0.050739 · e− 2 0.086438 = 0.9470
In order to test the rst assumption of nonzero expectation of the residuals for each of the models
presented above the plots of the residuals against each predictor variable for the models 6.1, 6.3,
6.5 must be examined. The plots of the residuals against each predictor for the models above are
presented in Figures 3 and 4 and 5, respectively. These plots should be examined by inspecting
each independent variable, to asses if for certain values the residuals dier systematically from
zero. To assess this assumption even more carefully, the RESET test was employed for each of the
models. Tables 6 and 7 display the results for the RESET test with power 2 and power 3 of the
tted response ŷt , respectively. Table 8 shows the results of the RESET test with both power 2 and
3 of the tted response. The most appropriate functional form is the log-linear functional form, but
one cannot reject the hypothesis that both the second and the third powers of the tted response
are insignicant explanatory variables in the model.

26
Figure 4: Plot of residuals against each predictor variable for the log-linear model

Figure 5: Plot of residuals against each predictor variable for the log-log model

27
Table 8: RESET test. Power 2 and 3 of the tted response
linear model RESET = 38.269, df1 = 2, df2 = 92, p-value = 8.055e-13
log-linear model RESET = 4.5081, df1 = 2, df2 = 92, p-value = 0.01356
log-log model RESET = 26.782, df1 = 2, df2 = 92, p-value = 6.818e-10

The Box-Cox test can also be used to determine whether transformations of variables are re-
quired. Figure 6 shows the log of the likelihood ratio test for dierent values of λ. The best tting
transformation is λ = −0.4242424, which is closest to the log-linear specication. Note that if the
task was to t historical data, the value of λ above would have been chosen. However, this has no
economic meaning, thus the log-linear specication is preferred.

Table 6: RESET test. Power 2 of the tted response


linear model RESET = 52.878, df1 = 1, df2 = 93, p-value = 1.092e-10
log-linear model RESET = 1.7934, df1 = 1, df2 = 93, p-value = 0.1838
log-log model RESET = 35.013, df1 = 1, df2 = 93, p-value = 5.417e-08

Table 7: RESET test. Power 3 of the tted response


linear model RESET = 31.869, df1 = 1, df2 = 93, p-value = 1.778e-07
log-linear model RESET = 0.3821, df1 = 1, df2 = 93, p-value = 0.538
log-log model RESET = 38.029, df1 = 1, df2 = 93, p-value = 1.785e-08

28
Figure 6: Box-Cox transformation of the response variable, with 95% condence interval of the
parameter λ

6.2 Marketing dynamics


Considering the specication chosen above, the next step is to nd the appropriate retention rate
for each marketing variable, by solving the following optimization problem:
T
minimize
X
(ln yt − ln ŷt )2
t=1
subject to 0 ≤ λi < 1, i = 1, . . . , K.
where:
ln ŷt = β0? + β1 f1 (TVt , λ1 ) + β2 f2 (DRt , λ2 ) + β3 f3 (DR.POSTENt , λ3 ) + β4 f4 (OUTDOORt , λ4 ) +
+β5 f5 (RADIOt , λ5 ) + β6 f6 (PRINTt , λ6 ) + β7 f7 (SOCIALMEDIAt , λ7 ) +
+β8 Raint + β9 salt + β10 HOLIDAYt ,
(6.7)
and f1 (TVt ), . . . , f7 (SOCIALMEDIAt ) are the adstock functions, dened as:
f1 (TVt , λ1 ) = TVt + λ1 f (TVt−1 )
..
. (6.8)
f7 (SOCIALMEDIAt , λ7 ) = SOCIALMEDIAt + λ7 f (SOCIALMEDIAt−1 )

29
The starting values for all the decays is 0, and as starting values for the parameter coecients the
estimates of the equation 6.3 were used, shown in Table 3. The results of the Non-Linear Least
squares regression are shown in Table 9. The estimates are quite close to the ones provided in 3,
although their signicance changes. To avoid over-tting, one might pick the signicant decays and
run a linear regression again. Observe that from all the decays, the SOCIALMEDIA_adstock has the
p-value 0.055435, so the null hypothesis that the decay for SOCIALMEDIA equals 0 is rejected
at signicance level α = 0.1. The adstock function for SOCIALMEDIA expenditures looks in the
following way:
f7 (SOCIALMEDIAt , 0.2644602) = SOCIALMEDIAt + 0.2644602 · f (SOCIALMEDIAt−1 )
t = 2, . . . , T

Table 9: Estimation results of the log-linear model, dynamic structure using Levenberg-Marquardt
method

##
## Formula : log ( SALES_TOT ) ~ Intercept + TV_coefficient * adstock (TV , TV_adstock ) +
## DR_coefficient * adstock (DR , DR_adstock ) + DR . POSTEN_coefficient *
## adstock ( DR . POSTEN , DR . POSTEN_adstock ) + OUTDOOR_coefficient *
## adstock ( OUTDOOR , OUTDOOR_adstock ) + RADIO_coefficient * adstock ( RADIO ,
## RADIO_adstock ) + PRINT_coefficient * adstock ( PRINT , PRINT_adstock ) +
## SOCIALMEDIA_coefficient * adstock ( SOCIALMEDIA , SOCIALMEDIA_adstock ) +
## Rain .. mm . _coefficient * Rain .. mm . + sal_coefficient * sal +
## HOLIDAY_coefficient * HOLIDAY
##
## Parameters :
## Estimate Std . Error t value Pr ( >| t |)
## Intercept 1.787 e +01 4.094 e -02 436.519 < 2e -16 ***
## TV_coefficient 1.593 e -07 2.313 e -08 6.885 8.54 e -10 ***
## TV_adstock 1.488 e -02 1.286 e -01 0.116 0.908188
## DR_coefficient 1.118 e -07 6.615 e -08 1.690 0.094613 .
## DR_adstock 0.000 e +00 5.868 e -01 0.000 1.000000
## DR . POSTEN_coefficient 1.445 e -07 5.444 e -08 2.655 0.009436 **
## DR . POSTEN_adstock 7.176 e -04 3.168 e -01 0.002 0.998198
## OUTDOOR_coefficient 3.229 e -07 8.639 e -08 3.737 0.000332 ***
## OUTDOOR_adstock 0.000 e +00 3.257 e -01 0.000 1.000000
## RADIO_coefficient -5.799 e -07 2.428 e -07 -2.389 0.019075 *
## RADIO_adstock 3.523 e -02 4.130 e -01 0.085 0.932215
## PRINT_coefficient 7.731 e -08 4.689 e -08 1.649 0.102793
## PRINT_adstock 0.000 e +00 5.808 e -01 0.000 1.000000
## SOCIALMEDIA_coefficient 1.409 e -06 2.141 e -07 6.581 3.38 e -09 ***
## SOCIALMEDIA_adstock 2.645 e -01 1.362 e -01 1.941 0.055435 .
## Rain .. mm . _coefficient 2.141 e -03 1.173 e -03 1.826 0.071339 .
## sal_coefficient 5.876 e -02 3.178 e -02 1.849 0.067882 .
## HOLIDAY_coefficient -3.850 e -02 5.050 e -02 -0.762 0.447846
## ---
## Signif . codes : 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 0.1105 on 87 degrees of freedom
##
## Number of iterations to convergence : 20
## Achieved convergence tolerance : 1.49 e -08

30
6.3 Re-estimation and testing the OLS assumptions
In this section the lag weight parameters found using the Levenberg-Marquardt algorithm are being
used, and the re-estimated model with OLS is tested. Since the only signicant decay found in the
previous section is the one for SOCIALMEDIA, the following equation is estimated with OLS:

ln(yt ) = β0? + β1 TVt + β2 DRt + β3 DR.POSTENt + β4 OUTDOORt +


+β5 RADIOt + β6 PRINTt + β7 f7 (SOCIALMEDIAt , 0.2644602) + (6.9)
+β8 Raint + β9 salt + β10 HOLIDAYt + ε?t ,

The results are shown in Table 10.


Table 10: Estimation results of the log-linear functional form with adstock model considered for
SOCIALMEDIA variable

##
## Call :
## lm ( formula = " log ( SALES_TOT ) ~ TV + DR + DR . POSTEN + OUTDOOR + RADIO + PRINT + SOCIALMEDIA +
Rain .. mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -0.231972 -0.073825 -0.009095 0.061918 0.291512
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 1.787 e +01 2.955 e -02 604.603 < 2e -16 ***
## TV 1.540 e -07 1.914 e -08 8.048 2.53 e -12 ***
## DR 1.124 e -07 5.925 e -08 1.897 0.060940 .
## DR . POSTEN 1.471 e -07 3.999 e -08 3.678 0.000391 ***
## OUTDOOR 3.153 e -07 7.667 e -08 4.113 8.37 e -05 ***
## RADIO -5.581 e -07 2.197 e -07 -2.540 0.012712 *
## PRINT 8.798 e -08 4.351 e -08 2.022 0.045990 *
## SOCIALMEDIA 1.901 e -06 2.078 e -07 9.146 1.20 e -14 ***
## Rain .. mm . 2.176 e -03 1.080 e -03 2.015 0.046764 *
## sal 5.672 e -02 2.757 e -02 2.057 0.042443 *
## HOLIDAY -4.193 e -02 4.717 e -02 -0.889 0.376266
## ---
## Signif . codes : 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 0.1063 on 94 degrees of freedom
## Multiple R - squared : 0.8456 , Adjusted R - squared : 0.8292
## F - statistic : 51.48 on 10 and 94 DF , p - value : < 2.2 e -16

The following results indicate that most of the parameters are signicant, except HOLIDAY.
The goodness of t has increased, and the p-value for the F -statistics indicates that the model is
highly signicant. The short-term eect of marketing eort for SOCIALMEDIA is 1.901 ∗ 10−6 .
The implied long-term eect is 1.901 ∗ 10−6 /(1 − 0.2644602) = 2.584496 ∗ 10−6 .
In order to test the rst assumption E[εt ] = 0, the residuals against each predictor variable are
plotted again (Figure 7). The graphs do not show any systematic pattern in the residuals. Next

31
Figure 7: Plot of residuals against each predictor variable for the log-linear functional form with
adstock model considered for SOCIALMEDIA variable (equation 6.9)

the RESET test was employed with powers of the tted response. The results shown in Table 11
indicate that there is no strong evidence of misspecication.

Table 11: RESET test for log-linear adstock model, using powers of the tted response
power 2 RESET = 1.1517, df1 = 1, df2 = 93, p-value = 0.286
power 3 RESET = 0.3628, df1 = 1, df2 = 93, p-value = 0.5484
power 2 and 3 RESET = 2.6459, df1 = 2, df2 = 92, p-value = 0.07634

In order to test the second assumption Var[εt ] = 0 for all t, Figure 7 must be examined again,
but now with the purpose to detect changes in the variability of the residuals. To test for het-
eroscedasticity more formally, the Breusch-Pagan test is employed, by running a regression of the
squared residuals on the explanatory variables that appear in equation 6.9. The p-value associated
with Breusch-Pagan test is 0.9635, indicating that no signicant heteroscedasticity is detected. To
test the normality of the residuals, normality tests together with visual assessment were employed.
Figure 9 shows the empirical cumulative distribution function together with normal cumulative dis-
tribution function. The normal probability plot is shown in Figure 8 Right. The plots don't indicate
evidence of non-normality of the residuals. To asses nonnormality of the residuals more carefully,
Lilliefors test was employed, which returned a p-value equal to 0.7248. From the results above
it can be concluded that nonnormality is not an issue. To check the presence of multicollinearity
rst the correlation matrix of the explanatory variables must be inspected. The correlation matrix

32
Figure 8: Diagnostics of the regression model considered in equation 6.9. Left : scatterplot of the
residuals against tted values. Right : Normal probability plot of the residuals

Figure 9: Empirical cumulative distribution function of the residuals for the log-linear functional
form with adstock model considered for SOCIALMEDIA variable (equation 6.9)

33
Table 12: Correlation matrix of the explanatory variables

## ( Intercept ) TV DR DR . POSTEN OUTDOOR RADIO PRINT


## ( Intercept ) 1.00 0.14 -0.240 -0.254 -0.020 0.1763 -0.610
## TV 0.14 1.00 -0.346 -0.186 0.154 -0.2015 -0.310
## DR -0.24 -0.35 1.000 -0.490 -0.096 0.1570 0.071
## DR . POSTEN -0.25 -0.19 -0.490 1.000 -0.030 -0.2383 0.186
## OUTDOOR -0.02 0.15 -0.096 -0.030 1.000 -0.7079 0.017
## RADIO 0.18 -0.20 0.157 -0.238 -0.708 1.0000 -0.266
## PRINT -0.61 -0.31 0.071 0.186 0.017 -0.2660 1.000
## SOCIALMEDIA -0.15 -0.17 0.050 -0.055 -0.158 0.1543 0.067
## Rain .. mm . -0.47 -0.11 0.148 -0.054 0.010 0.0287 -0.015
## sal 0.13 0.14 -0.110 -0.263 0.178 -0.0183 -0.362
## HOLIDAY -0.16 0.11 0.018 0.050 0.028 -0.0039 -0.176
## SOCIALMEDIA Rain .. mm . sal HOLIDAY
## ( Intercept ) -0.1485 -0.471 0.128 -0.1558
## TV -0.1671 -0.106 0.138 0.1058
## DR 0.0496 0.148 -0.110 0.0183
## DR . POSTEN -0.0551 -0.054 -0.263 0.0495
## OUTDOOR -0.1580 0.010 0.178 0.0281
## RADIO 0.1543 0.029 -0.018 -0.0039
## PRINT 0.0670 -0.015 -0.362 -0.1759
## SOCIALMEDIA 1.0000 -0.036 0.037 -0.0063
## Rain .. mm . -0.0360 1.000 0.129 0.1921
## sal 0.0370 0.129 1.000 0.0745
## HOLIDAY -0.0063 0.192 0.074 1.0000

indicates that RADIO and OUTDOOR are negatively correlated (−0.708). Also, there is evidence
of negative correlation between DR and DR.POSTEN (−0.490). Multicollinearity issue might be
the cause of the non-signicance of some of the coecients. In Table 13 there are presented the
VIF values of the explanatory variables.

Table 13: VIF values of the explanatory variables


VIF
TV 1.919631
DR 2.122950
DR.POSTEN 2.389954
OUTDOOR 2.487447
RADIO 3.113011
PRINT 1.584613
SOCIALMEDIA 1.077286
Rain 1.099352
sal 1.416964
HOLIDAY 1.114468

The condition number of the matrix (X| X)−1 , after normalizing the data matrix is 16.0652,
also indicating moderate degree of multicollinearity. Although in this case the multicollinearity
detected is not severe, to provide Nepa with a strategy for the cases of severe multicollinearity, this

34
issue will be addressed in section 6.5.
To assess autocorrelation, rst the plot of the residuals over time (Figure 10) were examined.

Figure 10: Plot of residuals against time for the log-linear functional form with adstock model
considered for SOCIALMEDIA variable (equation 6.9)

The residuals in Figure 10 show shorter and longer runs on either side of the mean value. The
D-W Statistic is 1.582541, and the p-value associated with this statistic is 0.018, indicating that
the residuals are positively autocorrelated. The estimation of the autocorrelation parameter is
0.1833852, meaning that Durbin Watson test assumes that the errors are driven by the following
rst order autocorrelation process: ut = 0.1833852 ∗ ut−1 + εt .
An approach that specically considers the autocorrelation structure is the Cochrane-Orcutt
method described in [17]. The procedure behind this method is based on the estimation of the
autocorrelation coecient, and then the transformation of variables. With the autocorrelation
parameter estimated as 0.1833852, the variables are transformed in the following way:
yt0 = yt − 0.1833852 ∗ yt−1 , t = 2, . . . , T

x0kt = xkt − 0.1833852 ∗ xkt−1 , t = 2, . . . , T, k = 1, . . . , K


Table 14 summarizes the results of tting the transformed data with linear regression. For the
transfomed regression model the D-W Statistic is 1.889608 and the p-value is 0.576, indicating that
there is no problem of autocorrelation in the transformed model.

35
Table 14: Coecient estimates for the Cochrane-Orcutt Method, log-linear functional form with
adstock model considered for SOCIALMEDIA variable (equation 6.9)

##
## Call :
## lm ( formula = " SALES_TOT ~ TV + DR + DR . POSTEN + OUTDOOR + RADIO + PRINT + SOCIALMEDIA + Rain ..
mm .+ sal + HOLIDAY ",
## data = regdata )
##
## Residuals :
## Min 1Q Median 3Q Max
## -23290744 -7598418 -1641095 7793738 82440104
##
## Coefficients :
## Estimate Std . Error t value Pr ( >| t |)
## ( Intercept ) 4.226 e +07 3.988 e +06 10.598 < 2e -16 ***
## TV 2.239 e +01 2.567 e +00 8.721 9.59 e -14 ***
## DR 1.453 e +01 8.034 e +00 1.809 0.07369 .
## DR . POSTEN 1.742 e +01 5.417 e +00 3.216 0.00178 **
## OUTDOOR 3.756 e +01 1.035 e +01 3.631 0.00046 ***
## RADIO -6.633 e +01 2.957 e +01 -2.243 0.02724 *
## PRINT 6.304 e +00 5.876 e +00 1.073 0.28610
## SOCIALMEDIA 1.832 e +02 2.400 e +01 7.636 1.84 e -11 ***
## Rain .. mm . 2.796 e +05 1.461 e +05 1.913 0.05877 .
## sal 6.885 e +06 3.734 e +06 1.844 0.06834 .
## HOLIDAY 2.906 e +06 6.383 e +06 0.455 0.64999
## ---
## Signif . codes : 0 '*** ' 0.001 '** ' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error : 14380000 on 94 degrees of freedom
## Multiple R - squared : 0.8203 , Adjusted R - squared : 0.8012
## F - statistic : 42.92 on 10 and 94 DF , p - value : < 2.2 e -16

An alternative estimation method that deals with the problem of autocorrelation is the maximum
likelihood method. As mentioned in [17] this method is attractive, because it can be used when
the structure of the errors is more complicated than the autoregressive process of order one. Table
15 shows the output from the maximum likelihood estimation assuming rst order autoregressive
process of the residuals, using gls function in R .The autocorrelation parameter is estimated to be
0.234079, which is close to the value retrieved by the D-W test.

36
Table 15: Coecient estimates shown for the maximum likelihood estimation, log-linear functional
form with adstock model considered for SOCIALMEDIA variable (equation 6.9)

## Generalized least squares fit by maximum likelihood


## Model : SALES_TOT ~ .
## Data : transformedregdata
## Log - likelihood : 94.48966
##
## Coefficients :
## ( Intercept ) TV DR DR . POSTEN OUTDOOR
## 1.787338 e +01 1.454923 e -07 1.241126 e -07 1.450152 e -07 2.815697 e -07
## RADIO PRINT SOCIALMEDIA Rain .. mm . sal
## -5.038014 e -07 9.843381 e -08 1.850852 e -06 1.631049 e -03 5.712213 e -02
## HOLIDAY
## -4.609447 e -02
##
## Correlation Structure : AR (1)
## Formula : ~1
## Parameter estimate (s):
## Phi
## 0.234079
## Degrees of freedom : 105 total ; 94 residual
## Residual standard error : 0.101172

6.4 Variable selection


Next step in building the model is variable selection. Since in this case a small number of variables
is used, it is possible to perform best subset selection using the regsubsets() function in R. The
best model that contains a given number of predictors (using RSS) is shown in Table 16.

Table 16: Best Subset Selection. The best model that contains a given number of predictors is
chosen according to RSS

## TV DR DR . POSTEN OUTDOOR RADIO PRINT SOCIALMEDIA Rain .. mm . sal HOLIDAY


## 1 ( 1 )"*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 )"*" " " " " " " " " " " "*" " " " " " "
## 3 ( 1 )"*" " " "*" " " " " " " "*" " " " " " "
## 4 ( 1 )"*" " " "*" "*" " " " " "*" " " " " " "
## 5 ( 1 )"*" " " "*" "*" " " " " "*" " " "*" " "
## 6 ( 1 )"*" " " "*" "*" "*" " " "*" " " "*" " "
## 7 ( 1 )"*" " " "*" "*" "*" " " "*" "*" "*" " "
## 8 ( 1 )"*" "*" "*" "*" "*" " " "*" "*" "*" " "
## 9 ( 1 )"*" "*" "*" "*" "*" "*" "*" "*" "*" " "
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

Figure 11 displays the plots of RSS, adjusted R2 , Cp , and BIC for all of the models at once. It
can be seen that both adjusted R2 and Cp choose the model with 9 variables, while BIC chooses
the model with 6 variables. As mentioned in [6], the BIC statistic generally places a heavier penalty
on models with many variables, and hence results in the selection of smaller models than Cp .

37
Figure 11: RSS, adjusted R2 , Cp , and BIC shown for the best models of each size

Figure 12: Adjusted R2 , Cp and BIC for log-linear functional form with adstock model considered
for SOCIALMEDIA variable (equation 6.9)

38
Figure 12 displays the selected variables for the best model with a given number of predictors,
ranked according to adjusted R2 , Cp and BIC.
One can also choose among a set of models of dierent sizes using the validation set and cross-
validation approaches. For these approaches to yield accurate estimates of the test error, only the
training observations must be used. The observations must be split into a training set and a test set.
Next, best subset selection only on the training observations should be performed. The validation
set error is computed for the best model of each model size, getting the results from Table 17.

Table 17: Validation set errors for the best model of each model size

Model size Validation


set error
1 0.03964116
2 0.03523362
3 0.04432925
4 0.02418651
5 0.02353050
6 0.01786729
7 0.01690079
8 0.01579381
9 0.01565193
10 0.01540573

The best model is found to be the one that contains ten variables. Next we would have to
perform best subset selection on the full dataset and select the best ten-variable model, but since
it is the full model, we just re-estimate the coecients on the full dataset. Since the full model was
selected, the estimates will be the ones from Table 10.
To choose among the models of dierent sizes using cross-validation, best subset selection is
performed within each of the k training sets. First, each observation is allocated to one of k = 10
folds. Next each of the folds is used as a test set for the best subset selection procedure, and the
rest of the data is used as the training set. The test errors are stored in a matrix, and then the
average is calculated over the columns of this matrix in order to obtain a vector for which the j th
element is the cross-validation error for the j -variable model, j = 1, . . . , k, (k = 10, number of
folds). Figure 13 shows that cross-validation selected the three-variable model. Performing cross-
validation multiple times for the dataset provided in the current case study, the cross-validation
error always decreased for the three-variable model, followed by a growth, nally to decrease again
by ten-variable model, as shown in Figure 13. If Nepa preferred a retracted model, a three-variable
model would have been selected, otherwise it is also possible to chose the full model.

39
Figure 13: Cross-validation errors for the log-linear functional form with adstock model considered
for SOCIALMEDIA variable (equation 6.9)

It is important to perform best subset selection on the full dataset to obtain reliable estimates
for the three-variable model. Results are shown in Table 18.
Table 18: Parameter Estimates for the three-variable model
(Intercept) 1.793679e+01
TV 1.739107e-07
DR.POSTEN 2.131372e-07
SOCIALMEDIA 1.948903e-06

6.5 Ridge regression


As it can be seen in section 6.3, although some of the variables show correlation, there is no strong
evidence of severe multicollinearity. But even though the best linear unbiased estimator of the
coecients is given by the ordinary least squares (OLS) estimator (the Gauss-Markov Theorem),
the least squares estimates might have high variance, making the estimates inecient for out of the
sample data. As the purpose of the thesis is to develop a general model building strategy which
Nepa will use for future projects, the next step is to compare the performance of OLS estimation
with dierent shrinkage methods. To select the method which is most suitable for the current data
all regularization methods presented in section 5 will be applied to equation 6.9 and their prediction
accuracy will be compared.
Ridge regression was performed using the function glmnet in R, over a grid of values ranging

40
from λ = 1010 to λ = 10−2 . This grid essentially covers all scenarios from the model containing
only the intercept, to the least squares t. It is recommended to standardize the variables before
performing ridge regression, so that all the variables would be on the same scale. The function
glmnet does it automatically, returning the coecient estimates of the variables in the original
scale.
In order to estimate the optimal parameter λ, ten-fold cross-validation was performed, using
cv.glmnet function in R. First ridge regression model is tted on the training set. Next cross-
validation is used to choose the tuning parameter λ that gives the smallest cross-validation error.
For each value of λ the test MSE is calculated. Finally, ridge regression model is retted on the
full dataset, using the value of λ chosen by cross-validation.

Figure 14: Ridge regression on the full dataset. Upper left: L1 norm against scaled coecients.
Upper right: Log lambda against scaled coecients. Lower left: Fraction of deviance explained
against scaled coecients. Lower right: Log(lambda) against MSE.

Figure 14 shows the plots of l1 norm, ln(λ) and fraction of deviance explained against coecient
estimates, as well as the plot of ln(λ) against mean squared error (lower right). At the top of each
graph the number of nonzero coecients is indicated.
Besides choosing the λ value that gives the smallest cross-validation error, one can also choose
the value of λ which gives the most regularized model such that error is within one standard
error of the minimum. In Table 19 are presented the estimated coecients for lambda.min and
lambda.1se chosen by cross-validation. As expected, none of the coecients is exactly zero, as ridge
regression does not perform variable selection. Figure 15 shows the plots of predicted values against
actual values of sales, for ridge coecients above from Table 19. The coecients corresponding to
lambda.min predict sales more accurately, since they are chosen in such a way that the cross-
validation error is minimal.

41
Table 19: Ridge regression coecient estimates for lambda.min and lambda.1se chosen by cross-
validation

## lambdaminridge = 0.0160694261981513

## lambda1seridge = 0.124419731067532

## msetest . ridge . lambdamin = 0.0188158022878513

## msetest . ridge . lambda1se = 0.0131380574237442

## 11 x 2 sparse Matrix of class " dgCMatrix "


## ridge . coef . lambdamin ridge . coef . lambda1se
## ( Intercept ) 1.788044 e +01 1.793414 e +01
## TV 1.390966 e -07 1.014373 e -07
## DR 1.338492 e -07 1.514207 e -07
## DR . POSTEN 1.372515 e -07 1.164731 e -07
## OUTDOOR 2.476422 e -07 1.450592 e -07
## RADIO -3.331841 e -07 2.032320 e -08
## PRINT 8.226841 e -08 6.783974 e -08
## SOCIALMEDIA 1.821864 e -06 1.402040 e -06
## Rain .. mm . 2.124742 e -03 1.716374 e -03
## sal 5.090904 e -02 3.908197 e -02
## HOLIDAY -4.598482 e -02 -5.182046 e -02

Figure 15: Ridge regression t for lambda.min (top ) and lambda.1se (bottom ) chosen by cross-
validation

42
6.6 The Lasso
To perform the Lasso, the glmnet function in R is used again for the same range of λ. Following the
same cross-validation procedure as for ridge regression, the coecient estimates and test MSE are
obtained for lambda.min and lambda.1se chosen by cross-validation. The lasso test MSE is close to
the ridge test MSE. However, the lasso has a substantial advantage over the ridge regression that it
performs also variable selection. The results are shown in Table 20. For the largest λ at which the
MSE is within one standard error of the minimal MSE two coecient estimates are zero: RADIO
and HOLIDAY.
Table 20: Lasso coecient estimates for lambda.min and lambda.1se chosen by cross-validation

## lambdaminlasso = 0.00294189136833369

## lambda1selasso = 0.0207544620232037

## msetest . lasso . lambdamin = 0.0192751123266679

## msetest . lasso . lambda1se = 0.0190073954177291

## 11 x 2 sparse Matrix of class " dgCMatrix "


## lasso . coef . lambdamin lasso . coef . lambda1se
## ( Intercept ) 1.788390 e +01 1.796400 e +01
## TV 1.509192 e -07 1.479784 e -07
## DR 1.135039 e -07 7.858522 e -08
## DR . POSTEN 1.390981 e -07 1.317557 e -07
## OUTDOOR 2.518648 e -07 1.071373 e -07
## RADIO -3.414987 e -07 .
## PRINT 6.911367 e -08 9.852782 e -09
## SOCIALMEDIA 1.879224 e -06 1.606965 e -06
## Rain .. mm . 1.902475 e -03 3.636966 e -05
## sal 5.072095 e -02 1.575444 e -02
## HOLIDAY -3.266459 e -02 .

Figure 16 shows each curve's path of its coecient against the l1 -norm , ln(λ) and fraction of
deviance explained, as well as the plot of ln(λ) against mean squared error (lower right). At the
top of each graph the number of nonzero coecients is indicated. Figure 17 shows the plots of
predicted against actual sales, using lasso coecients from Table 20.

43
Figure 16: The lasso on the full dataset. Upper left: L1 norm against scaled coecients. Upper
right: Log lambda against scaled coecients. Lower left: Fraction of deviance explained against
scaled coecients. Lower right: Log(lambda) against MSE.

Figure 17: Lasso t for lambda.min (top ) and lambda.1se (bottom ) chosen by cross-validation

44
6.7 Naive elastic net
As described in section 5.2.6, the naive elastic net penalty is a convex combination of the lasso and
ridge penalty.
p
X p
X
(1 − α) |βj |+α βj2 ≤ s
j=1 j=1

To choose the optimal parameter α, the function cv.glmnet was called with a pre-computed vector
foldid, and then this same fold vector was used in separate calls to cv.glmnet with dierent values
of α. Note that in the glmnet package in R the penalty is dened as
p
X p
X
(1 − α)/2 βj2 + α |βj |≤ s
j=1 j=1

It can be seen in the Figure 18 that ridge does about the best for the given dataset, so it seems

Figure 18: The standardized coecients as a function of λ, displayed for several values of α

reasonable to choose a value of α closer to ridge. Calling the cv.glmnet function with parameter
alpha 0.1 yields the following results shown in Table 21 and Figures 19 and 20.

45
Table 21: Naive elastic net coecient estimates for lambda.min and lambda.1se chosen by cross-
validation

## lambdaminelnet = 0.0202773160550586

## lambda1seelnet = 0.108213937001959

## msetest . elnet . lambdamin = 0.0189826745876587

## msetest . elnet . lambda1se = 0.0120703463671487

## 11 x 2 sparse Matrix of class " dgCMatrix "


## elnet . coef . lambdamin elnet . coef . lambda1se
## ( Intercept ) 1.789012 e +01 1.795649 e +01
## TV 1.380539 e -07 1.065173 e -07
## DR 1.321510 e -07 1.361582 e -07
## DR . POSTEN 1.335377 e -07 1.171296 e -07
## OUTDOOR 2.179913 e -07 1.278893 e -07
## RADIO -2.296430 e -07 .
## PRINT 7.185855 e -08 4.996454 e -08
## SOCIALMEDIA 1.805622 e -06 1.357760 e -06
## Rain .. mm . 1.947446 e -03 1.036418 e -03
## sal 4.743022 e -02 2.685974 e -02
## HOLIDAY -3.979584 e -02 -2.557000 e -02

Figure 19: Naive elastic net on the full dataset. Upper left: L1 norm against scaled coecients.
Upper right: Log lambda against scaled coecients. Lower left: Fraction of deviance explained
against scaled coecients. Lower right: Log(lambda) against MSE.

46
Figure 20: Naive elastic net t for lambda.min (top ) and lambda.1se (bottom ) chosen by cross-
validation

It is clear that even naive elastic net outperforms ridge and lasso.

6.8 Elastic net


The elastic net has the advantages of both variables selection and continuous shrinkage, similar to
the lasso. The elastic net method was performed using the function elasticnet package in R. The
optimal parameters were chosen by ten-fold cross-validation, as described in section 5.2.7. Figure
21 shows the elastic net estimates and the solution path for λ2 = 1 as a function of s, where s refers
to the ratio of the l1 norm of the coecient vector, relative to the norm of the full LS solution.
The minimum cross-validation error is obtained around value s = 0.5. Table 22 shows the elastic
net coecients for λ2 = 1 and s = 0.47. The predicted sales using elastic net method together with
the actual sales are plotted in Figure 22. The mean-squared test error for λ2 = 1 and s = 0.47 is
0.01188888.

47
Figure 21: Left : Elastic net estimates (λ2 = 1) as a function of s. Right : solution path (λ2 = 1) as
a function of s

Figure 22: Elastic net t for λ2 = 1 and s = 0.47 chosen by cross-validation

To include the seasonal eect, the model was extended by adding a trend variable and 51 dummy
variables for seasons, as described in section 2.3. Table 23 shows the coecients for the extended
model with λ2 = 0.1 and s = 0.4. It can be seen that the method selects only some of the seasonal
variables, allowing a partial seasonal adjustment.

48
Table 22: Elastic net coecient estimates for λ2 = 1 and s = 0.47 chosen by cross-validation

## $s
## [1] 0.47
##
## $fraction
## 0
## 0.47
##
## $mode
## [1] " fraction "
##
## $coefficients
## TV DR DR . POSTEN OUTDOOR RADIO PRINT
## 1.320210 e -07 1.511765 e -07 1.506639 e -07 5.525586 e -08 0.000000 e +00 0.000000 e +00
## SOCIALMEDIA Rain .. mm . sal HOLIDAY
## 1.091444 e -06 0.000000 e +00 0.000000 e +00 0.000000 e +00

Table 23: Elastic net coecient estimates with season and trend variables

## $s
## [1] 0.4
##
## $fraction
## 0
## 0.4
##
## $mode
## [1] " fraction "
##
## $coefficients
## S1 S2 S3 S4 S5
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S6 S7 S8 S9 S10
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S11 S12 S13 S14 S15
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 -3.310320 e -02 0.000000 e +00
## S16 S17 S18 S19 S20
## 0.000000 e +00 0.000000 e +00 -2.827022 e -02 0.000000 e +00 -2.507911 e -02
## S21 S22 S23 S24 S25
## -6.727018 e -02 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S26 S27 S28 S29 S30
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S31 S32 S33 S34 S35
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 4.078321 e -02
## S36 S37 S38 S39 S40
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S41 S42 S43 S44 S45
## 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00 0.000000 e +00
## S46 S47 S48 S49 S50
## 0.000000 e +00 0.000000 e +00 2.042501 e -02 0.000000 e +00 1.358926 e -01
## S51 trend TV DR DR . POSTEN
## 1.703252 e -01 0.000000 e +00 1.499851 e -07 1.342277 e -07 1.178116 e -07
## OUTDOOR RADIO PRINT SOCIALMEDIA Rain .. mm .
## 7.377897 e -08 0.000000 e +00 1.595277 e -08 1.445939 e -06 1.432973 e -04
## sal HOLIDAY
## 2.989432 e -02 0.000000 e +00
49
7 Conclusions & Recommendations
This thesis illustrates an application of modern approaches of statistical learning on a set of data
provided by Nepa. The goal of the thesis is to construct a model building strategy suitable for a
high level of complexity of the data, with the ambition to tackle several diculties encountered
with statistical analysis applied to marketing economics. A marketing mix model must address
all elements of the problem being studied. In the specication step, one of such elements is the
choice of the appropriate functional form. To nd the suitable specication, which describes the
relationship between the dependent and independent variables, the RESET test and the Box-Cox
transformation of the response variable were used. The plot of the residuals against each predictor
variable as well as the tests above suggest that the log-linear specication is appropriate. Several
subset selection methods were employed on the log-linear model. The results of the validation set
and cross-validation approaches justify the choice of the full model. To adapt the model to the
dynamic marketing behavior, the optimal lag weight parameters can be found with the Levenberg-
Marquardt algorithm, using nlsLM function in R.
Since the purpose is both explanatory and predictive analysis, in order to be able to perform
statistical inferences based on obtained point estimates, the assumptions made in section 4 must
hold. To sum up, the results show that we cannot assume that the error terms are uncorrelated. The
solution proposed was to employ the Cochrane-Orcutt method, an approach that specically con-
siders the autocorrelation structure, or to use alternative estimation methods, such as the method of
maximum likelihood. The testing of the assumptions also shows that the data exhibits mild degree
of multicollinearity. A comparison of several estimation methods is provided, so that Nepa could use
this thesis as a guideline for future marketing mix modelling projects that include data with severe
multicollinearity. Regularization methods were performed using glmnet and elasticnet packages
in R. Note that the penalty in the glmnet package in R is dened dierently from the penalty in
the elasticnet package. Table 24 shows the performance of ridge regression, the lasso, the naive
elastic net and elastic net results applied to the same training set and validation set. Model t-
ting and tuning parameter selection by tenfold cross-validation (CV) should be carried out on the
training data, and then the performance of those methods must be compared by computing their
prediction mean-squared error (MSE) on the test data. Although the dierence between mentioned
methods is not signicant, the lowest test MSE is achieved by elastic net, which also chose the
smallest number of variables.
Method Parameters test MSE Variables selected
Ridge regression λ1 = 0, λ2 = 0.01606943 ∗ 2 0.0188158 All
Ridge regression λ1 = 0, λ2 = 0.1244197 ∗ 2 0.01313806 All
Lasso λ1 = 0.002941891, λ2 = 0 0.01927511 All
Lasso λ1 = 0.02075446, λ2 = 0 0.0190074 (1,2,3,4,6,7,8,9)
Naive elastic net λ1 = 0.02027732/0.1, λ2 = 0.02027732 ∗ 2/(1 − 0.1) 0.01898267 All
Naive elastic net λ1 = 0.1082139/0.1, λ2 = 0.1082139 ∗ 2/(1 − 0.1) 0.01207035 (1,2,3,4,6,7,8,9,10)
Elastic net λ2 = 1, s = 0.47 0.01188888 (1,2,3,4,7)

Table 24: Comparing the mean-squared error of the regularization methods

50
Figure 23: Mean-squared test errors illustrated for dierent methods. It can be seen that OLS
performs worst in terms of prediction accuracy

The mean-squared test errors of the models above are also illustrated in comparison with OLS
in Figure 23. The results show that while the elastic net produces a model with fewer variables, its
prediction accuracy is higher compared to other estimation methods.

51
References
[1] David M. Blei. Regularized Regression. Columbia University. 2015.
[2] William D. Perreault Charlotte H. Mason and Jr. Collinearity, Power, and Interpretation of
Multiple Regression Analysis. Journal of Marketing Research, Vol. 28, No. 3 (Aug., 1991), pp.
268-280. 1991.
[3] Peter S.H. Leeang Csilla Horvath Marcel Kornelis. What marketing scholars should know
about Time Series Analysis: Time Series applications in marketing. 2002.
[4] James G. MacKinnon Davidson Russell. Estimation and Inference in Econometrics. 1993.
[5] J. Durbin. Testing for Serial Correlation in Least-Squares Regression When Some of the
Regressors are Lagged Dependent Variables. Econometrica , Vol. 38, No. 3 (May, 1970), pp.
410-421. 1970.
[6] Trevor Hastie Robert Tibshirani Gareth James Daniela Witten. An Introduction to Statistical
Learning: with Applications in R. Springer Texts in Statistics. 2013.
[7] Henri P. Gavin. The Levenberg-Marquardt method for nonlinear least squares curve-tting
problems. 2016.
[8] Alan J. Izenman. Modern Multivariate Statistical Techniques. Springer Texts in Statistics.
2008.
[9] G. S. Watson J. Durbin. Testing for Serial Correlation in Least Squares Regression: I. Biometrika,
Vol. 37, No. 3/4 (Dec., 1950), pp. 409-428. 1950.
[10] G. S. Watson J. Durbin. Testing for Serial Correlation in Least Squares Regression: II.
Biometrika, Vol. 38, No. 1/2 (Jun., 1951), pp. 159-177. 1951.
[11] John DiNardo Jack Johnson. Econometric methods, Fourth Edition. 1997.
[12] Peter Kennedy. A Guide to Econometrics, 6th Edition. 2008.
[13] Harald Lang. Elements of Regression Analysis. 2016.
[14] Wittink D.R. Wedel M. Naert P.A. Leeang P. Building Models for Marketing Decisions.
Springer International Series in Quantitative Marketing. 2000.
[15] Market Response Models: Econometric and Time Series Analysis. Volume 12. 2001.
[16] Modeling Markets: Analyzing Marketing Phenomena and Improving Marketing Decision Mak-
ing. International Series in Quantitative Marketing. 2015.
[17] Douglas C. Montgomery. Introduction to Linear Regression Analysis, Fifth Edition. 2013.
[18] On the econometrics of the Koyck model. Springer Texts in Statistics. 2004.
[19] J.B. Ramsey. Classical model selection through specication tests. Frontiers in Econometrics.
Academic, New York. 1974.
[20] J.B. Ramsey. Tests for specication errors in classical linear least squares regression analysis.
1969.
[21] A. R. Pagan T. S. Breusch. A Simple Test for Heteroscedasticity and Random Coecient
Variation. Econometrica, Vol. 47, No. 5 (Sep., 1979), pp. 1287-1294. 1979.
[22] Halbert White. A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct
Test for Heteroskedasticity. Econometrica, Vol. 48, No. 4 (May, 1980), pp. 817-838. 1980.

52
[23] Wayne Winston. Marketing Analytics: Data-Driven Techniques with Microsoft Excel, 1st Edi-
tion. 2014.
[24] Jerey M. Wooldridge. Introductory Econometrics, a modern approach,5th Edition. 2012.
[25] Elena Yusupova. Additive versus Multiplicative Marketing Mix Model. 2013. url
: http://
analytics.sd-group.com.au/blog/additive-versus-multiplicative-marketing-mix-
model/.
[26] Hui Zou and Trevor Hastie. Regularization and Variable Selection via the Elastic Net. url :
http://web.stanford.edu/~hastie/TALKS/enet_talk.pdf.
[27] Hui Zou and Trevor Hastie. Regularization and Variable Selection via the Elastic Net. Journal
of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 67,No. 2 (2005), pp.
301-320. 2005.

53
TRITA -MAT-E 2017:32
ISRN -KTH/MAT/E--17/32--SE

www.kth.se

You might also like