Kesteren Jol Van 10001962 MSC Etrics
Kesteren Jol Van 10001962 MSC Etrics
1
Contents
1 Introduction 3
2 Theory 6
2.1 Digital advertising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Digital advertising channels . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Multi Touch Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Theoretical evaluation . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Method 29
3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Simulation performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Data 36
4.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Relevant visit selection . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Prospect identification . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Create prospect journeys . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Data insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Results 44
5.1 Model estimation and attribution . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 Rule-based heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Simple probabilistic model . . . . . . . . . . . . . . . . . . . . . . 47
5.1.3 Logistic regression models . . . . . . . . . . . . . . . . . . . . . . 48
5.1.4 Markov chain models . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1
CONTENTS 2
5.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Markov chain simulations . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Additional simulation study . . . . . . . . . . . . . . . . . . . . . 56
6 Conclusion 60
7 Appendix 63
7.1 Regression output Logistic extensions . . . . . . . . . . . . . . . . . . . . 63
Chapter 1
Introduction
At this moment, more than 3 billion people around the world use internet1 . This number
has been increasing at an exploding rate since the introduction of the world wide web.
With this vast reach, the web offers tremendous opportunities for marketing purposes. It
is not surprising that digital marketing has been growing likewise, making it a $121 billion
industry in 2014 with a year-on-year growth of 16%2 . In addition to the potential reach
digital marketing offers, it has two other significant advantages over traditional media.
Firstly, an online advertisement can be uniquely tailored to each individual providing
perfect customization possibilities. Secondly, all visits of internet users can be tracked
and stored, enabling perfect tracking of the number of views an advertisement gets and
the number of subsequent product purchases or conversions. Theoretically, this gives
marketeers the opportunity to accurately evaluate online advertisements or the channels
that serve them. Typical online channels are search engine advertising, email, display
and social media.
However, in practice this evaluation of channels reveals to be rather challenging. Since
potential customers or prospects typically ‘touch’ multiple channels before converting, the
contribution of each of those channels should be determined. The problem of determining
the contribution of the channels a prospect touches before conversion is referred to in the
literature as the attribution problem. Traditionally, the full conversion credit is assigned
to the last channel a prospect touches prior to conversion, a method called last touch
attribution. However, it can be easily seen that this method is fundamentally flawed,
since it completely ignores the contribution of channels in prior touches. Having realized
this flaw, both the business world and academia have devoted themselves to a solution to
the attribution problem. The result is that a plethora of attribution methods and models
have been proposed.
1
http://www.internetlivestats.com/internet-users/
2
Report by ZenithOptimedia, 2014
3
CHAPTER 1. INTRODUCTION 4
Initial alternatives to the last touch attribution that have been proposed are first
touch attribution (assigning all conversion credit to the first touch) and linear attribution
(assigning equal credit to each touchpoint). However, all three are still rule-based methods
that a priori presuppose a certain weight to all touches without actually accounting for
the data. In response, Shao and Li (2011) propose two data-driven attribution techniques:
a (bagged) logistic regression model and a simple probabilistic model. Dalessandro et al.
(2012) further refine this probabilistic model, and demonstrate that it is fundamentally
equal to the well-known Shapley Value from cooperative game theory (Shapley, 1953).
Anderl et al. (2014) introduce an entirely different solution to the attribution problem,
modelling the prospect paths as Markov chain models and calculating the attribution
through a Removal Effect. Other research has tackled the issue of attribution through
Survival Theory (Zhang et al., 2014) or Bayesian Models (Li and Kannan, 2014). In
short, it is evident that an almost chaotic abundance of attribution methods and models
have been put forward, each of the authors advocating its own solution. The extant
literature evidently fails in creating some order by addressing the question which of all
those methods is the best solution. Addressing and answering this question is the main
goal of this thesis.
In order to do so, this thesis develops a methodology to evaluate attribution models
on three dimensions: theoretically, empirically and in a simulation study. It considers
the rule-based methods (last touch, first touch and linear), the probabilistic model, the
logistic regression model and the Markov chain models. In the theoretical analysis, seven
criteria are distilled from the extant literature and formulated by examining an abstract
conception of the perfect attribution model. The attribution models are then evaluated
in the light of these theoretical criteria. Empirically, we examine the performance of
the models by its out-of-sample classification accuracy. The data for the empirical study
contains all visits on a website of a large Dutch financial institution during ten months,
including variables on the timing, the channel and whether a conversion takes place. By
producing out-of-sample conversion classifications for this dataset and determining the
Area Under the ROC Curve (AUC), the predictive performance of the models can be ob-
jectively assessed. Underlying this assessment is the assumption that accurate prediction
implies accurate attribution. The third component of our study consists of simulating
a wide variety of scenarios in which the true attribution is known, and calculating the
Mean Absolute Error of the models with this true attribution. This simulation study is
useful in determining which attribution method performs best under which circumstances
to be found in the data, and provides another objective criterion to evaluate the models.
In addition to providing an answer to the question which of the considered attribution
models performs best, the most important contribution of this thesis is a standardized
CHAPTER 1. INTRODUCTION 5
Theory
In this chapter, the theory and literature behind digital marketing and attribution mod-
elling are discussed. First, Section 2.1 discusses digital advertising, its advantages and
preconditions. Section 2.2 introduces the most common digital channels. Finally, Section
2.3 reviews the literature on multi touch attribution. Theoretical criteria for a good at-
tribution model are derived and the most important attribution models are introduced,
discussed and evaluated in the light of those criteria.
6
CHAPTER 2. THEORY 7
level, enabling accurate performance evaluations. In the next paragraphs we will further
elucidate on the latter two advantages of online advertising.
In the marketing literature, the effect of customization has been studied extensively.
Ansari and Mela (2003) for instance argue that customized and targeted advertisements
attract customer attention and foster customer loyalty, therewith having a considerably
higher probability of persuading customers to a desired end. Customized advertisements
are - if targeted appropriately - much more capable of fulfilling a customer’s need than
broad and general advertisements. Advertisements can be personalised through its con-
tent, message or visual representation. However, customization through traditional mass
media such as television or radio is only possible at a collective level.
On the contrary, digital advertising has the major advantage that its advertisements
can be tailored to each unique individual. An advertiser can decide to change the ad-
vertisement based on the past browsing history or collected preference information of a
potential customer. This can be done through models and algorithms, making the ‘e-
customization’ quick and easy. Ansari and Mela (2003)’s research is one of the first to
develop such a model, with the purpose of optimally customizing content and represen-
tation of e-mails. They find that the expected click-through rate of these emails can be
increased by 62%.
Another major advantage of online marketing as opposed to traditional marketing is
that the performance of digital advertisements can be evaluated much more accurately
than ads from traditional media. Every impression, click and conversion per advertise-
ment is recorded on an individual level. The digital advertising medium is therefore
perfectly suitable to accurately evaluate how many conversions or revenue each adver-
tisements brings in. This enables marketeers to calculate each advertisement’s marketing
Return On Investment (ROI). Based on this ROI, budget allocation to the different online
advertisements and channels can be improved, eventually resulting in a more profitable
firm.
In contrast, analysis of the performance of traditional media such as television and
radio can only be done through aggregated data or expensive and untrustworthy surveys.
Say, for instance, that we want to evaluate the performance of a large television campaign
for a hotel chain. Our best option is to compare the number of bookings during our
campaign period with the baseline number of bookings. The additional bookings can be
attributed to the television advertisement. However, this method is based on a strong
assumption, since all other factors explaining the variation in the number of bookings
are ignored. Moreover, this method becomes complex when multiple advertisements are
displayed on different channels. Alternatively, some of the customers might be asked to
fill out a questionnaire asking them which channel predominantly induced them to book.
CHAPTER 2. THEORY 8
However, those surveys are generally unreliable because of reasons such as the difficulty
to acquire a representative sample or the ignorance and forgetfulness of participants.
A precondition to both advantages of online advertising - that is the possibility of
customization and improved performance evaluation - is the ability to identify unique
persons from the data. If this precondition is fulfilled, we can reconstruct full online
prospect or customer journeys, containing all visits, touchpoints or touches (all concepts
are used interchangeably in this thesis) a person makes prior to converting. This identi-
fication of individuals across multiple visits turns out to be non-trivial. In the literature,
this is usually done by HTTP cookies. Additionally, this thesis advocates the use of
IP-addresses.
Websites send cookies, small pieces of data, to a user’s web browser while the user
visits the website for the first time. These cookies are then locally saved on the user’s
device. Each subsequent time the user visits the website again with the same device, the
browser notifies the website that it concerns the same web browser and device. It is likely
that this subsequent visit containing the same cookie pertains to the same person. This
is how cookies identify a user across multiple visits.
However, solely using cookies to reconstruct full customer journeys, although common
in the literature, has its shortcomings. The predominant reason is that cookies are not
able to identify a person across all visits. To illustrate this, suppose a user visits the
website multiple times from different browsers or different devices. These visits cannot
be related to the same individual by the use of cookies alone. Moreover, cookie tracking
can be disabled. Cookies can therefore only relate part of a user’s visits to this same user.
A second disadvantage is that different persons can visit the website on the same device,
user account and browser, which cookies consider the same individual. However, since
people are increasingly using their own devices, we will assume this risk to be small.
The limitation that cookies cannot bundle all visits pertaining to the same user, can be
partly overcome by complementing information from cookies with information from the
public IP-address of a visit. A public IP-address is a numerical label that is unique to an
internet connection. Visits with the same IP-address are therefore likely to be the same
household and thus the same person. However, relating visits to a unique person this
way should be done with great caution, since multiple persons may form a household.
Moreover, public institutions such as offices, libraries or universities generally have a
single IP-address to which multiple people connect. Due to these drawbacks, combining
visits based on IP-addresses is unusual in the literature. However, with inclusion of
some restraining conditions, we argue in Chapter 4 that IP-bundling can be done to
further fine-tune our prospect identification across multiple visits. By using cookie and
IP-address information intelligently, a full online customer journey can be reconstructed,
CHAPTER 2. THEORY 9
achieve a higher ranking (search engine optimization, SEO). Both SEA and SEO are
important channels for advertisers, since more than 90% of all internet users make use of
search engines to acquire information and orientate on the products they need or desire
2
. Advertised SEA results are displayed above the organic SEO results.
The position of a specific SEA result is dependent on the bid of the advertiser, a
quality score and the expected impact of possible extensions. The bid of advertisers on
keywords is expressed in a certain paid amount per click (Costs per Click or CPC). As
long as the user does not click on the SEA result of its query, the advertiser therefore has
no costs. This explains why search engines additionally base the SEA result positioning
on a quality score, which is a function of the expected amount of clicks (clickthrough
rate or CTR), the relevance of an ad and a user’s landing page experience. Finally,
the impact of extensions, such as features that show extra business information (e.g. a
telephone number or address), is taken into account. The amount an advertiser pays
is the minimum it should have bid to beat the advertiser one position below, which is
a special case of the Vickrey auction (Vickrey, 1961). In practice, the paid amount is
usually significantly lower than the bid, especially when one has a good quality score.
Skiera and Nabout (2013) develop a model to find out the optimal bidding amount that
leads to the highest profit for each keyword. In their model, they presuppose a causal
relation between position and the relative amount of clicks (CTR) and estimate this
relation statistically. They find that a lower rank gives a higher CTR, which confirms
the intuition that users scan their results from top to bottom.
Within the realm of SEA, two sub-channels are distinguished based on the nature of
the keywords: branded or non-branded SEA. Branded keywords specifically refer to the
advertised brand. When someone searches branded keywords one might assume (s)he
prefers that company to purchase a product or acquire information from. Since keywords
that include a publisher’s brand are highly relevant to the internet user, its quality score
will generally be unbeatable. Therefore, for branded keywords relatively low bids are
sufficient to gain a top position. In contrast, the competition for non-branded or generic
keywords such as ’car insurance’ or ’laptop’ is much higher.
Search engine optimization (SEO) is the process of optimizing the ranking of unpaid
or organic results. According to an eye tracking study 3 , around 70% of the search engine
users skip the advertisement results. SEO is therefore undoubtedly an extremely valuable
marketing channel. The position of a result is determined by the relevance of the content
of a website to specific keywords. Strategies that are used to improve the ranking may
include increasing the number of backlinks (incoming links to a website), editing content
2
Pew Internet Survey, May 2011
3
Performed by GfK, gfk.com
CHAPTER 2. THEORY 11
2.3.1 Criteria
Besides developing a variety of models, the extant literature has been concerned with for-
mulating criteria in order to determine what is a ‘good’ attribution model. This search
for universally accepted and standardized attribution criteria is important for two rea-
sons. First and foremost, the true attribution of a certain channel is unobserved, making
the topic inevitably subjective to some extent. Secondly, the actual implementation of
attribution models by marketeers requires more practical criteria as well.
Shao and Li (2011) propose a bivariate metric to evaluate an attribution model: a
metric that evaluates both accuracy and variability. Accuracy means that a proper model
must be able to classify prospects as converters or non-converters. They evaluate accuracy
by the out-of-sample misclassification error rate. This is mathematically expressed as
(F P + F N )(T P + T N + F P + F N )−1 , with the elements explained in the Confusion
matrix in Table 2.1. Strangely, Shao and Li (2011) do not report any threshold to
classify a probability as a predicted conversion or non-conversion, and it is thus unclear
how they produced the exact numbers for their accuracy metric. In addition to predictive
power, they state that the variability of the model’s parameter estimates is important.
Consequential decisions of marketeers may after all be based on the parameter estimates of
CHAPTER 2. THEORY 13
Predicted outcome
Conv0 Non-conv0
True False
Conv Positive Negative
Actual (TP) (FN)
outcome
False True
Non-conv Positive Negative
(FP) (TN)
Table 2.1: Confusion matrix illustrating the four quadrants into which an out-of-sample prediction can
fall
the attribution model, such as performance evaluations and subsequent budget allocation
of channels. It is therefore desirable to have an attribution model with stable and reliable
parameter estimates. They calculate the variability by taking the average standard error
of the estimated coefficients of the model or n−1 ni=1 SE(β̂i ) for a model that has n
P
2. Ability to predict: a good attribution model, that is able to accurately judge the
value of each touchpoint, should be able to estimate the probability of a conversion
or a non-conversion given some touchpoints. Moreover, a model’s ability to predict
gives us an objective standard to evaluate the empirical performance of each of the
models. We will see that models have been proposed that aren’t predictive, making
the task much harder to determine whether they attribute correctly.
7. Intuitive restrictions: intuitively, there are two main additional restrictions that
attribution models should account for:
• The conversion credit for a channel must be between 0 and the the number of
conversions that have touched this channel. The equivalent on an individual
level is that a channel’s contribution to a conversion must be between 0 and
1.
• A model should be able to incorporate information from all touchpoints in a
journey.
In the remaining sections in this chapter the attribution methods and models will be
introduced and examined in the light of these seven criteria. Interestingly, we will see that
CHAPTER 2. THEORY 15
none of the models fulfils all conditions, perhaps indicating that the perfect attribution
model is not yet around.
2.3.2 Models
Before presenting the different attribution models, it is convenient to introduce some
mathematical notation. Let there be i = {1, 2, ..., N } prospects who each have an online
journey with j = {1, 2, ..., Ji } visits. The j th visit of prospect i is notated by vi,j . For
converting prospects, only visits prior the conversion are considered. Each visit is coming
from a channel Ck , for k = {1, 2, ..., K} channels. The function that maps a visit to
a channel is C(vi,j ). A prospect journey can either turn into a conversion or a non-
conversion:
1, Conversion
yi = (2.1)
0, Otherwise
The entire journey of prospect i can then be formally represented by P Ji = {{vi,j }Jj=1
i
, yi }.
In case of individual-level attribution, each visit vi,j has an attribution ai,j by the
function ai,j = a(vi,j ) under the restrictions that 0 ≤ a(vi,j ) ≤ 1 and Jj=1
Pi
a(vi,j ) = yi .
The restrictions imply that a non-converting journey gives a credit of 0 to all visits. The
attribution Ak of a channel k as a percentage of the total number of conversions can then
be calculated as follows:
PN P
i=1 j:{C(vi,j )=Ck } ai,j
Ak = PN (2.2)
i=1 yi
As we will see, not all attribution methods are able to attribute individually, so sometimes
a model produces estimates for Ak directly.
Now that we have formally defined all the elements of the attribution problem, we can
turn our attention to the different models to see how they propose to solve the attribution
problem (in other words, how they estimate the Ak ’s). First, we will discuss the sim-
ple and mainstream non-statistical rule-based heuristics. Thereafter, the more complex,
mathematical or statistical models (probabilistic model/Shapley, logistic regression and
Markov chain) are introduced.
Rule-based heuristics
The most frequently applied single touch heuristic is last touch attribution. Last
touch attribution assigns all credit to the last visit a prospect touches before conversion
or, mathematically expressed:
0, j = {1, 2, .., (Ji − 1)}
âi,j,LT = (2.3)
1, j = J
i
The popularity of this method is due to its intuitive and computational simplicity.
Only information about the last touch serves as input to the method, making the recon-
struction of a full customer journey unnecessary. However, the fact that it completely
ignores information about the prior touches makes it a fundamentally flawed heuristic.
Suppose a prospect reaches a website of an online vacation retailer through an affiliate
party, gathers all of its information, but then needs a night sleep to decide whether he is
going to purchase a trip. Waking up the next morning, he decides to buy it, quickly uses a
search engine to find the relevant page and instantly converts. Last click attribution will
assign the full credit to organic search (SEO) and no credit to the affiliate party, which
intuitively does not make sense. If a marketeer uses last touch attribution for attribu-
tion purposes, he might unjustly decide to stop allocating its funds to the affiliate party,
therewith lessening much more conversions than he is aware of. In practice, this means
that channels that typically appear in the beginning of a journey, while a prospect is still
in the orientation phase, are highly undervalued. Examples are banner advertisements
or affiliate parties. In contrast, channels that appear later in the journey such as direct
(typing the URL of the website in the browser) or organic search are overvalued, even
though those channels might predominantly be reached by prospects that have already
made up their mind to buy the product and are only looking for the easiest way to reach
the website.
To counter this bias that favours later touchpoints, another single touch heuristic
named first touch attribution is introduced. Mathematically, this heuristic assigns a
weight to each individual visit as follows:
0, j = {2, 3, .., Ji }
âi,j,F T = (2.4)
1, j = 1
However, as one can image, first touch attribution is far from perfect either, since a
new bias is introduced. Channels that are typically touched later in the journey such as
organic or sponsored search are now underestimated, since they are given no conversion
credit in cases of more than one touch. In addition, channels that usually occur in the
beginning of a journey such as affiliate or display are overvalued.
CHAPTER 2. THEORY 17
Both, the last touch and first touch heuristic, fail to take into account the informa-
tion of customer journeys with multiple touches. However, an advantage of the single
touch heuristics for the purposes of this thesis is their potential to be transformed into a
predictive model. Empirical conversion probabilities can be calculated for each channel
given a certain position (e.g. first or last) and used as probability predictions for out-of-
sample observations. To illustrate, for the last touch heuristic the empirical probability
of conversion given the last touched channel is k is as follows:
P
#{C(vi,Ji ) = Ck , yi = 1}
P̂ (yi = 1|C(vi,Ji ) = Ck ) = P (2.5)
#{C(vi,Ji ) = Ck }
This equation takes the number of conversions with last touch k divided by the total
number of journeys with last touch k. The predictive potential of single touch heuristics
is an advantage since it enables comparing performances both among each other and
among other models.
A straightforward solution to the bias of single touch heuristics is to assign equal
conversion credit to all touchpoints:
1
âi,j,LIN = , ∀j (2.6)
Ji
This rule-based method is unsurprisingly called linear touch attribution. Although less
fundamentally flawed, linear touch attribution still assigns an arbitrary weight to each
touchpoint independent of its true contribution. It is ignorant of potential contribution
differences between channels: channel X might be generally more effective in persuading
prospects to convert than channel Y. Moreover, it completely discards with differences
over time, whereas touches in the beginning or end of a journey may be much more
effective and influential than touches in the middle. Linear touch attribution, although
in expectation closer to the true attribution than first or last touch, is still not the ‘holy
grail’ of the attribution problem.
Wooff and Anderson (2013) decide to employ an attribution method that integrates
the knowledge of marketing industry experts. They interview marketeers and conclude
that marketeers generally regard the last clicks most valuable, followed by the first clicks
and the intermediate clicks. Based on this conclusion, they propose to assign conversion
credits for each touchpoint on the basis of an asymmetric U-shaped function:
In this expression, 0 < t < 1 is the relative time in the click path and a and b are fitted
parameters to the data. An illustration of such a fitted curve is displayed in Figure 2.2.
In this example, you can see that the last click value is larger than the value of the first
click.
CHAPTER 2. THEORY 18
Figure 2.2: Source: Wooff and Anderson (2013). The relative value of a click over time.
Although accepted by industry experts, this method is still flawed since it presupposes
a functional form. Moreover, it only takes into account attribution variability over time
but no intrinsic attribution differences between channels. Another disadvantage of both
multi touch heuristics is that there is no method to make them predictive, preventing the
possibility to compare its performance with other models.
To conclude this subsection about rule-based heuristics, it can be said that the over-
arching disadvantage of those heuristics is that no method is truly data-driven: each
method presupposes the distribution of the attribution weights. Interestingly though,
the rule-based heuristics are most commonly used in practice. For instance web analytics
service Google Analytics only offers attribution analysis based on rule-based methods. In
the subsequent subsections statistical models are discussed that base their attributions
on parameters derived from the data. It is expected that these models perform much
better.
The simple probabilistic model is first proposed by Shao and Li (2011). This non-
parametric model determines attribution by calculating empirical conversion probabilities
with one and two channel touches. The empirical probability of a path with a single visit
with channel k is as follows:
P
#{Ji = 1, C(vi,1 ) = Ck , yi = 1}
P̂ (yi = 1|Ck ) = P (2.8)
#{Ji = 1, C(vi,1 ) = Ck }
CHAPTER 2. THEORY 19
This expressions divides the number of conversion paths with a single channel k touch
by the total number of paths with a single channel k touch. Similarly, for paths with two
touches the empirical probability is calculated as follows:
P
#{Ji = 2, C(vi,j ) = Ck , C(vi,r ) = Cl , yi = 1}
P̂ (yi = 1|Ck , Cl ) = P (2.9)
#{Ji = 2, C(vi,j ) = Ck , C(vi,r ) = Cl )}
For some j ∈ 1, 2 and r = 3 − j. Note that the order of touching Ck and Cl is irrelevant.
The attribution of channel k on an aggregate level is then computed as follows:
1 X
Âk,P ROB = P̂ (yi = 1|Ck ) + {P̂ (yi = 1|Ck , Cl ) − P̂ (yi = 1|Ck ) − P̂ (yi = 1|Cl )}
2(K − 1) l6=k
(2.10)
The first element of this expression simply measures the conversion probability of prospect
journeys that solely contain channel k. The more interesting second element computes
the interaction effect of channels k and l, which is the conversion probability of paths with
both channels corrected by the one touch conversion probabilities of both individual chan-
nels. Note that the probabilistic model attributes at an aggregate rather than individual
level. An important assumption underlying this model is that half of this interaction
effect is attributed to each of the involved channels. Dalessandro et al. (2012) arrive at
the same model, having defined attribution as a “channel’s expected marginal impact on
conversion”. Moreover, they prove that it is a second-order approximation of the Shapley
Value, a way to distribute collective value in Cooperative Game Theory (Shapley, 1953).
Berman (2013) makes use of exactly the same formulation of this Shapley attribution
model.
The simple probabilistic model has a a number of disadvantages. Its attribution
methodology solely uses conversion probabilities, therewith ignoring information about
the number of conversions. This makes the attribution method unintuitive in some cases.
Suppose a channel is only touched in a single customer journey (in a large data set), but
this journey is successful and leads to a conversion. Although the channel just contributed
to a single conversion, its conversion probability is 100%, causing the probabilistic model
to attribute it a disproportionally high share. The attributed conversions to channel X
are likely to exceed one, which is unintuitive. A second disadvantage of the probabilistic
model is the possibility of negative attributions. Furthermore, the model is unable to
integrate information of paths that contain more than two touchpoints. It is theoretically
possible to extend the model for longer paths, but Shao and Li (2011) justly argue
that from a practical standpoint this does not make sense. The estimated conversion
probabilities for longer journeys become after all highly inaccurate due to the low number
of observations. A final disadvantage is that the model is not predictive, thwarting the
CHAPTER 2. THEORY 20
Logistic regression
An alternative attribution model that is also initially proposed by Shao and Li (2011) is
a simple logistic regression. This is a specific regression model in which the dependent
variable is binary and the functional form characterized by the non-linear logistic function.
In such a model, each customer journey makes an observation with the binary conversion
indicator yi as the dependent variable. Two major advantages of this model are that
it is predictive and it takes into account all available touchpoint information. In the
form Shao and Li (2011) propose, the explanatory variables are the number of touches
of a certain channel k in the journey i or N Ci,k = Jj=1
Pi
#{C(vi,j ) = Ck }. The logistic
regression can then be formulated as follows:
K
X
P (yi = 1) = Λ(β0 + βk N Ci,k ), (2.11)
k=1
where Λ(x) = (1 + e−x )−1 is the logistic cumulative distribution function. The param-
eters βk can be estimated by maximum likelihood, although a closed form solution such
as in the case of linear regression does not exist. These parameters are then interpreted
in order to determine each channel’s attribution to the total number of conversions.
However, the extant literature ignores or is particularly vague about the exact method
to go from the logistic model parameter estimates to attributing the channels. Theo-
retically, the most obvious method to do so would be to evaluate the marginal effects
∂yi PK x x −2
∂N Ck
= βk λ(β0 + k=1 (βk N Ci,k )), where λ(x) = (e )(1 + e ) is the logistic probabil-
ity density. However, since estimated parameters can be negative this would imply the
possibility of attribution to be negative, which is not a desirable property.
An alternative, more practical method to attribute is proposed by this thesis. For
each visit vi,j , consider the estimated conversion probability p̂i,j = Λ(β̂0 + β̂k ) in case
only the channel Ck = C(vi,j ) of that visit is touched. Use this estimated conversion
probabilities as unnormalized attributions, and normalize this subsequently to obtain
âi,j,LOG for each touchpoint. Note that this method assumes that every touch of channel
k has the same effect on attribution, which is compatible with the specification of the
basic logistic regression model. Mathematically, the individual attribution according to
the logistic model is expressed in (2.12), in which for simplicity k rather than Ck is the
channel belonging to visit vi,j .
CHAPTER 2. THEORY 21
K
X K
X
P (yi = 1) = Λ(β0 + βkLT dLT
i,k + βkN LT N Ci,k
N LT
) (2.13)
k=1 k=1
LT
Λ(β̂0 +β̂C(v )
i,j )
PJi −1
N LT )+
P LT
, j = Ji
r=1 Λ(β̂0 +β̂C(v ) r=J Λ(β̂0 +β̂C(v )
i i,r )
âi,j,LOGX1 = i,r
N LT )
Λ(β̂0 +β̂C(v
(2.14)
i,j )
, j 6= Ji
PJi −1 N LT )+
P LT
r=1 Λ(β̂0 +β̂C(v ) r=J Λ(β̂0 +β̂C(v )
i,r i i,r )
Although at first glance much more complex, on closer regard the only difference
with Equation (2.12) is that for the last touchpoint a different estimated coefficient is
evaluated as for the other touchpoints. The only difference between the cases of j = Ji
and j 6= Ji is that a different estimated coefficient is plugged in the logistic cumulative
distribution function of the denominator. For the single β̂kLT that is not estimated due
to perfect collinearity problems, we plug in 0.
that given a certain state in period t, there is always a probability of 1 that the system is
CHAPTER 2. THEORY 23
in a state in period t + 1. The final element required for a Markov chain is an initial state
Z0 . Journeys can thus be modelled or simulated by multiplying the initial state Z0 with
the transition matrix W , resulting in a sequence of states {Z0 , Z1 , Z2 , ..., Zt−1 , Zt , ...} over
discrete time.
Markov chains can be of different order, denoting the amount of previous observations
that influence the current state. Let’s first focus on the first-order model. The possible
states si ∈ S in the first-order Markov model are all channels Ck , a conversion state
Conv and a non-conversion state N onConv. The transition probabilities are empirically
derived from the data, giving first-order transition matrix Ŵ1 . The estimated first-order
initial phase for each state is also empirically calculated as the proportion of first visits
the relevant channel has with respect to all journeys, resulting in the vector Ẑ0,1 . Note
that the initial states for si = Conv and si = N onConv in Ẑ0,1 are zero, since a prospect
journey does not start with a conversion or non-conversion.
Once a first-order Markov model is appropriately fitted this way, attribution is de-
termined by a so called Removal Effect. This is defined as the change in probability of
reaching the conversion state in the normal situation compared to the situation where the
pertinent channel si = Ck is removed from the chain. Although unexplained by Anderl
et al. (2014), we assume removing a channel k means setting all row elements wk,j to zero
for all j’s and for j = N onConv to one. This results in a so called reduced matrix W(−k),1 .
Although not specifically mentioned by Anderl et al. (2014), we assume that the Removal
Effect is considered over the steady state of the Markov chain process. The first-order
Removal Effect of a channel k RE1 (Ck ) then takes a value between 0 and the original
conversion rate. Mathematically, this is expressed in Equation (2.15), where x[Conv] is
0
the si = Conv state from vector x, Ŵ1T is the matrix product of T times Ŵ1 , and Ẑ0,1 is
the transpose of Ẑ0,1 .
The aggregate attribution Âk,M AR1 for each channel is subsequently calculated by
dividing the Removal Effect by the sum of all Removal Effects:
ˆ 1 (Ck )
RE
Âk,M AR1 = PK (2.16)
ˆ 1 (Cl )
RE
l=1
Markov chain models can easily be made predictive. Given a state si = Ck at moment
t, the estimated transition probability to the state si = Conv is the estimated conversion
probability. Note that this probability is only based on the last touchpoint for a first-
order Markov chain model, a finding that can be generalized to the last r touchpoints for
rth order models.
CHAPTER 2. THEORY 24
Figure 2.3: First-order Markov chain graph illustrating the different states and transitions possibilities.
Now we will introduce a very simple analytical example of the first-order Markov
model to clarify more intuitively how the model attributes. Suppose there are two chan-
nels, C1 and C2 , and the maximum number of touches is two. The possible paths are
C1 , C2 , C1 C2 and C2 C1 , which occur respectively N1 , N2 , N1,2 and N2,1 times with the
number of conversions P1 , P2 , P1,2 and P2,1 . For the ease of this example we omit the
paths Ci Ci , i = {1, 2}. The graph of this Markov model is illustrated in Figure 2.3, and
the Markov transition matrix and initial state are shown in Table 2.2.
Note that the conversion and non-conversion states are absorbing states: once in this
state, a transition to another state is impossible. Now define conv ∗ = Z0,1
0
W1T (conv),
where T is the matrix power. The exact number of conv ∗ is irrelevant for our purposes.
P
The Removal Effect for channel RE1 (C1 ) is then expressed as follows, where N is the
total number of journeys:
N2 + N2,1 P2 + P1,2
RE1 (C1 ) = conv ∗ − P (2.17)
N N1,2 + N2 + N2,1
Since this Removal Effect is proportional to the attribution, we can derive that attri-
bution for C1 is a function of:
1. The relative amount of journeys that start with the other channel C2 (negative
effect)
2. The number of conversions with last touch C2 relative to the total number of touches
with C2 (positive effect)
CHAPTER 2. THEORY 25
Note that the attribution for C1 has a negative relation with the relative number of C2
last touch conversions weighted by the relative number of C2 first touches. This weight
by the number of first touches makes it intuitively less accurate than simply taking the
negative of the total number of last touch conversions of C2 , which is basically last touch
attribution. From this stylized two channel example we can therefore conclude that the
first-order Markov chain model is not only comparable to last touch, but that it is even
likely to attribute worse than this rule-based heuristic due to the unintuitive weighing.
To some extent, the conclusions of our stylized two channel example can be generalized
to more channels and touchpoints. However, in case of more than two channels the
transition probabilities and interactions between channels influence attribution as well.
Most straightforwardly, the transition probabilities to the state whose Removal Effect
is estimated have a positive influence on the attribution of this state. High transition
probabilities after all imply that the missed conversion of this state will be larger when it
is removed. This issue was irrelevant in our analytical example since the system reached
stability after a single iteration, meaning that only the direct traffic to conversion (and
no transitional traffic) influenced the Removal Effect.
Having explained the intuitive dynamics of first-order Markov chain models, let’s now
turn our attention to higher-order models. The most distinctive difference for higher-
order models is that multiple previous periods are taken into account: for a rth -order
Markov chain the present state not only depends on the previous state, but on the states
in the last r periods. It can be shown that a Markov chain of order r is equivalent to a
first-order Markov chain with r-tuples representing the states. For instance in case r = 2,
a state can be si = (Ck , Cl ), meaning that the current channel is Cl and the previous
channel Ck . We can thus express a rth order Markov chain with a single transition matrix
Wr and initial state Z0,r with r-tuples representing the different states. The transition
probabilities and initial state are again empirically derived from the data, giving Ŵr and
Ẑ0,r . For a second-order Markov model this implies (k+1)(k+2)+2 states. The k+1 term
represents the possibilities of the first element of the 2-tuple, which are all k channels plus
a ‘none’ element in case there is no channel previous to the second channel represented
in the 2-tuple. The k + 2 term represents all channels including a conversion and a non-
conversion possibility. The last 2 comes from the only absorbing states si = (Conv, Conv)
and si = (N onConv, N onConv). In the initial state vector Z0,2 empirical first touch
probabilities are estimated for the states of the structure si = (N one, Ck ).
Attribution for the higher order Markov model is again calculated by the Removal
Effect. Anderl et al. (2014) choose to calculate channel attribution by taking the average
Removal Effect of each of the states that include the respective channel. This state-based
Removal Effect is illustrated in Equation (2.18) for the case of r = 2, where Ŵ(−si ),2 is
CHAPTER 2. THEORY 26
However, taking the mean of the Removal Effects of all states that include channel
k seems inconsistent with Anderl et al. (2014)’s own definition of the Removal Effect,
being the “change in probability of reaching the conversion state when we remove a
channel from the graph”. Therefore, more in line with this definition we propose an
alternative method to determine individual channel attribution. Simply stated, this new
method determines the Removal Effect of a channel REr (Ck ) rather than a state REr (si ).
REr (Ck ) is calculated by removing all states (setting them to zero) that include channel
k or all states si ∈ Sk . The consequent reduced matrix is named W(−Sk ),r . This Removal
Effect for r = 2 is calculated in Equation (2.20).
model performs better than the logistic regression, first touch and last touch heuristics
as measured by the area under the ROC curve and the top-decile lift. Unfortunately, it
is not reported whether any model performance differences are significant.
Other models
Some more attribution models have been developed that are noteworthy, which we will
briefly refer to for the interested reader. Li and Kannan (2014) propose a Bayesian
model to measure online channel consideration, visits and purchases. They calculate
carryover and spillover effects to attribute conversion credit. Zhang et al. (2014) apply
the attribution question to a framework borrowed from survival theory, producing a model
that appears quite promising in both conversion prediction and attribution. Finally, Xu
et al. (2014) employ a mutually exciting point process model to calculate attribution of
online advertising channels. These models aren’t discussed in this thesis because either
our dataset is not suitable for the respective model, the model is too complex for the
purposes of this thesis or the model is expected to be less effective than our models.
Table 2.3: Summary of theoretical evaluation of attribution models. A ‘+’ indicates a model fully
complies with the criterion, a ‘+/−’ implies partial compliance and a ‘−’ no compliance.
absolute numbers. In conclusion, the logistic extension satisfies most theoretical criteria,
closely followed by the normal logistic model and the higher-order Markov chain model.
Based on the theoretical criteria, the logistic models perform best.
Chapter 3
Method
This chapter explains the method that this thesis employs to answer the question which
model performs best empirically and in a simulation. First, Section 3.1 briefly mentions
the different methods and models that will be estimated and evaluated in this thesis.
Then, Section 3.2 describes how the models are evaluated based on their classification
accuracy in an empirical study. Finally, Section 3.3 works out the method that is used
for the simulations. All statistical analyses and simulations are performed in the open
source programming language R.
3.1 Models
The models that are tested in the empirical and simulation study are all introduced in
Chapter 2. The list below sums them up, where the models that are able to produce
predictions are designated with an asterisk (*).
It is chosen to estimate the models on the full data set, without conditioning on the
number of touchpoints. The latter would after all quickly bring down the number of
observations for a larger amount of touchpoints. This would give insignificant parameter
estimates, severely complicating the issue of attribution. In addition, all of the models are
29
CHAPTER 3. METHOD 30
perfectly able to cope with a data set that contains observations of different touchpoints
and it is expected that some of the models even perform better on such a full data set. To
see this, first notice that conditioning on the number of touchpoints would not have any
implication for the rule-based heuristics. Attribution under conditioning on touchpoints
would be exactly the same as attribution under the full data set. Since the probabilistic
model already conditions on paths with one or two touchpoints and ignores longer paths,
conditioning here neither has an effect. For the logistic regression and Markov models,
it is expected that conditioning produces worse results, simply because for each estimate
less information is available. Suppose we condition on the number of touchpoints Nt for
Nt = 1 and Nt > 1. It is much harder for the set that is conditioned on Nt > 1 to
determine the contribution of a channel, since information on the single-touch conversion
probability of this channel relative to other channels is not taken into account. This
makes its estimates for this contribution less accurate compared to the case in which all
information is included in the data. There is, in conclusion, no good reason to condition
on the number of touchpoints. Having established this, let’s now discuss how to evaluate
the empirical performance of the models.
A better measure for our purposes is the Receiver Operating Characteristic (ROC)
curve. This curve plots the True Positive Rate (TPR) on the y-axis against the False
Positive Rate (FPR) on the x-axis for all possible thresholds c ∈ (0, 1). The TPR or
sensitivity is calculated as the percentage of accurately classified positive observations
(conversions) relative to the total number of true conversions. A 100% sensitivity implies
that all true converters are correctly classified as conversions in the model. Mathemat-
TP
ically, the sensitivity is expressed as T P R = T P +F N
, where TP is the number of True
Positive classifications and FN the amount of False Negative classifications. Similarly to
the sensitivity, the specificity is defined as the percentage of accurately classified negative
observations (non-conversions) relative to the total number of true non-conversions. The
FP
FPR is calculated by taking 1 − specif icity or T N +F P
. A 100% specificity (or 0% FPR)
implies that all actual non-conversions are classified as such by the model.
An example of a ROC curve is plotted in Figure 3.1. For every threshold c ∈ (0, 1)
that classifies an estimated probability as positive or negative, a single point in the ROC
graph is plotted. The best possible model yields 100% sensitivity (no false negatives) and
100% specificity (no false positives), which is represented by the upper left corner or (0,1)
coordinate in the ROC graph. Randomly guessing gives a point on the 45◦ line or line of
no discrimination. The extent a model is able to produce probability predictions that are
far from the line of no discrimination and close to the upper left corner determines its
discriminatory power and thus classification accuracy. The classification accuracy can be
expressed in a single digit, which is the Area Under the ROC Curve (AUC). The AUC
is a measure that is always between 0.5 and 1. The closer it gets to 1, the better the
performance of a model. The AUC is the measure that is used by this thesis to judge the
different models on their classification accuracy. Finally, it is important to realize that
the AUC as a measure assumes that every value of the specificity is equally important.
Alternatively, one might argue for a certain distribution of weights over each value of
the specificity. Only in case of overlapping ROC curves, the assumption of equal weights
becomes relevant. It is thus necessary to verify whether the ROC curves of any two
models overlap.
Evidently, solely the AUC measure does not tell us whether performance differences
between models are significant. This thesis extends the literature by performing the boot-
strap procedure on the test set in order to determine the standard deviation between the
difference of the AUC of any two models. This way, it is tested whether performance dif-
ferences across models are significant. The bootstrap procedure is based on the idea that
the sample data can be considered as the population. Randomly sampling with replace-
ment from this new population (the sample data) is called bootstrapping. Bootstrapping
allows calculating certain measures such as the standard deviation for sample estimates
CHAPTER 3. METHOD 32
Figure 3.1: An example of a ROC curve. The 45◦ line represents random classification. The model of the
blue curve thus clearly classifies better than random.
for inference. In our case, bootstrapping B times gives us B estimates of the difference in
AUC between two models i and j {AU CDif ˆ f i,j (1), AU CDif
ˆ f i,j (2), ..., AU CDif
ˆ f (B)i,j }.
From these B estimates the standard deviation can be calculated, and a simple Z-test
can show us whether AUC differences between models are in any direction significantly
different from 0. In this thesis B is chosen to be 1000.
Although the AUC measure for classification accuracy undoubtedly gives us a good
method to evaluate the different models on its attribution, we should not forget the under-
lying assumption that accurate prediction implies accurate attribution. This assumption
is necessary because the true attribution can never be observed in empirical data. An
alternative way to compare the attribution models is to simulate data in which the true
attribution is known, which is the topic of the next section.
in which the true attribution distribution over the channels is known. Let’s first discuss
simulation with a Markov chain DGP.
We restrict ourselves to discussing the first-order Markov DGP, but the described
method can be easily extended to the second-order. We use the estimated initial state Ẑ0
and estimated transition matrix Ŵ from the data to simulate a new dataset with the same
number of prospects, which is 734.6k. The simulation can thus be seen as an effort to
reproduce the original dataset. This simulation is iterated 10 times to account for sample
variance. By means of the Removal Effect (see Equation (2.15)) the estimated attribution
from the empirical dataset is determined, which functions as the ‘true’ attribution in our
DGP. All models are estimated on the generated data and their attribution is determined.
Subsequently, this attribution is compared with the true attribution by taking the Mean
Absolute Error (MAE). This way, we can evaluate which model attributes closest to the
true attribution.
Remark, however, that the true attribution in the DGP is calculated by a specific
method, namely the Removal Effect. Inevitably, this true attribution therefore represents
the way the Markov attribution model perceives of attribution. One might therefore argue
that this method answers the question which model attributes closest to the Markov chain
model rather than the question which model attributes closest to the truth. This is a
valid argument, and a major shortcoming of using the Markov chain DGP for simulation.
Nevertheless, this analysis can still be insightful. Generating data by both a first- and
second-order Markov chain DGP and comparing the models’ performances, allows us for
instance to see which Markov model better reflects the true data.
Because of this shortcoming of the Markov chain DGP, we extend our simulation
study by a different, basic and intuitive framework to generate data in which the true
attribution is known. Hopefully, this framework does not bias any model over the others.
For clarity and simplicity, we restrict ourself to a framework with two channels and
a maximum of two touchpoints. However, the framework is easily extendible to more
channels and longer journeys.
Our framework is as follows. First, P1 and P2 are the probabilities the journey has
respectively one and two touchpoints, where P2 = 1 − P1 . Then, for the first as well as
the second touchpoint, two channels can be touched. ch1 , ch1|1 and ch1|2 indicate the
probability that channel 1 is touched given no prior touches, a prior touch with channel
1 and a prior touch with channel 2 respectively. ch2 , ch2|1 and ch2|2 can be defined as one
minus the probability that channel 1 is touched in a specific instance. The final element
of the simulation model is the contribution to the total conversion probability of each
touchpoint. The contribution of channel 1 is p1 , p1|1 and p1|2 depending on the position of
this channel 1. Again, p1 gives the contribution to the conversion probability of channel
CHAPTER 3. METHOD 34
1 with no prior touches, p1|1 with a prior touch of channel 1 and p1|2 with a prior touch
of channel 2. p2 , p2|1 and p2|2 are equivalently defined. It should be noted that p1 and p2
are contributions to the conversion probability irrespective of possible later touchpoints.
The total conversion probability for a path with first channel 1 and then channel 2 is
p1 + p2|1 . In total, this simple simulation model contains 10 parameters that affect the
customer journeys and thus the true attribution.
The true attribution can be calculated in a straightforward fashion from the simulated
dataset. All converting customer journeys that contain a single channel (either with one
or with two touchpoints) obviously give full credit to that channel. More interestingly,
credit should be divided in case a conversion path contains both channels. Suppose a
conversion path i touches first channel 1 and then channel 2. Channel 1 receives the
p1
conversion credit ai,1 = p1 +p2|1
, which is the contribution to the conversion probability
of channel 1 as a fraction of the total contribution to the conversion probability of each
p2|1
touchpoint. Equivalently, the contribution of channel 2 can be calculated by ai,2 = p1 +p2|1
.
Having determined the individual attributions of both channels in each instance, we can
aggregate and normalize to determine the aggregate true contribution. The attribution
models that are estimated on this simulated data are evaluated by its Mean Absolute
Error with this true contribution.
The performance of the attribution models is tested in a wide spectrum of simulation
scenarios. In other words, data is generated multiple times according to different param-
eter settings. The scenarios are meant to reflect a great variety of possible characteristics
to be found in real datasets. We hope to find a model that attributes consistently accu-
rate over all scenarios, such that it is able to process all characteristics. An overview of
the scenarios can be found in Table 3.1.
In the base scenario, the probability of a single touch journey is 0.5 (P1 = P2 = 0.5).
Furthermore, regardless of the history, chances are for each touchpoint 0.5 to be channel
1 (ch1 = ch1|1 = ch1|2 = 0.5) and 0.5 to be channel 2 (ch2 = ch2|1 = ch2|2 = 0.5).
For each touchpoint with channel 1, the contribution to the conversion probability is 5%
(p1 = p1|1 = p1|2 = 0.05) . This contribution of 3% is slightly lower for each touchpoint
with channel 2 (p2 = p2|1 = p2|2 = 0.03).
All other scenarios are defined by single deviations from this base scenario. These
deviations are bold faced in Table 3.1. The scenario Short has 80% single touchpoint
journeys, while Long is characterized by journeys that contain two touchpoints in 80%
of the cases. Ch1 FT has 80% probability that the first touch is channel 1. The scenario
Mixed paths increases the chances of journeys with two touchpoints that include both
channels. The difference in conversion probability contribution between channel 1 and
channel 2 is enlarged in the scenario Ch1 conv. For Ch1 t2 conv this difference in con-
CHAPTER 3. METHOD 35
Table 3.1: Parameter settings for different simulation scenarios. Deviations from the base scenario are
emboldened.
tribution is only made greater for the second touchpoint and for Ch1 t2 mixed conv it is
increased even more specifically for the second touchpoint in case the first touchpoint is
channel 2. Finally, scenario Ch1 t2 no conv decreases the contribution to the conversion
probability of channel 1 in the second touchpoint. The scenarios are chosen such that
the effect of all possible characteristics to be found in real data on the attribution models
can be measured. Remark, however, that only unilateral effects and no interaction effects
can be investigated in this basic study.
For each scenario 10,000 customer journeys are generated. The true attribution is
calculated for each scenario. Models are estimated and attribution is determined. Con-
sequently, for each model we determine the Mean Absolute Error. We check for sample
variance by performing the same analysis on 100,000 generated customer journeys.
Chapter 4
Data
In this chapter, we will elucidate on the data that is used for the empirical part of the
thesis. The original dataset that is used is described in Section 4.1. Section 4.2 discusses
how this original data set has been processed in order to be fit for our empirical analysis.
Finally, in Section 4.3 some figures of the processed dataset are provided in order to get
some sense of the content of the data.
36
CHAPTER 4. DATA 37
total, the table has 64.7 million records. The total number of unique visits is 10.8 million,
meaning that a visit has on average around six goals. The predefined goals are given a
goal id, which can take on 1405 distinct values. Broadly stated, goals can fall in either
one of the five following categories for any given product.
1. Orientation visit: a visit to pages on the site to acquire more information about
a product. This can be considered as the initial orientation phase in a prospect
journey.
3. Funnel: a visit to pages on the site that are considered the official funnel, which is
the place where transactions can be made.
4. Price calculation: the next step in the funnel. A visitor fills out its personal details
on the site in order to calculate what price it is going to pay if purchasing a product.
The visit id is an 11-digit number that is uniquely assigned to each visit. A visit
starts from the moment a visitor enters the website and ends as soon as the website is
left (meaning that no tab in the browser displays the website). During a session, which
is defined as the period between opening and closing the browser, multiple visits are
possible. A cookie id has 32 numbers or characters, and is uniquely assigned in order to
identify visitors throughout multiple visits. The Goals Table contains 4.9 million distinct
cookies, implying that a cookie has on average 2.2 visits. The IP address variable gives
the numerical lable of the public IP-address of the visitor. The datetime indicates the
exact date and time that the goal in the same record is achieved. A final indispensable
variable for this thesis is the campaign type field. This states the channel from which
a visitor lands on the company’s website. This can take on the categories of ‘organic’,
‘organic search’ (SEO), ‘sponsoredsearch’ (SEA, which is later separated into branded
and non-branded sponsoredsearch), ‘email’, ‘referrer’, ‘bannerad’ and some other small
categories summarized into the ‘other’ group.
The Goals Table functions as input for the data processing algorithms. These al-
gorithms are necessary to obtain a dataset out of which the attribution models can be
estimated. This data processing is the subject of the next section.
CHAPTER 4. DATA 38
2. Prospect identification
All data processing is performed with the Structured Query Language (SQL) in SQL
2014 Management Studio. This programming language is well-suited for handling large
volumes of data in an efficient way, and therefore perfect for our purposes.
• Our data consists of all visits to the website. The pertinent firm offers multiple
products on its site. To limit our analysis, we choose to attribute for a specific
product, say product X. We therefore need to make sure that only visits (or touch-
points) that are intended towards this product are taken into account. Since the
goals are assigned specifically for each product, we can simply filter on all the goals
that include our product X.
• Some of the traffic is caused by internet bots rather than prospects interested in
product X. Examples are web crawlers that browse sites to index them for search
engines, or internal bots that check whether all pages on the website are still active
and functional. This traffic is obviously not performed by potential product pur-
chasers and if included distorts our results. These visits are therefore filtered from
our Goals Table.
• All visits with the same cookie id are assumed to come from the same person. This
assumption is standard and accepted in the literature.
• Different cookies that have the same IP-addresses are taken together and considered
the same person. However, this step involves the significant risk that cookies of
different individuals are taken together, since IP-addresses pertain to an internet
connection rather than an individual. For this reason, it has never been done in
the literature (to this author’s knowledge). To mitigate this risk, only IP-addresses
that have less than five cookies are considered for the bundling. This prevents
joining cookies that have been connected to a public internet connection such as
a library, office or university. The risk that visits of different persons within the
same househould are bundled is still present and probable. However, our product
X is typically consumed by households rather than individuals. It is for this reason
that this thesis argues cookies with the same IP-addresses can be seen as the same
customer.
• Finally, individuals as identified according to the above two steps are given a
uniquely generated prospect id.
• Take only the visits before a transaction. Since we are only interested in the touch-
points before the first conversion, all visits after this conversion should be removed.
• Ensure customer journeys are not cut off. At this stage our dataset contains all rel-
evant visits between July 2014 and June 2015. However, this causes some prospect
journeys to be cut off. Think of conversions in early July 2014, of which the orien-
tating touchpoints in May 2014 are not included. Similarly, visits in the end of June
2015 might be perfected with a conversion or complemented with additional visits
in July 2015, which our current data can not account for. In order to overcome this
issue, we take all visits from prospects in the period between August 2014 and May
2015, and retrieve additional visits in July 2014 and June 2015 of these prospects.
We arbitrarily set this bandwidth of one month, but later verified that almost all
customer journeys take less than a month.
• Then, we calculated some useful new variables. ‘Visit seq’ indicates for every visit
the consecutive visit number in the journey of a person. ‘Conversion path’ states
for each visit whether its path eventually results in a conversion. Also, in order
to simplify the Markov model estimation, it is determined for each visit what the
campaign type of the next, previous and second-previous visits are. If there is no
next visit, the next campaign type can be either ‘Conversion’ or ‘Non-Conversion’,
depending on whether a transaction is made in the final visit. If it is the first visit,
the previous campaign type is ‘None’.
• The resulting dataset is called the Visits Table. However, for estimating the logistic
regression the Visits Table does not suffice. Therefore, another table is produced by
aggregating on prospect id and making a variable for every campaign type category
that states how many visits in the journey come from each specific channel. This
CHAPTER 4. DATA 41
Figure 4.1: The number of paths per touchpoint for non-conversions and conversions. Also the conversion
percentage per touchpoint is shown.
The variables of both the Visits Table and the Prospect Journey are summarized
and explained in tables 4.2 and 4.3. The next section provides more information on the
content of these two tables.
Table 4.4: The number of visits and percentage of total visits coming from the different channels.
credit to the channel of that touchpoint. Converting prospects generally have more touch-
points than non-converting prospects, which intuitively makes sense since non-converting
prospects drop out of the funnel in an earlier stage. However, there are relatively more
non-converting prospects that have more than 11 touchpoints, which is presumably a
group of customers that visits the company’s website frequently but is not interested in
the product. The conversion percentage initially grows with the number of touchpoints,
indicating that customers who orientate longer have a higher probability of converting.
However, if the number of touchpoints is higher than three this percentage gradually de-
clines. This implies that there exists some sort of saturation point of orientation: after a
certain amount of orientation further visits do not contribute to the chances of conversion
and even work counter-productively. People who visit the website more than three times
might actually not be interested in purchasing a product, but have other reasons for the
visit. Note, however, that we intended to filter these people out of the dataset by only
taking visits that have a goal related to the funnel of the product. Now that we have
information about the length of prospect journeys, it is also interesting to consider the
channels that lead to visits.
Table 4.4 gives insight in the number of visits coming from each of the channels. As
we can see, the total number of visits is around two million. This makes the average
number of visits or touchpoints per prospect 2.7. Interestingly, around three quarter of
all these visits come from affiliate parties. One out of every ten visits comes from the
organic channel, meaning that the website is directly put into the browser. 7.5 percent
reach the website through sponsored search (SEA), which is more than the five percent
that use organic search (SEO) to find the product. 60% (4.5 out of 7.6 percent) of the
sponsored search use the company’s brand in their search query. Further, 2.5 percent
of the visits come from people clicking on links in emails they have received from the
CHAPTER 4. DATA 43
insurer. Finally, referrer, bannerad and other channels are responsible for very minor
contributions of less than a percent.
Chapter 5
Results
In this Chapter the results of this thesis are presented. First, in Section 5.1 the model
estimation results of the different attribution models are stated, including each model’s
actual attribution to the channels. Section 5.2 discusses the performance of the models in
classifying out-of-sample observations in the conversion or non-conversion class. Finally,
Section 5.3 shows the performance of each model in the simulation study.
44
CHAPTER 5. RESULTS 45
Table 5.1: The attribution to the channels according to the last touch, first touch and linear method,
both in absolute numbers and in percentages.
Table 5.1 displays the attribution according to each of these three rule-based heuristics,
both in absolute number of conversions and percentages.
It is evident from Table 5.1 that differences between attribution methods exists, even
though we have seen that most of the customer journeys exist of a single touchpoint,
in which case every method attributes similarly. The largest difference between two
methods is the attribution to the organic channel, which receives 8.8 percentage point
more conversion credit according to last touch than first touch. This proves that organic
is a channel that is generally touched last in converting journeys. The affiliate and non-
branded SEA attributions according to last touch and first touch differ both more than 3
percentage point. The linear method attributes for each channel somewhere between last
touch and first touch. This can be explained by the fact that 73% of the conversion paths
have one or two touchpoints, in which case linear attributes per definition between first
and last. Only for paths that have three touchpoints or more, the middle touchpoints are
disregarded by both single touch heuristics, making it possible for the linear method to
attribute outside the range spanned by first and last touch. As an illustration, suppose
all conversion paths contain three touchpoints. The middle touchpoint is always channel
k, and channel k does not appear first or last. In this case, linear attributes channel k
more than first or last touch (which both attribute 0%). Since conversion paths in our
dataset are generally short, this does not happen and linear attributes neatly between
the attribution of first and last touch.
Although differences between methods clearly exist, and even seem quite relevant if
you consider the absolute number of conversions, it is interesting to check whether they
are significant. In order to do so, the Mean Absolute Difference (MAD) between the
CHAPTER 5. RESULTS 46
Table 5.2: The difference of channel attributions between the last touch and linear rule-based method,
including its standard error and P-value.
methods is calculated. The MAD between method i and j is calculated as follows, where
k = 1, 2, ..., K refers to the channels:
K
ˆ 1 X
M AD(i, j) = |Âk,i − Âk,j | (5.1)
K k=1
The MAD is thus a channel’s average attribution difference between two methods.
The MAD between FT and LT turns out to be 2.00%, between FT and LIN 1.14% and
between LT and LIN 0.86%. The difference between last touch and linear is thus closest
with on average less than 1% difference between two channels. By bootstrapping 1000
samples, we determine the standard error of this difference, which is 0.045%, giving a
P-value of 0.000. This indicates that at any significance level the MAD between last
touch and linear is significant. Since this is the smallest MAD, we can conclude that it
is likely that all differences between rule-based models are highly significant.
Finally, it is also interesting to check whether individual channel contributions between
last touch and linear differ significantly. Again, bootstrapping is done to determine the
standard error between the channel distributions. The results are shown in Table 5.2.
The table demonstrates that the attribution of seven out of nine channels significantly
differs at a 5% (*) or even 1% (**) level between the two methods. The exceptions are the
channels email and organic search (SEO). Summarizing this subsection, we can conclude
that attribution over the channels is different with respect to the rule-based methods and
even highly significantly so.
CHAPTER 5. RESULTS 47
Table 5.3: The first-order effect, second-order effect and attribution of the probabilistic model. As a
reference, attribution according to linear touch is displayed.
Table 5.4: Regression output of the normal logistic regression. All estimated coefficients are significant
at the 5% level except for email.
Table 5.5: Attribution according to the standard logistic model, its both extensions and the linear model.
the maximum value of the likelihood function of a model and s the number of estimated
parameters, then the AIC value can be calculated as follows:
The smaller the value of the AIC, the better the quality of a model. The second exten-
sion of the logistic regression model has the smallest value (95611). The non-linear effect
model and the first extension have slightly higher values (96261 and 96299 respectively).
The ordinary logistic regression performs worst (98932). This provides substantive evi-
dence for the added value of the logistic extensions proposed by this thesis. In addition,
the fact that the non-linear effect model has a lower value than the standard logistic re-
gression model shows that the assumption that the channel contribution increases linearly
with its number of touches is probably unrealistic.
Having estimated the logistic regression models, let’s now turn to the question of
attribution. Table 5.5 reports the attribution percentages according to each of the logistic
models. Again, linear attribution is included in the last column for reference. It is clear
that the differences in attribution percentages are small. However, since the number of
observations is large, all MADs between models turn out to be highly significant with
a P-value of 0.000. The MAD between logistic and linear attribution is 0.22%, with
the affiliate channel being most prominently different. The affiliate channel receives
less credit for the logistic regression, implying that affiliate generally deserves less than
linear credit in conversion paths with other channels. This is consistent with the small
estimated coefficient of affiliate in the logistic regression model as shown in Table 5.4.
The MAD between the two logistic extensions is incredibly small (MAD= 0.02%) but
still highly significant. The difference between the logistic model and its extensions is
larger (MAD= 0.48 and MAD= 0.50 for extension 1 and extension 2 respectively). Most
notably, the extensions give more conversion credit to affiliate and SEA non-branded,
CHAPTER 5. RESULTS 50
Channel Ẑ0 Aff Ban Em Org Orgs Oth Ref SEAb SEAnb Conv NConv
Aff 0.60 0.70 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.26
Ban 0.00 0.03 0.10 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.00 0.80
Em 0.04 0.10 0.00 0.22 0.09 0.03 0.00 0.00 0.03 0.01 0.01 0.52
Org 0.11 0.13 0.00 0.02 0.24 0.04 0.00 0.01 0.03 0.01 0.02 0.49
Orgs 0.09 0.02 0.01 0.01 0.09 0.20 0.00 0.00 0.06 0.02 0.03 0.57
Oth 0.00 0.21 0.02 0.01 0.06 0.02 0.11 0.01 0.02 0.01 0.01 0.52
Ref 0.01 0.05 0.01 001 0.08 0.04 0.00 0.18 0.03 0.01 0.05 0.54
SEAb 0.08 0.02 0.00 0.02 0.11 0.09 0.00 0.00 0.18 0.02 0.03 0.53
SEAnb 0.06 0.03 0.01 0.01 0.05 0.05 0.00 0.00 0.03 0.16 0.03 0.63
Conv 0.00 0.00 0.00 0.00 0.005 0.00 0.00 0.00 0.00 0.00 1.00 0.00
NConv 0.00 0.00 0.00 0.00 0.005 0.00 0.00 0.00 0.00 0.00 0.00 1.00
Table 5.6: The estimated first-order Markov chain transition matrix Ŵ and initial state Ẑ0 . States are
abbreviated.
and less to organic. This implies that affiliate and SEA non-branded last touches are
relatively valuable, while organic last touches are relatively less valuable in comparison
to the situation in which this timing effect is not incorporated.
Table 5.7: Attribution percentages for the Markov first-order model, the second-order model (method 1
by Anderl et al. (2014) and method 2 proposed by this thesis) and linear model.
simulation study. This will be our focus in the next sections of this Chapter.
Figure 5.1: The ROC Curve for the second extension of the logistic regression model (best classification)
and the first touch attribution model (worst classification).
Attribution
AUC
method
FT 0.7401
LT 0.7470
LOG 0.7712
LOGX1 0.7761
LOGX2 0.7779
MAR1 0.7416
MAR2 0.7762
MAR3 0.7760
Table 5.8: The Area Under the ROC Curve for the different attribution methods.
logistic extensions and higher order Markov models is very small. However, the second
logistic extension regression has the largest AUC. Remarkably, the AUC of the second-
order Markov model is higher than the AUC of the third-order Markov model, meaning
that the increasing measurement error of a third-order Markov model due to the large
amount of parameters outweighs the benefit of taking an additional period into account.
This is called overfitting. However, a fair concern is which of the differences in AUC
shown in Table 5.8 are significant.
By means of bootstrapping the standard deviation of the difference between the AUC
of two models is derived. This standard deviation can then be used to see whether any
differences are significant. The results of this analysis are shown in Table 5.9. We can
conclude from the table that fortunately most differences are significant. All methods
classify significantly more accurately than the first touch heuristic and Markov 1. The
difference between first touch and Markov 1 itself is insignificant. Last touch performs
significantly better than first touch and Markov 1, but worse than all other more advanced
models (logistic and higher-order Markov). From the more advanced models, the basic
formulation of logistic regression significantly classifies the worst. Classification accuracy
is mostly insignificant between the logistic extensions and higher-order Markov models.
However, the second logistic extension performs significantly better than the first logistic
extension at a 5% level. Moreover, the P-value of the second extension with both Markov
extensions is 0.12. One might thus argue that the second extension is quite significantly
the preferred model concerning predictive performance.
However, as written before, these results are based on the assumption that best pre-
diction implies best attribution. This assumption is highly intuitive but not necessarily
entirely true. To further explore which model attributes best, the next section discusses
the results from a simulation study in which the true attribution is known.
CHAPTER 5. RESULTS 55
Table 5.9: The difference between AUC values of different models, its standard deviation between brackets
and whether it is significant.
5.3 Simulation
We start the simulation results with discussing the first-order and second-order Markov
chain simulation. Then, we will focus on the additional simulation study that is per-
formed. Since Section 5.2 shows that the difference between the second-order and third-
order Markov model is insignificant, only second-order Markov model results are reported
in this section.
MAR2- MAR2-
DGP FT LT LIN LOG LOGX1 LOGX2 MAR1 PROB
M1 M2
Markov 1 1.57% 1.70% 0.67% 0.54% 1.21% 1.21% 0.16% 1.34% 0.19% 7.76%
Markov 2 1.54% 0.84% 0.70% 0.63% 1.06% 1.03% 1.04% 0.60% 0.16% 6.16%
Table 5.10: Mean Absolute Error of each attribution method with respect to the ’true’ attribution of the
respective DGP.
first-order and second-order Markov simulation. The MAE is highly consistent over the
10 iterations, implying that the results and conclusions are not influenced by the issue of
sample variance.
Table 5.10 shows that the first-order Markov model attributes very closely (MAE=0.16%)
to the truth when the data is generated by a Markov 1 process. This is unsurprising, since
the large number (734.6k) of generated customer journeys enables the first-order Markov
model to accurately estimate the transition matrix and initial state. However, also the
second method of the Markov 2 model attributes very closely to the truth with a MAE
of 0.19%. In case the data is generated by a Markov 2 process, the second method of the
Markov 2 model performs best (MAE=0.16%), implying that the data contains enough
observations to accurately estimate a second-order model. The MAE of the first-order
Markov model is considerably larger with 1.04%. This indicates that the true data can be
better captured in a second-order Markov model than a first-order model. Although most
journeys in the true data have a single touchpoint, this proves that a memory of more
than a single period is desirable. Consistent to our out-of-sample classification study, we
have proven that a second-order Markov model attributes better than a first-order model.
A final observation that can be made from Table 5.10 is that the logistic and linear model
both attribute quite closely to the truth as expressed by the Removal Effect, with their
MAEs all smaller than 0.70%.
journeys does not alter any conclusions, implying that we can forget the issue of sample
variance in our results.
In this additional simulation study, the second extension of the logistic regression
model is not estimated since its only added value in comparison to the first extension
is that it considers two periods back. Because all customer journeys have maximum
two touches in this simulation study, this added value and thus the second extension
of the logistic regression model is irrelevant. However, a different logistic regression
model called LOG FT is considered. This model dummifies the first touch in stead of
the last touch. Since in the specification of the simulation model, the first touch has a
specific contribution to conversion irrespective of later touches, this model is expected to
attribute better than the logistic model based on last touch dummification. To emphasize
the contrast, the logistic model based on last touch dummification (normally the first
extension) will be called LOG LT in this section.
The second column of Table 5.11 shows the true attribution of the first channel for each
of the scenarios. The remaining columns show the deviation from this true attribution
in % for a given attribution method. The closer the absolute value of this deviation
approaches zero, the better the model performs. Also the average MAE over the scenarios
is shown as a single measure to evaluate the attribution models. Note that this is a highly
arbitrary measure since it is fully dependent on the definition of the scenarios.
If we turn our attention to the base scenario, we see that all rule-based heuristics (FT,
LT and LIN) attribute similarly. This is not a coincidence. For the conversion paths that
contain a single channel, attribution is per definition equal for these three methods. There
remain two paths that potentially create differences between the methods, namely channel
1 followed by channel 2 (denoted as P12) and channel 2 followed by channel 1 (P21).
Whenever the number of conversion paths of P12 is equal to the number of conversion
paths of P21, it can be easily derived that all rule-based heuristics attribute similarly.
Since in the base case, P12 and P21 have a probability of occurring of 0.5∗0.5∗0.5 = 0.125
and both have a conversion probability of 8%, the number of conversion paths P12 and
P21 is indeed the same. This explains the equal attribution of first touch, last touch and
linear for the base case and five other scenarios.
Markov 1 and Markov 2 - method 1 attributions in the base case are slightly worse
than the rule-based heuristics. However, our proposed method of the Markov 2 model
attributes extremely well in the base case, which turns out to be a lucky shot after
considering other scenarios. The logistic regression models attribute considerably well,
though the standard form attributes better than the extensions Log LT and log FT.
In this scenario, the contribution to conversion per channel does not differ between the
first and second touchpoint, so the extensions do not have an added value over the
CHAPTER 5. RESULTS 58
Table 5.11: True attribution and its deviation per attribution method (in %) for different scenarios.
standard logistic model. Unnecessarily using two different parameters for a single effect
(e.g. the contribution to conversion of a channel regardless of the timing), even makes
the calculation of the attribution less accurate for the extensions than the standard
model. Whether the dummification is based on the first or the last touch does not
matter for the attribution, since the contribution to conversion of a channel remains
stable over the touchpoints. A final interesting observation from this base case is the
perfect performance of the probabilistic model. In general, if channels are equally likely
to be touched and if the contribution to the conversion probability for both channels is
similar regardless of the touchpoint, which is the case for the base scenario, one can show
that the probabilistic model attributes perfectly. However, although present in multiple
scenarios in this simulation study, this is not a realistic assumption in real datasets.
Considering the other scenarios and the average MAE over these scenarios as shown
in Table 5.11, one can see that the logistic regression models perform consistently better
than all other models. The best attribution model is the LOG FT model, since the
form of this model is build up similarly to the specification of the DGP. The LOG FT
model distinguishes itself in the last three scenarios, which are the only scenarios that
give a channel a different contribution to conversion depending on whether it is the first
or second touchpoint. The normal logistic model performs better than the LOG LT
extension, since the latter is specified substantially differently than the DGP. Similar to
the Markov chain simulations, we thus see how the DGP of the simulation study inevitably
effects the performance of the models. Furthermore, one can see that the second method
of the second-order Markov chain Model performs best among the class of Markov chain
models, providing evidence for using this method in further research. Moreover, the
second-order Markov models perform better than the first-order Markov model. On
CHAPTER 5. RESULTS 59
average though, surprisingly, the rule-based heuristics attribute more accurately than
the Markov chain models. The best performing rule-based heuristic is linear, followed
by last touch and first touch. Finally, the probabilistic model attributes quite well on
average which can be explained by the large amount of perfect attributions. However,
for the generally more realistic scenarios in which the two mentioned requirements for
perfect attribution of this model are not fulfilled, the probabilistic model performs quite
dramatically. In conclusion, one can state that the logistic regression models, specifically
the log FT extension, are the evident winners of this simulation study.
An interesting question for discussion is to what extent the DGP of this additional
simulation study and its assumptions are realistic. The study aimed to reproduce cus-
tomer journeys in an intuitive way, without intentionally prioritizing any attribution
method over the other. However, by postulating that the first touchpoint has a certain
contribution to conversion regardless of possible later touchpoints, which is intuitively a
plausible assumption, it inadvertently prioritizes LOG FT over LOG LT. Although not
verified in any way, this may indicate that LOG FT is better able to represent reality and
thus a better model. In contrast, the assumption of a fixed first touchpoint contribution
regardless of later activity may also be flawed. As a good starting point to find out which
is the case, it is recommended for further research to test the performance of LOG FT in
comparison with LOG LT in our empirical study. If LOG FT performs better, this shows
that the assumption and thus the DGP of this simulation study is realistic, and LOG FT
is justly the best logistic extension. If LOG LT turns out to be a better classifier, the
assumption of this study is untrue, and a new simulation framework with a more realistic
DGP should be build.
Chapter 6
Conclusion
This thesis has investigated the question what is the best attribution method on three
dimensions: theoretically, empirically and in a simulation study. First, seven theoretical
criteria that are desirable for accurate attribution have been formulated. In the light
of those criteria, the different attribution methods and models are evaluated. Second,
the attribution methods are tested empirically. Under the assumption that accurate
attribution implies accurate prediction, the models’ out-of-sample classification accuracy
is calculated by the Area Under the ROC Curve. This measure is taken as the empirical
criterion to judge the models. Thirdly, new data is simulated according to a representative
variety of Data Generating Processes in which the true attribution is known. The Mean
Absolute Error of the different attribution models serves as the performance measure
in this simulation study. The combination of this theoretical, empirical and simulation
analysis enables this thesis to answer its main question about the best attribution method
in a well-substantiated manner. Such a standardized and extensive methodology for this
purpose has - as far as this author knows - not yet been developed in the existing literature.
From the theoretical analysis we have seen that the rule-based methods are funda-
mentally flawed because they are not data-driven. Neither do we advocate for the simple
probabilistic model, since its attribution of conversion probabilities rather than absolute
conversions potentially yields very counter-intuitive attribution percentages. Further, we
have seen that the first-order Markov model is very similar to the last-touch method and
thus inadequate as well. Higher order Markov models have more potential, although we
argue for a new definition of the Removal Effect and its aggregate nature can be incon-
venient. The logistic regression model competently accounts for channel heterogeneity
and individual attribution, but unfortunately does not capture any timing differences.
However, this thesis develops two extensions of the logistic model that incorporate this
timing effect by dummifying last touches. These logistic model extensions fulfil most of
the postulated criteria and have much potential from a theoretical perspective.
60
CHAPTER 6. CONCLUSION 61
Applying the different attribution methods to the data, results in significant differ-
ences in how they attribute conversion credit over the channels. Moreover, out-of-sample
classification accuracy also significantly differs between the models. The second logistic
extension that incorporates last touch and one but last touch dummies performs best,
although its P-value to the higher order Markov models is a modest 0.12. The rule-based
methods and first-order Markov model classify significantly less accurate. A Markov sim-
ulation provides evidence that the data can be better captured by a second-order Markov
model than a memoryless first-order model. Finally, a simple additional simulation study
shows that the logistic regression models on average attribute best across a wide range
of scenarios. More specifically, the DGP specification causes an extension of the logistic
model that dummifies first touches to be the best performer. Remarkably, the simulation
study further shows that the Markov models attribute rather poorly, most of the times
even worse than the linear method. However, as a general remark it should be noted
that for both simulation studies the chosen DGP inevitably affects the performance of
the attribution models to some extent.
In conclusion, the class of logistic regression models comes up consistently first in
our theoretical, empirical and simulation study. Extending the basic logistic model to
integrate timing effects results in an even better performing model. However, further
research should show whether this extension should be based on dummifying the last or
the first touches, since this thesis does not provide unilateral evidence on this issue. The
model that probably performs closest to the logistic regression model is the linear attribu-
tion method. The Markov and probabilistic models, although presented in a triumphant
fashion by Anderl et al. (2014) and Dalessandro et al. (2012) respectively, perform much
weaker than expected. In line with our expectations the single touch rule-based methods
neither come up as accurate attribution techniques.
Although according to our postulated criteria the best attribution model, the validity
of the logistic regression models applied to the attribution problem is doubtful from an
econometric perspective. The functional form of the logistic regression model presumes a
causal relation between the channel touches and the probability to convert. Indisputably,
this relation exists, since touching channels certainly has an effect on the conversion
likelihood. However, it is probable that the relation exists in the other direction as well.
A prospect who is looking for a product, and therefore has a higher probability to convert,
inevitably starts touching channels. This simultaneous relation makes the covariates and
error term correlated, creating an endogeneity bias that makes the estimated parameters
and thus the attribution over the channels biased. For this reason, this thesis considers
the logistic regression model a decent solution for the time being rather than the ultimate
best attribution technique.
CHAPTER 6. CONCLUSION 62
More significant than its determination of the ‘temporarily best attribution technique’,
this thesis has proposed a framework to evaluate attribution models in a standardized, all-
encompassing way. This methodology can be used to test newly proposed or the already
developed more advanced attribution models. This is an interesting direction for further
research, since the perfect attribution model still does not exist. Another direction is
further developing the simple simulation study proposed in this thesis, which is still very
basic and preliminary and can be relatively easily extended by facilitating more channels,
touchpoints and scenarios. This should give insight into which model functions best
under which circumstances to be found in the data. Moreover, the simulation study’s
assumption that the first channel contribution is irrespective of later touches should be
verified. When all this research is picked up, this author is confident that we will gradually
proceed closer to the solution to the attribution problem.
Such a result would not only be a theoretical breakthrough, but also of great practical
interest for all firms that engage in digital marketing activities. Having found the ‘holy
grail’ of the attribution problem, this would allow marketeers to accurately determine
the true contribution and performance of all digital channels and advertisements. This
way, the marketing budget could be optimally allocated over the channels, enhancing
the number of views, clicks, converters and thus the profitability of the firm. Something
every firm is on the watch for.
Chapter 7
Appendix
63
CHAPTER 7. APPENDIX 64
Table 7.1: The regression output of the logistic regression model including dummies for the last touch
channel.
CHAPTER 7. APPENDIX 65
Table 7.2: The regression output of the logistic regression model including dummies for the last touch
and one but last touch channel.
CHAPTER 7. APPENDIX 66
Table 7.3: The regression output of the non-linear effect logistic regression model including dummies for
N Ci,k = 1 and N Ci,k > 1.
Bibliography
Anderl, E., Becker, I., Wangenheim, F. V., and Schumann, J. H. (2014). Mapping the
customer journey: A graph-based framework for online attribution modeling.
Berchtold, A. and Raftery, A. (2002). The mixture transition distribution model for high-
order markov chains and non-gaussian time series. Statistical Science, 17(3):328–356.
Berman, R. (2013). Beyond the last touch: Attribution in online advertising. Available
at SSRN 2384211.
Cho, C. (2003). Factors influencing clicking of banner advertisement on the www. Cy-
berPsychology and Behavior, 6, nr. 2:201–215.
Dalessandro, B., Perlich, C., Stitelman, O., and Provost, F. (2012). Causally motivated
attribution for online advertising. ADKDD12 Proceedings of the Sixth International
Workshop on Data Mining for Online Advertising and Internet Economy.
He, H., Garcia, E., et al. (2009). Learning from imbalanced data. Knowledge and Data
Engineering, IEEE Transactions on, 21(9):1263–1284.
Kireyev, P., Pauwels, K., and Guta, S. (2013). Do display advertisements influence
search? attribution and dynamics in online advertising. Harvard Business School.
Patricio, I., Fisk, R., Cuncha, J., and Constantine, I. (2011). Multilevel service design:
From customer value constellation to service experience blueprinting. Journal of Service
Research, 14(5):180–200.
67
BIBLIOGRAPHY 68
Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games,
2:307–317.
Skiera, B. and Nabout, N. A. (2013). Prosad: A bidding decision support system for
profit optimizing search engine advertising. Marketing Science, 32:213–220.
Xu, L., Duan, J. A., and Whinston, A. (2014). Path to purchase: A mutually excit-
ing point process model for online advertising and conversion. Management Science,
60(6):1392–1412.
Zhang, Y., Wei, Y., and Ren, J. (2014). Multi-touch attribution in online advertising
with survival theory. 2014 IEEE International Conference on Data Mining (ICDM),
pages 687–696.