0% found this document useful (0 votes)
75 views70 pages

Kesteren Jol Van 10001962 MSC Etrics

This thesis aims to determine the best attribution model for assigning credit to marketing channels for a conversion. It develops a methodology to evaluate attribution models theoretically, empirically on real data, and through simulations. In the theoretical analysis, seven criteria for the perfect attribution model are identified. Attribution models are then compared based on these criteria. Empirically, models are evaluated by their ability to accurately classify conversions in real website visit data. In simulations, models are judged on how close their attributed conversions match the true simulated conversions. Across all evaluations, the logistic regression model is found to perform best, though the Markov chain and probabilistic models perform poorly. The thesis recommends using logistic regression for attribution, while encouraging the development of improved models.

Uploaded by

LucasBillaud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views70 pages

Kesteren Jol Van 10001962 MSC Etrics

This thesis aims to determine the best attribution model for assigning credit to marketing channels for a conversion. It develops a methodology to evaluate attribution models theoretically, empirically on real data, and through simulations. In the theoretical analysis, seven criteria for the perfect attribution model are identified. Attribution models are then compared based on these criteria. Empirically, models are evaluated by their ability to accurately classify conversions in real website visit data. In simulations, models are judged on how close their attributed conversions match the true simulated conversions. Across all evaluations, the logistic regression model is found to perform best, though the Markov chain and probabilistic models perform poorly. The thesis recommends using logistic regression for attribution, while encouraging the development of improved models.

Uploaded by

LucasBillaud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

MSc Thesis

Multi Touch Attribution

Searching for the Best Attribution Model

Joël van Kesteren

Student number: 10001962


Date of final version: December 18, 2015
Master’s programme: Econometrics
Supervisor: Dr. N. van Giersbergen
Second reader: Dr. K.J. van Garderen
Facilitator: MIcompany (D. de Bruin, MSc & F. de Jong, MSc)

Faculty of Economics and Business


Abstract

The topic of attribution, which is determining the contribution of marketing channels to


a purchase or conversion, is hot. In recent years, a plethora of models and methods to
assign attribution have been proposed in the academic and business literature. However,
the question which model functions best according to objective criteria has largely been
ignored. This thesis answers this question by developing a standardized methodology
to evaluate attribution models on three dimensions: theoretically, empirically and in a
simulation study. In the theoretical discussion, seven desirable criteria are formulated.
The models are consequently compared in the light of those theoretical criteria. For the
empirical component, website visit data from a large Dutch financial institution is used
to produce out-of-sample conversion classifications based on the touched channels. The
Area Under the ROC Curve then serves as a measure to compare the attribution models.
In the simulation study, data is generated according to a wide range of scenarios in which
the true attribution is known. The models are judged on the Mean Absolute Error of
their attribution with respect to the true attribution. This thesis finds that the logistic
regression model performs best on all three dimensions. Moreover, the performance of
this model can be significantly improved by a simple extension that incorporates timing
effects. The Markov chain and probabilistic models perform surprisingly bad. As ex-
pected, neither do the rule-based methods such as last touch perform well. In conclusion,
although far from perfect partly due to endogeneity issues, this thesis recommends com-
panies to employ the logistic regression model for the time being. Above all, however,
it encourages econometricians and marketeers to develop new models and methods and
evaluate them with the methodology that this thesis has laid out. Following this route,
this author is positive that the perfect attribution model will be found.

1
Contents

1 Introduction 3

2 Theory 6
2.1 Digital advertising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Digital advertising channels . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Multi Touch Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Theoretical evaluation . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Method 29
3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Simulation performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Data 36
4.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Relevant visit selection . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Prospect identification . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Create prospect journeys . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Data insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Results 44
5.1 Model estimation and attribution . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 Rule-based heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Simple probabilistic model . . . . . . . . . . . . . . . . . . . . . . 47
5.1.3 Logistic regression models . . . . . . . . . . . . . . . . . . . . . . 48
5.1.4 Markov chain models . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1
CONTENTS 2

5.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Markov chain simulations . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Additional simulation study . . . . . . . . . . . . . . . . . . . . . 56

6 Conclusion 60

7 Appendix 63
7.1 Regression output Logistic extensions . . . . . . . . . . . . . . . . . . . . 63
Chapter 1

Introduction

At this moment, more than 3 billion people around the world use internet1 . This number
has been increasing at an exploding rate since the introduction of the world wide web.
With this vast reach, the web offers tremendous opportunities for marketing purposes. It
is not surprising that digital marketing has been growing likewise, making it a $121 billion
industry in 2014 with a year-on-year growth of 16%2 . In addition to the potential reach
digital marketing offers, it has two other significant advantages over traditional media.
Firstly, an online advertisement can be uniquely tailored to each individual providing
perfect customization possibilities. Secondly, all visits of internet users can be tracked
and stored, enabling perfect tracking of the number of views an advertisement gets and
the number of subsequent product purchases or conversions. Theoretically, this gives
marketeers the opportunity to accurately evaluate online advertisements or the channels
that serve them. Typical online channels are search engine advertising, email, display
and social media.
However, in practice this evaluation of channels reveals to be rather challenging. Since
potential customers or prospects typically ‘touch’ multiple channels before converting, the
contribution of each of those channels should be determined. The problem of determining
the contribution of the channels a prospect touches before conversion is referred to in the
literature as the attribution problem. Traditionally, the full conversion credit is assigned
to the last channel a prospect touches prior to conversion, a method called last touch
attribution. However, it can be easily seen that this method is fundamentally flawed,
since it completely ignores the contribution of channels in prior touches. Having realized
this flaw, both the business world and academia have devoted themselves to a solution to
the attribution problem. The result is that a plethora of attribution methods and models
have been proposed.
1
http://www.internetlivestats.com/internet-users/
2
Report by ZenithOptimedia, 2014

3
CHAPTER 1. INTRODUCTION 4

Initial alternatives to the last touch attribution that have been proposed are first
touch attribution (assigning all conversion credit to the first touch) and linear attribution
(assigning equal credit to each touchpoint). However, all three are still rule-based methods
that a priori presuppose a certain weight to all touches without actually accounting for
the data. In response, Shao and Li (2011) propose two data-driven attribution techniques:
a (bagged) logistic regression model and a simple probabilistic model. Dalessandro et al.
(2012) further refine this probabilistic model, and demonstrate that it is fundamentally
equal to the well-known Shapley Value from cooperative game theory (Shapley, 1953).
Anderl et al. (2014) introduce an entirely different solution to the attribution problem,
modelling the prospect paths as Markov chain models and calculating the attribution
through a Removal Effect. Other research has tackled the issue of attribution through
Survival Theory (Zhang et al., 2014) or Bayesian Models (Li and Kannan, 2014). In
short, it is evident that an almost chaotic abundance of attribution methods and models
have been put forward, each of the authors advocating its own solution. The extant
literature evidently fails in creating some order by addressing the question which of all
those methods is the best solution. Addressing and answering this question is the main
goal of this thesis.
In order to do so, this thesis develops a methodology to evaluate attribution models
on three dimensions: theoretically, empirically and in a simulation study. It considers
the rule-based methods (last touch, first touch and linear), the probabilistic model, the
logistic regression model and the Markov chain models. In the theoretical analysis, seven
criteria are distilled from the extant literature and formulated by examining an abstract
conception of the perfect attribution model. The attribution models are then evaluated
in the light of these theoretical criteria. Empirically, we examine the performance of
the models by its out-of-sample classification accuracy. The data for the empirical study
contains all visits on a website of a large Dutch financial institution during ten months,
including variables on the timing, the channel and whether a conversion takes place. By
producing out-of-sample conversion classifications for this dataset and determining the
Area Under the ROC Curve (AUC), the predictive performance of the models can be ob-
jectively assessed. Underlying this assessment is the assumption that accurate prediction
implies accurate attribution. The third component of our study consists of simulating
a wide variety of scenarios in which the true attribution is known, and calculating the
Mean Absolute Error of the models with this true attribution. This simulation study is
useful in determining which attribution method performs best under which circumstances
to be found in the data, and provides another objective criterion to evaluate the models.
In addition to providing an answer to the question which of the considered attribution
models performs best, the most important contribution of this thesis is a standardized
CHAPTER 1. INTRODUCTION 5

methodology for evaluating models.


The structure of this thesis is as follows. In Chapter 2 the most popular attribution
methods and models are introduced, discussed and evaluated in the light of our derived
theoretical criteria. Chapter 3 explains the methodology for the empirical and simulation
analysis in this thesis. The next chapter, 4, gives insight into which data is used for the
empirical analysis and how it is processed. The results of our research are presented in
Chapter 5. Finally, this thesis concludes with the answer to our main question, a brief
discussion and potential directions for further research in Chapter 6.
Chapter 2

Theory

In this chapter, the theory and literature behind digital marketing and attribution mod-
elling are discussed. First, Section 2.1 discusses digital advertising, its advantages and
preconditions. Section 2.2 introduces the most common digital channels. Finally, Section
2.3 reviews the literature on multi touch attribution. Theoretical criteria for a good at-
tribution model are derived and the most important attribution models are introduced,
discussed and evaluated in the light of those criteria.

2.1 Digital advertising


Advertisements are used by a brand to communicate a message to an audience, usually
in order to persuade it to undertake some action. Such an action can for instance be pur-
chasing a product, signing up for a service or visiting a shop, and will in the remainder of
this thesis be referred to as a conversion. Traditionally, advertisements have reached their
audience through media such as television, radio, newspapers and outdoor. During the
early days of internet, the network was prohibited to be used for commercial purposes.
However, this ban was gradually phased out and since the widespread popularity of inter-
net in the late 1990s, online advertising has become one of the most popular advertising
media. In 2013, online advertising revenue was $42.8 billion in the United States alone,
therewith surpassing television broadcast spending1 .
We can distinguish three reasons that are explanatory to the popularity of advertising
on the world wide web. Firstly, the network has an immense potential reach, having an
estimated number of users that exceeds three billion at the moment of writing. Secondly,
the internet has the potential to identify customers at an individual level, enabling adver-
tisers to customize an advertisement which increases its effectiveness. Finally, the reach
of and response to online advertisements can be stored and monitored on an individual
1
Report by ZenithOptimedia, 2014

6
CHAPTER 2. THEORY 7

level, enabling accurate performance evaluations. In the next paragraphs we will further
elucidate on the latter two advantages of online advertising.
In the marketing literature, the effect of customization has been studied extensively.
Ansari and Mela (2003) for instance argue that customized and targeted advertisements
attract customer attention and foster customer loyalty, therewith having a considerably
higher probability of persuading customers to a desired end. Customized advertisements
are - if targeted appropriately - much more capable of fulfilling a customer’s need than
broad and general advertisements. Advertisements can be personalised through its con-
tent, message or visual representation. However, customization through traditional mass
media such as television or radio is only possible at a collective level.
On the contrary, digital advertising has the major advantage that its advertisements
can be tailored to each unique individual. An advertiser can decide to change the ad-
vertisement based on the past browsing history or collected preference information of a
potential customer. This can be done through models and algorithms, making the ‘e-
customization’ quick and easy. Ansari and Mela (2003)’s research is one of the first to
develop such a model, with the purpose of optimally customizing content and represen-
tation of e-mails. They find that the expected click-through rate of these emails can be
increased by 62%.
Another major advantage of online marketing as opposed to traditional marketing is
that the performance of digital advertisements can be evaluated much more accurately
than ads from traditional media. Every impression, click and conversion per advertise-
ment is recorded on an individual level. The digital advertising medium is therefore
perfectly suitable to accurately evaluate how many conversions or revenue each adver-
tisements brings in. This enables marketeers to calculate each advertisement’s marketing
Return On Investment (ROI). Based on this ROI, budget allocation to the different online
advertisements and channels can be improved, eventually resulting in a more profitable
firm.
In contrast, analysis of the performance of traditional media such as television and
radio can only be done through aggregated data or expensive and untrustworthy surveys.
Say, for instance, that we want to evaluate the performance of a large television campaign
for a hotel chain. Our best option is to compare the number of bookings during our
campaign period with the baseline number of bookings. The additional bookings can be
attributed to the television advertisement. However, this method is based on a strong
assumption, since all other factors explaining the variation in the number of bookings
are ignored. Moreover, this method becomes complex when multiple advertisements are
displayed on different channels. Alternatively, some of the customers might be asked to
fill out a questionnaire asking them which channel predominantly induced them to book.
CHAPTER 2. THEORY 8

However, those surveys are generally unreliable because of reasons such as the difficulty
to acquire a representative sample or the ignorance and forgetfulness of participants.
A precondition to both advantages of online advertising - that is the possibility of
customization and improved performance evaluation - is the ability to identify unique
persons from the data. If this precondition is fulfilled, we can reconstruct full online
prospect or customer journeys, containing all visits, touchpoints or touches (all concepts
are used interchangeably in this thesis) a person makes prior to converting. This identi-
fication of individuals across multiple visits turns out to be non-trivial. In the literature,
this is usually done by HTTP cookies. Additionally, this thesis advocates the use of
IP-addresses.
Websites send cookies, small pieces of data, to a user’s web browser while the user
visits the website for the first time. These cookies are then locally saved on the user’s
device. Each subsequent time the user visits the website again with the same device, the
browser notifies the website that it concerns the same web browser and device. It is likely
that this subsequent visit containing the same cookie pertains to the same person. This
is how cookies identify a user across multiple visits.
However, solely using cookies to reconstruct full customer journeys, although common
in the literature, has its shortcomings. The predominant reason is that cookies are not
able to identify a person across all visits. To illustrate this, suppose a user visits the
website multiple times from different browsers or different devices. These visits cannot
be related to the same individual by the use of cookies alone. Moreover, cookie tracking
can be disabled. Cookies can therefore only relate part of a user’s visits to this same user.
A second disadvantage is that different persons can visit the website on the same device,
user account and browser, which cookies consider the same individual. However, since
people are increasingly using their own devices, we will assume this risk to be small.
The limitation that cookies cannot bundle all visits pertaining to the same user, can be
partly overcome by complementing information from cookies with information from the
public IP-address of a visit. A public IP-address is a numerical label that is unique to an
internet connection. Visits with the same IP-address are therefore likely to be the same
household and thus the same person. However, relating visits to a unique person this
way should be done with great caution, since multiple persons may form a household.
Moreover, public institutions such as offices, libraries or universities generally have a
single IP-address to which multiple people connect. Due to these drawbacks, combining
visits based on IP-addresses is unusual in the literature. However, with inclusion of
some restraining conditions, we argue in Chapter 4 that IP-bundling can be done to
further fine-tune our prospect identification across multiple visits. By using cookie and
IP-address information intelligently, a full online customer journey can be reconstructed,
CHAPTER 2. THEORY 9

containing all relevant touches a person has prior to (non-)conversion.


Such a customer journey is extensively described
in the marketing literature. Typically, it contains
multiple engagement phases. One of the most in-
fluential models to regard this journey is AIDA
(Strong, 1925), which is an acronym for Attention,
Interest, Desire and Action. Usually, these different
engagement phases are illustrated as a funnel, indi-
cating that a certain amount of prospects are lost in
each consecutive phase (see Figure 2.1). However,
Figure 2.1: The AIDA funnel.
modern-day literature considers the AIDA model
inappropriate and unrealistic. New engagement phases such as Satisfaction and Con-
fidence have been proposed (Barry, 1987). Moreover, new theories based on empirical
research state that prospects do not necessarily engage with each of these phases or may
do so in a different order. A customer journey is nowadays seen as a constant process of
information-gathering and decision-making (Patricio et al., 2011). Nevertheless, AIDA
still tends to serve as the reference point for research on customer journeys.
This thesis focuses on the online component of this customer journey. Hence, all visits
to the website prior to (non-)conversion are considered. A prospect can visit a website in
multiple ways: he can browse for the exact web address, click on a link in an email or use
a search engine. These methods to reach a website are called digital or online channels.
Rather than attributing individual advertisements, this thesis aims to investigate the
contribution of each channel to the total number of conversions. It is therefore important
to obtain familiarity with the different digital channels, which is the topic of the next
subsection.

2.2 Digital advertising channels


A website can be reached through different channels. The online traffic that flows from
many of those channels can be influenced by advertising. Evidently, those online ad-
vertising channels are of specific interest for marketeers. The main online advertising
channels are search engine marketing, affiliate marketing, display, email and social media
advertising. We will discuss each of those channels individually in this section, starting
with search engine marketing.
Search engine marketing aims to improve the visibility of the advertiser on search
engines such as Google, Bing or Yahoo. This can be done through bidding for specific
keywords (search engine advertising, SEA) or adjusting the website in a specific way to
CHAPTER 2. THEORY 10

achieve a higher ranking (search engine optimization, SEO). Both SEA and SEO are
important channels for advertisers, since more than 90% of all internet users make use of
search engines to acquire information and orientate on the products they need or desire
2
. Advertised SEA results are displayed above the organic SEO results.
The position of a specific SEA result is dependent on the bid of the advertiser, a
quality score and the expected impact of possible extensions. The bid of advertisers on
keywords is expressed in a certain paid amount per click (Costs per Click or CPC). As
long as the user does not click on the SEA result of its query, the advertiser therefore has
no costs. This explains why search engines additionally base the SEA result positioning
on a quality score, which is a function of the expected amount of clicks (clickthrough
rate or CTR), the relevance of an ad and a user’s landing page experience. Finally,
the impact of extensions, such as features that show extra business information (e.g. a
telephone number or address), is taken into account. The amount an advertiser pays
is the minimum it should have bid to beat the advertiser one position below, which is
a special case of the Vickrey auction (Vickrey, 1961). In practice, the paid amount is
usually significantly lower than the bid, especially when one has a good quality score.
Skiera and Nabout (2013) develop a model to find out the optimal bidding amount that
leads to the highest profit for each keyword. In their model, they presuppose a causal
relation between position and the relative amount of clicks (CTR) and estimate this
relation statistically. They find that a lower rank gives a higher CTR, which confirms
the intuition that users scan their results from top to bottom.
Within the realm of SEA, two sub-channels are distinguished based on the nature of
the keywords: branded or non-branded SEA. Branded keywords specifically refer to the
advertised brand. When someone searches branded keywords one might assume (s)he
prefers that company to purchase a product or acquire information from. Since keywords
that include a publisher’s brand are highly relevant to the internet user, its quality score
will generally be unbeatable. Therefore, for branded keywords relatively low bids are
sufficient to gain a top position. In contrast, the competition for non-branded or generic
keywords such as ’car insurance’ or ’laptop’ is much higher.
Search engine optimization (SEO) is the process of optimizing the ranking of unpaid
or organic results. According to an eye tracking study 3 , around 70% of the search engine
users skip the advertisement results. SEO is therefore undoubtedly an extremely valuable
marketing channel. The position of a result is determined by the relevance of the content
of a website to specific keywords. Strategies that are used to improve the ranking may
include increasing the number of backlinks (incoming links to a website), editing content
2
Pew Internet Survey, May 2011
3
Performed by GfK, gfk.com
CHAPTER 2. THEORY 11

in HTML or removing barriers to the indexing activities of search engines.


Another digital advertising channel that causes traffic to flow to a website is called
affiliate. An affiliate is a third party that links to the advertiser’s website. Examples
of affiliate parties include price comparison websites, web directories or product review
sites. The majority of advertisers that engage in affiliate programs reward the affiliate
by a certain amount per sale (Pay Per Sale or PPS). Closely related to affiliate is the
referral channel. Whereas affiliates are primarily motivated financially, referrers refer
prospects to brands they know well and have a good relationship with. The motives and
relationship with the advocated brand are therefore a fundamental difference between the
affiliate and referral channel.
A very well-known digital advertisement channel is display or banner advertising. Re-
search has shown that internet users find (a large number of) banners annoying (Cho,
2003). However, Kireyev et al. (2013) find that banners have a significant indirect effect.
Although prospects do not directly click on banners because they find them annoying or
untrustworthy, Kireyev et al. (2013) find that these prospects do have a larger probabil-
ity of searching for the displayed products through other channels. The most popular
compensation scheme for display advertising, which is paying per click on a banner, may
therefore inaccurately compensate the true influence of the channel. As will become clear
later, this problem can be solved by attribution modelling.
Finally, although playing a minor role in this thesis, two other digital advertising
media, e-mail and social media advertising, are also worth mentioning. The reach of
e-mail advertising is restricted to prospects that have provided their e-mail addresses on
the website, current or past customers. However, the target group of e-mail advertising
is more engaged with the firm and is thus more likely to convert. Since the immense
popularity of Facebook and Twitter, the social media channel is important for marketeers
as well. Social media advertising can be seen as an “effort to create content that attracts
attention and encourages readers to share it across their social networks” 4 . Social media
marketing is effective because one has generally more trust in the word of mouth of friends
in one’s social network than in firms.
Having briefly introduced the most important digital advertising channels, we now
turn to the question how to evaluate a channel’s performance. Since prospects typically
touch multiple channels, this issue is not as straightforward as one would expect. A
method should after all be thought of to fairly assign credit over the channels. These
methods or attribution models are the topic of next section.
4
Wikipedia
CHAPTER 2. THEORY 12

2.3 Multi Touch Attribution


This section provides an overview on the existing literature on attribution models. We
will first dive into the criteria that are formulated for a proper attribution model and
subsequently formulate our own seven criteria. Then, mathematical notation around
attribution modelling is introduced. Finally, the different attribution models are discussed
and evaluated in the light of our criteria.
The topic of attribution modelling has recently gained widespread interest in the
marketing literature. The main explanation for this popularity is the growing importance
of digital marketing and its potential to track all the online channel touches of internet
users. If a company is able to gather and store data concerning the clicks of its visitors, a
full online customer journey can be reconstructed. From these journeys, the credit of each
visit to a conversion can be attributed. Since a visit always comes from a certain channel,
channels can thus be assigned a part of the (total) conversion credit. Attribution can be
expressed as the absolute number of conversions or as a percentage of the total amount
of conversions driven by a certain channel. In theory, an infinite number of methods to
attribute can be thought of, raising the question which method most accurately reflects
the ‘true’ contribution of a certain visit. This is a fundamental question in order to
evaluate different channels or advertisements based on their true performance.

2.3.1 Criteria
Besides developing a variety of models, the extant literature has been concerned with for-
mulating criteria in order to determine what is a ‘good’ attribution model. This search
for universally accepted and standardized attribution criteria is important for two rea-
sons. First and foremost, the true attribution of a certain channel is unobserved, making
the topic inevitably subjective to some extent. Secondly, the actual implementation of
attribution models by marketeers requires more practical criteria as well.
Shao and Li (2011) propose a bivariate metric to evaluate an attribution model: a
metric that evaluates both accuracy and variability. Accuracy means that a proper model
must be able to classify prospects as converters or non-converters. They evaluate accuracy
by the out-of-sample misclassification error rate. This is mathematically expressed as
(F P + F N )(T P + T N + F P + F N )−1 , with the elements explained in the Confusion
matrix in Table 2.1. Strangely, Shao and Li (2011) do not report any threshold to
classify a probability as a predicted conversion or non-conversion, and it is thus unclear
how they produced the exact numbers for their accuracy metric. In addition to predictive
power, they state that the variability of the model’s parameter estimates is important.
Consequential decisions of marketeers may after all be based on the parameter estimates of
CHAPTER 2. THEORY 13

Predicted outcome
Conv0 Non-conv0
True False
Conv Positive Negative
Actual (TP) (FN)
outcome
False True
Non-conv Positive Negative
(FP) (TN)

Table 2.1: Confusion matrix illustrating the four quadrants into which an out-of-sample prediction can
fall

the attribution model, such as performance evaluations and subsequent budget allocation
of channels. It is therefore desirable to have an attribution model with stable and reliable
parameter estimates. They calculate the variability by taking the average standard error
of the estimated coefficients of the model or n−1 ni=1 SE(β̂i ) for a model that has n
P

estimated β̂i coefficients.


Dalessandro et al. (2012) extend Shao and Li (2011)’s criteria with interpretability,
arguing that a proper attribution model should be “generally accepted by all parties
with material interest in the system, on the basis of its statistical merit, as well as on the
basis of intuitive understanding of the components of the system”. Finally, Anderl et al.
(2014) formulate as much as six evaluation criteria for attribution models. In addition
to the mentioned criteria, they argue for the importance of versatility and algorithmic
efficiency. Versatility is defined as the ability to incorporate new information and fit
company-specific requirements, and algorithmic efficiency simply reflects the speed of
computing model outputs. These criteria are derived from the more practical aim of
Anderl et al. (2014)’s paper to develop a model that is comprehensible for managers and
easily implementable for a wide range of companies.
In order to evaluate and compare the different attribution models on the theoretical
dimension, this thesis will formulate seven desirable qualities. Due to the academic
nature of this thesis, we are less concerned with the business-relevant criteria postulated
by Anderl et al. (2014). We look for the best theoretical attribution model rather than
the one that can be most straightforwardly explained to a manager. Further, it should
be noted that Shao and Li (2011)’s criteria are not included since they are practical
performance evaluation metrics rather than theoretical qualities. Our seven theoretical
criteria are as follows:

1. Data-driven: first and foremost, a good attribution model should be data-driven.


If a model is not data-driven and attributes conversion credit based on some a
CHAPTER 2. THEORY 14

priori determined distribution, the subsequent attribution is completely biased and


unverifiable. Rather, this distribution should be based on information derived from
the data.

2. Ability to predict: a good attribution model, that is able to accurately judge the
value of each touchpoint, should be able to estimate the probability of a conversion
or a non-conversion given some touchpoints. Moreover, a model’s ability to predict
gives us an objective standard to evaluate the empirical performance of each of the
models. We will see that models have been proposed that aren’t predictive, making
the task much harder to determine whether they attribute correctly.

3. Individual level credit attribution: a desirable quality of a model is its ability


to attribute credit on an individual customer journey’s level.

4. Channel contribution heterogeneity: given a prospect and a time, a certain


channel can be more contributory to a conversion than another channel. This is a
quality that is ideally allowed for in a model’s structure.

5. Differences in contribution over time: the contribution of a channel for a


given prospect can differ over time, for instance when a channel is touched closer
to the time of conversion. This is a dimension that ideally can be incorporated into
a model as well. The timing element can either be accounted for explicitly as a
timestamp or relatively as the sequential touchpoint within a journey.

6. Prospect heterogeneity: given a touchpoint with a channel at a specific time,


its contribution to a conversion may differ across prospects. Although hard to
implement in a model, observing attribution models from a conceptual standpoint
this is certainly a desirable characteristic.

7. Intuitive restrictions: intuitively, there are two main additional restrictions that
attribution models should account for:

• The conversion credit for a channel must be between 0 and the the number of
conversions that have touched this channel. The equivalent on an individual
level is that a channel’s contribution to a conversion must be between 0 and
1.
• A model should be able to incorporate information from all touchpoints in a
journey.

In the remaining sections in this chapter the attribution methods and models will be
introduced and examined in the light of these seven criteria. Interestingly, we will see that
CHAPTER 2. THEORY 15

none of the models fulfils all conditions, perhaps indicating that the perfect attribution
model is not yet around.

2.3.2 Models
Before presenting the different attribution models, it is convenient to introduce some
mathematical notation. Let there be i = {1, 2, ..., N } prospects who each have an online
journey with j = {1, 2, ..., Ji } visits. The j th visit of prospect i is notated by vi,j . For
converting prospects, only visits prior the conversion are considered. Each visit is coming
from a channel Ck , for k = {1, 2, ..., K} channels. The function that maps a visit to
a channel is C(vi,j ). A prospect journey can either turn into a conversion or a non-
conversion:

1, Conversion
yi = (2.1)
0, Otherwise

The entire journey of prospect i can then be formally represented by P Ji = {{vi,j }Jj=1
i
, yi }.
In case of individual-level attribution, each visit vi,j has an attribution ai,j by the
function ai,j = a(vi,j ) under the restrictions that 0 ≤ a(vi,j ) ≤ 1 and Jj=1
Pi
a(vi,j ) = yi .
The restrictions imply that a non-converting journey gives a credit of 0 to all visits. The
attribution Ak of a channel k as a percentage of the total number of conversions can then
be calculated as follows:
PN P
i=1 j:{C(vi,j )=Ck } ai,j
Ak = PN (2.2)
i=1 yi
As we will see, not all attribution methods are able to attribute individually, so sometimes
a model produces estimates for Ak directly.
Now that we have formally defined all the elements of the attribution problem, we can
turn our attention to the different models to see how they propose to solve the attribution
problem (in other words, how they estimate the Ak ’s). First, we will discuss the sim-
ple and mainstream non-statistical rule-based heuristics. Thereafter, the more complex,
mathematical or statistical models (probabilistic model/Shapley, logistic regression and
Markov chain) are introduced.

Rule-based heuristics

Rule-based heuristics are non-statistical methods to attribute conversion credit. A priori,


a distribution of the weights of the touchpoints is established for these methods. We
distinguish single touch and multi touch heuristics.
CHAPTER 2. THEORY 16

The most frequently applied single touch heuristic is last touch attribution. Last
touch attribution assigns all credit to the last visit a prospect touches before conversion
or, mathematically expressed:

0, j = {1, 2, .., (Ji − 1)}
âi,j,LT = (2.3)
1, j = J
i

The popularity of this method is due to its intuitive and computational simplicity.
Only information about the last touch serves as input to the method, making the recon-
struction of a full customer journey unnecessary. However, the fact that it completely
ignores information about the prior touches makes it a fundamentally flawed heuristic.
Suppose a prospect reaches a website of an online vacation retailer through an affiliate
party, gathers all of its information, but then needs a night sleep to decide whether he is
going to purchase a trip. Waking up the next morning, he decides to buy it, quickly uses a
search engine to find the relevant page and instantly converts. Last click attribution will
assign the full credit to organic search (SEO) and no credit to the affiliate party, which
intuitively does not make sense. If a marketeer uses last touch attribution for attribu-
tion purposes, he might unjustly decide to stop allocating its funds to the affiliate party,
therewith lessening much more conversions than he is aware of. In practice, this means
that channels that typically appear in the beginning of a journey, while a prospect is still
in the orientation phase, are highly undervalued. Examples are banner advertisements
or affiliate parties. In contrast, channels that appear later in the journey such as direct
(typing the URL of the website in the browser) or organic search are overvalued, even
though those channels might predominantly be reached by prospects that have already
made up their mind to buy the product and are only looking for the easiest way to reach
the website.
To counter this bias that favours later touchpoints, another single touch heuristic
named first touch attribution is introduced. Mathematically, this heuristic assigns a
weight to each individual visit as follows:

0, j = {2, 3, .., Ji }
âi,j,F T = (2.4)
1, j = 1

However, as one can image, first touch attribution is far from perfect either, since a
new bias is introduced. Channels that are typically touched later in the journey such as
organic or sponsored search are now underestimated, since they are given no conversion
credit in cases of more than one touch. In addition, channels that usually occur in the
beginning of a journey such as affiliate or display are overvalued.
CHAPTER 2. THEORY 17

Both, the last touch and first touch heuristic, fail to take into account the informa-
tion of customer journeys with multiple touches. However, an advantage of the single
touch heuristics for the purposes of this thesis is their potential to be transformed into a
predictive model. Empirical conversion probabilities can be calculated for each channel
given a certain position (e.g. first or last) and used as probability predictions for out-of-
sample observations. To illustrate, for the last touch heuristic the empirical probability
of conversion given the last touched channel is k is as follows:
P
#{C(vi,Ji ) = Ck , yi = 1}
P̂ (yi = 1|C(vi,Ji ) = Ck ) = P (2.5)
#{C(vi,Ji ) = Ck }
This equation takes the number of conversions with last touch k divided by the total
number of journeys with last touch k. The predictive potential of single touch heuristics
is an advantage since it enables comparing performances both among each other and
among other models.
A straightforward solution to the bias of single touch heuristics is to assign equal
conversion credit to all touchpoints:
1
âi,j,LIN = , ∀j (2.6)
Ji
This rule-based method is unsurprisingly called linear touch attribution. Although less
fundamentally flawed, linear touch attribution still assigns an arbitrary weight to each
touchpoint independent of its true contribution. It is ignorant of potential contribution
differences between channels: channel X might be generally more effective in persuading
prospects to convert than channel Y. Moreover, it completely discards with differences
over time, whereas touches in the beginning or end of a journey may be much more
effective and influential than touches in the middle. Linear touch attribution, although
in expectation closer to the true attribution than first or last touch, is still not the ‘holy
grail’ of the attribution problem.
Wooff and Anderson (2013) decide to employ an attribution method that integrates
the knowledge of marketing industry experts. They interview marketeers and conclude
that marketeers generally regard the last clicks most valuable, followed by the first clicks
and the intermediate clicks. Based on this conclusion, they propose to assign conversion
credits for each touchpoint on the basis of an asymmetric U-shaped function:

âi,j,W A = kta−1 (1 − t)b−1 (2.7)

In this expression, 0 < t < 1 is the relative time in the click path and a and b are fitted
parameters to the data. An illustration of such a fitted curve is displayed in Figure 2.2.
In this example, you can see that the last click value is larger than the value of the first
click.
CHAPTER 2. THEORY 18

Figure 2.2: Source: Wooff and Anderson (2013). The relative value of a click over time.

Although accepted by industry experts, this method is still flawed since it presupposes
a functional form. Moreover, it only takes into account attribution variability over time
but no intrinsic attribution differences between channels. Another disadvantage of both
multi touch heuristics is that there is no method to make them predictive, preventing the
possibility to compare its performance with other models.
To conclude this subsection about rule-based heuristics, it can be said that the over-
arching disadvantage of those heuristics is that no method is truly data-driven: each
method presupposes the distribution of the attribution weights. Interestingly though,
the rule-based heuristics are most commonly used in practice. For instance web analytics
service Google Analytics only offers attribution analysis based on rule-based methods. In
the subsequent subsections statistical models are discussed that base their attributions
on parameters derived from the data. It is expected that these models perform much
better.

Simple probabilistic model

The simple probabilistic model is first proposed by Shao and Li (2011). This non-
parametric model determines attribution by calculating empirical conversion probabilities
with one and two channel touches. The empirical probability of a path with a single visit
with channel k is as follows:
P
#{Ji = 1, C(vi,1 ) = Ck , yi = 1}
P̂ (yi = 1|Ck ) = P (2.8)
#{Ji = 1, C(vi,1 ) = Ck }
CHAPTER 2. THEORY 19

This expressions divides the number of conversion paths with a single channel k touch
by the total number of paths with a single channel k touch. Similarly, for paths with two
touches the empirical probability is calculated as follows:
P
#{Ji = 2, C(vi,j ) = Ck , C(vi,r ) = Cl , yi = 1}
P̂ (yi = 1|Ck , Cl ) = P (2.9)
#{Ji = 2, C(vi,j ) = Ck , C(vi,r ) = Cl )}

For some j ∈ 1, 2 and r = 3 − j. Note that the order of touching Ck and Cl is irrelevant.
The attribution of channel k on an aggregate level is then computed as follows:

1 X
Âk,P ROB = P̂ (yi = 1|Ck ) + {P̂ (yi = 1|Ck , Cl ) − P̂ (yi = 1|Ck ) − P̂ (yi = 1|Cl )}
2(K − 1) l6=k
(2.10)
The first element of this expression simply measures the conversion probability of prospect
journeys that solely contain channel k. The more interesting second element computes
the interaction effect of channels k and l, which is the conversion probability of paths with
both channels corrected by the one touch conversion probabilities of both individual chan-
nels. Note that the probabilistic model attributes at an aggregate rather than individual
level. An important assumption underlying this model is that half of this interaction
effect is attributed to each of the involved channels. Dalessandro et al. (2012) arrive at
the same model, having defined attribution as a “channel’s expected marginal impact on
conversion”. Moreover, they prove that it is a second-order approximation of the Shapley
Value, a way to distribute collective value in Cooperative Game Theory (Shapley, 1953).
Berman (2013) makes use of exactly the same formulation of this Shapley attribution
model.
The simple probabilistic model has a a number of disadvantages. Its attribution
methodology solely uses conversion probabilities, therewith ignoring information about
the number of conversions. This makes the attribution method unintuitive in some cases.
Suppose a channel is only touched in a single customer journey (in a large data set), but
this journey is successful and leads to a conversion. Although the channel just contributed
to a single conversion, its conversion probability is 100%, causing the probabilistic model
to attribute it a disproportionally high share. The attributed conversions to channel X
are likely to exceed one, which is unintuitive. A second disadvantage of the probabilistic
model is the possibility of negative attributions. Furthermore, the model is unable to
integrate information of paths that contain more than two touchpoints. It is theoretically
possible to extend the model for longer paths, but Shao and Li (2011) justly argue
that from a practical standpoint this does not make sense. The estimated conversion
probabilities for longer journeys become after all highly inaccurate due to the low number
of observations. A final disadvantage is that the model is not predictive, thwarting the
CHAPTER 2. THEORY 20

possibility to evaluate its performance empirically. In conclusion, it is clear that the


simple probabilistic model may be intuitive but has many serious drawbacks.

Logistic regression

An alternative attribution model that is also initially proposed by Shao and Li (2011) is
a simple logistic regression. This is a specific regression model in which the dependent
variable is binary and the functional form characterized by the non-linear logistic function.
In such a model, each customer journey makes an observation with the binary conversion
indicator yi as the dependent variable. Two major advantages of this model are that
it is predictive and it takes into account all available touchpoint information. In the
form Shao and Li (2011) propose, the explanatory variables are the number of touches
of a certain channel k in the journey i or N Ci,k = Jj=1
Pi
#{C(vi,j ) = Ck }. The logistic
regression can then be formulated as follows:

K
X
P (yi = 1) = Λ(β0 + βk N Ci,k ), (2.11)
k=1

where Λ(x) = (1 + e−x )−1 is the logistic cumulative distribution function. The param-
eters βk can be estimated by maximum likelihood, although a closed form solution such
as in the case of linear regression does not exist. These parameters are then interpreted
in order to determine each channel’s attribution to the total number of conversions.
However, the extant literature ignores or is particularly vague about the exact method
to go from the logistic model parameter estimates to attributing the channels. Theo-
retically, the most obvious method to do so would be to evaluate the marginal effects
∂yi PK x x −2
∂N Ck
= βk λ(β0 + k=1 (βk N Ci,k )), where λ(x) = (e )(1 + e ) is the logistic probabil-
ity density. However, since estimated parameters can be negative this would imply the
possibility of attribution to be negative, which is not a desirable property.
An alternative, more practical method to attribute is proposed by this thesis. For
each visit vi,j , consider the estimated conversion probability p̂i,j = Λ(β̂0 + β̂k ) in case
only the channel Ck = C(vi,j ) of that visit is touched. Use this estimated conversion
probabilities as unnormalized attributions, and normalize this subsequently to obtain
âi,j,LOG for each touchpoint. Note that this method assumes that every touch of channel
k has the same effect on attribution, which is compatible with the specification of the
basic logistic regression model. Mathematically, the individual attribution according to
the logistic model is expressed in (2.12), in which for simplicity k rather than Ck is the
channel belonging to visit vi,j .
CHAPTER 2. THEORY 21

p̂i,j Λ(β̂0 + β̂C(vi,j ) )


âi,j,LOG = PJi = PJ i (2.12)
r=1 p̂i,r r=1 Λ(β̂0 + β̂C(vi,r ) )

This method to determine attribution from a logistic regression estimation is non-


existent in the literature. It is important to remark that it is argued for from a practical
rather than theoretical econometric standpoint. There is no proof that the estimator
âi,j,LOG is statistically unbiased. The advantage of this practical method is that it facil-
itates heterogeneity in channel contributions without any undesirable quality such as a
negative attribution.
It is important to notice that the attribution method of this logistic regression uses
point coefficient estimates β̂k , ignoring the standard error SE(β̂k ) of these estimates.
A precondition for using the logistic regression attribution method is therefore that the
number of observations in the data set (e.g. the number of customer journeys) is very
large, such that standard errors are negligible. If estimated coefficients happen to be
insignificant, this is solved by plugging in βk = 0 in (2.12). Prediction is easy with
the logistic model. Having estimated the model in Equation (2.11), one can plug the
fitted coefficients β̂k ’s and out-of-sample observations N Ci,k in the logistic cumulative
distribution function Λ(x) to produce out-of-sample conversion probability forecasts.
The logistic regression model displayed in Equation (2.11) is very general, and can
be extended in various ways. In the original formulation, it is assumed that every touch
with a channel k has the same effect on the conversion probability, regardless whether it
is the first or tenth touch. If one does not accept this assumption, one can for instance
make dummies for N Ci,k = 1 and N Ci,k > 1, which assumes that the first touch and
later touches have different effects. One can even go further by dummifying N Ci,k = 1,
N Ci,k = 2 and so on, if the effect should vary across a larger number of touches with a
channel k. However, as the number of prospects that have multiple touches with a single
channel rapidly decreases, estimated coefficients are probable to become insignificant.
Alternatively, one can think of an extension to the logistic model that includes infor-
mation on the relative timing of touchpoints, which is one of the theoretically derived
criteria that the standard formulation does not comply with. One can integrate this by
for instance dummifying the last touch for all channels. The result is the following model:

K
X K
X
P (yi = 1) = Λ(β0 + βkLT dLT
i,k + βkN LT N Ci,k
N LT
) (2.13)
k=1 k=1

In this equation, dLT


i,k is a dummy that indicates whether channel k is the last touch
N LT
for prospect i. N Ci,k counts the number of touches of channel k that are not last
touch (NLT) for this same prospect i. The coefficients of these covariates are βkLT and
CHAPTER 2. THEORY 22

βkN LT respectively. Note that one of the dummies dLT


i,k should be eliminated from the
formulation to prevent perfect multicollinearity.
A second extension to the logistic regression model can easily be formulated by also
including dummies dLT 2 N LT 2
i,k for the one but last touch of a channel k. In this model, N Ci,k
counts the number of touches of a channel k that are neither last touch nor one but last
touch. Theoretically, one can go further, but it should be checked whether most coeffi-
cients are still significant. Another possible direction is creating dummies dFi,kT for the first
NF T
touch and including the number of non-first touches N Ci,k . The Akaike Information
Criterion can be checked to see which model has the best fit. However, this thesis limits
its scope by considering only the last touch and one but last touch extension of the logistic
regression model. This decision is based on the research of Wooff and Anderson (2013)
and Anderl et al. (2014), who show that the last touch is a more powerful predictor for
a conversion than the first touch.
Attribution in case of this extended logistic regression model can be derived in a
similar fashion as described in Equation (2.12). For the first extension with only last
touch dummies, the individual attributions for each touchpoint âi,j,LOGX1 are calculated
in Equation (2.14). Again, note that k = C(vi,j ).

LT

Λ(β̂0 +β̂C(v )
i,j )

 PJi −1
 N LT )+
P LT
, j = Ji
r=1 Λ(β̂0 +β̂C(v ) r=J Λ(β̂0 +β̂C(v )
i i,r )
âi,j,LOGX1 = i,r
N LT )
Λ(β̂0 +β̂C(v
(2.14)
i,j )
, j 6= Ji


 PJi −1 N LT )+
P LT
r=1 Λ(β̂0 +β̂C(v ) r=J Λ(β̂0 +β̂C(v )
i,r i i,r )

Although at first glance much more complex, on closer regard the only difference
with Equation (2.12) is that for the last touchpoint a different estimated coefficient is
evaluated as for the other touchpoints. The only difference between the cases of j = Ji
and j 6= Ji is that a different estimated coefficient is plugged in the logistic cumulative
distribution function of the denominator. For the single β̂kLT that is not estimated due
to perfect collinearity problems, we plug in 0.

Markov chain Models

An entirely different approach to the attribution problem is proposed by Anderl et al.


(2014). They state that a customer journey can be modelled as a Markov chain, which is
a probabilistic model that represents dependencies between sequences of observations of
a random variable. The random variable takes the value of one of the p possible states in
state space S or {s1 , s2 , ..., sp } ∈ S. A transition matrix W determines the dependencies
between those states over discrete time, with transition probabilities to go from state i to
state j being wi,j , where 0 ≤ wi,j ≤ 1 and pj=1 wi,j = 1, ∀i. The latter condition means
P

that given a certain state in period t, there is always a probability of 1 that the system is
CHAPTER 2. THEORY 23

in a state in period t + 1. The final element required for a Markov chain is an initial state
Z0 . Journeys can thus be modelled or simulated by multiplying the initial state Z0 with
the transition matrix W , resulting in a sequence of states {Z0 , Z1 , Z2 , ..., Zt−1 , Zt , ...} over
discrete time.
Markov chains can be of different order, denoting the amount of previous observations
that influence the current state. Let’s first focus on the first-order model. The possible
states si ∈ S in the first-order Markov model are all channels Ck , a conversion state
Conv and a non-conversion state N onConv. The transition probabilities are empirically
derived from the data, giving first-order transition matrix Ŵ1 . The estimated first-order
initial phase for each state is also empirically calculated as the proportion of first visits
the relevant channel has with respect to all journeys, resulting in the vector Ẑ0,1 . Note
that the initial states for si = Conv and si = N onConv in Ẑ0,1 are zero, since a prospect
journey does not start with a conversion or non-conversion.
Once a first-order Markov model is appropriately fitted this way, attribution is de-
termined by a so called Removal Effect. This is defined as the change in probability of
reaching the conversion state in the normal situation compared to the situation where the
pertinent channel si = Ck is removed from the chain. Although unexplained by Anderl
et al. (2014), we assume removing a channel k means setting all row elements wk,j to zero
for all j’s and for j = N onConv to one. This results in a so called reduced matrix W(−k),1 .
Although not specifically mentioned by Anderl et al. (2014), we assume that the Removal
Effect is considered over the steady state of the Markov chain process. The first-order
Removal Effect of a channel k RE1 (Ck ) then takes a value between 0 and the original
conversion rate. Mathematically, this is expressed in Equation (2.15), where x[Conv] is
0
the si = Conv state from vector x, Ŵ1T is the matrix product of T times Ŵ1 , and Ẑ0,1 is
the transpose of Ẑ0,1 .

ˆ 1 (Ck ) = lim {Ẑ 0 Ŵ T [Conv] − Ẑ 0 Ŵ T


RE 0,1 1 0,1 (−k),1 [Conv]} (2.15)
T →∞

The aggregate attribution Âk,M AR1 for each channel is subsequently calculated by
dividing the Removal Effect by the sum of all Removal Effects:

ˆ 1 (Ck )
RE
Âk,M AR1 = PK (2.16)
ˆ 1 (Cl )
RE
l=1
Markov chain models can easily be made predictive. Given a state si = Ck at moment
t, the estimated transition probability to the state si = Conv is the estimated conversion
probability. Note that this probability is only based on the last touchpoint for a first-
order Markov chain model, a finding that can be generalized to the last r touchpoints for
rth order models.
CHAPTER 2. THEORY 24

Figure 2.3: First-order Markov chain graph illustrating the different states and transitions possibilities.

State Z0 C1 C2 Conv N onConv


N1P
+N1,2 N1,2 P1 +P2,1 N1 −P1 +N2,1 −P2,1
C1 N 0 N1,2 +N1 +N2,1 N1,2 +N1 +N2,1 N1,2 +N1 +N2,1
N2P
+N2,1 N2,1 P2 +P1,2 N2 −P2 +N1,2 −P1,2
C2 N N1,2 +N2 +N2,1 0 N1,2 +N2 +N2,1 N1,2 +N2 +N2,1
Conv 0 0 0 1 0
N onConv 0 0 0 0 1

Table 2.2: Markov initial state Z0 and transition matrix W

Now we will introduce a very simple analytical example of the first-order Markov
model to clarify more intuitively how the model attributes. Suppose there are two chan-
nels, C1 and C2 , and the maximum number of touches is two. The possible paths are
C1 , C2 , C1 C2 and C2 C1 , which occur respectively N1 , N2 , N1,2 and N2,1 times with the
number of conversions P1 , P2 , P1,2 and P2,1 . For the ease of this example we omit the
paths Ci Ci , i = {1, 2}. The graph of this Markov model is illustrated in Figure 2.3, and
the Markov transition matrix and initial state are shown in Table 2.2.
Note that the conversion and non-conversion states are absorbing states: once in this
state, a transition to another state is impossible. Now define conv ∗ = Z0,1
0
W1T (conv),
where T is the matrix power. The exact number of conv ∗ is irrelevant for our purposes.
P
The Removal Effect for channel RE1 (C1 ) is then expressed as follows, where N is the
total number of journeys:

N2 + N2,1 P2 + P1,2
RE1 (C1 ) = conv ∗ − P (2.17)
N N1,2 + N2 + N2,1
Since this Removal Effect is proportional to the attribution, we can derive that attri-
bution for C1 is a function of:

1. The relative amount of journeys that start with the other channel C2 (negative
effect)

2. The number of conversions with last touch C2 relative to the total number of touches
with C2 (positive effect)
CHAPTER 2. THEORY 25

Note that the attribution for C1 has a negative relation with the relative number of C2
last touch conversions weighted by the relative number of C2 first touches. This weight
by the number of first touches makes it intuitively less accurate than simply taking the
negative of the total number of last touch conversions of C2 , which is basically last touch
attribution. From this stylized two channel example we can therefore conclude that the
first-order Markov chain model is not only comparable to last touch, but that it is even
likely to attribute worse than this rule-based heuristic due to the unintuitive weighing.
To some extent, the conclusions of our stylized two channel example can be generalized
to more channels and touchpoints. However, in case of more than two channels the
transition probabilities and interactions between channels influence attribution as well.
Most straightforwardly, the transition probabilities to the state whose Removal Effect
is estimated have a positive influence on the attribution of this state. High transition
probabilities after all imply that the missed conversion of this state will be larger when it
is removed. This issue was irrelevant in our analytical example since the system reached
stability after a single iteration, meaning that only the direct traffic to conversion (and
no transitional traffic) influenced the Removal Effect.
Having explained the intuitive dynamics of first-order Markov chain models, let’s now
turn our attention to higher-order models. The most distinctive difference for higher-
order models is that multiple previous periods are taken into account: for a rth -order
Markov chain the present state not only depends on the previous state, but on the states
in the last r periods. It can be shown that a Markov chain of order r is equivalent to a
first-order Markov chain with r-tuples representing the states. For instance in case r = 2,
a state can be si = (Ck , Cl ), meaning that the current channel is Cl and the previous
channel Ck . We can thus express a rth order Markov chain with a single transition matrix
Wr and initial state Z0,r with r-tuples representing the different states. The transition
probabilities and initial state are again empirically derived from the data, giving Ŵr and
Ẑ0,r . For a second-order Markov model this implies (k+1)(k+2)+2 states. The k+1 term
represents the possibilities of the first element of the 2-tuple, which are all k channels plus
a ‘none’ element in case there is no channel previous to the second channel represented
in the 2-tuple. The k + 2 term represents all channels including a conversion and a non-
conversion possibility. The last 2 comes from the only absorbing states si = (Conv, Conv)
and si = (N onConv, N onConv). In the initial state vector Z0,2 empirical first touch
probabilities are estimated for the states of the structure si = (N one, Ck ).
Attribution for the higher order Markov model is again calculated by the Removal
Effect. Anderl et al. (2014) choose to calculate channel attribution by taking the average
Removal Effect of each of the states that include the respective channel. This state-based
Removal Effect is illustrated in Equation (2.18) for the case of r = 2, where Ŵ(−si ),2 is
CHAPTER 2. THEORY 26

the reduced second-order transition matrix with state si set to zero.

ˆ 2 (si ) = lim {Ẑ0,2


RE 0 0
Ŵ2T [Conv, Conv] − Ẑ0,2 T
Ŵ(−s [Conv, Conv]} (2.18)
i ),2
T →∞

Subsequently, the attribution Âk,M AR2M 1 is calculated in Equation (2.19). In this


equation, Sk ∈ S is the set of all states that contain channel Ck , so either in the form
(Ck , α) or (α, Ck ) for any α. |Sk | is the number of such states.

|Sk |−1 si ∈Sk RE(s


ˆ i)
P
Âk,M AR2M 1 = PK −1 P
(2.19)
ˆ
l=1 |Sl | si ∈Sl RE(si )

However, taking the mean of the Removal Effects of all states that include channel
k seems inconsistent with Anderl et al. (2014)’s own definition of the Removal Effect,
being the “change in probability of reaching the conversion state when we remove a
channel from the graph”. Therefore, more in line with this definition we propose an
alternative method to determine individual channel attribution. Simply stated, this new
method determines the Removal Effect of a channel REr (Ck ) rather than a state REr (si ).
REr (Ck ) is calculated by removing all states (setting them to zero) that include channel
k or all states si ∈ Sk . The consequent reduced matrix is named W(−Sk ),r . This Removal
Effect for r = 2 is calculated in Equation (2.20).

ˆ 2 (Ck ) = lim {Ẑ 0 Ŵ T [Conv, Conv] − Ẑ 0 Ŵ T


RE 0,2 2 0,2 (−Sk ),2 [Conv, Conv]} (2.20)
T →∞

Attribution Âk,M AR2M 2 is then calculated by normalizing this channel-based Removal


Effect.
ˆ 2 (Ck )
RE
Âk,M AR2M 2 = PK (2.21)
ˆ 2 (Cl )
RE
l=1
In the remainder of this thesis we will report the higher-order Markov attribution method
of Anderl et al. (2014) as method 1, and the method proposed by this thesis as method
2.
The higher the order of Markov chain models, the more accurate it describes data with
multiple touch journeys. In contrast to the first-order model, not only the last touch but
the last r-touches are taken into account for attribution. Based on this, the Markov
model should perform better with increasing r. However, since the number of param-
eters to be estimated grows exponentially with r, a higher order Markov chain quickly
becomes inefficient to estimate (Berchtold and Raftery, 2002). In this case there are not
enough observations to produce accurate estimates for the transition probabilities. For
this reason, this thesis follows Anderl et al. (2014) and only estimates the first-, second-
and third-order Markov models. Anderl et al. (2014) find that the third-order Markov
CHAPTER 2. THEORY 27

model performs better than the logistic regression, first touch and last touch heuristics
as measured by the area under the ROC curve and the top-decile lift. Unfortunately, it
is not reported whether any model performance differences are significant.

Other models

Some more attribution models have been developed that are noteworthy, which we will
briefly refer to for the interested reader. Li and Kannan (2014) propose a Bayesian
model to measure online channel consideration, visits and purchases. They calculate
carryover and spillover effects to attribute conversion credit. Zhang et al. (2014) apply
the attribution question to a framework borrowed from survival theory, producing a model
that appears quite promising in both conversion prediction and attribution. Finally, Xu
et al. (2014) employ a mutually exciting point process model to calculate attribution of
online advertising channels. These models aren’t discussed in this thesis because either
our dataset is not suitable for the respective model, the model is too complex for the
purposes of this thesis or the model is expected to be less effective than our models.

2.3.3 Theoretical evaluation


So far, we have discussed the most common attribution model in the literature. Table
2.3 summarizes in a simplified way our theoretical evaluation in the light of the proposed
seven criteria. A ‘+’ indicates that the model complies with the quality or criterion, a
‘+/−’ indicates partial compliance and for a ‘−’ the model is unable to integrate the
quality.
All models except for the rule-based methods are fully data-driven. The only mod-
els that are not able to predict are linear touch attribution and the simple probabilistic
model. The logistic regression models and rule-based heuristics are able to attribute at
an individual level: the others are not. Channel heterogeneity is allowed for in all models
except for the rule-based methods. Contribution differences over time are explicitly mod-
elled in the logistic extension and higher-order Markov model. The first-order Markov
model, first- and last touch methods only partly allow for differences over time, such as
the simple distinction between last touch and non last touch. The other models do not
take into account timing differences. Prospect heterogeneity is an ambitious criterion
that none of the considered models comply with. All intuitive restrictions are satisfied
by the linear method, the logistic regression models and the Markov chain models. The
first touch method, last touch method and simple probabilist model do not incorporate
all information from all touchpoints. As explained, the simple probabilistic model neither
attributes in an intuitive way, since it is based on conversion probabilities rather than
CHAPTER 2. THEORY 28

Criterion FT LT LIN PROB LOG LOGX MAR1 MAR2+


Data-driven - - - + + + + +
Ability to predict + + - - + + + +
Individual level attribution + + + - + + - -
Channel heterogeneity - - - + + + + +
Differences over time +/- +/- - - - + +/- +
Prospect heterogeneity - - - - - - - -
Intuitive restrictions +/- +/- + - + + + +

Table 2.3: Summary of theoretical evaluation of attribution models. A ‘+’ indicates a model fully
complies with the criterion, a ‘+/−’ implies partial compliance and a ‘−’ no compliance.

absolute numbers. In conclusion, the logistic extension satisfies most theoretical criteria,
closely followed by the normal logistic model and the higher-order Markov chain model.
Based on the theoretical criteria, the logistic models perform best.
Chapter 3

Method

This chapter explains the method that this thesis employs to answer the question which
model performs best empirically and in a simulation. First, Section 3.1 briefly mentions
the different methods and models that will be estimated and evaluated in this thesis.
Then, Section 3.2 describes how the models are evaluated based on their classification
accuracy in an empirical study. Finally, Section 3.3 works out the method that is used
for the simulations. All statistical analyses and simulations are performed in the open
source programming language R.

3.1 Models
The models that are tested in the empirical and simulation study are all introduced in
Chapter 2. The list below sums them up, where the models that are able to produce
predictions are designated with an asterisk (*).

• Last touch attribution (LT*)


• First touch attribution (FT*)
• Linear attribution (LIN)
• Simple probabilistic model (PROB)
• Logistic regression model: basic formulation (LOG*), extension with last touch
dummies (LOGX1*) and extension with dummies for the last two touches (LOGX2*)
• Markov chain model: first-order (MAR1*), second-order (MAR2*) and third-order
(MAR3*)

It is chosen to estimate the models on the full data set, without conditioning on the
number of touchpoints. The latter would after all quickly bring down the number of
observations for a larger amount of touchpoints. This would give insignificant parameter
estimates, severely complicating the issue of attribution. In addition, all of the models are

29
CHAPTER 3. METHOD 30

perfectly able to cope with a data set that contains observations of different touchpoints
and it is expected that some of the models even perform better on such a full data set. To
see this, first notice that conditioning on the number of touchpoints would not have any
implication for the rule-based heuristics. Attribution under conditioning on touchpoints
would be exactly the same as attribution under the full data set. Since the probabilistic
model already conditions on paths with one or two touchpoints and ignores longer paths,
conditioning here neither has an effect. For the logistic regression and Markov models,
it is expected that conditioning produces worse results, simply because for each estimate
less information is available. Suppose we condition on the number of touchpoints Nt for
Nt = 1 and Nt > 1. It is much harder for the set that is conditioned on Nt > 1 to
determine the contribution of a channel, since information on the single-touch conversion
probability of this channel relative to other channels is not taken into account. This
makes its estimates for this contribution less accurate compared to the case in which all
information is included in the data. There is, in conclusion, no good reason to condition
on the number of touchpoints. Having established this, let’s now discuss how to evaluate
the empirical performance of the models.

3.2 Classification accuracy


Although channel attribution can obviously not be observed empirically, it seems plausible
that a model able to predict conversions well also attributes well. Under this assumption,
we can measure and evaluate the classification accuracy of the different models. In order
to do so, the dataset is split in a test and a training set. Randomly, two third of the
observations are assigned to the training set. This split is arbitrarily and may have been
chosen differently. Most importantly, the number of observations is large enough to create
both parameter estimates in the training set and a performance statistic in the test set
that have a small variance.
All models are estimated on the training set. For the models that have been marked
by an asterisk in the list in Section 3.1, conversion probabilities are determined for the
observations in the test set. These probabilities are then turned into a conversion or a
non-conversion for a certain threshold, resulting in a Confusion matrix as is shown in
Table 2.1. Standard measures for classification accuracy, such as Shao and Li (2011)’s
misclassification error rate or the percentage correctly classified, can be calculated. How-
ever, in our case the class distribution between conversion and non-conversion is highly
skewed: the event of non-conversion is far more likely than the event of conversion. In the
case of a highly skewed distribution, He et al. (2009) show that the standard measures
perform poorly due to limited discriminative power.
CHAPTER 3. METHOD 31

A better measure for our purposes is the Receiver Operating Characteristic (ROC)
curve. This curve plots the True Positive Rate (TPR) on the y-axis against the False
Positive Rate (FPR) on the x-axis for all possible thresholds c ∈ (0, 1). The TPR or
sensitivity is calculated as the percentage of accurately classified positive observations
(conversions) relative to the total number of true conversions. A 100% sensitivity implies
that all true converters are correctly classified as conversions in the model. Mathemat-
TP
ically, the sensitivity is expressed as T P R = T P +F N
, where TP is the number of True
Positive classifications and FN the amount of False Negative classifications. Similarly to
the sensitivity, the specificity is defined as the percentage of accurately classified negative
observations (non-conversions) relative to the total number of true non-conversions. The
FP
FPR is calculated by taking 1 − specif icity or T N +F P
. A 100% specificity (or 0% FPR)
implies that all actual non-conversions are classified as such by the model.
An example of a ROC curve is plotted in Figure 3.1. For every threshold c ∈ (0, 1)
that classifies an estimated probability as positive or negative, a single point in the ROC
graph is plotted. The best possible model yields 100% sensitivity (no false negatives) and
100% specificity (no false positives), which is represented by the upper left corner or (0,1)
coordinate in the ROC graph. Randomly guessing gives a point on the 45◦ line or line of
no discrimination. The extent a model is able to produce probability predictions that are
far from the line of no discrimination and close to the upper left corner determines its
discriminatory power and thus classification accuracy. The classification accuracy can be
expressed in a single digit, which is the Area Under the ROC Curve (AUC). The AUC
is a measure that is always between 0.5 and 1. The closer it gets to 1, the better the
performance of a model. The AUC is the measure that is used by this thesis to judge the
different models on their classification accuracy. Finally, it is important to realize that
the AUC as a measure assumes that every value of the specificity is equally important.
Alternatively, one might argue for a certain distribution of weights over each value of
the specificity. Only in case of overlapping ROC curves, the assumption of equal weights
becomes relevant. It is thus necessary to verify whether the ROC curves of any two
models overlap.
Evidently, solely the AUC measure does not tell us whether performance differences
between models are significant. This thesis extends the literature by performing the boot-
strap procedure on the test set in order to determine the standard deviation between the
difference of the AUC of any two models. This way, it is tested whether performance dif-
ferences across models are significant. The bootstrap procedure is based on the idea that
the sample data can be considered as the population. Randomly sampling with replace-
ment from this new population (the sample data) is called bootstrapping. Bootstrapping
allows calculating certain measures such as the standard deviation for sample estimates
CHAPTER 3. METHOD 32

Figure 3.1: An example of a ROC curve. The 45◦ line represents random classification. The model of the
blue curve thus clearly classifies better than random.

for inference. In our case, bootstrapping B times gives us B estimates of the difference in
AUC between two models i and j {AU CDif ˆ f i,j (1), AU CDif
ˆ f i,j (2), ..., AU CDif
ˆ f (B)i,j }.
From these B estimates the standard deviation can be calculated, and a simple Z-test
can show us whether AUC differences between models are in any direction significantly
different from 0. In this thesis B is chosen to be 1000.
Although the AUC measure for classification accuracy undoubtedly gives us a good
method to evaluate the different models on its attribution, we should not forget the under-
lying assumption that accurate prediction implies accurate attribution. This assumption
is necessary because the true attribution can never be observed in empirical data. An
alternative way to compare the attribution models is to simulate data in which the true
attribution is known, which is the topic of the next section.

3.3 Simulation performance


The third dimension on which the attribution models are evaluated is their performance
in a simulation. For simulating customer journeys a specific Data Generating Process
(DGP) should be chosen. It is important to note that both, the DGP and the specific
distribution of its parameters, potentially influence the performance of the models. In the
decision of a specific DGP one might thus inadvertently advantage one attribution model
over the others. We acknowledge this fact and therefore simulate data according to two
entirely different DGPs. Firstly, data is generated by a first- and second-order Markov
chain DGP. Secondly, this thesis proposes a very simple new framework to generate data
CHAPTER 3. METHOD 33

in which the true attribution distribution over the channels is known. Let’s first discuss
simulation with a Markov chain DGP.
We restrict ourselves to discussing the first-order Markov DGP, but the described
method can be easily extended to the second-order. We use the estimated initial state Ẑ0
and estimated transition matrix Ŵ from the data to simulate a new dataset with the same
number of prospects, which is 734.6k. The simulation can thus be seen as an effort to
reproduce the original dataset. This simulation is iterated 10 times to account for sample
variance. By means of the Removal Effect (see Equation (2.15)) the estimated attribution
from the empirical dataset is determined, which functions as the ‘true’ attribution in our
DGP. All models are estimated on the generated data and their attribution is determined.
Subsequently, this attribution is compared with the true attribution by taking the Mean
Absolute Error (MAE). This way, we can evaluate which model attributes closest to the
true attribution.
Remark, however, that the true attribution in the DGP is calculated by a specific
method, namely the Removal Effect. Inevitably, this true attribution therefore represents
the way the Markov attribution model perceives of attribution. One might therefore argue
that this method answers the question which model attributes closest to the Markov chain
model rather than the question which model attributes closest to the truth. This is a
valid argument, and a major shortcoming of using the Markov chain DGP for simulation.
Nevertheless, this analysis can still be insightful. Generating data by both a first- and
second-order Markov chain DGP and comparing the models’ performances, allows us for
instance to see which Markov model better reflects the true data.
Because of this shortcoming of the Markov chain DGP, we extend our simulation
study by a different, basic and intuitive framework to generate data in which the true
attribution is known. Hopefully, this framework does not bias any model over the others.
For clarity and simplicity, we restrict ourself to a framework with two channels and
a maximum of two touchpoints. However, the framework is easily extendible to more
channels and longer journeys.
Our framework is as follows. First, P1 and P2 are the probabilities the journey has
respectively one and two touchpoints, where P2 = 1 − P1 . Then, for the first as well as
the second touchpoint, two channels can be touched. ch1 , ch1|1 and ch1|2 indicate the
probability that channel 1 is touched given no prior touches, a prior touch with channel
1 and a prior touch with channel 2 respectively. ch2 , ch2|1 and ch2|2 can be defined as one
minus the probability that channel 1 is touched in a specific instance. The final element
of the simulation model is the contribution to the total conversion probability of each
touchpoint. The contribution of channel 1 is p1 , p1|1 and p1|2 depending on the position of
this channel 1. Again, p1 gives the contribution to the conversion probability of channel
CHAPTER 3. METHOD 34

1 with no prior touches, p1|1 with a prior touch of channel 1 and p1|2 with a prior touch
of channel 2. p2 , p2|1 and p2|2 are equivalently defined. It should be noted that p1 and p2
are contributions to the conversion probability irrespective of possible later touchpoints.
The total conversion probability for a path with first channel 1 and then channel 2 is
p1 + p2|1 . In total, this simple simulation model contains 10 parameters that affect the
customer journeys and thus the true attribution.
The true attribution can be calculated in a straightforward fashion from the simulated
dataset. All converting customer journeys that contain a single channel (either with one
or with two touchpoints) obviously give full credit to that channel. More interestingly,
credit should be divided in case a conversion path contains both channels. Suppose a
conversion path i touches first channel 1 and then channel 2. Channel 1 receives the
p1
conversion credit ai,1 = p1 +p2|1
, which is the contribution to the conversion probability
of channel 1 as a fraction of the total contribution to the conversion probability of each
p2|1
touchpoint. Equivalently, the contribution of channel 2 can be calculated by ai,2 = p1 +p2|1
.
Having determined the individual attributions of both channels in each instance, we can
aggregate and normalize to determine the aggregate true contribution. The attribution
models that are estimated on this simulated data are evaluated by its Mean Absolute
Error with this true contribution.
The performance of the attribution models is tested in a wide spectrum of simulation
scenarios. In other words, data is generated multiple times according to different param-
eter settings. The scenarios are meant to reflect a great variety of possible characteristics
to be found in real datasets. We hope to find a model that attributes consistently accu-
rate over all scenarios, such that it is able to process all characteristics. An overview of
the scenarios can be found in Table 3.1.
In the base scenario, the probability of a single touch journey is 0.5 (P1 = P2 = 0.5).
Furthermore, regardless of the history, chances are for each touchpoint 0.5 to be channel
1 (ch1 = ch1|1 = ch1|2 = 0.5) and 0.5 to be channel 2 (ch2 = ch2|1 = ch2|2 = 0.5).
For each touchpoint with channel 1, the contribution to the conversion probability is 5%
(p1 = p1|1 = p1|2 = 0.05) . This contribution of 3% is slightly lower for each touchpoint
with channel 2 (p2 = p2|1 = p2|2 = 0.03).
All other scenarios are defined by single deviations from this base scenario. These
deviations are bold faced in Table 3.1. The scenario Short has 80% single touchpoint
journeys, while Long is characterized by journeys that contain two touchpoints in 80%
of the cases. Ch1 FT has 80% probability that the first touch is channel 1. The scenario
Mixed paths increases the chances of journeys with two touchpoints that include both
channels. The difference in conversion probability contribution between channel 1 and
channel 2 is enlarged in the scenario Ch1 conv. For Ch1 t2 conv this difference in con-
CHAPTER 3. METHOD 35

Scenario P1 ch1 ch1|1 ch1|2 p1 p2 p1|1 p2|1 p1|2 p2|2


Base 0.5 0.5 0.5 0.5 0.05 0.03 0.05 0.03 0.05 0.03
Short 0.8 0.5 0.5 0.5 0.05 0.03 0.05 0.03 0.05 0.03
Long 0.2 0.5 0.5 0.5 0.05 0.03 0.05 0.03 0.05 0.03
Ch1 FT 0.5 0.8 0.5 0.5 0.05 0.03 0.05 0.03 0.05 0.03
Mixed paths 0.5 0.5 0.2 0.8 0.05 0.03 0.05 0.03 0.05 0.03
Ch1 conv 0.8 0.5 0.5 0.5 0.10 0.01 0.10 0.01 0.10 0.01
Ch1 t2 conv 0.5 0.5 0.5 0.5 0.05 0.03 0.10 0.03 0.10 0.03
Ch1 t2 mixed conv 0.5 0.5 0.5 0.5 0.05 0.03 0.05 0.03 0.10 0.03
Ch1 t2 no conv 0.5 0.5 0.5 0.5 0.05 0.03 0.00 0.03 0.00 0.03

Table 3.1: Parameter settings for different simulation scenarios. Deviations from the base scenario are
emboldened.

tribution is only made greater for the second touchpoint and for Ch1 t2 mixed conv it is
increased even more specifically for the second touchpoint in case the first touchpoint is
channel 2. Finally, scenario Ch1 t2 no conv decreases the contribution to the conversion
probability of channel 1 in the second touchpoint. The scenarios are chosen such that
the effect of all possible characteristics to be found in real data on the attribution models
can be measured. Remark, however, that only unilateral effects and no interaction effects
can be investigated in this basic study.
For each scenario 10,000 customer journeys are generated. The true attribution is
calculated for each scenario. Models are estimated and attribution is determined. Con-
sequently, for each model we determine the Mean Absolute Error. We check for sample
variance by performing the same analysis on 100,000 generated customer journeys.
Chapter 4

Data

In this chapter, we will elucidate on the data that is used for the empirical part of the
thesis. The original dataset that is used is described in Section 4.1. Section 4.2 discusses
how this original data set has been processed in order to be fit for our empirical analysis.
Finally, in Section 4.3 some figures of the processed dataset are provided in order to get
some sense of the content of the data.

4.1 Data description


The dataset that this thesis uses contains all the clicks that occured on the website of a
large Dutch financial institution for a full year between July 2014 and June 2015. This
type of data, having a record for every click, is called clickstream data. To obtain insight
in the goals of a visitor during a visit or clickstream path, certain goals are defined that
are achieved after for instance clicking a particular link or seeing a specific page. Every
visit can thus have multiple goals, such as downloading a form, clicking somewhere or
being active in a funnel. The dataset that contains the goals that are reached for all
visits in the period between July 2014 and June 2015 is the basic data employed by this
thesis.
A summary of the relevant variables in this Goals Table is given in Table 4.1. In

Variable Brief explanation


Goal id An identifier of the goal that is reached.
Visit id An identifier of the site visit.
Cookie id A unique cookie identifier of the visit.
IP address The IP address from which a visit comes.
Campaign type The channel through which the visitor came on the site.
Datetime The exact time a specific goal is achieved.

Table 4.1: Variables in the Goals Table.

36
CHAPTER 4. DATA 37

total, the table has 64.7 million records. The total number of unique visits is 10.8 million,
meaning that a visit has on average around six goals. The predefined goals are given a
goal id, which can take on 1405 distinct values. Broadly stated, goals can fall in either
one of the five following categories for any given product.

1. Orientation visit: a visit to pages on the site to acquire more information about
a product. This can be considered as the initial orientation phase in a prospect
journey.

2. Download: when a visitor downloads a certain brochure about a product, this is


another goal.

3. Funnel: a visit to pages on the site that are considered the official funnel, which is
the place where transactions can be made.

4. Price calculation: the next step in the funnel. A visitor fills out its personal details
on the site in order to calculate what price it is going to pay if purchasing a product.

5. Transactions: a visitor purchases a product.

The visit id is an 11-digit number that is uniquely assigned to each visit. A visit
starts from the moment a visitor enters the website and ends as soon as the website is
left (meaning that no tab in the browser displays the website). During a session, which
is defined as the period between opening and closing the browser, multiple visits are
possible. A cookie id has 32 numbers or characters, and is uniquely assigned in order to
identify visitors throughout multiple visits. The Goals Table contains 4.9 million distinct
cookies, implying that a cookie has on average 2.2 visits. The IP address variable gives
the numerical lable of the public IP-address of the visitor. The datetime indicates the
exact date and time that the goal in the same record is achieved. A final indispensable
variable for this thesis is the campaign type field. This states the channel from which
a visitor lands on the company’s website. This can take on the categories of ‘organic’,
‘organic search’ (SEO), ‘sponsoredsearch’ (SEA, which is later separated into branded
and non-branded sponsoredsearch), ‘email’, ‘referrer’, ‘bannerad’ and some other small
categories summarized into the ‘other’ group.
The Goals Table functions as input for the data processing algorithms. These al-
gorithms are necessary to obtain a dataset out of which the attribution models can be
estimated. This data processing is the subject of the next section.
CHAPTER 4. DATA 38

4.2 Data processing


This section describes the data transformation from the Goals Table to a Visits Table
and Prospect Journey Table. The latter two tables serve as input for the estimation of
the attribution models. This process can be roughly divided into three steps:

1. Relevant visit selection

2. Prospect identification

3. Create prospect journeys

All data processing is performed with the Structured Query Language (SQL) in SQL
2014 Management Studio. This programming language is well-suited for handling large
volumes of data in an efficient way, and therefore perfect for our purposes.

4.2.1 Relevant visit selection


The following initial steps are taken in order to select only the relevant visits for the
purposes of this thesis.

• Our data consists of all visits to the website. The pertinent firm offers multiple
products on its site. To limit our analysis, we choose to attribute for a specific
product, say product X. We therefore need to make sure that only visits (or touch-
points) that are intended towards this product are taken into account. Since the
goals are assigned specifically for each product, we can simply filter on all the goals
that include our product X.

• Some of the traffic is caused by internet bots rather than prospects interested in
product X. Examples are web crawlers that browse sites to index them for search
engines, or internal bots that check whether all pages on the website are still active
and functional. This traffic is obviously not performed by potential product pur-
chasers and if included distorts our results. These visits are therefore filtered from
our Goals Table.

• Finally, it is more convenient to work with a table on visit-level than goal-level,


so the Goals Table is aggregated on visit id. For each visit, a new binary variable
called ’Transaction’ indicates whether one of the goals during that visit is that the
visitor purchases product X. Datetime per visit is taken as the datetime of the last
goal.
CHAPTER 4. DATA 39

Variable Brief explanation


Visit id An identifier of the site visit.
Prospect id A unique number identifying the individual of the visit.
Visit seq Indicates the consecutive visit number of that prospect.
Conversion path States whether the journey results into a conversion.
Datetime The exact time of a visit.
Campaign type The channel through which the prospect arrived on the site.
Campaign type next The channel of the next visit in the journey.
Campaign type prev The channel of the previous visit in the journey.
Campaign type prev2 The channel of the second-previous visit in the journey.

Table 4.2: Variables in the Visits Table.

4.2.2 Prospect identification


Now that we have a table with relevant visits for our product X, we need to identify visits
done by the same person. This is done as follows:

• All visits with the same cookie id are assumed to come from the same person. This
assumption is standard and accepted in the literature.

• Different cookies that have the same IP-addresses are taken together and considered
the same person. However, this step involves the significant risk that cookies of
different individuals are taken together, since IP-addresses pertain to an internet
connection rather than an individual. For this reason, it has never been done in
the literature (to this author’s knowledge). To mitigate this risk, only IP-addresses
that have less than five cookies are considered for the bundling. This prevents
joining cookies that have been connected to a public internet connection such as
a library, office or university. The risk that visits of different persons within the
same househould are bundled is still present and probable. However, our product
X is typically consumed by households rather than individuals. It is for this reason
that this thesis argues cookies with the same IP-addresses can be seen as the same
customer.

• Finally, individuals as identified according to the above two steps are given a
uniquely generated prospect id.

4.2.3 Create prospect journeys


Having identified individuals across visits, we can now undertake the last preparatory
steps to create a dataset that contains full prospect journeys.
CHAPTER 4. DATA 40

Variable Brief explanation


Prospect id A unique number identifying the individual of the visit.
Affiliate The number of touches coming from the affiliate channel in a journey.
Bannerad The number of touches coming from the bannerad channel in a journey.
Email The number of touches coming from the email channel in a journey.
Organic The number of touches coming from the organic channel in a journey.
SEABranded The number of touches coming from the SEABranded channel in a journey.
SEANonbranded The number of touches coming from the SEANonbranded channel in a journey.
Referrer The number of touches coming from the referrer channel in a journey.
Other The number of touches coming from other channels in a journey.
Conversion path States whether the path results into a conversion.

Table 4.3: Variables in the Prospect Journey Table

• Take only the visits before a transaction. Since we are only interested in the touch-
points before the first conversion, all visits after this conversion should be removed.

• Ensure customer journeys are not cut off. At this stage our dataset contains all rel-
evant visits between July 2014 and June 2015. However, this causes some prospect
journeys to be cut off. Think of conversions in early July 2014, of which the orien-
tating touchpoints in May 2014 are not included. Similarly, visits in the end of June
2015 might be perfected with a conversion or complemented with additional visits
in July 2015, which our current data can not account for. In order to overcome this
issue, we take all visits from prospects in the period between August 2014 and May
2015, and retrieve additional visits in July 2014 and June 2015 of these prospects.
We arbitrarily set this bandwidth of one month, but later verified that almost all
customer journeys take less than a month.

• Then, we calculated some useful new variables. ‘Visit seq’ indicates for every visit
the consecutive visit number in the journey of a person. ‘Conversion path’ states
for each visit whether its path eventually results in a conversion. Also, in order
to simplify the Markov model estimation, it is determined for each visit what the
campaign type of the next, previous and second-previous visits are. If there is no
next visit, the next campaign type can be either ‘Conversion’ or ‘Non-Conversion’,
depending on whether a transaction is made in the final visit. If it is the first visit,
the previous campaign type is ‘None’.

• The resulting dataset is called the Visits Table. However, for estimating the logistic
regression the Visits Table does not suffice. Therefore, another table is produced by
aggregating on prospect id and making a variable for every campaign type category
that states how many visits in the journey come from each specific channel. This
CHAPTER 4. DATA 41

Figure 4.1: The number of paths per touchpoint for non-conversions and conversions. Also the conversion
percentage per touchpoint is shown.

table is named the Prospect Journey Table.

The variables of both the Visits Table and the Prospect Journey are summarized
and explained in tables 4.2 and 4.3. The next section provides more information on the
content of these two tables.

4.3 Data insights


In this section we will briefly discuss some data insights from the Visits Table and Prospect
Journey Table. Firstly, the length of journeys and how this effects the conversion per-
centage is investigated. Secondly, a table showing the number of visits coming from each
of the channels is discussed.
Let us first consider the length of the prospect journeys. Figure 4.1 displays the
number of non-conversion and conversion paths per touchpoint. The total number of
paths or prospects is around 734,000, of which 2.2% or around 16,400 are converting
prospects. Most prospects have a single touchpoint in their journey, around 61% for the
non-conversions and 50% for the conversions. The number of prospects decreases rapidly
when the number of touchpoints increases, with more than 88% of the journeys having
less than four touchpoints. The on average low amount of touchpoints is unfortunate for
the purposes of our thesis, since attribution modelling becomes interesting with longer
customer journeys. For single-touch customer journeys all models unanimously give full
CHAPTER 4. DATA 42

Campaign type # Visits % Visits


Affiliate 1,487,594 73.3%
Organic 197,091 9.7%
Organicsearch 112,115 5.5%
SEAbranded 92,168 4.5%
SEAnonbranded 63,608 3.1%
Email 52,110 2.6%
Referrer 17,753 0.9%
Bannerad 5,209 0.3%
Other 3,088 0.2%
Total 2,030,736 100%

Table 4.4: The number of visits and percentage of total visits coming from the different channels.

credit to the channel of that touchpoint. Converting prospects generally have more touch-
points than non-converting prospects, which intuitively makes sense since non-converting
prospects drop out of the funnel in an earlier stage. However, there are relatively more
non-converting prospects that have more than 11 touchpoints, which is presumably a
group of customers that visits the company’s website frequently but is not interested in
the product. The conversion percentage initially grows with the number of touchpoints,
indicating that customers who orientate longer have a higher probability of converting.
However, if the number of touchpoints is higher than three this percentage gradually de-
clines. This implies that there exists some sort of saturation point of orientation: after a
certain amount of orientation further visits do not contribute to the chances of conversion
and even work counter-productively. People who visit the website more than three times
might actually not be interested in purchasing a product, but have other reasons for the
visit. Note, however, that we intended to filter these people out of the dataset by only
taking visits that have a goal related to the funnel of the product. Now that we have
information about the length of prospect journeys, it is also interesting to consider the
channels that lead to visits.
Table 4.4 gives insight in the number of visits coming from each of the channels. As
we can see, the total number of visits is around two million. This makes the average
number of visits or touchpoints per prospect 2.7. Interestingly, around three quarter of
all these visits come from affiliate parties. One out of every ten visits comes from the
organic channel, meaning that the website is directly put into the browser. 7.5 percent
reach the website through sponsored search (SEA), which is more than the five percent
that use organic search (SEO) to find the product. 60% (4.5 out of 7.6 percent) of the
sponsored search use the company’s brand in their search query. Further, 2.5 percent
of the visits come from people clicking on links in emails they have received from the
CHAPTER 4. DATA 43

insurer. Finally, referrer, bannerad and other channels are responsible for very minor
contributions of less than a percent.
Chapter 5

Results

In this Chapter the results of this thesis are presented. First, in Section 5.1 the model
estimation results of the different attribution models are stated, including each model’s
actual attribution to the channels. Section 5.2 discusses the performance of the models in
classifying out-of-sample observations in the conversion or non-conversion class. Finally,
Section 5.3 shows the performance of each model in the simulation study.

5.1 Model estimation and attribution


In this section results about the estimation and attribution of the different models are
shown and interpreted. Note that the estimation and attribution is done on the training
set, implying that the amount of conversions displayed in the results is approximately
two third of the total number of conversions (which is 16,386). The order of discussion
of the different models is similar to Chapter 2, starting off with the rule-based heuristics
(last touch attribution, first touch attribution and linear attribution), then the simple
probabilistic model, the logistic regressions (including the extensions) and finishing with
the Markov chain models.

5.1.1 Rule-based heuristics


The estimation and attribution determination of the rule-based heuristics is extremely
simple and straightforward. For the last touch method the attribution of channel k is
simply the percentage of last touches with that channel, taken from the journeys that
convert. The first touch heuristic takes the percentage of first touches of a channel k in
conversion paths as its attribution. Linear attribution gives each of the touchpoints in
a conversion path equal credit, meaning that for a customer who touches the channels
banner and SEA non-branded prior to conversion both channels are given a credit of 0.5.

44
CHAPTER 5. RESULTS 45

Absolute attribution Relative attribution


Channel LT FT LIN LT FT LIN
Affiliate 1,677 2,028 1,855 14.9% 18.0% 16.4%
Bannerad 3 6 6 0.0% 0.1% 0.0%
Email 417 394 411 3.7% 3.5% 3.6%
Organic 3,311 2,318 2,881 29.4% 20.6% 25.6%
Organicsearch 2,046 2,103 2,055 18.1% 18.6% 18.2%
Other 17 27 21 0.2% 0.2% 0.2%
Referrer 559 612 583 5.0% 5.4% 5.2%
SEAbranded 2,155 2,276 2,189 19.1% 20.2% 19.4%
SEAnonbranded 1,092 1,513 1,275 9.7% 13.4% 11.3%
Total 11,277 11,277 11,277 100% 100% 100%

Table 5.1: The attribution to the channels according to the last touch, first touch and linear method,
both in absolute numbers and in percentages.

Table 5.1 displays the attribution according to each of these three rule-based heuristics,
both in absolute number of conversions and percentages.
It is evident from Table 5.1 that differences between attribution methods exists, even
though we have seen that most of the customer journeys exist of a single touchpoint,
in which case every method attributes similarly. The largest difference between two
methods is the attribution to the organic channel, which receives 8.8 percentage point
more conversion credit according to last touch than first touch. This proves that organic
is a channel that is generally touched last in converting journeys. The affiliate and non-
branded SEA attributions according to last touch and first touch differ both more than 3
percentage point. The linear method attributes for each channel somewhere between last
touch and first touch. This can be explained by the fact that 73% of the conversion paths
have one or two touchpoints, in which case linear attributes per definition between first
and last. Only for paths that have three touchpoints or more, the middle touchpoints are
disregarded by both single touch heuristics, making it possible for the linear method to
attribute outside the range spanned by first and last touch. As an illustration, suppose
all conversion paths contain three touchpoints. The middle touchpoint is always channel
k, and channel k does not appear first or last. In this case, linear attributes channel k
more than first or last touch (which both attribute 0%). Since conversion paths in our
dataset are generally short, this does not happen and linear attributes neatly between
the attribution of first and last touch.
Although differences between methods clearly exist, and even seem quite relevant if
you consider the absolute number of conversions, it is interesting to check whether they
are significant. In order to do so, the Mean Absolute Difference (MAD) between the
CHAPTER 5. RESULTS 46

Channel Âk,LT − Âk,LIN St. Error P-value


Bannerad -0.02% 0.01% 0.025*
Email -0.05% 0.10% 0.297
Organic 3.81% 0.22% 0.000**
Organicsearch -0.08% 0.14% 0.332
Other -0.04% 0.02% 0.024*
Referrer -0.21% 0.07% 0.002**
SEAbranded -0.31% 0.17% 0.033*
SEAnonbranded -1.63% 0.12% 0.000**
* P<0.05
**P<0.01

Table 5.2: The difference of channel attributions between the last touch and linear rule-based method,
including its standard error and P-value.

methods is calculated. The MAD between method i and j is calculated as follows, where
k = 1, 2, ..., K refers to the channels:

K
ˆ 1 X
M AD(i, j) = |Âk,i − Âk,j | (5.1)
K k=1
The MAD is thus a channel’s average attribution difference between two methods.
The MAD between FT and LT turns out to be 2.00%, between FT and LIN 1.14% and
between LT and LIN 0.86%. The difference between last touch and linear is thus closest
with on average less than 1% difference between two channels. By bootstrapping 1000
samples, we determine the standard error of this difference, which is 0.045%, giving a
P-value of 0.000. This indicates that at any significance level the MAD between last
touch and linear is significant. Since this is the smallest MAD, we can conclude that it
is likely that all differences between rule-based models are highly significant.
Finally, it is also interesting to check whether individual channel contributions between
last touch and linear differ significantly. Again, bootstrapping is done to determine the
standard error between the channel distributions. The results are shown in Table 5.2.
The table demonstrates that the attribution of seven out of nine channels significantly
differs at a 5% (*) or even 1% (**) level between the two methods. The exceptions are the
channels email and organic search (SEO). Summarizing this subsection, we can conclude
that attribution over the channels is different with respect to the rule-based methods and
even highly significantly so.
CHAPTER 5. RESULTS 47

Channel 1st order 2nd order PROB att LIN att


Affiliate 0.6% 0.4% 5.0% 16.4%
Bannerad 0.2% -1.4% 0.0% 0.0%
Email 1.2% -0.8% 0.0% 3.6%
Organic 3.1% 0.2% 16.0% 25.6%
Organicsearch 3.2% -0.2% 14.3% 18.2%
Other 1.2% -0.6% 3.0% 0.2%
Referrer 6.8% -1.2% 26.4% 5.2%
SEAbranded 4.9% -0.6% 20.0% 19.4%
SEAnonbranded 3.2% 0.0% 15.3% 11.3%

Table 5.3: The first-order effect, second-order effect and attribution of the probabilistic model. As a
reference, attribution according to linear touch is displayed.

5.1.2 Simple probabilistic model


The second class of models that we estimate to the training set is the simple probabilistic
model. We calculate empirical conversion probabilities for each channel in case of a single
touchpoint and a second-order effect in case of two touchpoints as described in Section
2.3.2. The first and second-order effect together are normalized to determine attribution.
The first and second-order effect and attribution according to this simple probabilistic
model are reported in Table 5.3. In addition, the last column of the table includes the
attribution of the linear touch heuristic for reference.
The table shows that the first-order effect is largest for referrer (6.8%), which means
that the conversion probability of single touch journeys is highest for this channel. In
contrast, this probability is smaller than 1% when bannerad or affiliate is touched. An-
other observation is that second-order effects are negative for most channels, implying
that the sum of two individual channel conversion percentages is generally higher than
the conversion percentage of paths with both channels combined. This is an interesting
observation in itself. Thirdly, Table 5.3 shows that attribution according to the proba-
bilistic model is entirely different in comparison to the rule-based methods. The MAD
with linear touch, which is a reasonable attribution method according to our defined
theoretical criteria, is as high as 6.36%. This confirms our intuition that attribution ac-
cording to the simple probabilistic model is most probably inaccurate. This is reflected
in the attribution of the referrer channel, which obtains the highest attribution of 26.4%
or 2974 of the total conversions due to its large first-order effect. In the data, however,
only 1151 conversion paths touch this channel. This proves our theoretical conclusions
that the simple probabilistic model is an inadequate attribution method.
CHAPTER 5. RESULTS 48

Channel β̂k T-statistic P-value


Intercept -3.6729 -244.5 0.000**
Affiliate -0.4719 -35.8 0.000**
Bannerad -1.8794 -7.7 0.000**
Email -0.0089 -0.4 0.685
Organic 0.2161 27.6 0.000**
Organicsearch 0.2651 22.6 0.000**
Other -0.2989 -2.2 0.031*
Referrer 0.4939 19.3 0.000**
SEAbranded 0.3982 32.6 0.000**
SEAnonbranded 0.3959 23.6 0.000**
* P-value < 0.05
**P-value < 0.01

Table 5.4: Regression output of the normal logistic regression. All estimated coefficients are significant
at the 5% level except for email.

5.1.3 Logistic regression models


The third class of models we consider for estimation and attribution is the logistic regres-
sion models. In addition to the standard formulation of this model in Equation (2.11),
we also consider the extensions described in Equation (2.13). These extensions are re-
spectively referred to as logistic extension 1 (including last touch dummies) and logistic
extension 2 (including both last touch and one but last touch dummies). Estimating the
standard logistic regression to the training set, gives us the regression output in Table
5.4. The coefficient estimates β̂k , standard error of these estimates and P-value are given
for each channel. All estimated coefficients are significant at the 5% level except for
email, which might be explained by the relatively small number of customer journeys
that contains this channel. Channels Referrer and both SEA channels have relatively
high estimated coefficients, bannerad and affiliate relatively low.
The regression output for both extensions can be seen in the Appendix. For the first
extension, 16 out of 18 estimated coefficients are significant at a 5% significance level.
For the second extension, this amount is 21 out of 27. Regression output for a model that
allows for a non-linear relation between the number of touches with channel k and its
contribution (containing dummies for both N Ci,k = 1 and N Ci,k > 1) can also be found
in the Appendix. This model is useful to verify the assumption of the standard logistic
model that the contribution of a channel linearly increases with its number of touches.
For this non-linear effect model, almost all coefficients are significant too.
To compare the fit relative to the number of estimated parameters of the logistic
regression models, the Akaike Information Criterion (AIC) measure can be used. If L is
CHAPTER 5. RESULTS 49

Channel LOG LOGX1 LOGX2 LIN


Affiliate 15.6% 16.8% 16.8% 16.4%
Bannerad 0.0% 0.0% 0.0% 0.0%
Email 3.6% 3.6% 3.6% 3.6%
Organic 25.7% 23.5% 23.6% 25.6%
Organicsearch 18.4% 18.4% 18.4% 18.2%
Other 0.2% 0.2% 0.2% 0.2%
Referrer 5.3% 5.3% 5.3% 5.2%
SEAbranded 19.8% 19.8% 19.7% 19.4%
SEAnonbranded 11.5% 12.4% 12.4% 11.3%

Table 5.5: Attribution according to the standard logistic model, its both extensions and the linear model.

the maximum value of the likelihood function of a model and s the number of estimated
parameters, then the AIC value can be calculated as follows:

AIC = −2 log L + 2s (5.2)

The smaller the value of the AIC, the better the quality of a model. The second exten-
sion of the logistic regression model has the smallest value (95611). The non-linear effect
model and the first extension have slightly higher values (96261 and 96299 respectively).
The ordinary logistic regression performs worst (98932). This provides substantive evi-
dence for the added value of the logistic extensions proposed by this thesis. In addition,
the fact that the non-linear effect model has a lower value than the standard logistic re-
gression model shows that the assumption that the channel contribution increases linearly
with its number of touches is probably unrealistic.
Having estimated the logistic regression models, let’s now turn to the question of
attribution. Table 5.5 reports the attribution percentages according to each of the logistic
models. Again, linear attribution is included in the last column for reference. It is clear
that the differences in attribution percentages are small. However, since the number of
observations is large, all MADs between models turn out to be highly significant with
a P-value of 0.000. The MAD between logistic and linear attribution is 0.22%, with
the affiliate channel being most prominently different. The affiliate channel receives
less credit for the logistic regression, implying that affiliate generally deserves less than
linear credit in conversion paths with other channels. This is consistent with the small
estimated coefficient of affiliate in the logistic regression model as shown in Table 5.4.
The MAD between the two logistic extensions is incredibly small (MAD= 0.02%) but
still highly significant. The difference between the logistic model and its extensions is
larger (MAD= 0.48 and MAD= 0.50 for extension 1 and extension 2 respectively). Most
notably, the extensions give more conversion credit to affiliate and SEA non-branded,
CHAPTER 5. RESULTS 50

Channel Ẑ0 Aff Ban Em Org Orgs Oth Ref SEAb SEAnb Conv NConv
Aff 0.60 0.70 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.26
Ban 0.00 0.03 0.10 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.00 0.80
Em 0.04 0.10 0.00 0.22 0.09 0.03 0.00 0.00 0.03 0.01 0.01 0.52
Org 0.11 0.13 0.00 0.02 0.24 0.04 0.00 0.01 0.03 0.01 0.02 0.49
Orgs 0.09 0.02 0.01 0.01 0.09 0.20 0.00 0.00 0.06 0.02 0.03 0.57
Oth 0.00 0.21 0.02 0.01 0.06 0.02 0.11 0.01 0.02 0.01 0.01 0.52
Ref 0.01 0.05 0.01 001 0.08 0.04 0.00 0.18 0.03 0.01 0.05 0.54
SEAb 0.08 0.02 0.00 0.02 0.11 0.09 0.00 0.00 0.18 0.02 0.03 0.53
SEAnb 0.06 0.03 0.01 0.01 0.05 0.05 0.00 0.00 0.03 0.16 0.03 0.63
Conv 0.00 0.00 0.00 0.00 0.005 0.00 0.00 0.00 0.00 0.00 1.00 0.00
NConv 0.00 0.00 0.00 0.00 0.005 0.00 0.00 0.00 0.00 0.00 0.00 1.00

Table 5.6: The estimated first-order Markov chain transition matrix Ŵ and initial state Ẑ0 . States are
abbreviated.

and less to organic. This implies that affiliate and SEA non-branded last touches are
relatively valuable, while organic last touches are relatively less valuable in comparison
to the situation in which this timing effect is not incorporated.

5.1.4 Markov chain models


Finally, we report the estimation results and attribution of the Markov chain models.
However, only the first-order Markov chain model transition matrix and initial state will
be shown. Higher orders contain more than a hundred states, making it undesirable and
practically impossible to display the full estimation results. The estimated first-order
transition matrix Ŵ and initial state Ẑ0 are shown in Table 5.6. Note that the channels
are abbreviated. If unclear, in Table 5.5 the full names can be found in exactly the same
order.
From the initial state we can derive that 60% of the customer journeys start with the
affiliate channel. Other popular channels are search engines (both SEA and SEO) and
organic. The estimated transition matrix Ŵ shows that most paths lead directly to a
non-conversion (NConv), which is consistent to our earlier observation that most journeys
contain a single touchpoint. Further, transition probabilities are relatively high on the
diagonal and to the organic state. The high diagonal probabilities can be described as
channel stickiness, meaning that customers are inclined to visit the website from the same
channel if they visit multiple times. From the results of last touch attribution we know
that organic is a popular channel later in the customer journey, explaining the relatively
high probabilities towards this state. The probability of going to the conversion state is
highest coming from the referral channel. However, the number of observations of this
CHAPTER 5. RESULTS 51

channel are small.


The attribution of this first-order Markov chain model is calculated by the Removal
Effect as described by Anderl et al. (2014). For the second-order model, we have described
two methods to calculate attribution in Section 2.3.2. The first method is state-based
and proposed by Anderl et al. (2014), and the second is channel-based and suggested by
this thesis. The second-order state-based Removal Effect provides us an empirical test to
determine whether the order of two touched channels is important. In the second-order
Markov model and the logistic extensions it is assumed this order is important, but in
most other models it is not. We find that on average the difference between the Removal
Effect of two states in which the order of the channels is changed is quite large, namely
an average multiplier of 1.9. There are no real outliers: the maximum multiplier is 5.2
and the minimum 1.1. The absolute difference is largest in the states containing affiliate
and organic. The state in which affiliate is first has a missed conversion of 0.15%, the
state in which organic is first 0.05%. A journey is thus far more likely to be successful
when affiliate is the first channel followed by organic. Overall, we can conclude from this
small sidestep that the order is definitely important.
Having verified this, let’s now turn to the attribution of the Markov models. Table 5.7
shows attribution percentages per channel for the different Markov models and methods,
including the linear model as a reference. The difference between linear and the first-order
Markov model is substantial (MAD= 1.32%). Most prominently the first-order Markov
model gives more credit to the affiliate channel and less to the SEA branded and non-
branded channels. In the linear model, only information from conversion paths is taken
into account. The transition probabilities in the first-order Markov model are determined
by all paths. It is therefore probable that affiliate plays a larger role in non-conversion
paths, while the SEA channel is more prominently present in the conversion paths. The
first-order Markov and second-order Markov model according to the first method differ
barely (MAD= 0.16%). However, the difference of both methods of the second-order
Markov chain is reasonably large (MAD= 0.96%).
Concluding this section, We have seen that it matters what attribution method is used.
In general, attribution distribution over the channels differs significantly across models
and methods. Even though percentage differences between models may appear small,
the difference can not be neglected in absolute numbers. Furthermore, it is important
to note that our dataset has relatively short, channel sticky customer journeys, which
is an unlucky disadvantage for demonstrating the relevance of the type of attribution
method. This question becomes increasingly important for datasets that contain longer
journeys over multiple channels. Now that we have proven that differences exist, it is an
interesting follow up question to see which method performs best empirically and in a
CHAPTER 5. RESULTS 52

Channel MAR1 MAR2-M1 MAR2-M2 LIN


Affiliate 21.1% 20.9% 16.8% 16.4%
Bannerad 0.1% 0.1% 0.1% 0.0%
Email 4.5% 4.6% 4.4% 3.6%
Organic 26.0% 26.1 % 26.3 % 25.6%
Organicsearch 17.2% 17.4% 18.4% 18.2%
Other 0.2% 0.2% 0.2% 0.2%
Referrer 4.2% 4.1% 4.6% 5.2%
SEAbranded 17.7% 17.4% 18.9% 19.4%
SEAnonbranded 9.0% 9.2% 10.3% 11.3%

Table 5.7: Attribution percentages for the Markov first-order model, the second-order model (method 1
by Anderl et al. (2014) and method 2 proposed by this thesis) and linear model.

simulation study. This will be our focus in the next sections of this Chapter.

5.2 Classification accuracy


In this section we will focus on the predictive performance of the different models by
testing its classification accuracy. Unfortunately, as described in Chapter 3 the linear
attribution method and the simple probabilistic model are unfit for predictive purposes.
For the remaining models, conversion probability predictions are produced for the cus-
tomer journeys in the test set. These probabilities are then for all thresholds compared
with the realization of a conversion or a non-conversions by calculating the sensitivity
and specificity. For each model, the results are plotted on a ROC curve. Finally, the
Area Under the ROC Curve (AUC) is determined and used as a performance measure.
Figure 5.1 shows the ROC curves for the second extension of the logistic regression
model and the first touch attribution method. These methods turn out to be the best and
worst classifier respectively. One can see that both methods perform optically significantly
better than the line of no discrimination, which represents randomly classifying. This is
especially surprising for the first touch method, since this implies that the first touchpoint
already provides much information about whether a prospect is going to convert or not.
However, another remarkable observation from Figure 3.1 is that the difference between
the worst and best performing model seems rather small. Although the red curve is
closer to the upper left corner than the green curve for all values of the specificity, the
difference does not appear to be very large. This should not be too surprising though,
since attribution percentages as well were very similar across the different methods.
Despite the proximity of the ROC curves of the models, no overlap between any two
curves occurs. The assumption of equal weight over the values of the specificity, that
CHAPTER 5. RESULTS 53

Figure 5.1: The ROC Curve for the second extension of the logistic regression model (best classification)
and the first touch attribution model (worst classification).

is presumed by the AUC measure, is therefore unproblematic. Having established this,


we can safely base our classification performance evaluation on the AUC measure. The
AUC for each model is displayed in Table 5.8. Observe that all AUCs are between 0.70
and 0.80. As a rough guide to evaluate classification accuracy, the traditional academic
grade system can be used. A grade between 0.70 and 0.80 is a fair performance for each
of the models. More specifically, it seems that we can roughly distinguish the methods
that have an AUC of 0.74 − 0.75 and those that have an AUC of 0.77 − 0.78.
The first group contains the rule-based methods and the first-order Markov model. It
is unsurprising that the single touch rule-based methods perform relatively poor, since
they do not take into account at least half of the available information of multi touch
journeys. Interestingly, the last touch classifies more accurately than the first touch
heuristic, implying that the last touch in a customer journey is more informative about
a potential conversion than the first touch. This is in correspondence with the results
of Anderl et al. (2014). We also foresaw the poor performance of the first-order Markov
model, since as described in Section 2.3.2 this model is expected to perform less than
the last touch heuristic (and so it does!). The second group of more accurate classifiers
contains all logistic regression models and the higher order Markov models. The basic
logistic model is the least accurate classifier within this group. The difference between the
CHAPTER 5. RESULTS 54

Attribution
AUC
method
FT 0.7401
LT 0.7470
LOG 0.7712
LOGX1 0.7761
LOGX2 0.7779
MAR1 0.7416
MAR2 0.7762
MAR3 0.7760

Table 5.8: The Area Under the ROC Curve for the different attribution methods.

logistic extensions and higher order Markov models is very small. However, the second
logistic extension regression has the largest AUC. Remarkably, the AUC of the second-
order Markov model is higher than the AUC of the third-order Markov model, meaning
that the increasing measurement error of a third-order Markov model due to the large
amount of parameters outweighs the benefit of taking an additional period into account.
This is called overfitting. However, a fair concern is which of the differences in AUC
shown in Table 5.8 are significant.
By means of bootstrapping the standard deviation of the difference between the AUC
of two models is derived. This standard deviation can then be used to see whether any
differences are significant. The results of this analysis are shown in Table 5.9. We can
conclude from the table that fortunately most differences are significant. All methods
classify significantly more accurately than the first touch heuristic and Markov 1. The
difference between first touch and Markov 1 itself is insignificant. Last touch performs
significantly better than first touch and Markov 1, but worse than all other more advanced
models (logistic and higher-order Markov). From the more advanced models, the basic
formulation of logistic regression significantly classifies the worst. Classification accuracy
is mostly insignificant between the logistic extensions and higher-order Markov models.
However, the second logistic extension performs significantly better than the first logistic
extension at a 5% level. Moreover, the P-value of the second extension with both Markov
extensions is 0.12. One might thus argue that the second extension is quite significantly
the preferred model concerning predictive performance.
However, as written before, these results are based on the assumption that best pre-
diction implies best attribution. This assumption is highly intuitive but not necessarily
entirely true. To further explore which model attributes best, the next section discusses
the results from a simulation study in which the true attribution is known.
CHAPTER 5. RESULTS 55

Method LT LOG LOGX1 LOGX2 MAR1 MAR2 MAR3


FT -0.0069** -0.0312** -0.0361** -0.0378** -0.0015 -0.0361** -0.0359**
(0.00243) (0.00185) (0.00223) (0.00240) (0.00203) (0.00227) (0.00219)
LT -0.0242** -0.0291** -0.0309** 0.0054** -0.0292** -0.0290**
(0.00221) (0.00132) (0.00158) (0.00156) (0.00210) (0.00221)
LOG -0.0049** -0.0066** 0.0297** -0.0050** -0.0048**
(0.00143) (0.00145) (0.00204) (0.00150) (0.00146)
LOGX1 -0.0018* 0.0346** -0.0001 0.0001
(0.00069) (0.00187) (0.00156) (0.00163)
LOGX2 0.0363** 0.0017 0.0019
(0.00204) (0.00142) (0.00162)
MAR1 -0.0346** -0.0344**
(0.00199) (0.00215)
MAR2 0.0002
(0.00118)
* P-value < 0.05
**P-value < 0.01

Table 5.9: The difference between AUC values of different models, its standard deviation between brackets
and whether it is significant.

5.3 Simulation
We start the simulation results with discussing the first-order and second-order Markov
chain simulation. Then, we will focus on the additional simulation study that is per-
formed. Since Section 5.2 shows that the difference between the second-order and third-
order Markov model is insignificant, only second-order Markov model results are reported
in this section.

5.3.1 Markov chain simulations


Data is generated both according to the first-order and second-order Markov chain initial
state and transition matrix. The estimated parameters of the initial state and transi-
tion matrix in the data are considered the ‘true’ values for these simulations. In total,
the journey of 734.6k prospects is simulated 10 times. All attribution methods are es-
timated on this generated data to determine their attribution over the channels. This
attribution is compared to the true attribution according to the standard Removal Effect
calculations based on the true initial state and transition matrix. For the second-order
Markov chain model we chose our proposed method (method 2) to calculate the Removal
Effect. Differences with method 1 proposed by Anderl et al. (2014) are insignificantly
small. The Mean Absolute Error (MAE) with this true attribution is calculated for each
model. The MAE for a single, representative iteration is displayed in Table 5.10 for the
CHAPTER 5. RESULTS 56

MAR2- MAR2-
DGP FT LT LIN LOG LOGX1 LOGX2 MAR1 PROB
M1 M2
Markov 1 1.57% 1.70% 0.67% 0.54% 1.21% 1.21% 0.16% 1.34% 0.19% 7.76%
Markov 2 1.54% 0.84% 0.70% 0.63% 1.06% 1.03% 1.04% 0.60% 0.16% 6.16%

Table 5.10: Mean Absolute Error of each attribution method with respect to the ’true’ attribution of the
respective DGP.

first-order and second-order Markov simulation. The MAE is highly consistent over the
10 iterations, implying that the results and conclusions are not influenced by the issue of
sample variance.
Table 5.10 shows that the first-order Markov model attributes very closely (MAE=0.16%)
to the truth when the data is generated by a Markov 1 process. This is unsurprising, since
the large number (734.6k) of generated customer journeys enables the first-order Markov
model to accurately estimate the transition matrix and initial state. However, also the
second method of the Markov 2 model attributes very closely to the truth with a MAE
of 0.19%. In case the data is generated by a Markov 2 process, the second method of the
Markov 2 model performs best (MAE=0.16%), implying that the data contains enough
observations to accurately estimate a second-order model. The MAE of the first-order
Markov model is considerably larger with 1.04%. This indicates that the true data can be
better captured in a second-order Markov model than a first-order model. Although most
journeys in the true data have a single touchpoint, this proves that a memory of more
than a single period is desirable. Consistent to our out-of-sample classification study, we
have proven that a second-order Markov model attributes better than a first-order model.
A final observation that can be made from Table 5.10 is that the logistic and linear model
both attribute quite closely to the truth as expressed by the Removal Effect, with their
MAEs all smaller than 0.70%.

5.3.2 Additional simulation study


More interestingly, we will now consider the performance of the models in an additional
simulation study. Since the true attribution in this case is expected to be less dependent
on the attribution techniques (which is obviously untrue for the Markov simulations, in
which the Removal Effect is utilized to determine this true attribution), this simulation
study probably gives a more objective criterion as to which model attributes best. Data is
generated according to the parameters of different scenarios as explained in Section 3.3.
For each scenario, 10,000 customer journeys are generated, models are estimated and
the attribution and Mean Absolute Error (MAE) determined. The same procedure is
performed for 100,000 generated customer journeys. However, simulating more customer
CHAPTER 5. RESULTS 57

journeys does not alter any conclusions, implying that we can forget the issue of sample
variance in our results.
In this additional simulation study, the second extension of the logistic regression
model is not estimated since its only added value in comparison to the first extension
is that it considers two periods back. Because all customer journeys have maximum
two touches in this simulation study, this added value and thus the second extension
of the logistic regression model is irrelevant. However, a different logistic regression
model called LOG FT is considered. This model dummifies the first touch in stead of
the last touch. Since in the specification of the simulation model, the first touch has a
specific contribution to conversion irrespective of later touches, this model is expected to
attribute better than the logistic model based on last touch dummification. To emphasize
the contrast, the logistic model based on last touch dummification (normally the first
extension) will be called LOG LT in this section.
The second column of Table 5.11 shows the true attribution of the first channel for each
of the scenarios. The remaining columns show the deviation from this true attribution
in % for a given attribution method. The closer the absolute value of this deviation
approaches zero, the better the model performs. Also the average MAE over the scenarios
is shown as a single measure to evaluate the attribution models. Note that this is a highly
arbitrary measure since it is fully dependent on the definition of the scenarios.
If we turn our attention to the base scenario, we see that all rule-based heuristics (FT,
LT and LIN) attribute similarly. This is not a coincidence. For the conversion paths that
contain a single channel, attribution is per definition equal for these three methods. There
remain two paths that potentially create differences between the methods, namely channel
1 followed by channel 2 (denoted as P12) and channel 2 followed by channel 1 (P21).
Whenever the number of conversion paths of P12 is equal to the number of conversion
paths of P21, it can be easily derived that all rule-based heuristics attribute similarly.
Since in the base case, P12 and P21 have a probability of occurring of 0.5∗0.5∗0.5 = 0.125
and both have a conversion probability of 8%, the number of conversion paths P12 and
P21 is indeed the same. This explains the equal attribution of first touch, last touch and
linear for the base case and five other scenarios.
Markov 1 and Markov 2 - method 1 attributions in the base case are slightly worse
than the rule-based heuristics. However, our proposed method of the Markov 2 model
attributes extremely well in the base case, which turns out to be a lucky shot after
considering other scenarios. The logistic regression models attribute considerably well,
though the standard form attributes better than the extensions Log LT and log FT.
In this scenario, the contribution to conversion per channel does not differ between the
first and second touchpoint, so the extensions do not have an added value over the
CHAPTER 5. RESULTS 58

True MAR2 MAR2 LOG LOG


Scenario FT LT LIN MAR1 LOG PROB
att -M1 -M2 LT FT
Base 63% -4.2 -4.2 -4.2 -6.9 -6.3 -0.4 -1.6 -1.8 -1.8 0.0
Short 63% -2.1 -2.1 -2.1 -3.8 -4.2 12.5 -0.5 -0.7 -0.7 0.0
Long 63% -5.6 -5.6 -5.6 -8.6 -7.1 -6.1 -2.5 -2.8 -2.8 0.0
Ch1 FT 80% 5.3 -12.9 -3.8 -10.4 -5.1 -7.3 -1.6 0.3 0.3 -17.0
Mixed paths 63% -6.7 -6.7 -6.7 -9.3 -9.1 -3.9 -2.0 -2.4 -2.4 0.0
Ch1 conv 91% -13.7 -13.7 -13.7 -22.7 -20.5 -18.9 -5.0 -5.2 -5.2 0.0
Ch1 t2 conv 69% -12.0 -3.5 -7.7 -8.6 -11.9 -5.9 -3.5 -5.5 -2.4 -9.5
Ch1 t2 mixed conv 66% -13.2 -3.8 -8.5 -7.9 -12.9 -5.7 -5.7 -7.7 -4.2 -6.5
Ch1 t2 no conv 53% 7.9 -5.2 1.4 -4.3 2.3 7.9 1.8 3.4 0.6 15.7
Avg MAE 0.0 7.8 6.4 6.0 9.2 8.8 7.6 2.7 3.3 2.3 5.4

Table 5.11: True attribution and its deviation per attribution method (in %) for different scenarios.

standard logistic model. Unnecessarily using two different parameters for a single effect
(e.g. the contribution to conversion of a channel regardless of the timing), even makes
the calculation of the attribution less accurate for the extensions than the standard
model. Whether the dummification is based on the first or the last touch does not
matter for the attribution, since the contribution to conversion of a channel remains
stable over the touchpoints. A final interesting observation from this base case is the
perfect performance of the probabilistic model. In general, if channels are equally likely
to be touched and if the contribution to the conversion probability for both channels is
similar regardless of the touchpoint, which is the case for the base scenario, one can show
that the probabilistic model attributes perfectly. However, although present in multiple
scenarios in this simulation study, this is not a realistic assumption in real datasets.
Considering the other scenarios and the average MAE over these scenarios as shown
in Table 5.11, one can see that the logistic regression models perform consistently better
than all other models. The best attribution model is the LOG FT model, since the
form of this model is build up similarly to the specification of the DGP. The LOG FT
model distinguishes itself in the last three scenarios, which are the only scenarios that
give a channel a different contribution to conversion depending on whether it is the first
or second touchpoint. The normal logistic model performs better than the LOG LT
extension, since the latter is specified substantially differently than the DGP. Similar to
the Markov chain simulations, we thus see how the DGP of the simulation study inevitably
effects the performance of the models. Furthermore, one can see that the second method
of the second-order Markov chain Model performs best among the class of Markov chain
models, providing evidence for using this method in further research. Moreover, the
second-order Markov models perform better than the first-order Markov model. On
CHAPTER 5. RESULTS 59

average though, surprisingly, the rule-based heuristics attribute more accurately than
the Markov chain models. The best performing rule-based heuristic is linear, followed
by last touch and first touch. Finally, the probabilistic model attributes quite well on
average which can be explained by the large amount of perfect attributions. However,
for the generally more realistic scenarios in which the two mentioned requirements for
perfect attribution of this model are not fulfilled, the probabilistic model performs quite
dramatically. In conclusion, one can state that the logistic regression models, specifically
the log FT extension, are the evident winners of this simulation study.
An interesting question for discussion is to what extent the DGP of this additional
simulation study and its assumptions are realistic. The study aimed to reproduce cus-
tomer journeys in an intuitive way, without intentionally prioritizing any attribution
method over the other. However, by postulating that the first touchpoint has a certain
contribution to conversion regardless of possible later touchpoints, which is intuitively a
plausible assumption, it inadvertently prioritizes LOG FT over LOG LT. Although not
verified in any way, this may indicate that LOG FT is better able to represent reality and
thus a better model. In contrast, the assumption of a fixed first touchpoint contribution
regardless of later activity may also be flawed. As a good starting point to find out which
is the case, it is recommended for further research to test the performance of LOG FT in
comparison with LOG LT in our empirical study. If LOG FT performs better, this shows
that the assumption and thus the DGP of this simulation study is realistic, and LOG FT
is justly the best logistic extension. If LOG LT turns out to be a better classifier, the
assumption of this study is untrue, and a new simulation framework with a more realistic
DGP should be build.
Chapter 6

Conclusion

This thesis has investigated the question what is the best attribution method on three
dimensions: theoretically, empirically and in a simulation study. First, seven theoretical
criteria that are desirable for accurate attribution have been formulated. In the light
of those criteria, the different attribution methods and models are evaluated. Second,
the attribution methods are tested empirically. Under the assumption that accurate
attribution implies accurate prediction, the models’ out-of-sample classification accuracy
is calculated by the Area Under the ROC Curve. This measure is taken as the empirical
criterion to judge the models. Thirdly, new data is simulated according to a representative
variety of Data Generating Processes in which the true attribution is known. The Mean
Absolute Error of the different attribution models serves as the performance measure
in this simulation study. The combination of this theoretical, empirical and simulation
analysis enables this thesis to answer its main question about the best attribution method
in a well-substantiated manner. Such a standardized and extensive methodology for this
purpose has - as far as this author knows - not yet been developed in the existing literature.
From the theoretical analysis we have seen that the rule-based methods are funda-
mentally flawed because they are not data-driven. Neither do we advocate for the simple
probabilistic model, since its attribution of conversion probabilities rather than absolute
conversions potentially yields very counter-intuitive attribution percentages. Further, we
have seen that the first-order Markov model is very similar to the last-touch method and
thus inadequate as well. Higher order Markov models have more potential, although we
argue for a new definition of the Removal Effect and its aggregate nature can be incon-
venient. The logistic regression model competently accounts for channel heterogeneity
and individual attribution, but unfortunately does not capture any timing differences.
However, this thesis develops two extensions of the logistic model that incorporate this
timing effect by dummifying last touches. These logistic model extensions fulfil most of
the postulated criteria and have much potential from a theoretical perspective.

60
CHAPTER 6. CONCLUSION 61

Applying the different attribution methods to the data, results in significant differ-
ences in how they attribute conversion credit over the channels. Moreover, out-of-sample
classification accuracy also significantly differs between the models. The second logistic
extension that incorporates last touch and one but last touch dummies performs best,
although its P-value to the higher order Markov models is a modest 0.12. The rule-based
methods and first-order Markov model classify significantly less accurate. A Markov sim-
ulation provides evidence that the data can be better captured by a second-order Markov
model than a memoryless first-order model. Finally, a simple additional simulation study
shows that the logistic regression models on average attribute best across a wide range
of scenarios. More specifically, the DGP specification causes an extension of the logistic
model that dummifies first touches to be the best performer. Remarkably, the simulation
study further shows that the Markov models attribute rather poorly, most of the times
even worse than the linear method. However, as a general remark it should be noted
that for both simulation studies the chosen DGP inevitably affects the performance of
the attribution models to some extent.
In conclusion, the class of logistic regression models comes up consistently first in
our theoretical, empirical and simulation study. Extending the basic logistic model to
integrate timing effects results in an even better performing model. However, further
research should show whether this extension should be based on dummifying the last or
the first touches, since this thesis does not provide unilateral evidence on this issue. The
model that probably performs closest to the logistic regression model is the linear attribu-
tion method. The Markov and probabilistic models, although presented in a triumphant
fashion by Anderl et al. (2014) and Dalessandro et al. (2012) respectively, perform much
weaker than expected. In line with our expectations the single touch rule-based methods
neither come up as accurate attribution techniques.
Although according to our postulated criteria the best attribution model, the validity
of the logistic regression models applied to the attribution problem is doubtful from an
econometric perspective. The functional form of the logistic regression model presumes a
causal relation between the channel touches and the probability to convert. Indisputably,
this relation exists, since touching channels certainly has an effect on the conversion
likelihood. However, it is probable that the relation exists in the other direction as well.
A prospect who is looking for a product, and therefore has a higher probability to convert,
inevitably starts touching channels. This simultaneous relation makes the covariates and
error term correlated, creating an endogeneity bias that makes the estimated parameters
and thus the attribution over the channels biased. For this reason, this thesis considers
the logistic regression model a decent solution for the time being rather than the ultimate
best attribution technique.
CHAPTER 6. CONCLUSION 62

More significant than its determination of the ‘temporarily best attribution technique’,
this thesis has proposed a framework to evaluate attribution models in a standardized, all-
encompassing way. This methodology can be used to test newly proposed or the already
developed more advanced attribution models. This is an interesting direction for further
research, since the perfect attribution model still does not exist. Another direction is
further developing the simple simulation study proposed in this thesis, which is still very
basic and preliminary and can be relatively easily extended by facilitating more channels,
touchpoints and scenarios. This should give insight into which model functions best
under which circumstances to be found in the data. Moreover, the simulation study’s
assumption that the first channel contribution is irrespective of later touches should be
verified. When all this research is picked up, this author is confident that we will gradually
proceed closer to the solution to the attribution problem.
Such a result would not only be a theoretical breakthrough, but also of great practical
interest for all firms that engage in digital marketing activities. Having found the ‘holy
grail’ of the attribution problem, this would allow marketeers to accurately determine
the true contribution and performance of all digital channels and advertisements. This
way, the marketing budget could be optimally allocated over the channels, enhancing
the number of views, clicks, converters and thus the profitability of the firm. Something
every firm is on the watch for.
Chapter 7

Appendix

7.1 Regression output Logistic extensions

63
CHAPTER 7. APPENDIX 64

Covariate Coefficient T-statistic P-value


affiliate lt -4.9428 -193.0 0.000**
bannerad lt -7.0671 -12.2 0.000**
email lt -3.8765 -76.1 0.000**
organic lt -3.1009 -161.4 0.000**
organicsearch lt -3.2366 -135.5 0.000**
overig lt -4.2648 -17.3 0.000**
referrer lt -2.5507 -55.6 0.000**
seabranded lt -2.9127 -124.2 0.000**
seanonbranded lt -3.3794 -104.7 0.000**
affiliate nlt -0.1294 -14.7 0.000**
bannerad nlt -0.2353 -0.9 0.360
email nlt 0.0750 3.2 0.001**
organic nlt 0.1034 12.1 0.000**
organicsearch nlt 0.2110 14.0 0.000**
overig nlt -0.0070 -0.0 0.964
referrer nlt 0.1663 5.3 0.000**
seabranded nlt 0.2188 13.2 0.000**
seanonbranded nlt 0.4454 20.3 0.000**
* P-value < 0.05
**P-value < 0.01

Table 7.1: The regression output of the logistic regression model including dummies for the last touch
channel.
CHAPTER 7. APPENDIX 65

Covariate Coefficient T-statistic P-value


affiliate lt -4.9931 -180.3 0.000**
bannerad lt -7.3631 -12.7 0.000**
email lt -4.0382 -76.6 0.000**
organic lt -3.3433 -148.6 0.000**
organicsearch lt -3.4196 -132.3 0.000**
overig lt -4.4349 -17.9 0.000**
referrer lt -2.8050 -57.1 0.000**
seabranded lt -3.0917 -121.9 0.000**
seanonbranded lt -3.5095 -103.8 0.000**
affiliate lt2 0.0008 0.0 0.985
bannerad lt2 -0.3558 -0.7 0.487
email lt2 0.4981 7.4 0.000**
organic lt2 0.7641 23.9 0.000**
organicsearch lt2 0.7064 19.0 0.000**
overig lt2 0.7148 2.4 0.016*
referrer lt2 1.1504 15.7 0.000**
seabranded lt2 0.6910 18.3 0.000**
seanonbranded lt2 0.8784 18.3 0.000**
affiliate nlt2 -0.1470 -13.1 0.000**
bannerad nlt2 0.0597 0.2 0.842
email nlt2 0.0103 0.3 0.754
organic nlt2 0.0493 5.8 0.000**
organicsearch nlt2 0.1032 5.1 0.000**
overig nlt2 -0.0233 -0.1 0.903
referrer nlt2 0.0248 1.1 0.261
seabranded nlt2 0.0856 3.8 0.000**
seanonbranded nlt2 0.3045 10.6 0.000**
* P-value < 0.05
**P-value < 0.01

Table 7.2: The regression output of the logistic regression model including dummies for the last touch
and one but last touch channel.
CHAPTER 7. APPENDIX 66

Covariate Coefficient T-statistic P-value


intercept -3.9652 -171.0 0.000**
affiliate=1 -0.7725 -24.2 0.000**
bannerad=1 -1.9670 -7.6 0.000**
email=1 -0.1847 -3.7 0.000**
organic=1 0.6304 25.1 0.000**
organicsearch=1 0.5803 22.0 0.000**
other=1 -0.2795 -1.5 0.126
referrer=1 1.1628 25.5 0.000**
SEAbranded=1 0.8317 31.9 0.000**
SEAnonbranded=1 0.6808 21.7 0.000**
affiliate>1 -0.5687 -13.2 0.000**
bannerad>1 -0.3815 -0.4 0.711
email>1 0.4851 5.9 0.000**
organic>1 0.4637 13.8 0.000**
organicsearch>1 0.1544 3.6 0.000**
other>1 0.1422 0.3 0.750
referrer>1 0.5410 6.0 0.000**
SEAbranded>1 0.1223 2.8 0.000**
SEAnonbranded>1 0.3862 6.5 0.000**
* P-value < 0.05
**P-value < 0.01

Table 7.3: The regression output of the non-linear effect logistic regression model including dummies for
N Ci,k = 1 and N Ci,k > 1.
Bibliography

Anderl, E., Becker, I., Wangenheim, F. V., and Schumann, J. H. (2014). Mapping the
customer journey: A graph-based framework for online attribution modeling.

Ansari, A. and Mela, C. F. (2003). E-customization. Journal of Marketing Research, 40


(2):131–145.

Barry, T. E. (1987). The development of the hierarchy of effects: An historical perspective.


Current issues and Research in Advertising, 10(1-2):251–295.

Berchtold, A. and Raftery, A. (2002). The mixture transition distribution model for high-
order markov chains and non-gaussian time series. Statistical Science, 17(3):328–356.

Berman, R. (2013). Beyond the last touch: Attribution in online advertising. Available
at SSRN 2384211.

Cho, C. (2003). Factors influencing clicking of banner advertisement on the www. Cy-
berPsychology and Behavior, 6, nr. 2:201–215.

Dalessandro, B., Perlich, C., Stitelman, O., and Provost, F. (2012). Causally motivated
attribution for online advertising. ADKDD12 Proceedings of the Sixth International
Workshop on Data Mining for Online Advertising and Internet Economy.

He, H., Garcia, E., et al. (2009). Learning from imbalanced data. Knowledge and Data
Engineering, IEEE Transactions on, 21(9):1263–1284.

Kireyev, P., Pauwels, K., and Guta, S. (2013). Do display advertisements influence
search? attribution and dynamics in online advertising. Harvard Business School.

Li, H. and Kannan, P. (2014). Attributing conversions in a multichannel online market-


ing environment: An empirical model and a field experiment. Journal of Marketing
Research, 51(1):40–56.

Patricio, I., Fisk, R., Cuncha, J., and Constantine, I. (2011). Multilevel service design:
From customer value constellation to service experience blueprinting. Journal of Service
Research, 14(5):180–200.

67
BIBLIOGRAPHY 68

Shao, X. and Li, L. (2011). Data-driven multi-touch attribution models. Proceedings


of the 17th ACM SIGKDD International Knowledge Discovery on Data Mining, pages
258–264.

Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games,
2:307–317.

Skiera, B. and Nabout, N. A. (2013). Prosad: A bidding decision support system for
profit optimizing search engine advertising. Marketing Science, 32:213–220.

Strong, E. (1925). Theories of selling. Journal of Applied Psychology, 9(1):75–86.

Vickrey, W. (1961). Counterspeculation, auctions and competitive sealed tenders. The


Journal of Finance, 16:8–37.

Wooff, D. and Anderson, J. (2013). Time-weighted attribution of revenue to multiple


e-commerce marketing channels in the customer journey. Departmental working paper.

Xu, L., Duan, J. A., and Whinston, A. (2014). Path to purchase: A mutually excit-
ing point process model for online advertising and conversion. Management Science,
60(6):1392–1412.

Zhang, Y., Wei, Y., and Ren, J. (2014). Multi-touch attribution in online advertising
with survival theory. 2014 IEEE International Conference on Data Mining (ICDM),
pages 687–696.

You might also like