TrustworthyOnlineControlledExperiments_PracticalGuideToABTesting_Chapter1
TrustworthyOnlineControlledExperiments_PracticalGuideToABTesting_Chapter1
https://experimentguide.com
Getting numbers is easy; getting numbers you can trust is hard. This practical guide by
experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to
accelerate innovation using trustworthy online controlled experiments, or A/B tests.
Based on practical experiences at companies that each runs more than 20,000 controlled
experiments a year, the authors share examples, pitfalls, and advice for students and
industry professionals getting started with experiments, plus deeper dives into advanced
topics for experienced practitioners who want to improve the way they and their
organizations make data-driven decisions.
Learn how to:
r o n k o h a v i is a vice president and technical fellow at Airbnb. This book was written
while he was a technical fellow and corporate vice president at Microsoft. He was
previously director of data mining and personalization at Amazon. He received his PhD
in Computer Science from Stanford University. His papers have more than 40,000
citations and three of them are in the top 1,000 most-cited papers in Computer Science.
d i a n e t a n g is a Google Fellow, with expertise in large-scale data analysis and
infrastructure, online controlled experiments, and ads systems. She has an AB from
Harvard and an MS/PhD from Stanford, with patents and publications in mobile
networking, information visualization, experiment methodology, data infrastructure,
data mining, and large data.
y a x u heads Data Science and Experimentation at LinkedIn. She has published several
papers on experimentation and is a frequent speaker at top-tier conferences and
universities. She previously worked at Microsoft and received her PhD in Statistics
from Stanford University.
“At the core of the Lean Methodology is the scientific method: Creating hypotheses,
running experiments, gathering data, extracting insight and validation or
modification of the hypothesis. A/B testing is the gold standard of creating
verifiable and repeatable experiments, and this book is its definitive text.”
– Steve Blank, Adjunct professor at Stanford University, father of modern
entrepreneurship, author of The Startup Owner’s Manual and
The Four Steps to the Epiphany
“This book is a great resource for executives, leaders, researchers or engineers
looking to use online controlled experiments to optimize product features, project
efficiency or revenue. I know firsthand the impact that Kohavi’s work had on Bing
and Microsoft, and I’m excited that these learnings can now reach a wider audience.”
– Harry Shum, EVP, Microsoft Artificial Intelligence and Research Group
“A great book that is both rigorous and accessible. Readers will learn how to bring
trustworthy controlled experiments, which have revolutionized internet product
development, to their organizations”
– Adam D’Angelo, Co-founder and CEO of Quora and
former CTO of Facebook
“This book is a great overview of how several companies use online experimentation
and A/B testing to improve their products. Kohavi, Tang and Xu have a wealth of
experience and excellent advice to convey, so the book has lots of practical real world
examples and lessons learned over many years of the application of these techniques
at scale.”
– Jeff Dean, Google Senior Fellow and SVP Google Research
“Do you want your organization to make consistently better decisions? This is the new
bible of how to get from data to decisions in the digital age. Reading this book is like
sitting in meetings inside Amazon, Google, LinkedIn, Microsoft. The authors expose
for the first time the way the world’s most successful companies make decisions.
Beyond the admonitions and anecdotes of normal business books, this book shows
what to do and how to do it well. It’s the how-to manual for decision-making in the
digital world, with dedicated sections for business leaders, engineers, and data analysts.”
– Scott Cook, Intuit Co-founder & Chairman of the Executive Committee
“Online controlled experiments are powerful tools. Understanding how they work,
what their strengths are, and how they can be optimized can illuminate both
specialists and a wider audience. This book is the rare combination of technically
authoritative, enjoyable to read, and dealing with highly important matters”
– John P.A. Ioannidis, Professor of Medicine, Health Research and Policy,
Biomedical Data Science, and Statistics at Stanford University
“Which online option will be better? We frequently need to make such choices, and
frequently err. To determine what will actually work better, we need rigorous
controlled experiments, aka A/B testing. This excellent and lively book by experts
from Microsoft, Google, and LinkedIn presents the theory and best practices of A/B
testing. A must read for anyone who does anything online!”
– Gregory Piatetsky-Shapiro, Ph.D., president of KDnuggets,
co-founder of SIGKDD, and LinkedIn Top Voice on
Data Science & Analytics.
“Ron Kohavi, Diane Tang and Ya Xu are the world’s top experts on online
experiments. I’ve been using their work for years and I’m delighted they have
now teamed up to write the definitive guide. I recommend this book to all my
students and everyone involved in online products and services.”
– Erik Brynjolfsson, Professor at MIT and Co-Author of
The Second Machine Age
“A modern software-supported business cannot compete successfully without online
controlled experimentation. Written by three of the most experienced leaders in the
field, this book presents the fundamental principles, illustrates them with compelling
examples, and digs deeper to present a wealth of practical advice. It’s a “must read”!
– Foster Provost, Professor at NYU Stern School of Business & co-author of the
best-selling Data Science for Business
“In the past two decades the technology industry has learned what scientists have
known for centuries: that controlled experiments are among the best tools to
understand complex phenomena and to solve very challenging problems. The
ability to design controlled experiments, run them at scale, and interpret their
results is the foundation of how modern high tech businesses operate. Between
them the authors have designed and implemented several of the world’s most
powerful experimentation platforms. This book is a great opportunity to learn
from their experiences about how to use these tools and techniques.”
– Kevin Scott, EVP and CTO of Microsoft
“Online experiments have fueled the success of Amazon, Microsoft, LinkedIn and
other leading digital companies. This practical book gives the reader rare access to
decades of experimentation experience at these companies and should be on the
bookshelf of every data scientist, software engineer and product manager.”
– Stefan Thomke, William Barclay Harding Professor, Harvard Business School,
Author of Experimentation Works: The Surprising Power of Business Experiments
“The secret sauce for a successful online business is experimentation. But it is a secret
no longer. Here three masters of the art describe the ABCs of A/B testing so that you
too can continuously improve your online services.”
– Hal Varian, Chief Economist, Google, and author of
Intermediate Microeconomics: A Modern Approach
“Experiments are the best tool for online products and services. This book is full of
practical knowledge derived from years of successful testing at Microsoft Google
and LinkedIn. Insights and best practices are explained with real examples and
pitfalls, their markers and solutions identified. I strongly recommend this book!”
– Preston McAfee, former Chief Economist and VP of Microsoft
“Experimentation is the future of digital strategy and ‘Trustworthy Experiments’ will
be its Bible. Kohavi, Tang and Xu are three of the most noteworthy experts on
experimentation working today and their book delivers a truly practical roadmap
for digital experimentation that is useful right out of the box. The revealing case
studies they conducted over many decades at Microsoft, Amazon, Google and
LinkedIn are organized into easy to understand practical lessens with tremendous
depth and clarity. It should be required reading for any manager of a digital business.”
– Sinan Aral, David Austin Professor of Management,
MIT and author of The Hype Machine
Published by Cambridge University Press 2020
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
“The only thing worse than no experiment is a misleading one, because it gives you
false confidence! This book details the technical aspects of testing based on insights
from some of the world’s largest testing programs. If you’re involved in online
experimentation in any capacity, read it now to avoid mistakes and gain confidence
in your results.”
- Chris Goward, Author of You Should Test That!,
Founder and CEO of Widerfunnel
“This is a phenomenal book. The authors draw on a wealth of experience and have
produced a readable reference that is somehow both comprehensive and detailed at
the same time. Highly recommended reading for anyone who wants to run serious
digital experiments.”
- Pete Koomen, Co-founder, Optimizely
“The authors are pioneers of online experimentation. The platforms they’ve built
and the experiments they’ve enabled have transformed some of the largest internet
brands. Their research and talks have inspired teams across the industry to adopt
experimentation. This book is the authoritative yet practical text that the industry has
been waiting for.”
– Adil Aijaz, Co-founder and CEO, Split Software
RON KOHAVI
Microsoft
DIANE TANG
Google
YA XU
LinkedIn
www.cambridge.org
Information on this title: www.cambridge.org/9781108724265
DOI: 10.1017/9781108653985
© Ron Kohavi, Diane Tang, and Ya Xu 2020
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2020
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Kohavi, Ron, author. | Tang, Diane, 1974– author. | Xu, Ya, 1982– author.
Title: Trustworthy online controlled experiments : a practical guide to A/B testing /
Ron Kohavi, Diane Tang, Ya Xu.
Description: Cambridge, United Kingdom ; New York, NY : Cambridge University Press,
2020. | Includes bibliographical references and index.
Identifiers: LCCN 2019042021 (print) | LCCN 2019042022 (ebook) | ISBN 9781108724265
(paperback) | ISBN 9781108653985 (epub)
Subjects: LCSH: Social media. | User-generated content–Social aspects.
Classification: LCC HM741 .K68 2020 (print) | LCC HM741 (ebook) | DDC 302.23/1–dc23
LC record available at https://lccn.loc.gov/2019042021
LC ebook record available at https://lccn.loc.gov/2019042022
ISBN 978-1-108-72426-5 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy
of URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
ix
Published by Cambridge University Press 2020
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
x Contents
Segment Differences 52
Simpson’s Paradox 55
Encourage Healthy Skepticism 57
4 Experimentation Platform and Culture 58
Experimentation Maturity Models 58
Infrastructure and Tools 66
References 246
Index 266
Preface
How to Read This Book
Our goal in writing this book is to share practical lessons from decades of
experience running online controlled experiments at scale at Amazon and
Microsoft (Ron), Google (Diane), and Microsoft and LinkedIn (Ya). While
we are writing this book in our capacity as individuals and not as representa-
tives of Google, LinkedIn, or Microsoft, we have distilled key lessons and
pitfalls encountered over the years and provide guidance for both software
platforms and the corporate cultural aspects of using online controlled experi-
ments to establish a data-driven culture that informs rather than relies on the
HiPPO (Highest Paid Person’s Opinion) (R. Kohavi, HiPPO FAQ 2019). We
believe many of these lessons apply in the online setting, to large or small
companies, or even teams and organizations within a company. A concern we
share is the need to evaluate the trustworthiness of experiment results. We
believe in the skepticism implied by Twyman’s Law: Any figure that looks
interesting or different is usually wrong; we encourage readers to double-
check results and run validity tests, especially for breakthrough positive
results. Getting numbers is easy; getting numbers you can trust is hard!
Part I is designed to be read by everyone, regardless of background, and
consists of four chapters.
● Chapter 1 is an overview of the benefits of running online controlled
experiments and introduces experiment terminology.
● Chapter 2 uses an example to run through the process of running an
experiment end-to-end.
● Chapter 3 describes common pitfalls and how to build experimentation
trustworthiness, and
xv
Published by Cambridge University Press 2020
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
xvi Preface
Acknowledgments
We would like to thank our colleagues who have worked with us throughout
the years. While too numerous to name individually, this book is based on our
combined work, as well as others throughout the industry and beyond
researching and conducting online controlled experiments. We learned a great
deal from you all, thank you.
On writing the book, we’d like to call out Lauren Cowles, our editor, for
partnering with us throughout this process. Cherie Woodward provided great
line editing and style guidance to help mesh our three voices. Stephanie Grey
worked with us on all diagrams and figures, improving them in the process.
Kim Vernon provided final copy-editing and bibliography checks.
Most importantly, we owe a deep debt of gratitude to our families, as we
missed time with them to work on this book. Thank you to Ronny’s family:
Yael, Oren, Ittai, and Noga, to Diane’s family: Ben, Emma, and Leah, and to
Ya’s family: Thomas, Leray, and Tavis. We could not have written this book
without your support and enthusiasm!
Google: Hal Varian, Dan Russell, Carrie Grimes, Niall Cardin, Deirdre
O’Brien, Henning Hohnhold, Mukund Sundararajan, Amir Najmi, Patrick
Riley, Eric Tassone, Jen Gennai, Shannon Vallor, Eric Miraglia, David Price,
Crystal Dahlen, Tammy Jih Murray, Lanah Donnelly and all who work on
experiments at Google.
LinkedIn: Stephen Lynch, Yav Bojinov, Jiada Liu, Weitao Duan, Nanyu
Chen, Guillaume Saint-Jacques, Elaine Call, Min Liu, Arun Swami, Kiran
Prasad, Igor Perisic, and the entire Experimentation team.
Microsoft: Omar Alonso, Benjamin Arai, Jordan Atlas, Richa Bhayani, Eric
Boyd, Johnny Chan, Alex Deng, Andy Drake, Aleksander Fabijan, Brian
Frasca, Scott Gude, Somit Gupta, Adam Gustafson, Tommy Guy, Randy
Henne, Edward Jezierski, Jing Jin, Dongwoo Kim, Waldo Kuipers, Jonathan
Litz, Sophia Liu, Jiannan Lu, Qi Lu, Daniel Miller, Carl Mitchell, Nils
xvii
Published by Cambridge University Press 2020
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
xviii Acknowledgments
Pohlmann, Wen Qin, Thomas Schreiter, Harry Shum, Dan Sommerfield, Garnet
Vaz, Toby Walker, Michele Zunker, and the Analysis & Experimentation team.
Special thanks to Maria Stone and Marcus Persson for feedback throughout
the book, and Michelle N. Meyer for expert feedback on the ethics chapter
Others who have given feedback include: Adil Aijaz, Jonas Alves, Alon
Amit, Kevin Anderson, Joel Barajas, Houman Bedayat, Beau Bender, Bahador
Biglari, Stuart Buck, Jike Chong, Jed Chou, Pavel Dmitriev, Yurong Fan,
Georgi Georgiev, Ilias Gerostathopoulos. Matt Gershoff, William Grosso,
Aditya Gupta, Rajesh Gupta, Shilpa Gupta, Kris Jack, Jacob Jarnvall, Dave
Karow, Slawek Kierner, Pete Koomen, Dylan Lewis, Bryan Liu, David Man-
heim, Colin McFarland, Tanapol Nearunchron, Dheeraj Ravindranath, Aaditya
Ramdas, Andre Richter, Jianhong Shen, Gang Su, Anthony Tang, Lukas
Vermeer, Rowel Willems, Yu Yang, and Yufeng Wang.
Thank you to the many who helped who are not named explicitly.
PART I
1
Introduction and Motivation
3
Published by Cambridge University Press 2020
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
4 1 Introduction and Motivation
● Experiments with big impact are rare. Bing runs over 10,000 experiments a
year, but simple features resulting in such a big improvement happen only
once every few years.
● The overhead of running an experiment must be small. Bing’s engineers had
access to ExP, Microsoft’s experimentation system, which made it easy to
scientifically evaluate the idea.
● The overall evaluation criterion (OEC, described more later in this chapter)
must be clear. In this case, revenue was a key component of the OEC, but
revenue alone is insufficient as an OEC. It could lead to plastering the web
site with ads, which is known to hurt the user experience. Bing uses an OEC
that weighs revenue against user-experience metrics, including Sessions per
user (are users abandoning or increasing engagement) and several other
components. The key point is that user-experience metrics did not signifi-
cantly degrade even though revenue increased dramatically.
The next section introduces the terminology of controlled experiments.
100%
Users
50% 50%
Users Users
Control Treatment
Existing Existing System
System with Feature X
titles. The users’ interactions with the Bing web site were instrumented, that is,
monitored and logged. From the logged data, metrics are computed, which
allowed us to assess the difference between the variants for each metric.
In the simplest controlled experiments, there are two variants: Control (A)
and Treatment (B), as shown in Figure 1.2.
We follow the terminology of Kohavi and Longbottom (2017), and Kohavi,
Longbottom et al. (2009) and provide related terms from other fields below.
You can find many other resources on experimentation and A/B testing at the
end of this chapter under Additional Reading.
Overall Evaluation Criterion (OEC): A quantitative measure of the
experiment‘s objective. For example, your OEC might be active days per user,
indicating the number of days during the experiment that users were active
(i.e., they visited and took some action). Increasing this OEC implies that users
are visiting your site more often, which is a great outcome. The OEC must be
measurable in the short term (the duration of an experiment) yet believed to
causally drive long-term strategic objectives (see Strategy, Tactics, and their
Relationship to Experiments later in this chapter and Chapter 7). In the case of a
search engine, the OEC can be a combination of usage (e.g., sessions-per-user),
Systematic Reviews
of randomized
controlled experiments,
i.e. Meta Analysis
Randomized
Controlled
Experiments
Observational studies
(cohort and case control)
Figure 1.3 A simple hierarchy of evidence for assessing the quality of trial design
(Greenhalgh 2014)
quality, causing more crashes. It turns out that all three events are caused by a
single factor: usage. Heavy users of the product see more error messages, experi-
ence more crashes, and have lower churn rates. Correlation does not imply
causality and overly relying on these observations leads to faulty decisions.
In 1995, Guyatt et al. (1995) introduced the hierarchy of evidence as a way to
grade recommendations in medical literature, which Greenhalgh expanded on in
her discussions on practicing evidence-based medicine (1997, 2014). Figure 1.3
shows a simple hierarchy of evidence, translated to our terminology, based on
Bailar (1983, 1). Randomized controlled experiments are the gold standard for
establishing causality. Systematic reviews, that is, meta-analysis, of controlled
experiments provides more evidence and generalizability.
More complex models, such as the Levels of Evidence by the Oxford Centre
for Evidence-based Medicine are also available (2009).
The experimentation platforms used by our companies allow experimenters
at Google, LinkedIn, and Microsoft to run tens of thousands of online con-
trolled experiments a year with a high degree of trust in the results. We believe
online controlled experiments are:
● The best scientific way to establish causality with high probability.
● Able to detect small changes that are harder to detect with other techniques,
such as changes over time (sensitivity).
Tenets
There are three key tenets for organizations that wish to run online controlled
experiments (Kohavi et al. 2013):
1. The organization wants to make data-driven decisions and has formalized
an OEC.
2. The organization is willing to invest in the infrastructure and tests to run
controlled experiments and ensure that the results are trustworthy.
3. The organization recognizes that it is poor at assessing the value of ideas.
billboards, the web, retail, or anything” (Segall 2012, 42). But measuring the
incremental benefit to users from new features has cost, and objective meas-
urements typically show that progress is not as rosy as initially envisioned.
Many organizations will not spend the resources required to define and
measure progress. It is often easier to generate a plan, execute against it, and
declare success, with the key metric being: “percent of plan delivered,” ignor-
ing whether the feature has any positive impact to key metrics.
To be data-driven, an organization should define an OEC that can be easily
measured over relatively short durations (e.g., one to two weeks). Large organ-
izations may have multiple OECs or several key metrics that are shared with
refinements for different areas. The hard part is finding metrics measurable in a
short period, sensitive enough to show differences, and that are predictive of
long-term goals. For example, “Profit” is not a good OEC, as short-term theat-
rics (e.g., raising prices) can increase short-term profit, but may hurt it in the long
run. Customer lifetime value is a strategically powerful OEC (Kohavi, Long-
bottom et al. 2009). We cannot overemphasize the importance of agreeing on a
good OEC that your organization can align behind; see Chapter 6.
The terms “data-informed” or “data-aware” are sometimes used to avoid the
implication that a single source of data (e.g., a controlled experiment) “drives”
the decisions (King, Churchill and Tan 2017, Knapp et al. 2006). We use data-
driven and data-informed as synonyms in this book. Ultimately, a decision
should be made with many sources of data, including controlled experiments,
surveys, estimates of maintenance costs for the new code, and so on. A data-
driven or a data-informed organization gathers relevant data to drive a decision
and inform the HiPPO (Highest Paid Person’s Opinion) rather than relying on
intuition (Kohavi 2019).
realize how rare it is for them to succeed on the first attempt. I strongly suspect
that this experience is universal, but it is not universally recognized or
acknowledged.” Finally, Colin McFarland wrote in the book Experiment!
(McFarland 2012, 20) “No matter how much you think it’s a no-brainer,
how much research you’ve done, or how many competitors are doing it,
sometimes, more often than you might think, experiment ideas simply fail.”
Not every domain has such poor statistics, but most who have run controlled
experiments in customer-facing websites and applications have experienced
this humbling reality: we are poor at assessing the value of ideas.
shipped to users over the year, assuming they are additive. Because the team
runs thousands of experiment Treatments, and some may appear positive by
chance (Lee and Shen 2018), credit towards the 2% is assigned based on a
replication experiment: once the implementation of an idea is successful,
possibly after multiple iterations and refinements, a certification experiment
is run with a single Treatment. The Treatment effect of this certification
experiment determines the credit towards the 2% goal. Recent development
suggests shrinking the Treatment effect to improve precision (Coey and
Cunningham 2019).
Figure 1.4 Bing Ad Revenue over Time (y-axis represents about 20% growth/
year). The specific numbers are not important
Figure 1.5 Amazon’s credit card offer with savings on cart total
Personalized Recommendations
Greg Linden at Amazon created a prototype to display personalized recom-
mendations based on items in the user’s shopping cart (Linden 2006, Kohavi,
Longbottom et al. 2009). When you add an item, recommendations come up;
add another item, new recommendations show up. Linden notes that while the
prototype looked promising, “a marketing senior vice-president was dead set
against it,” claiming it would distract people from checking out. Greg was
“forbidden to work on this any further.” Nonetheless, he ran a controlled
experiment, and the “feature won by such a wide margin that not having it
live was costing Amazon a noticeable chunk of change. With new urgency,
shopping cart recommendations launched.” Now multiple sites use cart
recommendations.
Malware Reduction
Ads are a lucrative business and “freeware” installed by users often contains
malware that pollutes pages with ads. Figure 1.6 shows what a resulting page
from Bing looked like to a user with malware. Note that multiple ads (high-
lighted in red) were added to the page (Kohavi et al. 2014).
Not only were Bing ads removed, depriving Microsoft of revenue, but low-
quality ads and often irrelevant ads displayed, providing a poor user experi-
ence for users who might not have realized why they were seeing so many ads.
Microsoft ran a controlled experiment with 3.8 million users potentially
impacted, where basic routines that modify the DOM (Document Object
Model) were overridden to allow only limited modifications from trusted
sources (Kohavi et al. 2014). The results showed improvements to all of Bing’s
key metrics, including Sessions per user, indicating that users visited more
often or churned less. In addition, users were more successful in their searches,
quicker to click on useful links, and annual revenue improved by several
million dollars. Also, page-load time, a key performance metric we previously
discussed, improved by hundreds of milliseconds for the impacted pages.
Figure 1.6 Bing page when the user has malware shows multiple ads
Backend Changes
Backend algorithmic changes are often overlooked as an area to use controlled
experiments (Kohavi, Longbottom et al. 2009), but it can yield significant
results. We can see this both from how teams at Google, LinkedIn, and
Microsoft work on many incremental small changes, as we described above,
and in this example involving Amazon.
Back in 2004, there already existed a good algorithm for making
recommendations based on two sets. The signature feature for Amazon’s
Figure 1.7 Amazon search for “24” with and without BBS
recommendation was “People who bought item X bought item Y,” but this was
generalized to “People who viewed item X bought item Y” and “People who
viewed item X viewed item Y.” A proposal was made to use the same algorithm
for “People who searched for X bought item Y.” Proponents of the algorithm
gave examples of underspecified searches, such as “24,” which most people
associate with the TV show starring Kiefer Sutherland. Amazon’s search was
returning poor results (left in Figure 1.7), such as CDs with 24 Italian Songs,
clothing for 24-month old toddlers, a 24-inch towel bar, and so on. The new
algorithm gave top-notch results (right in Figure 1.7), returning DVDs for the
show and related books, based on what items people actually purchased after
searching for “24.” One weakness of the algorithm was that some items surfaced
that did not contain the words in the search phrase; however, Amazon ran a
controlled experiment, and despite this weakness, this change increased
Amazon’s overall revenue by 3% – hundreds of millions of dollars.
Facebook and Twitter, opening a third pane with social search results. After
spending over $25 million on the strategy with no significant impact to key
metrics, the strategy was abandoned (Kohavi and Thomke 2017). It may be
hard to give up on a big bet, but economic theory tells us that failed bets are
sunk costs, and we should make a forward-looking decision based on the
available data, which is gathered as we run more experiments.
Eric Ries uses the term “achieved failure” for companies that successfully,
faithfully, and rigorously execute a plan that turned out to have been utterly
flawed (Ries 2011). Instead, he recommends:
The Lean Startup methodology reconceives a startup’s efforts as experiments that
test its strategy to see which parts are brilliant and which are crazy. A true
experiment follows the scientific method. It begins with a clear hypothesis that
makes predictions about what is supposed to happen. It then tests those predictions
empirically.
Additional Reading
There are several books directly related to online experiments and A/B tests
(Siroker and Koomen 2013, Goward 2012, Schrage 2014, McFarland 2012,
King et al. 2017). Most have great motivational stories but are inaccurate on
the statistics. Georgi Georgiev’s recent book includes comprehensive statis-
tical explanations (Georgiev 2019).