SlideShare a Scribd company logo
1©MapR Technologies - Confidential
Real-time Learning
for Fun and Profit
2©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late tonight):
– http://slideshare.net/tdunning
 Hash tags: #mapr #storm #bbuzz
3©MapR Technologies - Confidential
The Challenge
 Hadoop is great of processing vats of data
– But sucks for real-time (by design!)
 Storm is great for real-time processing
– But lacks any way to deal with batch processing
 It sounds like there isn’t a solution
– Neither fashionable solution handles everything
4©MapR Technologies - Confidential
This is not a problem.
It’s an opportunity!
5©MapR Technologies - Confidential
t
now
Hadoop is Not Very Real-time
Unprocessed
Data
Fully
processed
Latest full
period
Hadoop job
takes this
long for this
data
6©MapR Technologies - Confidential
t
now
Hadoop works
great back here
Storm
works
here
Real-time and Long-time together
Blended
view
Blended
view
Blended
View
7©MapR Technologies - Confidential
One Alternative
Search
Engine
NoSql
de Jour
Consumer
Real-time Long-time
?
8©MapR Technologies - Confidential
Problems
 Simply dumping into noSql engine doesn’t quite work
 Insert rate is limited
 No load isolation
– Big retrospective jobs kill real-time
 Low scan performance
– Hbase pretty good, but not stellar
 Difficult to set boundaries
– where does real-time end and long-time begin?
9©MapR Technologies - Confidential
Almost a Solution
 Lambda architecture talks about function of long-time state
– Real-time approximate accelerator adjusts previous result to current state
 Sounds good, but …
– How does the real-time accelerator combine with long-time?
– What algorithms can do this?
– How can we avoid gaps and overlaps and other errors?
 Needs more work
10©MapR Technologies - Confidential
A Simple Example
 Let’s start with the simplest case … counting
 Counting = addition
– Addition is associative
– Addition is on-line
– We can generalize these results to all associative, on-line functions
– But let’s start simple
11©MapR Technologies - Confidential
Data
Sources
Catcher
Cluster
Rough Design – Data Flow
Catcher
Cluster
Query Event
Spout
Logger
Bolt
Counter
Bolt
Raw
Logs
Logger
Bolt
Semi
Agg
Hadoop
Aggregator
Snap
Long
agg
ProtoSpout
Counter
Bolt
Logger
Bolt
Data
Sources
12©MapR Technologies - Confidential
Closer Look – Catcher Protocol
Data
Sources
Catcher
Cluster
Catcher
Cluster
Data
Sources
The data sources and catchers
communicate with a very simple
protocol.
Hello() => list of catchers
Log(topic,message) =>
(OK|FAIL, redirect-to-catcher)
13©MapR Technologies - Confidential
Closer Look – Catcher Queues
Catcher
Cluster
Catcher
Cluster
The catchers forward log requests
to the correct catcher and return
that host in the reply to allow the
client to avoid the extra hop.
Each topic file is appended by
exactly one catcher.
Topic files are kept in shared file
storage.
Topic
File
Topic
File
14©MapR Technologies - Confidential
Closer Look – ProtoSpout
The ProtoSpout tails the topic files,
parses log records into tuples and
injects them into the Storm
topology.
Last fully acked position stored in
shared, transactionally correct file
system.
Topic
File
Topic
File
ProtoSpout
15©MapR Technologies - Confidential
Closer Look – Counter Bolt
 Critical design goals:
– fast ack for all tuples
– fast restart of counter
 Ack happens when tuple hits the replay log (10’s of milliseconds,
group commit)
 Restart involves replaying semi-agg’s + replay log (very fast)
 Replay log only lasts until next semi-aggregate goes out
Counter
Bolt
Replay
Log
Semi-
aggregated
records
Incoming
records
Real-time Long-time
16©MapR Technologies - Confidential
A Frozen Moment in Time
 Snapshot defines the dividing line
 All data in the snap is long-time, all
after is real-time
 Semi-agg strategy allows clean
combination of both kinds of data
 Data synchronized snap not
needed
Semi
Agg
Hadoop
Aggregator
Snap
Long
agg
17©MapR Technologies - Confidential
Guarantees
 Counter output volume is small-ish
– the greater of k tuples per 100K inputs or k tuple/s
– 1 tuple/s/label/bolt for this exercise
 Persistence layer must provide guarantees
– distributed against node failure
– must have either readable flush or closed-append
 HDFS is distributed, but provides no guarantees and strange
semantics
 MapRfs is distributed, provides all necessary guarantees
18©MapR Technologies - Confidential
Presentation Layer
 Presentation must
– read recent output of Logger bolt
– read relevant output of Hadoop jobs
– combine semi-aggregated records
 User will see
– counts that increment within 0-2 s of events
– seamless and accurate meld of short and long-term data
19©MapR Technologies - Confidential
The Basic Idea
 Online algorithms generally have relatively small state (like
counting)
 Online algorithms generally have a simple update (like counting)
 If we can do this with counting, we can do it with all kinds of
algorithms
20©MapR Technologies - Confidential
Summary – Part 1
 Semi-agg strategy + snapshots allows correct real-time counts
– because addition is on-line and associative
 Other on-line associative operations include:
– k-means clustering (see Dan Filimon’s talk at 16.)
– count distinct (see hyper-log-log counters from streamlib or kmv from
Brickhouse)
– top-k values
– top-k (count(*)) (see streamlib)
– contextual Bayesian bandits (see part 2 of this talk)
21©MapR Technologies - Confidential
Example 2 – AB testing in real-time
 I have 15 versions of my landing page
 Each visitor is assigned to a version
– Which version?
 A conversion or sale or whatever can happen
– How long to wait?
 Some versions of the landing page are horrible
– Don’t want to give them traffic
22©MapR Technologies - Confidential
A Quick Diversion
 You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
 I flip the coin and while it is in the air ask again
 I catch the coin and ask again
 I look at the coin (and you don’t) and ask again
 Why does the answer change?
– And did it ever have a single value?
23©MapR Technologies - Confidential
A Philosophical Conclusion
 Probability as expressed by humans is subjective and depends on
information and experience
24©MapR Technologies - Confidential
I Dunno
25©MapR Technologies - Confidential
5 heads out of 10 throws
26©MapR Technologies - Confidential
2 heads out of 12 throws
Mean
Using any single number as a “best”
estimate denies the uncertain nature of
a distribution
Adding confidence bounds still loses most of
the information in the distribution and
prevents good modeling of the tails
27©MapR Technologies - Confidential
Bayesian Bandit
 Compute distributions based on data
 Sample p1 and p2 from these distributions
 Put a coin in bandit 1 if p1 > p2
 Else, put the coin in bandit 2
28©MapR Technologies - Confidential
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal
29©MapR Technologies - Confidential
Video Demo
30©MapR Technologies - Confidential
The Code
 Select an alternative
 Select and learn
 But we already know how to count!
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)
31©MapR Technologies - Confidential
The Basic Idea
 We can encode a distribution by sampling
 Sampling allows unification of exploration and exploitation
 Can be extended to more general response models
 Note that learning here = counting = on-line algorithm
33©MapR Technologies - Confidential
Caveats
 Original Bayesian Bandit only requires real-time
 Generalized Bandit may require access to long history for learning
– Pseudo online learning may be easier than true online
 Bandit variables can include content, time of day, day of week
 Context variables can include user id, user features
 Bandit × context variables provide the real power
34©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late tonight):
– http://slideshare.net/tdunning
 Hash tags: #mapr #storm #bbuzz
35©MapR Technologies - Confidential
Thank You

More Related Content

PPTX
Polyvalent recommendations
Ted Dunning
 
PPTX
Goto amsterdam-2013-skinned
Ted Dunning
 
PDF
Storm users group real time hadoop
Ted Dunning
 
PPTX
What is the past future tense of data?
Ted Dunning
 
PPTX
London hug
Ted Dunning
 
PPTX
Cheap learning-dunning-9-18-2015
Ted Dunning
 
PPTX
Where is Data Going? - RMDC Keynote
Ted Dunning
 
PPTX
Real time-hadoop
Ted Dunning
 
Polyvalent recommendations
Ted Dunning
 
Goto amsterdam-2013-skinned
Ted Dunning
 
Storm users group real time hadoop
Ted Dunning
 
What is the past future tense of data?
Ted Dunning
 
London hug
Ted Dunning
 
Cheap learning-dunning-9-18-2015
Ted Dunning
 
Where is Data Going? - RMDC Keynote
Ted Dunning
 
Real time-hadoop
Ted Dunning
 

What's hot (18)

PPTX
Dunning time-series-2015
Ted Dunning
 
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
 
PPTX
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning
 
PPTX
New Directions for Mahout
Ted Dunning
 
PPTX
predictive-analytics-san-diego-2013-02-21
Ted Dunning
 
PPTX
Sharing Sensitive Data Securely
Ted Dunning
 
PPTX
My talk about recommendation and search to the Hive
Ted Dunning
 
PPTX
Doing-the-impossible
Ted Dunning
 
PPTX
What's new in Apache Mahout
Ted Dunning
 
PPTX
Anomaly Detection - New York Machine Learning
Ted Dunning
 
PDF
Mathematical bridges From Old to New
MapR Technologies
 
PPTX
Building multi-modal recommendation engines using search engines
Ted Dunning
 
PPTX
Finding Changes in Real Data
Ted Dunning
 
PPTX
Which Algorithms Really Matter
Ted Dunning
 
PPTX
Strata new-york-2012
Ted Dunning
 
PDF
Strata 2014 Anomaly Detection
Ted Dunning
 
PPTX
Deep Learning for Fraud Detection
DataWorks Summit/Hadoop Summit
 
PPTX
Boston hug-2012-07
Ted Dunning
 
Dunning time-series-2015
Ted Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning
 
New Directions for Mahout
Ted Dunning
 
predictive-analytics-san-diego-2013-02-21
Ted Dunning
 
Sharing Sensitive Data Securely
Ted Dunning
 
My talk about recommendation and search to the Hive
Ted Dunning
 
Doing-the-impossible
Ted Dunning
 
What's new in Apache Mahout
Ted Dunning
 
Anomaly Detection - New York Machine Learning
Ted Dunning
 
Mathematical bridges From Old to New
MapR Technologies
 
Building multi-modal recommendation engines using search engines
Ted Dunning
 
Finding Changes in Real Data
Ted Dunning
 
Which Algorithms Really Matter
Ted Dunning
 
Strata new-york-2012
Ted Dunning
 
Strata 2014 Anomaly Detection
Ted Dunning
 
Deep Learning for Fraud Detection
DataWorks Summit/Hadoop Summit
 
Boston hug-2012-07
Ted Dunning
 
Ad

Similar to Buzz words-dunning-real-time-learning (20)

PPTX
London hug
MapR Technologies
 
PPTX
Real-time and Long-time Together
MapR Technologies
 
PPTX
Real-time and long-time together
Ted Dunning
 
PDF
Buzz Words Dunning Real-Time Learning
MapR Technologies
 
PDF
Storm Users Group Real Time Hadoop
MapR Technologies
 
PPTX
GoTo Amsterdam 2013 Skinned
MapR Technologies
 
PPTX
Devoxx Real-Time Learning
MapR Technologies
 
PPTX
Strata New York 2012
MapR Technologies
 
PPTX
News From Mahout
MapR Technologies
 
PPTX
Graphlab Ted Dunning Clustering
MapR Technologies
 
PPTX
CMU Lecture on Hadoop Performance
MapR Technologies
 
PPTX
Boston Hug by Ted Dunning 2012
MapR Technologies
 
PPTX
Recommendation as Search: Reflections on Symmetry
MapR Technologies
 
PPTX
London Data Science - Super-Fast Clustering Report
MapR Technologies
 
PDF
Crowd sourced intelligence built into search over hadoop
lucenerevolution
 
PPTX
Storm 2012-03-29
Ted Dunning
 
PPTX
Cmu Lecture on Hadoop Performance
Ted Dunning
 
PPTX
Predictive Analytics San Diego
MapR Technologies
 
PPTX
Big Data Paris
Ted Dunning
 
PPTX
Big Data Paris
MapR Technologies
 
London hug
MapR Technologies
 
Real-time and Long-time Together
MapR Technologies
 
Real-time and long-time together
Ted Dunning
 
Buzz Words Dunning Real-Time Learning
MapR Technologies
 
Storm Users Group Real Time Hadoop
MapR Technologies
 
GoTo Amsterdam 2013 Skinned
MapR Technologies
 
Devoxx Real-Time Learning
MapR Technologies
 
Strata New York 2012
MapR Technologies
 
News From Mahout
MapR Technologies
 
Graphlab Ted Dunning Clustering
MapR Technologies
 
CMU Lecture on Hadoop Performance
MapR Technologies
 
Boston Hug by Ted Dunning 2012
MapR Technologies
 
Recommendation as Search: Reflections on Symmetry
MapR Technologies
 
London Data Science - Super-Fast Clustering Report
MapR Technologies
 
Crowd sourced intelligence built into search over hadoop
lucenerevolution
 
Storm 2012-03-29
Ted Dunning
 
Cmu Lecture on Hadoop Performance
Ted Dunning
 
Predictive Analytics San Diego
MapR Technologies
 
Big Data Paris
Ted Dunning
 
Big Data Paris
MapR Technologies
 
Ad

More from Ted Dunning (14)

PPTX
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
PPTX
How to Get Going with Kubernetes
Ted Dunning
 
PPTX
Progress for big data in Kubernetes
Ted Dunning
 
PPTX
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
PPTX
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
PPTX
Machine Learning Logistics
Ted Dunning
 
PPTX
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
PPTX
Machine Learning logistics
Ted Dunning
 
PPTX
T digest-update
Ted Dunning
 
PPTX
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
PPTX
Recommendation Techn
Ted Dunning
 
PPTX
Possible Visions for Mahout 1.0
Ted Dunning
 
PPTX
Using Mahout and a Search Engine for Recommendation
Ted Dunning
 
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
How to Get Going with Kubernetes
Ted Dunning
 
Progress for big data in Kubernetes
Ted Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
Machine Learning Logistics
Ted Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
Machine Learning logistics
Ted Dunning
 
T digest-update
Ted Dunning
 
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Recommendation Techn
Ted Dunning
 
Possible Visions for Mahout 1.0
Ted Dunning
 
Using Mahout and a Search Engine for Recommendation
Ted Dunning
 

Recently uploaded (20)

PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Doc9.....................................
SofiaCollazos
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Software Development Methodologies in 2025
KodekX
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
The Future of Artificial Intelligence (AI)
Mukul
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 

Buzz words-dunning-real-time-learning

  • 1. 1©MapR Technologies - Confidential Real-time Learning for Fun and Profit
  • 2. 2©MapR Technologies - Confidential  Contact: – [email protected] – @ted_dunning  Slides and such (available late tonight): – http://slideshare.net/tdunning  Hash tags: #mapr #storm #bbuzz
  • 3. 3©MapR Technologies - Confidential The Challenge  Hadoop is great of processing vats of data – But sucks for real-time (by design!)  Storm is great for real-time processing – But lacks any way to deal with batch processing  It sounds like there isn’t a solution – Neither fashionable solution handles everything
  • 4. 4©MapR Technologies - Confidential This is not a problem. It’s an opportunity!
  • 5. 5©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  • 6. 6©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  • 7. 7©MapR Technologies - Confidential One Alternative Search Engine NoSql de Jour Consumer Real-time Long-time ?
  • 8. 8©MapR Technologies - Confidential Problems  Simply dumping into noSql engine doesn’t quite work  Insert rate is limited  No load isolation – Big retrospective jobs kill real-time  Low scan performance – Hbase pretty good, but not stellar  Difficult to set boundaries – where does real-time end and long-time begin?
  • 9. 9©MapR Technologies - Confidential Almost a Solution  Lambda architecture talks about function of long-time state – Real-time approximate accelerator adjusts previous result to current state  Sounds good, but … – How does the real-time accelerator combine with long-time? – What algorithms can do this? – How can we avoid gaps and overlaps and other errors?  Needs more work
  • 10. 10©MapR Technologies - Confidential A Simple Example  Let’s start with the simplest case … counting  Counting = addition – Addition is associative – Addition is on-line – We can generalize these results to all associative, on-line functions – But let’s start simple
  • 11. 11©MapR Technologies - Confidential Data Sources Catcher Cluster Rough Design – Data Flow Catcher Cluster Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg ProtoSpout Counter Bolt Logger Bolt Data Sources
  • 12. 12©MapR Technologies - Confidential Closer Look – Catcher Protocol Data Sources Catcher Cluster Catcher Cluster Data Sources The data sources and catchers communicate with a very simple protocol. Hello() => list of catchers Log(topic,message) => (OK|FAIL, redirect-to-catcher)
  • 13. 13©MapR Technologies - Confidential Closer Look – Catcher Queues Catcher Cluster Catcher Cluster The catchers forward log requests to the correct catcher and return that host in the reply to allow the client to avoid the extra hop. Each topic file is appended by exactly one catcher. Topic files are kept in shared file storage. Topic File Topic File
  • 14. 14©MapR Technologies - Confidential Closer Look – ProtoSpout The ProtoSpout tails the topic files, parses log records into tuples and injects them into the Storm topology. Last fully acked position stored in shared, transactionally correct file system. Topic File Topic File ProtoSpout
  • 15. 15©MapR Technologies - Confidential Closer Look – Counter Bolt  Critical design goals: – fast ack for all tuples – fast restart of counter  Ack happens when tuple hits the replay log (10’s of milliseconds, group commit)  Restart involves replaying semi-agg’s + replay log (very fast)  Replay log only lasts until next semi-aggregate goes out Counter Bolt Replay Log Semi- aggregated records Incoming records Real-time Long-time
  • 16. 16©MapR Technologies - Confidential A Frozen Moment in Time  Snapshot defines the dividing line  All data in the snap is long-time, all after is real-time  Semi-agg strategy allows clean combination of both kinds of data  Data synchronized snap not needed Semi Agg Hadoop Aggregator Snap Long agg
  • 17. 17©MapR Technologies - Confidential Guarantees  Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise  Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append  HDFS is distributed, but provides no guarantees and strange semantics  MapRfs is distributed, provides all necessary guarantees
  • 18. 18©MapR Technologies - Confidential Presentation Layer  Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records  User will see – counts that increment within 0-2 s of events – seamless and accurate meld of short and long-term data
  • 19. 19©MapR Technologies - Confidential The Basic Idea  Online algorithms generally have relatively small state (like counting)  Online algorithms generally have a simple update (like counting)  If we can do this with counting, we can do it with all kinds of algorithms
  • 20. 20©MapR Technologies - Confidential Summary – Part 1  Semi-agg strategy + snapshots allows correct real-time counts – because addition is on-line and associative  Other on-line associative operations include: – k-means clustering (see Dan Filimon’s talk at 16.) – count distinct (see hyper-log-log counters from streamlib or kmv from Brickhouse) – top-k values – top-k (count(*)) (see streamlib) – contextual Bayesian bandits (see part 2 of this talk)
  • 21. 21©MapR Technologies - Confidential Example 2 – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
  • 22. 22©MapR Technologies - Confidential A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value?
  • 23. 23©MapR Technologies - Confidential A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience
  • 24. 24©MapR Technologies - Confidential I Dunno
  • 25. 25©MapR Technologies - Confidential 5 heads out of 10 throws
  • 26. 26©MapR Technologies - Confidential 2 heads out of 12 throws Mean Using any single number as a “best” estimate denies the uncertain nature of a distribution Adding confidence bounds still loses most of the information in the distribution and prevents good modeling of the tails
  • 27. 27©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data  Sample p1 and p2 from these distributions  Put a coin in bandit 1 if p1 > p2  Else, put the coin in bandit 2
  • 28. 28©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 29. 29©MapR Technologies - Confidential Video Demo
  • 30. 30©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
  • 31. 31©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models  Note that learning here = counting = on-line algorithm
  • 32. 33©MapR Technologies - Confidential Caveats  Original Bayesian Bandit only requires real-time  Generalized Bandit may require access to long history for learning – Pseudo online learning may be easier than true online  Bandit variables can include content, time of day, day of week  Context variables can include user id, user features  Bandit × context variables provide the real power
  • 33. 34©MapR Technologies - Confidential  Contact: – [email protected] – @ted_dunning  Slides and such (available late tonight): – http://slideshare.net/tdunning  Hash tags: #mapr #storm #bbuzz
  • 34. 35©MapR Technologies - Confidential Thank You