SlideShare a Scribd company logo
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Who I am
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
© 2014 MapR Technologies 3
Agenda
• Background – recommending with puppies and ponies
• Speed tricks
• Accuracy tricks
• Moving to real-time
© 2014 MapR Technologies 4
Puppies and Ponies
© 2014 MapR Technologies 5
Cooccurrence AnalysisCooccurrence Analysis
© 2014 MapR Technologies 6
How Often Do Items Co-occur
How often do items co-occur?
© 2014 MapR Technologies 7
Which Co-occurrences are Interesting?
Which cooccurences are interesting?
Each row of indicators becomes a field in a
search engine document
© 2014 MapR Technologies 8
Recommendations
Alice got an apple and
a puppyAlice
© 2014 MapR Technologies 9
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
© 2014 MapR Technologies 10
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple
© 2014 MapR Technologies 11
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob What else would Bob like?
© 2014 MapR Technologies 12
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob A puppy!
© 2014 MapR Technologies 13
By the way, like me, Bob also
wants a pony…
© 2014 MapR Technologies 14
Recommendations
?
Alice
Bob
Charles
Amelia
What if everybody gets a
pony?
What else would you recommend for
new user Amelia?
© 2014 MapR Technologies 15
Recommendations
?
Alice
Bob
Charles
Amelia
If everybody gets a pony, it’s
not a very good indicator of
what to else predict...
© 2014 MapR Technologies 16
Problems with Raw Co-occurrence
• Very popular items co-occur with everything or why it’s not very
helpful to know that everybody wants a pony…
– Examples: Welcome document; Elevator music
• Very widespread occurrence is not interesting to generate indicators
for recommendation
– Unless you want to offer an item that is constantly desired, such as
razor blades (or ponies)
• What we want is anomalous co-occurrence
– This is the source of interesting indicators of preference on which to
base recommendation
© 2014 MapR Technologies 17
Overview: Get Useful Indicators from Behaviors
1. Use log files to build history matrix of users x items
– Remember: this history of interactions will be sparse compared to all
potential combinations
2. Transform to a co-occurrence matrix of items x items
3. Look for useful indicators by identifying anomalous co-occurrences to
make an indicator matrix
– Log Likelihood Ratio (LLR) can be helpful to judge which co-
occurrences can with confidence be used as indicators of preference
– ItemSimilarityJob in Apache Mahout uses LLR
© 2014 MapR Technologies 18
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
© 2014 MapR Technologies 19
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3
Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence,
Computational Linguistics vol 19 no. 1 (1993)
© 2014 MapR Technologies 20
Collection of Documents: Insert Meta-Data
Search Engine
Item
meta-data
Document for
“puppy”
id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
Ingest easily via NFS
✔ indicators: (t1)
© 2014 MapR Technologies 22
Cooccurrence Mechanics
• Cooccurrence is just a self-join
for each user, i
for each history item j1 in Ai*
for each history item j2 in Ai*
count pair (j1, j2)
© 2014 MapR Technologies 23
Cross-occurrence Mechanics
• Cross occurrence is just a self-join of adjoined matrices
for each user, i
for each history item j1 in Ai*
for each history item j2 in Bi*
count pair (j1, j2)
© 2014 MapR Technologies 24
A word about scaling
© 2014 MapR Technologies 25
A few pragmatic tricks
• Downsample all user histories to max length (interaction cut)
– Can be random or most-recent (no apparent effect on accuracy)
– Prolific users are often pathological anyway
– Common limit is 300 items (no apparent effect on accuracy)
• Downsample all items to limit max viewers (frequency limit)
– Can be random or earliest (no apparent effect)
– Ubiquitous items are uninformative
– Common limit is 500 users (no apparent effect)
Schelter, et al. Scalable similarity-based neighborhood methods with MapReduce.
Proceedings of the sixth ACM conference on Recommender systems. 2012
© 2014 MapR Technologies 26
But note!
• Number of pairs for a user history with ki distinct items is ≈ ki
2/2
• Average size of user history increases with increasing dataset
– Average may grow more slowly than N (or not!)
– Full cooccurrence cost grows strictly faster than N
– i.e. it just doesn’t scale
• Downsampling interactions places bounds on per user cost
– Cooccurrence with interaction cut is scalable
© 2014 MapR Technologies 27
0 200 400 600 800 1000
0123456
Benefit of down−sampling
User Limit
Pairs(x109
)
Without down−sampling
Track limit = 1000
500
200
Computed on 48,373,586 pair−wise triples
from the million song dataset
●
●
© 2014 MapR Technologies 28
Batch Scaling in Time Implies Scaling in Space
• Note:
– With frequency limit sampling, max cooccurrence count is small (<1000)
– With interaction cut, total number of non-zero pairs is relatively small
– Entire cooccurrence matrix can be stored in memory in ~10-15 GB
• Specifically:
– With interaction cut, cooccurrence scales in size
– Without interaction cut, cooccurrence does not scale size-wise
© 2014 MapR Technologies 29
Impact of Interaction Cut Downsampling
• Interaction cut allows batch cooccurrence analysis to be O(N) in
time and space
• This is intriguing
– Amortized cost is low
– Could this be extended to an on-line form?
• Incremental matrix factorization is hard
– Could cooccurrence be a key alternative?
• Scaling matters most at scale
– Cooccurrence is very accurate at large scale
– Factorization shows benefits at smaller scales
© 2014 MapR Technologies 30
Online update
© 2014 MapR Technologies 31
Requirements for Online Algorithms
• Each unit of input must require O(1) work
– Theoretical bound
• The constants have to be small enough on average
– Pragmatic constraint
• Total accumulated data must be small (enough)
– Pragmatic constraint
© 2014 MapR Technologies 32
Log Files
Search
Technology
Item
Meta-Data
via
NFS
MapR Cluster
via
NFS PostPre
Recommendations
New User
History
Web
Tier
Recommendations
happen in real-time
Batch co-
occurrence
Want this to be real-time
Real-time recommendations using MapR data platform
© 2014 MapR Technologies 33
Space Bound Implies Time Bound
• Because user histories are pruned, only a limited number of
value updates need be made with each new observation
• This bound is just twice the interaction cut kmax
– Which is a constant
• Bounding the number of updates trivially bounds the time
© 2014 MapR Technologies 34
Implications for Online Update
© 2014 MapR Technologies 35
With interaction cut at
© 2014 MapR Technologies 36
But Wait, It Gets Better
• The new observation may be pruned
– For users at the interaction cut, we can ignore updates
– For items at the frequency cut, we can ignore updates
– Ignoring updates only affects indicators, not recommendation query
– At million song dataset size, half of all updates are pruned
• On average ki is much less than the interaction cut
– For million song dataset, average appears to grow with log of frequency
limit, with little dependency on values of interaction cut > 200
• LLR cutoff avoids almost all updates to index
• Average grows slowly with frequency cut
© 2014 MapR Technologies 37
0 200 400 600 800 1000
05101520253035
Interaction cut (kmax)
kave
Frequency cut = 1000
= 500
= 200
© 2014 MapR Technologies 38
0 200 400 600 800 1000
05101520253035
Frequency cut
kave
© 2014 MapR Technologies 39
Recap
• Cooccurrence-based recommendations are simple
– Deploying with a search engine is even better
• Interaction cut and frequency cut are key to batch scalability
• Similar effect occurs in online form of updates
– Only dozens of updates per transaction needed
– Data structure required is relatively small
– Very, very few updates cause search engine updates
• Fully online recommendation is very feasible, almost easy
© 2014 MapR Technologies 40
More Details Available
available for free at
available for free at
http://www.mapr.com/practical-machine-learning
© 2014 MapR Technologies 41
Who I am
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
Apache Drill http://incubator.apache.org/drill/
Twitter @ApacheDrill
© 2014 MapR Technologies 42
Q&A
@mapr maprtech
tdunning@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot (20)

PPTX
My talk about recommendation and search to the Hive
Ted Dunning
 
PPTX
Real time-hadoop
Ted Dunning
 
PPTX
T digest-update
Ted Dunning
 
PDF
Strata 2014 Anomaly Detection
Ted Dunning
 
PPTX
Building multi-modal recommendation engines using search engines
Ted Dunning
 
PPTX
Recommendation Techn
Ted Dunning
 
PPTX
Which Algorithms Really Matter
Ted Dunning
 
PPTX
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning
 
PPTX
Using Mahout and a Search Engine for Recommendation
Ted Dunning
 
PPTX
Dunning ml-conf-2014
MapR Technologies
 
PPTX
What's new in Apache Mahout
Ted Dunning
 
PPTX
Possible Visions for Mahout 1.0
Ted Dunning
 
PPTX
Polyvalent recommendations
Ted Dunning
 
PPTX
Buzz words-dunning-real-time-learning
Ted Dunning
 
PPTX
How to Determine which Algorithms Really Matter
DataWorks Summit
 
PPTX
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
PPTX
Machine Learning logistics
Ted Dunning
 
PDF
Mathematical bridges From Old to New
MapR Technologies
 
PPTX
How to tell which algorithms really matter
DataWorks Summit
 
PPTX
Mahout and Recommendations
Ted Dunning
 
My talk about recommendation and search to the Hive
Ted Dunning
 
Real time-hadoop
Ted Dunning
 
T digest-update
Ted Dunning
 
Strata 2014 Anomaly Detection
Ted Dunning
 
Building multi-modal recommendation engines using search engines
Ted Dunning
 
Recommendation Techn
Ted Dunning
 
Which Algorithms Really Matter
Ted Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning
 
Using Mahout and a Search Engine for Recommendation
Ted Dunning
 
Dunning ml-conf-2014
MapR Technologies
 
What's new in Apache Mahout
Ted Dunning
 
Possible Visions for Mahout 1.0
Ted Dunning
 
Polyvalent recommendations
Ted Dunning
 
Buzz words-dunning-real-time-learning
Ted Dunning
 
How to Determine which Algorithms Really Matter
DataWorks Summit
 
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
Machine Learning logistics
Ted Dunning
 
Mathematical bridges From Old to New
MapR Technologies
 
How to tell which algorithms really matter
DataWorks Summit
 
Mahout and Recommendations
Ted Dunning
 

Viewers also liked (20)

PPTX
A review of machine learning based anomaly detection
Mohamed Elfadly
 
PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
PPTX
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Amr Awadallah
 
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
PDF
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
MapR Technologies
 
PDF
MapR-DB Elasticsearch Integration
MapR Technologies
 
PDF
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
PDF
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
PPTX
Hadoop: An Industry Perspective
Cloudera, Inc.
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
PDF
Baptist Health: Solving Healthcare Problems with Big Data
MapR Technologies
 
PPTX
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
PPTX
Big Data Paris
MapR Technologies
 
PDF
Apache Drill - Why, What, How
mcsrivas
 
PPTX
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 
PDF
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
ervogler
 
PDF
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
PPTX
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
PPTX
Practical Machine Learning: Innovations in Recommendation Workshop
MapR Technologies
 
A review of machine learning based anomaly detection
Mohamed Elfadly
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Amr Awadallah
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
MapR Technologies
 
MapR-DB Elasticsearch Integration
MapR Technologies
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Think Big, a Teradata Company
 
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Introduction to Apache Hadoop
Christopher Pezza
 
Baptist Health: Solving Healthcare Problems with Big Data
MapR Technologies
 
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
Big Data Paris
MapR Technologies
 
Apache Drill - Why, What, How
mcsrivas
 
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
ervogler
 
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
Practical Machine Learning: Innovations in Recommendation Workshop
MapR Technologies
 
Ad

Similar to Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time (20)

PPTX
Ted Dunning, Chief Application Architect, MapR at MLconf SF
MLconf
 
PPTX
Predictive Analytics with Hadoop
DataWorks Summit
 
PPTX
DFW Big Data talk on Mahout Recommenders
Ted Dunning
 
PPTX
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
MLconf
 
PDF
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
lucenerevolution
 
PPTX
Polyvalent Recommendations
MapR Technologies
 
PDF
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Matt Stubbs
 
PPTX
Buzz words-dunning-multi-modal-recommendation
Ted Dunning
 
PPTX
Buzz Words Dunning Multi Modal Recommendations
MapR Technologies
 
PDF
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Olivier Jeunen
 
PPTX
Recommendation as Search: Reflections on Symmetry
MapR Technologies
 
PDF
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Spark Summit
 
PPTX
GoTo Amsterdam 2013 Skinned
MapR Technologies
 
PPTX
Goto amsterdam-2013-skinned
Ted Dunning
 
PPTX
Deep Learning vs. Cheap Learning
MapR Technologies
 
PDF
Similarity at Scale
DataWorks Summit
 
PPTX
The Universal Recommender
Pat Ferrel
 
PPTX
Using Sequence Statistics to Fight Advanced Persistent Threats
DataWorks Summit/Hadoop Summit
 
PPTX
Introduction to Mahout
Ted Dunning
 
PPTX
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
MLconf
 
Predictive Analytics with Hadoop
DataWorks Summit
 
DFW Big Data talk on Mahout Recommenders
Ted Dunning
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
MLconf
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
lucenerevolution
 
Polyvalent Recommendations
MapR Technologies
 
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Matt Stubbs
 
Buzz words-dunning-multi-modal-recommendation
Ted Dunning
 
Buzz Words Dunning Multi Modal Recommendations
MapR Technologies
 
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Olivier Jeunen
 
Recommendation as Search: Reflections on Symmetry
MapR Technologies
 
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Spark Summit
 
GoTo Amsterdam 2013 Skinned
MapR Technologies
 
Goto amsterdam-2013-skinned
Ted Dunning
 
Deep Learning vs. Cheap Learning
MapR Technologies
 
Similarity at Scale
DataWorks Summit
 
The Universal Recommender
Pat Ferrel
 
Using Sequence Statistics to Fight Advanced Persistent Threats
DataWorks Summit/Hadoop Summit
 
Introduction to Mahout
Ted Dunning
 
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
 
Ad

More from Ted Dunning (9)

PPTX
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
PPTX
How to Get Going with Kubernetes
Ted Dunning
 
PPTX
Progress for big data in Kubernetes
Ted Dunning
 
PPTX
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
PPTX
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
PPTX
Machine Learning Logistics
Ted Dunning
 
PPTX
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
PPTX
Inside MapR's M7
Ted Dunning
 
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
How to Get Going with Kubernetes
Ted Dunning
 
Progress for big data in Kubernetes
Ted Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
Machine Learning Logistics
Ted Dunning
 
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Inside MapR's M7
Ted Dunning
 

Recently uploaded (20)

PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PDF
Troubleshooting Virtual Threads in Java!
Tier1 app
 
PDF
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PDF
Supabase Meetup: Build in a weekend, scale to millions
Carlo Gilmar Padilla Santana
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PDF
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PDF
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PPTX
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
Troubleshooting Virtual Threads in Java!
Tier1 app
 
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Supabase Meetup: Build in a weekend, scale to millions
Carlo Gilmar Padilla Santana
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
Presentation about variables and constant.pptx
kr2589474
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email [email protected] [email protected] Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 3. © 2014 MapR Technologies 3 Agenda • Background – recommending with puppies and ponies • Speed tricks • Accuracy tricks • Moving to real-time
  • 4. © 2014 MapR Technologies 4 Puppies and Ponies
  • 5. © 2014 MapR Technologies 5 Cooccurrence AnalysisCooccurrence Analysis
  • 6. © 2014 MapR Technologies 6 How Often Do Items Co-occur How often do items co-occur?
  • 7. © 2014 MapR Technologies 7 Which Co-occurrences are Interesting? Which cooccurences are interesting? Each row of indicators becomes a field in a search engine document
  • 8. © 2014 MapR Technologies 8 Recommendations Alice got an apple and a puppyAlice
  • 9. © 2014 MapR Technologies 9 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles
  • 10. © 2014 MapR Technologies 10 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
  • 11. © 2014 MapR Technologies 11 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob What else would Bob like?
  • 12. © 2014 MapR Technologies 12 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
  • 13. © 2014 MapR Technologies 13 By the way, like me, Bob also wants a pony…
  • 14. © 2014 MapR Technologies 14 Recommendations ? Alice Bob Charles Amelia What if everybody gets a pony? What else would you recommend for new user Amelia?
  • 15. © 2014 MapR Technologies 15 Recommendations ? Alice Bob Charles Amelia If everybody gets a pony, it’s not a very good indicator of what to else predict...
  • 16. © 2014 MapR Technologies 16 Problems with Raw Co-occurrence • Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music • Very widespread occurrence is not interesting to generate indicators for recommendation – Unless you want to offer an item that is constantly desired, such as razor blades (or ponies) • What we want is anomalous co-occurrence – This is the source of interesting indicators of preference on which to base recommendation
  • 17. © 2014 MapR Technologies 17 Overview: Get Useful Indicators from Behaviors 1. Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations 2. Transform to a co-occurrence matrix of items x items 3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co- occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR
  • 18. © 2014 MapR Technologies 18 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2
  • 19. © 2014 MapR Technologies 19 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3 Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
  • 20. © 2014 MapR Technologies 20 Collection of Documents: Insert Meta-Data Search Engine Item meta-data Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet Ingest easily via NFS ✔ indicators: (t1)
  • 21. © 2014 MapR Technologies 22 Cooccurrence Mechanics • Cooccurrence is just a self-join for each user, i for each history item j1 in Ai* for each history item j2 in Ai* count pair (j1, j2)
  • 22. © 2014 MapR Technologies 23 Cross-occurrence Mechanics • Cross occurrence is just a self-join of adjoined matrices for each user, i for each history item j1 in Ai* for each history item j2 in Bi* count pair (j1, j2)
  • 23. © 2014 MapR Technologies 24 A word about scaling
  • 24. © 2014 MapR Technologies 25 A few pragmatic tricks • Downsample all user histories to max length (interaction cut) – Can be random or most-recent (no apparent effect on accuracy) – Prolific users are often pathological anyway – Common limit is 300 items (no apparent effect on accuracy) • Downsample all items to limit max viewers (frequency limit) – Can be random or earliest (no apparent effect) – Ubiquitous items are uninformative – Common limit is 500 users (no apparent effect) Schelter, et al. Scalable similarity-based neighborhood methods with MapReduce. Proceedings of the sixth ACM conference on Recommender systems. 2012
  • 25. © 2014 MapR Technologies 26 But note! • Number of pairs for a user history with ki distinct items is ≈ ki 2/2 • Average size of user history increases with increasing dataset – Average may grow more slowly than N (or not!) – Full cooccurrence cost grows strictly faster than N – i.e. it just doesn’t scale • Downsampling interactions places bounds on per user cost – Cooccurrence with interaction cut is scalable
  • 26. © 2014 MapR Technologies 27 0 200 400 600 800 1000 0123456 Benefit of down−sampling User Limit Pairs(x109 ) Without down−sampling Track limit = 1000 500 200 Computed on 48,373,586 pair−wise triples from the million song dataset ● ●
  • 27. © 2014 MapR Technologies 28 Batch Scaling in Time Implies Scaling in Space • Note: – With frequency limit sampling, max cooccurrence count is small (<1000) – With interaction cut, total number of non-zero pairs is relatively small – Entire cooccurrence matrix can be stored in memory in ~10-15 GB • Specifically: – With interaction cut, cooccurrence scales in size – Without interaction cut, cooccurrence does not scale size-wise
  • 28. © 2014 MapR Technologies 29 Impact of Interaction Cut Downsampling • Interaction cut allows batch cooccurrence analysis to be O(N) in time and space • This is intriguing – Amortized cost is low – Could this be extended to an on-line form? • Incremental matrix factorization is hard – Could cooccurrence be a key alternative? • Scaling matters most at scale – Cooccurrence is very accurate at large scale – Factorization shows benefits at smaller scales
  • 29. © 2014 MapR Technologies 30 Online update
  • 30. © 2014 MapR Technologies 31 Requirements for Online Algorithms • Each unit of input must require O(1) work – Theoretical bound • The constants have to be small enough on average – Pragmatic constraint • Total accumulated data must be small (enough) – Pragmatic constraint
  • 31. © 2014 MapR Technologies 32 Log Files Search Technology Item Meta-Data via NFS MapR Cluster via NFS PostPre Recommendations New User History Web Tier Recommendations happen in real-time Batch co- occurrence Want this to be real-time Real-time recommendations using MapR data platform
  • 32. © 2014 MapR Technologies 33 Space Bound Implies Time Bound • Because user histories are pruned, only a limited number of value updates need be made with each new observation • This bound is just twice the interaction cut kmax – Which is a constant • Bounding the number of updates trivially bounds the time
  • 33. © 2014 MapR Technologies 34 Implications for Online Update
  • 34. © 2014 MapR Technologies 35 With interaction cut at
  • 35. © 2014 MapR Technologies 36 But Wait, It Gets Better • The new observation may be pruned – For users at the interaction cut, we can ignore updates – For items at the frequency cut, we can ignore updates – Ignoring updates only affects indicators, not recommendation query – At million song dataset size, half of all updates are pruned • On average ki is much less than the interaction cut – For million song dataset, average appears to grow with log of frequency limit, with little dependency on values of interaction cut > 200 • LLR cutoff avoids almost all updates to index • Average grows slowly with frequency cut
  • 36. © 2014 MapR Technologies 37 0 200 400 600 800 1000 05101520253035 Interaction cut (kmax) kave Frequency cut = 1000 = 500 = 200
  • 37. © 2014 MapR Technologies 38 0 200 400 600 800 1000 05101520253035 Frequency cut kave
  • 38. © 2014 MapR Technologies 39 Recap • Cooccurrence-based recommendations are simple – Deploying with a search engine is even better • Interaction cut and frequency cut are key to batch scalability • Similar effect occurs in online form of updates – Only dozens of updates per transaction needed – Data structure required is relatively small – Very, very few updates cause search engine updates • Fully online recommendation is very feasible, almost easy
  • 39. © 2014 MapR Technologies 40 More Details Available available for free at available for free at http://www.mapr.com/practical-machine-learning
  • 40. © 2014 MapR Technologies 41 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email [email protected] [email protected] Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout Apache Drill http://incubator.apache.org/drill/ Twitter @ApacheDrill
  • 41. © 2014 MapR Technologies 42 Q&A @mapr maprtech [email protected] Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  • #18: Mention that the Pony book said “RowSimilarityJob”…