SlideShare a Scribd company logo
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
advancedspark.compipeline.io
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Who Am I?
2
Streaming Data Engineer
Netflix OSS Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Meetup
Book Author
Advanced .
Due Soon
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Advanced Apache Spark Meetup
http://pipeline.io
Meetup Metrics
Top 10 Most-active Spark Meetup!
~4000 Members in just 12 mos!!
5100+ Docker downloads (demos)
Meetup Mission
Code deep-dive into Spark and related open source projects
Surface key patterns and idioms
Focus on distributed systems, scale, and performance
3
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Live, Interactive Demo!
Audience Participation Required!!
Cell Phone Compatible!!!
4
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
http://demo.pipeline.io
End User ->
NetflixOSS à
Redis à
TensorFlow ->
Data Scientist ->
5
<- Kafka
<- Spark
Streaming
<- Cassandra
Redis
<- Zeppelin
iPython
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
6
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Scaling with Parallelism
7
Peter
O(log n)
O(log n)
Worker
Nodes
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Parallelism with Composability
Worker 1 Worker 2
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d) == (a U b) U (c U d)
Addition (a + b + c + d) == (a + b) + (c + d)
Multiply (a * b * c * d) == (a * b) * (c * d)
8
What about Division and Average?
Collect at Driver
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
What about Division?
Division (a / b / c / d) != (a / b) / (c / d)
(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8)
(((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))
0.134 != 0.857
9
What were the Egyptians thinking?!
Not Composable
“Divide like
an Egyptian”
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
What about Average?
Overall AVG
(3, 1) (3 + 5 + 5 + 7) 20
+ (5, 1) == -------------------- == --- == 5
+ (5, 1) (1 + 1 + 1 + 1) 4
+ (7, 1)
10
values
counts
Pairwise AVG
(3 + 5) (5 + 7) 8 12 20
------- + ------- == --- + --- == --- == 10 != 5
2 2 2 2 2
Divide, Add, Divide?
Not Composable
Single-Node Divide at the End?
Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?
Composable!
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
11
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Similarities
12
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Euclidean Similarity
Exists in Euclidean, flat space
Based on Euclidean distance
Linear measure
Bias towards magnitude
13
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Cosine Similarity
Angular measure
Adjusts for Euclidean magnitude bias
Normalize to unit vectors in all dimensions
Used with real-valued vectors (versus binary)
14
org.jblas.
DoubleMatrix
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Jaccard Similarity
Set similarity measurement
Set intersection / set union
Bias towards popularity
Works with binary vectors
15
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Log Likelihood Similarity
Adjusts for popularity bias
Netflix “Shawshank” problem
16
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Word Similarity
Edit Distance
Misspellings and autocorrect
Word2Vec
Similar words are defined by similar contexts in vector space
17
English Spanish
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
Find Synonyms with Word2Vec
18
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Find Synonyms using Word2Vec
19
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Document Similarity
TF/IDF
Term Freq / Inverse Document Freq
Used by most search engines
Doc2Vec
Similar documents are determined by similar contexts
20
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Bonus! Text Rank Document Summary
Text Rank (aka Sentence Rank)
Surface summary sentences
TF/IDF + Similarity Graph + PageRank
Most similar sentence to all other sentences
TF/IDF + Similarity Graph
Most influential sentences
PageRank
21
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Similarity Pathways (Recommendations)
Best recommendations for 2 (or more) people
“You like Max Max. I like Message in a Bottle.
We might like a movie similar to both.”
Item-to-Item Similarity Graph + Dijkstra Heaviest Path
22
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
Similarity Pathway for Movie Recommendations
23
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Load Movies with Tags into DataFrame
24
My
Choice
Their
Choice
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Item-to-Item Tag Jaccard Similarity
Based on Tags
25
Calculate Jaccard Similarity
(Tag Set Similarity)
Must be Above the Given
Jaccard Similarity Threshold
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Item-to-Item Tag Similarity Graph
26
Edge Weights
==
Jaccard Similarity
(Based on Tag Sets)
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Use Dijkstra to Find Heaviest Pathway
27
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Calculating Exact Similarity
Brute-Force Similarity
Cartesian Product
O(n^2) shuffle and compute
aka. All-pairs, Pair-wise,
Similarity Join
28
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Calculating Approximate Similarity
Goal: Reduce Shuffle
Approximate Similarity
Sampling
Bucketing or Clustering
Ignore low-similarity probability
Locality Sensitive Hashing
Twitter Algebird MinHash
29
Bucket
By Genre
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
① Netflix Recommendations
30
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Recommendations
31
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Basic Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: user knows they are rating or liking, can choose to dislike
Implicit User Feedback: user not explicitly aware, cannot dislike (click, hover, etc)
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)
Model Evaluation: Compare predictions to actual values of hold out split
Feature Engineering: Modify, reduce, combine features
Loss Function: Function we’re trying to minimize such as least-squared error for Linear Regression
Cross Entropy: Loss function used for classification algorithms such as Logistic Regression
Optimizer: Technique to optimize loss function such as Stochastic Gradient Descent (SGD)
32
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Stochastic Gradient Descent (SGD)
Optimizes Loss Function
Least Squared Error b/w predicted and actual value
Cross Entropy Log Likelihood b/w predicted and actual probability
33
2-Dimensional 3-Dimensional
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Features
Binary: True or False
Numeric Discrete: Integers
Numeric: Real Values
Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)
Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)
Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots
Temporal: Time-based, Time of Day, Binge Viewing
Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)
Media: Images, Audio, Video
Geographic: (Longitude, Latitude), Geohash
Latent: Hidden Features within Data (Collaborative Filtering)
Derived: Age of Movie, Duration of User Subscription
34
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Feature Engineering
Dimension Reduction
Reduce number of features in feature space
Principle Component Analysis (PCA)
Find principle features that best describe data variance
Peel dimensional layers back
One-Hot Encoding
Convert nominal categorical feature values into 0’s and 1’s
Remove any numerical relationship between categories
Bears -> 1 Bears -> [1.0, 0.0, 0.0]
49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0]
Steelers-> 3 Steelers-> [0.0, 0.0, 1.0]
35
Convert Each Item
to Binary Vector
with Single 1.0 Column
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Feature Normalization & Standardization
Goal
Scale features to standard size
Prevent boundless features
Helps avoid overfitting
Required by many ML algos
Normalize Features
Calculate L1 (or L2, etc) norm, then divide into each element
Standardize Features
Apply standard normal transformation (mean->0, stddev->1)
org.apache.spark.ml.feature.[Normalizer, StandardScaler]
36
http://www.mathsisfun.com/data/standard-normal-distribution.html
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Non-Personalized Recommendations
37
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Cold Start Problem
“Cold Start” problem
New user, don’t know their preferences, must show something!
Movies with highest-rated actors
Top K aggregations
Facebook social graph
Friend-based recommendations
Most desirable singles
PageRank of likes and dislikes
38
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
GraphFrame PageRank
39
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Example: Dating Site “Like” Graph
40
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
PageRank of Top Influencers
41
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Personalized Recommendations
42
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
Personalized PageRank
43
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Personalized PageRank: Outbound Links
44
0.15 = (1 - 0.85 “Damping Factor”)
85% Probability: Choose Among Outbound Network
15% Probability: Choose Self or Random
85% Among
Outbound
Network
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Personalized PageRank: No Outbound
45
0.15 = (1 - 0.85 “Damping Factor”)
85% Probability: Choose Among Outbound Network
15% Probability: Choose Self or Random
85% Among
No
Outbound
Network!!
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
User-to-User Clustering
User Similarity
Time-based
Pattern of viewing (binge or casual)
Time of viewing (am or pm)
Ratings-based
Content ratings or number of views
Average rating relative to others (critical or lenient)
Search-based
Search terms
46
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Item-to-Item Clustering
Item Similarity
Profile text (TF/IDF, Word2Vec, NLP)
Categories, tags, interests (Jaccard Similarity, LSH)
Images, facial structures (Neural Nets, Eigenfaces)
Dating Site Example…
47
Cluster Similar Eigen-facesCluster Similar Profiles Cluster Similar Categories
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Bonus: NLP Conversation Starter Bot
48
“If your responses to my generic opening
lines are positive, I may read your profile.”
Spark ML, Stanford CoreNLP,
TF/IDF, DecisionTrees, Sentiment
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Bonus: Demo!
Spark + Stanford CoreNLP Sentiment Analysis
49
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Bonus: Top 100 Country Song Sentiment
50
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Bonus: Surprising Results…?!
51
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Item-to-Item Based Recommendations
Based on Metadata: Genre, Description, Cast, City
52
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
Item-to-Item-based Recommendations
One-Hot Encoding + K-Means Clustering
53
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
One-Hot Encode Tag Feature Vectors
54
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Cluster Movie Tag Feature Vectors
55
Hyperparameter
Tuning
(K Clusters?)
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Analyze Movie Tag Clusters
56
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
User-to-Item Collaborative Filtering
Matrix Factorization
① Factor the large matrix (left) into 2 smaller matrices (right)
② Lower-rank matrices approximate original when multiplied
③ Fill in the missing values of the large matrix
④ Surface k (rank) latent features from user-item interactions
57
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Item-to-Item Collaborative Filtering
Famous Amazon Paper circa 2003
Problem
As users grew, user-to-item collaborative filtering didn’t scale
Solution
Item-to-item similarity, nearest neighbors
Offline (Batch)
Generate itemId->List[userId] vectors
Online (Real-time)
From cart, recommend nearest-neighbors in vector space
58
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
Collaborative Filtering-based Recommendations
59
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Fitting the Matrix Factorization Model
60
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Show ItemFactors Matrix from ALS
61
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Show UserFactors Matrix from ALS
62
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Generating Individual Recommendations
63
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Generating Batch Recommendations
64
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Clustering + Collaborative Filtering Recs
Cluster matrix output from Matrix Factorization
Latent features derived from user-item interaction
Item-to-Item Similarity
Cluster item-factor matrix->
User-to-User Similarity
<-Cluster user-factor matrix
65
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
Clustering + Collaborative Filtering-based Recommendations
66
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Show ItemFactors Matrix from ALS
67
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Convert to Item Factors -> mllib.Vector
Required by K-Means Clustering Algorithm
68
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Fit and Evaluate K-Means Cluster Model
69
Measures Closeness
Of Points Within Clusters
K = 5 Clusters
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Genres and Clusters
Typical Genres
Documentary, Romance, Comedy, Horror, Action, Adventure
Latent (Hidden) Clusters
Emotionally-Independent Dramas for Hopeless Romantics
Witty Dysfunctional-Family TV Animated Comedies
Romantic Crime Movies based on Classic Literature
Latin American Forbidden-Love Movies
Critically-acclaimed Emotional Drug Movie
Cerebral Military Movie based on Real Life
Sentimental Movies about Horses for Ages 11-12
Gory Canadian Revenge Movies
Raunchy Mad Scientist Comedy
70
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
71
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
When to Approximate?
Memory or time constrained queries
Relative vs. exact counts are OK (approx # errors after a release)
Using machine learning or graph algos
Inherently probabilistic and approximate
Streaming aggregations
Inherently sloppy collection (exactly once?)
72
Approximate as much as you can get away with!
Ask for forgiveness later !!
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
When NOT to Approximate?
If you’ve ever heard the term…
“Sarbanes-Oxley”
…at the office.
73
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
A Few Good Algorithms
74
You can’t handle
the approximate!
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Common to These Algos & Data Structs
Low, fixed size in memory
Store large amount of data
Known error bounds
Tunable tradeoff between size and error
Less memory than Java/Scala collections
Rely on multiple hash functions or operations
Size of hash range defines error
75
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
76
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Bloom Filter
Approximate Set.contains(key)
No means No, Yes means Maybe
Elements can only be added
Never updated or removed
77
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Bloom Filter in Action
78
set(key) contains(key): Boolean
Images by @avibryant
Set.contains(key): TRUE -> maybe contains (other key hashes may overlap)
Set.contains(key): FALSE -> definitely does not contain (no key flipped all bits)
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
79
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
CountMin Sketch (CMS)
Approximate frequency count and TopK for key
ie. “Heavy Hitters” on Twitter
80
Matei Zaharia Martin Odersky Donald Trump
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
CountMin Sketch In Action (TopK Count)
81
Images derived from @avibryant
Find minimum of all rows
…
…
Can overestimate,
but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of
“Top Gun” for slightly
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few
Good Men
Taps
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few
Good Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
HyperLogLog
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
82
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
HyperLogLog (HLL)
Approximate count distinct
Slight twist
Special hash function creates uniform distribution
Hash subsets of data with single, special hash func
Error estimate
14 bits for size of range
m = 2^14 = 16,384 hash slots
error = 1.04/(sqrt(16,384)) = .81%
83
Not many of these
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
HyperLogLog In Action (Count Distinct)
Use Case: Number of distinct users who view a movie
84
0 32
Top Gun: Hour 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
user
1001
user
2009
user
3005
user
3003
Top Gun: Hour 1
user
3001
user
7009
0 16
UniformDistribution:
Estimate distinct # of users by
inspecting just the beginning
0 32
Top Gun: Hour 1 + 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
Combine across
different scales
user
7009
user
1001
user
2009
user
3005
user
3003
user
3001
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
85
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Locality Sensitive Hashing (LSH)
Approximate set similarity
Pre-process m rows into b buckets
b << m; b = buckets, m = rows
Hash items multiple times
** Similar items hash to overlapping buckets
** Hash designed to cluster similar items
Compare just contents of buckets
Much smaller cartesian compare
** Compare in parallel !!
Avoids huge cartesian all-pairs compare
86
Chapter 3: LSH
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
DIMSUM
Set Similarity
“Pre-process and ignore data that is unlikely to be similar.”
87
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove vectors with low probability of similarity
RowMatrix.columnSimiliarites(threshold)
Twitter DIMSUM Case Study
40% efficiency gain over bruce-force Cosine Sim
88
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
89
Composable Library
Distributed Cache
Big Data Processing
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Twitter Algebird
Algebraic Fundamentals
Parallel
Associative
Composable
Examples
Min, Max, Avg
BloomFilter (Set.contains(key))
HyperLogLog (Count Distinct)
CountMin Sketch (TopK Count)
90
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Redis
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error
Add user views for given movie
PFADD TopGun_Hour1_HLL user1001 user2009 user3005
PFADD TopGun_Hour1_HLL user3003 user1001
Get distinct count (cardinality) of set
PFCOUNT TopGun_Hour1_HLL
Returns: 4 (distinct users viewed this movie)
Union 2 HyperLogLog Data Structures
PFMERGE TopGun_Hour1_HLL TopGun_Hour2_HLL
91
ignore duplicates
Tunable
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Approximations in Spark Libraries
Spark Core
countByKeyApprox(timeout: Long, confidence: Double)
PartialResult
Spark SQL
approxCountDistinct(column: Column, targetResidual: Float)
approxQuantile(column: Column, quantiles: Seq[Float], targetResidual: Float)
Spark ML
Stratified sampling
sampleByKey(fractions: Map[K, Double])
DIMSUM sampling
Probabilistic sampling reduces amount of shuffle
RowMatrix.columnSimilarities(threshold: Double)
92
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
Exact Count vs. Approximate HLL and CMS Count
93
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
HashSet vs. HyperLogLog (Memory)
94
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
HashSet vs. CountMin Sketch (Memory)
95
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Demo!
Exact Similarity vs. Approximate LSH Similarity
96
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Brute Force Cartesian All Pair Similarity
97
47 seconds
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Locality Sensitive Hash All Pair Similarity
98
6 seconds
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Many More Demos!
or
Download Docker Clone on Github
99
http://advancedspark.com
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
100
Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO
Netflix Recommendations
From Ratings to Real-time
101
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Has a Lot of Data
Netflix has a lot of data about a lot of users and a lot of movies.
Netflix can use this data to buy new movies.
Netflix is global.
Netflix can use this data to choose original programming.
Netflix knows that a lot of people like politics and Kevin Spacey.
102
The UK doesn’t have White Castle.
Renamed my favourite movie to:
“Harold and Kumar
Get the Munchies”
My favorite movie:
“Harold and Kumar
Go to White Castle”
Summary: Buy NFLX Stock!
This broke my unit tests!
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Data Pipeline - Then
103
v1.0
v2.0
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Data Pipeline – Now (Keystone)
104
v3.0
9 million events per second
22 GB per second!!
EC2 D2XL
Disk: 6 TB, 475 MB/s
RAM: 30 G
Network: 700 Mbps
Auto-scaling,
Fault tolerance
A/B Tests,
Trending Now
SAMZA
Splits high and
normal priority
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Recommendation Data Pipeline
105
Throw away
batch user
factors (U)
Keep
batch video
factors (V)
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Trending Now (Time-based Recs)
Uses Spark Streaming
Personalized to user (viewing history, past ratings)
Learns and adapts to events (Valentine’s Day)
106
“VHS”
Number of
Plays
Number of
Impressions
Calculate
Take Rate
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Bonus: Pandora Time-based Recs
Work Days
Play familiar music
User is less likely accept new music
Evenings and Weekends
Play new music
More like to accept new music
107
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
$1 Million Netflix Prize (2006-2009)
Goal
Improve movie predictions by 10% (Root Mean Sq Error)
Test data withheld to calculate RMSE upon submission
5-star Ratings Dataset
(userId, movieId, rating, timestamp)
Winning algorithm(s)
10.06% improvement (RMSE)
Ensemble of 500+ ML combined with GBDT’s
Computationally impractical
108
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Secrets to the Winning Algorithms
Adjust for the following human bias…
① Alice effect: user rates lower than avg
② Inception effect: movie rated higher than avg
③ Overall mean rating of a movie
④ Number of people who have rated a movie
⑤ Number of days since user’s first rating
⑥ Number of days since movie’s first rating
⑦ Mood, time of day, day of week, season, weather
109
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Common ML Algorithms
Logistic Regression
Linear Regression
Gradient Boosted Decision Trees
Random Forest
Matrix Factorization
SVD
Restricted Boltzmann Machines
Deep Neural Nets
Markov Models
LDA
Clustering
110
Ensembles!
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Genres and Clusters
Typical Genres
Documentaries, Romance Comedies, Horror, Action, Adventure
Latent (Hidden) Clusters
Emotionally-Independent Dramas for Hopeless Romantics
Witty Dysfunctional-Family TV Animated Comedies
Romantic Crime Movies based on Classic Literature
Latin American Forbidden-Love Movies
Critically-acclaimed Emotional Drug Movie
Cerebral Military Movie based on Real Life
Sentimental Movies about Horses for Ages 11-12
Gory Canadian Revenge Movies
Raunchy Mad Scientist Comedy
111
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Social Integration
Post to Facebook after movie start (5 mins)
Recommend to new users based on friends
Helps with Cold Start problem
112
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Search
No results? No problem… Show similar results!
Utilize extensive DVD Catalog
Metadata search (ElasticSearch)
Named entity recognition (NLP)
Empty searches are opportunity!
Explicit feedback for future recommendations
Content to buy and produce!
113
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix A/B Tests
Users tend to click on images featuring…
Faces with strong emotional expressions
Villains over heroes
Small number of cast members
114
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Netflix Recommendation Serving Layer
Use Case: Recommendation service depends on EVCache
Problem: EVCache cluster goes down or becomes latent!?
Answer: github.com/Netflix/Hystrix Circuit Breaker!
Circuit States
Closed: Service OK
Open: Service DOWN
Fallback to Static
115
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Why Higher Average Ratings 2004+?
2004, Netflix noticed higher ratings on average
Some possible reasons why…
116
① Significant UI improvements deployed
② New recommendation engine deployed
③
Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production
Thank You, Everyone!!
Chris Fregly @cfregly
Research Scientist @ PipelineIO
San Francisco, California, USA
http://fluxcapacitor.com
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me LinkedIn, Twitter, Github, Email, Fax
117
Image derived from http://www.duchess-france.org/
Ad

More Related Content

What's hot (20)

Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
José Carlos García Serrano
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
Databricks
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
Kexin Xie
 
SCALA - Functional domain
SCALA -  Functional domainSCALA -  Functional domain
SCALA - Functional domain
Bartosz Kosarzycki
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
Sumant Tambe
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
C++ Generators and Property-based Testing
C++ Generators and Property-based TestingC++ Generators and Property-based Testing
C++ Generators and Property-based Testing
Sumant Tambe
 
Aaron Bedra - Effective Software Security Teams
Aaron Bedra - Effective Software Security TeamsAaron Bedra - Effective Software Security Teams
Aaron Bedra - Effective Software Security Teams
centralohioissa
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
Petr Zapletal
 
Scala introduction
Scala introductionScala introduction
Scala introduction
vito jeng
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
Javier Santos Paniego
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
InfluxData
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Evaluating Hype in scala
Evaluating Hype in scalaEvaluating Hype in scala
Evaluating Hype in scala
Pere Villega
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
Databricks
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
Kexin Xie
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
Sumant Tambe
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
C++ Generators and Property-based Testing
C++ Generators and Property-based TestingC++ Generators and Property-based Testing
C++ Generators and Property-based Testing
Sumant Tambe
 
Aaron Bedra - Effective Software Security Teams
Aaron Bedra - Effective Software Security TeamsAaron Bedra - Effective Software Security Teams
Aaron Bedra - Effective Software Security Teams
centralohioissa
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
Petr Zapletal
 
Scala introduction
Scala introductionScala introduction
Scala introduction
vito jeng
 
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
InfluxData
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Evaluating Hype in scala
Evaluating Hype in scalaEvaluating Hype in scala
Evaluating Hype in scala
Pere Villega
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 

Viewers also liked (20)

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Data Con LA
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
Rakesh Saha
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Fannie mae bmc remedy its mv7 production infrastructure_v8_021009
Fannie mae bmc remedy its mv7 production infrastructure_v8_021009Fannie mae bmc remedy its mv7 production infrastructure_v8_021009
Fannie mae bmc remedy its mv7 production infrastructure_v8_021009
Accenture
 
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Data Con LA
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kelly
Data Con LA
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
Data Con LA
 
Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcott
Data Con LA
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Data Con LA
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
Data Con LA
 
Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jha
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...Big Data Day LA 2016/ Data Science Track -  Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Data Con LA
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
Rakesh Saha
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Fannie mae bmc remedy its mv7 production infrastructure_v8_021009
Fannie mae bmc remedy its mv7 production infrastructure_v8_021009Fannie mae bmc remedy its mv7 production infrastructure_v8_021009
Fannie mae bmc remedy its mv7 production infrastructure_v8_021009
Accenture
 
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Data Con LA
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kelly
Data Con LA
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
Data Con LA
 
Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcott
Data Con LA
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Data Con LA
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
Data Con LA
 
Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jha
Data Con LA
 
Ad

Similar to Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Approxs, Similarities, and Recommendations at Scale using Spark Kafka, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird - Chris Fregly, Research Scientist, PipelineIO (20)

Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
VictorSzoltysek
 
Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017
Dan Crankshaw
 
Machine learning on streams of data
Machine learning on streams of dataMachine learning on streams of data
Machine learning on streams of data
Tomasz Sosiński
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Paris Data Engineers !
 
Twin Cities CUG Presentation - Cognos Power Play Tips And Tricks
Twin Cities CUG Presentation - Cognos Power Play Tips And TricksTwin Cities CUG Presentation - Cognos Power Play Tips And Tricks
Twin Cities CUG Presentation - Cognos Power Play Tips And Tricks
bidelivery
 
[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx
[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx
[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx
DataScienceConferenc1
 
Build 2019 Recap
Build 2019 RecapBuild 2019 Recap
Build 2019 Recap
Eran Stiller
 
Power BI
Power BIPower BI
Power BI
Cybage Software Pvt ltd
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Senturus
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
Márton Kodok
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
VictorSzoltysek
 
Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017
Dan Crankshaw
 
Machine learning on streams of data
Machine learning on streams of dataMachine learning on streams of data
Machine learning on streams of data
Tomasz Sosiński
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Paris Data Engineers !
 
Twin Cities CUG Presentation - Cognos Power Play Tips And Tricks
Twin Cities CUG Presentation - Cognos Power Play Tips And TricksTwin Cities CUG Presentation - Cognos Power Play Tips And Tricks
Twin Cities CUG Presentation - Cognos Power Play Tips And Tricks
bidelivery
 
[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx
[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx
[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx
DataScienceConferenc1
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Senturus
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
Márton Kodok
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Ad

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

Recently uploaded (20)

Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Approxs, Similarities, and Recommendations at Scale using Spark Kafka, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird - Chris Fregly, Research Scientist, PipelineIO

  • 1. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO advancedspark.compipeline.io
  • 2. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Who Am I? 2 Streaming Data Engineer Netflix OSS Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Meetup Book Author Advanced . Due Soon
  • 3. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Advanced Apache Spark Meetup http://pipeline.io Meetup Metrics Top 10 Most-active Spark Meetup! ~4000 Members in just 12 mos!! 5100+ Docker downloads (demos) Meetup Mission Code deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance 3
  • 4. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Live, Interactive Demo! Audience Participation Required!! Cell Phone Compatible!!! 4
  • 5. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production http://demo.pipeline.io End User -> NetflixOSS à Redis à TensorFlow -> Data Scientist -> 5 <- Kafka <- Spark Streaming <- Cassandra Redis <- Zeppelin iPython
  • 6. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 6
  • 7. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Scaling with Parallelism 7 Peter O(log n) O(log n) Worker Nodes
  • 8. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Parallelism with Composability Worker 1 Worker 2 Max (a max b max c max d) == (a max b) max (c max d) Set Union (a U b U c U d) == (a U b) U (c U d) Addition (a + b + c + d) == (a + b) + (c + d) Multiply (a * b * c * d) == (a * b) * (c * d) 8 What about Division and Average? Collect at Driver
  • 9. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857 9 What were the Egyptians thinking?! Not Composable “Divide like an Egyptian”
  • 10. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production What about Average? Overall AVG (3, 1) (3 + 5 + 5 + 7) 20 + (5, 1) == -------------------- == --- == 5 + (5, 1) (1 + 1 + 1 + 1) 4 + (7, 1) 10 values counts Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2 Divide, Add, Divide? Not Composable Single-Node Divide at the End? Doesn’t need to be Composable! AVG (3, 5, 5, 7) == 5 Add, Add, Add? Composable!
  • 11. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 11
  • 12. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Similarities 12
  • 13. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Euclidean Similarity Exists in Euclidean, flat space Based on Euclidean distance Linear measure Bias towards magnitude 13
  • 14. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias Normalize to unit vectors in all dimensions Used with real-valued vectors (versus binary) 14 org.jblas. DoubleMatrix
  • 15. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Jaccard Similarity Set similarity measurement Set intersection / set union Bias towards popularity Works with binary vectors 15
  • 16. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Log Likelihood Similarity Adjusts for popularity bias Netflix “Shawshank” problem 16
  • 17. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Word Similarity Edit Distance Misspellings and autocorrect Word2Vec Similar words are defined by similar contexts in vector space 17 English Spanish
  • 18. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! Find Synonyms with Word2Vec 18
  • 19. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Find Synonyms using Word2Vec 19
  • 20. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Document Similarity TF/IDF Term Freq / Inverse Document Freq Used by most search engines Doc2Vec Similar documents are determined by similar contexts 20
  • 21. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Bonus! Text Rank Document Summary Text Rank (aka Sentence Rank) Surface summary sentences TF/IDF + Similarity Graph + PageRank Most similar sentence to all other sentences TF/IDF + Similarity Graph Most influential sentences PageRank 21
  • 22. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Similarity Pathways (Recommendations) Best recommendations for 2 (or more) people “You like Max Max. I like Message in a Bottle. We might like a movie similar to both.” Item-to-Item Similarity Graph + Dijkstra Heaviest Path 22
  • 23. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! Similarity Pathway for Movie Recommendations 23
  • 24. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Load Movies with Tags into DataFrame 24 My Choice Their Choice
  • 25. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Item-to-Item Tag Jaccard Similarity Based on Tags 25 Calculate Jaccard Similarity (Tag Set Similarity) Must be Above the Given Jaccard Similarity Threshold
  • 26. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Item-to-Item Tag Similarity Graph 26 Edge Weights == Jaccard Similarity (Based on Tag Sets)
  • 27. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Use Dijkstra to Find Heaviest Pathway 27
  • 28. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Calculating Exact Similarity Brute-Force Similarity Cartesian Product O(n^2) shuffle and compute aka. All-pairs, Pair-wise, Similarity Join 28
  • 29. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Calculating Approximate Similarity Goal: Reduce Shuffle Approximate Similarity Sampling Bucketing or Clustering Ignore low-similarity probability Locality Sensitive Hashing Twitter Algebird MinHash 29 Bucket By Genre
  • 30. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ① Netflix Recommendations 30
  • 31. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Recommendations 31
  • 32. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: user knows they are rating or liking, can choose to dislike Implicit User Feedback: user not explicitly aware, cannot dislike (click, hover, etc) Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features Loss Function: Function we’re trying to minimize such as least-squared error for Linear Regression Cross Entropy: Loss function used for classification algorithms such as Logistic Regression Optimizer: Technique to optimize loss function such as Stochastic Gradient Descent (SGD) 32
  • 33. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Stochastic Gradient Descent (SGD) Optimizes Loss Function Least Squared Error b/w predicted and actual value Cross Entropy Log Likelihood b/w predicted and actual probability 33 2-Dimensional 3-Dimensional
  • 34. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Features Binary: True or False Numeric Discrete: Integers Numeric: Real Values Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon) Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5) Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots Temporal: Time-based, Time of Day, Binge Viewing Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming) Media: Images, Audio, Video Geographic: (Longitude, Latitude), Geohash Latent: Hidden Features within Data (Collaborative Filtering) Derived: Age of Movie, Duration of User Subscription 34
  • 35. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Feature Engineering Dimension Reduction Reduce number of features in feature space Principle Component Analysis (PCA) Find principle features that best describe data variance Peel dimensional layers back One-Hot Encoding Convert nominal categorical feature values into 0’s and 1’s Remove any numerical relationship between categories Bears -> 1 Bears -> [1.0, 0.0, 0.0] 49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0] Steelers-> 3 Steelers-> [0.0, 0.0, 1.0] 35 Convert Each Item to Binary Vector with Single 1.0 Column
  • 36. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Feature Normalization & Standardization Goal Scale features to standard size Prevent boundless features Helps avoid overfitting Required by many ML algos Normalize Features Calculate L1 (or L2, etc) norm, then divide into each element Standardize Features Apply standard normal transformation (mean->0, stddev->1) org.apache.spark.ml.feature.[Normalizer, StandardScaler] 36 http://www.mathsisfun.com/data/standard-normal-distribution.html
  • 37. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Non-Personalized Recommendations 37
  • 38. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Cold Start Problem “Cold Start” problem New user, don’t know their preferences, must show something! Movies with highest-rated actors Top K aggregations Facebook social graph Friend-based recommendations Most desirable singles PageRank of likes and dislikes 38
  • 39. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! GraphFrame PageRank 39
  • 40. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Example: Dating Site “Like” Graph 40
  • 41. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production PageRank of Top Influencers 41
  • 42. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Personalized Recommendations 42
  • 43. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! Personalized PageRank 43
  • 44. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Personalized PageRank: Outbound Links 44 0.15 = (1 - 0.85 “Damping Factor”) 85% Probability: Choose Among Outbound Network 15% Probability: Choose Self or Random 85% Among Outbound Network
  • 45. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Personalized PageRank: No Outbound 45 0.15 = (1 - 0.85 “Damping Factor”) 85% Probability: Choose Among Outbound Network 15% Probability: Choose Self or Random 85% Among No Outbound Network!!
  • 46. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production User-to-User Clustering User Similarity Time-based Pattern of viewing (binge or casual) Time of viewing (am or pm) Ratings-based Content ratings or number of views Average rating relative to others (critical or lenient) Search-based Search terms 46
  • 47. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Item-to-Item Clustering Item Similarity Profile text (TF/IDF, Word2Vec, NLP) Categories, tags, interests (Jaccard Similarity, LSH) Images, facial structures (Neural Nets, Eigenfaces) Dating Site Example… 47 Cluster Similar Eigen-facesCluster Similar Profiles Cluster Similar Categories
  • 48. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Bonus: NLP Conversation Starter Bot 48 “If your responses to my generic opening lines are positive, I may read your profile.” Spark ML, Stanford CoreNLP, TF/IDF, DecisionTrees, Sentiment http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  • 49. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Bonus: Demo! Spark + Stanford CoreNLP Sentiment Analysis 49
  • 50. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Bonus: Top 100 Country Song Sentiment 50
  • 51. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Bonus: Surprising Results…?! 51
  • 52. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Item-to-Item Based Recommendations Based on Metadata: Genre, Description, Cast, City 52
  • 53. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! Item-to-Item-based Recommendations One-Hot Encoding + K-Means Clustering 53
  • 54. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production One-Hot Encode Tag Feature Vectors 54
  • 55. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Cluster Movie Tag Feature Vectors 55 Hyperparameter Tuning (K Clusters?)
  • 56. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Analyze Movie Tag Clusters 56
  • 57. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production User-to-Item Collaborative Filtering Matrix Factorization ① Factor the large matrix (left) into 2 smaller matrices (right) ② Lower-rank matrices approximate original when multiplied ③ Fill in the missing values of the large matrix ④ Surface k (rank) latent features from user-item interactions 57
  • 58. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Item-to-Item Collaborative Filtering Famous Amazon Paper circa 2003 Problem As users grew, user-to-item collaborative filtering didn’t scale Solution Item-to-item similarity, nearest neighbors Offline (Batch) Generate itemId->List[userId] vectors Online (Real-time) From cart, recommend nearest-neighbors in vector space 58
  • 59. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! Collaborative Filtering-based Recommendations 59
  • 60. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Fitting the Matrix Factorization Model 60
  • 61. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Show ItemFactors Matrix from ALS 61
  • 62. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Show UserFactors Matrix from ALS 62
  • 63. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Generating Individual Recommendations 63
  • 64. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Generating Batch Recommendations 64
  • 65. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Clustering + Collaborative Filtering Recs Cluster matrix output from Matrix Factorization Latent features derived from user-item interaction Item-to-Item Similarity Cluster item-factor matrix-> User-to-User Similarity <-Cluster user-factor matrix 65
  • 66. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! Clustering + Collaborative Filtering-based Recommendations 66
  • 67. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Show ItemFactors Matrix from ALS 67
  • 68. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Convert to Item Factors -> mllib.Vector Required by K-Means Clustering Algorithm 68
  • 69. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Fit and Evaluate K-Means Cluster Model 69 Measures Closeness Of Points Within Clusters K = 5 Clusters
  • 70. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Genres and Clusters Typical Genres Documentary, Romance, Comedy, Horror, Action, Adventure Latent (Hidden) Clusters Emotionally-Independent Dramas for Hopeless Romantics Witty Dysfunctional-Family TV Animated Comedies Romantic Crime Movies based on Classic Literature Latin American Forbidden-Love Movies Critically-acclaimed Emotional Drug Movie Cerebral Military Movie based on Real Life Sentimental Movies about Horses for Ages 11-12 Gory Canadian Revenge Movies Raunchy Mad Scientist Comedy 70
  • 71. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 71
  • 72. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (approx # errors after a release) Using machine learning or graph algos Inherently probabilistic and approximate Streaming aggregations Inherently sloppy collection (exactly once?) 72 Approximate as much as you can get away with! Ask for forgiveness later !!
  • 73. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production When NOT to Approximate? If you’ve ever heard the term… “Sarbanes-Oxley” …at the office. 73
  • 74. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO A Few Good Algorithms 74 You can’t handle the approximate!
  • 75. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Common to These Algos & Data Structs Low, fixed size in memory Store large amount of data Known error bounds Tunable tradeoff between size and error Less memory than Java/Scala collections Rely on multiple hash functions or operations Size of hash range defines error 75
  • 76. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Bloom Filter Set.contains(key): Boolean “Hash Multiple Times and Flip the Bits Wherever You Land” 76
  • 77. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Bloom Filter Approximate Set.contains(key) No means No, Yes means Maybe Elements can only be added Never updated or removed 77
  • 78. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Bloom Filter in Action 78 set(key) contains(key): Boolean Images by @avibryant Set.contains(key): TRUE -> maybe contains (other key hashes may overlap) Set.contains(key): FALSE -> definitely does not contain (no key flipped all bits)
  • 79. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO CountMin Sketch Frequency Count and TopK “Hash Multiple Times and Add 1 Wherever You Land” 79
  • 80. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter 80 Matei Zaharia Martin Odersky Donald Trump
  • 81. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production CountMin Sketch In Action (TopK Count) 81 Images derived from @avibryant Find minimum of all rows … … Can overestimate, but never underestimate Multiple hash functions (1 hash function per row) Binary hash output (1 element per column) x 2 occurrences of “Top Gun” for slightly additional complexity Top Gun Top Gun Top Gun (x 2) A Few Good Men Taps Top Gun (x 2) add(Top Gun, 2) getCount(Top Gun): Long Use Case: TopK movies using total views add(A Few Good Men, 1) add(Taps, 1) A Few Good Men Taps … … Overlap Top Gun Overlap A Few Good Men
  • 82. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO HyperLogLog Count Distinct “Hash Multiple Times and Uniformly Distribute Where You Land” 82
  • 83. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution Hash subsets of data with single, special hash func Error estimate 14 bits for size of range m = 2^14 = 16,384 hash slots error = 1.04/(sqrt(16,384)) = .81% 83 Not many of these
  • 84. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production HyperLogLog In Action (Count Distinct) Use Case: Number of distinct users who view a movie 84 0 32 Top Gun: Hour 2 user 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 user 1001 user 2009 user 3005 user 3003 Top Gun: Hour 1 user 3001 user 7009 0 16 UniformDistribution: Estimate distinct # of users by inspecting just the beginning 0 32 Top Gun: Hour 1 + 2 user 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 Combine across different scales user 7009 user 1001 user 2009 user 3005 user 3003 user 3001
  • 85. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Locality Sensitive Hashing Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 85
  • 86. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Locality Sensitive Hashing (LSH) Approximate set similarity Pre-process m rows into b buckets b << m; b = buckets, m = rows Hash items multiple times ** Similar items hash to overlapping buckets ** Hash designed to cluster similar items Compare just contents of buckets Much smaller cartesian compare ** Compare in parallel !! Avoids huge cartesian all-pairs compare 86 Chapter 3: LSH
  • 87. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO DIMSUM Set Similarity “Pre-process and ignore data that is unlikely to be similar.” 87
  • 88. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold) Twitter DIMSUM Case Study 40% efficiency gain over bruce-force Cosine Sim 88
  • 89. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Common Tools to Approximate Twitter Algebird Redis Apache Spark 89 Composable Library Distributed Cache Big Data Processing
  • 90. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Twitter Algebird Algebraic Fundamentals Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count) 90
  • 91. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error Add user views for given movie PFADD TopGun_Hour1_HLL user1001 user2009 user3005 PFADD TopGun_Hour1_HLL user3003 user1001 Get distinct count (cardinality) of set PFCOUNT TopGun_Hour1_HLL Returns: 4 (distinct users viewed this movie) Union 2 HyperLogLog Data Structures PFMERGE TopGun_Hour1_HLL TopGun_Hour2_HLL 91 ignore duplicates Tunable
  • 92. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Approximations in Spark Libraries Spark Core countByKeyApprox(timeout: Long, confidence: Double) PartialResult Spark SQL approxCountDistinct(column: Column, targetResidual: Float) approxQuantile(column: Column, quantiles: Seq[Float], targetResidual: Float) Spark ML Stratified sampling sampleByKey(fractions: Map[K, Double]) DIMSUM sampling Probabilistic sampling reduces amount of shuffle RowMatrix.columnSimilarities(threshold: Double) 92
  • 93. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! Exact Count vs. Approximate HLL and CMS Count 93
  • 94. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production HashSet vs. HyperLogLog (Memory) 94
  • 95. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production HashSet vs. CountMin Sketch (Memory) 95
  • 96. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Demo! Exact Similarity vs. Approximate LSH Similarity 96
  • 97. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Brute Force Cartesian All Pair Similarity 97 47 seconds
  • 98. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Locality Sensitive Hash All Pair Similarity 98 6 seconds
  • 99. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Many More Demos! or Download Docker Clone on Github 99 http://advancedspark.com
  • 100. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 100
  • 101. Extend Your ML Pipelines to Serve Production Pipeline.IOExtend Your ML Pipelines to Serve Production Pipeline.IO Netflix Recommendations From Ratings to Real-time 101
  • 102. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Has a Lot of Data Netflix has a lot of data about a lot of users and a lot of movies. Netflix can use this data to buy new movies. Netflix is global. Netflix can use this data to choose original programming. Netflix knows that a lot of people like politics and Kevin Spacey. 102 The UK doesn’t have White Castle. Renamed my favourite movie to: “Harold and Kumar Get the Munchies” My favorite movie: “Harold and Kumar Go to White Castle” Summary: Buy NFLX Stock! This broke my unit tests!
  • 103. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Data Pipeline - Then 103 v1.0 v2.0
  • 104. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Data Pipeline – Now (Keystone) 104 v3.0 9 million events per second 22 GB per second!! EC2 D2XL Disk: 6 TB, 475 MB/s RAM: 30 G Network: 700 Mbps Auto-scaling, Fault tolerance A/B Tests, Trending Now SAMZA Splits high and normal priority
  • 105. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Recommendation Data Pipeline 105 Throw away batch user factors (U) Keep batch video factors (V)
  • 106. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Trending Now (Time-based Recs) Uses Spark Streaming Personalized to user (viewing history, past ratings) Learns and adapts to events (Valentine’s Day) 106 “VHS” Number of Plays Number of Impressions Calculate Take Rate
  • 107. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Bonus: Pandora Time-based Recs Work Days Play familiar music User is less likely accept new music Evenings and Weekends Play new music More like to accept new music 107
  • 108. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production $1 Million Netflix Prize (2006-2009) Goal Improve movie predictions by 10% (Root Mean Sq Error) Test data withheld to calculate RMSE upon submission 5-star Ratings Dataset (userId, movieId, rating, timestamp) Winning algorithm(s) 10.06% improvement (RMSE) Ensemble of 500+ ML combined with GBDT’s Computationally impractical 108
  • 109. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Secrets to the Winning Algorithms Adjust for the following human bias… ① Alice effect: user rates lower than avg ② Inception effect: movie rated higher than avg ③ Overall mean rating of a movie ④ Number of people who have rated a movie ⑤ Number of days since user’s first rating ⑥ Number of days since movie’s first rating ⑦ Mood, time of day, day of week, season, weather 109
  • 110. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering 110 Ensembles!
  • 111. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Genres and Clusters Typical Genres Documentaries, Romance Comedies, Horror, Action, Adventure Latent (Hidden) Clusters Emotionally-Independent Dramas for Hopeless Romantics Witty Dysfunctional-Family TV Animated Comedies Romantic Crime Movies based on Classic Literature Latin American Forbidden-Love Movies Critically-acclaimed Emotional Drug Movie Cerebral Military Movie based on Real Life Sentimental Movies about Horses for Ages 11-12 Gory Canadian Revenge Movies Raunchy Mad Scientist Comedy 111
  • 112. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Social Integration Post to Facebook after movie start (5 mins) Recommend to new users based on friends Helps with Cold Start problem 112
  • 113. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Search No results? No problem… Show similar results! Utilize extensive DVD Catalog Metadata search (ElasticSearch) Named entity recognition (NLP) Empty searches are opportunity! Explicit feedback for future recommendations Content to buy and produce! 113
  • 114. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix A/B Tests Users tend to click on images featuring… Faces with strong emotional expressions Villains over heroes Small number of cast members 114
  • 115. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Netflix Recommendation Serving Layer Use Case: Recommendation service depends on EVCache Problem: EVCache cluster goes down or becomes latent!? Answer: github.com/Netflix/Hystrix Circuit Breaker! Circuit States Closed: Service OK Open: Service DOWN Fallback to Static 115
  • 116. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Why Higher Average Ratings 2004+? 2004, Netflix noticed higher ratings on average Some possible reasons why… 116 ① Significant UI improvements deployed ② New recommendation engine deployed ③
  • 117. Extend Your ML Pipelines to Serve Production Pipeline.IOPipeline.IOExtend Your ML Pipelines to Serve Production Thank You, Everyone!! Chris Fregly @cfregly Research Scientist @ PipelineIO San Francisco, California, USA http://fluxcapacitor.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker Find me LinkedIn, Twitter, Github, Email, Fax 117 Image derived from http://www.duchess-france.org/