SlideShare a Scribd company logo
Fast Machine Learning
Development with MongoDB
Rajhans Samdani - Spoke
Jane Fine - MongoDB
What is Spoke
Demo Flow
Spoke: A simpler, smarter way to manage requests
● Problems: natural language processing problems
● Challenge: customized machine learning models
for every client
○ Need to learn quickly (near real time) from user
interactions
○ 1000s of ML models
● MongoDB: very useful in scaling up our ML
● Spoke is workplace management software that uses machine learning to
answer questions and assign requests to right teams
○ Started in August, 2016. Funded by Greylock and Accel.
Spoke Tech Stack
Machine Learning Approach
Machine Learning Problem: Team Triaging
Problem: pick right team based on
the text and context of the request
Challenge: Each client has different
teams so pretraining not possible;
must learn from demonstration
Implication => Separate ML model
for each client
Traditional ML vs Adaptive Approach
Claim: most ml-driven
startups are in the
second bucket
Low data and low
query volume domain
Startups must build
quickly and adapt to
users to show utility
Traditional ML Pipeline Adaptive ML Pipeline
Adapting with Online Machine Learning
● Online learning: Update the model at each time step as the data
sequentially arrives
● For the first year Spoke built quickly by using online learning to deliver a
slick product experience
○ Users see the utility because the system learns in real time!
● Easy to serve and scale using MongoDB
Serving flow
Online Learning with MongoDB
Training flow
Online Learning: Tips
● Use feature hashing to get bounded model size
○ E.g. a linear model with a hash of size 10k for a 5 way classification problem
=> model size = 200KB.
● Load test your setup to ensure it works for your QPS
● We modified Scikit Learn’s online learning algorithm to change the
number of class labels on the fly
● Gotchas: concurrency is a problem. Possible that for two training events
arriving at the same time, one will be ignored
○ Possible to avoid using queues
Augmenting TensorFlow with MongoDB
● Years later, we developed a batch training environment using Tensorflow
○ Maintainable
○ Retrainable
● Still a tension between online learning and batch training
○ Good UX from online learning
○ Engineering scalability from batch training
● Achieved a good compromise: use Mongo-based online learning model
for first few hundred responses and then silently switch to Tensorflow
Mongo ML Capabilities
Multiple Data Models and Access Patterns in MongoDB
Rich Queries
Point | Range | Geospatial | Faceted Search | Aggregations | JOINs | Graph Traversals
JSON
Documents
Tabular Key-Value Text GraphGeospatial
Example: Text Classification
Data Model 1:
key-value
raw text input
whole corpus
0b917217ae
7fef14c0b3
cb9eadad9a
Example: Text Classification
Data Model 1:
key-value
raw text input
whole corpus
0b917217ae
7fef14c0b3
cb9eadad9a
Data Model 2: tabular
matrix: one row per
article, one column
per word in article
word1 word2 word3
article1 1 0 2
article2 0 1 0
article3 0 1 1
TF-IDF
Vectorization
Example: Text Classification
Data Model 2: tabular
matrix: one row per
article, one column
per word in article
Data Model 3:
JSON documents
extract keywords and
topics and enrich
word1 word2 word3
article1 1 0 2
article2 0 1 0
article3 0 1 1
LDA Topic
extraction
{
"_id" : “0b917217ae”,
"title" : "Document Model Design Patterns",
“text”: blob,
"topics" : [ "Models", "MVC" ],
“top_words”: [“join”, “embed”, “one-to-many”]
“model”:
{
“location”:
“last_updated”: Timestamp(“05-29-19
00:00:00”)
“confidence”: Decimal128("0.9123")
...
}
...
}
Example: Text Classification & Graph Traversal
Data Model 3:
JSON documents
extract keywords and
topics and enrich
Data Model 4: graph
tree/hierarchy of
topics modeled as a
graph
Hierarchical
Clustering
{
"_id" : “0b917217ae”,
"title" : "Document Model Design Patterns",
“text”: blob,
“parent”: “Databases”,
"topics" : [ "Models", "MVC" ],
“top_words”: [“join”, “embed”, “one-to-many”]
“model”:
{
“location”:
“last_updated”: Timestamp(“05-29-19
00:00:00”)
“confidence”: Decimal128("0.9123")
...
}
...
db.topics.insert( { _id: "Models", parent: "Databases" } )
db.topics.insert( { _id: "Storage", parent: "Databases" } )
db.topic.insert( { _id: "MVCC", parent: "Databases" } )
db.topic.insert( { _id: "Databases", parent: "Programming" } )
db.topic.insert( { _id: "Languages", parent: "Programming" } )
db.topic.insert( { _id: "Programming", parent: null } )
Programming
Languages Databases
ModelsStorage
MVCC
$graphlookup
Indexing in MongoDB
• Primary Index
– Every Collection has a primary key index
• Compound Index
– Index against multiple keys in the document
• MultiKey Index
– Index into arrays
• Text Indexes
– Support for text searches
• GeoSpatial Indexes
– 2d & 2dSphere indexes for spatial geometries
• Hashed Indexes
– Hashed based values for sharding
Index Types
• TTL Indexes
– Single Field indexes, when expired delete the
document
• Unique Indexes
– Ensures value is not duplicated
• Partial Indexes
– Expression based indexes, allowing indexes on
subsets of data
• Case Insensitive Indexes
– Supports text search using case insensitive search
• Sparse Indexes
– Only index documents which have the given field
Index Features
Scalability & Distributed Processing
Process large volumes of data in parallel
queries and
aggregations
run in parallel
data is
returned
in parallel
• Automatically scale beyond
the constraints of a single
node
• Optimized for query patterns
and data locality
• Transparent to applications
and tools
≤ ∑
⟕ "
sharded cluster
Intelligent Data Distribution: Workload Isolation
Enable different workloads on the same data
ANALYTICAL
ML & AI
A single replica set
• Combine operational and
analytical workloads on a
single platform
• No data movement or
duplication
• Extract insights in real-time to
enrich applications
• MongoDB Atlas - Analytics
Nodes
TRANSACTIONAL
Operational Analytics
S
S
S
Application
Text Search
MongoDB Text Search
db.restaurants.find( { $text: { $search: "java coffee shop" } } )
db.restaurants.find(
{ $text: { $search: "java coffee shop" } },
{ score: { $meta: "textScore" } }
).sort( { score: { $meta: "textScore" } } )
Match Text
Score and Sort
Results
Create Text Index1
2
3
db.restaurants.createIndex( { description: "text" } )
Index any field whose value is a string or an array of string elements
A collection can only have one text search index, but that index can cover
multiple fields
Optionally Specify Language:
Text Indexing
Text Matching
$text will tokenize the search string using whitespace and most punctuation as
delimiters, and perform a logical OR of all such tokens in the search string.
Search for a Single Word
Match Any of the Words
Search for a Phrase
Negations
Scoring and Sorting: Control Search Results with Weights
Weight is the significance of the field (default = 1)
For each indexed field, MongoDB multiplies the number of matches by the
weight and sums the results → score of the document
Use “textScore" metadata for projections, sorts, and conditions subsequent
the $match stage that includes the $text operation.
Spoke: Knowledge Base Search with ML
● User asks a question to Spoke and expects real time response
○ Search best knowledge answer from 1000s of answers
● We use a combination of ML algorithms in determining the right answer
○ Scoring each answer independently is not an option due to latency
● Candidate generation to rescue!
How Spoke uses Text Search
● Use MongoDB text search to select top k highest scoring articles
● Only run extensive ML-based search on k articles
○ Works as long as the right answer is in top k
○ Allows us to build latest ML algos without worrying too much about latency
● Tip: set your MongoDB text index weights carefully by fine tuning
{ title: 10,
body: 2,
keywords: 6...}
Future Directions
What Spoke is working on
● Understand user queries and take actions
○ “I need access to Salesforce” => “issue_license(user, software=salesforce)”
● Assign custom labels to user questions
○ Allow customers to add labels to their requests
■ {“hardware”, “software”, “licensing”, “urgent”},
■ {“benefits”, “payroll”, “immigration”}
○ Specific custom labels for each client stored in MongoDB
○ Automatically predict the right labels for requests
What MongoDB is working on: Full Text Search (Beta)
● Based on Apache Lucene 8
● Integrated into MongoDB Atlas
● Separate process co-located with mongod
● Shard-aware
● Indexing = collection scan -> steady state
How Do I use it?
Create a cluster on MongoDB Atlas using 4.2 RC (M30+)
Create an Full Text Index via the MongoDB Atlas UI or API
Query Index via $searchBeta operator using MongoDB Compass or shell,
add to your existing aggregation pipelines
What MongoDB is working on: Atlas Data Lake (beta)
● Serverless: no infrastructure to set up and manage
● Usage-based pricing: only pay for the queries your run
● On-demand: no need to load data; bring your own S3 bucket
● Auto-scalable: parallel execution delivers performance for large and
complex queries across multiple user sessions
● Multi-format: JSON, BSON, CSV, TSV, Avro, Parquet
● Integrated with Atlas: users are managed by Atlas, enabled via Atlas
console
● The best tools to work with your data: MongoDB Query language
enable flexible and efficient data access; integrates with Compass,
MongoDB Shell and MongoDB drivers
What MongoDB is working on: Atlas Data Lake (beta)
Operational Analytics
Aggregations
Machine
Learning and AIData Lake
in-app analytics
Transactional
in-app analytics
Primary Secondary Secondary AnalyticsAnalytics
Q&A

More Related Content

What's hot (20)

PDF
MongoDB .local London 2019: MongoDB Atlas Full-Text Search Deep Dive
MongoDB
 
PPTX
MongoDB 101
Abhijeet Vaikar
 
PPT
Lecture # 8 software design and architecture (SDA).ppt
esrabilgic2
 
PPT
Introduction to mongodb
neela madheswari
 
PPT
Scala collection
Knoldus Inc.
 
PPT
Introduction to MongoDB
Ravi Teja
 
PDF
Schema Design
MongoDB
 
PPTX
Mongo db
Gyanendra Yadav
 
PDF
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB
 
PDF
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB
 
PDF
MongoDB World 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB
 
PDF
Json in Postgres - the Roadmap
EDB
 
PPTX
Introduction to MongoDB and CRUD operations
Anand Kumar
 
PDF
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB
 
PDF
[TDC2016] Apache Cassandra Estratégias de Modelagem de Dados
Eiti Kimura
 
PPTX
MongoDB at Scale
MongoDB
 
PDF
Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB Performance Tuning
Puneet Behl
 
KEY
JSON-LD and MongoDB
Gregg Kellogg
 
PPTX
MongoDB Memory Management Demystified
MongoDB
 
MongoDB .local London 2019: MongoDB Atlas Full-Text Search Deep Dive
MongoDB
 
MongoDB 101
Abhijeet Vaikar
 
Lecture # 8 software design and architecture (SDA).ppt
esrabilgic2
 
Introduction to mongodb
neela madheswari
 
Scala collection
Knoldus Inc.
 
Introduction to MongoDB
Ravi Teja
 
Schema Design
MongoDB
 
Mongo db
Gyanendra Yadav
 
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB
 
MongoDB World 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB
 
Json in Postgres - the Roadmap
EDB
 
Introduction to MongoDB and CRUD operations
Anand Kumar
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB
 
[TDC2016] Apache Cassandra Estratégias de Modelagem de Dados
Eiti Kimura
 
MongoDB at Scale
MongoDB
 
Data Modeling for MongoDB
MongoDB
 
MongoDB Performance Tuning
Puneet Behl
 
JSON-LD and MongoDB
Gregg Kellogg
 
MongoDB Memory Management Demystified
MongoDB
 

Similar to MongoDB World 2019: Fast Machine Learning Development with MongoDB (20)

PDF
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB
 
PDF
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
Lisa Roth, PMP
 
PPTX
Business Jumpstart: The Right (and Wrong) Use Cases for MongoDB
MongoDB
 
PPTX
MongoDB Evenings DC: MongoDB - The New Default Database for Giant Ideas
MongoDB
 
PPTX
Whats new in MongoDB 24
MongoDB
 
PDF
MongoDB_Spark
Mat Keep
 
PPTX
Running MongoDB in the Cloud
Tony Tam
 
PPTX
MongoDB Partner Program Update - November 2013
MongoDB
 
PDF
Which Questions We Should Have
Oracle Korea
 
PDF
Building Apps with MongoDB
Nate Abele
 
PDF
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
PPTX
MongoDB Evenings Minneapolis: MongoDB is Cool But When Should I Use It?
MongoDB
 
PPTX
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB
 
PPTX
Introduction to MongoDB
MongoDB
 
PDF
Introduction to MongoDB
Mike Dirolf
 
PDF
Building Your First MongoDB Application
Tugdual Grall
 
PDF
Building your first app with MongoDB
Norberto Leite
 
PPTX
Introduction to MongoDB – A NoSQL Database
manikgupta2k04
 
PDF
MongoDB.pdf
KuldeepKumar778733
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
Lisa Roth, PMP
 
Business Jumpstart: The Right (and Wrong) Use Cases for MongoDB
MongoDB
 
MongoDB Evenings DC: MongoDB - The New Default Database for Giant Ideas
MongoDB
 
Whats new in MongoDB 24
MongoDB
 
MongoDB_Spark
Mat Keep
 
Running MongoDB in the Cloud
Tony Tam
 
MongoDB Partner Program Update - November 2013
MongoDB
 
Which Questions We Should Have
Oracle Korea
 
Building Apps with MongoDB
Nate Abele
 
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
MongoDB Evenings Minneapolis: MongoDB is Cool But When Should I Use It?
MongoDB
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB
 
Introduction to MongoDB
MongoDB
 
Introduction to MongoDB
Mike Dirolf
 
Building Your First MongoDB Application
Tugdual Grall
 
Building your first app with MongoDB
Norberto Leite
 
Introduction to MongoDB – A NoSQL Database
manikgupta2k04
 
MongoDB.pdf
KuldeepKumar778733
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Manual Testing for Accessibility Enhancement
Julia Undeutsch
 
PDF
Evolution: How True AI is Redefining Safety in Industry 4.0
vikaassingh4433
 
PDF
Software Development Company Keene Systems, Inc (1).pdf
Custom Software Development Company | Keene Systems, Inc.
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Edge AI and Vision Alliance
 
PPTX
Role_of_Artificial_Intelligence_in_Livestock_Extension_Services.pptx
DrRajdeepMadavi
 
PDF
NASA A Researcher’s Guide to International Space Station : Earth Observations
Dr. PANKAJ DHUSSA
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
PDF
NASA A Researcher’s Guide to International Space Station : Fundamental Physics
Dr. PANKAJ DHUSSA
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Manual Testing for Accessibility Enhancement
Julia Undeutsch
 
Evolution: How True AI is Redefining Safety in Industry 4.0
vikaassingh4433
 
Software Development Company Keene Systems, Inc (1).pdf
Custom Software Development Company | Keene Systems, Inc.
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Edge AI and Vision Alliance
 
Role_of_Artificial_Intelligence_in_Livestock_Extension_Services.pptx
DrRajdeepMadavi
 
NASA A Researcher’s Guide to International Space Station : Earth Observations
Dr. PANKAJ DHUSSA
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Digital Circuits, important subject in CS
contactparinay1
 
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
NASA A Researcher’s Guide to International Space Station : Fundamental Physics
Dr. PANKAJ DHUSSA
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 

MongoDB World 2019: Fast Machine Learning Development with MongoDB

  • 1. Fast Machine Learning Development with MongoDB Rajhans Samdani - Spoke Jane Fine - MongoDB
  • 4. Spoke: A simpler, smarter way to manage requests ● Problems: natural language processing problems ● Challenge: customized machine learning models for every client ○ Need to learn quickly (near real time) from user interactions ○ 1000s of ML models ● MongoDB: very useful in scaling up our ML ● Spoke is workplace management software that uses machine learning to answer questions and assign requests to right teams ○ Started in August, 2016. Funded by Greylock and Accel.
  • 7. Machine Learning Problem: Team Triaging Problem: pick right team based on the text and context of the request Challenge: Each client has different teams so pretraining not possible; must learn from demonstration Implication => Separate ML model for each client
  • 8. Traditional ML vs Adaptive Approach Claim: most ml-driven startups are in the second bucket Low data and low query volume domain Startups must build quickly and adapt to users to show utility Traditional ML Pipeline Adaptive ML Pipeline
  • 9. Adapting with Online Machine Learning ● Online learning: Update the model at each time step as the data sequentially arrives ● For the first year Spoke built quickly by using online learning to deliver a slick product experience ○ Users see the utility because the system learns in real time! ● Easy to serve and scale using MongoDB
  • 10. Serving flow Online Learning with MongoDB Training flow
  • 11. Online Learning: Tips ● Use feature hashing to get bounded model size ○ E.g. a linear model with a hash of size 10k for a 5 way classification problem => model size = 200KB. ● Load test your setup to ensure it works for your QPS ● We modified Scikit Learn’s online learning algorithm to change the number of class labels on the fly ● Gotchas: concurrency is a problem. Possible that for two training events arriving at the same time, one will be ignored ○ Possible to avoid using queues
  • 12. Augmenting TensorFlow with MongoDB ● Years later, we developed a batch training environment using Tensorflow ○ Maintainable ○ Retrainable ● Still a tension between online learning and batch training ○ Good UX from online learning ○ Engineering scalability from batch training ● Achieved a good compromise: use Mongo-based online learning model for first few hundred responses and then silently switch to Tensorflow
  • 14. Multiple Data Models and Access Patterns in MongoDB Rich Queries Point | Range | Geospatial | Faceted Search | Aggregations | JOINs | Graph Traversals JSON Documents Tabular Key-Value Text GraphGeospatial
  • 15. Example: Text Classification Data Model 1: key-value raw text input whole corpus 0b917217ae 7fef14c0b3 cb9eadad9a
  • 16. Example: Text Classification Data Model 1: key-value raw text input whole corpus 0b917217ae 7fef14c0b3 cb9eadad9a Data Model 2: tabular matrix: one row per article, one column per word in article word1 word2 word3 article1 1 0 2 article2 0 1 0 article3 0 1 1 TF-IDF Vectorization
  • 17. Example: Text Classification Data Model 2: tabular matrix: one row per article, one column per word in article Data Model 3: JSON documents extract keywords and topics and enrich word1 word2 word3 article1 1 0 2 article2 0 1 0 article3 0 1 1 LDA Topic extraction { "_id" : “0b917217ae”, "title" : "Document Model Design Patterns", “text”: blob, "topics" : [ "Models", "MVC" ], “top_words”: [“join”, “embed”, “one-to-many”] “model”: { “location”: “last_updated”: Timestamp(“05-29-19 00:00:00”) “confidence”: Decimal128("0.9123") ... } ... }
  • 18. Example: Text Classification & Graph Traversal Data Model 3: JSON documents extract keywords and topics and enrich Data Model 4: graph tree/hierarchy of topics modeled as a graph Hierarchical Clustering { "_id" : “0b917217ae”, "title" : "Document Model Design Patterns", “text”: blob, “parent”: “Databases”, "topics" : [ "Models", "MVC" ], “top_words”: [“join”, “embed”, “one-to-many”] “model”: { “location”: “last_updated”: Timestamp(“05-29-19 00:00:00”) “confidence”: Decimal128("0.9123") ... } ... db.topics.insert( { _id: "Models", parent: "Databases" } ) db.topics.insert( { _id: "Storage", parent: "Databases" } ) db.topic.insert( { _id: "MVCC", parent: "Databases" } ) db.topic.insert( { _id: "Databases", parent: "Programming" } ) db.topic.insert( { _id: "Languages", parent: "Programming" } ) db.topic.insert( { _id: "Programming", parent: null } ) Programming Languages Databases ModelsStorage MVCC $graphlookup
  • 19. Indexing in MongoDB • Primary Index – Every Collection has a primary key index • Compound Index – Index against multiple keys in the document • MultiKey Index – Index into arrays • Text Indexes – Support for text searches • GeoSpatial Indexes – 2d & 2dSphere indexes for spatial geometries • Hashed Indexes – Hashed based values for sharding Index Types • TTL Indexes – Single Field indexes, when expired delete the document • Unique Indexes – Ensures value is not duplicated • Partial Indexes – Expression based indexes, allowing indexes on subsets of data • Case Insensitive Indexes – Supports text search using case insensitive search • Sparse Indexes – Only index documents which have the given field Index Features
  • 20. Scalability & Distributed Processing Process large volumes of data in parallel queries and aggregations run in parallel data is returned in parallel • Automatically scale beyond the constraints of a single node • Optimized for query patterns and data locality • Transparent to applications and tools ≤ ∑ ⟕ " sharded cluster
  • 21. Intelligent Data Distribution: Workload Isolation Enable different workloads on the same data ANALYTICAL ML & AI A single replica set • Combine operational and analytical workloads on a single platform • No data movement or duplication • Extract insights in real-time to enrich applications • MongoDB Atlas - Analytics Nodes TRANSACTIONAL Operational Analytics S S S Application
  • 23. MongoDB Text Search db.restaurants.find( { $text: { $search: "java coffee shop" } } ) db.restaurants.find( { $text: { $search: "java coffee shop" } }, { score: { $meta: "textScore" } } ).sort( { score: { $meta: "textScore" } } ) Match Text Score and Sort Results Create Text Index1 2 3 db.restaurants.createIndex( { description: "text" } )
  • 24. Index any field whose value is a string or an array of string elements A collection can only have one text search index, but that index can cover multiple fields Optionally Specify Language: Text Indexing
  • 25. Text Matching $text will tokenize the search string using whitespace and most punctuation as delimiters, and perform a logical OR of all such tokens in the search string. Search for a Single Word Match Any of the Words Search for a Phrase Negations
  • 26. Scoring and Sorting: Control Search Results with Weights Weight is the significance of the field (default = 1) For each indexed field, MongoDB multiplies the number of matches by the weight and sums the results → score of the document Use “textScore" metadata for projections, sorts, and conditions subsequent the $match stage that includes the $text operation.
  • 27. Spoke: Knowledge Base Search with ML ● User asks a question to Spoke and expects real time response ○ Search best knowledge answer from 1000s of answers ● We use a combination of ML algorithms in determining the right answer ○ Scoring each answer independently is not an option due to latency ● Candidate generation to rescue!
  • 28. How Spoke uses Text Search ● Use MongoDB text search to select top k highest scoring articles ● Only run extensive ML-based search on k articles ○ Works as long as the right answer is in top k ○ Allows us to build latest ML algos without worrying too much about latency ● Tip: set your MongoDB text index weights carefully by fine tuning { title: 10, body: 2, keywords: 6...}
  • 30. What Spoke is working on ● Understand user queries and take actions ○ “I need access to Salesforce” => “issue_license(user, software=salesforce)” ● Assign custom labels to user questions ○ Allow customers to add labels to their requests ■ {“hardware”, “software”, “licensing”, “urgent”}, ■ {“benefits”, “payroll”, “immigration”} ○ Specific custom labels for each client stored in MongoDB ○ Automatically predict the right labels for requests
  • 31. What MongoDB is working on: Full Text Search (Beta) ● Based on Apache Lucene 8 ● Integrated into MongoDB Atlas ● Separate process co-located with mongod ● Shard-aware ● Indexing = collection scan -> steady state How Do I use it? Create a cluster on MongoDB Atlas using 4.2 RC (M30+) Create an Full Text Index via the MongoDB Atlas UI or API Query Index via $searchBeta operator using MongoDB Compass or shell, add to your existing aggregation pipelines
  • 32. What MongoDB is working on: Atlas Data Lake (beta) ● Serverless: no infrastructure to set up and manage ● Usage-based pricing: only pay for the queries your run ● On-demand: no need to load data; bring your own S3 bucket ● Auto-scalable: parallel execution delivers performance for large and complex queries across multiple user sessions ● Multi-format: JSON, BSON, CSV, TSV, Avro, Parquet ● Integrated with Atlas: users are managed by Atlas, enabled via Atlas console ● The best tools to work with your data: MongoDB Query language enable flexible and efficient data access; integrates with Compass, MongoDB Shell and MongoDB drivers
  • 33. What MongoDB is working on: Atlas Data Lake (beta) Operational Analytics Aggregations Machine Learning and AIData Lake in-app analytics Transactional in-app analytics Primary Secondary Secondary AnalyticsAnalytics
  • 34. Q&A