SlideShare a Scribd company logo
Powered by
Vector Search for
Data Scientists
A Case Study with Twitter Analytics
#1 - How is my data distributed?
#2 - Are there outliers in my data?
#3 - Are my variables correlated with each other?
Common questions in Data Science
#1 - Can we capture the semantics in vector representations?
#2 - What can we learn about our data from semantic clusters?
Vector Search
Social Media Clicks and Twitter Analytics
Twitter Analytics
Twitter Analytics CSV Data
Tweet Text Time Impressions Engagements Engagement
Rate
Retweets Replies Likes User
Profile
Clicks
Url
Clicks
I just published “ANN
Benchmarks with Etienne
Dilcoker -- Weaviate
Podcast #16 on Medium..
May 27th,
1:34pm
1905 50 2.6% 3 1 15 2 18
Approximate Nearest
Neighbor algorithms
allow us to Vector Search
in massive datasets! …
May 24th,
1:13pm
7182 252 3.5% 14 1 50 27 36
Feature Engineering:
Contains Emoji?
Character Count?
Word Count?
Contains “Weaviate”?
Key Takeaways:
“Vector Search
for Data
Scientists”
1. Segmentation in Data
Science
2. Vector Representations of
Data
3. Vector Segmentation
4. Weaviate for Twitter
Analytics
5. Research Questions and
Discussion
Slides, Colab Notebook, Video Presentation available on:
github.com/CShorten/Vector-Search-for-Data-Scientists
Key Takeaway #1 -
Segmentation in
Data Science
Visualizing Distributions of Values
Segmentation in Data Science
● What Time was the Tweet sent?
● Is there a URL Link in the Tweet?
● Symbolic vs. Vector Segmentation
What Time was the Tweet sent?
Is there a URL Link in the Tweet?
Can we split Impressions based on
the Semantics of the content?
Weaviate Podcast Weaviate Tutorial AI Weekly Update
How can we segment analytics based on the
semantics of…
● Text
● Images
● Code
● Audio
● Video
● Graph-Structure
● Biological Sequences
● … !
Summary of
Takeaway #1
Segmentation in
Data Science
We visualize the Distribution
of our data to get a sense of it.
For example we see that
Impressions are somewhat
Normally Distributed.
Is that also true for Tweets sent
at 3 AM?
What about Tweets related to
Deep Learning for Robotics?
Key Takeaway #2 -
Vector
Representations
of Data
Symbols compared to Vectors
Symbols
Category - [0, 1, 0, 0, 0, 0]
Numeric - 52
Boolean - True
[0.1, 0.8, 0.34, 0.8, … 0.2]
Vectors
Vector Representations of Data
Photo by Shayna Douglas on Unsplash
0.83
0.35
..
0.02
Photo by Bill Stephan on Unsplash
0.74
0.01
..
0.95
Are these puppies similar?
Let’s ask Vector Distance!
L2 Distance = ∑ || ai
- bi
||2
L2 Distance (Puppy1, Puppy2) = (4-2)2
+ (8-9)2
+ (10-11)2
= 6
L2 Distance (Puppy1, Airplane) = (4-1)2
+ (8-20)2
+ (10-20)2
= 253
6 << 253, Puppy1 is thus much more semantically similar to Puppy2 than Airplane
Vector Name Value 1 Value 2 Value 3
Puppy1 4 8 10
Puppy2 2 9 11
Airplane 1 20 20
Capturing Semantics in Vector Representations
How do Vectors represent real-world objects?
0.08 0.53 0.16 … 0.83 0.18
384 dimensional vector
Does this represent how much of a “brand” this is?
We aren’t sure! But there are research fields such as “Multimodal Neurons”
from OpenAI, and the general field of Disentangled RepresentationLearning
that are making great strides in understanding this.
Can we compress these vectors?
…
384 dimensional vector
Sometimes!
Ideas like Binary Passage Retrieval (shown above) - fp32 to Binary values
Ideas like Product Quantization - 384-d vector mapped to 32-d
Semantic Similarity with Vector Representations
Sentence-BERT:
Sentence Embeddings
using Siamese
BERT-Networks
Authored by
Nils Reimers and Iryna
Gurevych
Published 2019
Query Point
Positive and Negative Pair Sampling
Positive (Semantically Similar)
Negative (Semantically Different)
Another strategy - Data2Vec, Baevski et al. 2022
Do we need to train our own
models?
No! There are many pre-trained
models that work very well for
a broad range of data!
Great place to get started: Sentence Transformers
Summary of
Takeaway #2
Vector
Representations
of Data
Data such as Images, Text, Code,
… can be represented as Vectors
with Deep Learning models.
These models are trained to
maximize semantic similarity
with massive collections of data.
We often do not need to train the
models ourselves for particular
data domains to reach
reasonable performance.
Key Takeaway #3 -
Vector
Segmentation
● Text
● Images
● Code
● Audio
● Video
● Graph-Structure
● Biological Sequences
● … !
We can segment analytics based on the
semantics of…
Can we split Impressions based on
the Semantics of the content?
Weaviate Podcast Weaviate Tutorial AI Weekly Update
More Examples
House Hunting
Symbols: # of bedrooms, # of bathrooms, square feet, city
→ With Vectors we can encode:
● Visual style
● Neighborhood structure
● Moreflexibleinterfacetodefinefeatureswithtext
e-Commerce Products
Symbols: “Shoes”, “T-Shirt”, “Pants” or colors
→ With Vectors we can encode visual styles
Movies
Symbols can differentiate between genres like “Children”, “Action”, or “Sci-Fi”
→ With Vectors we can encode:
● Themes
● Characters
● Storylines
Scientific Papers
Symbols: “Biology”, “Machine Learning”
→ With Vectors we can encode
● Nuance of the ideas
● Writing style
Music
Symbols can differentiate between genres like “Hip Hop”, “Dance”
→ With Vectors we can encode:
● Tone
● Lyrics
● Instruments
“That’s the magic of deep learning:
turning meaning into vectors, then into geometric
spaces, and then incrementally learning complex
geometric transformations that map one space to
another. All you need are spaces of sufficiently high
dimensionality in order to capture the full scope of
the relationships found in the original data.”
- Francois Chollet, Deep Learning with Python, 2nd edition
Summary of
Takeaway #3
Vector
Segmentation
Vector representations, also
known as embeddings,
enable an Interfaceto split
analytics based on the
Semanticsof the content.
This content could be Text,
Images, Code, Audio,
Videos, …
Key Takeaway #4 -
Weaviate for
Twitter Analytics
Twitter Analytics
Tweet Text Time Impressions Engagements Engagement
Rate
Retweets Replies Likes User
Profile
Clicks
Url
Clicks
I just published “ANN
Benchmarks with Etienne
Dilcoker -- Weaviate
Podcast #16 on Medium..
May 27th,
1:34pm
1905 50 2.6% 3 1 15 2 18
Approximate Nearest
Neighbor algorithms allow
us to Vector Search in
massive datasets! …
May 24th,
1:13pm
7182 252 3.5% 14 1 50 27 36
Vector Search for Data Scientists.pdf
Cloud Data Upload
There are many other ways to do this as well
Google Colab Weaviate Cloud
Services
GraphQL Live Demo
5 Nearest Neighbors to → “Weaviate Coding Tutorial”
Content Impressions
“We have 4 Weaviate Podcast Episodes so far [ … ] how to utilize the
Weaviate Database as a Document Store in Haystack pipelines … ”
311
“We have 2 new coding tutorials on Weaviate YouTube…” 1144
“@weaviate_io Love the integration of this with the GraphQL API!” 378
“Here are some thoughts on combining Weaviate and Haystack! TLDR:
Weavaite is a great Vector Search database…”
15563
“Weaviate (@weaviate_io) is also announcing a collaboration with Jina
AI (@JinaAI_)! …”
586
Vector Search for Data Scientists.pdf
What was the Tweet about?
Have I tweeted something like this before?
Have any Weaviate Podcast guests
tweeted something like this recently?
Vector Search for Data Scientists.pdf
Tweet, Author, Likes
GraphQL Live Demo
GraphQL Wikipedia Demo
Wikipedia Live Demo - Graph Data Model
GraphQL Wikipedia Demo
● Weaviate is a Vector
Search Database, rather
than a Library such as
Facebook’s FAISS or
ANNOY from Spotify
● Weaviate has a
Graph-like Data Model
Expanding Twitter project with Graph Model
Vector Search for Data Scientists.pdf
Summary of
Takeaway #4
Weaviate for
Twitter Analytics
We can segment Impressions
on Twitterbased on the
content of the tweet without
manual labeling!
Weaviateis a Vector Search
Databasethat can be used to
store and search through
semantic embeddings of data.
Key Takeaway #5 -
Research Questions
and Discussion
Research Questions and Discussion
● Should I fine-tune my embedding model?
● Large-Scale Vector Search with Approximate
Nearest Neighbor (ANN) Algorithms
● How does Vector Search differ from Classification or
Regression models?
Vector Search versus Regression on Impressions
8,530 Impressions
Model Prediction
Interpretability of Vector Search
Nearest Neighbors
Interpretability of Vector Search and Prediction
8,530 Impressions
Model Prediction
What do we want to know about our Tweets?
Should I post this?
When might be a better time to post it?
What might be a better phrasing of this tweet?
Expanding from individuals to teams
● Has anyone on my team tweeted something
like this recently?
● Who on our team would be best fit to tell this
story?
● What topics should we be tweeting about?
Summary of
Takeaway #5
Research
Questions and
Discussion
How can we improve these
systems? What looks
promising?
Key Takeaways:
“Vector Search
for Data
Scientists”
1. Segmentation in Data
Science
2. Vector Representations of
Unstructured Data
3. Vector Segmentation
4. Weaviate Example for
Twitter Analytics
5. Research Questions and
Discussion
Slides, Colab Notebook, Video Presentation available on:
github.com/CShorten/Vector-Search-for-Data-Scientists
Connect with us!
Weaviate Slack Channel
YouTube: Weaviate • Vector Search Engine
Weaviate Podcast
Twitter @weaviate_io
Thank you for Watching!
Special thanks to Sebastian Witalec in
advising the development of this presentation
and Svitlana Smolianova for visual styling.
Ad

More Related Content

What's hot (20)

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
Márton Kodok
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
Henrik Skogström
 
Intuit - Machine learning platform lifecycle management 2018
Intuit - Machine learning platform lifecycle management  2018Intuit - Machine learning platform lifecycle management  2018
Intuit - Machine learning platform lifecycle management 2018
Karthik Murugesan
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
Databricks
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
Zilliz
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 
The Elastic ELK Stack
The Elastic ELK StackThe Elastic ELK Stack
The Elastic ELK Stack
enterprisesearchmeetup
 
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Simplilearn
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
👋 Christopher Moody
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
Matthias Feys
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
Márton Kodok
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
Intuit - Machine learning platform lifecycle management 2018
Intuit - Machine learning platform lifecycle management  2018Intuit - Machine learning platform lifecycle management  2018
Intuit - Machine learning platform lifecycle management 2018
Karthik Murugesan
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
Databricks
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
Zilliz
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Simplilearn
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
👋 Christopher Moody
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
Matthias Feys
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 

Similar to Vector Search for Data Scientists.pdf (20)

Aman chaudhary
 Aman chaudhary Aman chaudhary
Aman chaudhary
AMANCHAUDHARY130
 
IRJET - Deep Learning based Chatbot
IRJET - Deep Learning based ChatbotIRJET - Deep Learning based Chatbot
IRJET - Deep Learning based Chatbot
IRJET Journal
 
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays
 
Weaviate and Pinecone Comparison.pdf
Weaviate and Pinecone Comparison.pdfWeaviate and Pinecone Comparison.pdf
Weaviate and Pinecone Comparison.pdf
Evgenios Skitsanos
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
acijjournal
 
DataChat_FinalPaper
DataChat_FinalPaperDataChat_FinalPaper
DataChat_FinalPaper
Urjit Patel
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
Blogtalk 2008
 
IRJET - Cyberbulling Detection Model
IRJET -  	  Cyberbulling Detection ModelIRJET -  	  Cyberbulling Detection Model
IRJET - Cyberbulling Detection Model
IRJET Journal
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
Marieke van Erp
 
IRJET - Twitter Sentimental Analysis
IRJET -  	  Twitter Sentimental AnalysisIRJET -  	  Twitter Sentimental Analysis
IRJET - Twitter Sentimental Analysis
IRJET Journal
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
DevinSohi
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
Universiti Technologi Malaysia (UTM)
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systems
Davide Eynard
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
CIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer KeynoteCIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer Keynote
infoblog
 
Twitter text mining using sas
Twitter text mining using sasTwitter text mining using sas
Twitter text mining using sas
Analyst
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET Journal
 
From BasicToAdvanced-FFN, Neuron, Activation Function.pdf
From BasicToAdvanced-FFN, Neuron, Activation Function.pdfFrom BasicToAdvanced-FFN, Neuron, Activation Function.pdf
From BasicToAdvanced-FFN, Neuron, Activation Function.pdf
ssuser2eeb6f
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
Matthew Russell
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
Digital Reasoning
 
IRJET - Deep Learning based Chatbot
IRJET - Deep Learning based ChatbotIRJET - Deep Learning based Chatbot
IRJET - Deep Learning based Chatbot
IRJET Journal
 
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays
 
Weaviate and Pinecone Comparison.pdf
Weaviate and Pinecone Comparison.pdfWeaviate and Pinecone Comparison.pdf
Weaviate and Pinecone Comparison.pdf
Evgenios Skitsanos
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
acijjournal
 
DataChat_FinalPaper
DataChat_FinalPaperDataChat_FinalPaper
DataChat_FinalPaper
Urjit Patel
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
Blogtalk 2008
 
IRJET - Cyberbulling Detection Model
IRJET -  	  Cyberbulling Detection ModelIRJET -  	  Cyberbulling Detection Model
IRJET - Cyberbulling Detection Model
IRJET Journal
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
Marieke van Erp
 
IRJET - Twitter Sentimental Analysis
IRJET -  	  Twitter Sentimental AnalysisIRJET -  	  Twitter Sentimental Analysis
IRJET - Twitter Sentimental Analysis
IRJET Journal
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
DevinSohi
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systems
Davide Eynard
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
CIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer KeynoteCIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer Keynote
infoblog
 
Twitter text mining using sas
Twitter text mining using sasTwitter text mining using sas
Twitter text mining using sas
Analyst
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET Journal
 
From BasicToAdvanced-FFN, Neuron, Activation Function.pdf
From BasicToAdvanced-FFN, Neuron, Activation Function.pdfFrom BasicToAdvanced-FFN, Neuron, Activation Function.pdf
From BasicToAdvanced-FFN, Neuron, Activation Function.pdf
ssuser2eeb6f
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
Matthew Russell
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
Digital Reasoning
 
Ad

Recently uploaded (20)

Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Ad

Vector Search for Data Scientists.pdf

  • 1. Powered by Vector Search for Data Scientists A Case Study with Twitter Analytics
  • 2. #1 - How is my data distributed? #2 - Are there outliers in my data? #3 - Are my variables correlated with each other? Common questions in Data Science
  • 3. #1 - Can we capture the semantics in vector representations? #2 - What can we learn about our data from semantic clusters? Vector Search
  • 4. Social Media Clicks and Twitter Analytics
  • 6. Twitter Analytics CSV Data Tweet Text Time Impressions Engagements Engagement Rate Retweets Replies Likes User Profile Clicks Url Clicks I just published “ANN Benchmarks with Etienne Dilcoker -- Weaviate Podcast #16 on Medium.. May 27th, 1:34pm 1905 50 2.6% 3 1 15 2 18 Approximate Nearest Neighbor algorithms allow us to Vector Search in massive datasets! … May 24th, 1:13pm 7182 252 3.5% 14 1 50 27 36 Feature Engineering: Contains Emoji? Character Count? Word Count? Contains “Weaviate”?
  • 7. Key Takeaways: “Vector Search for Data Scientists” 1. Segmentation in Data Science 2. Vector Representations of Data 3. Vector Segmentation 4. Weaviate for Twitter Analytics 5. Research Questions and Discussion Slides, Colab Notebook, Video Presentation available on: github.com/CShorten/Vector-Search-for-Data-Scientists
  • 8. Key Takeaway #1 - Segmentation in Data Science
  • 10. Segmentation in Data Science ● What Time was the Tweet sent? ● Is there a URL Link in the Tweet? ● Symbolic vs. Vector Segmentation
  • 11. What Time was the Tweet sent?
  • 12. Is there a URL Link in the Tweet?
  • 13. Can we split Impressions based on the Semantics of the content? Weaviate Podcast Weaviate Tutorial AI Weekly Update
  • 14. How can we segment analytics based on the semantics of… ● Text ● Images ● Code ● Audio ● Video ● Graph-Structure ● Biological Sequences ● … !
  • 15. Summary of Takeaway #1 Segmentation in Data Science We visualize the Distribution of our data to get a sense of it. For example we see that Impressions are somewhat Normally Distributed. Is that also true for Tweets sent at 3 AM? What about Tweets related to Deep Learning for Robotics?
  • 16. Key Takeaway #2 - Vector Representations of Data
  • 17. Symbols compared to Vectors Symbols Category - [0, 1, 0, 0, 0, 0] Numeric - 52 Boolean - True [0.1, 0.8, 0.34, 0.8, … 0.2] Vectors
  • 18. Vector Representations of Data Photo by Shayna Douglas on Unsplash 0.83 0.35 .. 0.02 Photo by Bill Stephan on Unsplash 0.74 0.01 .. 0.95
  • 19. Are these puppies similar? Let’s ask Vector Distance! L2 Distance = ∑ || ai - bi ||2 L2 Distance (Puppy1, Puppy2) = (4-2)2 + (8-9)2 + (10-11)2 = 6 L2 Distance (Puppy1, Airplane) = (4-1)2 + (8-20)2 + (10-20)2 = 253 6 << 253, Puppy1 is thus much more semantically similar to Puppy2 than Airplane Vector Name Value 1 Value 2 Value 3 Puppy1 4 8 10 Puppy2 2 9 11 Airplane 1 20 20
  • 20. Capturing Semantics in Vector Representations
  • 21. How do Vectors represent real-world objects? 0.08 0.53 0.16 … 0.83 0.18 384 dimensional vector Does this represent how much of a “brand” this is? We aren’t sure! But there are research fields such as “Multimodal Neurons” from OpenAI, and the general field of Disentangled RepresentationLearning that are making great strides in understanding this.
  • 22. Can we compress these vectors? … 384 dimensional vector Sometimes! Ideas like Binary Passage Retrieval (shown above) - fp32 to Binary values Ideas like Product Quantization - 384-d vector mapped to 32-d
  • 23. Semantic Similarity with Vector Representations Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks Authored by Nils Reimers and Iryna Gurevych Published 2019
  • 24. Query Point Positive and Negative Pair Sampling
  • 27. Another strategy - Data2Vec, Baevski et al. 2022
  • 28. Do we need to train our own models? No! There are many pre-trained models that work very well for a broad range of data!
  • 29. Great place to get started: Sentence Transformers
  • 30. Summary of Takeaway #2 Vector Representations of Data Data such as Images, Text, Code, … can be represented as Vectors with Deep Learning models. These models are trained to maximize semantic similarity with massive collections of data. We often do not need to train the models ourselves for particular data domains to reach reasonable performance.
  • 31. Key Takeaway #3 - Vector Segmentation
  • 32. ● Text ● Images ● Code ● Audio ● Video ● Graph-Structure ● Biological Sequences ● … ! We can segment analytics based on the semantics of…
  • 33. Can we split Impressions based on the Semantics of the content? Weaviate Podcast Weaviate Tutorial AI Weekly Update
  • 35. House Hunting Symbols: # of bedrooms, # of bathrooms, square feet, city → With Vectors we can encode: ● Visual style ● Neighborhood structure ● Moreflexibleinterfacetodefinefeatureswithtext
  • 36. e-Commerce Products Symbols: “Shoes”, “T-Shirt”, “Pants” or colors → With Vectors we can encode visual styles
  • 37. Movies Symbols can differentiate between genres like “Children”, “Action”, or “Sci-Fi” → With Vectors we can encode: ● Themes ● Characters ● Storylines
  • 38. Scientific Papers Symbols: “Biology”, “Machine Learning” → With Vectors we can encode ● Nuance of the ideas ● Writing style
  • 39. Music Symbols can differentiate between genres like “Hip Hop”, “Dance” → With Vectors we can encode: ● Tone ● Lyrics ● Instruments
  • 40. “That’s the magic of deep learning: turning meaning into vectors, then into geometric spaces, and then incrementally learning complex geometric transformations that map one space to another. All you need are spaces of sufficiently high dimensionality in order to capture the full scope of the relationships found in the original data.” - Francois Chollet, Deep Learning with Python, 2nd edition
  • 41. Summary of Takeaway #3 Vector Segmentation Vector representations, also known as embeddings, enable an Interfaceto split analytics based on the Semanticsof the content. This content could be Text, Images, Code, Audio, Videos, …
  • 42. Key Takeaway #4 - Weaviate for Twitter Analytics
  • 43. Twitter Analytics Tweet Text Time Impressions Engagements Engagement Rate Retweets Replies Likes User Profile Clicks Url Clicks I just published “ANN Benchmarks with Etienne Dilcoker -- Weaviate Podcast #16 on Medium.. May 27th, 1:34pm 1905 50 2.6% 3 1 15 2 18 Approximate Nearest Neighbor algorithms allow us to Vector Search in massive datasets! … May 24th, 1:13pm 7182 252 3.5% 14 1 50 27 36
  • 45. Cloud Data Upload There are many other ways to do this as well Google Colab Weaviate Cloud Services
  • 47. 5 Nearest Neighbors to → “Weaviate Coding Tutorial” Content Impressions “We have 4 Weaviate Podcast Episodes so far [ … ] how to utilize the Weaviate Database as a Document Store in Haystack pipelines … ” 311 “We have 2 new coding tutorials on Weaviate YouTube…” 1144 “@weaviate_io Love the integration of this with the GraphQL API!” 378 “Here are some thoughts on combining Weaviate and Haystack! TLDR: Weavaite is a great Vector Search database…” 15563 “Weaviate (@weaviate_io) is also announcing a collaboration with Jina AI (@JinaAI_)! …” 586
  • 49. What was the Tweet about?
  • 50. Have I tweeted something like this before?
  • 51. Have any Weaviate Podcast guests tweeted something like this recently?
  • 56. Wikipedia Live Demo - Graph Data Model
  • 58. ● Weaviate is a Vector Search Database, rather than a Library such as Facebook’s FAISS or ANNOY from Spotify ● Weaviate has a Graph-like Data Model
  • 59. Expanding Twitter project with Graph Model
  • 61. Summary of Takeaway #4 Weaviate for Twitter Analytics We can segment Impressions on Twitterbased on the content of the tweet without manual labeling! Weaviateis a Vector Search Databasethat can be used to store and search through semantic embeddings of data.
  • 62. Key Takeaway #5 - Research Questions and Discussion
  • 63. Research Questions and Discussion ● Should I fine-tune my embedding model? ● Large-Scale Vector Search with Approximate Nearest Neighbor (ANN) Algorithms ● How does Vector Search differ from Classification or Regression models?
  • 64. Vector Search versus Regression on Impressions 8,530 Impressions Model Prediction
  • 65. Interpretability of Vector Search Nearest Neighbors
  • 66. Interpretability of Vector Search and Prediction 8,530 Impressions Model Prediction
  • 67. What do we want to know about our Tweets? Should I post this? When might be a better time to post it? What might be a better phrasing of this tweet?
  • 68. Expanding from individuals to teams ● Has anyone on my team tweeted something like this recently? ● Who on our team would be best fit to tell this story? ● What topics should we be tweeting about?
  • 69. Summary of Takeaway #5 Research Questions and Discussion How can we improve these systems? What looks promising?
  • 70. Key Takeaways: “Vector Search for Data Scientists” 1. Segmentation in Data Science 2. Vector Representations of Unstructured Data 3. Vector Segmentation 4. Weaviate Example for Twitter Analytics 5. Research Questions and Discussion Slides, Colab Notebook, Video Presentation available on: github.com/CShorten/Vector-Search-for-Data-Scientists
  • 71. Connect with us! Weaviate Slack Channel YouTube: Weaviate • Vector Search Engine Weaviate Podcast Twitter @weaviate_io
  • 72. Thank you for Watching! Special thanks to Sebastian Witalec in advising the development of this presentation and Svitlana Smolianova for visual styling.