Vector Search for Data Scientists.pdf

Powered by
Vector Search for
Data Scientists
A Case Study with Twitter Analytics

#1 - How is my data distributed?
#2 - Are there outliers in my data?
#3 - Are my variables correlated with each other?
Common questions in Data Science

#1 - Can we capture the semantics in vector representations?
#2 - What can we learn about our data from semantic clusters?
Vector Search

Social Media Clicks and Twitter Analytics

Twitter Analytics CSV Data
Tweet Text Time Impressions Engagements Engagement
Rate
Retweets Replies Likes User
Proﬁle
Clicks
Url
Clicks
I just published “ANN
Benchmarks with Etienne
Dilcoker -- Weaviate
Podcast #16 on Medium..
May 27th,
1:34pm
1905 50 2.6% 3 1 15 2 18
Approximate Nearest
Neighbor algorithms
allow us to Vector Search
in massive datasets! …
May 24th,
1:13pm
7182 252 3.5% 14 1 50 27 36
Feature Engineering:
Contains Emoji?
Character Count?
Word Count?
Contains “Weaviate”?

Key Takeaways:
“Vector Search
for Data
Scientists”
1. Segmentation in Data
Science
2. Vector Representations of
Data
3. Vector Segmentation
4. Weaviate for Twitter
Analytics
5. Research Questions and
Discussion
Slides, Colab Notebook, Video Presentation available on:
github.com/CShorten/Vector-Search-for-Data-Scientists

Key Takeaway #1 -
Segmentation in
Data Science

Visualizing Distributions of Values

Segmentation in Data Science
● What Time was the Tweet sent?
● Is there a URL Link in the Tweet?
● Symbolic vs. Vector Segmentation

Is there a URL Link in the Tweet?

Can we split Impressions based on
the Semantics of the content?
Weaviate Podcast Weaviate Tutorial AI Weekly Update

How can we segment analytics based on the
semantics of…
● Text
● Images
● Code
● Audio
● Video
● Graph-Structure
● Biological Sequences
● … !

Summary of
Takeaway #1
Segmentation in
Data Science
We visualize the Distribution
of our data to get a sense of it.
For example we see that
Impressions are somewhat
Normally Distributed.
Is that also true for Tweets sent
at 3 AM?
What about Tweets related to
Deep Learning for Robotics?

Key Takeaway #2 -
Vector
Representations
of Data

Symbols compared to Vectors
Symbols
Category - [0, 1, 0, 0, 0, 0]
Numeric - 52
Boolean - True
[0.1, 0.8, 0.34, 0.8, … 0.2]
Vectors

Vector Representations of Data
Photo by Shayna Douglas on Unsplash
0.83
0.35
..
0.02
Photo by Bill Stephan on Unsplash
0.74
0.01
..
0.95

Are these puppies similar?
Let’s ask Vector Distance!
L2 Distance = ∑ || ai
- bi
||2
L2 Distance (Puppy1, Puppy2) = (4-2)2
+ (8-9)2
+ (10-11)2
= 6
L2 Distance (Puppy1, Airplane) = (4-1)2
+ (8-20)2
+ (10-20)2
= 253
6 << 253, Puppy1 is thus much more semantically similar to Puppy2 than Airplane
Vector Name Value 1 Value 2 Value 3
Puppy1 4 8 10
Puppy2 2 9 11
Airplane 1 20 20

Capturing Semantics in Vector Representations

How do Vectors represent real-world objects?
0.08 0.53 0.16 … 0.83 0.18
384 dimensional vector
Does this represent how much of a “brand” this is?
We aren’t sure! But there are research ﬁelds such as “Multimodal Neurons”
from OpenAI, and the general ﬁeld of Disentangled RepresentationLearning
that are making great strides in understanding this.

Can we compress these vectors?
…
384 dimensional vector
Sometimes!
Ideas like Binary Passage Retrieval (shown above) - fp32 to Binary values
Ideas like Product Quantization - 384-d vector mapped to 32-d

Semantic Similarity with Vector Representations
Sentence-BERT:
Sentence Embeddings
using Siamese
BERT-Networks
Authored by
Nils Reimers and Iryna
Gurevych
Published 2019

Query Point
Positive and Negative Pair Sampling

Positive (Semantically Similar)

Negative (Semantically Different)

Another strategy - Data2Vec, Baevski et al. 2022

Do we need to train our own
models?
No! There are many pre-trained
models that work very well for
a broad range of data!

Great place to get started: Sentence Transformers

Summary of
Takeaway #2
Vector
Representations
of Data
Data such as Images, Text, Code,
… can be represented as Vectors
with Deep Learning models.
These models are trained to
maximize semantic similarity
with massive collections of data.
We often do not need to train the
models ourselves for particular
data domains to reach
reasonable performance.

Key Takeaway #3 -
Vector
Segmentation

● Text
● Images
● Code
● Audio
● Video
● Graph-Structure
● Biological Sequences
● … !
We can segment analytics based on the
semantics of…

House Hunting
Symbols: # of bedrooms, # of bathrooms, square feet, city
→ With Vectors we can encode:
● Visual style
● Neighborhood structure
● Moreflexibleinterfacetodeﬁnefeatureswithtext

e-Commerce Products
Symbols: “Shoes”, “T-Shirt”, “Pants” or colors
→ With Vectors we can encode visual styles

Movies
Symbols can differentiate between genres like “Children”, “Action”, or “Sci-Fi”
● Themes
● Characters
● Storylines

Scientiﬁc Papers
Symbols: “Biology”, “Machine Learning”
→ With Vectors we can encode
● Nuance of the ideas
● Writing style

Music
Symbols can differentiate between genres like “Hip Hop”, “Dance”
● Tone
● Lyrics
● Instruments

“That’s the magic of deep learning:
turning meaning into vectors, then into geometric
spaces, and then incrementally learning complex
geometric transformations that map one space to
another. All you need are spaces of sufﬁciently high
dimensionality in order to capture the full scope of
the relationships found in the original data.”
- Francois Chollet, Deep Learning with Python, 2nd edition

Summary of
Takeaway #3
Vector
Segmentation
Vector representations, also
known as embeddings,
enable an Interfaceto split
analytics based on the
Semanticsof the content.
This content could be Text,
Images, Code, Audio,
Videos, …

Key Takeaway #4 -
Weaviate for
Twitter Analytics

Twitter Analytics
Tweet Text Time Impressions Engagements Engagement
Rate
Retweets Replies Likes User
Proﬁle
Clicks
Url
Clicks
I just published “ANN
Benchmarks with Etienne
Dilcoker -- Weaviate
Podcast #16 on Medium..
May 27th,
1:34pm
1905 50 2.6% 3 1 15 2 18
Approximate Nearest
Neighbor algorithms allow
us to Vector Search in
massive datasets! …
May 24th,
1:13pm
7182 252 3.5% 14 1 50 27 36

Vector Search for Data Scientists.pdf

Cloud Data Upload
There are many other ways to do this as well
Google Colab Weaviate Cloud
Services

5 Nearest Neighbors to → “Weaviate Coding Tutorial”
Content Impressions
“We have 4 Weaviate Podcast Episodes so far [ … ] how to utilize the
Weaviate Database as a Document Store in Haystack pipelines … ”
311
“We have 2 new coding tutorials on Weaviate YouTube…” 1144
“@weaviate_io Love the integration of this with the GraphQL API!” 378
“Here are some thoughts on combining Weaviate and Haystack! TLDR:
Weavaite is a great Vector Search database…”
15563
“Weaviate (@weaviate_io) is also announcing a collaboration with Jina
AI (@JinaAI_)! …”
586

Have I tweeted something like this before?

Have any Weaviate Podcast guests
tweeted something like this recently?

Wikipedia Live Demo - Graph Data Model

● Weaviate is a Vector
Search Database, rather
than a Library such as
Facebook’s FAISS or
ANNOY from Spotify
● Weaviate has a
Graph-like Data Model

Expanding Twitter project with Graph Model

Summary of
Takeaway #4
Weaviate for
Twitter Analytics
We can segment Impressions
on Twitterbased on the
content of the tweet without
manual labeling!
Weaviateis a Vector Search
Databasethat can be used to
store and search through
semantic embeddings of data.

Key Takeaway #5 -
Research Questions
and Discussion

Research Questions and Discussion
● Should I ﬁne-tune my embedding model?
● Large-Scale Vector Search with Approximate
Nearest Neighbor (ANN) Algorithms
● How does Vector Search differ from Classiﬁcation or
Regression models?

Vector Search versus Regression on Impressions
8,530 Impressions
Model Prediction

Interpretability of Vector Search
Nearest Neighbors

Interpretability of Vector Search and Prediction
8,530 Impressions
Model Prediction

What do we want to know about our Tweets?
Should I post this?
When might be a better time to post it?
What might be a better phrasing of this tweet?

Expanding from individuals to teams
● Has anyone on my team tweeted something
like this recently?
● Who on our team would be best ﬁt to tell this
story?
● What topics should we be tweeting about?

Summary of
Takeaway #5
Research
Questions and
Discussion
How can we improve these
systems? What looks
promising?

Key Takeaways:
“Vector Search
for Data
Scientists”
1. Segmentation in Data
Science
2. Vector Representations of
Unstructured Data
3. Vector Segmentation
4. Weaviate Example for
Twitter Analytics
5. Research Questions and
Discussion
Slides, Colab Notebook, Video Presentation available on:
github.com/CShorten/Vector-Search-for-Data-Scientists

Connect with us!
Weaviate Slack Channel
YouTube: Weaviate • Vector Search Engine
Weaviate Podcast
Twitter @weaviate_io

Thank you for Watching!
Special thanks to Sebastian Witalec in
advising the development of this presentation
and Svitlana Smolianova for visual styling.

Vector Search for Data Scientists.pdf

Recommended

More Related Content

What's hot (20)

Similar to Vector Search for Data Scientists.pdf (20)

Recently uploaded (20)

Vector Search for Data Scientists.pdf