SlideShare a Scribd company logo
A Search Index
is not
A Database Index
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive
A Search Index is Not a Database Index - Full Stack Toronto 2017
Story time!
3
Search Index
4
Database Index
They hired me!
5
They hired me!
6
(even though I was wrong)
Agenda
0: Terminology
1: Text Search
2: Numeric Range Search
3: Storage
Terminology
Database
Table
Schema
Column
Row
8
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
9
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
10
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
11
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
12
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
pets
id: integer
name: string
Breed: string
id name
001 Toria
002 Colleen
humans
id: integer
name: string
human_id pet_id
001 001
001 002
002 003
owners
human_id: int
pet_id: int
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
13
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
Database Index
14
?
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
Database Index Inverted Index
15
16
Text Search
Part 1
By Rebecca Davis
pawsomecrochet.etsy.com
Secret Santa
for Cats
Find all the
cat-related items in a
database
github.com/toriagibbs/SecretSanta
19
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
005 Kitten mittens Finally! An elegant,
comfortable mitten for cats
$25.97 18
20
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
Database Performance
n*m
21
n = number of rows in the database
m = length of strings
Database Performance
O(n)
n = number of rows in the database
22
23
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
005 Kitten mittens Finally! An elegant,
comfortable mitten for cats
$25.97 18
24
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id)
);
25
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
26
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
title id
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
27
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
very [001]
good [001]
hat [001, 002, 003, 004]
cat [001, 003, 005]
wear [002]
beach [002]
... ...
q=cat
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
title
description
28
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
very [001]
good [001]
hat [001, 002, 003, 004]
cat [001, 003, 005]
wear [002]
beach [002]
... ...
q=cat
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
title
description
Search Index Performance
O(1)
2 hash lookups = constant time
29
Search Index Performance
O(1) + retrieval
2 hash lookups = constant time
30
Search Index Performance
O(r)
r = number of results found
31
Text Search Quality
Part 1 ½
33
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
SELECT * FROM listings
WHERE LOWER(title) LIKE “%cat%”
OR LOWER(description) LIKE “%cat%”;
34
Solution: SQL “LOWER”
id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
35
Problem: hidden substring
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
36
Solution: check punctuation &
whitespace for every word form
SELECT * FROM listings
WHERE title LIKE “cat” OR title LIKE “cats”
OR title LIKE “cat %” OR title LIKE “cats %”
OR title LIKE “% cat” OR title LIKE “% cats”
OR title LIKE “% cat %” OR title LIKE “% cats %”
OR title LIKE “% cat.%” OR title LIKE “% cats.%”
OR title LIKE “%.cat %” OR title LIKE “%.cats %”
...
37
Problem: missed relevant item
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
38
SELECT * FROM listings
WHERE LOWER(title) = “cat” OR LOWER(title) = “cats”
OR LOWER(title) = “kitten” OR LOWER(title) = “kittens”
OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %”
OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %”
OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %”
OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %”
OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%”
OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%”
OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %”
OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %”
OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%”
OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%”
...
OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats”
OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens”
...
Let’s solve it with a
search index
39
40
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
q=cat
41
Solution: everything is lowercase
q=cat
key value
cat [003]
Cat [001]
title
key value
cat [001, 003]
title
id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
42
Problem: hidden substring
q=cat
43
Solution: tokenization
& stemming
“Vacation hat”
[“vacation”, “hat”]
“hats” → “hat”
“cats” → “cat”
“catlike” → “cat”
id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
44
Problem: missed relevant item
q=cat
45
Solution: synonyms
q=cat
key value
cat [001, 003]
kitten [004, 005]
title
key value
cat [001, 003, 004, 005]
title
46
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality
due to case sensitivity,
substring mismatches, and
missing terms
High quality
due to case insensitivity,
tokenization, stemming, and
synonyms
More disk space
Do work at “index time”
TRADE-OFFS
Numeric Range Search
Part 2
By Rebecca Davis
pawsomecrochet.etsy.com
Secret Santa
for Cats
Find all the
cat-related items
under $15
in a database
github.com/toriagibbs/SecretSanta
50
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
51
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id)
);
52
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id),
KEY (price)
);
53
Database Index
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
54
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
Database Performance
O(log n)
Log base 2 for a binary tree
Log base B for a B-tree
55
Database Performance
O(log n) + retrieval
Log base 2 for a binary tree
Log base B for a B-tree
56
Database Performance
O(log n + r)
57
n = number of rows in the database
r = number of results found
58
n log2
n
10 3.32
100 6.64
1 000 9.97
10 000 13.29
100 000 16.61
1 000 000 19.93
Why didn’t we do this
for text fields?!
SIDEBAR
60
Prefix Tree (Trie)
car
cat
ham
hat
SID
EB
A
R
61
Prefix Tree (Trie)
“car cat ham hat”
SID
EB
A
R
Database indexes for string fields
can only search prefixes
SIDEBAR
Unless you declare a “full text” index like:
FULLTEXT (description)
63
Database Search Engine
O(r)
text search
O(r)
text search
Poor quality
due to case sensitivity,
substring mismatches, and
missing terms
High quality
due to case insensitivity,
tokenization, stemming, and
synonyms
SID
EB
A
R
By Lacey Smith
hungupokanagan.etsy.com
Back to numeric searching...
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
65
price
66
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
price
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
67
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
price
price=0.00 OR price=0.01 OR
price=0.02 OR price=0.03 OR
price=0.04 OR price=0.05 OR
price=0.06 OR price=0.07 OR
price=0.08 OR price=0.09 OR
…
price=14.93 OR price=14.94 OR
price=14.95 OR price=14.96 OR
price=14.97 OR price=14.98 OR
price=14.99 OR price=15.00
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
68
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
11.00 [004]
12.50 - 24.99 [001]
15.00 [001]
25.00 - 49.99 [002, 003, 005]
25.00 - 37.49 [003, 005]
25.00 [003]
25.97 [005]
37.50 - 49.99 [002]
49.99 [002]
price
price(25.00 - 49.99)
U price(50.00)
price(0 - 24.99)
U price(25.00 - 37.49)
U price(37.50)
U price(37.51)
U price(37.52)
...
U price(40.00)
fq=price:[25 TO 50]
fq=price:[* TO 40]
69
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
15.00 - 15.49 [001]
15.00 [001]
... ...
price
price(0 - 12.49)
U price(12.50 - 12.99)
U price(13.00 - 13.49)
U price(13.50 - 13.99)
U price(14.00 - 14.49)
U price(14.50 - 14.99)
U price(15.00)
fq=price:[* TO 15]
70
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
15.00 - 15.49 [001]
15.00 [001]
... ...
price
Search Index Performance
O(log (max-min))
For the max and min values
of the field
71
Search Index Performance
O(1)
Number of buckets don’t
change with the size of the data
72
Search Index Performance
O(r)
73
r = number of results found
74
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
75
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
76
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
O(r)
numeric range search
Storage
Part 3
78
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id),
KEY (price)
);
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” required=true indexed=true stored=true>
<field name=”title” type=”text” required=true indexed=true stored=false>
<field name=”description” type=”text” required=true indexed=true stored=false>
<field name=”price” type=”long” required=true indexed=true stored=false>
<field name=”quantity” type=”int8” required=true indexed=true stored=false>
</fields>
</schema>
79
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” stored=true>
<field name=”title” type=”text” stored=false>
<field name=”description” type=”text” stored=false>
<field name=”price” type=”long” stored=false>
<field name=”quantity” type=”int8” stored=false>
</fields>
</schema>
80
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” stored=true>
<field name=”title” type=”text” stored=true>
<field name=”description” type=”text” stored=true>
<field name=”price” type=”long” stored=true>
<field name=”quantity” type=”int8” stored=true>
</fields>
</schema>
81
A search index
is not a database index
But a search engine
can totally be a database
Don’t do it
By Darcy Quinn
riotcakes.etsy.com
84
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
O(r)
numeric range search
Good at storage ‘Meh’ at storage
✓
✓
✓
✓
By Ashley Fehribach
furballfanatic.etsy.com
@nerdymathlete
Thank you
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive
Ad

More Related Content

What's hot (15)

Ruby things
Ruby thingsRuby things
Ruby things
Julio Santos
 
Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.
Douglas Starnes
 
Getting to know Arel
Getting to know ArelGetting to know Arel
Getting to know Arel
Ray Zane
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出す
Takashi Kitano
 
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
Takashi Kitano
 
Python data structures
Python data structuresPython data structures
Python data structures
Harry Potter
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
Siva Arunachalam
 
Python WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorPython WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd Behavior
Amy Hanlon
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology Initiative
Basil Bibi
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
Yanchang Zhao
 
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver){tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
Takashi Kitano
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and Dictionaries
IHTMINSTITUTE
 
Elixir
ElixirElixir
Elixir
Andrew Babichev
 
Predictions European Championships 2020
Predictions European Championships 2020Predictions European Championships 2020
Predictions European Championships 2020
Ruben Kerkhofs
 
Spruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted textSpruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted text
Claus Wilke
 
Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.
Douglas Starnes
 
Getting to know Arel
Getting to know ArelGetting to know Arel
Getting to know Arel
Ray Zane
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出す
Takashi Kitano
 
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
Takashi Kitano
 
Python data structures
Python data structuresPython data structures
Python data structures
Harry Potter
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
Siva Arunachalam
 
Python WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorPython WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd Behavior
Amy Hanlon
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology Initiative
Basil Bibi
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
Yanchang Zhao
 
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver){tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
Takashi Kitano
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and Dictionaries
IHTMINSTITUTE
 
Predictions European Championships 2020
Predictions European Championships 2020Predictions European Championships 2020
Predictions European Championships 2020
Ruben Kerkhofs
 
Spruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted textSpruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted text
Claus Wilke
 

Recently uploaded (20)

Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
Vaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without HallucinationsVaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without Hallucinations
john409870
 
MINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PRMINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PR
MIND CTI
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Mastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdfMastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdf
Spiral Mantra
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
Vaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without HallucinationsVaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without Hallucinations
john409870
 
MINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PRMINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PR
MIND CTI
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Mastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdfMastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdf
Spiral Mantra
 
Ad

A Search Index is Not a Database Index - Full Stack Toronto 2017

  • 1. A Search Index is not A Database Index Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive
  • 6. They hired me! 6 (even though I was wrong)
  • 7. Agenda 0: Terminology 1: Text Search 2: Numeric Range Search 3: Storage
  • 8. Terminology Database Table Schema Column Row 8 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 9. Terminology Database Table Schema Column Row 9 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 10. Terminology Database Table Schema Column Row 10 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 11. Terminology Database Table Schema Column Row 11 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 12. Terminology Database Table Schema Column Row 12 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog pets id: integer name: string Breed: string id name 001 Toria 002 Colleen humans id: integer name: string human_id pet_id 001 001 001 002 002 003 owners human_id: int pet_id: int
  • 13. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document 13
  • 14. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index 14 ?
  • 15. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index Inverted Index 15
  • 16. 16
  • 18. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items in a database github.com/toriagibbs/SecretSanta
  • 19. 19 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  • 20. 20 SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 21. Database Performance n*m 21 n = number of rows in the database m = length of strings
  • 22. Database Performance O(n) n = number of rows in the database 22
  • 23. 23 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  • 24. 24 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  • 25. 25 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens
  • 26. 26 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens title id cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005]
  • 27. 27 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  • 28. 28 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  • 29. Search Index Performance O(1) 2 hash lookups = constant time 29
  • 30. Search Index Performance O(1) + retrieval 2 hash lookups = constant time 30
  • 31. Search Index Performance O(r) r = number of results found 31
  • 33. 33 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 34. SELECT * FROM listings WHERE LOWER(title) LIKE “%cat%” OR LOWER(description) LIKE “%cat%”; 34 Solution: SQL “LOWER”
  • 35. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 35 Problem: hidden substring SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 36. 36 Solution: check punctuation & whitespace for every word form SELECT * FROM listings WHERE title LIKE “cat” OR title LIKE “cats” OR title LIKE “cat %” OR title LIKE “cats %” OR title LIKE “% cat” OR title LIKE “% cats” OR title LIKE “% cat %” OR title LIKE “% cats %” OR title LIKE “% cat.%” OR title LIKE “% cats.%” OR title LIKE “%.cat %” OR title LIKE “%.cats %” ...
  • 37. 37 Problem: missed relevant item SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”; id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2
  • 38. 38 SELECT * FROM listings WHERE LOWER(title) = “cat” OR LOWER(title) = “cats” OR LOWER(title) = “kitten” OR LOWER(title) = “kittens” OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %” OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %” OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %” OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %” OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%” OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%” OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %” OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %” OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%” OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%” ... OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats” OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens” ...
  • 39. Let’s solve it with a search index 39
  • 40. 40 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity q=cat
  • 41. 41 Solution: everything is lowercase q=cat key value cat [003] Cat [001] title key value cat [001, 003] title
  • 42. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 42 Problem: hidden substring q=cat
  • 43. 43 Solution: tokenization & stemming “Vacation hat” [“vacation”, “hat”] “hats” → “hat” “cats” → “cat” “catlike” → “cat”
  • 44. id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 44 Problem: missed relevant item q=cat
  • 45. 45 Solution: synonyms q=cat key value cat [001, 003] kitten [004, 005] title key value cat [001, 003, 004, 005] title
  • 46. 46 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms
  • 47. More disk space Do work at “index time” TRADE-OFFS
  • 49. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items under $15 in a database github.com/toriagibbs/SecretSanta
  • 50. 50 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 51. 51 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  • 52. 52 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) );
  • 53. 53 Database Index price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002
  • 54. 54 price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 55. Database Performance O(log n) Log base 2 for a binary tree Log base B for a B-tree 55
  • 56. Database Performance O(log n) + retrieval Log base 2 for a binary tree Log base B for a B-tree 56
  • 57. Database Performance O(log n + r) 57 n = number of rows in the database r = number of results found
  • 58. 58 n log2 n 10 3.32 100 6.64 1 000 9.97 10 000 13.29 100 000 16.61 1 000 000 19.93
  • 59. Why didn’t we do this for text fields?! SIDEBAR
  • 61. 61 Prefix Tree (Trie) “car cat ham hat” SID EB A R
  • 62. Database indexes for string fields can only search prefixes SIDEBAR Unless you declare a “full text” index like: FULLTEXT (description)
  • 63. 63 Database Search Engine O(r) text search O(r) text search Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms SID EB A R
  • 65. key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002] 65 price
  • 66. 66 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  • 67. 67 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price price=0.00 OR price=0.01 OR price=0.02 OR price=0.03 OR price=0.04 OR price=0.05 OR price=0.06 OR price=0.07 OR price=0.08 OR price=0.09 OR … price=14.93 OR price=14.94 OR price=14.95 OR price=14.96 OR price=14.97 OR price=14.98 OR price=14.99 OR price=15.00 key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  • 68. 68 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] 11.00 [004] 12.50 - 24.99 [001] 15.00 [001] 25.00 - 49.99 [002, 003, 005] 25.00 - 37.49 [003, 005] 25.00 [003] 25.97 [005] 37.50 - 49.99 [002] 49.99 [002] price price(25.00 - 49.99) U price(50.00) price(0 - 24.99) U price(25.00 - 37.49) U price(37.50) U price(37.51) U price(37.52) ... U price(40.00) fq=price:[25 TO 50] fq=price:[* TO 40]
  • 69. 69 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price price(0 - 12.49) U price(12.50 - 12.99) U price(13.00 - 13.49) U price(13.50 - 13.99) U price(14.00 - 14.49) U price(14.50 - 14.99) U price(15.00) fq=price:[* TO 15]
  • 70. 70 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price
  • 71. Search Index Performance O(log (max-min)) For the max and min values of the field 71
  • 72. Search Index Performance O(1) Number of buckets don’t change with the size of the data 72
  • 73. Search Index Performance O(r) 73 r = number of results found
  • 74. 74 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality
  • 75. 75 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search
  • 76. 76 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search
  • 78. 78 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) ); SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 79. <schema name=”listings”> <fields> <field name=”id” type=”int20” required=true indexed=true stored=true> <field name=”title” type=”text” required=true indexed=true stored=false> <field name=”description” type=”text” required=true indexed=true stored=false> <field name=”price” type=”long” required=true indexed=true stored=false> <field name=”quantity” type=”int8” required=true indexed=true stored=false> </fields> </schema> 79 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler>
  • 80. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=false> <field name=”description” type=”text” stored=false> <field name=”price” type=”long” stored=false> <field name=”quantity” type=”int8” stored=false> </fields> </schema> 80
  • 81. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=true> <field name=”description” type=”text” stored=true> <field name=”price” type=”long” stored=true> <field name=”quantity” type=”int8” stored=true> </fields> </schema> 81
  • 82. A search index is not a database index But a search engine can totally be a database
  • 83. Don’t do it By Darcy Quinn riotcakes.etsy.com
  • 84. 84 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search Good at storage ‘Meh’ at storage ✓ ✓ ✓ ✓
  • 87. Thank you Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive