0% found this document useful (0 votes)

49 views

An Approach To Highly Intuitive Fuzzy Search in Elasticsearch With Typo Handling - by Neelambuj Singh at Software Engineer - Medium

The document describes an approach for building a highly intuitive fuzzy search in Elasticsearch that handles typos and substring matching while still ranking exact matches higher. It discusses creating an index with custom analyzers for n-gram tokenization and search term parsing, and constructing a boolean query with a phrase match and fuzzy match to return exact and fuzzy results.

Uploaded by

nd0906

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

An Approach To Highly Intuitive Fuzzy Search in Elasticsearch With Typo Handling - by Neelambuj Singh at Software Engineer - Medium

Uploaded by

nd0906

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

An Approach To Highly Intuitive

Fuzzy Search In Elasticsearch With

Typo Handling…
Neelambuj Singh @ Software engineer · Follow
5 min read · Jul 15, 2019

116 1

How to design a search which can:-

1. Support Substring Search & general fuzzy search.

2. Handle typo’s in search term.

3. Get exact matches on top of results.

As the topic suggests, I am going to Discuss how to come up with a query

which is highly intuitive i.e. It should deal with fuzzy search , sub string
search as well as typo’s done by end user while performing a search.This
search is expected to get most relevant results on the top as well.

Okay, Lets start from ground zero.

What is ElasticSearch ?

Elasticsearch is an open-source, enterprise-grade search engine which can

power extremely fast searches that support all data discovery applications.
With Elasticsearch we can store, search, and analyze big volumes of data
quickly and in near real time. It is generally used as the underlying search
engine that powers applications that have simple/complex search features
and requirements

Now some good to know things about Elasticsearch

Build on top of lucene

Elastic search is built on top of Lucene, which is a full-featured
information retrieval library, so it provides the most powerful full-text
search capabilities of any open source product.
Also it is good, because it is already familiar to developers.

Full-text search
Elastic Search implements a lot of features, such as customized splitting
text into words, customized stemming, facetted search, etc.

Fuzzy Searching
A fuzzy search is good for spelling errors. You can find what you are
searching for even though you have a spelling mistake. Note:- This
fuzziness features should be used judiciously as it can lead to explosion of
search results if not used properly with Analyzers.

Restful API
Elastic search is API driven, actions can be performed using a simple
Restful API

There are other advantages as well but we will stop here.

Now Coming back to our main topic

How to design a search which can:-

Let’s first Create a basic index for our data

Mappings

{
"properties": {
"firstname": {
"type": "text",
"analyzer": "ngram_token_analyzer",
"search_analyzer": "search_term_analyzer"
},
"surname": {
"type": "text",
"analyzer": "ngram_token_analyzer",
"search_analyzer": "search_term_analyzer"
}
}
}

Index Settings

{
"index": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "2",
"max_gram": "9"
}
}
}
}
}

Now index sufficient number of document in the index.

We have created and populated an index called “person” to store the

firstname and surname of people.

Let us form a query that performs a highly intuitive search on the Name of
the person(can be first name or last name), But before that lets see the
shortcomings of conventional fuzzy search which uses only ngram_tokens.

for example, lets take the name “shoaib”. It will be ngram_tokenised and
indexed as

[sh, sho, shoa, shoai, shoaib, ho, hoa, hoai, hoaib, oa, oai, oaib, ai, aib, ib]

ShortComing of Conventional Fuzzy Search using Ngram-tokens

First shortcoming

Unless You search for a correct token here, you won’t be able to search a
document.For example If you search for “shoa” or “hoaib”, you will get the
results.

lets introduce some typo’s in the search term and search for “shoi”and
“hoiab”.

After introducing typo’s , document containing “shoiab” will not be found

because the search term doesn’t matches any indexed token.

Solution: Using fuzziness in the query for search term. “shoi” will get
matched to “shoa” since, fuzziness on search term will allow an edit distance
of maximum 2, same type of matching will happen for “hoaib”.

Second shortcoming

when you search for an exact match, the desired exact result may not appear
on the top because of internal scoring mechanism of elasticsearch.

Solution: Boost the score of exact matches, so that they appear on the top.

Now we have discussed the above two solutions, Let’s convert the above
solutions into Queries.

Query

Explanation: “phrase” in multimatch will match only with the exact search
term and will boost it by 10, so that it ends up on the top of result.
“most_fields” is used to get the fuzzy matches for the search term. fuzziness
value allows matching of tokens with search terms upto an edit distance of
2.For more info on edit distance, click this

https://qbox.io/blog/elasticsearch-optimization-fuzziness-performance

{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "hoiab",
"type": "phrase",
"fields": [
"firstname",
"surname"
],
"boost": 10
}
},
{
"multi_match": {
"query": "hoiab",
"type": "most_fields",
"fields": [
"firstname",
"surname"
],
"fuzziness":"AUTO"
}
}
]
}
}
}

Result

we can observe we searched for a substring with typo “hoaib”, still we are
able to get the desired result.

{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 9.432327,
"hits": [
{
"_index": "person",
"_type": "_doc",
"_id": "tZkR8GsB_cNxb0UgV49n",
"_score": 9.432327,
"_source": {
"firstname": "shoaib",
"surname": "akhtar"
}
}
]
}
}

Now lets take a look at case where we give an exact match.

Query

we are searching for Name “Serene”

{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Serene",
"type": "phrase",
"fields": [
"firstname",
"surname"
],
"boost": 10
}
},
{
"multi_match": {
"query": "Serene",
"type": "most_fields",
"fields": [
"firstname",
"surname"
],
"fuzziness": "AUTO"
}
}
]
}
}
}

Result

we can see that we got the exact match for “Serene” on the top followed by a
similar Name.

{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 6.56169,
"hits": [
{
"_index": "person",
"_type": "_doc",
"_id": "uJkR8GsB_cNxb0Ugpo8_",
"_score": 6.56169,
"_source": {
"firstname": "Serene",
"surname": "smith"
}
},
{
"_index": "person",
"_type": "_doc",
"_id": "uZkR8GsB_cNxb0Ugy4_T",
"_score": 4.4320574,
"_source": {
"firstname": "serenity",
"surname": "flair"
}
}
]
}
}

Important Note:

To avoid bloating of the search with irrelevant results, for search term whose
length is less than or equal to 3, use fuzziness value as “0”. Search term
whose length is more than 3, use fuzziness values as “AUTO”.However 3 is
not a concrete limit.You can always set the limit according to your use cases.

Hope you all find all this information useful.

Elasticsearch Fuzzy Search Typo Exact Match Top

116 1

Written by Neelambuj Singh @ Software engineer Follow

13 Followers

Software Engineer at Trimble Inc.

Recommended from Medium

Sujatha Mudadla Julie Mills in Rockset

What is the inverted index in elastic Choosing Between Nested Queries

search? and Parent-Child Relationships in…
In Elasticsearch, the inverted index is a core Data modeling in Elasticsearch is not as
component that enables efficient and fast… obvious as it is when dealing with relational…

2 min read · Nov 26, 2023 8 min read · 5 days ago

Lists

Natural Language Processing

1350 stories · 828 saves

Cristian Leo in Towards Data Science Akim Fitzgerald in Operations Research Bit

The Math Behind Neural Networks Optimizing Silhouette Score

Dive into Neural Networks, the backbone of Computation in K-Means…
modern AI, understand its mathematics,… Understanding the Silhouette Score

28 min read · 6 days ago 3 min read · Dec 21, 2023

1.2K 12 5

Narendra Soni Jeevanandham Selvaraj

Full Text Search vs Vector Decoding Elasticsearch Query DSL:

(Semantic) Search? And Beyond. Exploring Numeric and Date…
Search is the one of the most important Welcome to the Fifth episode of our
feature in any application. In this article we… Decoding Elasticsearch query DSL series. I…

2 min read · Dec 26, 2023 3 min read · Nov 7, 2023

4 1

See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

Business Requirement Document (BRD)
No ratings yet
Business Requirement Document (BRD)
8 pages
Elastic Search Presentation
No ratings yet
Elastic Search Presentation
55 pages
An Elasticsearch Crash Course Presentation PDF
No ratings yet
An Elasticsearch Crash Course Presentation PDF
81 pages
Impairments and Disabilities
No ratings yet
Impairments and Disabilities
64 pages
Elasticsearch Basic Concepts
100% (2)
Elasticsearch Basic Concepts
25 pages
Elasticsearch: Ponel
No ratings yet
Elasticsearch: Ponel
10 pages
Elasticsearch Developer Cheat Sheet PDF
No ratings yet
Elasticsearch Developer Cheat Sheet PDF
2 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Query DSL in Elasticsearch: Narayan Kumar Software Consultant Knoldus Software LLP
No ratings yet
Query DSL in Elasticsearch: Narayan Kumar Software Consultant Knoldus Software LLP
22 pages
3.tolerant Retrieval
No ratings yet
3.tolerant Retrieval
46 pages
The Default Password For The User Is .: Elastic Changeme
No ratings yet
The Default Password For The User Is .: Elastic Changeme
3 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Searching Quick Reference - V9.5
No ratings yet
Searching Quick Reference - V9.5
3 pages
Elasticsearch query string syntax Cheat Sheet
No ratings yet
Elasticsearch query string syntax Cheat Sheet
1 page
6-Spelling Correction Soundex
No ratings yet
6-Spelling Correction Soundex
52 pages
String functions
No ratings yet
String functions
57 pages
import os
No ratings yet
import os
8 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
Best Practices in Elasticsearch
No ratings yet
Best Practices in Elasticsearch
5 pages
Back end
No ratings yet
Back end
8 pages
10 Dictionaries and Tolerant Retrieval
No ratings yet
10 Dictionaries and Tolerant Retrieval
13 pages
Cross Lingual Information Retrieval and Error Tracking in Search Engine
No ratings yet
Cross Lingual Information Retrieval and Error Tracking in Search Engine
37 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
lecture3-tolerent
No ratings yet
lecture3-tolerent
81 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
IR qb1
No ratings yet
IR qb1
78 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
MOD_3_IRS
No ratings yet
MOD_3_IRS
18 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
thesis
No ratings yet
thesis
49 pages
An O (K Log N) Algorithm For Prefix Based Ranked Autocomplete
No ratings yet
An O (K Log N) Algorithm For Prefix Based Ranked Autocomplete
14 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Supervisionguide16 17 Students
No ratings yet
Supervisionguide16 17 Students
17 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
49 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
AJG Atlas Search Tech Webinar 2023 03
No ratings yet
AJG Atlas Search Tech Webinar 2023 03
34 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
IR END PYQ SOLS
No ratings yet
IR END PYQ SOLS
8 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
ir
No ratings yet
ir
120 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
IR ans
No ratings yet
IR ans
13 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
Complete Download Mastering Elasticsearch 2nd Edition Edition Rafal Kuc PDF All Chapters
No ratings yet
Complete Download Mastering Elasticsearch 2nd Edition Edition Rafal Kuc PDF All Chapters
91 pages
Api Docs
No ratings yet
Api Docs
121 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Frost-Banks-Journey-To-A-Paperless-Trust-Department
No ratings yet
Frost-Banks-Journey-To-A-Paperless-Trust-Department
6 pages
VUGen Recording Options in LoadRunner
No ratings yet
VUGen Recording Options in LoadRunner
14 pages
Website Performance Testing Tools and Services
No ratings yet
Website Performance Testing Tools and Services
6 pages
SRS Template of Sample Case Study - 24032018
No ratings yet
SRS Template of Sample Case Study - 24032018
11 pages
SQAT - FMO - FuncReq Spec - v1.6
No ratings yet
SQAT - FMO - FuncReq Spec - v1.6
110 pages
Binance UI
No ratings yet
Binance UI
4 pages
Babel Street Analytics Name Match For Elasticsearch - 2024 03 23 172617 - LGWF
No ratings yet
Babel Street Analytics Name Match For Elasticsearch - 2024 03 23 172617 - LGWF
2 pages
Tài liệu giới thiệu giải pháp phần mềm quản lý bán hàng trực tuyến
No ratings yet
Tài liệu giới thiệu giải pháp phần mềm quản lý bán hàng trực tuyến
51 pages
Comasu0st1m0581 X415ja-Ek366
No ratings yet
Comasu0st1m0581 X415ja-Ek366
1 page
Resume of Khairul Alam
No ratings yet
Resume of Khairul Alam
2 pages
Dissertation Reflective Report Sample
100% (2)
Dissertation Reflective Report Sample
5 pages
Moo Me Lund
No ratings yet
Moo Me Lund
62 pages
Behavior Problems in Children With Specific Language Impairment
No ratings yet
Behavior Problems in Children With Specific Language Impairment
9 pages
Introduction To Literature Assingment 1
No ratings yet
Introduction To Literature Assingment 1
8 pages
Bai Tap Thi Hien Tai Don Va Thi Hien Tai Tiep Dien
No ratings yet
Bai Tap Thi Hien Tai Don Va Thi Hien Tai Tiep Dien
10 pages
WSO2 Storage Server: Documentation
No ratings yet
WSO2 Storage Server: Documentation
336 pages
The Husbands Message and The Wifes Lamen PDF
No ratings yet
The Husbands Message and The Wifes Lamen PDF
168 pages
Reexspi PDF
No ratings yet
Reexspi PDF
3 pages
Grade5 Autobiography
No ratings yet
Grade5 Autobiography
3 pages
A LETTER TO GOD
No ratings yet
A LETTER TO GOD
7 pages
Image Acquisition: Sapro Robotics
No ratings yet
Image Acquisition: Sapro Robotics
4 pages
CIP3 Manual
No ratings yet
CIP3 Manual
21 pages
SSC Scientific Assistant Physics (Held On - 14 December 2022 Shift 2) (English)
No ratings yet
SSC Scientific Assistant Physics (Held On - 14 December 2022 Shift 2) (English)
54 pages
Selfie:: Self-Interpretation of Large Language Model Embeddings
No ratings yet
Selfie:: Self-Interpretation of Large Language Model Embeddings
16 pages
ns-2 Tutorial (1) : Contents
No ratings yet
ns-2 Tutorial (1) : Contents
16 pages
Xi Ww-4 QP and Ans
No ratings yet
Xi Ww-4 QP and Ans
3 pages
Combinepdf 2
No ratings yet
Combinepdf 2
19 pages
Structured Programming - Repetition: Program Example 1: For Loop
No ratings yet
Structured Programming - Repetition: Program Example 1: For Loop
4 pages
Taller Ingles Terminado
100% (1)
Taller Ingles Terminado
3 pages
The History of Bhagavad Gita
No ratings yet
The History of Bhagavad Gita
10 pages
Psychosocial dll1
No ratings yet
Psychosocial dll1
2 pages
Cys - 22ma301 - Discrete Mathematics
No ratings yet
Cys - 22ma301 - Discrete Mathematics
137 pages
ABAP Managed Database Procedure in SAP ABAP ON HANA
No ratings yet
ABAP Managed Database Procedure in SAP ABAP ON HANA
16 pages
Colegio Mixto "San Juan Bosco" Cuadro de Zona 2018: Content
No ratings yet
Colegio Mixto "San Juan Bosco" Cuadro de Zona 2018: Content
3 pages
DLL-Food Fish Processing 9-Q2-W6
100% (1)
DLL-Food Fish Processing 9-Q2-W6
4 pages
Lesson 4.2 Text and Context Connections: Hypertext: What's New
No ratings yet
Lesson 4.2 Text and Context Connections: Hypertext: What's New
2 pages
Gorgeous Shop Returns Form
No ratings yet
Gorgeous Shop Returns Form
1 page

An Approach To Highly Intuitive Fuzzy Search in Elasticsearch With Typo Handling - by Neelambuj Singh at Software Engineer - Medium

Uploaded by

An Approach To Highly Intuitive Fuzzy Search in Elasticsearch With Typo Handling - by Neelambuj Singh at Software Engineer - Medium

Uploaded by

An Approach To Highly Intuitive

Fuzzy Search In Elasticsearch With

How to design a search which can:-

1. Support Substring Search & general fuzzy search.

2. Handle typo’s in search term.

3. Get exact matches on top of results.

As the topic suggests, I am going to Discuss how to come up with a query

Okay, Lets start from ground zero.

Elasticsearch is an open-source, enterprise-grade search engine which can

Now some good to know things about Elasticsearch

Build on top of lucene

There are other advantages as well but we will stop here.

Now Coming back to our main topic

How to design a search which can:-

Let’s first Create a basic index for our data

Now index sufficient number of document in the index.

We have created and populated an index called “person” to store the

ShortComing of Conventional Fuzzy Search using Ngram-tokens

After introducing typo’s , document containing “shoiab” will not be found

Now lets take a look at case where we give an exact match.

we are searching for Name “Serene”

Hope you all find all this information useful.

Elasticsearch Fuzzy Search Typo Exact Match Top

Written by Neelambuj Singh @ Software engineer Follow

Software Engineer at Trimble Inc.

Recommended from Medium

Sujatha Mudadla Julie Mills in Rockset

What is the inverted index in elastic Choosing Between Nested Queries

2 min read · Nov 26, 2023 8 min read · 5 days ago

Natural Language Processing

The Math Behind Neural Networks Optimizing Silhouette Score

28 min read · 6 days ago 3 min read · Dec 21, 2023

Narendra Soni Jeevanandham Selvaraj

Full Text Search vs Vector Decoding Elasticsearch Query DSL:

2 min read · Dec 26, 2023 3 min read · Nov 7, 2023

See more recommendations

You might also like