0% found this document useful (0 votes)
49 views

An Approach To Highly Intuitive Fuzzy Search in Elasticsearch With Typo Handling - by Neelambuj Singh at Software Engineer - Medium

The document describes an approach for building a highly intuitive fuzzy search in Elasticsearch that handles typos and substring matching while still ranking exact matches higher. It discusses creating an index with custom analyzers for n-gram tokenization and search term parsing, and constructing a boolean query with a phrase match and fuzzy match to return exact and fuzzy results.

Uploaded by

nd0906
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

An Approach To Highly Intuitive Fuzzy Search in Elasticsearch With Typo Handling - by Neelambuj Singh at Software Engineer - Medium

The document describes an approach for building a highly intuitive fuzzy search in Elasticsearch that handles typos and substring matching while still ranking exact matches higher. It discusses creating an index with custom analyzers for n-gram tokenization and search term parsing, and constructing a boolean query with a phrase match and fuzzy match to return exact and fuzzy results.

Uploaded by

nd0906
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

An Approach To Highly Intuitive

Fuzzy Search In Elasticsearch With


Typo Handling…
Neelambuj Singh @ Software engineer · Follow
5 min read · Jul 15, 2019

116 1

How to design a search which can:-

1. Support Substring Search & general fuzzy search.

2. Handle typo’s in search term.

3. Get exact matches on top of results.

As the topic suggests, I am going to Discuss how to come up with a query


which is highly intuitive i.e. It should deal with fuzzy search , sub string
search as well as typo’s done by end user while performing a search.This
search is expected to get most relevant results on the top as well.

Okay, Lets start from ground zero.

What is ElasticSearch ?

Elasticsearch is an open-source, enterprise-grade search engine which can


power extremely fast searches that support all data discovery applications.
With Elasticsearch we can store, search, and analyze big volumes of data
quickly and in near real time. It is generally used as the underlying search
engine that powers applications that have simple/complex search features
and requirements

Now some good to know things about Elasticsearch

Build on top of lucene


Elastic search is built on top of Lucene, which is a full-featured
information retrieval library, so it provides the most powerful full-text
search capabilities of any open source product.
Also it is good, because it is already familiar to developers.

Full-text search
Elastic Search implements a lot of features, such as customized splitting
text into words, customized stemming, facetted search, etc.

Fuzzy Searching
A fuzzy search is good for spelling errors. You can find what you are
searching for even though you have a spelling mistake. Note:- This
fuzziness features should be used judiciously as it can lead to explosion of
search results if not used properly with Analyzers.

Restful API
Elastic search is API driven, actions can be performed using a simple
Restful API

There are other advantages as well but we will stop here.

Now Coming back to our main topic

How to design a search which can:-

Let’s first Create a basic index for our data

Mappings

{
"properties": {
"firstname": {
"type": "text",
"analyzer": "ngram_token_analyzer",
"search_analyzer": "search_term_analyzer"
},
"surname": {
"type": "text",
"analyzer": "ngram_token_analyzer",
"search_analyzer": "search_term_analyzer"
}
}
}

Index Settings

{
"index": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "2",
"max_gram": "9"
}
}
}
}
}

Now index sufficient number of document in the index.

We have created and populated an index called “person” to store the


firstname and surname of people.

Let us form a query that performs a highly intuitive search on the Name of
the person(can be first name or last name), But before that lets see the
shortcomings of conventional fuzzy search which uses only ngram_tokens.

for example, lets take the name “shoaib”. It will be ngram_tokenised and
indexed as

[sh, sho, shoa, shoai, shoaib, ho, hoa, hoai, hoaib, oa, oai, oaib, ai, aib, ib]

ShortComing of Conventional Fuzzy Search using Ngram-tokens


First shortcoming

Unless You search for a correct token here, you won’t be able to search a
document.For example If you search for “shoa” or “hoaib”, you will get the
results.

lets introduce some typo’s in the search term and search for “shoi”and
“hoiab”.

After introducing typo’s , document containing “shoiab” will not be found


because the search term doesn’t matches any indexed token.

Solution: Using fuzziness in the query for search term. “shoi” will get
matched to “shoa” since, fuzziness on search term will allow an edit distance
of maximum 2, same type of matching will happen for “hoaib”.

Second shortcoming

when you search for an exact match, the desired exact result may not appear
on the top because of internal scoring mechanism of elasticsearch.

Solution: Boost the score of exact matches, so that they appear on the top.

Now we have discussed the above two solutions, Let’s convert the above
solutions into Queries.

Query

Explanation: “phrase” in multimatch will match only with the exact search
term and will boost it by 10, so that it ends up on the top of result.
“most_fields” is used to get the fuzzy matches for the search term. fuzziness
value allows matching of tokens with search terms upto an edit distance of
2.For more info on edit distance, click this

https://qbox.io/blog/elasticsearch-optimization-fuzziness-performance

{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "hoiab",
"type": "phrase",
"fields": [
"firstname",
"surname"
],
"boost": 10
}
},
{
"multi_match": {
"query": "hoiab",
"type": "most_fields",
"fields": [
"firstname",
"surname"
],
"fuzziness":"AUTO"
}
}
]
}
}
}

Result

we can observe we searched for a substring with typo “hoaib”, still we are
able to get the desired result.

{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 9.432327,
"hits": [
{
"_index": "person",
"_type": "_doc",
"_id": "tZkR8GsB_cNxb0UgV49n",
"_score": 9.432327,
"_source": {
"firstname": "shoaib",
"surname": "akhtar"
}
}
]
}
}

Now lets take a look at case where we give an exact match.

Query

we are searching for Name “Serene”

{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Serene",
"type": "phrase",
"fields": [
"firstname",
"surname"
],
"boost": 10
}
},
{
"multi_match": {
"query": "Serene",
"type": "most_fields",
"fields": [
"firstname",
"surname"
],
"fuzziness": "AUTO"
}
}
]
}
}
}

Result

we can see that we got the exact match for “Serene” on the top followed by a
similar Name.

{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 6.56169,
"hits": [
{
"_index": "person",
"_type": "_doc",
"_id": "uJkR8GsB_cNxb0Ugpo8_",
"_score": 6.56169,
"_source": {
"firstname": "Serene",
"surname": "smith"
}
},
{
"_index": "person",
"_type": "_doc",
"_id": "uZkR8GsB_cNxb0Ugy4_T",
"_score": 4.4320574,
"_source": {
"firstname": "serenity",
"surname": "flair"
}
}
]
}
}

Important Note:

To avoid bloating of the search with irrelevant results, for search term whose
length is less than or equal to 3, use fuzziness value as “0”. Search term
whose length is more than 3, use fuzziness values as “AUTO”.However 3 is
not a concrete limit.You can always set the limit according to your use cases.

Hope you all find all this information useful.

Elasticsearch Fuzzy Search Typo Exact Match Top

116 1

Written by Neelambuj Singh @ Software engineer Follow

13 Followers

Software Engineer at Trimble Inc.

Recommended from Medium

Sujatha Mudadla Julie Mills in Rockset

What is the inverted index in elastic Choosing Between Nested Queries


search? and Parent-Child Relationships in…
In Elasticsearch, the inverted index is a core Data modeling in Elasticsearch is not as
component that enables efficient and fast… obvious as it is when dealing with relational…

2 min read · Nov 26, 2023 8 min read · 5 days ago

Lists

Natural Language Processing


1350 stories · 828 saves

Cristian Leo in Towards Data Science Akim Fitzgerald in Operations Research Bit

The Math Behind Neural Networks Optimizing Silhouette Score


Dive into Neural Networks, the backbone of Computation in K-Means…
modern AI, understand its mathematics,… Understanding the Silhouette Score

28 min read · 6 days ago 3 min read · Dec 21, 2023

1.2K 12 5

Narendra Soni Jeevanandham Selvaraj

Full Text Search vs Vector Decoding Elasticsearch Query DSL:


(Semantic) Search? And Beyond. Exploring Numeric and Date…
Search is the one of the most important Welcome to the Fifth episode of our
feature in any application. In this article we… Decoding Elasticsearch query DSL series. I…

2 min read · Dec 26, 2023 3 min read · Nov 7, 2023

4 1

See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

You might also like