An Approach To Highly Intuitive Fuzzy Search in Elasticsearch With Typo Handling - by Neelambuj Singh at Software Engineer - Medium
An Approach To Highly Intuitive Fuzzy Search in Elasticsearch With Typo Handling - by Neelambuj Singh at Software Engineer - Medium
116 1
What is ElasticSearch ?
Full-text search
Elastic Search implements a lot of features, such as customized splitting
text into words, customized stemming, facetted search, etc.
Fuzzy Searching
A fuzzy search is good for spelling errors. You can find what you are
searching for even though you have a spelling mistake. Note:- This
fuzziness features should be used judiciously as it can lead to explosion of
search results if not used properly with Analyzers.
Restful API
Elastic search is API driven, actions can be performed using a simple
Restful API
Mappings
{
"properties": {
"firstname": {
"type": "text",
"analyzer": "ngram_token_analyzer",
"search_analyzer": "search_term_analyzer"
},
"surname": {
"type": "text",
"analyzer": "ngram_token_analyzer",
"search_analyzer": "search_term_analyzer"
}
}
}
Index Settings
{
"index": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "2",
"max_gram": "9"
}
}
}
}
}
Let us form a query that performs a highly intuitive search on the Name of
the person(can be first name or last name), But before that lets see the
shortcomings of conventional fuzzy search which uses only ngram_tokens.
for example, lets take the name “shoaib”. It will be ngram_tokenised and
indexed as
[sh, sho, shoa, shoai, shoaib, ho, hoa, hoai, hoaib, oa, oai, oaib, ai, aib, ib]
Unless You search for a correct token here, you won’t be able to search a
document.For example If you search for “shoa” or “hoaib”, you will get the
results.
lets introduce some typo’s in the search term and search for “shoi”and
“hoiab”.
Solution: Using fuzziness in the query for search term. “shoi” will get
matched to “shoa” since, fuzziness on search term will allow an edit distance
of maximum 2, same type of matching will happen for “hoaib”.
Second shortcoming
when you search for an exact match, the desired exact result may not appear
on the top because of internal scoring mechanism of elasticsearch.
Solution: Boost the score of exact matches, so that they appear on the top.
Now we have discussed the above two solutions, Let’s convert the above
solutions into Queries.
Query
Explanation: “phrase” in multimatch will match only with the exact search
term and will boost it by 10, so that it ends up on the top of result.
“most_fields” is used to get the fuzzy matches for the search term. fuzziness
value allows matching of tokens with search terms upto an edit distance of
2.For more info on edit distance, click this
https://qbox.io/blog/elasticsearch-optimization-fuzziness-performance
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "hoiab",
"type": "phrase",
"fields": [
"firstname",
"surname"
],
"boost": 10
}
},
{
"multi_match": {
"query": "hoiab",
"type": "most_fields",
"fields": [
"firstname",
"surname"
],
"fuzziness":"AUTO"
}
}
]
}
}
}
Result
we can observe we searched for a substring with typo “hoaib”, still we are
able to get the desired result.
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 9.432327,
"hits": [
{
"_index": "person",
"_type": "_doc",
"_id": "tZkR8GsB_cNxb0UgV49n",
"_score": 9.432327,
"_source": {
"firstname": "shoaib",
"surname": "akhtar"
}
}
]
}
}
Query
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Serene",
"type": "phrase",
"fields": [
"firstname",
"surname"
],
"boost": 10
}
},
{
"multi_match": {
"query": "Serene",
"type": "most_fields",
"fields": [
"firstname",
"surname"
],
"fuzziness": "AUTO"
}
}
]
}
}
}
Result
we can see that we got the exact match for “Serene” on the top followed by a
similar Name.
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 6.56169,
"hits": [
{
"_index": "person",
"_type": "_doc",
"_id": "uJkR8GsB_cNxb0Ugpo8_",
"_score": 6.56169,
"_source": {
"firstname": "Serene",
"surname": "smith"
}
},
{
"_index": "person",
"_type": "_doc",
"_id": "uZkR8GsB_cNxb0Ugy4_T",
"_score": 4.4320574,
"_source": {
"firstname": "serenity",
"surname": "flair"
}
}
]
}
}
Important Note:
To avoid bloating of the search with irrelevant results, for search term whose
length is less than or equal to 3, use fuzziness value as “0”. Search term
whose length is more than 3, use fuzziness values as “AUTO”.However 3 is
not a concrete limit.You can always set the limit according to your use cases.
116 1
13 Followers
Lists
Cristian Leo in Towards Data Science Akim Fitzgerald in Operations Research Bit
1.2K 12 5
4 1
Help Status About Careers Blog Privacy Terms Text to speech Teams