SlideShare a Scribd company logo
Automatically Build Solr Synonym List
Using Machine Learning
Chao Han
VP, Head of Data Science, Lucidworks
Goal
• Automatically generate Solr synonym list that includes synonyms, common
misspellings and misplaced blank spaces. Choose the right Solr synonym format
(e.g., one or bi-directional).
• Examples:
• Synonym: bag, case; four, iv; mac, apple mac, mac book, macbook
• Acronym: playstation, ps
• Misspelling: accesory, accesoire, accessoire, accessorei => accessory
• Misplaced blank spaces: book end, bookend; whirl pool => whirlpool
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Existing Methods and Challenges
• Knowledge-base methods, such as utilizing WordNet, do not have
good coverage of customer’s own ontology.
• Example result from WordNet on an ecommerce data:
•Lack of usefulness:
• mankind, humanity; luck, chance; interference, noise
•Missing context specific synonyms:
• galaxy, Samsung galaxy; noise, quiet; vac, vacuum;
•Do not update frequently.
Existing Methods and Challenges
• Find synonyms from word2vec
• Example result from word2vec on an ecommerce data:
• Provide related words instead of inter-changeable words:
• king, queen; red, blue; broom, floor;
• Provide surrounding words:
• battery, rechargeable; unlocked, phone; power, supply;
• Sensitive to hyper-parameters; local optimization;
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Proposed method : Step 1 – Find similar queries
• Utilize customer behavior data to focus on queries that lead to similar set of clicked
documents, then further extract token/phrase wise synonyms.
Query Doc Set Num of Clicks
apple mac charger 1 500
apple mac charger 2 300
apple mac charger 3 100
apple mac charger 4 30
Mac power 1 200
Mac power 2 100
Mac power 3 50
Use Jaccard Index to measure query similarities:
𝐽 𝑞𝑢𝑒𝑟𝑦1, 𝑞𝑢𝑒𝑟𝑦2 =
|𝐷𝑜𝑐𝑆𝑒𝑡1 ∩ 𝐷𝑜𝑐𝑆𝑒𝑡2|
|𝐷𝑜𝑐𝑆𝑒𝑡2 ∪ 𝐷𝑜𝑐𝑆𝑒𝑡2|
Doc Set is weighted by number of clicks to de-noise.
Proposed method : Step 2 – Query pre-processing
• Stemming, stop words removal
• Find misspellings separately and correct misspellings in queries:
• If leave misspellings in: mattress, matress, mattrass, mattresss
which should be: matress, mattrass, mattresss => mattress
• Identify phrases in queries to find multi-word synonyms: mac, mac_book
Proposed method : Step 3 – Extract synonyms
• Extract synonym (token/phrases) from queries by finding token/phrases which
before/after the same word:
• E.g. Similar query: laptop charger, laptop power
Synonym: charger, power
Similar query: playstation console, ps console
Synonym: playstation, ps
• Measure synonym similarity by occurrence in similar query adjusted by the counts
of synonym in the corpus.
Proposed method : Step 4 – De-noise
• Drop the synonym pair that exist in the same query.
• Use graph model to find relationships among synonyms to put multiple synonyms
into the same set and to drop non-synonyms.
Synonym group: mac, apple mac, mac book
LCD
tv
tv
LED tv
mac
book
mac
apple
mac
Proposed method : Step 5 – Categorize output
• A tree based model is built based on features generated from the above steps
to help choose from synonym vs context:
• Example features: synonym similarity, number of context the synonym shown
up, token overlapping, synonym counts etc.
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Evaluation and comparison with word2vec
• Run word2vec on catalog and trim the rare words that are not in queries. (with
the same misspelling and phrase extraction steps)
Evaluation and comparison with word2vec
• Manually evaluated synonym pairs generated from the ecommerce dataset.
Method Precision Recall F1
LW synonym job 83% 81% 82%
word2vec 31% 28% 29%
Word2vec with de-
noise step
45% 25% 32%
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Spell Correction in Fusion 4.0:
• An offline job to find misspellings and provide corrections based on the number of
occurrence of words/phrases. Comparing to Solr spell checker, the advantages of this job
are:
• If query clicks are captured after Solr spell checker was turned on, then these misspellings
found from click data are mainly identifying erroneous corrections or no corrections from Solr.
• It allow offline human review to make sure the changes are all correct. If user have a dictionary
(e.g. product catalog) to check against the list, the job will go through the result list to make
sure misspellings do not exist in the dictionary and corrections do exist in dictionary.
Spell Correction in Fusion 4.0:
• High accuracy rate (96%). In addition to basic Solr spell checker settings :
• When there are multiple possible corrections, we rank corrections based on multiple criteria in
addition to edit distance.
• Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to
the query length to provide more wiggle room for long queries.
• Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr
spell check. E.g., it does not limit the maximum number of possible matches to review
(maxInspections parameter in Solr).
Spell Correction in Fusion 4.0:
• Several fields are provided to facilitate the reviewing process:
• by default, results are sorted by "mis_string_len", (descending) and "edit_dist" (ascending) to position more
probable corrections at the top.
• Soundex or last character match indicator.
Spell Correction in Fusion 4.0:
• Several additional fields are provided to disclose relationship among the
token corrections and phrase corrections to help further reduce the list:
• The suggested_corrections field help automatically choose to use phrase level correction or token level
correction. If there is low confidence of the correction, a “review” label is attached.
Spell Correction in Fusion 4.0:
• The resulting corrections can be used in various ways, for example:
• Put into synonym list in Solr to perform auto correction.
• Help evaluate and guide Solr spellcheck configuration.
• Put into typeahead or autosuggest list.
• Perform document cleansing (e.g. clean product catalog or medical records) by
mapping misspellings to corrections.
Phrase Extraction in Fusion:
• Income tax -> tax Income tax -> income
• a Spark job detects commonly co-occurring terms phrases
• Usage:
A. In the query pipeline, boost on any phrase that appears,
e.g. for the query red ipad case, rewrite it to red “ipad case”~10^2
B. Treat phrases as a single token (ipad_case) and feed into downstream
jobs such as clustering/classification/synonym detection.
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Synonym review process in Fusion 4.2
Automatic tail query rewriting
Tail reason investigation
Tail rewriting at query time
User searched for “red case for macbook.pro”
See this: After query rewriting: “macbook pro case”~10^2 color: red
Future works
• Utilize query rewrites in session logs.
• Explore deep learning embeddings and attention weights.
source: Rush et al (2014): https://arxiv.org/pdf/1409.0473.pdf)
• Evaluate results on more types of data.
Thank you!
Chao Han
VP, Head of Data Science, Lucidworks

More Related Content

What's hot (20)

PDF
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
PDF
Parquet Hadoop Summit 2013
Julien Le Dem
 
PDF
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
PDF
DevOps for Databricks
Databricks
 
PDF
認定テクニカルアーキテクト取ろうぜ
Hiroki Sato
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
Salesforce Release Management - Best Practices and Tools for Deployment
Salesforce Developers
 
PDF
Spark shuffle introduction
colorant
 
PPTX
Real-time Analytics with Presto and Apache Pinot
Xiang Fu
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
HTML5マークアップの心得と作法
Futomi Hatano
 
PDF
Power BI Governance and Development Best Practices - Presentation at #MSBIFI ...
Jouko Nyholm
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
PPTX
Salesforce interview questions walkthrough
Shivam Srivastava
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
PPTX
An Intro to Elasticsearch and Kibana
ObjectRocket
 
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
Parquet Hadoop Summit 2013
Julien Le Dem
 
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
DevOps for Databricks
Databricks
 
認定テクニカルアーキテクト取ろうぜ
Hiroki Sato
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Salesforce Release Management - Best Practices and Tools for Deployment
Salesforce Developers
 
Spark shuffle introduction
colorant
 
Real-time Analytics with Presto and Apache Pinot
Xiang Fu
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
HTML5マークアップの心得と作法
Futomi Hatano
 
Power BI Governance and Development Best Practices - Presentation at #MSBIFI ...
Jouko Nyholm
 
Data Engineering with Solr and Spark
Lucidworks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
Salesforce interview questions walkthrough
Shivam Srivastava
 
Productizing Structured Streaming Jobs
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
An Intro to Elasticsearch and Kibana
ObjectRocket
 

Similar to Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks (18)

PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
PDF
Doing Synonyms Right - John Marquiss, Wolters Kluwer
Lucidworks
 
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
PPTX
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Leonardo Dias
 
PPTX
Similarity computation exploiting the semantic and syntactic inherent structu...
Joydeep Mondal
 
PDF
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET Journal
 
PPTX
Query Understanding
Matt Corkum
 
PPTX
Custom spellchecker for SOLR
Murthy Remella
 
PPTX
Machine Aided Indexer
Access Innovations, Inc.
 
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
PPTX
The well tempered search application
Ted Sullivan
 
KEY
Evolution: It's a process
Christine Connors
 
PDF
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
PDF
Find it, possibly also near you!
Paul Borgermans
 
PDF
2011 Search Query Rewrites - Synonyms & Acronyms
Brian Johnson
 
PDF
EasyChair-Preprint-7375.pdf
NohaGhoweil
 
PDF
Query Understanding at LinkedIn [Talk at Facebook]
Abhimanyu Lad
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
Doing Synonyms Right - John Marquiss, Wolters Kluwer
Lucidworks
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Leonardo Dias
 
Similarity computation exploiting the semantic and syntactic inherent structu...
Joydeep Mondal
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET Journal
 
Query Understanding
Matt Corkum
 
Custom spellchecker for SOLR
Murthy Remella
 
Machine Aided Indexer
Access Innovations, Inc.
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
The well tempered search application
Ted Sullivan
 
Evolution: It's a process
Christine Connors
 
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Find it, possibly also near you!
Paul Borgermans
 
2011 Search Query Rewrites - Synonyms & Acronyms
Brian Johnson
 
EasyChair-Preprint-7375.pdf
NohaGhoweil
 
Query Understanding at LinkedIn [Talk at Facebook]
Abhimanyu Lad
 
Ad

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
PDF
Drive Agent Effectiveness in Salesforce
Lucidworks
 
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
PPTX
Connected Experiences Are Personalized Experiences
Lucidworks
 
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
PDF
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
PPTX
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
PPTX
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
PPTX
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
Drive Agent Effectiveness in Salesforce
Lucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
Connected Experiences Are Personalized Experiences
Lucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Ad

Recently uploaded (20)

PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Digital Circuits, important subject in CS
contactparinay1
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 

Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks

  • 1. Automatically Build Solr Synonym List Using Machine Learning Chao Han VP, Head of Data Science, Lucidworks
  • 2. Goal • Automatically generate Solr synonym list that includes synonyms, common misspellings and misplaced blank spaces. Choose the right Solr synonym format (e.g., one or bi-directional). • Examples: • Synonym: bag, case; four, iv; mac, apple mac, mac book, macbook • Acronym: playstation, ps • Misspelling: accesory, accesoire, accessoire, accessorei => accessory • Misplaced blank spaces: book end, bookend; whirl pool => whirlpool
  • 3. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 4. Existing Methods and Challenges • Knowledge-base methods, such as utilizing WordNet, do not have good coverage of customer’s own ontology. • Example result from WordNet on an ecommerce data: •Lack of usefulness: • mankind, humanity; luck, chance; interference, noise •Missing context specific synonyms: • galaxy, Samsung galaxy; noise, quiet; vac, vacuum; •Do not update frequently.
  • 5. Existing Methods and Challenges • Find synonyms from word2vec • Example result from word2vec on an ecommerce data: • Provide related words instead of inter-changeable words: • king, queen; red, blue; broom, floor; • Provide surrounding words: • battery, rechargeable; unlocked, phone; power, supply; • Sensitive to hyper-parameters; local optimization;
  • 6. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 7. Proposed method : Step 1 – Find similar queries • Utilize customer behavior data to focus on queries that lead to similar set of clicked documents, then further extract token/phrase wise synonyms. Query Doc Set Num of Clicks apple mac charger 1 500 apple mac charger 2 300 apple mac charger 3 100 apple mac charger 4 30 Mac power 1 200 Mac power 2 100 Mac power 3 50 Use Jaccard Index to measure query similarities: 𝐽 𝑞𝑢𝑒𝑟𝑦1, 𝑞𝑢𝑒𝑟𝑦2 = |𝐷𝑜𝑐𝑆𝑒𝑡1 ∩ 𝐷𝑜𝑐𝑆𝑒𝑡2| |𝐷𝑜𝑐𝑆𝑒𝑡2 ∪ 𝐷𝑜𝑐𝑆𝑒𝑡2| Doc Set is weighted by number of clicks to de-noise.
  • 8. Proposed method : Step 2 – Query pre-processing • Stemming, stop words removal • Find misspellings separately and correct misspellings in queries: • If leave misspellings in: mattress, matress, mattrass, mattresss which should be: matress, mattrass, mattresss => mattress • Identify phrases in queries to find multi-word synonyms: mac, mac_book
  • 9. Proposed method : Step 3 – Extract synonyms • Extract synonym (token/phrases) from queries by finding token/phrases which before/after the same word: • E.g. Similar query: laptop charger, laptop power Synonym: charger, power Similar query: playstation console, ps console Synonym: playstation, ps • Measure synonym similarity by occurrence in similar query adjusted by the counts of synonym in the corpus.
  • 10. Proposed method : Step 4 – De-noise • Drop the synonym pair that exist in the same query. • Use graph model to find relationships among synonyms to put multiple synonyms into the same set and to drop non-synonyms. Synonym group: mac, apple mac, mac book LCD tv tv LED tv mac book mac apple mac
  • 11. Proposed method : Step 5 – Categorize output • A tree based model is built based on features generated from the above steps to help choose from synonym vs context: • Example features: synonym similarity, number of context the synonym shown up, token overlapping, synonym counts etc.
  • 12. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 13. Evaluation and comparison with word2vec • Run word2vec on catalog and trim the rare words that are not in queries. (with the same misspelling and phrase extraction steps)
  • 14. Evaluation and comparison with word2vec • Manually evaluated synonym pairs generated from the ecommerce dataset. Method Precision Recall F1 LW synonym job 83% 81% 82% word2vec 31% 28% 29% Word2vec with de- noise step 45% 25% 32%
  • 15. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 16. Spell Correction in Fusion 4.0: • An offline job to find misspellings and provide corrections based on the number of occurrence of words/phrases. Comparing to Solr spell checker, the advantages of this job are: • If query clicks are captured after Solr spell checker was turned on, then these misspellings found from click data are mainly identifying erroneous corrections or no corrections from Solr. • It allow offline human review to make sure the changes are all correct. If user have a dictionary (e.g. product catalog) to check against the list, the job will go through the result list to make sure misspellings do not exist in the dictionary and corrections do exist in dictionary.
  • 17. Spell Correction in Fusion 4.0: • High accuracy rate (96%). In addition to basic Solr spell checker settings : • When there are multiple possible corrections, we rank corrections based on multiple criteria in addition to edit distance. • Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to the query length to provide more wiggle room for long queries. • Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr spell check. E.g., it does not limit the maximum number of possible matches to review (maxInspections parameter in Solr).
  • 18. Spell Correction in Fusion 4.0: • Several fields are provided to facilitate the reviewing process: • by default, results are sorted by "mis_string_len", (descending) and "edit_dist" (ascending) to position more probable corrections at the top. • Soundex or last character match indicator.
  • 19. Spell Correction in Fusion 4.0: • Several additional fields are provided to disclose relationship among the token corrections and phrase corrections to help further reduce the list: • The suggested_corrections field help automatically choose to use phrase level correction or token level correction. If there is low confidence of the correction, a “review” label is attached.
  • 20. Spell Correction in Fusion 4.0: • The resulting corrections can be used in various ways, for example: • Put into synonym list in Solr to perform auto correction. • Help evaluate and guide Solr spellcheck configuration. • Put into typeahead or autosuggest list. • Perform document cleansing (e.g. clean product catalog or medical records) by mapping misspellings to corrections.
  • 21. Phrase Extraction in Fusion: • Income tax -> tax Income tax -> income • a Spark job detects commonly co-occurring terms phrases • Usage: A. In the query pipeline, boost on any phrase that appears, e.g. for the query red ipad case, rewrite it to red “ipad case”~10^2 B. Treat phrases as a single token (ipad_case) and feed into downstream jobs such as clustering/classification/synonym detection.
  • 22. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 23. Synonym review process in Fusion 4.2
  • 24. Automatic tail query rewriting
  • 26. Tail rewriting at query time User searched for “red case for macbook.pro” See this: After query rewriting: “macbook pro case”~10^2 color: red
  • 27. Future works • Utilize query rewrites in session logs. • Explore deep learning embeddings and attention weights. source: Rush et al (2014): https://arxiv.org/pdf/1409.0473.pdf) • Evaluate results on more types of data.
  • 28. Thank you! Chao Han VP, Head of Data Science, Lucidworks

Editor's Notes

  • #3: Synonyms list plays an important part for search. However, it usually take a long time to detect and maintain synonyms by the search or ontology group in a company. Within the context of an ecommerce search use case.
  • #5: There are experiments around automatically generating synonym already. And I will talk about two of the most popular methods here.
  • #6: Word2vec is a shallow NN trying to predict target words from near by words or wise versa. Then we take the dense vector out, basically transfer from word space to vector space and find nearest neighbors through cosine similarity. Because the vectors live in a vast high dimensional space, then two vectors can be similar in any sense. E.g. red and blue are similar bc they are both colors, broom and floor share a functional relationship. They are related but they are not inter-changeable. Then in a search application, we usually require synonym to be bi-directional and interchangeable, thus it can leads to relevancy problem. E.g. if I want a king bed sheet, I may not want queen bed sheet. Red paint is not blue paint. Due to the way that w2v model is constructed, bc it’s trying to predict context from target words, thus it tends to find surrounding words. Since w2v is a NN model that use SGD, thus it can converge to a local optimization. Overall you can see some failed examples here from w2v results is due to lack of constraint. And problem with wordnet is a mismatched semantic context between customer data and the general dictionary.
  • #8: In order to tackle the above problems, here we propose a 5 step synonym detection algorithm. Nowadays websites can easily track and store user events such as queries, result clicks and purchases, we can use this collective behavior to create clickstream or LTR models, we can also use this data to help find synonyms. First step is to find similar queries then we can further extract. This way we are putting contraints through the input data.
  • #9: Since we don’t want to put all the stemmed and non-stemmed pairs into synonym list, just leave the stemming work to Solr.
  • #10: This method looks like a naïve method without fancy modeling involved, but it turns out works pretty well. I think it’s bc it’s a straight forward way to replicate how ppl construct the language. Also here we are not projecting the words into a different vector space as in w2v, thus we are getting the first order similarity between words.
  • #11: Have to say all methods leads to noise due to the nature of click data. Synonym should be transitional. Use graph algorithm to find a community which have enough edges in the graph. (BronKerbosch clique algorithm an example from clique is : frozenset({‘ear’, ‘ear bud’, ‘earbud’, ‘earphone’, ‘headset’}),but if only require connected component would be messy: audio, headphone, ear bud, ipod, headset, earbud, head, beat, heartbeat, ie, ibeat, tour, ear headphone, earphone, ear in order to keep good recall, I’m also considering loose cliques, i.e., if two triangles have 2 edges between each other, then can say they are 1 clique, loosier than strict clique defination)
  • #12: A problem we face is some of the synonym we extracted is too abstract and does not work outside certain context. In this algorithm’s output, we find the most frequent occuring words before/after the synonym pair. We call it context pair here. In this case, the tree model predict that we should include the word console in the synonym pair to make it more clear.
  • #17: many queries misspells may due to the same tokens or phrases. So in Fusion 4, we have a new job called token and phrase wise spell checker which can help you find misspellings and suggest corrections. Solr Spell Checker Index-based, Executes at query time
  • #18: such as min prefix match, max edit distance, min length of misspelling, count thresholds of misspellings and corrections, collation check. Specifically, we apply a filter such that only pairs with edit_distance <= query_length/length_scale will be kept. E.g., if we choose length_scale=4, for queries with lengths between 4 and 7, edit distance has to be 1 to be chosen. While for queries with lengths between 8 and 11, edit distance can be 2. and is able to find comprehensive lists of spelling errors resulting from misplaced whitespace (breakWords in Solr)
  • #19: can also sort by the ratio of correction traffic over misspelling traffic to only keep high traffic boosting corrections.