SlideShare a Scribd company logo
Apache Lucene/Solr London User Group
How the Lucene More Like
This Works
Alessandro Benedetti, Software Engineer
16th May 2019
Apache Lucene/Solr London User GroupWho I am
▪ Search Consultant
▪ R&D Software Engineer
▪ Master in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Beach Volleyball Player & Snowboarder
Alessandro Benedetti
Apache Lucene/Solr London User GroupSease
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Search Quality Evaluation, Relevancy Tuning
Apache Lucene/Solr London User Group
● Document Similarity
● Apache Lucene More Like This
! Term Scorer
! BM25
● Interesting Terms Retrieval
● Query Building
! DEMO
! Future Work
! JIRA References
Agenda
Apache Lucene/Solr London User Group
Document Similarity
Problem : find similar documents to a seed one
Solution(s) :

● Collaborative approach 

(users interactions)
● Content Based
● Hybrid
Similar ? 

● Documents accessed in
similar manners by similar
people
● Terms distributions
● All of above
Apache Lucene/Solr London User Group
Real World Use Cases - Streaming Services
Apache Lucene/Solr London User Group
Real World Use Cases - Hotels
Apache Lucene/Solr London User Group
Apache Lucene
Apache LuceneTM is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Apache Lucene is an open source project available for free download.
Apache Lucene/Solr London User Group
● Search Library (java)
● Structured Documents
! Inverted Index
! Similarity Metrics ( TF-IDF, BM25)
! Fast Search
! Support for advanced queries
! Relevancy tuning
Apache Lucene
Apache Lucene/Solr London User Group
Inverted Index
Indexing
Apache Lucene/Solr London User Group
Input
Document More Like This
Params
Interesting
Terms

Retriever
Term Scorer
Query Builder QUERY
More Like This - Break Up
Apache Lucene/Solr London User Group
Responsibility : define a set of parameters (and defaults) that affect the
various components of the More Like This module
● Regulate MLT behavior
● Groups parameters specific to each component
● Javadoc documentation
● Default values
! Useful container for various parameters to be passed
More Like This Params
Apache Lucene/Solr London User Group
● Field Name
● Field Stats ( Document Count)
● Term Stats ( Document Frequency)
! Term Frequency
! TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1)
! BM25
Term Scorer
Responsibility : assign a score to a term that measure how distinctive is the term
for the document in input
Apache Lucene/Solr London User Group
! Origin from Probabilistic Information Retrieval
! Default Similarity from Lucene 6.0 [1]
! 25th iteration in improving TF-IDF
! TF
! IDF
! Document Length
[1] LUCENE-6789
BM25 Term Scorer
Apache Lucene/Solr London User Group
BM25 Term Scorer - Inverse Document Frequency
IDF Score

has very similar
behavior
Apache Lucene/Solr London User Group
BM25 Term Scorer - Term Frequency
TF Score

approaches

asymptotically (k+1)



k=1.2 in this
example
Apache Lucene/Solr London User Group
BM25 Term Scorer - Document Length
Document Length /

Avg Document
Length



affects how fast we
saturate TF score
Apache Lucene/Solr London User Group
Responsibility : retrieve from the document a queue of weighted interesting
terms Params Used
! Analyzer
! Max Num Token Parsed
! Min Term Frequency
! Min/Max Document Frequency
! Max Query Terms
! Query Time Field Boost
Interesting Term Retriever
! Analyze content / Term Vector
! Skip Tokens
! Score Tokens
! Build Queue of Top Scored terms
Apache Lucene/Solr London User Group
Params Used
! Term Boost Enabled
More Like This Query Builder
Field1 :

Term1
Field2 :

Term2
Field1 :

Term3
Field1 :

Term4
Field3 :

Term5
3.0 4.0 4.5 4.8 7.5
Q = Field1:Term1^3.0 Field2:Term2^4.0
Field1:Term3^4.5 Field1:Term4^4.8
Field3:Term5^7.5
Apache Lucene/Solr London User Group
Term Boost
! on/off
! Affect each term weight in the
MLT query
! It is the term score 

( it depends of the Term Scorer
implementation chosen)
More Like This Boost
Field Boost
! field1^5.0 field2^2.0 field3^1.5
! Affect Term Scorer
! Affect the interesting terms
retrieved
N.B. a highly boosted field can
dominate the interesting terms
retrieval
Apache Lucene/Solr London User Group
More Like This Usage - Lucene Classification
! Given a document D to classify
! K Nearest Neighbours Classifier
! Find Top K similar documents to D ( MLT)
! Classes are extracted
! Class Frequency + Class ranking -> Class probability
Apache Lucene/Solr London User Group
More Like This Usage - Apache Solr
! More Like This query parser

( can be concatenated with other queries)
! More Like This search component

( can be assigned to a Request Handler)
! More Like This handler

( handler with specific request parameters)
Apache Lucene/Solr London User Group
More Like This Demo - Movie Data Set
This data consists of the following fields:
● id - unique identifier for the movie
● Title - Name of the movie
● Directors - The person(s) who directed the making of the film
● Genres - The genre(s) that the movie belongs to
Apache Lucene/Solr London User Group
More Like This Demo - Tuned
! Enable/Disable Term Boost
! Min Term Frequency
! Min Document Frequency
! Field Boost
Apache Lucene/Solr London User Group
Future Work
! Query Builder just use Terms and Term Score
! Term Positions ?
! Phrase Queries Boost

(for terms close in position)
! Sentence boundaries
! Field centric vs Document centric

( should high boosted fields kick out

relevant terms from low boosted fields)
Apache Lucene/Solr London User Group
Future Work - More Like These
! Multiple documents in input
! Interesting terms across
documents
● Useful for Content Based
recommender engines
Apache Lucene/Solr London User Group
Pros
● Apache Lucene Module
! Advanced Params
! Input : 

- structured document

- just text
! Build an advanced query
! Leverage the Inverted Index

( and additional data structures)
More Like This
Cons
● Massive single class
! Low cohesion
! Low readability
! Minimum test coverage
! Difficult to extend

( and improve)
Apache Lucene/Solr London User Group
● LUCENE-7498 - Introducing BM25 Term Scorer
● LUCENE-7802 - Architectural Refactor
● LUCENE-8326 - MLT Params Refactor
JIRA References
Apache Lucene/Solr London User Group
Questions ?
Apache Lucene/Solr London User GroupThanks!
Ad

More Related Content

What's hot (20)

Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
Knoldus Inc.
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 
Solrで多様なランキングモデルを活用するためのプラグイン開発 #SolrJP
Solrで多様なランキングモデルを活用するためのプラグイン開発 #SolrJPSolrで多様なランキングモデルを活用するためのプラグイン開発 #SolrJP
Solrで多様なランキングモデルを活用するためのプラグイン開発 #SolrJP
Yahoo!デベロッパーネットワーク
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
AWSKRUG - AWS한국사용자모임
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
Trey Grainger
 
주니어 개발자의 서버 로그 관리 개선기
주니어 개발자의 서버 로그 관리 개선기주니어 개발자의 서버 로그 관리 개선기
주니어 개발자의 서버 로그 관리 개선기
Yeonhee Kim
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Qi Guo
 
Count min sketch
Count min sketchCount min sketch
Count min sketch
DaeMyung Kang
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
NAVER D2
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
Timothy Spann
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
NAVER D2
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
Joey Wen
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Jean-Philippe Chateau
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Hyoungjun Kim
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
Knoldus Inc.
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 
Solrで多様なランキングモデルを活用するためのプラグイン開発 #SolrJP
Solrで多様なランキングモデルを活用するためのプラグイン開発 #SolrJPSolrで多様なランキングモデルを活用するためのプラグイン開発 #SolrJP
Solrで多様なランキングモデルを活用するためのプラグイン開発 #SolrJP
Yahoo!デベロッパーネットワーク
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
AWSKRUG - AWS한국사용자모임
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
Trey Grainger
 
주니어 개발자의 서버 로그 관리 개선기
주니어 개발자의 서버 로그 관리 개선기주니어 개발자의 서버 로그 관리 개선기
주니어 개발자의 서버 로그 관리 개선기
Yeonhee Kim
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Qi Guo
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
NAVER D2
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
Timothy Spann
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
NAVER D2
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
Joey Wen
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Hyoungjun Kim
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 

Similar to How the Lucene More Like This Works (20)

Faceted search using Solr and Ontopia
Faceted search using Solr and OntopiaFaceted search using Solr and Ontopia
Faceted search using Solr and Ontopia
Geir Ove Grønmo
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
Alessandro Benedetti
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
Sease
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
GokulD
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Sease
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
Andrei Lopatenko
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
Sease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
Andrea Gazzarini
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
Abanti Aazmin
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15
Hans Höchtl
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
sagar chaturvedi
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.com
Fernando Meyer
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
Charlie Hull
 
Faceted search using Solr and Ontopia
Faceted search using Solr and OntopiaFaceted search using Solr and Ontopia
Faceted search using Solr and Ontopia
Geir Ove Grønmo
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
Alessandro Benedetti
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
Sease
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
GokulD
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Sease
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
Andrei Lopatenko
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
Sease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
Andrea Gazzarini
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
Abanti Aazmin
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15
Hans Höchtl
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
sagar chaturvedi
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.com
Fernando Meyer
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
Charlie Hull
 
Ad

More from Sease (20)

Hybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank FusionHybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache SolrBlazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Hybrid Search With Apache Solr
Hybrid Search With Apache SolrHybrid Search With Apache Solr
Hybrid Search With Apache Solr
Sease
 
Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
Sease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
Sease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
Sease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
Sease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
Sease
 
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank FusionHybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache SolrBlazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Hybrid Search With Apache Solr
Hybrid Search With Apache SolrHybrid Search With Apache Solr
Hybrid Search With Apache Solr
Sease
 
Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
Sease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
Sease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
Sease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
Sease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
Sease
 
Ad

Recently uploaded (20)

Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 

How the Lucene More Like This Works

  • 1. Apache Lucene/Solr London User Group How the Lucene More Like This Works Alessandro Benedetti, Software Engineer 16th May 2019
  • 2. Apache Lucene/Solr London User GroupWho I am ▪ Search Consultant ▪ R&D Software Engineer ▪ Master in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  • 3. Apache Lucene/Solr London User GroupSease Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning
  • 4. Apache Lucene/Solr London User Group ● Document Similarity ● Apache Lucene More Like This ! Term Scorer ! BM25 ● Interesting Terms Retrieval ● Query Building ! DEMO ! Future Work ! JIRA References Agenda
  • 5. Apache Lucene/Solr London User Group Document Similarity Problem : find similar documents to a seed one Solution(s) :
 ● Collaborative approach 
 (users interactions) ● Content Based ● Hybrid Similar ? 
 ● Documents accessed in similar manners by similar people ● Terms distributions ● All of above
  • 6. Apache Lucene/Solr London User Group Real World Use Cases - Streaming Services
  • 7. Apache Lucene/Solr London User Group Real World Use Cases - Hotels
  • 8. Apache Lucene/Solr London User Group Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 9. Apache Lucene/Solr London User Group ● Search Library (java) ● Structured Documents ! Inverted Index ! Similarity Metrics ( TF-IDF, BM25) ! Fast Search ! Support for advanced queries ! Relevancy tuning Apache Lucene
  • 10. Apache Lucene/Solr London User Group Inverted Index Indexing
  • 11. Apache Lucene/Solr London User Group Input Document More Like This Params Interesting Terms
 Retriever Term Scorer Query Builder QUERY More Like This - Break Up
  • 12. Apache Lucene/Solr London User Group Responsibility : define a set of parameters (and defaults) that affect the various components of the More Like This module ● Regulate MLT behavior ● Groups parameters specific to each component ● Javadoc documentation ● Default values ! Useful container for various parameters to be passed More Like This Params
  • 13. Apache Lucene/Solr London User Group ● Field Name ● Field Stats ( Document Count) ● Term Stats ( Document Frequency) ! Term Frequency ! TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1) ! BM25 Term Scorer Responsibility : assign a score to a term that measure how distinctive is the term for the document in input
  • 14. Apache Lucene/Solr London User Group ! Origin from Probabilistic Information Retrieval ! Default Similarity from Lucene 6.0 [1] ! 25th iteration in improving TF-IDF ! TF ! IDF ! Document Length [1] LUCENE-6789 BM25 Term Scorer
  • 15. Apache Lucene/Solr London User Group BM25 Term Scorer - Inverse Document Frequency IDF Score
 has very similar behavior
  • 16. Apache Lucene/Solr London User Group BM25 Term Scorer - Term Frequency TF Score
 approaches
 asymptotically (k+1)
 
 k=1.2 in this example
  • 17. Apache Lucene/Solr London User Group BM25 Term Scorer - Document Length Document Length /
 Avg Document Length
 
 affects how fast we saturate TF score
  • 18. Apache Lucene/Solr London User Group Responsibility : retrieve from the document a queue of weighted interesting terms Params Used ! Analyzer ! Max Num Token Parsed ! Min Term Frequency ! Min/Max Document Frequency ! Max Query Terms ! Query Time Field Boost Interesting Term Retriever ! Analyze content / Term Vector ! Skip Tokens ! Score Tokens ! Build Queue of Top Scored terms
  • 19. Apache Lucene/Solr London User Group Params Used ! Term Boost Enabled More Like This Query Builder Field1 :
 Term1 Field2 :
 Term2 Field1 :
 Term3 Field1 :
 Term4 Field3 :
 Term5 3.0 4.0 4.5 4.8 7.5 Q = Field1:Term1^3.0 Field2:Term2^4.0 Field1:Term3^4.5 Field1:Term4^4.8 Field3:Term5^7.5
  • 20. Apache Lucene/Solr London User Group Term Boost ! on/off ! Affect each term weight in the MLT query ! It is the term score 
 ( it depends of the Term Scorer implementation chosen) More Like This Boost Field Boost ! field1^5.0 field2^2.0 field3^1.5 ! Affect Term Scorer ! Affect the interesting terms retrieved N.B. a highly boosted field can dominate the interesting terms retrieval
  • 21. Apache Lucene/Solr London User Group More Like This Usage - Lucene Classification ! Given a document D to classify ! K Nearest Neighbours Classifier ! Find Top K similar documents to D ( MLT) ! Classes are extracted ! Class Frequency + Class ranking -> Class probability
  • 22. Apache Lucene/Solr London User Group More Like This Usage - Apache Solr ! More Like This query parser
 ( can be concatenated with other queries) ! More Like This search component
 ( can be assigned to a Request Handler) ! More Like This handler
 ( handler with specific request parameters)
  • 23. Apache Lucene/Solr London User Group More Like This Demo - Movie Data Set This data consists of the following fields: ● id - unique identifier for the movie ● Title - Name of the movie ● Directors - The person(s) who directed the making of the film ● Genres - The genre(s) that the movie belongs to
  • 24. Apache Lucene/Solr London User Group More Like This Demo - Tuned ! Enable/Disable Term Boost ! Min Term Frequency ! Min Document Frequency ! Field Boost
  • 25. Apache Lucene/Solr London User Group Future Work ! Query Builder just use Terms and Term Score ! Term Positions ? ! Phrase Queries Boost
 (for terms close in position) ! Sentence boundaries ! Field centric vs Document centric
 ( should high boosted fields kick out
 relevant terms from low boosted fields)
  • 26. Apache Lucene/Solr London User Group Future Work - More Like These ! Multiple documents in input ! Interesting terms across documents ● Useful for Content Based recommender engines
  • 27. Apache Lucene/Solr London User Group Pros ● Apache Lucene Module ! Advanced Params ! Input : 
 - structured document
 - just text ! Build an advanced query ! Leverage the Inverted Index
 ( and additional data structures) More Like This Cons ● Massive single class ! Low cohesion ! Low readability ! Minimum test coverage ! Difficult to extend
 ( and improve)
  • 28. Apache Lucene/Solr London User Group ● LUCENE-7498 - Introducing BM25 Term Scorer ● LUCENE-7802 - Architectural Refactor ● LUCENE-8326 - MLT Params Refactor JIRA References
  • 29. Apache Lucene/Solr London User Group Questions ?
  • 30. Apache Lucene/Solr London User GroupThanks!