An Introduction to NLP4L

An Introduction to
NLP4L
Natural Language Processing tool for
Apache Lucene
Koji Sekiguchi @kojisays
Founder & CEO, RONDHUIT

My contributions
• CharFilter framework & MappingCharFilter
• FastVectorHighlighter
2

Agenda
• What s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
3

Agenda
• What s NLP4L?
4

What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
6

What s NLP4L?
• GOAL
• FEATURES
• NLP4L provides
7

What s NLP4L?
• GOAL
• FEATURES
• NLP4L provides
8

Agenda
• What s NLP4L?
9

Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
10

Recall ,Precision
tpfp fn
tn
11

targetresult
tpfp fn
tn
Recall ,Precision
12

Solution
n-gram, synonym dictionary, etc.
facet (ﬁlter query)
Ranking Tuning
recall
precision
recall , precision
13

Solution
Ranking Tuning
recall
precision
recall , precision
14

Solution
e.g. Transliteration
recall
precision
recall , precision
Ranking Tuning
15

Solution
e.g. Transliteration
e.g. Named Entity Extraction
recall
precision
recall , precision
Ranking Tuning
16

gradual precision improvement
q=watch
targetresult
17

ﬁlter by
Gender=Men s
targetresult
18

targetresult
ﬁlter by
Gender=Men s
ﬁlter by
Price=100-150
19

Structured Documents
ID product price gender
1
CURREN New Men s Date Stainless
Steel Military Sport Quartz Wrist Watch
8.92 Men s
2 Suiksilver The Gamer Watch 87.99 Men s
20

Unstructured Documents
ID article
1
David Cameron says he has a mandate to pursue EU reform following the
Conservatives' general election victory. The Prime Minister will be hoping his
majority government will give him extra leverage in Brussels.
2
He wants to renegotiate the terms of the UK's membership ahead of a
referendum by the end of 2017. He has said he will campaign for Britain to
remain in the EU if he gets the reforms he wants.
21

Make them Structured
I
D
article person org loc
1
David Cameron says he has a mandate to pursue EU reform following
the Conservatives' general election victory. The Prime Minister will be
hoping his majority government will give him extra leverage in Brussels.
David
Cameron
EU
Bruss
els
2
He wants to renegotiate the terms of the UK's membership ahead of a
referendum by the end of 2017. He has said he will campaign for
Britain to remain in the EU if he gets the reforms he wants.
EU
UK
Britai
n
NEE[1] extracts interesting words.
[1] Named Entity Extraction
22

Agenda
• What s NLP4L?
24

Language Model
• LM represents the ﬂuency of language
• N-gram model is the LM which is most widely
used
• Calculation example for 2-gram
totalTermFreq(”word2g”,”an apple”)
totalTermFreq(”word”,”an”)
25

Alice/NNP ate/VB an/AT apple/NNP ./.
Mike/NNP likes/VB an/AT orange/NNP ./.
An/AT apple/NNP is/VB red/JJ ./.
NNP Proper noun, singular
VB Verb
AT Article
JJ Adjective
. period
Part-of-Speech Tagging
Our Corpus for training
26

Hidden Markov Model
Series of Words
28

Hidden Markov Model
Series of Part-of-Speech
29

HMM state diagram
NNP
0.667
VB
0.0
.
0.0
JJ
0.0
AT
0.333
1.0
1.0
0.4 0.6
0.6670.333
alice 0.2
apple 0.4
mike 0.2
orange 0.2
ate 0.333
is 0.333
likes 0.333
an 1.0
red 1.0
. 1.0
32

Agenda
• What s NLP4L?
33

Transliteration
Transliteration is a process of transcribing letters or words from one
alphabet to another one to facilitate comprehension and pronunciation
for non-native speakers.
computer コンピューター
server サーバー
internet インターネット
mouse マウス
information インフォメーション
examples of transliteration from English to Japanese
34

It helps improve recall
you search English mouse
35

It helps improve recall
but you got マウス (=mouse)
highlighted in Japanese
36

Training data in NLP4L
アaカcaデdeミーmy
アaクcセceンnトt
アaクcセceスss
アaクcシciデdeンnトt
アaクcロroバッbaトt
アaクcショtioンn
アaダdaプpターter
アaフfリriカca
エaアirバbuスs
アaラlaスsカka
アaルlコーcohoルl
アaレlleルrギーgy
train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt
academy,アカデミー
accent,アクセント
access,アクセス
accident,アクシデント
acrobat,アクロバット
action,アクション
adapter,アダプター
africa,アフリカ
airbus,エアバス
alaska,アラスカ
alcohol,アルコール
allergy,アレルギー
37

Demo: Transliteration
Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
ダイニング dining (OK)
コミッター committer (OK)
エントリー entree entry
nlp4l> :load examples/trans_katakana_alpha.scala
38

Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
39

Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
Got 1,800+ records of
synonym knowledge
from jawiki
40

Agenda
• What s NLP4L?
41

NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
42

NLP4L Framework
provided.
and indexes.
provided as well.
43

NLP4L Framework
provided.
and indexes.
provided as well.
44

NLP4L Framework
provided.
and indexes.
provided as well.
45

Solr
ES
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・Did you mean?
・synonyms.txt
・userdic.txt
・keyword attachment
maintenance
Model ﬁles
Tagged
Corpus
Document
Vectors
・TermExtractor
・Transliteration
・NEE
・Classiﬁcation
・Document Vectors
・Language Detection
・Learning to Rank
・Personalized Search
46

Keyword Attachment
• Keyword attachment is a general format that enables the
following functions.
• Learning to Rank
• Personalized Search
• Named Entity Extraction
• Document Classiﬁcation
Lucene
doc
Lucene
doc
keyword
↑
Increase boost
47

Before Learning to Rank
targetresult
1 2
3 …
50 100
500 …
48

After Learning to Rank
targetresult
1 2
3 …
50 100
500 …
49

Learning to Rank
• Program learns, from access log and other
sources, that the score of document d for a
query q should be larger than the normal
score(q,d)
Lucene
doc d
q, q, …
https://en.wikipedia.org/wiki/Learning_to_rank
50

Personalized Search
targetresult
1 2
3 …
50 100
500 …
q=apple
computer …
51

Personalized Search
target
result
50 100
500 …
1 2
3 …
q=applefruit …
52

Personalized Search
• Program learns, from access log and other sources, that
the score of document d for a query q by user u should
be larger than the normal score(q,d)
• Since you cannot specify score(q,d,u) as Lucene restricts
doing so, you have to specify score(qu,d).
• Limit the data to high-order queries or divide ﬁelds
depending on a user as the number of q-u combinations
can be enormous.
Lucene
doc d1
q1u1, q2u2
Lucene
doc d2
q2u1, q1u2
53

Join and Code with Us!
Contact us at
koji at apache dot org
for the details.
54

An Introduction to NLP4L

More Related Content

What's hot (20)

Similar to An Introduction to NLP4L (20)

More from Koji Sekiguchi (20)

Recently uploaded (20)

An Introduction to NLP4L