SlideShare a Scribd company logo
Getting Started
Japanese Search and Calculate Similarity
with Apache Lucene
May 2016 Eiji Shinohara
Name:
Eiji Shinohara / 篠原 英治 / @shinodogg
Role:
AWS Solutions Architect
Subject Matter Expert
・Amazon CloudSearch
・Amazon Elasticsearch Service
Who am I?
Which Search Engine/Service do you use?
• Apache Solr
• Elasticsearch
• Amazon CloudSearch
• Amazon Elasticsearch Service
On top of Apache Lucene
• Apache Solr
• Elasticsearch
• Amazon CloudSearch
• Amazon Elasticsearch Service
Have you used Apache Lucene?
•Apache Lucene is a free and open-
source information retrieval software library,
originally written in Java by Doug Cutting.
•It is supported by theApache Software
Foundation and is released under the Apache
Software License.
https://en.wikipedia.org/wiki/Lucene
Doug Cutting – Hadoop/Nutch/Lucene
•Hadoop: MapReduce
•The	name	my	kid	gave	a	stuffed	yellow	elephant.
•Nutch: Crawler
•Nutch	was	the	way	my	oldest	son	when	he	was	two,	I	
think	it	came	from	lunch.
•Lucene: Search
•Lucene	is	Doug	Cutting's	wife's	middle	name,	and	her	
maternal	grandmother's	first	name.
http://www.mwsoft.jp/programming/hadoop/where_come_from.html
Doug Cutting – Hadoop/Nutch/Lucene
•Hadoop: MapReduce
•The	name	my	kid	gave	a	stuffed	yellow	elephant.
•Nutch: Crawler
•Nutch	was	the	way	my	oldest	son	when	he	was	two,	I	
think	it	came	from	lunch
•Lucene: Search
•Lucene	is	Doug	Cutting's	wife's	middle	name,	and	her	
maternal	grandmother's	first	name.
http://www.mwsoft.jp/programming/hadoop/where_come_from.html
Maybe	most	proper	naming	J
Apache Lucene
•Full-Text search
• Easy to use
http://www.lucenetutorial.com/lucene-in-5-minutes.html
Apache Lucene
•Full-Text search
• Easy to use
1. Index
• new Document → addDocument → commit
2. Query
• Generate Query String
3. Search
• Search and Fetch hitted documents
4. Display
• Get contents from fetched documents to show
http://www.lucenetutorial.com/lucene-in-5-minutes.html
Evernote and LinkedIn are using Lucene
• w/ thin their own HTTP wrapper
• Presentation at Lucene Solr Revolution 2014
https://www.youtube.com/watch?v=drOmahIie6c https://www.youtube.com/watch?v=8O7cF75intk
Build your own Search engine?
• Some companies are doing that
http://www.slideshare.net/lucidworks/galene-linkedins-search-architecture-
presented-by-diego-buthay-sriram-sankar-linkedin/8
Iʼll join Lucene Solr Revolution 2016
Apache Lucene⼊⾨ in Japanese
http://rondhuit.com/lucene-for-bea-060710.pdfhttp://www.amazon.co.jp/dp/4774127809
Lucene in Action
https://www.amazon.com/dp/1933988177
Uchida-sanʼs Blog in Japanese
http://mocobeta-backup.tumblr.com/post/54371099587/lucene-in-action
Uchida-san: Search Consultant at Rondhuit
Lucene in Action chap5: Term Vector (2)
Calcurate Document Similarity
http://mocobeta-backup.tumblr.com/post/49779999073/
Lucene in Action chap5: Term Vector (2)
Calcurate Document Similarity
• Just tried to run on local Macbook Air J
• Created 2 classes
• Indexer
• Indexing some documents
• CalculationSimilarityTester
• Comparing 2 documents
• Calculate cosine similarity
• Using Luke for browsing index
• https://github.com/DmitryKey/luke
• Uchida-san is also Luke comitter
•
Lucene 6.0
• I had Lucene 5.5 environment but,,,
• Invalid directory at the location, check console for more
information. Last exception:
• java.lang.IllegalArgumentException: Could not load codec
'Lucene60'. Did you forget to add lucene-backward-codecs.jar?
Lucene 6.0
• So created new Maven project
• pom.xml
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>6.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>6.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>6.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-kuromoji</artifactId>
<version>6.0.0</version>
</dependency>
Indexer
public class Indexer {
public static void main(String args[]) throws IOException {
Analyzer analyzer = new JapaneseAnalyzer();
〜略〜
File[] files = new File("/Users/xxx/lucene_test/docs/").listFiles();
for (File file : files) {
Document doc = new Document();
〜略〜
FieldType contentsType = new FieldType();
contentsType.setStored(true);
contentsType.setTokenized(true);
contentsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
contentsType.setStoreTermVectors(true);
〜略〜
doc.add(new Field("contents", sb.toString(), contentsType));
writer.addDocument(doc);
}
writer.commit();
writer.close();
}
}
• Read file -> add Document -> Commit
Indexer
• Files
• Found examples on the internet :)
• http://www.pahoo.org/e-soul/webtech/php06/php06-21-01.shtm
PHP: Hypertext Preprocessor(ピー・エイチ・ピー ハイパーテキスト プリプロ
セッサー)とは、動的に HTML データを⽣成することによって、動的なウェブペー
ジを実現することを主な⽬的としたプログラミング⾔語、およびその⾔語処理系で
ある。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語として分類され
る。この⾔語処理系⾃体は、C⾔語で記述されている。
PHP(Hypertext Preprocessor;ピー・エイチ・ピー)とは、動的に HTML データ
を⽣成することによって、動的なウェブページを実現すること⽬的としたプログラ
ミング⾔語である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語の
⼀種で、処理系⾃体は C⾔語で記述されている。
Indexer
• Files
• Found examples on the internet :)
• http://www.fisproject.jp/2015/01/cosine_similarity/
• Exactly same
A Cat sat on the mat.
Cats are sitting on the mat.
⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬
となっております。
⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬
となっております。
Indexer
• Run
Luke
• Index Browsing
Luke
• Index Browsing
Luke
• Index Browsing
$mvn package
./luke.sh
Luke
• Index Browsing
Luke
• Index Browsing
Calcurate Document Similarity
• mocobeta/CalcCosineSimilarityTest.java
• https://gist.github.com/mocobeta/5525864
• Search document from index
• TF-IDF from Term Vector
• TF-IDF
• how important a word is to a document in a collection or corpus
• TF: how frequently a term occurs in a document
• IDF: it's a measure of the rareness of a term
• Get Cosine-Similarity
• Lower is similar
Calcurate Document Similarity
public class CalcCosineSimilarityTester {
public static void main(String args[]) throws IOException {
〜略〜
TopDocs hits = searcher.search(new TermQuery(new Term("path", path1)), 1);
int docId1 = hits.scoreDocs[0].doc;
Map<String, Double> map1 = buildDocumentVector(docId1);
hits = searcher.search(new TermQuery(new Term("path", path2)), 1);
int docId2 = hits.scoreDocs[0].doc;
Map<String, Double> map2 = buildDocumentVector(docId2);
System.out.println(computeAngle(map1, map2));
// create HashMap(Key:Keyword, Value:TF-IDF) for each document
private Map<String, Double> buildDocumentVector(int docId) {
〜略〜
// calculate cosine similarity
private double computeAngle(map1, map2) {
〜略〜
Calcurate Document Similarity
private Map<String, Double> buildDocumentVector(int docId) throws IOException {
Terms vector = reader.getTermVector(docId, "contents");
〜略〜
// get TF-IDF from Term Vector
TermsEnum itr = vector.iterator();
〜略〜
while ((ref = itr.next()) != null) {
String term = ref.utf8ToString();
TermFreq freq = new TermFreq(term, maxDoc);
freq.setTc(itr.totalTermFreq());
freq.setDf(reader.docFreq(new Term("contents", term)));
list.add(freq);
tcSum += itr.totalTermFreq();
}
// Build HashMap Key:Keyword, Value:TF-IDF
Map<String, Double> docVector = new HashMap<String, Double>();
for (TermFreq freq : list) {
〜略〜
}
return docVector;
}
Calcurate Document Similarity
private double computeAngle(Map<String, Double> vec1, Map<String, Double> vec2) {
double dotProduct = 0; // inner product
for (String term : vec1.keySet()) {
if (vec2.containsKey(term)) {
dotProduct += vec1.get(term) * vec2.get(term);
}
}
double denominator = getNorm(vec1) * getNorm(vec2);
double ratio = dotProduct / denominator; // cosine value
return Math.acos(ratio);
}
private double getNorm(Map<String, Double> vec) {
double sumOfSquares = 0;
for (Double val : vec.values()){
sumOfSquares += val * val;
}
return Math.sqrt(sumOfSquares);
}
Calcurate Document Similarity
• result
• 0.5000430658877127
PHP: Hypertext Preprocessor(ピー・エイチ・ピー ハイパーテキスト プリプロ
セッサー)とは、動的に HTML データを⽣成することによって、動的なウェブペー
ジを実現することを主な⽬的としたプログラミング⾔語、およびその⾔語処理系で
ある。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語として分類され
る。この⾔語処理系⾃体は、C⾔語で記述されている。
PHP(Hypertext Preprocessor;ピー・エイチ・ピー)とは、動的に HTML データ
を⽣成することによって、動的なウェブページを実現すること⽬的としたプログラ
ミング⾔語である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語の
⼀種で、処理系⾃体は C⾔語で記述されている。
Calcurate Document Similarity
• result
• 1.2734113128621865
A Cat sat on the mat.
Cats are sitting on the mat.
Calcurate Document Similarity
• result
• 0.0
⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬
となっております。
⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬
となっております。
Lucene 6.0
• Bunch of changes..
Lucene 6.0
• N-best
• LUCENE-6837: Add N-best output capability to JapaneseTokenizer
N-best
• Contribute from Yahoo! Japan
http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest
N-best
• Contribute from Yahoo! Japan
http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest
N-best
• Contribute from Yahoo! Japan
Nihongo Muzukashii-ne…
• Need to analyze more or maintain dictionaries??
http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest
Nihongo Muzukashii-ne…
• Doesnʼt hit with “⼀眼レフ”(Single-lens reflex)?
http://blog.yoslab.com/entry/2014/09/12/005207
N-best
• Seems cool J
• Iʼm going to try…
http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest
Ad

More Related Content

What's hot (20)

Apache Solr 入門
Apache Solr 入門Apache Solr 入門
Apache Solr 入門
順平 西本
 
Hive chapter 2
Hive chapter 2Hive chapter 2
Hive chapter 2
masahiro_minami
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみよう
Shinsuke Sugaya
 
RubyもApache Arrowでデータ処理言語の仲間入り
RubyもApache Arrowでデータ処理言語の仲間入りRubyもApache Arrowでデータ処理言語の仲間入り
RubyもApache Arrowでデータ処理言語の仲間入り
Kouhei Sutou
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
Toru Takahashi
 
Hot の書き方(Template Version 2015-04-30) 前編
Hot の書き方(Template Version 2015-04-30) 前編Hot の書き方(Template Version 2015-04-30) 前編
Hot の書き方(Template Version 2015-04-30) 前編
irix_jp
 
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
Kentaro Yoshida
 
AWS小ネタ集
AWS小ネタ集AWS小ネタ集
AWS小ネタ集
Takehito Tanabe
 
PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
 PHPでPostgreSQLとPGroongaを使って高速日本語全文検索! PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
Kouhei Sutou
 
Elasticsearchプラグインの作り方
Elasticsearchプラグインの作り方Elasticsearchプラグインの作り方
Elasticsearchプラグインの作り方
Shinsuke Sugaya
 
named_scope more detail - WebCareer
named_scope more detail - WebCareernamed_scope more detail - WebCareer
named_scope more detail - WebCareer
Kyosuke MOROHASHI
 
GDG Tokyo Firebaseを使った Androidアプリ開発
GDG Tokyo Firebaseを使った Androidアプリ開発GDG Tokyo Firebaseを使った Androidアプリ開発
GDG Tokyo Firebaseを使った Androidアプリ開発
Fumihiko Shiroyama
 
AWS SDK for Haskell開発
AWS SDK for Haskell開発AWS SDK for Haskell開発
AWS SDK for Haskell開発
Nomura Yusuke
 
OpenStack API
OpenStack APIOpenStack API
OpenStack API
Akira Yoshiyama
 
実行時のために最適なデータ構造を作成しよう
実行時のために最適なデータ構造を作成しよう実行時のために最適なデータ構造を作成しよう
実行時のために最適なデータ構造を作成しよう
Hiroki Omae
 
Embulk 20150411
Embulk 20150411Embulk 20150411
Embulk 20150411
Hiroshi Nakamura
 
TerraformでECS+ECRする話
TerraformでECS+ECRする話TerraformでECS+ECRする話
TerraformでECS+ECRする話
Satoshi Hirayama
 
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システムMySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
Kouhei Sutou
 
Firebaseで驚くほど簡単に作れるリアルタイムイベントドリブンアプリ
Firebaseで驚くほど簡単に作れるリアルタイムイベントドリブンアプリFirebaseで驚くほど簡単に作れるリアルタイムイベントドリブンアプリ
Firebaseで驚くほど簡単に作れるリアルタイムイベントドリブンアプリ
Fumihiko Shiroyama
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみよう
Shinsuke Sugaya
 
RubyもApache Arrowでデータ処理言語の仲間入り
RubyもApache Arrowでデータ処理言語の仲間入りRubyもApache Arrowでデータ処理言語の仲間入り
RubyもApache Arrowでデータ処理言語の仲間入り
Kouhei Sutou
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
Toru Takahashi
 
Hot の書き方(Template Version 2015-04-30) 前編
Hot の書き方(Template Version 2015-04-30) 前編Hot の書き方(Template Version 2015-04-30) 前編
Hot の書き方(Template Version 2015-04-30) 前編
irix_jp
 
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
Kentaro Yoshida
 
PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
 PHPでPostgreSQLとPGroongaを使って高速日本語全文検索! PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
PHPでPostgreSQLとPGroongaを使って高速日本語全文検索!
Kouhei Sutou
 
Elasticsearchプラグインの作り方
Elasticsearchプラグインの作り方Elasticsearchプラグインの作り方
Elasticsearchプラグインの作り方
Shinsuke Sugaya
 
named_scope more detail - WebCareer
named_scope more detail - WebCareernamed_scope more detail - WebCareer
named_scope more detail - WebCareer
Kyosuke MOROHASHI
 
GDG Tokyo Firebaseを使った Androidアプリ開発
GDG Tokyo Firebaseを使った Androidアプリ開発GDG Tokyo Firebaseを使った Androidアプリ開発
GDG Tokyo Firebaseを使った Androidアプリ開発
Fumihiko Shiroyama
 
AWS SDK for Haskell開発
AWS SDK for Haskell開発AWS SDK for Haskell開発
AWS SDK for Haskell開発
Nomura Yusuke
 
実行時のために最適なデータ構造を作成しよう
実行時のために最適なデータ構造を作成しよう実行時のために最適なデータ構造を作成しよう
実行時のために最適なデータ構造を作成しよう
Hiroki Omae
 
TerraformでECS+ECRする話
TerraformでECS+ECRする話TerraformでECS+ECRする話
TerraformでECS+ECRする話
Satoshi Hirayama
 
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システムMySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
MySQL・PostgreSQLだけで作る高速でリッチな全文検索システム
Kouhei Sutou
 
Firebaseで驚くほど簡単に作れるリアルタイムイベントドリブンアプリ
Firebaseで驚くほど簡単に作れるリアルタイムイベントドリブンアプリFirebaseで驚くほど簡単に作れるリアルタイムイベントドリブンアプリ
Firebaseで驚くほど簡単に作れるリアルタイムイベントドリブンアプリ
Fumihiko Shiroyama
 

Viewers also liked (12)

WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成
Koji Sekiguchi
 
IVS CTO Night and Day Recap - #CTONight 2016 Spring
IVS CTO Night and Day Recap - #CTONight 2016 SpringIVS CTO Night and Day Recap - #CTONight 2016 Spring
IVS CTO Night and Day Recap - #CTONight 2016 Spring
Eiji Shinohara
 
エンジニアの為のAWS実践講座
エンジニアの為のAWS実践講座エンジニアの為のAWS実践講座
エンジニアの為のAWS実践講座
Eiji Shinohara
 
検索技術の活用による広告配信Relevance向上
検索技術の活用による広告配信Relevance向上検索技術の活用による広告配信Relevance向上
検索技術の活用による広告配信Relevance向上
Eiji Shinohara
 
Ad Tech on AWS - IVS CTO Night and Day Spring 2016
Ad Tech on AWS - IVS CTO Night and Day Spring 2016Ad Tech on AWS - IVS CTO Night and Day Spring 2016
Ad Tech on AWS - IVS CTO Night and Day Spring 2016
Eiji Shinohara
 
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECSAWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
Eiji Shinohara
 
Tips for getting the most out of AWS re:Invent IN ENGLISH
Tips for getting the most out of AWS re:Invent IN ENGLISHTips for getting the most out of AWS re:Invent IN ENGLISH
Tips for getting the most out of AWS re:Invent IN ENGLISH
Eiji Shinohara
 
Accelerating AdTech on AWS #AWSAdTechJP
Accelerating AdTech on AWS #AWSAdTechJPAccelerating AdTech on AWS #AWSAdTechJP
Accelerating AdTech on AWS #AWSAdTechJP
Eiji Shinohara
 
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
Eiji Shinohara
 
Global AWS AdTech use-cases
Global AWS AdTech use-casesGlobal AWS AdTech use-cases
Global AWS AdTech use-cases
Eiji Shinohara
 
IVS CTO Night and Day Recap - #CTONight 2016 Winter
IVS CTO Night and Day Recap - #CTONight 2016 WinterIVS CTO Night and Day Recap - #CTONight 2016 Winter
IVS CTO Night and Day Recap - #CTONight 2016 Winter
Eiji Shinohara
 
Search Solutions on AWS
Search Solutions on AWSSearch Solutions on AWS
Search Solutions on AWS
Eiji Shinohara
 
WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成
Koji Sekiguchi
 
IVS CTO Night and Day Recap - #CTONight 2016 Spring
IVS CTO Night and Day Recap - #CTONight 2016 SpringIVS CTO Night and Day Recap - #CTONight 2016 Spring
IVS CTO Night and Day Recap - #CTONight 2016 Spring
Eiji Shinohara
 
エンジニアの為のAWS実践講座
エンジニアの為のAWS実践講座エンジニアの為のAWS実践講座
エンジニアの為のAWS実践講座
Eiji Shinohara
 
検索技術の活用による広告配信Relevance向上
検索技術の活用による広告配信Relevance向上検索技術の活用による広告配信Relevance向上
検索技術の活用による広告配信Relevance向上
Eiji Shinohara
 
Ad Tech on AWS - IVS CTO Night and Day Spring 2016
Ad Tech on AWS - IVS CTO Night and Day Spring 2016Ad Tech on AWS - IVS CTO Night and Day Spring 2016
Ad Tech on AWS - IVS CTO Night and Day Spring 2016
Eiji Shinohara
 
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECSAWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
Eiji Shinohara
 
Tips for getting the most out of AWS re:Invent IN ENGLISH
Tips for getting the most out of AWS re:Invent IN ENGLISHTips for getting the most out of AWS re:Invent IN ENGLISH
Tips for getting the most out of AWS re:Invent IN ENGLISH
Eiji Shinohara
 
Accelerating AdTech on AWS #AWSAdTechJP
Accelerating AdTech on AWS #AWSAdTechJPAccelerating AdTech on AWS #AWSAdTechJP
Accelerating AdTech on AWS #AWSAdTechJP
Eiji Shinohara
 
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
Eiji Shinohara
 
Global AWS AdTech use-cases
Global AWS AdTech use-casesGlobal AWS AdTech use-cases
Global AWS AdTech use-cases
Eiji Shinohara
 
IVS CTO Night and Day Recap - #CTONight 2016 Winter
IVS CTO Night and Day Recap - #CTONight 2016 WinterIVS CTO Night and Day Recap - #CTONight 2016 Winter
IVS CTO Night and Day Recap - #CTONight 2016 Winter
Eiji Shinohara
 
Search Solutions on AWS
Search Solutions on AWSSearch Solutions on AWS
Search Solutions on AWS
Eiji Shinohara
 
Ad

Similar to Getting Started Japanese Search and Calculate Similarity with Apache Lucene (20)

Tour of distributed systems 1 - ZooKeeper
Tour of distributed systems 1 - ZooKeeperTour of distributed systems 1 - ZooKeeper
Tour of distributed systems 1 - ZooKeeper
Chris Birchall
 
Elastic circle ci-co-webinar-20210127
Elastic circle ci-co-webinar-20210127Elastic circle ci-co-webinar-20210127
Elastic circle ci-co-webinar-20210127
Shotaro Suzuki
 
Hive undocumented feature
Hive undocumented featureHive undocumented feature
Hive undocumented feature
tamtam180
 
AWSとAnsibleで実践!プロビジョニング入門‐Lamp+Laravel-
AWSとAnsibleで実践!プロビジョニング入門‐Lamp+Laravel-AWSとAnsibleで実践!プロビジョニング入門‐Lamp+Laravel-
AWSとAnsibleで実践!プロビジョニング入門‐Lamp+Laravel-
靖 小田島
 
SolrCloud on Amazon ECS
SolrCloud on Amazon ECSSolrCloud on Amazon ECS
SolrCloud on Amazon ECS
Eiji Shinohara
 
Apache Geode の Apache Lucene Integration を試してみた
Apache Geode の Apache Lucene Integration を試してみたApache Geode の Apache Lucene Integration を試してみた
Apache Geode の Apache Lucene Integration を試してみた
Akihiro Kitada
 
APIMeetup 20170329_ichimura
APIMeetup 20170329_ichimuraAPIMeetup 20170329_ichimura
APIMeetup 20170329_ichimura
Tomohiro Ichimura
 
Google Dev Fest 2010 Japan LT: OpenSocial JavaScript API is good, Lightweight...
Google Dev Fest 2010 Japan LT: OpenSocial JavaScript API is good, Lightweight...Google Dev Fest 2010 Japan LT: OpenSocial JavaScript API is good, Lightweight...
Google Dev Fest 2010 Japan LT: OpenSocial JavaScript API is good, Lightweight...
Nobuhiro Nakajima
 
Java ee6 with scala
Java ee6 with scalaJava ee6 with scala
Java ee6 with scala
Satoshi Kubo
 
Gradle布教活動
Gradle布教活動Gradle布教活動
Gradle布教活動
Nemoto Yusuke
 
データカタログソフトウェア CKAN
データカタログソフトウェア CKANデータカタログソフトウェア CKAN
データカタログソフトウェア CKAN
Fumihiro Kato
 
AWS Black Belt Tech シリーズ 2015 - AWS Elastic Beanstalk
AWS Black Belt Tech シリーズ 2015 - AWS Elastic BeanstalkAWS Black Belt Tech シリーズ 2015 - AWS Elastic Beanstalk
AWS Black Belt Tech シリーズ 2015 - AWS Elastic Beanstalk
Amazon Web Services Japan
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
Etsuji Nakai
 
クラウドオーケストレーション「OpenStack Heat」に迫る!
クラウドオーケストレーション「OpenStack Heat」に迫る!クラウドオーケストレーション「OpenStack Heat」に迫る!
クラウドオーケストレーション「OpenStack Heat」に迫る!
Etsuji Nakai
 
Using Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineUsing Kubernetes on Google Container Engine
Using Kubernetes on Google Container Engine
Etsuji Nakai
 
Hiveハンズオン
HiveハンズオンHiveハンズオン
Hiveハンズオン
Satoshi Noto
 
CloudStack Ecosystem Day - OpenStack/Swift
CloudStack Ecosystem Day - OpenStack/SwiftCloudStack Ecosystem Day - OpenStack/Swift
CloudStack Ecosystem Day - OpenStack/Swift
irix_jp
 
JOSUG2014 OpenStack 4th birthday party in Japan; the way of OpenStack API Dragon
JOSUG2014 OpenStack 4th birthday party in Japan; the way of OpenStack API DragonJOSUG2014 OpenStack 4th birthday party in Japan; the way of OpenStack API Dragon
JOSUG2014 OpenStack 4th birthday party in Japan; the way of OpenStack API Dragon
Naoto Gohko
 
Real world android akka
Real world android akkaReal world android akka
Real world android akka
Taisuke Oe
 
Tour of distributed systems 1 - ZooKeeper
Tour of distributed systems 1 - ZooKeeperTour of distributed systems 1 - ZooKeeper
Tour of distributed systems 1 - ZooKeeper
Chris Birchall
 
Elastic circle ci-co-webinar-20210127
Elastic circle ci-co-webinar-20210127Elastic circle ci-co-webinar-20210127
Elastic circle ci-co-webinar-20210127
Shotaro Suzuki
 
Hive undocumented feature
Hive undocumented featureHive undocumented feature
Hive undocumented feature
tamtam180
 
AWSとAnsibleで実践!プロビジョニング入門‐Lamp+Laravel-
AWSとAnsibleで実践!プロビジョニング入門‐Lamp+Laravel-AWSとAnsibleで実践!プロビジョニング入門‐Lamp+Laravel-
AWSとAnsibleで実践!プロビジョニング入門‐Lamp+Laravel-
靖 小田島
 
SolrCloud on Amazon ECS
SolrCloud on Amazon ECSSolrCloud on Amazon ECS
SolrCloud on Amazon ECS
Eiji Shinohara
 
Apache Geode の Apache Lucene Integration を試してみた
Apache Geode の Apache Lucene Integration を試してみたApache Geode の Apache Lucene Integration を試してみた
Apache Geode の Apache Lucene Integration を試してみた
Akihiro Kitada
 
Google Dev Fest 2010 Japan LT: OpenSocial JavaScript API is good, Lightweight...
Google Dev Fest 2010 Japan LT: OpenSocial JavaScript API is good, Lightweight...Google Dev Fest 2010 Japan LT: OpenSocial JavaScript API is good, Lightweight...
Google Dev Fest 2010 Japan LT: OpenSocial JavaScript API is good, Lightweight...
Nobuhiro Nakajima
 
Java ee6 with scala
Java ee6 with scalaJava ee6 with scala
Java ee6 with scala
Satoshi Kubo
 
データカタログソフトウェア CKAN
データカタログソフトウェア CKANデータカタログソフトウェア CKAN
データカタログソフトウェア CKAN
Fumihiro Kato
 
AWS Black Belt Tech シリーズ 2015 - AWS Elastic Beanstalk
AWS Black Belt Tech シリーズ 2015 - AWS Elastic BeanstalkAWS Black Belt Tech シリーズ 2015 - AWS Elastic Beanstalk
AWS Black Belt Tech シリーズ 2015 - AWS Elastic Beanstalk
Amazon Web Services Japan
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
Etsuji Nakai
 
クラウドオーケストレーション「OpenStack Heat」に迫る!
クラウドオーケストレーション「OpenStack Heat」に迫る!クラウドオーケストレーション「OpenStack Heat」に迫る!
クラウドオーケストレーション「OpenStack Heat」に迫る!
Etsuji Nakai
 
Using Kubernetes on Google Container Engine
Using Kubernetes on Google Container EngineUsing Kubernetes on Google Container Engine
Using Kubernetes on Google Container Engine
Etsuji Nakai
 
Hiveハンズオン
HiveハンズオンHiveハンズオン
Hiveハンズオン
Satoshi Noto
 
CloudStack Ecosystem Day - OpenStack/Swift
CloudStack Ecosystem Day - OpenStack/SwiftCloudStack Ecosystem Day - OpenStack/Swift
CloudStack Ecosystem Day - OpenStack/Swift
irix_jp
 
JOSUG2014 OpenStack 4th birthday party in Japan; the way of OpenStack API Dragon
JOSUG2014 OpenStack 4th birthday party in Japan; the way of OpenStack API DragonJOSUG2014 OpenStack 4th birthday party in Japan; the way of OpenStack API Dragon
JOSUG2014 OpenStack 4th birthday party in Japan; the way of OpenStack API Dragon
Naoto Gohko
 
Real world android akka
Real world android akkaReal world android akka
Real world android akka
Taisuke Oe
 
Ad

More from Eiji Shinohara (17)

Indexing with Algolia Ruby API Client
Indexing with Algolia Ruby API ClientIndexing with Algolia Ruby API Client
Indexing with Algolia Ruby API Client
Eiji Shinohara
 
Getting Started Algolia with InstantSearch.js
Getting Started Algolia with InstantSearch.jsGetting Started Algolia with InstantSearch.js
Getting Started Algolia with InstantSearch.js
Eiji Shinohara
 
Algolia introduction in Kanazawa - July 2019
Algolia introduction in Kanazawa - July 2019Algolia introduction in Kanazawa - July 2019
Algolia introduction in Kanazawa - July 2019
Eiji Shinohara
 
Scalable and Cost Effective Systems Architecture on AWS
Scalable and Cost Effective Systems Architecture on AWSScalable and Cost Effective Systems Architecture on AWS
Scalable and Cost Effective Systems Architecture on AWS
Eiji Shinohara
 
#AWSAdTechJP
#AWSAdTechJP#AWSAdTechJP
#AWSAdTechJP
Eiji Shinohara
 
Accelerating AdTech on AWS in Japan
Accelerating AdTech on AWS in JapanAccelerating AdTech on AWS in Japan
Accelerating AdTech on AWS in Japan
Eiji Shinohara
 
AWS Summit New York 2017 Keynote Recap
AWS Summit New York 2017 Keynote RecapAWS Summit New York 2017 Keynote Recap
AWS Summit New York 2017 Keynote Recap
Eiji Shinohara
 
#CTONight powered by AWS
#CTONight powered by AWS#CTONight powered by AWS
#CTONight powered by AWS
Eiji Shinohara
 
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
Eiji Shinohara
 
[要約] Building a Real-Time Bidding Platform on AWS #AWSAdTechJP
[要約] Building a Real-Time Bidding Platform on AWS #AWSAdTechJP[要約] Building a Real-Time Bidding Platform on AWS #AWSAdTechJP
[要約] Building a Real-Time Bidding Platform on AWS #AWSAdTechJP
Eiji Shinohara
 
Scaling on AWS - Feb 2016
Scaling on AWS - Feb 2016Scaling on AWS - Feb 2016
Scaling on AWS - Feb 2016
Eiji Shinohara
 
AWS Search Services
AWS Search ServicesAWS Search Services
AWS Search Services
Eiji Shinohara
 
Application Deployment on AWS
Application Deployment on AWSApplication Deployment on AWS
Application Deployment on AWS
Eiji Shinohara
 
AWS Startup Use Cases 2015
AWS Startup Use Cases 2015AWS Startup Use Cases 2015
AWS Startup Use Cases 2015
Eiji Shinohara
 
AWS Startup Tech Lightning Talks 2015 Summer at dots.
AWS Startup Tech Lightning Talks 2015 Summer at dots.AWS Startup Tech Lightning Talks 2015 Summer at dots.
AWS Startup Tech Lightning Talks 2015 Summer at dots.
Eiji Shinohara
 
(Best) practices for working globally in IT industry - DMM.Study Night
(Best) practices for working globally in IT industry - DMM.Study Night(Best) practices for working globally in IT industry - DMM.Study Night
(Best) practices for working globally in IT industry - DMM.Study Night
Eiji Shinohara
 
Bay Area Startup Report - IVS CTO Night & Day in Miyazaki
Bay Area Startup Report - IVS CTO Night & Day in MiyazakiBay Area Startup Report - IVS CTO Night & Day in Miyazaki
Bay Area Startup Report - IVS CTO Night & Day in Miyazaki
Eiji Shinohara
 
Indexing with Algolia Ruby API Client
Indexing with Algolia Ruby API ClientIndexing with Algolia Ruby API Client
Indexing with Algolia Ruby API Client
Eiji Shinohara
 
Getting Started Algolia with InstantSearch.js
Getting Started Algolia with InstantSearch.jsGetting Started Algolia with InstantSearch.js
Getting Started Algolia with InstantSearch.js
Eiji Shinohara
 
Algolia introduction in Kanazawa - July 2019
Algolia introduction in Kanazawa - July 2019Algolia introduction in Kanazawa - July 2019
Algolia introduction in Kanazawa - July 2019
Eiji Shinohara
 
Scalable and Cost Effective Systems Architecture on AWS
Scalable and Cost Effective Systems Architecture on AWSScalable and Cost Effective Systems Architecture on AWS
Scalable and Cost Effective Systems Architecture on AWS
Eiji Shinohara
 
Accelerating AdTech on AWS in Japan
Accelerating AdTech on AWS in JapanAccelerating AdTech on AWS in Japan
Accelerating AdTech on AWS in Japan
Eiji Shinohara
 
AWS Summit New York 2017 Keynote Recap
AWS Summit New York 2017 Keynote RecapAWS Summit New York 2017 Keynote Recap
AWS Summit New York 2017 Keynote Recap
Eiji Shinohara
 
#CTONight powered by AWS
#CTONight powered by AWS#CTONight powered by AWS
#CTONight powered by AWS
Eiji Shinohara
 
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
Eiji Shinohara
 
[要約] Building a Real-Time Bidding Platform on AWS #AWSAdTechJP
[要約] Building a Real-Time Bidding Platform on AWS #AWSAdTechJP[要約] Building a Real-Time Bidding Platform on AWS #AWSAdTechJP
[要約] Building a Real-Time Bidding Platform on AWS #AWSAdTechJP
Eiji Shinohara
 
Scaling on AWS - Feb 2016
Scaling on AWS - Feb 2016Scaling on AWS - Feb 2016
Scaling on AWS - Feb 2016
Eiji Shinohara
 
Application Deployment on AWS
Application Deployment on AWSApplication Deployment on AWS
Application Deployment on AWS
Eiji Shinohara
 
AWS Startup Use Cases 2015
AWS Startup Use Cases 2015AWS Startup Use Cases 2015
AWS Startup Use Cases 2015
Eiji Shinohara
 
AWS Startup Tech Lightning Talks 2015 Summer at dots.
AWS Startup Tech Lightning Talks 2015 Summer at dots.AWS Startup Tech Lightning Talks 2015 Summer at dots.
AWS Startup Tech Lightning Talks 2015 Summer at dots.
Eiji Shinohara
 
(Best) practices for working globally in IT industry - DMM.Study Night
(Best) practices for working globally in IT industry - DMM.Study Night(Best) practices for working globally in IT industry - DMM.Study Night
(Best) practices for working globally in IT industry - DMM.Study Night
Eiji Shinohara
 
Bay Area Startup Report - IVS CTO Night & Day in Miyazaki
Bay Area Startup Report - IVS CTO Night & Day in MiyazakiBay Area Startup Report - IVS CTO Night & Day in Miyazaki
Bay Area Startup Report - IVS CTO Night & Day in Miyazaki
Eiji Shinohara
 

Getting Started Japanese Search and Calculate Similarity with Apache Lucene

  • 1. Getting Started Japanese Search and Calculate Similarity with Apache Lucene May 2016 Eiji Shinohara
  • 2. Name: Eiji Shinohara / 篠原 英治 / @shinodogg Role: AWS Solutions Architect Subject Matter Expert ・Amazon CloudSearch ・Amazon Elasticsearch Service Who am I?
  • 3. Which Search Engine/Service do you use? • Apache Solr • Elasticsearch • Amazon CloudSearch • Amazon Elasticsearch Service
  • 4. On top of Apache Lucene • Apache Solr • Elasticsearch • Amazon CloudSearch • Amazon Elasticsearch Service
  • 5. Have you used Apache Lucene? •Apache Lucene is a free and open- source information retrieval software library, originally written in Java by Doug Cutting. •It is supported by theApache Software Foundation and is released under the Apache Software License. https://en.wikipedia.org/wiki/Lucene
  • 6. Doug Cutting – Hadoop/Nutch/Lucene •Hadoop: MapReduce •The name my kid gave a stuffed yellow elephant. •Nutch: Crawler •Nutch was the way my oldest son when he was two, I think it came from lunch. •Lucene: Search •Lucene is Doug Cutting's wife's middle name, and her maternal grandmother's first name. http://www.mwsoft.jp/programming/hadoop/where_come_from.html
  • 7. Doug Cutting – Hadoop/Nutch/Lucene •Hadoop: MapReduce •The name my kid gave a stuffed yellow elephant. •Nutch: Crawler •Nutch was the way my oldest son when he was two, I think it came from lunch •Lucene: Search •Lucene is Doug Cutting's wife's middle name, and her maternal grandmother's first name. http://www.mwsoft.jp/programming/hadoop/where_come_from.html Maybe most proper naming J
  • 8. Apache Lucene •Full-Text search • Easy to use http://www.lucenetutorial.com/lucene-in-5-minutes.html
  • 9. Apache Lucene •Full-Text search • Easy to use 1. Index • new Document → addDocument → commit 2. Query • Generate Query String 3. Search • Search and Fetch hitted documents 4. Display • Get contents from fetched documents to show http://www.lucenetutorial.com/lucene-in-5-minutes.html
  • 10. Evernote and LinkedIn are using Lucene • w/ thin their own HTTP wrapper • Presentation at Lucene Solr Revolution 2014 https://www.youtube.com/watch?v=drOmahIie6c https://www.youtube.com/watch?v=8O7cF75intk
  • 11. Build your own Search engine? • Some companies are doing that http://www.slideshare.net/lucidworks/galene-linkedins-search-architecture- presented-by-diego-buthay-sriram-sankar-linkedin/8
  • 12. Iʼll join Lucene Solr Revolution 2016
  • 13. Apache Lucene⼊⾨ in Japanese http://rondhuit.com/lucene-for-bea-060710.pdfhttp://www.amazon.co.jp/dp/4774127809
  • 15. Uchida-sanʼs Blog in Japanese http://mocobeta-backup.tumblr.com/post/54371099587/lucene-in-action
  • 17. Lucene in Action chap5: Term Vector (2) Calcurate Document Similarity http://mocobeta-backup.tumblr.com/post/49779999073/
  • 18. Lucene in Action chap5: Term Vector (2) Calcurate Document Similarity • Just tried to run on local Macbook Air J • Created 2 classes • Indexer • Indexing some documents • CalculationSimilarityTester • Comparing 2 documents • Calculate cosine similarity • Using Luke for browsing index • https://github.com/DmitryKey/luke • Uchida-san is also Luke comitter •
  • 19. Lucene 6.0 • I had Lucene 5.5 environment but,,, • Invalid directory at the location, check console for more information. Last exception: • java.lang.IllegalArgumentException: Could not load codec 'Lucene60'. Did you forget to add lucene-backward-codecs.jar?
  • 20. Lucene 6.0 • So created new Maven project • pom.xml <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>6.0.0</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>6.0.0</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>6.0.0</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-kuromoji</artifactId> <version>6.0.0</version> </dependency>
  • 21. Indexer public class Indexer { public static void main(String args[]) throws IOException { Analyzer analyzer = new JapaneseAnalyzer(); 〜略〜 File[] files = new File("/Users/xxx/lucene_test/docs/").listFiles(); for (File file : files) { Document doc = new Document(); 〜略〜 FieldType contentsType = new FieldType(); contentsType.setStored(true); contentsType.setTokenized(true); contentsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); contentsType.setStoreTermVectors(true); 〜略〜 doc.add(new Field("contents", sb.toString(), contentsType)); writer.addDocument(doc); } writer.commit(); writer.close(); } } • Read file -> add Document -> Commit
  • 22. Indexer • Files • Found examples on the internet :) • http://www.pahoo.org/e-soul/webtech/php06/php06-21-01.shtm PHP: Hypertext Preprocessor(ピー・エイチ・ピー ハイパーテキスト プリプロ セッサー)とは、動的に HTML データを⽣成することによって、動的なウェブペー ジを実現することを主な⽬的としたプログラミング⾔語、およびその⾔語処理系で ある。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語として分類され る。この⾔語処理系⾃体は、C⾔語で記述されている。 PHP(Hypertext Preprocessor;ピー・エイチ・ピー)とは、動的に HTML データ を⽣成することによって、動的なウェブページを実現すること⽬的としたプログラ ミング⾔語である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語の ⼀種で、処理系⾃体は C⾔語で記述されている。
  • 23. Indexer • Files • Found examples on the internet :) • http://www.fisproject.jp/2015/01/cosine_similarity/ • Exactly same A Cat sat on the mat. Cats are sitting on the mat. ⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬ となっております。 ⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬ となっております。
  • 27. Luke • Index Browsing $mvn package ./luke.sh
  • 30. Calcurate Document Similarity • mocobeta/CalcCosineSimilarityTest.java • https://gist.github.com/mocobeta/5525864 • Search document from index • TF-IDF from Term Vector • TF-IDF • how important a word is to a document in a collection or corpus • TF: how frequently a term occurs in a document • IDF: it's a measure of the rareness of a term • Get Cosine-Similarity • Lower is similar
  • 31. Calcurate Document Similarity public class CalcCosineSimilarityTester { public static void main(String args[]) throws IOException { 〜略〜 TopDocs hits = searcher.search(new TermQuery(new Term("path", path1)), 1); int docId1 = hits.scoreDocs[0].doc; Map<String, Double> map1 = buildDocumentVector(docId1); hits = searcher.search(new TermQuery(new Term("path", path2)), 1); int docId2 = hits.scoreDocs[0].doc; Map<String, Double> map2 = buildDocumentVector(docId2); System.out.println(computeAngle(map1, map2)); // create HashMap(Key:Keyword, Value:TF-IDF) for each document private Map<String, Double> buildDocumentVector(int docId) { 〜略〜 // calculate cosine similarity private double computeAngle(map1, map2) { 〜略〜
  • 32. Calcurate Document Similarity private Map<String, Double> buildDocumentVector(int docId) throws IOException { Terms vector = reader.getTermVector(docId, "contents"); 〜略〜 // get TF-IDF from Term Vector TermsEnum itr = vector.iterator(); 〜略〜 while ((ref = itr.next()) != null) { String term = ref.utf8ToString(); TermFreq freq = new TermFreq(term, maxDoc); freq.setTc(itr.totalTermFreq()); freq.setDf(reader.docFreq(new Term("contents", term))); list.add(freq); tcSum += itr.totalTermFreq(); } // Build HashMap Key:Keyword, Value:TF-IDF Map<String, Double> docVector = new HashMap<String, Double>(); for (TermFreq freq : list) { 〜略〜 } return docVector; }
  • 33. Calcurate Document Similarity private double computeAngle(Map<String, Double> vec1, Map<String, Double> vec2) { double dotProduct = 0; // inner product for (String term : vec1.keySet()) { if (vec2.containsKey(term)) { dotProduct += vec1.get(term) * vec2.get(term); } } double denominator = getNorm(vec1) * getNorm(vec2); double ratio = dotProduct / denominator; // cosine value return Math.acos(ratio); } private double getNorm(Map<String, Double> vec) { double sumOfSquares = 0; for (Double val : vec.values()){ sumOfSquares += val * val; } return Math.sqrt(sumOfSquares); }
  • 34. Calcurate Document Similarity • result • 0.5000430658877127 PHP: Hypertext Preprocessor(ピー・エイチ・ピー ハイパーテキスト プリプロ セッサー)とは、動的に HTML データを⽣成することによって、動的なウェブペー ジを実現することを主な⽬的としたプログラミング⾔語、およびその⾔語処理系で ある。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語として分類され る。この⾔語処理系⾃体は、C⾔語で記述されている。 PHP(Hypertext Preprocessor;ピー・エイチ・ピー)とは、動的に HTML データ を⽣成することによって、動的なウェブページを実現すること⽬的としたプログラ ミング⾔語である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語の ⼀種で、処理系⾃体は C⾔語で記述されている。
  • 35. Calcurate Document Similarity • result • 1.2734113128621865 A Cat sat on the mat. Cats are sitting on the mat.
  • 36. Calcurate Document Similarity • result • 0.0 ⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬ となっております。 ⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬ となっております。
  • 37. Lucene 6.0 • Bunch of changes..
  • 38. Lucene 6.0 • N-best • LUCENE-6837: Add N-best output capability to JapaneseTokenizer
  • 39. N-best • Contribute from Yahoo! Japan http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest
  • 40. N-best • Contribute from Yahoo! Japan http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest
  • 42. Nihongo Muzukashii-ne… • Need to analyze more or maintain dictionaries?? http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest
  • 43. Nihongo Muzukashii-ne… • Doesnʼt hit with “⼀眼レフ”(Single-lens reflex)? http://blog.yoslab.com/entry/2014/09/12/005207
  • 44. N-best • Seems cool J • Iʼm going to try… http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest