Text Classification Powered by Apache Mahout and Lucene

Text classification
With Apache Mahout and Lucene

Isabel Drost-Fromm

Software Engineer at Nokia Maps*
Member of the Apache Software Foundation
Co-Founder of Berlin Buzzwords and
Berlin Apache Hadoop GetTogether
Co-founder of Apache Mahout

*We are hiring, talk to me or mail careers@here.com

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

… provide your own success story online.

January 8, 2008 by Pink Sherbet Photography
http://www.flickr.com/photos/pinksherbet/2177961471/

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Image by jasondevilla
http://www.flickr.com/photos/jasondv/91960897/

How a linear classifier sees data

Image by ZapTheDingbat (Light meter)
http://www.flickr.com/photos/zapthedingbat/3028168415

Instance*
(sometimes also called example, item, or in databases a row)

Feature*
(sometimes also called attribute, signal, predictor, co-variate, or column in databases)

Label*
(sometimes also called class, target variable)

Image taken in Lisbon/ Portugal.

●

Remove noise.

●

Convert text to vectors.

Text consists of terms and phrases.

Encoding issues?
Chinese? Japanese?
“New York” vs. new York?
“go” vs. “going” vs. “went” vs. “gone”?
“go” vs. “Go”?

Now we have terms – how to turn them
into vectors?

If we looked at two phrases only:
Sunny weather

High performance computing

Binary bag of words
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Entry in vector is one, if word occurs in text.

●

Problem:
–

bi , j =

{

1 ∀ x i ∈d j
0 else

}

How to know all possible terms in unknown text?

Term Frequency
●


●


●

Entry in vector equal to the words frequency.
bi , j =ni , j

●

Problem:
–

Common words dominate vectors.

TF with stop wording
●


●


●

Filter stopwords.

●

Entry in vector equal to the words frequency.

●

Problem:
–

bi , j =ni , j

Common and uncommon words with same weight.

TF- IDF
●


●


●

Filter stopwords.

●

Entry in vector equal to the weighted frequency.

●

Problem:
–

bi , j =ni , j ×log 

∣D∣

∣{ d : t i ∈d }∣

Long texts get larger values.

Hashed feature vectors
●


●

Each word in texts = hashed to one dimension.

●

Entry in vector set to one, if word hashed to it.

HTML

Tokenstream+x

Apache Tika

FeatureVector
Encoder

Fulltext

Lucene
Analyzer

Vector

Online
Learner

Model

Goals

●

Did I use the best model parameters?

●

How well will my model perform in the wild?

Tune model
Parameters,
Experiment with
Tokenization,
Experiment with
Vector Encoding

Compute expected
performance

Performance
●

Use same data for training and testing.

●

Problem:
–

Highly optimistic.

–

Model generalization unknown.

Performance
●

Use same data for training and testing.

DON'T
●

Problem:
–

Highly optimistic.

–

Model generalization unknown.

Performance
●

Use just a fraction for training.

●

Set some data aside for testing.

●

Problems:
–

Pessimistic predictor: Not all data used for training.

–

Result may depend on which data was set aside.

Performance
●

Partition your data into n fractions.

●

Each fraction set aside for testing in turn.

●

Problem:
–

Still a pessimistic predictor.

Performance
●


●

Set some data aside for tuning and testing.

●

Problems:
–

Highly optimistic.

–

Parameters manually tuned to testing data.

Performance
●


●

Set some data aside for tuning and testing.
DON'T

●

Problems:
–

Highly optimistic.

–

Parameters manually tuned to testing data.

Performance
●


●

Set some data aside for tuning.

●

Set another set of data aside for testing.

●

Problems:
–

Pretty pessimistic as not all data is used.

–

May depend on which data was set aside.

Correct prediction: negative

Model
prediction:
negative

Model
prediction:
positive

Correct prediction: positive

Accuracy
ACC=

●

true positivetrue negative
true positive false positive false negativetrue negative

Problems:
–

What if class distribution is skewed?

Precision/ Recall
true positive
Precision=
true positive false positive
true positive
Recall=
true positive false negative
●

Problem:
–

Depends on decision threshold.

ROC Curves
True orange rate

False orange rate

AUC – area under ROC
True orange rate

False orange rate

Foto taken by fras1977
http://www.flickr.com/photos/fras/4992313333/

Image by Medienmagazin pro
http://www.flickr.com/photos/medienmagazinpro/6266643422

http://www.flickr.com/photos/generated/943078008/

Apache Hadoop-ready
Recommendations/
Collaborative filtering

kNN and matrix factorization
based Collaborative filtering
Classification/
Naïve Bayes, random forest
Frequent item sets/
(P)FPGrowth

Classification/
Logistic Regression/ SGD

Clustering/ Mean shift, k-Means,
Canopy, Dirichlet Process,
Co-Location search

Sequence learning/
HMM

Math libs/ Mahout collections

LDA

Libraries to have a look at:
Vowpal Wabbit Mallet
LibSvm
LibLinear
Libfm
Incanter
GraphLab
Skikits learn

Where to get more information:
“Mahout in Action” - Manning
“Taming Text” - Manning
“Machine Learning” - Andrew Ng
https://cwiki.apache.org/confluence/dis
play/MAHOUT/Books+Tutorials+and+T
alks
play/MAHOUT/Reference+Reading
Image by pareeerica
http://www.flickr.com/photos/pareeerica/3711741298/

Frameworks worth mentioning:
Apache Mahout
Matlab/ Otave
Shogun
RapidI

Apache Giraph
R
Weka
MyMedialight

Get your hands dirty:
http://kaggle.com
play/MAHOUT/Collections

Where to meet these people:
RecSys
NIPS
KDD
PKDD
ApacheCon
O'Reilly Strata

ICML
ECML
WSDM
JMLR
Berlin Buzzwords

Get started today with the right tools.

January 8, 2008 by dreizehn28
http://www.flickr.com/photos/1328/2176949559

Discuss ideas and problems online.

November 16, 2005 [phil h]
http://www.flickr.com/photos/hi-phi/64055296

Images taken at Berlin Buzzwords 2011/12/13 by
Philipp Kaden. See you there end of May 2014.

Discuss ideas and problems in person.

BerlinBuzzwords.de – End of May 2014 in Berlin/ Germany.

http://

Online – user/dev@mahout.apache.org, java-user@lucene.apache.org,
dev@lucene.apache.org

Interest in solving hard problems.
Being part of lively community.
Engineering best practices.

Bug reports, patches, features.
Documentation, code, examples.
Image by: Patrick McEvoy

Text Classification Powered by Apache Mahout and Lucene

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Text Classification Powered by Apache Mahout and Lucene (20)

More from lucenerevolution (20)

Recently uploaded (20)

Text Classification Powered by Apache Mahout and Lucene