Machine Learning and Apache Mahout : An Introduction

+ Machine
Learning
and
Apache
Mahout
Varad Meru
Software Development Engineer
Orzota, Inc.
about.me/vrdmr

© Varad Meru, 2013

+

2

Who Am I


Orzota, Inc.





Making BigData Easy
Designing a Cloud-based platform for ETL, Analytics

Past Work Experience




Persistent Systems Ltd.
Recommendation Engines and User Behavior Analytics.

Area of Interest


Machine Learning



Distributed Systems



Recommendation Engines

+

3

Outline


Introduction



Machine Learning








Apache Mahout






Introduction and History
Types of Learning Algorithms
Applications
What’s New

History
Architecture
Applications and Examples

Conclusion
© Varad Meru, 2013

+
Machine Learning
Rise of the Machine-Era

4

+

5

Introduction
“Machine Learning is Programming Computers to
optimize a Performance Criterion using Example Data
or Past Experience”


Term coined by Arthur Samuel


"Field of study that gives computers the ability to learn without being
explicitly programmed“.



Branch of Artificial Intelligence and Statistics



Focuses on prediction based on known properties



Used as a sub-process in Data Mining.


Data Mining focuses on discovering new, unknown properties.

+

6

Learning Algorithms


Supervised Learning





Unsupervised Learning






Unlabelled input data.
Creating a function to predict the relation and output

Semi-Supervised Learning




Labelled input data.
Creating classifiers to predict unseen inputs.

Combines Supervised and Unsupervised Learning methodology

Reinforcement Learning


Reward-Punishment based agent.

+

7

Supervised Learning
Introduction


Learn from the Data



Data is already labelled




Expert, Crowd-sourced or case-based labelling of data.

Applications


Handwriting Recognition



Spam Detection



Information Retrieval




Personalisation based on ranks

Speech Recognition

+

8

Supervised Learning
Algorithms


Decision Trees



k-Nearest Neighbours



Naive Bayes



Logistic Regression



Perceptron and Multi-level Perceptrons



Neural Networks



SVM and Kernel estimation

+

9

Supervised Learning
Example: Naive Bayes Classifier


President Obama’s Speech’s Word Map

+

10

Supervised Learning


A Spam Document’s Word Map

+

11

Supervised Learning


Running a test on the Classifier

“Order a trial Adobe
chicken daily EABList new summer
savings, welcome!”

Classifier

Spam
Bin

+

12

Introduction


Finding hidden structure in data



Unlabelled Data



SMEs needed post-processing to verify, validate and use the
output



Used in exploratory analysis rather than predictive analytics



Applications


Pattern Recognition



Groupings based on a distance measure


Group of People, Objects, ...

+

13

Algorithms


Clustering


k-Means, MinHash, Hierarchical Clustering



Hidden Markov Models



Feature Extraction methods



Self-organizing Maps (Neural Nets)

+

14

Example K-Means

Source: http://apandre.wordpress.com/visible-data/cluster-analysis/

+

15

Learning Problem
Cat and Dog Problem


Humans can easily classify which is a cat and which is a dog.



But how can a computer do that?



Some attempts used Clustering Mechanisms to solve it – Cooccurence Clustering, Deep Learning

+
Apache Mahout
Scalable Machine Learning Library

16
© Varad Meru, 2013

+

17

History and Etymology


Inspired from MapReduce for Machine
Learning on Multicore” Ng et. al.



Written in Java. Apache License.



Founders


Mahout – Isabel Drost, Grant Ingersoll, Karl
Witten.



Taste – Sean Owen



Mahout – Keeper/Driver of Elephants.



Current Release – 0.8 (stable)

© Varad Meru, 2013

+

Size

Need


BigData


Ever-growing data.



Yesterday’s methods to
process tomorrow’s data




Cheap Storage

Scalable from Ground Up




Lines
Sample
Data
KBs –
low MBs
Prototype
Data

Analysis and
Visualisation
Analysis and
Visualisation

Tools18

Whiteboard,
Bash, ...
Matlab,
Octave, R,
Processing,
Bash, ...

Storage

MySQL (DBs),
...

Analysis

NumPy, SciPy,
Pandas,
Weka..

MBs – low
GBs

Should be build on top of anyOnline
existing Distributed Systems Data
framework
Should contain distributed
version of ML algorithms

Classification

GBs
– TBs
– PBs

Visualisation

Flare,
AmCharts,
Raphael

Storage

HDFS, Hbase,
Cassandra,...

Analysis

Hive, Giraph,
Hama, Mahout

+

19

Mahout Modules

Applications

Evolutionary
Algorithms

Classification

Utilies
Lucene/Vectorizer

Clustering

Recommenders

Math
Vectors/ Matrics/SVD

Regression

Collections
(Primitives)

FPM

Dimension
Reduction

Hadoop

+

20

Recommender
Systems

© Varad Meru, 2013

+

21

Recommender Systems
Introduction


Types of Recommender Systems







Content Based Recommendations
Collaborative Filtering Recommendations
 User-User Recommendations
 Item-Item Recommendations
Dimensionality Reduction (SVD) Recommendations

Applications






Products you would like to buy
People you might want to connect with
Potential Life-Partners
Recommending Songs you might like
...

+

22

Recommender Systems
Collaborative Filtering in Action



Assuming people
have seen at least
one movie.


Cold Start?




© Varad Meru, 2013

1: seen
0: not seen

+

23



Tanimoto Coefficient

T ( a, b)

NA

NC
NB

NC



NA – Number of Customers
who bought A



NB – Number of Customers who
bought B



NC – Number of Customers
who bought A and B

© Varad Meru, 2013

+

24



Cosine Coefficient

C (a, b)

NC
NA

NB



NA – Number of Customers
who bought A



NB – Number of Customers who
bought B



NC – Number of Customers
who bought A and B

© Varad Meru, 2013

+

25

Apache Mahout
Recommender System
Architecture


Two Modes





Stand-alone non distributed (“Taste”)
Scalable Distributed Algorithmic version
for Collaborative Filtering

Top-level Packages


Data Model



User Similarity



Item Similarity



User Neighbourhood



Recommender

+

26

Naive Bayes Classifier

“Order a trial Adobe
chicken daily EABList new summer
savings, welcome!”

Classifier

+

27

Naive Bayes Classifier


Naive Bayes is a pretty complex process in Mahout: training
the classifier requires four separate Hadoop jobs.



Training:



Calculate per-Document
Statistics



Normalize across Categories





Read the Features

Calculate normalizing factor
of each label

Testing


Classification (fifth job, explicitly invoked)

© Varad Meru, 2013

+

28

K-Means Clustering
Iterations

+

29

K-Means Clustering
MapReduce Version

30

+

Summary
•

Machine Learning
•
•

•

Learning Algorithms

Varied Applications

Mahout
•

Scaling to Giga/Tera/Peta Scale

•

Free and Open Source

+

31

More Info.
1.

“Scalable Similarity-Based Neighborhood Methods with
MapReduce” by Sebastian Schelter, Christoph Boden and
Volker Markl. – RecSys 2012.

2.

“Case Study Evaluation of Mahout as a Recommender Platform”
by Carlos E. Seminario and David C. Wilson - Workshop on
Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)

3.

http://mahout.apache.org/ - Apache Mahout Project Page

4.

http://www.ibm.com/developerworks/java/library/j-mahout/ Introducing Apache Mahout

5.

[VIDEO] “Collaborative filtering at scale” by Sean Owen

6.

[BOOK] “Mahout in Action” by Owen et. al., Manning Pub.
© Varad Meru, 2013

Machine Learning and Apache Mahout : An Introduction

Recommended

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Machine Learning and Apache Mahout : An Introduction (20)

More from Varad Meru (15)

Recently uploaded (20)

Machine Learning and Apache Mahout : An Introduction