Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies1
Real-World Machine Learning - Leverage the
Features of MapR Converged Data Platform
Mathieu Dumoulin (mdumoulin@mapr.com)
Mateusz Dymczyk (mateusz@h2o.ai)
Hadoop Summit Tokyo 2016

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential 2
Today’s goals
• Machine Learning projects in the Enterprise
have a LOT of requirements beyond training a
good ML model
• Current options are too complex
• Need a Converged Data Platform
• Introduce specific features useful for ML:
– MapR-FS, Volumes, Mirrors and Topologies
– MapR-DB and MapR Streams

Mathieu Dumoulin, Data Engineer
• Master’s degree in text classification
on Hadoop at Fujitsu Canada’s
Innovation Lab
• In Tokyo, I’ve worked as a Data
Scientist, Search Engineer and Data
Engineer
• I like Scikit-Learn and H2O
• 日本料理が大好き。とくに鍋としゃ
ぶしゃぶです。

Mateusz Dymczyk, Software Engineer
• M.Sc. in CS (Software and
System Engineering) @ AGH
UST, Poland
• Ph.D. (Machine Learning) dropout
• Software Engineer @ H2O.ai
• Previously ML/NLP @ Fujitsu
Laboratories and en-japan inc
• I’m taking Sommelier classes

A common machine learning pipeline
*Image from scikit-learn.org

… meets the real world (Enterprise IT)

… meets the real world
Data comes from
many sources
maybe very large
Data isn’t
always labeled!

… meets the real world
Data comes from
many sources,
maybe very large
Needs ETL
and cleaning
Finding the best
algorithm and
parameters can use a
lot of CPU
Data isn’t
always labeled!

… Meets the real world
Data comes from
many sources,
maybe very large
Needs ETL
and cleaning
Finding the best
algorithm and
parameters can use a
lot of CPU
Data isn’t
always labeled!
From production
systems?
Is it real time?
What server will
serve predictions?
The predictions are
used by another
system...

Machine learning here...

Is not the same when you do it here

Enterprise machine learning matters
Growing number of ML use cases at successful companies
Anomaly
Detection
異常検出
Customer 360
Fraud
Detection
不正検出
Log Security
Analysis
ログ分析
Recommender
Engines
レコメンデーション
Sensor Data
Analysis (IoT)
Personalized
Offers
個人化
Ad Tech

…but it’s HARD
Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline

There must be a better way...

Big data Enterprise IT infrastructure for ML
• You can start simple and show value quickly
• It just works. Easy configuration and administration.
• Works with existing systems, and tools
• Includes common basics (File storage, DB, Streams)
• Strong ecosystem support (Apache projects)
• Enterprise class (multi-tenancy, security, HA, support)
An ideal platform for ML:

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential © 2016 MapR Technologies 16
MapR Converged Data Platform

MapR Converged Data Platform
Open Source Engines & Tools Commercial Engines & Applications
Utility-Grade Platform Services
DataProcessing
Enterprise Storage
MapR-FS MapR-DB MapR Streams
Database Event Streaming
Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy
Search &
Others
Cloud &
Managed
Services
Custom Apps
UnifiedManagementandMonitoring

MapR is great for Enterprise ML projects
●MapR-FS and NFS mount
●Volumes and Topologies
●Mirrors and Snapshots
●

MapR Filesystem
•Native implementation in C/C++, it’s fast
•Use it like your own local filesystem
•Everything that can use files works as usual
•Unique MapR technology
•For more info watch on Youtube:
•What is MapR-FS
•MapR-FS vs. HDFS
Working, battle-tested distributed read-write filesystem

NFS Mount
Mount the cluster as a regular folder
$> sudo mount -o hard,nolock ip-10-0-0-110:/mapr /mapr
$> ll /mapr/hadoopsummit/ 
total 3 
drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:21 apps 
drwxr-xr-x. 2 mapr mapr 0 Oct 13 11:12 hbase 
drwxr-xr-x. 3 root root 1 Oct 13 11:21 installer 
drwxr-xr-x. 2 mapr mapr 0 Oct 13 11:14 opt 
drwxrwxrwx. 2 mapr mapr 1 Oct 14 10:41 tmp 
drwxr-xr-x. 6 mapr mapr 4 Oct 14 10:52 user 
drwxr-xr-x. 3 mapr mapr 1 Oct 13 11:13 var

© 2014 MapR Technologies 21
MapR NFS and Volumes
[mapr@ip-10-0-0-110 mapr]$ pwd
/mapr/hadoopsummit/user/mapr

MapR-FS and NFS mount for ML
• Get started quickly and simply
• Use your favorite tool like...
– Custom code (Scikit-learn, R)
– SPSS, SAS, RapidMiner
– Apache Spark, Drill, Flink
• Super easy data import
– Just save to file on MapR
– Integrate with legacy servers
and code
– Use any ecosystem (Sqoop) it
all works
• Quick and scalable roundtrip
during development
– ETL/cleaning -> train/test ->
predict
– Don’t copy data (cluster to
cluster, local to cluster)
• Run in production direct from
the cluster
– no copying around

Volumes and Topologies - Managed in MCS

Volumes and Topologies
Volumes are just “regular” volumes

Volumes and Topologies
Volumes are just “regular” volumes
Select what nodes for
volume data = Topology

Volumes and Topologies for ML
• With YARN’s Node Labels, run tasks on nodes with
guaranteed data locality
– Special nodes with GPU, high memory or big CPU
• Multi-Tenancy
– Share cluster with business use cases in production
– Data isolation guaranteed
– Easy unified admin (Data scientists != Hadoop
admin)
– Bigger cluster, more reliable and faster

Snapshots and Mirrors

Snapshots - Instant point in time save

Mirrors - Physical copy

Snapshots
[... mateusz]$ cd .snapshot
[... .snapshot]$ ll 
total 1 
drwxr-xr-x. 2 mapr mapr 1 Oct 14 10:56
mateusz.snap1

Snapshots and Mirrors for ML
• Versioned data and models = Repeatable results
– same model, same data guaranteed
– Go back in time for free
• Keep intermediate transformations
– Quickly change your mind, don’t redo work
• A/B Testing easy-mode

Real-time events and DB for ML
• Built-in, no config, it just works
• Support next-gen use cases
– hyper-personalization of web/store content
– IoT Sensor data
• easy to start small but grows with your data/use case

MapR Converged Application Blueprint
• Microservices connected by real-time streams
– Ideal to serve predictions from ML models
• Next-Generation large-scale architecture
• Working example: https://www.mapr.com/appblueprint/
overview

Converged Data Platform 💖
Machine Learning
• Features that work together to support all phases of ML
• Supports your existing tools/code and the state of the art
large scale frameworks
• Easier to manage, more robust and secure.
• MapR is made for the enterprise and great for ML!

Demo of H2O on MapR: Features in Action

Agenda
• Why tooling matters in Machine Learning
• What is H2O and Sparkling Water
• Why MapR
• Demo

ML project problems
• Multiple data sources
• Different formats
• Large volumes of data to be read
• System bootstrap time
• Collaboration between data scientists
• Comparing models
• Deployment of the model
• Versioning
• Too many moving parts!
• etc.etc.

Successful ML platform
• Fast ingestion and manipulation of versatile data
• Intuitive modeling UI/API
• Easy model validation, visualisation and comparison
• Easy model deployment w/ versioning for fast predictions

• Written in high performance Java - native Java API
• Supports multiple file formats and data sources
• ETL capabilities
• Highly paralleled and distributed implementation
• Fast in-memory computation on highly compressed data
• Allows you to use all your data without sampling
• Runs on top of most major Hadoop distributions
ML
platform
Ingestions
platform
Big data
platform
What is H2O?
• Open source platform
• Exposes math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.

FlowUI
• Notebook style open
source interface for H2O
• Code execution,
mathematics, plots, and
rich media

Why H2O?
• Fast ingestion and manipulation of versatile data
• Blazing fast data parsing, supports multiple formats and
data sources
• Intuitive modeling UI/API
• FlowUI, R/Python/REST APIs
• Easy model validation, visualisation and comparison
• Cross-validation, FlowUI graphs, comparison via Steam
• Easy model deployment /w versioning for fast predictions
• Model export as POJO, deploy as service via Steam

What is Sparkling Water?
• Framework integrating Spark and H2O
• H2O instances on Spark executors
• Allows to call Spark and H2O methods together

Why MapR?
• H2O + MapR-FS = fast data ingestion made even faster
• Data resilience
• MapR snapshots + H2O modelling from checkpoints =
continuous and versioned modelling

Airline delay classification
Model predicting
flight delays
ETL Modelling Predictions
Load data from CSVs
Model using
H2O’s GLM
* https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

Q & A
@mapr
mdumoulin@mapr.com
Engage with us!
mapr-technologies

Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform (20)

More from Mathieu Dumoulin (6)

Recently uploaded (20)

Real-World Machine Learning - Leverage the Features of MapR Converged Data Platform