SlideShare a Scribd company logo
Scalable Machine 
Learning
me: 
Sam Bessalah 
Software Engineer, Freelance 
Big Data, Distributed Computing, Machine Learning 
Paris Data Geek Co-organizer 
@samklr @DataParis
Machine Learning Land 
VOWPAL WABBIT
Some Observations in Big Data Land 
● New use cases push towards faster execution platforms and real 
time predictions engines. 
● Traditional MapReduce on Hadoop is fading away, especially for 
Machine Learning 
● Apache Spark has become the darling of the Big Data world, 
thanks to its high level API and performances. 
● Rise of Machine Learning public APIs to easily integrate models 
into application and other data processing workflows.
● Used to be the only Hadoop MapReduce Framework 
● Moved from MapReduce towards modern and faster 
backends, namely 
● Now provide a fluent DSL that integrates with Scala and 
Spark
scalable machine learning
Mahout Example 
Simple Co-occurence analysis in Mahout 
val A = 
drmFromHDFS (“ hdfs://nivdul/babygirl.txt“) 
val cooccurencesMatrix = A.t %*% A 
val numInteractions = 
drmBroadcast(A.colsums) 
val I = C.mapBlock(){ 
case (keys, block) => 
val indicatorBlock = sparse(row, col) 
for (r <- block ) 
indicatorBlock = computeLLR (row, nbInt) 
keys <- indicatorblock 
}
Dataflow system, materialized by immutable and lazy, in-memory distributed 
collections suited for iterative and complex transformations, like in most Machine 
Learning algorithms. 
Those in-memory collections are called Resilient Distributed Datasets (RDD) 
They provide : 
● Partitioned data 
● High level operations (map, filter, collect, reduce, zip, join, sample, etc …) 
● No side effects 
● Fault recovery via lineage
Some operations on RDDs
Spark 
Ecosystem
MLlib 
Machine Learning library within Spark : 
● Provides an integrated predictive and data analysis 
workflow 
● Broad collections of algorithms and applications 
● Integrates with the whole Spark Ecosystem 
Three APIs in :
Algorithms in MLlib
Example: Clustering via K-means 
// Load and parse data 
val data = sc.textFile(“hdfs://bbgrl/dataset.txt”) 
val parsedData = data.map { x => 
Vectors.dense(x.split(“ “).map.(_.toDouble )) 
}.cache() 
//Cluster data into 5 classes using K-means 
val clusters = Kmeans.train(parsedData, k=5, numIterations=20 ) 
//Evaluate model error 
val cost = clusters.computeCost(parsedData)
scalable machine learning
Coming to Spark 1.2 
● Ensembles of decision trees : Random Forests 
● Boosting 
● Topic modeling 
● Streaming Kmeans 
● A pipeline interface for machine workflows 
A lot of contributions from the community
Machine Learning Pipeline 
Typical machine learning workflows are complex ! 
Coming in next iterations of MLLib
● H20 is a fast (really fast), statistics, Machine Learning 
and maths engine on the JVM. 
● Edited by 0xdata (commercial entity) and focus on 
bringing robust and highly performant machine learning 
algorithms to popular Big Data workloads. 
● Has APIs in R, Java, Scala and Python and integrates 
to third parties tools like Tableau and Excel.
scalable machine learning
Example in R 
library(h2o) 
localH2O = h2o.init(ip = 'localhost', port = 54321) 
irisPath = system.file("extdata", "iris.csv", package="h2o") 
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") 
iris.data.frame <- as.data.frame(iris.hex) 
> colnames(iris.hex) 
[1] "C1" "C2" "C3" "C4" "C5" 
>
Simple Logistic Regressioon to predict prostate cancer outcomes: 
> prostate.hex = h2o.importFile(localH2O, 
path="https://raw.github.com/0xdata/h2o/../prostate.csv", 
key = "prostate.hex") 
> prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"), 
data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5) 
> prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)
> (prostate.fit) 
IP Address: 127.0.0.1 
Port : 54321 
Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9 
predict X0 X1 
1 0 0.7452267 0.2547732 
2 1 0.3969807 0.6030193 
3 1 0.4120950 0.5879050 
4 1 0.3726134 0.6273866 
5 1 0.6465137 0.3534863 
6 1 0.4331880 0.5668120
Sparkling Water 
Transparent use of H2O data and algorithms with the Spark API. 
Provides a custom RDD : H2ORDD
scalable machine learning
scalable machine learning
val sqlContext = new SQLContext(sc) 
import sqlContext._ 
airlinesTable.registerTempTable("airlinesTable") //H20 methods 
val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest 
LIKE 'SJC' OR Dest LIKE 'OAK'“ 
val result = sql(query) 
result.count
Same but with Spark API 
// H2O Context provide useful implicits for conversions 
val h2oContext = new H2OContext(sc) 
import h2oContext._ 
// Create RDD wrapper around DataFrame 
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) 
airlinesTable.count 
// And use Spark RDD API directly 
val flightsOnlyToSF = airlinesTable.filter(f => 
f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") 
) 
flightsOnlyToSF.count
Build a model 
import hex.deeplearning._ 
import hex.deeplearning.DeepLearningModel.DeepLearningParameters 
val dlParams = new DeepLearningParameters() 
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek, 
'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 
FlightNum, 'TailNum, 'CRSElapsedTime, 
'Origin, 'Dest,'Distance,‘IsDepDelayed) 
dlParams.response_column = 'IsDepDelayed.name 
// Create a new model builder 
val dl = new DeepLearning(dlParams) 
val dlModel = dl.train.get
Predict 
// Use model to score data 
val prediction = dlModel.score(result)(‘predict) 
// Collect predicted values via the RDD API 
val predictionValues = toRDD[DoubleHolder](prediction) 
.collect 
.map ( _.result.getOrElse("NaN") )
scalable machine learning
Slides: http://speakerdeck.com/samklr/

More Related Content

What's hot (20)

PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Debugging & Tuning in Spark
Shiao-An Yuan
 
PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Ordered Record Collection
Hadoop User Group
 
ODP
Cascalog internal dsl_preso
Hadoop User Group
 
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
PDF
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
PPTX
SORT & JOIN IN SPARK 2.0
Sigmoid
 
PDF
Sparkling Water Meetup
Sri Ambati
 
PDF
Scalding
Mario Pastorelli
 
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
PPTX
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
DataStax
 
PDF
Interactive Session on Sparkling Water
Sri Ambati
 
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
PDF
Advanced Apache Cassandra Operations with JMX
zznate
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Debugging & Tuning in Spark
Shiao-An Yuan
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Ordered Record Collection
Hadoop User Group
 
Cascalog internal dsl_preso
Hadoop User Group
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
SORT & JOIN IN SPARK 2.0
Sigmoid
 
Sparkling Water Meetup
Sri Ambati
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
DataStax
 
Interactive Session on Sparkling Water
Sri Ambati
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
Advanced Apache Cassandra Operations with JMX
zznate
 

Similar to scalable machine learning (20)

PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
PDF
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
PDF
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
PPTX
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Scaling Analytics with Apache Spark
QuantUniversity
 
PPTX
SPARK ARCHITECTURE
GauravBiswas9
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
Apache Spark & MLlib
Grigory Sapunov
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PPTX
Intro to Apache Spark
Mammoth Data
 
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PDF
Spark + H20 = Machine Learning at scale
Mateusz Dymczyk
 
PPTX
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Scaling Analytics with Apache Spark
QuantUniversity
 
SPARK ARCHITECTURE
GauravBiswas9
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Apache Spark & MLlib
Grigory Sapunov
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Intro to Apache Spark
Mammoth Data
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Spark + H20 = Machine Learning at scale
Mateusz Dymczyk
 
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Bds session 13 14
Infinity Tech Solutions
 
Spark meetup TCHUG
Ryan Bosshart
 
Ad

More from Samir Bessalah (7)

PPTX
Machine Learning In Production
Samir Bessalah
 
PPTX
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
PPTX
Eventual Consitency with CRDTS
Samir Bessalah
 
PPTX
Deep learning for mere mortals - Devoxx Belgium 2015
Samir Bessalah
 
PPTX
High Performance RPC with Finagle
Samir Bessalah
 
PPTX
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Samir Bessalah
 
PDF
Structures de données exotiques
Samir Bessalah
 
Machine Learning In Production
Samir Bessalah
 
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
Eventual Consitency with CRDTS
Samir Bessalah
 
Deep learning for mere mortals - Devoxx Belgium 2015
Samir Bessalah
 
High Performance RPC with Finagle
Samir Bessalah
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Samir Bessalah
 
Structures de données exotiques
Samir Bessalah
 
Ad

Recently uploaded (20)

PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

scalable machine learning

  • 2. me: Sam Bessalah Software Engineer, Freelance Big Data, Distributed Computing, Machine Learning Paris Data Geek Co-organizer @samklr @DataParis
  • 3. Machine Learning Land VOWPAL WABBIT
  • 4. Some Observations in Big Data Land ● New use cases push towards faster execution platforms and real time predictions engines. ● Traditional MapReduce on Hadoop is fading away, especially for Machine Learning ● Apache Spark has become the darling of the Big Data world, thanks to its high level API and performances. ● Rise of Machine Learning public APIs to easily integrate models into application and other data processing workflows.
  • 5. ● Used to be the only Hadoop MapReduce Framework ● Moved from MapReduce towards modern and faster backends, namely ● Now provide a fluent DSL that integrates with Scala and Spark
  • 7. Mahout Example Simple Co-occurence analysis in Mahout val A = drmFromHDFS (“ hdfs://nivdul/babygirl.txt“) val cooccurencesMatrix = A.t %*% A val numInteractions = drmBroadcast(A.colsums) val I = C.mapBlock(){ case (keys, block) => val indicatorBlock = sparse(row, col) for (r <- block ) indicatorBlock = computeLLR (row, nbInt) keys <- indicatorblock }
  • 8. Dataflow system, materialized by immutable and lazy, in-memory distributed collections suited for iterative and complex transformations, like in most Machine Learning algorithms. Those in-memory collections are called Resilient Distributed Datasets (RDD) They provide : ● Partitioned data ● High level operations (map, filter, collect, reduce, zip, join, sample, etc …) ● No side effects ● Fault recovery via lineage
  • 11. MLlib Machine Learning library within Spark : ● Provides an integrated predictive and data analysis workflow ● Broad collections of algorithms and applications ● Integrates with the whole Spark Ecosystem Three APIs in :
  • 13. Example: Clustering via K-means // Load and parse data val data = sc.textFile(“hdfs://bbgrl/dataset.txt”) val parsedData = data.map { x => Vectors.dense(x.split(“ “).map.(_.toDouble )) }.cache() //Cluster data into 5 classes using K-means val clusters = Kmeans.train(parsedData, k=5, numIterations=20 ) //Evaluate model error val cost = clusters.computeCost(parsedData)
  • 15. Coming to Spark 1.2 ● Ensembles of decision trees : Random Forests ● Boosting ● Topic modeling ● Streaming Kmeans ● A pipeline interface for machine workflows A lot of contributions from the community
  • 16. Machine Learning Pipeline Typical machine learning workflows are complex ! Coming in next iterations of MLLib
  • 17. ● H20 is a fast (really fast), statistics, Machine Learning and maths engine on the JVM. ● Edited by 0xdata (commercial entity) and focus on bringing robust and highly performant machine learning algorithms to popular Big Data workloads. ● Has APIs in R, Java, Scala and Python and integrates to third parties tools like Tableau and Excel.
  • 19. Example in R library(h2o) localH2O = h2o.init(ip = 'localhost', port = 54321) irisPath = system.file("extdata", "iris.csv", package="h2o") iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") iris.data.frame <- as.data.frame(iris.hex) > colnames(iris.hex) [1] "C1" "C2" "C3" "C4" "C5" >
  • 20. Simple Logistic Regressioon to predict prostate cancer outcomes: > prostate.hex = h2o.importFile(localH2O, path="https://raw.github.com/0xdata/h2o/../prostate.csv", key = "prostate.hex") > prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"), data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5) > prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)
  • 21. > (prostate.fit) IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9 predict X0 X1 1 0 0.7452267 0.2547732 2 1 0.3969807 0.6030193 3 1 0.4120950 0.5879050 4 1 0.3726134 0.6273866 5 1 0.6465137 0.3534863 6 1 0.4331880 0.5668120
  • 22. Sparkling Water Transparent use of H2O data and algorithms with the Spark API. Provides a custom RDD : H2ORDD
  • 25. val sqlContext = new SQLContext(sc) import sqlContext._ airlinesTable.registerTempTable("airlinesTable") //H20 methods val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ val result = sql(query) result.count
  • 26. Same but with Spark API // H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc) import h2oContext._ // Create RDD wrapper around DataFrame val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) airlinesTable.count // And use Spark RDD API directly val flightsOnlyToSF = airlinesTable.filter(f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count
  • 27. Build a model import hex.deeplearning._ import hex.deeplearning.DeepLearningModel.DeepLearningParameters val dlParams = new DeepLearningParameters() dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance,‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name // Create a new model builder val dl = new DeepLearning(dlParams) val dlModel = dl.train.get
  • 28. Predict // Use model to score data val prediction = dlModel.score(result)(‘predict) // Collect predicted values via the RDD API val predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )

Editor's Notes

  • #14: c’est où le chat ?