SlideShare a Scribd company logo
Quick! Quick! Exploration!:
A framework for searching a predictive
model on Apache Spark
Masato Asahara*, Yoshiki Takahashi+
and Kazuyuki Shudo+
* NEC Corporation, + Tokyo Institute of Technology
Jun/21/2018 @DataWorks Summit 2018
2 © NEC Corporation 2018
Who we are?
▌Masato Asahara (Ph.D.)
▌Principal Software Architecture and Researcher
at NEC System Platform Research Laboratory
Masato Asahara (Ph.D.) is currently leading developments of Spark-based
machine learning and data analytics systems, which fully automate
predictive modeling.
Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 8 years as a researcher in the field of distributed computing
systems and computing resource management technologies.
▌Yoshiki Takahashi
▌Master course student at Tokyo Institute of Technology
Yoshiki Takahashi is a student of the master of computer science program
at the graduate school of Tokyo Institute of Technology. His academic
research proposal is accepted in SysML 2018 which has attracted attention
since its previous workshop era in NIPS.
He worked on development of a Spark-based machine learning platform
for automatic predictive modeling in his internship program at NEC Data
Science Research Laboratories in 2017. He received his B.S. degree in 2017
from Tokyo Institute of Technology.
3 © NEC Corporation 2018
Agenda
Best model
x
x
x
x
x
x
1000+ Patterns
4 © NEC Corporation 2018
Agenda
Best model
x
x
x
Quick!
Scalable!
Plug-in
5 © NEC Corporation 2018
Agenda
MLlib
(A cluster with 16 CPU cores, Using HIGGS data sets of UCI ML repository)
× 𝟏𝟑 faster !!Our framework
Predictive Modeling Automation Framework
7 © NEC Corporation 2018
Predictive Analysis in Enterprise Area
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.
8 © NEC Corporation 2018
Pain of Modern Predictive Modeling
High skill
Precious CS Ph.D.s
Evolving ML Technology
Quick Trial w/ New ML algo.
Long time
Many Tuning Parameters
9 © NEC Corporation 2018
Our Framework automates Predictive Modeling!
High skill
Precious CS Ph.D.s
Evolving ML Technology
Quick Trial w/ New ML algo.
Long time
Many Tuning Parameters
10 © NEC Corporation 2018
Values of Our Automation Framework
Democratized to
business users
Quick model
selection
Easy integration with
future ML
implementations
Best
Design Challenges and Solutions
12 © NEC Corporation 2018
High Level Architecture
Training Data
Validate Data
Training Validate
Criteria
⋮
⋮
Run
HDFS
13 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
14 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
15 © NEC Corporation 2018
Just Adding Nodes doesn’t Improve Performance
5 min 4 min
1 min 2 min Wait 6 min
16 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
2 min
1 min
5 min
Scheduler
⋮
17 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
2 min
1 min
5 min
Scheduler
⋮
18 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
??? min
Scheduler
⋮
??? min
??? min
19 © NEC Corporation 2018
eta
max_depth
⋮ ⋮
round
Scheduler Pre-profiles Job Time via Sampled Data
This training takes
xx.x sec.
Small Sampled Data Scheduler
20 © NEC Corporation 2018
Scheduler Pre-profiles Job Time via Sampled Data
Scheduling
Scheduler
Profiling
Automated Predictive
Modeling
Small Sampled Data Entire Data
21 © NEC Corporation 2018
Preliminary Evaluation: Pre-profiling Tasks Little Time
2.34%
In pre-profiling,
• sampling 1% data from training data
• executing trainings for same search
space as automatic prediction modeling
22 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
23 © NEC Corporation 2018
Reducing Implementation Costs to Add New ML impl.
Distributed Learning Validation /
Model Selection
24 © NEC Corporation 2018
Naïve Design: Requires Many Changes to plug-in New ML
Distributed Learning
Training
Training
Training
invoke
Add code
New ML Format Data
TF Format Data
XGB Format Data
25 © NEC Corporation 2018
Easy Integration with New ML impl. by Encapsulation
Distributed Learning Training
Training
Training
Training
Encapsulation
invoke
Common Format Data
♪~
Add code
26 © NEC Corporation 2018
Easy Integration with New ML impl. by Encapsulation
Validation /
Model selection
Prediction
Prediction
Prediction
Prediction
Encapsulation
invoke
Common Format Data
♪~
Add code
Evaluation
28 © NEC Corporation 2018
Evaluation Setup
▌Dataset
HIGGS (UCI Dataset Repository)
• 1M sampled data for each training,
validation and test data
• 28 features
▌Scheduler Training
Executes same grids for training
Using 1% sample of training data
▌Environment
Apache Spark 2.3.0
Apache Hadoop 3.1.0
▌Exploring Algorithms
Gradient Boosting Tree (GBT)
• XGBoost 0.8
• 864 grid points
Multi-layer Perceptron (MLP)
• TensorFlow 1.8.0
• 324 grid points
Logistic Regression (LR)
• scikit-learn 0.18.1
• 5 grid points
Random Forest (RF)
• scikit-learn 0.18.1
• 18 grid points
29 © NEC Corporation 2018
Evaluation Result: Total Execution Time
× 𝟏𝟑. 𝟏 faster !!
30 © NEC Corporation 2018
Spark MLlib Focuses on Scaling out for Huge Data Size
Core 1
Core 2
Core 3
Core 1
Core 2
Core 3
Next Model
Complete training !
Shuffle
Training
Dataset
31 © NEC Corporation 2018
Core 1
Core 2
Core 3
Core 1
Core 2
Core 3
Next Model
Complete training !
No-Shuffle
Training
Dataset
We Focuses on Huge Search Space of Parameter Tuning
Our Framework
Next Model
Next Model
Read entire
data
32 © NEC Corporation 2018
Evaluation Result: Execution Performance for Scalability
72.7%
78.4%
81.7%
84.7%
33 © NEC Corporation 2018
Evaluation Result: Improvement of Error and AUC
Classification
Accuracy
AUC
Best model* 0.756 0.837
Gradient Boosting Tree** (-0.013) 0.743 (-0.012) 0.825
Logistic Regression** (-0.114) 0.642 (-0.153) 0.684
Random Forest** (-0.032) 0.724 (-0.036) 0.801
* Best model produced by our framework.
** Using default hyper parameters of XGBoost and scikit-learn
34 © NEC Corporation 2018
Evaluation Result : Amount of Code for Adding New ML
# Lines of Code w/o comments
151 lines
292 lines
290 lines
python : 116
scala : 176
python : 90
scala : 200
Summary and Future work
36 © NEC Corporation 2018
Summary – Automation Framework for Predictive Modeling
Best model
x
x
x
Quick!
Scalable!
Plug-in
37 © NEC Corporation 2018
Values
Democratized to
business users
Quick model
selection
Easy integration with
future ML
implementations
Best
38 © NEC Corporation 2018
Design Challenges (Addressed)
High Scalability Open for ML Implementations
39 © NEC Corporation 2018
Future work - Convert Data Structure for Each ML impl.
Common Format :
Double[ ][ ]
Sparse
Column-oriented
Row-oriented
Memory Copy &
Convert
40 © NEC Corporation 2018
Common Memory Format can be Read w/o copy is Better
Common Format :
????
Sparse
Column-oriented
Row-oriented
Zero-copy read
Apache Arrow …?
Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark
Ad

More Related Content

What's hot (20)

The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine Learning
ModusOptimum
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
DataWorks Summit
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
DataWorks Summit
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
DataWorks Summit
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
DataWorks Summit/Hadoop Summit
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Next gen tooling for building streaming analytics apps: code-less development...
Next gen tooling for building streaming analytics apps: code-less development...Next gen tooling for building streaming analytics apps: code-less development...
Next gen tooling for building streaming analytics apps: code-less development...
DataWorks Summit
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
DataWorks Summit/Hadoop Summit
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine Learning
ModusOptimum
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
DataWorks Summit
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
DataWorks Summit
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Next gen tooling for building streaming analytics apps: code-less development...
Next gen tooling for building streaming analytics apps: code-less development...Next gen tooling for building streaming analytics apps: code-less development...
Next gen tooling for building streaming analytics apps: code-less development...
DataWorks Summit
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 

Similar to Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark (20)

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
Neo4j
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
DataWorks Summit/Hadoop Summit
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
SnapLogic
 
Machine Intelligence for Design Automation
Machine Intelligence for Design AutomationMachine Intelligence for Design Automation
Machine Intelligence for Design Automation
s.rohit
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and Roadmap
Neo4j
 
Design Optimization of Safety Critical Component for Fatigue and Strength Usi...
Design Optimization of Safety Critical Component for Fatigue and Strength Usi...Design Optimization of Safety Critical Component for Fatigue and Strength Usi...
Design Optimization of Safety Critical Component for Fatigue and Strength Usi...
Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...
Antje Barth
 
CNCF-Istanbul-MLOps for Devops Engineers.pptx
CNCF-Istanbul-MLOps for Devops Engineers.pptxCNCF-Istanbul-MLOps for Devops Engineers.pptx
CNCF-Istanbul-MLOps for Devops Engineers.pptx
cansukavili1
 
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
Edge AI and Vision Alliance
 
Neo4j: The path to success with Graph Database and Graph Data Science
Neo4j: The path to success with Graph Database and Graph Data ScienceNeo4j: The path to success with Graph Database and Graph Data Science
Neo4j: The path to success with Graph Database and Graph Data Science
Neo4j
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
Alok Singh
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
Venkata Pingali
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTXDecision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
SanjayKPrasad2
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET Journal
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET Journal
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
Neo4j
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
SnapLogic
 
Machine Intelligence for Design Automation
Machine Intelligence for Design AutomationMachine Intelligence for Design Automation
Machine Intelligence for Design Automation
s.rohit
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and Roadmap
Neo4j
 
Design Optimization of Safety Critical Component for Fatigue and Strength Usi...
Design Optimization of Safety Critical Component for Fatigue and Strength Usi...Design Optimization of Safety Critical Component for Fatigue and Strength Usi...
Design Optimization of Safety Critical Component for Fatigue and Strength Usi...
Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...How to build containerized architectures for deep learning - Data Festival 20...
How to build containerized architectures for deep learning - Data Festival 20...
Antje Barth
 
CNCF-Istanbul-MLOps for Devops Engineers.pptx
CNCF-Istanbul-MLOps for Devops Engineers.pptxCNCF-Istanbul-MLOps for Devops Engineers.pptx
CNCF-Istanbul-MLOps for Devops Engineers.pptx
cansukavili1
 
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
Edge AI and Vision Alliance
 
Neo4j: The path to success with Graph Database and Graph Data Science
Neo4j: The path to success with Graph Database and Graph Data ScienceNeo4j: The path to success with Graph Database and Graph Data Science
Neo4j: The path to success with Graph Database and Graph Data Science
Neo4j
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
Alok Singh
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
Venkata Pingali
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTXDecision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
SanjayKPrasad2
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET Journal
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET Journal
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Play It Safe: Manage Security Risks - Google Certificate
Play It Safe: Manage Security Risks - Google CertificatePlay It Safe: Manage Security Risks - Google Certificate
Play It Safe: Manage Security Risks - Google Certificate
VICTOR MAESTRE RAMIREZ
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Vaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without HallucinationsVaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without Hallucinations
john409870
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Build 3D Animated Safety Induction - Tech EHS
Build 3D Animated Safety Induction - Tech EHSBuild 3D Animated Safety Induction - Tech EHS
Build 3D Animated Safety Induction - Tech EHS
TECH EHS Solution
 
MINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PRMINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PR
MIND CTI
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
Make GenAI investments go further with the Dell AI Factory
Make GenAI investments go further with the Dell AI FactoryMake GenAI investments go further with the Dell AI Factory
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
The Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdfThe Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdf
Precisely
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Play It Safe: Manage Security Risks - Google Certificate
Play It Safe: Manage Security Risks - Google CertificatePlay It Safe: Manage Security Risks - Google Certificate
Play It Safe: Manage Security Risks - Google Certificate
VICTOR MAESTRE RAMIREZ
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Vaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without HallucinationsVaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without Hallucinations
john409870
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Build 3D Animated Safety Induction - Tech EHS
Build 3D Animated Safety Induction - Tech EHSBuild 3D Animated Safety Induction - Tech EHS
Build 3D Animated Safety Induction - Tech EHS
TECH EHS Solution
 
MINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PRMINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PR
MIND CTI
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
Make GenAI investments go further with the Dell AI Factory
Make GenAI investments go further with the Dell AI FactoryMake GenAI investments go further with the Dell AI Factory
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
The Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdfThe Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdf
Precisely
 

Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

  • 1. Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark Masato Asahara*, Yoshiki Takahashi+ and Kazuyuki Shudo+ * NEC Corporation, + Tokyo Institute of Technology Jun/21/2018 @DataWorks Summit 2018
  • 2. 2 © NEC Corporation 2018 Who we are? ▌Masato Asahara (Ph.D.) ▌Principal Software Architecture and Researcher at NEC System Platform Research Laboratory Masato Asahara (Ph.D.) is currently leading developments of Spark-based machine learning and data analytics systems, which fully automate predictive modeling. Masato received his Ph.D. degree from Keio University, and has worked at NEC for 8 years as a researcher in the field of distributed computing systems and computing resource management technologies. ▌Yoshiki Takahashi ▌Master course student at Tokyo Institute of Technology Yoshiki Takahashi is a student of the master of computer science program at the graduate school of Tokyo Institute of Technology. His academic research proposal is accepted in SysML 2018 which has attracted attention since its previous workshop era in NIPS. He worked on development of a Spark-based machine learning platform for automatic predictive modeling in his internship program at NEC Data Science Research Laboratories in 2017. He received his B.S. degree in 2017 from Tokyo Institute of Technology.
  • 3. 3 © NEC Corporation 2018 Agenda Best model x x x x x x 1000+ Patterns
  • 4. 4 © NEC Corporation 2018 Agenda Best model x x x Quick! Scalable! Plug-in
  • 5. 5 © NEC Corporation 2018 Agenda MLlib (A cluster with 16 CPU cores, Using HIGGS data sets of UCI ML repository) × 𝟏𝟑 faster !!Our framework
  • 7. 7 © NEC Corporation 2018 Predictive Analysis in Enterprise Area Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Product Price Optimization Sales Optimization Energy/Water Operation Mgmt.
  • 8. 8 © NEC Corporation 2018 Pain of Modern Predictive Modeling High skill Precious CS Ph.D.s Evolving ML Technology Quick Trial w/ New ML algo. Long time Many Tuning Parameters
  • 9. 9 © NEC Corporation 2018 Our Framework automates Predictive Modeling! High skill Precious CS Ph.D.s Evolving ML Technology Quick Trial w/ New ML algo. Long time Many Tuning Parameters
  • 10. 10 © NEC Corporation 2018 Values of Our Automation Framework Democratized to business users Quick model selection Easy integration with future ML implementations Best
  • 12. 12 © NEC Corporation 2018 High Level Architecture Training Data Validate Data Training Validate Criteria ⋮ ⋮ Run HDFS
  • 13. 13 © NEC Corporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 14. 14 © NEC Corporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 15. 15 © NEC Corporation 2018 Just Adding Nodes doesn’t Improve Performance 5 min 4 min 1 min 2 min Wait 6 min
  • 16. 16 © NEC Corporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} 2 min 1 min 5 min Scheduler ⋮
  • 17. 17 © NEC Corporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} 2 min 1 min 5 min Scheduler ⋮
  • 18. 18 © NEC Corporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} ??? min Scheduler ⋮ ??? min ??? min
  • 19. 19 © NEC Corporation 2018 eta max_depth ⋮ ⋮ round Scheduler Pre-profiles Job Time via Sampled Data This training takes xx.x sec. Small Sampled Data Scheduler
  • 20. 20 © NEC Corporation 2018 Scheduler Pre-profiles Job Time via Sampled Data Scheduling Scheduler Profiling Automated Predictive Modeling Small Sampled Data Entire Data
  • 21. 21 © NEC Corporation 2018 Preliminary Evaluation: Pre-profiling Tasks Little Time 2.34% In pre-profiling, • sampling 1% data from training data • executing trainings for same search space as automatic prediction modeling
  • 22. 22 © NEC Corporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 23. 23 © NEC Corporation 2018 Reducing Implementation Costs to Add New ML impl. Distributed Learning Validation / Model Selection
  • 24. 24 © NEC Corporation 2018 Naïve Design: Requires Many Changes to plug-in New ML Distributed Learning Training Training Training invoke Add code New ML Format Data TF Format Data XGB Format Data
  • 25. 25 © NEC Corporation 2018 Easy Integration with New ML impl. by Encapsulation Distributed Learning Training Training Training Training Encapsulation invoke Common Format Data ♪~ Add code
  • 26. 26 © NEC Corporation 2018 Easy Integration with New ML impl. by Encapsulation Validation / Model selection Prediction Prediction Prediction Prediction Encapsulation invoke Common Format Data ♪~ Add code
  • 28. 28 © NEC Corporation 2018 Evaluation Setup ▌Dataset HIGGS (UCI Dataset Repository) • 1M sampled data for each training, validation and test data • 28 features ▌Scheduler Training Executes same grids for training Using 1% sample of training data ▌Environment Apache Spark 2.3.0 Apache Hadoop 3.1.0 ▌Exploring Algorithms Gradient Boosting Tree (GBT) • XGBoost 0.8 • 864 grid points Multi-layer Perceptron (MLP) • TensorFlow 1.8.0 • 324 grid points Logistic Regression (LR) • scikit-learn 0.18.1 • 5 grid points Random Forest (RF) • scikit-learn 0.18.1 • 18 grid points
  • 29. 29 © NEC Corporation 2018 Evaluation Result: Total Execution Time × 𝟏𝟑. 𝟏 faster !!
  • 30. 30 © NEC Corporation 2018 Spark MLlib Focuses on Scaling out for Huge Data Size Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 Next Model Complete training ! Shuffle Training Dataset
  • 31. 31 © NEC Corporation 2018 Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 Next Model Complete training ! No-Shuffle Training Dataset We Focuses on Huge Search Space of Parameter Tuning Our Framework Next Model Next Model Read entire data
  • 32. 32 © NEC Corporation 2018 Evaluation Result: Execution Performance for Scalability 72.7% 78.4% 81.7% 84.7%
  • 33. 33 © NEC Corporation 2018 Evaluation Result: Improvement of Error and AUC Classification Accuracy AUC Best model* 0.756 0.837 Gradient Boosting Tree** (-0.013) 0.743 (-0.012) 0.825 Logistic Regression** (-0.114) 0.642 (-0.153) 0.684 Random Forest** (-0.032) 0.724 (-0.036) 0.801 * Best model produced by our framework. ** Using default hyper parameters of XGBoost and scikit-learn
  • 34. 34 © NEC Corporation 2018 Evaluation Result : Amount of Code for Adding New ML # Lines of Code w/o comments 151 lines 292 lines 290 lines python : 116 scala : 176 python : 90 scala : 200
  • 36. 36 © NEC Corporation 2018 Summary – Automation Framework for Predictive Modeling Best model x x x Quick! Scalable! Plug-in
  • 37. 37 © NEC Corporation 2018 Values Democratized to business users Quick model selection Easy integration with future ML implementations Best
  • 38. 38 © NEC Corporation 2018 Design Challenges (Addressed) High Scalability Open for ML Implementations
  • 39. 39 © NEC Corporation 2018 Future work - Convert Data Structure for Each ML impl. Common Format : Double[ ][ ] Sparse Column-oriented Row-oriented Memory Copy & Convert
  • 40. 40 © NEC Corporation 2018 Common Memory Format can be Read w/o copy is Better Common Format : ???? Sparse Column-oriented Row-oriented Zero-copy read Apache Arrow …?