SlideShare a Scribd company logo
Analytics Query Engine
Nantia Makrynioti and Vasilis Vassalos
Athens University of Economics and Business

MEDAL 2016, Bordeaux, France
TOWARDS AN
• Huge amount of available data

• Decrease of storage cost

• Great use of systems facilitating data analytics in a
distributed fashion
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
INTRODUCTION
CURRENT STATE OF THE ART
• Libraries of algorithms

• Systems that provide operators for developing
distributed algorithms

✦ Combination of procedural and declarative
programming

✦ Integration of declarative operators to imperative
languages

✦ Closer to the style of statistical computing languages
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Implementations of machine learning algorithms on top
of a given distributed framework

final	
  LogisticRegressionModel	
  model	
  =	
  new	
  
LogisticRegressionWithLBFGS()	
  
	
  	
  	
  	
  	
  	
  .setNumClasses(10)	
  
	
  	
  	
  	
  	
  	
  .run(training.rdd());	
  
Examples: Apache Mahout, MLlib, MADlib
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
LIBRARIES OF ALGORITHMS
+ Easy to use
- Too coarse grained
A programming model between SQL and Map Reduce

good_urls	
  =	
  FILTER	
  urls	
  BY	
  page	
  rank	
  >	
  0.2;	
  
groups	
  =	
  GROUP	
  good_urls	
  BY	
  category;	
  
Examples: Jaql, Pig Latin, Stratosphere’s Sopremo layer, uSQL
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
SYSTEMS OF OPERATORS (1)
A sequence of steps, with
each step performing a single high-
level transformation
Use of variables to store
intermediate results
+ Closer to data analysis users, yet
fairly declarative
- Limited support for iteration
Extensions of the MapReduce model

val	
  textFile	
  =	
  sc.textFile("hdfs://...")	
  
val	
  df	
  =	
  textFile.toDF("line")	
  
val	
  errors	
  =	
  df.filter(col("line").like("%ERROR%"))	
  
errors.count()	
  
Examples: DryadLINQ, Stratosphere’s PACT, Spark,
Tupleware
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
SYSTEMS OF OPERATORS (2)
Richer set of operators
Arbitrary
data flow graphs
+ Iteration support
- Considerable amount of user
defined code
R-like programming models

while(i	
  <	
  20)	
  {	
  
	
  	
  	
  	
  H	
  =	
  H	
  *	
  (t(W)	
  %*%	
  V)/(	
  t(W)	
  %*%	
  W	
  %*%	
  H);	
  
	
  	
  	
  	
  W	
  =	
  W	
  *	
  (V	
  %*%	
  t(H)/(W	
  %*%	
  H	
  %*%	
  t(H));	
  
	
  	
  	
  	
  i	
  =	
  i	
  +	
  1;	
  
}	
  
Examples: MLI API, SystemML
SYSTEMS OF OPERATORS (3)
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Support for matrix
operations
+ Easier to express ML algorithms due to linear
algebra operations
- Still closer to imperative programming
good_urls	
  =	
  FILTER	
  urls	
  BY	
  page	
  rank	
  >	
  0.2;	
  
groups	
  =	
  GROUP	
  good_urls	
  BY	
  category;	
  
val	
  points	
  =	
  spark.textFile(…).map(parsePoint).cache()	
  
var	
  w	
  =	
  Vector.random(D);	
  
for(i	
  <-­‐	
  1	
  to	
  ITERATIONS)	
  {	
  
val	
  grad	
  =	
  spark.accumulator(new	
  Vector(D));	
  
for(p<-­‐points)	
  {	
  
val	
  s	
  =	
  (1/(1+exp(-­‐p.y*(w	
  dot	
  p.x)))-­‐1)*p.y	
  
grad	
  +=	
  s	
  *	
  p.x	
  
}	
  
w	
  -­‐=	
  grad.value	
  
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
WHAT’S MISSING?
Machine Learning
Algorithms: very close
to imperative
languages
Relational
queries: very close to
SQL style
Can we increase declarativity in

specification of ML algorithms?
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
WHAT’S MISSING?
VISION
An analytics query engine that would enable
declarative specification, operator based execution
and cost-based optimization of machine learning
algorithms on distributed processing platforms.
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
• A language with a SQL-like syntax

• Use of Datalog to express ML algorithms in a
declarative manner

✦ Translation of existing programming models for analytics
into Datalog programs [13]

✦ Extension of Datalog to cover a wider range of machine
learning tasks [14]
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
DECLARATIVE SYNTAX
Hypothesis 

Cost function

Optimization function 

Repeat { }
} Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
DECOMPOSITION OF
LINEAR REGRESSION
✓j := ✓j a
@
@✓j
J(✓)
h✓(x) = ✓T
x
J(✓) =
1
2m
mX
1
(h✓(x(i)
) y(i)
)2
DEFINE ML OPERATORS
Based on the example of Linear Regression we need:

✦ Linear algebra operators

✦ Aggregation operators

✦ Iteration

✦ Optimization functions are also good candidates to
define operators
• Query rewriting as in databases

✦ Operators for ML algorithms would result in clearer
semantics

• Compiler optimizations: function inlining, SIMD
vectorization
OPTIMIZATION TECHNIQUES
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Mostly
applicable to relational
operators
Useful for user
defined code
More opportunities for
query rewriting
• Optimization techniques from machine learning

✦ Speed up training with large datasets using
stochastic gradient descent instead of batch
gradient descent

✦ Feature scaling for faster convergence

Note that these optimizations may produce slightly
different output
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
OPTIMIZING PROGRAMS
• Define a language with a declarative syntax

• Translate this language into a set of operators, which
express a wide range of machine learning tasks

• Based on their semantics, define algebraic properties
for query optimization

• Implement these operators on a distributed
processing framework
ACTION PLAN
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Thank you!
Questions?
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
1. Apache mahout. http://mahout.apache.org/. 

2. X. Meng et al. MLlib: Machine learning in apache spark. CoRR, abs/1505.06807, 2015. 

3. J. Cohen et al. Mad skills: New analysis practices for big data. Proc. VLDB Endow., 2(2):1481–1492, Aug. 2009.

4. K. S. Beyer et al. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12)
1272-1283, 2011

5. C.Olston et al. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM
SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1099–1110, 2008.

6. A. Alexandrov et al. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939-964, Dec.
2014.

7. U-SQL. http://usql.io/. 

8. A. Crotty et al. Tupleware: Redefining modern analytics. CoRR, abs/1406.6667, 2014. 

9. A. Ghoting et al. SystemML: Declarative machine learning on MapReduce. In Proceedings of the 2011 IEEE
27th International Conference on Data Engineering, ICDE ’11, pages 231–242, 2011. 

10. E. R. Sparks et al. MLI: An API for distributed machine learning. 

11. Y. Yu et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level
language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation,
OSDI’08, pages 1–14, 2008. 

12. M. Zaharia et al. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference
on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, 2010.

13. Y. Bu et al. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012.

14. V. Barany et al., Declarative Statistical Modeling with Datalog. CoRR abs/1412.2221, 2014.

REFERENCES
Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
Ad

Recommended

Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
Alejandro Llaves
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
Majid Abdollahi
 
An Introduction to Spark with Scala
An Introduction to Spark with Scala
Chetan Khatri
 
ISNCC 2017
ISNCC 2017
Rim Moussa
 
ER 2016 Tutorial
ER 2016 Tutorial
Rim Moussa
 
Data transformation
Data transformation
Chris Orwa
 
Introduction to Microsoft R Services
Introduction to Microsoft R Services
Gregg Barrett
 
Bicod2017
Bicod2017
Rim Moussa
 
Signals from outer space
Signals from outer space
GraphAware
 
(Big) Data Science
(Big) Data Science
Michal Bachman
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro Negro
GraphAware
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Graph Analytics
Graph Analytics
Khalid Salama
 
Graph computation
Graph computation
Sigmoid
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
Spark Summit
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Spark graphx
Spark graphx
Carol McDonald
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Dippy Aggarwal
 
Vldb14
Vldb14
hdbtracker
 
E05312426
E05312426
IOSR-JEN
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
Slide 1
Slide 1
butest
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
Mike Dusenberry
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
Gianvito Siciliano
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
 

More Related Content

What's hot (19)

Signals from outer space
Signals from outer space
GraphAware
 
(Big) Data Science
(Big) Data Science
Michal Bachman
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro Negro
GraphAware
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Graph Analytics
Graph Analytics
Khalid Salama
 
Graph computation
Graph computation
Sigmoid
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
Spark Summit
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Spark graphx
Spark graphx
Carol McDonald
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Dippy Aggarwal
 
Vldb14
Vldb14
hdbtracker
 
E05312426
E05312426
IOSR-JEN
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
Slide 1
Slide 1
butest
 
Signals from outer space
Signals from outer space
GraphAware
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro Negro
GraphAware
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Graph computation
Graph computation
Sigmoid
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
Spark Summit
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Dippy Aggarwal
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
Slide 1
Slide 1
butest
 

Similar to towards_analytics_query_engine (20)

SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
Mike Dusenberry
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
Gianvito Siciliano
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Geoffrey Fox
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PivotalOpenSourceHub
 
Demystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
BDA-Module-1.pptx
BDA-Module-1.pptx
ASHWIN808488
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Power Software Development with Apache Spark
Power Software Development with Apache Spark
OpenPOWERorg
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
All thingspython@pivotal
All thingspython@pivotal
Srivatsan Ramanujam
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Srivatsan Ramanujam
 
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
shalikstenmo
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
Srivatsan Ramanujam
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Ian Huston
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
Mike Dusenberry
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
Gianvito Siciliano
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Geoffrey Fox
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PivotalOpenSourceHub
 
Demystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Power Software Development with Apache Spark
Power Software Development with Apache Spark
OpenPOWERorg
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Srivatsan Ramanujam
 
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
shalikstenmo
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
Srivatsan Ramanujam
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Ian Huston
 
Ad

towards_analytics_query_engine

  • 1. Analytics Query Engine Nantia Makrynioti and Vasilis Vassalos Athens University of Economics and Business MEDAL 2016, Bordeaux, France TOWARDS AN
  • 2. • Huge amount of available data • Decrease of storage cost • Great use of systems facilitating data analytics in a distributed fashion Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos INTRODUCTION
  • 3. CURRENT STATE OF THE ART • Libraries of algorithms • Systems that provide operators for developing distributed algorithms ✦ Combination of procedural and declarative programming ✦ Integration of declarative operators to imperative languages ✦ Closer to the style of statistical computing languages Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
  • 4. Implementations of machine learning algorithms on top of a given distributed framework final  LogisticRegressionModel  model  =  new   LogisticRegressionWithLBFGS()              .setNumClasses(10)              .run(training.rdd());   Examples: Apache Mahout, MLlib, MADlib Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos LIBRARIES OF ALGORITHMS + Easy to use - Too coarse grained
  • 5. A programming model between SQL and Map Reduce good_urls  =  FILTER  urls  BY  page  rank  >  0.2;   groups  =  GROUP  good_urls  BY  category;   Examples: Jaql, Pig Latin, Stratosphere’s Sopremo layer, uSQL Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos SYSTEMS OF OPERATORS (1) A sequence of steps, with each step performing a single high- level transformation Use of variables to store intermediate results + Closer to data analysis users, yet fairly declarative - Limited support for iteration
  • 6. Extensions of the MapReduce model val  textFile  =  sc.textFile("hdfs://...")   val  df  =  textFile.toDF("line")   val  errors  =  df.filter(col("line").like("%ERROR%"))   errors.count()   Examples: DryadLINQ, Stratosphere’s PACT, Spark, Tupleware Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos SYSTEMS OF OPERATORS (2) Richer set of operators Arbitrary data flow graphs + Iteration support - Considerable amount of user defined code
  • 7. R-like programming models while(i  <  20)  {          H  =  H  *  (t(W)  %*%  V)/(  t(W)  %*%  W  %*%  H);          W  =  W  *  (V  %*%  t(H)/(W  %*%  H  %*%  t(H));          i  =  i  +  1;   }   Examples: MLI API, SystemML SYSTEMS OF OPERATORS (3) Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos Support for matrix operations + Easier to express ML algorithms due to linear algebra operations - Still closer to imperative programming
  • 8. good_urls  =  FILTER  urls  BY  page  rank  >  0.2;   groups  =  GROUP  good_urls  BY  category;   val  points  =  spark.textFile(…).map(parsePoint).cache()   var  w  =  Vector.random(D);   for(i  <-­‐  1  to  ITERATIONS)  {   val  grad  =  spark.accumulator(new  Vector(D));   for(p<-­‐points)  {   val  s  =  (1/(1+exp(-­‐p.y*(w  dot  p.x)))-­‐1)*p.y   grad  +=  s  *  p.x   }   w  -­‐=  grad.value   Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos WHAT’S MISSING? Machine Learning Algorithms: very close to imperative languages Relational queries: very close to SQL style
  • 9. Can we increase declarativity in
 specification of ML algorithms? Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos WHAT’S MISSING?
  • 10. VISION An analytics query engine that would enable declarative specification, operator based execution and cost-based optimization of machine learning algorithms on distributed processing platforms. Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
  • 11. • A language with a SQL-like syntax • Use of Datalog to express ML algorithms in a declarative manner ✦ Translation of existing programming models for analytics into Datalog programs [13] ✦ Extension of Datalog to cover a wider range of machine learning tasks [14] Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos DECLARATIVE SYNTAX
  • 12. Hypothesis Cost function Optimization function Repeat { } } Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos DECOMPOSITION OF LINEAR REGRESSION ✓j := ✓j a @ @✓j J(✓) h✓(x) = ✓T x J(✓) = 1 2m mX 1 (h✓(x(i) ) y(i) )2
  • 13. DEFINE ML OPERATORS Based on the example of Linear Regression we need: ✦ Linear algebra operators ✦ Aggregation operators ✦ Iteration ✦ Optimization functions are also good candidates to define operators
  • 14. • Query rewriting as in databases ✦ Operators for ML algorithms would result in clearer semantics • Compiler optimizations: function inlining, SIMD vectorization OPTIMIZATION TECHNIQUES Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos Mostly applicable to relational operators Useful for user defined code More opportunities for query rewriting
  • 15. • Optimization techniques from machine learning ✦ Speed up training with large datasets using stochastic gradient descent instead of batch gradient descent ✦ Feature scaling for faster convergence Note that these optimizations may produce slightly different output Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos OPTIMIZING PROGRAMS
  • 16. • Define a language with a declarative syntax • Translate this language into a set of operators, which express a wide range of machine learning tasks • Based on their semantics, define algebraic properties for query optimization • Implement these operators on a distributed processing framework ACTION PLAN Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
  • 17. Thank you! Questions? Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos
  • 18. 1. Apache mahout. http://mahout.apache.org/. 2. X. Meng et al. MLlib: Machine learning in apache spark. CoRR, abs/1505.06807, 2015. 3. J. Cohen et al. Mad skills: New analysis practices for big data. Proc. VLDB Endow., 2(2):1481–1492, Aug. 2009. 4. K. S. Beyer et al. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12) 1272-1283, 2011 5. C.Olston et al. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1099–1110, 2008. 6. A. Alexandrov et al. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939-964, Dec. 2014. 7. U-SQL. http://usql.io/. 8. A. Crotty et al. Tupleware: Redefining modern analytics. CoRR, abs/1406.6667, 2014. 9. A. Ghoting et al. SystemML: Declarative machine learning on MapReduce. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE ’11, pages 231–242, 2011. 10. E. R. Sparks et al. MLI: An API for distributed machine learning. 11. Y. Yu et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, pages 1–14, 2008. 12. M. Zaharia et al. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, 2010. 13. Y. Bu et al. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012. 14. V. Barany et al., Declarative Statistical Modeling with Datalog. CoRR abs/1412.2221, 2014.
 REFERENCES Towards an Analytics Query EngineNantia Makrynioti & Vasilis Vassalos