SlideShare a Scribd company logo
Distributed Systems from
Scratch - Part 2
Handling third party libraries
https://github.com/phatak-dev/distributedsystems
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Idea
● Motivation
● Architecture of existing big data system
● Function abstraction
● Third party libraries
● Implementing third party libraries
● MySQL task
● Code example
Idea
“What it takes to build a
distributed processing system
like Spark?”
Motivation
● First version of Spark only had 1600 lines of Scala code
● Had all basic pieces of RDD and ability to run
distributed system using Mesos
● Recreating the same code with step by step
understanding
● Ample of time in hand
Distributed systems from 30000ft
Distributed Storage(HDFS/S3)
Distributed Cluster management
(YARN/Mesos)
Distributed Processing Systems
(Spark/MapReduce)
Data Applications
Our distributed system
Mesos
Scala function based abstraction
Scala functions to express logic
Function abstraction
● The whole spark API can be summarized a scala
function which can represented as follow
() => T
● This scala function can be parallelized and sent over
network to run on multiple systems using mesos
● The function is represented as a task inside the
framework
● FunctionTask.scala
Spark API as distributed function
● Initial API of the spark revolved around scala function
abstraction for processing as with RDD for data
abstraction
● Every API like map, flatMap represented as a function
task which takes one parameter and return one value
● The distribution of the functions are initially done by the
mesos which later ported to other cluster management
● This shows how the spark started with functional
programming
Till now
● Discussion about Mesos and its abstraction
● Hello world code on Mesos
● Defining Function interface
● Implementing
○ Scheduler to run scala code
○ Custom executor for scala
○ Serialize and Deserialize scala function
● https://www.youtube.com/watch?v=Oy9ToN4O63c
What a local function can do?
● Access to the local data. Even in spark, normally the
function access the hdfs local data
● Ability to access the classes provided by the framework
● Any logic which can be serialized
What it cannot do?
● Access classes outside from the framework
● Access the results of other functions (shuffle)
● Access to lookup data (broadcast)
Need of third party libraries
● Ability to add third party libraries in a distributed system
framework is important
● Third party libraries allow us to
○ Connect to third party sources
○ Use library to implement custom logic like matrix
manipulation inside function abstraction
○ Ability to extend base framework using set of
libraries ex: spark-sql
○ Ability to optimize for specific hardware
Approaches to third party libraries
● There are two different approaches to distribute third
party jars
● UberJar - Build all the dependencies with your
application code to single jar
● Second approach is to distribute the libraries separately
and adding them to the classpath of executors
● UberJar suffers from issues of jar size and versioning
● So we are going follow second approach which is
similar to one followed in Spark
Design for distributing jars
Executor 1
Executor 2
Jar serving http
server
Scheduler code
Scheduler/Driver
Download
jars over http
Download
jars over http
Distributing jars
● Third party jars are distributed over http protocol over
the cluster
● Whenever the scheduler/drives comes up it starts a http
server to serve the jars passed on to it by user
● Whenever executors are created, scheduler passes on
the uri of the http server to connect
● Executors connect to the jar server and download the
jars to respective machine. Then they add them to their
classpath.
Code for implementing
● We need multiple changes to our existing code base to
support third party jars
● The following are the different steps
○ Implementation of embedded http server
○ Change to scheduler to start http server
○ Change to executor to download jars and add it to
classpath
○ A function which uses third party library
Http Server
● We implement an embedded http server using jetty
● Jetty is a popular http server and J2EE servlet container
from eclipse organization
● One of the strength of jetty is it can be embedded inside
another program to provide http interfaces to certain
functionality
● Initial versions of Spark used jetty for jar distribution.
Newer version uses netty.
● https://eclipse.org/jetty/
● HttpServer.scala
Scheduler change
● Once we have http server, now we need to start when
we start our scheduler
● We will use registered callback for creating our jar
server.
● As part of starting the jar server, we will copy all the jars
provided by the user to a location which will beame
base director for the server.
● Once we have the server running, we pass on the
server uri to all the executors
● TaskScheduler.scala
Executor side
● In executor, we download the jars using calls to the jar
server running on master
● Once we downloaded the jars, we add it the classpath
using URLClassLoader
● We use above classloader to run our functions so that it
has access all the jars
● We plug this code in the registered callback of the
executor so it run only once
● TaskExecutor.scala
MySQL function
● This example is a function which access the mysql class
to run jdbc against a mysql instance
● We ship mysql jar using our jar distributed framework so
it will be not part of our application jar
● There is no change in our function api as it’s a normal
function as other examples
● MySQLTask.scala
References
● http://blog.madhukaraphatak.com/mesos-single-node-
setup-ubuntu/
● http://blog.madhukaraphatak.com/mesos-helloworld-
scala/
● http://blog.madhukaraphatak.com/custom-mesos-
executor-scala/
● http://blog.madhukaraphatak.com/distributing-third-
party-libraries-in-mesos/

More Related Content

What's hot (20)

PDF
Interactive Data Analysis in Spark Streaming
datamantra
 
PDF
Building end to end streaming application on Spark
datamantra
 
PDF
Introduction to Structured streaming
datamantra
 
PDF
Migrating to spark 2.0
datamantra
 
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
PDF
Productionalizing a spark application
datamantra
 
PDF
Migrating to Spark 2.0 - Part 2
datamantra
 
PDF
Structured Streaming with Kafka
datamantra
 
PDF
Introduction to Spark 2.0 Dataset API
datamantra
 
PDF
Introduction to dataset
datamantra
 
PDF
Introduction to spark 2.0
datamantra
 
PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
 
PPTX
Multi Source Data Analysis using Spark and Tellius
datamantra
 
PDF
Real time ETL processing using Spark streaming
datamantra
 
PDF
Introduction to Flink Streaming
datamantra
 
PPTX
Building real time Data Pipeline using Spark Streaming
datamantra
 
PDF
Understanding transactional writes in datasource v2
datamantra
 
PDF
Core Services behind Spark Job Execution
datamantra
 
PDF
Understanding time in structured streaming
datamantra
 
PDF
Productionalizing Spark ML
datamantra
 
Interactive Data Analysis in Spark Streaming
datamantra
 
Building end to end streaming application on Spark
datamantra
 
Introduction to Structured streaming
datamantra
 
Migrating to spark 2.0
datamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Productionalizing a spark application
datamantra
 
Migrating to Spark 2.0 - Part 2
datamantra
 
Structured Streaming with Kafka
datamantra
 
Introduction to Spark 2.0 Dataset API
datamantra
 
Introduction to dataset
datamantra
 
Introduction to spark 2.0
datamantra
 
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Real time ETL processing using Spark streaming
datamantra
 
Introduction to Flink Streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
datamantra
 
Understanding transactional writes in datasource v2
datamantra
 
Core Services behind Spark Job Execution
datamantra
 
Understanding time in structured streaming
datamantra
 
Productionalizing Spark ML
datamantra
 

Viewers also liked (20)

PPTX
Mesos and Kubernetes ecosystem overview
Krishna-Kumar
 
PDF
Predictive modeling healthcare
Taposh Roy
 
PDF
Ranking the Web with Spark
Sylvain Zimmer
 
PPTX
Keyboard covert channels
Freeman Zhang
 
PDF
Introduction to Structured Streaming
datamantra
 
PPTX
AMP Camp 5 Intro
jeykottalam
 
PDF
Spark sql
Freeman Zhang
 
PDF
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
PDF
Spark on yarn
datamantra
 
PDF
Getting Started Running Apache Spark on Apache Mesos
Paco Nathan
 
PDF
Anatomy of in memory processing in Spark
datamantra
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PDF
Kafka and Spark Streaming
datamantra
 
PPTX
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
KEY
Building Distributed Systems in Scala
Alex Payne
 
PDF
Spark architecture
datamantra
 
PDF
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
PDF
Anatomy of spark catalyst
datamantra
 
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
PDF
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
Mesos and Kubernetes ecosystem overview
Krishna-Kumar
 
Predictive modeling healthcare
Taposh Roy
 
Ranking the Web with Spark
Sylvain Zimmer
 
Keyboard covert channels
Freeman Zhang
 
Introduction to Structured Streaming
datamantra
 
AMP Camp 5 Intro
jeykottalam
 
Spark sql
Freeman Zhang
 
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Spark on yarn
datamantra
 
Getting Started Running Apache Spark on Apache Mesos
Paco Nathan
 
Anatomy of in memory processing in Spark
datamantra
 
Building a modern Application with DataFrames
Spark Summit
 
Kafka and Spark Streaming
datamantra
 
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Building Distributed Systems in Scala
Alex Payne
 
Spark architecture
datamantra
 
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Anatomy of spark catalyst
datamantra
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
Ad

Similar to Building distributed processing system from scratch - Part 2 (20)

PPTX
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
PPTX
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
PPTX
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
PPTX
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
ODP
Spark Deep Dive
Corey Nolet
 
PPTX
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
PDF
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
PDF
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Apache Cassandra and Apche Spark
Alex Thompson
 
PDF
eScience Cluster Arch. Overview
Francesco Bongiovanni
 
PDF
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
PPTX
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Codemotion
 
PDF
"Big Data" Bioinformatics
Brian Repko
 
PPTX
Real world Scala hAkking NLJUG JFall 2011
Raymond Roestenburg
 
PDF
Scala Applied Machine Learning 1st Edition Pascal Bugnion
vnypuulddk8373
 
PDF
Scala at Treasure Data
Taro L. Saito
 
PPTX
Spark 1.0
Jatin Arora
 
PDF
Develop realtime web with Scala and Xitrum
Ngoc Dao
 
PDF
Building Applications with Scala 1st Edition Pacheco
waldalowey4n
 
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
Spark Deep Dive
Corey Nolet
 
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Cassandra and Apche Spark
Alex Thompson
 
eScience Cluster Arch. Overview
Francesco Bongiovanni
 
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Codemotion
 
"Big Data" Bioinformatics
Brian Repko
 
Real world Scala hAkking NLJUG JFall 2011
Raymond Roestenburg
 
Scala Applied Machine Learning 1st Edition Pascal Bugnion
vnypuulddk8373
 
Scala at Treasure Data
Taro L. Saito
 
Spark 1.0
Jatin Arora
 
Develop realtime web with Scala and Xitrum
Ngoc Dao
 
Building Applications with Scala 1st Edition Pacheco
waldalowey4n
 
Ad

More from datamantra (12)

PPTX
State management in Structured Streaming
datamantra
 
PDF
Spark on Kubernetes
datamantra
 
PDF
Optimizing S3 Write-heavy Spark workloads
datamantra
 
PDF
Spark stack for Model life-cycle management
datamantra
 
PDF
Testing Spark and Scala
datamantra
 
PDF
Understanding Implicits in Scala
datamantra
 
PDF
Scalable Spark deployment using Kubernetes
datamantra
 
PDF
Introduction to concurrent programming with akka actors
datamantra
 
PDF
Functional programming in Scala
datamantra
 
PPTX
Telco analytics at scale
datamantra
 
PPTX
Platform for Data Scientists
datamantra
 
PDF
Building scalable rest service using Akka HTTP
datamantra
 
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
datamantra
 
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Spark stack for Model life-cycle management
datamantra
 
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
datamantra
 
Telco analytics at scale
datamantra
 
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
datamantra
 

Recently uploaded (20)

PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
What Is Data Integration and Transformation?
subhashenia
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 

Building distributed processing system from scratch - Part 2

  • 1. Distributed Systems from Scratch - Part 2 Handling third party libraries https://github.com/phatak-dev/distributedsystems
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Idea ● Motivation ● Architecture of existing big data system ● Function abstraction ● Third party libraries ● Implementing third party libraries ● MySQL task ● Code example
  • 4. Idea “What it takes to build a distributed processing system like Spark?”
  • 5. Motivation ● First version of Spark only had 1600 lines of Scala code ● Had all basic pieces of RDD and ability to run distributed system using Mesos ● Recreating the same code with step by step understanding ● Ample of time in hand
  • 6. Distributed systems from 30000ft Distributed Storage(HDFS/S3) Distributed Cluster management (YARN/Mesos) Distributed Processing Systems (Spark/MapReduce) Data Applications
  • 7. Our distributed system Mesos Scala function based abstraction Scala functions to express logic
  • 8. Function abstraction ● The whole spark API can be summarized a scala function which can represented as follow () => T ● This scala function can be parallelized and sent over network to run on multiple systems using mesos ● The function is represented as a task inside the framework ● FunctionTask.scala
  • 9. Spark API as distributed function ● Initial API of the spark revolved around scala function abstraction for processing as with RDD for data abstraction ● Every API like map, flatMap represented as a function task which takes one parameter and return one value ● The distribution of the functions are initially done by the mesos which later ported to other cluster management ● This shows how the spark started with functional programming
  • 10. Till now ● Discussion about Mesos and its abstraction ● Hello world code on Mesos ● Defining Function interface ● Implementing ○ Scheduler to run scala code ○ Custom executor for scala ○ Serialize and Deserialize scala function ● https://www.youtube.com/watch?v=Oy9ToN4O63c
  • 11. What a local function can do? ● Access to the local data. Even in spark, normally the function access the hdfs local data ● Ability to access the classes provided by the framework ● Any logic which can be serialized What it cannot do? ● Access classes outside from the framework ● Access the results of other functions (shuffle) ● Access to lookup data (broadcast)
  • 12. Need of third party libraries ● Ability to add third party libraries in a distributed system framework is important ● Third party libraries allow us to ○ Connect to third party sources ○ Use library to implement custom logic like matrix manipulation inside function abstraction ○ Ability to extend base framework using set of libraries ex: spark-sql ○ Ability to optimize for specific hardware
  • 13. Approaches to third party libraries ● There are two different approaches to distribute third party jars ● UberJar - Build all the dependencies with your application code to single jar ● Second approach is to distribute the libraries separately and adding them to the classpath of executors ● UberJar suffers from issues of jar size and versioning ● So we are going follow second approach which is similar to one followed in Spark
  • 14. Design for distributing jars Executor 1 Executor 2 Jar serving http server Scheduler code Scheduler/Driver Download jars over http Download jars over http
  • 15. Distributing jars ● Third party jars are distributed over http protocol over the cluster ● Whenever the scheduler/drives comes up it starts a http server to serve the jars passed on to it by user ● Whenever executors are created, scheduler passes on the uri of the http server to connect ● Executors connect to the jar server and download the jars to respective machine. Then they add them to their classpath.
  • 16. Code for implementing ● We need multiple changes to our existing code base to support third party jars ● The following are the different steps ○ Implementation of embedded http server ○ Change to scheduler to start http server ○ Change to executor to download jars and add it to classpath ○ A function which uses third party library
  • 17. Http Server ● We implement an embedded http server using jetty ● Jetty is a popular http server and J2EE servlet container from eclipse organization ● One of the strength of jetty is it can be embedded inside another program to provide http interfaces to certain functionality ● Initial versions of Spark used jetty for jar distribution. Newer version uses netty. ● https://eclipse.org/jetty/ ● HttpServer.scala
  • 18. Scheduler change ● Once we have http server, now we need to start when we start our scheduler ● We will use registered callback for creating our jar server. ● As part of starting the jar server, we will copy all the jars provided by the user to a location which will beame base director for the server. ● Once we have the server running, we pass on the server uri to all the executors ● TaskScheduler.scala
  • 19. Executor side ● In executor, we download the jars using calls to the jar server running on master ● Once we downloaded the jars, we add it the classpath using URLClassLoader ● We use above classloader to run our functions so that it has access all the jars ● We plug this code in the registered callback of the executor so it run only once ● TaskExecutor.scala
  • 20. MySQL function ● This example is a function which access the mysql class to run jdbc against a mysql instance ● We ship mysql jar using our jar distributed framework so it will be not part of our application jar ● There is no change in our function api as it’s a normal function as other examples ● MySQLTask.scala
  • 21. References ● http://blog.madhukaraphatak.com/mesos-single-node- setup-ubuntu/ ● http://blog.madhukaraphatak.com/mesos-helloworld- scala/ ● http://blog.madhukaraphatak.com/custom-mesos- executor-scala/ ● http://blog.madhukaraphatak.com/distributing-third- party-libraries-in-mesos/