SlideShare a Scribd company logo
Introduction to Spark
Eric Eijkelenboom - UserReport - userreport.com
• What is Spark and why should I care?
• Architecture and programming model
• Examples
• Mini demo
• Related projects
Introduction to apache spark
RTFM
• A general-purpose computation framework that leverages distributed
memory
• More flexible than MapReduce (it supports general execution graphs)
• Linear scalability and fault tolerance
• It supports a rich set of higher-level tools including
• Shark (Hive on Spark) and Spark SQL
• MLlib for machine learning
• GraphX for graph processing
• Spark Streaming
Who cares?
!
!
!
!
• Slow due to serialisation & replication
• Inefficient for iterative computing & interactive
querying
Limitations of MapReduce
Input iter. 1 iter. 2 . . .
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Map
Map
Map
Reduce
Reduce
Input Output
Leveraging memory
iter. 1 iter. 2 . . .
Input
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Leveraging memory
iter. 1 iter. 2 . . .
Input
iter. 1 iter. 2 . . .
Input
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Leveraging memory
iter. 1 iter. 2 . . .
Input
iter. 1 iter. 2 . . .
Input
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Not tied to 2-stage
MapReduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
So, Spark is…
• In-memory analytics, many times faster than
Hadoop/Hive
• Designed for running iterative algorithms &
interactive querying
• Highly compatible with Hadoop’s Storage APIs
• Can run on your existing Hadoop Cluster Setup
• Programming in Scala, Python or Java
Spark stack
Architecture
HDFS
Datanode Datanode Datanode....
Spark Worker Spark Worker Spark Worker
....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
Architecture
HDFS
Datanode Datanode Datanode....
Spark Worker Spark Worker Spark Worker
....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
• YARN
• Mesos
• Standalone
Programming model
• Resilient Distributed Datasets (RDDs) are basic building blocks
• Distributed collection of objects, cached in-memory across
cluster nodes
• Automatically rebuilt on failure
• RDD operations
• Transformations: create new RDDs from existing ones
• Actions: return a value to the master node after running a
computation on the dataset
As you know…
• … Hadoop is a distributed system for counting
words
• Here is how it’s done is Spark
As you know…
• … Hadoop is a distributed system for counting
words
• Here is how it’s done is Spark
Blue code: Spark operations
Red code: functions
(closures) that get passed
to the cluster automatically
Text search
Text search
In memory text search:
!
!
caches the RDD in memory for faster reuse
Logistic regression
!
• 100 GB of data on a 100 node cluster
Easy unit testing
Spark shell
Mini demo
Introduction to apache spark
Hive on Spark = Shark
• A large scale data warehouse system just like Hive
• Highly compatible with Hive (HQL, metastore,
serialization formats, and UDFs)
• Built on top of Spark (thus a faster execution engine)
• Provision of creating in-memory materialized tables
(Cached Tables)
• And cached tables utilise columnar storage instead of
raw storage
Shark
Shark uses the existing Hive client and metastore
MLlib
• Machine learning library based on Spark
!
!
• Supports a range of machine learning algorithms,
including classification, regression, clustering,
collaborative filtering, dimensionality reduction, and
more
Spark Streaming
• Write streaming applications in the same way as
batch applications
• Reuse code between batch processing and
streaming
• Write more than analytics applications:
• Join streams against historical data
• Run ad-hoc queries on stream state
Spark Streaming
• Count tweets on a sliding window
!
!
• Find words with higher frequency than historic data
GraphX: graph computing
Introduction to apache spark
Ad

More Related Content

What's hot (20)

Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
pumaranikar
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
Muktadiur Rahman
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
pumaranikar
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
Muktadiur Rahman
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 

Viewers also liked (14)

La filiera Gas
La filiera GasLa filiera Gas
La filiera Gas
Enegan
 
Monitoring tool partner school
Monitoring tool partner schoolMonitoring tool partner school
Monitoring tool partner school
evangeline quibuyen
 
Marketing Cloud
Marketing CloudMarketing Cloud
Marketing Cloud
Oorjit
 
Presentation 224 b margie ware insurance and benefits counseling as a core se...
Presentation 224 b margie ware insurance and benefits counseling as a core se...Presentation 224 b margie ware insurance and benefits counseling as a core se...
Presentation 224 b margie ware insurance and benefits counseling as a core se...
The ALS Association
 
Bestow Showcase: Godrej Chotukool
Bestow Showcase: Godrej ChotukoolBestow Showcase: Godrej Chotukool
Bestow Showcase: Godrej Chotukool
Bestow
 
Bridal in Knysna 060 529 6330
Bridal in Knysna 060 529 6330Bridal in Knysna 060 529 6330
Bridal in Knysna 060 529 6330
knysnaarea
 
Vioxx tn2015
Vioxx tn2015Vioxx tn2015
Vioxx tn2015
Pooja Kashyap
 
Key challenges to scale up climate change resillience in Botswana lr experien...
Key challenges to scale up climate change resillience in Botswana lr experien...Key challenges to scale up climate change resillience in Botswana lr experien...
Key challenges to scale up climate change resillience in Botswana lr experien...
PROCASUR Corporation / Corporación PROCASUR
 
Structural Resin Injection for Soil Stabilization
Structural Resin Injection for Soil StabilizationStructural Resin Injection for Soil Stabilization
Structural Resin Injection for Soil Stabilization
Uretek Mid-Atlantic
 
Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs
Presentation 208 b sue walsh_an evaluation of newly diagnosed  patient needsPresentation 208 b sue walsh_an evaluation of newly diagnosed  patient needs
Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs
The ALS Association
 
Call for Application: Learning Initiative on Practical solutions to adapt to ...
Call for Application: Learning Initiative on Practical solutions to adapt to ...Call for Application: Learning Initiative on Practical solutions to adapt to ...
Call for Application: Learning Initiative on Practical solutions to adapt to ...
PROCASUR Corporation / Corporación PROCASUR
 
A Low Impact Solution for Increasing Existing Structural Loads
A Low Impact Solution for Increasing Existing Structural LoadsA Low Impact Solution for Increasing Existing Structural Loads
A Low Impact Solution for Increasing Existing Structural Loads
Uretek Mid-Atlantic
 
Ws routesa regional workshop_december 2014
Ws routesa regional workshop_december 2014Ws routesa regional workshop_december 2014
Ws routesa regional workshop_december 2014
PROCASUR Corporation / Corporación PROCASUR
 
Prezi andrea
Prezi andreaPrezi andrea
Prezi andrea
AnyJr
 
La filiera Gas
La filiera GasLa filiera Gas
La filiera Gas
Enegan
 
Marketing Cloud
Marketing CloudMarketing Cloud
Marketing Cloud
Oorjit
 
Presentation 224 b margie ware insurance and benefits counseling as a core se...
Presentation 224 b margie ware insurance and benefits counseling as a core se...Presentation 224 b margie ware insurance and benefits counseling as a core se...
Presentation 224 b margie ware insurance and benefits counseling as a core se...
The ALS Association
 
Bestow Showcase: Godrej Chotukool
Bestow Showcase: Godrej ChotukoolBestow Showcase: Godrej Chotukool
Bestow Showcase: Godrej Chotukool
Bestow
 
Bridal in Knysna 060 529 6330
Bridal in Knysna 060 529 6330Bridal in Knysna 060 529 6330
Bridal in Knysna 060 529 6330
knysnaarea
 
Structural Resin Injection for Soil Stabilization
Structural Resin Injection for Soil StabilizationStructural Resin Injection for Soil Stabilization
Structural Resin Injection for Soil Stabilization
Uretek Mid-Atlantic
 
Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs
Presentation 208 b sue walsh_an evaluation of newly diagnosed  patient needsPresentation 208 b sue walsh_an evaluation of newly diagnosed  patient needs
Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs
The ALS Association
 
A Low Impact Solution for Increasing Existing Structural Loads
A Low Impact Solution for Increasing Existing Structural LoadsA Low Impact Solution for Increasing Existing Structural Loads
A Low Impact Solution for Increasing Existing Structural Loads
Uretek Mid-Atlantic
 
Prezi andrea
Prezi andreaPrezi andrea
Prezi andrea
AnyJr
 
Ad

Similar to Introduction to apache spark (20)

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
Darko Marjanovic
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Marius Soutier
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
Dorian Beganovic
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
Darko Marjanovic
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
Ad

Recently uploaded (20)

Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 

Introduction to apache spark

  • 1. Introduction to Spark Eric Eijkelenboom - UserReport - userreport.com
  • 2. • What is Spark and why should I care? • Architecture and programming model • Examples • Mini demo • Related projects
  • 4. RTFM • A general-purpose computation framework that leverages distributed memory • More flexible than MapReduce (it supports general execution graphs) • Linear scalability and fault tolerance • It supports a rich set of higher-level tools including • Shark (Hive on Spark) and Spark SQL • MLlib for machine learning • GraphX for graph processing • Spark Streaming
  • 6. ! ! ! ! • Slow due to serialisation & replication • Inefficient for iterative computing & interactive querying Limitations of MapReduce Input iter. 1 iter. 2 . . . HDFS
 read HDFS
 write HDFS
 read HDFS
 write Map Map Map Reduce Reduce Input Output
  • 7. Leveraging memory iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write
  • 8. Leveraging memory iter. 1 iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write
  • 9. Leveraging memory iter. 1 iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write Not tied to 2-stage MapReduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly
  • 10. So, Spark is… • In-memory analytics, many times faster than Hadoop/Hive • Designed for running iterative algorithms & interactive querying • Highly compatible with Hadoop’s Storage APIs • Can run on your existing Hadoop Cluster Setup • Programming in Scala, Python or Java
  • 12. Architecture HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master)
  • 13. Architecture HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master) • YARN • Mesos • Standalone
  • 14. Programming model • Resilient Distributed Datasets (RDDs) are basic building blocks • Distributed collection of objects, cached in-memory across cluster nodes • Automatically rebuilt on failure • RDD operations • Transformations: create new RDDs from existing ones • Actions: return a value to the master node after running a computation on the dataset
  • 15. As you know… • … Hadoop is a distributed system for counting words • Here is how it’s done is Spark
  • 16. As you know… • … Hadoop is a distributed system for counting words • Here is how it’s done is Spark Blue code: Spark operations Red code: functions (closures) that get passed to the cluster automatically
  • 18. Text search In memory text search: ! ! caches the RDD in memory for faster reuse
  • 19. Logistic regression ! • 100 GB of data on a 100 node cluster
  • 24. Hive on Spark = Shark • A large scale data warehouse system just like Hive • Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs) • Built on top of Spark (thus a faster execution engine) • Provision of creating in-memory materialized tables (Cached Tables) • And cached tables utilise columnar storage instead of raw storage
  • 25. Shark Shark uses the existing Hive client and metastore
  • 26. MLlib • Machine learning library based on Spark ! ! • Supports a range of machine learning algorithms, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and more
  • 27. Spark Streaming • Write streaming applications in the same way as batch applications • Reuse code between batch processing and streaming • Write more than analytics applications: • Join streams against historical data • Run ad-hoc queries on stream state
  • 28. Spark Streaming • Count tweets on a sliding window ! ! • Find words with higher frequency than historic data