SlideShare a Scribd company logo
Certified Apache Spark and Scala Training – DataFlair
Introduction to Apache Spark
Certified Apache Spark and Scala Training – DataFlair
 Before Spark
 Need for Spark
 What is Apache Spark ?
 Goals
 Why Spark ?
 RDD & its Operations
 Features Of Spark
Agenda
Certified Apache Spark and Scala Training – DataFlair
Before Spark
Batch
Processing
Stream
Processing
Interactive
Processing
Graph
Processing
Machine
Learning
Certified Apache Spark and Scala Training – DataFlair
Need For Spark
• Need for a powerful engine that can process the data in Real-Time
(streaming) as well as in Batch mode
• Need for a powerful engine that can respond in Sub-second and
perform In-memory analytics
• Need for a powerful engine that can handle diverse workloads:
– Batch
– Streaming
– Interactive
– Graph
– Machine Learning
Certified Apache Spark and Scala Training – DataFlair
Apache Spark is a powerful open source engine which can handle:
– Batch processing
– Real-time (stream)
– Interactive
– Graph
– Machine Learning (Iterative)
– In-memory
What is Apache Spark?
Certified Apache Spark and Scala Training – DataFlair
Introduction to Apache Spark
 Lightening fast cluster computing tool
 General purpose distributed system
 Provides APIs in Scala, Java, Python, and R
Certified Apache Spark and Scala Training – DataFlair
History
Introduced by
UC Berkeley
Open
Sourced
Donated to
Apache
Became Top-level
project
World record
in sorting
Most active
project at Apache
2010 2011 2012 2013 2014 20152009
Certified Apache Spark and Scala Training – DataFlair
Sort Record
Hadoop MapReduce Spark
Data Size 102.5 TB 100 TB
Time Taken 72 min 23 min
No of nodes 2100 206
No of cores 50400 physical 6592 virtualized
Cluster disk throughput 3150 GBPS 618 GBPS
Network Dedicated 10 Gbps Virtualized 10 Gbps
Hadoop-MapReduce
2100 Nodes
206 Nodes
72 min
23 min
Src: Databricks
Spark
Certified Apache Spark and Scala Training – DataFlair
Goals
Batch
StreamingInteractive
One
Stack to
Rule them all
 Easy to combine batch, streaming, and interactive computations
Certified Apache Spark and Scala Training – DataFlair
Goals
 Easy to combine batch, streaming, and interactive computations
 Easy to develop sophisticated algorithms
Certified Apache Spark and Scala Training – DataFlair
Goals
 Easy to combine batch, streaming, and interactive computations
 Easy to develop sophisticated algorithms
 Compatible with existing open source ecosystem
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
Operation1
Operation2
Disk …
Operation1
Operation1
…Disk
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
Operation 1 Operation 2
Disk
…
Disk
Operation n
Disk
Disk
Operation 1 Operation 2 … Operation n
Disk
Disk
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
Spark
Streaming
Spark
Engine
Input data
stream
Batches of
Input data
Batches of
Processed data
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
 Support for multiple transformations and actions.
RDD1 RDD3RDD2 Result
Transformation 1
map()
Transformation 2
filter()
Action
(collect)
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
 Support for multiple transformations and actions.
 Compatible with hadoop, can process existing hadoop data.
Certified Apache Spark and Scala Training – DataFlair
Spark
Architecture
Certified Apache Spark and Scala Training – DataFlair
Nodes
Master Node Slave Nodes
Master Worker
Spark Nodes
Certified Apache Spark and Scala Training – DataFlair
Basic Spark Architecture
Sub Work Sub Work Sub Work Sub Work
Sub WorkSub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Work
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
Obj1
Obj2
Obj3
Obj n
....
RDD
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
 RDD can contain any type of (scala, java, python and R) objects.
RDD
Objects
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
 RDD can contain any type of (scala, java, python and R) objects.
 Each RDD is split-up into different partitions, which may be computed on
different nodes of clusters.
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
RDD
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Certified Apache Spark and Scala Training – DataFlair
Employee-data.txt
B1
B2
B3
B4 B9
B5
B10
B12
B11 B6
B8
B7
Partition-1
Partition-2
Partition-3
Partition-4
Partition-5
. . .
RDD
Create RDD
Resilient Distributed Dataset (RDD)
Hadoop Cluster
Certified Apache Spark and Scala Training – DataFlair
RDD Operations
RDD
Operations
PersistenceActionsTransformations
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Transformation
Transformation:
 Set of operations that define how RDD should be transformed
 Creates a new RDD from the existing one to process the data
 Lazy evaluation: Computation doesn’t start until an action associated
 E.g. Map, FlatMap, Filter, Union, GroupBy, etc.
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Action
Action:
 Triggers job execution.
 Returns the result or write it to the storage.
 E.g. Count, Collect, Reduce, Take, etc.
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Persistence
Persistence:
 Spark allows caching/Persisting entire dataset in memory
 Caches the RDD in the memory for future operations
Primary Storage
Cache
Certified Apache Spark and Scala Training – DataFlair
RDD
Parent RDD
Lineage
Transformations
Actions
Result
Creates a new
RDD based on
custom business
logic
(map(), flatMap()…)
(saveAsTextFile(), count()…)
Returns output to
Driver or exports
data to storage
system after
computation
RDD
RDD Operations
Certified Apache Spark and Scala Training – DataFlair
Features of Spark
Processing
Memory
Management
Window
Criteria
Fault
Tolerance
Duplicate
Elimination
Speed
Process every
record exactly
once
100 X Faster
Than Hadoop
Automatic
Memory
Management
Recovers
Automatically
Time based
window criteria
Diverse
processing
platform
Certified Apache Spark and Scala Training – DataFlair
Thank You
DataFlair
/c/DataFlairWS /DataFlairWS
Ad

More Related Content

What's hot (20)

Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
Edureka!
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark
SparkSpark
Spark
Heena Madan
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Spark
SparkSpark
Spark
Koushik Mondal
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache spark
Apache sparkApache spark
Apache spark
TEJPAL GAUTAM
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
Edureka!
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 

Similar to Introduction to apache spark (20)

Module01
 Module01 Module01
Module01
NPN Training
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Dharmjit Singh
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - Spark
Steve Blackmon
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
Edureka!
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - Spark
Steve Blackmon
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
Edureka!
 
Ad

Recently uploaded (20)

2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Ad

Introduction to apache spark

  • 1. Certified Apache Spark and Scala Training – DataFlair Introduction to Apache Spark
  • 2. Certified Apache Spark and Scala Training – DataFlair  Before Spark  Need for Spark  What is Apache Spark ?  Goals  Why Spark ?  RDD & its Operations  Features Of Spark Agenda
  • 3. Certified Apache Spark and Scala Training – DataFlair Before Spark Batch Processing Stream Processing Interactive Processing Graph Processing Machine Learning
  • 4. Certified Apache Spark and Scala Training – DataFlair Need For Spark • Need for a powerful engine that can process the data in Real-Time (streaming) as well as in Batch mode • Need for a powerful engine that can respond in Sub-second and perform In-memory analytics • Need for a powerful engine that can handle diverse workloads: – Batch – Streaming – Interactive – Graph – Machine Learning
  • 5. Certified Apache Spark and Scala Training – DataFlair Apache Spark is a powerful open source engine which can handle: – Batch processing – Real-time (stream) – Interactive – Graph – Machine Learning (Iterative) – In-memory What is Apache Spark?
  • 6. Certified Apache Spark and Scala Training – DataFlair Introduction to Apache Spark  Lightening fast cluster computing tool  General purpose distributed system  Provides APIs in Scala, Java, Python, and R
  • 7. Certified Apache Spark and Scala Training – DataFlair History Introduced by UC Berkeley Open Sourced Donated to Apache Became Top-level project World record in sorting Most active project at Apache 2010 2011 2012 2013 2014 20152009
  • 8. Certified Apache Spark and Scala Training – DataFlair Sort Record Hadoop MapReduce Spark Data Size 102.5 TB 100 TB Time Taken 72 min 23 min No of nodes 2100 206 No of cores 50400 physical 6592 virtualized Cluster disk throughput 3150 GBPS 618 GBPS Network Dedicated 10 Gbps Virtualized 10 Gbps Hadoop-MapReduce 2100 Nodes 206 Nodes 72 min 23 min Src: Databricks Spark
  • 9. Certified Apache Spark and Scala Training – DataFlair Goals Batch StreamingInteractive One Stack to Rule them all  Easy to combine batch, streaming, and interactive computations
  • 10. Certified Apache Spark and Scala Training – DataFlair Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms
  • 11. Certified Apache Spark and Scala Training – DataFlair Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms  Compatible with existing open source ecosystem
  • 12. Certified Apache Spark and Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.
  • 13. Certified Apache Spark and Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation. Operation1 Operation2 Disk … Operation1 Operation1 …Disk
  • 14. Certified Apache Spark and Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation. Operation 1 Operation 2 Disk … Disk Operation n Disk Disk Operation 1 Operation 2 … Operation n Disk Disk
  • 15. Certified Apache Spark and Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.
  • 16. Certified Apache Spark and Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.  Support Real time and Batch Processing. Spark Streaming Spark Engine Input data stream Batches of Input data Batches of Processed data
  • 17. Certified Apache Spark and Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.  Support Real time and Batch Processing.  Lazy Operations – optimize the job before execution.
  • 18. Certified Apache Spark and Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.  Support Real time and Batch Processing.  Lazy Operations – optimize the job before execution.  Support for multiple transformations and actions. RDD1 RDD3RDD2 Result Transformation 1 map() Transformation 2 filter() Action (collect)
  • 19. Certified Apache Spark and Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.  Support Real time and Batch Processing.  Lazy Operations – optimize the job before execution.  Support for multiple transformations and actions.  Compatible with hadoop, can process existing hadoop data.
  • 20. Certified Apache Spark and Scala Training – DataFlair Spark Architecture
  • 21. Certified Apache Spark and Scala Training – DataFlair Nodes Master Node Slave Nodes Master Worker Spark Nodes
  • 22. Certified Apache Spark and Scala Training – DataFlair Basic Spark Architecture Sub Work Sub Work Sub Work Sub Work Sub WorkSub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Work
  • 23. Certified Apache Spark and Scala Training – DataFlair Resilient Distributed Dataset (RDD)  RDD is a simple and immutable collection of objects. Obj1 Obj2 Obj3 Obj n .... RDD
  • 24. Certified Apache Spark and Scala Training – DataFlair Resilient Distributed Dataset (RDD)  RDD is a simple and immutable collection of objects.  RDD can contain any type of (scala, java, python and R) objects. RDD Objects
  • 25. Certified Apache Spark and Scala Training – DataFlair Resilient Distributed Dataset (RDD)  RDD is a simple and immutable collection of objects.  RDD can contain any type of (scala, java, python and R) objects.  Each RDD is split-up into different partitions, which may be computed on different nodes of clusters. Partition1 Partition2 Partition3 Partition4 Partition5 Partition6 RDD Partition1 Partition2 Partition3 Partition4 Partition5 Partition6
  • 26. Certified Apache Spark and Scala Training – DataFlair Employee-data.txt B1 B2 B3 B4 B9 B5 B10 B12 B11 B6 B8 B7 Partition-1 Partition-2 Partition-3 Partition-4 Partition-5 . . . RDD Create RDD Resilient Distributed Dataset (RDD) Hadoop Cluster
  • 27. Certified Apache Spark and Scala Training – DataFlair RDD Operations RDD Operations PersistenceActionsTransformations
  • 28. Certified Apache Spark and Scala Training – DataFlair RDD Operations – Transformation Transformation:  Set of operations that define how RDD should be transformed  Creates a new RDD from the existing one to process the data  Lazy evaluation: Computation doesn’t start until an action associated  E.g. Map, FlatMap, Filter, Union, GroupBy, etc.
  • 29. Certified Apache Spark and Scala Training – DataFlair RDD Operations – Action Action:  Triggers job execution.  Returns the result or write it to the storage.  E.g. Count, Collect, Reduce, Take, etc.
  • 30. Certified Apache Spark and Scala Training – DataFlair RDD Operations – Persistence Persistence:  Spark allows caching/Persisting entire dataset in memory  Caches the RDD in the memory for future operations Primary Storage Cache
  • 31. Certified Apache Spark and Scala Training – DataFlair RDD Parent RDD Lineage Transformations Actions Result Creates a new RDD based on custom business logic (map(), flatMap()…) (saveAsTextFile(), count()…) Returns output to Driver or exports data to storage system after computation RDD RDD Operations
  • 32. Certified Apache Spark and Scala Training – DataFlair Features of Spark Processing Memory Management Window Criteria Fault Tolerance Duplicate Elimination Speed Process every record exactly once 100 X Faster Than Hadoop Automatic Memory Management Recovers Automatically Time based window criteria Diverse processing platform
  • 33. Certified Apache Spark and Scala Training – DataFlair Thank You DataFlair /c/DataFlairWS /DataFlairWS