SlideShare a Scribd company logo
APACHE SPARK
S.Deepa
Assistant Professor(Sr.G)
Department of Computer Technology – PG
Kongu Engineering College
Introduction
• Apache Spark is a lightning-fast cluster computing technology, designed for fast computation.
• It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing.
• The main feature of Spark is its in-memory cluster computing that increases the processing speed
of an application.
• Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming.
• Apart from supporting all these workload in a respective system, it reduces the management
burden of maintaining separate tools.
Evolution of Apache Spark
• Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia.
• It was Open Sourced in 2010 under a BSD license.
• It was donated to Apache software foundation in 2013.
• Now Apache Spark has become a top level Apache project from Feb-2014.
Features of Apache Spark
• Speed
• Supports multiple languages
• Advanced Analytics
• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk.
• This is possible by reducing number of read/write operations to disk.
• It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore,
you can write applications in different languages. Spark comes up with 80 high-level operators for
interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries,
Streaming data, Machine learning (ML), and Graph algorithms.
• Powerful Caching - Simple programming layer provides powerful caching and disk persistence
capabilities.
• Deployment - It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster
manager.
• Real-Time - It offers Real-time computation & low latency because of in-memory computation.
• Polyglot - Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written
in any of these four languages. It also provides a shell in Scala and Python.
Spark Built on Hadoop
Ways of Spark deployment
• Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and
MapReduce will run side by side to cover all spark jobs on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-
installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop
stack. It allows other components to run on top of stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.
Components of Spark
Components
• Apache Spark Core
-- Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon.
-- It provides In-Memory computing and referencing datasets in external storage systems.
• Spark SQL
-- Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
• Spark Streaming
-- Spark Streaming is an extension of the core Spark API that allows data engineers and
data scientists to process real-time data from various sources including (but not limited
to) Kafka, Flume, and Amazon Kinesis.
-- This processed data can be pushed out to file systems, databases, and live dashboards.
-- Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics.
-- It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
• MLlib (Machine Learning Library)
-- Built on top of Spark, MLlib is a scalable machine learning library consisting of common
learning algorithms and utilities, including classification, regression, clustering,
collaborative filtering, dimensionality reduction, and underlying optimization primitives.
-- MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture.
-- It is, according to benchmarks, done by the MLlib developers against the
Alternating Least Squares (ALS) implementations.
-- Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache
Mahout (before Mahout gained a Spark interface).
• GraphX
• GraphX is a distributed graph-processing framework on top of Spark.
• It provides an API for expressing graph computation that can model the user-
defined graphs by using Pregel abstraction API.
• It also provides an optimized runtime for this abstraction.
• You can view the same data as both graphs and collections, transform and join
graphs with RDDs efficiently, and write custom iterative graph algorithms using
the Pregel API.
Resilient Distributed Datasets
• Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
• It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
• RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
• Formally, an RDD is a read-only, partitioned collection of records.
• RDDs can be created through deterministic operations on either data on stable storage or other
RDDs.
• RDD is a fault-tolerant collection of elements that can be operated on in parallel.
• There are two ways to create RDDs − parallelizing an existing collection in your driver program,
or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase,
or any data source offering a Hadoop Input Format.
• Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let
us first discuss how MapReduce operations take place and why they are not so efficient.
• Data Sharing is slow in MapReduce.
• MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster.
• It allows users to write parallel computations, using a set of high-level operators, without having
to worry about work distribution and fault tolerance.
• Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex
− between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS).
• Although this framework provides numerous abstractions for accessing a cluster’s computational
resources, users still want more.
• Both Iterative and Interactive applications require faster data sharing across parallel jobs.
• Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
• Regarding storage system, most of the Hadoop applications, they spend more than 90% of the
time doing HDFS read-write operations.
Iterative Operations on MapReduce
• Reuse intermediate results across multiple computations in multi-stage applications.
• The following illustration explains how the current framework works, while doing the iterative
operations on MapReduce.
• This incurs substantial overheads due to data replication, disk I/O, and serialization, which makes
the system slow.
Interactive Operations on MapReduce
• User runs ad-hoc queries on the same subset of data.
• Each query will do the disk I/O on the stable storage, which can dominate application execution
time.
• The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.
Data Sharing using Spark RDD
• Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
• Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write
operations.
• Recognizing this problem, researchers developed a specialized framework called Apache Spark.
• The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation.
• This means, it stores the state of memory as an object across the jobs and the object is sharable
between those jobs.
• Data sharing in memory is 10 to 100 times faster than network and Disk.
Spark RDD
• Iterative Operations on Spark RDD
Interactive Operations on Spark RDD
• This illustration shows interactive operations on Spark RDD.
• If different queries are run on the same set of data repeatedly, this particular data can be kept in
memory for better execution times.
• By default, each transformed RDD may be recomputed each time you run an
action on it.
• However, you may also persist an RDD in memory, in which case Spark will keep
the elements around on the cluster for much faster access, the next time you
query it.
• There is also support for persisting RDDs on disk, or replicated across multiple
nodes.
Spark Architecture
• Apache Spark has a well-defined layered architecture where all the spark components and layers are
loosely coupled.
• This architecture is further integrated with various extensions and libraries.
• Apache Spark Architecture is based on two main abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
Spark Eco-System
• The spark ecosystem is composed of various components like Spark SQL, Spark Streaming, MLlib,
GraphX, and the Core API component.
Spark Core
• Spark Core is the base engine for large-scale parallel and distributed data processing.
• Further, additional libraries which are built on the top of the core allows diverse workloads for
streaming, SQL, and machine learning.
• It is responsible for memory management and fault recovery, scheduling, distributing and
monitoring jobs on a cluster & interacting with storage systems.
Spark Streaming
• Spark Streaming is the component of Spark which is used to process real-time streaming data.
• Thus, it is a useful addition to the core Spark API.
• It enables high-throughput and fault-tolerant stream processing of live data streams.
• Spark SQL
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional
programming API.
• It supports querying data either via SQL or via the Hive Query Language.
• For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools
where you can extend the boundaries of traditional relational data processing.
• GraphX
GraphX is the Spark API for graphs and graph-parallel computation.
• Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.
• At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed
Property Graph (a directed multigraph with properties attached to each vertex and edge).
• MLlib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in
Apache Spark.
• SparkR
• It is an R package that provides a distributed data frame implementation.
• It also supports operations like selection, filtering, aggregation but on large data-sets.
Resilient Distributed Dataset(RDD)
RDDs are the building blocks of any Spark application. RDDs Stands for:
• Resilient: Fault tolerant and is capable of rebuilding data on failure
• Distributed: Distributed data among the multiple nodes in a cluster
• Dataset: Collection of partitioned data with values
• It is a layer of abstracted data over the distributed collection. It is immutable in
nature and follows lazy transformations.
• The data in an RDD is split into chunks based on a key.
• RDDs are highly resilient, i.e, they are able to recover quickly from any issues as
the same data chunks are replicated across multiple executor nodes.
• Thus, even if one executor node fails, another will still process the data.
• This allows you to perform your functional calculations against your dataset very
quickly by harnessing the power of multiple nodes.
• Moreover, once you create an RDD it becomes immutable.
• Immutable means, an object whose state cannot be modified after it is created,
but they can surely be transformed.
• Talking about the distributed environment, each dataset in RDD is divided into
logical partitions, which may be computed on different nodes of the cluster.
• Due to this, you can perform transformations or actions on the complete data
parallelly.
• Also, you don’t have to worry about the distribution, because Spark takes care of
that.
• There are two ways to create RDDs − parallelizing an existing collection in your driver program, or
by referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase,
etc.
• With RDDs, you can perform two types of operations:
• Transformations: They are the operations that are applied to create a new RDD.
• Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass the
result back to the driver.
Working of Spark Architecture
• In your master node, you have the driver program, which drives your application. The code you
are writing behaves as a driver program or if you are using the interactive shell, the shell acts as
the driver program.
• Inside the driver program, the first thing you do is, you create a Spark Context.
• Assume that the Spark context is a gateway to all the Spark functionalities.
• It is similar to your database connection.
• Any command you execute in your database goes through the database connection.
• Likewise, anything you do on Spark goes through Spark context.
• Now, this Spark context works with the cluster manager to manage various jobs.
• The driver program & Spark context takes care of the job execution within the
cluster.
• A job is split into multiple tasks which are distributed over the worker node.
• Anytime an RDD is created in Spark context, it can be distributed across various
nodes and can be cached there.
• Worker nodes are the slave nodes whose job is to basically execute the tasks.
These tasks are then executed on the partitioned RDDs in the worker node and
hence returns back the result to the Spark Context.
• Spark Context takes the job, breaks the job in tasks and distribute them to the
worker nodes. These tasks work on the partitioned RDD, perform operations,
collect the results and return to the main Spark Context.
• If you increase the number of workers, then you can divide jobs into
more partitions and execute them parallelly over multiple systems. It
will be a lot faster.
• With the increase in the number of workers, memory size will also
increase & you can cache the jobs to execute it faster.
• To know about the workflow of Spark Architecture, you can have a look at
the infographic below:
• STEP 1:
• The client submits spark user application code.
• When an application code is submitted, the driver implicitly converts user code that contains
transformations and actions into a logically directed acyclic graph called DAG.
• At this stage, it also performs optimizations such as pipelining transformations.
STEP 2:
• After that, it converts the logical graph called DAG into physical execution plan with many
stages.
• After converting into a physical execution plan, it creates physical execution units called tasks
under each stage.
• Then the tasks are bundled and sent to the cluster.
• STEP 3:
• Now the driver talks to the cluster manager and negotiates the resources.
• Cluster manager launches executors in worker nodes on behalf of the driver.
• At this point, the driver will send the tasks to the executors based on data
placement.
• When executors start, they register themselves with drivers.
• So, the driver will have a complete view of executors that are executing the task.
• STEP 4: During the course of execution of tasks, driver program will monitor the set of executors
that runs. Driver node also schedules future tasks based on data placement.
Ad

More Related Content

What's hot (20)

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
Suvradeep Rudra
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
Adam Kawa
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Hive
HiveHive
Hive
Manas Nayak
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
Suvradeep Rudra
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
Adam Kawa
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 

Similar to APACHE SPARK.pptx (20)

Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Spark 101
Spark 101Spark 101
Spark 101
Shahaf Azriely {TopLinked} ☁
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Ad

More from DeepaThirumurugan (6)

IO organization.ppt
IO organization.pptIO organization.ppt
IO organization.ppt
DeepaThirumurugan
 
Creating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptxCreating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptx
DeepaThirumurugan
 
Difference between traditional and agile software development
Difference between traditional and agile software developmentDifference between traditional and agile software development
Difference between traditional and agile software development
DeepaThirumurugan
 
Agile Development Models
Agile Development ModelsAgile Development Models
Agile Development Models
DeepaThirumurugan
 
Cloud Computing
Cloud Computing Cloud Computing
Cloud Computing
DeepaThirumurugan
 
Data structures - Introduction
Data structures - IntroductionData structures - Introduction
Data structures - Introduction
DeepaThirumurugan
 
Creating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptxCreating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptx
DeepaThirumurugan
 
Difference between traditional and agile software development
Difference between traditional and agile software developmentDifference between traditional and agile software development
Difference between traditional and agile software development
DeepaThirumurugan
 
Data structures - Introduction
Data structures - IntroductionData structures - Introduction
Data structures - Introduction
DeepaThirumurugan
 
Ad

Recently uploaded (20)

DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 

APACHE SPARK.pptx

  • 1. APACHE SPARK S.Deepa Assistant Professor(Sr.G) Department of Computer Technology – PG Kongu Engineering College
  • 2. Introduction • Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. • It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. • The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. • Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. • Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
  • 3. Evolution of Apache Spark • Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. • It was Open Sourced in 2010 under a BSD license. • It was donated to Apache software foundation in 2013. • Now Apache Spark has become a top level Apache project from Feb-2014.
  • 4. Features of Apache Spark • Speed • Supports multiple languages • Advanced Analytics • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. • This is possible by reducing number of read/write operations to disk. • It stores the intermediate processing data in memory.
  • 5. • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying. • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms. • Powerful Caching - Simple programming layer provides powerful caching and disk persistence capabilities. • Deployment - It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager. • Real-Time - It offers Real-time computation & low latency because of in-memory computation. • Polyglot - Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages. It also provides a shell in Scala and Python.
  • 6. Spark Built on Hadoop
  • 7. Ways of Spark deployment • Standalone − Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster. • Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre- installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack. • Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.
  • 9. Components • Apache Spark Core -- Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. -- It provides In-Memory computing and referencing datasets in external storage systems. • Spark SQL -- Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
  • 10. • Spark Streaming -- Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. -- This processed data can be pushed out to file systems, databases, and live dashboards. -- Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. -- It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
  • 11. • MLlib (Machine Learning Library) -- Built on top of Spark, MLlib is a scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. -- MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. -- It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. -- Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
  • 12. • GraphX • GraphX is a distributed graph-processing framework on top of Spark. • It provides an API for expressing graph computation that can model the user- defined graphs by using Pregel abstraction API. • It also provides an optimized runtime for this abstraction. • You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API.
  • 13. Resilient Distributed Datasets • Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. • It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. • RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. • Formally, an RDD is a read-only, partitioned collection of records. • RDDs can be created through deterministic operations on either data on stable storage or other RDDs. • RDD is a fault-tolerant collection of elements that can be operated on in parallel.
  • 14. • There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. • Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient. • Data Sharing is slow in MapReduce. • MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. • It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.
  • 15. • Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS). • Although this framework provides numerous abstractions for accessing a cluster’s computational resources, users still want more. • Both Iterative and Interactive applications require faster data sharing across parallel jobs. • Data sharing is slow in MapReduce due to replication, serialization, and disk IO. • Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
  • 16. Iterative Operations on MapReduce • Reuse intermediate results across multiple computations in multi-stage applications. • The following illustration explains how the current framework works, while doing the iterative operations on MapReduce. • This incurs substantial overheads due to data replication, disk I/O, and serialization, which makes the system slow.
  • 17. Interactive Operations on MapReduce • User runs ad-hoc queries on the same subset of data. • Each query will do the disk I/O on the stable storage, which can dominate application execution time. • The following illustration explains how the current framework works while doing the interactive queries on MapReduce.
  • 18. Data Sharing using Spark RDD • Data sharing is slow in MapReduce due to replication, serialization, and disk IO. • Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations. • Recognizing this problem, researchers developed a specialized framework called Apache Spark. • The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. • This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. • Data sharing in memory is 10 to 100 times faster than network and Disk.
  • 19. Spark RDD • Iterative Operations on Spark RDD
  • 20. Interactive Operations on Spark RDD • This illustration shows interactive operations on Spark RDD. • If different queries are run on the same set of data repeatedly, this particular data can be kept in memory for better execution times.
  • 21. • By default, each transformed RDD may be recomputed each time you run an action on it. • However, you may also persist an RDD in memory, in which case Spark will keep the elements around on the cluster for much faster access, the next time you query it. • There is also support for persisting RDDs on disk, or replicated across multiple nodes.
  • 22. Spark Architecture • Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled. • This architecture is further integrated with various extensions and libraries. • Apache Spark Architecture is based on two main abstractions: • Resilient Distributed Dataset (RDD) • Directed Acyclic Graph (DAG)
  • 23. Spark Eco-System • The spark ecosystem is composed of various components like Spark SQL, Spark Streaming, MLlib, GraphX, and the Core API component.
  • 24. Spark Core • Spark Core is the base engine for large-scale parallel and distributed data processing. • Further, additional libraries which are built on the top of the core allows diverse workloads for streaming, SQL, and machine learning. • It is responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems. Spark Streaming • Spark Streaming is the component of Spark which is used to process real-time streaming data. • Thus, it is a useful addition to the core Spark API. • It enables high-throughput and fault-tolerant stream processing of live data streams.
  • 25. • Spark SQL Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. • It supports querying data either via SQL or via the Hive Query Language. • For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing. • GraphX GraphX is the Spark API for graphs and graph-parallel computation. • Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. • At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph (a directed multigraph with properties attached to each vertex and edge).
  • 26. • MLlib (Machine Learning) MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark. • SparkR • It is an R package that provides a distributed data frame implementation. • It also supports operations like selection, filtering, aggregation but on large data-sets.
  • 27. Resilient Distributed Dataset(RDD) RDDs are the building blocks of any Spark application. RDDs Stands for: • Resilient: Fault tolerant and is capable of rebuilding data on failure • Distributed: Distributed data among the multiple nodes in a cluster • Dataset: Collection of partitioned data with values
  • 28. • It is a layer of abstracted data over the distributed collection. It is immutable in nature and follows lazy transformations. • The data in an RDD is split into chunks based on a key. • RDDs are highly resilient, i.e, they are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. • Thus, even if one executor node fails, another will still process the data. • This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes.
  • 29. • Moreover, once you create an RDD it becomes immutable. • Immutable means, an object whose state cannot be modified after it is created, but they can surely be transformed. • Talking about the distributed environment, each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. • Due to this, you can perform transformations or actions on the complete data parallelly. • Also, you don’t have to worry about the distribution, because Spark takes care of that.
  • 30. • There are two ways to create RDDs − parallelizing an existing collection in your driver program, or by referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, etc. • With RDDs, you can perform two types of operations: • Transformations: They are the operations that are applied to create a new RDD. • Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver.
  • 31. Working of Spark Architecture • In your master node, you have the driver program, which drives your application. The code you are writing behaves as a driver program or if you are using the interactive shell, the shell acts as the driver program.
  • 32. • Inside the driver program, the first thing you do is, you create a Spark Context. • Assume that the Spark context is a gateway to all the Spark functionalities. • It is similar to your database connection. • Any command you execute in your database goes through the database connection. • Likewise, anything you do on Spark goes through Spark context.
  • 33. • Now, this Spark context works with the cluster manager to manage various jobs. • The driver program & Spark context takes care of the job execution within the cluster. • A job is split into multiple tasks which are distributed over the worker node. • Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.
  • 34. • Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then executed on the partitioned RDDs in the worker node and hence returns back the result to the Spark Context. • Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context.
  • 35. • If you increase the number of workers, then you can divide jobs into more partitions and execute them parallelly over multiple systems. It will be a lot faster. • With the increase in the number of workers, memory size will also increase & you can cache the jobs to execute it faster.
  • 36. • To know about the workflow of Spark Architecture, you can have a look at the infographic below:
  • 37. • STEP 1: • The client submits spark user application code. • When an application code is submitted, the driver implicitly converts user code that contains transformations and actions into a logically directed acyclic graph called DAG. • At this stage, it also performs optimizations such as pipelining transformations. STEP 2: • After that, it converts the logical graph called DAG into physical execution plan with many stages. • After converting into a physical execution plan, it creates physical execution units called tasks under each stage. • Then the tasks are bundled and sent to the cluster.
  • 38. • STEP 3: • Now the driver talks to the cluster manager and negotiates the resources. • Cluster manager launches executors in worker nodes on behalf of the driver. • At this point, the driver will send the tasks to the executors based on data placement. • When executors start, they register themselves with drivers. • So, the driver will have a complete view of executors that are executing the task.
  • 39. • STEP 4: During the course of execution of tasks, driver program will monitor the set of executors that runs. Driver node also schedules future tasks based on data placement.