Shark

The document discusses features of Apache Spark including its in-memory processing capabilities and APIs for Java, Scala, Python and R. It describes Spark's core components like Spark Core, MLlib, Spark Streaming, Spark SQL and GraphX. It also covers Spark's architecture including the driver program, cluster manager, executors and resilience to failures through RDDs.

Uploaded by

kapilkashyap3105

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Shark

Uploaded by

kapilkashyap3105

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Chapter 4

SPARK
Dr. Nilesh M. Patil
Associate Professor
Computer Engineering
SVKM’s DJSCE, Mumbai
Features of Spark
• Apache Spark is an open-source, distributed processing system used for big
data workloads.
• It utilizes in-memory caching, and optimized query execution for fast
analytic queries against data of any size.
• Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by
reducing the number of read/write operations to disk.
• Spark not only supports ‘Map’ and ‘Reduce’.
• It provides development APIs in Java, Scala, Python and R, and supports
code reuse across multiple workloads like batch processing, interactive
queries, real-time analytics, machine learning, and graph processing.
Spark Built on Hadoop
• The following diagram shows three ways of how Spark can be built
with Hadoop components.
• Standalone − Spark Standalone deployment means Spark occupies the
place on top of HDFS (Hadoop Distributed File System) and space is
allocated for HDFS, explicitly. Here, Spark and MapReduce will run
side by side to cover all spark jobs on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs
on Yarn without any pre-installation or root access required. It helps to
integrate Spark into Hadoop ecosystem or Hadoop stack. It allows
other components to run on top of stack.
• Apache Mesos : Apache Mesos (Mesosphere's Cluster Operating
System) is a cluster management technology that can be used with
Apache Spark to manage resources in a distributed environment.
Mesos provides a layer of abstraction between the cluster's resources
and the distributed applications running on top of it, enabling multiple
frameworks to run concurrently and share the same set of cluster
resources.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to
launch spark job in addition to standalone deployment. With SIMR,
user can start Spark and uses its shell without any administrative
access.
Components of Spark (1/6)
Components of Spark (2/6)
Spark Core
• Spark Core is the foundation of the platform.
• It is responsible for memory management, fault recovery, scheduling,
distributing & monitoring jobs, and interacting with storage systems.
• Spark Core is exposed through an application programming interface
(APIs) built for Java, Scala, Python and R.
• These APIs hide the complexity of distributed processing behind
simple, high-level operators.
Components of Spark (3/6)
MLlib (Machine Learning)
• Spark includes MLlib, a library of algorithms to do machine learning
on data at scale.
• Machine Learning models can be trained by data scientists with R or
Python on any Hadoop data source, saved using MLlib, and imported
into a Java or Scala-based pipeline.
• Spark was designed for fast, interactive computation that runs in
memory, enabling machine learning to run quickly.
• The algorithms include the ability to do classification, regression,
clustering, collaborative filtering, and pattern mining.
Components of Spark (4/6)
Spark Streaming (Real-time)
• Spark Streaming is a real-time solution that leverages Spark Core’s
fast scheduling capability to do streaming analytics.
• It ingests data in mini-batches, and enables analytics on that data with
the same application code written for batch analytics.
• This improves developer productivity, because they can use the same
code for batch processing, and for real-time streaming applications.
• Spark Streaming supports data from Twitter, Kafka, Flume, HDFS,
and ZeroMQ, and many others found from the Spark Packages
ecosystem.
Components of Spark (5/6)
Spark SQL (Interactive Queries)
• Spark SQL is a distributed query engine that provides low-latency, interactive queries
up to 100x faster than MapReduce.
• It includes a cost-based optimizer, columnar storage, and code generation for fast
queries, while scaling to thousands of nodes.
• Business analysts can use standard SQL or the Hive Query Language for querying
data.
• Developers can use APIs, available in Scala, Java, Python, and R.
• It supports various data sources out-of-the-box including JDBC, ODBC, JSON,
HDFS, Hive, ORC, and Parquet.
• Other popular stores—Amazon Redshift, Amazon S3, Couchbase, Cassandra,
MongoDB, Salesforce.com, Elasticsearch, and many others can be found from
the Spark Packages ecosystem.
Components of Spark (6/6)
GraphX (Graph Processing)
• Spark GraphX is a distributed graph processing framework built on
top of Spark.
• GraphX provides ETL, exploratory analysis, and iterative graph
computation to enable users to interactively build, and transform a
graph data structure at scale.
• It comes with a highly flexible API, and a selection of distributed
Graph algorithms.
Architecture of Spark (1/2)
Architecture of Spark (2/2)
• When the Driver Program in the Apache Spark architecture executes, it calls the
real program of an application and creates a SparkContext. SparkContext contains
all of the basic functions. The Spark Driver includes several other components,
including a DAG Scheduler, Task Scheduler, Backend Scheduler, and Block
Manager, all of which are responsible for translating user-written code into jobs
that are actually executed on the cluster.
• The Cluster Manager manages the execution of various jobs in the cluster. Spark
Driver works in conjunction with the Cluster Manager to control the execution of
various other jobs. The cluster Manager does the task of allocating resources for
the job. Once the job has been broken down into smaller jobs, which are then
distributed to worker nodes, SparkDriver will control the execution.
Many worker nodes can be used to process an RDD created in the SparkContext,
and the results can also be cached.
• The Spark Context receives task information from the Cluster Manager and
enqueues it on worker nodes.
• The Executor is in charge of carrying out these duties. The lifespan of executors
is the same as that of the Spark Application. We can increase the number of
workers if we want to improve the performance of the system. In this way, we can
divide jobs into more coherent parts.
Apache Hadoop Vs Apache Spark
Category for Comparison Hadoop Spark
Slower performance, uses disks for storage and depends on Fast in-memory performance with reduced disk reading and
Performance
disk read and write speed. writing operations.
An open-source platform, less expensive to run. Uses
An open-source platform, but relies on memory for computation,
Cost affordable consumer hardware. Easier to find trained Hadoop
which considerably increases running costs.
professionals.
Best for batch processing. Uses MapReduce to split a large Suitable for iterative and live-stream data analysis. Works with
Data Processing
dataset across a cluster for parallel analysis. RDDs and DAGs to run operations.

Tracks RDD block creation process, and then it can rebuild a

A highly fault-tolerant system. Replicates the data across the
Fault Tolerance dataset when a partition fails. Spark can also use a DAG to rebuild
nodes and uses them in case of an issue.
data across nodes.

Easily scalable by adding nodes and disks for storage. A bit more challenging to scale because it relies on RAM for
Scalability
Supports tens of thousands of nodes without a known limit. computations. Supports thousands of nodes in a cluster.

Extremely secure. Supports LDAP, ACLs, Kerberos, SLAs, Not secure. By default, the security is turned off. Relies on
Security
etc. integration with Hadoop to achieve the necessary security level.
Ease of Use and Language More difficult to use with less supported languages. Uses More user friendly. Allows interactive shell mode. APIs can be
Support Java or Python for MapReduce apps. written in Java, Scala, R, Python, Spark SQL.
Slower than Spark. Data fragments can be too large and Much faster with in-memory processing. Uses MLlib for
Machine Learning
create bottlenecks. Mahout is the main library. computations.
Uses external solutions. YARN is the most common option
Scheduling and Resource Has built-in tools for resource allocation, scheduling, and
for resource management. Oozie is available for workflow
Management monitoring.
scheduling.
RDD in Spark
• RDD is a core abstraction in Spark, which stands for Resilient Distributed Dataset.
• It enables partition of large data into smaller data that fits each machine, so that
computation can be done parallelly on multiple machines.
• Moreover, RDDs automatically recover from node failures to ensure the storage
resilience.
• It is an immutable fault-tolerant, distributed collection of objects that can be operated on
in parallel.
• An RDD can contain any type of object and is created by loading an external dataset or
distributing a collection from the driver program.
• RDDs support two types of operations:
• Transformations are operations (such as map, filter, join, union, and so on) that are
performed on an RDD and which yield a new RDD containing the result.
• Actions are operations (such as reduce, count, first, and so on) that return a value after
running a computation on an RDD.
Features of RDD
Here are some features of RDD in Spark:
• Resilience: RDDs track data lineage information to recover lost data,
automatically on failure. It is also called fault tolerance.
• Distributed: Data present in an RDD resides on multiple nodes. It is distributed
across different nodes of a cluster.
• Lazy evaluation: Data does not get loaded in an RDD even if you define it.
Transformations are actually computed when you call action, such as count or
collect, or save the output to a file system.
• Immutability: Data stored in an RDD is in the read-only mode━you cannot edit
the data which is present in the RDD. But, you can create new RDDs by
performing transformations on the existing RDDs.
• In-memory computation: An RDD stores any immediate data that is generated in
the memory (RAM) than on the disk so that it provides faster access.
• Partitioning: Partitions can be done on any existing RDD to create logical parts
that are mutable. You can achieve this by applying transformations to the existing
partitions.
Advantages of RDD
• RDD aids in increasing the execution speed of Spark.
• RDDs are the basic unit of parallelism and hence help in achieving the
consistency of data.
• RDDs help in performing and saving the actions separately
• They are persistent as they can be used repeatedly.
Limitation of RDD
• There is no input optimization available in RDDs
• One of the biggest limitations of RDDs is that the execution process
does not start instantly.
• No changes can be made in RDD once it is created.
• RDD lacks enough storage memory.
• The run-time type safety is absent in RDDs.
RDD Example
• In this example, we first create a SparkContext
with the name "RDD Example".
• Then, we create an RDD from a list of numbers
using the parallelize() method.
• We then perform two transformations on the
RDD: map() to square each element, and
filter() to keep only the elements that are
greater than 10.
• Finally, we perform an action collect() to
retrieve the final result of the computation,
which is a list of numbers [16, 25]. The result
is printed to the console.
• Note that RDDs are lazy-evaluated, meaning
that the transformations are not executed until
an action is performed. In this example,
collect() is the action that triggers the
computation of the transformations.
Spark SQL
• Spark SQL is a Spark component that supports querying data either via
SQL or via the Hive Query Language.
• It originated as the Apache Hive port to run on top of Spark (in place
of MapReduce) and is now integrated with the Spark stack.
• Spark SQL introduces SchemaRDD, a new data abstraction that
provides support for structured and semi-structured data.
Features of Spark SQL
• Easy to Integrate: One can mix SQL queries with Spark programs easily.
Structured data can be queried inside Spark programs using either SQL or a
Dataframe API. Running SQL queries alongside analytic algorithms is easy
because of this tight integration.
• Compatibility with Hive: Hive queries can be executed in Spark SQL as they are.
• Unified Data Access: Loading and querying data from various sources is possible.
• Standard Connectivity: Spark SQL can connect to Java and Oracle using JDBC
(Java Database Connectivity) and ODBC (Oracle Database Connectivity) APIs.
• Performance and Scalability: To make queries agile, alongside computing
hundreds of nodes using the Spark engine, Spark SQL incorporates a code
generator, a cost-based optimizer, and columnar storage. This provides complete
mid-query fault tolerance.
Advantages of Spark SQL
• It helps in easy data querying. The SQL queries are mixed with Spark
programs for querying structured data as a distributed dataset (RDD). Also,
the SQL queries are run with analytic algorithms using Spark SQL’s
integration property.
• Another important advantage of Spark SQL is that the loading and querying
can be done for data from different sources. Hence, the data access is
unified.
• It offers standard connectivity as Spark SQL can be connected through
JDBC or ODBC.
• It can be used for faster processing of Hive tables.
• Another important offering of Spark SQL is that it can run unmodified Hive
queries on existing warehouses as it allows easy compatibility with existing
Hive data and queries.
Disadvantages of Spark SQL
• Creating or reading tables containing union fields is not possible with
Spark SQL.
• It does not convey if there is any error in situations where the varchar
is oversized.
• It does not support Hive transactions.
• It also does not support the Char type (fixed-length strings). Hence,
reading or creating a table with such fields is not possible.
Spark SQL Example
• In this example, we first create a SparkSession
with the name "Spark SQL Example".
• Then, we read a CSV file as a DataFrame
using the read.csv() method.
• We then register the DataFrame as a
temporary view using the
createOrReplaceTempView() method, which
allows us to query it using SQL.
• We then run a SQL query on the table, which
counts the number of occurrences of each
distinct value in column1.
• The result of the query is stored in a
DataFrame called result.
• Finally, we display the result using the show()
method.
Schedulers in Spark
• In Spark, the DAG (Directed Acyclic Graph) scheduler and the task scheduler are two important components of the
execution engine that work together to execute the user's application code.
• DAG Scheduler
1. The DAG scheduler is responsible for generating a DAG of stages based on the user's code and optimizing it for
efficient execution.
2. The stages represent a set of tasks that can be executed in parallel.
3. The DAG scheduler divides the computation into smaller stages based on the dependencies between RDDs (Resilient
Distributed Datasets) and operators in the code.
4. The DAG scheduler then submits the stages to the task scheduler for execution.
• Task Scheduler
1. The task scheduler is responsible for assigning tasks to workers in the cluster.
2. It receives the stages from the DAG scheduler and breaks them down into smaller tasks that can be executed in parallel.
3. The task scheduler then assigns these tasks to workers based on the data locality of the tasks and the availability of
resources in the cluster.
4. It also monitors the progress of the tasks and handles any failures that may occur.
• Together, the DAG scheduler and task scheduler ensure that the user's code is executed efficiently and reliably on the Spark
cluster.
• They work in tandem to maximize the parallelism of the computation and minimize the data movement across the network.
• This helps to achieve high performance and scalability when processing large amounts of data.
Shared Variables in Spark
• Shared variables are the variables that are required to be used by many functions & methods in parallel.
• Spark provides two special type of shared variables – Broadcast Variables and Accumulators.
• Broadcast variables:
1. Used to cache a value in memory on all nodes. Here only one instance of this read-only variable is shared
between all computations throughout the cluster.
2. Spark sends the broadcast variable to each node concerned by the related task. After that, each node caches
it locally in serialized form.
3. Now before executing each of the planned tasks instead of getting values from the driver system retrieves
them locally from the cache.
4. Broadcast variables are: Immutable (Unchangeable), Distributed i.e., broadcasted to the cluster and Fit in
memory.
• Accumulators:
1. As its name suggests Accumulators main role is to accumulate values. The accumulator is variables that
are used to implement counters and sums. Spark provides accumulators of numeric type only.
2. The user can create named or unnamed accumulators.
3. Unlike Broadcast Variables, accumulators are writable. However, written values can be only read in driver
program. It is why accumulators work pretty well as data aggregators.

1Z0 1105 22 Demo
No ratings yet
1Z0 1105 22 Demo
4 pages
CertyIQ AZ-104 REAL QUE EXAM DUMPS Part 3
100% (1)
CertyIQ AZ-104 REAL QUE EXAM DUMPS Part 3
42 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Unit 5
100% (1)
Unit 5
109 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Spark SQL
100% (1)
Spark SQL
25 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Spark BD
No ratings yet
Spark BD
9 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Spark
No ratings yet
Spark
9 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Bda 5
No ratings yet
Bda 5
21 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
UNIT 5.1
No ratings yet
UNIT 5.1
9 pages
UNIT V
No ratings yet
UNIT V
35 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Sspark
No ratings yet
Sspark
7 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark-Introduction
No ratings yet
Spark-Introduction
12 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Module 3
No ratings yet
Module 3
51 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Large Scale Data Processing: Saeed Iqbal Khattak
No ratings yet
Large Scale Data Processing: Saeed Iqbal Khattak
81 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Apache Spark 1
No ratings yet
Apache Spark 1
11 pages
Unit 4
No ratings yet
Unit 4
8 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
simulation pdf
No ratings yet
simulation pdf
21 pages
VAM example upload
No ratings yet
VAM example upload
6 pages
Asssignment Upload
No ratings yet
Asssignment Upload
40 pages
Basic HBase Shell Queries
No ratings yet
Basic HBase Shell Queries
4 pages
MongoDB Practice Commands
No ratings yet
MongoDB Practice Commands
10 pages
Nosql Tricks
No ratings yet
Nosql Tricks
34 pages
SAS Visual Analytics
No ratings yet
SAS Visual Analytics
19 pages
Is 1-8 60004210249
No ratings yet
Is 1-8 60004210249
25 pages
STRAP 2012 Installation Guide
No ratings yet
STRAP 2012 Installation Guide
2 pages
Ramkir SHP0000774190 - 1
No ratings yet
Ramkir SHP0000774190 - 1
87 pages
Unit 5 DBMS
No ratings yet
Unit 5 DBMS
34 pages
DMDW-Unit I
No ratings yet
DMDW-Unit I
14 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Database Principles And Technologies Based On Huawei Gaussdb Huawei Technologies Co pdf download
No ratings yet
Database Principles And Technologies Based On Huawei Gaussdb Huawei Technologies Co pdf download
76 pages
DBMS Exercise Midterm
No ratings yet
DBMS Exercise Midterm
8 pages
Sports Pro Intro Projects
No ratings yet
Sports Pro Intro Projects
5 pages
SQLi
No ratings yet
SQLi
32 pages
Commands
No ratings yet
Commands
11 pages
DMBS Module 2 Upto Normal_Forms
No ratings yet
DMBS Module 2 Upto Normal_Forms
21 pages
PgAgent Autojob For Win
No ratings yet
PgAgent Autojob For Win
2 pages
Microsoft SQL Server 2016 a beginner's guide Sixth Edition Petkovic All Chapters Instant Download
100% (2)
Microsoft SQL Server 2016 a beginner's guide Sixth Edition Petkovic All Chapters Instant Download
41 pages
Sqli Attacks
No ratings yet
Sqli Attacks
149 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
17 pages
Introduction To Amazon Relational Database Service (Amazon RDS)
No ratings yet
Introduction To Amazon Relational Database Service (Amazon RDS)
12 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
97 pages
SAP Basis Daily Monitoring Tcodes: ABAP Stack Checks
No ratings yet
SAP Basis Daily Monitoring Tcodes: ABAP Stack Checks
7 pages
COBOL Programmer or Data Analyst
No ratings yet
COBOL Programmer or Data Analyst
2 pages
Microsoft - Ucertify.az 305.study - Guide.2023 Aug 12.by - Dempsey.175q.vce
No ratings yet
Microsoft - Ucertify.az 305.study - Guide.2023 Aug 12.by - Dempsey.175q.vce
17 pages
Neo4j WP Retail Innovation EN A4
No ratings yet
Neo4j WP Retail Innovation EN A4
13 pages
DBMS DOC 2[1]
No ratings yet
DBMS DOC 2[1]
10 pages
ts6024 - Topic3 - DB - Lifecycle-20191104024431 2
No ratings yet
ts6024 - Topic3 - DB - Lifecycle-20191104024431 2
37 pages
Generalization, Specialization and Aggregation in ER Model - GeeksforGeeks
No ratings yet
Generalization, Specialization and Aggregation in ER Model - GeeksforGeeks
4 pages
Data Warehouse Models
No ratings yet
Data Warehouse Models
3 pages
402 - InformationTechnology - Class - X - Sample Paper 1
No ratings yet
402 - InformationTechnology - Class - X - Sample Paper 1
5 pages

Shark

Uploaded by

Shark

Uploaded by

Chapter 4

Tracks RDD block creation process, and then it can rebuild a

You might also like