0% found this document useful (0 votes)
3 views

06 Big Data

Apache Spark is an open-source data processing framework designed for big data analytics, offering advantages over Hadoop such as faster processing speeds and support for multiple programming languages. It utilizes in-memory processing and a master-slave architecture to efficiently handle various workloads, including batch and real-time processing. Spark integrates with cloud services like Azure, AWS, and GCP, and features components such as Spark SQL, Spark Streaming, and MLlib for advanced analytics.

Uploaded by

berketaraya588
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

06 Big Data

Apache Spark is an open-source data processing framework designed for big data analytics, offering advantages over Hadoop such as faster processing speeds and support for multiple programming languages. It utilizes in-memory processing and a master-slave architecture to efficiently handle various workloads, including batch and real-time processing. Spark integrates with cloud services like Azure, AWS, and GCP, and features components such as Spark SQL, Spark Streaming, and MLlib for advanced analytics.

Uploaded by

berketaraya588
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

SENg5302-Chapter-VI Big Data Analytics with

Apache Spark
What is Spark
Apache Spark is an open source data processing framework for performing Big data
analytics on distributed computing cluster.
Supports wide variety of operations, compared to Map and Reduce functions.
Provides concise and consistent APIs in Scala, Java and Python.
Spark is written in Scala Programming Language and runs in JVM.

It currently has support in the following Programming languages to develop


applications in Spark:
Scala(default)
Java
Python
R
Evolution of Apache Spark
• Spark is one of Hadoop’s sub project developed in 2009 in UC
Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in
2010 under a BSD license. It was donated to Apache
software foundation in 2013, and now Apache Spark has
become a top level Apache project from Feb-2014. Now, most
organizations across the world have incorporated Apache
Spark for empowering their Big Data applications.
Spark Overview
Spark was introduced by Apache Software Foundation for
speeding up the Hadoop computational computing
software process.

As against a common belief, Spark is not a modified


version of Hadoop and is not, really, dependent on
Hadoop because it has its own cluster management.
Hadoop is just one of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and


second is processing. Since Spark has its own cluster
management computation, it uses Hadoop for
storage purpose only.
Spark Overview contd…
Apache spark has become a key cluster computer
framework that catches the world of big data with fire.
It is a more accessible, powerful, and powerful data
tool to deal with a variety of big data challenges.
Apache Spark is a framework that is supported
in Scala, Python, R Programming, and Java.
Below are different implementations of Spark.
Spark – Default interface for Scala and
Java PySpark – Python interface for Spark
SparklyR – R interface for Spark.
Spark Overview contd…
It, is a general engine for Big Data analysis,
processing, and computations. It provides several
advantages over MapReduce: it is faster, easier to
use, offers simplicity, and runs virtually
everywhere.

It has built-in tools for SQL, Machine Learning,


and streaming which make it a very popular and
one of the most asked tools in the IT industry. Spark is
written in Scala.
Apache Spark overview contd..
Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation. It is based on
Hadoop MapReduce and it extends the MapReduce model
to efficiently use it for more types of computations, which
includes interactive queries and stream processing. The
main feature of Spark is its in-memory cluster
computing that increases the processing speed of an
application.

Spark is designed to cover a wide range of workloads


such as batch applications, iterative algorithms,
interactive queries and streaming. Apart from
supporting all these workload in a respective
system, it reduces the management burden of
maintaining separate tools.
Is Spark available as a managed service from
Spark on Azure
cloud providers?
As part of its big data capabilities and Databricks partnership, Azure provides Apache Spark. HDInsight, Azure
Databricks, and Azure Synapse Analytics are among the Azure tools and services that integrate with Apache Spark.
Spark on AWS

AWS (Amazon Web Services)


Includes Apache Spark in its big data services. Amazon EMR (Elastic MapReduce), AWS Glue, and Amazon
SageMaker are all services and tools that connect with Apache Spark.

• Amazon EMR is a fully managed big data platform that includes Apache Spark as
well as Hadoop, Hive, and Presto big data processing enginesSpark on GCP

Google Cloud Platform (GCP)


Includes Apache Spark in its big data offerings Cloud Dataproc and Cloud Dataflow.

Cloud Dataproc is a fully managed big data platform which includes Apache Spark and other big data processing
engines such as Hadoop and Hive
What is In-Memory Processing?

 In-memory processing is the practice of taking action on


data entirely in computer memory(e.g. in RAM).
 This is in contrast to other techniques of processing data
which rely on reading and writing data to and from slower
media such as disk drives(e.g.,SSD’s).
 In-memory processing typically implies large-scale
environments where multiple computers are pooled
together so their collective RAM can be used as a large and
fast storage medium.
 Since the storage appears as one big, single allocation of
RAM, large data sets can be processed all at once, versus
processing data sets that only fir into the RAM of a single
computer. This ofen done in the technology known as in-
memory data grids(IMDG).
Limitations of Mapreduce in Hadoop
Why choose Apache Spark over
Hadoop?
Apache Spark Hadoop
Parameter
Speed 100 times faster in memory Better than traditional systems
computations; ten times fast on
disk than Hadoop
Easy to Manage Everything in the same cluster Different engines required for
different tasks
Real-time Live data streaming Only efficient for batch
Analysis processing
Features of Apache Spark
Speed − Spark helps to run an application in Hadoop
cluster, up to 100 times faster in memory, and 10 times
faster when running on disk. This is possible by
reducing number of read/write operations to disk.
It stores the intermediate processing data in memory.

Supports multiple languages − Spark provides built-in


APIs in Java, Scala, or Python. Therefore, you can write
applications in different languages. Spark comes up with
80 high-level operators for interactive querying.

Advanced Analytics − Spark not only supports ‘Map’


and ‘reduce’. It also supports SQL queries, Streaming
data, Machine learning (ML), and Graph algorithms.
Features of Apache Spark
Limitations of Apache Spark
How does Apache Spark fit in the Hadoop
Ecosystem?
• Apache Spark can be used with Hadoop or
Hadoop YARN together. It can be deployed on
Hadoop in three ways: Standalone, YARN, and
SIMR. Spark Built on Hadoop
Spark Built on Hadoop
Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on
cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark into
Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.
Apache Spark Advantages
•Spark is a general-purpose, in-memory, fault-
tolerant, distributed processing engine that allows you to
process data efficiently in a distributed fashion.
•Applications running on Spark are 100x faster
than traditional systems.
•You will get great benefits using Spark for data
ingestion pipelines.
•Using Spark we can process data from Hadoop
HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and
many file systems.
•Spark also is used to process real-time data
using Streaming and Kafka.
•Using Spark Streaming you can also stream files from
the file system and also stream from the socket.
• Spark natively has machine learning and graph libraries.
Why should we consider using Hadoop and Spark
together?
1. Efficient Storage and Cluster Management
2. For easy resource management and can easily
handle task scheduling across a cluster.
3. Disaster recovery capabilities
4. Provides better data security
5. Fast computation engine like
6. Runs virtually everywhere
Industries using Apache Spark(Use Cases)
Apache Spark Components
Components of Spark cont..
i. Spark Core
It is the kernel of Spark, which provides an execution
platform for all the Spark applications. It is
a
generalized platform to support a wide array of
applications.
ii. Spark SQL
It enables users to run SQL/HQL queries on the top of
Spark. Using Apache Spark SQL, we can process
structured as well as semi-structured data. It also
provides an engine for Hive to run unmodified queries
up to 100 times faster on existing deployments.
Components of Spark cont..
iii. Spark Streaming
Apache Spark Streaming enables powerful interactive and data analytics
application across live streaming data. The live streams are converted into
micro-batches which are executed on top of spark core.
iv. Spark MLlib
It is the scalable machine learning library which delivers both efficiencies as
well as the high-quality algorithm. Apache Spark MLlib is one of the hottest
choices for Data Scientist due to its capability of in-memory data processing,
which improves the performance of iterative algorithm drastically.
v. Spark GraphX
Apache Spark GraphX is the graph computation engine built on top of spark
that enables to process graph data at scale.

vi. SparkR
It is R package that gives light-weight frontend to use Apache Spark from R.
It allows data scientists to analyze large datasets and interactively run jobs on
them from the R shell. The main idea behind SparkR was to explore different
techniques to integrate the usability of R with the scalability of Spark.
Apache Spark Architecture
Working Spark Architecture
The Apache Spark framework uses a master-slave architecture that consists of a driver, which
runs as a master node, and many executors that run across as worker nodes in the cluster.
Apache Spark can be used for batch processing and real-time processing as well.

Driver Program in the Apache Spark architecture calls the main program of an application
and creates SparkContext. A SparkContext consists of all the basic functionalities.

Spark Driver contains various other components such as DAG Scheduler, Task Scheduler,
Backend Scheduler, and Block Manager, which are responsible for translating the user-written
code into jobs that are actually executed on the cluster.
Working Spark Architecture cont..
Spark Driver and SparkContext collectively watch over the job
execution within the cluster. Spark Driver works with the Cluster
Manager to manage various other jobs. The cluster Manager does the
resource allocating work. And then, the job is split into multiple
smaller tasks which are further distributed to worker nodes.

Whenever an RDD is created in the SparkContext, it can be


distributed across many worker nodes and can also be cached
there.
Worker nodes execute the tasks assigned by the Cluster Manager
and return it back to the Spark Context.

An executor is responsible for the execution of these tasks. The


lifetime of executors is the same as that of the Spark Application. If
we want to increase the performance of the system, we can increase
the number of workers so that the jobs can be divided into more
logical portions.
Spark Architecture Overview cont..
Apache Spark has a well-defined layered
architecture where all the spark components and
layers are loosely coupled. This architecture is
further integrated with various extensions and
libraries. Apache Spark Architecture is based on
two main abstractions:
Apache Spark RDDs support two types of
operations:

• Resilient Distributed Dataset (RDD)


• Directed Acyclic Graph (DAG)
Spark Eco-Systems
As you can see, Spark comes packed with high-level libraries, including
support for R, SQL, Python, Scala, Java etc. These standard libraries
increase the seamless integrations in a complex workflow. Over this, it also
allows various sets of services to integrate with it like MLlib,
GraphX, SQL + Data Frames, Streaming services etc.
Resilient Distributed Dataset(RDD)
•Resilient: Fault tolerant and is capable of rebuilding
data on failure
•Distributed: Distributed data among the multiple
nodes in a cluster
• Dataset: Collection of partitioned data with values
Read-only partitioned collection of records (like a DFS)
but with a record of how the dataset was created as
combination of transformations from other dataset(s).
RDD contd..
RDD(Resilient, Distributed, Dataset) is immutable
distributed collection of objects. RDD is a logical
reference of a dataset which is partitioned across many
server machines in the cluster. RDDs are Immutable
and are self recovered in case of failure. An RDD could
come from any data source, e.g. text files, a database via
JDBC, etc.
Creating an RDD

val rdd = sc.textFile("/some_file",3)


val lines = sc.parallelize(List("this is","an example"))
the argument ‘3’ in the method call sc.textFile() specifies
the number of partitions
Partitions

RDD are a collection of various data if it cannot fit


into a single node it should be partitioned across
various nodes. So it means, the more the number of
partitions, the more the parallelism. These partitions of
an RDD is distributed across all the nodes in the
network.
RDDs Operations(Transformations and Actions)

• There are two types of operations that you can perform on


an RDD- Transformations and Actions.
• Transformation applies some function on a RDD and creates
a new RDD, it does not modify the RDD that you apply the
function on.(Remember that RDDs are immutable). Also, the
new RDD keeps a pointer to it’s parent RDD.
• Transformations are lazy operations on a RDD that create
one or many new RDDs,
e.g. map,filter, reduceByKey, join, cogroup, randomSplit
• At high level, there are two transformations that can be
applied onto the RDDs, namely narrow transformation and
wide transformation. Wide transformations basically result
in stage boundaries.
RDD-Transformations
• Narrow transformation — doesn’t require the data to
be shuffled across the partitions. for example, Map,
filter etc..
• Wide transformation — requires the data to be
shuffled for example, reduceByKey etc..
• By applying transformations you incrementally build
a RDD lineage with all the parent RDDs of the final
RDD(s).Transformations are lazy, i.e. are not executed
immediately. Only after calling an action are
transformations executed.
• val rdd = sc.textFile("spam.txt")
val filtered = rdd.filter(line =>
line.contains("money"))
filtered.count()
Transformations contd..
sc.textFile() and rdd.filter() do not get executed
immediately, it will only get executed once you call an
Action on the RDD — here filtered.count(). An
Action is used to either save result to some location or
to display it. You can also print the RDD lineage
information by using the
filtered.toDebugString(filtered is command the
RDDs can also be thought of as a set RDDof here).
instructions
that has to be executed, first instruction being the load
instruction.
RDD
What is DAG in Apache Spark?
DAG a finite direct graph with no directed cycles. There are finitely
many vertices and edges, where each edge directed from one vertex to
another. It contains a sequence of vertices such that every edge is
directed from earlier to later in the sequence. It is a strict generalization
of MapReduce model. DAG operations can do better global
optimization than other systems like MapReduce. The picture of DAG
becomes clear in more complex jobs. Apache Spark DAG allows the
user to dive into the stage and expand on detail on any stage. In the
stage view, the details of all RDDs belonging to that stage are
expanded. The Scheduler splits the Spark RDD into stages
based on various transformation applied.
Need of Directed Acyclic Graph in Spark
• The limitations of Hadoop MapReduce became a key point to introduce
DAG in Spark. The computation through MapReduce in three steps:
• The data is read from HDFS.
• Then apply Map and Reduce operations.
• The computed result is written back to HDFS.
• Each MapReduce operation is independent of each other and HADOOP has
no idea of which Map reduce would come next. Sometimes for some
iteration, it is irrelevant to read and write back the immediate result between
two map-reduce jobs. In such case, the memory in stable storage (HDFS) or
disk memory gets wasted.
• In multiple-step, till the completion of the previous job all the jobs block
from the beginning. As a result, complex computation can require a long
time with small data volume.
• While in Spark, a DAG (Directed Acyclic Graph) of consecutive
computation stages is formed. In this way, we optimize the execution plan,
e.g. to minimize shuffling data around. In contrast, it is done manually
in
MapReduce by tuning each MapReduce step.
How DAG works in Spark?
 The interpreter is the first layer, using a Scala interpreter,
Spark interprets the code with some modifications.
 Spark creates an operator graph when you enter your code in
Spark console.
 When we call an Action on Spark RDD at a high level, Spark
submits the operator graph to the DAG Scheduler.
 Divide the operators into stages of the task in the DAG
Scheduler. A stage contains task based on the partition of the
input data. The DAG scheduler pipelines operators together.
For example, map operators schedule in a single stage.
 The stages pass on to the Task Scheduler. It launches task
through cluster manager. The dependencies of stages are
unknown to the task scheduler.
 The Workers execute the task on the slave.
 The image below briefly describes the steps of How DAG
works in the Spark job execution.
INTERNALS OF JOB EXECUTION IN
SPARK
Advantages of DAG in Spark
There are multiple advantages of Spark DAG,
 The lost RDD can recover using the Directed Acyclic Graph.
 Map Reduce has just two queries the map, and reduce but in DAG we have
multiple
levels. So to execute SQL query, DAG is more flexible.
 DAG helps to achieve fault tolerance. Thus we can recover the lost data.
 It can do a better global optimization than a system like Hadoop
MapReduce. Working of DAG Optimizer in Spark

We optimize the DAG in Apache Spark by rearranging and combining operators


wherever possible. For, example if we submit a spark job which has a map() operation
followed by a filter operation. The DAG Optimizer will rearrange the order of these
operators since filtering will reduce the number of records to undergo map operation.

DAG in Apache Spark is an alternative to the MapReduce. It is a programming style


used in distributed systems. In MapReduce, we just have two functions (map and
reduce), while DAG has multiple levels that form a tree structure. Hence, DAG
execution is faster than MapReduce because intermediate results does not write to
disk.
Apache Spark is a framework
• Apache Spark is a framework that is supported in
Scala, Python, R Programming, and Java. Below are
different implementations of Spark.
• Spark – Default interface for Scala and Java
• PySpark – Python interface for Spark
•SparklyR – R interface for
Spark. Spark Shell:
Apache Spark provides an interactive spark-shell. It
helps Spark applications to easily run on the command
line of the system. Using the Spark shell we can
run/test our application code interactively. Spark can
read from many types of data sources so that it can
access and process a large amount of data.
Cluster Manager Types
As of writing this Apache Spark, Spark supports below cluster
managers:

Standalone – a simple cluster manager included with Spark that


makes it easy to set up a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run
Hadoop MapReduce and Spark applications.
Hadoop YARN – the resource manager in Hadoop 2. This is
mostly used, cluster manager.
Kubernetes – an open-source for
deployment, system scaling, and automating of
applications. management
containerized
local – which is not really a cluster manager but still I wanted to
mention as we use “local” for master() in order to run Spark on
your laptop/computer.
DataFrame Spark with Basic Examples

DataFrame is a distributed collection of data


organized into named columns. It is conceptually
equivalent to a table in a relational database or a data
frame in R/Python, but with richer optimizations
under the hood. DataFrames can be constructed from
a wide array of sources such as structured data files,
tables in Hive, external databases, or existing RDDs.
DataFrame creation
The simplest way to create a DataFrame is from a seq
collection. DataFrame can also be created from an
RDD and by reading files from several sources.
By using createDataFrame() function of the
SparkSession you can create a DataFrame.
using createDataFrame()

val columns =
Seq("firstname","middlename","lastname","dob","gender","salary")
df = spark.createDataFrame(data), schema =
columns).toDF(columns:_*)
Since DataFrame’s are structure format which contains names and
df.show() shows the 20 elements from the DataFrame.
Using toDF() function

•Once we have an RDD, let’s use toDF() to create DataFrame in Spark.


By default, it creates column names as “_1” and “_2” as we have
two columns for each row.
Using toDF() function
Create Spark DataFrame from CSV
• In all the above examples, you have learned Spark to create DataFrame from RDD
and data collection objects. In real-time these are less used, In this and
following sections, you will learn how to create DataFrame from data
sources like CSV, text, JSON, Avro e.t.c

• Spark by default provides an API to read a delimiter files like comma, pipe, tab
separated files and it also provides several options on handling with header, with
out header, double quotes, data types e.t.c.

Ex: JavaScript Object Notation(JSON-format)


{
"name": "Mohamed",
"age": 35,
"isProfessor": true,
"subjects": ["AI", "ML", "NLP"]
}
Creating from TXT,JSON
Creating from an XML file
Creating from Hive
Create DataFrame from HBase table
UNIT-VI Conclusion –Apache Spark
As a result, we have seen every aspect of Apache Spark, what is
Apache spark programming and spark definition, History of
Spark, why Spark is needed, Components of Apache Spark,
Spark RDD, Features of Spark RDD, Spark Streaming, Features
of Apache Spark, Limitations of Apache Spark, Apache Spark
use cases.
It, provides a collection of technologies that increase the value of
big data and permits new Spark use cases. It gives us a unified
framework for creating, managing and implementing Spark big
data processing requirements.
In addition to the MapReduce operations, one can also
implement SQL queries and process streaming data through
Spark, which were the drawbacks for Hadoop-1. With Spark,
developers can develop with Spark features either on a stand-
alone basis or, combine them with MapReduce programming
techniques.

You might also like