Unit V Big data
Unit V Big data
Apache Spark: Introducing Apache Spark, Why Hadoop plus Spark?, Components of Spark,
Apache Spark RDD, Apache Spark installation, Apache spark architecture, Introducing real time
processing, Architecture of spark streaming, Spark Steaming transformation and action, Input
sources and output stores, spark streaming with Kafka and HBase.
………………………………………………………………………………………
Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and waiting
time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways
to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.
1
Features of Apache Spark
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number
of read/write operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80
high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
The following diagram shows three ways of how Spark can be built with Hadoop components.
Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly.
Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without
any pre-installation or root access required. It helps to integrate Spark into Hadoop
ecosystem or Hadoop stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.
2
Why Hadoop plus Spark?
Combining Hadoop with Apache Spark offers several advantages for big data processing and
analytics:
1. Complementary Ecosystem:
- Hadoop and Spark are complementary frameworks that can be integrated seamlessly to
leverage the strengths of both.
- Hadoop provides a reliable and scalable storage layer through Hadoop Distributed File
System (HDFS), while Spark offers fast and efficient data processing capabilities.
4. Enhanced Performance:
- Spark's in-memory processing capabilities make it significantly faster than Hadoop
MapReduce, especially for iterative algorithms and interactive analytics.
- By running Spark on top of Hadoop, organizations can take advantage of Spark's performance
benefits while leveraging Hadoop's distributed storage and fault tolerance mechanisms.
5. Rich Ecosystem:
- Both Hadoop and Spark have rich ecosystems of tools and libraries for various data
processing tasks, including machine learning, graph processing, SQL querying, and more.
- By integrating Hadoop and Spark, organizations can access a wide range of tools and libraries
to build sophisticated data-driven applications and analytics solutions.
6. Cost Efficiency:
- Hadoop provides cost-effective storage solutions for storing large volumes of data, while
Spark's efficient processing engine reduces compute costs.
- By combining Hadoop's storage layer with Spark's processing engine, organizations can
optimize resource utilization and achieve cost efficiency in their big data infrastructure.
Overall, combining Hadoop with Apache Spark provides organizations with a powerful and
flexible platform for big data processing and analytics, enabling them to handle diverse
workloads, optimize performance, and drive insights from their data effectively.
3
Components of Spark
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
4
GraphX
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through
deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant
collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file system,
HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Let us first discuss how MapReduce operations take place and why they are not so efficient.
MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set of
high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations
(Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex −
HDFS). Although this framework provides numerous abstractions for accessing a cluster’s
computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data
sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage
system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS
read-write operations.
5
Iterative Operations on MapReduce
User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the
stable storage, which can dominate application execution time.
The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.
6
Data Sharing using Spark RDD
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark.
The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation. This means, it stores the state of memory as an object across the jobs and the object
is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network
and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.
The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of
the JOB), then it will store those results on the disk.
This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution times.
7
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the
elementsaround on the cluster for much faster access, the next time you query it. There is also
support for persisting RDDs on disk, or replicated across multiple nodes.
Installing Apache Spark involves several steps, including downloading the Spark distribution,
configuring the environment, and setting up dependencies. Below are the general steps to install
Apache Spark:
1. Prerequisites:
- Ensure that you have Java installed on your system. Apache Spark requires Java 8 or later.
- Verify that you have sufficient memory and disk space available for running Spark jobs.
2. Download Spark:
- Go to the official Apache Spark website: https://spark.apache.org/downloads.html
- Download the pre-built version of Spark that matches your Hadoop version. Choose a
package type (e.g., TAR or ZIP) based on your preference.
- Alternatively, you can also build Spark from source by downloading the source code package.
3. Extract Spark:
- Once the download is complete, extract the Spark archive to a directory of your choice on
your local machine.
- For example, you can use the following command to extract the TAR package:
```
tar -xvf spark-3.2.0-bin-hadoop3.2.tgz
```
8
5. Configuration:
- Spark comes with default configuration files located in the `conf` directory.
- You may need to customize the configuration settings based on your environment and
requirements. Key configuration files include `spark-defaults.conf`, `spark-env.sh`, and
`log4j.properties`.
7. Verify Installation:
- Open a terminal window and run the `spark-shell` command to launch the Spark interactive
shell (Scala REPL) or `pyspark` command for Python.
- If Spark starts successfully without any errors, it means that the installation was successful.
Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single master and
multiple slaves.
The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,
9
o Resilient: Restore the data on failure.
o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data.
Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph
refers the navigation whereas directed and acyclic refers to how it is done.
Driver Program
The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark applications,
running as independent sets of processes on a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -
Cluster Manager
10
o The role of the cluster manager is to allocate resources across applications. The Spark is
capable enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos
and Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to
install Spark on an empty set of machines.
Worker Node
o The worker node is a slave node
o Its role is to run the application code in the cluster.
Executor
o An executor is a process launched for an application on a worker node.
o It runs tasks and keeps data in memory or disk storage across them.
o It read and write data to the external sources.
o Every application contains its executor.
Task
o A unit of work that will be sent to one executor.
Real-time processing, also known as stream processing, involves the analysis and processing of
data streams in near real-time, as data is generated or ingested into the system. Apache Spark
Streaming is a component of Apache Spark that enables real-time stream processing, allowing
developers to analyze continuous streams of data with low-latency and high-throughput. Here's
an introduction to real-time processing with Apache Spark Streaming:
1. Streaming Context:
- In Apache Spark, real-time processing is achieved through the StreamingContext API, which
provides a high-level abstraction for creating and managing streaming applications.
- The StreamingContext represents the entry point for creating DStreams (Discretized
Streams), which are the fundamental abstraction for processing streaming data.
11
- Each batch of data is treated as an RDD (Resilient Distributed Dataset) in Spark, enabling
developers to use familiar RDD transformations and actions for stream processing.
3. Input Sources:
- Apache Spark Streaming supports various input sources for ingesting streaming data,
including Kafka, Flume, Kinesis, TCP sockets, file systems (e.g., HDFS), and custom receivers.
- Developers can create DStreams from these input sources to process data streams in real-
time.
5. Window Operations:
- Spark Streaming supports window operations, which enable developers to perform
computations over sliding windows of data.
- Window operations allow for aggregate computations, such as counting the number of events
in a time window or calculating averages over a windowed time interval.
6. Output Operations:
- Output operations in Spark Streaming allow developers to write the processed data to various
output sinks, including file systems, databases, dashboards, and external systems.
- Spark Streaming provides built-in support for output operations such as `saveAsTextFiles`,
`saveAsHadoopFiles`, and `foreachRDD`.
7. Fault Tolerance:
- Apache Spark Streaming provides fault tolerance through lineage information and
checkpointing.
- Checkpointing enables Spark to recover from failures by saving the state of the streaming
application periodically to a fault-tolerant storage system, such as HDFS.
Overall, Apache Spark Streaming provides a powerful and scalable platform for real-time stream
processing, enabling developers to build and deploy distributed streaming applications for a wide
range of use cases, including monitoring, fraud detection, anomaly detection, IoT data
processing, and more.
12
Architecture of spark streaming
The architecture of Apache Spark Streaming is designed to enable scalable and fault-tolerant
stream processing of real-time data streams. It builds upon the core architecture of Apache Spark
while introducing additional components and concepts specific to stream processing. Here's an
overview of the architecture of Spark Streaming:
1. Streaming Context:
- The entry point for creating and managing streaming applications in Apache Spark Streaming
is the StreamingContext.
- The StreamingContext represents the connection to the Spark cluster and provides a high-
level API for creating DStreams (Discretized Streams) and defining the streaming computation.
3. Receiver:
- Receivers are components responsible for ingesting data from input sources and converting it
into RDDs.
- Receivers run as long-running tasks on Executor nodes and continuously receive data from
the input sources.
- Receivers buffer the received data and push it to Spark's RDDs for processing.
4. Transformations:
- Spark Streaming provides a rich set of high-level transformations for processing DStreams.
- Transformations include map, filter, reduceByKey, window, join, and others, allowing
developers to perform complex stream processing operations on data streams.
5. Output Operations:
- Output operations in Spark Streaming enable developers to write the processed data from
DStreams to external storage systems, databases, or other downstream systems.
- Spark Streaming provides built-in support for output operations such as saveAsTextFiles,
saveAsHadoopFiles, and foreachRDD.
7. Fault Tolerance:
13
- Spark Streaming provides fault tolerance through RDD lineage information and
checkpointing.
- RDD lineage information is used to recompute lost or corrupted RDD partitions in case of
failures.
- Checkpointing allows Spark Streaming to periodically save the state of the streaming
application to a fault-tolerant storage system, such as HDFS, enabling recovery from failures.
In Apache Spark Streaming, transformations and actions are used to process and analyze
streaming data in near real-time. These operations are applied to DStreams (Discretized
Streams), which represent continuous streams of data divided into small, discrete batches or
micro-batches. Here's an overview of transformations and actions in Spark Streaming:
1. Transformations:
- Transformations in Spark Streaming are similar to those in Apache Spark batch processing
and operate on DStreams to produce new DStreams.
- These transformations are applied to each RDD (micro-batch) within the DStream, allowing
developers to perform various stream processing operations.
- Map: Applies a function to each element of the DStream and returns a new DStream.
- FlatMap: Similar to map, but each input item can be mapped to 0 or more output items.
- ReduceByKey: Performs a reduction operation on the elements with the same key.
- UpdateStateByKey: Allows maintaining arbitrary state while processing each key over time.
- Window: Groups RDDs into windows and applies transformations on each window of data.
Supports operations like countByWindow, reduceByKeyAndWindow, etc.
- Join: Performs inner, outer, left outer, and right outer joins between two DStreams.
- Union: Concatenates two DStreams to create a new DStream containing elements from both.
2. Actions:
- Actions in Spark Streaming are operations that trigger the execution of the streaming
computation and produce a result.
14
- They typically involve aggregating, saving, or printing the results of the stream processing.
- Print: Prints the first ten elements of each RDD in the DStream to the console for debugging
purposes.
- SaveAsTextFiles: Saves each RDD of the DStream as a text file in the specified directory.
- ForeachRDD: Allows executing arbitrary code for each RDD of the DStream. Useful for
writing custom output operations or integrating with external systems.
- Collect: Collects all elements of the DStream and returns them as an array (not recommended
for large streams due to memory constraints).
Transformations and actions in Spark Streaming enable developers to perform a wide range of
stream processing operations, such as filtering, aggregating, joining, and saving results, allowing
for real-time analytics and insights from streaming data.
In Apache Spark Streaming, input sources and output stores are essential components that enable
developers to ingest streaming data from external sources, process it, and store or output the
results. Here's an overview of common input sources and output stores used in Spark Streaming:
1. Input Sources:
- Spark Streaming supports various input sources for ingesting streaming data, including:
- Kafka: Apache Kafka is a distributed streaming platform that is commonly used as an input
source for Spark Streaming. Spark provides built-in integration with Kafka, allowing developers
to consume Kafka topics as DStreams.
- Flume: Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data. Spark Streaming can consume data
from Flume agents using custom Flume receivers.
- Kinesis: Amazon Kinesis is a platform for streaming data on AWS. Spark Streaming provides
integration with Kinesis through the Kinesis Receiver, allowing developers to consume data
streams from Kinesis shards.
15
- TCP Sockets: Spark Streaming can ingest data from TCP sockets, enabling direct streaming
from network sockets. This is useful for scenarios where data is generated by custom
applications or devices.
- File Systems: Spark Streaming can process data from various file systems, including HDFS
(Hadoop Distributed File System), local file systems, Amazon S3, and other distributed file
systems. It supports reading data as text files, sequence files, or any other Hadoop InputFormat.
- Custom Receivers: Developers can create custom receivers to ingest data from any source
that Spark does not directly support. Custom receivers extend the Receiver class and implement
the logic for data ingestion.
2. Output Stores:
- After processing streaming data, Spark Streaming allows developers to write the results to
various output stores or sinks:
- File Systems: Spark Streaming can save the processed data to file systems such as HDFS,
Amazon S3, or local file systems. It supports writing data as text files, sequence files, Parquet
files, or any other Hadoop OutputFormat.
- Databases: Spark Streaming integrates with relational databases and NoSQL databases,
allowing developers to write the results to databases such as MySQL, PostgreSQL, Cassandra,
MongoDB, or Elasticsearch.
- Message Brokers: Spark Streaming can publish the results to message brokers such as Kafka,
RabbitMQ, or ActiveMQ. This enables downstream processing or consumption by other systems
or applications.
- External Systems: Spark Streaming can output data to external systems or services using
custom output operations. Developers can implement custom logic to integrate with external
APIs, web services, or analytics platforms.
- Dashboards: Spark Streaming can feed processed data into visualization tools or dashboards
for real-time monitoring and visualization. Common tools include Apache Zeppelin, Grafana,
Kibana, or custom web applications.
- Alerting Systems: Spark Streaming can integrate with alerting systems or notification services
to trigger alerts or notifications based on certain conditions detected in the streaming data.
By supporting a wide range of input sources and output stores, Apache Spark Streaming provides
flexibility and scalability for building real-time streaming applications and pipelines for various
use cases, including real-time analytics, monitoring, alerting, and data integration.
16
spark streaming with Kafka and HBase
Integrating Apache Spark Streaming with Apache Kafka and Apache HBase allows for efficient
and scalable real-time data processing and storage. Here's an overview of how you can set up
Spark Streaming to consume data from Kafka and then store the processed results into HBase:
1. Setting up Kafka:
- Install and configure Apache Kafka on your system or cluster.
- Create a Kafka topic where you'll publish the streaming data. You can use the Kafka
command-line tools to create topics.
- Ensure that Kafka is running and accessible from your Spark cluster.
2. Setting up HBase:
- Install and configure Apache HBase on your system or cluster.
- Create an HBase table where you'll store the processed data. You can use the HBase shell or
HBase APIs to create tables.
- Ensure that HBase is running and accessible from your Spark cluster.
5. Processing Data:
- Apply transformations and processing logic to the DStream to manipulate the data as needed.
- Perform any necessary data cleansing, aggregation, or analytics operations.
- Use Spark's RDD or DataFrame API to perform the processing tasks.
17
- Monitor the Spark Streaming application's progress and performance using Spark UI and
HBase monitoring tools.
By integrating Spark Streaming with Kafka and HBase, you can build robust and scalable real-
time data processing pipelines for various use cases, including real-time analytics, monitoring,
and data warehousing.
18