Features of Apache Spark
Features of Apache Spark
Introduction
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model
to efficiently use it for more types of computations, which includes interactive queries
and stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications,
iterative algorithms, interactive queries and streaming. Apart from supporting all these
workloads in a respective system, it reduces the management burden of maintaining
separate tools.
Spark deployment:
Components of Spark:
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.
Spark SQL:
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming:
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.
GraphX:
RDD
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD
is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they
are not so efficient.
Why RDD?
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of
the Hadoop applications, they spend more than 90% of the time doing HDFS read-write
operations.
The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make
the system faster.
Features of RDD:
Installation of Pyspark:
Prerequisite:
From https://spark.apache.org/downloads.html
● C:\Spark\spark-3.1.1-bin-hadoop2.7\bin
Select Edit the system environment variables, from the System properties pop window
on Advanced section click on Environment variables:
Variable Value
SPARK_HOME C:\Spark\spark-3.1.1-bin-hadoop2.7
PYSPARK_DRIVER_PYTHON jupyter
PYSPARK_DRIVER_PYTHON_OPTS notebook
HADOOP_HOME C:\hadoop\bin
JAVA_HOME C:\Java\jdk1.8.0_291
Enter the above variables and value respectively under user variables for vssan.
Make sure the respective paths specified in value are present inside the System
variables Path as well.
To check whether the Spark is installed or not, run the command on windows command
prompt:
● spark-shell --version
To run Jupyter notebook, open Anaconda command prompt and run the command:
● ·jupyter notebook
In notebook run the following code to check the pyspark is running successfully or not:
Reference:
1. https://changhsinlee.com/install-pyspark-windows-jupyter/
2. https://www.youtube.com/watch?v=AB2nUrKYRhw
3. https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm