0% found this document useful (0 votes)
6 views23 pages

Unit V

Apache Spark is a fast cluster computing technology designed to enhance the performance of Hadoop by supporting in-memory processing and a variety of workloads, including batch applications and streaming. It was developed in 2009 and became a top-level Apache project by 2014, featuring components like Spark SQL, Spark Streaming, and MLlib for advanced analytics. Spark operates using Resilient Distributed Datasets (RDD) for efficient data sharing and processing, significantly improving speed compared to traditional MapReduce frameworks.

Uploaded by

Sree RK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views23 pages

Unit V

Apache Spark is a fast cluster computing technology designed to enhance the performance of Hadoop by supporting in-memory processing and a variety of workloads, including batch applications and streaming. It was developed in 2009 and became a top-level Apache project by 2014, featuring components like Spark SQL, Spark Streaming, and MLlib for advanced analytics. Spark operates using Resilient Distributed Datasets (RDD) for efficient data sharing and processing, significantly improving speed compared to traditional MapReduce frameworks.

Uploaded by

Sree RK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit – 5 SPARK

1. Spark - Introduction
Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and waiting
time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways
to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.

Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark


Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.

Features of Apache Spark


Apache Spark has following features.
• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number of
read/write operations to disk. It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-
level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop


The following diagram shows three ways of how Spark can be built with Hadoop components.
There are three ways of Spark deployment as explained below.
• Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here,
Spark and MapReduce will run side by side to cover all spark jobs on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any
pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem
or Hadoop stack. It allows other components to run on top of stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.

Components of Spark
The following illustration depicts the different components of Spark.
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.

2. Spark - Installation
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system.
The following steps show how to install Apache Spark.

Step1: Verifying Java Installation


Java installation is one of the mandatory things in installing Spark. Try the following command
to verify the JAVA version.
$java -version
If Java is already, installed on your system, you get to see the following response −
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to next
step.

Step2: Verifying Scala Installation


You should Scala language to implement Spark. So let us verify Scala installation using
following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala
installation.

Step3: Downloading Scala


Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in
the download folder.
Step4: Installing Scala
Follow the below given steps for installing Scala.
Extract the Scala tar file
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Move Scala software files
Use the following commands for moving the Scala software files, to respective
directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
Verifying Scala Installation
After installation, it is better to verify it. Use the following command for verifying Scala
installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Step5: Downloading Apache Spark


Download the latest version of Spark by visiting the following link Download Spark. For this
tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it, you will find
the Spark tar file in the download folder.

Step6: Installing Spark


Follow the steps given below for installing Spark.
Extracting Spark tar
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Moving Spark software files
The following commands for moving the Spark software files to respective
directory (/usr/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Setting up the environment for Spark
Add the following line to ~/.bashrc file. It means adding the location, where the spark software
file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc

Step7: Verifying the Spark Installation


Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify
permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
3. Spark – RDD - Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable
distributed collection of objects. Each dataset in RDD is divided into logical partitions, which
may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java,
or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through
deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant
collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file system,
HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Let us first discuss how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce


MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set of
high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations
(Ex: between two MapReduce jobs) is to write it to an external stable storage system (Ex:
HDFS). Although this framework provides numerous abstractions for accessing a cluster’s
computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data
sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage
system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS
read-write operations.

Iterative Operations on MapReduce


Reuse intermediate results across multiple computations in multi-stage applications. The
following illustration explains how the current framework works, while doing the iterative
operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O,
and serialization, which makes the system slow.
Interactive Operations on MapReduce
User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the
stable storage, which can dominates application execution time.
The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.

Data Sharing using Spark RDD


Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark.
The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation. This means, it stores the state of memory as an object across the jobs and the object
is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network
and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD


The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of
the JOB), then it will store those results on the disk

Interactive Operations on Spark RDD


This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution times.

By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support for
persisting RDDs on disk, or replicated across multiple nodes

4. Spark Architecture

The Spark follows the master-slave architecture. Its cluster consists of a single master and
multiple slaves.

The Spark architecture depends upon two abstractions:

o Resilient Distributed Dataset (RDD)


o Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,
o Resilient: Restore the data on failure.
o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.

We will learn about RDD later in detail.

Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data.
Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph
refers the navigation whereas dir
directed and acyclic refers to how it is done.

Let's understand the Spark architecture.

Driver Program

The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark applications,
running as independent sets of processes on a cluster.

To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -

o It acquires executors on nodes in the cluster.


o Then, it sends your application code to the executors. Here, the application code can be
defined by JAR or Python files passed to the SparkContext.
o At last, the SparkContext sends tasks to the executors to run.

Cluster Manager
o The role of the cluster manager is to allocate resources across applications. The Spark is
capable enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos
and Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to
install Spark on an empty set of machines.

Worker Node

o The worker node is a slave node


o Its role is to run the application code in the cluster.

Executor

o An executor is a process launched for an application on a worker node.


o It runs tasks and keeps data in memory or disk storage across them.
o It read and write data to the external sources.
o Every application contains its executor.

Task

o A unit of work that will be sent to one executor.

5. Spark Components

The Spark project consists of different types of tightly integrated components. At its core, Spark
is a computational engine that can schedule, distribute and monitor multiple applications.

Let's understand each Spark component in detail.


Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with storage
systems and memory management.

Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as the Apache
Hive variant of SQL?called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects
and existing databases, dat
data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.

Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-tolerant
fault
processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches
batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused to
analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real
real-time
time example of a data
stream.

MLlib
o The MLlib is a Machine Learning library that contains various machine learning
algorithms.
o These include correlations
lations and hypothesis testing, classification and regression,
clustering, and principal component analysis.
o It is nine times faster than the disk
disk-based
based implementation used by Apache Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex
and edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join
Vertices, and aggregate Messages

6. Scala

Scala tutorial provides basic and advanced concepts of Scala. Our Scala tutorial is designed for
beginners and professionals.

Scala is an object-oriented and functional programming language.

Our Scala tutorial includes all topics of Scala language such as datatype, conditional expressions,
comments, functions, examples on oops concepts, constructors, method overloading, this
keyword, inheritance, final, exception handling, file handling, tuples, string, string interpolation,
case classes, singleton objects, collection etc.

What is Scala

Scala is a general-purpose programming language. It supports object oriented, functional and


imperative programming approaches. It is a strong static type language. In scala, everything is an
object whether it is a function or a number. It does not have concept of primitive data.

It was designed by Martin Odersky. It was officially released for java platform in early 2004 and
for .Net framework in June 2004. Later on, Scala dropped .Net support in 2012.

Scala is influenced by Java, Haskell, Lisp, Pizza etc. and influenced to F#, Fantom, Red etc.

File extension of scala source file may be either .scala or .sc.

You can create any kind of application like web application, enterprise application, mobile
application, desktop based application etc.

Scala Program Example

Let's see the simple program of scala. A detailed description of this program is given in next
chapters.
1. object MainObject{
2. def main(args:Array[String]){
3. print("Hello Scala")
4. }
5. }

Where to use Scala


o Web applications
o Utilities and libraries
o Data streaming with Akka
o Parallel batch processing
o Concurrency and distributed application
o Data analysis with Spark
o AWS lambda expression
o Ad hoc scripting in REPL etc.

In Scala, you can create any type of application in less time and coding whether it is web based,
mobile based or desktop based application. Scala provides you powerful tools and API by using
which you can create applications. Here, You can use play framework which provides a platform
to build web application rapidly.

8. Scala Object and Class

Unlike java, scala is a pure object oriented programming language. It allows us to create object
and class so that you can develop object oriented applications.

Object

Object is a real world entity. It contains state and behavior. Laptop, car, cell phone are the real
world objects. Object typically has two characteristics:

1) State: data values of an object are known as its state.

2) Behavior: functionality that an object performs is known as its behavior.

Object in scala is an instance of class. It is also known as runtime entity.

Class

Class is a template or a blueprint. It is also known as collection of objects of similar type.

In scala, a class can contain:

1. Data member
2. Member method
3. Constructor
4. Block
5. Nested class
6. Super class information etc.

You must initialize all instance variables in the class. There is no default scope. If you don't
specify access scope, it is public. There must be an object in which main method is defined. It
provides starting point for your program. Here, we have created an example of class.

Scala Sample Example of Class


1. class Student{
2. var id:Int = 0; // All fields must be initialized
3. var name:String = null;
4. }
5. object MainObject{
6. def main(args:Array[String]){
7. var s = new Student() // Creating an object
8. println(s.id+" "+s.name);
9. }
10. }

Output:

0 null

Scala Sample Example2 of Class

In scala, you can create class like this also. Here, constructor is created in class definition. This is
called primary constructor.

1. class Student(id:Int, name:String){ // Primary constructor


2. def show(){
3. println(id+" "+name)
4. }
5. }
6. object MainObject{
7. def main(args:Array[String]){
8. var s = new Student(100,"Martin") // Passing values to constructor
9. s.show() // Calling a function by using an object
10. }
11. }

Output:

100 Martin

Scala Example of class that maintains the records of students


1. class Student(id:Int, name:String){
2. def getRecord(){
3. println(id+" "+name);
4. }
5. }
6.
7. object MainObject{
8. def main(args: Array[String]){
9. var student1 = new Student(101,"Raju");
10. var student2 = new Student(102,"Martin");
11. student1.getRecord();
12. student2.getRecord();
13. }
14. }

Output:

101 Raju

102 Martin

9. Scala Variables and Data Types

Variable is a name which is used to refer memory location. You can create mutable and
immutable variable in scala. Let's see how to declare variable.

Mutable Variable

You can create mutable variable using var keyword. It allows you to change value after
declaration of variable.

1. var data = 100


2. data = 101 // It works, No error.

In the above code, var is a keyword and data is a variable name. It contains an integer value 100.
Scala is a type infers language so you don?t need to specify data type explicitly. You can also
mention data type of variable explicitly as we have used in below.

Another example of variable

1. var data:Int = 100 // Here, we have mentioned Int followed by : (colon)

Immutable Variable
1. val data = 100
2. data = 101 // Error: reassignment to val

The above code throws an error because we have changed content of immutable variable, which
is not allowed. So if you want to change content then it is advisable to use var instead of val.
Data Types in Scala

Data types in scala are much similar to java in terms of their storage, length, except that in scala
there is no concept of primitive data types every type is an object and starts with capital letter. A
table of data types is given below. You will see their uses further.

Data Default Size


Type Value

Boolean False True or false

Byte 0 8 bit signed value (-27 to 27-1)

Short 0 16 bit signed value(-215 to 215-1)

Char '\u0000' 16 bit unsigned Unicode character(0 to 216-


1)

Int 0 32 bit signed value(-231 to 231-1)

Long 0L 64 bit signed value(-263 to 263-1)

Float 0.0F 32 bit IEEE 754 single-precision float

Double 0.0D 64 bit IEEE 754 double-precision float

String Null A sequence of characters

10. Scala while loop

In Scala, while loop is used to iterate code till the specified condition. It tests boolean expression
and iterates again and again. You are recommended to use while loop if you don't know number
of iterations prior.

Syntax

1. while(boolean expression){
2. // Statements to be executed
3. }

Flowchart:

Scala while loop Example


1. object MainObject {
2. def main(args: Array[String]) {
3. var a = 10; // Initialization
4. while( a<=20 ){ // Condition
5. println(a);
6. a = a+2 // Incrementation
7. }
8. }
9. }

Output:

10

12

14

16
18

20

11. Scala for loop

In scala, for loop is known as for-comprehensions. It can be used to iterate, filter and return an
iterated collection. The for-comprehension looks a bit like a for-loop in imperative languages,
except that it constructs a list of the results of all iterations.

Syntax

1. for( i <- range){


2. // statements to be executed
3. }

In the above syntax, range is a value which has start and end point. You can pass range by
using to or until keyword.

Scala for-loop example by using to keyword


1. object MainObject {
2. def main(args: Array[String]) {
3. for( a <- 1 to 10 ){
4. println(a);
5. }
6. }
7. }

Output:

7
8

10

11. Scala Inheritance

Inheritance is an object oriented concept which is used to reusability of code. You can achieve
inheritance by using extends keyword. To achieve inheritance a class must extend to other class.
A class which is extended called super or parent class. a class which extends class is called
derived or base class.

Syntax

1. class SubClassName extends SuperClassName(){


2. /* Write your code
3. * methods and fields etc.
4. */
5. }

Understand the Simple Example of Inheritance


Scala Single Inheritance Example
1. class Employee{
2. var salary:Float = 10000
3. }
4.
5. class Programmer extends Employee{
6. var bonus:Int = 5000
7. println("Salary = "+salary)
8. println("Bonus = "+bonus)
9. }
10.
11. object MainObject{
12. def main(args:Array[String]){
13. new Programmer()
14. }
15. }

Output:

Salary = 10000.0

Bonus = 5000

Types of Inheritance in Scala

Scala supports various types of inheritance including single, multilevel, multiple, and hybrid.
You can use single, multilevel and hierarchal in your class. Multiple and hybrid can only be
achieved by using traits. Here, we are representing all types of inheritance by using pictorial
form.
Scala Multilevel Inheritance Example
1. class A{
2. var salary1 = 10000
3. }
4.
5. class B extends A{
6. var salary2 = 20000
7. }
8.
9. class C extends B{
10. def show(){
11. println("salary1 = "+salary1)
12. println("salary2 = "+salary2)
13. }
14. }
15.
16. object MainObject{
17. def main(args:Array[String]){{
18. var c = new C()
19. c.show()
20.
21. }
22. }

Output:

salary1 = 10000

salary2 = 20000

You might also like