0% found this document useful (0 votes)

6 views23 pages

Unit V

Apache Spark is a fast cluster computing technology designed to enhance the performance of Hadoop by supporting in-memory processing and a variety of workloads, including batch applications and streaming. It was developed in 2009 and became a top-level Apache project by 2014, featuring components like Spark SQL, Spark Streaming, and MLlib for advanced analytics. Spark operates using Resilient Distributed Datasets (RDD) for efficient data sharing and processing, significantly improving speed compared to traditional MapReduce frameworks.

Uploaded by

Sree RK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views23 pages

Unit V

Uploaded by

Sree RK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Unit – 5 SPARK

1. Spark - Introduction
Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and waiting
time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways
to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.

Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.

Features of Apache Spark

Apache Spark has following features.
• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number of
read/write operations to disk. It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-
level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.
There are three ways of Spark deployment as explained below.
• Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here,
Spark and MapReduce will run side by side to cover all spark jobs on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any
pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem
or Hadoop stack. It allows other components to run on top of stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.

Components of Spark
The following illustration depicts the different components of Spark.
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.

2. Spark - Installation
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system.
The following steps show how to install Apache Spark.

Step1: Verifying Java Installation

Java installation is one of the mandatory things in installing Spark. Try the following command
to verify the JAVA version.
$java -version
If Java is already, installed on your system, you get to see the following response −
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to next
step.

Step2: Verifying Scala Installation

You should Scala language to implement Spark. So let us verify Scala installation using
following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala
installation.

Step3: Downloading Scala

Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in
the download folder.
Step4: Installing Scala
Follow the below given steps for installing Scala.
Extract the Scala tar file
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Move Scala software files
Use the following commands for moving the Scala software files, to respective
directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
Verifying Scala Installation
After installation, it is better to verify it. Use the following command for verifying Scala
installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Step5: Downloading Apache Spark

Download the latest version of Spark by visiting the following link Download Spark. For this
tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it, you will find
the Spark tar file in the download folder.

Step6: Installing Spark

Follow the steps given below for installing Spark.
Extracting Spark tar
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Moving Spark software files
The following commands for moving the Spark software files to respective
directory (/usr/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Setting up the environment for Spark
Add the following line to ~/.bashrc file. It means adding the location, where the spark software
file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc

Step7: Verifying the Spark Installation

Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify
permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
3. Spark – RDD - Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable
distributed collection of objects. Each dataset in RDD is divided into logical partitions, which
may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java,
or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through
deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant
collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file system,
HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Let us first discuss how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set of
high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations
(Ex: between two MapReduce jobs) is to write it to an external stable storage system (Ex:
HDFS). Although this framework provides numerous abstractions for accessing a cluster’s
computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data
sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage
system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS
read-write operations.

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The
following illustration explains how the current framework works, while doing the iterative
operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O,
and serialization, which makes the system slow.
Interactive Operations on MapReduce
User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the
stable storage, which can dominates application execution time.
The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.

Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark.
The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation. This means, it stores the state of memory as an object across the jobs and the object
is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network
and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of
the JOB), then it will store those results on the disk

Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution times.

By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support for
persisting RDDs on disk, or replicated across multiple nodes

4. Spark Architecture

The Spark follows the master-slave architecture. Its cluster consists of a single master and
multiple slaves.

The Spark architecture depends upon two abstractions:

o Resilient Distributed Dataset (RDD)

o Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,
o Resilient: Restore the data on failure.
o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.

We will learn about RDD later in detail.

Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data.
Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph
refers the navigation whereas dir
directed and acyclic refers to how it is done.

Let's understand the Spark architecture.

Driver Program

The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark applications,
running as independent sets of processes on a cluster.

To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -

o It acquires executors on nodes in the cluster.

o Then, it sends your application code to the executors. Here, the application code can be
defined by JAR or Python files passed to the SparkContext.
o At last, the SparkContext sends tasks to the executors to run.

Cluster Manager
o The role of the cluster manager is to allocate resources across applications. The Spark is
capable enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos
and Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to
install Spark on an empty set of machines.

Worker Node

o The worker node is a slave node

o Its role is to run the application code in the cluster.

Executor

o An executor is a process launched for an application on a worker node.

o It runs tasks and keeps data in memory or disk storage across them.
o It read and write data to the external sources.
o Every application contains its executor.

Task

o A unit of work that will be sent to one executor.

5. Spark Components

The Spark project consists of different types of tightly integrated components. At its core, Spark
is a computational engine that can schedule, distribute and monitor multiple applications.

Let's understand each Spark component in detail.

Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with storage
systems and memory management.

Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as the Apache
Hive variant of SQL?called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects
and existing databases, dat
data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.

Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-tolerant
fault
processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches
batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused to
analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real
real-time
time example of a data
stream.

MLlib
o The MLlib is a Machine Learning library that contains various machine learning
algorithms.
o These include correlations
lations and hypothesis testing, classification and regression,
clustering, and principal component analysis.
o It is nine times faster than the disk
disk-based
based implementation used by Apache Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex
and edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join
Vertices, and aggregate Messages

6. Scala

Scala tutorial provides basic and advanced concepts of Scala. Our Scala tutorial is designed for
beginners and professionals.

Scala is an object-oriented and functional programming language.

Our Scala tutorial includes all topics of Scala language such as datatype, conditional expressions,
comments, functions, examples on oops concepts, constructors, method overloading, this
keyword, inheritance, final, exception handling, file handling, tuples, string, string interpolation,
case classes, singleton objects, collection etc.

What is Scala

Scala is a general-purpose programming language. It supports object oriented, functional and

imperative programming approaches. It is a strong static type language. In scala, everything is an
object whether it is a function or a number. It does not have concept of primitive data.

It was designed by Martin Odersky. It was officially released for java platform in early 2004 and
for .Net framework in June 2004. Later on, Scala dropped .Net support in 2012.

Scala is influenced by Java, Haskell, Lisp, Pizza etc. and influenced to F#, Fantom, Red etc.

File extension of scala source file may be either .scala or .sc.

You can create any kind of application like web application, enterprise application, mobile
application, desktop based application etc.

Scala Program Example

Let's see the simple program of scala. A detailed description of this program is given in next
chapters.
1. object MainObject{
2. def main(args:Array[String]){
3. print("Hello Scala")
4. }
5. }

Where to use Scala

o Web applications
o Utilities and libraries
o Data streaming with Akka
o Parallel batch processing
o Concurrency and distributed application
o Data analysis with Spark
o AWS lambda expression
o Ad hoc scripting in REPL etc.

In Scala, you can create any type of application in less time and coding whether it is web based,
mobile based or desktop based application. Scala provides you powerful tools and API by using
which you can create applications. Here, You can use play framework which provides a platform
to build web application rapidly.

8. Scala Object and Class

Unlike java, scala is a pure object oriented programming language. It allows us to create object
and class so that you can develop object oriented applications.

Object

Object is a real world entity. It contains state and behavior. Laptop, car, cell phone are the real
world objects. Object typically has two characteristics:

1) State: data values of an object are known as its state.

2) Behavior: functionality that an object performs is known as its behavior.

Object in scala is an instance of class. It is also known as runtime entity.

Class

Class is a template or a blueprint. It is also known as collection of objects of similar type.

In scala, a class can contain:

1. Data member
2. Member method
3. Constructor
4. Block
5. Nested class
6. Super class information etc.

You must initialize all instance variables in the class. There is no default scope. If you don't
specify access scope, it is public. There must be an object in which main method is defined. It
provides starting point for your program. Here, we have created an example of class.

Scala Sample Example of Class

1. class Student{
2. var id:Int = 0; // All fields must be initialized
3. var name:String = null;
4. }
5. object MainObject{
6. def main(args:Array[String]){
7. var s = new Student() // Creating an object
8. println(s.id+" "+s.name);
9. }
10. }

Output:

0 null

Scala Sample Example2 of Class

In scala, you can create class like this also. Here, constructor is created in class definition. This is
called primary constructor.

1. class Student(id:Int, name:String){ // Primary constructor

2. def show(){
3. println(id+" "+name)
4. }
5. }
6. object MainObject{
7. def main(args:Array[String]){
8. var s = new Student(100,"Martin") // Passing values to constructor
9. s.show() // Calling a function by using an object
10. }
11. }

Output:

100 Martin

Scala Example of class that maintains the records of students

1. class Student(id:Int, name:String){
2. def getRecord(){
3. println(id+" "+name);
4. }
5. }
6.
7. object MainObject{
8. def main(args: Array[String]){
9. var student1 = new Student(101,"Raju");
10. var student2 = new Student(102,"Martin");
11. student1.getRecord();
12. student2.getRecord();
13. }
14. }

Output:

101 Raju

102 Martin

9. Scala Variables and Data Types

Variable is a name which is used to refer memory location. You can create mutable and
immutable variable in scala. Let's see how to declare variable.

Mutable Variable

You can create mutable variable using var keyword. It allows you to change value after
declaration of variable.

1. var data = 100

2. data = 101 // It works, No error.

In the above code, var is a keyword and data is a variable name. It contains an integer value 100.
Scala is a type infers language so you don?t need to specify data type explicitly. You can also
mention data type of variable explicitly as we have used in below.

Another example of variable

1. var data:Int = 100 // Here, we have mentioned Int followed by : (colon)

Immutable Variable
1. val data = 100
2. data = 101 // Error: reassignment to val

The above code throws an error because we have changed content of immutable variable, which
is not allowed. So if you want to change content then it is advisable to use var instead of val.
Data Types in Scala

Data types in scala are much similar to java in terms of their storage, length, except that in scala
there is no concept of primitive data types every type is an object and starts with capital letter. A
table of data types is given below. You will see their uses further.

Data Default Size

Type Value

Boolean False True or false

Byte 0 8 bit signed value (-27 to 27-1)

Short 0 16 bit signed value(-215 to 215-1)

Char '\u0000' 16 bit unsigned Unicode character(0 to 216-

Int 0 32 bit signed value(-231 to 231-1)

Long 0L 64 bit signed value(-263 to 263-1)

Float 0.0F 32 bit IEEE 754 single-precision float

Double 0.0D 64 bit IEEE 754 double-precision float

String Null A sequence of characters

10. Scala while loop

In Scala, while loop is used to iterate code till the specified condition. It tests boolean expression
and iterates again and again. You are recommended to use while loop if you don't know number
of iterations prior.

Syntax

1. while(boolean expression){
2. // Statements to be executed
3. }

Flowchart:

Scala while loop Example

1. object MainObject {
2. def main(args: Array[String]) {
3. var a = 10; // Initialization
4. while( a<=20 ){ // Condition
5. println(a);
6. a = a+2 // Incrementation
7. }
8. }
9. }

Output:

16
18

11. Scala for loop

In scala, for loop is known as for-comprehensions. It can be used to iterate, filter and return an
iterated collection. The for-comprehension looks a bit like a for-loop in imperative languages,
except that it constructs a list of the results of all iterations.

Syntax

1. for( i <- range){

2. // statements to be executed
3. }

In the above syntax, range is a value which has start and end point. You can pass range by
using to or until keyword.

Scala for-loop example by using to keyword

1. object MainObject {
2. def main(args: Array[String]) {
3. for( a <- 1 to 10 ){
4. println(a);
5. }
6. }
7. }

Output:

7
8

11. Scala Inheritance

Inheritance is an object oriented concept which is used to reusability of code. You can achieve
inheritance by using extends keyword. To achieve inheritance a class must extend to other class.
A class which is extended called super or parent class. a class which extends class is called
derived or base class.

Syntax

1. class SubClassName extends SuperClassName(){

2. /* Write your code
3. * methods and fields etc.
4. */
5. }

Understand the Simple Example of Inheritance

Scala Single Inheritance Example
1. class Employee{
2. var salary:Float = 10000
3. }
4.
5. class Programmer extends Employee{
6. var bonus:Int = 5000
7. println("Salary = "+salary)
8. println("Bonus = "+bonus)
9. }
10.
11. object MainObject{
12. def main(args:Array[String]){
13. new Programmer()
14. }
15. }

Output:

Salary = 10000.0

Bonus = 5000

Types of Inheritance in Scala

Scala supports various types of inheritance including single, multilevel, multiple, and hybrid.
You can use single, multilevel and hierarchal in your class. Multiple and hybrid can only be
achieved by using traits. Here, we are representing all types of inheritance by using pictorial
form.
Scala Multilevel Inheritance Example
1. class A{
2. var salary1 = 10000
3. }
4.
5. class B extends A{
6. var salary2 = 20000
7. }
8.
9. class C extends B{
10. def show(){
11. println("salary1 = "+salary1)
12. println("salary2 = "+salary2)
13. }
14. }
15.
16. object MainObject{
17. def main(args:Array[String]){{
18. var c = new C()
19. c.show()
20.
21. }
22. }

Output:

salary1 = 10000

salary2 = 20000

Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
BIG DATA ANLYTICS UNIT 3 R22 IT
No ratings yet
BIG DATA ANLYTICS UNIT 3 R22 IT
57 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
What Is Apache Spark?
No ratings yet
What Is Apache Spark?
232 pages
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
No ratings yet
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
28 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
06-Apache Spark
No ratings yet
06-Apache Spark
75 pages
Lec - Spark
No ratings yet
Lec - Spark
65 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
SABDE3G06 Big Data Sparks
No ratings yet
SABDE3G06 Big Data Sparks
57 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Unit V Big data
No ratings yet
Unit V Big data
18 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Unit 5
100% (1)
Unit 5
109 pages
Shark
No ratings yet
Shark
24 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
4.2. Spark Applications
No ratings yet
4.2. Spark Applications
19 pages
Pega Csa Test
No ratings yet
Pega Csa Test
44 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
UNIT 5.1
No ratings yet
UNIT 5.1
9 pages
spark
No ratings yet
spark
9 pages
Bda 5
No ratings yet
Bda 5
21 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
13 pages
Ewwww
No ratings yet
Ewwww
12 pages
Rewwww
No ratings yet
Rewwww
12 pages
Iouu
No ratings yet
Iouu
12 pages
Spark-Introduction
No ratings yet
Spark-Introduction
12 pages
Practical 11cdscds
No ratings yet
Practical 11cdscds
4 pages
Apache Spark 1
No ratings yet
Apache Spark 1
11 pages
Spark Introduction
No ratings yet
Spark Introduction
19 pages
8_PDFsam_apache_spark_tutorial
No ratings yet
8_PDFsam_apache_spark_tutorial
7 pages
Beginner Guide Spark
No ratings yet
Beginner Guide Spark
12 pages
Sspark
No ratings yet
Sspark
7 pages
Finding The Literary Space Within You
No ratings yet
Finding The Literary Space Within You
34 pages
Spark Overview: Security
No ratings yet
Spark Overview: Security
4 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Spark SQL
100% (1)
Spark SQL
25 pages
CPP CPP MSVC 170
100% (1)
CPP CPP MSVC 170
1,237 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Apache Spark Installation
No ratings yet
Apache Spark Installation
4 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Unit-3
No ratings yet
Unit-3
36 pages
Unit-2
No ratings yet
Unit-2
36 pages
Unit IV
No ratings yet
Unit IV
36 pages
Snmpinfo Md Internet
No ratings yet
Snmpinfo Md Internet
235 pages
Lab Manual - AJ
No ratings yet
Lab Manual - AJ
48 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
OWASP JuiceShop WebApp
No ratings yet
OWASP JuiceShop WebApp
38 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Uboot m1s
No ratings yet
Uboot m1s
44 pages
Hospital Management Project Report PDF Free
No ratings yet
Hospital Management Project Report PDF Free
99 pages
Embedded Systems - ARM Programming Techniques
No ratings yet
Embedded Systems - ARM Programming Techniques
258 pages
AI-102T00 Designing and Implementing A Microsoft Azure AI Solution
No ratings yet
AI-102T00 Designing and Implementing A Microsoft Azure AI Solution
2 pages
style
No ratings yet
style
1 page
Electronic Design Automation
No ratings yet
Electronic Design Automation
7 pages
XS Software Manual
No ratings yet
XS Software Manual
90 pages
Requirements Management, Change Management Policies & Procedures
No ratings yet
Requirements Management, Change Management Policies & Procedures
14 pages
Aafaq Fazal CV
No ratings yet
Aafaq Fazal CV
2 pages
Client Log
No ratings yet
Client Log
13 pages
Automate Web Training Day 4
No ratings yet
Automate Web Training Day 4
33 pages
Chuong00 Gioithieu Monhoc
No ratings yet
Chuong00 Gioithieu Monhoc
3 pages
API Cheat Sheet
No ratings yet
API Cheat Sheet
6 pages
WWW Android Com What Is Android
No ratings yet
WWW Android Com What Is Android
7 pages
Mixamo Case Study
No ratings yet
Mixamo Case Study
3 pages
Curriculum Vitae: Job Title
No ratings yet
Curriculum Vitae: Job Title
3 pages
Brochure, Optilia Video Capillaroscope
No ratings yet
Brochure, Optilia Video Capillaroscope
4 pages
T50 Reson
No ratings yet
T50 Reson
36 pages
URC TCL Tool User Guide
No ratings yet
URC TCL Tool User Guide
40 pages
Help Desk Specialist Resume Sample
100% (1)
Help Desk Specialist Resume Sample
9 pages
Echoview Data & Hardware Support: Supported Hardware Other Hardware
No ratings yet
Echoview Data & Hardware Support: Supported Hardware Other Hardware
2 pages
Actualtests: 4sure
No ratings yet
Actualtests: 4sure
4 pages
Fundamentals: Angular Learning Path 2020
No ratings yet
Fundamentals: Angular Learning Path 2020
1 page
Resource Management EN
No ratings yet
Resource Management EN
9 pages
Kasyap Ram. V.S.Y: Objective
No ratings yet
Kasyap Ram. V.S.Y: Objective
3 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

Unit V

Uploaded by

Unit V

Uploaded by

Unit – 5 SPARK

Evolution of Apache Spark

Features of Apache Spark

Spark Built on Hadoop

Step1: Verifying Java Installation

Step2: Verifying Scala Installation

Step3: Downloading Scala

Step5: Downloading Apache Spark

Step6: Installing Spark

Step7: Verifying the Spark Installation

Data Sharing is Slow in MapReduce

Iterative Operations on MapReduce

Data Sharing using Spark RDD

Iterative Operations on Spark RDD

Interactive Operations on Spark RDD

The Spark architecture depends upon two abstractions:

o Resilient Distributed Dataset (RDD)

Resilient Distributed Datasets (RDD)

We will learn about RDD later in detail.

Directed Acyclic Graph (DAG)

Let's understand the Spark architecture.

o It acquires executors on nodes in the cluster.

o The worker node is a slave node

o An executor is a process launched for an application on a worker node.

o A unit of work that will be sent to one executor.

Let's understand each Spark component in detail.

Scala is an object-oriented and functional programming language.

Scala is a general-purpose programming language. It supports object oriented, functional and

File extension of scala source file may be either .scala or .sc.

Scala Program Example

Where to use Scala

8. Scala Object and Class

1) State: data values of an object are known as its state.

2) Behavior: functionality that an object performs is known as its behavior.

Object in scala is an instance of class. It is also known as runtime entity.

Class is a template or a blueprint. It is also known as collection of objects of similar type.

In scala, a class can contain:

Scala Sample Example of Class

Scala Sample Example2 of Class

1. class Student(id:Int, name:String){ // Primary constructor

Scala Example of class that maintains the records of students

9. Scala Variables and Data Types

1. var data = 100

Another example of variable

1. var data:Int = 100 // Here, we have mentioned Int followed by : (colon)

Data Default Size

Boolean False True or false

Byte 0 8 bit signed value (-27 to 27-1)

Short 0 16 bit signed value(-215 to 215-1)

Char '\u0000' 16 bit unsigned Unicode character(0 to 216-

Int 0 32 bit signed value(-231 to 231-1)

Long 0L 64 bit signed value(-263 to 263-1)

Float 0.0F 32 bit IEEE 754 single-precision float

Double 0.0D 64 bit IEEE 754 double-precision float

String Null A sequence of characters

10. Scala while loop

Scala while loop Example

11. Scala for loop

1. for( i <- range){

Scala for-loop example by using to keyword

11. Scala Inheritance

1. class SubClassName extends SuperClassName(){

Understand the Simple Example of Inheritance

Types of Inheritance in Scala

You might also like