Unit V
Unit V
1. Spark - Introduction
Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and waiting
time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways
to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Components of Spark
The following illustration depicts the different components of Spark.
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.
2. Spark - Installation
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system.
The following steps show how to install Apache Spark.
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
3. Spark – RDD - Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable
distributed collection of objects. Each dataset in RDD is divided into logical partitions, which
may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java,
or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through
deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant
collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file system,
HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Let us first discuss how MapReduce operations take place and why they are not so efficient.
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support for
persisting RDDs on disk, or replicated across multiple nodes
4. Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single master and
multiple slaves.
The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,
o Resilient: Restore the data on failure.
o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data.
Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph
refers the navigation whereas dir
directed and acyclic refers to how it is done.
Driver Program
The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark applications,
running as independent sets of processes on a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -
Cluster Manager
o The role of the cluster manager is to allocate resources across applications. The Spark is
capable enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos
and Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to
install Spark on an empty set of machines.
Worker Node
Executor
Task
5. Spark Components
The Spark project consists of different types of tightly integrated components. At its core, Spark
is a computational engine that can schedule, distribute and monitor multiple applications.
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as the Apache
Hive variant of SQL?called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects
and existing databases, dat
data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-tolerant
fault
processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches
batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused to
analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real
real-time
time example of a data
stream.
MLlib
o The MLlib is a Machine Learning library that contains various machine learning
algorithms.
o These include correlations
lations and hypothesis testing, classification and regression,
clustering, and principal component analysis.
o It is nine times faster than the disk
disk-based
based implementation used by Apache Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex
and edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join
Vertices, and aggregate Messages
6. Scala
Scala tutorial provides basic and advanced concepts of Scala. Our Scala tutorial is designed for
beginners and professionals.
Our Scala tutorial includes all topics of Scala language such as datatype, conditional expressions,
comments, functions, examples on oops concepts, constructors, method overloading, this
keyword, inheritance, final, exception handling, file handling, tuples, string, string interpolation,
case classes, singleton objects, collection etc.
What is Scala
It was designed by Martin Odersky. It was officially released for java platform in early 2004 and
for .Net framework in June 2004. Later on, Scala dropped .Net support in 2012.
Scala is influenced by Java, Haskell, Lisp, Pizza etc. and influenced to F#, Fantom, Red etc.
You can create any kind of application like web application, enterprise application, mobile
application, desktop based application etc.
Let's see the simple program of scala. A detailed description of this program is given in next
chapters.
1. object MainObject{
2. def main(args:Array[String]){
3. print("Hello Scala")
4. }
5. }
In Scala, you can create any type of application in less time and coding whether it is web based,
mobile based or desktop based application. Scala provides you powerful tools and API by using
which you can create applications. Here, You can use play framework which provides a platform
to build web application rapidly.
Unlike java, scala is a pure object oriented programming language. It allows us to create object
and class so that you can develop object oriented applications.
Object
Object is a real world entity. It contains state and behavior. Laptop, car, cell phone are the real
world objects. Object typically has two characteristics:
Class
1. Data member
2. Member method
3. Constructor
4. Block
5. Nested class
6. Super class information etc.
You must initialize all instance variables in the class. There is no default scope. If you don't
specify access scope, it is public. There must be an object in which main method is defined. It
provides starting point for your program. Here, we have created an example of class.
Output:
0 null
In scala, you can create class like this also. Here, constructor is created in class definition. This is
called primary constructor.
Output:
100 Martin
Output:
101 Raju
102 Martin
Variable is a name which is used to refer memory location. You can create mutable and
immutable variable in scala. Let's see how to declare variable.
Mutable Variable
You can create mutable variable using var keyword. It allows you to change value after
declaration of variable.
In the above code, var is a keyword and data is a variable name. It contains an integer value 100.
Scala is a type infers language so you don?t need to specify data type explicitly. You can also
mention data type of variable explicitly as we have used in below.
Immutable Variable
1. val data = 100
2. data = 101 // Error: reassignment to val
The above code throws an error because we have changed content of immutable variable, which
is not allowed. So if you want to change content then it is advisable to use var instead of val.
Data Types in Scala
Data types in scala are much similar to java in terms of their storage, length, except that in scala
there is no concept of primitive data types every type is an object and starts with capital letter. A
table of data types is given below. You will see their uses further.
In Scala, while loop is used to iterate code till the specified condition. It tests boolean expression
and iterates again and again. You are recommended to use while loop if you don't know number
of iterations prior.
Syntax
1. while(boolean expression){
2. // Statements to be executed
3. }
Flowchart:
Output:
10
12
14
16
18
20
In scala, for loop is known as for-comprehensions. It can be used to iterate, filter and return an
iterated collection. The for-comprehension looks a bit like a for-loop in imperative languages,
except that it constructs a list of the results of all iterations.
Syntax
In the above syntax, range is a value which has start and end point. You can pass range by
using to or until keyword.
Output:
7
8
10
Inheritance is an object oriented concept which is used to reusability of code. You can achieve
inheritance by using extends keyword. To achieve inheritance a class must extend to other class.
A class which is extended called super or parent class. a class which extends class is called
derived or base class.
Syntax
Output:
Salary = 10000.0
Bonus = 5000
Scala supports various types of inheritance including single, multilevel, multiple, and hybrid.
You can use single, multilevel and hierarchal in your class. Multiple and hybrid can only be
achieved by using traits. Here, we are representing all types of inheritance by using pictorial
form.
Scala Multilevel Inheritance Example
1. class A{
2. var salary1 = 10000
3. }
4.
5. class B extends A{
6. var salary2 = 20000
7. }
8.
9. class C extends B{
10. def show(){
11. println("salary1 = "+salary1)
12. println("salary2 = "+salary2)
13. }
14. }
15.
16. object MainObject{
17. def main(args:Array[String]){{
18. var c = new C()
19. c.show()
20.
21. }
22. }
Output:
salary1 = 10000
salary2 = 20000