UNIT 4 Part 2
UNIT 4 Part 2
Spark applies Hadoop in two forms. The first form is storage and another one is processing.
Thus, Spark includes its computation for cluster management and applies Hadoop for only
storage purposes.
Apache Spark is a distributed and open-source processing system. It is used for the workloads
of 'Big data'. Spark utilizes optimized query execution and in-memory caching for rapid queries
across any size of data. It is simply a general and fast engine for much large-scale processing of
data.
It is much faster as compared to the previous concepts to implement with Big Data such as
classical MapReduce. Spark is faster die to it executes on RAM/memory and enables the
processing faster as compared to the disk drivers.
Spark is simple due to it could be used for more than one thing such as working with data streams
or graphs, Machine Learning algorithms, inhaling data into the database, building data pipelines,
executing distributed SQL, and others.
Apache Spark is a lightning-fast unified analytics engine used for cluster computing for large
data sets like BigData and Hadoop with the aim to run programs parallel across multiple nodes.
It is a combination of multiple stack libraries such as SQL and Dataframes, GraphX, MLlib, and
Spark Streaming.
Standalone Mode: Here all processes run within the same JVM process.
Standalone Cluster Mode: In this mode, it uses the Job-Scheduling framework in-built in
Spark.
Apache Mesos: In this mode, the work nodes run on various machines, but the driver runs
only in the master node.
Hadoop YARN: In this mode, the drivers run inside the application’s master node and is
handled by YARN on the Cluster.
Spark Installation:
There are some different things to use and install Spark. We can install Spark on our machine as
any stand-alone framework or use the images of Spark VM (Virtual Machine) available from
many vendors such as MapR, HortonWorks, and Cloudera. Also, we can use Spark configured
and installed inside the cloud (such as Databricks Clouds).
Spark Application:
Spark applications are programs written using the Apache Spark framework, which is designed
for large-scale data processing. These applications can perform various tasks on big data sets,
such as data transformation, analysis, machine learning, and graph processing. Here are some
common types of Spark applications used in big data:
1. Data Processing: Spark can efficiently process large volumes of data in parallel.
Applications might involve filtering, aggregating, joining, and transforming data.
2. Machine Learning: Spark's MLlib library provides scalable machine learning algorithms
for classification, regression, clustering, collaborative filtering, and dimensionality
reduction.
3. Graph Processing: Spark GraphX enables the processing of graph data structures and
implements graph-parallel algorithms for tasks like page rank, community detection, and
graph coloring.
4. Streaming Analytics: Spark Streaming allows real-time processing of streaming data.
Applications might involve processing continuous streams of data from various sources
like Kafka, Flume, Twitter, etc., for tasks such as anomaly detection, sentiment analysis,
and real-time recommendations.
5. SQL and Data Warehousing: Spark SQL provides a DataFrame API and SQL interface
for working with structured data, enabling users to run SQL queries and perform analytics
on large-scale datasets.
6. ETL (Extract, Transform, Load): Spark is often used for ETL tasks, where data is
extracted from various sources, transformed into a suitable format, and loaded into a data
warehouse or data lake for further analysis.
7. Data Exploration and Visualization: Spark can be used for exploratory data analysis and
visualization tasks, allowing users to gain insights from large datasets using tools like
Spark SQL, DataFrame operations, and visualization libraries like Matplotlib or Seaborn.
Natural
8. Language Processing (NLP): Spark's ecosystem includes libraries like Spark NLP, which
provides scalable natural language processing capabilities for tasks such as text
classification, entity recognition, sentiment analysis, and topic modeling.
You want low-level transformation and actions and control on your dataset;
Your data is unstructured, such as media streams or streams of text;
You want to manipulate your data with functional programming constructs than domain
specific expressions;
You don’t care about imposing a schema, such as columnar format while processing or
accessing data attributes by name or column; and
You can forgo some optimization and performance benefits available with DataFrames
and Datasets for structured and semi-structured data.
Job Submission: A spark job is submitted automatically when an action is performed on an RDD.
This internally calls runJob() on the SparkContext which passes the call to the scheduler that
runs as a a part of the driver.
The scheduler is made up of two components:
DAG Scheduler - Breaks down Job into stages.
Task Scheduler - Submits the task from each stage to the cluster.
DAG Construction: A spark job is split into multiple stages. Each stage runs a specific task.
There are mainly two types of task: shuffle map tasks and result task.
Shuffle map tasks: Each shuffle map task runs computation on one RDD and writes it
output to a new set of partition which are fetched in a later stage. Shuffle map tasks run
in all stages except the final stage.
Result tasks: It runs in the final stage and returns result to the user's program. Each result
task runs the computation on its RDD partition, then sends the result back to the driver,
which assembles results from all partitions into a final result. It is to be noted that each
task is given a placement preference by the DAG scheduler to allow task scheduler to
take advantage of data locality. Once the DAG scheduler completes construction of the
DAG of stages, it submits each stage's set of task to the task scheduler. Child stages are
only submitted once their parents have completed successfully.
Task Scheduling: When the task scheduler receives the set of tasks, it uses its list of executors
that are running for the application and constructs a mapping of tasks to executors on the basis
of placement preference. For a given executor, the scheduler will first assign process-local tasks,
then node-local task, then rack-local task, before assigning any arbitrary task or random task.
Executors send status update to the driver when a task is completed or has failed. In case of task
failure, task scheduler resubmits the task on another executor. It also launches speculative tasks
for tasks that are running slowly if this feature is enabled. Speculative tasks are duplicates of
existing tasks, which the scheduler may run as a backup if a task is running more slowly than
expected.
Task Execution: Executor makes sure that the JAR and file dependencies are up to date. It keeps
local cache of dependencies from previous tasks. It deserialize the task code (which consists of
the user's functions) from the serialized bytes that were sent as a part of the launch task message.
Finally the task code is executed. Task can return a result to the driver. The result is serialized
and sent to executor backend, and finally to driver as status update message.
Spark on YARN:
The Apache Spark YARN is either a single job (job refers to a spark job, a hive query or anything
similar to the construct) or a DAG (Directed Acyclic Graph) of jobs. Apache Spark YARN is a
division of functionalities of resource management into a global resource manager. And onto
Application matter for per application. A unit of scheduling on a YARN cluster is called an
application manager. A framework of generic resource management for distributed workloads is
called a YARN. YARN supports a lot of different computed frameworks such as Spark and Tez
as well as Map-reduce functions. And this is a part of the Hadoop system.
SCALA:
Scala is a programming language that can be used in conjunction with Hadoop, a distributed computing
framework, to build scalable and high performance data processing applications. Scala is a versatile
language that runs on the Java Virtual Machine (JVM) and is compatible with the Hadoop ecosystem.
2. Hadoop Streaming: Hadoop Streaming is a feature that allows you to write MapReduce
jobs in any programming language, including Scala. You can write Scala scripts to
process data using Hadoop Streaming and submit them to a Hadoop cluster for execution.
Apache Hive: Hive provides a SQL-like query language for querying and
analyzing data in Hadoop. You can write Hive queries in Scala using the Hive
JDBC or ODBC driver.
Apache Pig: Pig is a high-level platform for creating MapReduce programs
using a scripting language called Pig Latin. You can write Pig scripts in Scala
for data transformation tasks.
Apache Spark: Apache Spark, a fast and general-purpose cluster computing
framework, provides native support for Scala. You can write Spark
applications in Scala to process large-scale data in-memory and perform batch
processing, stream processing, machine learning, and graph processing tasks.
4. Hadoop Libraries: Scala can leverage various Hadoop libraries, such as Hadoop
Common, Hadoop HDFS, and Hadoop YARN, to interact with Hadoop clusters, manage
files in HDFS, and submit jobs for execution.
Generally speaking in OOP, it’s perfect to say that objects are instances of a class. However,
Scala has an object keyword that we can use when defining a singleton object.
When we say singleton, we mean an object that can only be instantiated once. Creating an
object requires just the object keyword and an identifier.
Classes are blueprints for creating objects. When we define a class, we can then create new
objects (instances) from the class.
We define a class using the class keyword followed by whatever name we give for that class.
Basic Types & Operators of SCALA:
A data type is a categorization of data which tells the compiler that which type of value a
variable has. For example, if a variable has an int data type, then it holds numeric value. In
Scala, the data types are similar to Java in terms of length and storage. In Scala, data types
are treated same objects so the first letter of the data type is in capital letter. The data types
that are available in Scala as shown in the below table
Default
DataType value Description
Boolean False True or False
Byte 0 8 bit signed value. Range:-128 to 127
Short 0 16 bit signed value. Range:-215 to 215-1
16 bit unsigned unicode character.
Char ‘\u000’ Range:0 to 216-1
Int 0 32 bit signed value. Range:-231 to 231-1
Long 0L 64 bit signed value. Range:-263 to 263-1
Float 0.0F 32 bit IEEE 754 single-Precision float
Double 0.0D 64 bit IEEE 754 double-Precision float
String null A sequence of character
Unit – Coinsides to no value.
It is a subtype of every other type and
Nothing – it contains no value.
Any – It is a supertype of all other types
AnyVal – It serve as value types.
AnyRef – It serves as reference types.
An operator is a symbol that represents an operation to be performed with one or more operand.
Operators are the foundation of any programming language. Operators allow us to perform
different kinds of operations on operands. There are different types of operators used in Scala as
follows:
Arithmetic Operators
These are used to perform arithmetic/mathematical operations on operands.
Addition(+) operator adds two operands. For example, x+y.
Subtraction(-) operator subtracts two operands. For example, x-y.
Multiplication(*) operator multiplies two operands. For example, x*y.
Division(/) operator divides the first operand by the second. For example, x/y.
Modulus(%) operator returns the remainder when the first operand is divided by the
second. For example, x%y.
Exponent(**) operator returns exponential(power) of the operands. For example, x**y.
Relational Operators
Relational operators or Comparison operators are used for comparison of two values. Let’s
see them one by one:
Equal To(==) operator checks whether the two given operands are equal or not. If so, it
returns true. Otherwise it returns false. For example, 5==5 will return true.
Not Equal To(!=) operator checks whether the two given operands are equal or not. If not,
it returns true. Otherwise it returns false. It is the exact boolean complement of the ‘==’
operator. For example, 5!=5 will return false.
Greater Than(>) operator checks whether the first operand is greater than the second
operand. If so, it returns true. Otherwise it returns false. For example, 6>5 will return true.
Less than(<) operator checks whether the first operand is lesser than the second operand.
If so, it returns true. Otherwise it returns false. For example, 6<5 will return false.
Greater Than Equal To(>=) operator checks whether the first operand is greater than or
equal to the second operand. If so, it returns true. Otherwise it returns false. For example,
5>=5 will return true.
Less Than Equal To(<=) operator checks whether the first operand is lesser than or equal
to the second operand. If so, it returns true. Otherwise it returns false. For example, 5<=5
will also return true.
Logical Operators
They are used to combine two or more conditions/constraints or to complement the evaluation
of the original condition in consideration. They are described below:
Logical AND(&&) operator returns true when both the conditions in consideration are
satisfied. Otherwise it returns false. Using “and” is an alternate for && operator. For
example, a && b returns true when both a and b are true (i.e. non-zero).
Logical OR(||) operator returns true when one (or both) of the conditions in
consideration is satisfied. Otherwise it returns false. Using “or” is an alternate for ||
operator. For example, a || b returns true if one of a or b is true (i.e. non-zero). Of course,
it returns true when both a and b are true.
Logical NOT(!) operator returns true the condition in consideration is not satisfied.
Otherwise it returns false. Using “not” is an alternate for ! operator. For example, !true
returns false.
Assignment Operators
Assignment operators are used to assigning a value to a variable. The left side operand of the
assignment operator is a variable and right side operand of the assignment operator is a value. The
value on the right side must be of the same data-type of the variable on the left side otherwise the
compiler will raise an error. Different types of assignment operators are shown below
Simple Assignment (=) operator is the simplest assignment operator. This operator is used
to assign the value on the right to the variable on the left.
Add AND Assignment (+=) operator is used for adding left operand with right operand and
then assigning it to variable on the left.
Subtract AND Assignment (-=) operator is used for subtracting left operand with right
operand and then assigning it to variable on the left.
Multiply AND Assignment (*=) operator is used for multiplying the left operand with right
operand and then assigning it to the variable on the left.
Divide AND Assignment (/=) operator is used for dividing left operand with right operand
and then assigning it to variable on the left.
Modulus AND Assignment (%=) operator is used for assigning modulo of left operand
with right operand and then assigning it to the variable on the left.
Exponent AND Assignment (**=) operator is used for raising power of the left operand to
the right operand and assigning it to the variable on the left.
Left shift AND Assignment(<<=)operator is used to perform binary left shift of the left
operand with the right operand and assigning it to the variable on the left.
Right shift AND Assignment(>>=)operator is used to perform binary right shift of the left
operand with the right operand and assigning it to the variable on the left.
Bitwise AND Assignment(&=)operator is used to perform Bitwise And of the left operand
with the right operand and assigning it to the variable on the left.
Bitwise exclusive OR and Assignment(^=)operator is used to perform Bitwise exclusive
OR of the left operand with the right operand and assigning it to the variable on the left.
Bitwise inclusive OR and Assignment(|=)operator is used to perform Bitwise inclusive OR
of the left operand with the right operand and assigning it to the variable on the left.
Bitwise Operators
In Scala, there are 7 bitwise operators which work at bit level or used to perform bit by bit
operations. Following are the bitwise operators:
Bitwise AND (&): Takes two numbers as operands and does AND on every bit of two
numbers. The result of AND is 1 only if both bits are 1.
Bitwise OR (|): Takes two numbers as operands and does OR on every bit of two numbers.
The result of OR is 1 any of the two bits is 1.
Bitwise XOR (^): Takes two numbers as operands and does XOR on every bit of two
numbers. The result of XOR is 1 if the two bits are different.
Bitwise left Shift (<<): Takes two numbers, left shifts the bits of the first operand, the
second operand decides the number of places to shift.
Bitwise right Shift (>>): Takes two numbers, right shifts the bits of the first operand, the
second operand decides the number of places to shift.
Bitwise ones Complement (~): This operator takes a single number and used to perform
the complement operation of 8-bit.
Bitwise shift right zero fill(>>>): In shift right zero fill operator, left operand is shifted
right by the number of bits specified by the right operand, and the shifted values are filled
up with zeros.
SCALA Closures:
Scala Closures are functions which uses one or more free variables and the return value of
this function is dependent of these variable. The free variables are defined outside of the
Closure Function and is not included as a parameter of this function. So the difference
between a closure function and a normal function is the free variable. A free variable is
any kind of variable which is not defined within the function and not passed as the
parameter of the function. A free variable is not bound to a function with a valid value.
The function does not contain any values for the free variable.
Inheritance in SCALA:
Inheritance is an important pillar of OOP (Object Oriented Programming). It is the
mechanism in Scala by which one class is allowed to inherit the features(fields and
methods) of another class.
Important terminology:
Super Class: The class whose features are inherited is known as superclass (or a base class
or a parent class).
Sub Class: The class that inherits the other class is known as subclass (or a derived class,
extended class, or child class). The subclass can add its own fields and methods in addition
to the superclass fields and methods.
Reusability: Inheritance supports the concept of “reusability”, i.e. when we want to create
a new class and there is already a class that includes some of the code that we want, we
can derive our new class from the existing class. By doing this, we are reusing the fields
and methods of the existing class.