Unit 4(Big Data Analytics)
Unit 4(Big Data Analytics)
Executors are worker nodes' processes in charge of running individual tasks in a given Spark job and The
spark driver is the program that declares the transformations and actions on RDDs of data and submits
such requests to the master.
DAG
A DAG is a directed acyclic graph. They are commonly used in computer systems for task execution.
Page 1 of 28
In this context, a graph is a collection of nodes that are connected by edges. In the case of Hadoop and
Spark, the nodes represent executable tasks, and the edges are task dependencies. Think of the DAG like
a flow chart that tells the system which tasks to execute and in what order. The following is a simple
example of an undirected graph of tasks.
This graph is undirected because there it does not capture which node is the start node and which is the
end node. In other words, this graph does not tell me if the reduce task should be feeding the map tasks
or vice versa. The next graph shows a directed graph of tasks.
A directed graph gives an unambiguous direction for each edge. This means that we know that the map
tasks feed into the reduce task, rather than the other way around. This property is essential for
executing complex workflows since we need to know which tasks should be executed in which order.
Lastly, the graph is acyclic, because it does not contain any cycles. A cycle happens when it is possible to
loop back to a previous node. Cycles are useful for tasks involving recursion but not as good for large-
scale distributed systems. The following are two examples of graphs with cycles.
Page 2 of 28
Spark context
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs,
accumulators and broadcast variables on that cluster.
Only one SparkContext should be active per JVM. You must stop() the active SparkContext before
creating a new one.
C#Copy
public sealed class SparkContext
Inheritance
Object
SparkContext
Constructors
SparkContext() Create a SparkContext that loads settings from system properties (for
instance, when launching with spark-submit).
SparkContext(SparkConf) Create a SparkContext object with the given config.
SparkContext(String, String) Initializes a SparkContext instance with a specific master and
application name.
SparkContext(String, String, Alternative constructor that allows setting common Spark properties
SparkConf) directly.
SparkContext(String, String, Alternative constructor that allows setting common Spark properties
String) directly.
Properties
DefaultParallelism Default level of parallelism to use when not given by user (e.g.
Parallelize()).
Methods
AddFile(String, Boolean) Add a file to be downloaded with this Spark job on every node.
Broadcast<T>(T) Broadcast a read-only variable to the cluster, returning a
Microsoft.Spark.Broadcast object for reading it in distributed
functions. The variable will be sent to each executor only once.
ClearJobGroup() Clear the current thread's job group ID and its description.
GetConf() Returns SparkConf object associated with this SparkContext object.
Note that modifying the SparkConf object will not have any impact.
GetOrCreate(SparkConf) This function may be used to get or instantiate a SparkContext and
register it as a singleton object. Because we can only have one active
SparkContext per JVM, this is useful when applications may wish to
share a SparkContext.
SetCheckpointDir(String) Sets the directory under which RDDs are going to be checkpointed.
SetJobDescription(String) Sets a human readable description of the current job.
SetJobGroup(String, String, Assigns a group ID to all the jobs started by this thread until the group
Boolean) ID is set to a different value or cleared.
SetLogLevel(String) Control our logLevel. This overrides any user-defined log settings.
Stop() Shut down the SparkContext.
Page 3 of 28
Spark Session
Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact
with various spark’s functionality with a lesser number of constructs. Instead of having a spark context,
Prior Spark 2.0, Spark Context was the entry point of any spark application and used to access all spark
features and needed a sparkConf which had all the cluster configs and parameters to create a Spark
Context object. We could primarily create just RDDs using Spark Context and we had to create specific
spark contexts for any other spark interactions. For SQL SQLContext, hive HiveContext, streaming
contexts. Internally, Spark session creates a new SparkContext for all the operations and also all the
The spark session builder will try to get a spark session if there is one already created or create a new
one and assigns the newly created SparkSession as the global default. Note that enableHiveSupport here
is similar to creating a HiveContext and all it does is enables access to Hive metastore, Hive serdes, and
Hive udfs.
Note that, we don’t have to create a spark session object when using spark-shell. It is already created for
Page 4 of 28
scala> spark
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2bd158ea
RDD?
RDD stands for “Resilient Distributed Dataset”. It is the fundamental data structure of Apache Spark.
RDD in Apache Spark is an immutable collection of objects which computes on the different node of the
cluster.
Decomposing the name RDD:
Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute
missing or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the data set externally
which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be
computed on different nodes of the cluster. RDDs are fault tolerant i.e. It posses self-recovery in the
case of failure.
There are three ways to create RDDs in Spark such as – Data in stable storage, other RDDs, and
parallelizing already existing collection in driver program. One can also operate Spark RDDs in parallel
with a low-level API that offers transformations and actions. We will study these Spark RDD Operations
later in this section.
Spark RDD can also be cached and manually partitioned. Caching is beneficial when we use RDD several
times. And manual partitioning is important to correctly balance partitions. Generally, smaller partitions
allow distributing RDD data more equally, among more executors. Hence, fewer partitions make the
work easy.
Programmers can also call a persist method to indicate which RDDs they want to reuse in future
operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not
enough RAM. Users can also request other persistence strategies, such as storing the RDD only on disk
or replicating it across machines, through flags to persist.
Several features of Apache Spark RDD are:
Page 5 of 28
Transformations
In Spark, the core data structures are immutable meaning they cannot be changed once created. This
might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In
order to “change” a DataFrame you will have to instruct Spark how you would like to modify the
DataFrame you have into the one that you want. These instructions are called transformations.
Transformations are the core of how you will be expressing your business logic using Spark. There are
two types of transformations, those that specify narrow dependencies and those that specify wide
dependencies.
Narrow Transformation
Narrow transformations are the result of map() and filter() functions and these compute data that live
on a single partition meaning there will not be any data movement between partitions to execute
narrow transformations.
Functions such as map(), mapPartition(), flatMap(), filter(), union() are some examples of narrow
transformation
Wider Transformation
Wider transformations are the result of groupByKey() and reduceByKey() functions and these compute
data that live on many partitions meaning there will be data movements between partitions to execute
wider transformations. Since these shuffles the data, they also called shuffle transformations.
Functions such as groupByKey(), aggregateByKey(), aggregate(), join(), repartition() are some examples
of a wider transformations.
Page 6 of 28
Spark Actions
RDD ACTION METHODS METHOD DEFINITION
aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒U, combOp: (U, U) ⇒ Aggregate the elements of each partition, and
U)(implicit arg0: ClassTag[U]): U then the results for all the partitions.
collect():Array[T] Return the complete dataset as an Array.
count():Long Return the count of elements in the dataset.
countApprox(timeout: Long, confidence: Double = 0.95): Return approximate count of elements in the
PartialResult[BoundedDouble] dataset, this method returns incomplete
when execution time meets timeout.
countApproxDistinct(relativeSD: Double = 0.05): Long Return an approximate number of distinct
elements in the dataset.
countByValue(): Map[T, Long] Return Map[T,Long] key representing each
unique value in dataset and value represent
count each value present.
countByValueApprox(timeout: Long, confidence: Double = Same as countByValue() but returns
0.95)(implicit ord: Ordering[T] = null): PartialResult[Map[T, approximate result.
BoundedDouble]]
first():T Return the first element in the dataset.
fold(zeroValue: T)(op: (T, T) ⇒T): T Aggregate the elements of each partition, and
then the results for all the partitions.
foreach(f: (T) ⇒Unit): Unit Iterates all elements in the dataset by
applying function f to all elements.
foreachPartition(f: (Iterator[T]) ⇒Unit): Unit Similar to foreach, but applies function f for
each partition.
min()(implicit ord: Ordering[T]): T Return the minimum value from the dataset.
max()(implicit ord: Ordering[T]): T Return the maximum value from the dataset.
reduce(f: (T, T) ⇒T): T Reduces the elements of the dataset using
the specified binary operator.
saveAsObjectFile(path: String): Unit Saves RDD as a serialized object's to the
storage system.
saveAsTextFile(path: String, codec: Class[_ <: Saves RDD as a compressed text file.
CompressionCodec]): Unit
saveAsTextFile(path: String): Unit Saves RDD as a text file.
take(num: Int): Array[T] Return the first num elements of the dataset.
Page 7 of 28
RDD ACTION METHODS METHOD DEFINITION
takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] Return the first num (smallest) elements from
the dataset and this is the opposite of the
take() action.
Note: Use this method only when the
resulting array is small, as all the data is
loaded into the driver's memory.
takeSample(withReplacement: Boolean, num: Int, seed: Long = Return the subset of the dataset in an Array.
Utils.random.nextLong): Array[T] Note: Use this method only when the
resulting array is small, as all the data is
loaded into the driver's memory.
toLocalIterator(): Iterator[T] Return the complete dataset as an Iterator.
Note: Use this method only when the
resulting array is small, as all the data is
loaded
ded into the driver's memory.
top(num: Int)(implicit ord: Ordering[T]): Array[T] Note: Use this method only when the
resulting array is small, as all the data is
loaded into the driver's memory.
treeAggregate Aggregates the elements of this RDD in a
multi-level
level tree pattern.
treeReduce Reduces the elements of this RDD in a multi-
multi
level tree pattern.
DataFrame
A DataFrame is a programming abstraction in the Spark SQL module. DataFrames resemble relational
database tables or excel spreadsheets with headers: the data resides in rows and columns of different
datatypes.
The information for distributed data is structured into schemas.. Every column in a DataFrame contains
the column name, datatype, and nullable properties. When nullable is set to true,, a column
accepts null properties as well.
Page 8 of 28
RDD to Data frames
Spark provides an implicit function toDF() which would be used to convert RDD, Seq[T], List[T] to
DataFrame. In order to use toDF() function, we should import implicits first using import
spark.implicits._.
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
Scala
Copy
toDF() has another signature that takes arguments to define column names as shown below.
root
|-- language: string (nullable = true)
|-- users_count: string (nullable = true)
Scala
Copy
By default, the datatype of these columns infers to the type of data and set’s nullable to true. We can
change this behavior by supplying schema using StructType – where we can specify a column name, data
type and nullable for each field/column.
Page 9 of 28
Convert RDD to DataFrame – Using createDataFrame()
SparkSession class provides createDataFrame() method to create DataFrame and it takes rdd object as
an argument. and chain it with toDF() to specify names to the co
columns.
Phases of the query plan in Spark SQL. Rounded squares represent the Catalyst trees
Data Frame Transformations
The transformations themselves can be divided into two groups, DataFrame transformations, and
column transformations. The first group transform the entire DataFrame, for example
df.select(col1, col2, col3)
df.filter(col('user_id') == 123)
Page 10 of 28
df.orderBy('age')
...
The most frequently used DataFrame transformations are probably the following (but it of course
Page 11 of 28
DATE FUNCTION SIGNATURE DATE FUNCTION DESCRIPTION
For example, `next_day('2015-07-27', "Sunday")` returns 2015-08-02
because that is the first Sunday after 2015-07-27.
trunc(date: Column, format: Returns date truncated to the unit specified by the format.
String): Column For example, `trunc("2018-11-19 12:01:19", "year")` returns 2018-01-01
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month
date_trunc(format: String, Returns timestamp truncated to the unit specified by the format.
timestamp: Column): Column For example, `date_trunc("year", "2018-11-19 12:01:19")` returns 2018-
01-01 00:00:00
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month,
'day', 'dd' to truncate by day,
Other options are: 'second', 'minute', 'hour', 'week', 'month', 'quarter'
year(e: Column): Column Extracts the year as an integer from a given date/timestamp/string
quarter(e: Column): Column Extracts the quarter as an integer from a given date/timestamp/string.
month(e: Column): Column Extracts the month as an integer from a given date/timestamp/string
dayofweek(e: Column): Column Extracts the day of the week as an integer from a given
date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a
Saturday
dayofmonth(e: Column): Column Extracts the day of the month as an integer from a given
date/timestamp/string.
dayofyear(e: Column): Column Extracts the day of the year as an integer from a given
date/timestamp/string.
weekofyear(e: Column): Column Extracts the week number as an integer from a given
date/timestamp/string. A week is considered to start on a Monday and
week 1 is the first week with more than 3 days, as defined by ISO 8601
last_day(e: Column): Column Returns the last day of the month which the given date belongs to. For
example, input "2015-07-27" returns "2015-07-31" since July 31 is the last
day of the month in July 2015.
from_unixtime(ut: Column): Converts the number of seconds from unix epoch (1970-01-01 00:00:00
Column UTC) to a string representing the timestamp of that moment in the current
system time zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(ut: Column, f: Converts the number of seconds from unix epoch (1970-01-01 00:00:00
String): Column UTC) to a string representing the timestamp of that moment in the current
system time zone in the given format.
unix_timestamp(): Column Returns the current Unix timestamp (in seconds) as a long
unix_timestamp(s: Column): Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp
Column (in seconds), using the default timezone and the default locale.
unix_timestamp(s: Column, p: Converts time string with given pattern to Unix timestamp (in seconds).
String): Column
Working with Nulls in Data
We know Spark needs to be aware of the null, in terms of data, but you, as a programmer, should be
aware of some details. Null in Spark, is not as straight forward as we wish it to be. At the beginning of
this article, I stated that this is not a simple problem we face here. Here I’m going to discuss why I think
that’s the case:
Spark is Null safe, well, almost!
Page 12 of 28
The fact that Spark functions are null safe (at least most of the times) is, quite pleasant. Take a look at
the following example:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
As you can see the third row of our data contains a null, but as you see in the following code box, Scala
considers the result of that row as null (which is the desired value if one party of your calculation is
already null):
scala> result.show
+----+---+----+
| v1| v2| v3|
+----+---+----+
| 1| 2| 3|
| 3| 4| 7|
|null| 5|null|
+----+---+----+
Working with Complex Types
Apache Spark natively supports complex data types, and in some cases like JSON where an appropriate
data source connector is available, it makes a pretty decent dataframe representation of the data. Top
Page 13 of 28
level key value pairs are presented in their own columns, whilst more complex hierarchical data is
persisted using a column cast to a complex data type. Using dot notation within a select clause,
individual data points within a complex object can be selected. For example:
from pyspark.sql.functions import coljsonStrings = ['{"car":{"color":"red",
"model":"jaguar"},"name":"Jo","address":{"city":"Houston",' + \
'"state":"Texas","zip":{"first":1234,"second":4321}}}']
otherPeopleRDD = spark.sparkContext.parallelize(jsonStrings)
source_json_df = spark.read.json(otherPeopleRDD)source_json_df.select(col("car.color"),
col("car.model")).show()
This mechanism is simple and it works. However, if the data is complex, has multiple levels, spans a large
number of attributes and/or columns, each aligned to a different schema and the consumer of the data
isn’t able to cope (i.e. like most BI tools, which like to report from relational databases like Oracle,
MySQL, etc) then problems will ensue. The manual approach of writing out the Select statement can be
To simplify working with complex data, this article will present a function designed to transform multi-
dataframe that has no complex data type columns. All nested attributes are assigned their own column
Lets assume that we need to transform the following JSON, which has been loaded into Spark using
spark.read.json:
{
"car":{
"color":"red",
"model":"jaguar"
},
Page 14 of 28
"name":"Jo",
"address":{
"city":"Houston",
"state":"Texas",
"zip":{
"first":1234,
"second":4321
}
}
}
The first task is to create a function that can parse the schema bound to the Dataframe. The schema is
accessed via a property of the same name found on the dataframe itself.
Page 15 of 28
Group By
The GROUP BY clause is used to group the rows based on a set of specified grouping expressions and
compute aggregations on the group of rows based on one or more specified aggregate functions. Spark
also supports advanced aggregations to do multiple aggregations for the same input record set
via GROUPING SETS, CUBE, ROLLUP clauses. The grouping expressions and advanced aggregations can
be mixed in the GROUP BY clause and nested in a GROUPING SETS clause. See more details in
the Mixed/Nested Grouping Analytics section. When a FILTER clause is attached to an aggregate
function, only the matching rows are passed to that function.
Syntax
GROUP BY group_expression [ , group_expression [ , ... ] ] [ WITH { ROLLUP | CUBE } ]
Page 16 of 28
(product), (warehouse, product), ()). The N elements of a CUBE specification results in
2^N GROUPING SETS.
aggregate_name
Specifies an aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
DISTINCT
Removes duplicates in input rows before they are passed to aggregate functions.
FILTER
Filters the input rows for which the boolean_expression in the WHERE clause evaluates to true are
passed to the aggregate function; other rows are discarded.
Window Functions
Page 17 of 28
Join in Spark SQL
Following are the different types of Joins:
1. INNER JOIN
The INNER JOIN returns the dataset which has the rows that have matching values in both the datasets
2. CROSS JOIN
The CROSS JOIN returns the dataset which is the number of rows in the first dataset multiplied by the
number of rows in the second dataset. Such kind of result is called the Cartesian Product.
right dataset.
correspondence in the right dataset. Unlike the LEFT OUTER JOIN, the returned dataset in LEFT SEMI
their matching in the right dataset. It also contains only the columns from the left dataset.
Data sources
This section describes the Apache Spark data sources you can use in Databricks. Many include a
notebook that demonstrates how to use the data source to read and write data.
Page 18 of 28
The following data sources are either directly supported in Databricks Runtime or require simple shell
commands to enable access:
Avro file
Binary file
CSV file
Hive table
Image
JSON file
LZO compressed file
MLflow experiment
Parquet file
XML file
Zip files
Broadcast Variables
In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached
and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending
this data along with every task, spark distributes broadcast variables to the machine using
efficient broadcast algorithms to reduce communication costs.
How does Spark Broadcast work?
Broadcast variables are used in the same way for RDD, DataFrame, and Dataset.
When you run a Spark RDD, DataFrame jobs that has the Broadcast variables defined and used, Spark
does the following.
Spark breaks the job into stages that have distributed shuffling and actions are executed with in the
stage.
Later Stages are also broken into tasks
Spark broadcasts the common data (reusable) needed by tasks within each stage.
The broadcasted data is cache in serialized format and deserialized before executing each task.
You should be creating and using broadcast variables for data that shared across multiple stages and
tasks.
Note that broadcast variables are not sent to executors with sc.broadcast(variable) call instead, they will
be sent to executors when they are first used.
How to create Broadcast variable
The Spark Broadcast is created using the broadcast(v) method of the SparkContext class. This method
takes the argument v that you want to broadcast.
import org.apache.spark.sql.SparkSession
Page 19 of 28
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))
println(rdd2.collect().mkString("\n"))
Accumulators
Accumulators are variables that are only “added” to through an associative operation and can
therefore, be efficiently supported in parallel. They can be used to implement counters (as in
MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can
add support for new types. If accumulators are created with a name, they will be displayed in Spark’s
UI. This can be useful for understanding the progress of running stages (NOTE − this is not yet
supported in Python).
An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running
on the cluster can then add to it using the add method or the += operator (in Scala and Python).
However, they cannot read its value. Only the driver program can read the accumulator’s value, using
its value method.
The code given below shows an accumulator being used to add up the elements of an array −
If you want to see the output of above code then use the following command −
Page 20 of 28
scala> accum.value
Output
res2: Int = 10
On-Premises Cluster Deployments
Implementing an enterprise-ready, on-premises Spark deployment can be very complex and it requires
BlueData makes it easier to deploy Apache Spark on-premises. With BlueData, you can spin up virtual
Spark clusters within minutes – providing secure, self-service, on-demand access to Big Data analytics
and infrastructure. You can deploy Spark in standalone mode or with Hadoop / YARN. You can also build
analytical pipelines and create Spark clusters using our RESTful APIs, and use web-based Zeppelin
BlueData’s software platform leverages virtualization and Docker containers – combined with our own
patent-pending innovations – to make it faster, and more cost-effective for enterprises to get up and
Page 21 of 28
Apache Spark Standalone Cluster Manager
Standalone mode is a simple cluster manager incorporated with Spark. It makes it easy to setup a
cluster that Spark itself manages and can run on Linux, Windows, or Mac OSX. Often it is the simplest
way to run Spark application in a clustered environment. Learn, how to install Apache Spark On
Standalone Mode.
a. How does Spark Standalone Cluster Works?
It has masters and number of workers with configured amount of memory and CPU cores. In Spark
standalone cluster mode, Spark allocates resources based on the core. By default, an application will
grab all the cores in the cluster.
In standalone cluster manager, Zookeeper quorum recovers the master using standby master. Using the
file system, we can achieve the manual recovery of the master. Spark supports authentication with the
help of shared secret with entire cluster manager. The user configures each node with a shared secret.
For communication protocols, Data encrypts using SSL. But for block transfer, it makes use of data SASL
encryption.
To check the application, each Apache Spark application has a Web User Interface. The Web UI provides
information of executors, storage usage, running task in the application. In this cluster manager, we
have Web UI to view cluster and job statistics. It also has detailed log output for each job. If an
application has logged event for its lifetime, Spark Web UI will reconstruct the application’s UI after the
application exits.
Spark on YARN
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN
on the cluster, and the client can go away after initiating the application. In client mode, the driver runs
in the client process, and the application master is only used for requesting resources from YARN.
Unlike other cluster managers supported by Spark in which the master’s address is specified in the --
master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop
configuration. Thus, the --master parameter is yarn.
To launch a Spark application in cluster mode:
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar>
[app options]
Page 22 of 28
Spark log files
Apache Spark log files can be useful in identifying issues with your Spark processes.
Table 1 lists the base log files that Spark generates.
Log file Location
Master logs $SPARK_LOG_DIR/spark-userID-org.apache.spark.deploy.master.Master-instance-host.out
Worker logs $SPARK_LOG_DIR/spark-userID-org.apache.spark.deploy.master.Worker-instance-host.out
Driver logs (client
Printed on the command line by default
deploy mode)
Driver logs
(cluster deploy
mode) $SPARK_WORKER_DIR/driverID/stdout
stdout $SPARK_WORKER_DIR/driverID/stderr
stderr
Executor logs
stdout $SPARK_WORKER_DIR/applID/executorID/stdout
stderr $SPARK_WORKER_DIR/applID/executorID/stderr
Table 1. Apache Spark log files
The Spark UI- Spark UI History Server
You can use an AWS CloudFormation template to start the Apache Spark history server and view the
Spark web UI. These templates are samples that you should modify to meet your requirements.
To start the Spark history server and view the Spark UI using AWS CloudFormation
1. Choose one of the Launch Stack buttons in the following table. This launches the stack on the AWS
CloudFormation console.
Region Launch for Glue 1.0/2.0 Launch for Glue 3.0
US East (Ohio)
US East (N. Virginia)
US West (N. California)
US West (Oregon)
Africa (Cape Town) Must first enable console access to the region.
Asia Pacific (Hong Kong) Must first enable console access to the region.
Page 23 of 28
Region Launch for Glue 1.0/2.0 Launch for Glue 3.0
Asia Pacific (Singapore)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)
Canada (Central)
Europe (Frankfurt)
Europe (Ireland)
Europe (London)
Europe (Milan) Must first enable console access to the region.
Europe (Paris)
Europe (Stockholm)
Middle East (Bahrain) Must first enable console access to the region.
Page 24 of 28
To debug a Scala or Java application, you need to run the application with JVM options agentlib:jdwp,
where agentlib:jdwp is the Java Debug Wire Protocol (JDWP) option, followed by a comma-separated list
of sub-option
agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
Scala
Copy
But to run with spark-submit, you need to add agentlib:jdwp with --conf
spark.driver.extraJavaOptions along with options as shown below.
spark-submit \
--name SparkByExamples.com \
--class org.sparkbyexamples.SparkWordCountExample \
--conf spark.driver.extraJavaOptions=-
agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
spark-by-examples.jar
Bash
Copy
By running the above command, it prompts you with the below message, and your application pauses.
The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.
Let's understand each Spark component in detail.
Page 25 of 28
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting w
with
ith storage systems
and memory management.
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive
variant of SQL?called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects and
existing databases, data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault
fault-tolerant
tolerant processing of
streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches
batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused to analyze
batches of historical data with little modification.
o The log files generated by web servers can be considered as a real
real-time
time example of a data
stream.
MLlib
o The MLlib is a Machine Learning library that contains various machine learning algorithms.
o These include correlations and hypothesis testing, classification and regression, clustering, and
principal component analysis.
o It is nine times faster than the disk
disk-based
based implementation used by Apache Mahout.
GraphX
Page 26 of 28
o The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex and
edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join Vertices,
and aggregate Messages.
Page 27 of 28
If you are running spark application on a remote node and you wanted to debug via IntelliJ, you need to
set the environment variable SPARK_SUBMIT_OPTS with the debug information.
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5050
Scala
Copy
Now run your spark-submit, which will wait for the the debugger.
Finally, Open the IntelliJ and follow the above points. and for the host, enter your remote host where
your spark application is running.
Page 28 of 28