0% found this document useful (0 votes)

448 views250 pages

HDP Training Tesco - II Notes

Uploaded by

ehenry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

448 views250 pages

HDP Training Tesco - II Notes

Uploaded by

ehenry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 250

Spark Internals and Tuning......................................................................................2

Apache Spark-Apache Hive connection configuration..............................................71
Prerequisites........................................................................................................ 71
Required properties............................................................................................. 72
Optional HWC properties..................................................................................... 74
Spark on a Kerberized YARN cluster....................................................................75
Integrate Spark-SQL with Hive................................................................................76
JSTACK and JMAP Collection................................................................................160
Jstack Collection................................................................................................ 160
JMap Collection..................................................................................................... 162
Spark Internals and Tuning

Deep-dive into Spark internals and architecture

Apache Spark is an open-source distributed general-purpose cluster-computing framework. A

spark application is a JVM process that’s running a user code using the spark as a 3rd party
library.
As part of this blog, I will be showing the way Spark works on Yarn architecture with an example
and the various underlying background processes that are involved such as:
• Spark Context
• Yarn Resource Manager, Application Master & launching of executors (containers).
• Setting up environment variables, job resources.
• CoarseGrainedExecutorBackend & Netty-based RPC.
• SparkListeners.
• Execution of a job (Logical plan, Physical plan).
• Spark-WebUI.
Spark Context
Spark context is the first level of entry point and the heart of any spark application. Spark-shell is
nothing but a Scala-based REPL with spark binaries which will create an object sc called spark
context.
We can launch the spark shell as shown below:
spark-shell --master yarn \ --conf spark.ui.port=12345 \ --num-executors 3 \ --executor-cores 2 \
--executor-memory 500M
As part of the spark-shell, we have mentioned the num executors. They indicate the number of
worker nodes to be used and the number of cores for each of these worker nodes to execute tasks
in parallel.
Or you can launch spark shell using the default configuration.
spark-shell --master yarn
The configurations are present as part of spark-env.sh
Our Driver program is executed on the Gateway node which is nothing but a spark-shell. It will
create a spark context and launch an application.

The spark context object can be accessed using sc.

After the Spark context is created it waits for the resources. Once the resources are available, Spark
context sets up internal services and establishes a connection to a Spark execution environment.
Yarn Resource Manager, Application Master & launching of executors (containers).
Once the Spark context is created it will check with the Cluster Manager and launch the
Application Master i.e, launches a container and registers signal handlers.
Once the Application Master is started it establishes a connection with the Driver.

Next, the ApplicationMasterEndPoint triggers a proxy application to connect to the resource

manager.

Now, the Yarn Container will perform the below operations as shown in the diagram.
Image Credits: jaceklaskowski.gitbooks.io
ii) YarnRMClient will register with the Application Master.

iii) YarnAllocator: Will request 3 executor containers, each with 2 cores and 884 MB memory
including 384 MB overhead

iv) AM starts the Reporter Thread

Now the Yarn Allocator receives tokens from Driver to launch the Executor nodes and start the
containers.

Setting up environment variables, job resources & launching containers.

Every time a container is launched it does the following 3 things in each of these.
• Setting up env variables
Spark Runtime Environment (SparkEnv) is the runtime environment with Spark’s services that are
used to interact with each other in order to establish a distributed computing platform for a Spark
application.
• Setting up job resources

• Launching container
YARN executor launch context assigns each executor with an executor id to identify the
corresponding executor (via Spark WebUI) and starts a CoarseGrainedExecutorBackend.

CoarseGrainedExecutorBackend & Netty-based RPC.

After obtaining resources from Resource Manager, we will see the executor starting up

CoarseGrainedExecutorBackend is an ExecutorBackend that controls the lifecycle of a single

executor. It sends the executor’s status to the driver.
When ExecutorRunnable is started, CoarseGrainedExecutorBackend registers the Executor RPC
endpoint and signal handlers to communicate with the driver (i.e. with CoarseGrainedScheduler
RPC endpoint) and to inform that it is ready to launch tasks.

Netty-based RPC - It is used to communicate between worker nodes, spark context, executors.
NettyRPCEndPoint is used to track the result status of the worker node.
RpcEndpointAddress is the logical address for an endpoint registered to an RPC Environment,
with RpcAddress and name.
It is in the format as shown below:

This is the first moment when CoarseGrainedExecutorBackend initiates communication with the
driver available at driverUrl through RpcEnv.

SparkListeners
Image Credits: jaceklaskowski.gitbooks.io
SparkListener (Scheduler listener) is a class that listens to execution events from Spark’s
DAGScheduler and logs all the event information of an application such as the executor, driver
allocation details along with jobs, stages, and tasks and other environment properties changes.
SparkContext starts the LiveListenerBus that resides inside the driver. It registers
JobProgressListener with LiveListenerBus which collects all the data to show the statistics in spark
UI.
By default, only the listener for WebUI would be enabled but if we want to add any other listeners
then we can use spark.extraListeners.
Spark comes with two listeners that showcase most of the activities
i) StatsReportListener
ii) EventLoggingListener
EventLoggingListener: If you want to analyze further the performance of your applications
beyond what is available as part of the Spark history server then you can process the event log
data. Spark Event Log records info on processed jobs/stages/tasks. It can be enabled as shown
below...

The event log file can be read as shown below

• The Spark driver logs into job workload/perf metrics in the spark.evenLog.dir directory as JSON
files.
• There is one file per application, the file names contain the application id (therefore including a
timestamp) application_1540458187951_38909.
It shows the type of events and the number of entries for each.
Now, let’s add StatsReportListener to the spark.extraListeners and check the status of the job.
Enable INFO logging level for org.apache.spark.scheduler.StatsReportListener logger to see Spark
events.

To enable the listener, you register it to SparkContext. It can be done in two ways.
i) Using SparkContext.addSparkListener(listener: SparkListener) method inside your Spark
application.
Click on the link to implement custom listeners - CustomListener
ii) Using the conf command-line option

Let’s read a sample file and perform a count operation to see the StatsReportListener.

Execution of a job (Logical plan, Physical plan).

In Spark, RDD (resilient distributed dataset) is the first level of the abstraction layer. It is a
collection of elements partitioned across the nodes of the cluster that can be operated on in
parallel. RDDs can be created in 2 ways.
i) Parallelizing an existing collection in your driver program

ii) Referencing a dataset in an external storage system

RDDs are created either by using a file in the Hadoop file system, or an existing Scala collection in
the driver program, and transforming it.
Let’s take a sample snippet as shown below

The execution of the above snippet takes place in 2 phases.

6.1 Logical Plan: In this phase, an RDD is created using a set of transformations, It keeps track of
those transformations in the driver program by building a computing chain (a series of RDD)as a
Graph of transformations to produce one RDD called a Lineage Graph.
Transformations can further be divided into 2 types
• Narrow transformation: A pipeline of operations that can be executed as one stage and does
not require the data to be shuffled across the partitions — for example, Map, filter, etc..

Now the data will be read into the driver using the broadcast variable.
• Wide transformation: Here each operation requires the data to be shuffled, henceforth for
each wide transformation a new stage will be created — for example, reduceByKey, etc..

We can view the lineage graph by using toDebugString

6.2 Physical Plan: In this phase, once we trigger an action on the RDD, The DAG Scheduler looks
at RDD lineage and comes up with the best execution plan with stages and tasks together with
TaskSchedulerImpl and execute the job into a set of tasks parallelly.
Once we perform an action operation, the SparkContext triggers a job and registers the RDD until
the first stage (i.e, before any wide transformations) as part of the DAGScheduler.

Now before moving onto the next stage (Wide transformations), it will check if there are any
partition data that is to be shuffled and if it has any missing parent operation results on which it
depends, if any such stage is missing then it re-executes that part of the operation by making use of
the DAG( Directed Acyclic Graph) which makes it Fault tolerant.

In the case of missing tasks, it assigns tasks to executors.

Each task is assigned to CoarseGrainedExecutorBackend of the executor.

It gets the block info from the Namenode.

now, it performs the computation and returns the result.

Next, the DAGScheduler looks for the newly runnable stages and triggers the next stage
(reduceByKey) operation.

The ShuffleBlockFetcherIterator gets the blocks to be shuffled.

Now the reduce operation is divided into 2 tasks and executed.

On completion of each task, the executor returns the result back to the driver.

Once the Job is finished the result is displayed.

Spark-WebUI
Spark-UI helps in understanding the code execution flow and the time taken to complete a
particular job. The visualization helps in finding out any underlying problems that take place
during the execution and optimizing the spark application further.
We will see the Spark-UI visualization as part of the previous step 6.
Once the job is completed you can see the job details such as the number of stages, the number of
tasks that were scheduled during the job execution of a Job.

On clicking the completed jobs we can view the DAG visualization i.e, the different wide and
narrow transformations as part of it.
You can see the execution time taken by each stage.
On clicking on a Particular stage as part of the job, it will show the complete details as to where the
data blocks are residing, data size, the executor used, memory utilized and the time taken to
complete a particular task. It also shows the number of shuffles that take place.
Further, we can click on the Executors tab to view the Executor and driver used.

Now that we have seen how Spark works internally, you can determine the flow of execution by
making use of Spark UI, logs and tweaking the Spark EventListeners to determine optimal solution
on the submission of a Spark job.
Apache Spark: core concepts, architecture and internals
03 MARCH 2016 on Spark, scheduling, RDD, DAG, shuffle
This post covers core concepts of Apache Spark such as RDD, DAG, execution workflow, forming
stages of tasks and shuffle implementation and also describes architecture and main components
of Spark Driver. There's a github.com/datastrophic/spark-workshop project created alongside
with this post which contains Spark Applications examples and dockerized Hadoop environment
to play with. Slides are also available at slideshare.
Intro
Spark is a generalized framework for distributed data processing providing functional API for
manipulating data at scale, in-memory data caching and reuse across computations. It applies set
of coarse-grained transformations over partitioned data and relies on dataset's lineage to
recompute tasks in case of failures. Worth mentioning is that Spark supports majority of data
formats, has integrations with various storage systems and can be executed on Mesos or YARN.
Powerful and concise API in conjunction with rich library makes it easier to perform data
operations at scale. E.g. performing backup and restore of Cassandra column families in Parquet
format:
def backup(path: String, config: Config) {
sc.cassandraTable(config.keyspace, config.table)
.map(_.toEvent).toDF()
.write.parquet(path)
}

def restore(path: String, config: Config) {

sqlContext.read.parquet(path)
.map(_.toEvent)
.saveToCassandra(config.keyspace, config.table)
}
Or run discrepancies analysis comparing the data in different data stores:
sqlContext.sql {
"""
SELECT count()
FROM cassandra_event_rollups
JOIN mongo_event_rollups
ON cassandra_event_rollups.uuid = cassandra_event_rollups.uuid
WHERE cassandra_event_rollups.value != cassandra_event_rollups.value
""".stripMargin
}
Recap
Spark is built around the concepts of Resilient Distributed Datasets and Direct Acyclic Graph
representing transformations and dependencies between them.
Spark Application (often referred to as Driver Program or Application Master) at high level
consists of SparkContext and user code which interacts with it creating RDDs and performing
series of transformations to achieve final result. These transformations of RDDs are then
translated into DAG and submitted to Scheduler to be executed on set of worker nodes.
RDD: Resilient Distributed Dataset
RDD could be thought as an immutable parallel data structure with failure recovery possibilities. It
provides API for various transformations and materializations of data as well as for control over
caching and partitioning of elements to optimize data placement. RDD can be created either from
external storage or from another RDD and stores information about its parents to optimize
execution (via pipelining of operations) and recompute partition in case of failure.
From a developer's point of view RDD represents distributed immutable data (partitioned data +
iterator) and lazily evaluated operations (transformations). As an interface RDD defines five main
properties:
//a list of partitions (e.g. splits in Hadoop)
def getPartitions: Array[Partition]

//a list of dependencies on other RDDs

def getDependencies: Seq[Dependency[_]]

//a function for computing each split

def compute(split: Partition, context: TaskContext): Iterator[T]

//(optional) a list of preferred locations to compute each split on

def getPreferredLocations(split: Partition): Seq[String] = Nil

//(optional) a partitioner for key-value RDDs

val partitioner: Option[Partitioner] = None
Here's an example of RDDs created during a call of method sparkContext.textFile("hdfs://...")
which first loads HDFS blocks in memory and then applies map() function to filter out keys
creating two RDDs:

• HadoopRDD:
getPartitions = HDFS blocks
getDependencies = None
compute = load block in memory
getPrefferedLocations = HDFS block locations
partitioner = None
• MapPartitionsRDD
getPartitions = same as parent
getDependencies = parent RDD
compute = compute parent and apply map()
getPrefferedLocations = same as parent
partitioner = None
RDD Operations
Operations on RDDs are divided into several groups:
• Transformations
apply user function to every element in a partition (or to the whole partition)
apply aggregation function to the whole dataset (groupBy, sortBy)
introduce dependencies between RDDs to form DAG
provide functionality for repartitioning (repartition, partitionBy)
• Actions
trigger job execution
used to materialize computation results
• Extra: persistence
explicitly store RDDs in memory, on disk or off-heap (cache, persist)
checkpointing for truncating RDD lineage
Here's a code sample of some job which aggregates data from Cassandra in lambda style
combining previously rolled-up data with the data from raw storage and demonstrates some of the
transformations and actions available on RDDs
//aggregate events after specific date for given campaign
val events =
sc.cassandraTable("demo", "event")
.map(_.toEvent)
.filter { e =>
e.campaignId == campaignId && e.time.isAfter(watermark)
}
.keyBy(_.eventType)
.reduceByKey(_ + _)
.cache()

//aggregate campaigns by type

val campaigns =
sc.cassandraTable("demo", "campaign")
.map(_.toCampaign)
.filter { c =>
c.id == campaignId && c.time.isBefore(watermark)
}
.keyBy(_.eventType)
.reduceByKey(_ + _)
.cache()

//joined rollups and raw events

val joinedTotals = campaigns.join(events)
.map { case (key, (campaign, event)) =>
CampaignTotals(campaign, event)
}
.collect()

//count totals separately

val eventTotals =
events.map{ case (t, e) => s"$t -> ${e.value}" }
.collect()

val campaignTotals =
campaigns.map{ case (t, e) => s"$t -> ${e.value}" }
.collect()
Execution workflow recap

Here's a quick recap on the execution workflow before digging deeper into details: user code
containing RDD transformations forms Direct Acyclic Graph which is then split into stages of tasks
by DAGScheduler. Stages combine tasks which don’t require shuffling/repartitioning if the data.
Tasks run on workers and results then return to client.
DAG
Here's a DAG for the code sample above. So basically any data processing workflow could be
defined as reading the data source, applying set of transformations and materializing the result in
different ways. Transformations create dependencies between RDDs and here we can see different
types of them.
The dependencies are usually classified as "narrow" and "wide":
• Narrow (pipelineable)
each partition of the parent RDD is used by at most one partition of the child RDD
allow for pipelined execution on one cluster node
failure recovery is more efficient as only lost parent partitions need to be recomputed
• Wide (shuffle)
multiple child partitions may depend on one parent partition
require data from all parent partitions to be available and to be shuffled across the nodes
if some partition is lost from all the ancestors a complete recomputation is needed
Splitting DAG into Stages
Spark stages are created by breaking the RDD graph at shuffle boundaries
• RDD operations with "narrow" dependencies, like map() and filter(), are pipelined together into
one set of tasks in each stage operations with shuffle dependencies require multiple stages
(one to write a set of map output files, and another to read those files after a barrier).
• In the end, every stage will have only shuffle dependencies on other stages, and may compute
multiple operations inside it. The actual pipelining of these operations happens in the
RDD.compute() functions of various RDDs
There are two types of tasks in Spark: ShuffleMapTask which partitions its input for shuffle and
ResultTask which sends its output to the driver. The same applies to types of stages:
ShuffleMapStage and ResultStage correspondingly.
Shuffle
During the shuffle ShuffleMapTask writes blocks to local drive, and then the task in the next stages
fetches these blocks over the network.
• Shuffle Write
redistributes data among partitions and writes files to disk
each hash shuffle task creates one file per “reduce” task (total = MxR)
sort shuffle task creates one file with regions assigned to reducer
sort shuffle uses in-memory sorting with spillover to disk to get final result
• Shuffle Read
fetches the files and applies reduce() logic
if data ordering is needed then it is sorted on “reducer” side for any type of shuffle
In Spark Sort Shuffle is the default one since 1.2, but Hash Shuffle is available too.
Sort Shuffle
• Incoming records accumulated and sorted in memory according their target partition ids
• Sorted records are written to file or multiple files if spilled and then merged
• index file stores offsets of the data blocks in the data file
• Sorting without deserialization is possible under certain conditions (SPARK-7081)
Spark Components
At 10K foot view there are three major components:
• Spark Driver
separate process to execute user applications
creates SparkContext to schedule jobs execution and negotiate with cluster manager
• Executors
run tasks scheduled by driver
store computation results in memory, on disk or off-heap
interact with storage systems
• Cluster Manager
Mesos
YARN
Spark Standalone
Spark Driver contains more components responsible for translation of user code into actual jobs
executed on cluster:
• SparkContext
◦ represents the connection to a Spark cluster, and can be used to create RDDs, accumulators
and broadcast variables on that cluster
• DAGScheduler
◦ computes a DAG of stages for each job and submits them to TaskScheduler
◦ determines preferred locations for tasks (based on cache status or shuffle files locations)
and finds minimum schedule to run the jobs
• TaskScheduler
◦ responsible for sending tasks to the cluster, running them, retrying if there are failures, and
mitigating stragglers
• SchedulerBackend
◦ backend interface for scheduling systems that allows plugging in different
implementations(Mesos, YARN, Standalone, local)
• BlockManager
◦ provides interfaces for putting and retrieving blocks both locally and remotely into various
stores (memory, disk, and off-heap)
Memory Management in Spark 1.6
Executors run as Java processes, so the available memory is equal to the heap size. Internally
available memory is split into several regions with specific functions.
• Execution Memory
◦ storage for data needed during tasks execution
◦ shuffle-related data
• Storage Memory
◦ storage of cached RDDs and broadcast variables
◦ possible to borrow from execution memory (spill otherwise)
◦ safeguard value is 50% of Spark Memory when cached blocks are immune to eviction
• User Memory
◦ user data structures and internal metadata in Spark
◦ safeguarding against OOM
• Reserved memory
◦ memory needed for running executor itself and not strictly related to Spark
Architecture
We talked about spark jobs in chapter 3. In this chapter, we will talk about the architecture and
how master, worker, driver and executors are coordinated to finish a job.
Feel free to skip code if you prefer diagrams.

Deployment diagram
We have seen the following diagram in overview chapter.
Next, we will talk about some details about it.

Job submission
The diagram below illustrates how driver program (on master node) produces job, and then
submits it to worker nodes.
Driver side behavior is equivalent to the code below:
finalRDD.action()
=> sc.runJob()

// generate job, stages and tasks

=> dagScheduler.runJob()
=> dagScheduler.submitJob()
=> dagSchedulerEventProcessActor ! JobSubmitted
=> dagSchedulerEventProcessActor.JobSubmitted()
=> dagScheduler.handleJobSubmitted()
=> finalStage = newStage()
=> mapOutputTracker.registerShuffle(shuffleId, rdd.partitions.size)
=> dagScheduler.submitStage()
=> missingStages = dagScheduler.getMissingParentStages()
=> dagScheduler.subMissingTasks(readyStage)

// add tasks to the taskScheduler

=> taskScheduler.submitTasks(new TaskSet(tasks))
=> fifoSchedulableBuilder.addTaskSetManager(taskSet)

// send tasks
=> sparkDeploySchedulerBackend.reviveOffers()
=> driverActor ! ReviveOffers
=> sparkDeploySchedulerBackend.makeOffers()
=> sparkDeploySchedulerBackend.launchTasks()
=> foreach task
CoarseGrainedExecutorBackend(executorId) ! LaunchTask(serializedTask)
Explanation:
When the following code is evaluated, the program will launch a bunch of driver communications,
e.g. job's executors, threads, actors, etc.
val sc = new SparkContext(sparkConf)
This line defines the role of driver

Job logical plan

transformation() in driver program builds a computing chain (a series of RDD). In each RDD:
• compute() function defines the computation of records for its partitions
• getDependencies() function defines the dependency relationship across RDD partitions.

Job physical plan

Each action() triggers a job:
• During dagScheduler.runJob(), different stages are defined
• During submitStage(), ResultTasks and ShuffleMapTasks needed by the stage are produced, then
they are packaged in TaskSet and sent to TaskScheduler. If TaskSet can be executed, tasks
will be submitted to sparkDeploySchedulerBackend which will distribute tasks.

Task distribution
After sparkDeploySchedulerBackend gets TaskSet, the Driver Actor sends serialized tasks to
CoarseGrainedExecutorBackend Actor on worker node.
Job reception
After receiving tasks, worker will do the following things:
coarseGrainedExecutorBackend ! LaunchTask(serializedTask)
=> executor.launchTask()
=> executor.threadPool.execute(new TaskRunner(taskId, serializedTask))
Executor packages each task into taskRunner, and picks a free thread to run the
task. A CoarseGrainedExecutorBackend process has exactly one executor

Task execution
The diagram below shows the execution of a task received by worker node and how driver
processes task results.
After receiving a serialized task, the executor deserializes it into a normal task, and then runs the
task to get directResult which will be sent back to driver. It is noteworthy that data package sent
from Actor can not be too big:
• If the result is too big (e.g. the one of groupByKey), it will be persisted to "memory + hard disk"
and managed by blockManager. Driver will only get indirectResult containing the storage
location. When result is needed, driver will fetch it via HTTP.
• If the result is not too big (less than spark.akka.frameSize = 10MB), then it will be directly sent to
driver.
Some more details about blockManager:
When directResult > akka.frameSize, the memoryStore of BlockManager creates a
LinkedHashMap to hold the data stored in memory whose size should be less than
Runtime.getRuntime.maxMemory * spark.storage.memoryFraction(default 0.6). If
LinkedHashMap has no space to save the incoming data, these data will be sent to diskStore which
persists data to hard disk if the data storageLevel contains "disk"
In TaskRunner.run()
// deserialize task, run it and then send the result to
=> coarseGrainedExecutorBackend.statusUpdate()
=> task = ser.deserialize(serializedTask)
=> value = task.run(taskId)
=> directResult = new DirectTaskResult(ser.serialize(value))
=> if( directResult.size() > akkaFrameSize() )
indirectResult = blockManager.putBytes(taskId, directResult, MEMORY+DISK+SER)
else
return directResult
=> coarseGrainedExecutorBackend.statusUpdate(result)
=> driver ! StatusUpdate(executorId, taskId, result)
The results produced by ShuffleMapTask and ResultTask are different.
• ShuffleMapTask produces MapStatus containing 2 parts:
the BlockManagerId of the task's BlockManager: (executorId + host, port, nettyPort）
the size of each output FileSegment of a task
• ResultTask produces the execution result of the specified function on one partition e.g. The
function of count() is simply for counting the number of records in a partition. Since
ShuffleMapTask needs FileSegment for writing to disk, OutputStream writers are needed.
These writers are produced and managed by blockManger of shuffleBlockManager
In task.run(taskId)
// if the task is ShuffleMapTask
=> shuffleMapTask.runTask(context)
=> shuffleWriterGroup = shuffleBlockManager.forMapTask(shuffleId, partitionId,
numOutputSplits)
=> shuffleWriterGroup.writers(bucketId).write(rdd.iterator(split, context))
=> return MapStatus(blockManager.blockManagerId, Array[compressedSize(fileSegment)])

//If the task is ResultTask

=> return func(context, rdd.iterator(split, context))
A series of operations will be executed after driver gets a task's result:
TaskScheduler will be notified that the task is finished, and its result will be processed:
• If it is indirectResult, BlockManager.getRemotedBytes() will be invoked to fetch actual results.
If it is ResultTask, ResultHandler() invokes driver side computation on result (e.g. count()
take sum operation on all ResultTask).
If it is MapStatus of ShuffleMapTask, then MapStatus will be put into mapStatuses of
mapOutputTrackerMaster, which makes it more easy to be queried during reduce
shuffle.
• If the received task on driver is the last task in the stage, then next stage will be submitted. If the
stage is already the last one, dagScheduler will be informed that the job is finished.
After driver receives StatusUpdate(result)
=> taskScheduler.statusUpdate(taskId, state, result.value)
=> taskResultGetter.enqueueSuccessfulTask(taskSet, tid, result)
=> if result is IndirectResult
serializedTaskResult = blockManager.getRemoteBytes(IndirectResult.blockId)
=> scheduler.handleSuccessfulTask(taskSetManager, tid, result)
=> taskSetManager.handleSuccessfulTask(tid, taskResult)
=> dagScheduler.taskEnded(result.value, result.accumUpdates)
=> dagSchedulerEventProcessActor ! CompletionEvent(result, accumUpdates)
=> dagScheduler.handleTaskCompletion(completion)
=> Accumulators.add(event.accumUpdates)

// If the finished task is ResultTask

=> if (job.numFinished == job.numPartitions)
listenerBus.post(SparkListenerJobEnd(job.jobId, JobSucceeded))
=> job.listener.taskSucceeded(outputId, result)
=> jobWaiter.taskSucceeded(index, result)
=> resultHandler(index, result)
// If the finished task is ShuffleMapTask
=> stage.addOutputLoc(smt.partitionId, status)
=> if (all tasks in current stage have finished)
mapOutputTrackerMaster.registerMapOutputs(shuffleId, Array[MapStatus])
mapStatuses.put(shuffleId, Array[MapStatus]() ++ statuses)
=> submitStage(stage)

Shuffle read
In the preceding paragraph, we talked about task execution and result processing, now we will talk
about how reducer (tasks needs shuffle) gets the input data. The shuffle read part in last chapter
has already talked about how reducer processes input data.
How does reducer know where to fetch data ?
Reducer needs to know on which node the FileSegments produced by ShuffleMapTask of parent
stage are. This kind of information is sent to driver’s mapOutputTrackerMaster when
ShuffleMapTask is finished. The information is also stored in mapStatuses:
HashMap[stageId, Array[MapStatus]]. Given stageId, we can getArray[MapStatus] which
contains information about FileSegments produced by ShuffleMapTasks. Array(taskId) contains
the location(blockManagerId) and the size of each FileSegment.
When reducer need fetch input data, it will first invoke blockStoreShuffleFetcher to get input
data’s location (FileSegments). blockStoreShuffleFetcher calls local MapOutputTrackerWorker to
do the work. MapOutputTrackerWorker uses mapOutputTrackerMasterActorRef to communicate
with mapOutputTrackerMasterActor in order to get MapStatus. blockStoreShuffleFetcher
processes MapStatus and finds out where reducer should fetch FileSegment information, and then
it stores this information in blocksByAddress. blockStoreShuffleFetcher tells
basicBlockFetcherIterator to fetch FileSegment data.
rdd.iterator()
=> rdd(e.g., ShuffledRDD/CoGroupedRDD).compute()
=> SparkEnv.get.shuffleFetcher.fetch(shuffledId, split.index, context, ser)
=> blockStoreShuffleFetcher.fetch(shuffleId, reduceId, context, serializer)
=> statuses = MapOutputTrackerWorker.getServerStatuses(shuffleId, reduceId)

=> blocksByAddress: Seq[(BlockManagerId, Seq[(BlockId, Long)])] = compute(statuses)

=> basicBlockFetcherIterator = blockManager.getMultiple(blocksByAddress, serializer)
=> itr = basicBlockFetcherIterator.flatMap(unpackBlock)
After basicBlockFetcherIterator has received the task of data retrieving, it produces several
fetchRequests. **Each of them contains the tasks to fetch FileSegments from several nodes.
**According to the diagram above, we know that reducer-2 needs to fetch FileSegment(FS)(in
white) from 3 worker nodes. The global data fetching task can be represented by blockByAddress:
4 blocks from node 1, 3 blocks from node 2, and 4 blocks from node 3
In order to accelerate data fetching process, it makes sense to divide the global tasks into sub
tasks(fetchRequest), then every task takes a thread to fetch data. Spark launches 5 parallel threads
for each reducer (the same as Hadoop). Since the fetched data will be buffered into memory, one
fetch is not able to take too much data (no more than spark.reducer.maxMbInFlight＝48MB). Note
that 48MB is shared by the 5 fetch threads, so each sub task should take no more than
48MB / 5 = 9.6MB. In the diagram, on node 1, we have size(FS0-2) + size(FS1-2) < 9.6MB, but
size(FS0-2) + size(FS1-2) + size(FS2-2) > 9.6MB, so we should break between t1-r2 and t2-r2. As a
result, we can see 2 fetchRequests fetching data from node 1. Will there be fetchRequest
larger than 9.6MB? The answer is yes. If one FileSegment is too large, it still needs to be fetched
in one shot. Besides, if reducer needs some FileSegments already existing on the local, it will do
local read. At the end of shuffle read, it will deserialize fetched FileSegment and offer record
iterators to RDD.compute()
In basicBlockFetcherIterator:

// generate the fetch requests

=> basicBlockFetcherIterator.initialize()
=> remoteRequests = splitLocalRemoteBlocks()
=> fetchRequests ++= Utils.randomize(remoteRequests)
// fetch remote blocks
=> sendRequest(fetchRequests.dequeue()) until Size(fetchRequests) > maxBytesInFlight
=> blockManager.connectionManager.sendMessageReliably(cmId,
blockMessageArray.toBufferMessage)
=> fetchResults.put(new FetchResult(blockId, sizeMap(blockId)))
=> dataDeserialize(blockId, blockMessage.getData, serializer)

// fetch local blocks

=> getLocalBlocks()
=> fetchResults.put(new FetchResult(id, 0, () => iter))
Some more details:
How does the reducer send fetchRequest to the target node? How does the target
node process fetchRequest, read and send back FileSegment to reducer?
When RDD.iterator() meets ShuffleDependency, BasicBlockFetcherIterator will be called to fetch
FileSegments. BasicBlockFetcherIterator uses connectionManager of blockManger to send
fetchRequest to connectionManagers on the other nodes. NIO is used for communication between
connectionManagers. On the other nodes, for example, after connectionManager on worker node 2
receives a message, it will forward the message to blockManager. The latter uses diskStore to read
FileSegments requested by fetchRequest locally, they will still be sent back by connectionManager.
If FileConsolidation is activated, diskStore needs the location of blockId given by
shuffleBolockManager. If FileSegment is no more than spark.storage.memoryMapThreshold =
8KB, then diskStore will put FileSegment into memory when reading it, otherwise, The memory
mapping method in FileChannel of RandomAccessFile will be used to read FileSegment, thus large
FileSegment can be loaded into memory.
When BasicBlockFetcherIterator receives serialized FileSegments from the other nodes, it will
deserialize and put them in fetchResults.Queue. You may notice that fetchResults.Queue is
similar to softBuffer in Shuffle detials chapter. If the FileSegments needed by
BasicBlockFetcherIterator are local, they will be found locally by diskStore, and put in
fetchResults. Finally, reducer reads the records from FileSegment and processes them.
After the blockManager receives the fetch request

=> connectionManager.receiveMessage(bufferMessage)
=> handleMessage(connectionManagerId, message, connection)

// invoke blockManagerWorker to read the block (FileSegment)

=> blockManagerWorker.onBlockMessageReceive()
=> blockManagerWorker.processBlockMessage(blockMessage)
=> buffer = blockManager.getLocalBytes(blockId)
=> buffer = diskStore.getBytes(blockId)
=> fileSegment = diskManager.getBlockLocation(blockId)
=> shuffleManager.getBlockLocation()
=> if(fileSegment < minMemoryMapBytes)
buffer = ByteBuffer.allocate(fileSegment)
else
channel.map(MapMode.READ_ONLY, segment.offset, segment.length)
Every reducer has a BasicBlockFetcherIterator, and one BasicBlockFetcherIterator could, in
theory, hold 48MB of fetchResults. As soon as one FileSegment in fetchResults is read off, some
FileSegments will be fetched to fill that 48MB.
BasicBlockFetcherIterator.next()
=> result = results.task()
=> while (!fetchRequests.isEmpty &&
(bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size <= maxBytesInFlight)) {
sendRequest(fetchRequests.dequeue())
}
=> result.deserialize()

Discussion
In terms of architecture design, functionalities and modules are pretty independent.
BlockManager is well designed, but it seems to manage too many things (data block, memory, disk
and network communication)
This chapter discussed how the modules of spark system are coordinated to finish a job
(production, submission, execution, results collection, results computation and shuffle). A lot of
code is pasted, many diagrams are drawn. More details can be found in source code, if you want.
Apache Spark-Apache Hive connection configuration
You need to understand the workflow and service changes involved in accessing ACID table data
from Spark. You can configure Spark properties in Ambari for using the Hive Warehouse
Connector.
Prerequisites
You need to use the following software to connect Spark and Hive using the
HiveWarehouseConnector library.

 HDP 3.15
 Spark2
 Hive with HiveServer Interactive (HSI)

The Hive Warehouse Connector (HWC) and low-latency analytical processing (LLAP) are
required for certain tasks, as shown in the following table:
Table 1. Spark Compatibility
HWC LLAP
Tasks Other Requirement/Comments
Required Required
Read Hive managed tables Yes Yes Ranger ACLs enforced.*
from Spark
Table 1. Spark Compatibility
HWC LLAP
Tasks Other Requirement/Comments
Required Required
Write Hive managed Yes No Ranger ACLs enforced.* Supports
tables from Spark ORC only.
Read Hive external tables No Only if HWC is Ranger ACLs not enforced.
from Spark used
Write Hive external tables No No Ranger ACLs enforced.
from Spark
* Ranger column level security or column masking is supported for each access pattern when
you use HWC.
You need low-latency analytical processing (LLAP) in HSI to read ACID, or other Hive-managed
tables, from Spark. You do not need LLAP to write to ACID, or other managed tables, from
Spark. The HWC library internally uses the Hive Streaming API and LOAD DATA Hive
commands to write the data. You do not need LLAP to access external tables from Spark with
caveats shown in the table above.
Required properties
You must add several Spark properties through spark-2-defaults in Ambari to use the Hive
Warehouse Connector for accessing data in Hive. Alternatively, configuration can be provided for
each job using --conf.
 spark.sql.hive.hiveserver2.jdbc.url

The URL for HiveServer2 Interactive

 spark.datasource.hive.warehouse.metastoreUri

The URI for the metastore

 spark.datasource.hive.warehouse.load.staging.dir

The HDFS temp directory for batch writes to Hive, /tmp for example

 spark.hadoop.hive.llap.daemon.service.hosts

The application name for LLAP service

 spark.hadoop.hive.zookeeper.quorum

The ZooKeeper hosts used by LLAP

Set the values of these properties as follows:

 spark.sql.hive.hiveserver2.jdbc.url

In Ambari, copy the value from Services > Hive > Summary > HIVESERVER2 INTERACTIVE
JDBC URL.

 spark.datasource.hive.warehouse.metastoreUri
Copy the value from hive.metastore.uris. In Hive, at the hive> prompt, enter set
hive.metastore.urisand copy the output. For example, thrift://mycluster-1.com:9083.

 spark.hadoop.hive.llap.daemon.service.hosts

Copy value from Advanced hive-interactive-site > hive.llap.daemon.service.hosts.

 spark.hadoop.hive.zookeeper.quorum

Copy the value from Advanced hive-site > hive.zookeeper.quorum.

Optional HWC properties
Optionally, you can set the following properties:

 spark.datasource.hive.warehouse.write.path.strictColumnNamesMapping Validates the
mapping of columns against those in Hive to alert the user to input errors. Default = true.
 spark.sql.hive.conf.list Propagates one or more configuration properties from the HWC to
Hive. Set properties on the command line using the --conf option. For example:

--conf
spark.sql.hive.conf.list="hive.vectorized.execution.filesink.arrow.native.enabled=true;hive.vecto
rized.execution.enabled=true"
Do not attempt to set spark.sql.hive.conf.list programmatically.
Spark on a Kerberized YARN cluster
In Spark client mode on a kerberized Yarn cluster, set the following
property:spark.sql.hive.hiveserver2.jdbc.url.principal
This property must be equal to hive.server2.authentication.kerberos.principal. In Ambari, copy the
value for this property
from hive.server2.authentication.kerberos.principal in Services > Hive > Configs > Advanced
> Advanced hive-site .
Integrate Spark-SQL with Hive
Integrate Spark-SQL with Hive when you want to run Spark-SQL queries on Hive tables.

Spark 1.5.2 and Spark 1.6.1 are built using Hive 1.2 artifacts; however, you can configure Spark-
SQL to work with Hive 0.13 and Hive 1.0. Spark 1.3.1 and Spark 1.4.1 are build using Hive
0.13;other versions of Hive are not supported with Spark-SQL. For additional details on Spark-
SQL and Hive support, see Spark Feature Support.

Note: If you installed Spark with the MapR Installer, the following steps are not required.
1. Copy hive-site.xml file into the SPARK_HOME/conf directory so that Spark and Spark-SQL
recognize the Hive Metastore configuration.
2. Configure the Hive version in the /opt/mapr/spark/spark-<version>/mapr-
util/compatibility.version file:
hive_versions=<version>
3. If you are running Spark 1.5.2 or Spark 1.6.1, add the following additional properties to the
/opt/mapr/spark/spark-<version>/conf/spark-defaults.conf file:

Property Configuration Requirements

spark.yarn.dist.files Option 1: For Spark on YARN, specify the location of the

Property Configuration Requirements

hive-site.xml and the datanucleus JARs.

/opt/mapr/hive/hive-<hive-version>/conf/hive-
site.xml,/opt/mapr/hive/<version>/lib/datanucleus-api-
jdo-
<version>.jar,/opt/mapr/hive/<version>/lib/datanucleus-
core-<version>.jar,/opt/mapr/hive/hive-
1.2/lib/datanucleus-rdbms-<version>.jar
Option 2:For Spark on YARN, store hive-site.xml and
datanucleus JARs on MapR-FS and use the following
syntax:
maprfs:///<path to hive-site.xml>,maprfs:///<path to
datanucleus jar files>

spark.sql.hive.metastore.vers Specify the Hive version that you are using:

ion
o For Hive 1.2.0, set the value to 1.2.0
o For Hive 0.13, set the value to 0.13
o For Hive 1.0, set the value to 1.0.0
Property Configuration Requirements

spark.sql.hive.metastore.jars Specify the classpath to JARS for Hive, Hive dependencies,

and Hadoop. These files must be available on the node from
which you submit Spark jobs.
/opt/mapr/hadoop/hadoop-<hadoop-
version>/etc/hadoop:/opt/mapr/hadoop/hadoop-
<hadoop-version>/share/hadoop/common/lib/*:<rest of
hadoop classpath>:/opt/mapr/hive/hive-
<version>/lib/accumulo-core-
<version>.jar:/opt/mapr/hive/hive-<version>/lib/hive-
contrib-<version>.jar:<rest of hive classpath>
For example, if you run Spark 1.5.2 with Hive 1.2 you can set
the following classpath:
/opt/mapr/hadoop/hadoop-
2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-
2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/ha
doop-
2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/./had
oop-
2.7.0/share/hadoop/mapreduce/*:/opt/mapr/hadoop/had
oop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hive/hive-
Property Configuration Requirements

1.2/lib/accumulo-core-1.6.0.jar:/opt/mapr/hive/hive-
1.2/lib/hive-contrib-1.2.0-mapr-
1508.jar:/opt/mapr/hive/hive-1.2/lib/*
For more information, see the Apache Spark
documentation.

4. To verify the integration, run the following command as the mapr user or as a user that mapr
impersonates:
MASTER=<master-url> <spark-home>/bin/run-example sql.hive.HiveFromSpark
The master URL for the cluster is either spark://<host>:7077 , yarn-client, or yarn-cluster.

Note: The default port for both HiveServer 2 and the Spark Thrift server is 10000. Therefore,
before you start the Spark Thrift server on node where HiveServer 2 is running, verify that there is
no port conflict.
Background
There are several open source Spark HBase connectors available either as Spark packages, as
independent projects or in HBase trunk.
Spark has moved to the Dataset/DataFrame APIs, which provides built-in query plan optimization.
Now, end users prefer to use DataFrames/Datasets based interface.
The HBase connector in the HBase trunk has a rich support at the RDD level, e.g. BulkPut, etc, but
its DataFrame support is not as rich. HBase trunk connector relies on the standard
HadoopRDD with HBase built-in TableInputFormat has some performance limitations. In
addition, BulkGet performed in the the driver may be a single point of failure.
There are some other alternative implementations. Take Spark-SQL-on-HBase as an example. It
applies very advanced custom optimization techniques by embedding its own query
optimization plan inside the standard Spark Catalyst engine, ships the RDD to HBase and
performs complicated tasks, such as partial aggregation, inside the HBase coprocessor. This
approach is able to achieve high performance, but it difficult to maintain due to its
complexity and the rapid evolution of Spark. Also allowing arbitrary code to run inside a
coprocessor may pose security risks.
The Spark-on-HBase Connector (SHC) has been developed to overcome these potential
bottlenecks and weaknesses. It implements the standard Spark Datasource API, and
leverages the Spark Catalyst engine for query optimization. In parallel, the RDD is
constructed from scratch instead of using TableInputFormat in order to achieve high
performance. With this customized RDD, all critical techniques can be applied and fully
implemented, such as partition pruning, column pruning, predicate pushdown and data
locality. The design makes the maintenance very easy, while achieving a good tradeoff
between performance and simplicity.
Architecture
We assume Spark and HBase are deployed in the same cluster, and Spark executors are co-located
with region servers, as illustrated in the figure below.

Figure 1. Spark-on-HBase Connector Architecture

At a high-level, the connector treats both Scan and Get in a similar way, and both actions are
performed in the executors. The driver processes the query, aggregates scans/gets based on
the region’s metadata, and generates tasks per region. The tasks are sent to the preferred
executors co-located with the region server, and are performed in parallel in the executors
to achieve better data locality and concurrency. If a region does not hold the data required,
that region server is not assigned any task. A task may consist of multiple Scans and
BulkGets, and the data requests by a task is retrieved from only one region server, and this
region server will also be the locality preference for the task. Note that the driver is not
involved in the real job execution except for scheduling tasks. This avoids the driver being
the bottleneck.
Table Catalog
To bring the HBase table as a relational table into Spark, we define a mapping between HBase and
Spark tables, called Table Catalog. There are two critical parts of this catalog. One is the
rowkey definition and the other is the mapping between table column in Spark and the
column family and column qualifier in HBase. Please refer to the Usage section for details.
Native Avro support
The connector supports the Avro format natively, as it is a very common practice to persist
structured data into HBase as a byte array. User can persist the Avro record into HBase
directly. Internally, the Avro schema is converted to a native Spark Catalyst data type
automatically. Note that both key-value parts in an HBase table can be defined in Avro
format. Please refer to the examples/test cases in the repo for exact usage.
Predicate Pushdown
The connector only retrieves required columns from region server to reduce network overhead and
avoid redundant processing in Spark Catalyst engine. Existing standard HBase filters are
used to perform predicate push-down without leveraging the coprocessor capability.
Because HBase is not aware of the data type except for byte array, and the order
inconsistency between Java primitive types and byte array, we have to preprocess the filter
condition before setting the filter in the Scan operation to avoid any data loss. Inside the
region server, records not matching the query condition are filtered out.
Partition Pruning
By extracting the row key from the predicates, we split the Scan/BulkGet into multiple non-
overlapping ranges, only the region servers that has the requested data will perform
Scan/BulkGet. Currently, the partition pruning is performed on the first dimension of the
row keys. For example, if a row key is “key1:key2:key3”, the partition pruning will be based
on “key1” only. Note that the WHERE conditions need to be defined carefully. Otherwise,
the partition pruning may not take effect. For example, WHERE rowkey1 > “abc” OR
column = “xyz” (where rowkey1 is the first dimension of the rowkey, and column is a
regular hbase column) will result in a full scan, as we have to cover all the ranges because of
the OR logic.
Data Locality
When a Spark executor is co-located with HBase region servers, data locality is achieved by
identifying the region server location, and makes best effort to co-locate the task with the
region server. Each executor performs Scan/BulkGet on the part of the data co-located on
the same host.
Scan and BulkGet
These two operators are exposed to users by specifying WHERE CLAUSE, e.g., WHERE column >
x and column < y for scan and WHERE column = x for get. The operations are performed in
the executors, and the driver only constructs these operations. Internally they are converted
to scan and/or get, and Iterator[Row] is returned to catalyst engine for upper layer
processing.
Usage
The following illustrates the basic procedure on how to use the connector. For more details and
advanced use case, such as Avro and composite key support, please refer to the examples in
the repository.
1) Define the catalog for the schema mapping:
[code language="scala"]def catalog = s"""{
       |"table":{"namespace":"default", "name":"table1"},
       |"rowkey":"key",
       |"columns":{
         |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
         |"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},
         |"col2":{"cf":"cf2", "col":"col2", "type":"double"},
         |"col3":{"cf":"cf3", "col":"col3", "type":"float"},
         |"col4":{"cf":"cf4", "col":"col4", "type":"int"},
         |"col5":{"cf":"cf5", "col":"col5", "type":"bigint"},
         |"col6":{"cf":"cf6", "col":"col6", "type":"smallint"},
         |"col7":{"cf":"cf7", "col":"col7", "type":"string"},
         |"col8":{"cf":"cf8", "col":"col8", "type":"tinyint"}
       |}
     |}""".stripMargin
[/code]

2) Prepare the data and populate the HBase table:
case class HBaseRecord(col0: String, col1: Boolean,col2: Double, col3: Float,col4: Int,       col5:
Long, col6: Short, col7: String, col8: Byte)
object HBaseRecord {def apply(i: Int, t: String): HBaseRecord = { val s = s”””row$
{“%03d”.format(i)}”””       HBaseRecord(s, i % 2 == 0, i.toDouble, i.toFloat, i, i.toLong,
i.toShort, s”String$i: $t”,      i.toByte) }}
val data = (0 to 255).map { i => HBaseRecord(i, “extra”)}
sc.parallelize(data).toDF.write.options( Map(HBaseTableCatalog.tableCatalog -> catalog,
HBaseTableCatalog.newTable ->
“5”)) .format(“org.apache.spark.sql.execution.datasources.hbase”) .save()

3) Load the DataFrame:
def withCatalog(cat: String): DataFrame =
{ sqlContext .read .options(Map(HBaseTableCatalog.tableCatalog->cat)) .format(“org.apac
he.spark.sql.execution.datasources.hbase”) .load()}
val df = withCatalog(catalog)
4) Language integrated query:
val s = df.filter((($”col0″ <= “row050″ && $”col0” > “row040”) || $”col0″ === “row005”
|| $”col0″ === “row020” || $”col0″ === “r20” || $”col0″ <= “row005”) && ($”col4″ === 1
|| $”col4″ === 42)) .select(“col0”, “col1”, “col4”)s.show
5) SQL query:
df.registerTempTable(“table”)sqlContext.sql(“select count(col1) from table”).show
Configuring Spark-Package
Users can use the Spark-on-HBase connector as a standard Spark package. To include the package
in your Spark application use:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell –packages zhzhan:shc:0.0.11-1.6.1-s_2.10
Users can include the package as the dependency in your SBT file as well. The format is the spark-
package-name:version
spDependencies += “zhzhan/shc:0.0.11-1.6.1-s_2.10”
Running in Secure Cluster
For running in a Kerberos enabled cluster, the user has to include HBase related jars into the
classpath as the HBase token retrieval and renewal is done by Spark, and is independent of
the connector. In other words, the user needs to initiate the environment in the normal
way, either through kinit or by providing principal/keytab. The following examples show
how to run in a secure cluster with both yarn-client and yarn-cluster mode. Note that
SPARK_CLASSPATH has to be set for both modes, and the example jar is just a placeholder
for Spark.
export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-
common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-
client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-
protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar
Suppose hrt_qa is a headless acount, user can use following command for kinit:
kinit -k -t /tmp/hrt_qa.headless.keytab hrt_qa
/usr/hdp/current/spark-client/bin/spark-submit –class
org.apache.spark.sql.execution.datasources.hbase.examples.HBaseSource –master yarn-
client –packages zhzhan:shc:0.0.11-1.6.1-s_2.10 –num-executors 4 –driver-memory 512m
–executor-memory 512m –executor-cores 1 /usr/hdp/current/spark-client/lib/spark-
examples-1.6.1.2.4.2.0-106-hadoop2.7.1.2.4.2.0-106.jar
/usr/hdp/current/spark-client/bin/spark-submit –class
org.apache.spark.sql.execution.datasources.hbase.examples.HBaseSource –master yarn-
cluster –files /etc/hbase/conf/hbase-site.xml –packages zhzhan:shc:0.0.11-1.6.1-s_2.10 –
num-executors 4 –driver-memory 512m –executor-memory 512m –executor-cores 1
/usr/hdp/current/spark-client/lib/spark-examples-1.6.1.2.4.2.0-106-hadoop2.7.1.2.4.2.0-
106.jar
Putting It All Together
We’ve just given a quick overview of how HBase supports Spark at the DataFrame level. With the
DataFrame API Spark applications can work with data stored in HBase table as easily as any data
stored in other data sources. With this new feature, data in HBase tables can be easily consumed
by Spark applications and other interactive tools, e.g. users can run a complex SQL query on top of
an HBase table inside Spark, perform a table join against Dataframe, or integrate with Spark
Streaming to implement a more complicated system.
Apache Spark and Hadoop: Working Together
by Ion Stoica Posted in ENGINEERING BLOG January 21, 2014
We are often asked how does Apache Spark fits in the Hadoop ecosystem, and how one can run
Spark in a existing Hadoop cluster. This blog aims to answer these questions.
First, Spark is intended to enhance, not replace, the Hadoop stack. From day one, Spark was
designed to read and write data from and to HDFS, as well as other storage systems, such as HBase
and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by combining
Spark with Hadoop MapReduce, HBase, and other big data frameworks.
Second, we have constantly focused on making it as easy as possible for every Hadoop user to take
advantage of Spark’s capabilities. No matter whether you run Hadoop 1.x or Hadoop 2.0 (YARN),
and no matter whether you have administrative privileges to configure the Hadoop cluster or not,
there is a way for you to run Spark! In particular, there are three ways to deploy Spark in a Hadoop
cluster: standalone, YARN, and SIMR.
Standalone deployment: With the standalone deployment one can statically allocate resources
on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR.
The user can then run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the
deployment of choice for many Hadoop 1.x users.
Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy
Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access
required. This allows users to easily integrate Spark in their Hadoop stack and take advantage of
the full power of Spark, as well as of other components running on top of Spark.
Spark In MapReduce (SIMR): For the Hadoop users that are not running YARN yet, another
option, in addition to the standalone deployment, is to use SIMR to launch Spark jobs inside
MapReduce. With SIMR, users can start experimenting with Spark and use its shell within a
couple of minutes after downloading it! This tremendously lowers the barrier of deployment, and
lets virtually everyone play with Spark.
Interoperability with other Systems
Spark interoperates not only with Hadoop, but with other popular big data technologies as well.
• Apache Hive: Through Shark, Spark enables Apache Hive users to run their unmodified queries
much faster. Hive is a popular data warehouse solution running on top of Hadoop, while
Shark is a system that allows the Hive framework to run on top of Spark instead of Hadoop.
As a result, Shark can accelerate Hive queries by as much as 100x when the input data fits
into memory, and up 10x when the input data is stored on disk.
• AWS EC2: Users can easily run Spark (and Shark) on top of Amazon’s EC2 either using the
scripts that come with Spark, or the hosted versions of Spark and Shark on Amazon’s Elastic
MapReduce.
• Apache Mesos: Spark runs on top of Mesos, a cluster manager system which provides efficient
resource isolation across distributed applications, including MPI and Hadoop. Mesos enables
fine grained sharing which allows a Spark job to dynamically take advantage of the idle
resources in the cluster during its execution. This leads to considerable performance
improvements, especially for long running Spark jobs.
This blog pertains to Apache SPARK and YARN (Yet Another Resource Negotiator), where we
will understand how Spark runs on YARN with HDFS. So let’s get started.
First, let’s see what Apache Spark is. The official definition of Apache Spark says that “Apache
Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory
computation processing engine where the data is kept in random access memory (RAM) instead of
some slow disk drives and is processed in parallel. YARN is cluster management technology and
HDFS stands for Hadoop Distributed File System.
Now, let’s start and try to understand the actual topic “How Spark runs on YARN with HDFS as
storage layer”. We will look into the steps involved in submitting a job to a cluster.
Let’s say a client came and submitted a job using “spark-submit” from the client machine to the
master machine.
In the master machine resides the NameNode (daemon for HDFS) and the ResourceManager
(daemon for YARN). Both of them are Java processes.

Note: The NameNode and ResourceManager can reside in the same machine or different machine
depending upon the configuration of the cluster.
Now, when the client submits a job, it goes to the master machine. It will talk to the NameNode
and the NameNode will do various checks like –
• It will check, whether the client has appropriate permission to read the Input path.
• Whether the client has appropriate permission to write onto the Output path.
• Whether the Input and Output path is valid or not.
• And many more.
Once it verifies that everything is in place, it will assign a Job ID to the Job and then allocate the
Job ID into a Job Queue.
So, in Job Queue there can be multiple jobs waiting to get processed.
As soon as a job is assigned to the Job Queue, it’s corresponding information about the Job like
Input/Output Path, the location of the Jar, etc. are written into the temp location of HDFS.
Let’s talk a little about the temp location of HDFS. This is the location in each data node where
the intermediate data goes in. The location of this path is set in the file named “core-site.xml”
under location “hadoop-dir/etc/hadoop”.

<img
src="https://i1.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-9.png?
w=810&ssl=1" alt="" data-recalc-dims="1"/>
That is, every detail of each job will be stored in the temp location. After this, the Job is finally
“Accepted”.
In the next step, whenever the turn of a Job comes for execution from the Job Queue, the Resource
Manager will randomly select a DataNode (worker node) and start a Java process called
Application Master in the DataNode.
Note: For each Job, there will be an Application Master.
Now, on behalf of the Resource Manager, the Application Master will go to the temp location and
the details of the Job will be checked/collected from the temp location. Subsequently, the
Application Master communicates with the NameNode which further takes the call to figure out
where the files (blocks) are located in the cluster, how much resources (number of CPUs, number
of nodes, memory required) will the job need. So, the NameNode will do its computation and
figure out those things.

<img
src="https://i0.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-1.png?
resize=490%2C270&ssl=1" alt="" width="490" height="270" data-recalc-dims="1" />
Once all the evaluations are done, the Application Master sends all the resource request
information to the Resource Manager.
Now, the Resource Manager will look into the request and will send the resource allocation request
of the job to the DataNodes.
Now, let’s assume a scenario, the resource request which the Resource Manager has received
from the Application Master is of just 2 Cores and 2 GB memory. And the data nodes in the
cluster have a configuration of 4 Cores and 16 GB RAM. In this case, the Resource Manager will
send the resource allocation request to one of the DataNode requesting it to allocate 2 Cores and
2 GB memory (i.e. a portion of RAM and Core) to the Job.So, the Resource Manager sends the
request of 2 Cores and 2 GB memory packed together as a Container. These containers are
known as Executors.
<img src="https://i2.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-2.png?
resize=494%2C293&ssl=1" alt="" width="494" height="293" data-recalc-dims="1" />
The resource allocation requests are handled by the NodeManager of each individual worker
node and are responsible for the resource allocation of the job.
Finally, the code/Task will start executing in the Executor.
<img
src="https://i1.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-3.png?
resize=489%2C290&ssl=1" alt="" width="489" height="290" data-recalc-dims="1" />

<img
src="https://i0.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-4.png?
resize=500%2C290&ssl=1" alt="" width="500" height="290" data-recalc-dims="1" />
Execution Mode:
In Spark, there are two modes to submit a job: i) Client mode (ii) Cluster mode.

Client mode: In the client mode, we have Spark installed in our local client machine, so the
Driver program (which is the entry point to a Spark program) resides in the client machine i.e. we
will have the SparkSession or SparkContext in the client machine.
Whenever we place any request like “spark-submit” to submit any job, the request goes to
Resource Manager then the Resource Manager opens up the Application Master in any of the
Worker nodes.
Note: I am skipping the detailed intermediate steps explained above here.
The Application Master launches the Executors (i.e. Containers in terms of Hadoop) and the jobs
will be executed.

<img
src="https://i1.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-5.png?
resize=476%2C273&ssl=1" alt="" width="476" height="273" data-recalc-dims="1" />
After the Executors are launched they start communicating directly with the Driver program i.e.
SparkSession or SparkContext and the output will be directly returned to the client.

<img
src="https://i0.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-6.png?
resize=488%2C284&ssl=1" alt="" width="488" height="284" data-recalc-dims="1" />
The drawback of Spark Client mode w.r.t YARN is that: The client machine needs to be available at
all times whenever any job is running. You cannot submit your job and then turn off your laptop
and leave from office until your job is finished.
In this case, it won’t be able to give the output as the connection between Driver and Executors will
be broken.
Cluster Mode: The only difference in this mode is that Spark is installed in the cluster, not in the
local machine. Whenever we place any request like “spark-submit” to submit any job, the request
goes to Resource Manager then the Resource Manager opens up the Application Master in any of
the Worker nodes.
Now, the Application Master will launch the Driver Program (which will be having
the SparkSession/SparkContext) in the Worker node.
That means, in cluster mode the Spark driver runs inside an application master process which is
managed by YARN on the cluster, and the client can go away after initiating the application.
Whereas in client mode, the driver runs in the client machine, and the application master is only
used for requesting resources from YARN.
<img src="https://i2.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-7.png?
resize=480%2C283&ssl=1" alt="" width="480" height="283" data-recalc-dims="1" />

<img src="https://i2.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-8.png?
resize=489%2C286&ssl=1" alt="" width="489" height="286" data-recalc-dims="1" />
In the next blog, I have explained how Spark Driver and Executor works.
Integrate Spark with YARN
To communicate with the YARN Resource Manager, Spark needs to be aware of your Hadoop
configuration. This is done via the HADOOP_CONF_DIR environment variable. The
SPARK_HOME variable is not mandatory but is useful when submitting Spark jobs from the
command line.
• Edit the “bashrc” file and add the following lines:
export HADOOP_CONF_DIR=/<path of hadoop dir>/etc/hadoop
export YARN_CONF_DIR=/<path of hadoop dir>/etc/hadoop
export SPARK_HOME=/<path of spark dir>
export LD_LIBRARY_PATH=/<path of hadoop dir>/lib/native:$LD_LIBRARY_PATH
• Restart your session by logging out and logging in again.
• Rename the spark default template config file:
• mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-
defaults.conf
• Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:
• spark.master yarn
Copy all jars of Spark from $SPARK_HOME/jars to hdfs so that it can be shared among all the
worker nodes:
hdfs dfs -put *.jar /user/spark/share/lib
Add/modify the following parameters in spark-default.conf:
spark.master yarn
spark.yarn.jars hdfs://hmaster:9000/user/spark/share/lib/*.jar
spark.executor.memory 1g
spark.driver.memory 512m
spark.yarn.am.memory 512m
Example Spark Application Web Application
Consider a job consisting of a set of transformation to join data from an accounts dataset with a
weblogs dataset in order to determine the total number of web hits for every account and then an
action write the result to HDFS. In this example, the write is performed twice, resulting in two
jobs. To view the application UI, in the History Server click the link in the App ID column:

The following screenshot shows the timeline of the events in the application including the jobs that
were run and the allocation and deallocation of executors. Each job shows the last action,
saveAsTextFile, run for the job. The timeline shows that the application acquires executors over
the course of running the first job. After the second job finishes, the executors become idle and are
returned to the cluster.
You can manipulate the timeline as follows:
• Pan - Press and hold the left mouse button and swipe left and right.
• Zoom - Select the Enable zooming checkbox and scroll the mouse up and down.
To view the details for Job 0, click the link in the Description column. The following screenshot
shows details of each stage in Job 0 and the DAG visualization. Zooming in shows finer detail for
the segment from 28 to 42 seconds:
Clicking a stage shows further details and metrics:
The web page for Job 1 shows how preceding stages are skipped because Spark retains the results
from those stages:
Example Spark SQL Web Application
In addition to the screens described above, the web application UI of an application that uses the
Spark SQL API also has an SQL tab. Consider an application that loads the contents of two tables
into a pair of DataFrames, joins the tables, and then shows the result. After you click the
application ID, the SQL tab displays the final action in the query:

If you click the show link you see the DAG of the job. Clicking the Details link on this page
displays the logical query plan:
Example Spark Streaming Web Application
Note: The following example demonstrates the Spark driver web UI. Streaming information is not
captured in the Spark History Server.
The Spark driver web application UI also supports displaying the behavior of streaming
applications in the Streaming tab. If you run the example described in Spark Streaming Example,
and provide three bursts of data, the top of the tab displays a series of visualizations of the
statistics summarizing the overall behavior of the streaming application:
The application has one receiver that processed 3 bursts of event batches, which can be observed in
the events, processing time, and delay graphs. Further down the page you can view details of
individual batches:

To view the details of a specific batch, click a link in the Batch Time column. Clicking the
2016/06/16 14:23:20 link with 8 events in the batch, provides the following details:
Apache Spark has proven an efficient and accessible platform for distributed computation. In some
areas, it almost approaches the Holy Grail of making parallelization “automagic” — something we
human programmers appreciate precisely because we are rarely good at it.
Nonetheless, although it is easy to get something to run on Spark, it is not always easy to tell
whether it's running optimally, nor — if we get a sense that something isn't right — how to fix it.
For example, a classic Spark puzzler is the batch job that runs the same code on the same sort of
cluster and similar data night after night … but every so often it seems to take much longer to
finish. What could be going on?
In this article, I am going to show how to identify some common Spark issues the easy way: by
looking at a particularly informative graphical report that is built into the Spark Web UI. The Web
UI Stage Detail view[1] is my go-to page for tuning and troubleshooting, and is also one of the most
information-dense spots in the whole UI.
Let's briefly describe a “stage” and how to find the relevant UI screen. Then we'll look at the data in
each part of that page.
What Exactly is a Stage?
In Apache Spark execution terminology, operations that physically move data in order to produce
some result are called “jobs.” Some jobs are triggered by user API calls (so-called “Action” APIs,
such as “.count” to count records). Other jobs live behind the scenes and are implicitly triggered —
e.g., data schema inference requires Spark to physically inspect some data, hence it requires a job
of its own.
Jobs are decomposed into “stages” by separating where a shuffle is required. The shuffle is
essential to the “reduce” part of the parallel computation — it is the part that is not fully parallel
but where we must, in general, move data in order to complete the current phase of computation.
For example, to sort a distributed set of numbers, it's not enough to locally sort partitions of the
data … sooner or later we need to impose a global ordering and that requires comparisons of data
records from all over the cluster, necessitating a shuffle.
So, Spark's stages represent segments of work that run from data input (or data read from a
previous shuffle) through a set of operations called tasks — one task per data partition — all the
way to a data output or a write into a subsequent shuffle.
Locating the Stage Detail View UI
Start by opening a browser to the Spark Web UI[2].
Unless you already know the precise details of jobs and stages running on your Spark cluster, it's
probably useful to navigate via the “Jobs” tab at the top of the UI, which provides a clear drill-
down by job (rather than the “Stages” tab, which lists all stages but doesn't clearly distinguish
them by job).
From the “Jobs” tab, you can locate the job you're interested in, and click its “Description” link to
get to the Job Detail view, which lists all of the stages in your job, along with some useful stats.
We get to our final destination by clicking on a stage's “Description” link. This link leads to the
Stage Detail view, which is the report we're analyzing today.
We'll look at major parts of this report, proceeding from top to bottom on the page.
Event Timeline
One of my favorite parts of the Stage Detail view is initially hidden behind the “Event Timeline”
dropdown. Click that dropdown link to get a large, colored timeline graph showing each of the
tasks in the stage, plotted by start time (horizontally) and grouped by executor (vertically).
Within each task's colored bar — representing time — the full duration is further broken down via
colored segments to show how the time was spent.
There is exactly one colored bar per task, so we can see how many tasks there are, and get a feel for
whether there are too many or too few tasks. Since tasks are one-to-one with data partitions, this
really helps us answer the question: How many partitions should I have?
More precisely: What would this graph look like if there are too few partitions (and tasks)?
In the most extreme case, we might see fewer tasks than an executor has cores — perhaps we have
40 cores across our cluster but see only 32 tasks. That is usually not what we want, and it's easy to
identify and change.
But, more subtly, what if we have a larger number of tasks than we have cores but still too few for
optimal performance. How would we recognize that situation?
Look at the right hand edge of the graph, and locate the last one or two tasks to complete. Those
tasks are essentially limiting the progress of the job, because they have to complete before Spark
can move on. Look at the timescale, and see whether the span between those tasks' end time and
the previous few tasks' end time is significant (for example, hundreds of milliseconds or perhaps
much more). What we're looking at is the period at the end of the stage when the cluster cores are
underutilized. If this is substantial, it's an indicator of too few partitions/tasks, or of skew in data
size, compute time, or both.

Notice that one “straggler” task finishes more than 1 full second after the next longest running
tasks, and almost 2 full seconds after most tasks are complete. This image — with its high fraction
of green — displays some artifacts of a small dataset on a local mode Spark, but it also suggests
some skew in one of the partitions or tasks. The tiny tasks suggest near-empty partitions or near-
no-op tasks.
On the opposite end, we might have too many partitions, leading to too many tasks. How would
this situation appear? Lots of very short tasks, dominated by time spent in non-compute activities

These Tasks come from a Stage with too many partitions. Notice how many of the tasks show only
a little bit of green “Computing Time” … certainly less than 70%
The green color in the task indicates “Executor Computing Time” and we would ideally like this to
make up at least 70% of the time spent on the task. If you see many tasks filled up with other
colors, representing non-compute activities such as “Task Deserialization,” and only a small slice
of green “Computing Time,” that is an indicator that you may have too many tasks/partitions or,
equivalently, that the partitions are too small (data) or require too little work (compute) to be
optimally efficient.
Note two gotchas about these timeline graphs in general:
• Although the task bars are arranged to show start time and length, the “swim lanes” only separate
executors. One “row” within an executor swim lane does not represent a specific core or
thread, and tasks scheduled on the same thread do not show that fact in any way.
• In general, the vertical layout of the bars within an executor “swim lane” are purely an artifact of
trying to show many tasks within a limited space on the page. The vertical positioning,
overlap, etc., of multiple tasks has no meaning.
Summary Metrics for Completed Tasks
Next on page we find the Summary Metrics, showing various metrics at the 0 (Min), 25th, 50th
(Median), 75th, and 100th (Max) percentiles (among the tasks in the stage). More metrics can be
revealed by selecting checkboxes hidden under “Show Additional Metrics” earlier on the page.
How do we make sense of this part of the report?
In a perfect world, where our computation was completely symmetric across tasks, we would see
all of the statistics clustered tightly around the 50th percentile value. There would be minimal
variance; the distance between 0 and 100% values would be small. In the real world, though,
things don't always work out that way, but we can see how far off they are — and get an idea of
why.
Suppose the 25%-75% spread isn't too wide but some Max metric figures are substantially higher
then the corresponding 75% figures. That suggests a number of “straggler” tasks, taking too much
time to compute (or triggering excess GC), and/or operating over partitions with larger skewed
amounts of data.
On the other end, suppose that the distribution is reasonable, except that we have a bunch of Min
values at or close to zero. That suggests we have empty (or near empty) partitions and/or tasks
that aren't computing anything (our compute logic might not be the same for all sorts of records,
so we could have large partitions whose tasks do no work).

Summary Metrics corresponding to the task timeline view (above) which had suggested skew. Note
that the Max task took 10x the time and read about 10x the data of the 75th-percentile task. There
is skew at the low end as well: Min time and data are zero, and 25th percentile data is around 1% of
the median.
Aggregated Metrics by Executor
The next segment of the report is a set of summarized statistics, this time collected by executor.
How is this helpful? In theory, given the vicissitudes of scheduling on threads and across a
network, we would expect that when running a job several times — or running a long job with
many stages and tasks — we would see similar statistics across all our executors. There's that
perfect world again. What could go wrong?
If we see one or more executors consistently showing worse metrics than most, it could indicate
several possible situations:
• The JVM (executor) is sick — perhaps we should kill it and start a new one.
• The node hosting the executor is sick — the UI shows executors live on which nodes, so if
multiple problematic executors are always on the same node we might suspect the node.
• Data locality trouble — Since Spark attempts to schedule tasks where their partition data is
located, over time it should be successful at a consistent rate. But suppose three of your Spark
executors happen to be collocated with HDFS replicas of tasks' data, while one is allocated
(by, say, YARN) far away from your job's data. That executor is going to consistently take
longer to read the data over the network.
• Good locality but difficult data — Conversely, an executor may have great locality to the part of
the data which your Spark job is using most heavily. So those tasks take longer and/or
process more data than other tasks.
All of those possibilities can be mitigated, and the report gives us hints about what to inspect so
that we can do so.
Tasks List
The last section of the Stage Detail view is a grid containing a row for every single task in the stage.
The data shown for each task is similar to the data shown in the graphical timeline, but includes
the addition of a few fields such as data quantity read/written and — something not shown
anywhere else — the specific data locality level at which each task ran[3].
Assuming you know something about where your data is located at this stage of the computation,
the locality info will tell you whether tasks are generally being scheduled in a way that minimizes
transporting data over the network.
Let's return by way of example to our mysterious scenario from the start of the article — a job that
is in all respects similar each night, but every so often takes much longer to finish. Perhaps your
Spark cluster coincides with your HDFS data “most of the time” but occasionally ends up getting
launched with terrible proximity to the data block replicas it needs, and so it runs successfully but
much slower on those occasions. Comparing the locality info in this part of the report to the
observed behavior from other runs will give you an indication of what has happened.
Since a stage could easily have thousands of tasks, this is probably a good time to mention that the
Spark UI data are also accessible through a REST API[4]. So, if you want to monitor and plot
locality in your stages, you don't need to read or scrape the thousands of task table rows.
Finally, a couple rules of thumb around partition sizing and task duration. Two of the most
common questions in the Spark classes and workshops I teach for ProTech are: “How many
partitions should I have? And how long should tasks execute?”
The proper answer is that it depends on so many variables — workload, cluster configuration, data
sources — that it always requires hands-on tuning. However, new users are understandably
desperate to have some a priori numbers to start with, so — purely as a bootstrapping mechanism
— I suggest starting with code and configuration that causes each partition to contain 100-200MB
of data and each task to take 50-200ms.
Those numbers are not a final goal, but once you have your app running, you can use the
knowledge from this article to start to tune your partition counts and improve the speed and
consistency of your Spark application.
If this approach to tuning Spark sounds helpful, and you'd like to dig even deeper into optimizing
and troubleshooting Spark for real-world production scenarios in the enterprise, check out
ProTech's 3-Day Spark 2.0 Programming course.
Diving into Apache Spark Streaming’s Execution Model

by Tathagata Das, Matei Zaharia and Patrick Wendell Posted in ENGINEERING BLOG July 30,
2015
With so many distributed stream processing engines available, people often ask us about the
unique benefits of Apache Spark Streaming. From early on, Apache Spark has provided an unified
engine that natively supports both batch and streaming workloads. This is different from other
systems that either have a processing engine designed only for streaming, or have similar batch
and streaming APIs but compile internally to different engines. Spark’s single execution engine
and unified programming model for batch and streaming lead to some unique benefits over other
traditional streaming systems. In particular, four major aspects are:
• Fast recovery from failures and stragglers
• Better load balancing and resource usage
• Combining of streaming data with static datasets and interactive queries
• Native integration with advanced processing libraries (SQL, machine learning, graph processing)
In this post, we outline Spark Streaming’s architecture and explain how it provides the above
benefits. We also discuss some of the interesting ongoing work in the project that leverages the
execution model.
Stream Processing Architectures – The Old and the New
At a high level, modern distributed stream processing pipelines execute as follows:
• Receive streaming data from data sources (e.g. live logs, system telemetry data, IoT device data,
etc.) into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.
• Process the data in parallel on a cluster. This is what stream processing engines are designed to
do, as we will discuss in detail next.
• Output the results out to downstream systems like HBase, Cassandra, Kafka, etc.
To process the data, most traditional stream processing systems are designed with a continuous
operator model, which works as follows:
• There is a set of worker nodes, each of which run one or more continuous operators.
• Each continuous operator processes the streaming data one record at a time and forwards the
records to other operators in the pipeline.
• There are “source” operators for receiving data from ingestion systems, and “sink” operators that
output to downstream systems.
Figure 1: Architecture of traditional stream processing systems
Continuous operators are a simple and natural model. However, with today’s trend towards larger
scale and more complex real-time analytics, this traditional architecture has also met some
challenges. We designed Spark Streaming to satisfy the following requirements:
• Fast failure and straggler recovery – With greater scale, there is a higher likelihood of a cluster
node failing or unpredictably slowing down (i.e. stragglers). The system must be able to
automatically recover from failures and stragglers to provide results in real time.
Unfortunately, the static allocation of continuous operators to worker nodes makes it
challenging for traditional systems to recover quickly from faults and stragglers.
• Load balancing – Uneven allocation of the processing load between the workers can cause
bottlenecks in a continuous operator system. This is more likely to occur in large clusters and
dynamically varying workloads. The system needs to be able to dynamically adapt the
resource allocation based on the workload.
• Unification of streaming, batch and interactive workloads – In many use cases, it is also
attractive to query the streaming data interactively (after all, the streaming system has it all in
memory), or to combine it with static datasets (e.g. pre-computed models). This is hard in
continuous operator systems as they are not designed to the dynamically introduce new
operators for ad-hoc queries. This requires a single engine that can combine batch, streaming
and interactive queries.
• Advanced analytics like machine learning and SQL queries – More complex workloads require
continuously learning and updating data models, or even querying the “latest” view of
streaming data with SQL queries. Again, having a common abstraction across these analytic
tasks makes the developer’s job much easier.
To address these requirements, Spark Streaming uses a new architecture called discretized streams
that directly leverages the rich libraries and fault tolerance of the Spark engine.
Architecture of Spark Streaming: Discretized Streams
Instead of processing the streaming data one record at a time, Spark Streaming discretizes the
streaming data into tiny, sub-second micro-batches. In other words, Spark Streaming’s Receivers
accept data in parallel and buffer it in the memory of Spark’s workers nodes. Then the latency-
optimized Spark engine runs short tasks (tens of milliseconds) to process the batches and output
the results to other systems. Note that unlike the traditional continuous operator model, where the
computation is statically allocated to a node, Spark tasks are assigned dynamically to the workers
based on the locality of the data and available resources. This enables both better load balancing
and faster fault recovery, as we will illustrate next.
In addition, each batch of data is a Resilient Distributed Dataset (RDD), which is the basic
abstraction of a fault-tolerant dataset in Spark. This allows the streaming data to be processed
using any Spark code or library.
Figure 2: Spark Streaming Architecture

Benefits of Discretized Stream Processing
Let’s see how this architecture allows Spark Streaming to achieve the goals we set earlier.
Dynamic load balancing
Dividing the data into small micro-batches allows for fine-grained allocation of computations to
resources. For example, consider a simple workload where the input data stream needs to
partitioned by a key and processed. In the traditional record-at-a-time approach taken by most
other systems, if one of the partitions is more computationally intensive than the others, the node
statically assigned to process that partition will become a bottleneck and slow down the pipeline.
In Spark Streaming, the job’s tasks will be naturally load balanced across the workers — some
workers will process a few longer tasks, others will process more of the shorter tasks.
Figure 3: Dynamic load balancing

Fast failure and straggler recovery
In case of node failures, traditional systems have to restart the failed continuous operator on
another node and replay some part of the data stream to recompute the lost information. Note that
only one node is handling the recomputation, and the pipeline cannot proceed until the new node
has caught up after the replay. In Spark, the computation is already discretized into small,
deterministic tasks that can run anywhere without affecting correctness. So failed tasks can be
relaunched in parallel on all the other nodes in the cluster, thus evenly distributing all the
recomputations across many nodes, and recovering from the failure faster than the traditional
approach.

Figure 4: Faster failure recovery with redistribution of computation

Unification of batch, streaming and interactive analytics
The key programming abstraction in Spark Streaming is a DStream, or distributed stream. Each
batch of streaming data is represented by an RDD, which is Spark’s concept for a distributed
dataset. Therefore a DStream is just a series of RDDs. This common representation allows batch
and streaming workloads to interoperate seamlessly. Users can apply arbitrary Spark functions on
each batch of streaming data: for example, it’s easy to join a DStream with a precomputed static
dataset (as an RDD).
// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile("file")
// Join each batch in stream with the dataset
kafkaDStream.transform { batchRDD =>
batchRDD.join(dataset).filter(...)
}
Since the batches of streaming data are stored in the Spark’s worker memory, it can be
interactively queried on demand. For example, you can expose all the streaming state through the
Spark SQL JDBC server, as we will show in the next section. This kind of unification of batch,
streaming and interactive workloads is very simple in Spark, but hard to achieve in systems
without a common abstraction for these workloads.
Advanced analytics like machine learning and interactive SQL
Spark interoperability extends to rich libraries like MLlib (machine learning), SQL, DataFrames,
and GraphX. Let’s explore a few use cases:
Streaming + SQL and DataFrames
RDDs generated by DStreams can be converted to DataFrames (the programmatic interface to
Spark SQL), and queried with SQL. For example, using Spark SQL’s JDBC server, you can expose
the state of the stream to any external application that talks SQL.
val hiveContext = new HiveContext(sparkContext)
// ...
wordCountsDStream.foreachRDD { rdd =>
// Convert RDD to DataFrame and register it as a SQL table
val wordCountsDataFrame = rdd.toDF("word", "count")
wordCountsDataFrame.registerTempTable("word_counts")
}
// ...
// Start the JDBC server
HiveThriftServer2.startWithContext(hiveContext)
Then you can interactively query the continuously updated “word_counts” table through the JDBC
server, using the beeline client that ships with Spark, or tools like Tableau.
> show tables;
+--------------+--------------+
| tableName | isTemporary |
+--------------+--------------+
| word_counts | true |
+--------------+--------------+
1 row selected (0.102 seconds)

> select * from word_counts;

+-----------+--------+
|   word    | count |
+-----------+--------+
| 2015      | 264    |
| PDT       | 264    |
| 21:45:41 | 27     |
Streaming + MLlib
Machine learning models generated offline with MLlib can applied on streaming data. For
example, the following code trains a KMeans clustering model with some static data and then uses
the model to classify events in a Kafka data stream.
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
val kafkaStream = KafkaUtils.createDStream(...)
kafkaStream.map { event => model.predict(featurize(event)) }
We demonstrated this offline-learning-online-prediction at our Spark Summit 2014 Databricks
demo. Since then, we have also added streaming machine learning algorithms in MLLib that can
continuously train from a labelled data stream. Other Spark libraries can also easily be called from
Spark Streaming.
Performance
Given the unique design of Spark Streaming, how fast does it run? In practice, Spark Streaming’s
ability to batch data and leverage the Spark engine leads to comparable or higher throughput to
other streaming systems. In terms of latency, Spark Streaming can achieve latencies as low as a
few hundred milliseconds. Developers sometimes ask whether the micro-batching inherently adds
too much latency. In practice, batching latency is only a small component of end-to-end pipeline
latency. For example, many applications compute results over a sliding window, and even in
continuous operator systems, this window is only updated periodically (e.g. a 20 second window
that slides every 2 seconds). Many pipelines collect records from multiple sources and wait for a
short period to process delayed or out-of-order data. Finally, any automatic triggering algorithm
tends to wait for some time period to fire a trigger. Therefore, compared to the end-to-end latency,
batching rarely adds significant overheads. In fact, the throughput gains from DStreams often
means that you need fewer machines to handle the same workload.
Future Directions for Spark Streaming
Spark Streaming is one of the most widely used components in Spark, and there is a lot more
coming for streaming users down the road. Some of the highest priority items our team is working
on are discussed below. You can expect these in the next few releases of Spark:
• Backpressure – Streaming workloads can often have bursts of data (e.g. sudden spike in tweets
during the Oscars) and the processing system must be able to handle them gracefully. In the
upcoming Spark 1.5 release (next month), Spark will be adding better backpressure
mechanisms that allow Spark Streaming dynamically control the ingestion rate for such
bursts. This feature represents joint work between us at Databricks and engineers at
Typesafe.
• Dynamic scaling – Controlling the ingestion rate may not be sufficient to handle longer terms
variations in data rates (e.g. sustained higher tweet rate during the day than night). Such
variations can be handled by dynamically scaling the cluster resource based on the processing
demands. This is very easy to do within the Spark Streaming architecture — since the
computation is already divided into small tasks, they can be dynamically redistributed to a
larger cluster if more nodes are acquired from the cluster manager (YARN, Mesos, Amazon
EC2, etc). We plan to add support for automatic dynamic scaling.
• Event time and out-of-order data – In practice, users sometimes have records that are delivered
out of order, or with a timestamp that differs from the time of ingestion. Spark streaming will
support “event time” by allowing user-defined time extraction function. This will include a
slack duration for late or out-of-order data.
• UI enhancements – Finally, we want to make it easy for developers to debug their streaming
applications. For this purpose, in Spark 1.4, we added new visualizations to the streaming
Spark UI that let developers closely monitor the performance of their application. In Spark
1.5, we are further improving this by showing more input information like Kafka offsets
processed in each batch.
To learn more about Spark Streaming, read the official programming guide, or the Spark
Streaming research paper that introduces its execution and fault tolerance model.
Deploying your Java code to production limits your troubleshooting options. Connecting to your
app in production with a debugger is usually out of the question, and you might not even be able to
get console access. So even with monitoring, you’re going to end up troubleshooting many
problems post-mortem. This means looking at logs and, if you’re lucky, working with a Java stack
trace.
That’s right, I said you’re lucky if you have a stack trace. It’s like getting a compass, a map, and a
first-class airplane ticket handed to you all at once! Let’s talk about what a Java stack trace is and
how you can use it.
What’s a Java Stack Trace?
A stack trace, also called a stack backtrace or even just a backtrace, is a list of stack frames. These
frames represent a moment during an application’s execution. A stack frame is information about
a method or function that your code called. So the Java stack trace is a list of frames that starts at
the current method and extends to when the program started.
Sometimes there’s confusion between a stack and the Stack. A stack is a data structure that acts as
a stack of papers on your desk: it’s first-in-last-out. You add documents to the pile and take them
off in the reverse order you put them there. The Stack, more accurately called the runtime or call
stack, is a set of stack frames a program creates as it executes, organized in a stack data structure.
Let’s look at an example.
Java Stack Trace Example
Let’s take a look at a Java program. This class calls four methods and prints a stack trace to the
console from the last one.
public class StackTrace {

public static void main(String[] args) {

a();
}

static void a() {

b();
}

static void b() {

c();
}

static void c() {

d();
}

static void d() {

Thread.dumpStack();
}
}
When you run the class, you’ll see something like this:
java.lang.Exception: Stack trace
at java.base/java.lang.Thread.dumpStack(Thread.java:1383)
at com.ericgoebelbecker.stacktraces.StackTrace.d(StackTrace.java:23)
at com.ericgoebelbecker.stacktraces.StackTrace.c(StackTrace.java:19)
at com.ericgoebelbecker.stacktraces.StackTrace.b(StackTrace.java:15)
at com.ericgoebelbecker.stacktraces.StackTrace.a(StackTrace.java:11)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:7)
The d() method() is at the top of the stack because that’s where the app generated the trace. The
main() method is at the bottom because that’s where the program started. When the program
started, the Java runtime executed the main() method. Main() called a(). A() called b(), and b()
called c(), which called d(). Finally, d() called dumpStack(), which generated the output. This Java
stack trace gives us a picture of what the program did, in the order that it did it.
A Java stack trace is a snapshot of a moment in time. You can see where your application was and
how it got there. That’s valuable insight that you can use a few different ways.
How to Use Java Stack Traces
Now that you’ve seen what Java stack traces show you, how can you use them?
Java Exceptions
Stack traces and exceptions are often associated with each other. When you see a Java application
throw an exception, you usually see a stack trace logged with it. This is because of how exceptions
work.
When Java code throws an exception, the runtime looks up the stack for a method that has a
handler that can process it. If it finds one, it passes the exception to it. If it doesn’t, the program
exits. So exceptions and the call stack are linked directly. Understanding this relationship will help
you figure out why your code threw an exception.
Let’s change our sample code.
First, modify the d() method:
static void d() {
throw new NullPointerException("Oops!");
}
Then, change main() and a() so main can catch an exception. You’ll need to add a checked
exception to a() so the code will compile.
public static void main(String[] args)
{
try {
a();
} catch (InvalidClassException ice) {
System.err.println(ice.getMessage());
}
}

static void a() throws InvalidClassException

{
b();
}
You’re deliberately catching the “wrong” exception. Run this code and watch what happens.
Exception in thread "main" java.lang.NullPointerException: Oops!
at com.ericgoebelbecker.stacktraces.StackTrace.d(StackTrace.java:29)
at com.ericgoebelbecker.stacktraces.StackTrace.c(StackTrace.java:24)
at com.ericgoebelbecker.stacktraces.StackTrace.b(StackTrace.java:20)
at com.ericgoebelbecker.stacktraces.StackTrace.a(StackTrace.java:16)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:9)
The exception bubbled up the stack past main() because you were trying to catch a different
exception. So the runtime threw it, terminating the application. You can still see a stack trace
though, so it’s easy to determine what happened.
Now, change main() to catch a NullPointerException instead. You can remove the checked
exception from a() too.
public static void main(String[] args) {
try {
a();
} catch (NullPointerException ice) {
System.err.println(ice.getMessage());
}
}

static void a() {

b();
}
Rerun the program.
Oops!
We lost the stack trace! By only printing the message attached to the exception, you missed some
vital context. Unless you can remember why you wrote Oops! in that message, tracking down this
problem is going to be complicated. Let’s try again.
public static void main(String[] args) {
try {
a();
} catch (NullPointerException npe) {
npe.printStackTrace();
}
}
Rerun the application.
java.lang.NullPointerException: Oops!
at com.ericgoebelbecker.stacktraces.StackTrace.d(StackTrace.java:28)
at com.ericgoebelbecker.stacktraces.StackTrace.c(StackTrace.java:24)
at com.ericgoebelbecker.stacktraces.StackTrace.b(StackTrace.java:20)
at com.ericgoebelbecker.stacktraces.StackTrace.a(StackTrace.java:16)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:9)
That’s better! We see the stack trace, and it ends at d() where the exception occurred, even though
main() printed it.
Logging Java Stack Traces
What if you don’t want to print an error message to the console but to a log file instead? The good
news is that most loggers, including Log4j and Logback, will write exceptions with stack traces if
you call them with the right arguments.
Pass in the exception object as the last argument to the message, without a formatting directive. So
if you used Log4j or Logback with the sample code like this:
logger.error(“Something bad happened:”, npe);
You would see this in your log file:
Something bad happened:
java.lang.NullPointerException: Oops!
at com.ericgoebelbecker.stacktraces.StackTrace.d(StackTrace.java:28)
at com.ericgoebelbecker.stacktraces.StackTrace.c(StackTrace.java:24)
at com.ericgoebelbecker.stacktraces.StackTrace.b(StackTrace.java:20)
at com.ericgoebelbecker.stacktraces.StackTrace.a(StackTrace.java:16)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:9)
One of the best things you can do with exceptions and stack traces is to log them so you can use
them to isolate a problem. If you get in the habit of printing useful log messages with details like
stack traces and log indexing, then search tools, like Scalyr, become one of the most powerful tools
in your troubleshooting tool bag.
The Java Debugger
Debuggers work by taking control of a program’s runtime and letting you both observe and control
it. To do this, it shows you the program stack and enables you to traverse it in either direction.
When you’re in a debugger, you get a more complete picture of a stack frame than you do when
looking at stack traces in a log message.
Let’s make a small code change and then throw the sample code into a debugger.
First, add a local variable to the d() method:
static void d() {
String message = “Oops.”
throw new NullPointerException(message);
}

Then add a breakpoint where d() throws the exception in your debugger. I’m using IntelliJ’s
debugger for this image.
Here you can see that the string we added to d() is part of the stack frame because it’s a local
variable. Debuggers operate inside the Stack and give you a detailed picture of each frame.
Forcing a Thread Dump
Thread dumps are great post-mortem tools, but they can be useful for runtime issues too. If your
application stops responding or is consuming more CPU or memory than you expect, you can
retrieve information about the running app with jstack.
Modify main() so the application will run until killed:
public static void main(String[] args) throws Exception {
try {
while(true) {
Thread.sleep(1000);
}
} catch (NullPointerException ice) {
ice.printStackTrace();
}
}
Run the app, determine its pid, and then run jstack. On Windows, you’ll need to press ctrl-break in
the DOS window you’re running your code in.
$ jstack <pid>
Jstack will generate a lot of output.
2019-05-13 10:06:17
Full thread dump OpenJDK 64-Bit Server VM (12+33 mixed mode, sharing):

Threads class SMR info:

_java_thread_list=0x00007f8bb2727190, length=10, elements={
0x00007f8bb3807000, 0x00007f8bb2875000, 0x00007f8bb2878000, 0x00007f8bb4000800,
0x00007f8bb300a800, 0x00007f8bb287b800, 0x00007f8bb287f000, 0x00007f8bb28ff800,
0x00007f8bb300b800, 0x00007f8bb3805000
}

"main" #1 prio=5 os_prio=31 cpu=60.42ms elapsed=103.32s tid=0x00007f8bb3807000

nid=0x2503 waiting on condition [0x0000700001a0e000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(java.base@12/Native Method)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:9)

"Reference Handler" #2 daemon prio=10 os_prio=31 cpu=0.08ms elapsed=103.29s

tid=0x00007f8bb2875000 nid=0x4603 waiting on condition [0x0000700002123000]
java.lang.Thread.State: RUNNABLE
at java.lang.ref.Reference.waitForReferencePendingList(java.base@12/Native Method)
at java.lang.ref.Reference.processPendingReferences(java.base@12/Reference.java:241)
at java.lang.ref.Reference$ReferenceHandler.run(java.base@12/Reference.java:213)

"Finalizer" #3 daemon prio=8 os_prio=31 cpu=0.13ms elapsed=103.29s

tid=0x00007f8bb2878000 nid=0x3903 in Object.wait() [0x0000700002226000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@12/Native Method)
- waiting on <0x000000070ff02770> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@12/ReferenceQueue.java:155)
- locked <0x000000070ff02770> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@12/ReferenceQueue.java:176)
at java.lang.ref.Finalizer$FinalizerThread.run(java.base@12/Finalizer.java:170)

"Signal Dispatcher" #4 daemon prio=9 os_prio=31 cpu=0.27ms elapsed=103.28s

tid=0x00007f8bb4000800 nid=0x3e03 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" #5 daemon prio=9 os_prio=31 cpu=6.12ms elapsed=103.28s
tid=0x00007f8bb300a800 nid=0x5603 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task

"C1 CompilerThread0" #7 daemon prio=9 os_prio=31 cpu=12.01ms elapsed=103.28s

tid=0x00007f8bb287b800 nid=0xa803 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task

"Sweeper thread" #8 daemon prio=9 os_prio=31 cpu=0.73ms elapsed=103.28s

tid=0x00007f8bb287f000 nid=0xa603 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"Service Thread" #9 daemon prio=9 os_prio=31 cpu=0.04ms elapsed=103.27s

tid=0x00007f8bb28ff800 nid=0xa503 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"Common-Cleaner" #10 daemon prio=8 os_prio=31 cpu=0.27ms elapsed=103.27s

tid=0x00007f8bb300b800 nid=0xa303 in Object.wait() [0x000070000293b000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@12/Native Method)
- waiting on <0x000000070ff91690> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@12/ReferenceQueue.java:155)
- locked <0x000000070ff91690> (a java.lang.ref.ReferenceQueue$Lock)
at jdk.internal.ref.CleanerImpl.run(java.base@12/CleanerImpl.java:148)
at java.lang.Thread.run(java.base@12/Thread.java:835)
at jdk.internal.misc.InnocuousThread.run(java.base@12/InnocuousThread.java:134)

"Attach Listener" #11 daemon prio=9 os_prio=31 cpu=0.72ms elapsed=0.10s

tid=0x00007f8bb3805000 nid=0x5e03 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"VM Thread" os_prio=31 cpu=3.83ms elapsed=103.29s tid=0x00007f8bb2874800 nid=0x3703

runnable

"GC Thread#0" os_prio=31 cpu=0.13ms elapsed=103.31s tid=0x00007f8bb282b800 nid=0x3003

runnable

"G1 Main Marker" os_prio=31 cpu=0.26ms elapsed=103.31s tid=0x00007f8bb2845000

nid=0x3103 runnable

"G1 Conc#0" os_prio=31 cpu=0.04ms elapsed=103.31s tid=0x00007f8bb3810000 nid=0x3303

runnable

"G1 Refine#0" os_prio=31 cpu=0.39ms elapsed=103.31s tid=0x00007f8bb2871000 nid=0x3403

runnable
"G1 Young RemSet Sampling" os_prio=31 cpu=13.60ms elapsed=103.31s
tid=0x00007f8bb2872000 nid=0x4d03 runnable
"VM Periodic Task Thread" os_prio=31 cpu=66.44ms elapsed=103.27s
tid=0x00007f8bb2900800 nid=0xa403 waiting on condition

JNI global refs: 5, weak refs: 0

My application was running 11 threads, and jstack generated a stack trace for all of them. The first
thread, helpfully named main, is the one we’re concerned with. You can see it sleeping on wait().
Java Stack Traces: Your Roadmap
A stack trace is more than just a picture inside your application. It’s a snapshot of a moment in
time that includes every step your code took to get there. There’s no reason to dread seeing one in
your logs because they’re a gift from Java that tells you exactly what happened. Make sure you’re
logging them when an error crops up and send them to a tool like Scalyr so they’re easy to find.
Now that you understand what a Java stack trace is and how to use it, take a look at your code. Are
you throwing away critical information about errors and exceptions in your code? Is there a spot
where a call to Thread.dumpstack() might help you isolate a recurring bug? Perhaps it’s time to
run your app through the debugger a few times with some strategically-chosen breakpoints.
This post was written by Eric Goebelbecker. Eric has worked in the financial markets in New York
City for 25 years, developing infrastructure for market data and financial information exchange
(FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!)
JSTACK and JMAP Collection
Jstack Collection

Step 1: Switch as the service user that started the process.

#su - <service-user-who-started-the-process>

Step 2: Capture the process ID

#ps -ef | grep <process-name>

#ps -ef | grep hive

hive 21887 1 0 Aug01 ? 00:58:04 /usr/jdk64/jdk1.8.0_112/bin/java -Xmx1024m

-Dhdp.version=2.6.5.0-292 -Djava.net.preferIPv4Stack=true -Dhdp.version=2.6.5.0-292
-Dhadoop.log.dir=/var/log/hadoop/hive -Dhadoop.log.file=hadoop.log
-Dhadoop.home.dir=/usr/hdp/2.6.5.0-292/hadoop -Dhadoop.id.str=hive
-Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/hdp/2.6.5.0-
292/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.6.5.0-292/hadoop/lib/native/Linux-
amd64-64:/usr/hdp/2.6.5.0-292/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml
-Djava.net.preferIPv4Stack=true -Xmx1024m -Xmx2048m
-Djava.util.logging.config.file=/usr/hdp/current/hive-server2/conf/conf.server/parquet-
logging.properties -Dhadoop.security.logger=INFO,NullAppender
org.apache.hadoop.util.RunJar /usr/hdp/2.6.5.0-292/hive/lib/hive-service-1.2.1000.2.6.5.0-
292.jar org.apache.hive.service.server.HiveServer2 --hiveconf
hive.aux.jars.path=file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar
-hiveconf hive.metastore.uris= -hiveconf hive.log.file=hiveserver2.log -hiveconf
hive.log.dir=/var/log/hive

From above output,

Parent service account is hive
Process ID is 21887
Java version used is /usr/jdk64/jdk1.8.0_112/bin/java
Step 3: Capture the java used by the process to start the service.
From the above output it is /usr/jdk64/jdk1.8.0_112/bin/java

Step 4: (In the order of priority)

NOTE: We would need to consider running the command multiple times (min 5 times ) separated
by 20-30 seconds.
4.1: Simple jstack for a responding process
#<jstack-used-by-process>/jstack -l <pid> > <location-to-redirect-the-output>/jstack.out
4.2: Use kill for a hung process
#kill -3 <pid>
Corresponding output is captured in .out file of the process.
4.3: Use -F for a hung process
#<jstack-used-by-process>/jstack -F <pid> > <location-to-redirect-the-output>/jstack.out
JMap Collection
Step 1: #su - <service-user-who-started-the-process>
Step 2: Capture the process ID
Step 3: Capture the java used by the process to start the service.
Step 4: Determining the appropriate flag to be used,
We use "-heap" option to determine if it is needed to use "-dump" option.
#<jmap-used-by-process>/jmap -heap <pid> > jmapHEAP.out
Upon multiple executions of the above command, if the percentage used is above 90% then we use
the -dump flag as below,
Sample output of above command is,

Attaching to process ID 21887, please wait...

Debugger attached successfully.

Server compiler detected.

JVM version is 25.112-b15

using thread-local object allocation.

Parallel GC with 8 thread(s)

Heap Configuration:

MinHeapFreeRatio =0

MaxHeapFreeRatio = 100

MaxHeapSize = 2147483648 (2048.0MB)

NewSize = 87031808 (83.0MB)

MaxNewSize = 715653120 (682.5MB)

OldSize = 175112192 (167.0MB)

NewRatio =2

SurvivorRatio =8

MetaspaceSize = 21807104 (20.796875MB)

CompressedClassSpaceSize = 1073741824 (1024.0MB)

MaxMetaspaceSize = 17592186044415 MB

G1HeapRegionSize = 0 (0.0MB)

Heap Usage:

PS Young Generation

Eden Space:

capacity = 141557760 (135.0MB)

used = 36859416 (35.151878356933594MB)

free = 104698344 (99.8481216430664MB)

26.038428412543404% used

From Space:
capacity = 5767168 (5.5MB)

used = 4211840 (4.0167236328125MB)

free = 1555328 (1.4832763671875MB)

73.0313387784091% used

To Space:

capacity = 5767168 (5.5MB)

used = 0 (0.0MB)

free = 5767168 (5.5MB)

0.0% used

PS Old Generation

capacity = 277872640 (265.0MB)

used = 161075720 (153.61377716064453MB)

free = 116796920 (111.38622283935547MB)

57.9674630794885% used

From the above output, 57% of heap is being used.

The two general flags that are used while collecting the heapdumps are “-dump” and “-histo”.
While former gives the heapdump in the form of binary file with the collection of objects at a
particular time, latter provides the details of live objects in a text format.

#<jmap-used-by-process>/jmap -dump:file=<location-to-redirect-the-
output>/heapdump.hprof,format=b <PID>
If histo label needs to be used,
#<jmap-used-by-process>/jmap -histo <pid> > jmap.out
NOTE:
1. Jmap/Jstack is high CPU intensive process, so please use it with caution.
2. Please try not to use -F as much as possible as critical data are missed with this option. If -F
option needs to be used by any of the commands,

Example:
#/usr/jdk64/jdk1.8.0_112/bin/jmap -dump:file=/tmp/jmap21887.hprof,format=b -F 21887
#/usr/jdk64/jdk1.8.0_112/bin/jmap -histo -F 21887 > /tmp/jmaphistoF.out

Enable heap dumps for Apache Hadoop services on Linux-based HDInsight

01/02/20203 minutes to read +1

Linux

Heap dumps contain a snapshot of the application's memory, including the values of variables at
the time the dump was created. So they're useful for diagnosing problems that occur at run-time.
Services

You can enable heap dumps for the following services:

• Apache hcatalog - tempelton
• Apache hive - hiveserver2, metastore, derbyserver
• mapreduce - jobhistoryserver
• Apache yarn - resourcemanager, nodemanager, timelineserver
• Apache hdfs - datanode, secondarynamenode, namenode
You can also enable heap dumps for the map and reduce processes ran by HDInsight.
Understanding heap dump configuration

Heap dumps are enabled by passing options (sometimes known as opts, or parameters) to the JVM
when a service is started. For most Apache Hadoop services, you can modify the shell script used
to start the service to pass these options.
In each script, there's an export for *_OPTS, which contains the options passed to the JVM. For
example, in the hadoop-env.sh script, the line that begins with export
HADOOP_NAMENODE_OPTS= contains the options for the NameNode service.
Map and reduce processes are slightly different, as these operations are a child process of the
MapReduce service. Each map or reduce process runs in a child container, and there are two
entries that contain the JVM options. Both contained in mapred-site.xml:
• mapreduce.admin.map.child.java.opts
• mapreduce.admin.reduce.child.java.opts
Note
We recommend using Apache Ambari to modify both the scripts and mapred-site.xml settings,
as Ambari handle replicating changes across nodes in the cluster. See the Using Apache Ambari
section for specific steps.
Enable heap dumps

The following option enables heap dumps when an OutOfMemoryError occurs:

Copy

-XX:+HeapDumpOnOutOfMemoryError
The + indicates that this option is enabled. The default is disabled.
Warning
Heap dumps are not enabled for Hadoop services on HDInsight by default, as the dump files can
be large. If you do enable them for troubleshooting, remember to disable them once you have
reproduced the problem and gathered the dump files.
Dump location

The default location for the dump file is the current working directory. You can control where the
file is stored using the following option:

Copy

-XX:HeapDumpPath=/path
For example, using -XX:HeapDumpPath=/tmp causes the dumps to be stored in the /tmp
directory.
Scripts

You can also trigger a script when an OutOfMemoryError occurs. For example, triggering a
notification so you know that the error has occurred. Use the following option to trigger a script on
an OutOfMemoryError:

Copy

-XX:OnOutOfMemoryError=/path/to/script
Note
Since Apache Hadoop is a distributed system, any script used must be placed on all nodes in the
cluster that the service runs on.
The script must also be in a location that is accessible by the account the service runs as, and must
provide execute permissions. For example, you may wish to store scripts in /usr/local/bin and use
chmod go+rx /usr/local/bin/filename.sh to grant read and execute permissions.
Using Apache Ambari

To modify the configuration for a service, use the following steps:

• From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.net, where
CLUSTERNAME is the name of your cluster.
• Using the list of on the left, select the service area you want to modify. For example, HDFS. In
the center area, select the Configs tab.

• Using the Filter... entry, enter opts. Only items containing this text are displayed.
• Find the *_OPTS entry for the service you want to enable heap dumps for, and add the options
you wish to enable. In the following image, I've added -XX:
+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ to the
HADOOP_NAMENODE_OPTS entry:
NoteWhen enabling
heap dumps for the map or reduce child process, look for the fields named
mapreduce.admin.map.child.java.opts and
mapreduce.admin.reduce.child.java.opts.Use the Save button to save the changes.
You can enter a short note describing the changes.
• Once the changes have been applied, the Restart required icon appears beside one or more
services.
• Select each service that needs a restart, and use the Service Actions button to Turn On
Maintenance Mode. Maintenance mode prevents alerts from being generated from the

service when you restart it.

• Once you have enabled maintenance mode, use the Restart button for the service to Restart
All Effected NoteThe entries for the Restart button may be
different for other services.
• Once the services have been restarted, use the Service Actions button to Turn Off
Maintenance Mode. This Ambari to resume monitoring for alerts for the service.
How to Read a Thread Dump
Want to learn how to read a thread dump? Click here to learn more about thread dumps in Java
applications and how to decipher them.

by Justin Albano CORE · Jun. 23, 18 · Java Zone · Tutorial

Like (73)
Comment (6)
Save
Tweet
111.1k Views
Join the DZone community and get the full member experience. JOIN FOR FREE

Most Java applications developed today involve multiple threads, which, in contrast to its benefits,
carries with it a number of subtle difficulties. In a single-threaded application, all resources
(shared data, Input/Output (IO) devices, etc.) can be accessed without coordination, knowing that
the single thread of execution will be the only thread that utilizes the resource at any given time
within the application.
In the case of multithreaded applications, a trade-off is made — increased complexity for a possible
gain in performance, where multiple threads can utilize the available (often more than one)
Central Processing Unit (CPU) cores. In the right conditions, an application can see a
significant performance increase using multiple threads (formalized by Amdahl's Law), but special
attention must be paid to ensure that multiple threads coordinate properly when accessing a
resource that is needed by two threads. In many cases, frameworks, such as Spring, will abstract
direct thread management, but even the improper use of these abstracted threads can cause some
hard-to-debug issues. Taking all of these difficulties into consideration, it is likely that, eventually,
something will go wrong, and we, as developers, will have to start diagnosing the indeterministic
realm of threads.
Fortunately, Java has a mechanism for inspecting the state of all threads in an application at any
given time —the thread dump. In this article, we will look at the importance of thread dumps and
how to decipher their compact format, as well as how to generate and analyze thread dumps in
realistically-sized applications. This article assumes the reader has a basic understanding of
threads and the various issues that surround threads, including thread contention and shared
resource management. Even with this understanding, before generating and examining a thread
dump, it is important to solidify some central threading terminology.
Understanding the Terminology
Java thread dumps can appear cryptic at first, but making sense of thread dumps requires an
understanding of some basic terminology. In general, the following terms are key in grasping the
meaning and context of a Java thread dump:
• Thread — A discrete unit of concurrency that is managed by the Java Virtual Machine (JVM).
Threads are mapped to Operating System (OS) threads, called native threads, which
provide a mechanism for the execution of instructions (code). Each thread has a unique
identifier, name, and may be categorized as a daemon thread or non-daemon thread,
where a daemon thread runs independent of other threads in the system and is only killed
when either the Runtime.exit method has been called (and the security manager authorizes
the exiting of the program) or all non-daemon threads have died. For more information, see
the Thread class documentation.
Alive thread — a running thread that is performing some work (the normal thread state).
Blocked thread — a thread that attempted to enter a synchronized block but another
thread already locked the same synchronized block.
Waiting thread — a thread that has called the wait method (with a possible timeout) on
an object and is currently waiting for another thread to call the notify method
(or notifyAll) on the same object. Note that a thread is not considered waiting if it calls
the wait method on an object with a timeout and the specified timeout has expired.
Sleeping thread — a thread that is currently not executing as a result of calling
the Thread.sleep method (with a specified sleep length).
• Monitor — a mechanism employed by the JVM to facilitate concurrent access to a single object.
This mechanism is instituted using the synchronized keyword, where each object in Java has
an associated monitor allowing any thread to synchronize, or lock, an object, ensuring that
no other thread accesses the locked object until the lock is released (the synchronized block is
exited). For more information, see the Synchronization section (17.1) of the Java Langauge
Specification (JLS).
• Deadlock — a scenario in which one thread holds some resource, A, and is blocked, waiting for
some resource, B, to become available, while another thread holds resource B and is blocked,
waiting for resource A to become available. When a deadlock occurs, no progress is made
within a program. It is important to note that a deadlock may also occur with more than two
threads, where three or more threads all hold a resource required by another thread and are
simultaneously blocked, waiting for a resource held by another thread. A special case of this
occurs when some thread, X, holds resource A and requires resource C, thread Y holds
resource B and requires resource A, and thread Z holds resource C and requires resource
B (formally known as the Dining Philosophers Problem).

• Livelock — a scenario in which thread A performs an action that causes thread B to perform an
action that in turn causes thread A to perform its original action. This situation can be
visualized as a dog chasing its tail. Similar to deadlock, live-locked threads do not make
progress, but unlike deadlock, the threads are not blocked (and instead, are alive).
The above definitions do not constitute a comprehensive vocabulary for Java threads or thread
dumps but make up a large portion of the terminology that will be experienced when reading a
typical thread dump. For a more detailed lexicon of Java threads and thread dumps, see Section 17
of the JLS and Java Concurrency in Practice.
With this basic understanding of Java threads, we can progress to creating an application from
which we will generate a thread dump and, later, examine the key portion of the thread dump to
garner useful information about the threads in the program.
Creating an Example Program
In order to generate a thread dump, we need to first execute a Java application. While a simple
"hello, world!" application results in an overly simplistic thread dump, a thread dump from
an even moderately-sized multithreaded application can be overwhelming. For the sake of
understanding the basics of a thread dump, we will use the following program, which starts two
threads that eventually become deadlocked:

public class DeadlockProgram {

public static void main(String[] args) throws Exception {

Object resourceA = new Object();

Object resourceB = new Object();

Thread threadLockingResourceAFirst = new Thread(new DeadlockRunnable(resourceA,

resourceB));

Thread threadLockingResourceBFirst = new Thread(new DeadlockRunnable(resourceB,

resourceA));

threadLockingResourceAFirst.start();

Thread.sleep(500);

threadLockingResourceBFirst.start();

private static class DeadlockRunnable implements Runnable {

private final Object firstResource;

private final Object secondResource;

public DeadlockRunnable(Object firstResource, Object secondResource) {

this.firstResource = firstResource;

this.secondResource = secondResource;

@Override

public void run() {

try {

synchronized(firstResource) {

printLockedResource(firstResource);
Thread.sleep(1000);

synchronized(secondResource) {

printLockedResource(secondResource);

} catch (InterruptedException e) {

System.out.println("Exception occurred: " + e);

private static void printLockedResource(Object resource) {

System.out.println(Thread.currentThread().getName() + ": locked resource -> " +

resource);

This program simply creates two resources, resourceAand resourceB, and starts two threads,
threadLockingResourceAFirstand threadLockingResourceBFirst, that lock each of these resources.
The key to causing deadlock is ensuring that threadLockingResourceAFirst tries to lock
resourceAand then lock resourceBwhile threadLockingResourceBFirst tries to lock resourceBand
then resourceA. Delays are added to ensure that threadLockingResourceAFirstsleeps before it is
able to lock resourceBand threadLockingResourceBFirstis given enough time to lock resourceB
before threadLockingResourceAFirstwakes. threadLockingResourceBFirst then sleeps and when
both threads await, they find that the second resource they desired has already been locked and
both threads block, waiting for the other thread to relinquish its locked resource (which never
occurs).
Executing this program results in the following output, where the object hashes (the numeric
following java.lang.Object@) will vary between each execution:

Thread-0: locked resource -> java.lang.Object@149bc794

Thread-1: locked resource -> java.lang.Object@17c10009

At the completion of this output, the program appears as though it is running (the process
executing this program does not terminate), but no further work is being done. This is a deadlock
in practice. In order to troubleshoot the issue at hand, we must generate a thread dump manually
and inspect the state of the threads in the dump.
Generating a Thread Dump
In practice, a Java program might terminate abnormally and generate a thread dump
automatically, but, in some cases (such as with many deadlocks), the program does not terminate
but appears as though it is stuck. To generate a thread dump for this stuck program, we must first
discover the Process ID (PID) for the program. To do this, we use the JVM Process Status (JPS)
tool that is included with all Java Development Kit (JDK) 7+ installations. To find the PID for our
deadlocked program, we simply execute jps in the terminal (either Windows or Linux):

$ jps

11568 DeadlockProgram

15584 Jps

15636
The first column represents the Local VM ID (lvmid) for the running Java process. In the context
of a local JVM, the lvmid maps to the PID for the Java process. Note that this value will likely differ
from the value above. The second column represents the name of the application, which may map
to the name of the main class, a Java Archive (JAR) file, or Unknown, depending on the
characteristics of the program run.
In our case, the application name is DeadlockProgram, which matches the name of the main class
file that was executed when our program started. In the above example, the PID for our program is
11568, which provides us with enough information to generate thread dump. To generate the
dump, we use the jstack program (included with all JDK 7+ installations), supplying the -l flag
(which creates a long listing) and the PID of our deadlocked program, and piping the output to
some text file (i.e. thread_dump.txt):

jstack -l 11568 > thread_dump.txt

This thread_dump.txt file now contains the thread dump for our deadlocked program and includes
some very useful information for diagnosis the root cause of our deadlock problem. Note that if we
did not have a JDK 7+ installed, we could also generate a thread dump by quitting the deadlocked
program with a SIGQUIT signal. To do this on Linux, simply kill deadlocked program using its PID
(11568 in our example), along with the -3 flag:

kill -3 11568
Reading a Simple Thread Dump
Opening the thread_dump.txt file, we see that it contains the following:

2018-06-19 16:44:44

Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0.1+10 mixed mode):

Threads class SMR info:

_java_thread_list=0x00000250e5488a00, length=13, elements={

0x00000250e4979000, 0x00000250e4982800, 0x00000250e52f2800, 0x00000250e4992800,

0x00000250e4995800, 0x00000250e49a5800, 0x00000250e49ae800, 0x00000250e5324000,

0x00000250e54cd800, 0x00000250e54cf000, 0x00000250e54d1800, 0x00000250e54d2000,

0x00000250e54d0800

}
"Reference Handler" #2 daemon prio=10 os_prio=2 tid=0x00000250e4979000 nid=0x3c28
waiting on condition [0x000000b82a9ff000]

java.lang.Thread.State: RUNNABLE

at java.lang.ref.Reference.waitForReferencePendingList([email protected]/Native Method)

at java.lang.ref.Reference.processPendingReferences([email protected]/Reference.java:174)

at java.lang.ref.Reference.access$000([email protected]/Reference.java:44)

at java.lang.ref.Reference$ReferenceHandler.run([email protected]/Reference.java:138)

Locked ownable synchronizers:

- None

"Finalizer" #3 daemon prio=8 os_prio=1 tid=0x00000250e4982800 nid=0x2a54 in Object.wait()

[0x000000b82aaff000]
java.lang.Thread.State: WAITING (on object monitor)

at java.lang.Object.wait([email protected]/Native Method)

- waiting on <0x0000000089509410> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:151)

- waiting to re-lock in wait() <0x0000000089509410> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:172)

at java.lang.ref.Finalizer$FinalizerThread.run([email protected]/Finalizer.java:216)

Locked ownable synchronizers:

- None

"Signal Dispatcher" #4 daemon prio=9 os_prio=2 tid=0x00000250e52f2800 nid=0x2184

runnable [0x0000000000000000]

java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

- None

"Attach Listener" #5 daemon prio=5 os_prio=2 tid=0x00000250e4992800 nid=0x1624 waiting

on condition [0x0000000000000000]

java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

- None
"C2 CompilerThread0" #6 daemon prio=9 os_prio=2 tid=0x00000250e4995800 nid=0x4198
waiting on condition [0x0000000000000000]

java.lang.Thread.State: RUNNABLE

No compile task

Locked ownable synchronizers:

- None

"C2 CompilerThread1" #7 daemon prio=9 os_prio=2 tid=0x00000250e49a5800 nid=0x3b98

waiting on condition [0x0000000000000000]

java.lang.Thread.State: RUNNABLE

No compile task
Locked ownable synchronizers:

- None

"C1 CompilerThread2" #8 daemon prio=9 os_prio=2 tid=0x00000250e49ae800 nid=0x1a84

waiting on condition [0x0000000000000000]

java.lang.Thread.State: RUNNABLE

No compile task

Locked ownable synchronizers:

- None

"Sweeper thread" #9 daemon prio=9 os_prio=2 tid=0x00000250e5324000 nid=0x5f0 runnable

[0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

- None

"Service Thread" #10 daemon prio=9 os_prio=0 tid=0x00000250e54cd800 nid=0x169c runnable

[0x0000000000000000]

java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

- None
"Common-Cleaner" #11 daemon prio=8 os_prio=1 tid=0x00000250e54cf000 nid=0x1610 in
Object.wait() [0x000000b82b2fe000]

java.lang.Thread.State: TIMED_WAITING (on object monitor)

at java.lang.Object.wait([email protected]/Native Method)

- waiting on <0x000000008943e600> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:151)

- waiting to re-lock in wait() <0x000000008943e600> (a java.lang.ref.ReferenceQueue$Lock)

at jdk.internal.ref.CleanerImpl.run([email protected]/CleanerImpl.java:148)

at java.lang.Thread.run([email protected]/Thread.java:844)

at jdk.internal.misc.InnocuousThread.run([email protected]/InnocuousThread.java:134)

Locked ownable synchronizers:

- None
"Thread-0" #12 prio=5 os_prio=0 tid=0x00000250e54d1800 nid=0xdec waiting for monitor
entry [0x000000b82b4ff000]

java.lang.Thread.State: BLOCKED (on object monitor)

at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)

- waiting to lock <0x00000000894465b0> (a java.lang.Object)

- locked <0x00000000894465a0> (a java.lang.Object)

at java.lang.Thread.run([email protected]/Thread.java:844)

Locked ownable synchronizers:

- None
"Thread-1" #13 prio=5 os_prio=0 tid=0x00000250e54d2000 nid=0x415c waiting for monitor
entry [0x000000b82b5ff000]

java.lang.Thread.State: BLOCKED (on object monitor)

at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)

- waiting to lock <0x00000000894465a0> (a java.lang.Object)

- locked <0x00000000894465b0> (a java.lang.Object)

at java.lang.Thread.run([email protected]/Thread.java:844)

Locked ownable synchronizers:

- None

"DestroyJavaVM" #14 prio=5 os_prio=0 tid=0x00000250e54d0800 nid=0x2b8c waiting on

condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

- None

"VM Thread" os_prio=2 tid=0x00000250e496d800 nid=0x1920 runnable

"GC Thread#0" os_prio=2 tid=0x00000250c35b5800 nid=0x310c runnable

"GC Thread#1" os_prio=2 tid=0x00000250c35b8000 nid=0x12b4 runnable

"GC Thread#2" os_prio=2 tid=0x00000250c35ba800 nid=0x43f8 runnable

"GC Thread#3" os_prio=2 tid=0x00000250c35c0800 nid=0x20c0 runnable

"G1 Main Marker" os_prio=2 tid=0x00000250c3633000 nid=0x4068 runnable

"G1 Conc#0" os_prio=2 tid=0x00000250c3636000 nid=0x3e28 runnable

"G1 Refine#0" os_prio=2 tid=0x00000250c367e000 nid=0x3c0c runnable

"G1 Refine#1" os_prio=2 tid=0x00000250e47fb800 nid=0x3890 runnable

"G1 Refine#2" os_prio=2 tid=0x00000250e47fc000 nid=0x32a8 runnable

"G1 Refine#3" os_prio=2 tid=0x00000250e47fd800 nid=0x3d00 runnable

"G1 Young RemSet Sampling" os_prio=2 tid=0x00000250e4800800 nid=0xef4 runnable

"VM Periodic Task Thread" os_prio=2 tid=0x00000250e54d6800 nid=0x3468 waiting on

condition

JNI global references: 2

Found one Java-level deadlock:

=============================

"Thread-0":
waiting to lock monitor 0x00000250e4982480 (object 0x00000000894465b0, a
java.lang.Object),

which is held by "Thread-1"

"Thread-1":

waiting to lock monitor 0x00000250e4982380 (object 0x00000000894465a0, a

java.lang.Object),

which is held by "Thread-0"

Java stack information for the threads listed above:

===================================================

"Thread-0":

at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)

- waiting to lock <0x00000000894465b0> (a java.lang.Object)

- locked <0x00000000894465a0> (a java.lang.Object)

at java.lang.Thread.run([email protected]/Thread.java:844)

"Thread-1":

at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)

- waiting to lock <0x00000000894465a0> (a java.lang.Object)

- locked <0x00000000894465b0> (a java.lang.Object)

at java.lang.Thread.run([email protected]/Thread.java:844)

Found 1 deadlock.

Introductory Information
Although this file may appear overwhelming at first, it is actually simple if we take each section one
step at a time. The first line of the dump displays the timestamp of when the dump was generated,
while the second line contains the diagnostic information about the JVM from which the dump
was generated:

2018-06-19 16:44:44

Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0.1+10 mixed mode):

While these lines do not provide any information with regards to the threads in our system, they
provide a context from which the rest of the dump can be framed (i.e. which JVM generated the
dump and when the dump was generated).
General Threading Information
The next section begins to provide us with some useful information about the threads that were
running at the time the thread dump was taken:

Threads class SMR info:

_java_thread_list=0x00000250e5488a00, length=13, elements={

0x00000250e4979000, 0x00000250e4982800, 0x00000250e52f2800, 0x00000250e4992800,

0x00000250e4995800, 0x00000250e49a5800, 0x00000250e49ae800, 0x00000250e5324000,

0x00000250e54cd800, 0x00000250e54cf000, 0x00000250e54d1800, 0x00000250e54d2000,

0x00000250e54d0800

This section contains the thread list Safe Memory Reclamation (SMR) information1,
which enumerates the addresses of all non-JVM internal threads (e.g. non-VM and non-Garbage
Collection (GC)). If we examine these addresses, we see that they correspond to the tid value — the
address of the native thread object, not the Thread ID, as we will see shortly — of each of the
numbered threads in the dump (note that ellipses are used to hide superfluous information):

"Reference Handler" #2 ... tid=0x00000250e4979000 ...

"Finalizer" #3 ... tid=0x00000250e4982800 ...

"Signal Dispatcher" #4 ... tid=0x00000250e52f2800 ...

"Attach Listener" #5 ... tid=0x00000250e4992800 ...

"C2 CompilerThread0" #6 ... tid=0x00000250e4995800 ...

"C2 CompilerThread1" #7 ... tid=0x00000250e49a5800 ...

"C1 CompilerThread2" #8 ... tid=0x00000250e49ae800 ...

"Sweeper thread" #9 ... tid=0x00000250e5324000 ...

"Service Thread" #10 ... tid=0x00000250e54cd800 ...

"Common-Cleaner" #11 ... tid=0x00000250e54cf000 ...

"Thread-0" #12 ... tid=0x00000250e54d1800 ...

"Thread-1" #13 ... tid=0x00000250e54d2000 ...

"DestroyJavaVM" #14 ... tid=0x00000250e54d0800 ...

Threads
Directly following the SMR information is the list of threads. The first thread listed in for our
deadlocked program is the Reference Handler thread:

"Reference Handler" #2 daemon prio=10 os_prio=2 tid=0x00000250e4979000 nid=0x3c28

waiting on condition [0x000000b82a9ff000]

java.lang.Thread.State: RUNNABLE

at java.lang.ref.Reference.waitForReferencePendingList([email protected]/Native Method)
at java.lang.ref.Reference.processPendingReferences([email protected]/Reference.java:174)

at java.lang.ref.Reference.access$000([email protected]/Reference.java:44)

at java.lang.ref.Reference$ReferenceHandler.run([email protected]/Reference.java:138)

Locked ownable synchronizers:

- None

Thread Summary
The first line of each thread represents the thread summary, which contains the following items:
SECTION EXAMPLE DESCRIPTION

Human-readable name of the thread. This name can be set

Name "Reference Handler" by calling the setName method on a Threadobject and be
obtained by calling getName on the object.

ID #2 A unique ID associated with each Thread object. This

number is generated, starting at 1, for all threads in the
system. Each time a Thread object is created, the sequence
number is incremented and then assigned to the newly
created Thread. This ID is read-only and can be obtained by
calling getId on a Thread object.

A tag denoting if the thread is a daemon thread. If the thread

is a daemon, this tag will be present; if the thread is a non-
Daemon daemon thread, no tag will be present. For example, Thread-
daemon
status 0 is not a daemon thread and therefore has no
associated daemon tag in its summary: Thread-0" #12
prio=5....

The numeric priority of the Java thread. Note that this does
not necessarily correspond to the priority of the OS thread to
Priority prio=10 with the Java thread is dispatched. The priority of
a Thread object can be set using the setPriority method and
obtained using the getPriority method.

The OS thread priority. This priority can differ from the Java
OS Thread
os_prio=2 thread priority and corresponds to the OS thread on which
Priority
the Java thread is dispatched.

Address tid=0x00000250e49790 The address of the Java thread. This address represents the
00 pointer address of the Java Native Interface (JNI)
native Thread object (the C++ Thread object that backs the
Java thread through the JNI). This value is obtained by
converting the pointer to this (of the C++ object that backs
the Java Thread object) to an integer on line 879
of hotspot/share/runtime/thread.cpp:
st->print("tid=" INTPTR_FORMAT " ", p2i(this));
Although the key for this item (tid) may appear to be the
thread ID, it is actually the address of the underlying JNI C+
+ Thread object and thus is not the ID returned when
calling getId on a Java Thread object.

The unique ID of the OS thread to which the Java Thread is

OS Thread mapped. This value is printed on line 42 of
nid=0x3c28
ID hotspot/share/runtime/osThread.cpp:
st->print("nid=0x%x ", thread_id());

A human-readable string depicting the current status of the

thread. This string provides supplementary information
beyond the basic thread state (see below) and can be useful
Status waiting on condition
in discovering the intended actions of a thread (i.e. was the
thread trying to acquire a lock or waiting on a condition
when it blocked).

Last [0x000000b82a9ff000] The last known Stack Pointer (SP) for the stack associated
with the thread. This value is supplied using native C++ code
and is interlaced with the Java Thread class using the JNI.
This value is obtained using the last_Java_sp() native
method and is formatted into the thread dump on line 2886
Known
of hotspot/share/runtime/thread.cpp:
Java Stack
    st->print_cr("[" INTPTR_FORMAT "]",
Pointer
        (intptr_t)last_Java_sp() & ~right_n_bits(12));
For simple thread dumps, this information may not be
useful, but for more complex diagnostics, this SP value can
be used to trace lock acquisition through a program.
Thread State
The second line represents the current state of the thread. The possible states for a thread are
captured in the Thread.State enumeration:
• NEW
• RUNNABLE
• BLOCKED
• WAITING
• TIMED_WAITING
• TERMINATED
For more information on the meaning of each state, see the Thread.State documentation.
Thread Stack Trace
The next section contains the stack trace for the thread at the time of the dump. This stack trace
resembles the stack trace printed when an uncaught exception occurs and simply denotes the class
and line that the thread was executing when the dump was taken. In the case of the Reference
Handler thread, there is nothing of particular importance that we see in the stack trace, but if we
look at the stack trace for Thread-02, we see a difference from the standard stack trace:

"Thread-0" #12 prio=5 os_prio=0 tid=0x00000250e54d1800 nid=0xdec waiting for monitor

entry [0x000000b82b4ff000]

java.lang.Thread.State: BLOCKED (on object monitor)

at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)

- waiting to lock <0x00000000894465b0> (a java.lang.Object)

- locked <0x00000000894465a0> (a java.lang.Object)

at java.lang.Thread.run([email protected]/Thread.java:844)

Locked ownable synchronizers:

- None
Within this stack trace, we can see that locking information has been added, which tells us that this
thread is waiting for a lock on an object with an address of 0x00000000894465b0 (and a type
of java.lang.Object) and, at this point in the stack trace, holds a lock on an object with an address
of 0x00000000894465a0 (also of type java.lang.Object). This supplemental lock information is
important when diagnosing deadlocks, as we will see in the following sections.
Locked Ownable Synchronizer
The last portion of the thread information contains a list of synchronizers (objects that can be used
for synchronization, such as locks) that are exclusively owned by a thread. According to the official
Java documentation, "an ownable synchronizer is a synchronizer that may be exclusively owned by
a thread and uses AbstractOwnableSynchronizer (or its subclass) to implement its synchronization
property. ReentrantLock and the write-lock (but not the read-lock) of ReentrantReadWriteLock
are two examples of ownable synchronizers provided by the platform.
For more information on locked ownable synchronizers, see this Stack Overflow post.
JVM Threads
The next section of the thread dump contains the JVM-internal (non-application) threads that are
bound to the OS. Since these threads do not exist within a Java application, they do not have a
thread ID. These threads are usually composed of GC threads and other threads used by the JVM
to run and maintain a Java application:

"VM Thread" os_prio=2 tid=0x00000250e496d800 nid=0x1920 runnable

"GC Thread#0" os_prio=2 tid=0x00000250c35b5800 nid=0x310c runnable

"GC Thread#1" os_prio=2 tid=0x00000250c35b8000 nid=0x12b4 runnable

"GC Thread#2" os_prio=2 tid=0x00000250c35ba800 nid=0x43f8 runnable

"GC Thread#3" os_prio=2 tid=0x00000250c35c0800 nid=0x20c0 runnable

"G1 Main Marker" os_prio=2 tid=0x00000250c3633000 nid=0x4068 runnable

"G1 Conc#0" os_prio=2 tid=0x00000250c3636000 nid=0x3e28 runnable

"G1 Refine#0" os_prio=2 tid=0x00000250c367e000 nid=0x3c0c runnable

"G1 Refine#1" os_prio=2 tid=0x00000250e47fb800 nid=0x3890 runnable

"G1 Refine#2" os_prio=2 tid=0x00000250e47fc000 nid=0x32a8 runnable

"G1 Refine#3" os_prio=2 tid=0x00000250e47fd800 nid=0x3d00 runnable

"G1 Young RemSet Sampling" os_prio=2 tid=0x00000250e4800800 nid=0xef4 runnable

"VM Periodic Task Thread" os_prio=2 tid=0x00000250e54d6800 nid=0x3468 waiting on

condition

JNI Global References

This section captures the number of global references maintained by the JVM through the JNI.
These references may cause memory leaks under certain circumstances and are not automatically
garbage collected.

JNI global references: 2

For many simple issues, this information is unused, but it is important to understand the
importance of these global references. For more information, see this Stack Overflow post.
Deadlocked Threads
The final section of the thread dump contains information about discovered deadlocks. This is not
always the case: If the application does not have one or more detected deadlocks, this section will
be omitted. Since our application was designed with a deadlock, the thread dump correctly
captures this contention with the following message:

Found one Java-level deadlock:

=============================

"Thread-0":

waiting to lock monitor 0x00000250e4982480 (object 0x00000000894465b0, a

java.lang.Object),

which is held by "Thread-1"

"Thread-1":

waiting to lock monitor 0x00000250e4982380 (object 0x00000000894465a0, a

java.lang.Object),

which is held by "Thread-0"

Java stack information for the threads listed above:

===================================================

"Thread-0":

at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)

- waiting to lock <0x00000000894465b0> (a java.lang.Object)

- locked <0x00000000894465a0> (a java.lang.Object)

at java.lang.Thread.run([email protected]/Thread.java:844)

"Thread-1":
at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)

- waiting to lock <0x00000000894465a0> (a java.lang.Object)

- locked <0x00000000894465b0> (a java.lang.Object)

at java.lang.Thread.run([email protected]/Thread.java:844)

Found 1 deadlock.

The first subsection describes the deadlock scenario: Thread-0 is waiting to lock a monitor
(through the synchronized statement around the firstResource and secondResource in our
application) that is held while Thread-1 is waiting to lock a monitor held by Thread-0. This circular
dependency is the textbook definition of a deadlock (contrived by our application) and is
illustrated in the figure below:
In addition to the description of the deadlock, the stack trace for each of the threads involved is
printed in the second subsection. This allows us to track down the line and locks (the objects being
used as monitor locks in this case) that are causing the deadlock. For example, if we examine line
34 of our application, we find the following content:

printLockedResource(secondResource);

This line represents the first line of the synchronized block causing the deadlock and tips us off to
the fact that synchronizing on secondResource is the root of the deadlock. In order to solve this
deadlock, we would have to instead synchronize on resourceA and resourceB in the same order in
both threads. If we do this, we end up with the following application:

public class DeadlockProgram {

public static void main(String[] args) throws Exception {

Object resourceA = new Object();

Object resourceB = new Object();

Thread threadLockingResourceAFirst = new Thread(new DeadlockRunnable(resourceA,

resourceB));

Thread threadLockingResourceBFirst = new Thread(new DeadlockRunnable(resourceA,

resourceB));

threadLockingResourceAFirst.start();

Thread.sleep(500);

threadLockingResourceBFirst.start();
}

private static class DeadlockRunnable implements Runnable {

private final Object firstResource;

private final Object secondResource;

public DeadlockRunnable(Object firstResource, Object secondResource) {

this.firstResource = firstResource;

this.secondResource = secondResource;

}
@Override

public void run() {

try {

synchronized (firstResource) {

printLockedResource(firstResource);

Thread.sleep(1000);

synchronized (secondResource) {

printLockedResource(secondResource);

} catch (InterruptedException e) {
System.out.println("Exception occurred: " + e);

private static void printLockedResource(Object resource) {

System.out.println(Thread.currentThread().getName() + ": locked resource -> " +

resource);

This application produces the following output and completes without deadlocking (note that the
addresses of the Object objects will vary by execution):
Thread-0: locked resource -> java.lang.Object@1ad895d1

Thread-0: locked resource -> java.lang.Object@6e41d7dd

Thread-1: locked resource -> java.lang.Object@1ad895d1

Thread-1: locked resource -> java.lang.Object@6e41d7dd

In summary, using only the information provided in the thread dump, we can find and fix a
deadlocked application. Although this inspection technique is sufficient for many simple
applications (or applications that have only a small number of deadlocks), dealing with more
complex thread dumps may need to be handled in a different way.
Handling More Complex Thread Dumps
When handling production applications, thread dumps can become overwhelming very quickly. A
single JVM may have hundreds of threads running at the same time and deadlocks may involve
more than two threads (or there may be more than one concurrency issue as a side-effect of a
single cause) and parsing through this firehose of information can be tedious and unruly.
In order to handle these large-scale situations, Thread Dump Analyzers (TDAs) should be the tool
of choice. These tools parse Java thread dumps display otherwise confusing information in a
manageable form (commonly with a graph or other visual aid) and may even perform static
analysis of the dump to discover issues. While the best tool for a situation will vary by
circumstance, some of the most common TDAs include the following:
• fastThread
• Spotify TDA
• IBM Thread and Monitor Dump Analyze for Java
• irockel TDA
While this is far from a comprehensive list of TDAs, each performs enough analysis and visual
sorting to reduce the manual burden of decyphering thread dumps.
How to Analyze Java Thread Dumps

by Esen Sagynov · Oct. 18, 12 · Performance Zone · Tutorial

Like (78)
Comment (38)
Save
Tweet
701.6k Views
Join the DZone community and get the full member experience. JOIN FOR FREE

The content of this article was originally written by Tae Jin Gu on the Cubrid blog.
When there is an obstacle, or when a Java based Web application is running much slower than
expected, we need to use thread dumps. If thread dumps feel like very complicated to you, this
article may help you very much. Here I will explain what threads are in Java, their types, how they
are created, how to manage them, how you can dump threads from a running application, and
finally how you can analyze them and determine the bottleneck or blocking threads. This article is
a result of long experience in Java application debugging.
Java and Thread
A web server uses tens to hundreds of threads to process a large number of concurrent users. If
two or more threads utilize the same resources, a contention between the threads is inevitable, and
sometimes deadlock occurs.
Thread contention is a status in which one thread is waiting for a lock, held by another thread,
to be lifted. Different threads frequently access shared resources on a web application. For
example, to record a log, the thread trying to record the log must obtain a lock and access the
shared resources.
Deadlock is a special type of thread contention, in which two or more threads are waiting for the
other threads to complete their tasks in order to complete their own tasks.
Different issues can arise from thread contention. To analyze such issues, you need to use the
thread dump. A thread dump will give you the information on the exact status of each thread.
Background Information for Java Threads
Thread Synchronization
A thread can be processed with other threads at the same time. In order to ensure compatibility
when multiple threads are trying to use shared resources, one thread at a time should be allowed
to access the shared resources by using thread synchronization.
Thread synchronization on Java can be done using monitor. Every Java object has a single
monitor. The monitor can be owned by only one thread. For a thread to own a monitor that is
owned by a different thread, it needs to wait in the wait queue until the other thread releases its
monitor.
Thread Status
In order to analyze a thread dump, you need to know the status of threads. The statuses of threads
are stated on java.lang.Thread.State.

Figure 1: Thread Status.

• NEW: The thread is created but has not been processed yet.
• RUNNABLE: The thread is occupying the CPU and processing a task. (It may be in WAITING
status due to the OS's resource distribution.)
• BLOCKED: The thread is waiting for a different thread to release its lock in order to get the
monitor lock.
• WAITING: The thread is waiting by using a wait, join or park method.
• TIMED_WAITING: The thread is waiting by using a sleep, wait, join or park method. (The
difference from WAITING is that the maximum waiting time is specified by the method
parameter, and WAITING can be relieved by time as well as external changes.)
Thread Types
Java threads can be divided into two:
• daemon threads;
• and non-daemon threads.
Daemon threads stop working when there are no other non-daemon threads. Even if you do not
create any threads, the Java application will create several threads by default. Most of them are
daemon threads, mainly for processing tasks such as garbage collection or JMX.
A thread running the 'static void main(String[] args)’ method is created as a non-daemon thread,
and when this thread stops working, all other daemon threads will stop as well. (The thread
running this main method is called the VM thread in HotSpot VM.)
Getting a Thread Dump
We will introduce the three most commonly used methods. Note that there are many other ways to
get a thread dump. A thread dump can only show the thread status at the time of measurement, so
in order to see the change in thread status, it is recommended to extract them from 5 to 10 times
with 5-second intervals.
Getting a Thread Dump Using jstack
In JDK 1.6 and higher, it is possible to get a thread dump on MS Windows using jstack.
Use PID via jps to check the PID of the currently running Java application process.

[user@linux ~]$ jps -v

25780 RemoteTestRunner -Dfile.encoding=UTF-8

25590 sub.rmi.registry.RegistryImpl 2999 -Dapplication.home=/home1/user/java/jdk.1.6.0_24

-Xms8m

26300 sun.tools.jps.Jps -mlvV -Dapplication.home=/home1/user/java/jdk.1.6.0_24 -Xms8m

Use the extracted PID as the parameter of jstack to obtain a thread dump.

[user@linux ~]$ jstack -f 5824

A Thread Dump Using jVisualVM

Generate a thread dump by using a program such as jVisualVM.

Figure 2: A Thread Dump Using visualvm.

The task on the left indicates the list of currently running processes. Click on the process for which
you want the information, and select the thread tab to check the thread information in real time.
Click the Thread Dump button on the top right corner to get the thread dump file.
Generating in a Linux Terminal
Obtain the process pid by using ps -ef command to check the pid of the currently running Java
process.

[user@linux ~]$ ps - ef | grep java

user 2477 1 0 Dec23 ? 00:10:45 ...

user 25780 25361 0 15:02 pts/3 00:00:02 ./jstatd -J -Djava.security.policy=jstatd.all.policy -p

2999

user 26335 25361 0 15:49 pts/3 00:00:00 grep java

Use the extracted pid as the parameter of kill –SIGQUIT(3) to obtain a thread dump.
Thread Information from the Thread Dump File

"pool-1-thread-13" prio=6 tid=0x000000000729a000 nid=0x2fb4 runnable

[0x0000000007f0f000] java.lang.Thread.State: RUNNABLE

at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)

at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)

at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)

at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)

- locked <0x0000000780b7e688> (a java.io.InputStreamReader)

at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)

at java.io.BufferedReader.readLine(BufferedReader.java:299)

- locked <0x0000000780b7e688> (a java.io.InputStreamReader)

at java.io.BufferedReader.readLine(BufferedReader.java:362)

• Thread name: When using Java.lang.Thread class to generate a thread, the thread will be named
Thread-(Number), whereas when using java.util.concurrent.ThreadFactory class, it will be
named pool-(number)-thread-(number).
• Priority: Represents the priority of the threads.
• Thread ID: Represents the unique ID for the threads. (Some useful information, including the
CPU usage or memory usage of the thread, can be obtained by using thread ID.)
• Thread status: Represents the status of the threads.
• Thread callstack: Represents the call stack information of the threads.
Thread Dump Patterns by Type
When Unable to Obtain a Lock (BLOCKED)
This is when the overall performance of the application slows down because a thread is occupying
the lock and prevents other threads from obtaining it. In the following example, BLOCKED_TEST
pool-1-thread-1 thread is running with <0x0000000780a000b0> lock, while BLOCKED_TEST
pool-1-thread-2 and BLOCKED_TEST pool-1-thread-3 threads are waiting to obtain
<0x0000000780a000b0> lock.

Figure 3: A thread blocking other threads.

"BLOCKED_TEST pool-1-thread-1" prio=6 tid=0x0000000006904800 nid=0x28f4 runnable

[0x000000000785f000]

java.lang.Thread.State: RUNNABLE

at java.io.FileOutputStream.writeBytes(Native Method)

at java.io.FileOutputStream.write(FileOutputStream.java:282)

at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)

- locked <0x0000000780a31778> (a java.io.BufferedOutputStream)

at java.io.PrintStream.write(PrintStream.java:432)

- locked <0x0000000780a04118> (a java.io.PrintStream)

at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)

at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)

at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)

- locked <0x0000000780a040c0> (a java.io.OutputStreamWriter)

at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)

at java.io.PrintStream.newLine(PrintStream.java:496)

- locked <0x0000000780a04118> (a java.io.PrintStream)

at java.io.PrintStream.println(PrintStream.java:687)
- locked <0x0000000780a04118> (a java.io.PrintStream)

at
com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:44)

- locked <0x0000000780a000b0> (a
com.nbp.theplatform.threaddump.ThreadBlockedState)

at
com.nbp.theplatform.threaddump.ThreadBlockedState$1.run(ThreadBlockedState.java:7)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

Locked ownable synchronizers:

- <0x0000000780a31758> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"BLOCKED_TEST pool-1-thread-2" prio=6 tid=0x0000000007673800 nid=0x260c waiting for
monitor entry [0x0000000008abf000]

java.lang.Thread.State: BLOCKED (on object monitor)

at
com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:43)

- waiting to lock <0x0000000780a000b0> (a

com.nbp.theplatform.threaddump.ThreadBlockedState)

at
com.nbp.theplatform.threaddump.ThreadBlockedState$2.run(ThreadBlockedState.java:26)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:

- <0x0000000780b0c6a0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

"BLOCKED_TEST pool-1-thread-3" prio=6 tid=0x00000000074f5800 nid=0x1994 waiting for

monitor entry [0x0000000008bbf000]

java.lang.Thread.State: BLOCKED (on object monitor)

at
com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:42)

- waiting to lock <0x0000000780a000b0> (a

com.nbp.theplatform.threaddump.ThreadBlockedState)

at
com.nbp.theplatform.threaddump.ThreadBlockedState$3.run(ThreadBlockedState.java:34)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

Locked ownable synchronizers:

- <0x0000000780b0e1b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

When in Deadlock Status

This is when thread A needs to obtain thread B's lock to continue its task, while thread B needs to
obtain thread A's lock to continue its task. In the thread dump, you can see that
DEADLOCK_TEST-1 thread has 0x00000007d58f5e48 lock, and is trying to obtain
0x00000007d58f5e60 lock. You can also see that DEADLOCK_TEST-2 thread has
0x00000007d58f5e60 lock, and is trying to obtain 0x00000007d58f5e78 lock. Also,
DEADLOCK_TEST-3 thread has 0x00000007d58f5e78 lock, and is trying to obtain
0x00000007d58f5e48 lock. As you can see, each thread is waiting to obtain another thread's lock,
and this status will not change until one thread discards its lock.

Figure 4: Threads in a Deadlock status.

"DEADLOCK_TEST-1" daemon prio=6 tid=0x000000000690f800 nid=0x1820 waiting for

monitor entry [0x000000000805f000]

java.lang.Thread.State: BLOCKED (on object monitor)

at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(T
hreadDeadLockState.java:197)

- waiting to lock <0x00000007d58f5e60> (a

com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)

at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.monitorOurLock(Thre
adDeadLockState.java:182)

- locked <0x00000007d58f5e48> (a
com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)

at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockS
tate.java:135)
Locked ownable synchronizers:

- None

"DEADLOCK_TEST-2" daemon prio=6 tid=0x0000000006858800 nid=0x17b8 waiting for

monitor entry [0x000000000815f000]

java.lang.Thread.State: BLOCKED (on object monitor)

at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(T
hreadDeadLockState.java:197)

- waiting to lock <0x00000007d58f5e78> (a

com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)

at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.monitorOurLock(Thre
adDeadLockState.java:182)

- locked <0x00000007d58f5e60> (a
com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockS
tate.java:135)

Locked ownable synchronizers:

- None

"DEADLOCK_TEST-3" daemon prio=6 tid=0x0000000006859000 nid=0x25dc waiting for

monitor entry [0x000000000825f000]

java.lang.Thread.State: BLOCKED (on object monitor)

at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(T
hreadDeadLockState.java:197)

- waiting to lock <0x00000007d58f5e48> (a

com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.monitorOurLock(Thre
adDeadLockState.java:182)

- locked <0x00000007d58f5e78> (a
com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)

at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockS
tate.java:135)

Locked ownable synchronizers:

- None

When Continuously Waiting to Receive Messages from a Remote Server

The thread appears to be normal, since its state keeps showing as RUNNABLE. However, when
you align the thread dumps chronologically, you can see that socketReadThread thread is waiting
infinitely to read the socket.

Figure 5: Continuous Waiting Status.

"socketReadThread" prio=6 tid=0x0000000006a0d800 nid=0x1b40 runnable
[0x00000000089ef000]

java.lang.Thread.State: RUNNABLE

at java.net.SocketInputStream.socketRead0(Native Method)

at java.net.SocketInputStream.read(SocketInputStream.java:129)

at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)

at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)

at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)

- locked <0x00000007d78a2230> (a java.io.InputStreamReader)

at sun.nio.cs.StreamDecoder.read0(StreamDecoder.java:107)

- locked <0x00000007d78a2230> (a java.io.InputStreamReader)

at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:93)
at java.io.InputStreamReader.read(InputStreamReader.java:151)

at
com.nbp.theplatform.threaddump.ThreadSocketReadState$1.run(ThreadSocketReadState.java:27
)

at java.lang.Thread.run(Thread.java:662)

When Waiting
The thread is maintaining WAIT status. In the thread dump, IoWaitThread thread keeps waiting to
receive a message from LinkedBlockingQueue. If there continues to be no message for
LinkedBlockingQueue, then the thread status will not change.

Figure 6: Waiting status.

"IoWaitThread" prio=6 tid=0x0000000007334800 nid=0x2b3c waiting on condition

[0x000000000893f000]

java.lang.Thread.State: WAITING (parking)

at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007d5c45850> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)

at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSy
nchronizer.java:1987)

at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:440)

at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:629)

at
com.nbp.theplatform.threaddump.ThreadIoWaitState$IoWaitHandler2.run(ThreadIoWaitState.ja
va:89)

at java.lang.Thread.run(Thread.java:662)

When Thread Resources Cannot be Organized Normally

Unnecessary threads will pile up when thread resources cannot be organized normally. If this
occurs, it is recommended to monitor the thread organization process or check the conditions for
thread termination.
Figure 7: Unorganized Threads.
How to Solve Problems by Using Thread Dump
Example 1: When the CPU Usage is Abnormally High
1. Extract the thread that has the highest CPU usage.

[user@linux ~]$ ps -mo pid.lwp.stime.time.cpu -C java

PID LWP STIME TIME %CPU

10029 - Dec07 00:02:02 99.5

- 10039 Dec07 00:00:00 0.1

- 10040 Dec07 00:00:00 95.5

From the application, find out which thread is using the CPU the most.
Acquire the Light Weight Process (LWP) that uses the CPU the most and convert its unique
number (10039) into a hexadecimal number (0x2737).
2. After acquiring the thread dump, check the thread's action.
Extract the thread dump of an application with a PID of 10029, then find the thread with an nid of
0x2737.
"NioProcessor-2" prio=10 tid=0x0a8d2800 nid=0x2737 runnable [0x49aa5000]

java.lang.Thread.State: RUNNABLE

at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)

at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)

at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)

at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)

- locked <0x74c52678> (a sun.nio.ch.Util$1)

- locked <0x74c52668> (a java.util.Collections$UnmodifiableSet)

- locked <0x74c501b0> (a sun.nio.ch.EPollSelectorImpl)

at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

at
external.org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:65)
at
external.org.apache.mina.common.AbstractPollingIoProcessor$Worker.run(AbstractPollingIoPro
cessor.java:708)

at
external.org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

Extract thread dumps several times every hour, and check the status change of the threads to
determine the problem.
Example 2: When the Processing Performance is Abnormally Slow
After acquiring thread dumps several times, find the list of threads with BLOCKED status.

" DB-Processor-13" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry
[0x000000000825f000]

java.lang.Thread.State: BLOCKED (on object monitor)

at beans.ConnectionPool.getConnection(ConnectionPool.java:102)

- waiting to lock <0xe0375410> (a beans.ConnectionPool)

at beans.cus.ServiceCnt.getTodayCount(ServiceCnt.java:111)

at beans.cus.ServiceCnt.insertCount(ServiceCnt.java:43)

"DB-Processor-14" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry

[0x000000000825f020]

java.lang.Thread.State: BLOCKED (on object monitor)

at beans.ConnectionPool.getConnection(ConnectionPool.java:102)

- waiting to lock <0xe0375410> (a beans.ConnectionPool)

at beans.cus.ServiceCnt.getTodayCount(ServiceCnt.java:111)

at beans.cus.ServiceCnt.insertCount(ServiceCnt.java:43)
" DB-Processor-3" daemon prio=5 tid=0x00928248 nid=0x8b waiting for monitor entry
[0x000000000825d080]

java.lang.Thread.State: RUNNABLE

at oracle.jdbc.driver.OracleConnection.isClosed(OracleConnection.java:570)

- waiting to lock <0xe03ba2e0> (a oracle.jdbc.driver.OracleConnection)

at beans.ConnectionPool.getConnection(ConnectionPool.java:112)

- locked <0xe0386580> (a java.util.Vector)

- locked <0xe0375410> (a beans.ConnectionPool)

at beans.cus.Cue_1700c.GetNationList(Cue_1700c.java:66)

at org.apache.jsp.cue_1700c_jsp._jspService(cue_1700c_jsp.java:120)

Acquire the list of threads with BLOCKED status after getting the thread dumps several times.
If the threads are BLOCKED, extract the threads related to the lock that the threads are trying to
obtain.
Through the thread dump, you can confirm that the thread status stays BLOCKED because
<0xe0375410> lock could not be obtained. This problem can be solved by analyzing stack trace
from the thread currently holding the lock.
There are two reasons why the above pattern frequently appears in applications using DBMS. The
first reason is inadequate configurations. Despite the fact that the threads are still working,
they cannot show their best performance because the configurations for DBCP and the like are not
adequate. If you extract thread dumps multiple times and compare them, you will often see that
some of the threads that were BLOCKED previously are in a different state.
The second reason is the abnormal connection. When the connection with DBMS stays
abnormal, the threads wait until the time is out. In this case, even after extracting the thread
dumps several times and comparing them, you will see that the threads related to DBMS are still in
a BLOCKED state. By adequately changing the values, such as the timeout value, you can shorten
the time in which the problem occurs.
Coding for Easy Thread Dump
Naming Threads
When a thread is created using java.lang.Thread object, the thread will be named Thread-
(Number). When a thread is created using java.util.concurrent.DefaultThreadFactory object, the
thread will be named pool-(Number)-thread-(Number). When analyzing tens to thousands of
threads for an application, if all the threads still have their default names, analyzing them becomes
very difficult, because it is difficult to distinguish the threads to be analyzed.
Therefore, you are recommended to develop the habit of naming the threads whenever a new
thread is created.
When you create a thread using java.lang.Thread, you can give the thread a custom name by using
the creator parameter.
public Thread(Runnable target, String name);

public Thread(ThreadGroup group, String name);

public Thread(ThreadGroup group, Runnable target, String name);

public Thread(ThreadGroup group, Runnable target, String name, long stackSize);

When you create a thread using java.util.concurrent.ThreadFactory, you can name it by generating
your own ThreadFactory. If you do not need special functionalities, then you can use
MyThreadFactory as described below:

import java.util.concurrent.ConcurrentHashMap;

import java.util.concurrent.ThreadFactory;

import java.util.concurrent.atomic.AtomicInteger;

public class MyThreadFactory implements ThreadFactory {

private static final ConcurrentHashMap<String, AtomicInteger> POOL_NUMBER =

new ConcurrentHashMap<String, AtomicInteger>();

private final ThreadGroup group;

private final AtomicInteger threadNumber = new AtomicInteger(1);

private final String namePrefix;

public MyThreadFactory(String threadPoolName) {

if (threadPoolName == null) {

throw new NullPointerException("threadPoolName");

POOL_NUMBER.putIfAbsent(threadPoolName, new AtomicInteger());

SecurityManager securityManager = System.getSecurityManager();

group = (securityManager != null) ? securityManager.getThreadGroup() :

Thread.currentThread().getThreadGroup();

AtomicInteger poolCount = POOL_NUMBER.get(threadPoolName);

if (poolCount == null) {

namePrefix = threadPoolName + " pool-00-thread-";

} else {

namePrefix = threadPoolName + " pool-" + poolCount.getAndIncrement() + "-thread-";

}
public Thread newThread(Runnable runnable) {

Thread thread = new Thread(group, runnable, namePrefix +

threadNumber.getAndIncrement(), 0);

if (thread.isDaemon()) {

thread.setDaemon(false);

if (thread.getPriority() != Thread.NORM_PRIORITY) {

thread.setPriority(Thread.NORM_PRIORITY);

}
return thread;

Obtaining More Detailed Information by Using MBean

You can obtain ThreadInfo objects using MBean. You can also obtain more information that would
be difficult to acquire via thread dumps, by using ThreadInfo.

ThreadMXBean mxBean = ManagementFactory.getThreadMXBean();

long[] threadIds = mxBean.getAllThreadIds();

ThreadInfo[] threadInfos =

mxBean.getThreadInfo(threadIds);

for (ThreadInfo threadInfo : threadInfos) {

System.out.println(
threadInfo.getThreadName());

System.out.println(

threadInfo.getBlockedCount());

System.out.println(

threadInfo.getBlockedTime());

System.out.println(

threadInfo.getWaitedCount());

System.out.println(

threadInfo.getWaitedTime());

You can acquire the amount of time that the threads WAITed or were BLOCKED by using the
method in ThreadInfo, and by using this you can also obtain the list of threads that have been
inactive for an abnormally long period of time.
In Conclusion
In this article I was concerned that for developers with a lot of experience in multi-thread
programming, this material may be common knowledge, whereas for less experienced developers,
I felt that I was skipping straight to thread dumps, without providing enough background
information about the thread activities. This was because of my lack of knowledge, as I was not
able to explain the thread activities in a clear yet concise manner. I sincerely hope that this article
will prove helpful for many developers.
How to collect threaddump using jcmd and analyse it ?
jsensharma Super Mentor
Created on 12-17-2016 01:11 PM
[Related Article On Ambari Server Tuning :
https://community.hortonworks.com/articles/131670/ambari-server-performance-tuning-
troubleshooting-c... ]
-The jcmd utility comes with the JDK and is present inside the "$JAVA_HOME/bin". It is used to
send diagnostic command requests to the JVM, where these requests are useful for controlling
Java Flight Recordings, troubleshoot, and diagnose JVM and Java Applications.
Following are the conditions for using this utility.
- 1. It must be used on the same machine where the JVM is running
- 2. Only a user who own the JVM process can connect to is using this utility.
This utility can help us in getting many details about the JVM process. Some of the most useful
information's are as following: Syntax:
jcmd $PID $ARGUMENT
Example1: Classes taking the most memory are listed at the top, and classes are listed in a
descending order.
/usr/jdk64/jdk1.8.0_60/bin/jcmd $PID GC.class_histogram > /tmp/22421_ClassHistogram.txt
Example2: Generate Heap Dump
/usr/jdk64/jdk1.8.0_60/bin/jcmd $PID GC.heap_dump /tmp/test123.hprof
Example3: Explicitly request JVM to trigger a Garbage Collection Cycle.
/usr/jdk64/jdk1.8.0_60/bin/jcmd $PID GC.run
Example4: Generate Thread dump.
usr/jdk64/jdk1.8.0_60/bin/jcmd $PID Thread.print
Example5: List JVM properties.
/usr/jdk64/jdk1.8.0_60/bin/jcmd $PID VM.system_properties
Example6: The Command line options along with the CLASSPATH setting.
/usr/jdk64/jdk1.8.0_60/bin/jcmd $ PID VM.command_line
**NOTE:** To use few specific features offered by "jcmd" tool the "-XX:
+UnlockDiagnosticVMOptions" JVM option need to be enabled.
.
When to Collect Thread Dumps?

---------------------------------------------------------
Here now we will see a very common scenario when we find that the JVM process is talking a lots
of time in processing the request. Many times we see that the JVM process is stuck/slow or
completely Hung. In such scenario in order to investigate the root cause of the slowness we need to
collect the thread dumps of the JVM process which will tell us about the various activities those
threads are actually performing. Sometimes some threads are involved in some very high CPU
intensive operations which also might cause a slowness in getting the response. So We should
collect the thread dump as well as the CPU data using "top" command. Few things to consider
while collecting the thread dumps:
- 1. Collect the thread dump when we see the issue (slowness, stuck/ hung scenario ...etc). .
- 2. Mostly a single thread dump is not very useful. So whenever we collect the thread dump then
we should collect at least 5-6 thread dumps. In some interval like collect 5-6 thread dumps in 10
seconds interval. Like that we will get around 5-6 thread dumps in 1 minute.
- 3. If we are also investigating that few threads might be consuming high CPU cycles then in order
to find the APIs that are actually consuming the high CPU we must collect the Thread dump as well
as the "top" command output data almost at the same time.
.
- In order to make this easy we can use a simple script "threaddump_cpu_with_cmd.sh" and
use it for out troubleshooting & JVM data collection. The following script can be downloaded from:
https://github.com/jaysensharma/MiddlewareMagicDemos/tree/master/HDP_Ambari/JVM
#!/bin/sh
# Takes the JavaApp PID as an argument.
# Make sure you set JAVA_HOME
# Create thread dumps a specified number of times (i.e. LOOP) and INTERVAL.
# Thread dumps will be collected in the file "jcmd_threaddump.out", in the same directory from
where this script is been executed.
# Usage:
# sudo - $user_Who_Owns_The_JavaProcess
# ./threaddump_cpu_with_cmd.sh <JAVA_APP_PID>
#
#
# Example:
# NameNode PID is "5752" and it is started by user "hdfs" then run this utility as following:
#
# su -l hdfs -c "/tmp/threaddump_cpu_with_cmd.sh 5752"
###################################################################
#############################

# Number of times to collect data. Means total number of thread dumps.

LOOP=10

# Interval in seconds between data points.

INTERVAL=10

# Where to generate the threddump & top output files.

WHERE_TO_GENERATE_OUTPUT_FILES="/tmp"

# Setting the Java Home, by giving the path where your JDK is kept
# USERS MUST SET THE JAVA_HOME before running this scripta s following:
JAVA_HOME=/usr/jdk64/jdk1.8.0_60

echo "Writing CPU data log files to Directory: $WHERE_TO_GENERATE_OUTPUT_FILES"

for ((i=1; i <= $LOOP; i++))

do
#$JAVA_HOME/bin/jstack -l $1 >> jstack_threaddump.out
$JAVA_HOME/bin/jcmd $1 Thread.print >>
$WHERE_TO_GENERATE_OUTPUT_FILES/jcmd_threaddump.out
_now=$(date)
echo "${_now}" >> $WHERE_TO_GENERATE_OUTPUT_FILES/top_highcpu.out
top -b -n 1 -H -p $1 >> $WHERE_TO_GENERATE_OUTPUT_FILES/top_highcpu.out
echo "Collected 'top' output and Thread Dump #" $i
if [ $i -lt $LOOP ]; then
echo "Sleeping for $INTERVAL seconds."
sleep $INTERVAL
fi
done
- Get the file "jcmd_threaddump.out" and "top_highcpu.out" for analysis.
.
How to analyze the Thread dump Data?
---------------------------------------------------------
You may have a look at one of my old blog article which explains
"High CPU Utilization Finding Cause?" http://middlewaremagic.com/weblogic/?p=4884
.
.
Common Errors with "jcmd" utility.

---------------------------------------------------------
While running the JCMD we might see the below mentioned error. Here the "5752" is the
NameNode PID.
[root@c6401 keys]# /usr/jdk64/jdk1.8.0_60/bin/jcmd 5752 help
5752:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not
responding or HotSpot VM not loaded
at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:208)
at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:147)
at sun.tools.jcmd.JCmd.main(JCmd.java:131)
This error occurred because, JCMD utility allows to connect only to the JVM process that we own
the process. In this case we see that the "NameNode" process is being owned by the "hdfs" user
where as in the above command we are trying to connect to the NameNode process Via "jcmd"
utility as "root" user. The root user here does not own the process, Hence we see the error.
-
"hdfs" user owned process
# su -l hdfs -c "/usr/jdk64/jdk1.8.0_60/bin/jcmd -l"
5752 org.apache.hadoop.hdfs.server.namenode.NameNode
5546 org.apache.hadoop.hdfs.tools.DFSZKFailoverController
5340 org.apache.hadoop.hdfs.server.datanode.DataNode
4991 org.apache.hadoop.hdfs.qjournal.server.JournalNode
.
- "root" user owned process
[root@c6401 keys]# /usr/jdk64/jdk1.8.0_60/bin/jcmd -l
1893 com.hortonworks.support.tools.server.SupportToolServer
6470 com.hortonworks.smartsense.activity.ActivityAnalyzerFacade
16774 org.apache.ambari.server.controller.AmbariServer
29100 sun.tools.jcmd.JCmd -l
6687 org.apache.zeppelin.server.ZeppelinServer
More information about this utility can be found at:
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr006.html
.
.

30,006 VIEWS
3 KUDOS

TAGS (6)
• Design & ArchitectureFAQhelpjavathreaddumptool

Explore the Community

Mayan Edms
No ratings yet
Mayan Edms
143 pages
DNC Professional and DNCjr32 User Manual Oct 09 2013
No ratings yet
DNC Professional and DNCjr32 User Manual Oct 09 2013
290 pages
Final Note
No ratings yet
Final Note
31 pages
Apache Spark Guide
No ratings yet
Apache Spark Guide
33 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
7
No ratings yet
7
39 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
Detyrat Programim Kapitulli 8
No ratings yet
Detyrat Programim Kapitulli 8
26 pages
Apache Spark
No ratings yet
Apache Spark
100 pages
Spark Architecture
No ratings yet
Spark Architecture
10 pages
3_UNIT3_Spark
No ratings yet
3_UNIT3_Spark
55 pages
22617 Model Answer Summer 2023 by Campusify
No ratings yet
22617 Model Answer Summer 2023 by Campusify
47 pages
spark_notes
No ratings yet
spark_notes
19 pages
Unit 4(Big Data Analytics)
No ratings yet
Unit 4(Big Data Analytics)
28 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Cerificate Report Sharique
No ratings yet
Cerificate Report Sharique
12 pages
slips bigdata
No ratings yet
slips bigdata
6 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
spark
No ratings yet
spark
160 pages
3. Cloud Compute Solution Design
No ratings yet
3. Cloud Compute Solution Design
41 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Spark Architecture and Deploy Modes
No ratings yet
Spark Architecture and Deploy Modes
22 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
Spark Architecture
No ratings yet
Spark Architecture
6 pages
Log
No ratings yet
Log
88 pages
SaTScan Tutorial NYS Birth Defect
No ratings yet
SaTScan Tutorial NYS Birth Defect
30 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
UNIT V
No ratings yet
UNIT V
35 pages
Bda 5
No ratings yet
Bda 5
21 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Spark Intro
No ratings yet
Spark Intro
24 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Android App Development - Questions Answer - Habib
No ratings yet
Android App Development - Questions Answer - Habib
22 pages
MHD User Guide
No ratings yet
MHD User Guide
49 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
SIM800 Series IP Application Note V1.05-1
No ratings yet
SIM800 Series IP Application Note V1.05-1
24 pages
Operator's Manual
No ratings yet
Operator's Manual
26 pages
ABIF File Format
No ratings yet
ABIF File Format
56 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Lvcore1 Coursemanual English PDF
No ratings yet
Lvcore1 Coursemanual English PDF
230 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
SG10 ICT Chapter6
No ratings yet
SG10 ICT Chapter6
36 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Administrators Installation Guide: Intel® PRO/Wireless 2200BG Network Connection Software
No ratings yet
Administrators Installation Guide: Intel® PRO/Wireless 2200BG Network Connection Software
17 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
4 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
File Handling Chapter -13-Converted
No ratings yet
File Handling Chapter -13-Converted
5 pages
What Happens With Standby When Primary Database Is Flashback
No ratings yet
What Happens With Standby When Primary Database Is Flashback
6 pages
BIRT Report - Tivoli Automation Engine
No ratings yet
BIRT Report - Tivoli Automation Engine
67 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
Ministra TV Platform Installation2
No ratings yet
Ministra TV Platform Installation2
25 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Text Analysis in R
No ratings yet
Text Analysis in R
21 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
GT 2 Ict
No ratings yet
GT 2 Ict
27 pages
Ajay Kumar Garg Engineering College: (Shapemyskills)
No ratings yet
Ajay Kumar Garg Engineering College: (Shapemyskills)
32 pages
Connecting With the Database Using Migration (1)
No ratings yet
Connecting With the Database Using Migration (1)
3 pages
Apache Spark and Ignite
No ratings yet
Apache Spark and Ignite
4 pages
Unit 5
100% (1)
Unit 5
109 pages
KEW6315 Firmware: Upgrade Instruction Manual
No ratings yet
KEW6315 Firmware: Upgrade Instruction Manual
7 pages
How To Upload A Panelview Plus App
No ratings yet
How To Upload A Panelview Plus App
2 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
The Hosts File
No ratings yet
The Hosts File
3 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
TLE-ICT-Computer-Hardware-Servicing-LM Module 3RD QUARTER
100% (1)
TLE-ICT-Computer-Hardware-Servicing-LM Module 3RD QUARTER
30 pages
DSP Lab Manual GEC Dahod 1
No ratings yet
DSP Lab Manual GEC Dahod 1
40 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Abinitio Newtopics
No ratings yet
Abinitio Newtopics
5 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Introduction To Spark For Data Engineers / Data Scientists
100% (3)
Introduction To Spark For Data Engineers / Data Scientists
100 pages
Oasis Montaj 7.2 Viewer: The Core Software Platform For Working With Large Volume Spatial Data
No ratings yet
Oasis Montaj 7.2 Viewer: The Core Software Platform For Working With Large Volume Spatial Data
0 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet

HDP Training Tesco - II Notes

Uploaded by

HDP Training Tesco - II Notes

Uploaded by

Table of Contents

Spark Internals and Tuning......................................................................................2

Deep-dive into Spark internals and architecture

Apache Spark is an open-source distributed general-purpose cluster-computing framework. A

The spark context object can be accessed using sc.

Next, the ApplicationMasterEndPoint triggers a proxy application to connect to the resource

iv) AM starts the Reporter Thread

Setting up environment variables, job resources & launching containers.

CoarseGrainedExecutorBackend & Netty-based RPC.

CoarseGrainedExecutorBackend is an ExecutorBackend that controls the lifecycle of a single

The event log file can be read as shown below

Execution of a job (Logical plan, Physical plan).

ii) Referencing a dataset in an external storage system

The execution of the above snippet takes place in 2 phases.

We can view the lineage graph by using toDebugString

In the case of missing tasks, it assigns tasks to executors.

Each task is assigned to CoarseGrainedExecutorBackend of the executor.

now, it performs the computation and returns the result.

The ShuffleBlockFetcherIterator gets the blocks to be shuffled.

Once the Job is finished the result is displayed.

def restore(path: String, config: Config) {

//a list of dependencies on other RDDs

//a function for computing each split

//(optional) a list of preferred locations to compute each split on

//(optional) a partitioner for key-value RDDs

//aggregate campaigns by type

//joined rollups and raw events

//count totals separately

// generate job, stages and tasks

// add tasks to the taskScheduler

Job logical plan

Job physical plan

//If the task is ResultTask

// If the finished task is ResultTask

=> blocksByAddress: Seq[(BlockManagerId, Seq[(BlockId, Long)])] = compute(statuses)

// generate the fetch requests

// fetch local blocks

// invoke blockManagerWorker to read the block (FileSegment)

The URL for HiveServer2 Interactive

The URI for the metastore

The application name for LLAP service

The ZooKeeper hosts used by LLAP

Copy value from Advanced hive-interactive-site > hive.llap.daemon.service.hosts.

Copy the value from Advanced hive-site > hive.zookeeper.quorum.

Property Configuration Requirements

spark.yarn.dist.files Option 1: For Spark on YARN, specify the location of the

hive-site.xml and the datanucleus JARs.

spark.sql.hive.metastore.vers Specify the Hive version that you are using:

spark.sql.hive.metastore.jars Specify the classpath to JARS for Hive, Hive dependencies,

Figure 1. Spark-on-HBase Connector Architecture

Figure 4: Faster failure recovery with redistribution of computation

> select * from word_counts;

public static void main(String[] args) {

static void a() {

static void b() {

static void c() {

static void d() {

static void a() throws InvalidClassException

static void a() {

Threads class SMR info:

"main" #1 prio=5 os_prio=31 cpu=60.42ms elapsed=103.32s tid=0x00007f8bb3807000

"Reference Handler" #2 daemon prio=10 os_prio=31 cpu=0.08ms elapsed=103.29s

"Finalizer" #3 daemon prio=8 os_prio=31 cpu=0.13ms elapsed=103.29s

"Signal Dispatcher" #4 daemon prio=9 os_prio=31 cpu=0.27ms elapsed=103.28s

"C1 CompilerThread0" #7 daemon prio=9 os_prio=31 cpu=12.01ms elapsed=103.28s

"Sweeper thread" #8 daemon prio=9 os_prio=31 cpu=0.73ms elapsed=103.28s

"Service Thread" #9 daemon prio=9 os_prio=31 cpu=0.04ms elapsed=103.27s

"Common-Cleaner" #10 daemon prio=8 os_prio=31 cpu=0.27ms elapsed=103.27s

"Attach Listener" #11 daemon prio=9 os_prio=31 cpu=0.72ms elapsed=0.10s

"VM Thread" os_prio=31 cpu=3.83ms elapsed=103.29s tid=0x00007f8bb2874800 nid=0x3703

"GC Thread#0" os_prio=31 cpu=0.13ms elapsed=103.31s tid=0x00007f8bb282b800 nid=0x3003

"G1 Main Marker" os_prio=31 cpu=0.26ms elapsed=103.31s tid=0x00007f8bb2845000

"G1 Conc#0" os_prio=31 cpu=0.04ms elapsed=103.31s tid=0x00007f8bb3810000 nid=0x3303

"G1 Refine#0" os_prio=31 cpu=0.39ms elapsed=103.31s tid=0x00007f8bb2871000 nid=0x3403

JNI global refs: 5, weak refs: 0

Step 1: Switch as the service user that started the process.