HDP Training Tesco - II Notes
HDP Training Tesco - II Notes
After the Spark context is created it waits for the resources. Once the resources are available, Spark
context sets up internal services and establishes a connection to a Spark execution environment.
Yarn Resource Manager, Application Master & launching of executors (containers).
Once the Spark context is created it will check with the Cluster Manager and launch the
Application Master i.e, launches a container and registers signal handlers.
Once the Application Master is started it establishes a connection with the Driver.
Now, the Yarn Container will perform the below operations as shown in the diagram.
Image Credits: jaceklaskowski.gitbooks.io
ii) YarnRMClient will register with the Application Master.
iii) YarnAllocator: Will request 3 executor containers, each with 2 cores and 884 MB memory
including 384 MB overhead
Now the Yarn Allocator receives tokens from Driver to launch the Executor nodes and start the
containers.
• Launching container
YARN executor launch context assigns each executor with an executor id to identify the
corresponding executor (via Spark WebUI) and starts a CoarseGrainedExecutorBackend.
Netty-based RPC - It is used to communicate between worker nodes, spark context, executors.
NettyRPCEndPoint is used to track the result status of the worker node.
RpcEndpointAddress is the logical address for an endpoint registered to an RPC Environment,
with RpcAddress and name.
It is in the format as shown below:
This is the first moment when CoarseGrainedExecutorBackend initiates communication with the
driver available at driverUrl through RpcEnv.
SparkListeners
Image Credits: jaceklaskowski.gitbooks.io
SparkListener (Scheduler listener) is a class that listens to execution events from Spark’s
DAGScheduler and logs all the event information of an application such as the executor, driver
allocation details along with jobs, stages, and tasks and other environment properties changes.
SparkContext starts the LiveListenerBus that resides inside the driver. It registers
JobProgressListener with LiveListenerBus which collects all the data to show the statistics in spark
UI.
By default, only the listener for WebUI would be enabled but if we want to add any other listeners
then we can use spark.extraListeners.
Spark comes with two listeners that showcase most of the activities
i) StatsReportListener
ii) EventLoggingListener
EventLoggingListener: If you want to analyze further the performance of your applications
beyond what is available as part of the Spark history server then you can process the event log
data. Spark Event Log records info on processed jobs/stages/tasks. It can be enabled as shown
below...
To enable the listener, you register it to SparkContext. It can be done in two ways.
i) Using SparkContext.addSparkListener(listener: SparkListener) method inside your Spark
application.
Click on the link to implement custom listeners - CustomListener
ii) Using the conf command-line option
Let’s read a sample file and perform a count operation to see the StatsReportListener.
RDDs are created either by using a file in the Hadoop file system, or an existing Scala collection in
the driver program, and transforming it.
Let’s take a sample snippet as shown below
Now the data will be read into the driver using the broadcast variable.
• Wide transformation: Here each operation requires the data to be shuffled, henceforth for
each wide transformation a new stage will be created — for example, reduceByKey, etc..
6.2 Physical Plan: In this phase, once we trigger an action on the RDD, The DAG Scheduler looks
at RDD lineage and comes up with the best execution plan with stages and tasks together with
TaskSchedulerImpl and execute the job into a set of tasks parallelly.
Once we perform an action operation, the SparkContext triggers a job and registers the RDD until
the first stage (i.e, before any wide transformations) as part of the DAGScheduler.
Now before moving onto the next stage (Wide transformations), it will check if there are any
partition data that is to be shuffled and if it has any missing parent operation results on which it
depends, if any such stage is missing then it re-executes that part of the operation by making use of
the DAG( Directed Acyclic Graph) which makes it Fault tolerant.
Next, the DAGScheduler looks for the newly runnable stages and triggers the next stage
(reduceByKey) operation.
On completion of each task, the executor returns the result back to the driver.
Spark-WebUI
Spark-UI helps in understanding the code execution flow and the time taken to complete a
particular job. The visualization helps in finding out any underlying problems that take place
during the execution and optimizing the spark application further.
We will see the Spark-UI visualization as part of the previous step 6.
Once the job is completed you can see the job details such as the number of stages, the number of
tasks that were scheduled during the job execution of a Job.
On clicking the completed jobs we can view the DAG visualization i.e, the different wide and
narrow transformations as part of it.
You can see the execution time taken by each stage.
On clicking on a Particular stage as part of the job, it will show the complete details as to where the
data blocks are residing, data size, the executor used, memory utilized and the time taken to
complete a particular task. It also shows the number of shuffles that take place.
Further, we can click on the Executors tab to view the Executor and driver used.
Now that we have seen how Spark works internally, you can determine the flow of execution by
making use of Spark UI, logs and tweaking the Spark EventListeners to determine optimal solution
on the submission of a Spark job.
Apache Spark: core concepts, architecture and internals
03 MARCH 2016 on Spark, scheduling, RDD, DAG, shuffle
This post covers core concepts of Apache Spark such as RDD, DAG, execution workflow, forming
stages of tasks and shuffle implementation and also describes architecture and main components
of Spark Driver. There's a github.com/datastrophic/spark-workshop project created alongside
with this post which contains Spark Applications examples and dockerized Hadoop environment
to play with. Slides are also available at slideshare.
Intro
Spark is a generalized framework for distributed data processing providing functional API for
manipulating data at scale, in-memory data caching and reuse across computations. It applies set
of coarse-grained transformations over partitioned data and relies on dataset's lineage to
recompute tasks in case of failures. Worth mentioning is that Spark supports majority of data
formats, has integrations with various storage systems and can be executed on Mesos or YARN.
Powerful and concise API in conjunction with rich library makes it easier to perform data
operations at scale. E.g. performing backup and restore of Cassandra column families in Parquet
format:
def backup(path: String, config: Config) {
sc.cassandraTable(config.keyspace, config.table)
.map(_.toEvent).toDF()
.write.parquet(path)
}
• HadoopRDD:
getPartitions = HDFS blocks
getDependencies = None
compute = load block in memory
getPrefferedLocations = HDFS block locations
partitioner = None
• MapPartitionsRDD
getPartitions = same as parent
getDependencies = parent RDD
compute = compute parent and apply map()
getPrefferedLocations = same as parent
partitioner = None
RDD Operations
Operations on RDDs are divided into several groups:
• Transformations
apply user function to every element in a partition (or to the whole partition)
apply aggregation function to the whole dataset (groupBy, sortBy)
introduce dependencies between RDDs to form DAG
provide functionality for repartitioning (repartition, partitionBy)
• Actions
trigger job execution
used to materialize computation results
• Extra: persistence
explicitly store RDDs in memory, on disk or off-heap (cache, persist)
checkpointing for truncating RDD lineage
Here's a code sample of some job which aggregates data from Cassandra in lambda style
combining previously rolled-up data with the data from raw storage and demonstrates some of the
transformations and actions available on RDDs
//aggregate events after specific date for given campaign
val events =
sc.cassandraTable("demo", "event")
.map(_.toEvent)
.filter { e =>
e.campaignId == campaignId && e.time.isAfter(watermark)
}
.keyBy(_.eventType)
.reduceByKey(_ + _)
.cache()
val campaignTotals =
campaigns.map{ case (t, e) => s"$t -> ${e.value}" }
.collect()
Execution workflow recap
Here's a quick recap on the execution workflow before digging deeper into details: user code
containing RDD transformations forms Direct Acyclic Graph which is then split into stages of tasks
by DAGScheduler. Stages combine tasks which don’t require shuffling/repartitioning if the data.
Tasks run on workers and results then return to client.
DAG
Here's a DAG for the code sample above. So basically any data processing workflow could be
defined as reading the data source, applying set of transformations and materializing the result in
different ways. Transformations create dependencies between RDDs and here we can see different
types of them.
The dependencies are usually classified as "narrow" and "wide":
• Narrow (pipelineable)
each partition of the parent RDD is used by at most one partition of the child RDD
allow for pipelined execution on one cluster node
failure recovery is more efficient as only lost parent partitions need to be recomputed
• Wide (shuffle)
multiple child partitions may depend on one parent partition
require data from all parent partitions to be available and to be shuffled across the nodes
if some partition is lost from all the ancestors a complete recomputation is needed
Splitting DAG into Stages
Spark stages are created by breaking the RDD graph at shuffle boundaries
• RDD operations with "narrow" dependencies, like map() and filter(), are pipelined together into
one set of tasks in each stage operations with shuffle dependencies require multiple stages
(one to write a set of map output files, and another to read those files after a barrier).
• In the end, every stage will have only shuffle dependencies on other stages, and may compute
multiple operations inside it. The actual pipelining of these operations happens in the
RDD.compute() functions of various RDDs
There are two types of tasks in Spark: ShuffleMapTask which partitions its input for shuffle and
ResultTask which sends its output to the driver. The same applies to types of stages:
ShuffleMapStage and ResultStage correspondingly.
Shuffle
During the shuffle ShuffleMapTask writes blocks to local drive, and then the task in the next stages
fetches these blocks over the network.
• Shuffle Write
redistributes data among partitions and writes files to disk
each hash shuffle task creates one file per “reduce” task (total = MxR)
sort shuffle task creates one file with regions assigned to reducer
sort shuffle uses in-memory sorting with spillover to disk to get final result
• Shuffle Read
fetches the files and applies reduce() logic
if data ordering is needed then it is sorted on “reducer” side for any type of shuffle
In Spark Sort Shuffle is the default one since 1.2, but Hash Shuffle is available too.
Sort Shuffle
• Incoming records accumulated and sorted in memory according their target partition ids
• Sorted records are written to file or multiple files if spilled and then merged
• index file stores offsets of the data blocks in the data file
• Sorting without deserialization is possible under certain conditions (SPARK-7081)
Spark Components
At 10K foot view there are three major components:
• Spark Driver
separate process to execute user applications
creates SparkContext to schedule jobs execution and negotiate with cluster manager
• Executors
run tasks scheduled by driver
store computation results in memory, on disk or off-heap
interact with storage systems
• Cluster Manager
Mesos
YARN
Spark Standalone
Spark Driver contains more components responsible for translation of user code into actual jobs
executed on cluster:
• SparkContext
◦ represents the connection to a Spark cluster, and can be used to create RDDs, accumulators
and broadcast variables on that cluster
• DAGScheduler
◦ computes a DAG of stages for each job and submits them to TaskScheduler
◦ determines preferred locations for tasks (based on cache status or shuffle files locations)
and finds minimum schedule to run the jobs
• TaskScheduler
◦ responsible for sending tasks to the cluster, running them, retrying if there are failures, and
mitigating stragglers
• SchedulerBackend
◦ backend interface for scheduling systems that allows plugging in different
implementations(Mesos, YARN, Standalone, local)
• BlockManager
◦ provides interfaces for putting and retrieving blocks both locally and remotely into various
stores (memory, disk, and off-heap)
Memory Management in Spark 1.6
Executors run as Java processes, so the available memory is equal to the heap size. Internally
available memory is split into several regions with specific functions.
• Execution Memory
◦ storage for data needed during tasks execution
◦ shuffle-related data
• Storage Memory
◦ storage of cached RDDs and broadcast variables
◦ possible to borrow from execution memory (spill otherwise)
◦ safeguard value is 50% of Spark Memory when cached blocks are immune to eviction
• User Memory
◦ user data structures and internal metadata in Spark
◦ safeguarding against OOM
• Reserved memory
◦ memory needed for running executor itself and not strictly related to Spark
Architecture
We talked about spark jobs in chapter 3. In this chapter, we will talk about the architecture and
how master, worker, driver and executors are coordinated to finish a job.
Feel free to skip code if you prefer diagrams.
Deployment diagram
We have seen the following diagram in overview chapter.
Next, we will talk about some details about it.
Job submission
The diagram below illustrates how driver program (on master node) produces job, and then
submits it to worker nodes.
Driver side behavior is equivalent to the code below:
finalRDD.action()
=> sc.runJob()
// send tasks
=> sparkDeploySchedulerBackend.reviveOffers()
=> driverActor ! ReviveOffers
=> sparkDeploySchedulerBackend.makeOffers()
=> sparkDeploySchedulerBackend.launchTasks()
=> foreach task
CoarseGrainedExecutorBackend(executorId) ! LaunchTask(serializedTask)
Explanation:
When the following code is evaluated, the program will launch a bunch of driver communications,
e.g. job's executors, threads, actors, etc.
val sc = new SparkContext(sparkConf)
This line defines the role of driver
Task distribution
After sparkDeploySchedulerBackend gets TaskSet, the Driver Actor sends serialized tasks to
CoarseGrainedExecutorBackend Actor on worker node.
Job reception
After receiving tasks, worker will do the following things:
coarseGrainedExecutorBackend ! LaunchTask(serializedTask)
=> executor.launchTask()
=> executor.threadPool.execute(new TaskRunner(taskId, serializedTask))
Executor packages each task into taskRunner, and picks a free thread to run the
task. A CoarseGrainedExecutorBackend process has exactly one executor
Task execution
The diagram below shows the execution of a task received by worker node and how driver
processes task results.
After receiving a serialized task, the executor deserializes it into a normal task, and then runs the
task to get directResult which will be sent back to driver. It is noteworthy that data package sent
from Actor can not be too big:
• If the result is too big (e.g. the one of groupByKey), it will be persisted to "memory + hard disk"
and managed by blockManager. Driver will only get indirectResult containing the storage
location. When result is needed, driver will fetch it via HTTP.
• If the result is not too big (less than spark.akka.frameSize = 10MB), then it will be directly sent to
driver.
Some more details about blockManager:
When directResult > akka.frameSize, the memoryStore of BlockManager creates a
LinkedHashMap to hold the data stored in memory whose size should be less than
Runtime.getRuntime.maxMemory * spark.storage.memoryFraction(default 0.6). If
LinkedHashMap has no space to save the incoming data, these data will be sent to diskStore which
persists data to hard disk if the data storageLevel contains "disk"
In TaskRunner.run()
// deserialize task, run it and then send the result to
=> coarseGrainedExecutorBackend.statusUpdate()
=> task = ser.deserialize(serializedTask)
=> value = task.run(taskId)
=> directResult = new DirectTaskResult(ser.serialize(value))
=> if( directResult.size() > akkaFrameSize() )
indirectResult = blockManager.putBytes(taskId, directResult, MEMORY+DISK+SER)
else
return directResult
=> coarseGrainedExecutorBackend.statusUpdate(result)
=> driver ! StatusUpdate(executorId, taskId, result)
The results produced by ShuffleMapTask and ResultTask are different.
• ShuffleMapTask produces MapStatus containing 2 parts:
the BlockManagerId of the task's BlockManager: (executorId + host, port, nettyPort)
the size of each output FileSegment of a task
• ResultTask produces the execution result of the specified function on one partition e.g. The
function of count() is simply for counting the number of records in a partition. Since
ShuffleMapTask needs FileSegment for writing to disk, OutputStream writers are needed.
These writers are produced and managed by blockManger of shuffleBlockManager
In task.run(taskId)
// if the task is ShuffleMapTask
=> shuffleMapTask.runTask(context)
=> shuffleWriterGroup = shuffleBlockManager.forMapTask(shuffleId, partitionId,
numOutputSplits)
=> shuffleWriterGroup.writers(bucketId).write(rdd.iterator(split, context))
=> return MapStatus(blockManager.blockManagerId, Array[compressedSize(fileSegment)])
Shuffle read
In the preceding paragraph, we talked about task execution and result processing, now we will talk
about how reducer (tasks needs shuffle) gets the input data. The shuffle read part in last chapter
has already talked about how reducer processes input data.
How does reducer know where to fetch data ?
Reducer needs to know on which node the FileSegments produced by ShuffleMapTask of parent
stage are. This kind of information is sent to driver’s mapOutputTrackerMaster when
ShuffleMapTask is finished. The information is also stored in mapStatuses:
HashMap[stageId, Array[MapStatus]]. Given stageId, we can getArray[MapStatus] which
contains information about FileSegments produced by ShuffleMapTasks. Array(taskId) contains
the location(blockManagerId) and the size of each FileSegment.
When reducer need fetch input data, it will first invoke blockStoreShuffleFetcher to get input
data’s location (FileSegments). blockStoreShuffleFetcher calls local MapOutputTrackerWorker to
do the work. MapOutputTrackerWorker uses mapOutputTrackerMasterActorRef to communicate
with mapOutputTrackerMasterActor in order to get MapStatus. blockStoreShuffleFetcher
processes MapStatus and finds out where reducer should fetch FileSegment information, and then
it stores this information in blocksByAddress. blockStoreShuffleFetcher tells
basicBlockFetcherIterator to fetch FileSegment data.
rdd.iterator()
=> rdd(e.g., ShuffledRDD/CoGroupedRDD).compute()
=> SparkEnv.get.shuffleFetcher.fetch(shuffledId, split.index, context, ser)
=> blockStoreShuffleFetcher.fetch(shuffleId, reduceId, context, serializer)
=> statuses = MapOutputTrackerWorker.getServerStatuses(shuffleId, reduceId)
=> connectionManager.receiveMessage(bufferMessage)
=> handleMessage(connectionManagerId, message, connection)
Discussion
In terms of architecture design, functionalities and modules are pretty independent.
BlockManager is well designed, but it seems to manage too many things (data block, memory, disk
and network communication)
This chapter discussed how the modules of spark system are coordinated to finish a job
(production, submission, execution, results collection, results computation and shuffle). A lot of
code is pasted, many diagrams are drawn. More details can be found in source code, if you want.
Apache Spark-Apache Hive connection configuration
You need to understand the workflow and service changes involved in accessing ACID table data
from Spark. You can configure Spark properties in Ambari for using the Hive Warehouse
Connector.
Prerequisites
You need to use the following software to connect Spark and Hive using the
HiveWarehouseConnector library.
HDP 3.15
Spark2
Hive with HiveServer Interactive (HSI)
The Hive Warehouse Connector (HWC) and low-latency analytical processing (LLAP) are
required for certain tasks, as shown in the following table:
Table 1. Spark Compatibility
HWC LLAP
Tasks Other Requirement/Comments
Required Required
Read Hive managed tables Yes Yes Ranger ACLs enforced.*
from Spark
Table 1. Spark Compatibility
HWC LLAP
Tasks Other Requirement/Comments
Required Required
Write Hive managed Yes No Ranger ACLs enforced.* Supports
tables from Spark ORC only.
Read Hive external tables No Only if HWC is Ranger ACLs not enforced.
from Spark used
Write Hive external tables No No Ranger ACLs enforced.
from Spark
* Ranger column level security or column masking is supported for each access pattern when
you use HWC.
You need low-latency analytical processing (LLAP) in HSI to read ACID, or other Hive-managed
tables, from Spark. You do not need LLAP to write to ACID, or other managed tables, from
Spark. The HWC library internally uses the Hive Streaming API and LOAD DATA Hive
commands to write the data. You do not need LLAP to access external tables from Spark with
caveats shown in the table above.
Required properties
You must add several Spark properties through spark-2-defaults in Ambari to use the Hive
Warehouse Connector for accessing data in Hive. Alternatively, configuration can be provided for
each job using --conf.
spark.sql.hive.hiveserver2.jdbc.url
spark.datasource.hive.warehouse.metastoreUri
spark.datasource.hive.warehouse.load.staging.dir
The HDFS temp directory for batch writes to Hive, /tmp for example
spark.hadoop.hive.llap.daemon.service.hosts
spark.hadoop.hive.zookeeper.quorum
spark.sql.hive.hiveserver2.jdbc.url
In Ambari, copy the value from Services > Hive > Summary > HIVESERVER2 INTERACTIVE
JDBC URL.
spark.datasource.hive.warehouse.metastoreUri
Copy the value from hive.metastore.uris. In Hive, at the hive> prompt, enter set
hive.metastore.urisand copy the output. For example, thrift://mycluster-1.com:9083.
spark.hadoop.hive.llap.daemon.service.hosts
spark.hadoop.hive.zookeeper.quorum
spark.datasource.hive.warehouse.write.path.strictColumnNamesMapping Validates the
mapping of columns against those in Hive to alert the user to input errors. Default = true.
spark.sql.hive.conf.list Propagates one or more configuration properties from the HWC to
Hive. Set properties on the command line using the --conf option. For example:
--conf
spark.sql.hive.conf.list="hive.vectorized.execution.filesink.arrow.native.enabled=true;hive.vecto
rized.execution.enabled=true"
Do not attempt to set spark.sql.hive.conf.list programmatically.
Spark on a Kerberized YARN cluster
In Spark client mode on a kerberized Yarn cluster, set the following
property:spark.sql.hive.hiveserver2.jdbc.url.principal
This property must be equal to hive.server2.authentication.kerberos.principal. In Ambari, copy the
value for this property
from hive.server2.authentication.kerberos.principal in Services > Hive > Configs > Advanced
> Advanced hive-site .
Integrate Spark-SQL with Hive
Integrate Spark-SQL with Hive when you want to run Spark-SQL queries on Hive tables.
Spark 1.5.2 and Spark 1.6.1 are built using Hive 1.2 artifacts; however, you can configure Spark-
SQL to work with Hive 0.13 and Hive 1.0. Spark 1.3.1 and Spark 1.4.1 are build using Hive
0.13;other versions of Hive are not supported with Spark-SQL. For additional details on Spark-
SQL and Hive support, see Spark Feature Support.
Note: If you installed Spark with the MapR Installer, the following steps are not required.
1. Copy hive-site.xml file into the SPARK_HOME/conf directory so that Spark and Spark-SQL
recognize the Hive Metastore configuration.
2. Configure the Hive version in the /opt/mapr/spark/spark-<version>/mapr-
util/compatibility.version file:
hive_versions=<version>
3. If you are running Spark 1.5.2 or Spark 1.6.1, add the following additional properties to the
/opt/mapr/spark/spark-<version>/conf/spark-defaults.conf file:
1.2/lib/accumulo-core-1.6.0.jar:/opt/mapr/hive/hive-
1.2/lib/hive-contrib-1.2.0-mapr-
1508.jar:/opt/mapr/hive/hive-1.2/lib/*
For more information, see the Apache Spark
documentation.
4. To verify the integration, run the following command as the mapr user or as a user that mapr
impersonates:
MASTER=<master-url> <spark-home>/bin/run-example sql.hive.HiveFromSpark
The master URL for the cluster is either spark://<host>:7077 , yarn-client, or yarn-cluster.
Note: The default port for both HiveServer 2 and the Spark Thrift server is 10000. Therefore,
before you start the Spark Thrift server on node where HiveServer 2 is running, verify that there is
no port conflict.
Background
There are several open source Spark HBase connectors available either as Spark packages, as
independent projects or in HBase trunk.
Spark has moved to the Dataset/DataFrame APIs, which provides built-in query plan optimization.
Now, end users prefer to use DataFrames/Datasets based interface.
The HBase connector in the HBase trunk has a rich support at the RDD level, e.g. BulkPut, etc, but
its DataFrame support is not as rich. HBase trunk connector relies on the standard
HadoopRDD with HBase built-in TableInputFormat has some performance limitations. In
addition, BulkGet performed in the the driver may be a single point of failure.
There are some other alternative implementations. Take Spark-SQL-on-HBase as an example. It
applies very advanced custom optimization techniques by embedding its own query
optimization plan inside the standard Spark Catalyst engine, ships the RDD to HBase and
performs complicated tasks, such as partial aggregation, inside the HBase coprocessor. This
approach is able to achieve high performance, but it difficult to maintain due to its
complexity and the rapid evolution of Spark. Also allowing arbitrary code to run inside a
coprocessor may pose security risks.
The Spark-on-HBase Connector (SHC) has been developed to overcome these potential
bottlenecks and weaknesses. It implements the standard Spark Datasource API, and
leverages the Spark Catalyst engine for query optimization. In parallel, the RDD is
constructed from scratch instead of using TableInputFormat in order to achieve high
performance. With this customized RDD, all critical techniques can be applied and fully
implemented, such as partition pruning, column pruning, predicate pushdown and data
locality. The design makes the maintenance very easy, while achieving a good tradeoff
between performance and simplicity.
Architecture
We assume Spark and HBase are deployed in the same cluster, and Spark executors are co-located
with region servers, as illustrated in the figure below.
Note: The NameNode and ResourceManager can reside in the same machine or different machine
depending upon the configuration of the cluster.
Now, when the client submits a job, it goes to the master machine. It will talk to the NameNode
and the NameNode will do various checks like –
• It will check, whether the client has appropriate permission to read the Input path.
• Whether the client has appropriate permission to write onto the Output path.
• Whether the Input and Output path is valid or not.
• And many more.
Once it verifies that everything is in place, it will assign a Job ID to the Job and then allocate the
Job ID into a Job Queue.
So, in Job Queue there can be multiple jobs waiting to get processed.
As soon as a job is assigned to the Job Queue, it’s corresponding information about the Job like
Input/Output Path, the location of the Jar, etc. are written into the temp location of HDFS.
Let’s talk a little about the temp location of HDFS. This is the location in each data node where
the intermediate data goes in. The location of this path is set in the file named “core-site.xml”
under location “hadoop-dir/etc/hadoop”.
<img
src="https://i1.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-9.png?
w=810&ssl=1" alt="" data-recalc-dims="1"/>
That is, every detail of each job will be stored in the temp location. After this, the Job is finally
“Accepted”.
In the next step, whenever the turn of a Job comes for execution from the Job Queue, the Resource
Manager will randomly select a DataNode (worker node) and start a Java process called
Application Master in the DataNode.
Note: For each Job, there will be an Application Master.
Now, on behalf of the Resource Manager, the Application Master will go to the temp location and
the details of the Job will be checked/collected from the temp location. Subsequently, the
Application Master communicates with the NameNode which further takes the call to figure out
where the files (blocks) are located in the cluster, how much resources (number of CPUs, number
of nodes, memory required) will the job need. So, the NameNode will do its computation and
figure out those things.
<img
src="https://i0.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-1.png?
resize=490%2C270&ssl=1" alt="" width="490" height="270" data-recalc-dims="1" />
Once all the evaluations are done, the Application Master sends all the resource request
information to the Resource Manager.
Now, the Resource Manager will look into the request and will send the resource allocation request
of the job to the DataNodes.
Now, let’s assume a scenario, the resource request which the Resource Manager has received
from the Application Master is of just 2 Cores and 2 GB memory. And the data nodes in the
cluster have a configuration of 4 Cores and 16 GB RAM. In this case, the Resource Manager will
send the resource allocation request to one of the DataNode requesting it to allocate 2 Cores and
2 GB memory (i.e. a portion of RAM and Core) to the Job.So, the Resource Manager sends the
request of 2 Cores and 2 GB memory packed together as a Container. These containers are
known as Executors.
<img src="https://i2.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-2.png?
resize=494%2C293&ssl=1" alt="" width="494" height="293" data-recalc-dims="1" />
The resource allocation requests are handled by the NodeManager of each individual worker
node and are responsible for the resource allocation of the job.
Finally, the code/Task will start executing in the Executor.
<img
src="https://i1.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-3.png?
resize=489%2C290&ssl=1" alt="" width="489" height="290" data-recalc-dims="1" />
<img
src="https://i0.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-4.png?
resize=500%2C290&ssl=1" alt="" width="500" height="290" data-recalc-dims="1" />
Execution Mode:
In Spark, there are two modes to submit a job: i) Client mode (ii) Cluster mode.
Client mode: In the client mode, we have Spark installed in our local client machine, so the
Driver program (which is the entry point to a Spark program) resides in the client machine i.e. we
will have the SparkSession or SparkContext in the client machine.
Whenever we place any request like “spark-submit” to submit any job, the request goes to
Resource Manager then the Resource Manager opens up the Application Master in any of the
Worker nodes.
Note: I am skipping the detailed intermediate steps explained above here.
The Application Master launches the Executors (i.e. Containers in terms of Hadoop) and the jobs
will be executed.
<img
src="https://i1.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-5.png?
resize=476%2C273&ssl=1" alt="" width="476" height="273" data-recalc-dims="1" />
After the Executors are launched they start communicating directly with the Driver program i.e.
SparkSession or SparkContext and the output will be directly returned to the client.
<img
src="https://i0.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-6.png?
resize=488%2C284&ssl=1" alt="" width="488" height="284" data-recalc-dims="1" />
The drawback of Spark Client mode w.r.t YARN is that: The client machine needs to be available at
all times whenever any job is running. You cannot submit your job and then turn off your laptop
and leave from office until your job is finished.
In this case, it won’t be able to give the output as the connection between Driver and Executors will
be broken.
Cluster Mode: The only difference in this mode is that Spark is installed in the cluster, not in the
local machine. Whenever we place any request like “spark-submit” to submit any job, the request
goes to Resource Manager then the Resource Manager opens up the Application Master in any of
the Worker nodes.
Now, the Application Master will launch the Driver Program (which will be having
the SparkSession/SparkContext) in the Worker node.
That means, in cluster mode the Spark driver runs inside an application master process which is
managed by YARN on the cluster, and the client can go away after initiating the application.
Whereas in client mode, the driver runs in the client machine, and the application master is only
used for requesting resources from YARN.
<img src="https://i2.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-7.png?
resize=480%2C283&ssl=1" alt="" width="480" height="283" data-recalc-dims="1" />
<img src="https://i2.wp.com/blog.knoldus.com/wp-content/uploads/2019/12/Image-8.png?
resize=489%2C286&ssl=1" alt="" width="489" height="286" data-recalc-dims="1" />
In the next blog, I have explained how Spark Driver and Executor works.
Integrate Spark with YARN
To communicate with the YARN Resource Manager, Spark needs to be aware of your Hadoop
configuration. This is done via the HADOOP_CONF_DIR environment variable. The
SPARK_HOME variable is not mandatory but is useful when submitting Spark jobs from the
command line.
• Edit the “bashrc” file and add the following lines:
export HADOOP_CONF_DIR=/<path of hadoop dir>/etc/hadoop
export YARN_CONF_DIR=/<path of hadoop dir>/etc/hadoop
export SPARK_HOME=/<path of spark dir>
export LD_LIBRARY_PATH=/<path of hadoop dir>/lib/native:$LD_LIBRARY_PATH
• Restart your session by logging out and logging in again.
• Rename the spark default template config file:
• mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-
defaults.conf
• Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:
• spark.master yarn
Copy all jars of Spark from $SPARK_HOME/jars to hdfs so that it can be shared among all the
worker nodes:
hdfs dfs -put *.jar /user/spark/share/lib
Add/modify the following parameters in spark-default.conf:
spark.master yarn
spark.yarn.jars hdfs://hmaster:9000/user/spark/share/lib/*.jar
spark.executor.memory 1g
spark.driver.memory 512m
spark.yarn.am.memory 512m
Example Spark Application Web Application
Consider a job consisting of a set of transformation to join data from an accounts dataset with a
weblogs dataset in order to determine the total number of web hits for every account and then an
action write the result to HDFS. In this example, the write is performed twice, resulting in two
jobs. To view the application UI, in the History Server click the link in the App ID column:
The following screenshot shows the timeline of the events in the application including the jobs that
were run and the allocation and deallocation of executors. Each job shows the last action,
saveAsTextFile, run for the job. The timeline shows that the application acquires executors over
the course of running the first job. After the second job finishes, the executors become idle and are
returned to the cluster.
You can manipulate the timeline as follows:
• Pan - Press and hold the left mouse button and swipe left and right.
• Zoom - Select the Enable zooming checkbox and scroll the mouse up and down.
To view the details for Job 0, click the link in the Description column. The following screenshot
shows details of each stage in Job 0 and the DAG visualization. Zooming in shows finer detail for
the segment from 28 to 42 seconds:
Clicking a stage shows further details and metrics:
The web page for Job 1 shows how preceding stages are skipped because Spark retains the results
from those stages:
Example Spark SQL Web Application
In addition to the screens described above, the web application UI of an application that uses the
Spark SQL API also has an SQL tab. Consider an application that loads the contents of two tables
into a pair of DataFrames, joins the tables, and then shows the result. After you click the
application ID, the SQL tab displays the final action in the query:
If you click the show link you see the DAG of the job. Clicking the Details link on this page
displays the logical query plan:
Example Spark Streaming Web Application
Note: The following example demonstrates the Spark driver web UI. Streaming information is not
captured in the Spark History Server.
The Spark driver web application UI also supports displaying the behavior of streaming
applications in the Streaming tab. If you run the example described in Spark Streaming Example,
and provide three bursts of data, the top of the tab displays a series of visualizations of the
statistics summarizing the overall behavior of the streaming application:
The application has one receiver that processed 3 bursts of event batches, which can be observed in
the events, processing time, and delay graphs. Further down the page you can view details of
individual batches:
To view the details of a specific batch, click a link in the Batch Time column. Clicking the
2016/06/16 14:23:20 link with 8 events in the batch, provides the following details:
Apache Spark has proven an efficient and accessible platform for distributed computation. In some
areas, it almost approaches the Holy Grail of making parallelization “automagic” — something we
human programmers appreciate precisely because we are rarely good at it.
Nonetheless, although it is easy to get something to run on Spark, it is not always easy to tell
whether it's running optimally, nor — if we get a sense that something isn't right — how to fix it.
For example, a classic Spark puzzler is the batch job that runs the same code on the same sort of
cluster and similar data night after night … but every so often it seems to take much longer to
finish. What could be going on?
In this article, I am going to show how to identify some common Spark issues the easy way: by
looking at a particularly informative graphical report that is built into the Spark Web UI. The Web
UI Stage Detail view[1] is my go-to page for tuning and troubleshooting, and is also one of the most
information-dense spots in the whole UI.
Let's briefly describe a “stage” and how to find the relevant UI screen. Then we'll look at the data in
each part of that page.
What Exactly is a Stage?
In Apache Spark execution terminology, operations that physically move data in order to produce
some result are called “jobs.” Some jobs are triggered by user API calls (so-called “Action” APIs,
such as “.count” to count records). Other jobs live behind the scenes and are implicitly triggered —
e.g., data schema inference requires Spark to physically inspect some data, hence it requires a job
of its own.
Jobs are decomposed into “stages” by separating where a shuffle is required. The shuffle is
essential to the “reduce” part of the parallel computation — it is the part that is not fully parallel
but where we must, in general, move data in order to complete the current phase of computation.
For example, to sort a distributed set of numbers, it's not enough to locally sort partitions of the
data … sooner or later we need to impose a global ordering and that requires comparisons of data
records from all over the cluster, necessitating a shuffle.
So, Spark's stages represent segments of work that run from data input (or data read from a
previous shuffle) through a set of operations called tasks — one task per data partition — all the
way to a data output or a write into a subsequent shuffle.
Locating the Stage Detail View UI
Start by opening a browser to the Spark Web UI[2].
Unless you already know the precise details of jobs and stages running on your Spark cluster, it's
probably useful to navigate via the “Jobs” tab at the top of the UI, which provides a clear drill-
down by job (rather than the “Stages” tab, which lists all stages but doesn't clearly distinguish
them by job).
From the “Jobs” tab, you can locate the job you're interested in, and click its “Description” link to
get to the Job Detail view, which lists all of the stages in your job, along with some useful stats.
We get to our final destination by clicking on a stage's “Description” link. This link leads to the
Stage Detail view, which is the report we're analyzing today.
We'll look at major parts of this report, proceeding from top to bottom on the page.
Event Timeline
One of my favorite parts of the Stage Detail view is initially hidden behind the “Event Timeline”
dropdown. Click that dropdown link to get a large, colored timeline graph showing each of the
tasks in the stage, plotted by start time (horizontally) and grouped by executor (vertically).
Within each task's colored bar — representing time — the full duration is further broken down via
colored segments to show how the time was spent.
There is exactly one colored bar per task, so we can see how many tasks there are, and get a feel for
whether there are too many or too few tasks. Since tasks are one-to-one with data partitions, this
really helps us answer the question: How many partitions should I have?
More precisely: What would this graph look like if there are too few partitions (and tasks)?
In the most extreme case, we might see fewer tasks than an executor has cores — perhaps we have
40 cores across our cluster but see only 32 tasks. That is usually not what we want, and it's easy to
identify and change.
But, more subtly, what if we have a larger number of tasks than we have cores but still too few for
optimal performance. How would we recognize that situation?
Look at the right hand edge of the graph, and locate the last one or two tasks to complete. Those
tasks are essentially limiting the progress of the job, because they have to complete before Spark
can move on. Look at the timescale, and see whether the span between those tasks' end time and
the previous few tasks' end time is significant (for example, hundreds of milliseconds or perhaps
much more). What we're looking at is the period at the end of the stage when the cluster cores are
underutilized. If this is substantial, it's an indicator of too few partitions/tasks, or of skew in data
size, compute time, or both.
Notice that one “straggler” task finishes more than 1 full second after the next longest running
tasks, and almost 2 full seconds after most tasks are complete. This image — with its high fraction
of green — displays some artifacts of a small dataset on a local mode Spark, but it also suggests
some skew in one of the partitions or tasks. The tiny tasks suggest near-empty partitions or near-
no-op tasks.
On the opposite end, we might have too many partitions, leading to too many tasks. How would
this situation appear? Lots of very short tasks, dominated by time spent in non-compute activities
These Tasks come from a Stage with too many partitions. Notice how many of the tasks show only
a little bit of green “Computing Time” … certainly less than 70%
The green color in the task indicates “Executor Computing Time” and we would ideally like this to
make up at least 70% of the time spent on the task. If you see many tasks filled up with other
colors, representing non-compute activities such as “Task Deserialization,” and only a small slice
of green “Computing Time,” that is an indicator that you may have too many tasks/partitions or,
equivalently, that the partitions are too small (data) or require too little work (compute) to be
optimally efficient.
Note two gotchas about these timeline graphs in general:
• Although the task bars are arranged to show start time and length, the “swim lanes” only separate
executors. One “row” within an executor swim lane does not represent a specific core or
thread, and tasks scheduled on the same thread do not show that fact in any way.
• In general, the vertical layout of the bars within an executor “swim lane” are purely an artifact of
trying to show many tasks within a limited space on the page. The vertical positioning,
overlap, etc., of multiple tasks has no meaning.
Summary Metrics for Completed Tasks
Next on page we find the Summary Metrics, showing various metrics at the 0 (Min), 25th, 50th
(Median), 75th, and 100th (Max) percentiles (among the tasks in the stage). More metrics can be
revealed by selecting checkboxes hidden under “Show Additional Metrics” earlier on the page.
How do we make sense of this part of the report?
In a perfect world, where our computation was completely symmetric across tasks, we would see
all of the statistics clustered tightly around the 50th percentile value. There would be minimal
variance; the distance between 0 and 100% values would be small. In the real world, though,
things don't always work out that way, but we can see how far off they are — and get an idea of
why.
Suppose the 25%-75% spread isn't too wide but some Max metric figures are substantially higher
then the corresponding 75% figures. That suggests a number of “straggler” tasks, taking too much
time to compute (or triggering excess GC), and/or operating over partitions with larger skewed
amounts of data.
On the other end, suppose that the distribution is reasonable, except that we have a bunch of Min
values at or close to zero. That suggests we have empty (or near empty) partitions and/or tasks
that aren't computing anything (our compute logic might not be the same for all sorts of records,
so we could have large partitions whose tasks do no work).
Summary Metrics corresponding to the task timeline view (above) which had suggested skew. Note
that the Max task took 10x the time and read about 10x the data of the 75th-percentile task. There
is skew at the low end as well: Min time and data are zero, and 25th percentile data is around 1% of
the median.
Aggregated Metrics by Executor
The next segment of the report is a set of summarized statistics, this time collected by executor.
How is this helpful? In theory, given the vicissitudes of scheduling on threads and across a
network, we would expect that when running a job several times — or running a long job with
many stages and tasks — we would see similar statistics across all our executors. There's that
perfect world again. What could go wrong?
If we see one or more executors consistently showing worse metrics than most, it could indicate
several possible situations:
• The JVM (executor) is sick — perhaps we should kill it and start a new one.
• The node hosting the executor is sick — the UI shows executors live on which nodes, so if
multiple problematic executors are always on the same node we might suspect the node.
• Data locality trouble — Since Spark attempts to schedule tasks where their partition data is
located, over time it should be successful at a consistent rate. But suppose three of your Spark
executors happen to be collocated with HDFS replicas of tasks' data, while one is allocated
(by, say, YARN) far away from your job's data. That executor is going to consistently take
longer to read the data over the network.
• Good locality but difficult data — Conversely, an executor may have great locality to the part of
the data which your Spark job is using most heavily. So those tasks take longer and/or
process more data than other tasks.
All of those possibilities can be mitigated, and the report gives us hints about what to inspect so
that we can do so.
Tasks List
The last section of the Stage Detail view is a grid containing a row for every single task in the stage.
The data shown for each task is similar to the data shown in the graphical timeline, but includes
the addition of a few fields such as data quantity read/written and — something not shown
anywhere else — the specific data locality level at which each task ran[3].
Assuming you know something about where your data is located at this stage of the computation,
the locality info will tell you whether tasks are generally being scheduled in a way that minimizes
transporting data over the network.
Let's return by way of example to our mysterious scenario from the start of the article — a job that
is in all respects similar each night, but every so often takes much longer to finish. Perhaps your
Spark cluster coincides with your HDFS data “most of the time” but occasionally ends up getting
launched with terrible proximity to the data block replicas it needs, and so it runs successfully but
much slower on those occasions. Comparing the locality info in this part of the report to the
observed behavior from other runs will give you an indication of what has happened.
Since a stage could easily have thousands of tasks, this is probably a good time to mention that the
Spark UI data are also accessible through a REST API[4]. So, if you want to monitor and plot
locality in your stages, you don't need to read or scrape the thousands of task table rows.
Finally, a couple rules of thumb around partition sizing and task duration. Two of the most
common questions in the Spark classes and workshops I teach for ProTech are: “How many
partitions should I have? And how long should tasks execute?”
The proper answer is that it depends on so many variables — workload, cluster configuration, data
sources — that it always requires hands-on tuning. However, new users are understandably
desperate to have some a priori numbers to start with, so — purely as a bootstrapping mechanism
— I suggest starting with code and configuration that causes each partition to contain 100-200MB
of data and each task to take 50-200ms.
Those numbers are not a final goal, but once you have your app running, you can use the
knowledge from this article to start to tune your partition counts and improve the speed and
consistency of your Spark application.
If this approach to tuning Spark sounds helpful, and you'd like to dig even deeper into optimizing
and troubleshooting Spark for real-world production scenarios in the enterprise, check out
ProTech's 3-Day Spark 2.0 Programming course.
Diving into Apache Spark Streaming’s Execution Model
by Tathagata Das, Matei Zaharia and Patrick Wendell Posted in ENGINEERING BLOG July 30,
2015
With so many distributed stream processing engines available, people often ask us about the
unique benefits of Apache Spark Streaming. From early on, Apache Spark has provided an unified
engine that natively supports both batch and streaming workloads. This is different from other
systems that either have a processing engine designed only for streaming, or have similar batch
and streaming APIs but compile internally to different engines. Spark’s single execution engine
and unified programming model for batch and streaming lead to some unique benefits over other
traditional streaming systems. In particular, four major aspects are:
• Fast recovery from failures and stragglers
• Better load balancing and resource usage
• Combining of streaming data with static datasets and interactive queries
• Native integration with advanced processing libraries (SQL, machine learning, graph processing)
In this post, we outline Spark Streaming’s architecture and explain how it provides the above
benefits. We also discuss some of the interesting ongoing work in the project that leverages the
execution model.
Stream Processing Architectures – The Old and the New
At a high level, modern distributed stream processing pipelines execute as follows:
• Receive streaming data from data sources (e.g. live logs, system telemetry data, IoT device data,
etc.) into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.
• Process the data in parallel on a cluster. This is what stream processing engines are designed to
do, as we will discuss in detail next.
• Output the results out to downstream systems like HBase, Cassandra, Kafka, etc.
To process the data, most traditional stream processing systems are designed with a continuous
operator model, which works as follows:
• There is a set of worker nodes, each of which run one or more continuous operators.
• Each continuous operator processes the streaming data one record at a time and forwards the
records to other operators in the pipeline.
• There are “source” operators for receiving data from ingestion systems, and “sink” operators that
output to downstream systems.
Figure 1: Architecture of traditional stream processing systems
Continuous operators are a simple and natural model. However, with today’s trend towards larger
scale and more complex real-time analytics, this traditional architecture has also met some
challenges. We designed Spark Streaming to satisfy the following requirements:
• Fast failure and straggler recovery – With greater scale, there is a higher likelihood of a cluster
node failing or unpredictably slowing down (i.e. stragglers). The system must be able to
automatically recover from failures and stragglers to provide results in real time.
Unfortunately, the static allocation of continuous operators to worker nodes makes it
challenging for traditional systems to recover quickly from faults and stragglers.
• Load balancing – Uneven allocation of the processing load between the workers can cause
bottlenecks in a continuous operator system. This is more likely to occur in large clusters and
dynamically varying workloads. The system needs to be able to dynamically adapt the
resource allocation based on the workload.
• Unification of streaming, batch and interactive workloads – In many use cases, it is also
attractive to query the streaming data interactively (after all, the streaming system has it all in
memory), or to combine it with static datasets (e.g. pre-computed models). This is hard in
continuous operator systems as they are not designed to the dynamically introduce new
operators for ad-hoc queries. This requires a single engine that can combine batch, streaming
and interactive queries.
• Advanced analytics like machine learning and SQL queries – More complex workloads require
continuously learning and updating data models, or even querying the “latest” view of
streaming data with SQL queries. Again, having a common abstraction across these analytic
tasks makes the developer’s job much easier.
To address these requirements, Spark Streaming uses a new architecture called discretized streams
that directly leverages the rich libraries and fault tolerance of the Spark engine.
Architecture of Spark Streaming: Discretized Streams
Instead of processing the streaming data one record at a time, Spark Streaming discretizes the
streaming data into tiny, sub-second micro-batches. In other words, Spark Streaming’s Receivers
accept data in parallel and buffer it in the memory of Spark’s workers nodes. Then the latency-
optimized Spark engine runs short tasks (tens of milliseconds) to process the batches and output
the results to other systems. Note that unlike the traditional continuous operator model, where the
computation is statically allocated to a node, Spark tasks are assigned dynamically to the workers
based on the locality of the data and available resources. This enables both better load balancing
and faster fault recovery, as we will illustrate next.
In addition, each batch of data is a Resilient Distributed Dataset (RDD), which is the basic
abstraction of a fault-tolerant dataset in Spark. This allows the streaming data to be processed
using any Spark code or library.
Figure 2: Spark Streaming Architecture
Benefits of Discretized Stream Processing
Let’s see how this architecture allows Spark Streaming to achieve the goals we set earlier.
Dynamic load balancing
Dividing the data into small micro-batches allows for fine-grained allocation of computations to
resources. For example, consider a simple workload where the input data stream needs to
partitioned by a key and processed. In the traditional record-at-a-time approach taken by most
other systems, if one of the partitions is more computationally intensive than the others, the node
statically assigned to process that partition will become a bottleneck and slow down the pipeline.
In Spark Streaming, the job’s tasks will be naturally load balanced across the workers — some
workers will process a few longer tasks, others will process more of the shorter tasks.
Figure 3: Dynamic load balancing
Fast failure and straggler recovery
In case of node failures, traditional systems have to restart the failed continuous operator on
another node and replay some part of the data stream to recompute the lost information. Note that
only one node is handling the recomputation, and the pipeline cannot proceed until the new node
has caught up after the replay. In Spark, the computation is already discretized into small,
deterministic tasks that can run anywhere without affecting correctness. So failed tasks can be
relaunched in parallel on all the other nodes in the cluster, thus evenly distributing all the
recomputations across many nodes, and recovering from the failure faster than the traditional
approach.
Then add a breakpoint where d() throws the exception in your debugger. I’m using IntelliJ’s
debugger for this image.
Here you can see that the string we added to d() is part of the stack frame because it’s a local
variable. Debuggers operate inside the Stack and give you a detailed picture of each frame.
Forcing a Thread Dump
Thread dumps are great post-mortem tools, but they can be useful for runtime issues too. If your
application stops responding or is consuming more CPU or memory than you expect, you can
retrieve information about the running app with jstack.
Modify main() so the application will run until killed:
public static void main(String[] args) throws Exception {
try {
while(true) {
Thread.sleep(1000);
}
} catch (NullPointerException ice) {
ice.printStackTrace();
}
}
Run the app, determine its pid, and then run jstack. On Windows, you’ll need to press ctrl-break in
the DOS window you’re running your code in.
$ jstack <pid>
Jstack will generate a lot of output.
2019-05-13 10:06:17
Full thread dump OpenJDK 64-Bit Server VM (12+33 mixed mode, sharing):
Heap Configuration:
MinHeapFreeRatio =0
MaxHeapFreeRatio = 100
NewRatio =2
SurvivorRatio =8
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
PS Young Generation
Eden Space:
26.038428412543404% used
From Space:
capacity = 5767168 (5.5MB)
73.0313387784091% used
To Space:
used = 0 (0.0MB)
0.0% used
PS Old Generation
57.9674630794885% used
The two general flags that are used while collecting the heapdumps are “-dump” and “-histo”.
While former gives the heapdump in the form of binary file with the collection of objects at a
particular time, latter provides the details of live objects in a text format.
#<jmap-used-by-process>/jmap -dump:file=<location-to-redirect-the-
output>/heapdump.hprof,format=b <PID>
If histo label needs to be used,
#<jmap-used-by-process>/jmap -histo <pid> > jmap.out
NOTE:
1. Jmap/Jstack is high CPU intensive process, so please use it with caution.
2. Please try not to use -F as much as possible as critical data are missed with this option. If -F
option needs to be used by any of the commands,
Example:
#/usr/jdk64/jdk1.8.0_112/bin/jmap -dump:file=/tmp/jmap21887.hprof,format=b -F 21887
#/usr/jdk64/jdk1.8.0_112/bin/jmap -histo -F 21887 > /tmp/jmaphistoF.out
Heap dumps contain a snapshot of the application's memory, including the values of variables at
the time the dump was created. So they're useful for diagnosing problems that occur at run-time.
Services
Heap dumps are enabled by passing options (sometimes known as opts, or parameters) to the JVM
when a service is started. For most Apache Hadoop services, you can modify the shell script used
to start the service to pass these options.
In each script, there's an export for *_OPTS, which contains the options passed to the JVM. For
example, in the hadoop-env.sh script, the line that begins with export
HADOOP_NAMENODE_OPTS= contains the options for the NameNode service.
Map and reduce processes are slightly different, as these operations are a child process of the
MapReduce service. Each map or reduce process runs in a child container, and there are two
entries that contain the JVM options. Both contained in mapred-site.xml:
• mapreduce.admin.map.child.java.opts
• mapreduce.admin.reduce.child.java.opts
Note
We recommend using Apache Ambari to modify both the scripts and mapred-site.xml settings,
as Ambari handle replicating changes across nodes in the cluster. See the Using Apache Ambari
section for specific steps.
Enable heap dumps
Copy
-XX:+HeapDumpOnOutOfMemoryError
The + indicates that this option is enabled. The default is disabled.
Warning
Heap dumps are not enabled for Hadoop services on HDInsight by default, as the dump files can
be large. If you do enable them for troubleshooting, remember to disable them once you have
reproduced the problem and gathered the dump files.
Dump location
The default location for the dump file is the current working directory. You can control where the
file is stored using the following option:
Copy
-XX:HeapDumpPath=/path
For example, using -XX:HeapDumpPath=/tmp causes the dumps to be stored in the /tmp
directory.
Scripts
You can also trigger a script when an OutOfMemoryError occurs. For example, triggering a
notification so you know that the error has occurred. Use the following option to trigger a script on
an OutOfMemoryError:
Copy
-XX:OnOutOfMemoryError=/path/to/script
Note
Since Apache Hadoop is a distributed system, any script used must be placed on all nodes in the
cluster that the service runs on.
The script must also be in a location that is accessible by the account the service runs as, and must
provide execute permissions. For example, you may wish to store scripts in /usr/local/bin and use
chmod go+rx /usr/local/bin/filename.sh to grant read and execute permissions.
Using Apache Ambari
• Using the Filter... entry, enter opts. Only items containing this text are displayed.
• Find the *_OPTS entry for the service you want to enable heap dumps for, and add the options
you wish to enable. In the following image, I've added -XX:
+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ to the
HADOOP_NAMENODE_OPTS entry:
NoteWhen enabling
heap dumps for the map or reduce child process, look for the fields named
mapreduce.admin.map.child.java.opts and
mapreduce.admin.reduce.child.java.opts.Use the Save button to save the changes.
You can enter a short note describing the changes.
• Once the changes have been applied, the Restart required icon appears beside one or more
services.
• Select each service that needs a restart, and use the Service Actions button to Turn On
Maintenance Mode. Maintenance mode prevents alerts from being generated from the
Most Java applications developed today involve multiple threads, which, in contrast to its benefits,
carries with it a number of subtle difficulties. In a single-threaded application, all resources
(shared data, Input/Output (IO) devices, etc.) can be accessed without coordination, knowing that
the single thread of execution will be the only thread that utilizes the resource at any given time
within the application.
In the case of multithreaded applications, a trade-off is made — increased complexity for a possible
gain in performance, where multiple threads can utilize the available (often more than one)
Central Processing Unit (CPU) cores. In the right conditions, an application can see a
significant performance increase using multiple threads (formalized by Amdahl's Law), but special
attention must be paid to ensure that multiple threads coordinate properly when accessing a
resource that is needed by two threads. In many cases, frameworks, such as Spring, will abstract
direct thread management, but even the improper use of these abstracted threads can cause some
hard-to-debug issues. Taking all of these difficulties into consideration, it is likely that, eventually,
something will go wrong, and we, as developers, will have to start diagnosing the indeterministic
realm of threads.
Fortunately, Java has a mechanism for inspecting the state of all threads in an application at any
given time —the thread dump. In this article, we will look at the importance of thread dumps and
how to decipher their compact format, as well as how to generate and analyze thread dumps in
realistically-sized applications. This article assumes the reader has a basic understanding of
threads and the various issues that surround threads, including thread contention and shared
resource management. Even with this understanding, before generating and examining a thread
dump, it is important to solidify some central threading terminology.
Understanding the Terminology
Java thread dumps can appear cryptic at first, but making sense of thread dumps requires an
understanding of some basic terminology. In general, the following terms are key in grasping the
meaning and context of a Java thread dump:
• Thread — A discrete unit of concurrency that is managed by the Java Virtual Machine (JVM).
Threads are mapped to Operating System (OS) threads, called native threads, which
provide a mechanism for the execution of instructions (code). Each thread has a unique
identifier, name, and may be categorized as a daemon thread or non-daemon thread,
where a daemon thread runs independent of other threads in the system and is only killed
when either the Runtime.exit method has been called (and the security manager authorizes
the exiting of the program) or all non-daemon threads have died. For more information, see
the Thread class documentation.
Alive thread — a running thread that is performing some work (the normal thread state).
Blocked thread — a thread that attempted to enter a synchronized block but another
thread already locked the same synchronized block.
Waiting thread — a thread that has called the wait method (with a possible timeout) on
an object and is currently waiting for another thread to call the notify method
(or notifyAll) on the same object. Note that a thread is not considered waiting if it calls
the wait method on an object with a timeout and the specified timeout has expired.
Sleeping thread — a thread that is currently not executing as a result of calling
the Thread.sleep method (with a specified sleep length).
• Monitor — a mechanism employed by the JVM to facilitate concurrent access to a single object.
This mechanism is instituted using the synchronized keyword, where each object in Java has
an associated monitor allowing any thread to synchronize, or lock, an object, ensuring that
no other thread accesses the locked object until the lock is released (the synchronized block is
exited). For more information, see the Synchronization section (17.1) of the Java Langauge
Specification (JLS).
• Deadlock — a scenario in which one thread holds some resource, A, and is blocked, waiting for
some resource, B, to become available, while another thread holds resource B and is blocked,
waiting for resource A to become available. When a deadlock occurs, no progress is made
within a program. It is important to note that a deadlock may also occur with more than two
threads, where three or more threads all hold a resource required by another thread and are
simultaneously blocked, waiting for a resource held by another thread. A special case of this
occurs when some thread, X, holds resource A and requires resource C, thread Y holds
resource B and requires resource A, and thread Z holds resource C and requires resource
B (formally known as the Dining Philosophers Problem).
• Livelock — a scenario in which thread A performs an action that causes thread B to perform an
action that in turn causes thread A to perform its original action. This situation can be
visualized as a dog chasing its tail. Similar to deadlock, live-locked threads do not make
progress, but unlike deadlock, the threads are not blocked (and instead, are alive).
The above definitions do not constitute a comprehensive vocabulary for Java threads or thread
dumps but make up a large portion of the terminology that will be experienced when reading a
typical thread dump. For a more detailed lexicon of Java threads and thread dumps, see Section 17
of the JLS and Java Concurrency in Practice.
With this basic understanding of Java threads, we can progress to creating an application from
which we will generate a thread dump and, later, examine the key portion of the thread dump to
garner useful information about the threads in the program.
Creating an Example Program
In order to generate a thread dump, we need to first execute a Java application. While a simple
"hello, world!" application results in an overly simplistic thread dump, a thread dump from
an even moderately-sized multithreaded application can be overwhelming. For the sake of
understanding the basics of a thread dump, we will use the following program, which starts two
threads that eventually become deadlocked:
threadLockingResourceAFirst.start();
Thread.sleep(500);
threadLockingResourceBFirst.start();
this.firstResource = firstResource;
this.secondResource = secondResource;
@Override
try {
synchronized(firstResource) {
printLockedResource(firstResource);
Thread.sleep(1000);
synchronized(secondResource) {
printLockedResource(secondResource);
} catch (InterruptedException e) {
This program simply creates two resources, resourceAand resourceB, and starts two threads,
threadLockingResourceAFirstand threadLockingResourceBFirst, that lock each of these resources.
The key to causing deadlock is ensuring that threadLockingResourceAFirst tries to lock
resourceAand then lock resourceBwhile threadLockingResourceBFirst tries to lock resourceBand
then resourceA. Delays are added to ensure that threadLockingResourceAFirstsleeps before it is
able to lock resourceBand threadLockingResourceBFirstis given enough time to lock resourceB
before threadLockingResourceAFirstwakes. threadLockingResourceBFirst then sleeps and when
both threads await, they find that the second resource they desired has already been locked and
both threads block, waiting for the other thread to relinquish its locked resource (which never
occurs).
Executing this program results in the following output, where the object hashes (the numeric
following java.lang.Object@) will vary between each execution:
At the completion of this output, the program appears as though it is running (the process
executing this program does not terminate), but no further work is being done. This is a deadlock
in practice. In order to troubleshoot the issue at hand, we must generate a thread dump manually
and inspect the state of the threads in the dump.
Generating a Thread Dump
In practice, a Java program might terminate abnormally and generate a thread dump
automatically, but, in some cases (such as with many deadlocks), the program does not terminate
but appears as though it is stuck. To generate a thread dump for this stuck program, we must first
discover the Process ID (PID) for the program. To do this, we use the JVM Process Status (JPS)
tool that is included with all Java Development Kit (JDK) 7+ installations. To find the PID for our
deadlocked program, we simply execute jps in the terminal (either Windows or Linux):
$ jps
11568 DeadlockProgram
15584 Jps
15636
The first column represents the Local VM ID (lvmid) for the running Java process. In the context
of a local JVM, the lvmid maps to the PID for the Java process. Note that this value will likely differ
from the value above. The second column represents the name of the application, which may map
to the name of the main class, a Java Archive (JAR) file, or Unknown, depending on the
characteristics of the program run.
In our case, the application name is DeadlockProgram, which matches the name of the main class
file that was executed when our program started. In the above example, the PID for our program is
11568, which provides us with enough information to generate thread dump. To generate the
dump, we use the jstack program (included with all JDK 7+ installations), supplying the -l flag
(which creates a long listing) and the PID of our deadlocked program, and piping the output to
some text file (i.e. thread_dump.txt):
This thread_dump.txt file now contains the thread dump for our deadlocked program and includes
some very useful information for diagnosis the root cause of our deadlock problem. Note that if we
did not have a JDK 7+ installed, we could also generate a thread dump by quitting the deadlocked
program with a SIGQUIT signal. To do this on Linux, simply kill deadlocked program using its PID
(11568 in our example), along with the -3 flag:
kill -3 11568
Reading a Simple Thread Dump
Opening the thread_dump.txt file, we see that it contains the following:
2018-06-19 16:44:44
Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0.1+10 mixed mode):
0x00000250e54d0800
}
"Reference Handler" #2 daemon prio=10 os_prio=2 tid=0x00000250e4979000 nid=0x3c28
waiting on condition [0x000000b82a9ff000]
java.lang.Thread.State: RUNNABLE
at java.lang.ref.Reference.waitForReferencePendingList([email protected]/Native Method)
at java.lang.ref.Reference.processPendingReferences([email protected]/Reference.java:174)
at java.lang.ref.Reference.access$000([email protected]/Reference.java:44)
at java.lang.ref.Reference$ReferenceHandler.run([email protected]/Reference.java:138)
- None
at java.lang.Object.wait([email protected]/Native Method)
at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:151)
at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:172)
at java.lang.ref.Finalizer$FinalizerThread.run([email protected]/Finalizer.java:216)
- None
java.lang.Thread.State: RUNNABLE
- None
java.lang.Thread.State: RUNNABLE
- None
"C2 CompilerThread0" #6 daemon prio=9 os_prio=2 tid=0x00000250e4995800 nid=0x4198
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
- None
java.lang.Thread.State: RUNNABLE
No compile task
Locked ownable synchronizers:
- None
java.lang.Thread.State: RUNNABLE
No compile task
- None
- None
java.lang.Thread.State: RUNNABLE
- None
"Common-Cleaner" #11 daemon prio=8 os_prio=1 tid=0x00000250e54cf000 nid=0x1610 in
Object.wait() [0x000000b82b2fe000]
at java.lang.Object.wait([email protected]/Native Method)
at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:151)
at jdk.internal.ref.CleanerImpl.run([email protected]/CleanerImpl.java:148)
at java.lang.Thread.run([email protected]/Thread.java:844)
at jdk.internal.misc.InnocuousThread.run([email protected]/InnocuousThread.java:134)
- None
"Thread-0" #12 prio=5 os_prio=0 tid=0x00000250e54d1800 nid=0xdec waiting for monitor
entry [0x000000b82b4ff000]
at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)
at java.lang.Thread.run([email protected]/Thread.java:844)
- None
"Thread-1" #13 prio=5 os_prio=0 tid=0x00000250e54d2000 nid=0x415c waiting for monitor
entry [0x000000b82b5ff000]
at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)
at java.lang.Thread.run([email protected]/Thread.java:844)
- None
- None
=============================
"Thread-0":
waiting to lock monitor 0x00000250e4982480 (object 0x00000000894465b0, a
java.lang.Object),
"Thread-1":
===================================================
"Thread-0":
at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)
at java.lang.Thread.run([email protected]/Thread.java:844)
"Thread-1":
at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)
at java.lang.Thread.run([email protected]/Thread.java:844)
Found 1 deadlock.
Introductory Information
Although this file may appear overwhelming at first, it is actually simple if we take each section one
step at a time. The first line of the dump displays the timestamp of when the dump was generated,
while the second line contains the diagnostic information about the JVM from which the dump
was generated:
2018-06-19 16:44:44
Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0.1+10 mixed mode):
While these lines do not provide any information with regards to the threads in our system, they
provide a context from which the rest of the dump can be framed (i.e. which JVM generated the
dump and when the dump was generated).
General Threading Information
The next section begins to provide us with some useful information about the threads that were
running at the time the thread dump was taken:
This section contains the thread list Safe Memory Reclamation (SMR) information1,
which enumerates the addresses of all non-JVM internal threads (e.g. non-VM and non-Garbage
Collection (GC)). If we examine these addresses, we see that they correspond to the tid value — the
address of the native thread object, not the Thread ID, as we will see shortly — of each of the
numbered threads in the dump (note that ellipses are used to hide superfluous information):
Threads
Directly following the SMR information is the list of threads. The first thread listed in for our
deadlocked program is the Reference Handler thread:
java.lang.Thread.State: RUNNABLE
at java.lang.ref.Reference.waitForReferencePendingList([email protected]/Native Method)
at java.lang.ref.Reference.processPendingReferences([email protected]/Reference.java:174)
at java.lang.ref.Reference.access$000([email protected]/Reference.java:44)
at java.lang.ref.Reference$ReferenceHandler.run([email protected]/Reference.java:138)
- None
Thread Summary
The first line of each thread represents the thread summary, which contains the following items:
SECTION EXAMPLE DESCRIPTION
The numeric priority of the Java thread. Note that this does
not necessarily correspond to the priority of the OS thread to
Priority prio=10 with the Java thread is dispatched. The priority of
a Thread object can be set using the setPriority method and
obtained using the getPriority method.
The OS thread priority. This priority can differ from the Java
OS Thread
os_prio=2 thread priority and corresponds to the OS thread on which
Priority
the Java thread is dispatched.
Address tid=0x00000250e49790 The address of the Java thread. This address represents the
00 pointer address of the Java Native Interface (JNI)
native Thread object (the C++ Thread object that backs the
Java thread through the JNI). This value is obtained by
converting the pointer to this (of the C++ object that backs
the Java Thread object) to an integer on line 879
of hotspot/share/runtime/thread.cpp:
st->print("tid=" INTPTR_FORMAT " ", p2i(this));
Although the key for this item (tid) may appear to be the
thread ID, it is actually the address of the underlying JNI C+
+ Thread object and thus is not the ID returned when
calling getId on a Java Thread object.
Last [0x000000b82a9ff000] The last known Stack Pointer (SP) for the stack associated
with the thread. This value is supplied using native C++ code
and is interlaced with the Java Thread class using the JNI.
This value is obtained using the last_Java_sp() native
method and is formatted into the thread dump on line 2886
Known
of hotspot/share/runtime/thread.cpp:
Java Stack
st->print_cr("[" INTPTR_FORMAT "]",
Pointer
(intptr_t)last_Java_sp() & ~right_n_bits(12));
For simple thread dumps, this information may not be
useful, but for more complex diagnostics, this SP value can
be used to trace lock acquisition through a program.
Thread State
The second line represents the current state of the thread. The possible states for a thread are
captured in the Thread.State enumeration:
• NEW
• RUNNABLE
• BLOCKED
• WAITING
• TIMED_WAITING
• TERMINATED
For more information on the meaning of each state, see the Thread.State documentation.
Thread Stack Trace
The next section contains the stack trace for the thread at the time of the dump. This stack trace
resembles the stack trace printed when an uncaught exception occurs and simply denotes the class
and line that the thread was executing when the dump was taken. In the case of the Reference
Handler thread, there is nothing of particular importance that we see in the stack trace, but if we
look at the stack trace for Thread-02, we see a difference from the standard stack trace:
at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)
at java.lang.Thread.run([email protected]/Thread.java:844)
- None
Within this stack trace, we can see that locking information has been added, which tells us that this
thread is waiting for a lock on an object with an address of 0x00000000894465b0 (and a type
of java.lang.Object) and, at this point in the stack trace, holds a lock on an object with an address
of 0x00000000894465a0 (also of type java.lang.Object). This supplemental lock information is
important when diagnosing deadlocks, as we will see in the following sections.
Locked Ownable Synchronizer
The last portion of the thread information contains a list of synchronizers (objects that can be used
for synchronization, such as locks) that are exclusively owned by a thread. According to the official
Java documentation, "an ownable synchronizer is a synchronizer that may be exclusively owned by
a thread and uses AbstractOwnableSynchronizer (or its subclass) to implement its synchronization
property. ReentrantLock and the write-lock (but not the read-lock) of ReentrantReadWriteLock
are two examples of ownable synchronizers provided by the platform.
For more information on locked ownable synchronizers, see this Stack Overflow post.
JVM Threads
The next section of the thread dump contains the JVM-internal (non-application) threads that are
bound to the OS. Since these threads do not exist within a Java application, they do not have a
thread ID. These threads are usually composed of GC threads and other threads used by the JVM
to run and maintain a Java application:
For many simple issues, this information is unused, but it is important to understand the
importance of these global references. For more information, see this Stack Overflow post.
Deadlocked Threads
The final section of the thread dump contains information about discovered deadlocks. This is not
always the case: If the application does not have one or more detected deadlocks, this section will
be omitted. Since our application was designed with a deadlock, the thread dump correctly
captures this contention with the following message:
=============================
"Thread-0":
===================================================
"Thread-0":
at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)
at java.lang.Thread.run([email protected]/Thread.java:844)
"Thread-1":
at DeadlockProgram$DeadlockRunnable.run(DeadlockProgram.java:34)
at java.lang.Thread.run([email protected]/Thread.java:844)
Found 1 deadlock.
The first subsection describes the deadlock scenario: Thread-0 is waiting to lock a monitor
(through the synchronized statement around the firstResource and secondResource in our
application) that is held while Thread-1 is waiting to lock a monitor held by Thread-0. This circular
dependency is the textbook definition of a deadlock (contrived by our application) and is
illustrated in the figure below:
In addition to the description of the deadlock, the stack trace for each of the threads involved is
printed in the second subsection. This allows us to track down the line and locks (the objects being
used as monitor locks in this case) that are causing the deadlock. For example, if we examine line
34 of our application, we find the following content:
printLockedResource(secondResource);
This line represents the first line of the synchronized block causing the deadlock and tips us off to
the fact that synchronizing on secondResource is the root of the deadlock. In order to solve this
deadlock, we would have to instead synchronize on resourceA and resourceB in the same order in
both threads. If we do this, we end up with the following application:
threadLockingResourceAFirst.start();
Thread.sleep(500);
threadLockingResourceBFirst.start();
}
this.firstResource = firstResource;
this.secondResource = secondResource;
}
@Override
try {
synchronized (firstResource) {
printLockedResource(firstResource);
Thread.sleep(1000);
synchronized (secondResource) {
printLockedResource(secondResource);
} catch (InterruptedException e) {
System.out.println("Exception occurred: " + e);
This application produces the following output and completes without deadlocking (note that the
addresses of the Object objects will vary by execution):
Thread-0: locked resource -> java.lang.Object@1ad895d1
In summary, using only the information provided in the thread dump, we can find and fix a
deadlocked application. Although this inspection technique is sufficient for many simple
applications (or applications that have only a small number of deadlocks), dealing with more
complex thread dumps may need to be handled in a different way.
Handling More Complex Thread Dumps
When handling production applications, thread dumps can become overwhelming very quickly. A
single JVM may have hundreds of threads running at the same time and deadlocks may involve
more than two threads (or there may be more than one concurrency issue as a side-effect of a
single cause) and parsing through this firehose of information can be tedious and unruly.
In order to handle these large-scale situations, Thread Dump Analyzers (TDAs) should be the tool
of choice. These tools parse Java thread dumps display otherwise confusing information in a
manageable form (commonly with a graph or other visual aid) and may even perform static
analysis of the dump to discover issues. While the best tool for a situation will vary by
circumstance, some of the most common TDAs include the following:
• fastThread
• Spotify TDA
• IBM Thread and Monitor Dump Analyze for Java
• irockel TDA
While this is far from a comprehensive list of TDAs, each performs enough analysis and visual
sorting to reduce the manual burden of decyphering thread dumps.
How to Analyze Java Thread Dumps
The content of this article was originally written by Tae Jin Gu on the Cubrid blog.
When there is an obstacle, or when a Java based Web application is running much slower than
expected, we need to use thread dumps. If thread dumps feel like very complicated to you, this
article may help you very much. Here I will explain what threads are in Java, their types, how they
are created, how to manage them, how you can dump threads from a running application, and
finally how you can analyze them and determine the bottleneck or blocking threads. This article is
a result of long experience in Java application debugging.
Java and Thread
A web server uses tens to hundreds of threads to process a large number of concurrent users. If
two or more threads utilize the same resources, a contention between the threads is inevitable, and
sometimes deadlock occurs.
Thread contention is a status in which one thread is waiting for a lock, held by another thread,
to be lifted. Different threads frequently access shared resources on a web application. For
example, to record a log, the thread trying to record the log must obtain a lock and access the
shared resources.
Deadlock is a special type of thread contention, in which two or more threads are waiting for the
other threads to complete their tasks in order to complete their own tasks.
Different issues can arise from thread contention. To analyze such issues, you need to use the
thread dump. A thread dump will give you the information on the exact status of each thread.
Background Information for Java Threads
Thread Synchronization
A thread can be processed with other threads at the same time. In order to ensure compatibility
when multiple threads are trying to use shared resources, one thread at a time should be allowed
to access the shared resources by using thread synchronization.
Thread synchronization on Java can be done using monitor. Every Java object has a single
monitor. The monitor can be owned by only one thread. For a thread to own a monitor that is
owned by a different thread, it needs to wait in the wait queue until the other thread releases its
monitor.
Thread Status
In order to analyze a thread dump, you need to know the status of threads. The statuses of threads
are stated on java.lang.Thread.State.
Use the extracted PID as the parameter of jstack to obtain a thread dump.
Use the extracted pid as the parameter of kill –SIGQUIT(3) to obtain a thread dump.
Thread Information from the Thread Dump File
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
• Thread name: When using Java.lang.Thread class to generate a thread, the thread will be named
Thread-(Number), whereas when using java.util.concurrent.ThreadFactory class, it will be
named pool-(number)-thread-(number).
• Priority: Represents the priority of the threads.
• Thread ID: Represents the unique ID for the threads. (Some useful information, including the
CPU usage or memory usage of the thread, can be obtained by using thread ID.)
• Thread status: Represents the status of the threads.
• Thread callstack: Represents the call stack information of the threads.
Thread Dump Patterns by Type
When Unable to Obtain a Lock (BLOCKED)
This is when the overall performance of the application slows down because a thread is occupying
the lock and prevents other threads from obtaining it. In the following example, BLOCKED_TEST
pool-1-thread-1 thread is running with <0x0000000780a000b0> lock, while BLOCKED_TEST
pool-1-thread-2 and BLOCKED_TEST pool-1-thread-3 threads are waiting to obtain
<0x0000000780a000b0> lock.
java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.PrintStream.write(PrintStream.java:432)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
at java.io.PrintStream.newLine(PrintStream.java:496)
at java.io.PrintStream.println(PrintStream.java:687)
- locked <0x0000000780a04118> (a java.io.PrintStream)
at
com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:44)
- locked <0x0000000780a000b0> (a
com.nbp.theplatform.threaddump.ThreadBlockedState)
at
com.nbp.theplatform.threaddump.ThreadBlockedState$1.run(ThreadBlockedState.java:7)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
- <0x0000000780a31758> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"BLOCKED_TEST pool-1-thread-2" prio=6 tid=0x0000000007673800 nid=0x260c waiting for
monitor entry [0x0000000008abf000]
at
com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:43)
at
com.nbp.theplatform.threaddump.ThreadBlockedState$2.run(ThreadBlockedState.java:26)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
- <0x0000000780b0c6a0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at
com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:42)
at
com.nbp.theplatform.threaddump.ThreadBlockedState$3.run(ThreadBlockedState.java:34)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
- <0x0000000780b0e1b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(T
hreadDeadLockState.java:197)
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.monitorOurLock(Thre
adDeadLockState.java:182)
- locked <0x00000007d58f5e48> (a
com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockS
tate.java:135)
Locked ownable synchronizers:
- None
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(T
hreadDeadLockState.java:197)
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.monitorOurLock(Thre
adDeadLockState.java:182)
- locked <0x00000007d58f5e60> (a
com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockS
tate.java:135)
- None
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(T
hreadDeadLockState.java:197)
- locked <0x00000007d58f5e78> (a
com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at
com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockS
tate.java:135)
- None
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at sun.nio.cs.StreamDecoder.read0(StreamDecoder.java:107)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:93)
at java.io.InputStreamReader.read(InputStreamReader.java:151)
at
com.nbp.theplatform.threaddump.ThreadSocketReadState$1.run(ThreadSocketReadState.java:27
)
at java.lang.Thread.run(Thread.java:662)
When Waiting
The thread is maintaining WAIT status. In the thread dump, IoWaitThread thread keeps waiting to
receive a message from LinkedBlockingQueue. If there continues to be no message for
LinkedBlockingQueue, then the thread status will not change.
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007d5c45850> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSy
nchronizer.java:1987)
at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:440)
at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:629)
at
com.nbp.theplatform.threaddump.ThreadIoWaitState$IoWaitHandler2.run(ThreadIoWaitState.ja
va:89)
at java.lang.Thread.run(Thread.java:662)
From the application, find out which thread is using the CPU the most.
Acquire the Light Weight Process (LWP) that uses the CPU the most and convert its unique
number (10039) into a hexadecimal number (0x2737).
2. After acquiring the thread dump, check the thread's action.
Extract the thread dump of an application with a PID of 10029, then find the thread with an nid of
0x2737.
"NioProcessor-2" prio=10 tid=0x0a8d2800 nid=0x2737 runnable [0x49aa5000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
external.org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:65)
at
external.org.apache.mina.common.AbstractPollingIoProcessor$Worker.run(AbstractPollingIoPro
cessor.java:708)
at
external.org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Extract thread dumps several times every hour, and check the status change of the threads to
determine the problem.
Example 2: When the Processing Performance is Abnormally Slow
After acquiring thread dumps several times, find the list of threads with BLOCKED status.
" DB-Processor-13" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry
[0x000000000825f000]
at beans.cus.ServiceCnt.getTodayCount(ServiceCnt.java:111)
at beans.cus.ServiceCnt.insertCount(ServiceCnt.java:43)
at beans.ConnectionPool.getConnection(ConnectionPool.java:102)
at beans.cus.ServiceCnt.getTodayCount(ServiceCnt.java:111)
at beans.cus.ServiceCnt.insertCount(ServiceCnt.java:43)
" DB-Processor-3" daemon prio=5 tid=0x00928248 nid=0x8b waiting for monitor entry
[0x000000000825d080]
java.lang.Thread.State: RUNNABLE
at oracle.jdbc.driver.OracleConnection.isClosed(OracleConnection.java:570)
at beans.ConnectionPool.getConnection(ConnectionPool.java:112)
at beans.cus.Cue_1700c.GetNationList(Cue_1700c.java:66)
at org.apache.jsp.cue_1700c_jsp._jspService(cue_1700c_jsp.java:120)
Acquire the list of threads with BLOCKED status after getting the thread dumps several times.
If the threads are BLOCKED, extract the threads related to the lock that the threads are trying to
obtain.
Through the thread dump, you can confirm that the thread status stays BLOCKED because
<0xe0375410> lock could not be obtained. This problem can be solved by analyzing stack trace
from the thread currently holding the lock.
There are two reasons why the above pattern frequently appears in applications using DBMS. The
first reason is inadequate configurations. Despite the fact that the threads are still working,
they cannot show their best performance because the configurations for DBCP and the like are not
adequate. If you extract thread dumps multiple times and compare them, you will often see that
some of the threads that were BLOCKED previously are in a different state.
The second reason is the abnormal connection. When the connection with DBMS stays
abnormal, the threads wait until the time is out. In this case, even after extracting the thread
dumps several times and comparing them, you will see that the threads related to DBMS are still in
a BLOCKED state. By adequately changing the values, such as the timeout value, you can shorten
the time in which the problem occurs.
Coding for Easy Thread Dump
Naming Threads
When a thread is created using java.lang.Thread object, the thread will be named Thread-
(Number). When a thread is created using java.util.concurrent.DefaultThreadFactory object, the
thread will be named pool-(Number)-thread-(Number). When analyzing tens to thousands of
threads for an application, if all the threads still have their default names, analyzing them becomes
very difficult, because it is difficult to distinguish the threads to be analyzed.
Therefore, you are recommended to develop the habit of naming the threads whenever a new
thread is created.
When you create a thread using java.lang.Thread, you can give the thread a custom name by using
the creator parameter.
public Thread(Runnable target, String name);
When you create a thread using java.util.concurrent.ThreadFactory, you can name it by generating
your own ThreadFactory. If you do not need special functionalities, then you can use
MyThreadFactory as described below:
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ThreadFactory;
import java.util.concurrent.atomic.AtomicInteger;
if (threadPoolName == null) {
Thread.currentThread().getThreadGroup();
if (poolCount == null) {
} else {
}
public Thread newThread(Runnable runnable) {
if (thread.isDaemon()) {
thread.setDaemon(false);
if (thread.getPriority() != Thread.NORM_PRIORITY) {
thread.setPriority(Thread.NORM_PRIORITY);
}
return thread;
ThreadInfo[] threadInfos =
mxBean.getThreadInfo(threadIds);
System.out.println(
threadInfo.getThreadName());
System.out.println(
threadInfo.getBlockedCount());
System.out.println(
threadInfo.getBlockedTime());
System.out.println(
threadInfo.getWaitedCount());
System.out.println(
threadInfo.getWaitedTime());
You can acquire the amount of time that the threads WAITed or were BLOCKED by using the
method in ThreadInfo, and by using this you can also obtain the list of threads that have been
inactive for an abnormally long period of time.
In Conclusion
In this article I was concerned that for developers with a lot of experience in multi-thread
programming, this material may be common knowledge, whereas for less experienced developers,
I felt that I was skipping straight to thread dumps, without providing enough background
information about the thread activities. This was because of my lack of knowledge, as I was not
able to explain the thread activities in a clear yet concise manner. I sincerely hope that this article
will prove helpful for many developers.
How to collect threaddump using jcmd and analyse it ?
jsensharma Super Mentor
Created on 12-17-2016 01:11 PM
[Related Article On Ambari Server Tuning :
https://community.hortonworks.com/articles/131670/ambari-server-performance-tuning-
troubleshooting-c... ]
-The jcmd utility comes with the JDK and is present inside the "$JAVA_HOME/bin". It is used to
send diagnostic command requests to the JVM, where these requests are useful for controlling
Java Flight Recordings, troubleshoot, and diagnose JVM and Java Applications.
Following are the conditions for using this utility.
- 1. It must be used on the same machine where the JVM is running
- 2. Only a user who own the JVM process can connect to is using this utility.
This utility can help us in getting many details about the JVM process. Some of the most useful
information's are as following: Syntax:
jcmd $PID $ARGUMENT
Example1: Classes taking the most memory are listed at the top, and classes are listed in a
descending order.
/usr/jdk64/jdk1.8.0_60/bin/jcmd $PID GC.class_histogram > /tmp/22421_ClassHistogram.txt
Example2: Generate Heap Dump
/usr/jdk64/jdk1.8.0_60/bin/jcmd $PID GC.heap_dump /tmp/test123.hprof
Example3: Explicitly request JVM to trigger a Garbage Collection Cycle.
/usr/jdk64/jdk1.8.0_60/bin/jcmd $PID GC.run
Example4: Generate Thread dump.
usr/jdk64/jdk1.8.0_60/bin/jcmd $PID Thread.print
Example5: List JVM properties.
/usr/jdk64/jdk1.8.0_60/bin/jcmd $PID VM.system_properties
Example6: The Command line options along with the CLASSPATH setting.
/usr/jdk64/jdk1.8.0_60/bin/jcmd $ PID VM.command_line
**NOTE:** To use few specific features offered by "jcmd" tool the "-XX:
+UnlockDiagnosticVMOptions" JVM option need to be enabled.
.
When to Collect Thread Dumps?
---------------------------------------------------------
Here now we will see a very common scenario when we find that the JVM process is talking a lots
of time in processing the request. Many times we see that the JVM process is stuck/slow or
completely Hung. In such scenario in order to investigate the root cause of the slowness we need to
collect the thread dumps of the JVM process which will tell us about the various activities those
threads are actually performing. Sometimes some threads are involved in some very high CPU
intensive operations which also might cause a slowness in getting the response. So We should
collect the thread dump as well as the CPU data using "top" command. Few things to consider
while collecting the thread dumps:
- 1. Collect the thread dump when we see the issue (slowness, stuck/ hung scenario ...etc). .
- 2. Mostly a single thread dump is not very useful. So whenever we collect the thread dump then
we should collect at least 5-6 thread dumps. In some interval like collect 5-6 thread dumps in 10
seconds interval. Like that we will get around 5-6 thread dumps in 1 minute.
- 3. If we are also investigating that few threads might be consuming high CPU cycles then in order
to find the APIs that are actually consuming the high CPU we must collect the Thread dump as well
as the "top" command output data almost at the same time.
.
- In order to make this easy we can use a simple script "threaddump_cpu_with_cmd.sh" and
use it for out troubleshooting & JVM data collection. The following script can be downloaded from:
https://github.com/jaysensharma/MiddlewareMagicDemos/tree/master/HDP_Ambari/JVM
#!/bin/sh
# Takes the JavaApp PID as an argument.
# Make sure you set JAVA_HOME
# Create thread dumps a specified number of times (i.e. LOOP) and INTERVAL.
# Thread dumps will be collected in the file "jcmd_threaddump.out", in the same directory from
where this script is been executed.
# Usage:
# sudo - $user_Who_Owns_The_JavaProcess
# ./threaddump_cpu_with_cmd.sh <JAVA_APP_PID>
#
#
# Example:
# NameNode PID is "5752" and it is started by user "hdfs" then run this utility as following:
#
# su -l hdfs -c "/tmp/threaddump_cpu_with_cmd.sh 5752"
###################################################################
#############################
# Setting the Java Home, by giving the path where your JDK is kept
# USERS MUST SET THE JAVA_HOME before running this scripta s following:
JAVA_HOME=/usr/jdk64/jdk1.8.0_60
---------------------------------------------------------
While running the JCMD we might see the below mentioned error. Here the "5752" is the
NameNode PID.
[root@c6401 keys]# /usr/jdk64/jdk1.8.0_60/bin/jcmd 5752 help
5752:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not
responding or HotSpot VM not loaded
at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:208)
at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:147)
at sun.tools.jcmd.JCmd.main(JCmd.java:131)
This error occurred because, JCMD utility allows to connect only to the JVM process that we own
the process. In this case we see that the "NameNode" process is being owned by the "hdfs" user
where as in the above command we are trying to connect to the NameNode process Via "jcmd"
utility as "root" user. The root user here does not own the process, Hence we see the error.
-
"hdfs" user owned process
# su -l hdfs -c "/usr/jdk64/jdk1.8.0_60/bin/jcmd -l"
5752 org.apache.hadoop.hdfs.server.namenode.NameNode
5546 org.apache.hadoop.hdfs.tools.DFSZKFailoverController
5340 org.apache.hadoop.hdfs.server.datanode.DataNode
4991 org.apache.hadoop.hdfs.qjournal.server.JournalNode
.
- "root" user owned process
[root@c6401 keys]# /usr/jdk64/jdk1.8.0_60/bin/jcmd -l
1893 com.hortonworks.support.tools.server.SupportToolServer
6470 com.hortonworks.smartsense.activity.ActivityAnalyzerFacade
16774 org.apache.ambari.server.controller.AmbariServer
29100 sun.tools.jcmd.JCmd -l
6687 org.apache.zeppelin.server.ZeppelinServer
More information about this utility can be found at:
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr006.html
.
.
30,006 VIEWS
3 KUDOS
TAGS (6)
• Design & ArchitectureFAQhelpjavathreaddumptool