0% found this document useful (0 votes)
6 views

Hadoop Interview Qs

hadoop interview q

Uploaded by

Parthasarathi M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Hadoop Interview Qs

hadoop interview q

Uploaded by

Parthasarathi M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

 What is Big Data

Big data is vast amount of data generally in GBs or TBs of


size) that exceeds the regular processing capacity of the
traditional computing servers and requires special parallel
processing mechanism. This data is too big and its rate of
increase gets accelerated. This data can be either
structural or unstructured data which may not be able to
process by legacy databases.

 What is Hadoop

Hadoop is an open source frame work from Apache


Software Foundation for storing & processing large-scale
data usually called Big Data using clusters of commodity
hardware.

 Who uses Hadoop

 Big organizations in which data grows exponentially day


by day and must require Hadoop platform to process such
huge data. For example , Facebook, Google, Amazon,
Twitter, IBM, LinkedIn etc… companies uses hadoop
technology to solve their big data processing problems.

 What is commodity hardware

 Commodity hardware is a non-expensive system which is


not of high quality or high-availability.
 Hadoop can be installed on any commodity hardware. We
don’t need super computers or high-end hardware to work
on Hadoop. Commodity hardware includes RAM because
there will be some services which will be running on RAM.

 What is the basic difference between traditional


RDBMS and Hadoop

 Traditional RDBMS is used for transactional systems to


report and archive the data, whereas Hadoop is an
approach to store huge amount of data in the distributed
file system and process it.
 RDBMS will be useful when we want to seek one record
from Big data, whereas, Hadoop will be useful when we
want Big data in one shot and perform analysis on that
later.

 What are the modes in which Hadoop can run

 Hadoop can run in three modes.


 Stand alone or Local mode - No daemons will be running
in this mode and everything runs in a single JVM.
 Pseudo distributed mode – All the Hadoop daemons run
on a local machine, simulating cluster on a small scale.
 Fully distributed mode - A cluster of machines will be
setup in master/slaves architecture to distribute and
process the data across various nodes of commodity
hardware.

 What are main components/projects in Hadoop


architecture

 Hadoop Common: The common utilities that support the


other Hadoop modules.
 HDFS: Hadoop distributed file system that provides high-
throughput access to application data.
 Hadoop YARN: A framework for job scheduling and
cluster resource management.
 Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.

 List important default configuration files in


Hadoop cluster

 The default configuration files in hadoop cluster are:


 core-default.xml
 hdfs-default.xml
 yarn-default.xml
 mapred-default.xml

 List important site-specific configuration files in


Hadoop cluster
 In order to override any hadoop configuration property’s
default values, we need to provide configuration values in
site-specific configuration files. Below are the four site-
specific .xml configuration files and environment variable
setup file.
 core-site.xml : Common properties are configured in
this file.
 hdfs-site.xml : Site specific hdfs properties are
configured in this file
 yarn-site.xml : Yarn specific properties can be
provided in this file.
 mapred-site.xml : Mapreduce framework specific
properties will defined here.
 hadoop-env.sh : Hadoop environment variables are
setup in this file.
 All these configuration files should be placed in hadoop’s
configuration directory etc/hadoop from hadoop’s home
directory.

 How many hadoop daemon processes run on a


Hadoop System

 As of hadoop-2.5.0 release, three hadoop daemon


processes run on a hadoop cluster.
 NameNode daemon – Only one daemon runs for entire
hadoop cluster.
 Secondary NameNode daemon – Only one daemon runs
for entire hadoop cluster.
 DataNode daemon – One datanode daemon per each
datanode in hadoop cluster

 How to start all hadoop daemons at a time

 $ start-dfs.sh command can be used to start all hadoop


daemons from terminal at a time.
 If some hadoop daemons are already running and if we
need to start any one remaining daemon process then
what are the commands to use
 Instead of start-dfs.sh which will trigger all the hadoop
three daemons at a time, we can also start running each
daemon separately by the below commands.
 $ hadoop-daemon.sh start namenode
 $ hadoop-daemon.sh start secondarynamenode
 $ hadoop-daemon.sh start datanode

 How to stop all the three hadoop daemons at a


time

 By using stop-dfs.sh command, we can stop the above


three daemon processes with a single command.

 What are commands that need to be used to bring


down a single hadoop daemon

 Below hadoop-daemon.sh commands can be used to bring


down each hadoop daemon separately.
 $ hadoop-daemon.sh stop namenode
 $ hadoop-daemon.sh stop secondarynamenode
 $ hadoop-daemon.sh stop datanode

 How many YARN daemon processes run on a


cluster

 Two types of Yarn daemons will be running on hadoop


cluster in master/slave fashion.
 ResourceManager – Master daemon process
 NodeManager – One Slave daemon process per node in a
cluster.

 How to start Yarn daemon processes on a hadoop


cluster

 By running $ start-yarn.sh command from terminal on


each machine on hadoop cluster, Yarn daemons can be
started.

 How to verify whether the daemon processes are


running or not

 By using java’s processes command $ jps to check what are


all the java processes running on a machine. This
command lists down all the daemon processes running on
a machine along with their process ids.

 How to bring down the Yarn daemon processes


 Using $ stop-yarn.sh command, we can bring down both
the Yarn daemon processes running on a machine.

 Can we start both Hadoop daemon processes and


Yarn daemon processes with a single command

 Yes, we can start all the above mentioned five daemon


processes (3 hadoop + 2 Yarn) with a single command $
start-all.sh

 Can we stop all the above five daemon processes


with a single command
 Yes, by using $ stop-all.sh command all the above five
daemon processes can be bring down in a single shot.

 Which operating systems are supported for


Hadoop deployment
 The only supported operating system for hadoop’s
production deployment is Linux. However, with some
additional software Hadoop can be deployed on Windows
for test environments.

 How could be the various components of Hadoop


cluster deployed in production
 Both Name Node and Resource Manager can be deployed
on a Master Node, and Data nodes and node managers can
be deployed on multiple slave nodes.
 There is a need for only one master node for namenode
and Resource Manager on the system. The number of
slave nodes for datanodes & node managers depends on
the size of the cluster.
 One more node with hardware specifications same as
master node will be needed for secondary namenode.

 What is structured and unstructured data


 Structured data is the data that is easily identifiable as it is
organized in a structure. The most common form of
structured data is a database where specific information is
stored in tables, that is, rows and columns.
 Unstructured data refers to any data that cannot be
identified easily. It could be in the form of images, videos,
documents, email, logs and random text. It is not in the
form of rows and columns.

 Is Namenode also a commodity


 No. Namenode can never be a commodity hardware
because the entire HDFS rely on it. It is the single point of
failure in HDFS. Namenode has to be a high-availability
machine.

 What is the difference between jps and jps -lm


commands
 jps command returns the process id and short names for
running processes. But jps -lm returns long messages
along with process id and short names as shown below.
 hadoop1@ubuntu-1:~$ jps
 5314 SecondaryNameNode
 5121 DataNode
 5458 Jps
 4995 NameNode
 hadoop1@ubuntu-1:~$ jps -lm
 5314
org.apache.hadoop.hdfs.server.namenode.SecondaryNam
eNode
 5121 org.apache.hadoop.hdfs.server.datanode.DataNode
 5473 sun.tools.jps.Jps -lm
 4995
org.apache.hadoop.hdfs.server.namenode.NameNode

 What is Mapreduce
 Mapreduce is a framework for processing big data (huge
data sets using a large number of commodity computers).
It processes the data in two phases namely Map and
Reduce phase. This programming model is inherently
parallel and can easily process large-scale data with the
commodity hardware itself.
 It is highly integrated with hadoop distributed file system
for processing distributed across data nodes of clusters.

 What is YARN
 YARN stands for Yet Another Resource Negotiator which
is also called as Next generation Mapreduce or Mapreduce
2 or MRv2.
 It is implemented in hadoop 0.23 release to overcome the
scalability short come of classic Mapreduce framework by
splitting the functionality of Job tracker in Mapreduce
frame work into Resource Manager and Scheduler.
 What is data serialization
 Serialization is the process of converting object data into
byte stream data for transmission over a network across
different nodes in a cluster or for persistent data storage.

 What is deserialization of data


 Deserialization is the reverse process of serialization and
converts byte stream data into object data for reading data
from HDFS. Hadoop provides Writables for serialization
and deserialization purpose.

 What are the Key/Value Pairs in Mapreduce


framework
 Mapreduce framework implements a data model in which
data is represented as key/value pairs. Both input and
output data to mapreduce framework should be in
key/value pairs only.

 What are the constraints to Key and Value classes


in Mapreduce
 Any data type that can be used for a Value field in a
mapper or reducer must implement
org.apache.hadoop.io.Writable Interface to enable the
field to be serialized and deserialized.
 By default Key fields should be comparable with each
other. So, these must implement hadoop’s
org.apache.hadoop.io.WritableComparable Interface
which in turn extends hadoop’s Writable interface and
java.lang.Comparable interfaces.
 What are the main components of Mapreduce Job
 Main driver class which provides job configuration
parameters.
 Mapper class which must extend
org.apache.hadoop.mapreduce.Mapper class and provide
implementation for map () method.
 Reducer class which should extend
org.apache.hadoop.mapreduce.Reducer class.

 What are the Main configuration parameters that


user need to specify to run Mapreduce Job
 On high level, the user of mapreduce framework needs to
specify the following things:
 The job’s input location(s) in the distributed file system.
 The job’s output location in the distributed file system.
 The input format.
 The output format.
 The class containing the map function.
 The class containing the reduce function but it is optional.
 The JAR file containing the mapper and reducer classes
and driver classes.

 What is identity Mapper


 Identity Mapper is a default Mapper class provided by
hadoop. When no mapper is class is specified in
Mapreduce job, then this mapper will be executed.
 It doesn’t process/manipulate/ perform any computation
on input data rather it simply writes the input data into
output. It’s class name is
org.apache.hadoop.mapred.lib.IdentityMapper.

 What is identity Reducer


 It is a reduce phase’s counter part for Identity mapper in
map phase. It simply passes on the input key/value pairs
into output directory. Its class name is
org.apache.hadoop.mapred.lib.IdentityReducer.
 When no reducer class is specified in Mapreduce job, then
this class will be picked up by the job automatically.

 What is a combiner
 Combiner is a semi-reducer in Mapreduce framework. it is
an optional component and can be specified with
Job.setCombinerClass() method.
 Combiner functions are suitable for producing summary
information from a large data set. Hadoop doesn’t
guarantee on how many times a combiner function will be
called for each map output key. it may call 0 or 1 or many
times.

 What are the constraints on combiner


implementation
 Combiner class must implement Reducer interface and
must provide implementation for reduce() method. The
combiner class’s reduce() method must have same input
and output key-value types as the reducer class.
 What are the advantages of combiner over
reducer or why do we need combiner when we are
using same reducer class as combiner class
 The main purpose of Combiner in Mapreduce frame work
is to limit the volume of data transfer between map and
reduce tasks.
 It is a general practice that same reducer class is used as a
combiner class, in this case, the only benefit of combiner
class is to minimize the input data to reduce phase from
data nodes.

 What are the primitive data types in Hadoop


 Below are the list of primitive writable data types available
in Hadoop.
 BooleanWritable
 ByteWritable
 IntWritable
 VIntWritable
 FloatWritable
 LongWritable
 VLongWritable
 DoubleWritable

 What is speculative execution in Mapreduce


 Speculative execution is a mechanism of running multiple
copies of same map or reduce tasks on different slave
nodes to cope with individual Machine performance.
 In large clusters of hundreds of machines, there may be
machines which are not performing as fast as others. This
may result in delays in a full job due to only one machine
not performing well. To avoid this, speculative execution
in hadoop can run multiple copies of same map or reduce
task on different slave nodes. The results from first node
to finish are used.
 If other copies were executing speculatively, Hadoop tells
the Task Trackers to abandon the tasks and discard their
outputs.
 What will happen if we run a Mapreduce job with an
output directory that already existing
 Job will fail with
org.apache.hadoop.mapred.FileAlreadyExistsException.
In this case, delete the output directory and re-execute the
job.

 What are the naming conventions for output files


from Map phase and Reduce Phase
 Output files from map phase are named as part-m-xxxxx
and output files from reduce phase are named as part-r-
xxxxx. These part files are created separately by each
individual reducer. Here xxxxx is partition number
starting from 00000 and increases sequentially by 1
resulting in 00001, 00002 and so on.

 Where the output does from Map tasks are stored


 The mapper’s output (intermediate data) is stored on the
Local file system (not on HDFS) of each individual mapper
nodes. This is typically a temporary directory location
which can be setup in mapreduce.cluster.local.dir
configuration property. The intermediate data is deleted
after the Hadoop Job completes.

 When will the reduce() method will be called from


reducers in Mapreduce job flow
 In a MapReduce job, reducers do not start executing the
reduce() method until all Map jobs are completed.
Reducers start copying intermediate output data from the
mappers as soon as they are available. But reduce()
method is called only after all the mappers have finished.
 If reducers do not start before all the mappers are
completed then why does the progress on MapReduce job
shows something like Map(80%) Reduce(20%)
 As said above, Reducers start copying intermediate output
data from map tasks as soon as they are available and task
progress calculation counts this data copying as well. So,
even though the actual reduce() method is not triggered to
run on map output data, job progress displays completion
percentage of reduce phase as 10 % or 20 %. But the actual
reduce() method processing starts execution only after
completion of map phase by 100 %.

 Where the output does from Reduce tasks are


stored
 The output from reducers are stored on HDFS cluster but
not on local file system. All reducers stores their output
part-r-xxxxx files in the output directory specified in the
mapreduce job instead of in local FS. But Map tasks
output files are not stored on HDFS but they are stored on
each individual data nodes local file system.
 Can we set arbitrary number of Map tasks in a
mapreduce job
 No. We cannot set the number of map tasks in a
mapreduce job. But this can be set at site level with
mapreduce.job.maps configuration property in mapred-
site.xml file.

 Can we set arbitrary number of Reduce tasks in a


mapreduce job and if yes, how
 Yes, we can set the no of reduce tasks at job level in
Mapreduce. Arbitrary number of reduce tasks in a job can
be setup with job.setNumReduceTasks(N);
 Here N is the no of reduce tasks of our choice. Reduce
tasks can be setup at site level as well with
mapreduce.job.reduces configuration property in mapred-
site.xml file.

 Can Reducers talk with each other


 No, Reducers run in isolation. MapReduce programming
model does not allow reducers to communicate with each
other.

 What are the primary phases of the Mapper


 Primary phases of Mapper are: Record Reader, Mapper,
Combiner and Partitioner.

 What are the primary phases of the Reducer


 Primary phases of Reducer are: Shuffle, Sort and Reducer
and Output Formatter.

 What are the side effects of not running a


secondary name node
 The cluster performance will degrade over time since edit
log will grow bigger and bigger. If the secondary
namenode is not running at all, the edit log will grow
significantly and it will slow the system down. Also, the
system will go into safemode for an extended time since
the namenode needs to combine the edit log and the
current fs checkpoint image.

 How many racks do you need to create an Hadoop


cluster in order to make sure that the cluster
operates reliably
 In order to ensure a reliable operation it is recommended
to have at least 2 racks with rack placement configured
Hadoop has a built-in rack awareness mechanism that
allows data distribution between different racks based on
the configuration.

 What is the procedure for namenode recovery


 A namenode can be recovered in two ways:
 Starting new namenode from backup metadata
 Promoting secondary namenode to primary namenode.
 Hadoop WebUI shows that half of the datanodes are in
decommissioning mode. What does that mean Is it safe to
remove those nodes from the network
 This means that namenode is trying retrieve data from
those datanodes by moving replicas to remaining
datanodes. There is a possibility that data can be lost if
administrator removes those datanodes before
decommissioning finished .

 What does the Hadoop administrator have to do


after adding new datanodes to the Hadoop cluster
 Since the new nodes will not have any data on them, the
administrator needs to start the balancer to redistribute
data evenly between all nodes.
 If the Hadoop administrator needs to make a change,
which configuration file does he need to change
 Each node in the Hadoop cluster has its own configuration
files and the changes needs to be made in every file. One of
the reasons for this is that configuration can be different
for every node.
 Map Reduce jobs take too long. What can be done to
improve the performance of the cluster
 One the most common reasons for performance problems
on Hadoop cluster is uneven distribution of the tasks. The
number tasks has to match the number of available slots
on the cluster. Hadoop is not a hardware aware system. It
is the responsibility of the developers and the
administrators to make sure that the resource supply and
demand match.
 After increasing the replication level, I still see that data is
under replicated. What could be wrong
 Data replication takes time due to large quantities of data.
The Hadoop administrator should allow sufficient time for
data replication depending on the data size. if data size is
big enough it is not uncommon that replication will take
from a few minutes to a few hours.

 What is HDFS
 HDFS is a distributed file system implemented on
Hadoop’s framework. It is a block-structured distributed
file system designed to store vast amount of data on low
cost commodity hardware and ensuring high speed
process on data.
 HDFS stores files across multiple machines and maintains
reliability and fault tolerance. HDFS support parallel
processing of data by Mapreduce framework.

 What are the objectives of HDFS file system


 Easily Store large amount of data across multiple
machines
 Data reliability and fault tolerance by maintaining
multiple copies of each block of a file.
 Capacity to move computation to data instead of moving
data to computation server. I.e. processing data locally.
 Able to provide parallel processing of data by Mapreduce
framework.

 What are the limitations of HDFS file systems


 HDFS supports file operations reads, writes, appends and
deletes efficiently but it doesn’t support file updates.
 HDFS is not suitable for large number of small sized files
but best suits for large sized files. Because file system
namespace maintained by Namenode is limited by it’s
main memory capacity as namespace is stored in
namenode’s main memory and large number of files will
result in big fsimage file.

 What is a block in HDFS and what is its size


 It is a fixed size chunk of data usually of size 128 MB. It is
the minimum of size of data that HDFS can read/write.
 HDFS files are broken into these fixed size chunks of data
across multiple machines on a cluster.
 Thus, blocks are building bricks of a HDFS file. Each block
is maintained in at least 3 copies as mentioned by
replication factor in Hadoop configuration to provide data
redundancy and maintain fault-tolerance.

 What are the core components in HDFS Cluster


 Name Node
 Secondary Name Node
 Data Nodes
 Checkpoint Nodes
 Backup Node

 What is a NameNode
 Namenode is a dedicated machine in HDFS cluster which
acts as a master serve that maintains file system
namespace in its main memory and serves the file access
requests by users. File system namespace mainly contains
fsimage and edits files. Fsimage is a file which contains file
names, file permissions and block locations of each file.
 Usually only one active namenode is allowed in HDFS
default architecture.

 What is a DataNode
 DataNodes are slave nodes of HDFS architecture which
store the blocks of HDFS files and sends blocks
information to namenode periodically through heart beat
messages.
 Data Nodes serve read and write requests of clients on
HDFS files and also perform block creation, replication
and deletions.

 Is Namenode machine same as datanode machine


as in terms of hardware
 It depends upon the cluster we are trying to create. The
Hadoop VM can be there on the same machine or on
another machine. For instance, in a single node cluster,
there is only one machine,whereas in the development or
in a testing environment, Namenode and datanodes are on
different machines.

 What is a Secondary Namenode


 The Secondary NameNode is a helper to the primary
NameNode. Secondary NameNode is a specially dedicated
node in HDFS cluster whose main function is to take
checkpoints of the file system metadata present on
namenode. It is not a backup namenode and doesn’t act as
a namenode in case of primary namenode’s failures. It just
checkpoints namenode’s file system namespace.

 What is a Checkpoint Node


 It is an enhanced secondary namenode whose main
functionality is to take checkpoints of namenode’s file
system metadata periodically. It replaces the role of
secondary namenode. Advantage of Checkpoint node over
the secondary namenode is that it can upload the result of
merge operation of fsimage and edits log files while
checkpointing.

 11. What is a checkpoint


 During Checkpointing process, fsimage file is merged with
edits log file and a new fsimage file will be created which is
usually called as a checkpoint.

 What is a daemon
 Daemon is a process or service that runs in background. In
general, we use this word in UNIX environment. Hadoop
or Yarn daemons are Java processes which can be verified
with jps command.

 What is a heartbeat in HDFS


 A heartbeat is a signal indicating that it is alive. A
datanode sends heartbeat to Namenode and task tracker
will send its heart beat to job tracker. If the Namenode or
job tracker does not receive heart beat then they will
decide that there is some problem in datanode or task
tracker is unable to perform the assigned task.

 Are Namenode and Resource Manager run on the


same host
 No, in practical environment, Namenode runs on a
separate host and Resource Manager runs on a separate
host.

 What is the communication mode between


namenode and datanode
 The mode of communication is SSH.

 If we want to copy 20 blocks from one machine to


another, but another machine can copy only 18.5
blocks, can the blocks be broken at the time of
replication
 In HDFS, blocks cannot be broken down. Before copying
the blocks from one machine to another, the Master node
will figure out what is the actual amount of space required,
how many block are being used, how much space is
available, and it will allocate the blocks accordingly.

 How indexing is done in HDFS


 Hadoop has its own way of indexing. Depending upon the
block size, once the data is stored, HDFS will keep on
storing the last part of the data which will say where the
next part of the data will be. In fact, this is the base of
HDFS.

 18. If a data Node is full, then how is it identified

 When data is stored in datanode, then the metadata of that


data will be stored in the Namenode. So Namenode will
identify if the data node is full.

 19. If datanodes increase, then do we need to


upgrade Namenode

 While installing the Hadoop system, Namenode is


determined based on the
 size of the clusters. Most of the time, we do not need to
upgrade the Namenode because it does not store the actual
data, but just the metadata, so such a requirement rarely
arise.

 20. Why Reading is done in parallel and Writing is


not in HDFS
 Reading is done in parallel because by doing so we can
access the data fast. But we do not perform the write
operation in parallel because it might result in data written
by one node can be overwritten by other.

 For example, we have a file and two nodes are trying to


write data into the file in parallel, then the first node does
not know what the second node has written and vice-
versa. So, this makes it confusing which data to be stored
and accessed.

 21. What is fsimage and edit log in hadoop

 Fsimage is a file which contains file names, file


permissions and block locations of each file, and this file is
maintained by Namenode for indexing of files in HDFS.
We can call it as metadata about HDFS files. The fsimage
file contains a serialized form of all the directory and file
inodes in the filesystem.

 EditLog is a transaction log which contains records for


every change that occurs to file system metadata.

 Note: Whenever a NameNode is restarted, the latest status


of FsImage is built by applying edits records on last saved
copy of FsImage.
 After restart of namenode, Mapreduce jobs started failing
which worked fine before restart. What could be the wrong
 The cluster could be in a safe mode after the restart of a
namenode. The administrator needs to wait for namenode
to exit the safe mode before restarting the jobs again. This
is a very common mistake by Hadoop administrators.

 What do you always have to specify for a


MapReduce job
• The classes for the mapper and reducer.
• The classes for the mapper, reducer, and combiner.
• The classes for the mapper, reducer, partitioner,
and combiner.
• None; all classes have default implementations.

 How many times will a combiner be executed


• At least once.
• Zero or one times.
• Zero, one, or many times.
• It’s configurable.

 You have a mapper that for each key produces an


integer value and the following set of
 reduce operations
 Reducer A: outputs the sum of the set of integer values.
 Reducer B: outputs the maximum of the set of values.
 Reducer C: outputs the mean of the set of values.
 Reducer D: outputs the difference between the largest and
smallest values
 in the set.

 Which of these reduce operations could safely be


used as a combiner
- All of them.
- A and B.
- A, B, and D.
- C and D.
- None of them.
 Explanation: Reducer C cannot be used because if such
reduction were to occur, the final reducer could receive
from the combiner a series of means with no knowledge of
how many items were used to generate them, meaning the
overall mean is impossible to calculate.
 Reducer D is subtle as the individual tasks of selecting a
maximum or minimum are safe for use as combiner
operations. But if the goal is to determine the overall
variance between the maximum and minimum value for
each key, this would not work. If the combiner that
received the maximum key had values clustered around it,
this would generate small results; similarly for the one
receiving the minimum value. These sub ranges have little
value in isolation and again the final reducer cannot
construct the desired result.

 What is Uber task in YARN


 If the job is small, the application master may choose to
run them in the same JVM as itself, since it judges the
overhead of allocating new containers and running tasks
in them as outweighing the gain to be had in running them
in parallel, compared to running them sequentially on one
node. (This is different to Mapreduce
 1, where small jobs are never run on a single task tracker.)
 Such a job is said to be Uberized, or run as an Uber task.
 What are the ways to debug a failed mapreduce job
 Commonly there are two ways.
 By using mapreduce job counters
 YARN Web UI for looking into syslogs for actual error
messages or status.

 What is the importance of heartbeats in


HDFS/Mapreduce Framework
 A heartbeat in master/slave architecture is a signal
indicating that it is alive. A datanode sends heartbeats to
Namenode and node managers send their heartbeats to
Resource Managers to tell the master node that these are
still alive.
 If the Namenode or Resource manager does not receive
heartbeat from any slave node then they will decide that
there is some problem in data node or node manager and
is unable to perform the assigned task, then master
(namenode or resource manager) will reassign the same
task to other live nodes.

 Can we rename the output file


 Yes, we can rename the output file by implementing
multiple format output class.

 What are the default input and output file formats


in Mapreduce jobs
 If input file or output file formats are not specified, then
the default file input or output formats are considered as
text files.

 What is side data distribution in Mapreduce


framework
 The extra read-only data needed by a mapreduce job to
process the main data set is called as side data.
 There are two ways to make side data available to all the
map or reduce tasks.
• Job Configuration
• Distributed cache
 What is Distributed Cache in Mapreduce
 Distributed cache mechanism is an alternative way of side
data distribution by copying files and archives to the task
nodes in time for the tasks to use them when they run.
 To save network bandwidth, files are normally copied to
any particular node once per job.
 How to supply files or archives to mapreduce job in
distributed cache mechanism
 The files that need to be distributed can be specified as a
comma-separated list of URIs as the argument to the -files
option in hadoop job command. Files can be on the local
file system, on HDFS.
 Archive files (ZIP files, tar files, and gzipped tar files) can
also be copied to task nodes by distributed cache by using -
archives option. these are un-archived on the task node.
 The -libjars option will add JAR files to the classpath of
the mapper and reducer tasks.

• $ hadoop jar example.jar ExampleProgram -files


Inputpath/example.txt input/filename /output/

 How distributed cache works in Mapreduce


Framework
 When a mapreduce job is submitted with distributed cache
options, the node managers copies the the files specified
by the -files , -archives and -libjars options from
distributed cache to a local disk. The files are said to be
localized at this point.
 local.cache.size property can be configured to setup cache
size on local disk of node managers. Files are localized
under the ${hadoop.tmp.dir}/mapred/local directory on
the node manager nodes.

 What will hadoop do when a task is failed in a list


of suppose 50 spawned tasks
 It will restart the map or reduce task again on some other
node manager and only if thetask fails more than 4 times
then it will kill the job. The default number of maximum
attempts for map tasks and reduce tasks can be configured
with below properties in mapred-site.xml file.
 mapreduce.map.maxattempts
 mapreduce.reduce.maxattempts
 The default value for the above two properties is 4 only.

 Consider case scenario: In Mapreduce system,


HDFS block size is 256 MB and we have 3 files of
size 256 KB, 266 MB and 500 MB then how many
input splits will be made by Hadoop framework
 Hadoop will make 5 splits as follows
- 1 split for 256 KB file
- 2 splits for 266 MB file (1 split of size 256 MB
and another split of size 10 MB)
- 2 splits for 500 MB file (1 Split of size 256 MB
and another of size 244 MB)
 Why can’t we just have the file in HDFS and have the
application read it instead of distributed cache
 Distributed cache copies the file to all node managers at
the start of the job. Now if the node manager runs 10 or 50
map or reduce tasks, it will use the same file copy from
distributed cache.
 On the other hand, if a file needs to read from HDFS in the
job then every map or reduce task will access it from
HDFS and hence if a node manager runs 100 map tasks
then it will read this file 100 times from HDFS. Accessing
the same file from node manager’s Local FS is much faster
than from HDFS data nodes.

 What is the default block size in HDFS


 As of Hadoop-2.4.0 release, the default block size in HDFS
is 256 MB and prior to that it was 128 MB.
 What is the benefit of large block size in HDFS
 Main benefit of large block size is Quick Seek Time. The
time to transfer a large file of multiple blocks operates at
the disk transfer rate instead of depending much on seek
time.

 What are the overheads of maintaining too large


Block size
 Usually in Mapreduce framework, each map task operate
on one block at a time. So, having too few blocks result in
too few map tasks running in parallel for longer time
which finally results in overall slow down of job
performance.
 If a file of size 10 MB is copied on to HDFS of block size
256 MB, then how much storage will be allocated to the
file on HDFS
 Even though the default HDFS block size is 256 MB, a file
which is smaller than a single block doesn’t occupy full
block size. So, in this case, the file will occupy just 10 MB
but not 256 MB.

 What are the benefits of block structure concept


in HDFS
• Main benefit is that the ability to store very large
files which can be even larger than the size of
single disk (node) as the file is broken into blocks
and distributed across various nodes on cluster.
• Another important advantage is simplicity of
storage management as the blocks are fixed size, it
is easy to calculate how many can be stored on a
given disk.
• Blocks replication feature is useful in fault
tolerance.

 What if we upgrade our Hadoop version in which,


default block size is higher than the current
Hadoop version’s default block size. Suppose 128
MB (Hadoop 0.20.2) to 256 MB (Hadoop 2.4.0).
 All the existing files are maintained at block size of 128 MB
but any new files copied on to upgraded hadoop are
broken into blocks of size 256 MB.

 What is Block replication


 Block replication is a way of maintaining multiple copies
of same block across various nodes on cluster to achieve
fault tolerance. In this, though one of the data node
containing the block becomes dead, the block data can be
obtained from other live data nodes which contain the
same copy of the block data.

 What is default replication factor and how to


configure it
 The default replication factor in fully distributed HDFS is
3.
 This can be configured with dfs.replication in hdfs-site.xml
file at site level.
 Replication factor can be setup at file level with below
FS command.
• $ hadoop fs -setrep N /filename
 In above command ‘N’ is the new replication factor for the
file “/filename”.

 What is HDFS distributed copy (distcp)


 distcp is an utility for launching MapReduce jobs to copy
large amounts of HDFS files within or in between HDFS
clusters.
 Syntax for using this tool.
• $ hadoop distcp hdfs://namenodeX/src
hdfs://namenodeY/dest

 What is the use of fsck command in HDFS


 HDFS fsck command is a useful to get the files and blocks
details of the file system. It’s syntax is:

• $ hadoop fsck<path> [-move|-delete|-


openforwrite][-files[-blocks[-locations|-racks]]]
 below are the command options and their purpose.
 -move Move corrupted files to /lost+found
 -delete Delete corrupted files.
 -openforwrite Print out files opened for write.
 -files Print out files being checked.
 -blocks Print out block report.
 -locations Print out locations for every block.
 -racks Print out network topology for data-node
locations.

 What is a Backup Node


 It is an extended checkpoint node that performs
checkpointing and also supports online streaming of file
system edits.
 It maintains an in memory, up-to-date copy of file system
namespace and accepts a real time online stream of file
system edits and applies these edits on its own copy of
namespace in its main memory.
 Thus, it maintains always a latest backup of current file
system namespace.

 What are the differences between Backup Node


and Checkpoint node or Secondary Namenode
• Multiple checkpoint nodes can be registered with
namenode but only a single backup node is
allowed to register with namenode at any point of
time.
• To create a checkpoint, checkpoint node or
secondary namenode needs to download fsimage
and edits files from active namenode and apply
edits to fsimage and saves a copy of new fsimage as
a checkpoint.
 But in backup node, no need to download fsimage and
edits files from active namenode because, it already has an
up-to-date copy of fsimage in its main memory and
accesses online streaming of edits which are provided by
namenode. So, applying these edits into fsimage in its own
main memory and saving a copy in local FS.
 So, checkpoint creation in backup node is faster than that
of checkpoint node or secondary namenode.
• The diff between checkpoint node and secondary
namenode is that checkpoint node can upload the
new copy of fsimage file back to namenode after
checkpoint creation where as a secondary
namenode can’t upload but can only store in its
local FS.
• Backup node provides the option of running
namenode with no persistent storage but a
checkpoint node or secondary namenode doesn’t
provide such option.
• In case of namenode failures, data loss in
checkpoint node or secondary namenode is certain
at least to a minimum amount of data due to time
gap between two checkpoints.
 But in backup node, data loss is not certain and it
maintains namespace which is in sync with namenode at
any point of time.

 What is Safe Mode in HDFS


 Safe Mode is a maintenance state of NameNode during
which Name Node doesn’t allow any changes to the file
system.
 During Safe Mode, HDFS cluster is ready-only and doesn’t
replicate or delete blocks.
 Name Node automatically enters safe mode during its
start up and maintain blocks replication value within
minimum and maximum allowable replication limit.
 What is Data Locality in HDFS
 One of the HDFS design idea is that “Moving Computation
is cheaper than Moving data”.
 If data sets are huge, running applications on nodes where
the actual data resides will give efficient results than
moving data to nodes where applications are running.
 This concept of moving applications to data, is called Data
Locality.
 This reduces network traffic and increases speed of data
processing and accuracy of data since there is no chance of
data loss during data transfer through network channels
because there is no need to move data.

 What is a rack
 Rack is a storage area with all the datanodes put together.
These datanodes can be physically located at different
places. Rack is a physical collection of datanodes which are
stored at a single location. There can be multiple racks in a
single location.

 What is Rack Awareness


 The concept of maintaining Rack Id information by
NameNode and using these rack ids for choosing closest
data nodes for HDFS file read or writes requests is called
Rack Awareness.
 By choosing closest data nodes for read/writes request
through rack awareness policy, minimizes the write cost
and maximizing read speed.
 How does HDFS File Deletes or Undeletes work
 When a file is deleted from HDFS, it will not be removed
immediately from HDFS, but HDFS moves the file into
/trash directory. After certain period of time interval, the
NameNode deletes the file from the HDFS /trash
directory. The deletion of a file releases the blocks
associated with the file.
 Time interval for which a file remains in /trash directory
can be configured with fs.trash.interval property stored in
core-site.xml.
 As long as a file remains in /trash directory, the file can be
undeleted by moving the file from /trash directory into
required location in HDFS. Default trash interval is set to
0. So, HDFS Deletes file without storing in trash.

 What is a Rebalancer in HDFS


 Rebalancer is a administration tool in HDFS, to balance
the distribution of blocks uniformly across all the data
nodes in the cluster.
 Rebalancing will be done on demand only. It will not get
triggered automatically.
 HDFS administrator issues this command on request
to balance the cluster
• $ hdfs balancer
 If a Rebalancer is triggered, NameNode will scan entire
data node list and when
• Under-utilized data node is found, it moves blocks
from over-utilized data nodes or not-under-
utilized data nodes to this current data node
• If Over-utilized data node is found, it moves blocks
from this data node to other under-utilized or not-
over-utilized data nodes.

 What is the need for Rebalancer in HDFS


 Whenever a new data node is added to the existing HDFS
cluster or a data node is removed from the cluster then
some of the data nodes in the cluster will have more/less
blocks compared to other data nodes.
 In this unbalanced cluster, data read/write requests
become very busy on some data nodes and some data
nodes are under utilized.
 In such cases, to make all the data nodes space is
uniformly utilized for blocks distribution, rebalancing will
be done by Hadoop Administrator.

 When will be a cluster in balanced status


 A cluster is in a balanced status when, % of space used in
each data node is within limits of Average % of space used
on data nodes +/- Threshold size .
 Percentage space used on a data node should not be less
than Average % of space used on data nodes – Threshold
size.
 Percentage space used on a data node should not be
greater than Average % of space used on data nodes +
Threshold size.
 Here Threshold size is configurable value which is 20 % of
used spaced by default.
 What is Hadoop Streaming
 Streaming is a generic API that allows programs written in
virtually any language to be used as Hadoop Mapper and
Reducer implementations
 What is the characteristic of streaming API that
makes it flexible run mapreduce jobs in languages
like perl, ruby, awk etc.
 Hadoop Streaming allows to use arbitrary programs for
the Mapper and Reducer phases of a Map Reduce job by
having both Mappers and Reducers receive their input on
stdin and emit output (key, value) pairs on stdout.

 What is a metadata
 Metadata is the information about the data stored in data
nodes such as
 location of the file, size of the file and so on.

 What is the lowest granularity at which you can


apply replication factor in HDFS
– We can choose replication factor per directory
– We can choose replication factor per file in a
directory-
– We can choose replication factor per block of a
file
 True
 True
 False
 What happens if you get a ‘connection refused
java exception’ when you type hadoop fs -ls /
 It could mean that the Namenode is not working on our
hadoop cluster.

 Can we have multiple entries in the master files


 Yes, we can have multiple entries in the Master files.

 Why do we need a password-less SSH in Fully


Distributed environment
 We need a password-less SSH in a Fully-Distributed
environment because when the cluster is LIVE and
running in Fully Distributed environment, the
communication is too frequent. The Resource
Manager/Namenode should be able to send a task to Node
manager /Datanodes quickly.

 Does this lead to security issues


 No, not at all. Hadoop cluster is an isolated cluster. And
generally it has nothing to do with an internet. It has a
different kind of a configuration. We needn’t worry about
that kind of a security breach, for instance, someone
hacking through the internet, and so on. Hadoop has
implemented Kerberose for strong security way to connect
to other machines to fetch and to process data.

 Which port does SSH work on


 SSH works on Port 22, though it can be configured. 22 is
the default Port number.

 How can we create an empty file in HDFS


 We can create empty file with hadoop fs -touchz
command. It creates a zero byte file.
• $ hadoop fs -touchz /user/hadoop/filename

 How can we see the contents of a Snappy


compressed file or SequenceFile via command
line
 With the help of $ hadoop fs -text /user/hadoop/filename
command we can see the contents of sequencefile or any
compressed format file in text format.

 How can we check the existence of a file in HDFS


 The hadoop test is used for file test operations. The
syntax is shown below:
• hadoop fs -test -[ezd] URI
 Here “e” for checking the existence of a file, “z” for
checking the file is zero length or not, “d” for checking the
path is a directory or not. On success, the test command
returns 1 else 0.
 How can we set the replication factor of directory
via command line
 Hadoop setrep is used to change the replication
factor of a file. Use the -R option for recursively changing
the replication factor.
• $ hadoop fs -setrep -w 4 -R /user/hadoop/dir

 How can we apply the permission to all the files


under a HDFS directory recursively
 Using $ hadoop fs -chmod -R 755 /user/hadoop/dir
command we can set the permissions to a directory
recursively.

 What is hadoop fs -stat command used for


 Hadoop stat returns the stats information of a file. It
returns the last updated date and time. The syntax of stat
is shown below:

• $ hadoop fs -stat /filename
 2014-09-28 18:34:06

 What is expunge command in HDFS and why is it


used for
 Hadoop fs expunge command is used to empty the trash
directory in HDFS. Syntax is:

• $ hadoop fs -expunge
 How can we see the output of a MR job as a single
file if the reducer might have created multiple
part-r-0000* files
 We can use hadoop fs -getmerge command to combine all
the part-r-0000* files into single file and this file can be
browsed to view the entire results of the MR job at a time.
Syntax is:

• hadoop fs -getmerge &lt;hdfssrc&gt;


&lt;localdestination&gt; [addnl]
 The addnl option is for adding new line character at the
end of each file.

 Which of the following is most important when


selecting hardware for our new Hadoop cluster
 The number of CPU cores and their speed.
 The amount of physical memory.
 The amount of storage.
 The speed of the storage.
 It depends on the most likely workload.
 Answer 5 – Though some general guidelines are possible
and we may need to generalize whether our cluster will be
running a variety of jobs, the best fit depends on the
anticipated workload.

 Why would you likely not want to use network


storage in your cluster
 Because it may introduce a new single point of failure.
 Because it most likely has approaches to redundancy and
fault-tolerance that may
 be unnecessary given Hadoop’s fault tolerance.
 Because such a single device may have inferior
performance to Hadoop’s use of
 multiple local disks simultaneously.
 All of the above.
 Answer 4: Network storage comes in many flavors but in
many cases we may find a large Hadoop cluster of
hundreds of hosts reliant on a single (or usually a pair of)
storage devices. This adds a new failure scenario to the
cluster and one with a less uncommon likelihood than
many others. Where storage technology does look to
address failure mitigation it is usually through disk-level
redundancy.

 We will be processing 10 TB of data on our cluster.


Our main MapReduce job processes financial
transactions, using them to produce statistical
models of behavior and future forecasts. Which of
the following hardware choices would be our first
choice for the cluster
 20 hosts each with fast dual-core processors, 4 GB
memory, and one 500 GB
 disk drive.
 30 hosts each with fast dual-core processors, 8 GB
memory, and two 500 GB
 disk drives.
 30 hosts each with fast quad-core processors, 8 GB
memory, and one 1 TB disk drive.
 40 hosts each with 16 GB memory, fast quad-core
processors, and four 1 TB
 disk drives.
 Answer 3. Probably! We would suggest avoiding the first
configuration as, though it has just enough raw storage
and is far from under powered, there is a good chance the
setup will provide little room for growth. An increase in
data volumes would immediately require new hosts and
additional complexity in the MapReduce job could require
additional processor power or memory.
 Configurations B and C both look good as they have
surplus storage for growth and provide similar head-room
for both processor and memory. B will have the higher
disk I/O and C the better CPU performance. Since the
primary job is involved in financial modelling and
forecasting, we expect each task to be reasonably
heavyweight in terms of CPU
 and memory needs. Configuration B may have higher I/O
but if the processors are running at 100 percent utilization
it is likely the extra disk throughput will not be used. So
the hosts with greater processor power are likely the better
fit.
 Configuration D is more than adequate for the task and we
don’t choose it for that very reason; why buy more capacity
than we know we need

 What is the difference between an HDFS Block


and Input Split
 HDFS Block is the physical division of the data and Input
Split is the logical division of the data.

 What is keyvaluetextinputformat
 In keyvaluetextinputformat, each line in the text file is a
‘record‘. The first separator character divides each line.
Everything before the separator is the key and everything
after the separator is the value. For instance, Key: text,
value: text.

 Why we cannot do aggregation (addition) in a


mapper Why we require reducer for that
 We cannot do aggregation (addition) in a mapper because,
sorting is not done in a mapper. Sorting happens only on
the reducer side. Mapper method initialization depends
upon each input split. While doing aggregation, we will
lose the value of the previous instance. For each row, a
new mapper will get initialized. For each row, input split
again gets divided into mapper, thus we do not have a
track of the previous row value.

 What is the need for serialization in Mapreduce


 Below are the two necessities for serialization in Hadoop.
• In Hadoop cluster, data is stored in only binary
stream format but object structured data can’t be
stored directly hadoop data nodes.
• Only Binary stream data can be transferred across
data nodes in hadoop cluster. So, Serialization is
needed to convert the object structured data into
binary stream format.

 How does the nodes in a hadoop cluster
communicate with each other
• Inter process communication between nodes in a
hadoop cluster is implemented usingRemote
Procedure Calls (RPC).
 Inter process communication happens in below three
stages.
• RPC protocol uses serialization to convert the
message from source data node into a binary
stream data.
• Binary stream data is transferred to the remote
destination node
• Destination node then use De-serialization to
convert the binary stream data into object
structured data and then it reads object structured
data.

 What is the Hadoop in built serialization


framework
 Writables are the hadoop’s own serialization format which
serializes the data into compact size and ensures fast
transfer across nodes. Writables are written in Java and
supported by Java only.

 What is Writable and its methods in hadoop


library
 Writable is an Interface in hadoop library and it provides
below two methods for serializing and de-serializing the
data.

 What is partitioner in Map Phase


 Partitioner takes intermediate output from mappers as
input and splits into partitions. Each partition will be fed
separately into each reducer. The keys are partitioned in
such a waythat records for any given key are grouped into
a single partition.
 The partitioned data is written to the local file system for
each map task and it will be transferred to its respective
reducer.

 Will data locality optimization possible at reducer


phase
 No, Reduce tasks can not be started on nodes where the
map outputs are present on the cluster because usually
reduce tasks are lesser in number compared to map tasks
and some time a single reducer is required to process all
the map tasks output.
 So, map outputs need to be transferred to the nodes on
which reduce tasks get executed.
 What is copy phase in Reduce tasks
 In Mapreduce framework, the map tasks may finish at
different times, but the reduce tasks start copying map
task outputs as soon as each map task completes. This is
known as the copy phase of the reduce task.
 The reduce task has five copier threads by default so that it
can fetch map outputs in parallel, but this number can be
changed by setting the mapred.reduce.parallel.copies
property.

 Is it possible that a Job has 0 reducers


 Yes, It is legal to set the number of reduce-tasks to zero if
no reduction is desired.

 What happens if number of reducers are 0


 In this case, the output of the map tasks go directly onto
the HDFS output directory. But the framework does not
sort the map outputs before writing them out to the output
directory.

 What is the default input split size in Mapreduce


Job
 The default input split size in Mapreduce job is equal to
the size of the HDFS block which is256 MB as of hadoop-
2.4.0 release. Each of these input splits are processed by a
separate map task.

 What are the advantages/disadvantages of small


input split size
 In Mapreduce framework, map tasks are executed in
parallel to the process these input splits. if the splits are
small, the processing will be better load-balanced since a
faster node will be able to process proportionally more
splits over the course of job than a slower node.
 But if the splits are too smaller than the default HDFS
block size, then managing splits and creation of map tasks
becomes an overhead than the job execution time.
 Using command line in Linux, how will we see all
jobs running in the hadoop cluster and how will
we kill a job
• $ hadoop job -list
 $ hadoop job -kill jobid

 Is it possible to provide multiple input to Hadoop


If yes then how can we give multiple directories as
input to the Hadoop job
 Yes, The input format class provides methods to add
multiple directories as
 input to a Hadoop job through
FileInputFormat.addInputPath() method.

 Whether Mapreduce Job mapper output files are


replicated
 No. Mapper output files (part-m-00000) are stored on
local file system of data nodes and these are not replicated
to provide fault tolerance as these files exist only during
job execution. Once the mapreduce job is completed these
mapper output files will be flushed out.

 Whether Mapreduce Job reducer output file


blocks are also replicated If yes how many copies
are maintained
 Yes. Since Mapreduce Job reducer output files (part-r-
00000) are stored on HDFS instead of on local FS as
mapper output, each block of reducer output files are
maintained in 3 copieswhich is equal to default replication
factor. For each HDFS block of the reduce output, the first
replica is stored on the local node, with other replicas
being stored on off-rack nodes.

 How will be the number of map tasks required to


run a map reduce job is determined
 Mapreduce programmer can’t specify the number of map
tasks to be instantiated in a mapreduce job. The number of
map tasks are decided based on the input data in map
phase. i.e. no of input splits from the input file.
 Each split (which is of one block size generally) will be
processed separately by a map task. So, the total number
of map tasks are decided no of input splits.
 How to compress the output from a mapreduce Job
 Output from a mapreduce job can be compressed by
setting the below two properties inmapred-site.xml
configurations.
 mapreduce.output.compress property to true
 mapred.output.compression.codec property to the
classname of the compression codec we want to use

 Describe Jobtrackers and Tasktrackers.


 One jobtracker for many tasktrackers.The Jobtracker
reschedules tasks and holds a record of overall progress
 What is a good split size What happens when a
split is too small
 64MB or the size of an HDFS block.
 A larger split would have to use bandwidth as well as local
data, because a split will span more than a single block.
 A smaller split would create overhead in managing all the
splits and metadata associated with task creation.

 What are the three possibilities of Map task/HDFS


block locality
 Data local
 Rack local
 Off-rack

 Why do map tasks write their output to local disk


instead of HDFS
 Output from a map task is temporary and would be
overkill to store in HDFS. If a map task fails, the mapper is
re-run so there is no point in keeping the intermediate
data.

 True or False: Input of a reduce task is output


from all map tasks so there is no benefit of data
locality
 TRUE
 True or False: The number of reduce tasks is
governed by the size of the input.
 False

 What is "the shuffle"


 The data flow between map and reduce tasks.

 What does the Combiner do


 It can reduce the amount of data transferred between
mapper and reducer. Combiner can be an instance of the
reducer class.

 What are the 6 key points of HDFS Design


 Storing very large files
 Streaming data access - A write once, read many pattern.
The time it takes to read the full data set is more
important than the latency of reading the first record
 Commodity Hardware - Designed to carry on working
through node failures which are higher on commodity
hardware
 Low-latency data access - Optimized for high throughput
at the expense of latency
 Lots of small files - The Namenode stores file system
metadata in memory, therefore the max amount of files is
governed by the amount of memory in the Namenode
 Multiple writes, arbitrary file modifications - HDFS files
may be written by a single writer and must always be made
at the end of files

 True or False: A file in HDFS that is smaller than a


single block will occupy a full block's worth of
underlying storage
 False, unlike a file system for a single disk, it does not

 Why are HDFS blocks so large


 To minimize the cost of seeks

 What is a Namenode
 The master node that manages the file system namespace.
It maintains the file system tree and metadata for all the
files/directories within. Keeps track of the location of all
the datanodes for a given file.

 Are block locations persistently stored


 No, Block locations are reconstructed from datanodes
when the system starts

 What are the two files the Namenode stores data


in
 namespace file image
 edit log

 What is it called when an administrator brings a


namenode down manually for routine
maintenence
 graceful failover

 What does dfs.replication =1 mean


 There would only be one replication per block. Typically
we aim for at least 3.

 What does dfs.default.name do


 Typically sets the default filesystem for Hadoop. If set you
do not need to specify it explicitly when you use -
CopyFromLocal via command line. ex: dfs.default.name =
hdfs://localhost/

 How would you list the files in the root directory


of the local filesystem via command line
 hadoop fs -ls /

 True or False: As long as df.replication.min


replicas are written, a write will succeed.
 True, (default = 1) The namenode can schedule further
replication afterward.

 What is Hadoop's default replica placement


 1st - on the same node as the client
 2nd - on a different rack
 3rd - on a different rack and a different node than the 2nd

 Any content written to a file is not guaranteed to


be visible, even if the stream is flushed. Why
 Once more than a block's worth of data has been written,
the first block will be visible to the readers. The current
block is always invisible to new readers and will display a
length of zero.

 What is Apache Flume


• What is a sample use-case
 A system for moving large quantities of streaming data
into HDFS.
• use case: Collecting log data from one system and
aggregating it into HDFS for later analysis.

 What is Sqoop used for (use case)


 Bulk imports of data into HDFS from unstructured
datastores
 use case: An organization runs nightly Sqoop imports to
load the days data into the Hive data warehouse for
analysis

What is distcp (use case)


How do you run distcp
What are it's options (2)
A Hadoop program for copying large amounts
of data to and from the Hadoop Filesystem in
parallel.

 use case: Transferring data between two HDFS clusters


 hadoop distcp hdfs://namenode1/foo
hdfs://namenode2/bar
 overwrite - distcp will skip files that
already exist without specifying this
 update - updates only the files that have changed

 Can distcp work on two different versions of


Hadoop
 Yes, you would have to use http ( or the newer webhdfs)
 ex: hadoop distcp webhdfs://namenode1:50070/foo
webhdfs://namenode2:50070/bar

 How is distcp implemented


 As a mapreduce job with the copying being done by the
maps and no reducers. Each file is copied by a single map,
distcp tries to give each map the same amount of data.

 What is the Hadoop Archive (HAR files)


- How do you use them
- How do you list the file contents of a .har
- What are the limitations of Hadoop
Archives
• It is a file archiving facility that packs files into
HDFS blocks more efficiently. They can be used as
an input to a mapreduce job. It reduces namenode
memory usage, while allowing transparent access
to files.
• hadoop archive -archiveName files.har /my/files
/my
• hadoop fs -lsr har://my/files.har
• 1. Creates a copy of the files (disk space usage)
 Archives are immutable
 No compression on archives, only files

 How do checksums work What type of hardware


do you need for them
 Checksums are computed once when the data first enters
the system and again whenever it is transmitted across a
channel. The checksums are compared to check if the data
was corrupted. No way to fix the data, merely serves as
error detection.

 How do datanodes deal with checksums


 Datanodes are responsible for verifying the data they
receive before storing the data and its checksum. A client
writing data sends it to a pipeline of datanodes. The last
datanode verifies the checksum. If there is an error, the
client receives a checksum exception.
 Each datanode keeps a persistent log of checksum
verifications. (knows when each block was last verified)
 Each datanode runs a DataBlockScanner in a background
thread that periodically verifies all blocks stored on the
datanode.

 How are corrupted blocks "healed"


 If a client detects an error when reading a block, it reports
a bad block & datanode to the namenode, and throws a
ChecksumException. The namenode marks the copy as
corrupt and stops traffic to it. The namenode schedules a
copy of the block to be replicated on another datanode.
The corrupted replica is deleted.

 What are the benefits of File Compression


 Reduces the space needed to store files.
 Speeds up data transfer across the network to and from
the disk

 What is a codec
 An implementation of a compression-decompression
algorithm. In Hadoop, its represented by an
implementation of CompressionCodec Interface.
 ex: GZipCodec - encapsulates the compression-
decompression algorithm for gzip.

 What are the possible Hadoop compression


codecs Are they supported Natively or do they use
a Java Implementation
 DEFLATE - (Java yes, Native yes)
 org.apache.hadoop.io.compress.DefaultCodec
 gzip ( Java yes, Native yes)
 org.apache.hadoop.io.compress.GzipCodec
 bzip2 (Java yes, Native no)
 org.apache.hadoop.io.compress.BZipCodec
 4, LZO (Java no, Native yes)
 com.hadoop.compression.lzo.LZOCodec
 LZ4 (Java no, Native yes)
 com.hadoop.compression.lz4.LZ4Codec
 Snappy (Java no, Native yes)
 org.apache.hadoop.io.compress.SnappyCodec
 If you store 3 separate configs for:
• single
• pseudo-distributed
• distributed
 How do you start/stop and specify those to the
daemon
 start-dfs.sh --config path/to/config/dir
 start-mapred.sh --config path/to/config/dir

 How do you list the files running in


pseduo/single/distributed mode
 hadoop fs -conf conf/hadoop-xxx.xml -ls .
 hadoop-xxx-.xml (config file for single or dist or pseudo)

 How do you run a MapReduce job on a cluster


 hadoop jar hadoop-examples.jar /
 v3MaxTemperatureDriver -conf conf/hadoop-cluster.xml
input/ncdc/all max-temp

 What is Job History


 Where are the files stored
 How long are History files kept
 How do you view job history via command line
 Events and configuration for a completed job.
 The local file system of the jobtracker. (history subdir of
logs)
 30 days , user location - never
 hadoop.job.history.location
 2nd copy _logs/history of jobs output location
 hadoop.job.history.user.location
 hadoop job -history

 Reduce tasks are broken down on the jobtracker


web UI. What do:
 copy
 sort
 reduce
 refer to
 When map outputs are being transferred to the reducers
tasktracker.
 When the reduce inputs are being merged.
 When the reduce function is being run to produce the file
output.

 What does "hadoop fs -getmerge max-temp max-


temp-local" do
 Gets all the files specified in a HDFS directory and merges
them into a single file on the local file system.
 What are things to look for on the Tuning
Checklist (How can I make a job run faster)
 Number of Mappers
 Number of Reducers
 Combiners
 Intermediate Compression
 Custom Serialization
 Shuffle Tweaks
 A mapper should run for about a minute. Any shorter and
you should reduce the amt of mappers.
 Slightly less reducers than the number of reduce slots in
the cluster. This allows the reducers to finish in a single
wave, using the cluster fully.
 Check if combiner can be use to reduce amount of data
going through the shuffle.
 Job execution time can almost always benefit from
enabling map output compression.
 Use RawComparator if you are using your won custom
Writable objects or custom comparators.
 Lots of tuning parameters for memory management

 True or False: Its better to add more jobs than add


more complexity to the mapper.
 TRUE
 What does the Hadoop Library Class
ChainMapper do
 Allows you to run a chain of mappers, followed by a
reducer and another chain of mappers in a single job.

 What is Apache Oozie


 What are its two main parts
 What is the difference between Oozie and
JobControl
 What do action nodes do control nodes
 What are two possible types of callbacks
 A system for running workflows of dependent jobs
 (a) Workflow engine - stores and runs workflows
composed of different types of Hadoop jobs (MapReduce,
Pig, Hive)
• Coordinator engine - runs workflow jobs based on
pre-defined schedules and data availability.
 JobControl runs on the client machine submitting the
jobs. Oozie runs as a service in the cluster and client
submit workflow definitions for immediate or later
execution
 (a) Performs a workflow task such as: moving files in
HDFS, running MapReduce, Streaming or Pig jobs, Sqoop
imports, shell scripts, java programs
 (b) Governs the workflow execution using conditional logic
 (a) On workflow completion, HTTP callback to client to
inform workflow status.
 (b) receive callbacks every time a workflow enters/exits an
action node
 All Oozie workflows must have which control nodes
• start node <start to="max-temp-mr"/>
 1 end node <end name="end"/>
 1 kill node <kill name="fail><message>MapReduce failed
error...</></kill>
 When the workflow starts it goes to the node specified in
start.
 If workflow succeeds -> end
 If workflow fails -> kill

 How do you run an Oozie workflow job


 export OOZIE_URL="http://localhost:11000/oozie"
 (tells oozie command which server to use)
 oozie job -config ch05/src.../max-temp-
workflow.properties -run
 (run - runs the workflow)
 (config - local java properties file containing definitions for
the parameters in the workflow xml)
 oozie job -info 000000009-112....-oozie-tom-W (shows
the status, also available via web url)

 What are the four entities of MapReduce1


 The client
 The jobtracker (coordinate the job run)
 The tasktrackers (running tasks)
 The distributed filesystem (sharing job files)
 At how many nodes does MapReduce1 hit
scaleability bottlenecks
 over 4,000 nodes

 What does YARN stand for


 Yet Another Resource Negotiator

 What are the YARN entities


 The Client - submits the MapReduce job
 resource manager - coordinates allocation of compute
resources
 node managers - launch and monitor the compute
containers on machines in the cluster
 application master - coordinates the tasks running the
MapReduce job. The application master and MapReduce
tasks run in containers that are scheduled by the resource
manager and managed by node managers.
 Distributed Filesystem

 True or False: It is possible for users to run


different versions of MapReduce on the same
YARN cluster.
 True, makes upgrades more manageable.
 YARN takes responsibilities of the jobtracker and
divides them between which 2 components
 Resource Manager
 Application Master

 True or False: Streaming and Pipes work the same


way in MapReduce1 vs MapReduce 2.
 True, only difference is the child and subprocesses run on
the node managers not tasktrackers

 What does a Jobtracker do when it is notified of a


task attempt which has failed
 How many times will a task be re-tried before job
failure
 What are 2 ways to configure failure conditions
 It will reschedule the task on a new tasktracker (node)
 4 times (default)
 mapred.map.max.attempts
 mapred.reduce.max.attempts
 If tasks are allowed to fail to a certain percentage
 mapred.max.map.failures.percentage
 mapred.max.reduce.failures.percentage
 Note: Killed tasks do not count as failures.

 How is tasktracker failure handled


 If the heartbeat isn't sent to jobtracker in 10secs
(mapred.task.tracker.expiry.interval) The jobtracker
removes it from the pool. Any tasks running when
removed from the pool have to be re-run.

 When are tasktrackers blacklisted How do blacklisted


tasktrackers behave
 If more than 4 tasks from the same job fail on a particular
tasktracker, (mapred.max.tracker.failures) the jobtracker
records this as a fault. If the number of faults is over the
minimum threashold (mapred.max.tracker.blacklists)
default 4, the tasktracker is blacklisted.
 They are not assigned tasks. They still communicate with
the jobtracker. Faults expire over time (1 per day) so they
will get a chance to run again. If the fault can be fixed (ex:
hardware) when it restarts it will be re-added.

 How does MR1 handle jobtracker failure


 It is a single point of failure, however it is unlikely that
particular machine will go down. After restarting, all jobs
need to be resubmitted.

 How does MR2 handle runtime exception/failure


and sudden JVM exits
 Hanging tasks
 What are criteria for job failure
 The application master marks them as failed.
 The application master notices an absence of ping over
umbilical channel, task attempt is marked as failed.
 Same as MR1, same config options. A task is marked as
failed after 4 attempts or percentage map/reduce tasks
fail.

 When are node managers blacklisted By what


 If more than 3 tasks fail
(mapreduce.job.maxtaskfailures.per.tracker)
 by the application master

 How does a FIFO Scheduler work


 Typically each job would use the whole cluster so jobs had
to wait their turn. Has the ability to set a job's priority
(very high, high, normal. low, very low) It will choose the
highest tasks first, but no preemption (one its running, it
can't be replaced)

 How does the Fair Scheduler work


 Every user gets a fair share of the cluster capacity over
time.
 A single job running on the cluster would use full capacity.
 A short job belonging to one user will complete in a
reasonable time, even while another users long job is
running.
 Jobs are placed in pools and by default each user gets their
own pool. Its possible to create custom pools with a
minimum value.Supports preemption - if a pool hasn't
received its fair share over time, the scheduler will kill
tasks in pools running over capacity in order to give more
slots to under capacity pools.

 How do you set a MapReduce taskscheduler


 mapred.jobtracker.taskscheduler =
org.apache.hadoop.mapred.FairScheduler

 How does the Capacity Scheduler work


 A cluster is made up of a number of queues which may be
hierarchical and each queue has a capacity.Within each
queue jobs are scheduled using FIFO scheduling (with
priorities) Allows users (defined by queues) to simulate
separate clusters. Does not enforce fair sharing like the
Fair Scheduler.

 How does the map portion of the MapReduce


write output
 Each map task has a circular memory buffer that writes
output 100MB by default (io.sort.mb)
 When contents of the buffer meet threshold size (80%
default io.sort.spill.percent) a background thread will start
to spill the contents to disk. Map outputs continue to be
written to the buffer while the spill is taking place. If the
buffer fills up before the spill is complete, it will wait.
 Spills are written round robin to directories specified
(mapred.local.dir) Before it writes to disk, the thread
divides the data into partitions based on reducer and then
the partition is sorted by key. If a combiner exists, it is
then run.
 Each time the memory buffer reaches spill threshold, a
new spill file is created. There are at least 3 spill files.
(min.num.spills.for.combine) All spill files are combined
into a single partitioned and sorted output file.
 It is a good idea to compress output (Not set by default)
 Output file partitions are made available to reducers via
HTTP. The max amount of worker threads used to serve
partitions is controlled by tasktracker.http.threads = 40
(default) (setting per tasktracker, not per map) This is set
automatically in MR2 by the number of processors on the
machine. (2 x amt of processors)

 How does the reduce side of the Shuffle work


 (Copy Phase)
 Copy Phase - after a map task completes, the reduce task
starts copying their outputs. Small numbers of copier
threads are used so it can fetch output in parallel. (default
= 5 mapred.reduce.parallel.copies)
 The output is copied to the reduce task JVM's memory. (if
its small enough) otherwise, its copied to disk. When the
in-memory buffer reaches threshold size or reaches
threshold number of map outputs it is merged and spilled
to disk.
 mapred.job.shuffle.merge.percent
 mapred.inmem.merge.threshold
 A combiner would be run here if specified. Any map
outputs that were compressed have to be decompressed in
memory. When all map outputs have been copied we
continue to the Sort phase.
 (Sort Phase & Reduce phase)
 Sort Phase - (merge phase) This is done in rounds. The
number of map outputs/merge factor(io.sort.factor default
10)
 50/10 = 5 rounds. (5 intermediate files)
 Reduce Phase - reduce function is invoked for every key of
ouput. The result is written to the output filesystem,
typically HDFS

 What is Speculative Execution of tasks


 Hadoop detects when a task is running slower than
expected and launches another equivalent task as backup.
When a task completes successfully, the duplicate tasks
are killed. Turned on by default. It is an optimization and
not used to make tasks run more reliably.

 What are reasons to turn off Speculative


Execution
 On a busy cluster it can reduce overall throughput since
there are duplicate tasks running. Admins can turn it off
and have users override it per job if necessary.
 For reduce tasks, since duplicate tasks have to transfer
duplicate map inputs which increases network traffic
 Tasks that are not idempotent. You can make tasks
idempotent using OutputComitter. Idempotent(def) apply
operation multiple times and it doesn't change the result.
 Hadoop runs tasks _____________ to isolate
them from other running tasks
 their own java virtual machine

 How do you handle corrupt records that are


failing in the mapper and reducer code
 Detect and ignore
 Abort job, throwing an Exception
 Count the total number of bad records in the jobs using
Counters to see how widespread the problem is.

 What are the specs of a typical "commodity


hardware" machine
 Processor - 2 quad core 2-2.5 Ghz
 Memory - 16-24GB ECC RAM (error code checking)
 Storage - Four 1TB SATA disks
 Network -Gigabit Ethernet

 HDFS clusters do not benefit from RAID


 True, The redundancy that RAID provides is not needed
since HDFS handles replication between nodes.
 Machines running a namenode should be which 32 bit or
64 bit
 64bit, to avoid the 3GB limit on Java Heap size on 32bit

 (1)True or False: Hadoop has enough unix


assumptions that is is unwise to run on non-unix
platforms in production
 (2)True or False: For a small cluster (10 nodes) it
is acceptable to have the namenode and
jobtracker on a single machine
 TRUE
 TRUE, as long as you have a copy of the namenode
metatdata on a remote machine. Eventually as the # of
files grows, the namenode should be moved to a separate
machine because it is a memory hog.

 What are masters and slaves files used for


 Contains a list of machine hosts names or IP addresses.
 Masters file - determines which machines should run a
secondary namenode
 Slaves file - determines which machines the datanodes and
tasktrackers are run on.
- Used only by the control scripts running on the
namenode or jobtracker

 How is the namenode machine decided


 It runs on the machine that the startup scripts were run
on.

 What does running start-dfs.sh do ( 3 steps)


 Starts a namenode on the machine the script was run on
 Starts a datanode on each machine listed in the slaves file
 Starts a secondary namenode on each machine listed in
the masters file

 What does running start-mapred.sh do (2 steps)


 Starts a jobtracker on the local machine
 Starts a tasktracker on each machine in the slaves file

 True or False: Each node in the cluster should run


a datanode & tasktracker
 True

 How much memory does Hadoop allocate per


daemon Where is it controlled
 1GB
 HADOOP_HEAPSIZE in hadoop_env.sh

 Which property controls the maximum number of


map/reduce tasks that can run on a tasktracker at
one time
 mapred.tasktracker.map.tasks.maximum (default 2)
 mapred.tasktracker.reduce.tasks.maximum (default 2)

 True or False: A good rule of thumb is to have one


or more tasks than processors
 true

 How much memory does the namenode,


secondary namenode and jobtracker daemons use
by default
 1GB
 Namenode - 1 GB per million blocks storage

 How do you increase namenode memory


 HADOOP_NAMENODE_OPTS in hadoop-env.sh
 HADOOP_SECONDARYNAMENODE_OPTS
 Value specified should be Xmx2000m would allocate 2GB

 Where are logs stored by default How and where


should you move them
 $HADOOP_INSTALL/logs
 set HADOOP_LOG_DIR in hadoop-env.sh
 Move it outside the install path to avoid deletion during
upgrades
 What are the two types of log files
 Logs ending in .log are made by log4j and are never
deleted. These logs are for most daemon tasks
 Logs ending in .out act as a combination standard error
and standard output log. Only the last 5 are retained and
they are rotated out when the daemon restarts.

 HDFS: fs.default.name
 Takes the HDFS filesystem URI host is the namenodes
hostname or IP:port that the namenode will listen on
(default file///:8020) It specifies the default filesystem so
you can use relative paths.

 HDFS: dfs.name.dir
 Specifies a list of directories where the namenode
metadata will be stored.

 HDFS: dfs.data.dir
 List of directories for a datanode to store its blocks

 HDFS: fs.checkpoint.dir
 Where the secondary namenode stores it's checkpoints of
the filesystem

 MAPRED: mapred.job.tracker
 Hostname and port the jobtracker's RPC server runs on.
(default = local)

 MAPRED: mapred.local.dir
 A list of directories where MapReduce stores intermediate
temp data for jobs (cleared at job end)

 MAPRED: mapred.system.dir
 Relative to fs.default.name where shared files are stored
during job run.

 MAPRED:
mapred.tasktracker.map.tasks.maximum
 Int (default=2) number of map tasks run on a tasktracker
at one time.

 MAPRED:
mapred.tasktracker.reduce.tasks.maximum
 Int (default=2) number of reduce tasks run on a
tasktracker at one time.

 MAPRED: mapred.child.java.opts
 String (-Xmx 2000m) JVM option used to launch
tasktracker child processes that run map and reduce tasks
( can be set on per-job basis)
 MAPRED: mapreduce.map.java.opts
 String (-Xmx 2000m) JVM option used for child process
that runs map tasks

 MAPRED: mapreduce.reduce.java.opts
 String (-Xmx 2000m) JVM option used for child process
that runs reduce tasks

 What are the following HTTP Server default ports


• mapred.job.tracker.http.address
• mapred.task.tracker.http.address
• dfs.http.address
• dfs.datanode.http.address
• dfs.secondary.http.address
• 0.0.0.0:50030
• 0.0.0.0:50060
• 0.0.0.0:50070
• 0.0.0.0:50075
• 0.0.0.0:50090

 Which settings are used for


commissioning/decommissioning nodes
 Commissioning:
 dfs.hosts (datanodes)
 mapred.hosts (tasktrackers)
 Decommissioning:
 dfs.hosts.exclude (datanodes)
 mapred.hosts.exclude (tasktrackers)

 How do you change the default buffer size


 io.file.buffer.size (default 4KB) recommended 128KB
 core-site.xml

 How do you change HDFS block size


 dfs.block.size (hdfs-site.xml)
 (default 64MB recommended 128MB )

 How do you reserve storage space for non-HDFS


use
 dfs.datanode.du.reserved (amount in bytes)

 How do you setup Trash


 Where do you find Trash files
 Will programatically deleted files be put in Trash
 How do you manually take out Trash for non-
HDFS filesystems
 fs.trash.interval, set to greater than 0 in core-site.xml
 In your user/home directory in a .trash folder
 No, they will be permanently deleted
 hadoop fs -expunge

 How do you set a space limit on a users home


directory
 hadoop dfsadmin -setSpaceQuota 1t /user/username

 True or False: Under YARN, you no longer run a


jobtracker or tasktrackers.
 True, there is a single resource manager running on the
same machine as the HDFS namenode(small clusters) or a
dedicated machine with node managers running on each
worker node.

 What does YARN use rather than tasktrackers


 Shuffle handlers, auxillary services running in node
managers.

 How much memory do you dedicate to the node


manager
 1GB namenode daemon
 + 1GB datanode daemon
 + extra for running processes
 = 8GB (generally)
 What happens when a container uses more
memory than allocated
 It is marked as failed and terminated by the node manager

 Where do you find the list of all benchmarks


 hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar

 How do you benchmark HDFS


 Write
 hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar
TestDFSIO -write -nrFiles IO -fileSize 1000 (writes 10 files
of 1,000
 MB each)
 cat TestDFSIO_results.log in /benchmarks/TestDFSIO
 Read
 hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar
TestDFSIO -read -nrFiles IO -fileSize 1000 (reads 10 files
of 1,000
 MB each)
 Clean
 hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar
TestDFSIO -clean

 How do you benchmark mapreduce


 Write
 hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar
randomwriter random-data (Generate some data
randomly)
 Sort
 hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar
sort random-data sorted-data (runs the sort program) Can
see progress at the jobtracker web url.
 Verify Data is Sorted Correctly
 hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar
testmapredsort -sortInput random-data -sortOutput
sorted-data (returns success or failure)

 What does a newly formatted namenode look like


(directory structure)
 ${ dfs.name.dir }
- current
 version
 edits
 fsimage
 fstime

 What does the namenodes VERSION file contain


(4)
 It's a java properties file that contains information about
the version of HDFS running
 namespaceID - a unique identifier for the filesystem.
Namenode uses it to identify new datanodes since they will
not know it until they have registered.
 cTime = 0 Marks creation time of namenode's storage. It is
updated from 0 to a timestamp when the filsystem is
upgraded
 storageType = NAME_NODE - Indicates the storage
directory that contains data structures for the namenode.
 layoutVersion = -18 Always negative. Indicates the version
of HDFS

 What is the namenode's fs image file


 A persistent checkpoint of filesystem metadata

 Describe the Checkpoint process. (5 steps)


 The secondary asks the primary to roll its edits file. edits
=> edits.new (on primary)
 Secondary retrieves fsimage and edits from primary
 Secondary loads fsimage info memory and applies edits.
Then creates a new fsimage file.
 Secondary sends new fsimage to primary (HTTP Post)
 Primary replaces old fsimage file and the old one with the
edits.new. Updates fstime to record the time the
checkpoint was taken.

 How would an administrator run the checkpoint


process manually while in safe-mode
 hadoop dfsadmin -saveNamespace
 What controls the schedule for checkpointing (2)
 Every hour (fs.checkpoint.period)
 If the edits log reaches 64MB (fs.checkpoint.size)

 What is the directory structure for the secondary


namenode What are the key points in its design
 ${ dfs.checkpoint.dir }
- current/
 version
 edits
 fsimage
 fstime
- previous.checkpoint/
 version
 edits
 fsimage
 fstime
 Previous checkpoint can act as a stale backup
 If the secondary is taking over you can use -
importCheckpoint when starting the namnode daemon to
use the most recent version

 What is the directory structure of the Datanode


 ${ dfs.data.dir }
- current/
 version
 blk_<id_1>
 blk_<id_1>.meta
 blk_<id_2>
 blk_<id_2>.meta
 ....
 --subdir0/
 --subdir1/

 What does the Datanode's VERSION file contain


(5)
 namespaceID -received from the namnode when the
datanode first connects
 storageID = DS-5477177.... used by the namenode to
uniquely identify the datanode
 cTime = 0
 storageType = DATA_NODE
 layoutVersion = -18

 What are the two types of blk files in the datanode


directory structure and what do they do
 HDFS blocks themselves (consist of a files raw bytes)
 The metadata for a block made up of header with version,
type information and a series of checksums for sections on
the block.

 When does the datanode create a new blk_ file


 Every time the number of blocks in a directory reaches 64.
This way the datanode ensures there is a manageable
amount of blocks spread out in differrent directories.
(dfs.datanode.numblocks)

 When does Safe Mode start When does it end


 Safe Mode starts when the namenode is started. (after
loading fsimage and edit logs)
 When the minimum replication factor has been met, plus
an additional 30 seconds ( dfs.replication.min) When you
are starting a newly formed cluster it does not go into
safemode since there are no blocks in the system yet.

 What are the restrictions while being in Safe


Mode (2)
 Offers only a read-only view of the filesystem to clients.
 No new datanodes are setup/written to. This is because
the system has references to where the blocks are in the
datanodes and the namenode has to read them all before
coordinating any instructions to the datanodes.

 What do the following Safe Mode properties do


• dfs.replication.min
• dfs.safemode.threashold
• dfs.safemode.extension
• minimum number of replicas that have to be
written for a write to be successful.
• (0.999) Proportion of blocks that must meet
minimum replication before the system will exit
Safe Mode
• (30,000) Time(ms) to extend Safe Mode after the
minimum replication has been satisfied

 How do you check if you are in Safe Mode


 hadoop dfsadmin -safemode get
 or front page of HDFS web UI

 How do you set a script to run after Safe Mode is


over
 hadoop dfsadmin -safemode wait #command to
read/write a file

 What is the command to enter Safe Mode


 hadoop dfsadmin -safemode enter

 What is the command to exit Safe Mode


 hadoop dfsadmin -safemode leave

 How would you make sure a namenode stays in


Safe Mode indefinitely
 set dfs.safemode.threashold.pct > 1
 What is Audit logging and how do you enable it
 HDFS logs all filesystem requests with log4j at the INFO
level
 log4j.logger.org.apache.hadoop.hdfs.sever.namenode.FSN
ameSystem.audit = INFO (default WARN)

 What is dfsadmin
 Tool for finding information on the state of HDFS and
performing administrative actions on HDFS

 What do the following options for dfsadmin do


• -help
• -report
• -metasave
• -safemode
• -saveNamespace
• -refreshNodes
• -upgradeProgress
• -finalizeUpgrade
• -setQuota
• -clrQuota
• -setSpaceQuota
• -clrSpaceQuota
• -refreshServiceACL
• shows help for given command or -all
• shows filesystem statistics & info on datanodes
• Dumps info on blocks being replicated/deleted and
connected datanodes to logs
• Changes or queries to the state of Safe Mode
• Saves current in memory filsystem image to a new
fsimage file and resets the edits file (only in safe
mode)
• Updates the set of datanodes that are permitted to
connect to the namenode
• Gets info on the process of an HDFS upgrade and
forces an upgrade to proceed
• After upgrade is complete it deletes the previous
version of the namenode and datanode directories
• Sets directory quota. Limit on files/directories in
the directory tree. Preserves memory by not
allowing a small number of small files.
• Clears specified directory quotas
• Sets space quotas on directories. Limit on size of
files in directory tree.
• Clears specified space quotas
• Refreshes the namenodes service-level
authorization policy file.

 What does FileSystem check do (fsck)


 usage
 Checks the health of files in HDFS. Looks for blocks that
are missing from all datanodes as well as under/over
replicated blocks. fsck does a check by looking at the
metadata files for blocks and checking for inconsistencies.
 hadoop fsck / (directory to recursively search)
 Which components does fsck measure and what
do they do
 over-replicated blocks - automatically deletes replicas
 under-replicated blocks - automatically create relicas
 mis-replicated blocks - blocks that don't satisfy the replica
placement policy. They are re-replaced.
 corrupt blocks - blocks whose replicas are all corrupt.
Blocks with 1 non-corrupt replica are not marked as
corrupt.
 missing replicas - blocks with no replicas anywhere. Data
has been lost. You can specify to -move the affected files to
the lost and found directories or -delete the files (cannot
be recovered)

 How do you find which blocks are in any


particular file
 hadoop fsck /user/tom/part-0007 -files -blocks -racks

 What is the Datanode block scanner


 It verifies all blocks stored on the Datanode. It allows bad
blocks to be deleted/fixed/ DataBlockScanner maintains a
list of blocks. (dfs.datanode.scan.period = 504(hours))
Corrupt datanodes are reported to the namenode to be
fixed.
 How do you get the block verification report for a
datanode How do you get a list of blocks on the
datanode and their status
• http://datanode:50075/blockScannerReport
• http://datanode:50075/blockScannerReportlistBl
ocks

 What is a Balancer program Where does it output


• A hadoop daemon that redistributes blocks from
over-utilized datanodes to under-utilized
datanodes, while still adhering to the replication
placement policy. Moves blocks until the cluser is
deemed "balanced".
• Standard log directory

 How do you run the balancer


 start-balancer.sh
 -threashold - Specifies threashold percentage that is
deemed "balanced" (optional)
 Only one balancer can be run at a time

 The balancer runs until: (3)


 Cluster is balanced
 It cannot move any more blocks
 It loses contact with the Namenode

 What is the bandwidth used by the balancer


 1MB/s (dfs.balance.bandwidthPerSec) hdfs.site.xml
 Limits bandwidth used for copying blocks between nodes.
Designed to run in the background.

 How do you get a stack trace for a component


 http://jobtracker-host:50030/stacks

 What is the different between Metrics and


Counters
 Metrics - collected by Hadoop daemons (administrators)
 Counters - are collected from mapreduce tasks and
aggregated for the job.
 The collection mechanism for metrics is decoupled from
the comonent that receives the updates and there are
various pluggable outputs:
• local files
• Ganglia
• JMX
 The daemon collecting metrics does aggregation.

 Where can you view a components metrics


 http://jobtracker-host:50030/metrics
- format=json (optional)
 What is Ganglia
 An open source distributed monitoring system for very
large scale clusters. Using Ganglia context you can inject
Hadoop metrics into Ganglia. Low overhead and collects
info about memory/CPU usage.

 What should be run regularly for maintenance


 fsck
 balancer

 How do you do metadata backups


 Keep multiple copies of different ages (1hr, 1day,1week)
 Write a script to periodically archive the secondary
namenodes previous.checkpoint subdir to an offsite
location
 Integrity of the backup is tested by starting a local
namenode daemon and verifying it has read fsimage and
edits successfully.

 How do you manage data backups


 Prioritize your data. What must not be lost, what can be
lost easily
 Use distcp to make a backup to other HDFS clusters
(preferably to a different hadoop version to prevent
version bugs)
 Have a policy in place for user directories in HDFS. (how
big when are they backed up)

 Describe the process of commissioning nodes (6


steps).
 Add network address of new nodes to the include file.
 Update the namenode with new permitted tasktrackers:
hadoop dfsadmin -refreshNodes
 Update the jobtracker with the new set of permitted
tasktrackers: hadoop mradmin -refreshNodes
 Update the slaves file with the new nodes
 Start the new datanode/tasktrackers
 Check that the new datanodes/tasktrackers show up in the
web UI.

 Datanodes permitted/not permitted to connect to


namenodes if specified in ___________.
 Tasktrackers that may/ may not connect to the
jobtracker are specified in ___________.
 dfs.hosts / dfs.hosts.exclude
 mapred.hosts / mapred.hosts.exclude

 Describe the process of decommissioning nodes (7


steps).
 Add network address of decommissioned node to exclude
file.
 Update the namenode: hadoop dfsadmin -refreshNodes
 Update jobtracker: hadoop mradmin -refreshNodes
 Web UI - check that node status is "decommission in
progress"
 When status = "decommissioned" all blocks are replicated.
The node can be shut down.
 Remove from the include file, then
 hadoop dfsadmin -refreshNodes
 hadoop mradmin -refreshNodes
 Remove nodes from slaves file.

 If you shut down a tasktracker that is running, the


jobtracker will reschedule the task on another
tasktracker.
 TRUE

 A tasktracker may connect if its in the include file


and not in the exclude file.
 TRUE

 If a datanode is in the include and not in the


exclude will it connect If a datanode is in the
include and exclude, will it connect
 yes
 yes, but it will be decommisioned
 Describe the upgrade process. ( 9 Steps)
 Make sure any previous upgrade is finalized before
proceeding
 Shut down MapReduce and kill any orphaned
tasks/processes on the tasktrackers
 Shut down HDFS, and back up namenode directories.
 Install new versions of Hadoop HDFS and MapReduce on
cluster and clients.
 Start HDFS with -upgrade option
 Wait until upgrade completes.
 Perform sanity checks on HDFS (fsck)
 Start MapReduce
 Roll back or finalize upgrade

 After a successful upgrade, what should you do


 remove old installation and config files
 fix any warnings in your code or config
 Change the environment variables in your path.
 HADOOP_HOME => NEW_HADOOP_HOME

 What are the steps of an upgrade (when the


filesystem layout hasn't changed)
 Install new versions of HDFS and MapReduce
 Shut down old daemons
 Update config files
 Start up new daemons and use new libraries
 What should you do before upgrading
 A full disk fsck (save output and compare after upgrade)
 clear out temporary files
 delete the previous version (finalizing the upgrade)

 Once an upgrade is finalize, you can't roll back to a


previous version.
 TRUE

 How do you start an upgrade


 $NEW_HADOOP_HOME/bin/ start-dfs.sh -upgrade

 How do you check the progress of an upgrade


 $NEW_HADOOP_HOME/bin/hadoop dfsadmin -
upgradeProgress status

 How do you roll back an upgrade


 $NEW_HADOOP_HOME/bin/stop-dfs.sh
 $OLD_HADOOP_HOME/bin/start-dfs.sh

 How do you finalize an upgrade


 $NEW_HADOOP_HOME/bin/ hadoop dfsadmin -
finalizeUpgrade

You might also like