Hadoop Interview Qs
Hadoop Interview Qs
What is Hadoop
What is Mapreduce
Mapreduce is a framework for processing big data (huge
data sets using a large number of commodity computers).
It processes the data in two phases namely Map and
Reduce phase. This programming model is inherently
parallel and can easily process large-scale data with the
commodity hardware itself.
It is highly integrated with hadoop distributed file system
for processing distributed across data nodes of clusters.
What is YARN
YARN stands for Yet Another Resource Negotiator which
is also called as Next generation Mapreduce or Mapreduce
2 or MRv2.
It is implemented in hadoop 0.23 release to overcome the
scalability short come of classic Mapreduce framework by
splitting the functionality of Job tracker in Mapreduce
frame work into Resource Manager and Scheduler.
What is data serialization
Serialization is the process of converting object data into
byte stream data for transmission over a network across
different nodes in a cluster or for persistent data storage.
What is a combiner
Combiner is a semi-reducer in Mapreduce framework. it is
an optional component and can be specified with
Job.setCombinerClass() method.
Combiner functions are suitable for producing summary
information from a large data set. Hadoop doesn’t
guarantee on how many times a combiner function will be
called for each map output key. it may call 0 or 1 or many
times.
What is HDFS
HDFS is a distributed file system implemented on
Hadoop’s framework. It is a block-structured distributed
file system designed to store vast amount of data on low
cost commodity hardware and ensuring high speed
process on data.
HDFS stores files across multiple machines and maintains
reliability and fault tolerance. HDFS support parallel
processing of data by Mapreduce framework.
What is a NameNode
Namenode is a dedicated machine in HDFS cluster which
acts as a master serve that maintains file system
namespace in its main memory and serves the file access
requests by users. File system namespace mainly contains
fsimage and edits files. Fsimage is a file which contains file
names, file permissions and block locations of each file.
Usually only one active namenode is allowed in HDFS
default architecture.
What is a DataNode
DataNodes are slave nodes of HDFS architecture which
store the blocks of HDFS files and sends blocks
information to namenode periodically through heart beat
messages.
Data Nodes serve read and write requests of clients on
HDFS files and also perform block creation, replication
and deletions.
What is a daemon
Daemon is a process or service that runs in background. In
general, we use this word in UNIX environment. Hadoop
or Yarn daemons are Java processes which can be verified
with jps command.
What is a rack
Rack is a storage area with all the datanodes put together.
These datanodes can be physically located at different
places. Rack is a physical collection of datanodes which are
stored at a single location. There can be multiple racks in a
single location.
What is a metadata
Metadata is the information about the data stored in data
nodes such as
location of the file, size of the file and so on.
• $ hadoop fs -expunge
How can we see the output of a MR job as a single
file if the reducer might have created multiple
part-r-0000* files
We can use hadoop fs -getmerge command to combine all
the part-r-0000* files into single file and this file can be
browsed to view the entire results of the MR job at a time.
Syntax is:
What is keyvaluetextinputformat
In keyvaluetextinputformat, each line in the text file is a
‘record‘. The first separator character divides each line.
Everything before the separator is the key and everything
after the separator is the value. For instance, Key: text,
value: text.
What is a Namenode
The master node that manages the file system namespace.
It maintains the file system tree and metadata for all the
files/directories within. Keeps track of the location of all
the datanodes for a given file.
What is a codec
An implementation of a compression-decompression
algorithm. In Hadoop, its represented by an
implementation of CompressionCodec Interface.
ex: GZipCodec - encapsulates the compression-
decompression algorithm for gzip.
HDFS: fs.default.name
Takes the HDFS filesystem URI host is the namenodes
hostname or IP:port that the namenode will listen on
(default file///:8020) It specifies the default filesystem so
you can use relative paths.
HDFS: dfs.name.dir
Specifies a list of directories where the namenode
metadata will be stored.
HDFS: dfs.data.dir
List of directories for a datanode to store its blocks
HDFS: fs.checkpoint.dir
Where the secondary namenode stores it's checkpoints of
the filesystem
MAPRED: mapred.job.tracker
Hostname and port the jobtracker's RPC server runs on.
(default = local)
MAPRED: mapred.local.dir
A list of directories where MapReduce stores intermediate
temp data for jobs (cleared at job end)
MAPRED: mapred.system.dir
Relative to fs.default.name where shared files are stored
during job run.
MAPRED:
mapred.tasktracker.map.tasks.maximum
Int (default=2) number of map tasks run on a tasktracker
at one time.
MAPRED:
mapred.tasktracker.reduce.tasks.maximum
Int (default=2) number of reduce tasks run on a
tasktracker at one time.
MAPRED: mapred.child.java.opts
String (-Xmx 2000m) JVM option used to launch
tasktracker child processes that run map and reduce tasks
( can be set on per-job basis)
MAPRED: mapreduce.map.java.opts
String (-Xmx 2000m) JVM option used for child process
that runs map tasks
MAPRED: mapreduce.reduce.java.opts
String (-Xmx 2000m) JVM option used for child process
that runs reduce tasks
What is dfsadmin
Tool for finding information on the state of HDFS and
performing administrative actions on HDFS