Big Data Hadoop Interview Questions and Answers
Big Data Hadoop Interview Questions and Answers
and Answers
These are Hadoop Basic Interview Questions and Answers for freshers and
experienced.
1. What is Big Data?
Big data is defined as the voluminous amount of structured, unstructured or semi-structured
data that has huge potential for mining but is so large that it cannot be processed using
traditional database systems. Big data is characterized by its high velocity, volume and variety
that requires cost effective and innovative methods for information processing to draw
meaningful business insights. More than the volume of the data it is the nature of the data that
defines whether it is considered as Big Data or not.
3. How big data analysis helps businesses increase their revenue? Give
example.
Big data analysis is helping businesses differentiate themselves for example Walmart the
worlds largest retailer in 2014 in terms of revenue - is using big data analytics to increase its
sales through better predictive analytics, providing customized recommendations and launching
new products based on customer preferences and needs. Walmart observed a significant 10%
to 15% increase in online sales for $1 billion in incremental revenue. There are many more
companies like Facebook, Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc.
using big data analytics to boost their revenue.
Here is an interesting video that explains how various industries are leveraging big data analysis
to increase their revenue
To view a detailed list of some of the top companies using Hadoop CLICK HERE
5. Differentiate between Structured and Unstructured data.
Data which can be stored in traditional database systems in the form of rows and columns, for
example the online purchase transactions can be referred to as Structured Data. Data which
can be stored only partially in traditional database systems, for example, data in XML records
can be referred to as semi structured data. Unorganized and raw data that cannot be
categorized as semi structured or structured data is referred to as unstructured data. Facebook
updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data.
1)HDFS Hadoop Distributed File System is the java based file system for scalable and reliable
storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the
Master Slave Architecture.
Here is a visual that clearly explain the HDFS and Hadoop MapReduce Concepts-
1) Hadoop Common
2) HDFS
3) Hadoop MapReduce
4) YARN
Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper.
10. What are the most commonly defined input formats in Hadoop?
The most common Input Formats defined in Hadoop are:
Text Input Format- This is the default input format defined in Hadoop.
Key Value Input Format- This input format is used for plain text files wherein
the files are broken down into lines.
Sequence File Input Format- This input format is used for reading files in
sequence.
We have further categorized Big Data Interview Questions for Freshers and Experienced-
Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies
them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve
disk bandwidth on the datanode.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NodeCheckpoint Node keeps track of the latest checkpoint in a directory that has same structure as
that of NameNodes directory. Checkpoint node creates checkpoints for the namespace at
regular intervals by downloading the edits and fsimage file from the NameNode and merging it
locally. The new image is then again updated back to the active NameNode.
BackupNode:
Backup Node also provides check pointing functionality like that of the checkpoint node but it
also maintains its up-to-date in-memory copy of the file system namespace that is in sync with
the active NameNode.
Commodity Hardware refers to inexpensive systems that do not have high availability or high
quality. Commodity Hardware consists of RAM because there are specific services that need to
be executed on RAM. Hadoop can be run on any commodity hardware and does not require
any super computer s or high end hardware configuration to execute jobs.
4. What is the port number for NameNode, Task Tracker and Job Tracker?
NameNode 50070
1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below
command-
$hadoop fs setrep w 2 /my/test_file (test_file is the filename whose replication factor will be
set to 2)
2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be
modified using the below command-
3)$hadoop fs setrep w 5 /my/test_dir (test_dir is the name of the directory and all the files in
this directory will have a replication factor set to 5)
redundancy whereas HDFS runs on a cluster of different machines thus there is data
redundancy because of the replication protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks
MapReduce cannot be used for processing whereas HDFS works with Hadoop
MapReduce as the computations in HDFS are moved to data.
8. Explain what happens if during the PUT operation, HDFS block is
assigned a replication factor 1 instead of the default value 3.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to
adjust the number of times the blocks are to be replicated to ensure high data availability. For
every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the
replication factor during the PUT operation is set to 1 instead of the default value 3, then it will
have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if
the DataNode crashes under any circumstances, then only single copy of the data would be
lost.
11. What is a rack awareness and on what basis is data stored in a rack?
All the data nodes put together form a storage area i.e. the physical location of the data nodes is
referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired
by the NameNode. The process of selecting closer data nodes depending on the rack
information is known as Rack Awareness.
The contents present in the file are divided into data block as soon as the client is ready to load
the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data
nodes for each data block. For each data block, there exists 2 copies in one rack and the third
copy is present in another rack. This is generally referred to as the Replica Placement Policy.
We have further categorized Hadoop HDFS Interview Questions for Freshers and Experienced-
1)setup () This method of the reducer is used for configuring various parameters like the input
data size, distributed cache, heap size, etc.
3)cleanup () - This method is called only once at the end of reduce task for clearing all the
temporary files.
A new class must be created that extends the pre-defined Partitioner Class.
The custom partitioner to the job can be added as a config file in the wrapper
which runs Hadoop MapReduce or the custom partitioner can be added to the job by
using the set method of the partitioner class.
5. What is the relationship between Job and Task in Hadoop?
A single job can be broken down into one or many tasks in Hadoop.
7. What is the process of changing the split size if there is limited storage
space on Commodity Hardware?
If there is limited storage space on commodity hardware, the split size can be changed by
implementing the Custom Splitter. The call to Custom Splitter can be made from the main
method.
1)Shuffle
2)Sort
3)Reduce
9. What is a TaskInstance?
The actual hadoop MapReduce jobs that run on each slave node are referred to as Task
instances. Every task instance has its own JVM process. For every new task instance, a JVM
process is spawned by default for a task.
We have further categorized Hadoop MapReduce Interview Questions for Freshers and
Experienced-
3)If the application demands key based access to data while retrieving.
Zookeeper- It takes care of the coordination between the HBase Master component and the
client.
Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks
where the META table is and META table stores all the regions in the system.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
4. Explain the difference between RDBMS data model and HBase data
model.
RDBMS is a schema based database whereas HBase is schema less data model.
RDBMS does not have support for in-built partitioning whereas in HBase there is automated
partitioning.
6. What is column families? What happens if you alter the block size of
ColumnFamily on an already populated database?
The logical deviation of data is represented through a key known as column Family. Column
families consist of the basic unit of physical storage on which compression features can be
applied. In an already populated database, when the block size of column family is altered, the
old data will remain within the old block size whereas the new data that comes in will take the
new block size. When compaction takes place, the old data will take the new block size so that
the existing data is read correctly.
1)Family Delete Marker- This markers marks all columns for a column family.
We have further categorized Hadoop HBase Interview Questions for Freshers and Experienced-
--import \
--connect jdbc:mysql://localhost/db \
--username root \
The process to perform incremental data load in Sqoop is to synchronize the modified or
updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be
facilitated through the incremental load command in Sqoop.
Incremental load can be performed by using Sqoop import command or by loading the data into
hive without overwriting it. The different attributes that need to be specified during incremental
load in Sqoop are-
1)Mode (incremental) The mode defines how Sqoop will determine what the new rows are.
The mode can have value as Append or Last Modified.
2)Col (Check-column) This attribute specifies the column that should be examined to find out
the rows to be imported.
3)Value (last-value) This denotes the maximum value of the check column from the previous
import operation.
1)Append
2)Last Modified
To insert only rows Append should be used in import command and for inserting the rows and
also updating Last-Modified should be used in the import command.
6. How can you check all the tables present in a single database using
Sqoop?
The command to check the list of all tables present in a single database using Sqoop is as
follows-
Large objects in Sqoop are handled by importing the large objects into a file referred as
LobFile i.e. Large Object File. The LobFile has the ability to store records of huge size, thus
each record in the LobFile is a large object.
8. Can free form SQL queries be used with Sqoop import command? If yes,
then how can they be used?
Sqoop allows us to use free form SQL queries with the import command. The import command
should be used with the e and query options to execute free form SQL queries. When using
the e and query options with the import command the target dir value must be specified.
10. What are the limitations of importing RDBMS tables into Hcatalog
directly?
There is an option to import RDBMS tables into Hcatalog directly by making use of hcatalog
database option with the hcatalog table but the limitation to it is that there are several
arguments like as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
We have further categorized Hadoop Sqoop Interview Questions for Freshers and Experienced-
Source- This is the component through which data enters Flume workflows.
Client- The component that transmits event to the source that operates with the agent.
clusters and also the novel HBase IPC that was introduced in the version HBase
0.96.
performance than HBase sink as it can easily make non-blocking calls to HBase.
Working of the HBaseSink
In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer
implements the HBaseEventSerializer which is then instantiated when the sink starts. For every
event, sink calls the initialize method in the serializer which then translates the Flume Event into
HBase increments and puts to be sent to HBase cluster.
Working of the AsyncHBaseSinkAsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called
only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to
the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the
cleanUp method is called by the serializer.
4) Explain about the different channel types in Flume. Which channel type
is faster?
The 3 different built in channel types available in Flume are-
MEMORY Channel Events are read from the source into memory and passed to the sink.
JDBC Channel JDBC Channel stores the events in an embedded Derby database.
FILE Channel File Channel writes the contents to a file on the file system after reading the
event from a source. The file is deleted only after the contents are successfully delivered to the
sink.
MEMORY Channel is the fastest channel among the three however has the risk of data loss.
The channel that you choose completely depends on the nature of the big data application and
the value of each event.
3 or more independent servers collectively form a ZooKeeper cluster and elect a master. One
client connects to any of the specific server and migrates if a particular node fails. The
ensemble of ZooKeeper nodes is alive till the majority of nods are working. The master node in
ZooKeeper is dynamically selected by the consensus within the ensemble so if the master node
fails then the role of master node will migrate to another node which is selected dynamically.
Writes are linear and reads are concurrent in ZooKeeper.
Zookeeper-client command is used to launch the command line client. If the initial prompt is
hidden by the log messages after entering the command, users can just hit ENTER to view the
prompt.
The Znodes that get destroyed as soon as the client that created it
disconnects are referred to as Ephemeral Znodes.
Pig coding approach is comparatively slower than the fully tuned MapReduce
coding approach.
Read More in Detail- http://www.dezyre.com/article/-mapreduce-vs-pig-vs-hive/163
8) What is the usage of foreach operation in Pig scripts?
FOREACH operation in Apache Pig is used to apply transformation to each element in the data
bag so that respective action is performed to generate new data items.
Tuples- Just similar to the row in a table where different items are separated
by a comma. Tuples can have multiple attributes.
We have further categorized Hadoop Pig Interview Questions for Freshers and Experienced-
Release 2.4.1
We have further categorized Hadoop YARN Interview Questions for Freshers and Experienced-
4)Is it possible to change the default location of Managed Tables in Hive, if so how?
8)What is SerDe in Hive? How can you write yourown customer SerDe?
9)In case of embedded Hive, can the same metastore be used by multiple users?
Or
5)What are the modules that constitute the Apache Hadoop 2.0 framework?
We hope that these Hadoop Interview Questions and Answers have pre-charged you for your
next Hadoop Interview.Get the Ball Rolling and answer the unanswered questions in the
comments below.Please do! It's all part of our shared mission to ease Hadoop Interviews for all
prospective Hadoopers.We invite you to get involved.