Hadoop Interview Questions
Hadoop Interview Questions
HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is
replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes,
the data can still be retrieved from other DataNodes.
3 If you have an input file of 350 MB, how many input splits would HDFS
create and what would be the size of each input split?
By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block,
will be 128 MB. For an input file of 350 MB, there are three input splits in total. The size of each split is
128 MB, 128MB, and 94 MB.
5 How do you copy data from the local system onto HDFS?
copyFromLoacal or put command
7 What is Combiner
This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works
on it, and then passes its output to the reducer phase.
IntWritable
FloatWritable
LongWritable
DoubleWritable
BooleanWritable
Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability
YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where
the ZooKeeper handles the coordination.
User Interface
Metastore
Compiler
Execution Engine
External tables in Hive refer to the data that is at an existing location outside the warehouse
director
internal table, these types of tables manage the data and move it into its warehouse directory by
default
Hive deletes the metadata information of a table and does not change the table data present in HDFS
If one drops a managed table, the metadata information along with the table data is deleted from the Hive
warehouse directory
Script file
Bag
Map
18 What are the relational operators in Pig?
COGROUP
CROSS
FOREACH
JOIN
LIMIT
SPLIT
UNION
ORDER
HMaster
ZooKeeper
Column families consist of a group of columns that are defined during table creation, and each column
family has certain column qualifiers that a delimiter separates.
scan ‘table_name’
26 What are the default file formats to import data using Sqoop?
The default Hadoop file formats are Delimited Text File Format and SequenceFile Format
Standalone Mode
Apache Mesos
Hadoop YARN
Kubernetes
Transformations:
Actions
filer() function is used to develop a new RDD by selecting the various elements from the existing RDD,
which passes the function argument.
30 What is a SparkSession?
SparkSession is a unified entry point for reading data in Spark. Introduced in Spark 2.0
Lazy evaluation in Spark means that the execution will not start until an action is triggered
Parquet is a columnar storage file format optimized for use with big data processing frameworks like
Apache Spark. It provides efficient data compression and encoding schemes with enhanced performance
to handle complex nested data structures.
DataFrame is a distributed collection of data organized into named columns, similar to a table in a
relational database.
Compile-time analysis
Faster Computation
Less Memory consumption
Query Optimization
Qualified Persistent storage
Single Interface for multiple languages
take() function is an action that takes into consideration all the values from an RDD to the local node.
reduce() function is an action that is applied repeatedly until the one value is left in the last.