BDA Question Answer
BDA Question Answer
Question Bank
• Big data refer to the massive data sets that are collected from a variety of data
sources for business needs to reveal new insights for optimized decision making.
• "Big data" is a field that treats ways to analyze, systematically extract information
from, or otherwise deal with data sets that are too large or complex to be dealt with
by traditional data-processing application software.
• Big data generates value by the storage and processing of digital data that cannot be
analyzed by traditional computing techniques.
• Result of various trends like cloud, increased computing resources, generation by
mobile computing, social networks, sensors, web data etc.
In recent years, Big Data was defined by the “3Vs” but now there is “5Vs” of
Big Data which are also termed as the characteristics of Big Data as follows:
1. Volume:
• The name ‘Big Data’ itself is related to a size which is enormous.
• Volume is a huge amount of data.
• To determine the value of data, size of data plays a very crucial role.
If the volume of data is very large then it is actually considered as a
‘Big Data’. This means whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of
data.
• Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’.
• Example: In the year 2016, the estimated global mobile traffic was
6.2 Exabytes(6.2 billion GB) per month. Also, by the year 2020 we
will have almost 40000 ExaBytes of data.
2. Velocity:
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines,
networks, social media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed
to meet the demands.
• Sampling data can help in dealing with the issue like ‘velocity’.
• Example: There are more than 3.5 billion searches per day are made
on Google. Also, FaceBook users are increasing by 22%(Approx.)
year by year.
3. Variety:
• It refers to nature of data that is structured, semi-structured and
unstructured data.
• It also refers to heterogeneous sources.
• Variety is basically the arrival of data from new sources that are
both inside and outside of an enterprise. It can be structured, semi-
structured and unstructured.
• Structured data: This data is basically an organized data. It generally
refers to data that has defined the length and format of data.
• Semi- Structured data: This data is basically a semi-organised data. It is
generally a form of data that do not conform to the formal structure of
data. Log files are the examples of this type of data.
• Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row
and column structure of the relational database. Texts, pictures, videos
etc. are the examples of unstructured data which can’t be stored in the
form of rows and columns.
4. Veracity:
• It refers to inconsistencies and uncertainty in data, that is data
which is available can sometimes get messy and quality and
accuracy are difficult to control.
• Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types and
sources.
• Example: Data in bulk could create confusion whereas less amount
of data could convey half or Incomplete Information.
5. Value:
• After having the 4 V’s into account there comes one more V which
stands for Value!. The bulk of Data having no Value is of no good to
the company, unless you turn it into something useful.
• Data in itself is of no use or importance but it needs to be converted
into something valuable to extract Information. Hence, you can
state that Value! is the most important V of all the 5V’s.
2. Differentiate b/w Traditional Vs Big Data
Ans:
Its volume ranges from Gigabytes to Its volume ranges from Petabytes
Terabytes. to Zettabytes or Exabytes.
Traditional data is generated per hour But big data is generated more
or per day or more. frequently mainly per seconds.
Traditional data base tools are Special kind of data base tools are
required to perform any data base required to perform any
operation. databaseschema-based operation.
Its data model is strict schema based Its data model is a flat schema
and it is static. based and it is dynamic.
Tradational Data Big Data
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.
Structured Data
Structured data is the easiest to work with. It is highly organized with dimensions defined by
set parameters.
Think spreadsheets; every piece of information is grouped into rows and columns. Specific
elements defined by certain variables are easily discoverable.
It’s all your quantitative data:
• Age
• Billing
• Contact
• Address
• Expenses
• Debit/credit card numbers
Because structured data is already tangible numbers, it’s much easier for a program to sort
through and collect data.
• Structured data follows schemas: essentially road maps to specific data points. These
schemas outline where each datum is and what it means.
• Structured data is the easiest type of data to analyze because it requires little to no
preparation before processing. A user might need to cleanse data and pare it down
to only relevant points, but it won’t need to be interpreted or converted too deeply
before a true inquiry can be performed.
Example: A payroll database will lay out employee identification information, pay rate,
hours worked, how compensation is delivered, etc. The schema will define each one of
these dimensions for whatever application is using it. The program won’t have to dig into
data to discover what it actually means, it can go straight to work collecting and
processing it.
OR
An ‘Employee’ table in a database is an example of Structured Data
Semi-Structured Data
• Semi-structured data toes the line between structured and unstructured. Most
of the time, this translates to unstructured data with metadata attached to it.
This can be inherent data collected, such as time, location, device ID stamp or
email address, or it can be a semantic tag attached to the data later.
• Let us understand it with example: Let’s say you take a picture of your cat
from your phone. It automatically logs the time the picture was taken, the GPS
data at the time of the capture and your device ID. If you’re using any kind of
web service for storage, like iCloud, your account info becomes attached to
the file.
• If you send an email, the time sent, email addresses to and from, the IP address
from the device sent from, and other pieces of information are linked to the
actual content of the email.
• In both scenarios, the actual content (i.e. the pixels that compose the photo
and the characters that make up the email) is not structured, but there are
components that allow the data to be grouped based on certain
characteristics.
Unstructured Data
• Now day organizations have wealth of data available with them but
unfortunately, they don’t know how to derive value out of it since this
data is in its raw form or unstructured format.
Secondary Namenode, by its name we assume that it as a backup node but its
not.
Namenode holds the metadata for HDFS like Block information, size etc. This
Information is stored in main memory as well as disk for persistence storage .
Now there are two important files which reside in the namenode’ s current
directory,
The information is stored in 2 different files .They are
• Editlogs- It
keeps track of each and every changes to HDFS.
• Fsimage- It stores the snapshot of the file system.
Any changes done to HDFS gets noted in the edit logs the file size grows where as
the size of fsimage remains same. This not have any impact until we restart the
server. When we restart the server the edit file logs are written into fsimage file
and loaded into main memory which takes some time. If we restart the cluster
after a long time there will be a vast down time since the edit log file would have
grown. Secondary namenode would come into picture in rescue of this problem.
Secondary Namenode simply gets edit logs from name node periodically and
copies to fsimage. This new fsimage is copied back to namenode.Namenode
now, this uses this new fsimage for next restart which reduces the startup time.
It is a helper node to Namenode and to precise Secondary Namenode whole
purpose is to have checkpoint in HDFS, which helps namenode to function
effectively. Hence, It is also called as Checkpoint node.
The main function of the Secondary namenode is to store the latest copy of the
FsImage and the Edits Log files.
Checkpoint:-
A checkpoint is nothing but the updation of the latest FsImage file by applying
the latest Edits Log files to it. If the time gap of a checkpoint is large the there
will be too many Edits Log files generated and it will be very cumbersome and
time consuming to apply them all at once on the latest FsImage file .
The above figure shows the working of Secondary Namenode
1. It gets the edit logs from the namenode in regular intervals and applies
to fsimage
2. Once it has new fsimage, it copies back to namenode
3. Namenode will use this fsimage for the next restart,which will reduce
the startup time
▪ Hadoop keeps multiple copies for all data that is present in HDFS. If Hadoop is
aware of the rack topology, each copy of data can be kept in a different rack. By
doing this, in case an entire rack suffers a failure for some reason, the data can be
retrieved from a different rack.
▪ Replication of data blocks in multiple racks in HDFS via rack awareness is done
using a policy called Replica Replacement Policy.
▪ The policy states that “No more than one replica is placed on one node. And no
more than 2 replicas are placed on the same rack.”
The reasons for the Rack Awareness in Hadoop are:
• To reduce the network traffic while file read/write, which improves the cluster
performance.
• To achieve fault tolerance, even when the rack goes down.
• Achieve high availability of data so that data is available even in unfavorable
conditions.
• To reduce the latency, that is, to make the file read/write operations done with
lower delay.
What is rack awareness policy?
NameNode on multiple rack cluster maintains block replication by using inbuilt Rack
awareness policies which are:
▪ Not more than one replica be placed on one node.
▪ Not more than two replicas are placed on the same rack.
▪ Also, the number of racks used for block replication should always be smaller than
the number of replicas.
Rack awareness ensures that the Read/Write requests to replicas are placed to the closest
rack or the same rack. This maximizes the reading speed and minimizes the writing cost.
Rack Awareness maximizes the network bandwidth by block transfers within the rack
7. Explain Speculative Execution? How Map Reduce job can be optimized using
Speculative Execution?
Ans:
1. In MapReduce, jobs are broken into tasks and the tasks are run in parallel to make
the overall job execution time smaller than it would otherwise be if the tasks ran
sequentially. Now among the divided tasks, if one of the tasks take more time than
desired, then the overall execution time of job increases.
2. Tasks may be slow for various reasons: Including hardware degradation or software
misconfiguration, but the causes may be hard to detect since the tasks may be
completed successfully, could be after a longer time than expected.
3. Apache Hadoop does not fix or diagnose slow-running tasks. Instead, it tries to
detect when a task is running slower than expected and launches another,
equivalent task as a backup (the backup task is called as speculative task). This
process is called Speculative execution in MapReduce.
4. Speculative execution in Hadoop does not imply that launching duplicate tasks at the
same time so they can race. As this will result in wastage of resources in the cluster.
Rather, a speculative task is launched only after a task runs for the significant
amount of time and framework detects it running slow as compared to other tasks,
running for the same job.
5. When a task successfully completes, then duplicate tasks that are running are killed
since they are no longer needed.
6. If the speculative task after the original task, then kill the speculative task.
7. on the other hand, if the speculative task finishes first, then the original one is killed.
Speculative execution in Hadoop is just an optimization, it is not a feature to make
jobs run more reliably.
8. So To summarize: The speed of MapReduce job is dominated by the slowest task.
MapReduce first detects slow tasks. Then, run redundant (speculative) tasks. This
will optimistically commit before the corresponding stragglers. This process is known
as speculative execution. Only one copy of a straggler is allowed to be speculated.
Whichever copy (among the two copies) of a task commits first, it becomes the
definitive copy, and the other copy is killed by the framework
8. What is shuffling and sorting in Map Reduce?
Ans:
PPT
1. As shuffling can start even before the map phase has finished so this
saves some time and completes the tasks in lesser time.
2. The keys generated by the mapper are automatically sorted by Map
Reduce.
3. Values passed to each reducer are not sorted and can be in any order.
Sorting helps reducer to easily distinguish when a new reduce task
should start.
4. This saves time for the Reducer. Reducer starts a new reduce task when
the next key in the sorted input data is different than the previous.
5. Each reduce task takes key-value pairs as input and generates key-value
pair as output.
Extra:
▪ Shuffle phase in Hadoop transfers the map output from Mapper to a
Reducer in MapReduce. Sort phase in MapReduce covers the merging
and sorting of map outputs. Data from the mapper are grouped by the
key, split among reducers and sorted by the key.
▪ The process of transferring data from the mappers to reducers is known
as shuffling i.e. the process by which the system performs the sort and
transfers the map output to the reducer as input. So, MapReduce shuffle
phase is necessary for the reducers, otherwise, they would not have any
input (or input from every mapper). As shuffling can start even before
the map phase has finished so this saves some time and completes the
tasks in lesser time.
▪ The keys generated by the mapper are automatically sorted by
MapReduce Framework. Sorting in Hadoop helps reducer to easily
distinguish when a new reduce task should start. This saves time for
the reducer. Reducer starts a new reduce task when the next key in
the sorted input data is different than the previous. Each reduce task
takes key-value pairs as input and generates key-value pair as output.
9. What is Input Format? Write difference between HDFS Block and Input Split.
Ans:
Input Format
▪ Input Format defines how the input files are split and read.
▪ Input Format creates InputSplit.
▪ Based on split , Input Format defines number of map task in the mapping phase.
▪ Job driver invokes the InputFormat directly to decide the InputSplit number and
location of the map task execution.
1. Data Representation
• Block – HDFS Block is the physical representation of data in
Hadoop.
• InputSplit – MapReduce InputSplit is the logical representation of
data present in the block in Hadoop. It is basically used during
data processing in MapReduce program or other processing
techniques. The main thing to focus is that InputSplit doesn’t
contain actual data; it is just a reference to the data.
2. Size
• Block – By default, the HDFS block size is 128MB which you can
change as per your requirement. All HDFS blocks are the same
size except the last block, which can be either the same size or
smaller. Hadoop framework break files into 128 MB blocks and
then stores into the Hadoop file system.
• InputSplit – InputSplit size by default is approximately equal to
block size. It is user defined. In MapReduce program the user can
control split size based on the size of data.
Example of Block and InputSplit in Hadoop
Suppose we need to store the file in HDFS. Hadoop HDFS stores files as
blocks. Block is the smallest unit of data that can be stored or retrieved from
the disk. The default size of the block is 128MB. Hadoop HDFS breaks files into
blocks. Then it stores these blocks on different nodes in the cluster.
For example, we have a file of 132 MB. Therefore HDFS will break this file into
2 block
10. Illustrate the main component of Hadoop system.
Ans:
▪ Hadoop Common – the libraries and utilities used by other Hadoop modules.
▪ Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores
data across multiple machines without prior organization.
▪ YARN – (Yet Another Resource Negotiator) provides resource management for the
processes running on Hadoop.
▪ MapReduce – a parallel processing software framework. It is comprised of two steps.
Map step is a master node that takes inputs and partitions them into smaller
subproblems and then distributes them to worker nodes. After the map step has
taken place, the master node takes the answers to all of the subproblems and
combines them to produce output. Used to distribute work around a cluster.
Hadoop Common
▪ It consist of Java Architecture File (JAR) and script needed to start Hadoop.
▪ It requires Java Runtime Environment (JRE) 1.6 or higher version.
▪ The standard start up and shutdown script need Secure Shell (SSH) to be setup
between the nodes in the cluster.
▪ HDFS ( storage) and Map Reduce(processing) are two core components of Apache
Hadoop.
▪
HDFS
▪ The Hadoop Distributed File System (HDFS) allows applications to run across multiple
servers. HDFS is highly fault tolerant, runs on low-cost hardware, and provides high-
throughput access to data.
▪ Java-based scalable system that stores data across multiple machines without prior
organization.
▪ Data in a Hadoop cluster is broken into smaller pieces called blocks, and then
distributed throughout the cluster.
▪ Blocks, and copies of blocks, are stored on other servers in the Hadoop cluster.
▪ That is, an individual file is stored as smaller blocks that are replicated across
multiple servers in the cluster.
▪ HDFC cluster has two types of cluster : NameNode(Master) and DataNode(workers)
YARN
▪ YARN is also called as MapReduce 2.0 and this is a software rewrite for decoupling
the MapReduce resource management for the scheduling capabilities from the data
processing unit.
▪ With the help of YARN software, Hadoop YARN clusters are now able to run stream
data processing and interactive querying side by side with the MapReduce batch
jobs. Managing cluster resources is done by Yarn.
▪ There involves using the right amount of RAM, CPU and Disk space on the Hadoop
cluster and this should be taken care of during the YARN configuration on the
Hadoop cluster.
▪ In the Hadoop 1.0 the batch processing framework MapReduce was closely paired
with the Hadoop Distributed File System that is use for resource management and
job scheduling on the Hadoop systems and this helps for processing and condensing
of data in a parallel manner. Managing cluster resources was done by the job
tracker.
MapReduce
▪ MapReduce is mainly a data processing component of Hadoop.
▪ It is a programming model for processing large number of data sets.
▪ It contains the task of data processing and distributes the particular tasks across the
nodes.
▪ It consists of two phases –
▪ Map : Map converts a typical dataset into another set of data where individual
elements are divided into key-value pairs.
▪ Reduce :Reduce task takes the output files from a map considering as an input and
then integrate the data tuples into a smaller set of tuples. Always it is been executed
after the map job is done.
▪ In between Map and Reduce, there is small phase called shuffle and sort in
MapReduce.
11. What is Map Reduce Partitioner? What is need of Partitioner? How many
partitioners are there in HADOOP?
Ans:
▪ The Partitioner in MapReduce controls the partitioning of the key of the
intermediate mapper output.
▪ By hash function, key (or a subset of the key) is used to derive the partition.
▪ A total number of partitions depends on the number of reduce task.
▪ According to the key-value each mapper output is partitioned and records having the
same key value go into the same partition (within each mapper), and then each
partition is sent to a reducer
▪ Partition class determines which partition a given (key, value) pair will go.
▪ Partition phase takes place after map phase and before reduce phase
Need
▪ MapReduce job takes an input data set and produces the list of the key-value pair
which is the result of map phase Then, the output from the map phase is sent to
reduce task which processes the user-defined reduce function on map outputs.
▪ Before reduce phase, partitioning of the map output take place on the basis of the
key and sorted.
▪ This partitioning specifies that all the values for each key are grouped together and
make sure that all the values of a single key go to the same reducer, thus allows even
distribution of the map output over the reducer.
▪ Partitioner in Hadoop MapReduce redirects the mapper output to the reducer by
determining which reducer is responsible for the particular key.
How many Partitioner?
▪ The total number of Partitioners that run in Hadoop is equal to the number of
reducers i.e. Partitioner will divide the data according to the number of reducers
which is set by JobConf.setNumReduceTasks() method.
▪ Thus, the data from single partitioner is processed by a single reducer. And
partitioner is created only when there are multiple reducers.
Poor Partitioning in Hadoop MapReduce
• If in data input, one key appears more than anyother key. In such case, we use two
mechanisms to send data to partitions.
–The key appearing more will be sent to one partition.
–All the other key will be sent to partitions according to their hashCode().
• But if hashCode() method does not uniformlydistribute other keys data over partition
range,then data will not be evenly sent to reducers.
• Poor partitioning of data means that some reducers will have more data input than other
i.e. they will have more work to do than other reducers. So, the entire job will wait for one
reducerto finish its extra-large share of the load.
• How to overcome poor partitioning in MapReduce?
To overcome poor partitioner in Hadoop MapReduce, we can create Custom partitioner,
which allows sharing workload uniformly across different reducers.
12. What is Map Reduce Combiner? Write advantages and dis advantages of
Map Reduce Combiner?
Ans:
• Hadoop Combiner is also known as “Mini-Reducer” that summarizes the Mapper output
record with the same Key before passing to the Reducer.
• On a large dataset when we run MapReduce job, large chunks of intermediate data is
generated by the Mapper and this intermediate data is passed on the Reducer for further
processing, which leads to enormous network congestion.
• MapReduce framework provides a function known as Hadoop Combiner that plays a key
role in reducing network congestion.
• The primary job of Combiner is to process the output data from the Mapper, before
passing it to Reducer. It runs after the mapper and before the Reducer and its use is
optional.
• Combiner: When the reduce function is both associative and commutative(e.g. sum, max,
average), then some of the task of reduce function are assign to combiner .
• Instead of sending all the Mapper data to the reducer , some values are computed to the
mapped side itself by using combiner and then they are sent to the reducer.
• For eg , if particular w words appears k times among the all documents assigned to the
process , then there will be k times(word,1), key- value pairs as a result of Map execution,
which can be group into single pair(word,k) provided to the reduce task.
Advantages of MapReduce Combiner
• Hadoop Combiner reduces the time taken for data transfer between mapper
and reducer.
• It decreases the amount of data that needed to be processed by the reducer.
• The Combiner improves the overall performance of the reducer.
Disadvantages of Hadoop combiner in MapReduce
• The role of the combiner is to reduce network congestion
• MapReduce jobs cannot depend on the Hadoop combiner execution because
there is no guarantee in its execution.
• Hadoop may or may not execute a combiner. Also if required it may execute
it more than 1 times. So, MapReduce jobs should not depend on the
Combiners execution.
13. What is role of record reader in HADOOP?
Ans:
Record Reader:
▪ It communicates with InputSplit in and converts the data into key-value pairs
suitable for reading by the mapper.
▪ By default, it uses TextInputFormat for converting data into a key-value pair.
▪ Record Reader communicates with the InputSplit until the file reading is not
completed.
▪ It assigns byte offset (unique number) to each line present in the file. Then, these
key-value pairs are sent to the mapper for further processing.
Extra:
▪ A RecordReader converts the byte-oriented view of the input to a record-oriented
view for the Mapper and Reducer tasks for processing.
▪ To understand Hadoop RecordReader, we need to understand MapReduce Dataflow.
Let us learn how the data flow:
▪ MapReduce is a simple model of data processing. Inputs and outputs for the map
and reduce functions are key-value pairs. Following is the general form of the map
and reduce functions:
Map: (K1, V1) → list (K2, V2)
Reduce: (K2, list (V2)) → list (K3, V3)
After running setup(), the nextKeyValue() will repeat on the context, to populate the key
and value objects for the mapper. The key and value is retrieved from the record reader by
way of context and passed to the map() method to do its work. An input to the map
function, which is a key-value pair(K, V), gets processed as per the logic mentioned in the
map code. When the record gets to the end of the record, the nextKeyValue() method
returns false.
A RecordReader usually stays in between the boundaries created by the inputsplit to
generate key-value pairs but this is not mandatory. A custom implementation can even read
more data outside of the inputsplit, but it is not encouraged a lot.
Types of Hadoop RecordReader
InputFormat defines the RecordReader instance, in Hadoop. By default, by using
TextInputFormat ReordReader converts data into key-value pairs. TextInputFormat also
provides 2 types of RecordReaders which as follows:
1. LineRecordReader
It is the default RecordReader. TextInputFormat provides this RecordReader. It also treats
each line of the input file as the new value. Then the associated key is byte offset. It always
skips the first line in the split (or part of it), if it is not the first split.
It always reads one line after the boundary of the split in the end (if data is available, so it is
not the last split).
2. SequenceFileRecordReader
This Hadoop RecorReader reads data specified by the header of a sequence file.
Distinct key
- Due to lack of consistency they can’t be used for updating part of a value or query the database. ie.
It cant provide traditional database capabilities.
2. Column-based
– It work on columns and are based on BigTable paper by Google.
– Every column is treated separately. Values of single column databases are stored contiguously.
– They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc.
– Its widely used to manage data warehouses, business intelligence, CRM, Library card catalogs
Searching:
The column and key value type lack a formal structure hence can not be index , so searching is not
possible.
This can resolved by document store. Using single ID , a query can result in getting any item out of
document store.
This is possible because everything inside a document is automatically indexed . - Difference in Key
value and document store is that Key value stores into memory the entire document in the value
portion , whereas the document store extracts subsection of all document.
• “document path “ is used like a key to access the leaf values of a document
• Employee[id=‘2003’]/address/street/buildingname/text()
4. Graphs based
- Graph type database stores entities as well the relations amongst those entities.
- The entity is stored as a node with the relationship as edges.
- Graph base database mostly used for social networks, logistics, spatial data.