0% found this document useful (0 votes)

73 views

BDA Question Answer

Big data refers to massive datasets that are too large or complex for traditional data processing techniques. It is collected from various sources for business needs to reveal new insights. Big data generates value through the storage and processing of digital data that cannot be analyzed by traditional computing. It results from various trends like cloud computing, increased processing power, mobile devices, social networks, sensors, and web data. Big data is characterized by high volume, velocity, variety, veracity, and value.

Uploaded by

Yachika Yadav

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views

BDA Question Answer

Uploaded by

Yachika Yadav

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

BDA

Question Bank
• Big data refer to the massive data sets that are collected from a variety of data
sources for business needs to reveal new insights for optimized decision making.
• "Big data" is a field that treats ways to analyze, systematically extract information
from, or otherwise deal with data sets that are too large or complex to be dealt with
by traditional data-processing application software.
• Big data generates value by the storage and processing of digital data that cannot be
analyzed by traditional computing techniques.
• Result of various trends like cloud, increased computing resources, generation by
mobile computing, social networks, sensors, web data etc.

1. Explain 5 V’s of Big Data

Ans:

In recent years, Big Data was defined by the “3Vs” but now there is “5Vs” of
Big Data which are also termed as the characteristics of Big Data as follows:
1. Volume:
• The name ‘Big Data’ itself is related to a size which is enormous.
• Volume is a huge amount of data.
• To determine the value of data, size of data plays a very crucial role.
If the volume of data is very large then it is actually considered as a
‘Big Data’. This means whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of
data.
• Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’.
• Example: In the year 2016, the estimated global mobile traffic was
6.2 Exabytes(6.2 billion GB) per month. Also, by the year 2020 we
will have almost 40000 ExaBytes of data.
2. Velocity:
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines,
networks, social media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed
to meet the demands.
• Sampling data can help in dealing with the issue like ‘velocity’.
• Example: There are more than 3.5 billion searches per day are made
on Google. Also, FaceBook users are increasing by 22%(Approx.)
year by year.
3. Variety:
• It refers to nature of data that is structured, semi-structured and
unstructured data.
• It also refers to heterogeneous sources.
• Variety is basically the arrival of data from new sources that are
both inside and outside of an enterprise. It can be structured, semi-
structured and unstructured.
• Structured data: This data is basically an organized data. It generally
refers to data that has defined the length and format of data.
• Semi- Structured data: This data is basically a semi-organised data. It is
generally a form of data that do not conform to the formal structure of
data. Log files are the examples of this type of data.
• Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row
and column structure of the relational database. Texts, pictures, videos
etc. are the examples of unstructured data which can’t be stored in the
form of rows and columns.
4. Veracity:
• It refers to inconsistencies and uncertainty in data, that is data
which is available can sometimes get messy and quality and
accuracy are difficult to control.
• Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types and
sources.
• Example: Data in bulk could create confusion whereas less amount
of data could convey half or Incomplete Information.
5. Value:
• After having the 4 V’s into account there comes one more V which
stands for Value!. The bulk of Data having no Value is of no good to
the company, unless you turn it into something useful.
• Data in itself is of no use or importance but it needs to be converted
into something valuable to extract Information. Hence, you can
state that Value! is the most important V of all the 5V’s.
2. Differentiate b/w Traditional Vs Big Data
Ans:

Tradational Data Big Data

Traditional data is generated in Big data is generated outside the

enterprise level. enterprise level.

Its volume ranges from Gigabytes to Its volume ranges from Petabytes
Terabytes. to Zettabytes or Exabytes.

Big data system deals with

structured, semi-
Traditional database system deals structured,database, and
with structured data. unstructured data.

Traditional data is generated per hour But big data is generated more
or per day or more. frequently mainly per seconds.

Traditional data source is centralized Big data source is distributed and it

and it is managed in centralized form. is managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is High system configuration is

capable to process traditional data. required to process big data.

The size is more than the traditional

The size of the data is very small. data size.

Traditional data base tools are Special kind of data base tools are
required to perform any data base required to perform any
operation. databaseschema-based operation.

Special kind of functions can

Normal functions can manipulate data. manipulate data.

Its data model is strict schema based Its data model is a flat schema
and it is static. based and it is dynamic.
Tradational Data Big Data

Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.

Traditional data is in manageable Big data is in huge volume which

volume. becomes unmanageable.

It is easy to manage and manipulate It is difficult to manage and

the data. manipulate the data.

Its data sources includes ERP

transaction data, CRM transaction Its data sources includes social
data, financial data, organizational media, device data, sensor data,
data, web transaction data etc. video, images, audio etc.
3. Explain Types of Big Data and give Example
Ans:
Big data is classified in three ways:
• Structured Data
• Unstructured Data
• Semi-Structured Data

Structured Data
Structured data is the easiest to work with. It is highly organized with dimensions defined by
set parameters.
Think spreadsheets; every piece of information is grouped into rows and columns. Specific
elements defined by certain variables are easily discoverable.
It’s all your quantitative data:

• Age
• Billing
• Contact
• Address
• Expenses
• Debit/credit card numbers
Because structured data is already tangible numbers, it’s much easier for a program to sort
through and collect data.
• Structured data follows schemas: essentially road maps to specific data points. These
schemas outline where each datum is and what it means.
• Structured data is the easiest type of data to analyze because it requires little to no
preparation before processing. A user might need to cleanse data and pare it down
to only relevant points, but it won’t need to be interpreted or converted too deeply
before a true inquiry can be performed.
Example: A payroll database will lay out employee identification information, pay rate,
hours worked, how compensation is delivered, etc. The schema will define each one of
these dimensions for whatever application is using it. The program won’t have to dig into
data to discover what it actually means, it can go straight to work collecting and
processing it.
OR
An ‘Employee’ table in a database is an example of Structured Data

Semi-Structured Data
• Semi-structured data toes the line between structured and unstructured. Most
of the time, this translates to unstructured data with metadata attached to it.
This can be inherent data collected, such as time, location, device ID stamp or
email address, or it can be a semantic tag attached to the data later.

• Let us understand it with example: Let’s say you take a picture of your cat
from your phone. It automatically logs the time the picture was taken, the GPS
data at the time of the capture and your device ID. If you’re using any kind of
web service for storage, like iCloud, your account info becomes attached to
the file.

• If you send an email, the time sent, email addresses to and from, the IP address
from the device sent from, and other pieces of information are linked to the
actual content of the email.

• In both scenarios, the actual content (i.e. the pixels that compose the photo
and the characters that make up the email) is not structured, but there are
components that allow the data to be grouped based on certain
characteristics.
Unstructured Data

• Any data with unknown form or the structure is classified as unstructured

data. In addition to the size being huge, un-structured data poses
multiple challenges in terms of its processing for deriving value out of it. \

• A typical example of unstructured data is a heterogeneous data source

containing a combination of simple text files, images, videos etc.

• Now day organizations have wealth of data available with them but
unfortunately, they don’t know how to derive value out of it since this
data is in its raw form or unstructured format.

Example: The output returned by ‘Google Search’

4. What are differences between NameNode and Standby NameNode
Ans:
• NameNode is the master daemon which maintains and manages the DataNodes.
• It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
• NameNode is the one which stores the information of HDFS file system in a file
called FSimage.
• In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.
• It stores the metadata of all the files stored in HDFS, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc.
• It maintains 2 files:
▪ FsImage: Contains the complete state of the file system namespace
since the start of the NameNode.
▪ EditLogs: Contains all the recent modifications made to the file system
with respect to the most recent FsImage.
• Any changes that you make in your HDFS are never logged directly into FSimage.
instead, they are logged into a separate temporary file.
• The name node reads the FSimage file and then reads the temporary file and
updates the memory.
• This temporary file which stores the intermediate data is called Secondary name
node.
• The secondary NameNode merges the fsimage and the edits log files periodically and
keeps edits log size within a limit.
• It is usually run on a different machine than the primary NameNode since its
memory requirements are on the same order as the primary NameNode.
• This secodary name node is used just to speed up the memory accessing process of
Name node. since the process of updating the minute data changes directly to the
name node consumes a lot of time and is not efficient.
• the Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the file
system.
• It is responsible for combining the EditLogs with Fsimage from the NameNode.
5. Draw and Explain Secondary Name Node and Check Pointing Mechanism
Ans:

Secondary Namenode, by its name we assume that it as a backup node but its
not.
Namenode holds the metadata for HDFS like Block information, size etc. This
Information is stored in main memory as well as disk for persistence storage .
Now there are two important files which reside in the namenode’ s current
directory,
The information is stored in 2 different files .They are

• Editlogs- It
keeps track of each and every changes to HDFS.
• Fsimage- It stores the snapshot of the file system.

Any changes done to HDFS gets noted in the edit logs the file size grows where as
the size of fsimage remains same. This not have any impact until we restart the
server. When we restart the server the edit file logs are written into fsimage file
and loaded into main memory which takes some time. If we restart the cluster
after a long time there will be a vast down time since the edit log file would have
grown. Secondary namenode would come into picture in rescue of this problem.

Secondary Namenode simply gets edit logs from name node periodically and
copies to fsimage. This new fsimage is copied back to namenode.Namenode
now, this uses this new fsimage for next restart which reduces the startup time.
It is a helper node to Namenode and to precise Secondary Namenode whole
purpose is to have checkpoint in HDFS, which helps namenode to function
effectively. Hence, It is also called as Checkpoint node.

The main function of the Secondary namenode is to store the latest copy of the
FsImage and the Edits Log files.

Checkpoint:-
A checkpoint is nothing but the updation of the latest FsImage file by applying
the latest Edits Log files to it. If the time gap of a checkpoint is large the there
will be too many Edits Log files generated and it will be very cumbersome and
time consuming to apply them all at once on the latest FsImage file .
The above figure shows the working of Secondary Namenode

1. It gets the edit logs from the namenode in regular intervals and applies
to fsimage
2. Once it has new fsimage, it copies back to namenode
3. Namenode will use this fsimage for the next restart,which will reduce
the startup time

Secondary Namenode whole purpose is to have a checkpoint in HDFS. Its just a

helper node for namenode.That’s why it also known as checkpoint node inside
the community.

So we now understood all Secondary Namenode does puts a checkpoint in

filesystem which will help Namenode to function better. Its not the replacement
or backup for the Namenode. So from now on make a habit of calling it as a
checkpoint node.
6. What is Rack Awareness in HADOOP? Define Rack Awareness Policies
Ans:
What is Rack?
• The Rack is the collection of around 40-50 DataNodes connected using the same
network switch.
• If the network goes down, the whole rack will be unavailable.
• A large Hadoop cluster is deployed in multiple racks.
What is Rack Awareness?
1. The process of making Hadoop aware of what machine is part of which rack and how
these racks are connected to each other within the Hadoop cluster is what defines
rack awareness.
2. In a Hadoop cluster, NameNode keeps the rack ids of all the DataNodes. Namenode
chooses the closest DataNode while storing the data blocks using the rack
information.
3. In simple terms, having the knowledge of how different data nodes are distributed
across the racks or knowing the cluster topology in the Hadoop cluster is called rack
awareness in Hadoop.
4. Rack awareness is important as it ensures data reliability and helps to recover data in
case of a rack failure.
Why Rack Awareness?
• In a large Hadoop cluster, there are multiple racks. Each rack consists of DataNodes.
Communication between the DataNodes on the same rack is more efficient as
compared to the communication between DataNodes residing on different racks.
• To reduce the network traffic during file read/write, NameNode chooses the closest
DataNode for serving the client read/write request.
• NameNode maintains rack ids of each DataNode to achieve this rack information.
• This concept of choosing the closest DataNode based on the rack information is
known as Rack Awareness.
• A default Hadoop installation assumes that all the DataNodes reside on the same
rack.

▪ Hadoop keeps multiple copies for all data that is present in HDFS. If Hadoop is
aware of the rack topology, each copy of data can be kept in a different rack. By
doing this, in case an entire rack suffers a failure for some reason, the data can be
retrieved from a different rack.
▪ Replication of data blocks in multiple racks in HDFS via rack awareness is done
using a policy called Replica Replacement Policy.
▪ The policy states that “No more than one replica is placed on one node. And no
more than 2 replicas are placed on the same rack.”
The reasons for the Rack Awareness in Hadoop are:
• To reduce the network traffic while file read/write, which improves the cluster
performance.
• To achieve fault tolerance, even when the rack goes down.
• Achieve high availability of data so that data is available even in unfavorable
conditions.
• To reduce the latency, that is, to make the file read/write operations done with
lower delay.
What is rack awareness policy?
NameNode on multiple rack cluster maintains block replication by using inbuilt Rack
awareness policies which are:
▪ Not more than one replica be placed on one node.
▪ Not more than two replicas are placed on the same rack.
▪ Also, the number of racks used for block replication should always be smaller than
the number of replicas.
Rack awareness ensures that the Read/Write requests to replicas are placed to the closest
rack or the same rack. This maximizes the reading speed and minimizes the writing cost.
Rack Awareness maximizes the network bandwidth by block transfers within the rack
7. Explain Speculative Execution? How Map Reduce job can be optimized using
Speculative Execution?
Ans:
1. In MapReduce, jobs are broken into tasks and the tasks are run in parallel to make
the overall job execution time smaller than it would otherwise be if the tasks ran
sequentially. Now among the divided tasks, if one of the tasks take more time than
desired, then the overall execution time of job increases.
2. Tasks may be slow for various reasons: Including hardware degradation or software
misconfiguration, but the causes may be hard to detect since the tasks may be
completed successfully, could be after a longer time than expected.
3. Apache Hadoop does not fix or diagnose slow-running tasks. Instead, it tries to
detect when a task is running slower than expected and launches another,
equivalent task as a backup (the backup task is called as speculative task). This
process is called Speculative execution in MapReduce.
4. Speculative execution in Hadoop does not imply that launching duplicate tasks at the
same time so they can race. As this will result in wastage of resources in the cluster.
Rather, a speculative task is launched only after a task runs for the significant
amount of time and framework detects it running slow as compared to other tasks,
running for the same job.
5. When a task successfully completes, then duplicate tasks that are running are killed
since they are no longer needed.
6. If the speculative task after the original task, then kill the speculative task.
7. on the other hand, if the speculative task finishes first, then the original one is killed.
Speculative execution in Hadoop is just an optimization, it is not a feature to make
jobs run more reliably.
8. So To summarize: The speed of MapReduce job is dominated by the slowest task.
MapReduce first detects slow tasks. Then, run redundant (speculative) tasks. This
will optimistically commit before the corresponding stragglers. This process is known
as speculative execution. Only one copy of a straggler is allowed to be speculated.
Whichever copy (among the two copies) of a task commits first, it becomes the
definitive copy, and the other copy is killed by the framework
8. What is shuffling and sorting in Map Reduce?
Ans:
PPT

1. As shuffling can start even before the map phase has finished so this
saves some time and completes the tasks in lesser time.
2. The keys generated by the mapper are automatically sorted by Map
Reduce.
3. Values passed to each reducer are not sorted and can be in any order.
Sorting helps reducer to easily distinguish when a new reduce task
should start.
4. This saves time for the Reducer. Reducer starts a new reduce task when
the next key in the sorted input data is different than the previous.
5. Each reduce task takes key-value pairs as input and generates key-value
pair as output.
Extra:
▪ Shuffle phase in Hadoop transfers the map output from Mapper to a
Reducer in MapReduce. Sort phase in MapReduce covers the merging
and sorting of map outputs. Data from the mapper are grouped by the
key, split among reducers and sorted by the key.
▪ The process of transferring data from the mappers to reducers is known
as shuffling i.e. the process by which the system performs the sort and
transfers the map output to the reducer as input. So, MapReduce shuffle
phase is necessary for the reducers, otherwise, they would not have any
input (or input from every mapper). As shuffling can start even before
the map phase has finished so this saves some time and completes the
tasks in lesser time.
▪ The keys generated by the mapper are automatically sorted by
MapReduce Framework. Sorting in Hadoop helps reducer to easily
distinguish when a new reduce task should start. This saves time for
the reducer. Reducer starts a new reduce task when the next key in
the sorted input data is different than the previous. Each reduce task
takes key-value pairs as input and generates key-value pair as output.
9. What is Input Format? Write difference between HDFS Block and Input Split.
Ans:
Input Format
▪ Input Format defines how the input files are split and read.
▪ Input Format creates InputSplit.
▪ Based on split , Input Format defines number of map task in the mapping phase.
▪ Job driver invokes the InputFormat directly to decide the InputSplit number and
location of the map task execution.

Comparison Between InputSplit vs

Blocks in Hadoop
Let’s now discuss the feature wise difference between InputSplit vs Blocks
in Hadoop Framework.

1. Data Representation
• Block – HDFS Block is the physical representation of data in
Hadoop.
• InputSplit – MapReduce InputSplit is the logical representation of
data present in the block in Hadoop. It is basically used during
data processing in MapReduce program or other processing
techniques. The main thing to focus is that InputSplit doesn’t
contain actual data; it is just a reference to the data.
2. Size
• Block – By default, the HDFS block size is 128MB which you can
change as per your requirement. All HDFS blocks are the same
size except the last block, which can be either the same size or
smaller. Hadoop framework break files into 128 MB blocks and
then stores into the Hadoop file system.
• InputSplit – InputSplit size by default is approximately equal to
block size. It is user defined. In MapReduce program the user can
control split size based on the size of data.
Example of Block and InputSplit in Hadoop
Suppose we need to store the file in HDFS. Hadoop HDFS stores files as
blocks. Block is the smallest unit of data that can be stored or retrieved from
the disk. The default size of the block is 128MB. Hadoop HDFS breaks files into
blocks. Then it stores these blocks on different nodes in the cluster.
For example, we have a file of 132 MB. Therefore HDFS will break this file into
2 block
10. Illustrate the main component of Hadoop system.
Ans:
▪ Hadoop Common – the libraries and utilities used by other Hadoop modules.
▪ Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores
data across multiple machines without prior organization.
▪ YARN – (Yet Another Resource Negotiator) provides resource management for the
processes running on Hadoop.
▪ MapReduce – a parallel processing software framework. It is comprised of two steps.
Map step is a master node that takes inputs and partitions them into smaller
subproblems and then distributes them to worker nodes. After the map step has
taken place, the master node takes the answers to all of the subproblems and
combines them to produce output. Used to distribute work around a cluster.
Hadoop Common
▪ It consist of Java Architecture File (JAR) and script needed to start Hadoop.
▪ It requires Java Runtime Environment (JRE) 1.6 or higher version.
▪ The standard start up and shutdown script need Secure Shell (SSH) to be setup
between the nodes in the cluster.
▪ HDFS ( storage) and Map Reduce(processing) are two core components of Apache
Hadoop.

▪
HDFS
▪ The Hadoop Distributed File System (HDFS) allows applications to run across multiple
servers. HDFS is highly fault tolerant, runs on low-cost hardware, and provides high-
throughput access to data.
▪ Java-based scalable system that stores data across multiple machines without prior
organization.
▪ Data in a Hadoop cluster is broken into smaller pieces called blocks, and then
distributed throughout the cluster.
▪ Blocks, and copies of blocks, are stored on other servers in the Hadoop cluster.
▪ That is, an individual file is stored as smaller blocks that are replicated across
multiple servers in the cluster.
▪ HDFC cluster has two types of cluster : NameNode(Master) and DataNode(workers)
YARN
▪ YARN is also called as MapReduce 2.0 and this is a software rewrite for decoupling
the MapReduce resource management for the scheduling capabilities from the data
processing unit.
▪ With the help of YARN software, Hadoop YARN clusters are now able to run stream
data processing and interactive querying side by side with the MapReduce batch
jobs. Managing cluster resources is done by Yarn.
▪ There involves using the right amount of RAM, CPU and Disk space on the Hadoop
cluster and this should be taken care of during the YARN configuration on the
Hadoop cluster.
▪ In the Hadoop 1.0 the batch processing framework MapReduce was closely paired
with the Hadoop Distributed File System that is use for resource management and
job scheduling on the Hadoop systems and this helps for processing and condensing
of data in a parallel manner. Managing cluster resources was done by the job
tracker.
MapReduce
▪ MapReduce is mainly a data processing component of Hadoop.
▪ It is a programming model for processing large number of data sets.
▪ It contains the task of data processing and distributes the particular tasks across the
nodes.
▪ It consists of two phases –
▪ Map : Map converts a typical dataset into another set of data where individual
elements are divided into key-value pairs.
▪ Reduce :Reduce task takes the output files from a map considering as an input and
then integrate the data tuples into a smaller set of tuples. Always it is been executed
after the map job is done.
▪ In between Map and Reduce, there is small phase called shuffle and sort in
MapReduce.
11. What is Map Reduce Partitioner? What is need of Partitioner? How many
partitioners are there in HADOOP?
Ans:
▪ The Partitioner in MapReduce controls the partitioning of the key of the
intermediate mapper output.
▪ By hash function, key (or a subset of the key) is used to derive the partition.
▪ A total number of partitions depends on the number of reduce task.
▪ According to the key-value each mapper output is partitioned and records having the
same key value go into the same partition (within each mapper), and then each
partition is sent to a reducer
▪ Partition class determines which partition a given (key, value) pair will go.
▪ Partition phase takes place after map phase and before reduce phase
Need
▪ MapReduce job takes an input data set and produces the list of the key-value pair
which is the result of map phase Then, the output from the map phase is sent to
reduce task which processes the user-defined reduce function on map outputs.
▪ Before reduce phase, partitioning of the map output take place on the basis of the
key and sorted.
▪ This partitioning specifies that all the values for each key are grouped together and
make sure that all the values of a single key go to the same reducer, thus allows even
distribution of the map output over the reducer.
▪ Partitioner in Hadoop MapReduce redirects the mapper output to the reducer by
determining which reducer is responsible for the particular key.
How many Partitioner?
▪ The total number of Partitioners that run in Hadoop is equal to the number of
reducers i.e. Partitioner will divide the data according to the number of reducers
which is set by JobConf.setNumReduceTasks() method.
▪ Thus, the data from single partitioner is processed by a single reducer. And
partitioner is created only when there are multiple reducers.
Poor Partitioning in Hadoop MapReduce
• If in data input, one key appears more than anyother key. In such case, we use two
mechanisms to send data to partitions.
–The key appearing more will be sent to one partition.
–All the other key will be sent to partitions according to their hashCode().
• But if hashCode() method does not uniformlydistribute other keys data over partition
range,then data will not be evenly sent to reducers.
• Poor partitioning of data means that some reducers will have more data input than other
i.e. they will have more work to do than other reducers. So, the entire job will wait for one
reducerto finish its extra-large share of the load.
• How to overcome poor partitioning in MapReduce?
To overcome poor partitioner in Hadoop MapReduce, we can create Custom partitioner,
which allows sharing workload uniformly across different reducers.
12. What is Map Reduce Combiner? Write advantages and dis advantages of
Map Reduce Combiner?
Ans:
• Hadoop Combiner is also known as “Mini-Reducer” that summarizes the Mapper output
record with the same Key before passing to the Reducer.
• On a large dataset when we run MapReduce job, large chunks of intermediate data is
generated by the Mapper and this intermediate data is passed on the Reducer for further
processing, which leads to enormous network congestion.
• MapReduce framework provides a function known as Hadoop Combiner that plays a key
role in reducing network congestion.
• The primary job of Combiner is to process the output data from the Mapper, before
passing it to Reducer. It runs after the mapper and before the Reducer and its use is
optional.
• Combiner: When the reduce function is both associative and commutative(e.g. sum, max,
average), then some of the task of reduce function are assign to combiner .
• Instead of sending all the Mapper data to the reducer , some values are computed to the
mapped side itself by using combiner and then they are sent to the reducer.
• For eg , if particular w words appears k times among the all documents assigned to the
process , then there will be k times(word,1), key- value pairs as a result of Map execution,
which can be group into single pair(word,k) provided to the reduce task.
Advantages of MapReduce Combiner
• Hadoop Combiner reduces the time taken for data transfer between mapper
and reducer.
• It decreases the amount of data that needed to be processed by the reducer.
• The Combiner improves the overall performance of the reducer.
Disadvantages of Hadoop combiner in MapReduce
• The role of the combiner is to reduce network congestion
• MapReduce jobs cannot depend on the Hadoop combiner execution because
there is no guarantee in its execution.
• Hadoop may or may not execute a combiner. Also if required it may execute
it more than 1 times. So, MapReduce jobs should not depend on the
Combiners execution.
13. What is role of record reader in HADOOP?
Ans:
Record Reader:
▪ It communicates with InputSplit in and converts the data into key-value pairs
suitable for reading by the mapper.
▪ By default, it uses TextInputFormat for converting data into a key-value pair.
▪ Record Reader communicates with the InputSplit until the file reading is not
completed.
▪ It assigns byte offset (unique number) to each line present in the file. Then, these
key-value pairs are sent to the mapper for further processing.
Extra:
▪ A RecordReader converts the byte-oriented view of the input to a record-oriented
view for the Mapper and Reducer tasks for processing.
▪ To understand Hadoop RecordReader, we need to understand MapReduce Dataflow.
Let us learn how the data flow:
▪ MapReduce is a simple model of data processing. Inputs and outputs for the map
and reduce functions are key-value pairs. Following is the general form of the map
and reduce functions:
Map: (K1, V1) → list (K2, V2)
Reduce: (K2, list (V2)) → list (K3, V3)

After running setup(), the nextKeyValue() will repeat on the context, to populate the key
and value objects for the mapper. The key and value is retrieved from the record reader by
way of context and passed to the map() method to do its work. An input to the map
function, which is a key-value pair(K, V), gets processed as per the logic mentioned in the
map code. When the record gets to the end of the record, the nextKeyValue() method
returns false.
A RecordReader usually stays in between the boundaries created by the inputsplit to
generate key-value pairs but this is not mandatory. A custom implementation can even read
more data outside of the inputsplit, but it is not encouraged a lot.
Types of Hadoop RecordReader
InputFormat defines the RecordReader instance, in Hadoop. By default, by using
TextInputFormat ReordReader converts data into key-value pairs. TextInputFormat also
provides 2 types of RecordReaders which as follows:

1. LineRecordReader
It is the default RecordReader. TextInputFormat provides this RecordReader. It also treats
each line of the input file as the new value. Then the associated key is byte offset. It always
skips the first line in the split (or part of it), if it is not the first split.
It always reads one line after the boundary of the split in the end (if data is available, so it is
not the last split).

2. SequenceFileRecordReader
This Hadoop RecorReader reads data specified by the header of a sequence file.

14. List and explain types of NO SQL Databases with examples?

Ans:
• There are mainly four categories of NoSQL databases.
1. Key-value Pair Based(key values stores )
2. Column-oriented Graph ( Column family stores )
3. Graphs based (Graph stores )
4. Document-oriented (document stores)
1. Key Value Pair Based
– Data is stored in key/value pairs. It is designed in such a way to handle lots of data and
heavy load.
– It store data as a hash table where each key is unique, and the value can be a JSON,
BLOB(Binary Large Objects), string, etc.
– For example, a key-value pair may contain a key like "Website" associated with a value like
“amazon". – Eg.Redis, Dynamo, Riak are some examples of key-value store DataBases.
– They are all based on Amazon's Dynamo paper.
Rules to access data using Key-Value : -

Distinct key

- All keys in Key-Value type are Unique

- No Quires on values: No Queries an be preformed on value of the table.

Weakness of key value :

- Due to lack of consistency they can’t be used for updating part of a value or query the database. ie.
It cant provide traditional database capabilities.

- It will be difficult to maintain unique values as key if volume of data increase

2. Column-based
– It work on columns and are based on BigTable paper by Google.

– Every column is treated separately. Values of single column databases are stored contiguously.

– Cell is identified by row number and column name identifier.

– They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc.

– Its widely used to manage data warehouses, business intelligence, CRM, Library card catalogs

– HBase, Cassandra, HBase, Hypertable are examples of column based database

3. Document-Oriented
- It stores and retrieves data as a key value pair but the value part is stored as a document.
The document is stored in JSON (JavaScript Object Notation) or XML formats.
- It pair each key with a complex data structure known as a document. Documents can
contain many different key-value pairs, or key-array pairs, or even nested documents.
- For eg, MangoDB, CouchBase,CouchDB .MongoDB is the most popular of these databases.

Searching:

The column and key value type lack a formal structure hence can not be index , so searching is not
possible.

This can resolved by document store. Using single ID , a query can result in getting any item out of
document store.
This is possible because everything inside a document is automatically indexed . - Difference in Key
value and document store is that Key value stores into memory the entire document in the value
portion , whereas the document store extracts subsection of all document.

• It uses a tree structure

• “document path “ is used like a key to access the leaf values of a document

• For eg. Root is employee, the path can be

• Employee[id=‘2003’]/address/street/buildingname/text()

4. Graphs based
- Graph type database stores entities as well the relations amongst those entities.
- The entity is stored as a node with the relationship as edges.

- An edge gives a relationship between nodes.

- Every node and edge has a unique identifier.

- Graph base database mostly used for social networks, logistics, spatial data.

- E.g.Neo4j, Infinite Graph, OrientDB, FlockDB

• It is based on graph theory

• These databases are designed for data whose relations are represented as a graph and
have elements which are interconnected, with an undetermined number of relations
between them.
• These are used when a business problem has complex relationship among their objects
especially in social networks and rule based engines

11 - PSS (SBL Specimen 1 Sept 3023 Preseen) - Answer by Sir Hasan Dossani (Full Drafting)
100% (2)
11 - PSS (SBL Specimen 1 Sept 3023 Preseen) - Answer by Sir Hasan Dossani (Full Drafting)
12 pages
Data Analytics Made Accessible
100% (7)
Data Analytics Made Accessible
296 pages
(Routledge Philosophy Companions) Joseph C. Pitt (Editor), Ashley Shew (Editor) - Spaces For The Future - A Companion To Philosophy of Technology-Routledge (2017)
No ratings yet
(Routledge Philosophy Companions) Joseph C. Pitt (Editor), Ashley Shew (Editor) - Spaces For The Future - A Companion To Philosophy of Technology-Routledge (2017)
377 pages
Admin 1
No ratings yet
Admin 1
856 pages
Original Texts For Demo Edits
No ratings yet
Original Texts For Demo Edits
7 pages
Data Integration Tools
No ratings yet
Data Integration Tools
40 pages
BDT Module 1
No ratings yet
BDT Module 1
107 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
89 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Unit 5 Concepts of Big Data and Data Lake
No ratings yet
Unit 5 Concepts of Big Data and Data Lake
15 pages
ai project cycle
No ratings yet
ai project cycle
17 pages
Introduction To Big Data - Presentation
No ratings yet
Introduction To Big Data - Presentation
30 pages
Big Data Chapter-I_new
No ratings yet
Big Data Chapter-I_new
49 pages
Big Data
No ratings yet
Big Data
4 pages
UNIT 1 QUESTION&ANSWERS
No ratings yet
UNIT 1 QUESTION&ANSWERS
36 pages
5.1 Data and Databases
No ratings yet
5.1 Data and Databases
14 pages
Core Technology - Big Data and Analysis
No ratings yet
Core Technology - Big Data and Analysis
27 pages
BigData_1
No ratings yet
BigData_1
14 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
UNIT- 1_DA_Notes
No ratings yet
UNIT- 1_DA_Notes
51 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
BDS Module-1
No ratings yet
BDS Module-1
59 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
74 pages
(Ca) Bda Unit-I
No ratings yet
(Ca) Bda Unit-I
10 pages
Lecture 1: Big Data Challenges and Overview: Extracted From
No ratings yet
Lecture 1: Big Data Challenges and Overview: Extracted From
26 pages
Introduction
No ratings yet
Introduction
10 pages
Big Data Analytics - Complete Notes
No ratings yet
Big Data Analytics - Complete Notes
136 pages
BDU1
No ratings yet
BDU1
39 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Unit 4
No ratings yet
Unit 4
29 pages
Getting An Overview of Big Data (Module1)
No ratings yet
Getting An Overview of Big Data (Module1)
58 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
21 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Big Data Cat 1
No ratings yet
Big Data Cat 1
11 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
Big Data Analytics Unit Test-I Answers Bank
No ratings yet
Big Data Analytics Unit Test-I Answers Bank
10 pages
Machine Learning and Big Data
No ratings yet
Machine Learning and Big Data
40 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Unit 1 What Is Big Data
No ratings yet
Unit 1 What Is Big Data
26 pages
BIG DATA -Abhiskar Poudel
No ratings yet
BIG DATA -Abhiskar Poudel
14 pages
Unit 5
No ratings yet
Unit 5
63 pages
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
UNIT 1Big Data Introduction (1)
No ratings yet
UNIT 1Big Data Introduction (1)
56 pages
BDA notes part 1
No ratings yet
BDA notes part 1
11 pages
DS Assignment
No ratings yet
DS Assignment
31 pages
Krist Jayanti School,Bariya 20240624 192351 0000
No ratings yet
Krist Jayanti School,Bariya 20240624 192351 0000
9 pages
Data Science Vs Big Data
No ratings yet
Data Science Vs Big Data
34 pages
Module 2-4
No ratings yet
Module 2-4
16 pages
BD U1.PDF.crdownload
No ratings yet
BD U1.PDF.crdownload
65 pages
Big Data (1)
No ratings yet
Big Data (1)
23 pages
University Institute of Computing: Big Data Analytics 21CAH-782
No ratings yet
University Institute of Computing: Big Data Analytics 21CAH-782
13 pages
UNIT-1 Bda Kalyan
No ratings yet
UNIT-1 Bda Kalyan
25 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
Big Data
No ratings yet
Big Data
19 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
27 pages
BDA Unit 1
No ratings yet
BDA Unit 1
22 pages
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
No ratings yet
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
27 pages
IMP Questions pdf in Big Data
No ratings yet
IMP Questions pdf in Big Data
15 pages
EmTech Chapter 2 - Data Science
No ratings yet
EmTech Chapter 2 - Data Science
22 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Plant Propagation
No ratings yet
Plant Propagation
29 pages
L - 12 Vegtative Propagation
No ratings yet
L - 12 Vegtative Propagation
19 pages
Types of Irrigation
No ratings yet
Types of Irrigation
26 pages
Format 5
No ratings yet
Format 5
32 pages
V Imp PYQs Lecture 5
No ratings yet
V Imp PYQs Lecture 5
38 pages
Imp pptL-8 Plantation and Horticulture - Post Harvest Management
No ratings yet
Imp pptL-8 Plantation and Horticulture - Post Harvest Management
26 pages
EXPT-1 Arithmatic and Logical Operations
No ratings yet
EXPT-1 Arithmatic and Logical Operations
4 pages
Natural Resources: Depletion (Reasons and Effects)
100% (1)
Natural Resources: Depletion (Reasons and Effects)
8 pages
EXPT-2 1 D and 2D Array Using For Loop
No ratings yet
EXPT-2 1 D and 2D Array Using For Loop
3 pages
Resume Format
No ratings yet
Resume Format
1 page
EXPT-7 DT Convolution Stu
No ratings yet
EXPT-7 DT Convolution Stu
5 pages
Blockchain Sem3
No ratings yet
Blockchain Sem3
1 page
3D Solar Cell Technology Ie
No ratings yet
3D Solar Cell Technology Ie
12 pages
Data Analytics and Business Economics Factsheet 2024
No ratings yet
Data Analytics and Business Economics Factsheet 2024
2 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
Product Code
No ratings yet
Product Code
20 pages
Oil & Gas: Transforming Through Digital Technologies: 23 - 25 JULY 2019 Lagos, Nigeria
No ratings yet
Oil & Gas: Transforming Through Digital Technologies: 23 - 25 JULY 2019 Lagos, Nigeria
8 pages
Data Management Iot Bigdata Cloud Scada
No ratings yet
Data Management Iot Bigdata Cloud Scada
11 pages
Group 9-INTERGRATING BIG DATA ANALYSTIC IN NURSING
No ratings yet
Group 9-INTERGRATING BIG DATA ANALYSTIC IN NURSING
10 pages
TDWI BPReport Q411 Big Data ExecSummary
No ratings yet
TDWI BPReport Q411 Big Data ExecSummary
6 pages
Compete answers of 12th English Workbook
No ratings yet
Compete answers of 12th English Workbook
50 pages
ifcb50_06
No ratings yet
ifcb50_06
52 pages
Machine Learning and Data Science: Fundamentals and Applications 1st Edition Prateek Agrawal (Editor) 2024 Scribd Download
100% (3)
Machine Learning and Data Science: Fundamentals and Applications 1st Edition Prateek Agrawal (Editor) 2024 Scribd Download
28 pages
Next Best Action in An Omnichannel Environment
100% (2)
Next Best Action in An Omnichannel Environment
40 pages
Accenture CFO As Architecture of Business Value
No ratings yet
Accenture CFO As Architecture of Business Value
60 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Azure Data Engineering Course
No ratings yet
Azure Data Engineering Course
20 pages
Uber
No ratings yet
Uber
16 pages
Business Analytics
No ratings yet
Business Analytics
13 pages
MSC Big Data Admission Notice 2021 22
No ratings yet
MSC Big Data Admission Notice 2021 22
7 pages
Final Report On Big Data and Advanced Analytics PDF
No ratings yet
Final Report On Big Data and Advanced Analytics PDF
60 pages
internship
No ratings yet
internship
43 pages
Full download The Routledge Companion to Accounting Information Systems First Edition. Edition Quinn pdf docx
100% (1)
Full download The Routledge Companion to Accounting Information Systems First Edition. Edition Quinn pdf docx
65 pages
Bsa Assignment
No ratings yet
Bsa Assignment
13 pages
Big Data Analytics in Healthcare Challenges and Possibilities
No ratings yet
Big Data Analytics in Healthcare Challenges and Possibilities
9 pages
Data Analytics Important Questions
No ratings yet
Data Analytics Important Questions
2 pages
The European Business Review - January-February 2024
No ratings yet
The European Business Review - January-February 2024
86 pages

BDA Question Answer

Uploaded by

BDA Question Answer

Uploaded by

BDA

1. Explain 5 V’s of Big Data

Tradational Data Big Data

Traditional data is generated in Big data is generated outside the

Big data system deals with

Traditional data source is centralized Big data source is distributed and it

Data integration is very easy. Data integration is very difficult.

Normal system configuration is High system configuration is

The size is more than the traditional

Special kind of functions can

Traditional data is in manageable Big data is in huge volume which

It is easy to manage and manipulate It is difficult to manage and

Its data sources includes ERP

• Any data with unknown form or the structure is classified as unstructured

• A typical example of unstructured data is a heterogeneous data source

Example: The output returned by ‘Google Search’

Secondary Namenode whole purpose is to have a checkpoint in HDFS. Its just a

So we now understood all Secondary Namenode does puts a checkpoint in

Comparison Between InputSplit vs

14. List and explain types of NO SQL Databases with examples?

- All keys in Key-Value type are Unique

- No Quires on values: No Queries an be preformed on value of the table.

Weakness of key value :

- It will be difficult to maintain unique values as key if volume of data increase

– Cell is identified by row number and column name identifier.

– HBase, Cassandra, HBase, Hypertable are examples of column based database

• It uses a tree structure

• For eg. Root is employee, the path can be

- An edge gives a relationship between nodes.

- Every node and edge has a unique identifier.

- E.g.Neo4j, Infinite Graph, OrientDB, FlockDB

• It is based on graph theory

You might also like