0% found this document useful (0 votes)

87 views

Assignment No 01

1. Structured, semi-structured, and unstructured data are explained with examples. Structured data has a predefined structure and resides in databases like SQL. Semi-structured data has some organization but not a rigid structure, like XML. Unstructured data has no predefined structure, like text documents. 2. The steps to install and configure Hadoop are outlined, including setting environment variables, editing configuration files, and formatting and starting the HDFS and YARN systems. 3. HDFS architecture is explained with a diagram showing the NameNode as the master and DataNodes that store blocks of data, managed by the NameNode. 4. Features of MongoDB are summarized, including its flexible schema-

Uploaded by

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views

Assignment No 01

Uploaded by

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Assignment No: 01

By:
Aakanksha Padmanabhan
BE-IT
Roll No:01

1) List different types of data and hence explain structured, Semi Structured and
Unstructured data by giving example. Structured data –
Structured Data:
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.

 Structured data examples

Figure 1 shows customer data of Your Model Car, using a spreadsheet as an example of
structured data. The tabular form and inherent structure make this type of data analysis-ready,
e.g., we could use a computer to filter the table for customers living in the USA (the data is
machine-readable).

Another common storage format of structured data is comma separated value files (CSV).
Figure 2 shows structured data in csv format. While it might look messy at first, if you look
closely, it follows a rigorous structure that can easily be converted into a spreadsheet-like
view. Each row has a value for a product code, order number, etc. These values are separated
by a semi-colon “;” so they can easily be related to the right column header. For example,
every first value in a row indicates a product code.
1. Semi-Structured data –

Semi-structured data is information that does not reside in a relational database but that have
some organizational properties that make it easier to analyse. With some process, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.

 Semi structured data examples

If wanted to see an example of semi-structured data, you have been looking at one the entire
time! You are currently reading a hypertext markup language (HTML) file. HTML is one
example of semi-structured data, in which a text and other data is organized with tags. For
example, all headers you see in here have a header tag <h2> or <h3>. These tags somewhat
organise this file and help your browser rendering it and making sense of it. However, on a
different webpage the number and type of tags used might be completely different.
Another widely used type of semi-structured are JSON files. This figure below a JSON file
containing employee data. As you can see, JSON files have an inherent tree-like structure that
gives some degree of organization, but it is less strong than in a table. Therefore, analysing the
data by using simple filter options is partly possible, but more cumbersome than with structured
data (think of how easy that is with Excel’s filter functionality).

2. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model; thus, it is not a good fit for a mainstream relational database. So, for
Unstructured data, there are alternative platforms for storing and managing, it is
increasingly prevalent in IT systems and is used by organizations in a variety of business
intelligence and analytics applications. Example: Word, PDF, Text, Media logs.

 Unstructured data examples

There is a plethora of examples of unstructured data. Just think of any image (e.g., jpeg), video
(e.g. mp4), song (e.g. mp3), documents (e.g. PDFs or docx) or any other file type. The image
below shows just one concrete example of unstructured data: a product image and description
text. Even though this type of data might be easy to consume for us humans, it has no degree of
organization and is therefore difficult for machines to analyse and interpret.

2) Explain in brief the installation steps of Hadoop Framework.

 Set-up:
1. Check either Java 1.8.0 is already installed on your system or not, use “Javac -version” to
check.
2. If Java is not installed on your system then first install java under “C:\JAVA”

3. Extract file Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and place under “C:\Hadoop-2.8.0”.

4. Set the path HADOOP_HOME Environment variable on windows 10(see Step 1,2,3 and 4
below).
5. Set the path JAVA_HOME Environment variable on windows 10(see Step 1,2,3 and 4
below).
6. Next we set the Hadoop bin directory path and JAVA bin directory path.

Configuration
1. Edit file C:/Hadoop-2.8.0/etc/hadoop/core-site.xml, paste below xml paragraph and save this
file

2. Rename “mapred-site.xml.template” to “mapred-site.xml” and edit this file C:/Hadoop-

2.8.0/etc/hadoop/mapred-site.xml, paste below xml paragraph and save this file.
3. Create folder “data” under “C:\Hadoop-2.8.0”

Create folder “datanode” under “C:\Hadoop-2.8.0\data”

Create folder “namenode” under “C:\Hadoop-2.8.0\data”

4. Edit file C:\Hadoop-2.8.0/etc/hadoop/hdfs-site.xml, paste below xml paragraph and save this

file.

5. Edit file C:/Hadoop-2.8.0/etc/hadoop/yarn-site.xml, paste below xml paragraph and save this

file.
6. Edit file C:/Hadoop-2.8.0/etc/hadoop/hadoop-env.cmd by closing the command
line “JAVA_HOME=%JAVA_HOME%” instead of set “JAVA_HOME=C:\Java” (On C:\java
this is path to file jdk.18.0)

Hadoop Configuration:

1. Open cmd and typing command “hdfs namenode –format”. You will see

2. Open cmd and change directory to “C:\Hadoop-2.8.0\sbin” and type “start-all.cmd” to start

apache.
3. Make sure these apps are running

Hadoop Namenode
Hadoop datanode
YARN Resourc Manager
YARN Node Manager

4. Open: http://localhost:8088

5. Open: http://localhost:50070
3) Explain HDFS architecture with diagram.

 Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.

HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.

 Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks −
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening files and
directories.

 Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.

 Block
Generally, the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.

4) Explain features of MongoDB database.


1. Schema-less Database:
It is the great feature provided by the MongoDB. A Schema-less database means one
collection can hold different types of documents in it. Or in other words, in the MongoDB
database, a single collection can hold multiple documents and these documents may consist of
the different numbers of fields, content, and size. It is not necessary that the one document is
similar to another document like in the relational databases. Due to this cool feature,
MongoDB provides great flexibility to databases.

2. Document Oriented:
In MongoDB, all the data stored in the documents instead of tables like in RDBMS. In these
documents, the data is stored in fields (key-value pair) instead of rows and columns which
make the data much more flexible in comparison to RDBMS. And each document contains its
unique object id.

3. Indexing:
In MongoDB database, every field in the documents is indexed with primary and secondary
indices this makes easier and takes less time to get or search data from the pool of the data. If
the data is not indexed, then database search each document with the specified query which
takes lots of time and not so efficient.

4. Scalability:

MongoDB provides horizontal scalability with the help of sharding. Sharding means to
distribute data on multiple servers, here a large amount of data is partitioned into data chunks
using the shard key, and these data chunks are evenly distributed across shards that reside
across many physical servers. It will also add new machines to a running database.

5. Replication:
MongoDB provides high availability and redundancy with the help of replication, it creates
multiple copies of the data and sends these copies to a different server so that if one server
fails, then the data is retrieved from another server.

6. Aggregation:
It allows to perform operations on the grouped data and get a single result or computed result.
It is similar to the SQL GROUPBY clause. It provides three different aggregations i.e.,
aggregation pipeline, map-reduce function, and single-purpose aggregation methods

7. High Performance:
The performance of MongoDB is very high and data persistence as compared to another
database due to its features like scalability, indexing, replication, etc.

5) Describe NoSQL data architecture patterns in detail.

 Architecture Pattern is a logical way of categorising data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and
store it in a valid format. It is widely used because of its flexibilty and wide variety of
services.

Architecture Patterns of NoSQL:

The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database

1. Key-Value Store Database:

This model is one of the most basic models of NoSQL databses. As the name suggests, the
data is stored in form of Key-Value Pairs.The key is usually a sequence of strings, integers
or characters but can also be a more advanced data type. The value is typically linked or co-
related to the key. The key-value pair storage databases generally store data as a hash table
where each key is unique. The value can be of any type (JSON, BLOB(Binary Large
Object), strings, etc). This type of pattern is usually used in shopping websites or e-
commerce applications .

Advantages:

 Can handle large amounts of data and heavy load,

 Easy retrieval of data by keys.

Limitations:

 Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
 Data can be involving many-to-many relationships which may collide.

Examples:

 DynamoDB
 Berkeley DB
2. Column Store Database:

Rather than storing data in relational tuples, the data is stored in individual cells which are
further grouped into columns. Column-oriented databases work only on columns. They store
large amounts of data into columns together. Format and titles of the columns can diverge from
one row to other. Every column is treated separately. But still, each individual column may
contain multiple other columns like traditional databases.
Basically, columns are mode of storage in this type.

Advantages:

 Data is readily available

 Queries like SUM, AVERAGE, COUNT can be easily performed on columns.

Examples:

 HBase
 Bigtable by Google
 Cassandra

3.Document Database:

The document database fetches and accumulates data in forms of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The
use of nested documents is also very common. It is very affective as most of the data created
is usually in form of JSONs and is unstructured .

Advantages:

 This type of format is very useful and apt for semi-structured data.
 Storage retrieval and managing of documents is easy.

Limitations:

 Handling multiple documents is challenging

 Aggregation operations may not work accurately.
Examples:

 MongoDB
 CouchDB

3. Graph Databases:
Clearly, this architecture pattern deals with storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some
data.The objects or entities are called as nodes and are joined together by relationships
called Edges. Each edge has a unique identifier. Each node serves as a point of contact for
the graph.This pattern is very commonly used in social networks where there are a large
number of entities and each entity has one or many characteristics which are connected by
edges.

The relational database pattern has tables which are loosely connected, whereas graphs are
often very strong and rigid in nature.

Advantages:

 Fastest traversal because of connections.

 Spatial data can be easily handled.

Limitations:

 Wrong connections may lead to infinite loops.

Examples:

 Neo4J
 FlockDB( Used by Twitter)
6) Write a short note on:
a) Hadoop Architectural Model.

Hadoop is an open-source framework from Apache and is used to store process and analyse
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more. Moreover, it can be scaled up just by adding
nodes in the cluster.

Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in nodes
over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set which
can be computed in Key value pair. The output of Map task is consumed by reduce task and
then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.
Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can
also be called as a Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
b) Hive and its architecture.

The above screenshot explains the Apache Hive architecture in detail

Hive Consists of Mainly 3 core parts

1. Hive Clients
2. Hive Services
3. Hive Storage and Computing

Hive Clients:

Hive provides different drivers for communication with a different type of applications. For
Thrift based applications, it will provide Thrift client for communication.

For Java related applications, it provides JDBC Drivers. Other than any type of applications
provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server
in the Hive services.

Hive Services:

Client interactions with Hive can be performed through Hive Services. If the client wants to
perform any query related operations in Hive, it has to communicate through Hive Services.

CLI is the command line interface acts as Hive service for DDL (Data definition Language)
operations. All drivers communicate with Hive server and to the main driver in Hive services as
shown in above architecture diagram.

Driver present in the Hive services represents the main driver, and it communicates all type of
JDBC, ODBC, and other client specific applications. Driver will process those requests from
different applications to meta store and field systems for further processing.
Hive Storage and Computing:

Hive services such as Meta store, File system, and Job Client in turn communicates with Hive
storage and performs the following actions

 Metadata information of tables created in Hive is stored in Hive "Meta storage database".
 Query results and data loaded in the tables are going to be stored in Hadoop cluster on
HDFS.

7) What are JobTracker and TaskTracker? Explain the benefits of block Transfer.

 JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1
(or Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and
replaced by Resource Manager, Application Master and Node Manager Daemons.

Job Tracker –

1. JobTracker process runs on a separate node and not usually on a DataNode.

2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by
ResourceManager/ApplicationMaster in MRv2.
3. JobTracker receives the requests for MapReduce execution from the client.
4. JobTracker talks to the NameNode to determine the location of the data.
5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.
6. JobTracker monitors the individual TaskTrackers and the submits back the overall status of the
job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
8. When the JobTracker is down, HDFS will still be functional but the MapReduce execution can
not be started and the existing MapReduce jobs will be halted.

TaskTracker –

1. TaskTracker runs on DataNode. Mostly on all DataNodes.

2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling the progress of
the task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive,
JobTracker will assign the task executed by the TaskTracker to another node.
Explain the benefits of block Transfer.
A file can be larger than any single disk in the network. There’s nothing that requires the blocks
from a file to be stored on the same disk, so they can take advantage of any of the disks in the
cluster. Making the unit of abstraction a block rather than a file simplifies the storage
subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks
and disk and machine failure, each block is replicated to a small number of physically separate
machines (typically three). If a block becomes unavailable, a copy can be read from another
location in a way that is transparent to the client.

8) What is a Role of Combiner in MapReduce Framework? Explain with the help of one
example.

 Combiner
The Combiner class is used in between the Map class and the Reduce class to reduce the
volume of data transfer between Map and Reduce. Usually, the output of the map task is large
and the data transferred to the reduce task is high.
The following MapReduce task diagram shows the COMBINER PHASE.
How Combiner Works?
Here is a brief summary on how MapReduce Combiner works −
 A combiner does not have a predefined interface and it must implement the Reducer interface’s
reduce () method.
 A combiner operates on each map output key. It must have the same output key-value types as
the Reducer class.
 A combiner can produce summary information from a large dataset because it replaces the
original Map output.
Although, Combiner is optional yet it helps segregating data into multiple groups for Reduce
phase, which makes it easier to process.

Example:

In the above example, we can see that two Mappers are containing different data. the main text
file is divided into two different Mappers. Each mapper is assigned to process a different line of
our data. in our above example, we have two lines of data so we have two Mappers to handle
each line. Mappers are producing the intermediate key-value pairs, where the name of the
particular word is key and its count is its value. For example for the data Geeks For Geeks
For the key-value pairs are shown below.
// Key Value pairs generated for data Geeks For Geeks For
(Geeks,1)
(For,1)
(Geeks,1)
(For,1)
The key-value pairs generated by the Mapper are known as the intermediate key-value pairs or
intermediate output of the Mapper. Now we can minimize the number of these key-value pairs
by introducing a combiner for each Mapper in our program. In our case, we have 4 key-value
pairs generated by each of the Mapper. since these intermediate key-value pairs are not ready to
directly feed to Reducer because that can increase Network congestion so Combiner will
combine these intermediate key-value pairs before sending them to Reducer. The combiner
combines these intermediate key-value pairs as per their key. For the above example for
data Geeks For Geeks For the combiner will partially reduce them by merging the same pairs
according to their key value and generate new key-value pairs as shown below.
// Partially reduced key-value pairs with combiner
(Geeks,2)
(For,2)
With the help of Combiner, the Mapper output got partially reduced in terms of size(key-value
pairs) which now can be made available to the Reducer for better performance. Now the Reducer
will again Reduce the output obtained from combiners and produces the final output that is
stored on HDFS(Hadoop Distributed File System).

9) Show the MapReduce Implementation for the following tasks with the help of one
example:
a) Matrix-Vector Multiplication.
MapReduce is a technique in which a huge program is subdivided into small tasks and
run parallelly to make computation faster, save time, and mostly used in distributed
systems. It has 2 important parts:
 Mapper: It takes raw data input and organizes into key, value pairs. For example, In a
dictionary, you search for the word “Data” and its associated meaning is “facts and
statistics collected together for reference or analysis”. Here the Key is Data and
the Value associated with is facts and statistics collected together for reference or
analysis.
 Reducer: It is responsible for processing data in parallel and produce final output.

Let us consider the matrix multiplication example to visualize MapReduce. Consider the
following matrix:

2×2 matrices A and B

Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is
called A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1
reducer. The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i

Therefore computing the mapper for Matrix A:

# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))

j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))

j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B
i=1 j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
j=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))
The formula for Reducer is:

Reducer(k, v)=(i, k)=>Make sorted Alist and Blist

(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation
# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A &
# B with adjoining values taken from
# Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)

From (i), (ii), (iii) and (iv) we conclude that

((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore the Final Matrix is:

Final output of Matrix multiplication.

b) Computing Group by and Aggregation by Relational Table.

 Usually understanding grouping and aggregation takes a bit of time when we learn SQL, but
not in case when we understand these operations using map reduce. The logic is already there
in the working of the map. Map workers implicitly group keys and the reduce function acts
upon the aggregated values to generate output.

 Map Function: For each row in the table, take the attributes using which grouping is to be done
as the key, and value will be the ones on which aggregation is to be performed. For example, If
a relation has 4 columns A, B, C, D and we want to group by A, B and do an aggregation on C
we will make (A, B) as the key and C as the value.

 Reduce Function: Apply the aggregation operation (sum, max, min, avg, …) on the list of
values and output the result.

 For our example lets group by (A, B) and apply sum as the aggregation.
The data after application of map function and grouping keys will creates (A, B) as key and C as
value and D is discarded as if it doesn’t exist.

Applying partitioning using hash functions, we get

The data at the reduce workers will look like

The data is aggregated based on keys before applying the aggregation function (sum in this case).

After applying the sum over the value lists we get the final output

Here also like difference operation we can’t get rid of the reduce stage. The context of tables isn’t

wanted here but the aggregation function makes it necessary for the values to be in one place for a

single key. This operation is also inefficient as compared to selection, projection, union, and

intersection. The column that is not in aggregation or grouping clause is ignored and isn’t required.

So if the data be stored in a columnar format we can save cost of loading a lot of data. Usually there

are only a few columns involved in grouping and aggregation it does save up a lot of cost both in

terms of data that is sent over the network and the data that needs to be loaded to main memory for

execution.
c) Selection and Projection.
 Selection:

To perform selections using map reduce we need the following Map and Reduce functions:

 Map Function: For each row r in the table apply condition and produce a key value
pair r, r if condition is satisfied else produce nothing. i.e., key and value are the same.

 Reduce Function: The reduce function has nothing to do in this case. It will simply write
the value for each key it receives to the output.

For our example we will doSelection(B <= 3). Select all the rows where value of B is less
than or equal to 3.

Let’s consider the data we have initially distributed as files in Map Workers, And the data
looks like the following figure

After applying the map function (And grouping, there are no common keys in this case as each
row is unique) we will get the output as follows, The tuples are constructed with 0th index
containing values from A column and 1st index containing values from B. In actual
implementations either this information can be sent as some extra metadata or within each value
itself, making values and keys look something like ({A: 1}, {B: 2}), which does look somewhat
inefficient.
After this based on number or reduce workers (2 in our case). A hash function is applied as
explained in the Hash Function section. The files for reduce workers on map workers will look
like:

After this step The files for reduce worker 1 are sent to that and reduce worker 2 are sent to that.
The data at reduce workers will look like:

The final output after applying the reduce function which ignores the keys and just consider values

will look like:

The points to take into consideration here are that we don’t need to shuffle data across the
nodes really. We can just execute the map function and save values to the output from map
workers itself. This makes it an efficient operation (When compared to others where reduce
function does something)

 Projection:

Map Function: For each row r in the table produce a key value pair r', r’, where r' only contains

the columns which are wanted in the projection.

Reduce Function: The reduce function will get outputs in the form of r' :[r', r', r', r', ...]. As after

removing some columns the output may contain duplicate rows. So it will just take the value at

0th index, getting rid of duplicates (Note that this de duplication is done as we are implementing

the operations while getting outputs which we are supposed to get according to relational

algebra).

Let’s see it in action, by computing projection(A, B) for the following table:

After application of map function (ignoring values in C column) and grouping the keys the data will
look like:

The keys will be partitioned using a hash function as was the case in selection. The data will look
like:

The data at the reduce workers will be:

At the reduce node the keys will be aggregated again as same keys might have occurred at multiple
map workers. As we already know the reduce function operates on values of each key only once.
The reduce function is applied which will consider only the first value of the values list and ignore
rest of the information.

The points to remember are that here the reduce function is required for duplicate elimination. If
that’s not the case (as it is in SQL) We can get rid of reduce operation, meaning we don’t have to
move data around. So, this operation can be implemented without actually passing data around.

10) List Relational Algebra Operations. Explain any two using MapReduce.

 Relational Algebra Operations:

1. Selection.
2. Projection.
3. Union & Intersection.
4. Natural Join.
5. Grouping & Aggregation.
1. Selection:
 Apply a condition c to each taple in the relation and produce as output only those tuples that
satisfy c.
 The result of this selection is denoted by 6c(R)6c(R)
 Selection really do not need the full power of MapReduce.
 They can be done most conveniently in the map portion alone, although they could also be done
in the reduce portion also.

 The pseudo code is as follows :

Map (key, valve)
for tuple in valve :
if tuple satisfies C :
emit (tuple, tuple)
Reduce (key, valves)
emit (key, key)

2. projection:
 for some subset s of the attribute of the relation, produce from each tuple only the components
for the attributes in S.
 The result of this projection is denoted TTs (R)
 Projection is performed similarly to selection.
 As projection may cause the same tuple to appear several times, the reduce function eliminate
duplicates.

 The pseudo code for projection is as follows :

Map (key, valve)
for tuple in valve :
ts = tuple with only the components for the attributes in S.
emit (ts, ts)
Reduce (key, values)
emit (key, key)

Principles of Database Manageme - Wilfried Lemahieu
100% (6)
Principles of Database Manageme - Wilfried Lemahieu
1,843 pages
Data Science Training Content Naresh IT Hyderabad
No ratings yet
Data Science Training Content Naresh IT Hyderabad
13 pages
SuperDataScience - Data Scientist Learning Path Study Plan
100% (2)
SuperDataScience - Data Scientist Learning Path Study Plan
30 pages
Own Preparation What Is A Database?: Oracle
No ratings yet
Own Preparation What Is A Database?: Oracle
16 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
Aryan BDA Assignment
No ratings yet
Aryan BDA Assignment
6 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
21ai402 Data Analytics Unit-2
No ratings yet
21ai402 Data Analytics Unit-2
44 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Compare Hadoop & Spark Criteria Hadoop Spark
No ratings yet
Compare Hadoop & Spark Criteria Hadoop Spark
18 pages
Chapter 6 Information Management Basics
No ratings yet
Chapter 6 Information Management Basics
55 pages
Lecture Notes: Data Ingestion For Structured/Unstructured Data
No ratings yet
Lecture Notes: Data Ingestion For Structured/Unstructured Data
31 pages
Name Shivam Prasad Reg No. 15BCE1196
No ratings yet
Name Shivam Prasad Reg No. 15BCE1196
8 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
10 pages
Big Data Analytics (Unit-II)
No ratings yet
Big Data Analytics (Unit-II)
17 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
PCC ALOK
No ratings yet
PCC ALOK
29 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
Introduction To Databases
No ratings yet
Introduction To Databases
8 pages
1 - Content - Introduction To Database, Normalization, DDL, DML
No ratings yet
1 - Content - Introduction To Database, Normalization, DDL, DML
37 pages
Big Data Hadoop Interview Questions and Answers
No ratings yet
Big Data Hadoop Interview Questions and Answers
26 pages
Original - 1404388042 - 1 - Content - Introduction To Database, Normalization, DDL, DML
No ratings yet
Original - 1404388042 - 1 - Content - Introduction To Database, Normalization, DDL, DML
39 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
3
No ratings yet
3
11 pages
Bda Answer Bank: 1. Characteristics of Big Data 5V
No ratings yet
Bda Answer Bank: 1. Characteristics of Big Data 5V
28 pages
What Are Database Types
No ratings yet
What Are Database Types
7 pages
Introduction to Databases
No ratings yet
Introduction to Databases
6 pages
Dbms All Units Notes
No ratings yet
Dbms All Units Notes
140 pages
Microsoft Azure Data Fundamentals
No ratings yet
Microsoft Azure Data Fundamentals
6 pages
Introduction To Databases
No ratings yet
Introduction To Databases
6 pages
Document
No ratings yet
Document
3 pages
Big Data Tools and Techniques
No ratings yet
Big Data Tools and Techniques
12 pages
bda final sem 7
No ratings yet
bda final sem 7
120 pages
Introduction To Databases
No ratings yet
Introduction To Databases
6 pages
SQL Access
No ratings yet
SQL Access
49 pages
CH-05_cc
No ratings yet
CH-05_cc
21 pages
BDA
No ratings yet
BDA
8 pages
Big Data Analytics Tools and Technology
No ratings yet
Big Data Analytics Tools and Technology
12 pages
DBMS Unit - 1 Final
No ratings yet
DBMS Unit - 1 Final
89 pages
Big Data
No ratings yet
Big Data
63 pages
Introduction To Database
No ratings yet
Introduction To Database
6 pages
Data Analytics mid sem notes
No ratings yet
Data Analytics mid sem notes
9 pages
Unit-V DBMS
No ratings yet
Unit-V DBMS
19 pages
Module 1
No ratings yet
Module 1
66 pages
Cis Theory Chapter 5
No ratings yet
Cis Theory Chapter 5
4 pages
DBMS Notes
No ratings yet
DBMS Notes
85 pages
1
No ratings yet
1
5 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Chapter 6-Foundaiton of BI
No ratings yet
Chapter 6-Foundaiton of BI
5 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
ESE_BDA
No ratings yet
ESE_BDA
28 pages
Big Data Tools and Its Framework
No ratings yet
Big Data Tools and Its Framework
5 pages
Assignment questions BDA Lec 6
No ratings yet
Assignment questions BDA Lec 6
51 pages
RTK Notes m1
No ratings yet
RTK Notes m1
16 pages
DF_Unit3_DataBaseManagement
No ratings yet
DF_Unit3_DataBaseManagement
25 pages
DSS - U3 - Chap6 - MongoDB Rev 1.1
No ratings yet
DSS - U3 - Chap6 - MongoDB Rev 1.1
80 pages
Module 3 - Data and Database Management
No ratings yet
Module 3 - Data and Database Management
11 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Practice Sheet CDA-1
No ratings yet
Practice Sheet CDA-1
6 pages
Review of CAVLC, Arithmetic Coding, and CABAC
No ratings yet
Review of CAVLC, Arithmetic Coding, and CABAC
12 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
5 pages
LAB 6 Configure Wireless Network in Packet Tracer
No ratings yet
LAB 6 Configure Wireless Network in Packet Tracer
8 pages
Subject: Business Requirements Document (BRD) For Aviation Information Data Exchange (AIDX) Message
No ratings yet
Subject: Business Requirements Document (BRD) For Aviation Information Data Exchange (AIDX) Message
73 pages
1 CommCell Environment
No ratings yet
1 CommCell Environment
67 pages
Comp 3
No ratings yet
Comp 3
4 pages
Vigor2926 Datasheet 170725
No ratings yet
Vigor2926 Datasheet 170725
2 pages
Java 6 Notes
No ratings yet
Java 6 Notes
52 pages
Niraj Kumar: N D, M:-9650133577, 8851198670 Email
No ratings yet
Niraj Kumar: N D, M:-9650133577, 8851198670 Email
5 pages
VuFind Documentation
No ratings yet
VuFind Documentation
15 pages
Stack: LIFO (Last in First Out)
No ratings yet
Stack: LIFO (Last in First Out)
13 pages
Vci 32 (Minimon) 1.3 E
No ratings yet
Vci 32 (Minimon) 1.3 E
16 pages
LPC2148 UART Programming
No ratings yet
LPC2148 UART Programming
5 pages
Event Management System New
0% (2)
Event Management System New
80 pages
1186 Slave Architecture For The Robonova MR C3024 Using The HMI Protocol IRI Technical Report
100% (1)
1186 Slave Architecture For The Robonova MR C3024 Using The HMI Protocol IRI Technical Report
54 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
Data Processing Ss1 - Ss3
No ratings yet
Data Processing Ss1 - Ss3
6 pages
UNIT 1 PC Hardware and Installation
No ratings yet
UNIT 1 PC Hardware and Installation
10 pages
TUPLE
No ratings yet
TUPLE
16 pages
CBDB3203 Database Implementation PDF
No ratings yet
CBDB3203 Database Implementation PDF
157 pages
C# Lab Report VTU
No ratings yet
C# Lab Report VTU
75 pages
Elastic Cloud Server User Guide
No ratings yet
Elastic Cloud Server User Guide
37 pages
Suction Log
No ratings yet
Suction Log
392 pages
Notes AUGUST 2020 Chapter 3: Storage Devices and Media Grade: 8 Subject: Ict
No ratings yet
Notes AUGUST 2020 Chapter 3: Storage Devices and Media Grade: 8 Subject: Ict
8 pages
Chapter 22
No ratings yet
Chapter 22
15 pages
Java 8
No ratings yet
Java 8
2 pages