Assignment No 01
Assignment No 01
By:
Aakanksha Padmanabhan
BE-IT
Roll No:01
1) List different types of data and hence explain structured, Semi Structured and
Unstructured data by giving example. Structured data –
Structured Data:
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.
Another common storage format of structured data is comma separated value files (CSV).
Figure 2 shows structured data in csv format. While it might look messy at first, if you look
closely, it follows a rigorous structure that can easily be converted into a spreadsheet-like
view. Each row has a value for a product code, order number, etc. These values are separated
by a semi-colon “;” so they can easily be related to the right column header. For example,
every first value in a row indicates a product code.
1. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that have
some organizational properties that make it easier to analyse. With some process, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.
2. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model; thus, it is not a good fit for a mainstream relational database. So, for
Unstructured data, there are alternative platforms for storing and managing, it is
increasingly prevalent in IT systems and is used by organizations in a variety of business
intelligence and analytics applications. Example: Word, PDF, Text, Media logs.
There is a plethora of examples of unstructured data. Just think of any image (e.g., jpeg), video
(e.g. mp4), song (e.g. mp3), documents (e.g. PDFs or docx) or any other file type. The image
below shows just one concrete example of unstructured data: a product image and description
text. Even though this type of data might be easy to consume for us humans, it has no degree of
organization and is therefore difficult for machines to analyse and interpret.
4. Set the path HADOOP_HOME Environment variable on windows 10(see Step 1,2,3 and 4
below).
5. Set the path JAVA_HOME Environment variable on windows 10(see Step 1,2,3 and 4
below).
6. Next we set the Hadoop bin directory path and JAVA bin directory path.
Configuration
1. Edit file C:/Hadoop-2.8.0/etc/hadoop/core-site.xml, paste below xml paragraph and save this
file
Create folder “datanode” under “C:\Hadoop-2.8.0\data”
Create folder “namenode” under “C:\Hadoop-2.8.0\data”
Hadoop Configuration:
1. Open cmd and typing command “hdfs namenode –format”. You will see
Hadoop Namenode
Hadoop datanode
YARN Resourc Manager
YARN Node Manager
4. Open: http://localhost:8088
5. Open: http://localhost:50070
3) Explain HDFS architecture with diagram.
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.
Block
Generally, the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
2. Document Oriented:
In MongoDB, all the data stored in the documents instead of tables like in RDBMS. In these
documents, the data is stored in fields (key-value pair) instead of rows and columns which
make the data much more flexible in comparison to RDBMS. And each document contains its
unique object id.
3. Indexing:
In MongoDB database, every field in the documents is indexed with primary and secondary
indices this makes easier and takes less time to get or search data from the pool of the data. If
the data is not indexed, then database search each document with the specified query which
takes lots of time and not so efficient.
4. Scalability:
MongoDB provides horizontal scalability with the help of sharding. Sharding means to
distribute data on multiple servers, here a large amount of data is partitioned into data chunks
using the shard key, and these data chunks are evenly distributed across shards that reside
across many physical servers. It will also add new machines to a running database.
5. Replication:
MongoDB provides high availability and redundancy with the help of replication, it creates
multiple copies of the data and sends these copies to a different server so that if one server
fails, then the data is retrieved from another server.
6. Aggregation:
It allows to perform operations on the grouped data and get a single result or computed result.
It is similar to the SQL GROUPBY clause. It provides three different aggregations i.e.,
aggregation pipeline, map-reduce function, and single-purpose aggregation methods
7. High Performance:
The performance of MongoDB is very high and data persistence as compared to another
database due to its features like scalability, indexing, replication, etc.
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
This model is one of the most basic models of NoSQL databses. As the name suggests, the
data is stored in form of Key-Value Pairs.The key is usually a sequence of strings, integers
or characters but can also be a more advanced data type. The value is typically linked or co-
related to the key. The key-value pair storage databases generally store data as a hash table
where each key is unique. The value can be of any type (JSON, BLOB(Binary Large
Object), strings, etc). This type of pattern is usually used in shopping websites or e-
commerce applications .
Advantages:
Limitations:
Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
Data can be involving many-to-many relationships which may collide.
Examples:
DynamoDB
Berkeley DB
2. Column Store Database:
Rather than storing data in relational tuples, the data is stored in individual cells which are
further grouped into columns. Column-oriented databases work only on columns. They store
large amounts of data into columns together. Format and titles of the columns can diverge from
one row to other. Every column is treated separately. But still, each individual column may
contain multiple other columns like traditional databases.
Basically, columns are mode of storage in this type.
Advantages:
Examples:
HBase
Bigtable by Google
Cassandra
3.Document Database:
The document database fetches and accumulates data in forms of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The
use of nested documents is also very common. It is very affective as most of the data created
is usually in form of JSONs and is unstructured .
Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:
MongoDB
CouchDB
3. Graph Databases:
Clearly, this architecture pattern deals with storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some
data.The objects or entities are called as nodes and are joined together by relationships
called Edges. Each edge has a unique identifier. Each node serves as a point of contact for
the graph.This pattern is very commonly used in social networks where there are a large
number of entities and each entity has one or many characteristics which are connected by
edges.
The relational database pattern has tables which are loosely connected, whereas graphs are
often very strong and rigid in nature.
Advantages:
Limitations:
Examples:
Neo4J
FlockDB( Used by Twitter)
6) Write a short note on:
a) Hadoop Architectural Model.
Hadoop is an open-source framework from Apache and is used to store process and analyse
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more. Moreover, it can be scaled up just by adding
nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in nodes
over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set which
can be computed in Key value pair. The output of Map task is consumed by reduce task and
then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can
also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
b) Hive and its architecture.
1. Hive Clients
2. Hive Services
3. Hive Storage and Computing
Hive Clients:
Hive provides different drivers for communication with a different type of applications. For
Thrift based applications, it will provide Thrift client for communication.
For Java related applications, it provides JDBC Drivers. Other than any type of applications
provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server
in the Hive services.
Hive Services:
Client interactions with Hive can be performed through Hive Services. If the client wants to
perform any query related operations in Hive, it has to communicate through Hive Services.
CLI is the command line interface acts as Hive service for DDL (Data definition Language)
operations. All drivers communicate with Hive server and to the main driver in Hive services as
shown in above architecture diagram.
Driver present in the Hive services represents the main driver, and it communicates all type of
JDBC, ODBC, and other client specific applications. Driver will process those requests from
different applications to meta store and field systems for further processing.
Hive Storage and Computing:
Hive services such as Meta store, File system, and Job Client in turn communicates with Hive
storage and performs the following actions
Metadata information of tables created in Hive is stored in Hive "Meta storage database".
Query results and data loaded in the tables are going to be stored in Hadoop cluster on
HDFS.
7) What are JobTracker and TaskTracker? Explain the benefits of block Transfer.
JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1
(or Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and
replaced by Resource Manager, Application Master and Node Manager Daemons.
Job Tracker –
TaskTracker –
8) What is a Role of Combiner in MapReduce Framework? Explain with the help of one
example.
Combiner
The Combiner class is used in between the Map class and the Reduce class to reduce the
volume of data transfer between Map and Reduce. Usually, the output of the map task is large
and the data transferred to the reduce task is high.
The following MapReduce task diagram shows the COMBINER PHASE.
How Combiner Works?
Here is a brief summary on how MapReduce Combiner works −
A combiner does not have a predefined interface and it must implement the Reducer interface’s
reduce () method.
A combiner operates on each map output key. It must have the same output key-value types as
the Reducer class.
A combiner can produce summary information from a large dataset because it replaces the
original Map output.
Although, Combiner is optional yet it helps segregating data into multiple groups for Reduce
phase, which makes it easier to process.
Example:
In the above example, we can see that two Mappers are containing different data. the main text
file is divided into two different Mappers. Each mapper is assigned to process a different line of
our data. in our above example, we have two lines of data so we have two Mappers to handle
each line. Mappers are producing the intermediate key-value pairs, where the name of the
particular word is key and its count is its value. For example for the data Geeks For Geeks
For the key-value pairs are shown below.
// Key Value pairs generated for data Geeks For Geeks For
(Geeks,1)
(For,1)
(Geeks,1)
(For,1)
The key-value pairs generated by the Mapper are known as the intermediate key-value pairs or
intermediate output of the Mapper. Now we can minimize the number of these key-value pairs
by introducing a combiner for each Mapper in our program. In our case, we have 4 key-value
pairs generated by each of the Mapper. since these intermediate key-value pairs are not ready to
directly feed to Reducer because that can increase Network congestion so Combiner will
combine these intermediate key-value pairs before sending them to Reducer. The combiner
combines these intermediate key-value pairs as per their key. For the above example for
data Geeks For Geeks For the combiner will partially reduce them by merging the same pairs
according to their key value and generate new key-value pairs as shown below.
// Partially reduced key-value pairs with combiner
(Geeks,2)
(For,2)
With the help of Combiner, the Mapper output got partially reduced in terms of size(key-value
pairs) which now can be made available to the Reducer for better performance. Now the Reducer
will again Reduce the output obtained from combiners and produces the final output that is
stored on HDFS(Hadoop Distributed File System).
9) Show the MapReduce Implementation for the following tasks with the help of one
example:
a) Matrix-Vector Multiplication.
MapReduce is a technique in which a huge program is subdivided into small tasks and
run parallelly to make computation faster, save time, and mostly used in distributed
systems. It has 2 important parts:
Mapper: It takes raw data input and organizes into key, value pairs. For example, In a
dictionary, you search for the word “Data” and its associated meaning is “facts and
statistics collected together for reference or analysis”. Here the Key is Data and
the Value associated with is facts and statistics collected together for reference or
analysis.
Reducer: It is responsible for processing data in parallel and produce final output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the
following matrix:
Usually understanding grouping and aggregation takes a bit of time when we learn SQL, but
not in case when we understand these operations using map reduce. The logic is already there
in the working of the map. Map workers implicitly group keys and the reduce function acts
upon the aggregated values to generate output.
Map Function: For each row in the table, take the attributes using which grouping is to be done
as the key, and value will be the ones on which aggregation is to be performed. For example, If
a relation has 4 columns A, B, C, D and we want to group by A, B and do an aggregation on C
we will make (A, B) as the key and C as the value.
Reduce Function: Apply the aggregation operation (sum, max, min, avg, …) on the list of
values and output the result.
For our example lets group by (A, B) and apply sum as the aggregation.
The data after application of map function and grouping keys will creates (A, B) as key and C as
value and D is discarded as if it doesn’t exist.
After applying the sum over the value lists we get the final output
Here also like difference operation we can’t get rid of the reduce stage. The context of tables isn’t
wanted here but the aggregation function makes it necessary for the values to be in one place for a
single key. This operation is also inefficient as compared to selection, projection, union, and
intersection. The column that is not in aggregation or grouping clause is ignored and isn’t required.
So if the data be stored in a columnar format we can save cost of loading a lot of data. Usually there
are only a few columns involved in grouping and aggregation it does save up a lot of cost both in
terms of data that is sent over the network and the data that needs to be loaded to main memory for
execution.
c) Selection and Projection.
Selection:
To perform selections using map reduce we need the following Map and Reduce functions:
Map Function: For each row r in the table apply condition and produce a key value
pair r, r if condition is satisfied else produce nothing. i.e., key and value are the same.
Reduce Function: The reduce function has nothing to do in this case. It will simply write
the value for each key it receives to the output.
For our example we will doSelection(B <= 3). Select all the rows where value of B is less
than or equal to 3.
Let’s consider the data we have initially distributed as files in Map Workers, And the data
looks like the following figure
After applying the map function (And grouping, there are no common keys in this case as each
row is unique) we will get the output as follows, The tuples are constructed with 0th index
containing values from A column and 1st index containing values from B. In actual
implementations either this information can be sent as some extra metadata or within each value
itself, making values and keys look something like ({A: 1}, {B: 2}), which does look somewhat
inefficient.
After this based on number or reduce workers (2 in our case). A hash function is applied as
explained in the Hash Function section. The files for reduce workers on map workers will look
like:
After this step The files for reduce worker 1 are sent to that and reduce worker 2 are sent to that.
The data at reduce workers will look like:
The final output after applying the reduce function which ignores the keys and just consider values
Projection:
Map Function: For each row r in the table produce a key value pair r', r’, where r' only contains
Reduce Function: The reduce function will get outputs in the form of r' :[r', r', r', r', ...]. As after
removing some columns the output may contain duplicate rows. So it will just take the value at
0th index, getting rid of duplicates (Note that this de duplication is done as we are implementing
the operations while getting outputs which we are supposed to get according to relational
algebra).
The keys will be partitioned using a hash function as was the case in selection. The data will look
like:
At the reduce node the keys will be aggregated again as same keys might have occurred at multiple
map workers. As we already know the reduce function operates on values of each key only once.
The reduce function is applied which will consider only the first value of the values list and ignore
rest of the information.
The points to remember are that here the reduce function is required for duplicate elimination. If
that’s not the case (as it is in SQL) We can get rid of reduce operation, meaning we don’t have to
move data around. So, this operation can be implemented without actually passing data around.
10) List Relational Algebra Operations. Explain any two using MapReduce.
2. projection:
for some subset s of the attribute of the relation, produce from each tuple only the components
for the attributes in S.
The result of this projection is denoted TTs (R)
Projection is performed similarly to selection.
As projection may cause the same tuple to appear several times, the reduce function eliminate
duplicates.