0% found this document useful (0 votes)

364 views

Big Data Assignment PDF

The document discusses how communication occurs between the namenode and datanodes in Hadoop. It explains that the datanode initiates all communication by sending heartbeats to the namenode. The heartbeats include statistics and block reports. The namenode responds with commands for the datanode like instructions for block replication. It also describes the read and write operations in HDFS and how data flows from clients to the datanode pipeline and is replicated across multiple datanodes.

Uploaded by

Mukul Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

364 views

Big Data Assignment PDF

Uploaded by

Mukul Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Q1: Explain the Hadoop Ecosystem and discuss how communication is taking place between

namenode and datanodes.

What is Hadoop?

Hadoop is a framework that allows for the distributed processing of large datasets across clusters
of computers using simple programming models

 Economical: Its systems are highly economical as ordinary computers can be used for data
processing.
 Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
 Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.

Hadoop Ecosystem

Hadoop Ecosystem Hadoop has an ecosystem that has evolved from its three core components processing,
resource management, and storage.

Hadoop ecosystem is continuously growing to meet the needs of Big Data. It comprises the following
twelve components:

 HDFS(Hadoop Distributed file system)- Hadoop Distributed File System

 HBase
 Sqoop
 Flume
 Spark
 Hadoop MapReduce
 Pig

1|Page
 Impala
 Hive
 Cloudera Search
 Oozie
 Hue.
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing

HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
NameNode/Master Node

 Stores metadata for the files, like the directory structure of a typical FS.
 The server holding the NameNode instance is quite crucial, as there is only one.
 Transaction log for file deletes/adds,etc ,Does not use transaction for whole blocks or
file-streams, only metadata.
 Handles creation of more replica blocks when necessary after a DataNode failure.

DataNode/Slave Node

 Stores the actual data in HDFS

 Can run on any underlying filesystem (ext3/4, NTFS, etc)
 Notifies NameNode of what blocks it has
 NameNode replicates blocks 2x in local rack, 1x elsewhere

2|Page
YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.

MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form
of group. Map generates a key-value pair based result which is later on processed by
the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.

3|Page
Communication takes place between NameNode and DataNode

How a NameNode and DataNode communicate with each other?

All communication between Namenode and Datanode is initiated by the Datanode, and
responded to by the Namenode. The Namenode never initiates communication to
the Datanode, although Namenode responses may include commands to the Datanode that
cause it to send further communications.
DataNode sends information to NameNode through four major interfaces defined in the

DataNodeProtocol These Four are:

1. DataNode Registration. The DataNode informs NameNode of its existence. NameNode

returns its registration id. This registration id is a parameter of other DataNode functions.
Registration is triggered when a new DataNode is initiated, an old one is re-initiated, or
when a new NameNode is initiated.

2. DataNode sends heartbeat. The DataNode sends a heartbeat message every few seconds.
This includes some information statistics about capacity and current activity. NameNode
returns a list of block oriented commands for DataNode to execute. These commands

4|Page
primarily consist of instructions to transfer blocks to other DataNodes for replication
purposes, or instructions to delete blocks. The NameNode can also command an
immediate Block Report from the DataNode, but this is only done to recover from severe
problems.
3. DataNode sends block report. DataNode periodically reports the blocks contained in its
storage. The period is typically configured to hourly.
4. DataNode notifies Block Received. DataNode reports that it has received a new block,
either from a Client (during file write) or from another DataNode (during replication). It
reports each block immediately upon receipt.

Read Operation In HDFS

1. A client initiates read request by calling 'open()' method of FileSystem object; it is an

object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few blocks
of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.

5|Page
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes care
of interactions with DataNode and NameNode. In step 4 shown in the above diagram, a
client invokes 'read()' method which causes DFSInputStream to establish a connection
with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This
process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on
to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.

Write Operation In HDFS

1. A client initiates write operation by calling 'create()' method of DistributedFileSystem

object which creates a new file DistributedFileSystem object connects to the NameNode
using RPC call and initiates new file creation. However, this file creates operation does
not associate any blocks with the file. It is the responsibility of NameNode to verify that

6|Page
the file (which is being created) does not exist already and a client has correct
permissions to create a new file. If a file already exists or client does not have sufficient
permission to create a new file, then IOException is thrown to the client. Otherwise, the
operation succeeds and a new record for the file is created by the NameNode.
2. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
3. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
4. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
5. Now, the process of replication starts by creating a pipeline using DataNodes. In our
case, we have chosen a replication level of 3 and hence there are 3 DataNodes in the
pipeline.
6. The DataStreamer pours packets into the first DataNode in the pipeline.
7. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.
8. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which
are waiting for acknowledgment from DataNodes.
9. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
10. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
11. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.

7|Page
Q2: Explain the steps required for the configuration of eclipse with Apache Hadoop.
Support your answer with screenshots.

Eclipse is the most popular Integrated Development Environment (IDE) for developing Java
applications. It is robust, feature-rich, easy-to-use and powerful IDE which is the #1 choice of
almost Java programmers in the world. And it is totally FREE.

1. Download and Install Eclipse IDE

https://www.eclipse.org/downloads/download.php?file=/oomph/epp/2020

After downloading the file, we need to move this file to the home directory.

2. Extracting the file

8|Page
Two methods to extract the file either right click on the file and click on extract here, or

Unzip the file using following command:

$ tar -xzvf eclipse-java-inst-1-linux-gtk-x86_64.tar.gz

4. Choose a Workspace Directory:

Eclipse organizes projects by workspaces. A workspace is a group of related projects and it is

actually a directory on your computer. That’s why when you start Eclipse, it asks to choose a
workspace location like this:

9|Page
5. Create Java Project in Package Explorer:
By clicking in the left side of the interface

File > New > Java Project > (Name it – WordCountProject1) > Finish

The new Java Project has created

6. After that we need to create the Package, by clicking:

Right Click > New > Package (Name it – WordCountPackage1) > Finish

7. After creating the Package the third step is to create the class

Right Click on Package > New > Class (Name it – WordCount)

8. Download Hadoop Libraries

Add Following Reference Libraries –

 Hadoop-core-1.2.1.jar
 Commons-cli-1.2.jar

10 | P a g e
 Hadoop-core-1.2.1.jar

 Commons-cli-1.2.jar

11 | P a g e
9. To Add this Following libraries

Right Click on Project > Build Path> Add External Archivals

 Downloads/hadoop-core-1.2.1.jar
 Downloads/commons-cli-1.2.jar

12 | P a g e
10. The Hadoop Libraries has successfully added.

The next step is the mapper and reducer logic write in the console for the word count project.

Q3.Explain YARN model and give its comparison with that of Hadoop.

Hadoop YARN knits the storage unit of Hadoop i.e. HDFS (Hadoop Distributed File System) with
the various processing tools. YARN stands for “Yet Another Resource Negotiator”.

Why YARN?

In Hadoop version 1.0, MapReduce performed both processing and resource management
functions. It consisted of a Job Tracker which was the single master. The Job Tracker allocated the
resources, performed scheduling and monitored the processing jobs. It assigned map and reduce
tasks on a number of subordinate processes called the Task Trackers. The Task Trackers
periodically reported their progress to the Job Tracker.

This design resulted in scalability bottleneck due to a single Job Tracker. IBM mentioned in its
article that according to Yahoo!, the practical limits of such a design are reached with a cluster of
5000 nodes and 40,000 tasks running concurrently. Apart from this limitation, the utilization of
computational resources is inefficient in MRV1. Also, the Hadoop framework became limited
only to MapReduce processing paradigm.

To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by
Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over

13 | P a g e
the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop
the ability to run non-MapReduce jobs within the Hadoop framework.

Introduction to Hadoop YARN

In version 2.0, YARN. YARN allows different data processing methods like graph processing,
interactive processing, stream processing as well as batch processing to run and process data stored
in HDFS. Therefore YARN opens up Hadoop to other types of distributed applications beyond
MapReduce.

YARN enabled the users to perform operations as per requirement by using a variety of tools
like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others.

Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all
your processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN
Architecture consists of the following main components:

1. Resource Manager: Runs on a master daemon and manages the resource allocation in the
cluster.
2. Node Manager: They run on the slave daemons and are responsible for the execution of a
task on every single Data Node.

14 | P a g e
3. Application Master: Manages the user job lifecycle and resource needs of individual
applications. It works along with the Node Manager and monitors the execution of tasks.
4. Container: Package of resources including RAM, CPU, Network, HDD etc on a single
node.

Components of YARN

The first component of YARN Architecture is,

Resource Manager

 It is the ultimate authority in resource allocation.

 On receiving the processing requests, it passes parts of requests to corresponding node
managers accordingly, where the actual processing takes place.
 It is the arbitrator of the cluster resources and decides the allocation of the available
resources for competing applications.
 Optimizes the cluster utilization like keeping all resources in use all the time against
various constraints such as capacity guarantees, fairness, and SLAs.
 It has two major components: a) Scheduler b) Application Manager

15 | P a g e
a) Scheduler

 The scheduler is responsible for allocating resources to the various running applications
subject to constraints of capacities, queues etc.
 It is called a pure scheduler in ResourceManager, which means that it does not perform
any monitoring or tracking of status for the applications.
 If there is an application failure or hardware failure, the Scheduler does not guarantee to
restart the failed tasks.
 Performs scheduling based on the resource requirements of the applications.
 It has a pluggable policy plug-in, which is responsible for partitioning the cluster
resources among the various applications. There are two such plug-ins: Capacity
Scheduler and Fair Scheduler, which are currently used as Schedulers in
ResourceManager.

b) Application Manager

 It is responsible for accepting job submissions.

 Negotiates the first container from the Resource Manager for executing the application
specific Application Master.
 Manages running the Application Masters in a cluster and provides service for restarting
the Application Master container on failure.

Coming to the second component which is:

Node Manager

 It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow
on the given node.
 It registers with the Resource Manager and sends heartbeats with the health status of the
node.
 Its primary goal is to manage application containers assigned to it by the resource
manager.
 It keeps up-to-date with the Resource Manager.

16 | P a g e
 Application Master requests the assigned container from the Node Manager by sending it
a Container Launch Context(CLC) which includes everything the application needs in
order to run. The Node Manager creates the requested container process and starts it.
 Monitors resource usage (memory, CPU) of individual containers.
 Performs Log management.
 It also kills the container as directed by the Resource Manager.

The third component of Apache Hadoop YARN is,

Application Master

 An application is a single job submitted to the framework. Each such application has a
unique Application Master associated with it which is a framework specific entity.
 It is the process that coordinates an application’s execution in the cluster and also
manages faults.
 Its task is to negotiate resources from the Resource Manager and work with the Node
Manager to execute and monitor the component tasks.
 It is responsible for negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring progress.
 Once started, it periodically sends heartbeats to the Resource Manager to affirm its health
and to update the record of its resource demands.

The fourth component is:

Container

 It is a collection of physical resources such as RAM, CPU cores, and disks on a single
node.
 YARN containers are managed by a container launch context which is container life-
cycle(CLC). This record contains a map of environment variables, dependencies stored in
a remotely accessible storage, security tokens, payload for Node Manager services and
the command necessary to create the process.

17 | P a g e
 It grants rights to an application to use a specific amount of resources (memory, CPU
etc.) on a specific host.

18 | P a g e

Narrative As Virtual Reality 2 - Marie Laure Ryan - Text PDF
100% (1)
Narrative As Virtual Reality 2 - Marie Laure Ryan - Text PDF
153 pages
Az 104 Dumps
100% (7)
Az 104 Dumps
269 pages
Redemption Accomplished and Applied John Murray
No ratings yet
Redemption Accomplished and Applied John Murray
51 pages
Double, Not Dual: A Contextual and Formal Analysis of Peter Eisenman's Max Reinhardt Haus (Berlin, 1992)
No ratings yet
Double, Not Dual: A Contextual and Formal Analysis of Peter Eisenman's Max Reinhardt Haus (Berlin, 1992)
32 pages
Ineo+ - 224e - 284e - 364e - 454e - 554e - Product Guide - e - 130211 - F
No ratings yet
Ineo+ - 224e - 284e - 364e - 454e - 554e - Product Guide - e - 130211 - F
48 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
HBase
No ratings yet
HBase
31 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Big Query
No ratings yet
Big Query
5 pages
Uml Final
No ratings yet
Uml Final
70 pages
Web Services
No ratings yet
Web Services
10 pages
Database Lab Manual
No ratings yet
Database Lab Manual
86 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Big Data 3
No ratings yet
Big Data 3
16 pages
Lab - Qlik Replicate With Google BigQuery
No ratings yet
Lab - Qlik Replicate With Google BigQuery
23 pages
Chapter - 1: Introduction To Cloud Computing
No ratings yet
Chapter - 1: Introduction To Cloud Computing
36 pages
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
100% (1)
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
16 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Big Data Technologies
No ratings yet
Big Data Technologies
4 pages
16-029 Pentaho Hadoop Ebook v7
No ratings yet
16-029 Pentaho Hadoop Ebook v7
22 pages
Web Technologies 0108
No ratings yet
Web Technologies 0108
15 pages
Cs1028 Network Security
No ratings yet
Cs1028 Network Security
120 pages
Data Base Complete
No ratings yet
Data Base Complete
75 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Query 1:: Unique Liquor Stores in Iowa
No ratings yet
Query 1:: Unique Liquor Stores in Iowa
3 pages
Project Doc1
No ratings yet
Project Doc1
26 pages
Bus Reservation System WebBased Apllication
0% (1)
Bus Reservation System WebBased Apllication
73 pages
Compare Hadoop vs. Spark vs. Kafka For Your Big Data Strategy
No ratings yet
Compare Hadoop vs. Spark vs. Kafka For Your Big Data Strategy
10 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Big Data Fund
No ratings yet
Big Data Fund
5 pages
Software Requirements Specification:: Online System To Learning Center
No ratings yet
Software Requirements Specification:: Online System To Learning Center
27 pages
Lab 12 Introduction To Rapidminer/Weka.: Objective
No ratings yet
Lab 12 Introduction To Rapidminer/Weka.: Objective
24 pages
Harnessing Big Data
No ratings yet
Harnessing Big Data
29 pages
Web Services
No ratings yet
Web Services
16 pages
What Is WSDL?: Web Services
No ratings yet
What Is WSDL?: Web Services
4 pages
Using Hibernate in A Java Swing Application
No ratings yet
Using Hibernate in A Java Swing Application
22 pages
AACS1304 01 - Introduction To Is 202005
No ratings yet
AACS1304 01 - Introduction To Is 202005
49 pages
Distributed Systems Lecturer Notes Latest 25 148
No ratings yet
Distributed Systems Lecturer Notes Latest 25 148
124 pages
SQL Practical
No ratings yet
SQL Practical
16 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
Blockchain As A Service
No ratings yet
Blockchain As A Service
17 pages
IOT (Unit III) ...
No ratings yet
IOT (Unit III) ...
39 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
IT Infrastructure Architecture: Hosting and Deployment Options (Chapter 13 - 14)
No ratings yet
IT Infrastructure Architecture: Hosting and Deployment Options (Chapter 13 - 14)
31 pages
Introduction To Cloud Computing
No ratings yet
Introduction To Cloud Computing
92 pages
Basic of Azure
No ratings yet
Basic of Azure
4 pages
Design Key-Value Database + Real Usecase
No ratings yet
Design Key-Value Database + Real Usecase
49 pages
MVC Design Pattern PPT Presented by QuontraSolutions
No ratings yet
MVC Design Pattern PPT Presented by QuontraSolutions
35 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
BIG DATA For BBA
No ratings yet
BIG DATA For BBA
80 pages
Nimbus Cloud Technology
No ratings yet
Nimbus Cloud Technology
5 pages
Kudu
No ratings yet
Kudu
9 pages
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
No ratings yet
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
53 pages
MIT 820 Architectures For Software Systems and Emerging
No ratings yet
MIT 820 Architectures For Software Systems and Emerging
26 pages
CC MODULE 5
No ratings yet
CC MODULE 5
26 pages
NoSQL Module 2
No ratings yet
NoSQL Module 2
76 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Essentials of Cloud Computing 1st Edition K. Chandrasekaran pdf download
100% (1)
Essentials of Cloud Computing 1st Edition K. Chandrasekaran pdf download
51 pages
Big Data Ca
No ratings yet
Big Data Ca
14 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Big Data Assignment PDF
No ratings yet
Big Data Assignment PDF
18 pages
About The Cas1 General Food
No ratings yet
About The Cas1 General Food
6 pages
4 Hadoop 2.0
No ratings yet
4 Hadoop 2.0
5 pages
A132835183 - 21576 - 23 - 2019 - CASE BASED Ca2
No ratings yet
A132835183 - 21576 - 23 - 2019 - CASE BASED Ca2
6 pages
Bajaj
No ratings yet
Bajaj
2 pages
Case Study
100% (4)
Case Study
7 pages
Class 3
No ratings yet
Class 3
1 page
Presented By: Nisha Agrawal (Roll-27)
No ratings yet
Presented By: Nisha Agrawal (Roll-27)
11 pages
MKT525 Business To Business Marketing: Dr. Prashant Chauhan
No ratings yet
MKT525 Business To Business Marketing: Dr. Prashant Chauhan
9 pages
Introduction To Personal Finance
No ratings yet
Introduction To Personal Finance
5 pages
CS SemisV2
No ratings yet
CS SemisV2
3 pages
Attention, Awareness, and FL Behaviour
No ratings yet
Attention, Awareness, and FL Behaviour
43 pages
66b760504bd0c80018cd7741 - ## - Major Test - 04 - Test Paper
No ratings yet
66b760504bd0c80018cd7741 - ## - Major Test - 04 - Test Paper
9 pages
Ket Vocabulary and Grammar U1-4
No ratings yet
Ket Vocabulary and Grammar U1-4
4 pages
t25 - Learner Centred Foreign Language Teaching and Learning
100% (1)
t25 - Learner Centred Foreign Language Teaching and Learning
5 pages
Final Summer Holiday Homework Class 8
No ratings yet
Final Summer Holiday Homework Class 8
10 pages
SQL Cheat Sheet Bascis - MD
No ratings yet
SQL Cheat Sheet Bascis - MD
1 page
Two-Sample Tests of Hypothesis
No ratings yet
Two-Sample Tests of Hypothesis
55 pages
TIMO 202021 Heat P3
100% (3)
TIMO 202021 Heat P3
8 pages
1 BULAN Up 190 08.27.20
No ratings yet
1 BULAN Up 190 08.27.20
2 pages
Keiler, Bernard. Instruments and Music of Bolivia. (1963)
100% (1)
Keiler, Bernard. Instruments and Music of Bolivia. (1963)
6 pages
Fujoshi: The University of Chicago Press
No ratings yet
Fujoshi: The University of Chicago Press
23 pages
Product Data Sheet 6ES7214-1AD23-0XB0
No ratings yet
Product Data Sheet 6ES7214-1AD23-0XB0
6 pages
Grammar & Vocabulary Practice - Intermediate B1 - Revision - Tests - Test4
No ratings yet
Grammar & Vocabulary Practice - Intermediate B1 - Revision - Tests - Test4
2 pages
BW Guide and Best Practices
100% (2)
BW Guide and Best Practices
41 pages
Walter Scott
No ratings yet
Walter Scott
11 pages
A1 C2 Standard For Students
No ratings yet
A1 C2 Standard For Students
7 pages
Listening: Elements of Good Listening
No ratings yet
Listening: Elements of Good Listening
17 pages
A3 Worksheet - Mini Data Collection Program
No ratings yet
A3 Worksheet - Mini Data Collection Program
5 pages
Converting GRIB Data To A UM Dump Via The Reconfiguration: Unified Model Documentation Paper 303
No ratings yet
Converting GRIB Data To A UM Dump Via The Reconfiguration: Unified Model Documentation Paper 303
6 pages
Cloud Computing Internship Tasks
No ratings yet
Cloud Computing Internship Tasks
13 pages
Roll No Name Mobile No Language City Status Sunderkand
No ratings yet
Roll No Name Mobile No Language City Status Sunderkand
1 page
Class 10 Information Technology Sample Paper Set 6
No ratings yet
Class 10 Information Technology Sample Paper Set 6
8 pages
Embedded System Lab
100% (1)
Embedded System Lab
62 pages
07 - Chapter 2
No ratings yet
07 - Chapter 2
431 pages

Big Data Assignment PDF

Uploaded by

Big Data Assignment PDF

Uploaded by

Q1: Explain the Hadoop Ecosystem and discuss how communication is taking place between

namenode and datanodes.

 HDFS(Hadoop Distributed file system)- Hadoop Distributed File System

 Stores the actual data in HDFS

How a NameNode and DataNode communicate with each other?

DataNodeProtocol These Four are:

1. DataNode Registration. The DataNode informs NameNode of its existence. NameNode

Read Operation In HDFS

1. A client initiates read request by calling 'open()' method of FileSystem object; it is an

Write Operation In HDFS

1. A client initiates write operation by calling 'create()' method of DistributedFileSystem

1. Download and Install Eclipse IDE

2. Extracting the file

Unzip the file using following command:

$ tar -xzvf eclipse-java-inst-1-linux-gtk-x86_64.tar.gz

4. Choose a Workspace Directory:

Eclipse organizes projects by workspaces. A workspace is a group of related projects and it is

The new Java Project has created

6. After that we need to create the Package, by clicking:

Right Click on Package > New > Class (Name it – WordCount)

8. Download Hadoop Libraries

Add Following Reference Libraries –

Right Click on Project > Build Path> Add External Archivals

Introduction to Hadoop YARN

The first component of YARN Architecture is,

 It is the ultimate authority in resource allocation.

 It is responsible for accepting job submissions.

Coming to the second component which is:

The third component of Apache Hadoop YARN is,

The fourth component is:

You might also like