0% found this document useful (0 votes)
44 views

l2 Hdfs and Mapreduce Model 2022s2

This document discusses HDFS interfaces and the MapReduce model. It describes the shell and Java interfaces for HDFS, as well as how to interact with HDFS through commands, upload and download files, and read/write files using the Java API. It also provides an overview of how reading data from HDFS works internally by contacting the namenode and datanodes.

Uploaded by

Comp Scif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

l2 Hdfs and Mapreduce Model 2022s2

This document discusses HDFS interfaces and the MapReduce model. It describes the shell and Java interfaces for HDFS, as well as how to interact with HDFS through commands, upload and download files, and read/write files using the Java API. It also provides an overview of how reading data from HDFS works internally by contacting the namenode and datanodes.

Uploaded by

Comp Scif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

HDFS Interfaces

and
MapReduce Model
ISIT312/912 LECTURE 2

1
Content
1. Hadoop Distributed File System (HDFS)
a) Shell Interface
b) Java Interface
c) Internals
2. The MapReduce model

2
Recall Hadoop’s Core
Components

3
Interacting with HDFS
HDFS provides multiple interfaces to read, write, interrogate, and
manage the filesystem:
Ø The filesystem shell (Command-Line Interface): hadoop fs or
hdfs dfs
Ø The Hadoop Filesystem Java API
Ø Hadoop’s simple Web UI
Ø Other interfaces, such as RESTful proxy interfaces (e.g., HttpFS)

4
Hadoop’s home directory
Commands are provided in the shell Bash
> which bash
/bin/bash

> cd $HADOOP_HOME
… > ls
bin include libexec logs README.txt share
etc lib LICENSE.txt NOTICE.txt sbin

You will mostly use scripts in the bin and sbin folders, and use jar files
in the share folder.

5
Hadoop Daemons
> jps
28530 SecondaryNameNode
11188 NodeManager
28133 NameNode
28311 DataNode
10845 ResourceManager
3542 Jps

Hadoop is running properly only if the above services are running.

6
HDFS Command-Line Interface
Create an HDFS user account (already created in the VM)
> bin/hdfs dfs -mkdir -p /user/bigdata

Create an folder “input” :

> bin/hadoop fs -mkdir input

View the folder:


> bin/hadoop fs -ls
Found 1 item
drwxr-xr-x - bigdata supergroup 0 2017-07-17 16:33 input

7
HDFS Command-Line Interface
Upload a file to HDFS:
> bin/hadoop fs -put README.txt input
> bin/hadoop fs -ls input
-rw-r--r-- 1 bigdata supergroup 1494 2017-07-12 17:53 input/README.txt

Read a file in HDFS:


> bin/hadoop fs -cat input/README.txt
<contents of README.txt shown here…>

8
Paths in HDFS
The path in HDFS is represented as a URI with the prefix “hdfs://”
For example,
◦ “hdfs: //<hostname>:<port>/user/bigdata/input” refers to the “input”
directory in HDFS under the user of “bigdata”
◦ “hdfs ://<hostname>:<port>/user/bigdata/input/README.txt” refers to the
file “README.txt” in the above “input” directory in HDFS.

When interacting with HDFS’s interface in the default setting, one can
omit IP, port and user, and simply mention the directory or file.
Thus, the full spelling of “hadoop fs -ls input” is:
“hadoop fs -ls hdfs://<hostname>:<port>/user/bigdata/input”

9
Some Usual Commands
Command Description
-put Upload a file (or files) from the local filesystem to HDFS
-mkdir Create a directory in HDFS
-ls List the files in a directory in HDFS
-cat Read the content of a file (or files) in HDFS
-copyFromLocal Copy a file from the local filesystem to HDFS (similar to
“put”)
-copyToLocal Copy a file (or files) from HDFS to the local filesystem
-rm Delete a file (or files) in HDFS
-rm -r Delete a directory in HDFS

10
Web Interface of HDFS

11
Web Interface of HDFS

12
Java Interface:
The FileSystem API
A file in a Hadoop filesystem (including HDFS) is represented by a
Hadoop Path object
◦ Its syntax is URI,
◦ e.g., hdfs://localhost:8020/user/bigdata/input/README.txt

To get an instance of FileSystem, use the following factory methods:


◦ public static FileSystem get(Configuration conf) throws IOException
◦ public static FileSystem get(URI uri, Configuration conf) throws IOException
◦ public static FileSystem get(URI uri, Configuration conf, String user)
throws IOException

The following method gets a local filesystem instance:


◦ public static FileSystem getLocal(Configuration conf) throws IOException

13
The FileSystem API: Reading
A Configuration object is determined by the Hadoop configuration files
or user-provided parameters.
Using the default configuration, one can simply set
◦ Configuration conf = new Configuration()

With a FileSystem instance in hand, we invoke an open() method to get


the input stream for a file:
◦ public FSDataInputStream open(Path f) throws IOException
◦ public FSDataInputStream open(Path f, int bufferSize) throws IOException

A Path object can be created by using a designated URI. For example:


◦ Path f = new Path(uri)

14
A File Reading Application
Putting together, we can create the following class:
public class FileSystemCat {

public static void main(String[] args) throws Exception {

String uri = args[0];


Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;

Path path = new Path(uri);

in = fs.open(path);
IOUtils.copyBytes(in, System.out, 4096, true);

}
}

15
Compiling and Running APP in
Hadoop
The compilation simply uses the “javac” command, but needs to point the
dependencies in the class path.
> export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
> javac -cp $HADOOP_CLASSPATH FileSystemCat.java

Then, package a jar file and run as follows:

> hadoop jar FileSystemCat.jar FileSystemcat input/README.txt

The output is the same as running the “hadoop fs -cat” command.

16
The FileSystem API: Write
Suppose an input stream is created to read a local file.
To write a file on HDFS, the simplest way is to take a Path object for the
file to be created and return an output stream to write to:
public FSDataOutputStream create(Path f) throws IOException
And then just copy the input stream to the output stream.
Another (more flexible) way is to read the input stream into a buffer
and then write to the output stream.

17
A File Writing Application
public class FileSystemPut {

public static void main(String[] args) throws Exception {

String localStr = args[0];


String hdfsStr = args[1];

Configuration conf = new Configuration();


FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(URI.create(hdfsStr), conf);

Path localFile = new Path(localStr);


Path hdfsFile = new Path(hdfsStr);

FSDataInputStream in = local.open(localFile);
FSDataOutputStream out = hdfs.create(hdfsFile);

IOUtils.copyBytes(in, out, 4096, true);


}
}

18
Another File Writing
Application
public class FileSystemPutAlt {
public static void main(String[] args) throws Exception {
String localStr = args[0];
String hdfsStr = args[1];
Configuration conf = new Configuration();
FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(URI.create(hdfsStr), conf);
Path localFile = new Path(localStr);
Path hdfsFile = new Path(hdfsStr);
FSDataInputStream in = local.open(localFile);
FSDataOutputStream out = hdfs.create(hdfsFile);
byte[] buffer = new byte[256];
int bytesRead = 0;
while( (bytesRead = in.read(buffer)) > 0) { buffer
out.write(buffer, 0, bytesRead);
}
in.close();
out.close();
}
}

19
Other FileSystem API
§ The method mkdirs() creates a directory
§ The method getFileStatus() gets the meta information for a single file
or directory
§ The method listStatus() lists contents of files in a directory
§ The method exists() checks whether a file exists
§ The method delete() removes a file

v The Java API enables the implementation of customised applications


to interact with HDFS

20
Read Data in HDFS: What
Happens Inside

21
Read Data in HDFS
Step 1: The client opens the file it
wishes to read by calling open() on
the FileSystem object, which for
HDFS is an instance of
DistributedFileSystem.
Step 2: DistributedFileSystem calls
the namenode, using remote
procedure calls (RPCs), to determine
the locations of the first few blocks
in the file.

Step 3: The DistributedFileSystem returns an FSDataInputStream to the client and


the client calls read() on the stream.

22
Read Data in HDFS
Step 4: FSDataInputStream
connects to the first datanode for
the first block in the file, and then
data is streamed from the
datanode back to the client, by
calling read() repeatedly on the
stream.
Step 5: When the end of the block
is reached, FSDataInputStream will
close the connection to the
datanode, then find the best
(possibly the same) datanode for
the next block.
Step 6: When the client has finished reading, it calls close() on the
FSDataInputStream .

23
Write Data in HDFS

24
Write Data In HDFS
Step 1: The client creates the file by
calling create() on
DistributedFileSystem.
Step 2: DistributedFileSystem makes
an RPC call to the namenode to
create a new file in the filesystem’s
namespace and returns an
FSDataOutputStream for the client
to start writing data to.
Step 3: The client writes data into
the FSDataOutputStream.
Step 4: Data wrapped by the FSDataOutputStream is split into packages, which
are flushed into a queue; data packages are sent to the blocks in a datanode and
forwarded to other (usually two) datanodes.

25
Write Data In HDFS
Step 5: If FSDataStream receives
an ack from the datanode the
data packages are removed from
the queue.
Step 6: When the client has
finished writing data, it calls
close() on the stream.
Step 7: The client signals the
namenode that the writing is
completed.

26
The MapReduce
Model

27
Key-Value Pairs: MapReduce’s
Basic Data Model
Key Value
City Sydney
Employer ID Employee Albot’s profile

Input, output and intermediate records in MapReduce are represented


as key-value pairs (aka name-value/attribute-value pairs).
The key is an identifier (e.g., the name of an attribute).
◦ In MapReduce, the key is not required to be unique.
The value is the data of the key.
◦ It may be simple value or a complex object.

28
MapReduce Model
MapReduce Model
Map

Implemented by developer
Partition

Shuffle & Sort Implemented by platform

Reduce

29
An Abstract MapReduce
Program: WordCount
function map(Long lineNo, String line):
// lineNo: the position no. of a line in the text
// line: a line of text
for each word w in line:
emit (w, 1)

function reduce(String w, List loc):


// w: a word
// loc: a list of counts
sum = 0
for each c in loc: text: 1 text: 1
sum += c “text to to: 1 to: 2
emit (word, sum) pass to pass: 1 pass: 1
reader” to: 1 reader: 1
reader: 1

30
The MapReduce Model

31
Map Phase
ØMap Phase uses input format and record reader functions to derive
records in the form of key-value pairs for the input data
ØMap Phase applies a function or functions to each key-value pair over a
portion of the dataset
vIn the case of a dataset hosted in HDFS, this portion is usually called
a block.
vIf there are n blocks of data in the input dataset, there will be at
least n Map tasks (also referred to as Mappers).

32
Map Phase
Each Map task operates against
one filesystem (HDFS) block.
As illustrated, a Map task will call
its map() function, represented by
M in the diagram, once for each
record, or key-value pair; for
example, rec1, rec2, and so on.

33
Map Phase
Each call of the map() function
accepts one key-value pair and
emits (or outputs) zero or more
key-value pairs:
map (in_key, in_value) →
list (itm_key, itm_value)
The emitted data from Mapper,
also in the form of lists of key-value
pairs, will be subsequently
processed in the Reduce phase.
Different Mappers do not
communicate or share data with
each other!

34
Examples of Map Functions
Common map() functions include filtering of specific keys, such as
filtering log messages if you only want to count or analyse ERROR log
messages:
let map (k, v) = if (ERROR in v) then emit (k, v)
Another example of a map() function would be to manipulate values,
such as a map() function that converts a text value to lowercase:
let map (k, v) = emit (k, v.toLowercase ( ))

35
Partitioning Function
ØPartition function, or Partitioner, ensures each key and its list of values
is passed to one and only one Reduce task or Reducer
ØThe number of partitions is determined by the (default or user-
defined) number of Reducers
ØCustom Partitioners are developed for various practical purposes

36
Reduce Phase
ØInput of the Reduce phase is output of the Map phase (via shuffle-and
sort)
ØEach Reduce task (or Reducer) executes a reduce() function for each
intermediate key and its list of associated intermediate values.
ØThe output from each reduce() function is zero or more key-values:
reduce (intermediate_key, list (intermediate_value))
→ (out_key, out_value)
ØNote that, in reality, the output from Reducer may be the input of
another Map phase in a complex multistage computational workflow.

37
Example of Reduce Functions
The simplest and most common reduce() function is the summation,
which simply sums a list of values for each key:
let reduce (k, list <v>) = {
sum = 0
for int i in list <v> :
sum += i
emit (k, sum) }
A count operation is as simple as summing a set of numbers
representing instances of the values you wish to count.
Other examples of the reduce() function: max/mix and average

38
Shuffle and Sort
ØShuffle-and-sort is the process where data are transferred from
Mapper to Reducer
q It is “the heart of MapReduce where the ‘magic’ happens.”

ØThe most important purpose of Shuffle-and-sort is to minimise data


transmit through network I/O.
Ø In general, in Shuffle-and-Sort, the Mapper output is sent to the target
Reducer according to the partitioning function.

39
Shuffle and Sort in
MapReduce

40
Combine Phase

Suppose the
Reduce
function is a
sum.

41
Combine Function
ØIf the Reduce function is commutative and associative, it can be
performed before the Shuffle-and-Sort phase. In this case, the Reduce
function is called a Combiner function.
Ø E.g., sum (or count) is commutative and associative, but average is not.

ØThe use of a Combiner can minimise the amount of data transferred to


Reduce phase, alleviating the network transmit overhead.

42
Map-Only MapReduce
A MapReduce application may contain zero Reduce tasks. In this case, it
is a map-only application.
Examples of map-only MapReduce jobs:
◦ ETL routines without data
summarization, aggregation
and reduction
◦ File format conversion jobs
◦ Image processing jobs

43
An Election Analogy for
MapReduce

44
MapReduce Example:
Average Contact Number
For a database of 1 billion people, compute the average number of
social contacts a person has according to age.
In SQL-like language:

SELECT age, AVG(contacts)


FROM social.person
GROUP BY age

45
MapReduce Example:
Average Contact Number
Now suppose these records are stored in different datanodes. In
MapReduce:

function Map is
input: integer K between 1 and 1000 // thus each integer
representing a batch of 1 million social.person records
for each social.person record in the K-th batch do
produce one output record (Y,(N,1))
where Y is the person's age
and N is the number of contacts that the person has
end function

46
MapReduce Example:
Average Contact Number
function Reduce is
input: age (in years) Y, number of contacts N, count C
for each input record (Y,(N,C)) do
Accumulate in S the sum of N*C
Accumulate in D the sum of C
produce one output record (Y, S/D)
end function

MapReduce sends the codes to the location of each data batch (not the other
way around)

Question: the output from Map is multiple copies of (Y, (N, 1)), but the input to
Reduce is (Y, (N, C)), so what fills the gap?

47
Submit A MapReduce
Application to Hadoop
A MapReduce application in Hadoop is a Java implementation of the
MapReduce model for a specific problem (e.g., word count).

Here it
goes

Map-
Redcuce
job

48
Sample run on the Screen
The application

49
50
Behind the Screen: Running of
MapReduce Jobs
Client: submits an MR Job
YARN resource manager:
coordinates the allocation of
computing resources in the
cluster
YARN node manager(s): launch & monitor
containers on machines in the cluster.
MapReduce application master: runs in a
container, and coordinates the tasks in a
MapReduce job.
HDFS: used for sharing job files between
the other files

51
Summary
How to interact with Hadoop’s storage system HDFS
◦ Command-Line Interface: the hadoop (and dfs) script
◦ Java API: Read, write and other operations
The MapReduce model and its implementation in
Hadoop
◦ Map stage, Reduce stage, Shuffleand Sort, Partitioner, Combiner

52

You might also like