l2 Hdfs and Mapreduce Model 2022s2
l2 Hdfs and Mapreduce Model 2022s2
and
MapReduce Model
ISIT312/912 LECTURE 2
1
Content
1. Hadoop Distributed File System (HDFS)
a) Shell Interface
b) Java Interface
c) Internals
2. The MapReduce model
2
Recall Hadoop’s Core
Components
3
Interacting with HDFS
HDFS provides multiple interfaces to read, write, interrogate, and
manage the filesystem:
Ø The filesystem shell (Command-Line Interface): hadoop fs or
hdfs dfs
Ø The Hadoop Filesystem Java API
Ø Hadoop’s simple Web UI
Ø Other interfaces, such as RESTful proxy interfaces (e.g., HttpFS)
4
Hadoop’s home directory
Commands are provided in the shell Bash
> which bash
/bin/bash
> cd $HADOOP_HOME
… > ls
bin include libexec logs README.txt share
etc lib LICENSE.txt NOTICE.txt sbin
You will mostly use scripts in the bin and sbin folders, and use jar files
in the share folder.
5
Hadoop Daemons
> jps
28530 SecondaryNameNode
11188 NodeManager
28133 NameNode
28311 DataNode
10845 ResourceManager
3542 Jps
6
HDFS Command-Line Interface
Create an HDFS user account (already created in the VM)
> bin/hdfs dfs -mkdir -p /user/bigdata
7
HDFS Command-Line Interface
Upload a file to HDFS:
> bin/hadoop fs -put README.txt input
> bin/hadoop fs -ls input
-rw-r--r-- 1 bigdata supergroup 1494 2017-07-12 17:53 input/README.txt
8
Paths in HDFS
The path in HDFS is represented as a URI with the prefix “hdfs://”
For example,
◦ “hdfs: //<hostname>:<port>/user/bigdata/input” refers to the “input”
directory in HDFS under the user of “bigdata”
◦ “hdfs ://<hostname>:<port>/user/bigdata/input/README.txt” refers to the
file “README.txt” in the above “input” directory in HDFS.
When interacting with HDFS’s interface in the default setting, one can
omit IP, port and user, and simply mention the directory or file.
Thus, the full spelling of “hadoop fs -ls input” is:
“hadoop fs -ls hdfs://<hostname>:<port>/user/bigdata/input”
9
Some Usual Commands
Command Description
-put Upload a file (or files) from the local filesystem to HDFS
-mkdir Create a directory in HDFS
-ls List the files in a directory in HDFS
-cat Read the content of a file (or files) in HDFS
-copyFromLocal Copy a file from the local filesystem to HDFS (similar to
“put”)
-copyToLocal Copy a file (or files) from HDFS to the local filesystem
-rm Delete a file (or files) in HDFS
-rm -r Delete a directory in HDFS
10
Web Interface of HDFS
11
Web Interface of HDFS
12
Java Interface:
The FileSystem API
A file in a Hadoop filesystem (including HDFS) is represented by a
Hadoop Path object
◦ Its syntax is URI,
◦ e.g., hdfs://localhost:8020/user/bigdata/input/README.txt
13
The FileSystem API: Reading
A Configuration object is determined by the Hadoop configuration files
or user-provided parameters.
Using the default configuration, one can simply set
◦ Configuration conf = new Configuration()
14
A File Reading Application
Putting together, we can create the following class:
public class FileSystemCat {
in = fs.open(path);
IOUtils.copyBytes(in, System.out, 4096, true);
}
}
15
Compiling and Running APP in
Hadoop
The compilation simply uses the “javac” command, but needs to point the
dependencies in the class path.
> export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
> javac -cp $HADOOP_CLASSPATH FileSystemCat.java
16
The FileSystem API: Write
Suppose an input stream is created to read a local file.
To write a file on HDFS, the simplest way is to take a Path object for the
file to be created and return an output stream to write to:
public FSDataOutputStream create(Path f) throws IOException
And then just copy the input stream to the output stream.
Another (more flexible) way is to read the input stream into a buffer
and then write to the output stream.
17
A File Writing Application
public class FileSystemPut {
FSDataInputStream in = local.open(localFile);
FSDataOutputStream out = hdfs.create(hdfsFile);
18
Another File Writing
Application
public class FileSystemPutAlt {
public static void main(String[] args) throws Exception {
String localStr = args[0];
String hdfsStr = args[1];
Configuration conf = new Configuration();
FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(URI.create(hdfsStr), conf);
Path localFile = new Path(localStr);
Path hdfsFile = new Path(hdfsStr);
FSDataInputStream in = local.open(localFile);
FSDataOutputStream out = hdfs.create(hdfsFile);
byte[] buffer = new byte[256];
int bytesRead = 0;
while( (bytesRead = in.read(buffer)) > 0) { buffer
out.write(buffer, 0, bytesRead);
}
in.close();
out.close();
}
}
19
Other FileSystem API
§ The method mkdirs() creates a directory
§ The method getFileStatus() gets the meta information for a single file
or directory
§ The method listStatus() lists contents of files in a directory
§ The method exists() checks whether a file exists
§ The method delete() removes a file
20
Read Data in HDFS: What
Happens Inside
21
Read Data in HDFS
Step 1: The client opens the file it
wishes to read by calling open() on
the FileSystem object, which for
HDFS is an instance of
DistributedFileSystem.
Step 2: DistributedFileSystem calls
the namenode, using remote
procedure calls (RPCs), to determine
the locations of the first few blocks
in the file.
22
Read Data in HDFS
Step 4: FSDataInputStream
connects to the first datanode for
the first block in the file, and then
data is streamed from the
datanode back to the client, by
calling read() repeatedly on the
stream.
Step 5: When the end of the block
is reached, FSDataInputStream will
close the connection to the
datanode, then find the best
(possibly the same) datanode for
the next block.
Step 6: When the client has finished reading, it calls close() on the
FSDataInputStream .
23
Write Data in HDFS
24
Write Data In HDFS
Step 1: The client creates the file by
calling create() on
DistributedFileSystem.
Step 2: DistributedFileSystem makes
an RPC call to the namenode to
create a new file in the filesystem’s
namespace and returns an
FSDataOutputStream for the client
to start writing data to.
Step 3: The client writes data into
the FSDataOutputStream.
Step 4: Data wrapped by the FSDataOutputStream is split into packages, which
are flushed into a queue; data packages are sent to the blocks in a datanode and
forwarded to other (usually two) datanodes.
25
Write Data In HDFS
Step 5: If FSDataStream receives
an ack from the datanode the
data packages are removed from
the queue.
Step 6: When the client has
finished writing data, it calls
close() on the stream.
Step 7: The client signals the
namenode that the writing is
completed.
26
The MapReduce
Model
27
Key-Value Pairs: MapReduce’s
Basic Data Model
Key Value
City Sydney
Employer ID Employee Albot’s profile
28
MapReduce Model
MapReduce Model
Map
Implemented by developer
Partition
Reduce
29
An Abstract MapReduce
Program: WordCount
function map(Long lineNo, String line):
// lineNo: the position no. of a line in the text
// line: a line of text
for each word w in line:
emit (w, 1)
30
The MapReduce Model
31
Map Phase
ØMap Phase uses input format and record reader functions to derive
records in the form of key-value pairs for the input data
ØMap Phase applies a function or functions to each key-value pair over a
portion of the dataset
vIn the case of a dataset hosted in HDFS, this portion is usually called
a block.
vIf there are n blocks of data in the input dataset, there will be at
least n Map tasks (also referred to as Mappers).
32
Map Phase
Each Map task operates against
one filesystem (HDFS) block.
As illustrated, a Map task will call
its map() function, represented by
M in the diagram, once for each
record, or key-value pair; for
example, rec1, rec2, and so on.
33
Map Phase
Each call of the map() function
accepts one key-value pair and
emits (or outputs) zero or more
key-value pairs:
map (in_key, in_value) →
list (itm_key, itm_value)
The emitted data from Mapper,
also in the form of lists of key-value
pairs, will be subsequently
processed in the Reduce phase.
Different Mappers do not
communicate or share data with
each other!
34
Examples of Map Functions
Common map() functions include filtering of specific keys, such as
filtering log messages if you only want to count or analyse ERROR log
messages:
let map (k, v) = if (ERROR in v) then emit (k, v)
Another example of a map() function would be to manipulate values,
such as a map() function that converts a text value to lowercase:
let map (k, v) = emit (k, v.toLowercase ( ))
35
Partitioning Function
ØPartition function, or Partitioner, ensures each key and its list of values
is passed to one and only one Reduce task or Reducer
ØThe number of partitions is determined by the (default or user-
defined) number of Reducers
ØCustom Partitioners are developed for various practical purposes
36
Reduce Phase
ØInput of the Reduce phase is output of the Map phase (via shuffle-and
sort)
ØEach Reduce task (or Reducer) executes a reduce() function for each
intermediate key and its list of associated intermediate values.
ØThe output from each reduce() function is zero or more key-values:
reduce (intermediate_key, list (intermediate_value))
→ (out_key, out_value)
ØNote that, in reality, the output from Reducer may be the input of
another Map phase in a complex multistage computational workflow.
37
Example of Reduce Functions
The simplest and most common reduce() function is the summation,
which simply sums a list of values for each key:
let reduce (k, list <v>) = {
sum = 0
for int i in list <v> :
sum += i
emit (k, sum) }
A count operation is as simple as summing a set of numbers
representing instances of the values you wish to count.
Other examples of the reduce() function: max/mix and average
38
Shuffle and Sort
ØShuffle-and-sort is the process where data are transferred from
Mapper to Reducer
q It is “the heart of MapReduce where the ‘magic’ happens.”
39
Shuffle and Sort in
MapReduce
40
Combine Phase
Suppose the
Reduce
function is a
sum.
41
Combine Function
ØIf the Reduce function is commutative and associative, it can be
performed before the Shuffle-and-Sort phase. In this case, the Reduce
function is called a Combiner function.
Ø E.g., sum (or count) is commutative and associative, but average is not.
42
Map-Only MapReduce
A MapReduce application may contain zero Reduce tasks. In this case, it
is a map-only application.
Examples of map-only MapReduce jobs:
◦ ETL routines without data
summarization, aggregation
and reduction
◦ File format conversion jobs
◦ Image processing jobs
43
An Election Analogy for
MapReduce
44
MapReduce Example:
Average Contact Number
For a database of 1 billion people, compute the average number of
social contacts a person has according to age.
In SQL-like language:
45
MapReduce Example:
Average Contact Number
Now suppose these records are stored in different datanodes. In
MapReduce:
function Map is
input: integer K between 1 and 1000 // thus each integer
representing a batch of 1 million social.person records
for each social.person record in the K-th batch do
produce one output record (Y,(N,1))
where Y is the person's age
and N is the number of contacts that the person has
end function
46
MapReduce Example:
Average Contact Number
function Reduce is
input: age (in years) Y, number of contacts N, count C
for each input record (Y,(N,C)) do
Accumulate in S the sum of N*C
Accumulate in D the sum of C
produce one output record (Y, S/D)
end function
MapReduce sends the codes to the location of each data batch (not the other
way around)
Question: the output from Map is multiple copies of (Y, (N, 1)), but the input to
Reduce is (Y, (N, C)), so what fills the gap?
47
Submit A MapReduce
Application to Hadoop
A MapReduce application in Hadoop is a Java implementation of the
MapReduce model for a specific problem (e.g., word count).
Here it
goes
Map-
Redcuce
job
48
Sample run on the Screen
The application
49
50
Behind the Screen: Running of
MapReduce Jobs
Client: submits an MR Job
YARN resource manager:
coordinates the allocation of
computing resources in the
cluster
YARN node manager(s): launch & monitor
containers on machines in the cluster.
MapReduce application master: runs in a
container, and coordinates the tasks in a
MapReduce job.
HDFS: used for sharing job files between
the other files
51
Summary
How to interact with Hadoop’s storage system HDFS
◦ Command-Line Interface: the hadoop (and dfs) script
◦ Java API: Read, write and other operations
The MapReduce model and its implementation in
Hadoop
◦ Map stage, Reduce stage, Shuffleand Sort, Partitioner, Combiner
52