0% found this document useful (0 votes)

61 views

The Google File System: Firas Abuzaid

The Google File System (GFS) was designed to handle large files and high throughput workloads across many nodes. It uses a master node to manage metadata and lease chunks to chunkservers, which store replicated file chunks. The relaxed consistency model guarantees data is defined after a successful modification and appended records will be atomic. Clients read and write chunks directly from chunkservers for high performance, while the master coordinates modifications and provides metadata.

Uploaded by

AdilGouse Gouse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

The Google File System: Firas Abuzaid

Uploaded by

AdilGouse Gouse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

The Google File System

Firas Abuzaid
Why build GFS?

● Node failures happen frequently

● Files are huge – multi-GB
● Most files are modified by appending at the end
○ Random writes (and overwrites) are practically non-existent

● High sustained bandwidth is more important than low latency

○ Place more priority on processing data in bulk
Typical workloads on GFS

● Two kinds of reads: large streaming reads & small random reads
○ Large streaming reads usually read 1MB or more
○ Oftentimes, applications read through contiguous regions in the file
○ Small random reads are usually only a few KBs at some arbitrary offset
● Also many large, sequential writes that append data to files
○ Similar operation sizes to reads
○ Once written, files are seldom modified again
○ Small writes at arbitrary offsets do not have to be efficient
● Multiple clients (e.g. ~100) concurrently appending to a single file
○ e.g. producer-consumer queues, many-way merging
Interface

● Not POSIX-compliant, but supports typical file system operations: create,

delete, open, close, read, and write
● snapshot: creates a copy of a file or a directory tree at low cost
● record append: allow multiple clients to append data to the same file
concurrently
○ At least the very first append is guaranteed to be atomic
Architecture
Architecture

● Very important: data flow is decoupled from control flow

○ Clients interact with the master for metadata operations
○ Clients interact directly with chunkservers for all files operations
○ This means performance can be improved by scheduling expensive data flow
based on the network topology

● Neither the clients nor the chunkservers cache file data

○ Working sets are usually too large to be cached, chunkservers can use Linux’s
buffer cache
The Master Node

● Responsible for all system-wide activities

○ managing chunk leases, reclaiming storage space, load-balancing
● Maintains all file system metadata
○ Namespaces, ACLs, mappings from files to chunks, and current locations of chunks
○ all kept in memory, namespaces and file-to-chunk mappings are also stored persistently in
operation log
● Periodically communicates with each chunkserver in HeartBeat messages
○ This let’s master determines chunk locations and assesses state of the overall system
○ Important: The chunkserver has the final word over what chunks it does or does not have on
its own disks – not the master
The Master Node

● For the namespace metadata, master does not use any per-directory data
structures – no inodes! (No symlinks or hard links, either.)
○ Every file and directory is represented as a node in a lookup table, mapping pathnames to
metadata. Stored efficiently using prefix compression (< 64 bytes per namespace entry)

● Each node in the namespace tree has a corresponding read-write lock to

manage concurrency
○ Because all metadata is stored in memory, the master can efficiently scan the entire state
of the system periodically in the background
○ Master’s memory capacity does not limit the size of the system
The Operation Log

● Only persistent record of metadata

● Also serves as a logical timeline that defines the serialized order of
concurrent operations
● Master recovers its state by replaying the operation log
○ To minimize startup time, the master checkpoints the log periodically
■ The checkpoint is represented in a B-tree like form, can be directly mapped into
memory, but stored on disk
■ Checkpoints are created without delaying incoming requests to master, can be
created in ~1 minute for a cluster with a few million files
Why a Single Master?

● The master now has global knowledge of the whole system, which
drastically simplifies the design
● But the master is (hopefully) never the bottleneck
○ Clients never read and write file data through the master; client only requests from master
which chunkservers to talk to
○ Master can also provide additional information about subsequent chunks to further reduce
latency
○ Further reads of the same chunk don’t involve the master, either
Why a Single Master?

● Master state is also replicated for reliability on multiple machines, using

the operation log and checkpoints
○ If master fails, GFS can start a new master process at any of these replicas and modify
DNS alias accordingly
○ “Shadow” masters also provide read-only access to the file system, even when primary
master is down
■ They read a replica of the operation log and apply the same sequence of changes
■ Not mirrors of master – they lag primary master by fractions of a second
■ This means we can still read up-to-date file contents while master is in recovery!
Chunks and Chunkservers

● Files are divided into fixed-size chunks, which has an immutable, globally
unique 64-bit chunk handle
○ By default, each chunk is replicated three times across multiple chunkservers (user can
modify amount of replication)
● Chunkservers store the chunks on local disks as Linux files
○ Metadata per chunk is < 64 bytes (stored in master)
■ Current replica locations
■ Reference count (useful for copy-on-write)
■ Version number (for detecting stale replicas)
Chunk Size

● 64 MB, a key design parameter (Much larger than most file systems.)
● Disadvantages:
○ Wasted space due to internal fragmentation
○ Small files consist of a few chunks, which then get lots of traffic from concurrent clients
■ This can be mitigated by increasing the replication factor
● Advantages:
○ Reduces clients’ need to interact with master (reads/writes on the same chunk only require one
request)
○ Since client is likely to perform many operations on a given chunk, keeping a persistent TCP
connection to the chunkserver reduces network overhead
○ Reduces the size of the metadata stored in master → metadata can be entirely kept in memory
GFS’s Relaxed Consistency Model

● Terminology:
○ consistent: all clients will always see the same data, regardless of which replicas they read
from
○ defined: same as consistent and, furthermore, clients will see what the modification is in
its entirety
● Guarantees:
Data Modifications in GFS

● After a sequence of modifications, if successful, then modified file region

is guaranteed to be defined and contain data written by last modification
● GFS applies modification to a chunk in the same order on all its replicas
● A chunk is lost irreversibly if and only if all its replicas are lost before the
master node can react, typically within minutes
○ even in this case, data is lost, not corrupted
Record Appends

● A modification operation that guarantees that data (the “record”) will be

appended atomically at least once – but at the offset of GFS’s choosing
○ The offset chosen by GFS is returned to the client so that the application is aware

● GFS may insert padding or record duplicates in between different record

append operations
● Preferred that applications use this instead of write
○ Applications should also write self-validating records (e.g. checksumming) with unique IDs
to handle padding/duplicates
System Interactions

● If the master receives a modification operation for a particular chunk:

○ Master finds the chunkservers that have the chunk and grants a chunk lease to one of them
■ This server is called the primary, the other servers are called secondaries
■ The primary determines the serialization order for all of the chunk’s modifications, and the
secondaries follow that order
○ After the lease expires (~60 seconds), master may grant primary status to a different server for
that chunk
■ The master can, at times, revoke a lease (e.g. to disable modifications when file is being
renamed)
■ As long as chunk is being modified, the primary can request an extension indefinitely
○ If master loses contact with primary, that’s okay: just grant a new lease after the old one expires
System Interactions
1. Client asks master for all chunkservers (including all
secondaries)
2. Master grants a new lease on chunk, increases the
chunk version number, tells all replicas to do the
same. Replies to client. Client no longer has to talk
to master
3. Client pushes data to all servers, not necessarily to
primary first
4. Once data is acked, client sends write request to
primary. Primary decides serialization order for all
incoming modifications and applies them to the
chunk
System Interactions
5. After finishing the modification, primary forwards
write request and serialization order to secondaries,
so they can apply modifications in same order. (If
primary fails, this step is never reached.)
6. All secondaries reply back to the primary once they
finish the modifications
7. Primary replies back to the client, either with
success or error
○ If write succeeds at primary but fails at any of
the secondaries, then we have inconsistent
state → error returned to client
○ Client can retry steps (3) through (7)
Note: If a write straddles chunk boundary, GFS splits this into multiple write operations
System Interactions for Record Appends

● Same as before, but with the following extra steps:

○ In step (4), the primary checks to see if appending record to current chunk would exceed
max size (64 MB)
■ If so, pads the chunk, notifies secondaries to do the same, and tells client to retry
request on next chunk
■ Record append is restricted to ¼th max chunk size → at most, padding will be 16 MB
● If record append fails at any of the replicas, the client must retry
○ This means that replicas of the same chunk may contain duplicates
● A successful record append? That means the data must have been written
at the same offset on all replicas of the chunk
○ Hence, GFS guarantees that record append will be defined interspersed with inconsistent
Conclusions

● De-coupling of data flow vs. control flow is super-important

● Single-master design can be, in certain circumstances, quite advantageous
● Focusing on the core use cases of the file system (e.g. atomic appends)
can lead you to the right abstractions
Questions?

Google File System Paper - Summary
50% (2)
Google File System Paper - Summary
4 pages
MS DOS Command List
No ratings yet
MS DOS Command List
7 pages
GFS
No ratings yet
GFS
33 pages
Google File System
No ratings yet
Google File System
20 pages
Case Study: Google File System
No ratings yet
Case Study: Google File System
7 pages
2 GFS
No ratings yet
2 GFS
30 pages
The Google File System
No ratings yet
The Google File System
21 pages
Unit 5 Lecture 2
No ratings yet
Unit 5 Lecture 2
22 pages
The Google File System Final
No ratings yet
The Google File System Final
20 pages
The Google File System: Kenneth Chiu
No ratings yet
The Google File System: Kenneth Chiu
40 pages
15 Gfs
No ratings yet
15 Gfs
40 pages
Google File System
No ratings yet
Google File System
22 pages
GFD Summary
No ratings yet
GFD Summary
3 pages
Gfs Google File System 13331
No ratings yet
Gfs Google File System 13331
28 pages
The Google File System: Alexandru Costan
No ratings yet
The Google File System: Alexandru Costan
38 pages
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
No ratings yet
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
52 pages
Unit 2
No ratings yet
Unit 2
22 pages
R16 4-1 BDA - Unit-2 (Ref-3)
No ratings yet
R16 4-1 BDA - Unit-2 (Ref-3)
22 pages
GFS
No ratings yet
GFS
9 pages
Google File System
No ratings yet
Google File System
9 pages
Unit 2 PDF
No ratings yet
Unit 2 PDF
22 pages
MIT 6.824 - Lecture 3 - GFS
No ratings yet
MIT 6.824 - Lecture 3 - GFS
1 page
Lecture_14_HDFS_GFS
No ratings yet
Lecture_14_HDFS_GFS
30 pages
Chapter_2_c8ad153f2f004857aca733db68105108_1712934164766
No ratings yet
Chapter_2_c8ad153f2f004857aca733db68105108_1712934164766
21 pages
Google File System (GFS)
No ratings yet
Google File System (GFS)
18 pages
05 en Distributed File Systems
No ratings yet
05 en Distributed File Systems
63 pages
Paper Review 1 - Google File System
No ratings yet
Paper Review 1 - Google File System
2 pages
A Review On GOOGLE File System
No ratings yet
A Review On GOOGLE File System
4 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
21 pages
Google_File_System_1
No ratings yet
Google_File_System_1
48 pages
BDA Unit-1
No ratings yet
BDA Unit-1
19 pages
ds_2016_17_lec18
No ratings yet
ds_2016_17_lec18
26 pages
Bda Material Unit 2
No ratings yet
Bda Material Unit 2
19 pages
storage-systems
No ratings yet
storage-systems
23 pages
Questions On Google File System
100% (1)
Questions On Google File System
3 pages
An Overview of Google File System (GFS) _ Medium
No ratings yet
An Overview of Google File System (GFS) _ Medium
10 pages
chap6
No ratings yet
chap6
54 pages
Google File System
No ratings yet
Google File System
48 pages
What Is Distributed Data Processing?
No ratings yet
What Is Distributed Data Processing?
2 pages
Unit 3.4 Gfs and Hdfs
No ratings yet
Unit 3.4 Gfs and Hdfs
4 pages
BDA Complete Notes
100% (1)
BDA Complete Notes
88 pages
9238 DC Assignment 3
No ratings yet
9238 DC Assignment 3
5 pages
BDA-Unit-I
No ratings yet
BDA-Unit-I
18 pages
Lecture 4.1 - Hadoop - MapReduce - Hbase
No ratings yet
Lecture 4.1 - Hadoop - MapReduce - Hbase
94 pages
36 DC Expt9
No ratings yet
36 DC Expt9
4 pages
Chunky
No ratings yet
Chunky
3 pages
Google Fs
No ratings yet
Google Fs
35 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Google File System: Dr. Amit Praseed
No ratings yet
Google File System: Dr. Amit Praseed
11 pages
Google File System Basics: Google World Wide Web Computers
No ratings yet
Google File System Basics: Google World Wide Web Computers
5 pages
Unit-II (BIG DATA)
No ratings yet
Unit-II (BIG DATA)
9 pages
Distributed Computing Module 5 Important Topics PYQs
No ratings yet
Distributed Computing Module 5 Important Topics PYQs
23 pages
16 Distributedfilesystems
No ratings yet
16 Distributedfilesystems
6 pages
CC - Lecture 8-Final
No ratings yet
CC - Lecture 8-Final
51 pages
The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
No ratings yet
The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
18 pages
Google File System
No ratings yet
Google File System
6 pages
Refer Slide Time: 00:15
No ratings yet
Refer Slide Time: 00:15
31 pages
18-Distributed File Systems Study On Operating Systems
No ratings yet
18-Distributed File Systems Study On Operating Systems
24 pages
Presentation ON Distributed File System: Institute of Engineering and Technology Bundelkhand University
No ratings yet
Presentation ON Distributed File System: Institute of Engineering and Technology Bundelkhand University
51 pages
The Google File System: CSE 490h, Autumn 2008
No ratings yet
The Google File System: CSE 490h, Autumn 2008
29 pages
Gluster Filesystem - Practical Method
From Everand
Gluster Filesystem - Practical Method
Fabian Mestre
No ratings yet
File Del
No ratings yet
File Del
97 pages
USB-DLA Instruction Book
No ratings yet
USB-DLA Instruction Book
32 pages
113 115 Object Detection and Tracking Using Image Processing PDF
No ratings yet
113 115 Object Detection and Tracking Using Image Processing PDF
3 pages
M800 CCW PDF
No ratings yet
M800 CCW PDF
21 pages
KD-208 User Manual
No ratings yet
KD-208 User Manual
17 pages
EU RM Toolbox Library 03 - Threats Mappings
No ratings yet
EU RM Toolbox Library 03 - Threats Mappings
112 pages
Maintenance and Service Guide - HP Elite Mini 800 G9 Desktop PC
No ratings yet
Maintenance and Service Guide - HP Elite Mini 800 G9 Desktop PC
120 pages
3.multicore Architecture and Programming
0% (1)
3.multicore Architecture and Programming
3 pages
5-13!1!105a Versaview 6180p-12 Tt2000 Display (Hmi) Computer
No ratings yet
5-13!1!105a Versaview 6180p-12 Tt2000 Display (Hmi) Computer
2 pages
4 - Internal Memory
No ratings yet
4 - Internal Memory
58 pages
4 Sem
No ratings yet
4 Sem
8 pages
Genesys Logic GL834 MNY03 - C161830
No ratings yet
Genesys Logic GL834 MNY03 - C161830
16 pages
B37 (6) LUYỆN LISTENING
No ratings yet
B37 (6) LUYỆN LISTENING
11 pages
Unit 4 - Digital Circuit and Design - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Digital Circuit and Design - WWW - Rgpvnotes.in
21 pages
Cs6303 - Computer Architecture Question Bank Unit-I Overview & Instructions Part - A (2 MARKS)
No ratings yet
Cs6303 - Computer Architecture Question Bank Unit-I Overview & Instructions Part - A (2 MARKS)
4 pages
Pasolink NMS
No ratings yet
Pasolink NMS
12 pages
Introduction To Cybercrime and Environmental Laws & Protection
100% (3)
Introduction To Cybercrime and Environmental Laws & Protection
43 pages
PM591 Eth
100% (1)
PM591 Eth
15 pages
Memory Hierarchy
No ratings yet
Memory Hierarchy
16 pages
Quanta - Te5 - R1a Toshiba Satellite L745
No ratings yet
Quanta - Te5 - R1a Toshiba Satellite L745
46 pages
CHMA Unit - V
100% (1)
CHMA Unit - V
25 pages
SFF Time P-ATX V2 User Manual
No ratings yet
SFF Time P-ATX V2 User Manual
36 pages
DX Diag
No ratings yet
DX Diag
14 pages
Ds Maestro 2
No ratings yet
Ds Maestro 2
48 pages
FSP Group FSP350 60APN 350W Technische Details 234f55
No ratings yet
FSP Group FSP350 60APN 350W Technische Details 234f55
1 page
Pax A920
No ratings yet
Pax A920
15 pages
Inspiron 3647 Small Desktop Reference Guide en Us
No ratings yet
Inspiron 3647 Small Desktop Reference Guide en Us
15 pages
US20220291979A1
No ratings yet
US20220291979A1
20 pages
Chapter 10: Local Area Networks: Multiple Choice
No ratings yet
Chapter 10: Local Area Networks: Multiple Choice
7 pages

The Google File System: Firas Abuzaid

Uploaded by

The Google File System: Firas Abuzaid

Uploaded by

The Google File System

● Node failures happen frequently

● High sustained bandwidth is more important than low latency

● Not POSIX-compliant, but supports typical file system operations: create,

● Very important: data flow is decoupled from control flow

● Neither the clients nor the chunkservers cache file data

● Responsible for all system-wide activities

● Each node in the namespace tree has a corresponding read-write lock to

● Only persistent record of metadata

● Master state is also replicated for reliability on multiple machines, using

● After a sequence of modifications, if successful, then modified file region

● A modification operation that guarantees that data (the “record”) will be

● GFS may insert padding or record duplicates in between different record

● If the master receives a modification operation for a particular chunk:

● Same as before, but with the following extra steps:

● De-coupling of data flow vs. control flow is super-important

You might also like