SlideShare a Scribd company logo
THE GOOGLE FILE SYSTEM
S. GHEMAWAT, H. GOBIOFF AND S. LEUNG
APRIL 7, 2015
CSI5311: Distributed Databases and Transaction Processing
Winter 2015
Prof. Iluju Kiringa
University of Ottawa
Presented By:
Ajaydeep Grewal
Roopesh Jhurani
1
AGENDA
• Introduction
• Design Overview
• System Interactions
• Master Operations
• Fault Tolerance and Diagnosis
• Measurements
• Conclusion
• References
2
Introduction
 Google File System(GFS) is a distributed file
system developed by GOOGLE for its own use.
 It is a scalable file system for large distributed
data-intensive applications.
 It is widely used within GOOGLE as a storage
platform for generation and processing of data.
3
Inspirational factors
 Multiple clusters distributed worldwide.
 Thousands of queries served per second.
 Single query reads more than 100's of MB of data.
 Google stores dozens of copies of the entire Web.
Conclusion
 Need large, distributed, highly fault tolerant file system.
 Large data processing needs Performance, Reliability,
Scalability and Availability.
4
Design Assumptions
 Component Failures
File System consists of hundreds of machines made from
commodity parts.
The quantity and quality of the machines guarantee that there
are non functional nodes at a given time.
 Huge File Sizes
 Workload
Large streaming reads.
Small random reads.
Large, sequential writes that append data to file.
 Applications & API are co-designed
Increases flexibility.
Goal is simple file system, light burden on applications. 5
GFS Architecture
Master
Chunk Servers
GFS Client API
6
GFS Architecture
Master
Contains the system metadata like:
• Namespaces
• Access Control Information
• Mappings from files to chunks
• Current location of chunks
Also helps in:
◦ Garbage collection
◦ Synching across Chunk Servers(Heartbeat Synching)
7
GFS Architecture
Chunk Servers
 Machines containing physical files divided into chunks.
 Each Master server can have a number of associated chunk
servers.
 For reliability, each chunk is replicated on multiple chunk
servers.
Chunk Handle
 Immutable 64 bit chunk handle assigned by master at the
time of chunk creation.
8
GFS Architecture
GFS Client code
 Code at client machine that interacts with GFS.
 Interacts with the master for metadata operations.
 Interacts with Chunk Servers for all Read-Write operations.
9
GFS Architecture
1.GFS Client
code requests for
a particular file .
2. Master gives
the location of the
chunk server.
3.Client caches
the information
and interacts
directly with the
chunk server.
4.Periodic
replication of
changes across
all the replicas.
10
Chunk Size
Having a large uniform chunk size of 64 MB has the
following advantages:
 Reduced Client-Master interaction.
 Reduced Network-Overhead.
 Reduction in the size of metadata's stored.
11
Metadata
 The file and chunk namespaces.
 The mappings from files to chunks.
 Location of each chunk’s replica.
First two are kept persistently in operation log files to
ensure reliability and recoverability.
Chunk locations are held by chunk servers.
Master polls the chunk server at start-up and also
periodically thereafter.
12
Operation Logs
 The operation log contains a historical record of critical
metadata changes.
 Metadata updates are in following format
 e.g. (old value, new value) pairs.
 Since the operation logs are very important, so they are
replicated on remote machines.
 Global snapshots (checkpoints)
 Checkpoint is B-tree like form and mapped into
memory.
 When new updates arrive checkpoints can be created.
13
System Interactions
 Mutation
A mutation is an operation that changes the contents or
metadata of a chunk such as a write or an append operation.
 Lease mechanism
Leases are used to maintain a consistent mutation order across
replicas.
◦ Firstly the master grants a chunk lease to a replica and
calls it primary.
◦ The primary determines the order of updates to all the
other replicas.
14
Write Control and Data Flow
15
1.Client requests for a
write operation.
2.Master replies with
the location of Chunk
Primary and replicas.
3.Client caches the
information and pushes the
write information.
4.The Primary and
replicas store the
information in buffer
and sends a
confirmation.
5.Primary sends a
mutation order to all
the secondaries.
7.Primary sends a
confirmation to the
client.
6.Secondaries commit
the mutations and
sends a confirmation
to the Primary.
Consistency
 Consistent: All the replicated chunks have the
same data.
 Inconsistent: A failed mutation makes the region
inconsistent, i.e., different clients may see different
data.
16
Master Operations
1. Namespace Management and Locking
2. Replica Placement
3. Creation, Re-replication and Rebalancing
4. Garbage Collection
5. Stale Replica Detection
17
Master Operations
Namespace Management and Locking
 Separate locks on region namespace ensures:
 Serialization
 Multiple operations on master to avoid any delay.
 Each master operation acquires a set of locks before it runs.
 To make operation on /dir1/dir2/dir3/leaf it requires locks.
 Read-Lock on /dir1, /dir1/dir2/, /dir1/dir2/dir3
 Read-Lock or Write-Lock on /dir1/dir2/dir3/leaf
 File creation doesn’t require write-lock on parent directory: read-lock is
enough to protect it from deletion, rename, or snapshotted.
 Write-locks on file names serialize attempts to create any duplicate file.
18
Master Operations
Locking Mechanism
 Snapshot acquires
 Read Locks on: /home, /save
 Write Locks on: /home/user, /save/user
 File to be created:
 Read Locks on: /home, /home/user
 Write Locks on: /home/user/foo
 Conflicting locks on /home/user
/home/user /save/user
snapshotted
/home/user/foo
19
Master Operations
Replica Placement
 Serves two purposes:
 Maximize data reliability and availability
 Maximize Network Bandwidth utilization
 Spread Chunk replicas across racks:
 To ensure chunk survivability
 To exploit aggregate read bandwidth of multiple racks
 Write traffic has to flow through multiple racks.
20
Master Operations
Creation, re-replication and rebalancing
 Creation: Master considers several factors
 Place new replicas on chunk servers with below average disk utilization.
 Limit the number of “recent” creations on chunk server.
 Spread replicas of a chunk across racks.
 Re-replication:
 Master re-replicate a chunk when number of replicas fall below a goal level.
 Re-replicated chunk is prioritized based on several factors.
 Master limits the numbers of active clone operations both for the cluster and
for each chunk servers.
 Each chunk servers limits bandwidth it spends on each clone operation.
 Balancing:
 Master re-balances replicas periodically for better disk and load-balancing.
 Master gradually fills up a chunk server rather than instantly filling it with
new chunks.
21
Master Operations
Garbage Collection
 Lazy garbage collection by GFS for a deleted file.
 Mechanism:
 Master logs the deletion like other changes.
 File is renamed to a hidden name that include deletion timestamp.
 Master removes any hidden files during regular namespace
scanning thus erasing its in-memory metadata.
 Similar scan performed for chunk namespace to identify orphaned
chunks and erase metadata for the same.
 Chunk Server can delete those chunks not identified in master
metadata during regular heartbeat message exchange.
22
Master Operations
Stale Replica Detection
 Problem: Chunk Replica may become stale if a chunk server fails and
misses mutations.
 Solution: for each chunk, master maintains a version number.
 Whenever a master grants a new lease on a chunk, master increases
the version number and inform up-to-date replicas (version number
is stored permanently on the master and associated chunk servers)
 Master detects that chunk server has a stale replica when the chunk
server restarts and reports its set of chunks and associated version
numbers.
 Master removes stale replica in its regular garbage collection.
 Master includes chunk version number when it informs clients
which chunk server holds a lease on chunk, or when it instructs a
chunk server to read the chunk from another chunk server in
cloning operation.
23
Fault Tolerance and Diagnosis
High Availability
 Strategies: Fast recovery and Replication.
 Fast Recovery:
 Master and chunk servers are designed to restore their state in seconds.
 No matter how they terminated, no distinction between normal and abnormal
termination (servers routinely shutdown just by killing process).
 Clients and servers experience minor timeout on outstanding requests, reconnect to
the restarted server, and retry.
 Chunk Replication:
 Chunk replicated on multiple chunk servers on different racks (different parts of the
file namespace can have different replica on level).
 Master clones existing replicas as chunk servers go offline or detect corrupted
replicas (checksum verification).
 Master Replication
 Shadow master provides read-only access to file system even when the master is
down.
 Master operation logs and checkpoints are replicated on multiple machines for
reliability.
24
Fault Tolerance and Diagnosis
Data Integrity
 Each chunk server uses check summing to detect corruption of stored
chunk.
 Chunk is broken into 64KB blocks with associated 32 bit checksum.
 Checksums are metadata kept in memory and stored persistently with
logging, separate from user data.
 For READS: chunk server verifies the checksum of data blocks that
overlap the range before returning any data.
 For WRITES: chunk server verifies the checksum of first and last
data blocks that overlap the write range before perform the write, and
finally compute and record new checksums.
25
Measurements
Micro-benchmarks: GFS cluster
 One master, 2 master replicas, 16 chunk servers with 16 clients.
 Dual 1.4 GHz PIII processors, 2 GB RAM, 2*80 GB 5400 RPM
disks, FastEthernet NIC connected to one HP 2524 Ethernet switch
ports 10/100 + Gigabit uplink.
26
Measurements
Micro-benchmarks: READS
 Each client read a randomly selected 4MB region 256 times (=1GB of
data) from a 320 MB file.
 Aggregate chunk server memory is 32GB, so 10% hit rate in Linux
buffer cache is expected.
27
Measurements
Micro-benchmarks: WRITE
 Each client writes 1GB of data to a new file in a series of 1MB writes.
 Network stack does not interact very well with the pipelining scheme
used for pushing data to the chunk replicas: network congestion is
more likely for 16 writers than for 16 readers because each write
involves 3 different replicas()
28
Measurements
Micro-benchmarks: RECORD APPENDS
 Each client appends simultaneously to a single file.
 Performance is limited by the network bandwidth of the 3 chunk
servers that store the last chunk of the file, independent of the number
of clients.
29
Conclusion
Google File System
 Support Large Scale data processing workloads on COTS x86 servers.
 Component failure are norms rather than exceptions.
 Optimize for huge files mostly append to and then read sequentially.
 Fault tolerance by constant monitoring, replicating crucial data and
fast and automatic recovery.
 Delivers high aggregate throughput to many concurrent readers and
writers.
Future Improvements
 Networking Stack Limit: Write throughput can be improved in the
future.
30
References
1. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The
Google File System." ACM SIGOPS Operating Systems Review:
29. Print.
2. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee.
Frangipani: A scalable distributed file system. In Proceedings of the
16th ACM Symposium on Operating System Principles, pages 224–
237, Saint-Malo, France, October 1997.
3. http://en.wikipedia.org/wiki/Google_File_System
4. http://computer.howstuffworks.com/internet/basics/google-file-
system.htm
5. http://en.wikiversity.org/wiki/Big_Data/Google_File_System
6. http://storagemojo.com/google-file-system-eval-part-i/
7. https://www.youtube.com/watch?v=d2SWUIP40Nw
31
Thank You!!
32
Ad

More Related Content

What's hot (20)

Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
zihad164
 
Google File System
Google File SystemGoogle File System
Google File System
nadikari123
 
google file system
google file systemgoogle file system
google file system
diptipan
 
Google file system
Google file systemGoogle file system
Google file system
Ankit Thiranh
 
GFS
GFSGFS
GFS
Suman Karumuri
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File System
Vishal Polley
 
Google File System
Google File SystemGoogle File System
Google File System
Amir Payberah
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
Viet-Trung TRAN
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
Aisha Siddiqa
 
Caching
CachingCaching
Caching
Nascenia IT
 
11. dfs
11. dfs11. dfs
11. dfs
Dr Sandeep Kumar Poonia
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
JYoTHiSH o.s
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
 
Google File System
Google File SystemGoogle File System
Google File System
guest2cb4689
 
Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating System
Raj Mohan
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
vanjakom
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
GOOGLE BIGTABLE
GOOGLE BIGTABLEGOOGLE BIGTABLE
GOOGLE BIGTABLE
Tomcy Thankachan
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
zihad164
 
Google File System
Google File SystemGoogle File System
Google File System
nadikari123
 
google file system
google file systemgoogle file system
google file system
diptipan
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File System
Vishal Polley
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
Viet-Trung TRAN
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
JYoTHiSH o.s
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
 
Google File System
Google File SystemGoogle File System
Google File System
guest2cb4689
 
Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating System
Raj Mohan
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
vanjakom
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 

Viewers also liked (7)

Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
Azeem Mumtaz
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
tutchiio
 
Google
GoogleGoogle
Google
rpaikrao
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
Antonio Cesarano
 
Google file system
Google file systemGoogle file system
Google file system
Dhan V Sagar
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
The google file system
The google file systemThe google file system
The google file system
Daniel Checchia
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
Azeem Mumtaz
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
tutchiio
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
Antonio Cesarano
 
Google file system
Google file systemGoogle file system
Google file system
Dhan V Sagar
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
Ad

Similar to Google file system (20)

advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
diptipan
 
Lalit
LalitLalit
Lalit
diptipan
 
Google file system
Google file systemGoogle file system
Google file system
Lalit Rastogi
 
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
ShimoFcis
 
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologiesChaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
GulbakshiDharmale
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
Lalit Rastogi
 
Google File System: System and Design Overview
Google File System: System and Design OverviewGoogle File System: System and Design Overview
Google File System: System and Design Overview
habibaabderrahim1
 
Gfs
GfsGfs
Gfs
ravi kiran
 
GFS xouzfz h ghdzg ix booc ug nog ghzg m
GFS xouzfz h ghdzg ix booc  ug nog ghzg mGFS xouzfz h ghdzg ix booc  ug nog ghzg m
GFS xouzfz h ghdzg ix booc ug nog ghzg m
gagaco5776
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
Saptarshi Chatterjee
 
MongoDB Replication and Sharding
MongoDB Replication and ShardingMongoDB Replication and Sharding
MongoDB Replication and Sharding
Tharun Srinivasa
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
Hariharan Ganesan
 
Chapter Six Storage-systemsgggggggg.pptx
Chapter Six Storage-systemsgggggggg.pptxChapter Six Storage-systemsgggggggg.pptx
Chapter Six Storage-systemsgggggggg.pptx
BinyamBekeleMoges
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
jhao niu
 
Gfs介绍
Gfs介绍Gfs介绍
Gfs介绍
yiditushe
 
The Google file system
The Google file systemThe Google file system
The Google file system
Sergio Shevchenko
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
os
osos
os
lavanya lalu
 
Gfs sosp2003
Gfs sosp2003Gfs sosp2003
Gfs sosp2003
睿琦 崔
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
diptipan
 
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
ShimoFcis
 
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologiesChaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
GulbakshiDharmale
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
Lalit Rastogi
 
Google File System: System and Design Overview
Google File System: System and Design OverviewGoogle File System: System and Design Overview
Google File System: System and Design Overview
habibaabderrahim1
 
GFS xouzfz h ghdzg ix booc ug nog ghzg m
GFS xouzfz h ghdzg ix booc  ug nog ghzg mGFS xouzfz h ghdzg ix booc  ug nog ghzg m
GFS xouzfz h ghdzg ix booc ug nog ghzg m
gagaco5776
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
Saptarshi Chatterjee
 
MongoDB Replication and Sharding
MongoDB Replication and ShardingMongoDB Replication and Sharding
MongoDB Replication and Sharding
Tharun Srinivasa
 
Chapter Six Storage-systemsgggggggg.pptx
Chapter Six Storage-systemsgggggggg.pptxChapter Six Storage-systemsgggggggg.pptx
Chapter Six Storage-systemsgggggggg.pptx
BinyamBekeleMoges
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
jhao niu
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Ad

Recently uploaded (20)

What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 

Google file system

  • 1. THE GOOGLE FILE SYSTEM S. GHEMAWAT, H. GOBIOFF AND S. LEUNG APRIL 7, 2015 CSI5311: Distributed Databases and Transaction Processing Winter 2015 Prof. Iluju Kiringa University of Ottawa Presented By: Ajaydeep Grewal Roopesh Jhurani 1
  • 2. AGENDA • Introduction • Design Overview • System Interactions • Master Operations • Fault Tolerance and Diagnosis • Measurements • Conclusion • References 2
  • 3. Introduction  Google File System(GFS) is a distributed file system developed by GOOGLE for its own use.  It is a scalable file system for large distributed data-intensive applications.  It is widely used within GOOGLE as a storage platform for generation and processing of data. 3
  • 4. Inspirational factors  Multiple clusters distributed worldwide.  Thousands of queries served per second.  Single query reads more than 100's of MB of data.  Google stores dozens of copies of the entire Web. Conclusion  Need large, distributed, highly fault tolerant file system.  Large data processing needs Performance, Reliability, Scalability and Availability. 4
  • 5. Design Assumptions  Component Failures File System consists of hundreds of machines made from commodity parts. The quantity and quality of the machines guarantee that there are non functional nodes at a given time.  Huge File Sizes  Workload Large streaming reads. Small random reads. Large, sequential writes that append data to file.  Applications & API are co-designed Increases flexibility. Goal is simple file system, light burden on applications. 5
  • 7. GFS Architecture Master Contains the system metadata like: • Namespaces • Access Control Information • Mappings from files to chunks • Current location of chunks Also helps in: ◦ Garbage collection ◦ Synching across Chunk Servers(Heartbeat Synching) 7
  • 8. GFS Architecture Chunk Servers  Machines containing physical files divided into chunks.  Each Master server can have a number of associated chunk servers.  For reliability, each chunk is replicated on multiple chunk servers. Chunk Handle  Immutable 64 bit chunk handle assigned by master at the time of chunk creation. 8
  • 9. GFS Architecture GFS Client code  Code at client machine that interacts with GFS.  Interacts with the master for metadata operations.  Interacts with Chunk Servers for all Read-Write operations. 9
  • 10. GFS Architecture 1.GFS Client code requests for a particular file . 2. Master gives the location of the chunk server. 3.Client caches the information and interacts directly with the chunk server. 4.Periodic replication of changes across all the replicas. 10
  • 11. Chunk Size Having a large uniform chunk size of 64 MB has the following advantages:  Reduced Client-Master interaction.  Reduced Network-Overhead.  Reduction in the size of metadata's stored. 11
  • 12. Metadata  The file and chunk namespaces.  The mappings from files to chunks.  Location of each chunk’s replica. First two are kept persistently in operation log files to ensure reliability and recoverability. Chunk locations are held by chunk servers. Master polls the chunk server at start-up and also periodically thereafter. 12
  • 13. Operation Logs  The operation log contains a historical record of critical metadata changes.  Metadata updates are in following format  e.g. (old value, new value) pairs.  Since the operation logs are very important, so they are replicated on remote machines.  Global snapshots (checkpoints)  Checkpoint is B-tree like form and mapped into memory.  When new updates arrive checkpoints can be created. 13
  • 14. System Interactions  Mutation A mutation is an operation that changes the contents or metadata of a chunk such as a write or an append operation.  Lease mechanism Leases are used to maintain a consistent mutation order across replicas. ◦ Firstly the master grants a chunk lease to a replica and calls it primary. ◦ The primary determines the order of updates to all the other replicas. 14
  • 15. Write Control and Data Flow 15 1.Client requests for a write operation. 2.Master replies with the location of Chunk Primary and replicas. 3.Client caches the information and pushes the write information. 4.The Primary and replicas store the information in buffer and sends a confirmation. 5.Primary sends a mutation order to all the secondaries. 7.Primary sends a confirmation to the client. 6.Secondaries commit the mutations and sends a confirmation to the Primary.
  • 16. Consistency  Consistent: All the replicated chunks have the same data.  Inconsistent: A failed mutation makes the region inconsistent, i.e., different clients may see different data. 16
  • 17. Master Operations 1. Namespace Management and Locking 2. Replica Placement 3. Creation, Re-replication and Rebalancing 4. Garbage Collection 5. Stale Replica Detection 17
  • 18. Master Operations Namespace Management and Locking  Separate locks on region namespace ensures:  Serialization  Multiple operations on master to avoid any delay.  Each master operation acquires a set of locks before it runs.  To make operation on /dir1/dir2/dir3/leaf it requires locks.  Read-Lock on /dir1, /dir1/dir2/, /dir1/dir2/dir3  Read-Lock or Write-Lock on /dir1/dir2/dir3/leaf  File creation doesn’t require write-lock on parent directory: read-lock is enough to protect it from deletion, rename, or snapshotted.  Write-locks on file names serialize attempts to create any duplicate file. 18
  • 19. Master Operations Locking Mechanism  Snapshot acquires  Read Locks on: /home, /save  Write Locks on: /home/user, /save/user  File to be created:  Read Locks on: /home, /home/user  Write Locks on: /home/user/foo  Conflicting locks on /home/user /home/user /save/user snapshotted /home/user/foo 19
  • 20. Master Operations Replica Placement  Serves two purposes:  Maximize data reliability and availability  Maximize Network Bandwidth utilization  Spread Chunk replicas across racks:  To ensure chunk survivability  To exploit aggregate read bandwidth of multiple racks  Write traffic has to flow through multiple racks. 20
  • 21. Master Operations Creation, re-replication and rebalancing  Creation: Master considers several factors  Place new replicas on chunk servers with below average disk utilization.  Limit the number of “recent” creations on chunk server.  Spread replicas of a chunk across racks.  Re-replication:  Master re-replicate a chunk when number of replicas fall below a goal level.  Re-replicated chunk is prioritized based on several factors.  Master limits the numbers of active clone operations both for the cluster and for each chunk servers.  Each chunk servers limits bandwidth it spends on each clone operation.  Balancing:  Master re-balances replicas periodically for better disk and load-balancing.  Master gradually fills up a chunk server rather than instantly filling it with new chunks. 21
  • 22. Master Operations Garbage Collection  Lazy garbage collection by GFS for a deleted file.  Mechanism:  Master logs the deletion like other changes.  File is renamed to a hidden name that include deletion timestamp.  Master removes any hidden files during regular namespace scanning thus erasing its in-memory metadata.  Similar scan performed for chunk namespace to identify orphaned chunks and erase metadata for the same.  Chunk Server can delete those chunks not identified in master metadata during regular heartbeat message exchange. 22
  • 23. Master Operations Stale Replica Detection  Problem: Chunk Replica may become stale if a chunk server fails and misses mutations.  Solution: for each chunk, master maintains a version number.  Whenever a master grants a new lease on a chunk, master increases the version number and inform up-to-date replicas (version number is stored permanently on the master and associated chunk servers)  Master detects that chunk server has a stale replica when the chunk server restarts and reports its set of chunks and associated version numbers.  Master removes stale replica in its regular garbage collection.  Master includes chunk version number when it informs clients which chunk server holds a lease on chunk, or when it instructs a chunk server to read the chunk from another chunk server in cloning operation. 23
  • 24. Fault Tolerance and Diagnosis High Availability  Strategies: Fast recovery and Replication.  Fast Recovery:  Master and chunk servers are designed to restore their state in seconds.  No matter how they terminated, no distinction between normal and abnormal termination (servers routinely shutdown just by killing process).  Clients and servers experience minor timeout on outstanding requests, reconnect to the restarted server, and retry.  Chunk Replication:  Chunk replicated on multiple chunk servers on different racks (different parts of the file namespace can have different replica on level).  Master clones existing replicas as chunk servers go offline or detect corrupted replicas (checksum verification).  Master Replication  Shadow master provides read-only access to file system even when the master is down.  Master operation logs and checkpoints are replicated on multiple machines for reliability. 24
  • 25. Fault Tolerance and Diagnosis Data Integrity  Each chunk server uses check summing to detect corruption of stored chunk.  Chunk is broken into 64KB blocks with associated 32 bit checksum.  Checksums are metadata kept in memory and stored persistently with logging, separate from user data.  For READS: chunk server verifies the checksum of data blocks that overlap the range before returning any data.  For WRITES: chunk server verifies the checksum of first and last data blocks that overlap the write range before perform the write, and finally compute and record new checksums. 25
  • 26. Measurements Micro-benchmarks: GFS cluster  One master, 2 master replicas, 16 chunk servers with 16 clients.  Dual 1.4 GHz PIII processors, 2 GB RAM, 2*80 GB 5400 RPM disks, FastEthernet NIC connected to one HP 2524 Ethernet switch ports 10/100 + Gigabit uplink. 26
  • 27. Measurements Micro-benchmarks: READS  Each client read a randomly selected 4MB region 256 times (=1GB of data) from a 320 MB file.  Aggregate chunk server memory is 32GB, so 10% hit rate in Linux buffer cache is expected. 27
  • 28. Measurements Micro-benchmarks: WRITE  Each client writes 1GB of data to a new file in a series of 1MB writes.  Network stack does not interact very well with the pipelining scheme used for pushing data to the chunk replicas: network congestion is more likely for 16 writers than for 16 readers because each write involves 3 different replicas() 28
  • 29. Measurements Micro-benchmarks: RECORD APPENDS  Each client appends simultaneously to a single file.  Performance is limited by the network bandwidth of the 3 chunk servers that store the last chunk of the file, independent of the number of clients. 29
  • 30. Conclusion Google File System  Support Large Scale data processing workloads on COTS x86 servers.  Component failure are norms rather than exceptions.  Optimize for huge files mostly append to and then read sequentially.  Fault tolerance by constant monitoring, replicating crucial data and fast and automatic recovery.  Delivers high aggregate throughput to many concurrent readers and writers. Future Improvements  Networking Stack Limit: Write throughput can be improved in the future. 30
  • 31. References 1. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google File System." ACM SIGOPS Operating Systems Review: 29. Print. 2. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 224– 237, Saint-Malo, France, October 1997. 3. http://en.wikipedia.org/wiki/Google_File_System 4. http://computer.howstuffworks.com/internet/basics/google-file- system.htm 5. http://en.wikiversity.org/wiki/Big_Data/Google_File_System 6. http://storagemojo.com/google-file-system-eval-part-i/ 7. https://www.youtube.com/watch?v=d2SWUIP40Nw 31

Editor's Notes

  • #26: Each chunk server uses check summing to detect corruption of stored data. Given that a GFS cluster often has thousands of disks on hundreds of machines, it regularly experiences disk failures that cause data corruption or loss on both the read and write paths. (See Section 7 for one cause.) We can recover from corruption using other chunkre plicas, but it would be impractical to detect corruption by comparing replicas across chunk servers. Moreover, divergent replicas may be legal: the semantics of GFS mutations, in particular atomic record append as discussed earlier, does not guarantee identical replicas. Therefore, each chunk server must independently verify the integrity of its own copy by maintaining checksums. A chunki s broken up into 64 KB blocks. Each has a corresponding 32 bit checksum. Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data. For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another chunkserver. Therefore chunkservers will not propagate corruptions to other machines. If a blockdo es not match the recorded checksum, the chunkserver returns an error to the requestor and reports the mismatch to the master. In response, the requestor will read from other replicas, while the master will clone the chunkfrom another replica. After a valid new replica is in place, the master instructs the chunkserver that reported the mismatch to delete its replica. Checksumming has little effect on read performance for several reasons. Since most of our reads span at least a few blocks, we need to read and checksum only a relatively small amount of extra data for verification. GFS client code further reduces this overhead by trying to align reads at checksum block boundaries. Moreover, checksum lookups and comparison on the chunkserver are done without any I/O, and checksum calculation can often be overlapped with I/Os. Checksum computation is heavily optimized for writes that append to the end of a chunk(a s opposed to writes that overwrite existing data) because they are dominant in our workloads. We just incrementally update the checksum for the last partial checksum block, and compute new checksums for any brand new checksum blocks filled by the append. Even if the last partial checksum block is already corrupted and we fail to detect it now, the new checksum value will not match the stored data, and the corruption will be detected as usual when the blocki s next read. In contrast, if a write overwrites an existing range of the chunk, we must read and verify the first and last blocks of the range being overwritten, then perform the write, and finally compute and record the new checksums. If we do not verify the first and last blocks before overwriting them partially, the new checksums may hide corruption that exists in the regions not being overwritten. During idle periods, chunkservers can scan and verify the contents of inactive chunks. This allows us to detect corruption in chunks that are rarely read. Once the corruption is detected, the master can create a new uncorrupted replica and delete the corrupted replica. This prevents an inactive but corrupted chunkre plica from fooling the master into thinking that it has enough valid replicas of a chunk.
  • #27: In this section we present a few micro-benchmarks to illustrate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google. We measured performance on a GFS cluster consisting of one master, two master replicas, 16 chunk servers, and 16 clients. Note that this configuration was set up for ease of testing. Typical clusters have hundreds of chunk servers and hundreds of clients. All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch. All 19 GFS server machines are connected to one switch, and all 16 client machines to the other. The two switches are connected with a 1 Gbps link.
  • #28: N clients read simultaneously from the file system. Each client reads a randomly selected 4 MB region from a 320 GB file set. This is repeated 256 times so that each client ends up reading 1 GB of data. The chunk servers taken together have only 32 GB of memory, so we expect at most a 10% hit rate in the Linux buffer cache. Our results should be close to cold cache results. Figure 3(a) shows the aggregate read rate for N clients and its theoretical limit. The limit peaks at an aggregate of 125 MB/s when the 1 Gbps linkb etween the two switches is saturated, or 12.5 MB/s per client when its 100 Mbps networkin terface gets saturated, whichever applies. The observed read rate is 10 MB/s, or 80% of the per-client limit, when just one client is reading. The aggregate read rate reaches 94 MB/s, about 75% of the 125 MB/s linklim it, for 16 readers, or 6 MB/s per client. The efficiency drops from 80% to 75% because as the number of readers increases, so does the probability that multiple readers simultaneously read from the same chunkserver.
  • #29: N clients write simultaneously to N distinct files. Each client writes 1 GB of data to a new file in a series of 1 MB writes. The aggregate write rate and its theoretical limit are shown in Figure 3(b). The limit plateaus at 67 MB/s because we need to write each byte to 3 of the 16 chunk servers, each with a 12.5 MB/s input connection. The write rate for one client is 6.3 MB/s, about half of the limit. The main culprit for this is our networkst ack. It does not interact very well with the pipelining scheme we use for pushing data to chunkrep licas. Delays in propagating data from one replica to another reduce the overall write rate. Aggregate write rate reaches 35 MB/s for 16 clients (or 2.2 MB/s per client), about half the theoretical limit. As in the case of reads, it becomes more likely that multiple clients write concurrently to the same chunkserver as the number of clients increases. Moreover, collision is more likely for 16 writers than for 16 readers because each write involves three different replicas. Writes are slower than we would like. In practice this has not been a major problem because even though it increases the latencies as seen by individual clients, it does not significantly affect the aggregate write bandwidth delivered by the system to a large number of clients.
  • #30: Figure 3(c) shows record append performance. N clients append simultaneously to a single file. Performance is limited by the networkba ndwidth of the chunkservers that store the last chunko f the file, independent of the number of clients. It starts at 6.0 MB/s for one client and drops to 4.8 MB/s for 16 clients, mostly due to congestion and variances in networktransf er rates seen by different clients. Our applications tend to produce multiple such files concurrently. In other words, N clients append to M shared files simultaneously where both N and M are in the dozens or hundreds. Therefore, the chunkserver network congestion in our experiment is not a significant issue in practice because a client can make progress on writing one file while the chunkservers for another file are busy.