SlideShare a Scribd company logo
The Google File System
Tut Chi Io(Modified by
Fengchang)
WHAT IS GFS
• Google FILE SYSTEM(GFS)
• scalable distributed file system (DFS)
• falt tolerence
• Reliability
• Scalability
• availability and performance to large
networks and connected nodes.
WHAT IS GFS
• built from low-cost COMMODITY
HARDWARE components
• optimized to accomodate Google's
different data use and storage needs,
• capitalized on the strength of off-the-
shelf servers while minimizing
hardware weaknesses
Design Overview – Assumption
• Inexpensive commodity hardware
• Large files: Multi-GB
• Workloads
– Large streaming reads
– Small random reads
– Large, sequential appends
• Concurrent append to the same file
• High Throughput > Low Latency
Design Overview – Interface
• Create
• Delete
• Open
• Close
• Read
• Write
• Snapshot
• Record Append
What does it look like
Design Overview – Architecture
• Single master, multiple chunk servers,
multiple clients
– User-level process running on commodity
Linux machine
– GFS client code linked into each client
application to communicate
• File -> 64MB chunks -> Linux files
– on local disks of chunk servers
– replicated on multiple chunk servers (3r)
• Cache metadata but not chunk on
clients
Design Overview – Single Master
• Why centralization? Simplicity!
• Global knowledge is needed for
– Chunk placement
– Replication decisions
Design Overview – Chunk Size
• 64MB – Much Larger than ordinary,
why?
– Advantages
• Reduce client-master interaction
• Reduce network overhead
• Reduce the size of the metadata
– Disadvantages
• Internal fragmentation
– Solution: lazy space allocation
• Hot Spots – many clients accessing a 1-chunk
file, e.g. executables
– Solution:
– Higher replication factor
– Stagger application start times
– Client-to-client communication
Design Overview – Metadata
• File & chunk namespaces
– In master’s memory
– In master’s and chunk servers’ storage
• File-chunk mapping
– In master’s memory
– In master’s and chunk servers’ storage
• Location of chunk replicas
– In master’s memory
– Ask chunk servers when
• Master starts
• Chunk server joins the cluster
– If persistent, master and chunk servers must be
in sync
Design Overview – Metadata – In-memory DS
• Why in-memory data structure for the
master?
– Fast! For GC and LB
• Will it pose a limit on the number of
chunks -> total capacity?
– No, a 64MB chunk needs less than 64B
metadata (640TB needs less than 640MB)
• Most chunks are full
• Prefix compression on file names
Design Overview – Metadata – Log
• The only persistent record of metadata
• Defines the order of concurrent
operations
• Critical
– Replicated on multiple remote machines
– Respond to client only when log locally
and remotely
• Fast recovery by using checkpoints
– Use a compact B-tree like form directly
mapping into memory
– Switch to a new log, Create new
checkpoints in a separate threads
Design Overview – Consistency Model
• Consistent
– All clients will see the same data,
regardless of which replicas they read
from
• Defined
– Consistent, and clients will see what the
mutation writes in its entirety
Design Overview – Consistency Model
• After a sequence of success, a region
is guaranteed to be defined
– Same order on all replicas
– Chunk version number to detect stale
replicas
• Client cache stale chunk locations?
– Limited by cache entry’s timeout
– Most files are append-only
• A Stale replica return a premature end of
chunk
System Interactions – Lease
• Minimized management overhead
• Granted by the master to one of the
replicas to become the primary
• Primary picks a serial order of
mutation and all replicas follow
• 60 seconds timeout, can be extended
• Can be revoked
System Interactions – Mutation Order
Current lease holder?
identity of primary
location of replicas
(cached by client)
3a. data
3b. data
3c. data
Write request
Primary assign s/n to mutations
Applies it
Forward write request
Operation completed
Operation completed
Operation completed
or Error report
System Interactions – Data Flow
• Decouple data flow and control flow
• Control flow
– Master -> Primary -> Secondaries
• Data flow
– Carefully picked chain of chunk servers
• Forward to the closest first
• Distances estimated from IP addresses
– Linear (not tree), to fully utilize outbound
bandwidth (not divided among recipients)
– Pipelining, to exploit full-duplex links
• Time to transfer B bytes to R replicas = B/T +
RL
• T: network throughput, L: latency
System Interactions – Atomic Record Append
• Concurrent appends are serializable
– Client specifies only data
– GFS appends at least once atomically
– Return the offset to the client
– Heavily used by Google to use files as
• multiple-producer/single-consumer queues
• Merged results from many different clients
– On failures, the client retries the
operation
– Data are defined, intervening regions are
inconsistent
• A Reader can identify and discard extra
padding and record fragments using the
checksums
System Interactions – Snapshot
• Makes a copy of a file or a directory
tree almost instantaneously
• Use copy-on-write
• Steps
– Revokes lease
– Logs operations to disk
– Duplicates metadata, pointing to the same
chunks
• Create real duplicate locally
– Disks are 3 times as fast as 100 Mb
Ethernet links
Master Operation – Namespace Management
• No per-directory data structure
• No support for alias
• Lock over regions of namespace to
ensure serialization
• Lookup table mapping full pathnames
to metadata
– Prefix compression -> In-Memory
Master Operation – Namespace Locking
• Each node (file/directory) has a read-
write lock
• Scenario: prevent /home/user/foo from
being created while /home/user is
being snapshotted to /save/user
– Snapshot
• Read locks on /home, /save
• Write locks on /home/user, /save/user
– Create
• Read locks on /home, /home/user
• Write lock on /home/user/foo
Master Operation – Policies
• New chunks creation policy
– New replicas on below-average disk
utilization
– Limit # of “recent” creations on each chun
server
– Spread replicas of a chunk across racks
• Re-replication priority
– Far from replication goal first
– Chunk that is blocking client first
– Live files first (rather than deleted)
• Rebalance replicas periodically
Master Operation – Garbage Collection
• Lazy reclamation
– Logs deletion immediately
– Rename to a hidden name
• Remove 3 days later
• Undelete by renaming back
• Regular scan for orphaned chunks
– Not garbage:
• All references to chunks: file-chunk mapping
• All chunk replicas: Linux files under designated
directory on each chunk server
– Erase metadata
– HeartBeat message to tell chunk servers to
delete chunks
Master Operation – Garbage Collection
• Advantages
– Simple & reliable
• Chunk creation may failed
• Deletion messages may be lost
– Uniform and dependable way to clean up
unuseful replicas
– Done in batches and the cost is amortized
– Done when the master is relatively free
– Safety net against accidental, irreversible
deletion
Master Operation – Garbage Collection
• Disadvantage
– Hard to fine tune when storage is tight
• Solution
– Delete twice explicitly -> expedite storage
reclamation
– Different policies for different parts of the
namespace
• Stale Replica Detection
– Master maintains a chunk version number
Fault Tolerance – High Availability
• Fast Recovery
– Restore state and start in seconds
– Do not distinguish normal and abnormal
termination
• Chunk Replication
– Different replication levels for different
parts of the file namespace
– Keep each chunk fully replicated as chunk
servers go offline or detect corrupted
replicas through checksum verification
Fault Tolerance – High Availability
• Master Replication
– Log & checkpoints are replicated
– Master failures?
• Monitoring infrastructure outside GFS starts a
new master process
– “Shadow” masters
• Read-only access to the file system when the
primary master is down
• Enhance read availability
• Reads a replica of the growing operation log
Fault Tolerance – Data Integrity
• Use checksums to detect data corruption
• A chunk(64MB) is broken up into 64KB
blocks with 32-bit checksum
• Chunk server verifies the checksum before
returning, no error propagation
• Record append
– Incrementally update the checksum for the last
block, error will be detected when read
• Random write
– Read and verify the first and last block first
– Perform write, compute new checksums
Conclusion
• GFS supports large-scale data
processing using commodity hardware
• Reexamine traditional file system
assumption
– based on application workload and
technological environment
– Treat component failures as the norm
rather than the exception
– Optimize for huge files that are mostly
appended
– Relax the stand file system interface
Conclusion
• Fault tolerance
– Constant monitoring
– Replicating crucial data
– Fast and automatic recovery
– Checksumming to detect data corruption
at the disk or IDE subsystem level
• High aggregate throughput
– Decouple control and data transfer
– Minimize operations by large chunk size
and by chunk lease
Reference
• Sanjay Ghemawat, Howard Gobioff,
and Shun-Tak Leung, “The Google File
System”
Ad

More Related Content

What's hot (20)

Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
Antonio Cesarano
 
Google File System
Google File SystemGoogle File System
Google File System
guest2cb4689
 
Google File System
Google File SystemGoogle File System
Google File System
Junyoung Jung
 
Google File System
Google File SystemGoogle File System
Google File System
Amgad Muhammad
 
The Google file system
The Google file systemThe Google file system
The Google file system
Sergio Shevchenko
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
Steven Francia
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
Romain Jacotin
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
diptipan
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
Azeem Mumtaz
 
Google file system
Google file systemGoogle file system
Google file system
Dhan V Sagar
 
google file system
google file systemgoogle file system
google file system
diptipan
 
Google file system
Google file systemGoogle file system
Google file system
Ankit Thiranh
 
Google
GoogleGoogle
Google
rpaikrao
 
Google file system
Google file systemGoogle file system
Google file system
Roopesh Jhurani
 
Gfs介绍
Gfs介绍Gfs介绍
Gfs介绍
yiditushe
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
Ioanna Tsalouchidou
 
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
zihad164
 
Google File System
Google File SystemGoogle File System
Google File System
nadikari123
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
PriyankaAher11
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
Antonio Cesarano
 
Google File System
Google File SystemGoogle File System
Google File System
guest2cb4689
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
Steven Francia
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
Romain Jacotin
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
diptipan
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
Azeem Mumtaz
 
Google file system
Google file systemGoogle file system
Google file system
Dhan V Sagar
 
google file system
google file systemgoogle file system
google file system
diptipan
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
Ioanna Tsalouchidou
 
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
zihad164
 
Google File System
Google File SystemGoogle File System
Google File System
nadikari123
 

Similar to Gfs google-file-system-13331 (20)

GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
tutchiio
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
BIOVIA
 
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologiesChaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
GulbakshiDharmale
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management
BIOVIA
 
Ch8 main memory
Ch8   main memoryCh8   main memory
Ch8 main memory
Welly Dian Astika
 
Gfs final
Gfs finalGfs final
Gfs final
AmitSaha123
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
Andrew Trossman
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Romain Jacotin
 
Lecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptxLecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptx
Amanuelmergia
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
Ahmed Misbah
 
Operating systems- Main Memory Management
Operating systems- Main Memory ManagementOperating systems- Main Memory Management
Operating systems- Main Memory Management
Dr. Chandrakant Divate
 
08 operating system support
08 operating system support08 operating system support
08 operating system support
Sher Shah Merkhel
 
Memory Management.pdf
Memory Management.pdfMemory Management.pdf
Memory Management.pdf
SujanTimalsina5
 
Tuning Linux for MongoDB
Tuning Linux for MongoDBTuning Linux for MongoDB
Tuning Linux for MongoDB
Tim Vaillancourt
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
Viet-Trung TRAN
 
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Microsoft Technet France
 
Hadoop
HadoopHadoop
Hadoop
Girish Khanzode
 
08 operating system support
08 operating system support08 operating system support
08 operating system support
Anwal Mirza
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
tutchiio
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
BIOVIA
 
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologiesChaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
GulbakshiDharmale
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management
BIOVIA
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
Andrew Trossman
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Romain Jacotin
 
Lecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptxLecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptx
Amanuelmergia
 
Operating systems- Main Memory Management
Operating systems- Main Memory ManagementOperating systems- Main Memory Management
Operating systems- Main Memory Management
Dr. Chandrakant Divate
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
Viet-Trung TRAN
 
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Microsoft Technet France
 
08 operating system support
08 operating system support08 operating system support
08 operating system support
Anwal Mirza
 
Ad

Recently uploaded (20)

DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Ad

Gfs google-file-system-13331

  • 1. The Google File System Tut Chi Io(Modified by Fengchang)
  • 2. WHAT IS GFS • Google FILE SYSTEM(GFS) • scalable distributed file system (DFS) • falt tolerence • Reliability • Scalability • availability and performance to large networks and connected nodes.
  • 3. WHAT IS GFS • built from low-cost COMMODITY HARDWARE components • optimized to accomodate Google's different data use and storage needs, • capitalized on the strength of off-the- shelf servers while minimizing hardware weaknesses
  • 4. Design Overview – Assumption • Inexpensive commodity hardware • Large files: Multi-GB • Workloads – Large streaming reads – Small random reads – Large, sequential appends • Concurrent append to the same file • High Throughput > Low Latency
  • 5. Design Overview – Interface • Create • Delete • Open • Close • Read • Write • Snapshot • Record Append
  • 6. What does it look like
  • 7. Design Overview – Architecture • Single master, multiple chunk servers, multiple clients – User-level process running on commodity Linux machine – GFS client code linked into each client application to communicate • File -> 64MB chunks -> Linux files – on local disks of chunk servers – replicated on multiple chunk servers (3r) • Cache metadata but not chunk on clients
  • 8. Design Overview – Single Master • Why centralization? Simplicity! • Global knowledge is needed for – Chunk placement – Replication decisions
  • 9. Design Overview – Chunk Size • 64MB – Much Larger than ordinary, why? – Advantages • Reduce client-master interaction • Reduce network overhead • Reduce the size of the metadata – Disadvantages • Internal fragmentation – Solution: lazy space allocation • Hot Spots – many clients accessing a 1-chunk file, e.g. executables – Solution: – Higher replication factor – Stagger application start times – Client-to-client communication
  • 10. Design Overview – Metadata • File & chunk namespaces – In master’s memory – In master’s and chunk servers’ storage • File-chunk mapping – In master’s memory – In master’s and chunk servers’ storage • Location of chunk replicas – In master’s memory – Ask chunk servers when • Master starts • Chunk server joins the cluster – If persistent, master and chunk servers must be in sync
  • 11. Design Overview – Metadata – In-memory DS • Why in-memory data structure for the master? – Fast! For GC and LB • Will it pose a limit on the number of chunks -> total capacity? – No, a 64MB chunk needs less than 64B metadata (640TB needs less than 640MB) • Most chunks are full • Prefix compression on file names
  • 12. Design Overview – Metadata – Log • The only persistent record of metadata • Defines the order of concurrent operations • Critical – Replicated on multiple remote machines – Respond to client only when log locally and remotely • Fast recovery by using checkpoints – Use a compact B-tree like form directly mapping into memory – Switch to a new log, Create new checkpoints in a separate threads
  • 13. Design Overview – Consistency Model • Consistent – All clients will see the same data, regardless of which replicas they read from • Defined – Consistent, and clients will see what the mutation writes in its entirety
  • 14. Design Overview – Consistency Model • After a sequence of success, a region is guaranteed to be defined – Same order on all replicas – Chunk version number to detect stale replicas • Client cache stale chunk locations? – Limited by cache entry’s timeout – Most files are append-only • A Stale replica return a premature end of chunk
  • 15. System Interactions – Lease • Minimized management overhead • Granted by the master to one of the replicas to become the primary • Primary picks a serial order of mutation and all replicas follow • 60 seconds timeout, can be extended • Can be revoked
  • 16. System Interactions – Mutation Order Current lease holder? identity of primary location of replicas (cached by client) 3a. data 3b. data 3c. data Write request Primary assign s/n to mutations Applies it Forward write request Operation completed Operation completed Operation completed or Error report
  • 17. System Interactions – Data Flow • Decouple data flow and control flow • Control flow – Master -> Primary -> Secondaries • Data flow – Carefully picked chain of chunk servers • Forward to the closest first • Distances estimated from IP addresses – Linear (not tree), to fully utilize outbound bandwidth (not divided among recipients) – Pipelining, to exploit full-duplex links • Time to transfer B bytes to R replicas = B/T + RL • T: network throughput, L: latency
  • 18. System Interactions – Atomic Record Append • Concurrent appends are serializable – Client specifies only data – GFS appends at least once atomically – Return the offset to the client – Heavily used by Google to use files as • multiple-producer/single-consumer queues • Merged results from many different clients – On failures, the client retries the operation – Data are defined, intervening regions are inconsistent • A Reader can identify and discard extra padding and record fragments using the checksums
  • 19. System Interactions – Snapshot • Makes a copy of a file or a directory tree almost instantaneously • Use copy-on-write • Steps – Revokes lease – Logs operations to disk – Duplicates metadata, pointing to the same chunks • Create real duplicate locally – Disks are 3 times as fast as 100 Mb Ethernet links
  • 20. Master Operation – Namespace Management • No per-directory data structure • No support for alias • Lock over regions of namespace to ensure serialization • Lookup table mapping full pathnames to metadata – Prefix compression -> In-Memory
  • 21. Master Operation – Namespace Locking • Each node (file/directory) has a read- write lock • Scenario: prevent /home/user/foo from being created while /home/user is being snapshotted to /save/user – Snapshot • Read locks on /home, /save • Write locks on /home/user, /save/user – Create • Read locks on /home, /home/user • Write lock on /home/user/foo
  • 22. Master Operation – Policies • New chunks creation policy – New replicas on below-average disk utilization – Limit # of “recent” creations on each chun server – Spread replicas of a chunk across racks • Re-replication priority – Far from replication goal first – Chunk that is blocking client first – Live files first (rather than deleted) • Rebalance replicas periodically
  • 23. Master Operation – Garbage Collection • Lazy reclamation – Logs deletion immediately – Rename to a hidden name • Remove 3 days later • Undelete by renaming back • Regular scan for orphaned chunks – Not garbage: • All references to chunks: file-chunk mapping • All chunk replicas: Linux files under designated directory on each chunk server – Erase metadata – HeartBeat message to tell chunk servers to delete chunks
  • 24. Master Operation – Garbage Collection • Advantages – Simple & reliable • Chunk creation may failed • Deletion messages may be lost – Uniform and dependable way to clean up unuseful replicas – Done in batches and the cost is amortized – Done when the master is relatively free – Safety net against accidental, irreversible deletion
  • 25. Master Operation – Garbage Collection • Disadvantage – Hard to fine tune when storage is tight • Solution – Delete twice explicitly -> expedite storage reclamation – Different policies for different parts of the namespace • Stale Replica Detection – Master maintains a chunk version number
  • 26. Fault Tolerance – High Availability • Fast Recovery – Restore state and start in seconds – Do not distinguish normal and abnormal termination • Chunk Replication – Different replication levels for different parts of the file namespace – Keep each chunk fully replicated as chunk servers go offline or detect corrupted replicas through checksum verification
  • 27. Fault Tolerance – High Availability • Master Replication – Log & checkpoints are replicated – Master failures? • Monitoring infrastructure outside GFS starts a new master process – “Shadow” masters • Read-only access to the file system when the primary master is down • Enhance read availability • Reads a replica of the growing operation log
  • 28. Fault Tolerance – Data Integrity • Use checksums to detect data corruption • A chunk(64MB) is broken up into 64KB blocks with 32-bit checksum • Chunk server verifies the checksum before returning, no error propagation • Record append – Incrementally update the checksum for the last block, error will be detected when read • Random write – Read and verify the first and last block first – Perform write, compute new checksums
  • 29. Conclusion • GFS supports large-scale data processing using commodity hardware • Reexamine traditional file system assumption – based on application workload and technological environment – Treat component failures as the norm rather than the exception – Optimize for huge files that are mostly appended – Relax the stand file system interface
  • 30. Conclusion • Fault tolerance – Constant monitoring – Replicating crucial data – Fast and automatic recovery – Checksumming to detect data corruption at the disk or IDE subsystem level • High aggregate throughput – Decouple control and data transfer – Minimize operations by large chunk size and by chunk lease
  • 31. Reference • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”