SlideShare a Scribd company logo
The Google File System Tut Chi Io
Design Overview – Assumption Inexpensive commodity hardware Large files: Multi-GB Workloads Large streaming reads Small random reads Large, sequential appends Concurrent append to the same file High Throughput > Low Latency
Design Overview – Interface Create Delete Open Close Read Write Snapshot Record Append
Design Overview – Architecture Single master, multiple chunk servers, multiple clients User-level process running on commodity Linux machine GFS client code linked into each client application to communicate File -> 64MB chunks -> Linux files on local disks of chunk servers replicated on multiple chunk servers (3r) Cache metadata but not chunk on clients
Design Overview – Single Master Why centralization? Simplicity! Global knowledge is needed for Chunk placement Replication decisions
Design Overview – Chunk Size 64MB – Much Larger than ordinary, why? Advantages Reduce client-master interaction Reduce network overhead Reduce the size of the metadata Disadvantages Internal fragmentation Solution: lazy space allocation Hot Spots – many clients accessing a 1-chunk file, e.g. executables Solution: Higher replication factor Stagger application start times Client-to-client communication
Design Overview – Metadata File & chunk namespaces In master’s memory In master’s and chunk servers’ storage File-chunk mapping In master’s memory In master’s and chunk servers’ storage Location of chunk replicas In master’s memory Ask chunk servers when Master starts Chunk server joins the cluster If persistent, master and chunk servers must be in sync
Design Overview – Metadata – In-memory DS Why in-memory data structure for the master?  Fast! For GC and LB Will it pose a limit on the number of chunks -> total capacity? No, a 64MB chunk needs less than 64B metadata (640TB needs less than 640MB) Most chunks are full Prefix compression on file names
Design Overview – Metadata – Log The only persistent record of metadata Defines the order of concurrent operations Critical Replicated on multiple remote machines Respond to client only when log locally and remotely Fast recovery by using checkpoints Use a compact B-tree like form directly mapping into memory  Switch to a new log, Create new checkpoints in a separate threads
Design Overview – Consistency Model Consistent All clients will see the same data, regardless of which replicas they read from Defined Consistent, and clients will see what the mutation writes in its entirety
Design Overview – Consistency Model After a sequence of success, a region is guaranteed to be defined Same order on all replicas Chunk version number to detect stale replicas Client cache stale chunk locations? Limited by cache entry’s timeout Most files are append-only A Stale replica return a premature end of chunk
System Interactions – Lease Minimized management overhead Granted by the master to one of the replicas to become the primary Primary picks a serial order of mutation and all replicas follow 60 seconds timeout, can be extended Can be revoked
System Interactions – Mutation Order Current lease holder? identity of primary location of replicas (cached by client) 3a. data 3b. data 3c. data Write request Primary assign s/n to mutations Applies it Forward write request Operation completed Operation completed Operation completed or Error report
System Interactions – Data Flow Decouple data flow and control flow Control flow Master -> Primary -> Secondaries Data flow Carefully picked chain of chunk servers Forward to the closest first Distances estimated from IP addresses Linear (not tree), to fully utilize outbound bandwidth (not divided among recipients) Pipelining, to exploit full-duplex links Time to transfer B bytes to R replicas =  B/T + RL T: network throughput, L: latency
System Interactions – Atomic Record Append Concurrent appends are serializable Client specifies only data GFS appends at least once atomically Return the offset to the client Heavily used by Google to use files as multiple-producer/single-consumer queues Merged results from many different clients On failures, the client retries the operation Data are defined, intervening regions are inconsistent A Reader can identify and discard extra padding and record fragments using the checksums
System Interactions – Snapshot Makes a copy of a file or a directory tree almost instantaneously Use copy-on-write Steps Revokes lease Logs operations to disk Duplicates metadata, pointing to the same chunks Create real duplicate locally Disks are 3 times as fast as 100 Mb Ethernet links
Master Operation – Namespace Management No per-directory data structure No support for alias Lock over regions of namespace to ensure serialization Lookup table mapping full pathnames to metadata Prefix compression -> In-Memory
Master Operation – Namespace Locking Each node (file/directory) has a read-write lock Scenario: prevent /home/user/foo from being created while /home/user is being snapshotted to /save/user Snapshot Read locks on /home, /save Write locks on /home/user, /save/user Create Read locks on /home, /home/user Write lock on /home/user/foo
Master Operation – Policies New chunks creation policy New replicas on below-average disk utilization Limit # of “recent” creations on each chun server Spread replicas of a chunk across racks Re-replication priority Far from replication goal first Chunk that is blocking client first  Live files first (rather than deleted) Rebalance replicas periodically
Master Operation – Garbage Collection Lazy reclamation Logs deletion immediately Rename to a hidden name Remove 3 days later Undelete by renaming back Regular scan for orphaned chunks Not garbage: All references to chunks: file-chunk mapping All chunk replicas: Linux files under designated directory on each chunk server Erase metadata HeartBeat message to tell chunk servers to delete chunks
Master Operation – Garbage Collection Advantages Simple & reliable Chunk creation may failed Deletion messages may be lost Uniform and dependable way to clean up unuseful replicas Done in batches and the cost is amortized Done when the master is relatively free Safety net against accidental, irreversible deletion
Master Operation – Garbage Collection Disadvantage Hard to fine tune when storage is tight Solution Delete twice explicitly -> expedite storage reclamation Different policies for different parts of the namespace Stale Replica Detection Master maintains a chunk version number
Fault Tolerance – High Availability Fast Recovery Restore state and start in seconds Do not distinguish normal and abnormal termination Chunk Replication Different replication levels for different parts of the file namespace Keep each chunk fully replicated as chunk servers go offline or detect corrupted replicas through checksum verification
Fault Tolerance – High Availability Master Replication Log & checkpoints are replicated Master failures? Monitoring infrastructure outside GFS starts a new master process “Shadow” masters Read-only access to the file system when the primary master is down Enhance read availability Reads a replica of the growing operation log
Fault Tolerance – Data Integrity Use checksums to detect data corruption A chunk(64MB) is broken up into 64KB blocks with 32-bit checksum Chunk server verifies the checksum before returning, no error propagation Record append Incrementally update the checksum for the last block, error will be detected when read Random write Read and verify the first and last block first Perform write, compute new checksums
Conclusion GFS supports large-scale data processing using commodity hardware Reexamine traditional file system assumption based on application workload and technological environment Treat component failures as the norm rather than the exception Optimize for huge files that are mostly appended Relax the stand file system interface
Conclusion Fault tolerance Constant monitoring Replicating crucial data Fast and automatic recovery Checksumming to detect data corruption at the disk or IDE subsystem level High aggregate throughput Decouple control and data transfer Minimize operations by large chunk size and by chunk lease
Reference Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”
Ad

More Related Content

What's hot (20)

Google file system
Google file systemGoogle file system
Google file system
Roopesh Jhurani
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
Mohit Saini
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
RAHULRAHU8
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Corba
CorbaCorba
Corba
Sanoj Kumar
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
koolkampus
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
darugar
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
Anamika Singh
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File System
Vishal Polley
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Chapter 4
Chapter 4Chapter 4
Chapter 4
Ali Broumandnia
 
MOBILE BI
MOBILE BIMOBILE BI
MOBILE BI
Raminder Pal Singh
 
Unit 2 - Grid and Cloud Computing
Unit 2 - Grid and Cloud ComputingUnit 2 - Grid and Cloud Computing
Unit 2 - Grid and Cloud Computing
vimalraman
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESDISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
AAKANKSHA JAIN
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
Mohit Saini
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
RAHULRAHU8
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
koolkampus
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
darugar
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
Anamika Singh
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Seminar Report on Google File System
Seminar Report on Google File SystemSeminar Report on Google File System
Seminar Report on Google File System
Vishal Polley
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Unit 2 - Grid and Cloud Computing
Unit 2 - Grid and Cloud ComputingUnit 2 - Grid and Cloud Computing
Unit 2 - Grid and Cloud Computing
vimalraman
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESDISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
AAKANKSHA JAIN
 

Similar to GFS - Google File System (20)

GFS xouzfz h ghdzg ix booc ug nog ghzg m
GFS xouzfz h ghdzg ix booc  ug nog ghzg mGFS xouzfz h ghdzg ix booc  ug nog ghzg m
GFS xouzfz h ghdzg ix booc ug nog ghzg m
gagaco5776
 
tittle
tittletittle
tittle
uvolodia
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
Fengchang Xie
 
Gfs介绍
Gfs介绍Gfs介绍
Gfs介绍
yiditushe
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
Lalit Rastogi
 
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
Gfs
GfsGfs
Gfs
ravi kiran
 
Google
GoogleGoogle
Google
rpaikrao
 
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsLuxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
William Yang
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)
Sri Prasanna
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
diptipan
 
Gfs final
Gfs finalGfs final
Gfs final
AmitSaha123
 
Lec3 Dfs
Lec3 DfsLec3 Dfs
Lec3 Dfs
mobius.cn
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systems
tugrulh
 
Lalit
LalitLalit
Lalit
diptipan
 
CH08.pdf
CH08.pdfCH08.pdf
CH08.pdf
ImranKhan880955
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Dsm (Distributed computing)
Dsm (Distributed computing)Dsm (Distributed computing)
Dsm (Distributed computing)
Sri Prasanna
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
Serverless (Distributed computing)
Serverless (Distributed computing)Serverless (Distributed computing)
Serverless (Distributed computing)
Sri Prasanna
 
GFS xouzfz h ghdzg ix booc ug nog ghzg m
GFS xouzfz h ghdzg ix booc  ug nog ghzg mGFS xouzfz h ghdzg ix booc  ug nog ghzg m
GFS xouzfz h ghdzg ix booc ug nog ghzg m
gagaco5776
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
Fengchang Xie
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
Lalit Rastogi
 
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsLuxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
William Yang
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)
Sri Prasanna
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
diptipan
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systems
tugrulh
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Dsm (Distributed computing)
Dsm (Distributed computing)Dsm (Distributed computing)
Dsm (Distributed computing)
Sri Prasanna
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
Serverless (Distributed computing)
Serverless (Distributed computing)Serverless (Distributed computing)
Serverless (Distributed computing)
Sri Prasanna
 
Ad

Recently uploaded (20)

Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Ad

GFS - Google File System

  • 1. The Google File System Tut Chi Io
  • 2. Design Overview – Assumption Inexpensive commodity hardware Large files: Multi-GB Workloads Large streaming reads Small random reads Large, sequential appends Concurrent append to the same file High Throughput > Low Latency
  • 3. Design Overview – Interface Create Delete Open Close Read Write Snapshot Record Append
  • 4. Design Overview – Architecture Single master, multiple chunk servers, multiple clients User-level process running on commodity Linux machine GFS client code linked into each client application to communicate File -> 64MB chunks -> Linux files on local disks of chunk servers replicated on multiple chunk servers (3r) Cache metadata but not chunk on clients
  • 5. Design Overview – Single Master Why centralization? Simplicity! Global knowledge is needed for Chunk placement Replication decisions
  • 6. Design Overview – Chunk Size 64MB – Much Larger than ordinary, why? Advantages Reduce client-master interaction Reduce network overhead Reduce the size of the metadata Disadvantages Internal fragmentation Solution: lazy space allocation Hot Spots – many clients accessing a 1-chunk file, e.g. executables Solution: Higher replication factor Stagger application start times Client-to-client communication
  • 7. Design Overview – Metadata File & chunk namespaces In master’s memory In master’s and chunk servers’ storage File-chunk mapping In master’s memory In master’s and chunk servers’ storage Location of chunk replicas In master’s memory Ask chunk servers when Master starts Chunk server joins the cluster If persistent, master and chunk servers must be in sync
  • 8. Design Overview – Metadata – In-memory DS Why in-memory data structure for the master? Fast! For GC and LB Will it pose a limit on the number of chunks -> total capacity? No, a 64MB chunk needs less than 64B metadata (640TB needs less than 640MB) Most chunks are full Prefix compression on file names
  • 9. Design Overview – Metadata – Log The only persistent record of metadata Defines the order of concurrent operations Critical Replicated on multiple remote machines Respond to client only when log locally and remotely Fast recovery by using checkpoints Use a compact B-tree like form directly mapping into memory Switch to a new log, Create new checkpoints in a separate threads
  • 10. Design Overview – Consistency Model Consistent All clients will see the same data, regardless of which replicas they read from Defined Consistent, and clients will see what the mutation writes in its entirety
  • 11. Design Overview – Consistency Model After a sequence of success, a region is guaranteed to be defined Same order on all replicas Chunk version number to detect stale replicas Client cache stale chunk locations? Limited by cache entry’s timeout Most files are append-only A Stale replica return a premature end of chunk
  • 12. System Interactions – Lease Minimized management overhead Granted by the master to one of the replicas to become the primary Primary picks a serial order of mutation and all replicas follow 60 seconds timeout, can be extended Can be revoked
  • 13. System Interactions – Mutation Order Current lease holder? identity of primary location of replicas (cached by client) 3a. data 3b. data 3c. data Write request Primary assign s/n to mutations Applies it Forward write request Operation completed Operation completed Operation completed or Error report
  • 14. System Interactions – Data Flow Decouple data flow and control flow Control flow Master -> Primary -> Secondaries Data flow Carefully picked chain of chunk servers Forward to the closest first Distances estimated from IP addresses Linear (not tree), to fully utilize outbound bandwidth (not divided among recipients) Pipelining, to exploit full-duplex links Time to transfer B bytes to R replicas = B/T + RL T: network throughput, L: latency
  • 15. System Interactions – Atomic Record Append Concurrent appends are serializable Client specifies only data GFS appends at least once atomically Return the offset to the client Heavily used by Google to use files as multiple-producer/single-consumer queues Merged results from many different clients On failures, the client retries the operation Data are defined, intervening regions are inconsistent A Reader can identify and discard extra padding and record fragments using the checksums
  • 16. System Interactions – Snapshot Makes a copy of a file or a directory tree almost instantaneously Use copy-on-write Steps Revokes lease Logs operations to disk Duplicates metadata, pointing to the same chunks Create real duplicate locally Disks are 3 times as fast as 100 Mb Ethernet links
  • 17. Master Operation – Namespace Management No per-directory data structure No support for alias Lock over regions of namespace to ensure serialization Lookup table mapping full pathnames to metadata Prefix compression -> In-Memory
  • 18. Master Operation – Namespace Locking Each node (file/directory) has a read-write lock Scenario: prevent /home/user/foo from being created while /home/user is being snapshotted to /save/user Snapshot Read locks on /home, /save Write locks on /home/user, /save/user Create Read locks on /home, /home/user Write lock on /home/user/foo
  • 19. Master Operation – Policies New chunks creation policy New replicas on below-average disk utilization Limit # of “recent” creations on each chun server Spread replicas of a chunk across racks Re-replication priority Far from replication goal first Chunk that is blocking client first Live files first (rather than deleted) Rebalance replicas periodically
  • 20. Master Operation – Garbage Collection Lazy reclamation Logs deletion immediately Rename to a hidden name Remove 3 days later Undelete by renaming back Regular scan for orphaned chunks Not garbage: All references to chunks: file-chunk mapping All chunk replicas: Linux files under designated directory on each chunk server Erase metadata HeartBeat message to tell chunk servers to delete chunks
  • 21. Master Operation – Garbage Collection Advantages Simple & reliable Chunk creation may failed Deletion messages may be lost Uniform and dependable way to clean up unuseful replicas Done in batches and the cost is amortized Done when the master is relatively free Safety net against accidental, irreversible deletion
  • 22. Master Operation – Garbage Collection Disadvantage Hard to fine tune when storage is tight Solution Delete twice explicitly -> expedite storage reclamation Different policies for different parts of the namespace Stale Replica Detection Master maintains a chunk version number
  • 23. Fault Tolerance – High Availability Fast Recovery Restore state and start in seconds Do not distinguish normal and abnormal termination Chunk Replication Different replication levels for different parts of the file namespace Keep each chunk fully replicated as chunk servers go offline or detect corrupted replicas through checksum verification
  • 24. Fault Tolerance – High Availability Master Replication Log & checkpoints are replicated Master failures? Monitoring infrastructure outside GFS starts a new master process “Shadow” masters Read-only access to the file system when the primary master is down Enhance read availability Reads a replica of the growing operation log
  • 25. Fault Tolerance – Data Integrity Use checksums to detect data corruption A chunk(64MB) is broken up into 64KB blocks with 32-bit checksum Chunk server verifies the checksum before returning, no error propagation Record append Incrementally update the checksum for the last block, error will be detected when read Random write Read and verify the first and last block first Perform write, compute new checksums
  • 26. Conclusion GFS supports large-scale data processing using commodity hardware Reexamine traditional file system assumption based on application workload and technological environment Treat component failures as the norm rather than the exception Optimize for huge files that are mostly appended Relax the stand file system interface
  • 27. Conclusion Fault tolerance Constant monitoring Replicating crucial data Fast and automatic recovery Checksumming to detect data corruption at the disk or IDE subsystem level High aggregate throughput Decouple control and data transfer Minimize operations by large chunk size and by chunk lease
  • 28. Reference Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”