Chapter 2 1712934164766
Chapter 2 1712934164766
Tags
GFS, which stands for Google File System, is a distributed file system
developed by Google to handle the massive amounts of data generated and
processed by its services. It is designed to provide high availability, reliability,
and scalability across a large number of commodity hardware components.
Google File System (GFS) was built to address the challenges of managing and
storing vast amounts of data efficiently and reliably in a distributed
environment. It was developed by Google to support its various services and
applications, which require storing and processing enormous datasets.
Characteristics of Google File System (GFS):
2. Fault Tolerance: GFS is built with fault tolerance in mind. It replicates data
across multiple servers to ensure data durability and availability even in the
event of hardware failures or network issues.
Chapter 2 1
4. Consistency Model: GFS provides a relaxed consistency model, favoring
availability and performance over strong consistency. It allows for eventual
consistency, meaning that updates to the file system may not be
immediately visible to all clients but will eventually converge to a consistent
state..
Chapter 2 2
4. Fault Tolerance: Ensuring fault tolerance is a key goal of GFS. It employs
techniques such as data replication across multiple nodes and automatic
recovery mechanisms to maintain data availability and integrity even in the
presence of hardware failures or other system faults.
4. Limited Support for Small Files: GFS is optimized for handling large files
and is less efficient when dealing with a large number of small files. This
can lead to inefficiencies in storage utilization and metadata management
for applications that primarily deal with small files.
Chapter 2 3
6. Limited Security Features: GFS lacks advanced security features such as
encryption at rest or access control mechanisms beyond basic file
permissions. This can be a concern for deployments where data security
and access control are critical requirements.
3. Large Files: Designed for handling large files (hundreds of MBs to GBs).
6. Multiple Writers, Single Reader: Expects multiple writers but typically one
reader.
💡 Draw the architecture of GFS. Explain the process of data reading and
data writing in GFS. (5+5)
Chapter 2 4
💡 What is the main role of GFS Master during read and write processes?
(4)
Architecture
1. Master:
Chapter 2 5
Rebalancing and Load Distribution: It may initiate data rebalancing
across Chunk Servers to maintain balanced storage utilization.
2. Chunk:
3. Chunk Server:
Data Storage: Chunk Servers store the actual data content of chunks on
local disks.
Data Serving: Chunk Servers respond to read and write requests from
clients, serving data directly to clients.
4. Client:
Chapter 2 6
File Access: Clients interact with the file system to read from and write
to files by accessing the appropriate chunks.
Read Process:
Client Request: When a client application wants to read data from the file
system, it contacts the GFS master server.
Metadata Lookup: The GFS master checks its metadata to determine the
locations of the required data chunks and their replicas.
Data Retrieval: The GFS master provides the client with the locations of the
appropriate chunk servers containing the requested data.
Data Transfer: The client then directly contacts the chunk servers to
retrieve the required data chunks. It can fetch data from multiple replicas
for faster access and fault tolerance.
Chapter 2 7
Data Aggregation: If the requested data spans multiple chunks, the client
aggregates the data chunks to reconstruct the complete file.
💡 How data and control messages flow in GFS architecture. Explain with
diagram . (5)
When a client wants to read or write data in GFS, it contacts the master
server to determine which chunk server holds the current lease for the
data chunk it needs and the locations of other replicas.
If no chunk server currently holds a lease for the requested chunk, the
master grants one to a replica it selects.
Chapter 2 8
2. Master responds with replica information:
The master replies to the client with the identity of the primary replica
(which currently holds the lease for the chunk) and the locations of
other secondary replicas.
The client caches this information for future use, reducing the need to
contact the master for subsequent operations unless there's a change
in lease status.
The client then proceeds to push the data to all replicas. This can be
done in any order.
Each chunk server stores the received data in its internal LRU (Least
Recently Used) buffer cache until it's either used or evicted due to
cache management policies.
Once all replicas acknowledge receiving the data, the client sends a
write request to the primary replica.
It then applies the mutation to its own local state in the order of serial
numbers.
After applying the mutation locally, the primary forwards the write
request to all secondary replicas.
Each secondary replica applies the mutations in the same serial number
order assigned by the primary.
Once the secondaries have applied the mutations, they reply to the
primary, indicating that they have completed the operation.
Chapter 2 9
7. Primary replies to the client:
Any errors encountered during the operation at any of the replicas are
reported back to the client.
In the Google File System (GFS), having a single master node might seem like a
potential bottleneck because it handles metadata operations such as
namespace management, file-to-chunk mapping, and maintaining the overall
system state. However, GFS is designed in a way that mitigates the impact of
this single point of failure:
2. Chunk Servers Handle Data: The actual data storage and retrieval
operations are handled by chunk servers, which operate independently of
the master. This distribution of data management tasks means that the
master is not overwhelmed with data-handling operations.
Chapter 2 10
number of files and directories rather than the volume of data stored,
making it less prone to becoming a bottleneck as the system grows.
💡 How GFS differ from other File Systems? List out five distinct
differences. (5)
Google File System (GFS) differs from traditional file systems in several key
aspects:
4. Access Patterns: GFS is optimized for large, sequential read and write
operations commonly encountered in data-intensive applications like web
indexing and log processing. Traditional file systems may be more suited to
a wider range of access patterns, including random access and small,
frequent updates.
Chapter 2 11
7. Cost: While GFS leverages commodity hardware to achieve scalability and
cost-effectiveness, traditional file systems may require more specialized
hardware and software licenses, potentially increasing overall costs.
4. Efficient Chunk Storage and Retrieval: The primary function of GFS chunk
servers is to store and serve data chunks. By having numerous chunk
servers distributed across the network, GFS can efficiently store large
volumes of data and handle concurrent read and write operations from
multiple clients.
Chapter 2 12
💡 Why do we have large and fixed sized Chunks in GFS What can be the
demerits of that design? (10)
Advantages:
1. Wasted Space: Fixed-sized chunks may lead to wasted space if the data
being stored is smaller than the chunk size. This can result in inefficient
storage utilization, especially for small files.
2. Increased Latency for Small Files: Accessing small files stored in large
chunks may introduce additional latency, as the entire chunk needs to be
read or written even if only a small portion of it is required.
Chapter 2 13
extremely large files. In such cases, managing and processing these large
chunks efficiently can become complex.
1. Operation Log:
In GFS, there are several in-memory data structures used for efficient
file system operations.
3. Consistency Model:
This means that GFS doesn't necessarily guarantee that all clients see
the same version of the file system at any given time.
Chapter 2 14
System Interaction
GFS is designed with a master-slave architecture where the master node
oversees the overall operation of the file system while chunk servers handle
the actual storage of data. Minimizing the master's involvement means that
routine operations should be handled locally by chunk servers or clients
without constantly requiring the master's intervention. This helps in distributing
the workload and reducing the potential bottleneck at the master node.
Client, Master, and Chunkservers Interaction: In GFS, clients are responsible
for interacting with the file system on behalf of users, the master node
manages metadata (such as namespace, file-to-chunk mapping), and chunk
servers store the actual data. These components interact to implement various
operations efficiently:
Leases and Mutation Order: GFS uses leases to manage data consistency
and ensure that only one client can write to a particular chunk at a time.
This reduces the need for constant communication with the master node
for every write operation, as the client holding the lease can perform
multiple writes to the chunk without involving the master each time.
Mutation order ensures that changes to the data are applied in a consistent
and sequential manner across replicas.
Chapter 2 15
💡 How GFS provides fault tolerance. How it allows tolerating chunk
server failures? (10)
Availability
Google File System (GFS) manages availability through several key
mechanisms:
3. Master Server: GFS includes a master server that manages metadata about
file locations, chunk locations, and replication. The master server keeps
track of which chunks are stored on which chunk servers and ensures that
each chunk is replicated as required. By centralizing metadata
management, GFS simplifies data access and ensures consistency across
the system.
Chapter 2 16
ensure even distribution and optimal performance. This dynamic
rebalancing helps maintain availability and performance in the face of
changing conditions.
Fault Tolerance
Based on the provided information, Google File System (GFS) provides fault
tolerance through several mechanisms:
2. Stale Replica Deletion: GFS identifies stale replicas, which are replicas that
have missed mutations while a chunkserver was down. The master
maintains a chunk version number to differentiate between up-to-date
replicas and stale replicas. Stale replicas are identified and removed during
regular garbage collection by the master. This ensures that only up-to-date
data is stored in the system.
3. Shadow Master: GFS uses a shadow master for reliability. The shadow
master replicates the functionality of the main master but does not serve as
a mirror. Instead, it acts as a backup in case the main master fails. This
setup ensures that there is always a master available to handle mutations
and background activities, enhancing fault tolerance by providing
redundancy at the master level.
Chapter 2 17
independently verify the integrity of their copies by comparing checksums.
This helps detect and mitigate data corruption, enhancing fault tolerance by
ensuring data integrity.
1. Replication
4. Fast Recovery: GFS emphasizes fast recovery from chunk server failures.
When a chunk server fails, the master quickly detects the failure through
the heartbeat mechanism and initiates recovery procedures. These
procedures typically involve replicating data stored on the failed chunk
server to other healthy chunk servers to ensure continued availability and
fault tolerance.
Chapter 2 18
2. Regular Scanning: GFS periodically scans the file namespace to identify
and remove hidden files that have exceeded the deletion threshold. These
hidden files are candidates for garbage collection. By regularly scanning
and cleaning up the file namespace, GFS ensures that storage space is
efficiently reclaimed without impacting system performance.
Chapter 2 19
1. Scalability: GFS is designed to scale horizontally to accommodate the
growing volume of data in large-scale distributed systems. It achieves
scalability by distributing data across multiple servers (chunk servers) and
allowing the system to add more servers as needed. This scalability
ensures that GFS can handle petabytes or even exabytes of data without
significant performance degradation.
4. Data Locality: GFS optimizes data access by leveraging data locality, which
ensures that computation tasks are performed on data stored in close
proximity to the processing nodes. By minimizing data transfer over the
network, data locality reduces latency and improves overall system
performance, particularly for large-scale data processing workflows.
Chapter 2 20
network bandwidth usage, and improve overall system efficiency,
particularly for large-scale data sets with high redundancy or
repetitiveness.
Chapter 2 21