[Paper Notes] The Google File System (GFS)

Devin Z

5 min readSep 11, 2022

The predecessor of Colossus and archetype of HDFS.

Key observations of the application workloads and technological environment:
- Component failures are the norm rather than the exception, which requires constant error detection, fault tolerance and automatic failure recovery.
- Files are huge (multi-GBs) by the standards at that time, which challenges traditional design assumptions.
- Most files are mutated by appending new data rather than overwriting existing data, making caching data blocks in the client lose its appeal.
- Some flexibility can be gained by co-designing the applications and the file system API, such as the relaxed consistency model.
The GFS interface consists of:
- Familiar file system APIs and structure, albeit not POSIX-compliant.
- The snapshot operation.
- The atomic record append operation.
A GFS cluster consists of:
- A single master (with shadow replicas).
- Hundreds of chunkservers.
- Multiple concurrent clients.
Files are divided into fixed-size chunks (64 MB) and each chunk is globally identified by a 64-bit chunk handle assigned by the master.
Each chunk is replicated on multiple chunkservers as per the user-designated replication level (3 replicas by default).
Each chunk is stored as a plain Linux file on a chunkserver and is extended only as needed to avoid internal fragmentation.
Chunkservers independently verify data integrity of their own copies by persisting a 32-bit checksum for each 64 KB block.
Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers.
The client sends a file name and a chunk index, and the master replies with the chunk handle and the locations of the replicas.
To minimize the master’s involvement in reads and writes:
- The client caches the chunk location info for a limited time.
- The client typically batches multiple inquiries in a single request.
- The master can also incorporate info about chunks immediately following the requested chunks in a reply.
Pros and cons of a large chunk size:
- Clients’ interaction with the master is reduced.
- Network overhead (e.g. TCP connection establishment) is reduced.
- Metadata size is reduced (< 64 bytes per chunk), enabling it to be cached in memory.
- Chunkservers serving the single chunk of a small file may become a hotspot, which can be mitigated by a higher replication factor, etc.
The metadata in the master includes:
- The file and chunk namespaces (both in memory and on disk).
- The mapping from files to chunks (both in memory and on disk).
- Th locations of each chunk’s replicas (in-memory only).
Mutations on persistent metadata are kept in an operation log stored in the master and replicated remotely.
The operation log is replicated to shadow masters synchronously, but replication is batched for better throughput.
The operation log is compacted by checkpointing in the form of a B-tree like structure, which speeds up crash recovery of the master.
The master periodically scan through its entire state do do
- Chunk garbage collection.
- Re-replication in the presence of chunkserver failures.
- Chunk migration for rebalancing.
Chunkservers have the final word over chunk locations, although that is managed and monitored by the master.
A file region is consistent if linearizability is preserved, and is defined if serializability is also preserved.
Two types of data mutations:
- Writes at specified offsets (including regular appends).
- Record appends at unspecified offsets.
Consistency guarantees:
- A successful record append is guaranteed to happen at least once atomically, but some replicas may have duplicates.
- Concurrent writes to the same region is not serializable.
- A mutated file region is defined after successful mutations.
- Unsuccessful mutations are reported to the client.
The master grants (or extends) a lease to a chunkserver for being the primary of a chunk, and can safely grant it to another one after waiting the old lease to surely expire.
For mutations, data flow and control flow are decoupled to use the network efficiently:
- First, the client pushes the data to all the replicas (details below).
- Then, the client sends a write request to the primary.
- The primary assigns consecutive serial numbers to all the mutations, and applies them locally.
- The write request is forwarded to all the secondaries, which apply the mutations in the same order.
- The primary replies to the client after receiving all the replies from the secondaries.
The data is pushed from the client to one of the replicas (probably the closest) at first and then propagated to the rest linearly along a chain, so that each machine’s network bandwidth is fully utilized.
For a record append, the primary first needs to check if doing so would exceed the size of the chunk:
- If so, pad the chunk to the full size (replicated by the secondaries) and reject the request.
- Otherwise, apply the append locally and then forward the write to the secondaries with the exact offset specified.
For a snapshot operation, the master first revokes all the leases on the chunks involved; and if a mutation on one of the chunks is requested while the master is making the snapshot, a copy-on-write will be done before the master grants the lease.
GFS does not allocate a chunk for listing all the files in a directory, but does allocate an entry (associated with a read-write lock) for each directory in the namespace lookup table.
File creation only requires the read lock on the parent directory to prevent that directory being deleted, renamed or snapshotted, while allowing concurrent mutations within the directory.
The chunk replica placement policy is aimed to maximize data reliability, availability and network bandwidth utilization.
Factors to consider when placing a replica:
- Disk space utilization should be equalized among chunkservers.
- The number of “recent” chunk creations should be limited on each chunkserver.
- Replicas should be spread across racks to reduce fate sharing.
The metadata of a deleted file will not be erased within three days after the file is marked as deleted.
Chunkservers get to delete orphaned chunks after gathering that information through regular heartbeat replies from the master.
The master bumps the version number of a chunk every time a new lease is granted on it, and stale chunk replicas are removed in regular garbage collections.
When the master is down, shadow masters, which may slightly lag behind, only provide read-only access to the file system to enhance read availability for files not being actively mutated.

References:

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System. SOSP’03, Octorber 19–22, 2003.

Devin Z

Software System Design

View list

32 stories

[Paper Notes] The Google File System (GFS)

References:

Software System Design

Written by Devin Z