[Paper Notes] The Google File System (GFS)

Devin Z
5 min readSep 11, 2022

The predecessor of Colossus and archetype of HDFS.

Google Bay View Campus, August 28, 2022
  • Key observations of the application workloads and technological environment:
    - Component failures are the norm rather than the exception, which requires constant error detection, fault tolerance and automatic failure recovery.
    - Files are huge (multi-GBs) by the standards at that time, which challenges traditional design assumptions.
    - Most files are mutated by appending new data rather than overwriting existing data, making caching data blocks in the client lose its appeal.
    - Some flexibility can be gained by co-designing the applications and the file system API, such as the relaxed consistency model.
  • The GFS interface consists of:
    - Familiar file system APIs and structure, albeit not POSIX-compliant.
    - The snapshot operation.
    - The atomic record append operation.
  • A GFS cluster consists of:
    - A single master (with shadow replicas).
    - Hundreds of chunkservers.
    - Multiple concurrent clients.
  • Files are divided into fixed-size chunks (64 MB) and each chunk is globally identified by a 64-bit chunk handle assigned by the master.
  • Each chunk is replicated on multiple chunkservers as per the user-designated replication level (3 replicas by default).
  • Each chunk is stored as a plain Linux file on a chunkserver and is extended only as needed to avoid internal fragmentation.
  • Chunkservers independently verify data integrity of their own copies by persisting a 32-bit checksum for each 64 KB block.
  • Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers.
  • The client sends a file name and a chunk index, and the master replies with the chunk handle and the locations of the replicas.
  • To minimize the master’s involvement in reads and writes:
    - The client caches the chunk location info for a limited time.
    - The client typically batches multiple inquiries in a single request.
    - The master can also incorporate info about chunks immediately following the requested chunks in a reply.
  • Pros and cons of a large chunk size:
    - Clients’ interaction with the master is reduced.
    - Network overhead (e.g. TCP connection establishment) is reduced.
    - Metadata size is reduced (< 64 bytes per chunk), enabling it to be cached in memory.
    - Chunkservers serving the single chunk of a small file may become a hotspot, which can be mitigated by a higher replication factor, etc.
  • The metadata in the master includes:
    - The file and chunk namespaces (both in memory and on disk).
    - The mapping from files to chunks (both in memory and on disk).
    - Th locations of each chunk’s replicas (in-memory only).
  • Mutations on persistent metadata are kept in an operation log stored in the master and replicated remotely.
  • The operation log is replicated to shadow masters synchronously, but replication is batched for better throughput.
  • The operation log is compacted by checkpointing in the form of a B-tree like structure, which speeds up crash recovery of the master.
  • The master periodically scan through its entire state do do
    - Chunk garbage collection.
    - Re-replication in the presence of chunkserver failures.
    - Chunk migration for rebalancing.
  • Chunkservers have the final word over chunk locations, although that is managed and monitored by the master.
  • A file region is consistent if linearizability is preserved, and is defined if serializability is also preserved.
  • Two types of data mutations:
    - Writes at specified offsets (including regular appends).
    - Record appends at unspecified offsets.
  • Consistency guarantees:
    - A successful record append is guaranteed to happen at least once atomically, but some replicas may have duplicates.
    - Concurrent writes to the same region is not serializable.
    - A mutated file region is defined after successful mutations.
    - Unsuccessful mutations are reported to the client.
  • The master grants (or extends) a lease to a chunkserver for being the primary of a chunk, and can safely grant it to another one after waiting the old lease to surely expire.
  • For mutations, data flow and control flow are decoupled to use the network efficiently:
    - First, the client pushes the data to all the replicas (details below).
    - Then, the client sends a write request to the primary.
    - The primary assigns consecutive serial numbers to all the mutations, and applies them locally.
    - The write request is forwarded to all the secondaries, which apply the mutations in the same order.
    - The primary replies to the client after receiving all the replies from the secondaries.
  • The data is pushed from the client to one of the replicas (probably the closest) at first and then propagated to the rest linearly along a chain, so that each machine’s network bandwidth is fully utilized.
  • For a record append, the primary first needs to check if doing so would exceed the size of the chunk:
    - If so, pad the chunk to the full size (replicated by the secondaries) and reject the request.
    - Otherwise, apply the append locally and then forward the write to the secondaries with the exact offset specified.
  • For a snapshot operation, the master first revokes all the leases on the chunks involved; and if a mutation on one of the chunks is requested while the master is making the snapshot, a copy-on-write will be done before the master grants the lease.
  • GFS does not allocate a chunk for listing all the files in a directory, but does allocate an entry (associated with a read-write lock) for each directory in the namespace lookup table.
  • File creation only requires the read lock on the parent directory to prevent that directory being deleted, renamed or snapshotted, while allowing concurrent mutations within the directory.
  • The chunk replica placement policy is aimed to maximize data reliability, availability and network bandwidth utilization.
  • Factors to consider when placing a replica:
    - Disk space utilization should be equalized among chunkservers.
    - The number of “recent” chunk creations should be limited on each chunkserver.
    - Replicas should be spread across racks to reduce fate sharing.
  • The metadata of a deleted file will not be erased within three days after the file is marked as deleted.
  • Chunkservers get to delete orphaned chunks after gathering that information through regular heartbeat replies from the master.
  • The master bumps the version number of a chunk every time a new lease is granted on it, and stale chunk replicas are removed in regular garbage collections.
  • When the master is down, shadow masters, which may slightly lag behind, only provide read-only access to the file system to enhance read availability for files not being actively mutated.

--

--