The predecessor of Colossus and archetype of HDFS.
- Key observations of the application workloads and technological environment:
- Component failures are the norm rather than the exception, which requires constant error detection, fault tolerance and automatic failure recovery.
- Files are huge (multi-GBs) by the standards at that time, which challenges traditional design assumptions.
- Most files are mutated by appending new data rather than overwriting existing data, making caching data blocks in the client lose its appeal.
- Some flexibility can be gained by co-designing the applications and the file system API, such as the relaxed consistency model. - The GFS interface consists of:
- Familiar file system APIs and structure, albeit not POSIX-compliant.
- The snapshot operation.
- The atomic record append operation. - A GFS cluster consists of:
- A single master (with shadow replicas).
- Hundreds of chunkservers.
- Multiple concurrent clients. - Files are divided into fixed-size chunks (64 MB) and each chunk is globally identified by a 64-bit chunk handle assigned by the master.
- Each chunk is replicated on multiple chunkservers as per the user-designated replication level (3 replicas by default).
- Each chunk is stored as a plain Linux file on a chunkserver and is extended only as needed to avoid internal fragmentation.
- Chunkservers independently verify data integrity of their own copies by persisting a 32-bit checksum for each 64 KB block.
- Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers.
- The client sends a file name and a chunk index, and the master replies with the chunk handle and the locations of the replicas.
- To minimize the master’s involvement in reads and writes:
- The client caches the chunk location info for a limited time.
- The client typically batches multiple inquiries in a single request.
- The master can also incorporate info about chunks immediately following the requested chunks in a reply. - Pros and cons of a large chunk size:
- Clients’ interaction with the master is reduced.
- Network overhead (e.g. TCP connection establishment) is reduced.
- Metadata size is reduced (< 64 bytes per chunk), enabling it to be cached in memory.
- Chunkservers serving the single chunk of a small file may become a hotspot, which can be mitigated by a higher replication factor, etc. - The metadata in the master includes:
- The file and chunk namespaces (both in memory and on disk).
- The mapping from files to chunks (both in memory and on disk).
- Th locations of each chunk’s replicas (in-memory only). - Mutations on persistent metadata are kept in an operation log stored in the master and replicated remotely.
- The operation log is replicated to shadow masters synchronously, but replication is batched for better throughput.
- The operation log is compacted by checkpointing in the form of a B-tree like structure, which speeds up crash recovery of the master.
- The master periodically scan through its entire state do do
- Chunk garbage collection.
- Re-replication in the presence of chunkserver failures.
- Chunk migration for rebalancing. - Chunkservers have the final word over chunk locations, although that is managed and monitored by the master.
- A file region is consistent if linearizability is preserved, and is defined if serializability is also preserved.
- Two types of data mutations:
- Writes at specified offsets (including regular appends).
- Record appends at unspecified offsets. - Consistency guarantees:
- A successful record append is guaranteed to happen at least once atomically, but some replicas may have duplicates.
- Concurrent writes to the same region is not serializable.
- A mutated file region is defined after successful mutations.
- Unsuccessful mutations are reported to the client. - The master grants (or extends) a lease to a chunkserver for being the primary of a chunk, and can safely grant it to another one after waiting the old lease to surely expire.
- For mutations, data flow and control flow are decoupled to use the network efficiently:
- First, the client pushes the data to all the replicas (details below).
- Then, the client sends a write request to the primary.
- The primary assigns consecutive serial numbers to all the mutations, and applies them locally.
- The write request is forwarded to all the secondaries, which apply the mutations in the same order.
- The primary replies to the client after receiving all the replies from the secondaries. - The data is pushed from the client to one of the replicas (probably the closest) at first and then propagated to the rest linearly along a chain, so that each machine’s network bandwidth is fully utilized.
- For a record append, the primary first needs to check if doing so would exceed the size of the chunk:
- If so, pad the chunk to the full size (replicated by the secondaries) and reject the request.
- Otherwise, apply the append locally and then forward the write to the secondaries with the exact offset specified. - For a snapshot operation, the master first revokes all the leases on the chunks involved; and if a mutation on one of the chunks is requested while the master is making the snapshot, a copy-on-write will be done before the master grants the lease.
- GFS does not allocate a chunk for listing all the files in a directory, but does allocate an entry (associated with a read-write lock) for each directory in the namespace lookup table.
- File creation only requires the read lock on the parent directory to prevent that directory being deleted, renamed or snapshotted, while allowing concurrent mutations within the directory.
- The chunk replica placement policy is aimed to maximize data reliability, availability and network bandwidth utilization.
- Factors to consider when placing a replica:
- Disk space utilization should be equalized among chunkservers.
- The number of “recent” chunk creations should be limited on each chunkserver.
- Replicas should be spread across racks to reduce fate sharing. - The metadata of a deleted file will not be erased within three days after the file is marked as deleted.
- Chunkservers get to delete orphaned chunks after gathering that information through regular heartbeat replies from the master.
- The master bumps the version number of a chunk every time a new lease is granted on it, and stale chunk replicas are removed in regular garbage collections.
- When the master is down, shadow masters, which may slightly lag behind, only provide read-only access to the file system to enhance read availability for files not being actively mutated.