[Paper Notes] F1: A Distributed SQL Database That Scales

Devin Z
3 min readSep 26, 2022

--

A pioneer in HTAP (hybrid transactional/analytical processing).

MIT Campus, July 9, 2022
  • Goals of F1’s design:
    - Better scalability than AdWords’ old MySQL system.
    - High availability for mission-critical business.
    - Consistency, including ACID transaction support.
    - Usability, including full SQL query support.
  • To mitigate increased latency due to remote data:
    - Make data clustering explicit to improve locality.
    - Make heavy use of batching, parallelism and asynchronous read.
  • Basic components:
    - F1 servers, which receive requests and coordinate query execution.
    - A shared slave pool that consists of F1 processes for executing parts of distributed query plans.
    - The F1 master, which monitors slave process health and distributes the list of available slaves to F1 servers.
    - Spanner servers, which retrieve data from the Colossus File System (CFS) in the same datacenter.
Basic Architecture of F1 (Source: [1])
  • F1 servers are typically co-located in the same set of datacenters as the Spanner servers storing the data, but can communicate with Spanner servers outside their own datacenters when necessary.
  • F1 servers are mostly stateless except when holding locks for a pessimistic transaction.
  • Data model:
    - Physically, F1 stores each child table clustered with and interleaved within the rows from its parent table to reduce the number of Spanner reads required for joining related data.
    - The hierarchically clustered physical schema also reduces the number of Paxos groups involved in updates and thereby avoid the performance penalty of 2PCs.
    - Support Protocol Buffer columns to remove the impedance mismatch between the database and applications.
    - Allow for the use of repeated fields to avoid the performance impact and complexity of storing and joining multiple child records.
    - Encourage local indexes, which are co-located with the root rows they index, over global indexes, which are sharded across directories.
  • F1 supports fully non-blocking schema changes by implementing a carefully designed algorithm.
  • Three types of F1 transactions:
    - Read-only snapshot transactions, which is the default mode for SQL queries and MapReduces.
    - Pessimistic transactions, which are directly mapped on to Spanner transactions and have to abort if F1 servers die.
    - Optimistic transactions, each of which consists of (1) a long-running lock-free read phase followed by (2) a short-lived Spanner pessimistic transaction that checks conflicts and commits writes.
  • Optimistic transactions avoid conflicts between reads and writes, but have two drawbacks:
    - Insertion phantoms, which may require the use of parent-table locks.
    - Low throughput under high contention, which may require batching updates in the application level.
  • F1 supports more granular locking than the default row-level locking to avoid transaction conflicts between independent updates.
  • F1 users may use change history for change data capture and incremental processing.
  • For each transaction, one ChangeBatch is written for each distinct root row and change history is stored close to the data being tracked as child tables.
  • F1 SQL supports both:
    - Centralized execution of low-latency OLTP-style queries.
    - Distributed execution of long-running OLAP-style queries.
  • In contrast with disk latency, network latency
    - Can be mitigated by batching or pipelining data accesses.
    - Benefit more from parallel data accesses because of less contention for limited resources.
  • A distributed query plan is organized as a DAG of plan parts rooted at the single query coordinator (or multiple partitioned consumers like MapReduces).
  • F1 cannot take advantage of explicit co-partitioning of data, and apply only hash partitioning for repartitioning.
  • For protocol buffers, F1 queries support:
    - Using path expressions to extract individual fields.
    - Querying and passing around entire protocol buffers.
    - Access to repeated fields using PROTO JOIN.

--

--