[Paper Notes] Spanner: Google’s Globally-Distributed Database

Devin Z
3 min readSep 4, 2022

--

Cross-datacenter multi-object transactions in industry.

Google Bay View Campus, August 28, 2022
  • Database features of Spanner:
    - Schematized semi-relational tables.
    - SQL-based query language.
    - General purpose transactions.
    - Atomic schema changes.
  • Distributed system features of Spanner:
    - Fine-grained configuration for data replication.
    - Externally consistent reads and writes.
    - Globally consistent reads.
  • The purposes of cross-datacenter data replication:
    - Global availability;
    - Geographic locality.
  • A datacenter consists of one or more zones, and a zone consists of
    - One zone-master, which assigns data;
    - Up to thousands of spanservers, which serve data;
    - Location proxies, which locate data for clients.
  • A directory, as the unit of data placement, is a set of contiguous keys that share a common prefix.
  • A tablet, as the data managed by a Paxos state machine, may consist of multiple directories so that data frequently accessed together could be colocated.
  • Along with a bag of multi-versioned key-value mappings, the metadata and log of the corresponding Paxos state machine is also stored in the tablet.
  • A Paxos group is the set of replicas of the same tablet across different spanservers.
  • A spanserver may serve 100–1000 tablets, each belonging to a different Paxos group.
  • The TrueTime API exposes the clock uncertainty by returning a time interval to each query for the current time (i.e. TT.now().earliest and TT.now().latest).
  • TrueTime uses both GPS and atomic clocks, which have different failure modes.
  • Two-phase locking (2PL) is used for read-write transactions, because optimistic concurrency control would perform poorly for long-lived transactions in the presence of conflicts.
  • Lock-free snapshot isolation is used for read-only transactions, which executes at a system-chosen timestamp and can proceed on any replica that is sufficiently up-to-date.
  • A Paxos leader maintains a lock table that maps ranges of keys to 2PL states and a transaction manager for coordinating transactions across multiple Paxos groups.
  • For a transaction across multiple Paxos groups, one Paxos group is chosen as the coordinator for the two-phase commit (2PC), and the state of the transaction is replicated within the coordinator group so that the availability issue of the 2PC is mitigated.
  • 2PC is skipped if only one Paxos group is involved in a transaction.
  • Writes that occur in a transaction are buffered at the client until commit.
  • A Paxos leader is long-lived, and has a lease that defaults to 10 seconds.
  • Disjointness invariant: for each Paxos group, each Paxos leader’s lease interval is disjoint from every other leader’s.
  • External-consistency invariant: if the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1.
  • Two rules to guarantee the external-consistency invariant:
    - Start: the assigned commit timestamp is no less than TT.now().latest;
    - Commit wait: the coordinator leader doesn’t initiate the commit until TT.now().earliest exceeds the commit timestamp.
  • The commit timestamp assigned by the coordinator leader must be no less than any of the prepare timestamps assigned by the leaders of the participant groups.
  • The coordinator leader sets the commit time of a read-only transaction as TT.now().latest without negotiating with the other participants.

--

--