Cross-datacenter multi-object transactions in industry.
- Database features of Spanner:
- Schematized semi-relational tables.
- SQL-based query language.
- General purpose transactions.
- Atomic schema changes. - Distributed system features of Spanner:
- Fine-grained configuration for data replication.
- Externally consistent reads and writes.
- Globally consistent reads. - The purposes of cross-datacenter data replication:
- Global availability;
- Geographic locality. - A datacenter consists of one or more zones, and a zone consists of
- One zone-master, which assigns data;
- Up to thousands of spanservers, which serve data;
- Location proxies, which locate data for clients. - A directory, as the unit of data placement, is a set of contiguous keys that share a common prefix.
- A tablet, as the data managed by a Paxos state machine, may consist of multiple directories so that data frequently accessed together could be colocated.
- Along with a bag of multi-versioned key-value mappings, the metadata and log of the corresponding Paxos state machine is also stored in the tablet.
- A Paxos group is the set of replicas of the same tablet across different spanservers.
- A spanserver may serve 100–1000 tablets, each belonging to a different Paxos group.
- The TrueTime API exposes the clock uncertainty by returning a time interval to each query for the current time (i.e. TT.now().earliest and TT.now().latest).
- TrueTime uses both GPS and atomic clocks, which have different failure modes.
- Two-phase locking (2PL) is used for read-write transactions, because optimistic concurrency control would perform poorly for long-lived transactions in the presence of conflicts.
- Lock-free snapshot isolation is used for read-only transactions, which executes at a system-chosen timestamp and can proceed on any replica that is sufficiently up-to-date.
- A Paxos leader maintains a lock table that maps ranges of keys to 2PL states and a transaction manager for coordinating transactions across multiple Paxos groups.
- For a transaction across multiple Paxos groups, one Paxos group is chosen as the coordinator for the two-phase commit (2PC), and the state of the transaction is replicated within the coordinator group so that the availability issue of the 2PC is mitigated.
- 2PC is skipped if only one Paxos group is involved in a transaction.
- Writes that occur in a transaction are buffered at the client until commit.
- A Paxos leader is long-lived, and has a lease that defaults to 10 seconds.
- Disjointness invariant: for each Paxos group, each Paxos leader’s lease interval is disjoint from every other leader’s.
- External-consistency invariant: if the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1.
- Two rules to guarantee the external-consistency invariant:
- Start: the assigned commit timestamp is no less than TT.now().latest;
- Commit wait: the coordinator leader doesn’t initiate the commit until TT.now().earliest exceeds the commit timestamp. - The commit timestamp assigned by the coordinator leader must be no less than any of the prepare timestamps assigned by the leaders of the participant groups.
- The coordinator leader sets the commit time of a read-only transaction as TT.now().latest without negotiating with the other participants.