[Paper Notes] TAO: Facebook’s Distributed Data Store for the Social Graph

Devin Z
4 min readNov 5, 2022

A look-through cache that powers a social network.

Sunnyvale, November 3, 2022
  • Aggregation and filtering for items from the social graph is performed each time the content is viewed (i.e. “pull” instead of “push”) due to fine-grained customization.
  • Fundamental problems with the look-aside cache:
    - A key-value cache is inefficient for operations on edge lists.
    - Distributed control logic increases the number of failure modes and makes it difficult to avoid thundering herds.
    - Maintaining read-after-write consistency is expensive since it doesn’t employ the data model of the social graph.
  • TAO is optimized heavily for reads, and explicitly favors efficiency and availability over consistency.
  • Scale:
    - Thousands of machines across multiple geographical locations.
    - Billions of reads per second.
    - Millions of writes per second.
  • TAO data model:
    - Objects are typed and identified by a 64-bit integer that is unique across all objects regardless of type.
    - Associations are identified by the (1) source object, (2) association type and (3) destination object.
    - At most one association of a given type can exist between any two objects.
    - Each association has a 32-bit time field, which is used in queries specifying time ranges.
    - Actions that happen at most once or record state transitions are modeled by associations.
    - Repeatable actions are better represented by objects.
    - An association commonly has a tightly coupled inverse.
  • TAO APIs:
    - CRUD on objects.
    - Mutations on associations: creation, deletion, change on type.
    - Queries on associations: search, count, pagination.
  • Storage layer:
    - MySQL is adopted over more expressive non-SQL alternatives, taking account of data accesses that don’t use TAO APIs (e.g. bulk migrations).
    - Data is sharded way more than the number of servers to balance load.
    - The shard id of an object is embedded in its object id.
    - An association is stored on the shard of its source object.
  • Cache layer:
    - A cache tier is a set of cache servers collectively capable of serving any TAO request.
    - The cache contains objects, (partial) association lists, and association counts.
    - The cache is filled on demand and uses the LRU eviction policy.
    - Updates on inverse edges are not atomic and may need repairs by asynchronous jobs.
  • Two-level caching:
    - Large tiers are prone to hotspots and have a quadratic growth in all-to-all connections.
    - Members of the leader tier read from and write to the storage layer.
    - Members of the follower tier forward read misses and writes to a leader (in the local region).
    - Clients talk to the closest follower tier and may fail over to another nearby follower tier.
    - The follower that issued the write is updated synchronously on reply from the leader, while other followers are updated on receiving asynchronous messages (invalidation or refill) from the leader.
    - The single leader serializes concurrent writes and protects the database from thundering herds.
Multi-region TAO configuration (image source: [1])
  • Multi-region configuration:
    - Network round trip times become the bottleneck when geographical locations expand.
    - Read misses by followers are 25 times as frequent as writes.
    - Writes are forwarded by the local leader to the leader in the master region.
    - Read misses are serviced locally so that read latency is independent of inter-region latency.
    - Shards in theory are independent in master region control, but locating all master databases in a single region is preferred when considering writes on inverse associations.
    - TAO and memcache uses the same pipeline for synchronizing delivery of invalidations and refills with the database replication stream.
    - The local leader that forwards a write may update its value before the replication stream arrives.
  • Under normal operations, TAO provides eventual consistency across all tiers and read-after-write consistency within a single tier.
  • Stronger consistency can be provided for reads marked as critical (e.g. authentication) by proxying them to the master region.
  • TAO avoids head-of-line blocking on multiplexed connections by using a protocol with out-of-order responses.
  • RAM is partitioned into arenas to isolate types of different behaviors.
  • An object is serialized into a single MySQL table column.
  • How to alleviate hotspots in cache sharding?
    - Shard cloning: loads for a shard is balanced across multiple followers in a tier.
    - The client caches the data and version if the access rate exceeds a threshold.

--

--