[Paper Notes] TAO: Facebook’s Distributed Data Store for the Social Graph
A look-through cache that powers a social network.
- Aggregation and filtering for items from the social graph is performed each time the content is viewed (i.e. “pull” instead of “push”) due to fine-grained customization.
- Fundamental problems with the look-aside cache:
- A key-value cache is inefficient for operations on edge lists.
- Distributed control logic increases the number of failure modes and makes it difficult to avoid thundering herds.
- Maintaining read-after-write consistency is expensive since it doesn’t employ the data model of the social graph. - TAO is optimized heavily for reads, and explicitly favors efficiency and availability over consistency.
- Scale:
- Thousands of machines across multiple geographical locations.
- Billions of reads per second.
- Millions of writes per second. - TAO data model:
- Objects are typed and identified by a 64-bit integer that is unique across all objects regardless of type.
- Associations are identified by the (1) source object, (2) association type and (3) destination object.
- At most one association of a given type can exist between any two objects.
- Each association has a 32-bit time field, which is used in queries specifying time ranges.
- Actions that happen at most once or record state transitions are modeled by associations.
- Repeatable actions are better represented by objects.
- An association commonly has a tightly coupled inverse. - TAO APIs:
- CRUD on objects.
- Mutations on associations: creation, deletion, change on type.
- Queries on associations: search, count, pagination. - Storage layer:
- MySQL is adopted over more expressive non-SQL alternatives, taking account of data accesses that don’t use TAO APIs (e.g. bulk migrations).
- Data is sharded way more than the number of servers to balance load.
- The shard id of an object is embedded in its object id.
- An association is stored on the shard of its source object. - Cache layer:
- A cache tier is a set of cache servers collectively capable of serving any TAO request.
- The cache contains objects, (partial) association lists, and association counts.
- The cache is filled on demand and uses the LRU eviction policy.
- Updates on inverse edges are not atomic and may need repairs by asynchronous jobs. - Two-level caching:
- Large tiers are prone to hotspots and have a quadratic growth in all-to-all connections.
- Members of the leader tier read from and write to the storage layer.
- Members of the follower tier forward read misses and writes to a leader (in the local region).
- Clients talk to the closest follower tier and may fail over to another nearby follower tier.
- The follower that issued the write is updated synchronously on reply from the leader, while other followers are updated on receiving asynchronous messages (invalidation or refill) from the leader.
- The single leader serializes concurrent writes and protects the database from thundering herds.
- Multi-region configuration:
- Network round trip times become the bottleneck when geographical locations expand.
- Read misses by followers are 25 times as frequent as writes.
- Writes are forwarded by the local leader to the leader in the master region.
- Read misses are serviced locally so that read latency is independent of inter-region latency.
- Shards in theory are independent in master region control, but locating all master databases in a single region is preferred when considering writes on inverse associations.
- TAO and memcache uses the same pipeline for synchronizing delivery of invalidations and refills with the database replication stream.
- The local leader that forwards a write may update its value before the replication stream arrives. - Under normal operations, TAO provides eventual consistency across all tiers and read-after-write consistency within a single tier.
- Stronger consistency can be provided for reads marked as critical (e.g. authentication) by proxying them to the master region.
- TAO avoids head-of-line blocking on multiplexed connections by using a protocol with out-of-order responses.
- RAM is partitioned into arenas to isolate types of different behaviors.
- An object is serialized into a single MySQL table column.
- How to alleviate hotspots in cache sharding?
- Shard cloning: loads for a shard is balanced across multiple followers in a tier.
- The client caches the data and version if the access rate exceeds a threshold.