[Paper Notes] The Tail at Scale

San Francisco, May 15, 2022
  • Just as fault-tolerant computing aims to create a reliable whole out of less-reliable parts, tail-tolerant system is a predictably responsive whole created out of less-predictable parts.
  • Variability of response time comes from resource sharing (among requests, among applications or among machines), background daemons, queuing, performance variability of hardware and so forth.
  • Parallelization reduces latency in large-scale online services by fanning out a request from a root to a large number of leaf servers and merging responses via a request-distribution tree, whereby variability in the latency distribution of individual components is magnified at the service level.
  • Reduce component variability by
    - Differentiating service classes and queuing in a higher level.
    - Reducing head-of-line blocking via time-slicing.
    - Synchronizing background activities across many machines.
  • Hedged requests: send a second request after the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests, so that the additional load is limited to ~5% while the latency tail is substantially shortened.
  • Tied requests and cross-server cancellation: enqueue copies of a request in multiple servers simultaneously and allow servers to communicate updates on the status of these copies to each other.
  • Micro-partitions: generate many more partitions than there are machines in the service, and then do dynamic assignment and load balancing of theses partitions to particular machines.
  • Selective replication: create copies of items that are likely to cause load imbalance and use the additional replicas to spread the load of these hot micro-partitions across multiple machines.
  • Latency-induced probation: put slow machines on probation, meanwhile continue to issue shadow requests to them to collect statistics on their latency, and reincorporate them when the problem abates.
  • Canary requests for preventing correlated crashes: a root server sends a request first to one or two leaf servers and then send it to remaining servers only if a successful response is received within a reasonable period of time.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Devin Z

Devin Z

认识世界,改造世界 (Seek truth and solve problems)