Basics of Distributed Systems
Curated learning notes on computer science topics (1/4).

- Three concerns about a software system:
- Reliability: a system should continue to work correctly even in the face of adversity (hardware faults, software errors or human errors).
- Scalability: a system should maintain reasonable performance as the load increases.
- Maintainability: a system should be easy to operate, simple to understand and flexible to extend. - Problems with vertical scaling:
- Super-linear cost growth and even hard limits.
- Limited fault tolerance (no redundancy). - Motivations of horizontal scaling:
- Higher availability (fault tolerance).
- Higher throughput (scalability).
- Lower latency (performance). - Leader-based replication:
- Writes go to the leader, while reads go to any replica.
- Synchronous replication has poor performance and fault tolerance.
- Asynchronous replication is not durable but widely used. - Problems with failover:
- Risks of split brain.
- Durability and consistency issues due to an out-of-date new leader.
- Difficulty in setting the timeout for failure detection. - How to implement replication logs:
- Statement-based replication (prone to nondeterminism issues).
- Trigger-based replication (prone to bugs, greater overheads).
- Write-ahead log shipping (closely coupled with the storage engine).
- Logical (row-based) log replication. - Solutions for read-after-write consistency:
- Read a thing from the leader if the user may have modified it (e.g. one’s own profile).
- Read a thing from the leader if it was recently modified by any user.
- Read anything from the leader if the user recently wrote something (doesn’t work for multiple devices of the same user). - Monotonic reads:
- A user should never see things moving backward in time.
- One solution is to make sure that a user always makes reads from the same replica. - Consistent prefix reads:
- Users should see the data in a state that makes causal sense, even if reads are from different partitions.
- One solution is to keep track of causal dependencies. - Use cases of multi-leader replication:
- Multi-datacenter operation.
- Clients with offline operations.
- Collaborative editing. - Automatic conflict resolution technologies:
- Conflict-free replicated datatypes (CRDTs).
- Mergeable persistent data structures.
- Operational transformations (e.g. Google Docs). - Leaderless replication (Dynamo-style):
- Read repair: conflicts are resolved by the client on query.
- Anti-entropy process: some background process constantly diffs replicas and copies missing values.
- Strict quorum (r + w > n) doesn’t guarantee linearizability (for example, in the case of concurrent writes).
- Sloppy quorum with hinted handoff can be used to increase write availability.
- Use version vectors to capture “happens-before” relationship and concurrency. - Strategies for partitioning data:
- Partitioning by key range.
- Partitioning by hash of key.
- One solution for hot spots is to concatenate a hot key with a random number. - Consistent hashing: distribute both partitions and servers (virtual nodes) on a hash ring and assign each server to a range of partitions so that only a small range of partitions need to be remapped when a server is added or removed.
- Secondary indexes for partitioned data:
- Local indexes (scatter-gather pattern, prone to tail latency amplification).
- Global indexes (optimized for reads but not for writes). - Requirements for partition rebalancing:
- Even distribution.
- No downtime.
- Minimum data transfer. - Strategies for rebalancing:
- Create a fixed number of partitions up front and assign several partitions to each node.
- Dynamically split a partition when it’s oversized.
- Make the number of partitions proportional to the number of nodes. - Approaches to request routing (service discovery):
- Partition-aware client.
- Peer-to-peer forwarding (e.g. via gossip protocols).
- Separate routing tier (e.g. ZooKeeper, etcd). - Models of distributed systems:
- Node behavior: (1) crash-stop, (2) crash-recovery, (3) Byzantine.
- Network behavior: (1) reliable, (2) fair-loss, (3) arbitrary (active adversary).
- Timing: (1) synchronous, (2) partially synchronous, (3) asynchronous. - Transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.
- Serializability is an isolation property of multi-object transactions, while linearizability (strong consistency) is a recency guarantee on the reads and writes of a single object (with multiple copies).
- The FLP impossibility: one can never solve the consensus problem in an asynchronous system as long as nodes may crash.
- The CAP theorem: strong consistency and total availability cannot be simultaneously achieved if network partitions are present.
- Downsides of strong consistency:
- Performance overhead.
- The leader can be a bottleneck for scalability.
- Limited availability (quorum). - The Lamport timestamp provides a total order that is consistent with causality, but is not sufficient to implement a uniqueness constraint since we don’t know when that order is finalized.
- Causal broadcast: the message delivery order on each node is consistent with causality.
- Total order broadcast: messages are delivered to every node in the same order.
- Two-phase commit (2PC):
- The coordinator sends to all the participants a prepare request in phase I, and sends either a commit or abort request in phase II.
- In-doubt transaction: if the coordinator crashes or the network fails after a participant replied “yes” to a prepare request, the participant must wait indefinitely and there might be other transactions being blocked on locks.
- Nonblocking atomic commit requires a perfect failure detector, which is impossible in a network with unbounded delay.
- XA coordinates 2PC across heterogeneous data systems. - 2PC provides atomic commit in a distributed transaction, whereas 2PL (two-phase locking) provides serializable isolation.
- Any fault-tolerant consensus algorithm requires more than 1/2 of the nodes functioning correctly, and more than 2/3 of nodes that are not Byzantine faulty.
- Fault-tolerant consensus algorithms:
- Multi-Paxos.
- Raft.
- Zab.
- Viewstamp replication. - What a reverse proxy (web server) does:
- Authentication.
- Rate limiting (throttling).
- SSL termination.
- Response compression.
- Static content caching. - Two types of load balancers:
- TCP load balancer (less latency and resources).
- HTTP load balancer (more flexibility). - When doing load balancing:
- Optimize latency and throughput at the global level.
- Optimize resource utilization within a data center. - Distributed cache:
- Cache-aside enables caching at the application level, decoupled from DB schemas and queries.
- In contrast with cache-through, cache-back trades durability for lower latency.
- Update/delete a look-aside cache only after the persistent storage is updated.
- Keep promises instead of values in the cache to prevent the thundering herd problem. - Three types of systems:
- Services (online systems).
- Batch processing systems (offline systems).
- Stream processing systems (near-real-time systems). - The service-oriented architecture (SOA, a.k.a. microservices) makes services independently deployable and evolvable.
- Dataflow through services:
- RPC exploits the efficiency of binary encodings (e.g. Protocol Buffers) and is mostly used within the same organization.
- REST web services are generally based on JSON and are predominant for public services.
- SOAP web services are based on XML and have fallen out of favor outside large enterprises. - Rolling upgrades (a.k.a. staged rollouts) allow deployment with no downtime and thus encourage more frequent releases and better evolvability.
- Backward compatibility means new code can read data written by old code, and forward compatibility means old code can read data written by new code.
- Advantages of a message broker:
- Decouple the sender from the recipient.
- Buffer the data if the recipient is unavailable or overloaded.
- Enable fan-out and load balancing. - Log-based message brokers combine the durable storage of databases and the low-latency notification facilities of messaging. They are suited for situations where (1) the message throughput is high, (2) each message is fast to process and (3) message ordering is important, compared with traditional message brokers.
- Some production practices:
- Use N+2 configuration to tolerate both a planned outage and an unplanned outage.
- Use randomized exponential backoffs with limited retries to avoid positive feedback loops between clients and servers.
- Slowly increase load on a new cluster to warm up the cold cache.
- Golden signals of monitoring: errors, latency, traffic, and saturation.
- Keep granular metrics in a short period of time for diagnostic purposes and roll them up for long-term growth planning.
- Design configurations for human users to change the system behavior.
- Canary: a partial and time-limited deployment of a change in a service, followed by evaluation metrics attributable to SLIs.