Basics of Distributed Systems

Devin Z
7 min readJun 6, 2022

--

Curated learning notes on computer science topics (1/4).

Municipal Rose Garden, May 14, 2022
  • Three concerns about a software system:
    - Reliability: a system should continue to work correctly even in the face of adversity, such as hardware faults, software errors or human errors.
    - Scalability: a system should maintain reasonable performance as the load increases.
    - Maintainability: a system should be easy to operate, simple to understand and flexible to extend.
  • Problems with vertical scaling:
    - Super-linear cost growth and even hard limits.
    - Limited fault tolerance (no redundancy).
  • Motivations of horizontal scaling:
    - Higher availability (fault tolerance).
    - Higher throughput (scalability).
    - Lower latency (performance).
  • Leader-based replication:
    - Writes go to the leader, while reads go to any replica.
    - Synchronous replication has poor performance and fault tolerance.
    - Asynchronous replication is not durable but widely used.
  • Problems with failover:
    - Risks of split brain.
    - Durability and consistency issues due to an out-of-date new leader.
    - Difficulty in setting the timeout for failure detection.
  • How to implement replication logs:
    - Statement-based replication, which is prone to nondeterminism issues.
    - Trigger-based replication, which is prone to bugs and has greater overheads.
    - Write-ahead log shipping, which is closely coupled with the storage engine.
    - Logical (row-based) log replication.
  • Solutions for read-after-write consistency:
    - Read a thing from the leader if the user may have modified it (e.g. one’s own profile).
    - Read a thing from the leader if it was recently modified by any user.
    - Read anything from the leader if the user recently wrote something (doesn’t work for multiple devices of the same user).
  • Monotonic reads:
    - A user should never see things moving backward in time.
    - One solution is to make sure that a user always makes reads from the same replica.
  • Consistent prefix reads:
    - Users should see the data in a state that makes causal sense, even if reads are from different partitions.
    - One solution is to keep track of causal dependencies.
  • Use cases of multi-leader replication:
    - Multi-datacenter operation.
    - Clients with offline operations.
    - Collaborative editing.
  • Automatic conflict resolution technologies:
    - Conflict-free replicated datatypes (CRDTs).
    - Mergeable persistent data structures.
    - Operational transformations for ordered items (e.g. Google Docs).
  • Leaderless replication (Dynamo-style):
    - Read repair: conflicts are resolved by the client on query.
    - Anti-entropy process: some background process constantly diffs replicas and copies missing values.
    - Strict quorum (r + w > n) doesn’t guarantee linearizability (for example, in the case of concurrent writes).
    - Sloppy quorum with hinted handoff can be used to increase write availability.
    - Use version vectors to capture “happens-before” relationship and concurrency.
  • How do version vectors work?
    - A version vector is a collection of version numbers from all the replicas for a key.
    - A client merges concurrent values (e.g. siblings) and provides “happens-before” relationship when issuing writes to a server.
    - A server automatically overwrites an old value when two values are not concurrent.
  • Strategies for partitioning data:
    - Partitioning by key range.
    - Partitioning by hash of key.
    - One solution for hot spots is to concatenate a hot key with a random number (i.e. further sharding the key).
  • Secondary indexes for partitioned data:
    - Local indexes: the scatter-gather pattern, which is prone to tail latency amplification.
    - Global indexes: efficient reads at the price of fan-out writes.
  • Requirements for partition rebalancing:
    - Even distribution.
    - No downtime.
    - Minimum data transfer.
  • Strategies for rebalancing:
    - Create a fixed number of partitions upfront and adaptively assign partitions to nodes (e.g. consistent hashing).
    - Dynamically split a partition when it’s oversized (e.g. Bigtable).
    - Make the number of partitions proportional to the number of nodes and transfer data between partitions.
  • Consistent hashing: distribute both partitions and servers (virtual nodes) on a hash ring and assign each server to a range of partitions so that only a small range of partitions need to be remapped when a server is added or removed.
  • Approaches to request routing (service discovery):
    - partition-aware clients;
    - peer-to-peer forwarding;
    - a separate routing tier.
  • How does a partition-aware client or routing tier keep up with the partition assignment?
    - Using a coordination service, such as ZooKeeper.
    - Using a gossip protocol to disseminate updates (e.g. Dynamo).
  • Models of distributed systems:
    - Node behavior: (1) crash-stop, (2) crash-recovery, (3) Byzantine.
    - Network behavior: (1) reliable, (2) fair-loss, (3) arbitrary (active adversary).
    - Timing: (1) synchronous, (2) partially synchronous, (3) asynchronous.
  • Linearizability (strong consistency) means a system behaves as if there were only one copy of the data and all operations on it were atomic.
    - Linearizability implies total ordering and causal consistency.
    - Serializability is an isolation property of multi-object transactions, while linearizability is a recency guarantee on the reads and writes of a single object with multiple copies.
    - Transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of the replicas in the face of delays and faults.
    - Strong consistency guarantees both integrity and timeliness, while eventual consistency compromises the latter in exchange for availability and performance.
  • The FLP impossibility: one can never solve the consensus problem in an asynchronous system as long as nodes may crash.
  • The CAP theorem: strong consistency and total availability cannot be simultaneously achieved if network partitions are present.
  • Downsides of strong consistency:
    - Performance overhead.
    - The leader can be a bottleneck for scalability.
    - Limited availability (quorum).
  • Lamport timestamps:
    - Lamport timestamps provide a total order that is consistent with causality for events across nodes.
    - Lamport timestamps are insufficient for implementing a uniqueness constraint since the order is finalized right away.
    - Unlike version vectors, Lamport timestamps cannot decide on whether two events are concurrent.
  • Causal broadcast: the message delivery order on each node is consistent with causality.
  • Total order broadcast: messages are delivered to every node in the same order.
    - This is stronger than Lamport timestamps since the order is fixed at the time the messages are delivered.
    - This is weaker than linearizability as it doesn’t guarantee recency for reads.
    - This is equivalent to consensus (atomic commit), and can be used to implement a lock service and state machine replication.
  • Two-phase commit (2PC):
    - The coordinator sends to all the participants a prepare request in phase I, and sends either a commit or abort request in phase II.
    - In-doubt transaction: if the coordinator crashes or the network fails after a participant replied “yes” to a prepare request, the participant must wait indefinitely and there might be other transactions being blocked on locks.
    - Nonblocking atomic commit, such as 3PC, requires a perfect failure detector, which is impossible in a network with unbounded delay.
    - XA is a standard that coordinates 2PC across heterogeneous data systems.
    - 2PC provides atomic commit in a distributed transaction, whereas 2PL (two-phase locking) provides serializable isolation.
    - 2PL ensures there’s a point in time when all the objects involved in a transaction are locked, while 2PC ensures there’s a point in time when all the nodes participating a distributed transaction are holding locks.
    - Strictly speaking, 2PC does not fulfill the termination property of a consensus algorithm since a transaction cannot proceed unless the coordinator recovers.
  • Any fault-tolerant consensus algorithm requires more than 1/2 of the nodes functioning correctly, and more than 2/3 of nodes that are not Byzantine faulty.
    - In contrast, a distributed transaction involving multiple partitions require all participants to ack.
  • Fault-tolerant consensus algorithms:
    - Multi-Paxos.
    - Raft.
    - Zab.
    - Viewstamp replication.
  • What a reverse proxy (web server) does:
    - Authentication.
    - Rate limiting (throttling).
    - SSL termination.
    - Response compression.
    - Static content caching.
  • Two types of load balancers:
    - TCP load balancer (less latency and resources).
    - HTTP load balancer (more flexibility).
  • When doing load balancing:
    - optimize latency and throughput at the global level;
    - optimize resource utilization within a data center.
  • Distributed cache (e.g. Facebook Memcache):
    - In contrast with look-through, look-aside enables caching at the application level, decoupled from DB schemas and queries.
    - In contrast with write-through, write-back trades durability for lower latency.
    - Invalidate a look-aside cache after the persistent storage is updated.
    - Prevent the thundering herd problem by keeping a promise in the cache being updated.
  • Three types of systems:
    - Services (online systems).
    - Batch processing systems (offline systems).
    - Stream processing systems (near-real-time systems).
  • MapReduce:
    - Mappers are co-partitioned with inputs, and reduces are co-partitioned with outputs.
    - The distributed file system is a uniform interface that allows for diverse storage formats (schema-on-read).
    - Frequent job preemptions in a resource-overcommitted production environment calls for granular fault tolerance.
    - MapReduce proactively materializes intermediate results, while dataflow engines (e.g. Spark) track data lineage and recompute data efficiently.
    - MapReduce offers a more general programming model than SQL, albeit sometime less expressive.
  • The service-oriented architecture (SOA, a.k.a. microservices) makes services independently deployable and evolvable.
  • Dataflow through services:
    - RPC exploits the efficiency of binary encodings (e.g. Protocol Buffers) and is mostly used within the same organization.
    - Web services are those based on HTTP, but not necessarily used on the web.
    - REST web services are generally based on JSON and are predominant for public services.
    - SOAP web services are based on XML and have fallen out of favor outside large enterprises.
  • The end-to-end argument: functions such as duplicate suppression and data integrity check must require application-level solutions.¹
  • Rolling upgrades (a.k.a. staged rollouts) allow deployment with no downtime and thus encourage more frequent releases and better evolvability.
  • Backward compatibility means new code can read data written by old code, and forward compatibility means old code can read data written by new code.
  • Advantages of a message broker:
    - Decouple the sender from the recipient.
    - Buffer the data if the recipient is unavailable or overloaded.
    - Enable fan-out and load balancing.
  • Log-based message brokers (e.g. Kafka) combine the durable storage of databases and the low-latency notification facilities of messaging. They are suited for situations where:
    - The message throughput is high.
    - Each message is fast to process.
    - Message ordering is important, compared with traditional message brokers.
  • Some production practices:
    - Use N+2 configuration to tolerate both a planned outage and an unplanned outage.
    - Use randomized exponential backoffs with limited retries to avoid positive feedback loops between clients and servers.
    - Slowly increase the load on a new cluster to warm up the cold cache.
    - Golden signals of monitoring: errors, latency, traffic, and saturation.
    - Keep granular metrics in a short period of time for diagnostic purposes and roll them up for long-term analysis.
    - Design configurations for human users to change the system behavior.
    - Canary: a partial and time-limited deployment of a change in a service, followed by evaluation metrics attributable to SLIs.

--

--

Devin Z
Devin Z

Written by Devin Z

认识世界,改造世界

No responses yet