[Book Notes] System Design Interview — An Insider’s Guide (Volume 2)
A must-read when jobs in tech are at risk.
The following contents are my takeaways from Alex Xu and Sahn Lam’s book¹. Reading the original book is highly recommended for those who want to ace system design interviews.
Chapter 1. Proximity Service
- Functional requirements:
- Return all businesses based on a user’s location and a specified radius.
- View details of a business.
- Business administration operations, which may take effect the next day. - High-level design:
- Location-based service.
- Business service (CRUD operations). - Use a relational database, such as MySQL, for the read-heavy business data, and set up sharding and primary-secondary replication.
- The evenly divided grid doesn’t work well because the distribution of business is not even.
- Geospatial indexing:
- Geohash: a 1D string that recursively subdivides the world into smaller and smaller grids with each additional bit.
- Quadtree: an in-memory data structure that recursively subdivides a plane into four quadrants.
- Google S2: mapping a sphere to a 1D index based on the Hilbert curve, on which two points are close if they are close in 2D space. - For 200 million businesses, the quadtree can easily fit in the working set of a single server and just takes a few minutes to build.
- Geohash is easier to update on location change, but a quadtree can dynamically adjust the grid size based on population density.
- Use a Redis cluster to cache:
- The business info by id.
- The business ids in a grid by geohash.
Chapter 2. Nearby Friends
- Functional requirements:
- Fetch nearby friends within a radius.
- Refresh the user location every 30 seconds. - High-level design:
- RESTful API servers for updating user profiles and friendships.
- WebSocket servers, which maintain persistent connections with active users.
- Redis Pub/Sub servers, which provide lightweight channels for propagating location updates to WebSocket connection handlers.
- Redis cluster, which caches user locations (latitude, longitude, timestamp) with TTLs.
- Cassandra for location history sharded by user id. - The load balancer should be able to auto-scale stateful WebSocket servers.
- The location cache should be sharded on user id for the traffic of location updates is high.
- Creating a new Redis pub/sub channel is lightweight, so at client initialization time a user can subscribe to all the friends regardless of whether they are active or not.
- Use consistent hashing to shard Redis Pub/Sub servers, and use ZooKeeper for service discovery.
- Pub/Sub servers are stateful due to the subscriber list of each channel, and thus take some overhead to scale out.
Chapter 3. Google Maps
- Features:
- User location update.
- Navigation service, including ETA.
- Real-time map rendering. - The mobile client should use as little data and battery as possible.
- Geocoding is the process of converting addresses to geographic coordinates.
- Geocoding info could be stored in Redis since it’s read frequently but written infrequently.
- Hierarchical routing tiles:
- Local roads.
- Arterial roads connecting districts.
- Major highways connecting cities. - Routing tiles could be stored in an object storage, such as S3.
- Location history should be batched before being sent to the server. Even so, it would be good to choose a database optimized for write throughput, such as Cassandra.
- User location data is also logged into Kafka, where it could be consumed by other services, such as for traffic monitoring and for personalization.
- Use tiles of different resolutions for map rendering of different zoom levels. These tiles should be pre-generated and be cached on a CDN.
- The advantages of a vector map over a rasterized map:
- Better chances for compression.
- Better zooming experience.
Chapter 4. Distributed Message Queue
- Additional features:
- Long data retention.
- Repeated consumption.
- In-order delivery (for each partition).
- Configurable data delivery semantics. - Support two messaging models:
- Publish-subscribe, which is implemented by topics.
- Point-to-point, which is supported by consumer groups. - A topic is divided into partitions, and a single partition can only be consumed by one consumer in the same group.
- An optional key in the message can be used to determine the partition.
- Storage includes:
- Messages (on brokers’ local disks).
- Consumer states (in ZooKeeper).
- Topic metadata (in ZooKeeper). - A coordination service (e.g. ZooKeeper) is used for:
- Service discovery (which brokers are active).
- Leader election (which broker is the active controller). - Persist messages as WAL log files on disk to take advantage of the sequential access pattern.
- Wrap the routing layer into the producer to:
- Minimize the number of network hops.
- Batch messages in a single request. - Latency-throughput trade-off: use a large batch size for high throughput and a small batch size for low latency.
- Use the pull model so that:
- Consumers control the consumption rate.
- It enables aggressive batching of data. - For each consumer group, one of the brokers is assigned as the coordinator of the group for consumer rebalancing.
- To rebalance a consumer group, the coordinator elects one of the consumers as the leader and asks it to generate a partition dispatch plan, which will then be forwarded to each group member.
- A partition is replicated on multiple brokers and consumers only read from the leader replica.
- One of the brokers in the cluster is elected as the active controller for maintaining the leader/follower relationship for all the partitions in the topic metadata.
- Provide an acknowledgement setting (e.g. ACK=ALL) for trade-offs between performance and durability.
- Decommissioned partitions can only be removed after the retention period.
- For message filtering:
- Filter messages on the broker side to avoid unnecessary traffic.
- Attach tags to the metadata of each message so that brokers don’t need to decrypt or deserialize the payload.
Chapter 5. Metrics Monitoring and Alerting System
- Requirements:
- Operational system metrics both at the node level and at the service level.
- 1 year data retention.
- Rolled up to 1 min resolution after 7 days.
- Rolled up to 1 hour resolution after 30 days.
- Multiple alert channels. - Workload pattern:
- Constant heavy write load.
- Spiky read load. - Five components:
- Data collection.
- Data transmission, which updates the TSDB.
- Data storage (TSDB).
- Alerting, which queries the TSDB.
- Visualization, which queries the TSDB. - A time series can be uniquely identified by its name, and optionally by a set of labels.
- Use an industrial-scale time-series database (TSDB) as storage, such as InfluxDB or Prometheus. They have features like caching, indexing and expressive query languages, which could obviate the need of a standalone query service.
- The pull model for data collection:
- A metrics collector regularly pulls metrics from sources via HTTP.
- Use ZooKeeper to maintain configurations of active sources.
- Use consistent hashing for service partitioning among collectors.
- Metrics can be viewed at any time for debugging purposes. - The push model for data collection:
- The collection agent at each source sends metrics to the metrics collector.
- The collection agent may aggregate and buffer metrics before sending them.
- The collectors could be organized as an auto-scaling group behind a load balancer.
- Short-lived batch jobs can only use the push model. - Three places for data aggregation:
- The collection agent at each source.
- The metrics collectors, which write the TSDB.
- The TSDB readers. - Use Kafka to buffer data and to decouple data collection from data processing, which ingests metrics into the TSDB.
- Set up an alert manager to:
- Poll the TSDB or the query service.
- Filter, merge and dedupe alerts.
Chapter 6. Ad Click Event Aggregation
- Requirements:
- Aggregate the number of clicks of an ad in the past M minutes.
- Return the top K most clicked ads in the past M minutes.
- Support data filtering by different attributes.
- Correctness must be guaranteed.
- End-to-end latency should be a few minutes at most. - Store both the raw data and the aggregated data, and use Cassandra for the write-heavy workload.
- Use two Kafka instances to:
- Decouple the log watcher and the services that persist or aggregate the raw data.
- Provide end-to-end exactly-once semantics for delivering the aggregated data. - Use the MapReduce framework for the aggregation service, which consumes the raw data from the first Kafka and produces the aggregated data in the second Kafka.
- Pre-define filtering criteria and aggregate data based on them.
- The Lambda architecture maintains the batch and stream processing paths separately, whereas the Kappa architecture combines them into one.
- Two places to generate timestamps for aggregation:
- The event time on the client side, which is accurate but could be skewed.
- The processing time on the server side, which can be delayed. - How to handle delayed events when the event time is used?
- Watermark: extend each aggregation window by 15 seconds, which improves data accuracy but increases overall latency.
- End-of-day reconciliation. - The aggregation service peforms a single distributed transaction to:
- Send data to the downstream Kafka (with an ACK).
- Record the offset in an external storage, such as HDFS.
Chapter 7. Hotel Reservation System
- Functional requirements:
- View hotels and rooms.
- Reserve a hotel room.
- Add/delete/update a hotel room.
- Support overbooking. - Non-functional requirements:
- Support high concurrency.
- Moderate latency (a few seconds for reservation). - Choose a relational database for the read-heavy workload and ACID properties.
- Include an idempotency key in the API for making reservations to avoid double booking.
- Use the microservice architecture and use remote procedure calls for inter-service communication.
- Keep a pre-populated room type inventory (for 2 years ahead) to support reservations by room type (instead of by room id).
- Concurrency control:
- Pessimistic locking is prone to deadlocks and tends to have bad performance.
- Optimistic locking is usually faster but performance dramatically drops when contention is heavy. - How to scale up?
- Shard the database by hotel since most queries need to filter by it.
- Cache inventory information (aggregates) in Redis to reduce database load and improve read performance. - Use the same service to manage reservations and the inventory to avoid data consistency issues that require 2PC or Saga.
Chapter 8. Distributed Email Service
- The traditional mail sending path:
- Sender client =>
- Sender SMTP server =>
- Receiver SMTP server =>
- Receiver mail server storage =>
- Receiver IMAP/POP server =>
- Receiver client. - The proposed path:
- Sender client =>
- (LB =>) Web server =>
- Outgoing queue =>
- Sender-side SMTP worker =>
- (LB =>) Receiver-side SMTP server =>
- Incoming queue =>
- Mail processing workers =>
- WebSocket-based real-time servers =>
- Receiver client (if online) and storage. - Storage consists of:
- Metadata DB (NoSQL with denormalization).
- Attachment store (Amazon S3).
- Distributed cache (Redis).
- Search store (Elasticsearch). - Kafka is used to decouple services that trigger reindexing and services that actually perform reindexing.
Chapter 9. S3-like Object Storage
- Three types of storage:
- Block storage (e.g. HDD, SSD).
- File storage, which usually has a hierarchical directory structure.
- Object storage, which stores all data as objects in a flat structure. - Characteristics of object storage:
- Highly durable.
- Vastly scalable.
- Low cost.
- Low performance.
- Mainly for archival and backup. - Functional requirements:
- Bucket creation.
- Object uploading and downloading.
- Object versioning.
- Listing objects in a bucket. - Separate services for mutable metadata and immutable object data.
- Life of a read request:
- The load balancer forwards the request to the API service.
- The API service performs authentication and authorization.
- The API service queries the metadata service.
- The API service queries the storage service.
- A response is sent back to the client. - Components of the data store:
- The stateless data routing service, which serves user requests.
- The highly-reliable placement service, which maintains a virtual cluster map and monitors data nodes.
- The data nodes with synchronous replication. - Problems with too many small files:
- Internal fragmentation.
- It may exceed the system’s i-node capacity. - Append multiple (small) objects into a large file and serialize writes to the same file.
- Multipart upload: slice a large object into smaller parts and upload them independently to the object store.
- Maintain the local object location info in SQLite at each data node.
- Two ways to achieve data redundancy:
- Replication, which entails no computation or tail latency.
- Erasure coding, which achieves higher durability with less storage. - Use checksums to check against data corruption of files.
- Shard the object metadata by the combination of the bucket name and the object name.
- Support pagination of bucket listing by materializing the listing data in a separate table sharded by bucket_id.
- Add an object_version column to support object versioning.
Chapter 10. Real-time Gaming Leaderboard
- Functional requirements:
- Display top 10 players.
- Show a specific user’s rank along with the four players above and below her. - The client directly queries the leaderboard but updates scores only through the game service.
- If fan-out on score updates (e.g. notification) is needed, use a message broker to connect the game service and the leaderboard service.
- Store user profiles in one MySQL table and cache hot ones in Redis.
- Store scores in another table and use a Redis sorted set to index users’ ranks.
- A sorted set is implemented by a hash table (user -> score) and a skip list (score -> user).
- Use AWS API Gateway and AWS Lambda to auto-scale the service.
- The score data can be partitioned by either the score range or the user key.
- Limitation of partitioning by score range:
- Score distribution needs to be known a-priori to make load balanced. - Limitations of partitioning by user key:
- Determining a user’s rank is not straightforward.
- Tail latency amplification and more response traffic for top-k queries. - An alternative to MySQL + Redis is AWS DynamoDB with a global secondary index on scores.
- More partitions balances load better but complicates scatter-gather queries.
Chapter 11. Payment System
- The focus is on how to correctly handle payment transactions, rather than aiming for high throughput.
- The payment flow consists of a pay-in flow (buyer -> e-commerce) and a pay-out flow (e-commerce -> seller).
- A payment event may contain several payment orders.
- In the pay-in flow, the payment service:
- Accepts payment events and performs risk checks.
- Sends payment orders to the payment executor.
- Updates the wallet for a seller upon success of executing an order.
- Updates the double-entry ledger. - Keep numbers in string format during transmission and storage to ensure precision.
- Payment Service Provider (PSP) integration:
- The payment executor sends a payment registration request with a UUID called nonce.
- The PSP returns a token, which is the UUID of the payment order on the PSP side and is persisted by the payment executor.
- The client displays a PSP-hosted payment page.
- The web page is then redirected to the redirect URL set by the payment system based on the payment status returned by the PSP.
- Asynchronously, the PSP returns the payment status to the payment executor via a webhook. - Reconciliation is a practice that periodically compares the states among related services in order to verify that they are in agreement.
- Put a failed payment to either the retry queue or the dead letter queue.
- Implementation of exact-once delivery:
- Ensure at-least-once using retries.
- Ensure at-most-once using idempotency checks.
Chapter 12. Digital Wallet
- The functional requirement is to provide cross-wallet balance transfer.
- It takes thousands of DB nodes to support millions of TPS.
- Distributed transaction:
- X/Open XA is a standard to coordinate heterogeneous database to achieve 2PC.
- 2PC is not performant and is prone to the single point of failure.
- 2PC wraps two phases in the same transaction, whereas TC/C separates them into separate transactions.
- 2PC is executed at the database level, whereas TC/C and Saga are executed at the application level.
- TC/C is better than Saga when latency requirement is high. - Event sourcing:
- The advantage of event sourcing is reproducibility.
- Only validated commands generate events.
- Event generation (based on the state and the command) may contain randomness.
- State update on event must be deterministic.
- Events are saved as immutable history. - Command-query responsibility segregation (CQRS):
- A single state machine for updating the state.
- Multiple state machines for building read-only views. - Use mmap to log commands and events on the local disk and to cache recent ones in memory.
- Use LSM tree-based RocksDB as the state store for optimizing write performance.
- Use the Raft consensus algorithm to replicate the event list across multiple machines.
Chapter 13. Stock Exchange
- Clients interact with the stock exchange system via the brokers to
- Place orders.
- View executions by order.
- View market data (L2 order book).
- Download historical data (candlestick charts) for analysis. - An order is an inbound instruction, and an execution is an outbound matched result.
- The latency-sensitive trading flow consists of:
- The client gateway, which provides authentication, rate limiting, validation, etc.
- The order manager, which evaluates risks and checks users’ wallets.
- The sequencer, which stamps every inbound order and outbound execution pair with a sequence ID to provide functional determinism.
- The matching engine, which matches buy and sell orders. - The market data publisher receives executions from the matching engine and builds (1) order books and (2) candlestick charts.
- The order manager merges incoming orders and outgoing executions into a reporting flow.
- An order book is a list of buy and sell orders for a specific security or financial instrument organized by price level.
- To reduce end-to-end latency of the trading flow:
- Put everything in a single server to eliminate network hops.
- Use mmap and shared memory for interprocess communication.
- Make each application loop single-threaded and pinned to a fixed CPU core. - Keep a warm matching engine as a fast failover of the hot one.
- Use the Raft protocol to elect a new server when the active one dies.