[Book Notes] System Design Interview — An Insider’s Guide (Volume 2)
A must-read when jobs in tech are at risk.
The following contents are my takeaways from Alex Xu and Sahn Lam’s book¹. Reading the original book is highly recommended for those who want to ace system design interviews.
Chapter 1. Proximity Service
- Functional requirements:
- Return all businesses based on a user’s location and a specified radius.
- View details of a business.
- Business administration operations, which may take effect the next day. - High-level design:
- Location-based service.
- Business service (CRUD operations). - Use a relational database, such as MySQL, for the read-heavy business data, and set up sharding and primary-secondary replication.
- The evenly divided grid doesn’t work well because the distribution of business is not even.
- Geospatial indexing:
- Geohash: a 1D string that recursively subdivides the world into smaller and smaller grids with each additional bit.
- Quadtree: an in-memory data structure that recursively subdivides a plane into four quadrants.
- Google S2: mapping a sphere to a 1D index based on the Hilbert curve, on which two points are close if they are close in 2D space. - For 200 million businesses, the quadtree can easily fit in the working set of a single server and just takes a few minutes to build.
- Geohash is easier to update on location change, but a quadtree can dynamically adjust the grid size based on population density.
- Use a Redis cluster to cache:
- The business info by id.
- The business ids in a grid by geohash.
Chapter 2. Nearby Friends
- Functional requirements:
- Fetch nearby friends within a radius.
- Refresh the user location every 30 seconds. - High-level design:
- RESTful API servers for updating user profiles and friendships.
- WebSocket servers, which maintain persistent connections with active users.
- Redis Pub/Sub servers, which provide lightweight channels for propagating location updates to WebSocket connection handlers.
- Redis cluster, which caches user locations (latitude, longitude, timestamp) with TTLs.
- Cassandra for location history sharded by user id. - The load balancer should be able to auto-scale stateful WebSocket servers.
- The location cache should be sharded on user id for the traffic of location updates is high.
- Creating a new Redis pub/sub channel is lightweight, so at client initialization time a user can subscribe to all the friends regardless of whether they are active or not.
- Use consistent hashing to shard Redis Pub/Sub servers, and use ZooKeeper for service discovery.
- Pub/Sub servers are stateful due to the subscriber list of each channel, and thus take some overhead to scale out.
Chapter 3. Google Maps
- Features:
- User location update.
- Navigation service, including ETA.
- Real-time map rendering. - The mobile client should use as little data and battery as possible.
- Geocoding is the process of converting addresses to geographic coordinates.
- Geocoding info could be stored in Redis since it’s read frequently but written infrequently.
- Hierarchical routing tiles:
- Local roads.
- Arterial roads connecting districts.
- Major highways connecting cities. - Routing tiles could be stored in an object storage, such as S3.
- Location history should be batched before being sent to the server. Even so, it would be good to choose a database optimized for write throughput, such as Cassandra.
- User location data is also logged into Kafka, where it could be consumed by other services, such as for traffic monitoring and for personalization.
- Use tiles of different resolutions for map rendering of different zoom levels. These tiles should be pre-generated and be cached on a CDN.
- The advantages of a vector map over a rasterized map:
- Better chances for compression.
- Better zooming experience.
Chapter 4. Distributed Message Queue
- Additional features:
- Long data retention.
- Repeated consumption.
- In-order delivery (for each partition).
- Configurable data delivery semantics. - Support two messaging models:
- Publish-subscribe, which is implemented by topics.
- Point-to-point, which is supported by consumer groups. - A topic is divided into partitions, and a single partition can only be consumed by one consumer in the same group.
- An optional key in the message can be used to determine the partition.
- Storage includes:
- Messages (on brokers’ local disks).
- Consumer states (in ZooKeeper).
- Topic metadata (in ZooKeeper). - A coordination service (e.g. ZooKeeper) is used for:
- Service discovery (which brokers are active).
- Leader election (which broker is the active controller). - Persist messages as WAL log files on disk to take advantage of the sequential access pattern.
- Wrap the routing layer into the producer to:
- Minimize the number of network hops.
- Batch messages in a single request. - Latency-throughput trade-off: use a large batch size for high throughput and a small batch size for low latency.
- Use the pull model so that:
- Consumers control the consumption rate.
- It enables aggressive batching of data. - For each consumer group, one of the brokers is assigned as the coordinator of the group for consumer rebalancing.
- To rebalance a consumer group, the coordinator elects one of the consumers as the leader and asks it to generate a partition dispatch plan, which will then be forwarded to each group member.
- A partition is replicated on multiple brokers and consumers only read from the leader replica.
- One of the brokers in the cluster is elected as the active controller for maintaining the leader/follower relationship for all the partitions in the topic metadata.
- Provide an acknowledgement setting (e.g. ACK=ALL) for trade-offs between performance and durability.
- Decommissioned partitions can only be removed after the retention period.
- For message filtering:
- Filter messages on the broker side to avoid unnecessary traffic.
- Attach tags to the metadata of each message so that brokers don’t need to decrypt or deserialize the payload.
Chapter 5. Metrics Monitoring and Alerting System
- Requirements:
- Operational system metrics both at the node level and at the service level.
- 1 year data retention.
- Rolled up to 1 min resolution after 7 days.
- Rolled up to 1 hour resolution after 30 days.
- Multiple alert channels. - Workload pattern:
- Constant heavy write load.
- Spiky read load. - Five components:
- Data collection.
- Data transmission, which updates the TSDB.
- Data storage (TSDB).
- Alerting, which queries the TSDB.
- Visualization, which queries the TSDB. - A time series can be uniquely identified by its name, and optionally by a set of labels.
- Use an industrial-scale time-series database (TSDB) as storage, such as InfluxDB or Prometheus. They have features like caching, indexing and expressive query languages, which could obviate the need of a standalone query service.
- The pull model for data collection:
- A metrics collector regularly pulls metrics from sources via HTTP.
- Use ZooKeeper to maintain configurations of active sources.
- Use consistent hashing for service partitioning among collectors.
- Metrics can be viewed at any time for debugging purposes. - The push model for data collection:
- The collection agent at each source sends metrics to the metrics collector.
- The collection agent may aggregate and buffer metrics before sending them.
- The collectors could be organized as an auto-scaling group behind a load balancer.
- Short-lived batch jobs can only use the push model. - Three places for data aggregation:
- The collection agent at each source.
- The metrics collectors, which write the TSDB.
- The TSDB readers. - Use Kafka to buffer data and to decouple data collection from data processing, which ingests metrics into the TSDB.
- Set up an alert manager to:
- Poll the TSDB or the query service.
- Filter, merge and dedupe alerts.
Chapter 6. Ad Click Event Aggregation
- Requirements:
- Aggregate the number of clicks of an ad in the past M minutes.
- Return the top K most clicked ads in the past M minutes.
- Support data filtering by different attributes.
- Correctness must be guaranteed.
- End-to-end latency should be a few minutes at most. - Store both the raw data and the aggregated data, and use Cassandra for the write-heavy workload.
- Use two Kafka instances to:
- Decouple the log watcher and the services that persist raw data or perform the aggregation.
- Provide end-to-end exactly-once semantics for dumping the aggregated data. - The aggregation service uses the MapReduce paradigm to consume raw data from the first Kafka instance and produces aggregated data to put in the second Kafka instance.
- Pre-define filtering criteria and aggregate data based on them.
- The Lambda architecture maintains the batch and stream processing paths separately, whereas the Kappa architecture combines them into one.
- Two places to generate timestamps for aggregation:
- The event time on the client side, which is accurate but could be skewed.
- The processing time on the server side, which is reliable but can be delayed. - How to handle delayed events when the event time is used?
- Watermark: extend each aggregation window by 15 seconds, which improves data accuracy but increases overall latency.
- End-of-day reconciliation. - The aggregation service performs a single distributed transaction to:
- Send data to the downstream Kafka (with an ACK).
- Update the offset in an external storage, such as HDFS.
Chapter 7. Hotel Reservation System
- Functional requirements:
- View hotels and rooms.
- Reserve a hotel room.
- Add/delete/update a hotel room.
- Support overbooking. - Non-functional requirements:
- Support high concurrency.
- Moderate latency (a few seconds for reservation). - Choose a relational database for the read-heavy workload and ACID properties.
- Keep an inventory table to track the different room types of each hotel on each day.
- The primary key is hotel_id, room_type and date. - Include an idempotency key in the API for making reservations to avoid double booking.
- A client sends a request to get a unique reservation id before submission.
- On clicking on the submit button, the actual reservation request with the unique id is sent. - Use the microservice architecture and use remote procedure calls for inter-service communication.
- Keep a pre-populated room type inventory (for 2 years ahead) to support reservations by room type (instead of by room id).
- Concurrency control:
- Pessimistic locking is prone to deadlocks and tends to have bad performance.
- Optimistic locking is usually faster but performance dramatically drops when contention is heavy. - How to scale up?
- Shard the database by hotel since most queries need to filter by it.
- Cache inventory information (aggregates) in Redis to reduce database load and improve read performance.
- Use CDC to propagate DB writes into the caching layer. - Use the same service to manage reservations and the inventory to avoid data consistency issues that require 2PC or Saga.
Chapter 8. Distributed Email Service
- SMTP is used to transfer emails from one mail server to another.
- IMAP is used to receive and download emails from a mail server.
- The traditional mail sending path:
- sending client =>
- sender SMTP server (storage) =>
- receiver SMTP server =>
- receiver mail server storage =>
- receiver IMAP/POP server =>
- receiving client. - Traditional mail servers don’t have a scalable storage layer for a large number of users or for large attachments.
- The proposed path:
- A client talks to a Web server for sending or fetching emails.
- A client keeps a connection with a WebSocket server for receiving emails.
- Use a message queue to decouple the Web server who receives emails from sending clients and SMTP workers who forward emails to receiver SMTP servers.
- Use another message queue to decouple the SMTP servers who receive incoming emails and mail processing workers who filter and pass emails to WebSocket servers. - Storage consists of:
- Metadata DB (Cassandra sharded by user_id and denormalized for read_status).
- Attachment store (Amazon S3).
- Distributed cache (Redis).
- Search store (Elasticsearch with async reindexing).
Chapter 9. S3-like Object Storage
- Three types of storage:
- Block storage (e.g. HDD, SSD).
- File storage, which usually has a hierarchical directory structure.
- Object storage, which stores all data as objects in a flat structure. - Characteristics of object storage:
- Highly durable.
- Vastly scalable.
- Low cost.
- Low performance.
- Mainly for archival and backup. - Functional requirements:
- Bucket creation.
- Object uploading and downloading.
- Object versioning.
- Listing objects in a bucket. - Separate services for mutable metadata and immutable object data.
- Life of a read request:
- The load balancer forwards the request to the API service.
- The API service performs authentication and authorization.
- The API service queries the metadata service.
- The API service queries the storage service.
- A response is sent back to the client. - Components of the data store:
- The stateless data routing service, which serves user requests.
- The highly-reliable placement service, which maintains a virtual cluster map and monitors data nodes.
- The data nodes with synchronous replication. - Problems with too many small files:
- Internal fragmentation.
- It may exceed the system’s i-node capacity. - Append multiple (small) objects into a large file and serialize writes to the same file.
- Multipart upload: slice a large object into smaller parts and upload them independently to the object store.
- Maintain the local object location info in SQLite at each data node.
- Two ways to achieve data redundancy:
- Replication, which entails no computation or tail latency.
- Erasure coding, which achieves higher durability with less storage. - Use checksums to check against data corruption of files.
- Shard the object metadata by the combination of the bucket name and the object name.
- Support pagination of bucket listing by materializing the listing data in a separate table sharded by bucket_id.
- Add an object_version column to support object versioning.
Chapter 10. Real-time Gaming Leaderboard
- Functional requirements:
- Display top 10 players.
- Show a specific user’s rank along with the four players above and below her. - The client directly queries the leaderboard but updates scores only through the game service.
- If fan-out on score updates (e.g. notification) is needed, use a message broker to connect the game service and the leaderboard service.
- Store user profiles in one MySQL table and cache hot ones in Redis.
- Store points (scoring history) in another MySQL table.
- Use a Redis sorted set to index users’ ranks.
- A sorted set is implemented by a hash table (user -> score) and a skip list (score -> user).
- Use ZINCRBY to update a score.
- Use ZREVRANGE to fetch the top-10 players.
- Use ZREVRANK to get the rank of a player. - Use AWS API Gateway and AWS Lambda to auto-scale the service.
- With AWS Lambda, we don’t need to provision or manage servers on our own. - The score data can be partitioned by either the score range or the user key.
- Limitation of partitioning by score range:
- Score distribution needs to be known a-priori to make load balanced. - Limitations of partitioning by user key:
- Determining a user’s rank is not straightforward.
- Tail latency amplification and more response traffic for top-k queries. - An alternative to MySQL + Redis is AWS DynamoDB with a global secondary index on scores.
- More partitions balances load better but complicates scatter-gather queries.
Chapter 11. Payment System
- The focus is on how to correctly handle payment transactions, rather than aiming for high throughput.
- The payment flow consists of a pay-in flow (buyer -> e-commerce) and a pay-out flow (e-commerce -> seller).
- A payment event may contain several payment orders that will be executed separately.
- In the pay-in flow, the payment service:
- Accepts payment events and performs risk checks.
- Sends payment orders to the payment executor.
- Updates the wallet for a seller upon success of executing an order.
- Updates the double-entry ledger. - Keep numbers in string format during transmission and storage to ensure precision.
- Payment Service Provider (PSP) integration:
- The payment executor sends a payment registration request, which contains a UUID called nonce and a redirect URL.
- The PSP returns a token, which is the UUID of the payment order on the PSP side and is persisted by the payment executor.
- The client displays a PSP-hosted payment page to collect sensitive payment info.
- The web page is then redirected to the redirect URL set by the payment system based on the payment status returned by the PSP. If the payment processing delays, the payment service returns a pending status to the client.
- Asynchronously, the PSP returns the payment status to the payment executor via a webhook. - Reconciliation is a practice that periodically compares the states among related services in order to verify that they are in agreement.
- Put a failed payment to either the retry queue or the dead letter queue.
- Implementation of exact-once delivery:
- Ensure at-least-once using retries.
- Ensure at-most-once using idempotency checks.
Chapter 12. Digital Wallet
- The functional requirement is to provide cross-wallet balance transfer.
- It takes thousands of DB nodes to support millions of TPS.
- Distributed transaction:
- X/Open XA is a standard to coordinate heterogeneous database to implement 2PC.
- 2PC is not performant and is prone to the single point of failure.
- 2PC wraps two phases in the same transaction, whereas TC/C separates them into separate transactions.
- 2PC is executed at the database level, whereas TC/C and Saga are executed at the application level.
- TC/C is better than Saga when latency requirement is high. - Event sourcing:
- The advantage of event sourcing is reproducibility.
- Only validated commands generate events.
- Event generation is based on the state and validated commands, and may contain randomness.
- State update on an event must be deterministic.
- Events are saved as immutable history. - Command-query responsibility segregation (CQRS):
- A single state machine for updating the state.
- Multiple state machines for building read-only views. - Performance improvement:
- Use mmap(2) to log commands and events on the local disk and to cache recent ones in memory.
- Use LSM tree-based RocksDB as the state store for optimizing write performance. - One balance transfer request is executed as a distributed transaction across multiple shards coordinated by a single Saga coordinator.
- The debit and credit accounts may be in two separate shards. - In each shard, use the Raft consensus algorithm to replicate the event list across multiple machines.
- The Raft leader receives commands from the Saga coordinator and generates events.
- No need to replicate commands.
Chapter 13. Stock Exchange
- Clients interact with the stock exchange system via the brokers to
- Place orders.
- View executions by order.
- View market data (L2 order book).
- Download historical data (candlestick charts) for analysis. - An order is an inbound instruction, and an execution is an outbound matched result.
- A limit order specifies a fixed price and doesn’t require the order to be executed immediately, whereas a market order is the opposite.
- The exchange system consists of:
- a trading flow,
- a reporting flow,
- a market data flow. - The latency-sensitive trading flow consists of:
- The client gateway, which provides authentication, rate limiting, validation, etc.
- The order manager, which perform risk checks and probs a buyer’s wallet.
- The inbound sequencer stamps each inbound order with a sequence ID to provide functional determinism.
- The matching engine, which matches buy and sell orders.
- The outbound sequencer stamps each outbound execution pair with a sequence ID and passes them back to the order manager. - The market data publisher receives executions from the matching engine and builds (1) order books and (2) candlestick charts.
- A candlestick includes the open, close, high and low prices over a time interval. - The order manager merges incoming orders and outgoing executions into a reporting flow.
- An order book is a list of buy and sell orders for a specific security or financial instrument organized by price level.
- An L2 order book includes the quantities of different bid and ask price levels. - To reduce end-to-end latency of the trading flow:
- Put everything in a single server to eliminate network hops.
- Use mmap(2) and shared memory for interprocess communication.
- Make each application loop single-threaded and pinned to a fixed CPU core. - Institutional clients may use different gateways (or even colocation service) from retail clients.
- Keep a warm matching engine as a fast failover of the hot one.
- Use the Raft protocol to replicate events to backup servers and to elect a new server when the active one dies.