[Book Notes] System Design Interview — An Insider’s Guide (Volume 2)

A must-read when jobs in tech are at risk.

Image Source: Amazon.com

The following content is my takeaways from Alex Xu and Sahn Lam’s book¹. Reading the original book is highly recommended for those who want to ace system design interviews.

Chapter 1. Proximity Service

  • Functional requirements:
    - Return all businesses based on a user’s location and a specified radius.
    - View details of a business.
    - Business administration operations, which may take effect the next day.
  • High-level design:
    - Location-based service.
    - Business service (CRUD operations).
  • Use a relational database, such as MySQL, for the read-heavy business data, and set up sharding and primary-secondary replication.
  • The evenly divided grid doesn’t work well because the distribution of business is not even.
  • Geospatial indexing:
    - Geohash: a 1D string that recursively subdivides the world into smaller and smaller grids with each additional bit.
    - Quadtree: an in-memory data structure that recursively subdivides a plane into four quadrants.
    - Google S2: mapping a sphere to a 1D index based on the Hilbert curve.
  • For 200 million businesses, the quadtree can easily fit in the working set of a single server and just takes a few minutes to build.
  • Geohash is easier to update on location change, but a quadtree can dynamically adjust the grid size based on population density.
  • Use a Redis cluster to cache:
    - The business info by id.
    - The business ids in a grid by geohash.

Chapter 2. Nearby Friends

  • Functional requirements:
    - Fetch nearby friends within a radius.
    - Refresh the user location every 30 seconds.
  • High-level design:
    - RESTful API servers for updating user profiles and friendships.
    - WebSocket servers, which maintain persistent connections with active users.
    - Redis Pub/Sub servers, which provide lightweight channels for propagating location updates to WebSocket connection handlers.
    - Redis cluster, which caches user locations (latitude, longitude, timestamp) with TTLs.
    - Cassandra for location history sharded by user id.
  • Creating a new Redis pub/sub channel is lightweight, so a user can subscribe to all the friends regardless of whether they are active or not.
  • Use consistent hashing to shard Redis Pub/Sub servers, and use ZooKeeper for service discovery.
  • Pub/Sub servers are stateful due to the subscriber list of each channel, and thus take some overhead to scale.

Chapter 3. Google Maps

  • Features:
    - User location update.
    - Navigation service, including ETA.
    - Real-time map rendering.
  • The mobile client should use as little data and battery as possible.
  • Geocoding is the process of converting addresses to geographic coordinates.
  • Geocoding info could stored in Redis since it’s read frequently but written infrequently.
  • Hierarchical routing tiles:
    - Local roads.
    - Arterial roads connecting districts.
    - Major highways connecting cities.
  • Routing tiles could be stored in an object storage, such as S3.
  • Location history should be batched before being sent to the server. Even so, it would be good to choose a database optimized for write throughput, such as Cassandra.
  • User location data is also logged into Kafka, where it could be consumed by other services, such as the traffic update service.
  • Use tiles of different resolution for map rendering of different zoom levels. These tiles should be pre-generated and be cached on a CDN.
  • The advantages of a vector map over a rasterized map:
    - Better chances for compression.
    - Better zooming experience.

Chapter 4. Distributed Message Queue

  • Additional features:
    - Long data retention.
    - Repeated consumption.
    - In-order delivery (for each partition).
    - Configurable data delivery semantics.
  • Support two messaging models:
    - Publish-subscribe, which is implemented by topics.
    - Point-to-point, which is supported by consumer groups.
  • A topic is divided into partitions, and a single partition can only be consumed by one consumer in the same group.
  • An optional key in the message can be used to determine the partition.
  • Storage includes:
    - Messages (on brokers’ local disks).
    - Consumer states (in ZooKeeper).
    - Topic metadata (in ZooKeeper).
  • A coordination service (e.g. ZooKeeper) is used for:
    - Service discovery (which brokers are active).
    - Leader election (which broker is the active controller).
  • Persist messages as WAL log files on disk to take advantage of the sequential access pattern.
  • Wrap the routing layer into the producer to:
    - Minimize the number of network hops.
    - Batch messages in a single request.
  • Latency-throughput trade-off: use a large batch size for high throughput and a small batch size for low latency.
  • Use the pull model so that:
    - Consumers control the consumption rate.
    - This enables aggressive batching of data.
  • For each consumer group, one of the brokers is assigned as the coordinator of the group for consumer rebalancing.
  • To rebalance a consumer group, the coordinator elects one of the consumers as the leader and asks it to generate a partition dispatch plan, which will then be forwarded to each group member.
  • A partition is replicated on multiple brokers and consumers only read from the leader replica.
  • One of the brokers in the cluster is elected as the active controller for maintaining the leader/follower relationship for all the partitions in the topic metadata.
  • Provide an acknowledgement setting (e.g. ACK=ALL) for trade-offs between performance and durability.
  • Decommissioned partitions can only be removed after the retention period.
  • For message filtering:
    - Filter messages on the broker side to avoid unnecessary traffic.
    - Attach tags to the metadata of each message so that brokers don’t need to decrypt or deserialize the payload.

Chapter 5. Metrics Monitoring and Alerting System

  • Requirements:
    - Operational system metrics both at the node level and at the service level.
    - 1 year data retention.
    - Rolled up to 1 min resolution after 7 days.
    - Rolled up to 1 hour resolution after 30 days.
    - Multiple alert channels.
  • Workload pattern:
    - Constant heavy write load.
    - Spiky read load.
  • Five components:
    - Data collection.
    - Data transmission, which updates the TSDB.
    - Data storage (TSDB).
    - Alerting, which queries the TSDB.
    - Visualization, which queries the TSDB.
  • A time series can be uniquely identified by its name, and optionally by a set of labels.
  • Use an industrial-scale time-series database (TSDB) as storage, such as InfluxDB or Prometheus. They have features like caching, indexing and expressive query languages, which could obviate the need of an standalone query service.
  • The pull model for data collection:
    - A metrics collector regularly pulls metrics from sources via HTTP.
    - Use ZooKeeper to maintain configurations of active sources.
    - Use consistent hashing for service partitioning among collectors.
    - Metrics can be viewed at any time for debugging purposes.
  • The push model for data collection:
    - The collection agent at each source sends metrics to the metrics collector.
    - The collection agent may aggregate and buffer metrics before sending them.
    - The collectors could be organized as an auto-scaling group behind a load balancer.
    - Short-lived batch jobs can only use the push model.
  • Three places for data aggregation:
    - The collection agent at each source.
    - The metrics collectors, which write the TSDB.
    - The TSDB readers.
  • Use Kafka to buffer data and to decouple data collection from data processing, which ingests metrics into the TSDB.
  • Set up an alert manager to:
    - Poll the TSDB or the query service.
    - Filter, merge and dedupe alerts.
Palo Alto, November 10, 2022

Chapter 6. Ad Click Event Aggregation

  • Requirements:
    - Aggregate the number of clicks of an ad in the past M minutes.
    - Return the top K most clicked ads in the past M minutes.
    - Support data filtering by different attributes.
    - Correctness must be guaranteed.
    - End-to-end latency should be a few minutes at most.
  • Store both the raw data and the aggregated data, and use Cassandra for the write-heavy workload.
  • Use two Kafka instances to:
    - Decouple the log watcher and the services that persist or aggregate the raw data.
    - Provide end-to-end exactly-once semantics for delivering the aggregated data.
  • Use the MapReduce framework for the aggregation service, which consumes the raw data from the first Kafka and produces the aggregated data in the second Kafka.
  • Pre-define filtering criteria and aggregate data based on them.
  • The Lambda architecture maintains the batch and stream processing paths separately, whereas the Kappa architecture combines them into one.
  • Two places to generate timestamps for aggregation:
    - The event time on the client side, which is accurate but could be skewed.
    - The processing time on the server side, which can be delayed.
  • How to handle delayed events when the event time is used?
    - Watermark: extend each aggregation window by 15 seconds, which improves data accuracy but increases overall latency.
    - End-of-day reconciliation.
  • The aggregation service peforms a single distributed transaction to:
    - Send data to the downstream Kafka (with an ACK).
    - Record the offset in an external storage, such as HDFS.

Chapter 7. Hotel Reservation System

  • Functional requirements:
    - View hotels and rooms.
    - Reserve a hotel room.
    - Add/delete/update a hotel room.
    - Support overbooking.
  • Non-functional requirements:
    - Support high concurrency.
    - Moderate latency (a few seconds for reservation).
  • Choose a relational database for the read-heavy workload and ACID properties.
  • Include an idempotency key in the API for making reservations to avoid double booking.
  • Use the microservice architecture and use remote procedure calls for inter-service communication.
  • Keep a pre-populated room type inventory (for 2 years ahead) to support reservations by room type (instead of by room id).
  • Concurrency control:
    - Pessimistic locking is prone to deadlocks and tends to have bad performance.
    - Optimistic locking is usually faster but performance dramatically drops when contention is heavy.
  • How to scale up?
    - Shard the database by hotel since most queries need to filter by it.
    - Cache inventory information (aggregates) in Redis to reduce database load and improve read performance.
  • Use the same service to manage reservations and the inventory to avoid data consistency issues that require 2PC or Saga.

Chapter 8. Distributed Email Service

  • The traditional mail sending path:
    - Sender client =>
    - Sender SMTP server =>
    - Receiver SMTP server =>
    - Receiver mail server storage =>
    - Receiver IMAP/POP server =>
    - Receiver client.
  • The proposed path:
    - Sender client =>
    - (LB =>) Web server =>
    - Outgoing queue =>
    - Sender-side SMTP worker =>
    - (LB =>) Receiver-side SMTP server =>
    - Incoming queue =>
    - Mail processing workers =>
    - WebSocket-based real-time servers =>
    - Receiver client (if online) and storage.
  • Storage consists of:
    - Metadata DB (NoSQL with denormalization).
    - Attachment store (Amazon S3).
    - Distributed cache (Redis).
    - Search store (Elasticsearch).
  • Kafka is used to decouple services that trigger reindexing and services that actually perform reindexing.

Chapter 9. S3-like Object Storage

  • Three types of storage:
    - Block storage (e.g. HDD, SSD).
    - File storage, which usually has a hierarchical directory structure.
    - Object storage, which stores all data as objects in a flat structure.
  • Characteristics of object storage:
    - Highly durable.
    - Vastly scalable.
    - Low cost.
    - Low performance.
    - Mainly for archival and backup.
  • Functional requirements:
    - Bucket creation.
    - Object uploading and downloading.
    - Object versioning.
    - Listing objects in a bucket.
  • Separate services for mutable metadata and immutable object data.
  • Life of a read request:
    - The load balancer forwards the request to the API service.
    - The API service performs authentication and authorization.
    - The API service queries the metadata service.
    - The API service queries the storage service.
    - A response is sent back to the client.
  • Components of the data store:
    - The stateless data routing service, which serves user requests.
    - The highly-reliable placement service, which maintains a virtual cluster map and monitors data nodes.
    - The data nodes with synchronous replication.
  • Problems with too many small files:
    - Internal fragmentation.
    - It may exceed the system’s i-node capacity.
  • Append multiple (small) objects into a large file and serialize writes to the same file.
  • Multipart upload: slice a large object into smaller parts and upload them independently to the object store.
  • Maintain the local object location info in SQLite at each data node.
  • Two ways to achieve data redundancy:
    - Replication, which entails no computation or tail latency.
    - Erasure coding, which achieves higher durability with less storage.
  • Use checksums to check against data corruption of files.
  • Shard the object metadata by the combination of the bucket name and the object name.
  • Support pagination of bucket listing by materializing the listing data in a separate table sharded by bucket_id.
  • Add an object_version column to support object versioning.
Sunnyvale, November 12, 2022

Chapter 10. Real-time Gaming Leaderboard

  • Functional requirements:
    - Display top 10 players.
    - Show a specific user’s rank along with the four players above and below her.
  • The client directly queries the leaderboard but updates scores only through the game service.
  • If fan-out on score updates (e.g. notification) is needed, use a message broker to connect the game service and the leaderboard service.
  • Store user profiles in one MySQL table and cache hot ones in Redis.
  • Store scores in another table and use a Redis sorted set to index users’ ranks.
  • A sorted set is implemented by a hash table (user -> score) and a skip list (score -> user).
  • Use AWS API Gateway and AWS Lambda to auto-scale the service.
  • The score data can be partitioned by either the score range or the user key.
  • Limitation of partitioning by score range:
    - Score distribution needs to be known a-priori to make load balanced.
  • Limitations of partitioning by user key:
    - Determining a user’s rank is not straightforward.
    - Tail latency amplification and more response traffic for top-k queries.
  • An alternative to MySQL + Redis is AWS DynamoDB with a global secondary index on scores.
  • More partitions balances load better but complicates scatter-gather queries.

Chapter 11. Payment System

  • The focus is on how to correctly handle payment transactions, rather than aiming for high throughput.
  • The payment flow consists of a pay-in flow (buyer -> e-commerce) and a pay-out flow (e-commerce -> seller).
  • A payment event may contain several payment orders.
  • In the pay-in flow, the payment service:
    - Accepts payment events and performs risk checks.
    - Sends payment orders to the payment executor.
    - Updates the wallet for a seller upon success of executing an order.
    - Updates the double-entry ledger.
  • Keep numbers in string format during transmission and storage to ensure precision.
  • Payment Service Provider (PSP) integration:
    - The payment executor sends a payment registration request with a UUID called nonce.
    - The PSP returns a token, which is the UUID of the payment order on the PSP side and is persisted by the payment executor.
    - The client displays a PSP-hosted payment page.
    - The web page is then redirected to the redirect URL set by the payment system based on the payment status returned by the PSP.
    - Asynchronously, the PSP returns the payment status to the payment executor via a webhook.
  • Reconciliation is a practice that periodically compares the states among related services in order to verify that they are in agreement.
  • Put a failed payment to either the retry queue or the dead letter queue.
  • Implementation of exact-once delivery:
    - Ensure at-least-once using retries.
    - Ensure at-most-once using idempotency checks.

Chapter 12. Digital Wallet

  • The functional requirement is to provide cross-wallet balance transfer.
  • It takes thousands of DB nodes to support millions of TPS.
  • Distributed transaction:
    - X/Open XA is a standard to coordinate heterogeneous database to achieve 2PC.
    - 2PC is not performant and is prone to the single point of failure.
    - 2PC wraps two phases in the same transaction, whereas TC/C separates them into separate transactions.
    - 2PC is executed at the database level, whereas TC/C and Saga are executed at the application level.
    - TC/C is better than Saga when latency requirement is high.
  • Event sourcing:
    - The advantage of event sourcing is reproducibility.
    - Only validated commands generate events.
    - Event generation (based on the state and the command) may contain randomness.
    - State update on event must be deterministic.
    - Events are saved as immutable history.
  • Command-query responsibility segregation (CQRS):
    - A single state machine for updating the state.
    - Multiple state machines for building read-only views.
  • Use mmap to log commands and events on the local disk and to cache recent ones in memory.
  • Use LSM tree-based RocksDB as the state store for optimizing write performance.
  • Use the Raft consensus algorithm to replicate the event list across multiple machines.

Chapter 13. Stock Exchange

  • Clients interact with the stock exchange system via the brokers to
    - Place orders.
    - View executions by order.
    - View market data (L2 order book).
    - Download historical data (candlestick charts) for analysis.
  • An order is an inbound instruction, and an execution is an outbound matched result.
  • The latency-sensitive trading flow consists of:
    - The client gateway, which provides authentication, rate limiting, validation, etc.
    - The order manager, which evaluates risks and checks users’ wallets.
    - The sequencer, stamps every inbound order and outbound execution pair with a sequence ID to provide functional determinism.
    - The matching engine, which matches buy and sell orders.
  • The market data publisher receives executions from the matching engine and builds (1) order books and (2) candlestick charts.
  • The order manager merges incoming orders and outgoing executions into a reporting flow.
  • An order book is a list of buy and sell orders for a specific security or financial instrument organized by price level.
  • To reduce end-to-end latency of the trading flow:
    - Put everything in a single server to eliminate network hops.
    - Use mmap and shared memory for interprocess communication.
    - Make each application loop single-threaded and pinned to a fixed CPU core.
  • Keep a warm matching engine as a fast failover of the hot one.
  • Use the Raft protocol to elect a new server when the active one dies.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store