[Book Notes] Production Practices of Data Processing Pipelines

Devin Z
3 min readApr 30, 2024

--

Notes from the SRE book¹ and the SRE workbook².

Hakone Estate and Gardens, March 31, 2024
  • Prefer skipping launches of a cron job to risking double launches, since the former tends to be easier to recover from.
  • Use the distributed cron service itself to track the state of cron jobs rather than store it externally in a distributed file system.
    - The state is typically small, which is not what GFS caters to.
    - Base services that have a large blast radius should have fewer dependencies.
  • The replicas of a cron service are deployed as a Paxos group to ensure a consistent shared state.
    - Only the leader actively launches new jobs.
    - Followers synchronously track the state of the world, particularly before a launch and after the finish of a launch.
    - Paxos logs and snapshots are stored on local disks of cron service replicas.
    - Paxos snapshots are also backed-up in a distributed file system.
  • Thundering herd: too many jobs are scheduled to run at the same time, causing spiky load.
  • Moire load pattern: two or more pipelines run simultaneously and consume some shared resource.
  • Google Workflow:
    - The Task Master uses the system prevalence pattern.
    - All job states are held in memory.
    - Mutates are synchronously journaled to persistent disk.
    - Workers are observers of the Task Master, in much the same way as a view is an observer of the model.
    - A worker continues writing to uniquely named scratch files and can only commit its work if it is holding a valid lease and an up-to-date configuration version at that time.
  • Define and measure SLOs in terms of data freshness and data correctness.
    - Separate SLOs for data of different priorities.
    - For a multi-stage pipeline, SLOs should be measured end-to-end to capture real customer experience, while efficiency and resource usage should be measured at each individual stage.
  • Pipeline development lifecycle:
    - prototyping
    - testing with a 1% dry run
    - staging environment (almost production data)
    - canarying (partial tasks or partial data)
    - partial deployment (traffic ramp-up)
    - deploying to production
  • Use the idempotent mutation design pattern to prevent storing duplicate or incorrect data.
  • Two phase mutation:
    - The mutations induced by a transformation pipeline is first put in a temporary location.
    - Validation is conducted against the planned mutations.
    - The verified mutations are applied via a follow-up pipeline.
  • The most common culprits of pipeline failures are data delay and data corruption.
  • Set the number of shards M much bigger than the number of workers N.
    - rule of thumb: M = k * N * log(N)
  • Spotify event delivery system:
    - An event is an end-user interaction.
    - The system isolates events of different types.
    - Events are published by microservices in both Spotify datacenters and Google Computer Engine.
    - Event collection and delivery are decoupled into separate failure domains by Google Cloud Pub/Sub.
    - Delivered events are grouped into hourly buckets (in Google Cloud Storage) and then further grouped into event-type directories.
    - Timeliness SLO is defined as the maximum delay of delivering an hourly bucket of data.
    - Skewness SLO is defined as the maximal percentage of data that can be misplaced on a daily basis.
    - Completeness SLO is defined as the percentage of events that are delivered after they are successfully published to the system.
    - CPU usage is a basic signal for monitoring and is also a guidance for capacity planning (50% for peak hours).

References:

  1. Besty Beyer et al. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. 2016.
  2. Besty Beyer et al. The Site Reliability Workbook: Practical Ways to Implement SRE. O’Reilly Media. 2018.

--

--