[Book Notes] Production Practices of Data Processing Pipelines

Devin Z

3 min readApr 30, 2024

Notes from the SRE book¹ and the SRE workbook².

Hakone Estate and Gardens, March 31, 2024

Prefer skipping launches of a cron job to risking double launches, since the former tends to be easier to recover from.
Use the distributed cron service itself to track the state of cron jobs rather than store it externally in a distributed file system.
- The state is typically small, which is not what GFS caters to.
- Base services that have a large blast radius should have fewer dependencies.
The replicas of a cron service are deployed as a Paxos group to ensure a consistent shared state.
- Only the leader actively launches new jobs.
- Followers synchronously track the state of the world, particularly before a launch and after the finish of a launch.
- Paxos logs and snapshots are stored on local disks of cron service replicas.
- Paxos snapshots are also backed-up in a distributed file system.
Thundering herd: too many jobs are scheduled to run at the same time, causing spiky load.
Moire load pattern: two or more pipelines run simultaneously and consume some shared resource.
Google Workflow:
- The Task Master uses the system prevalence pattern.
- All job states are held in memory.
- Mutates are synchronously journaled to persistent disk.
- Workers are observers of the Task Master, in much the same way as a view is an observer of the model.
- A worker continues writing to uniquely named scratch files and can only commit its work if it is holding a valid lease and an up-to-date configuration version at that time.
Define and measure SLOs in terms of data freshness and data correctness.
- Separate SLOs for data of different priorities.
- For a multi-stage pipeline, SLOs should be measured end-to-end to capture real customer experience, while efficiency and resource usage should be measured at each individual stage.
Pipeline development lifecycle:
- prototyping
- testing with a 1% dry run
- staging environment (almost production data)
- canarying (partial tasks or partial data)
- partial deployment (traffic ramp-up)
- deploying to production
Use the idempotent mutation design pattern to prevent storing duplicate or incorrect data.
Two phase mutation:
- The mutations induced by a transformation pipeline is first put in a temporary location.
- Validation is conducted against the planned mutations.
- The verified mutations are applied via a follow-up pipeline.
The most common culprits of pipeline failures are data delay and data corruption.
Set the number of shards M much bigger than the number of workers N.
- rule of thumb: M = k * N * log(N)
Spotify event delivery system:
- An event is an end-user interaction.
- The system isolates events of different types.
- Events are published by microservices in both Spotify datacenters and Google Computer Engine.
- Event collection and delivery are decoupled into separate failure domains by Google Cloud Pub/Sub.
- Delivered events are grouped into hourly buckets (in Google Cloud Storage) and then further grouped into event-type directories.
- Timeliness SLO is defined as the maximum delay of delivering an hourly bucket of data.
- Skewness SLO is defined as the maximal percentage of data that can be misplaced on a daily basis.
- Completeness SLO is defined as the percentage of events that are delivered after they are successfully published to the system.
- CPU usage is a basic signal for monitoring and is also a guidance for capacity planning (50% for peak hours).

References:

Besty Beyer et al. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. 2016.
Besty Beyer et al. The Site Reliability Workbook: Practical Ways to Implement SRE. O’Reilly Media. 2018.

[Book Notes] Production Practices of Data Processing Pipelines

References:

Written by Devin Z

No responses yet