Notes from the SRE book¹ and the SRE workbook².
- Prefer skipping launches of a cron job to risking double launches, since the former tends to be easier to recover from.
- Use the distributed cron service itself to track the state of cron jobs rather than store it externally in a distributed file system.
- The state is typically small, which is not what GFS caters to.
- Base services that have a large blast radius should have fewer dependencies. - The replicas of a cron service are deployed as a Paxos group to ensure a consistent shared state.
- Only the leader actively launches new jobs.
- Followers synchronously track the state of the world, particularly before a launch and after the finish of a launch.
- Paxos logs and snapshots are stored on local disks of cron service replicas.
- Paxos snapshots are also backed-up in a distributed file system. - Thundering herd: too many jobs are scheduled to run at the same time, causing spiky load.
- Moire load pattern: two or more pipelines run simultaneously and consume some shared resource.
- Google Workflow:
- The Task Master uses the system prevalence pattern.
- All job states are held in memory.
- Mutates are synchronously journaled to persistent disk.
- Workers are observers of the Task Master, in much the same way as a view is an observer of the model.
- A worker continues writing to uniquely named scratch files and can only commit its work if it is holding a valid lease and an up-to-date configuration version at that time. - Define and measure SLOs in terms of data freshness and data correctness.
- Separate SLOs for data of different priorities.
- For a multi-stage pipeline, SLOs should be measured end-to-end to capture real customer experience, while efficiency and resource usage should be measured at each individual stage. - Pipeline development lifecycle:
- prototyping
- testing with a 1% dry run
- staging environment (almost production data)
- canarying (partial tasks or partial data)
- partial deployment (traffic ramp-up)
- deploying to production - Use the idempotent mutation design pattern to prevent storing duplicate or incorrect data.
- Two phase mutation:
- The mutations induced by a transformation pipeline is first put in a temporary location.
- Validation is conducted against the planned mutations.
- The verified mutations are applied via a follow-up pipeline. - The most common culprits of pipeline failures are data delay and data corruption.
- Set the number of shards M much bigger than the number of workers N.
- rule of thumb: M = k * N * log(N) - Spotify event delivery system:
- An event is an end-user interaction.
- The system isolates events of different types.
- Events are published by microservices in both Spotify datacenters and Google Computer Engine.
- Event collection and delivery are decoupled into separate failure domains by Google Cloud Pub/Sub.
- Delivered events are grouped into hourly buckets (in Google Cloud Storage) and then further grouped into event-type directories.
- Timeliness SLO is defined as the maximum delay of delivering an hourly bucket of data.
- Skewness SLO is defined as the maximal percentage of data that can be misplaced on a daily basis.
- Completeness SLO is defined as the percentage of events that are delivered after they are successfully published to the system.
- CPU usage is a basic signal for monitoring and is also a guidance for capacity planning (50% for peak hours).
References:
- Besty Beyer et al. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. 2016.
- Besty Beyer et al. The Site Reliability Workbook: Practical Ways to Implement SRE. O’Reilly Media. 2018.