Scale programming to reduce the amortized cost.
The following content is the notes I took from the free online book Software Engineering at Google: Lessons Learned from Programming over Time¹ with (possibly) slight personal paraphrasing.
- Three critical differences between software engineering and programming: time, scale and trade-offs.
- Software engineering is programming over time.
- Software is sustainable when, for the expected life span of the code, we are capable of responding to changes in dependencies, technology, or product requirements.
- Hyrum’s Law: with a sufficient number of users of an API, it does not matter what you promise in the contract; all observable behaviors of your system will be depended on by somebody.
- It’s programming if “clever” is a compliment, but it’s software engineering if “clever” is an accusation.
- The churn rule: infrastructure teams must do the work to move their internal users to new versions themselves or do the update in place in backward-compatible fashion.
- The infrastructure team has better domain knowledge.
- The benefits of a migration are often diffused across an organization, and centralizing the migration to a dedicated group internalizes the externalities that an unfunded mandate creates. - The Beyonce rule: if a product experiences outages or other problems as a result of infrastructure changes, but the issue wasn’t surfaced by tests in our Continuous Integration (CI) systems, it is not the fault of the infrastructure change.
- Every task your organization has to do repeatedly should be scalable (linear or better) in terms of human input. Policies are a wonderful tool for making process scalable.
- Shifting left: finding problems earlier in the developer workflow usually reduces costs.
- The limiting factor in software engineering is usually personnel cost instead of financial cost, so keeping engineers happy, focused and engaged can easily dominate other factors.
- It is far more important to optimize for obstacle-free brainstorming than to protect against someone wandering off with a bunch of markers.
- Jevons Paradox: consumption of a resource may increase as a response to greater efficiency in its use. Ex: a more efficient distributed build system leads to more bloated or unnecessary dependencies.
- Contrary to some people’s instincts, leaders who admit mistakes are more respected, not less.
- Humans are mostly a collection of intermittent bugs.
- The Genius Myth is the tendency that we as humans need to ascribe the success of a team to a single person/leader.
- Fail early, fail fast, fail often.
- The bus factor: the number of people that need to get hit by a bus before a project is completely doomed.
- It is better to be one part of a successful project than the critical part of a failed project.
- Three pillars of social skills: humility, respect and trust.
- A proper postmortem should always contain an explanation of what was learned and what is going to change as a result of the learning experience.
- The more open you are to influence, the more you are able to influence; the more vulnerable you are, the stronger you appear.
- Psychological safety is the foundation for fostering a knowledge-sharing environment.
- Written knowledge scales better than tribal knowledge to a larger organization, but it comes with a maintenance cost and might be less applicable to individual learners’ situations.
- It’s important not to mistakenly equate “seniority” with “knowing everything”.
- Chesterton’s fence: before removing or changing something, first understand why it’s there.
- One reason that people avoid becoming managers is that quantifying management work is difficult.
- Peter Principle: in a hierarchy every employee tends to rise to his level of incompetence.
- Servant leadership: the most important thing you can do as a leader is to serve your team and strive to create an atmosphere of humility, respect and trust.
- Traditional managers worry about how to get things done, whereas great managers worry about what things get done.
- If an individual succeeds, praise them in front of the team; if an individual fails, give constructive criticism in private.
- The leader is always on stage, so always stay calm.
- In many cases, knowing the right person is more valuable than knowing the right answer.
- Be kind and empathetic when delivering constructive criticism without resorting to compliment sandwich.
- If you are new to a leadership role, try to delegate work to other engineers even if it will take them a lot longer than you would to accomplish the work.
- If you’ve been leading teams for a while or if you pick up a new team, get your hands dirty and take on a grungy task that no one else wants to do.
- Shield your team from uncertainty and organizational chaos.
- Increase people’s intrinsic motivation by giving them autonomy, mastery and purpose.
- Optimize for the reader of the code rather than the author, since given the passage of time the code will be read far more frequently than it is written.
- Code review reinforces to software engineers that code is not “theirs” but in fact part of a collective enterprise.
- Code review involves an exchange of ideas and knowledge sharing, which benefits both the reviewer and the reviewee.
- Code reviewers should defer to authors on particular approaches, and only point out alternatives if the author’s approach is deficient.
- Code reviewers should avoid responding to the code review in piecemeal fashion.
- The code review itself provides a historical record.
- Because documentation’s benefits are all necessarily downstream, they generally don’t reap immediate benefits to the author.
- Use the same developer workflow (ownership, source control, bug tracking, code review, etc) to write and maintain code and documents.
- Differentiate API comments from implementation comments.
- Keep documents focused on one purpose and targeted at specific audience.
- The act of writing tests also improves the design of your systems by forcing you to confront the issues early on in the development cycle.
- Test sizes in terms of resources consumed by a test:
- A small test must run in a single process and avoid I/O or calling any other blocking calls.
- A medium test must be contained in a single machine and are not allowed to make network calls to other systems than localhost.
- Large tests are the rest. - Test scopes in terms of how much code a test is intended to validate:
- Unit tests are designed to validate the logic of an individual class or function (80% of the test suite).
- Integration tests are designed to verify interactions between a small number of components (15% of the test suite).
- System tests are designed to validate the interaction of several distinct parts of the system (5% of the test suite). - At the appropriate test scope, make the test size as small as possible.
- Prevent brittle tests by
- Testing via public APIs rather than implementations.
- Testing states rather than interactions. - The use of control flow statements like conditionals and loops in a test is discouraged for easier inspection on test failures.
- Rather than writing a test for each method, write a test for each behavior.
- Structure a test into “given”, “when” and “then”.
- Descriptive And Meaningful Phrases (DAMP): a little bit of duplication is okay in tests so long as it makes tests simpler and clearer.
- A seam is a way to make code testable by allowing for the use of test doubles — it makes it possible to use different dependencies for the system under test rather than dependencies used in a production environment.
- A mock is a test double whose behavior is specified inline in a test.
- A fake is a lightweight implementation of an API that behaves similar to the real implementation but isn’t suitable for production. It should be written and maintained by the team owning the real implementation.
- Stubbing is the process of giving behavior to a function that otherwise has no behavior on its own, usually done through mocking frameworks.
- Overusing stubbing can make tests unclear, brittle and less effective.
- For sake of fidelity, a real implementation is preferred if it is fast, deterministic, and has simple dependencies.
- Configuration changes are the number one reason for major outages at Google.
- Each system under test (SUT) has two conflicting factors: hermeticity and fidelity.
- Strive to make tests
- first, network hermetic (everything being contained in the same machine);
- second, version hermetic (everything being built at the same CL). - Hermetic servers are still prone to some sources of nondeterminism, like system time, random number generation and race conditions.
- Some types of integration tests:
- functional testing,
- performance, load and stress testing,
- A/B diff testing,
- deployment configuration testing,
- probers and canary analysis. - Chaos engineering involves writing programs that continuously introduce a background level of faults into the system and see what happens.
- A good design includes a test strategy that identifies risks and larger tests that mitigate them.
- Code is a liability, not an asset, since it carries a maintenance cost over time and a deprecation cost in the end.
- New systems need to have transformative benefits in order to incentivize users to do the migrate on their own.
- A deprecation warning needs to be actionable and relevant.
- The idea of downloading the entire codebase and having access to it locally in decentralized version control systems (e.g. Git) doesn’t scale well at Google.
- Version control is about both the tooling and the policy. Even in a decentralized version control system, it’s important to define one repository (and one branch) to be the ultimate source of truth.
- Google uses the in-house version control system to make the concept of ownership and approval more explicit and enforced.
- The One-Version Rule (for preventing the diamond dependency issue): developers within an organization must not have a choice where to commit, or which version of an existing component to depend upon.
- No long-lived dev branches: work must be done in small increments against trunk, committed regularly.
- Google’s code search moved from a suffix array-based indexing solution to a token-based n-gram indexing solution to take advantage of Google’s primary indexing and search stack.
- As opposed to artifact-based build systems, task-based build systems are unable to perform incremental builds or parallelizing build steps.
- Bazel treats tools, such as a compiler, as dependencies, which may be configured globally at the workspace level.
- On supported systems, Bazel isolate each action from every other action via a filesystem sandbox.
- Strict transitive dependencies: a target is not allowed to reference a symbol without depending on it directly.
- External dependencies should be versioned explicitly under source control.
- A static analysis tool needs to strive for a low user-perceived false positive rate (e.g. less than 10%).
- Source control is far easier than dependency management.
- Semantic versioning:
- A changed major version indicates a change to an existing API.
- A changed minor version indicates purely added functionality.
- A changed patch version is reserved for non-API-impacting implementation details and bug fixes. - Minimum Version Selection: try to use the version as close to the one the dependency author developed against as possible.
- A change in a dependency can be evaluated as breaking or non-breaking only in the context of how the dependency is being used. It’s best to use testing and CI to check whether a new set of versions actually work together.
- The C++ standard library almost provides ABI compatibility; Abseil provides limited API compatibility but no ABI compatibility; Boost provides no API compatibility at all.
- Haunted graveyard: a system that is so ancient, obtuse or complex that no one dares to enter it.
- CI should optimize quicker, more reliable tests on presubmit, and slower, less deterministic test on post-submit.
- Mid-air collision: two changes that touch completely different files cause a test to fail.
- Hermetic testing can both reduce instability in larger-scoped tests and help isolate failures.
- Faster is safer: ship early and often in small batches to reduce the risk of each release as well as to minimize time to market.
- If you are late for the release train, it will leave without you.