[Book Notes] Software Engineering at Google

9 min readJul 31, 2022

Scale programming to reduce the amortized cost.

The following content is the notes I took from the free online book Software Engineering at Google: Lessons Learned from Programming over Time¹ with (possibly) slight personal paraphrasing.

Three critical differences between software engineering and programming: time, scale and trade-offs.
Software engineering is programming over time.
Software is sustainable when, for the expected life span of the code, we are capable of responding to changes in dependencies, technology, or product requirements.
Hyrum’s Law: with a sufficient number of users of an API, it does not matter what you promise in the contract; all observable behaviors of your system will be depended on by somebody.
It’s programming if “clever” is a compliment, but it’s software engineering if “clever” is an accusation.
The churn rule: infrastructure teams must do the work to move their internal users to new versions themselves or do the update in place in backward-compatible fashion.
- The infrastructure team has better domain knowledge.
- The benefits of a migration are often diffused across an organization, and centralizing the migration to a dedicated group internalizes the externalities that an unfunded mandate creates.
The Beyonce rule: if a product experiences outages or other problems as a result of infrastructure changes, but the issue wasn’t surfaced by tests in our Continuous Integration (CI) systems, it is not the fault of the infrastructure change.
Every task your organization has to do repeatedly should be scalable (linear or better) in terms of human input. Policies are a wonderful tool for making process scalable.
Shifting left: finding problems earlier in the developer workflow usually reduces costs.
The limiting factor in software engineering is usually personnel cost instead of financial cost, so keeping engineers happy, focused and engaged can easily dominate other factors.
It is far more important to optimize for obstacle-free brainstorming than to protect against someone wandering off with a bunch of markers.
Jevons Paradox: consumption of a resource may increase as a response to greater efficiency in its use. Ex: a more efficient distributed build system leads to more bloated or unnecessary dependencies.
Contrary to some people’s instincts, leaders who admit mistakes are more respected, not less.
Humans are mostly a collection of intermittent bugs.
The Genius Myth is the tendency that we as humans need to ascribe the success of a team to a single person/leader.
Fail early, fail fast, fail often.
The bus factor: the number of people that need to get hit by a bus before a project is completely doomed.
It is better to be one part of a successful project than the critical part of a failed project.
Three pillars of social skills: humility, respect and trust.
A proper postmortem should always contain an explanation of what was learned and what is going to change as a result of the learning experience.
The more open you are to influence, the more you are able to influence; the more vulnerable you are, the stronger you appear.

The definition of being Googley (Image Source¹)

Psychological safety is the foundation for fostering a knowledge-sharing environment.
Written knowledge scales better than tribal knowledge to a larger organization, but it comes with a maintenance cost and might be less applicable to individual learners’ situations.
It’s important not to mistakenly equate “seniority” with “knowing everything”.
Chesterton’s fence: before removing or changing something, first understand why it’s there.
One reason that people avoid becoming managers is that quantifying management work is difficult.
Peter Principle: in a hierarchy every employee tends to rise to his level of incompetence.
Servant leadership: the most important thing you can do as a leader is to serve your team and strive to create an atmosphere of humility, respect and trust.
Traditional managers worry about how to get things done, whereas great managers worry about what things get done.
If an individual succeeds, praise them in front of the team; if an individual fails, give constructive criticism in private.
The leader is always on stage, so always stay calm.
In many cases, knowing the right person is more valuable than knowing the right answer.
Be kind and empathetic when delivering constructive criticism without resorting to compliment sandwich.
If you are new to a leadership role, try to delegate work to other engineers even if it will take them a lot longer than you would to accomplish the work.
If you’ve been leading teams for a while or if you pick up a new team, get your hands dirty and take on a grungy task that no one else wants to do.
Shield your team from uncertainty and organizational chaos.
Increase people’s intrinsic motivation by giving them autonomy, mastery and purpose.
Optimize for the reader of the code rather than the author, since given the passage of time the code will be read far more frequently than it is written.
Code review reinforces to software engineers that code is not “theirs” but in fact part of a collective enterprise.
Code review involves an exchange of ideas and knowledge sharing, which benefits both the reviewer and the reviewee.
Code reviewers should defer to authors on particular approaches, and only point out alternatives if the author’s approach is deficient.
Code reviewers should avoid responding to the code review in piecemeal fashion.
The code review itself provides a historical record.
Because documentation’s benefits are all necessarily downstream, they generally don’t reap immediate benefits to the author.
Use the same developer workflow (ownership, source control, bug tracking, code review, etc) to write and maintain code and documents.
Differentiate API comments from implementation comments.
Keep documents focused on one purpose and targeted at specific audience.
The act of writing tests also improves the design of your systems by forcing you to confront the issues early on in the development cycle.
Test sizes in terms of resources consumed by a test:
- A small test must run in a single process and avoid I/O or calling any other blocking calls.
- A medium test must be contained in a single machine and are not allowed to make network calls to other systems than localhost.
- Large tests are the rest.
Test scopes in terms of how much code a test is intended to validate:
- Unit tests are designed to validate the logic of an individual class or function (80% of the test suite).
- Integration tests are designed to verify interactions between a small number of components (15% of the test suite).
- System tests are designed to validate the interaction of several distinct parts of the system (5% of the test suite).
At the appropriate test scope, make the test size as small as possible.
Prevent brittle tests by
- Testing via public APIs rather than implementations.
- Testing states rather than interactions.
The use of control flow statements like conditionals and loops in a test is discouraged for easier inspection on test failures.
Rather than writing a test for each method, write a test for each behavior.
Structure a test into “given”, “when” and “then”.
Descriptive And Meaningful Phrases (DAMP): a little bit of duplication is okay in tests so long as it makes tests simpler and clearer.
A seam is a way to make code testable by allowing for the use of test doubles — it makes it possible to use different dependencies for the system under test rather than dependencies used in a production environment.
A mock is a test double whose behavior is specified inline in a test.
A fake is a lightweight implementation of an API that behaves similar to the real implementation but isn’t suitable for production. It should be written and maintained by the team owning the real implementation.
Stubbing is the process of giving behavior to a function that otherwise has no behavior on its own, usually done through mocking frameworks.
Overusing stubbing can make tests unclear, brittle and less effective.
For sake of fidelity, a real implementation is preferred if it is fast, deterministic, and has simple dependencies.
Configuration changes are the number one reason for major outages at Google.
Each system under test (SUT) has two conflicting factors: hermeticity and fidelity.
Strive to make tests
- first, network hermetic (everything being contained in the same machine);
- second, version hermetic (everything being built at the same CL).
Hermetic servers are still prone to some sources of nondeterminism, like system time, random number generation and race conditions.
Some types of integration tests:
- functional testing,
- performance, load and stress testing,
- A/B diff testing,
- deployment configuration testing,
- probers and canary analysis.
Chaos engineering involves writing programs that continuously introduce a background level of faults into the system and see what happens.
A good design includes a test strategy that identifies risks and larger tests that mitigate them.

Code is a liability, not an asset, since it carries a maintenance cost over time and a deprecation cost in the end.
New systems need to have transformative benefits in order to incentivize users to do the migrate on their own.
A deprecation warning needs to be actionable and relevant.
The idea of downloading the entire codebase and having access to it locally in decentralized version control systems (e.g. Git) doesn’t scale well at Google.
Version control is about both the tooling and the policy. Even in a decentralized version control system, it’s important to define one repository (and one branch) to be the ultimate source of truth.
Google uses the in-house version control system to make the concept of ownership and approval more explicit and enforced.
The One-Version Rule (for preventing the diamond dependency issue): developers within an organization must not have a choice where to commit, or which version of an existing component to depend upon.
No long-lived dev branches: work must be done in small increments against trunk, committed regularly.
Google’s code search moved from a suffix array-based indexing solution to a token-based n-gram indexing solution to take advantage of Google’s primary indexing and search stack.
As opposed to artifact-based build systems, task-based build systems are unable to perform incremental builds or parallelizing build steps.
Bazel treats tools, such as a compiler, as dependencies, which may be configured globally at the workspace level.
On supported systems, Bazel isolate each action from every other action via a filesystem sandbox.
Strict transitive dependencies: a target is not allowed to reference a symbol without depending on it directly.
External dependencies should be versioned explicitly under source control.
A static analysis tool needs to strive for a low user-perceived false positive rate (e.g. less than 10%).
Source control is far easier than dependency management.
Semantic versioning:
- A changed major version indicates a change to an existing API.
- A changed minor version indicates purely added functionality.
- A changed patch version is reserved for non-API-impacting implementation details and bug fixes.
Minimum Version Selection: try to use the version as close to the one the dependency author developed against as possible.
A change in a dependency can be evaluated as breaking or non-breaking only in the context of how the dependency is being used. It’s best to use testing and CI to check whether a new set of versions actually work together.
The C++ standard library almost provides ABI compatibility; Abseil provides limited API compatibility but no ABI compatibility; Boost provides no API compatibility at all.
Haunted graveyard: a system that is so ancient, obtuse or complex that no one dares to enter it.
CI should optimize quicker, more reliable tests on presubmit, and slower, less deterministic test on post-submit.
Mid-air collision: two changes that touch completely different files cause a test to fail.
Hermetic testing can both reduce instability in larger-scoped tests and help isolate failures.
Faster is safer: ship early and often in small batches to reduce the risk of each release as well as to minimize time to market.
If you are late for the release train, it will leave without you.