[Book Notes] Software Engineering at Google

Devin Z
9 min readJul 31, 2022

--

Scale programming to reduce the amortized cost.

San Francisco, May 15, 2022

The following content is the notes I took from the free online book Software Engineering at Google: Lessons Learned from Programming over Time¹ with (possibly) slight personal paraphrasing.

  • Three critical differences between software engineering and programming: time, scale and trade-offs.
  • Software engineering is programming over time.
  • Software is sustainable when, for the expected life span of the code, we are capable of responding to changes in dependencies, technology, or product requirements.
  • Hyrum’s Law: with a sufficient number of users of an API, it does not matter what you promise in the contract; all observable behaviors of your system will be depended on by somebody.
  • It’s programming if “clever” is a compliment, but it’s software engineering if “clever” is an accusation.
  • The churn rule: infrastructure teams must do the work to move their internal users to new versions themselves or do the update in place in backward-compatible fashion.
    - The infrastructure team has better domain knowledge.
    - The benefits of a migration are often diffused across an organization, and centralizing the migration to a dedicated group internalizes the externalities that an unfunded mandate creates.
  • The Beyonce rule: if a product experiences outages or other problems as a result of infrastructure changes, but the issue wasn’t surfaced by tests in our Continuous Integration (CI) systems, it is not the fault of the infrastructure change.
  • Every task your organization has to do repeatedly should be scalable (linear or better) in terms of human input. Policies are a wonderful tool for making process scalable.
  • Shifting left: finding problems earlier in the developer workflow usually reduces costs.
  • The limiting factor in software engineering is usually personnel cost instead of financial cost, so keeping engineers happy, focused and engaged can easily dominate other factors.
  • It is far more important to optimize for obstacle-free brainstorming than to protect against someone wandering off with a bunch of markers.
  • Jevons Paradox: consumption of a resource may increase as a response to greater efficiency in its use. Ex: a more efficient distributed build system leads to more bloated or unnecessary dependencies.
  • Contrary to some people’s instincts, leaders who admit mistakes are more respected, not less.
  • Humans are mostly a collection of intermittent bugs.
  • The Genius Myth is the tendency that we as humans need to ascribe the success of a team to a single person/leader.
  • Fail early, fail fast, fail often.
  • The bus factor: the number of people that need to get hit by a bus before a project is completely doomed.
  • It is better to be one part of a successful project than the critical part of a failed project.
  • Three pillars of social skills: humility, respect and trust.
  • A proper postmortem should always contain an explanation of what was learned and what is going to change as a result of the learning experience.
  • The more open you are to influence, the more you are able to influence; the more vulnerable you are, the stronger you appear.
The definition of being Googley (Image Source¹)
  • Psychological safety is the foundation for fostering a knowledge-sharing environment.
  • Written knowledge scales better than tribal knowledge to a larger organization, but it comes with a maintenance cost and might be less applicable to individual learners’ situations.
  • It’s important not to mistakenly equate “seniority” with “knowing everything”.
  • Chesterton’s fence: before removing or changing something, first understand why it’s there.
  • One reason that people avoid becoming managers is that quantifying management work is difficult.
  • Peter Principle: in a hierarchy every employee tends to rise to his level of incompetence.
  • Servant leadership: the most important thing you can do as a leader is to serve your team and strive to create an atmosphere of humility, respect and trust.
  • Traditional managers worry about how to get things done, whereas great managers worry about what things get done.
  • If an individual succeeds, praise them in front of the team; if an individual fails, give constructive criticism in private.
  • The leader is always on stage, so always stay calm.
  • In many cases, knowing the right person is more valuable than knowing the right answer.
  • Be kind and empathetic when delivering constructive criticism without resorting to compliment sandwich.
  • If you are new to a leadership role, try to delegate work to other engineers even if it will take them a lot longer than you would to accomplish the work.
  • If you’ve been leading teams for a while or if you pick up a new team, get your hands dirty and take on a grungy task that no one else wants to do.
  • Shield your team from uncertainty and organizational chaos.
  • Increase people’s intrinsic motivation by giving them autonomy, mastery and purpose.
  • Optimize for the reader of the code rather than the author, since given the passage of time the code will be read far more frequently than it is written.
  • Code review reinforces to software engineers that code is not “theirs” but in fact part of a collective enterprise.
  • Code review involves an exchange of ideas and knowledge sharing, which benefits both the reviewer and the reviewee.
  • Code reviewers should defer to authors on particular approaches, and only point out alternatives if the author’s approach is deficient.
  • Code reviewers should avoid responding to the code review in piecemeal fashion.
  • The code review itself provides a historical record.
  • Because documentation’s benefits are all necessarily downstream, they generally don’t reap immediate benefits to the author.
  • Use the same developer workflow (ownership, source control, bug tracking, code review, etc) to write and maintain code and documents.
  • Differentiate API comments from implementation comments.
  • Keep documents focused on one purpose and targeted at specific audience.
  • The act of writing tests also improves the design of your systems by forcing you to confront the issues early on in the development cycle.
  • Test sizes in terms of resources consumed by a test:
    - A small test must run in a single process and avoid I/O or calling any other blocking calls.
    - A medium test must be contained in a single machine and are not allowed to make network calls to other systems than localhost.
    - Large tests are the rest.
  • Test scopes in terms of how much code a test is intended to validate:
    - Unit tests are designed to validate the logic of an individual class or function (80% of the test suite).
    - Integration tests are designed to verify interactions between a small number of components (15% of the test suite).
    - System tests are designed to validate the interaction of several distinct parts of the system (5% of the test suite).
  • At the appropriate test scope, make the test size as small as possible.
  • Prevent brittle tests by
    - Testing via public APIs rather than implementations.
    - Testing states rather than interactions.
  • The use of control flow statements like conditionals and loops in a test is discouraged for easier inspection on test failures.
  • Rather than writing a test for each method, write a test for each behavior.
  • Structure a test into “given”, “when” and “then”.
  • Descriptive And Meaningful Phrases (DAMP): a little bit of duplication is okay in tests so long as it makes tests simpler and clearer.
  • A seam is a way to make code testable by allowing for the use of test doubles — it makes it possible to use different dependencies for the system under test rather than dependencies used in a production environment.
  • A mock is a test double whose behavior is specified inline in a test.
  • A fake is a lightweight implementation of an API that behaves similar to the real implementation but isn’t suitable for production. It should be written and maintained by the team owning the real implementation.
  • Stubbing is the process of giving behavior to a function that otherwise has no behavior on its own, usually done through mocking frameworks.
  • Overusing stubbing can make tests unclear, brittle and less effective.
  • For sake of fidelity, a real implementation is preferred if it is fast, deterministic, and has simple dependencies.
  • Configuration changes are the number one reason for major outages at Google.
  • Each system under test (SUT) has two conflicting factors: hermeticity and fidelity.
  • Strive to make tests
    - first, network hermetic (everything being contained in the same machine);
    - second, version hermetic (everything being built at the same CL).
  • Hermetic servers are still prone to some sources of nondeterminism, like system time, random number generation and race conditions.
  • Some types of integration tests:
    - functional testing,
    - performance, load and stress testing,
    - A/B diff testing,
    - deployment configuration testing,
    - probers and canary analysis.
  • Chaos engineering involves writing programs that continuously introduce a background level of faults into the system and see what happens.
  • A good design includes a test strategy that identifies risks and larger tests that mitigate them.
Life of a Code Change (Image Source¹)
  • Code is a liability, not an asset, since it carries a maintenance cost over time and a deprecation cost in the end.
  • New systems need to have transformative benefits in order to incentivize users to do the migrate on their own.
  • A deprecation warning needs to be actionable and relevant.
  • The idea of downloading the entire codebase and having access to it locally in decentralized version control systems (e.g. Git) doesn’t scale well at Google.
  • Version control is about both the tooling and the policy. Even in a decentralized version control system, it’s important to define one repository (and one branch) to be the ultimate source of truth.
  • Google uses the in-house version control system to make the concept of ownership and approval more explicit and enforced.
  • The One-Version Rule (for preventing the diamond dependency issue): developers within an organization must not have a choice where to commit, or which version of an existing component to depend upon.
  • No long-lived dev branches: work must be done in small increments against trunk, committed regularly.
  • Google’s code search moved from a suffix array-based indexing solution to a token-based n-gram indexing solution to take advantage of Google’s primary indexing and search stack.
  • As opposed to artifact-based build systems, task-based build systems are unable to perform incremental builds or parallelizing build steps.
  • Bazel treats tools, such as a compiler, as dependencies, which may be configured globally at the workspace level.
  • On supported systems, Bazel isolate each action from every other action via a filesystem sandbox.
  • Strict transitive dependencies: a target is not allowed to reference a symbol without depending on it directly.
  • External dependencies should be versioned explicitly under source control.
  • A static analysis tool needs to strive for a low user-perceived false positive rate (e.g. less than 10%).
  • Source control is far easier than dependency management.
  • Semantic versioning:
    - A changed major version indicates a change to an existing API.
    - A changed minor version indicates purely added functionality.
    - A changed patch version is reserved for non-API-impacting implementation details and bug fixes.
  • Minimum Version Selection: try to use the version as close to the one the dependency author developed against as possible.
  • A change in a dependency can be evaluated as breaking or non-breaking only in the context of how the dependency is being used. It’s best to use testing and CI to check whether a new set of versions actually work together.
  • The C++ standard library almost provides ABI compatibility; Abseil provides limited API compatibility but no ABI compatibility; Boost provides no API compatibility at all.
  • Haunted graveyard: a system that is so ancient, obtuse or complex that no one dares to enter it.
  • CI should optimize quicker, more reliable tests on presubmit, and slower, less deterministic test on post-submit.
  • Mid-air collision: two changes that touch completely different files cause a test to fail.
  • Hermetic testing can both reduce instability in larger-scoped tests and help isolate failures.
  • Faster is safer: ship early and often in small batches to reduce the risk of each release as well as to minimize time to market.
  • If you are late for the release train, it will leave without you.
Mountain View, January 18, 2019

--

--

Devin Z
Devin Z

Written by Devin Z

认识世界,改造世界

No responses yet