Culture of Quality

What is Quality?

We judge code quality along two different dimensions.

High quality code satisfies its current requirements, both functional or non-functional.
High quality code is maintainable, meaning that it can be easily understood, extended, and modified to take into account new or changing requirements.

Why do we invest in Quality?

In the short term, time pressures often tempt us to trade future maintainability for faster initial development. If not quickly corrected, this tradeoff is almost always bad in the long term, as maintaining poor quality code requires huge amounts of time and effort.

Having a Culture of Quality is important for making the necessary investments in high quality code to ensure that we can continue to GSD in the future.

Just as under-investing in quality can lead to less GSD in the future, over-investing in quality can lead to less GSD in the present. Perfection is the enemy of GSD; we must find a balance that allows us to both make progress now as well as not limit us in the future.

Code Reviews and Design Reviews

Reviews are perhaps the most important method we use to maintain a culture of quality. We generically refer to reviews of both code and designs as "code reviews".

Code reviews have multiple benefits:

Because developers know that others will be looking at their designs and their code, they are more likely to invest the effort to keep the quality of their work high.
Having other developers look at code quickly is a good way to evaluate whether the code is understandable and maintainable by somebody who does not have the same mental context as the author did while writing the code.
Code reviews are great places to both see how other developers are creating software as well as to offer alternate suggestions for how to structure logic and applications. They teach us how to better write code and to read code written by our peers.
Code reviews allow us to enforce the spirit of our rules instead of the letter. The measure of how readable a piece of code is depends on the judgement of developers actually reading the code, not metrics like the number of methods per class or cyclomatic complexity.

Code reviews are a good example of a keystone habit. Knowing that others will be looking at their code encourages developers to invest the time and effort to create high quality code, even before it is reviewed.

In general, all designs and all code should be reviewed, and all reviews should be completed before the code is deployed. Thus, it is important for all reviewers to complete their reviews in a timely manner. A good standard is that no code review should be waiting for your feedback for longer than 2 business days. An even better goal to strive for is "code review zero", where you have successfully completed all of your open code reviews!

In addition to ensuring that the code itself is high quality, we also use code reviews to communicate how we write code to each other. Each developer is assigned a rotating reviewer who should be added to all reviews. This additional reviewer is focusing on learning from the review instead of providing constructive feedback on what may be an unfamiliar system. (Although offering constructive feedback is still encouraged, of course!)

Because of the importance of completing code reviews quickly, developers should incorporate code review time into their daily routine. At the same time, they should not consume so much time as to feel burdensome. If you feel overwhelmed with the number of reviews that you need to complete, please speak up! As teams grow and change, we often need to adjust who is reviewing the code for each project. Conversely, also feel free to add yourself as a default reviewer to any project that you have an interest in following.

Continuous Improvement

Whenever we make modifications to a system, we should be leaving it in a better place than it was before. The easiest time to let quality slip is when making a minor change to an otherwise stable project. Unless we make an effort to continuously improve code whenever we touch it, these small reductions in quality will slowly change well-written software into an unmaintainable mess.

The principle of continuous improvement also lets us think about quality as a process, not a state. We may sometimes need to make compromises on quality in order to satisfy other requirements -- code debt is something that needs to be managed, not avoided entirely.

In practice, our philosophy of continuous improvement takes several forms:

We hold periodic retrospectives to surface problems and impediments that may otherwise stay unnoticed.
After outages and failures, we have a post-mortem discussion to determine what went wrong and mitigate the risk of a similar outage in the future.
We have quarterly performance evaluations for identifying where we can thoughtfully improve in the future.
We build slack into our development plans to ensure that developers have time to think about and address low-quality code or processes when they are identified.
And, in all cases, we give and receive feedback with humility, respect, and trust. Continuous improvement requires a safe environment where we can talk about how we fall short of an ideal without attacking each other or feeling like we are being attacked.

Best Practices and Standards

Just like the other rules and processes described in this guide, our best practices and standards for developing software are intended to provide a useful default and describe good habits to cultivate. They will not apply universally -- don't be afraid to break the rules, although this should hopefully not be necessary very often.

Code Standards and Style Guides

One simple way of making code more maintainable is to make it easier to visually parse by both the original author and other members of the team. At the same time, making standards easy to follow is essential to preventing them from being overly burdensome. Thus, our standards are mostly encoded as automatic formatting rules, although we also have a somewhat poorly maintained Scala Style Guide.

Testing

We are strong believers in the power of tests to validate the behavior of an application both now and in the future. Tests form a concrete specification for how software should behave. Our code should have both unit tests to verify the behavior of small pieces of code as well as integration tests to ensure that different projects are operating with each other according to our expectations.

SCM

One of the key measurements of maintainability is whether another developer can quickly and easily recreate an application without modifying any code at all. If the code is inaccessible, this is clearly impossible. All code should be checked into our source control. Generally speaking, commits should small and oriented around a single logical change. This aids in documenting, reverting, merging, and otherwise understanding what each changeset does.

Build Server

All production builds should happen on our build server. This ensures that builds complete cleanly and are not relying on any local state on a developer's personal machine. We do make a small exception for builds that will never be used outside of an individual developer's work. (For example, when working on multiple projects simultaneously, it may be necessary to publish an upstream project to a local repository for the downstream project to consume.) However, all builds that reach our shared developer repository and potentially beyond to production should be built on the build server.

Standard Deployment Environments

All production applications should run on one of our standard deployment environments:

Long running server-style applications should be deployed on EC2 via Rodeo.
Both periodic and on-demand script-style applications should be deployed on our cron server.
Distributed reports should be launched on Elastic MapReduce from the cron server and deployed via Spark Plug.

Sticking to a few standard deployment environments reduces the overhead of maintaining a large amount of infrastructure. Our common deployment tools also ensure that we can track failures, roll back changes when necessary, and successfully re-deploy services on demand.

Security/IAM

We believe that having healthy security restrictions not only protects our systems from others but also protects them from ourselves. All developers have the responsibility to help maintain our production environment, including shutting down and deleting resources when they are no longer needed. However, developers should only be able to make destructive changes intentionally. Ideally, no developer should be able to accidentally shut down production services or delete production data.

To protect against accidental modifications to production systems, destructive changes to production systems should be checked into our SCM as scripts and reviewed according to our usual process. The changes should not actually be made until after the review has been completed. You can find a more detailed explanation of the steps involved on the wiki.

Our permissions are generally controlled via Amazon's Identity and Access Management.