David Baron's weblog: February 2008

Friends & Colleagues

Wednesday 2008-02-27

The costs of monoculture (13:15 -0800)

The original purpose of tinderbox was a simple one. It was to ensure that when a developer updated to the very latest code in Mozilla's code repository, that code would compile without errors. This makes people more likely to stay up to date, leads to more testing of the latest code, and leads to more testing of the interaction of new changes being written with the latest code. It saves time wasted dealing with code that doesn't compile. And it saves us from losing contributors who give up on building Mozilla because it doesn't compile on their machine.

To do this, build machines, on various platforms, in various configurations, pull and build the code, and report whether they succeed, so that we know what's broken. For tinderbox to perform its purpose successfully, these build machines should reflect the diversity both of how developers (and potential developers) build and of what we ship. They need to build across the platforms used, across the compilers used, across different versions of both platforms and compilers, and across the options used (for example, the debugging options we want developers to use and the release options that we ship to users). Since we want developers building across these configurations to be able to pull new code without fear of compilation errors, this diversity is a good thing.

Since tinderbox was created, we've started using it for more tests than just ensuring that the code compiles correctly. We use these builds to measure speed and memory use and to test correctness in many ways. Since we want Mozilla to function correctly and perform well across a variety of systems, this diversity is also a good thing.

(The diversity of tinderbox can be painful sometimes: the more diverse the machines we test on, the more often we catch unexpected problems. But it's almost always better to catch these problems sooner rather than later. And we can fix much of the pain with tools like the try server that allow running of the various tests in some or all different environments, and potentially in ways that allow the results of the tests to be analyzed further, like performance profiles and detailed memory tool output. It can also be painful because many of the tinderbox machines are slow, and getting unexpected results over a period of multiple hours rather than a fraction of an hour consumes time spent waiting for results and waiting to confirm that problems thought to be fixed actually have been. Eight years ago I never would have imagined that today we'd still be dealing with multi-hour tinderbox machine cycles.)

Compiling the binary releases that Mozilla ships to users is an entirely different story. Changing which compiler is used to compile a release, which operating system version it is compiled on, or which build options we use can change which users those builds function correctly for. So when we build releases, we need to test the environment that we use to make the builds and ensure that we don't accidentally change that environment. To help with this, over the past few years, we've become much better about documenting how we build the environments we use to compile our releases, and in many cases building standard virtual machine images built according to that documentation that can be reused. This consistency is a good thing.

But lately I've been worried by what appears to be a trend of applying this consistency to the places we want diversity. Sure, this does make maintaining these build machines easier. But at Mozilla we should recognize the costs of this consistency as well (losing the benefits of the machine diversity that I described above). After all, we often hear how requiring every computer in a company run Windows XP with IE7 reduces the maintenance burden and makes writing Intranet sites easier since they only have to be written for one browser. (And we often reply both describing the specific advantages of Firefox and the general benefits of competition.) A monoculture of build machines has costs (as I described above), just like the monocultures of operating systems and Web browsers that we are accustomed to fighting. We could trap ourselves into overly-specific build requirements and become unable to attract new developers or take advantage of new compilers. We could end up optimizing our performance for operating systems that are no longer widely used. We could even trap ourselves into overly-specific runtime requirements that hurt our ability to support users on the wide variety of systems that use Mozilla. These changes aren't likely to happen in sudden steps, but they could happen gradually and hurt us in the long term. We should consider these costs carefully.