Mrrrgn's Mumbles
{tags: notes }

A Note on Deterministic Builds
This note is from 2014, posted at my old blog: It was written to accompany simple-rootkit. I've since joined the JS Engine team, but releng and ops work remain close to my heart.

Since I joined Mozilla's Release Engineering team I've had the opportunity to put my face into a firehose of interesting new knowledge and challenges. Maintaining a release pipeline for binary installers and updates used by a substantial portion of the Earth's population is a whole other kind of beast from ops roles where I've focused on serving some kind of SaaS or internal analytics infrastructure. It's really exciting!

One of the most interesting problems I've seen getting attention lately are deterministic builds, that is, builds that produce the same sequence of bytes from source on a given platform at any time.

What good are deterministic builds?

For starters, they aid in detecting "Trusting Trust" attacks. That's where a compromised compiler produces malicious binaries from perfectly harmless source code via replacing certain patterns during compilation. It sort of negates the primary security advantage of open source right?

Luckily for us users, a fellow named David A. Wheeler rigorously proved a method for circumventing this class of attacks altogether via a technique he coined "Diverse Double-Compiling" (DDC). The gist of it is, you compile a project's source code with a trusted tool chain then compare a hash of the result with some potentially malicious binary. If the hashes match you're safe.

DDC also detects the less clever scenario where an adversary patches, otherwise open, source code during the build process and serves up malwareified packages. In either case, it's easy to see that this works if and only if builds are deterministic.

Aside from security, they can also help projects that support many platforms take advantage of cross building with less stress. That is, one could compile arm packages on an x86_64 host then compare the results to a native build and make sure everything matches up. This can be a huge win for folks who want to cut back on infrastructure overhead.

How can I make a project more deterministic?

One bit of good news is, most compilers are already pretty deterministic (on a given platform). Take hello.c for example:

int main() {
    printf("Hello World!");

Compile that a million times and take the md5sum. Chances are you'll end up with a million identical md5sums. Scale that up to a million lines of code, and there's no reason why this won't hold true.

However, take a look at this doozy:

int main() {
    printf("Hello from %s! @ %s", __FILE__, __TIME__);

Having timestamps and other platform specific metadata baked into source code is a huge no-no for creating deterministic builds. Compile that a million times, and you'll likely get a million different md5sums.

In fact, in an attempt to make Linux more deterministic all __TIME__ macros were removed and the makefile specifies a compiler option (-Werror=date-time) that turns any use of it into an error.

Unfortunately, removing all traces of such metadata in a mature code base could be all but impossible, however, a fantastic tool called gitian will allow you to compile projects within a virtual environment where timestamps and other metadata are controlled.

Another trouble spot to consider is static linking. Here, unless you're careful, determinism sits at the mercy of third parties. Be sure that your build system has access to identical libraries from anywhere it may be used. Containers and pre-baked vms seem like a good choice for fixing this issue, but remember that you could also be passing around a tainted compiler!

Scripts that automate parts of the build process are also a potent breeding ground for non-deterministic behaviors. Take this python snippet for example:

with open('manifest', 'w') as manifest:
    for dirpath, dirnames, filenames in os.walk("."):
        for filename in filenames:

The problem here is that os.walk will not always print filenames in the same order. :(

One also has to keep in mind that certain data structures become very dangerous in such scripts. Consider this pseudo-python that auto generates some sort of source code in a compiled language:

weird_mapping = dict(file_a=99, file_b=1)
things_in_a_set = set([thing_a, thing_b, thing_c])
for k, v in werid_mapping.items():
    ... generate some code ...
for thing in things_in_a_set:
    ... generate some code ...

A pattern like this would dash any hope that your project had of being deterministic because it makes use of unordered data structures.

Enforcing determinism from the beginning of a project's life cycle is the ideal situation, so, I would highly recommend incorporating it into CI flows. When a developer submits a patch it should include a hash of their latest build. If the CI system builds and the hashes don't match, reject that non-deterministic code! :)


Of course, this hardly scratches the surface on why deterministic builds are important; but I hope this is enough for a person to get started on. It's a very interesting topic with lots of fun challenges that need solving. :) If you'd like to do some further reading, I've listed a few useful sources below.
01-23-2016 , rss feed license