Confessions of a Build Engineer

When setting up your build pipelines you are creating a system for other engineers and game devs to use, whether they see its direct effects or not. In most cases a large number of people working in a studio will never even notice your work. They check things into source control and at some point afterwards, a build of the game is delivered to a network share or other game delivery system for them to pull out and review / test later. This also means that in usage of the system, most devs are not actually concerned with how it goes about producing the builds that they then consume.

You will eventually reach a point though, where a dev comes to you and tells you that the build that they’ve just downloaded isn’t working. The kicker is that if they sync to the same changelist and do a complete build locally, then everything works successfully. This is well known in software engineering circles as “works on my machine” syndrome.

This is where things get a bit tricky. A build system is made up of many, many connected components using whatever glue the Build Engineers could come up with. Many processes are broken up and split across multiple computers to run in parallel and often these processes rely on separate caching systems that are targeted at speeding up the same processes for a whole studio. If you are using a bank of build machines that are not targeted specifically to certain build jobs you may end up running on a machine that hasn’t been allocated for a while and therefore has a whole bunch of different intermediates than the last time the software was built.

So the biggest problem when builds operate differently depending on the source, is in trying to work out what those differences are and why they are having the effect that they’re having.

A large source of these problems come from the CI/CD system and the way that it has been set up, especially when working with large game engines; more-so if that engine is an in-house engine and not publicly available (public or open engines by their nature have to be able to work outside of the company infrastructure). The long and the short of it is that the engine is often built in a way that makes it easy for content editors to use, so it’s coupled with launchers, editors, caching systems and all manner of scripts and tools that the developer can run on their system to get through the tasks that result in magic happening.

In a lot of cases, when these things start to get turned into an automation system, the Build Engineers have to re-invoke that magic to make things happen in a similar way. This is often done by setting specific environment variables within the build system or by calling the same scripts and mocking up various parameters that would normally get passed in through a UI, in a manner that pretends the build system is being run in the same way that a user would. Add to that a whole bunch of other settings and values that are put in place to enable the build system itself to do the right thing and you get a cocktail of potential misconfiguration.

Another issue that can trip things up is when parts of the engine have to be run as a specific user, because that’s what it’s always been expecting. When you run things under automation, a lot of the time there is no specific user which means you often end up with the build system running as a custom Build User with its own account and its own elevated levels of system access but no actual real world human user tied to it. As well as the other problems this can cause, it’s also a potentially large security risk.

So what do we do about this? In my experience the best way to combat this is to reduce the customisation gap. So….. what exactly do I mean by that? Well, the aim here is to make sure that whatever processes you go through to get a build via automation, you should be equally able to run those same processes on your local machine. Often the best way to do this is to make sure that the steps you run as part of your build and the scripts that you use to do that are all available for anyone working on the game to run. In effect, you’re abstracting your build process out of the scheduler itself, leaving that to do the things it’s good at - allocate available build machines and schedule jobs accordingly. This has multiple effects;

  1. As the Build Engineer, you can continue to work on improving the scripts and running tests locally without ever committing those changes to the automation system. This makes your iteration process faster and easier as you don’t have to keep pausing the build system to update settings or configuration values. This also makes it easier for a Build Team to work on parts of the system together - anyone should be able to run a script on their machine to get the desired results without having to take over the build system itself in order to do it. Over the years I’ve seen enough build systems where the only way to make a change in the way a build happens is to change a config on the actual build system itself which is a massively risky thing to do. Giving your own team the means to rapidly iterate and test their own changes, knowing that the system itself will operate in the exact same way will make everybody’s lives better. It’s easier to iterate and it’s easier to test.
  2. You know that whatever happens on the build machine can be executed on any developer machine to get the same result. So, when a developer comes to you asking why something looks different in the build that the farm produced, you can guide them into how to run the necessary scripts on their own machine and allow them to debug what’s going wrong (or why their assumptions are incorrect) without affecting the rest of the build system.
  3. You keep those pesky developers away from the build system itself - no more “Can I just log on to the machine and take a look?”, which should be the death-knell for any machine sitting in an automation farm.

There is one other aspect to all this and that’s build determinism. I may well cover this subject in a bit more depth at some point but the short version goes like this;

  • Whatever goes into the build is what makes the build (and nothing else!)

Now that may seem obvious, and there are many people who will argue that their build systems are always working in that manner, and they may well be right. But it might also be that they’ve not looked closely enough. Many moons ago Frostbite had issues with the datetime being stamped into various components at the moment of the build (I believe that in real terms, this was fixed a long, long time ago though). What this meant is that the build was not deterministic. You could run the build and get a build out of it, clear all intermediates and run the build again and get a build that was very, very close to but not binarily identical to the previous build even though absolutely nothing had changed.

Now this may seem like a small innocuous problem, but these datetime stamps could mean that other things were misaligned in compiled files or that they would cause other systems to rebuild things that hadn’t changed. And if those parts also weren’t deterministic we’re suddenly running off building even more unnecessary stuff. And so on. This makes things worse if the non-deterministic part is something less obvious like a GUID or piece of meaningful data instead of something human readable and obvious like a datetime stamp.

As a Build Engineer, you probably won’t be able to solve all (or many at all!) of these determinism problems; they’re down to the engine programmers to improve, but what you can do is make sure that your own systems are not injecting non-deterministic factors into the build. If you can guarantee that, then it’s much easier to make a case for getting the other problems solved by their respective owners, especially if they’re actually causing issues for build turnaround times or reliability.