Confessions of a Build Engineer

During the section on Revision Control Systems I briefly discussed how we would sync source code or data from our RCS and the decisions that would lead to that pull. It is fairly safe to say that scheduling and orchestration can be one of the areas that seems so simple and yet has the opportunity to explode into the most complicated of answers depending on the structure of the software that you’re trying to build, the spread of the workforce, the resources available to build the product etc.

In simple terms the thing we’re trying to achieve is the smallest Time To Dev (T2D) value possible. T2D can be defined as the time it takes from the minute an engineer/artist/whoever on the studio floor hits the submit button, to that change (whether it is code, data or something else) making its way through the system and becoming a usable part of the product. Your T2D value might also include the time it takes for that product to go through a barrage of tests (automated or manual - it doesn’t really matter), but whatever it is; the end goal is to measure the time taken for the change you make to get into the final product and potentially onto a consumers console. The reason for this is hopefully obvious - the quicker you can make a change that gets reflected in the product, the quicker you are able to respond to critical issues or make important updates. In game dev this is often reflected in the speed that the studio is able to review changes to the game. The iterative cycle of “build, test, play, review” is extremely important and it’s that process that often helps teams “find fun, fast”. It’s easy to plan something out and expect it to be fun, but it’s not until you actually try it that you’ll know whether you succeeded.

So what’s the best way to handle this scheduling? One of the first thoughts here is often to trigger the build every time a change is submitted. In very small applications or micro services this may be possible. It’s also possible if you have a small team that’s not submitting a barrage of changes every minute. If your T2D is very small this may be the perfect solution for you. In game dev however, builds are often big and can take a significant amount of time to process. The hardware requirements are often fairly large and high-spec, so to allow this to happen for every single change would require massive amounts of infrastructure and hardware & probably more time than you realistically have in a day.

With this in mind we often resort to a timed polling of the RCS. What this means in practice is that every X minutes we poll the RCS and check for changes. If changes are found we start the build pipeline. This is where things get exciting. It’s very rare for a game pipeline to simply be about pressing ‘go’ and the game popping out the end. We have to take into consideration a lot of things about the way that changes propagate through the system and how different outputs are used by the studio during production.

A simplified example of this is described below, whereby a member of the team can be working in one of two areas of the game - the actual game code (hardcore physics, rendering, audio engineering) or the game content (the level definitions, character models, game modes, all manner of other things that make games fun);

  • Code
    • Produces the game executables as well as some tooling
    • Tooling
      • Used to build the Content data
  • Content
    • Basically a huge database of all assets containing textures, audio, 3D models, game mode definitions etc
    • Requires a set of prebuilt tools from the code line to edit properly or ‘process’

It would seem fair to assume that we could orchestrate this so that any submission to the RCS starts this pipeline building Code and then Content in one big time consuming pipeline. There are a number of considerations here though that we should look into. The first thing to consider is breaking this pipeline into two. If an artist has a set of tools that works and is only making texture changes, do we really need to rebuild the code every single time a content creator submits a change? I suspect the answer is “probably not”, why rebuild when we don’t need to? Great! So we’ve just chopped a huge amount of unnecessary time out of our build and reduced our T2D significantly.

Can we do the same for the programmers working on code? Let’s assume that Susan is a programmer working on support for a new feature. Jeff is an artist putting in a new 3D model for a character. Both of them got in first thing this morning and grabbed the latest version of the code and data from our RCS. They’re both on CL10 and working diligently to make their updates. At some point Susan submits her code (CL11) and the code builds successfully; let’s short-circuit our build here and not do the data build, after all Susan hasn’t changed any data as she’s not an artist.

An hour later, Jeff finishes his new character model and submits it, but the build breaks. Jeff is adamant that his data is fine as it was all built with a working copy (CL10) of the tools, but the code has moved on and unbeknownst to her, Susan has submitted a breaking change that means the data no longer builds. The only way we’d have known this is if Susan’s code changes were then used to build the data immediately after they were available.

Let’s now change the position and assume Susan hasn’t broken the tool chain. Jeff is using an older version of the tools which work fine to produce the model he’s making. If he submits now, it might all be ok but it might also break. Jeff now has a choice - he can get the latest version of the tools and make sure his changes are ok, or he can submit and let the build pipeline tell him if he’s got it wrong. The safest option here might be to make Jeff grab the latest version of the tools and make sure he’s on the latest working version before submitting, but Jeff probably isn’t the only artist in the studio making changes and Susan probably isn’t the only engineer. This rate of change starts to make it untenable making sure everyone is always on the latest version of everything in this manner.

In this relationship, the content creators are definitely downstream and very reliant on the engineers doing their job correctly and not submitting code that is going to break them. It would be hoped that no engineer would submit code that is just nonsense and doesn’t even compile, after all they should be building and testing all of their changes locally before they even go close to the submit button, but when there are fifty other people also updating the code and making changes, breakages happen. We cannot therefore guarantee that the code in the RCS is going to always be ‘good’, but equally we can’t hang our content creators out to dry. One way to work around this problem is by implementing a ‘last known good’ system. What this means is that as the code gets submitted by the engineers, it gets compiled and the tool chain built as usual. The building of the data, as well as being a requirement of our ability to ship the game, also then serves another purpose; it becomes a validation step for the engineers. If their code builds and it is successfully used to build the data that is in the RCS at the time their code changes were submitted, then it becomes a ‘last known good’ build. We can mark this build as LKG in some form (a piece of metadata, record in a database or some other method) and then the content creators can make sure that they only take versions of the tools that have been tested against the data that’s currently in the RCS. The onus is then on Jeff to make sure that his changes work with the LKG code. As well as now needing to handle the revision of our code and data assets, we now have the extra problem of the revision of our built executables and binaries within the system itself.

This is quite a difficult problem to solve already and in this example we only have one set of dependencies - the tools that are required to build the data. In reality game dev often has many, many dependencies in the system as you might imagine. The tool pipeline cannot often be summarised as just ‘tools’ - it’s model compilers, audio compressors, video codecs, light baking solutions, editors for each of these things and converters from commercial packages like Photoshop or 3DS Max. There’s a whole lot of chaining that happens here and the requirements and dependencies get more and more complicated.

Unfortunately this isn’t the end of the story - even in our simple example. Let’s assume for the moment that the game that Susan and Jeff are working on is being released on three different consoles - let’s call them Mony, Nicrosoft and Sintendo. Each of these will have its own format or requirements for the way that its executables are built and the way that data might be laid out. Each of these consoles will almost certainly have its own development kit, which itself will be released in different versions by the relevant manufacturer. One of the consoles may be a different ‘generation’ and require an HD build while the others require UHD, meaning that for that one console the data assets have to be built in a different manner. Taking all this into account, we eventually end up with three products at the end of the pipeline, our game in three different variants. How do we declare our game is good? What if the code builds for everything but on one console the data fails for some reason? Maybe it all builds on each console but one of them exhibits odd behaviour that only shows up when running the game? What if one of the consoles enacts region locking, so you have to build one version of the game for North America, one for Europe and one for the rest of the world?

Typically we would also produce a minimum of three different code configurations too - debug, release (a variant that is almost retail and much more lightweight than a full debug build, but still contains some debugging and metrics ability) and retail (the version that customers will get on the disc when they buy the game).

The final piece of this puzzle is something that thankfully has simplified a lot in recent years, but it’s the final packaging of the product. This is when we take all the executables and data that we’ve built and bundle it up into some sort of image that will work on the relevant consoles. This used to mean we would end up producing ISO images and layouts for Blu-ray or DVD discs. Some consoles use an SD card instead of an optical disc format so again the layout is different, but at the end of the day it basically means taking everything that we’ve made and formatting it so the final retail console can understand it.

For a single branch of our simple game example, our end of line builds and configurations might look something like this;

  • Toolset required for building Data
  • Mony
    • ‘Generation 3’ console
      • Code
        • Debug
        • Release
        • Retail
      • Data
        • HD Assets
      • Packaging
        • DVD ISO
          • US Region
          • European Region
          • Rest Of World Region
    • ‘Generation 4’ console
      • Code
        • Debug
        • Release
        • Retail
      • Data
        • UHD Assets
      • Packaging
        • Blu-ray ISO
          • US Region
          • European Region
          • Rest Of World Region
  • Nicrosoft
    • Code
      • Debug
      • Release
      • Retail
      • Performance Checking Variant
    • Data
      • UHD Assets
    • Packaging
      • Blu-ray ISO (universal / regionless)
      • Online store downloadable version
  • Sintendo
    • Code
      • Debug
      • Release
      • Retail
    • Data
      • HD Assets
    • Packaging
      • DVD ISO (universal / regionless)

The explosion of options here can become quite staggering and at the front of all this is the scheduling and orchestration to make sure all the pieces end up in the right place at the right time, all while maintaining as efficient a T2D as possible.

Something we also need to consider in more detail here is the actual scheduling times. Often teams ask for an ‘overnight build’. Mainly, the aim here is to take some downtime and turn it into something productive. Maybe the aim is to clear all intermediates and produce a ‘clean build’ overnight, perhaps with some more intensive tests run automatically once it’s done. The desire to come in first thing in the morning and find a nice tidy error report of all the things that went badly (or things that went well; let’s stay optimistic!) is quite enticing. But what happens when you onboard a team halfway round the world to work on some specific aspect of the game? This might not happen for the vast majority of game studios, but for any studio that is owned by a large publisher I’m willing to bet it will happen at least once per game. In the current post-covid environment, when studios are more willing to hire specialists from anywhere in the world and are allowing or even encouraging remote work employment, the concept of an ‘overnight build’ evaporates quite quickly. Where you were once doing builds at 3am because that was a nice tidy gap in the schedule before everyone made it into the office at 9am, you’ve now got a large downtime in the middle of someone’s potential working day. By the way - this concept of scheduling overnight work also has an impact on the other large area that we see for big systems; maintenance. You will never have 2 hours spare to ‘run maintenance’ so you need to start considering how you’re going to do it whilst keeping everything else alive and responsive. We’ll cover this more later.

 

The actual scheduling & orchestration tech

When I first started working at EA we were using BuildForge to produce our builds for Need For Speed: Most Wanted (2012). One of the problems with this was that it was a paid for product where we were effectively paying for agent licences. In order to grow our system and add more nodes to do more work, we would have to pay for more agents. This actually worked out fine, as the system was fairly stable in its size and wasn’t growing too much but it was a minor ongoing concern every now and then. BuildForge handled all of the scheduling for our build jobs but we had to write plugins and components that worked out whether there were changes that we needed to sync in order to do the builds. We also had to hack the job list so that steps were squeezed into the process at runtime, depending on the state that we were in at the time. This was quite messy and actually fairly error prone but it was the only system that we had. A breakage by the build team here would take down the entire dev team for however long it took us to fix it, which was pretty unacceptable. I’m disappointed to say that in my first few weeks at EA I managed to break this whole system a couple of times as I was still learning, but generally the studio team were very gracious and didn’t shout at me (too much!).

Once we had moved to using the Frostbite engine for our games we inherited a build system specific to Frostbite (lovingly called MonkeyFarm). This system was OK, but nothing special. It was particularly difficult to work with because originally it came bundled within the actual game code itself. Which meant making changes was quite hard. I won’t go into all the complications of it here, but one of the biggest problem we had was that the configuration was only loaded at system startup, so every time we needed to make a change to the config to add a new branch or adjust some settings etc we had to shutdown the system and restart it with an updated config file. We moved away from it relatively quickly.

Eventually after some discussion in our wider organisation, DRE decided that we should all try and settle on the same basic system, so we all moved to Jenkins. The intention here was that all of the DRE teams in different studios across the world would then be able to help out in a crisis as we would all be using the same system and therefore should all be familiar with the workings of it. However as mentioned earlier, all games are constructed differently and so the ideal of everyone being able to help out another team anywhere in the world never materialised. Game teams are also notoriously protective about giving systems and source access to people not in their immediate circle of colleagues and so the idea that someone from North America could just come along and restart machines or fix things for a team in Europe was understandably made harder by difficult access rights. We’ll cover this more in a section about globally distributed teams.

 

No Maintenance Windows

What this all boils down to is that large, potentially global publishers need to realise that they are running in an environment that supports 24/7 development and that their studios are now also working in this environment instead of being a collection of isolated groups of people all congregating in various offices around the world. In the modern era this is not something that should come as too much of a surprise; other large corporations have been operating in this manner for many many years. But for some reason, game publishers and studios have yet to catch up. My own personal feeling is that a lot of what makes games fun comes from that close personal interaction with people bouncing ideas off each other in rooms covered in post-it notes or in corridors on the way to the cafe. This naturally means that a lot of studios are still not in a mode of operation that is accepting of their staff being spread over large distances as the standard way of working.

When you are working in this environment the notion that you can segment away a couple of hours in order to take your entire system out of working condition and run some maintenance on it seems absurd. You are going to have to find a way to run that maintenance in between the various daily routine tasks that make up the bulk of your work. Items for consideration here are normally large processing jobs that need to be ‘chunked’ so that they are not running for many hours at a time. If for example, you have a job that tidies up a network drive removing old builds (a fairly common pattern in the games industry) don’t try and identify and remove all the builds in one go. Handle them by branch, identify and clean up 10 at a time. Maybe consider running this process in a two step approach; identify first, remove later. None of this is particularly complicated and is using common approaches that DBA’s and other organisations that work with large datasets and 24/7 working have been dealing with for years, but for some reason we still seem to be struggling with accepting this ‘no maintenance windows’ rule.