Confessions of a Build Engineer

There are many ways for your build services to run in order to ship your game and these depend largely on a number of factors, some of which are probably outside of your control; the size and distribution of the studio, the amount of hardware (physical or virtual) that you have at your disposal, the orchestration tech that you have settled on, the actual structure of your game and its related projects.

Generally in my experience though there are two main patterns, followed by some smaller support ones. The first of these is essential if you ever plan on shipping your title. The second is pretty much optional but will generally be required in some form, the bigger the project gets and the more contributors that are involved. After that, there are a few other services that we tend to put in place for potential optimisation or convenience.

  • The Release Train
  • Preflights
  • Code and Data validation builds
  • Other support jobs

 

The Release Train

For build engineers, the release train is your product. As a build engineer your primary job is to put in place the pipeline by which engineers and artists working on the game will get their changes through the build into the final version of the game, that will end up on paying customers' consoles. It’s almost certain that this pattern will consist of multiple branches in the RCS, most likely with their own supporting build sub-patterns to ensure those branches are in a good state, but the general approach here is to make sure that when changes get checked into your main dev branch, they can be filtered and integrated (manually, or most probably in the earlier branches, automatically) down the various layers of branches until the changes get into the final branch which is where the release is cut from.

To get to a good pattern on the release train has two main benefits. The first is that during the early days of development, the game team is getting reliable, regular builds which they can review and iterate on at speed. This allows the dev team to quickly work out what features and options go together well to get to a fun time. Obviously they do this on their own dev kits, but it’s only when all the various features from different teams are merged that the product as a whole can be reviewed. This is vital for working out if it all goes together well or has some glaringly obvious problems that might not be noticed when looking at feature sets in isolation.

Secondly if you can get the whole release train in place early enough in your dev cycle, then you’re not changing the method of delivery for the final builds. By the time of release, the whole game team is used to the process and methodology that has been in place for a significant amount of time and they will trust that when it says something has gone wrong, it really has. Additionally, if your pipeline is consistently producing good builds all the way to the final shippable item, they will know that they have a mandate to maintain the state of the build and that doing so will produce the best product in the long run.

When a dev team starts to lose faith in the reliability or speed of turnaround for its supporting build system, you will find yourself trying to debug build failures that are not even yours to investigate (maybe just some c++ failure, or a link problem). This is because the dev team has given up on the state of the build system and no longer trusts that the errors it’s reporting are theirs to deal with. This is the evil that you must combat as a build engineer!

I’ve shipped many Need For Speed titles over the years. Historically they’ve always been a bit ‘spicy’. Need For Speed Unbound however seemed different. Now, I can’t talk to the final shipping and how that eventually went down (as I left EA at the end of June 2022 and the game shipped later in the year), but I remember quite clearly going through the expectation with the rest of the team as to what the flow of 2022 would probably look like in January. It normally revolves around some ups and downs of various pattern changes in the build system, with chaos brewing towards the summer as we get ready for shipping the final release. Then, when the release is still going through the final stages, we shift our focus towards patching and game updates, so that when the game is finally going out the door, the team is already looking at how we produce our patches and all the complications that are entailed in that process. This can often be a pretty chaotic and busy time, often with an explosion of hardware requirements and sometimes more branches being created late on.

Unbound though, was different. At the start of the year when the team all came together in January we discussed the above flow and what was expected to happen, but by the end of February and going into March it felt like we were in complete chaos. Things were moving so fast and hardware requirements were unexpectedly increasing that the team was beginning to feel a bit pressured by it all. Luckily, we had a fantastic team and they got through all the complication and fuss by using a methodical approach, targeting the right areas and not getting too flustered, but I remember saying to the team (we’d had some new starters joining us around May time) that although it had been a bit hard for the last 6 weeks, things were probably about to hot up as we headed into June and that ‘this is where the game dev cycle gets interesting for us’.

But….. it didn’t happen. May came and went without issue. June looked like it was relatively calm and there were no crazy requests coming in from the game team as we headed into the summer. We looked for what might have been the reason for this and it’s only when we took a breather to look at the state of the system did we realise the position we had put ourselves in. The same build system that we were shipping Unbound on had been used for multiple years and shipped multiple Frostbite titles. Over all that time, the team had been diligently making improvements and system upgrades at the same time as shipping the actual titles. When it was Unbound’s turn to ship, we were for the first time, so far ahead of plan that we had the full shipping and patching / update pipeline in place at the beginning of March. It turns out that was what all the unexpected chaos was; we were about 6 months ahead of schedule. Now we were in a position where, probably for the first time, we were able to prove out and test the full shipping and patching cycle well ahead of time. This gave both the studio and associated QA folks much needed information on the state of builds well ahead of where we usually were and subsequently, the last minute requests and crazy times that we were normally waiting for never emerged.

 

Preflights

There are many definitions of preflights used within various companies and studios. This is the definition of preflights from my point of view and the definition that steered the work that my teams have been involved in over the last eight or so years.

Simply put, Preflights allow your game dev teams to test the validity of the changes that they are currently working on so that they don’t break the build when they submit their changelists. As mentioned in the Scheduling and Orchestration section, there are points throughout the build process where a build can be declared as ‘good’ to a point. These are effectively logical checkpoints that can be built upon with reliability. We can call these the Last Known Good build. The notion of Preflights builds upon the concept of the LKG. By recording the state of the build (either by recording the changelist / hash / build number / whatever) to generate our LKG and then allowing engineers to supply work in progress changelists to the Preflight system we can then allow our build farm to sync the code base to the LKG and then apply the user’s WIP changes over the top of it and then attempt to build the product. What this will tell us is that if no other changes happen to the checked-in codebase, we should get a valid representation of whether the user's changes are good or not.

There are a number of things that can happen here; If the build fails, you simply report back to the user that it failed and give them the details of the failure. This is pretty much the same as any CI/CD system in terms of informing your users of their failures.

If the build succeeds, you can also report that back to the user, but you could also submit the change for them. Or you could make it an option in your Preflight system. It’s your choice.

This does not however guarantee a totally smooth transition to nice stable builds from your mainline. If Susan and Geoff both submit their changes to the Preflight system and they both come back good, there’s still no way we can know whether both changes will be problem free, as they may actually break each other. This will have to be dealt with the same as any standard mainline breakage; Susan and Geoff will have to talk through it together and come to an agreement about who owes who a cup of coffee from the onsite cafe.

Another issue that comes from this is something we definitely experienced on the later Battlefield titles and that’s infrastructure usage. As users become familiar with the Preflight system and as studios make it a requirement before changes are submitted, it’s likely that you will see bottlenecks in the system as everybody starts preflighting their work. In order to not block the studio from making changes you need to provide enough hardware in the Preflight system to make sure throughput is fast enough for people to get their changes in with a decent turnaround. Initially we supplied enough hardware for preflights to basically check the whole of the game build, but as the build got bigger, the build time got longer and more devs were submitting their changes we had to make a decision to cut back some of that. The simple thing to do here was to only build a minimal layer of the platforms, for example the toolset, PC build and PS4. This allowed us to utilise our hardware better and provide more ‘lanes’ for building and checking concurrently.

Eventually even this became too much for our on-prem hardware to handle so we started moving this workload to the cloud. For a task like Preflight, where the only thing you’re interested in is the result (you don’t have to make the build artifacts available anywhere for retrieval) then making these build checks available via cloud resources is a great way of improving your resource usage. Also Preflight is a typical example of workload explosion towards the release of a product, with many more changes going through the pipe as the deadlines loom closer. Having this functionality in the cloud is what the cloud was built for and allows you to expand or contract as your usage patterns do.

One final thing to consider when implementing a Preflight system is user abuse. There is definitely the possibility that users get lazy and just start submitting changelists to the Preflight system instead of simply pressing F7 on their own keyboard to build stuff. I can distinctly remember having to have a conversation about resources with one programmer who had about four Preflights on the go at the same time. After I checked through them i discovered that most of the changes between each preflight check were simple things like missing semicolons or badly formatted code that would have been flagged up quickly and easily if only they’d at least run the build locally first. When asked about this they simply said that they were too busy and needed their machine to do other stuff and didn’t have time to wait around for it to build when they could be getting on with something else and let the build farm take the strain. Unfortunately their actions were actually resulting in the prevention of three other users from preflighting legitimate large changes so they were duly told that we were unceremoniously killing their running preflights and that others would be told why they were queueing up behind this one person. I think they behaved themselves after that.

 

Code and Data Validation Builds

These are pretty simple and don’t need a lot of explanation. In most cases we can check an awful lot of things on a lowest common denominator first. This can allow for a super quick turnaround time. If we have a build process that has a T2D of an hour, but we can provide a code build validation of a single platform (for example PC) really quickly, then whilst our machinery is off building the full build of everything across all platforms, we can still provide a way of checking all the code changes that are going in to make sure that nothing insane is sneaking in. It’s not unlikely that a straight code validation using incremental building across even a large product and codebase can only take a number of minutes, meaning changes can continue to get checked in for the duration of that hour-long full build. Then when that finishes and the next full build is due to start, you should already know that there are no code problems likely to come crawling out of the woodwork and kill that build 30 minutes into its long arduous process. If problems did get found, the responsible dev has most likely already fixed it by the time the big build comes round again.

 

Other support jobs

Finally we come to the ‘other support jobs’. These can be as varied as they are complicated but usually these boil down to the fact that someone, somewhere in the studio has a job that involves them pressing a button to do something, maybe once or twice a day. You’re the team that automates the build, so they come over and ask you to automate it. Depending on your team mission statement you might take on these jobs or not. Personally I’ve often agreed to take on jobs that are tangentially related to building the game (a level optimisation pass, a rendering job, a data export, sending something across the world for an outsourcer to use) but I’ve turned down jobs that are not directly related to doing the build (“QA would love it if you could just run this job for them, as their own engineering team is too busy or doesn’t have an automation system” - in this case, it’s a great way for QA to highlight their lack of resources and try and argue the case for more).

The biggest thing when taking on these tasks is that they can prove to you that the job can be run via a script or command line without user interaction before you ever consider automating it. You’ll be surprised how many people go away to work it out and never come back, having found out that they (the expert on said system) couldn’t automate it on their own PC let alone get us to do it across a bank of machines spread around the world.