Confessions of a Build Engineer

The first choice of any large project is often not even within your hands as a Build Engineer. Most often this is decided by the company or studio that you're working for and has been chosen for a whole multitude of reasons from cost to efficiency, or even simply "it's what we've always used". With any luck, your revision control system is relatively mature and has been in use within your studio for a number of years before they even realised they needed a Build Engineer. Regardless, there are two main features that you need to rely on as a Build Engineer (obviously the primary functions of putting files in and getting them out again are a given, otherwise it wouldn’t be of use to anyone at all, let alone Build Engineers).

The first is that the RCS allows you to retrieve any file or set of files at a specific version (date stamp, hash id, sequential number code, user defined tag property, or some other unique mechanism). The second is that you can determine whether anything has changed since you last looked at it. Let me know if you find an RCS that doesn't allow you to do both of those two things as I would imagine it would be fairly useless. Both of these features need to be accessible via some form of API, even if it’s just a command line.

So, what's left after that? Connectivity, bandwidth, reliability, uptime, ability to query for specific metadata or subset of assets, automated merging of branches, automatic or scheduled server mirroring and proxies. All these things have an impact on Build Engineers and are usually in the hands of some form of infrastructure or I.T engineer, but basically not usually you or your team.

When your project becomes huge you start to incur costs of initial startup when a branch is required to be built. As I write this, the size of a typical single branch of a large triple-A game weighs in at around 500GB, excluding raw art/audio assets (which are often huge). Even if you try to spread that load among several machines (set aside to perform the actual builds of the product in parallel) you can end up with a large collection of machines all trying to sync many GBs of code and data, typically all at once. This can have knock-on implications for your network bandwidth, the load on the RCS server itself and potentially any users trying to use that service at the same time. After the initial sync has completed things get easier but any single check-in by a developer that dumps huge assets back into the RCS can cause similar, albeit probably smaller delays again. There's nothing described here that couldn't be achieved by a batch file on a loop and it's this perceived simplicity that often leads to the dreaded "it's just a batch script". But for now, let's start with a simple first pass version of that batch script.

At first you have a system that is synching and then building the project on a regular cadence when you realise what's actually happening is that every X minutes you're building the project, regardless of change. There are already two potential issues here.

The first is the perception that you are constantly building new versions of the project and deploying it to your repository (file share, artifact server, CD burner, whatever). Depending on how you're identifying those builds you could be causing QA all sorts of headaches. If you choose to, you could ID your builds using the Date/Time stamp. Every time you make a build you copy it to your repository using the timestamp as its identifier. It might seem like a good starting point, but you could be burning hundreds of GBs in space/bandwidth copying what is effectively the same set of binaries to the repository under a different ID. That's not a great idea, so maybe you choose to use some specific aspect of the RCS itself. At EA we used Perforce, which identifies every single check-in to the system with an ever changing incremental changelist number (CL). Let's assume you're going to use that so that every build you produce is identified in the final repository using its head CL (the version of files that were retrieved) at the time the build started. The first build goes fine and copies to the repository as expected. By the time the next build triggers, no changes have been checked in, so the system sparks up and builds what it has again. Your sync step is really fast as there are no changes and hopefully, your build is quick too due to it being an incremental process and all your intermediates are the same. But when it comes time to copy that to the repository, the CL already exists in the system. What do you do now? You could just copy it over the top of the existing one, but what if QA has already grabbed that copy and are in the middle of testing it? You really shouldn't be copying over the top of that build - what if they declare it golden and that's the one to go to the consumer? Potentially you've just overwritten it with an untested version of the project. Ideally it should be the same, but maybe it isn't because of some unexpected aspect of your build system - maybe the virus checker on the build machine kicked in and locked a file on the machine which meant the system couldn't read it properly when deploying the build. Maybe some other aspect of the build system itself isn't deterministic enough to the point that every build with the same starting point can be guaranteed to produce binarily identical end results. Depending on how you have constructed your 'copy to the repository' functionality you either have a new build with a single old file in the middle of it, a build with a file missing or possibly a completely different build altogether. None of that is good so the game team requests that you only ever build each unique CL once. That should solve that problem, right?

So, we move to a system whereby we check the RCS first and only start a build if we can detect that there has been a change. That should solve the problem and prevent us from rebuilding something that's already been built. Also, it means hardware is only spinning when it needs to and not just because it can. But this leads us to our next potential issue. What if the build fails from something outside of your control? Network glitch, build machine catches fire or something else that wasn't the fault of you or the game team. The game team doesn't want to check in another change simply to force a new build to start as they don't want to introduce more change and therefore more risk. Besides that, if they introduce another change QA may have to run a whole batch of tests anew, which could take many hours and add at best, a day's delay to the release. So, the inevitable question comes…

"Can we just try and build that CL again?"

"Sure! Maybe. Well actually now that I think about it; I don't know."

Depending on how we implemented the game team's original request to never check for or possibly even build the same CL twice, we may be able to start the machinery again, but we might also have short-circuited the system to specifically disallow it. That takes time to unpick and resolve, even if we know what the fix is. I sure don't want to be changing that at the end of a long day (and years of development) when we're waiting on the final build that will potentially end up on hundreds of thousands of discs. What happens if/when we've fixed the issue and we can press the GO button on the same CL? Is it guaranteed to be the same? Again, it might be OK, but you have now also introduced a new element that wasn't in the system before - your 'fixes to the system itself'. In my experience, dev teams like certainty and deterministic results more than anything else. They will shout about things being too slow or not transparent enough but at the end of the day, they will sacrifice those desires to ensure that it is right. Sometimes it may be ok to just say "You know, we could do that, but I can't guarantee it will be what you wanted it to be. You may have to do a whole load of testing again". The game team can then decide whether they are willing to take the risk and will also know that extra diligence needs to be paid to certain areas during testing. Or maybe, just maybe they'll squeeze in that one last fix to trigger the build, after all; they’ll have to do all that testing again anyway. A sensible approach here could be to use a no-op check-in simply to generate a new CL number, but that depends on your RCS and whether that's even a possibility and let's be honest - in a truly deterministic system, there really is no such thing as a no-op check-in, as every change has a consequence.

Assuming we have agreed how we are going to handle all these different circumstances with our game team, another question then arises. Where do we actually put our configuration? Strictly speaking we should make sure that any changes to the system that we are constructing are revision controlled too and it's highly likely that we will have more control over where that lives. Do we continue to use the same RCS as the game team, in the same location or do we move to a separate server or system altogether? If we do move to a separate system, we now have the additional challenge of having another piece of information that describes the build that pops out the end of our pipeline; our build system specific CL. It may be that you need to have a conversation with your game team to see how they want the builds identified. It's perfectly possible that they don't want your ID near the build as they want the build ID to represent the state of the game code/data at that point in time, so you only ID the build using whatever you previously agreed on. Sometimes it’s also unpalatable that a change you made to fix an issue in your system triggers a new build of the game itself and can lead to confusion about why there is a new build with a new ID number coming off the end of the pipeline when no changes have been checked in by the game team. Often towards the end of a project check-ins are strictly limited to reduce risk and you tweaking a value in your system that the game team haven’t signed off on, regardless of whether you think it’ll make a difference to the game build or not, can cause significant distress and confusion to all involved.

To end this segment let's look at one last minute thing; it's the final day and you're expecting the shippable build of a multimillion-dollar production to roll out of the pipeline today. You and your team (and the game team, QA, the entire company) have spent somewhere in the region of 2 to 5 years (for several hundred people) waiting for this to happen and someone on the game team just needs to check in one final change. The project rolls off the production line and is deployed to your repository at CL 12345678. Disaster! The project builds and initially looks great, but then QA discovers a game breaking bug. It turns out that for whatever reason the game team needed that last build to have a specific flag or value set in the build system, but they totally forgot to tell any of the Build Engineers (don't laugh - it'll happen to you one day!). The fix is pretty easy, but we fall into the trap of how that build is identified in order to get the ball rolling. We're back in that same situation again, but this time the pressure is palpable, and you may need to hack in a fix that nobody likes or agrees on but one that does at least work. Now hold that thought - we'll come back to this when we talk about archiving our game projects later.