8. Failure Analysis

: Published: 06 June 2024

One of the most important responsibilities that a Build Engineer takes on during their job is that of failure analysis.

There’s a few things that need clarifying when we talk about failure analysis in the context of building large scale games.

Firstly we need to discuss the difference between failure detection and failure analysis. Now it sounds obvious to say it out loud, but failure detection is simply the act of discovering that something has gone wrong. In most build systems this is done by checking return codes from… well, everything! Any script that you run, any command line that you call, any executable you fire off will one way or another return an error code. If you’re lucky it actually holds some convention and means something. Typically these will be something along the lines of a return code of zero meaning success and any other number being a failure. Most publicly available CI/CD systems that you can use to construct your build system will automatically handle these results and report failure or possibly terminate the build run for you if it detects that a step has returned a non-zero code.

When creating your pipeline you can take advantage of this convention to make sure that you are correctly catching errors and terminating a build if it’s not going to succeed.

The second part of this aspect of the role is more complicated, and that’s failure analysis. When a step returns failure for whatever reason and terminates a build (presuming that’s what happens) it is normally the job of the build engineer to look at all the logs, output and return codes to work out what went wrong. This is usually a slow and difficult process, especially if you don’t necessarily know all of the code that’s actually being built (which is highly likely for a build engineer).

Initially most failure analysis is done by looking through the logs and highlighting lines before the process terminated in order to try and show up what the problem might be. The next phase here is to use some form of formatted search like regex to pull out known failure patterns and display those to the user.

There is a risk here that I want to bring to your attention though. When building up your analysis system, it’s important to make sure that you are clear in the differentiation between detection and analysis. When you start putting in analysis of console output, in my experience it’s always better to let the build fail and then do the analysis instead of trying to do the analysis while the build is running. The reason for this is fairly simple; if you have analysis that looks for specific keywords or log output that looks a particular way, then you are at risk of your analysis actually terminating the build unnecessarily. I’ve seen enough systems interrupt a perfectly successful build and terminate the process, just because the word ERROR turned up somewhere in a piece of output. You will also get pressure from others to detect certain ‘early signs’ that a build might fail and kill it early. It’s up to you as the engineer in charge of the system whether you implement that early-out or whether you let the build fail at its normal death point.

So let’s assume that you’ve got an analysis system in place but it’s currently virgin; it doesn’t contain any regex patterns or other failure definitions… yet. A build is running and it’s about to fail, let’s watch it now… here comes the output, there’s some stuff in there that looks familiar aaaaaand. Yep, it’s failed. It looks like somewhere in the code some pesky game engineer somewhere did a naughty and checked in something that was never going to even compile. Who has ownership of the failure? Sorry, but I’ve got news for you - it’s most likely to be you. Now before you throw your hands up in the air and tell me I’m stupid, let me explain. There will be a million different things that can go wrong during a build and stop a studio in its tracks. I contend that all unidentified errors are the responsibility of the build team to own them at least until the error is analysed and a mechanism for automatic identification has been set up. During this identification and classification, there are normally a couple of default categories that you can rely on.

These are;

Unclassified

Any error that is being seen for the first time and has yet to have a classification set for it. Build Team is responsible for identifying who actually owns this problem and setting up analysis that auto classifies the error and assigns the correct responsibility.

Build Team responsibility

Any error that can be placed on the build team to fix. These are normally errors in the actual build scripts or the system itself. Errors here tend to be more in line with configuration errors when setting up jobs or infrastructure related to the build system, and not actual code compilation errors.

Game Team responsibility

Any obvious error introduced in the game code or data itself. Normally compilation or asset build issues. Should be fairly obvious what the issue is as compilers are pretty good at telling you where they’ve gone wrong, but game devs don’t like looking through thousands of lines of output, so it’s up to you to identify these and make them easy to spot when they look in future.

Infrastructure failures

The grey area that exists in game dev. These are usually transient errors that can happen and resolve themselves without full investigation or intervention. There can be many reasons here but they are typically things like network glitches, lag and delays that cause other components to timeout. Sometimes the responsibility here lies with the Build Team, but often it lies with a studio I.T Team instead. On rare occasions it can lie with the Game Team, especially if it’s a system that they’ve created as part of the build process (these are often data related compilers or compressors). These tend to be areas where the requests for 'automatic retries' come in. Resist these requests as much as possible as you're simply hiding another issue that may bite harder later. Auto retries should only be implemented when the issue itself is relatively well understood and this is deemed the most sensible option.

When a failure is seen for the first time it will most likely fall to the Build Team to categorise the failure into somebody else’s responsibility to fix.

As a Build Engineer we also (generally) have two main responsibilities. Building the full and final product that ships, and providing the Game Team with continuous feedback on the state of the code. The types of failure analysis that you implement in either of these different build types will depend on you, the team you’re supporting and the type of project you’re building. I would be more willing to implement a system that does live console output analysis and could potentially stop the build and terminate processes early within code validation builds than full shippable builds. My main reason for doing this is that if an engineer is waiting to see if their code changes are good then it makes sense to catch those problems as early as possible, especially if a small failure early will not report an error until the rest of a 20 minute process has finished.

Finally, discovering and identifying the failure is only a small part of the resolution here. The last thing that you need to do is work out how to get that information to the right person as quickly and efficiently as possible. Again, this part differs depending on your team makeup and the studio that you work at. Your team may also be responsible for building dashboards that display this information or message systems that get this failure information where it needs to be. Another thing that you will probably be responsible for is some form of trend spotting. As you identify each problem and categorise it, you should probably also be logging that instance into some form of database or other store that allows you to track how often you see each issue. This will allow you to see things that you might not see on an instance by instance basis - maybe late every night you see a specific network glitch that needs investigating. Maybe there’s that one engineer who gets in super early and checks in a breakage while half asleep, every Monday morning. Once you can engage in some sensible trend spotting you can potentially predict issues or start working out how to fix root causes.

At the end of the day, failure analysis is a fundamental part of being a Build Engineer, even if your main goal is to hand that failure on to someone else as quickly and efficiently as possible.

Confessions of a Build Engineer

Just a batch script

Games That Made Me

8. Failure Analysis