Confessions of a Build Engineer

At some point in this process you will probably come to the decision that having your actual infrastructure and/or configuration defined as revision controlled code might also be a sensible pattern. There are many reasons to do this but for us, the most important has often been the speed of disaster recovery. We should aim to have as much of our system defined in code as possible, primarily to aid in those unpleasant situations where it's all gone wrong.

There have been more than a few occasions where, for some unknown reason, a build machine has eaten itself and left us in a position whereby the system is failing. The best thing to do is to get that machine fixed and operational again so that we can continue to process builds as fast as possible. But wait, there's a problem! We've already discussed the desire to not introduce unknowns into the build system and to revision control as much as we can so that we can define exactly what a build is when it pops out of the pipeline. When a part of the machinery goes wrong, it's super important to fix it and understand what the issue was but if we leave it in a state where a human being has been tinkering with the machinery, our entire revision control is null and void. No matter how many times we sync from the RCS in the future, we have infected the machine with an unknown number of human clicks, button presses or command line executions in order to resolve the original problem and now we can no longer trust that machine to be in a valid state for the rest of its life.

OK, so what about backups? Backups are great! As a last line of defence against having nothing at all, you should always have backups, but they would not be my first go-to for a resolution. As mentioned, our build machines are huge beasts taking many hundreds of GBs of disc space purely to sync the game source and base assets, before we've even tried to execute any compilation or asset builds on them. To back those up on a regular basis is costly in both time and storage. Actual retrieval of those backups can also take a long time and the older it is, the longer it can take due to backups making their way to slower storage the older they get (you are using a reliable tiered backup solution, right?).

Don’t rely on backups to quickly get you functional under strained and difficult conditions. Under these circumstances, trying to move as much of your system to be implemented by configuration as code would appear to us to be the right approach for a number of reasons. There's a fairly famous phrase in DevOps communities which goes something along the lines of "Treat your infrastructure as cattle, not pets". What this means is that you should be able to kill parts of your infrastructure as quickly as necessary and replace it with another copy of the section that you removed with little to no effort, or if possible, human intervention. For us, if we have a machine that goes bonkers, we should be aiming to disconnect it from the system and spin a replacement relatively quickly, knowing that its creation is the same (or marginally better, with the original bug having been fixed) than the one we're replacing. We can keep the original machine around for diagnostic purposes and use that information to pre-emptively fix our scripts so that the next time a machine is created it is free of the original bug. Once we have done that, we can kill the troublesome machine. At the moment we must still accept that once the machine is ready there is a lengthy initial sync that has to happen to bring all the game source back down ready for building, but there are potentially ways to reduce that too.

Putting in place a known pipeline for the creation of your actual infrastructure should also free you up from the mundane but critical disaster recovery situation and allows you to focus on improvements to smaller subsections of your pipeline. Once you can go from nothing to a fully created machine that's ready to do work you can then move on to the next thing. Maybe you want to map out what that looks like for a whole branch of your product and define a layer that creates a standard collection of multiple machines in different configurations that are ready to work on a single problem space for you. Maybe you know that for a single branch of your game you require (for example) a Jenkins master, 10 build machines of varying power and configuration, an Elastic Search box for logging information and metadata and any number of other useful tools. As you solve the problem of creating one machine and moving on to banks of machines in a single collection, other issues form. You have all these machines created, but now it would be really handy if they could automatically connect themselves to the Jenkins master for example. As you solve each problem with a defined automated solution you push your ability to react faster to unexpected requests. "Team C now needs a set of automation spun up on their branch asap!". Easy, add Team C's branch to the list of branches that you support and let your infrastructure spin itself up as you get on with something else.

This all sounds like a fairly rosy world to be living in and you'd probably be right if we were there. Unfortunately the complexities within our games and the differing level of physical infrastructure at different sites can sometimes mean that we have to ditch a system and start again or take an existing system from another build team that may or may not work as described on the imaginary box. Switching from supporting one game to another is also hugely disruptive and in the past has meant that we have needed to shift focus or start a new system to support it and its differences. This case is worse if we're moving on to support another studio on a project and we're coming to it midway through. In a perfect world this shouldn't be the case. There is a (perhaps reasonable) perception that if all our games are built off the same game engine, then building games based on that engine should be a solved problem too. In theory this is correct, the basics of building the games are generally aligned, but no game is identical to the last one and there are always special requests that come in which means that you have to adjust the way you're doing things to such an extent that you can't necessarily use the system you had before. The engine is also always improving which means that at a certain point, the game engine may get upgraded and you find that now all of a sudden the game doesn't build anymore because a command has changed or worse, been removed or that a fundamental piece of your pipeline has been deprecated (your version of MS Visual Studio, for example).

There are a few systems out there that allow for Configuration as Code and organisations worldwide that have been using these approaches for years. Upon investigation, these systems are often built to allow one of a few workflows. Either they are set up to allow a virtual machine to be created on the fly, build the product and then destroy itself again or they are designed to allow any number of virtual machines to be connected to a system and at any point, any one of those machines can be tasked with building the product. Often these approaches work fine for a lot of the smaller projects we've seen them used for (small Java applications, tools, product plugins, overall large products that are mainly built up of many micro-services etc) but they don't function too well when your initial cost is a sync of several hundred GBs instead of a source base measured in MBs. We are still relatively early in this process having worked with our own home-brewed version of configuration as code for quite a long time and we're always trying to improve the efficiency or dynamism of our systems. With this in mind, we are beginning to use some of the public tools that are available to do this job; afterall why reinvent the wheel when tried and tested solutions already exist?

For quite a while now we've been achieving good results with sets of Powershell scripts (remember; "it's just a batch script") running in a variety of Jenkins jobs to do defined roles working in conjunction with the VMWare APIs to access our vSphere and build the machines or infrastructure we need. However, as time goes on, these scripts become bigger and more cumbersome, with more special case handling and specifics to our hardware or infrastructure setup. This always makes it hard to hire talented people as there is a large custom layer involved where applicants will naturally have no experience. It also makes the simple act of moving to another team to help on a different project hard for the same reasons.

By aiming to move towards some of the more publicly available and industry standard tools (Packer, Kubernetes, Docker, Chef, Ansible, Terraform, etc) we are aiming to not only utilise tried and tested solutions, but also give ourselves the ability to hire in people with the knowledge and experience we need so that they can start their careers here from a position of knowledge and contributing to us much easier than they could if we had to teach them all of our special cases and custom software. This is often slow going, involving small teams who must still maintain the existing systems for the games that they support, whilst trying to find breakout time to innovate with and investigate new (well, new to the team anyway) technology. One of the ways we tried to address this situation is by joining up with our sister teams in other studios and sharing this load. Over the last few years we made massive strides in collaborative working with those teams and solving joint problems together, but I'm sure there is still a long way to go. Often, we were hampered by the aforementioned differences in physical infrastructure, but these are problems that can be worked around as the main bulk of the solutions come online. At the end of the day, a lot of this comes back to the age-old problem discussed at the start of this series; human communication and effective sharing. I’ll also cover some of these issues when I talk about having globally distributed teams.