Confessions of a Build Engineer

Back in 2015/6 we were working on Star Wars Battlefront 2 across 4 locations. The lead studio was DICE in Stockholm, Sweden. There were also elements of Ghost from Gothenburg, Sweden. Starfighter combat was being produced by Criterion in Guildford, UK and the single player campaign was being developed by Motive in Montreal, Canada.

This was an extremely challenging project to be working on for a whole bunch of reasons (not least the initial loot-box game design shenanigans) but it’s significant to me for a number of reasons.

Firstly, it’s where I started taking more responsibility and management work instead of being just an engineer trying to get stuff done. This is where our initial thoughts around how to structure and manage our build team started to take root and is the team that eventually formed Team Cobra - the one team that took on Build and Release duties for all DICE, Criterion and Ghost games by the start of 2018.

Secondly, it’s where we made our first big decision about how we would operate in the future. The decision we made as a team was to move to one tech stack. At this point in time it didn’t mean one solution to rule them all; we still had individual Jenkins instances in all of the different on-prem locations. But we had decided that we were going to settle on Jenkins as our chosen solution. All the scripts that ran our builds were stored in the same place and they would all be written in python (we had a mix of python and powershell, basically split by site) and we started moving a lot of built in logic so that it utilised a location modifier that could be set in the main Jenkins configuration for each location. This meant that at least to look at - all our Jenkins jobs were basically doing the same thing, just with little tweaks here and there like the name of a network share, or service account used to log in to systems. Making these changes meant that we could actually resolve issues on the other side of the world, because their system pretty much looked like ours.

The other thing this meant was that it forced us to always consider all the other locations when making changes to the system. Although it still happened on occasion, we had to get out of the mindset of “I’ll just fix it for now and Montreal can handle their side when they wake up”, because we were all using the same code and scripts. Ensuring everyone was on a level playing field and that each location was treated as a first class citizen in the system made huge differences to the way we worked. Sometimes we’d come in on a Wednesday morning and see that there were a whole bunch of build failures overnight but that they magically got resolved at some point and things just started working. When digging, it became apparent that one of the team in Motive had fixed an issue, which meant that we started the day from a good baseline. In return we’d often fix things that, whilst not breaking issues anywhere, meant one of the other locations got a little performance boost.

Another reason to behave like this is that it leads you to build more robust and reliable patterns. The teams in Sweden generally had much faster hardware than the teams in the UK or Canada, especially when it came round to copying data to a network share for distribution. This meant we had to get creative with our solutions to make sure that race conditions didn’t take effect or that the slower sites weren’t holding back the faster ones. All whilst maintaining throughput and T2D to be as good as it could be in each location.

A lot of this stuff these days is handled in a more simple manner; by building your infrastructure in the cloud (to be fair you still have to worry about locations and putting it close to your workload, but you don’t really have to worry about whether you actually have infrastructure there). But back on SW: BF2, we were only just starting to explore pushing some of our workflows into the cloud (6. Build Service Patterns / Preflights).

Even if the actual workload is in the cloud there are still a lot of logical issues that you need to deal with. When dealing with global teams, downtime and maintenance windows just aren’t suitable. You need to find a way to make that maintenance happen without disrupting your general day to day work (5. Scheduling & Orchestration / No Maintenance Windows). 

There are issues with initial sync times when you need to spin up new build infrastructure. Synching a huge code base is bad enough at the best of times, but when you potentially have machines all over the world synching, you need to start looking into ways to speed that process up, or at least take the load off the responsible servers. In our case we used a combination of Perforce Proxies and Edge servers in different datacenters that EA had around the world and utilised scheduled replications between them.

Even with all the locations effectively having their own builds (albeit running off the same scripts and job definitions, using the same RCS source), different builds would become available in different sites across the world. Builds would be kicked off on a certain CL in one location, but because another location was still building a previous CL a bit slower, it meant the sites became out of sync with each other regarding which builds were available. So we also had to transfer complete final builds back and forth across the sites to make sure that QA in each site always had the same set of builds available in order to test and validate the state of the game.

Quite possibly we could have built all this centrally and only distributed the final builds instead of building in multiple locations. Initially this is the position we were in, but it became apparent very quickly that although we were all working on the same game, each studio was very clearly working on its own bit (single player, starfighter, multiplayer) and had their own focus and workflows for their iterative reviews. In the end it was agreed that every site would be the master of their own builds and requirements, but that those things would be supported by a central job structure and configuration that was available everywhere.

But it’s not just technology issues you face when dealing with a team spread around the world. As I’ve mentioned before, communication is one of the hardest things to get right and we initially tried to handle it the way most teams do - by booking in meetings that we expected people to turn up to. But when you have teams on either side of the Atlantic, this is never really a great solution as someone is always staying late, or getting up early. We did maintain a few of those meetings but we dropped the cadence a lot. We relied on tools that post pandemic, seem commonplace - Slack and Zoom being the most obvious and heavily used. But we also made sure that we budgeted for face to face. When we got the whole team together, including all the build engineers from all four sites in one place (I can’t actually remember whether we met up in Guildford or Stockholm) it was truly inspirational to have a room full of people who knew exactly who we were, what we were trying to achieve & why and more importantly, how we were going to do it. Gone were the “It’s just a batch script” conversations and we were able to spend a good week together, digging into the details. A lot of problems were solved that week, but we’d also achieved something more important. We spoke face to face. We shook hands. We went out to dinner. We built a team.

With all of the chaos that surrounded SW: BF2, the release itself was actually pretty good. Sure, there were some difficulties; but nothing show-stopping and in the end; I think the game itself turned out great, even if it had to stumble a little to get there.

At the end of 2018 a bunch of our team met up in Stockholm for a pre-Christmas get together. I took the DVD “Hidden Figures” with me (a really great film, I’d recommend it to everyone) and we’d planned to have a movie night in the office. For some reason it took us about 20 minutes to get the right combination of TV options and consoles operating together in the big meeting room at the very top of DICE’s office working correctly, but when we did we all sat down with nibbles in a dark room, with a huge screen and watched the film. I’d seen the film many times before, but quite a few of my team hadn’t. At the end of the film some of them went out of their way and thanked me for bringing the film and thought it was a great film with many important messages (I agree). Now I may be reading too much into it, but I feel that the team really bonded well during that film night.

Roll on a further year and I was in Stockholm again for another catchup. Little did we know that this was probably going to be the last time we’d all sit in a room together for about 2 years. As I’m sure you’re aware, the pandemic hit in March 2020 and EA sent us all home immediately and basically said “We’ll figure it out from here”.

I need to be very clear here - I think the way that EA handled the pandemic was fantastic. As a company, it didn’t have all the answers but they basically told everyone to go home and be safe, and that they’d work out how we all actually went back to work as we went along.

My team however, didn’t struggle too hard (apart from the personal family stuff that came along with the pandemic). As far as working was concerned, we’d all been operating as a remote team for quite a while. We’d even shipped a few large, important games with that process. So when the pandemic finally landed for most of us, it was business as usual. Just with a different background on Zoom. But due to the work we’d put in building the team up and making sure that the team was bonding well, inviting everyone into their homes seemed like the final step to being a globally distributed team. I’m not with that team anymore, but I hope they’re all doing fantastically well, whatever it is they’re doing now.

So what’s the most important message here? Why talk about global distribution of teams at all? I wanted to talk about this to cover two main aspects that are equally important. Yes, it’s important that you get your technology right and that you’re servicing ‘the system’ as best you can. There are a bunch of good and anti-patterns that are involved in doing that. But the two most important things are;

  • Who are you building it for? - As a build team, where are your ‘customers’ and how do you provide for them the best way that you can?
  • Who are you building it with? - As a build team, the way you operate with each other is equally, if not more important than the way you interact with your customer teams. Allowing everyone on the team to operate to the best of their ability, regardless of circumstances is massively important. Everyone must grow and achieve, but we also need to show patience and compassion too. Something which studios and publishers sometimes (but not always) forget.