Build systems are often very big and very complicated. If you’re paying attention and have all of your processes in place correctly to stop you checking in nonsense you can still screw up your system in unexpected ways. Often this happens with the best of intentions and under the auspices of ‘improvements’. Even more often it’s because one improvement to the system actually does what it’s intended to do and fixes an issue… but that just exposes the fact that the next part of the process or the next job has a previously unknown weakness that was only just hanging on by a thread in the original use case.
When you follow all of the advice I’ve given so far in the previous articles and you start seeing your build farm as individual blocks of an overall system it’s actually easy to forget that you still need to take the system as a whole and remember the end goal; getting the builds out as fast and as reliably as you can. Accidentally introducing a problem during optimisations is nothing new, but in systems the size and scope of what we sometimes have to deal with it can be surprisingly difficult to get to an understanding of what was going wrong.
I have two examples of issues that got thrown up thanks to improvements made in our build systems on various projects to illustrate the sort of problems that can be caused. In hindsight they both come down to the fairly common programmers pattern of running things in parallel and causing yourself threading issues, albeit the symptoms and fundamental causes were different.
Issue 1
One build system I worked on would produce builds way in excess of 100GB per SKU. These would then get copied on to the network for QA to take and test as well as the builds being submitted to the autotest system. At some point someone was looking at how we might make this delivery a bit faster (studios will always ask for this to be faster; get used to it!). On investigating the throughput of the build machines, it seemed that when they copied their built binaries to the network share (using Windows robocopy feature) the copy was operating single threaded and actually had capacity spare. So the decision was made to add the /MT flag to the copy and run it multithreaded. This seemed to work on testing and all the machines that were copying to the network were pushing their builds much quicker. We even looked at the network throughput to make sure that we weren’t overriding the network capacity too much. Everything looked good for a little while, but after about a week I.T asked us to look at it again as people in the building were complaining that various network access to boring things like legal documents and accounts spreadsheets was getting stupidly slow. It turns out that although our build machines were capable of copying everything to the network share quickly enough, as they were landing and being submitted to the autotest system, that system was then trying to deploy all those builds to the console test farm. It was actually this that was flattening the network. Now the key thing here was that although our test farm was on an isolated network and was all separated from impacting on other networks around the building (this was our first thought), it appeared that although our network share was virtualised and we were apparently on a different drive to the rest of the building, when I.T investigated the actual throughput of the NAS it seemed like we were killing the read ops for the rest of the building because our builds we basically reading very large (multi-GB) individual files. This was happening so much during the deploys that the scheduling that needed to occur in the NAS system when someone tried to open a 3KB spreadsheet just kept getting deprioritised and people could find themselves waiting 45 minutes for that sheet to open. Back in those days all the drives were still spinning metal and so the more we wrote, the more we tried to read, the slower everything else got. We slowly cranked the multithreading back down on our robocopy deploys, to help it out because although our part of the process was keeping up fine, it turns out unbeknownst to us, the rest of the building was suffering. Because most of the people that were suffering were non-technical people, a lot of them just seemed to assume that there was an issue that would resolve itself and it took a week before someone even mentioned it to I.T. It took a further week or two for people to put two and two together to work out why that was us, especially since our NAS at the time was already an old system and didn’t really come with enough tooling to allow I.T to see where the problem actually was, instead it pretty much said “everything is A-OK!”.
Issue 2
I don’t know if it’s been adjusted now, but at some point in the past the Frostbite data build basically consumed as much CPU as was available on the machine. I believe work was being done to modify the system so that you could specify a number of CPU threads and things but at the time of this issue, that work hadn’t been completed. Once again we were asked to look at ways of improving the throughput of some of the build processes and the data build came under the spotlight. When watching the machine metrics, it seemed to us that we could actually give the VMs doing the builds (we’d gone virtual by then, but not quite cloud yet) a few more vCPUs to play with and knowing the way that the build worked we figured that would allow a few more things to run in parallel and get through stuff quicker.
So, we increased the vCPUs for each build machine and watched as it did indeed have the desired effect on the build, and things got a little bit faster. In our hubris though, we thought we could increase those numbers again and get another speed increase - after all, we’d checked the number of actual CPUs we had in the farm and we could see that we had plenty of capacity left available so we might as well make use of it. We tweaked the numbers up again. And that’s when it all fell apart. We hadn’t done anything too crazy - the first increase was something like moving from 2 vCPUs per machine to about 8. The second increase was only a small one to take everything to up 10.
Why did it all fall apart? Because we we’re looking at the CPU numbers and watching the processor usage on each of the machines which was all registering well under capacity for each build as it ran. What we weren’t looking at, perhaps obviously, was RAM usage. To free ourselves from some of the blame a little bit here, to be fair it wasn’t going mad; but that last little change seemed to reach a tipping point in the build system where the process thought it had enough CPU available to run just the wrong number of things at the same time. Each of those individual processes then started grabbing huge amounts of memory - much more than they were grabbing with only 8 vCPUs. Machines then started to grind to a halt across our farm as memory was consumed at a silly rate. Now it turns out that this had been happening a little bit after the first increase, but because our machines were all virtualised and sharing allocations of RAM from the actual hardware blades that they lived on, Windows had decided to get itself involved in the process and started to thrash the hell out of its pagefile. Because the default option was to allow the pagefile to grow (up to about 64GB, I think) it was now starting to thrash 64GB pagefiles over the network which, as I’m sure you’re aware, is definitely not as fast as reaching out to the actual onboard RAM. Anyway, that was another ‘improvement’ that we had to dial back down in order to actually reach a balanced performance.
What are my take-aways from all this? I guess if I had to summarise, it’d go something like this;
- When managing a build farm you should always be looking for balance within the system
- Yes, you can make some things faster, but there will almost always be a trade-off somewhere - you just probably haven’t found it yet
- People will always ask for things to be faster - never be afraid to say “No”, at least initially, until you can be aware of what the trade-off is
- A build system is always a whole eco-system, from start to finish. Even if you share ownership of bits out among the team, you must always try to be aware of the impact on the whole. After all - it’s more than just the sum of its parts.