It’s always funny to see blog posts that talk about CI, ephemeral environments, containers, Bazel as the solution to CI.
This blog post matches my experience of running a CI system (but at a smaller scale). It’s messy, and throwing more resources and money at the problem often shows significantly diminishing returns for massive cost spends.
I’d love to someday try and fix this problem for other companies.
> As mentioned before, the CI critical path is bound by its longest stretch of dependent actions. If one test consistently takes 20 minutes to execute and flakes, and has some logic to retry on failure, let’s say up to 3 times, it’ll take up to 60 minutes. It doesn’t matter if all other builds and tests execute in 30 seconds. That one slow, flaky test holds everyone’s builds back for up to 1 hour.
Honestly, really surprised to not see this mentioned til the end. Some of the other things in the article were almost jaw-dropping ($1+ million in instances savings, needing 48 cores to run CI, etc.), but having flakey tests regularly causing you problems, having to rerun extremely expensive jobs, is something that I would argue should have been addressed first, not last.
It's really interesting how profoundly flaky tests are in a CI environment; it's like the classic "Herbie" story from Goldratt. The slow things ultimately determine the whole process.
Nice post! I enjoyed reading about how many teams were involved in the process. CI has to be a collaborative effort to be really successful.
It's often underestimated how much benefit you'll get from taking a good look at your cache usage. It all worked great the day your platform team set up the build system, but 100 new CI jobs later there will be tons of room for improvement. Similar story with consolidating CI jobs in general. If we keep just tacking things on eventually we have to step back and optimize.
It’s always funny to see blog posts that talk about CI, ephemeral environments, containers, Bazel as the solution to CI.
This blog post matches my experience of running a CI system (but at a smaller scale). It’s messy, and throwing more resources and money at the problem often shows significantly diminishing returns for massive cost spends.
I’d love to someday try and fix this problem for other companies.
> As mentioned before, the CI critical path is bound by its longest stretch of dependent actions. If one test consistently takes 20 minutes to execute and flakes, and has some logic to retry on failure, let’s say up to 3 times, it’ll take up to 60 minutes. It doesn’t matter if all other builds and tests execute in 30 seconds. That one slow, flaky test holds everyone’s builds back for up to 1 hour.
Honestly, really surprised to not see this mentioned til the end. Some of the other things in the article were almost jaw-dropping ($1+ million in instances savings, needing 48 cores to run CI, etc.), but having flakey tests regularly causing you problems, having to rerun extremely expensive jobs, is something that I would argue should have been addressed first, not last.
It's really interesting how profoundly flaky tests are in a CI environment; it's like the classic "Herbie" story from Goldratt. The slow things ultimately determine the whole process.
I mean, that's just Amdahl's law, right? Your speedup is directly related to how much the thing you improved was the bottleneck.
(Though of course that's just for raw speed; for cost and total compute needed, every cycle/IO counts.)
Nice post! I enjoyed reading about how many teams were involved in the process. CI has to be a collaborative effort to be really successful.
It's often underestimated how much benefit you'll get from taking a good look at your cache usage. It all worked great the day your platform team set up the build system, but 100 new CI jobs later there will be tons of room for improvement. Similar story with consolidating CI jobs in general. If we keep just tacking things on eventually we have to step back and optimize.