Faster continuous integration builds at Canva

(canva.dev)

21 points | by SerCe 7 months ago ago

7 comments

dan_sbl 6 months ago
> As mentioned before, the CI critical path is bound by its longest stretch of dependent actions. If one test consistently takes 20 minutes to execute and flakes, and has some logic to retry on failure, let’s say up to 3 times, it’ll take up to 60 minutes. It doesn’t matter if all other builds and tests execute in 30 seconds. That one slow, flaky test holds everyone’s builds back for up to 1 hour.
Honestly, really surprised to not see this mentioned til the end. Some of the other things in the article were almost jaw-dropping ($1+ million in instances savings, needing 48 cores to run CI, etc.), but having flakey tests regularly causing you problems, having to rerun extremely expensive jobs, is something that I would argue should have been addressed first, not last.
[-]
- bradleyy 6 months ago
  It's really interesting how profoundly flaky tests are in a CI environment; it's like the classic "Herbie" story from Goldratt. The slow things ultimately determine the whole process.
- yjftsjthsd-h 6 months ago
  I mean, that's just Amdahl's law, right? Your speedup is directly related to how much the thing you improved was the bottleneck.
  (Though of course that's just for raw speed; for cost and total compute needed, every cycle/IO counts.)
kpen11 6 months ago
Nice post! I enjoyed reading about how many teams were involved in the process. CI has to be a collaborative effort to be really successful.
It's often underestimated how much benefit you'll get from taking a good look at your cache usage. It all worked great the day your platform team set up the build system, but 100 new CI jobs later there will be tons of room for improvement. Similar story with consolidating CI jobs in general. If we keep just tacking things on eventually we have to step back and optimize.
maccard 6 months ago
It’s always funny to see blog posts that talk about CI, ephemeral environments, containers, Bazel as the solution to CI.
This blog post matches my experience of running a CI system (but at a smaller scale). It’s messy, and throwing more resources and money at the problem often shows significantly diminishing returns for massive cost spends.
I’d love to someday try and fix this problem for other companies.
[-]
- nijave 6 months ago
  Yeah, I think it's not just limited to CI. I think flakey/slow/expensive CI is also an orange/red flag you're lacking in developer experience/productivity.
  When it gets too bad, it starts once incentivizing even worse practices like skipping/disabling parts of the test suite or multi tasking/context switching waiting on things to complete
6 months ago
[deleted]