Backup agents across our regions started crashing during Saturday the 23rd of May. No new software had recently been deployed, no major configuration changes had been made, nothing had been done on our side that could potentially be linked to this sudden change of behaviour.
An analysis of crash data quickly led the team to a simple bug in a central piece of logic. This particular code has not been modified recently, but the bug was simple enough to correct and this was swiftly done.
A patched version of the software was rolled out to production very early Sunday on the 24th of May (around midnight UTC) on a single server. As we confirmed the patch was effective, the patched software was rolled out to all regions during Sunday.
We believe that external factors have triggered this bug. The bug was clearly ours, and ours to fix, but it was not recently introduced. Most likely, an otherwise innocent change with a primary workload provider will have slightly altered how these code paths execute, causing the bug to suddenly trigger.