Backup failures

Incident Report for Keepit

Postmortem

Backup agents across our regions started crashing during Saturday the 23rd of May. No new software had recently been deployed, no major configuration changes had been made, nothing had been done on our side that could potentially be linked to this sudden change of behaviour.

An analysis of crash data quickly led the team to a simple bug in a central piece of logic. This particular code has not been modified recently, but the bug was simple enough to correct and this was swiftly done.

A patched version of the software was rolled out to production very early Sunday on the 24th of May (around midnight UTC) on a single server. As we confirmed the patch was effective, the patched software was rolled out to all regions during Sunday.

We believe that external factors have triggered this bug. The bug was clearly ours, and ours to fix, but it was not recently introduced. Most likely, an otherwise innocent change with a primary workload provider will have slightly altered how these code paths execute, causing the bug to suddenly trigger.

Posted May 25, 2020 - 17:39 UTC

Resolved

The fix has proven completely effective in resolving the situation that arose over the weekend. Backup jobs now run to completion without issues.

Posted May 25, 2020 - 09:52 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 24, 2020 - 12:42 UTC

Investigating

We are noticing occurrence of failed backup jobs. Fixes are applied and now we are monitoring the systems to ensure that service is fully restored across all environments

Posted May 24, 2020 - 12:39 UTC

This incident affected: Denmark, Copenhagen (dk-co) (SaaS Backup), United States, Washington DC (us-dc) (SaaS Backup), and Australia, Sydney (au-sy) (SaaS Backup).