General backup performance degradation

Incident Report for Keepit

Resolved

We are satisfied with the results of the fix and see continued success with this.

Posted Jul 08, 2020 - 12:04 UTC

Update

All regions have now caught up with backlogs and backups seem to run as expected everywhere.

We will continue to monitor the performance of the fix and analyse results further.

Posted Jul 05, 2020 - 04:01 UTC

Update

Everything indicates that the fix is working and having a very positive effect on error rates as well as backup job completion times across all regions.

We are still feathering in job executions on the EU environment and therefore we are still not yet fully up to speed in that region - we are continuing to monitor and adjust.

Posted Jul 04, 2020 - 08:49 UTC

Monitoring

The fix has been deployed on US and AU regions as well.

All running backup jobs were saved and stopped, they are now being started back up again and will continue from there they left.

We are following the performance of the platform closely to validate that the fix performs as expected in all situations.

So far the situation is looking very promising but we will continue to follow the performance of the backups closely.

Posted Jul 03, 2020 - 19:59 UTC

Update

A fix for this problem has been deployed on the EU region after going through QA.

Existing backup jobs have been restarted (after saving their state) to facilitate an immediate switch to the new codebase. We are now ramping jobs up again on the EU region.

The AU and US regions will receive this fix as soon as we validate that it works as expected on the EU production environment.

Posted Jul 03, 2020 - 19:17 UTC

Update

Work is progressing towards a resolution of this problem.

We interrupted a large number of backup jobs in the EU region during reconfiguration of systems in order to perform further investigations - the backup jobs saved their state prior to being interrupted and they will continue more or less where they left off, once they are started back up.

Please note that we are ramping jobs back up slowly in this region and that this is not a solution to the ongoing problem. We will provide further updates with status on the resolution effort.

Posted Jul 03, 2020 - 12:59 UTC

Identified

We have been analysing the logs of older and newer backup jobs for
devices that include OneDrive and Sharepoint workloads, and we have
been noticing that a certain type of network error has started to
occur more frequently during the month of June.

Network errors are a normal part of running a workload over the
Internet and our systems are perfectly able to deal with such errors -
however, where we used to see maybe 10 such errors per day for a given
device by the start of June, we now see tens of thousands of such
errors for the same device by the end of June.

This thousand-fold increase in errors is now at a level where it is
causing a decrease in backup performance for customers.

We are working on several tracks to address this problem. The core
problem seems to reside outside of our network and therefore one line
of problem resolution is focusing on working with external parties to
investigate and resolve this. However, there can be things that we can
do internally to make the problem less likely to trigger, so that is
also being pursued. And finally, there are changes that we can
implement to lessen the impact of the problem when it occurs - this is
the third track that we are working on.

We will be providing updates as we progress towards resolution of this
issue - we apologise for any inconvenience this decrease in backup
performance may cause and assure you that we are working hard towards
a resolution of this problem.

Posted Jul 03, 2020 - 09:56 UTC

This incident affected: Denmark, Copenhagen (dk-co) (SaaS Backup), United States, Washington DC (us-dc) (SaaS Backup), and Australia, Sydney (au-sy) (SaaS Backup).