Degraded service performance at the AU DC

Incident Report for Keepit

Resolved

We can confirm that we no longer experience backup jobs being delayed and no queues of jobs of any types are present at the current time. All functions and operations are restored to normal and working as intended.

We took several actions to address the issues that the service experienced.

For a period of time, we managed the number of backup jobs that could run simultaneously. We have returned to full capacity resulting in the service now running the same quantity of backup jobs as before the problem had appeared. We do not observe any backlogs.

We installed additional hardware to further increase the capacity on each server. These upgrades triggered the automatic restart of some backup jobs but it did result in a significant improvement in the overall performance.

We installed all the necessary hardware to switch to Internet Exchange Point (IXP), and are in the process of doing some final maintenance work that is not impacting the service. We do not expect any major interruptions when activating the IXP. We will also continue to use public networks as a backup option, as well as for processing and downloading any data that is not reachable via IXP.

We appreciate your understanding and patience while going through this period.

Posted Jul 28, 2020 - 20:24 UTC

Update

We are taking the next action towards the resolution of the problems we see on Australian environment. We are currently mounting the additional hardware and doing all the installation works in order to increase the capacity further. As part of this process, we will have to restart our systems which will cause the ongoing backup jobs to fail. The new backup jobs will be scheduled automatically and will start as soon as the available slot appears.

This upgrade will allow us to have more backup jobs running simultaneously which will result in the reduced queue of the backup jobs. When all the works are completed we expect to have the improved performance, but still not ideal as we are still waiting for another piece of equipment.

We apologize for the inconveniences this might have caused you.

Posted Jul 22, 2020 - 07:47 UTC

Monitoring

We are experiencing degraded service performance at the AU DC.

We are very aware of the degraded service performance observed in the Australian environment for the last two weeks which is partially related to the Incident that was open on July 3. Some customers may have up to a week of missing backups of their data. We apologize for the service interruption and are doing everything we can to remedy the interruption as quickly as possible.

With a much higher inflow of data which is the result of a recent Microsoft API change, we are encountering some problems related to memory and network allocation. As a result, the backups are running longer than expected which adds additional pressure on completing regular backups jobs, as the queue of backup jobs is increasing. Restore jobs are indirectly impacted by this which results in having the actual restore operations start with some delay.

To remedy the saturation issues yet keep backups running we had to limit the number of backups running simultaneously for a short period.

At this point, we have been able to address the memory consumption, which has allowed us to significantly increase the number of slots available for concurrent backup jobs. In addition to this optimization we are adding additional hardware by end of week which now has cleared AU customs to increase capacity further and get back to providing optimal service to our customers.

We are also addressing the network architecture in a very significant way at this time which will benefit our customers once completed. The service is currently using public networks to transfer data which naturally has limitations and periodically we have been hitting the maximum allowed bandwidth. The capacity of the "available bandwidth" was increased several times over the past 12 months, but we are at times maximizing utilization, which has a negative impact on overall system performance.

To resolve this challenge, we are changing the network architecture to utilize Internet Exchange Point (IXP) which is what is used in all our other datacenters as the main method of data transfers. IXP allows networks (Keepit and Microsoft/Google/Salesforce) to interconnect directly, via the Xchange, rather than having to go through one or more third-party networks. This means it is a lot more efficient, less error-prone. As a result, performance will be much faster not just compared to the current situation but also compared to the past especially in the areas of SharePoint and OneDrive backups.
IXP is a physical infrastructure component that requires specialized equipment to be installed. This piece of hardware required for this was just released from customs and will be installed and configured over the next couple of days. Our goal is to have this in place by the end of this weekend. We will maintain the public network capability we currently have as a fall back option in case unexpected circumstances occur with the Xchange.

We are doing our best to resolve the existing problems as quickly as possible and are working 24X7 to get passed this situation. We will update this status as the work progresses to keep you informed on the timeline. We apologize for any inconvenience this might have caused.

Posted Jul 21, 2020 - 20:06 UTC

This incident affected: Australia, Sydney (au-sy) (User Interface, SaaS Backup).