Now that summer vacay is sadly over, let’s turn our attention to a less enjoyable form of downtime.
As we reported earlier this year, data centre outages are becoming more frequent and costing businesses more money. In 2021, more than a quarter (26 per cent) of all outages were suffered by large public cloud providers. Aren’t these kinds of unfortunate events covered by SLAs?
Yes … and no.
Under most public cloud provider SLAs, “compensation is offered in service credits, not cash,” according to a report by Owen Rogers, director of cloud computing research at Uptime. “SLA compensation is poor and is highly unlikely to cover the business impacts that result from downtime,” he added.
In a subsequent blog post, Rogers pointed out that the process of seeking SLA compensation after a public cloud outage can be cumbersome for businesses in the midst of trying to keep the proverbial lights on during a downtime incident.
“When a (cloud) failure occurs, the user is responsible for measuring downtime and requesting compensation; this is not provided automatically. Users usually need to raise a report request with service logs to show proof of the outage … These approaches mean that users must detect an outage and apply for a service credit … which is unlikely to cover the cost of the outage,” Rogers wrote.
Rogers also gave a hypothetical breakdown of how much compensation a business would receive based on the actual SLA terms offered by one of the world’s biggest cloud providers.
Under that provider’s 99 per cent uptime SLA, “if a single virtual machine goes down for less than seven hours and 18 minutes” in one month, the cloud provider’s “total compensation for this outage would be 30 cents,” Rogers wrote. “If a virtual machine goes down for less than 36 hours” in a month, he added, the provider’s compensation would be “just under $1.”
For many businesses, $1 in compensation for the financial and reputational losses suffered during a 35-hour outage may not cut it. As Neta Rozy succinctly stated in an interview with Protocol.com, cloud providers “only give you credit for the service you didn’t receive.”
To address the gap between downtime-related business losses and SLA compensation levels, Rozy co-founded a startup called Parametrix to offer cloud downtime insurance.
Businesses can buy insurance policies to cover losses they suffer due to downtime at their cloud provider. Those losses can include lost revenue, lost employee productivity, reputational damage and even the domino effect damage of losses suffered by their own customers.
Parametrix has a proprietary system that monitors downtime incidents at cloud providers like AWS, Azure, Google Cloud and Oracle Cloud, plus major cloud-based services like Salesforce, Cloudflare and Shopify. When Parametrix detects downtime at one of those providers, its system automatically notifies policyholders so they can activate an insurance claim.
Parametrix coverage kicks in just one hour after a policyholder is affected by a cloud outage, which is a much shorter period than the window of time required to activate most cyberinsurance policies.
Cost-effective cloud resilience
There’s another way businesses can try to limit their losses after a cloud provider outage. According to Rogers’s report, they can add resiliency to their own cloud and application architecture in the most cost-effective way possible.
To help organizations do that, Rogers examined seven methods of boosting resiliency in IT architectures to protect against cloud provider outages. He evaluated the cost-effectiveness of each method based mainly on:
- how much extra resilience it adds
- the cost (borne by the cloud service customer) of adding that extra resilience
- how much compensation the cloud service customer would receive from a provider’s SLA in the event of an outage
Based on Rogers’s findings, the most cost-effective steps businesses can take to make their architecture more resilient to public cloud outages are:
- distributing workloads across multiple availability zones vs. just one zone
- setting up failover to a backup region using a pre-enabled DNS service
The most salient points of Rogers’s research, however, have as much to do with responsibility as they do with resilience.
The study reminds public cloud users that it’s their responsibility to detect and track an outage suffered by their provider. With most SLAs offering little compensation, it’s also the responsibility of cloud customers to cover the financial losses their business suffers when downtime hits their cloud provider.
Architecting some extra IT resilience might save a business more headaches (and money) than any SLA.