Sep 4, 2023

Azure Australia East data center outage and building resilient solutions

The recent outage in the Azure Australia East Data Centre serves as a stark reminder: no matter how robust cloud services become, they aren't exempt from outages. Azure already has some great capabilities that can help mitigate zone, region outages. In this post, we will explore what happened and how to build resilient systems leveraging multi-region architectural patterns.

What Happened?

The Azure's Australia East Data Centre experienced an outage due to a utility power sag around 8.30 AM on 30th August. This event caused a subset of cooling units to go offline, elevating datacenter temperatures. To prevent potential hardware damage, Microsoft made the decision to power down selected compute and storage units. Consequently, a number of Azure services that depended on this infrastructure were impacted – including:

Compute
Storage
Analytics
Integration
Management and Governance
IoT
Hybrid/Multi-cloud

A small number of these services experienced prolonged impact, predominantly as a result of dependencies in recovering subsets of Storage, SQL, and/or Cosmos DB services. A full description of the incident can be found here.

The incident lasted for nearly two days, with varying degrees of impact across different services.

How do we design systems to withstand regional outages like this?

This leads us to the next point of this article: How do we design more resilient systems that can withstand regional outages?  

Image from https://datacenters.microsoft.com/globe/explore

In Australia, Microsoft has three data centres with Region Pairing. This gives an important benefit for Azure customers to set up effective DR strategies.

We are going to discuss some basic configurations assuming an application stack that uses a Virtual Machines and SQL Azure Databases.

Pilot  Light

The Pilot Light DR strategy maintains a small replica of the environment that is always running in a secondary region. Only the essential data (like databases and storage accounts) is replicated in the new region. The rest of the infrastructure, like application servers or web servers, is not running full-time but is pre-configured and ready to be initiated rapidly.  In case of a disaster, this replica can be rapidly scaled up to handle production workloads. It may take a little longer to have the services fully up and running in the un-impacted region.

Pilot Light DR configuraiton


low cost, high recovery time

Hot Standby (Active/Passive)

This is a very common method in enterprise environments where a full-scale replica of the production environment runs in another region, with data being replicated in real-time or near-real-time. In the event of a disaster, traffic can be redirected to a cloud environment in a different region almost immediately. This method provides the shortest recovery time but is also the most expensive.

Hot Standby DR configuration
very high cost, moderate recovery time

Active/Active

In the Active/Active configuration, the production workload runs in multiple cloud regions simultaneously. In case one site fails, the other site(s) can take over, ensuring high availability and resilience. Azure Traffic Manager operates at the DNS layer to quickly and efficiently direct incoming DNS requests based on the health of the application stack. In the event of a disaster in one region, the configuration will automatically start routing traffic to the un-impacted region only. With regards to the database, when the SQL Azure is configured with “Auto Failover Group“ configuration, it automatically changes the “Read-only“ replica in Region 2 to a “Read-Write“ which makes the recovery process smooth.

Active/Active DR configuration


high cost, low recovery time

Use cloud-native services

Disaster Recovery becomes much easier to manage when using cloud-native services. Wherever possible, we encourage our clients to use cloud-native services and look at DR as a key element in system architecture design. For example,

Multi-Region Azure App Services / Function Apps

Azure App Services is a fully managed service ( including things such as security patching and automatic scaling) that offers a number of DR options, including multi-region deployments. In this example, the App Service is deployed in two regions for Active/Active configuration but this can be very easily configured for Active/Passive or Pilot Light methods.  


Multi-region AppService deployment

Multi-Region Azure Kubernetes Services ( AKS)

Although AKS is a fully managed service, a single cluster can’t be deployed in two regions. But there are great architectural solutions available to make this happen within Azure.

As depicted in the below diagram;

Multiple AKS clusters are deployed, each in a separate Azure region. During normal operations, network traffic is routed between all regions. If one region becomes unavailable, Traffic Manager/FrontDoor is configured to route the traffic to the healthy region.

In addition to this, shared ( but non-critical) services, such as Azure Container Registry and Log Analytics, can also be deployed in both regions or can just be spun up using IaC deployment. In most cases ( if the shared services aren’t impacted), the recovery can be done without having to do a shared services deployment, but in the worst-case scenario ( such as with this Azure outage where ACR was also impacted), it should be able possible to spin them up using IaC and tweaking the CI/CD pipelines.

Multi-region AKS deployment

There are three main DR strategies for regional outages we have discussed here, with various levels of complexity and cost. Also, there are ways to combine some of these strategies to save costs or reduce recovery time.

There are a number of other factors that play in when coming up with a reliable strategy, and you need be able to consider testing these strategies out on a regular basis. Some of the more advanced multi-region solutions will also need careful design considerations. You should evaluate their requirements and choose the solution that's appropriate for them. Feel free to reach out to us if your organisation need any in evaluating or implementing DR solutions in Azure.

Additional reading

More information about the Azure outage

Enterprise-scale disaster recovery Azure Architecture Center

Interested in hearing more?
Lets connect.