Azure Australia East data center outage and building resilient solutions

‍

What Happened?

The Azure's Australia East Data Centre experienced an outage due to a utility power sag around 8.30 AM on 30th August. This event caused a subset of cooling units to go offline, elevating datacenter temperatures. To prevent potential hardware damage, Microsoft made the decision to power down selected compute and storage units. Consequently, a number of Azure services that depended on this infrastructure were impacted – including:

Compute

Container Apps
Container Registry
Virtual Machines
Logic Apps
App Service
Arc enabled Kubernetes
Batch
Service Fabric

Storage

Cosmos DB
Databricks
Data Explorer
Database for PostgreSQL/MySQL flexible servers
Data Factory
NetApp Files
Redis Cache
SQL Database
Storage

Analytics

Activity Logs & Alerts
Application Insights
Chaos Studio
Event Hubs
HDInsight
Log Analytics
Log Search Alerts
Purview
Stream Analytics

Integration

API Management
Azure API for FHIR
Health Data Services
Notification Hubs
Service Bus
Relay

Management and Governance

Backup

IoT

Digital Twins
Device Update for IoT Hub
IoT Hub, Kubernetes Service (AKS)
IoT Central

Hybrid/Multi-cloud

ExpressRoute
Microsoft Sentinel

A small number of these services experienced prolonged impact, predominantly as a result of dependencies in recovering subsets of Storage, SQL, and/or Cosmos DB services. A full description of the incident can be found here.

The incident lasted for nearly two days, with varying degrees of impact across different services.

How do we design systems to withstand regional outages like this?

This leads us to the next point of this article: How do we design more resilient systems that can withstand regional outages?

Image from https://datacenters.microsoft.com/globe/explore

In Australia, Microsoft has three data centres with Region Pairing. This gives an important benefit for Azure customers to set up effective DR strategies.

We are going to discuss some basic configurations assuming an application stack that uses a Virtual Machines and SQL Azure Databases.

‍

Pilot Light

The Pilot Light DR strategy maintains a small replica of the environment that is always running in a secondary region. Only the essential data (like databases and storage accounts) is replicated in the new region. The rest of the infrastructure, like application servers or web servers, is not running full-time but is pre-configured and ready to be initiated rapidly. In case of a disaster, this replica can be rapidly scaled up to handle production workloads. It may take a little longer to have the services fully up and running in the un-impacted region.

low cost, high recovery time

‍

Hot Standby (Active/Passive)

This is a very common method in enterprise environments where a full-scale replica of the production environment runs in another region, with data being replicated in real-time or near-real-time. In the event of a disaster, traffic can be redirected to a cloud environment in a different region almost immediately. This method provides the shortest recovery time but is also the most expensive.

very high cost, moderate recovery time

‍

Active/Active

In the Active/Active configuration, the production workload runs in multiple cloud regions simultaneously. In case one site fails, the other site(s) can take over, ensuring high availability and resilience. Azure Traffic Manager operates at the DNS layer to quickly and efficiently direct incoming DNS requests based on the health of the application stack. In the event of a disaster in one region, the configuration will automatically start routing traffic to the un-impacted region only. With regards to the database, when the SQL Azure is configured with “Auto Failover Group“ configuration, it automatically changes the “Read-only“ replica in Region 2 to a “Read-Write“ which makes the recovery process smooth.

‍

high cost, low recovery time

‍

Use cloud-native services

Disaster Recovery becomes much easier to manage when using cloud-native services. Wherever possible, we encourage our clients to use cloud-native services and look at DR as a key element in system architecture design. For example,

Using Azure App Services or Azure Functions, instead of Virtual Machines allows much easier muti-region compute deployments and using AKS instead of deploying their own Kubernetes clusters.
Using SQL Azure instead of deploying VMs to host the databases.

‍

Multi-Region Azure App Services / Function Apps

Azure App Services is a fully managed service ( including things such as security patching and automatic scaling) that offers a number of DR options, including multi-region deployments. In this example, the App Service is deployed in two regions for Active/Active configuration but this can be very easily configured for Active/Passive or Pilot Light methods.

Multi-Region Azure Kubernetes Services ( AKS)

Although AKS is a fully managed service, a single cluster can’t be deployed in two regions. But there are great architectural solutions available to make this happen within Azure.

As depicted in the below diagram;

Multiple AKS clusters are deployed, each in a separate Azure region. During normal operations, network traffic is routed between all regions. If one region becomes unavailable, Traffic Manager/FrontDoor is configured to route the traffic to the healthy region.

In addition to this, shared ( but non-critical) services, such as Azure Container Registry and Log Analytics, can also be deployed in both regions or can just be spun up using IaC deployment. In most cases ( if the shared services aren’t impacted), the recovery can be done without having to do a shared services deployment, but in the worst-case scenario ( such as with this Azure outage where ACR was also impacted), it should be able possible to spin them up using IaC and tweaking the CI/CD pipelines.

‍

There are three main DR strategies for regional outages we have discussed here, with various levels of complexity and cost. Also, there are ways to combine some of these strategies to save costs or reduce recovery time.

There are a number of other factors that play in when coming up with a reliable strategy, and you need be able to consider testing these strategies out on a regular basis. Some of the more advanced multi-region solutions will also need careful design considerations. You should evaluate their requirements and choose the solution that's appropriate for them. Feel free to reach out to us if your organisation need any in evaluating or implementing DR solutions in Azure.

Additional reading

More information about the Azure outage

Enterprise-scale disaster recovery Azure Architecture Center