Learn about the things to know about fault domains in Azure


Fault Domains (FDs) are a critical concept in Azure's high availability and resiliency strategy.

They are designed to help you ensure that your application remains operational even when hardware failures or other infrastructure issues occur.

Here's everything you need to know about Fault Domains in Azure.

Definition of Fault Domain (FD)

A Fault Domain is a logical grouping of physical hardware within an Azure data center.

It represents a single point of failure in the infrastructure, such as a server rack, power supply, or network switch.

Azure uses fault domains to distribute VMs across different physical resources to ensure that if one part of the infrastructure fails, not all of your VMs are impacted.

Fault Domain = A physical unit of failure (e.g., a server rack or a power supply).

Azure ensures that VMs in the same availability set or virtual machine scale set (VMSS) are placed in different fault domains.

Purpose of Fault Domains

The primary purpose of fault domains is to provide protection against hardware failures by isolating VMs into different physical infrastructure groups.

This isolation ensures that if a hardware failure occurs in one fault domain (e.g., server failure, network issue, or power outage), only the VMs in that specific fault domain are affected, and the other VMs remain unaffected.

Key Points

Hardware Failures

Fault domains protect against failures at the physical hardware level, such as the failure of a specific server rack, network switch, or power supply.

Isolation

VMs in different fault domains are isolated from each other at the hardware level, meaning that a failure in one fault domain doesn’t affect VMs in other fault domains.

High Availability

Distributing your VMs across multiple fault domains ensures high availability and resilience in the event of hardware failures.

How Fault Domains Work

When you create an Availability Set or Virtual Machine Scale Set (VMSS), Azure automatically distributes VMs across multiple fault domains.

A fault domain typically represents one server rack in a data center that shares resources such as networking, power, and cooling.

Azure ensures that VMs in an Availability Set are placed across multiple fault domains to protect them from hardware failures.

Example

Suppose you deploy 6 VMs in an Availability Set with 3 fault domains. Azure will place these VMs in the fault domains like this:

  1. Fault Domain 1: VMs 1, 2

  2. Fault Domain 2: VMs 3, 4

  3. Fault Domain 3: VMs 5, 6

If there is a hardware failure in Fault Domain 1, only VMs 1 and 2 will be affected.

VMs 3, 4, 5, and 6 (which are in other fault domains) will continue to run without disruption.

Fault Domain Limits

Default Fault Domains

In an Availability Set, Azure typically provides 2 fault domains by default, but it can support up to 3 fault domains in most regions and VM sizes.

Fault Domain Maximum

The maximum number of fault domains varies depending on the region and the VM size you are using.

Azure generally supports up to 3 fault domains in most regions.

For VM Scale Sets (VMSS), the maximum number of fault domains is also typically 3, although this can vary depending on the VMSS configuration and regional availability.

Example

If your region supports 3 fault domains, and you deploy 9 VMs in an Availability Set, Azure will distribute them across the 3 fault domains:

  1. Fault Domain 1: VMs 1, 4, 7

  2. Fault Domain 2: VMs 2, 5, 8

  3. Fault Domain 3: VMs 3, 6, 9

In this case, if Fault Domain 1 experiences a failure, only the VMs in Fault Domain 1 will be impacted, leaving the VMs in Fault Domains 2 and 3 unaffected.

Fault Domain vs. Update Domain

AspectFault Domain (FD)Update Domain (UD)
DefinitionA grouping of VMs across different physical hardware (e.g., racks, power supplies)A grouping of VMs that are updated together during planned maintenance
PurposeTo protect against hardware failures (e.g., server crashes, power failures)To protect against downtime during planned updates (e.g., patches, reboots)
ScopeHardware-level isolationLogical grouping for maintenance and updates
Impact of FailureA failure in a fault domain affects only VMs in that domainA failure in an update domain affects only VMs undergoing updates
Number of DomainsTypically 2 or 3 fault domains in most regionsTypically up to 20 update domains
ManagementManaged by Azure to distribute VMs across hardware resources for redundancyManaged by Azure for software updates and maintenance

Fault Domain and High Availability

Deploying VMs across multiple fault domains is a best practice to ensure high availability in the event of a hardware failure.

By distributing VMs across separate fault domains, you ensure that if one part of the hardware infrastructure (e.g., a server rack or power supply) fails, the other VMs continue to run without being impacted.

High Availability Considerations

For 99.95% uptime SLA, Azure recommends deploying at least two VMs across multiple fault domains.

For more critical applications, consider deploying more VMs and spreading them across 3 fault domains for even better resilience.

Example

If you have an application running on 2 VMs in an Availability Set with 2 fault domains, the failure of one fault domain could bring down the entire application.

However, if you distribute VMs across 3 fault domains, the impact of a hardware failure is significantly reduced.

Availability Sets and Fault Domains

When you use Availability Sets in Azure, you can specify the number of fault domains across which the VMs should be distributed.

The Availability Set helps you achieve high availability by spreading VMs across multiple fault domains.

Fault Domain Allocation

Azure will automatically allocate VMs in the Availability Set across available fault domains.

By default, it uses 2 fault domains, but you can configure up to 3 fault domains in most regions.

Redundancy

By deploying VMs across fault domains, you ensure redundancy for your application.

If one fault domain goes down (due to power failure or server failure), VMs in other fault domains will continue to serve traffic.

Best Practices

Minimum of 2 VMs

Ensure that you have at least 2 VMs in an Availability Set with 2 or 3 fault domains to meet the 99.95% uptime SLA.

More Fault Domains

If possible, spread VMs across 3 fault domains to increase resilience against infrastructure failures.

Fault Domains in Virtual Machine Scale Sets (VMSS)

Virtual Machine Scale Sets (VMSS) also support fault domains for distributing VMs across physical infrastructure. VMSS is particularly useful when scaling your application automatically based on demand.

VMSS Fault Domain Configuration

When configuring VMSS, you can specify the number of fault domains to use for distributing instances.

The default is typically 2 fault domains, but this can vary depending on the scale and region.

Automatic Scaling

VMSS automatically distributes new instances across the configured fault domains to ensure that instances are not placed in the same physical hardware group.

Fault Tolerance

By deploying VMs in different fault domains, VMSS ensures that your application remains available even in the event of hardware failure.

Monitoring and Alerts for Fault Domains

Azure provides several tools to help you monitor the health and performance of VMs across fault domains:

Azure Monitor

Use Azure Monitor to track the health of VMs and get alerts for any issues related to fault domains, such as a failure of a server rack or power supply.

Azure Service Health

Azure Service Health helps you monitor the availability of services and infrastructure within your fault domains.

It provides updates on platform issues that might impact your fault domains.

Health Monitoring

Use Health Probes (e.g., from Azure Load Balancer) to continuously monitor the health of VMs across fault domains and ensure that traffic is routed only to healthy instances.

Fault Domain and Availability Zones

For even greater protection against data center failures, you can combine Fault Domains with Availability Zones (AZs).

Availability Zones provide physical isolation of resources across multiple data centers within a region. Each AZ is essentially its own fault domain.

By distributing your VMs across multiple Availability Zones, you get fault isolation at the regional level, which is more resilient than just using fault domains within a single data center.

Example

If you deploy VMs in an Availability Set across 3 Fault Domains in Availability Zone 1, you’ll be protected from hardware failures in that zone.

If you extend the deployment to multiple Availability Zones, you’ll gain extra resilience, as the entire application is spread across physically isolated data centers.

Summary

Fault Domains in Azure are essential for protecting against hardware failures and ensuring high availability.

By distributing VMs across multiple fault domains, Azure isolates your VMs from potential disruptions caused by server rack failures, power issues, or network problems.

Key considerations include:

  1. Fault Domain Limits: Typically, 2-3 fault domains are supported in Azure regions.

  2. VM Placement: Azure automatically distributes VMs across fault domains in Availability Sets or VMSS.

  3. High Availability: At least 2-3 fault domains should be used to ensure high availability and meet Azure’s 99.95% uptime SLA.

  4. Load Balancing: Use an Azure Load Balancer to ensure traffic is routed to healthy VMs in different fault domains.

By understanding and properly configuring fault domains, you can increase the reliability and uptime of your Azure workloads, even in the face of physical infrastructure failures.

Related Articles


Rajnish, MCT

Leave a Reply

Your email address will not be published. Required fields are marked *


SUBSCRIBE

My newsletter for exclusive content and offers. Type email and hit Enter.

No spam ever. Unsubscribe anytime.
Read the Privacy Policy.