Fault Domains (FDs) are a critical concept in Azure's high availability and resiliency strategy.
They are designed to help you ensure that your application remains operational even when hardware failures or other infrastructure issues occur.
Here's everything you need to know about Fault Domains in Azure.
Definition of Fault Domain (FD)
A Fault Domain is a logical grouping of physical hardware within an Azure data center.
It represents a single point of failure in the infrastructure, such as a server rack, power supply, or network switch.
Azure uses fault domains to distribute VMs across different physical resources to ensure that if one part of the infrastructure fails, not all of your VMs are impacted.
Fault Domain = A physical unit of failure (e.g., a server rack or a power supply).
Azure ensures that VMs in the same availability set or virtual machine scale set (VMSS) are placed in different fault domains.
Purpose of Fault Domains
The primary purpose of fault domains is to provide protection against hardware failures by isolating VMs into different physical infrastructure groups.
This isolation ensures that if a hardware failure occurs in one fault domain (e.g., server failure, network issue, or power outage), only the VMs in that specific fault domain are affected, and the other VMs remain unaffected.
Key Points
Hardware Failures
Fault domains protect against failures at the physical hardware level, such as the failure of a specific server rack, network switch, or power supply.
Isolation
VMs in different fault domains are isolated from each other at the hardware level, meaning that a failure in one fault domain doesn’t affect VMs in other fault domains.
High Availability
Distributing your VMs across multiple fault domains ensures high availability and resilience in the event of hardware failures.
How Fault Domains Work
When you create an Availability Set or Virtual Machine Scale Set (VMSS), Azure automatically distributes VMs across multiple fault domains.
A fault domain typically represents one server rack in a data center that shares resources such as networking, power, and cooling.
Azure ensures that VMs in an Availability Set are placed across multiple fault domains to protect them from hardware failures.
Example
Suppose you deploy 6 VMs in an Availability Set with 3 fault domains. Azure will place these VMs in the fault domains like this:
Fault Domain 1: VMs 1, 2
Fault Domain 2: VMs 3, 4
Fault Domain 3: VMs 5, 6
If there is a hardware failure in Fault Domain 1, only VMs 1 and 2 will be affected.
VMs 3, 4, 5, and 6 (which are in other fault domains) will continue to run without disruption.
Fault Domain Limits
Default Fault Domains
In an Availability Set, Azure typically provides 2 fault domains by default, but it can support up to 3 fault domains in most regions and VM sizes.
Fault Domain Maximum
The maximum number of fault domains varies depending on the region and the VM size you are using.
Azure generally supports up to 3 fault domains in most regions.
For VM Scale Sets (VMSS), the maximum number of fault domains is also typically 3, although this can vary depending on the VMSS configuration and regional availability.
Example
If your region supports 3 fault domains, and you deploy 9 VMs in an Availability Set, Azure will distribute them across the 3 fault domains:
Fault Domain 1: VMs 1, 4, 7
Fault Domain 2: VMs 2, 5, 8
Fault Domain 3: VMs 3, 6, 9
In this case, if Fault Domain 1 experiences a failure, only the VMs in Fault Domain 1 will be impacted, leaving the VMs in Fault Domains 2 and 3 unaffected.
Fault Domain vs. Update Domain
Aspect | Fault Domain (FD) | Update Domain (UD) |
---|---|---|
Definition | A grouping of VMs across different physical hardware (e.g., racks, power supplies) | A grouping of VMs that are updated together during planned maintenance |
Purpose | To protect against hardware failures (e.g., server crashes, power failures) | To protect against downtime during planned updates (e.g., patches, reboots) |
Scope | Hardware-level isolation | Logical grouping for maintenance and updates |
Impact of Failure | A failure in a fault domain affects only VMs in that domain | A failure in an update domain affects only VMs undergoing updates |
Number of Domains | Typically 2 or 3 fault domains in most regions | Typically up to 20 update domains |
Management | Managed by Azure to distribute VMs across hardware resources for redundancy | Managed by Azure for software updates and maintenance |
Fault Domain and High Availability
Deploying VMs across multiple fault domains is a best practice to ensure high availability in the event of a hardware failure.
By distributing VMs across separate fault domains, you ensure that if one part of the hardware infrastructure (e.g., a server rack or power supply) fails, the other VMs continue to run without being impacted.
High Availability Considerations
For 99.95% uptime SLA, Azure recommends deploying at least two VMs across multiple fault domains.
For more critical applications, consider deploying more VMs and spreading them across 3 fault domains for even better resilience.
Example
If you have an application running on 2 VMs in an Availability Set with 2 fault domains, the failure of one fault domain could bring down the entire application.
However, if you distribute VMs across 3 fault domains, the impact of a hardware failure is significantly reduced.
Availability Sets and Fault Domains
When you use Availability Sets in Azure, you can specify the number of fault domains across which the VMs should be distributed.
The Availability Set helps you achieve high availability by spreading VMs across multiple fault domains.
Fault Domain Allocation
Azure will automatically allocate VMs in the Availability Set across available fault domains.
By default, it uses 2 fault domains, but you can configure up to 3 fault domains in most regions.
Redundancy
By deploying VMs across fault domains, you ensure redundancy for your application.
If one fault domain goes down (due to power failure or server failure), VMs in other fault domains will continue to serve traffic.
Best Practices
Minimum of 2 VMs
Ensure that you have at least 2 VMs in an Availability Set with 2 or 3 fault domains to meet the 99.95% uptime SLA.
More Fault Domains
If possible, spread VMs across 3 fault domains to increase resilience against infrastructure failures.
Fault Domains in Virtual Machine Scale Sets (VMSS)
Virtual Machine Scale Sets (VMSS) also support fault domains for distributing VMs across physical infrastructure. VMSS is particularly useful when scaling your application automatically based on demand.
VMSS Fault Domain Configuration
When configuring VMSS, you can specify the number of fault domains to use for distributing instances.
The default is typically 2 fault domains, but this can vary depending on the scale and region.
Automatic Scaling
VMSS automatically distributes new instances across the configured fault domains to ensure that instances are not placed in the same physical hardware group.
Fault Tolerance
By deploying VMs in different fault domains, VMSS ensures that your application remains available even in the event of hardware failure.
Monitoring and Alerts for Fault Domains
Azure provides several tools to help you monitor the health and performance of VMs across fault domains:
Azure Monitor
Use Azure Monitor to track the health of VMs and get alerts for any issues related to fault domains, such as a failure of a server rack or power supply.
Azure Service Health
Azure Service Health helps you monitor the availability of services and infrastructure within your fault domains.
It provides updates on platform issues that might impact your fault domains.
Health Monitoring
Use Health Probes (e.g., from Azure Load Balancer) to continuously monitor the health of VMs across fault domains and ensure that traffic is routed only to healthy instances.
Fault Domain and Availability Zones
For even greater protection against data center failures, you can combine Fault Domains with Availability Zones (AZs).
Availability Zones provide physical isolation of resources across multiple data centers within a region. Each AZ is essentially its own fault domain.
By distributing your VMs across multiple Availability Zones, you get fault isolation at the regional level, which is more resilient than just using fault domains within a single data center.
Example
If you deploy VMs in an Availability Set across 3 Fault Domains in Availability Zone 1, you’ll be protected from hardware failures in that zone.
If you extend the deployment to multiple Availability Zones, you’ll gain extra resilience, as the entire application is spread across physically isolated data centers.
Summary
Fault Domains in Azure are essential for protecting against hardware failures and ensuring high availability.
By distributing VMs across multiple fault domains, Azure isolates your VMs from potential disruptions caused by server rack failures, power issues, or network problems.
Key considerations include:
Fault Domain Limits: Typically, 2-3 fault domains are supported in Azure regions.
VM Placement: Azure automatically distributes VMs across fault domains in Availability Sets or VMSS.
High Availability: At least 2-3 fault domains should be used to ensure high availability and meet Azure’s 99.95% uptime SLA.
Load Balancing: Use an Azure Load Balancer to ensure traffic is routed to healthy VMs in different fault domains.
By understanding and properly configuring fault domains, you can increase the reliability and uptime of your Azure workloads, even in the face of physical infrastructure failures.
Leave a Reply