CacheU
High Level Design

High Availability Systems - Active-Active and Active-Passive Architectures

A comprehensive guide to high availability system design, focusing on Active-Active and Active-Passive architectures, how they work, their trade-offs, failover strategies, load balancing, replication, and real-world production patterns used in large distributed systems.

High Availability Systems: Active-Active and Active-Passive

Introduction

Modern digital platforms are expected to be available almost all the time.

When users interact with systems such as:

  • :contentReference[oaicite:0]{index=0}
  • :contentReference[oaicite:1]{index=1}
  • :contentReference[oaicite:2]{index=2}

they expect services to work 24/7 with minimal downtime.

However, real systems face many failures:

Failure TypeExample
Hardware failureServer crash
Network failureData center network outage
Software bugsMemory leaks
Infrastructure outageEntire region down
Deployment issuesFaulty release

To handle such failures, systems are designed with High Availability (HA).

High availability ensures that:

A system continues operating even when components fail.

One of the most important architectural techniques used to achieve HA is redundancy.

Two common redundancy patterns are:

PatternDescription
**Active-Active**Multiple nodes actively serve traffic simultaneously
**Active-Passive**One node serves traffic while another remains on standby

What is High Availability?

High availability is the ability of a system to remain operational for long periods with minimal downtime.

Availability is usually expressed as uptime percentage.

AvailabilityDowntime per Year
99%~3.65 days
99.9%~8.76 hours
99.99%~52 minutes
99.999% (Five Nines)~5 minutes

Large-scale systems aim for four or five nines availability.


Why High Availability Matters

Without HA architecture:

Diagram
flowchart LR Users --> AppServer AppServer --> Database

If AppServer fails, the entire system stops working.

Single points of failure are dangerous in distributed systems.


High Availability Through Redundancy

High availability is achieved by running multiple instances of critical components.

Diagram
flowchart LR Users --> LoadBalancer LoadBalancer --> Server1 LoadBalancer --> Server2 LoadBalancer --> Server3

If one server fails, others continue serving traffic.

This redundancy can be implemented using two primary patterns:

  1. Active-Active Architecture
  2. Active-Passive Architecture

Active-Active Architecture

Concept

In an Active-Active system, multiple servers actively handle traffic at the same time.

All nodes are live and processing requests.

Diagram
flowchart LR Users --> LoadBalancer LoadBalancer --> Server1 LoadBalancer --> Server2 LoadBalancer --> Server3 Server1 --> DatabaseCluster Server2 --> DatabaseCluster Server3 --> DatabaseCluster

If one server fails:

  • the load balancer stops routing traffic to it
  • other servers continue serving users

Users typically do not notice any disruption.


Example Scenario

Consider a video streaming platform like Netflix.

Requests arrive globally.

Traffic is distributed across multiple active servers:

Diagram
flowchart TD Users --> GlobalLoadBalancer GlobalLoadBalancer --> Region1 GlobalLoadBalancer --> Region2 Region1 --> App1 Region1 --> App2 Region2 --> App3 Region2 --> App4

All application instances handle requests simultaneously.


Characteristics of Active-Active Systems

CharacteristicExplanation
All nodes serve trafficNo idle resources
High scalabilityAdd more nodes to scale
Automatic failoverTraffic shifts automatically
Better resource utilizationEvery node works

Active-Active Data Replication

To maintain consistency, data must be replicated across nodes.

Two approaches:

MethodDescription
Synchronous replicationWrites happen on multiple nodes simultaneously
Asynchronous replicationWrites propagate later

Multi-Region Active-Active Example

Diagram
flowchart LR Users --> GlobalDNS GlobalDNS --> USRegion GlobalDNS --> EuropeRegion GlobalDNS --> AsiaRegion USRegion --> DB_US EuropeRegion --> DB_EU AsiaRegion --> DB_ASIA

Each region serves traffic.

Data is replicated across regions.


Benefits of Active-Active Architecture

BenefitExplanation
High availabilityNo single point of failure
ScalabilityNodes can scale horizontally
Load distributionTraffic shared across nodes
Faster responseRequests served closer to users

Challenges of Active-Active Systems

ChallengeExplanation
Data conflictsSimultaneous writes
Replication complexityMaintaining consistency
Higher infrastructure costMultiple active nodes
Debugging complexityDistributed failures

Active-Passive Architecture

Concept

In Active-Passive architecture, only one node actively serves traffic.

The other node remains standby (passive).

Diagram
flowchart LR Users --> LoadBalancer LoadBalancer --> ActiveServer ActiveServer --> Database PassiveServer -. Standby .- Database

If the active node fails:

  1. Passive node becomes active
  2. Traffic shifts to the new node

This process is called failover.


Active-Passive Failover Process

Diagram
sequenceDiagram participant User participant LoadBalancer participant ActiveServer participant PassiveServer User->>LoadBalancer: Request LoadBalancer->>ActiveServer: Forward request ActiveServer-->>User: Response Note over ActiveServer: Server failure LoadBalancer->>PassiveServer: Promote to active PassiveServer-->>User: Handle requests

Example: Database Replication

Active-passive is commonly used for databases.

Primary database handles writes.

Replica database stays on standby.

Diagram
flowchart LR App --> PrimaryDB PrimaryDB --> ReplicaDB

If primary fails:

Diagram
flowchart LR App --> ReplicaDB

Replica becomes the new primary.


Characteristics of Active-Passive Systems

CharacteristicExplanation
One active nodeHandles all traffic
Passive standbyReady for failover
Simpler architectureEasier to maintain
Lower write conflictsOnly one writer

Benefits of Active-Passive Architecture

BenefitExplanation
Simpler designEasier implementation
Easier consistencySingle writer model
Reliable failoverPredictable recovery

Limitations of Active-Passive Systems

LimitationExplanation
Idle resourcesPassive node unused
Failover delayDetection + promotion time
Lower scalabilitySingle active node

Failover Mechanisms

Failover detection usually relies on health checks.

Diagram
flowchart TD HealthMonitor --> ActiveServer ActiveServer -->|Healthy| Continue ActiveServer -->|Failure| PromotePassive PromotePassive --> PassiveServer

Common detection methods:

MethodDescription
Heartbeat checksPeriodic ping
Health endpoints`/health` API
Monitoring alertsInfrastructure monitoring

Active-Active vs Active-Passive Comparison

FeatureActive-ActiveActive-Passive
Nodes serving trafficMultipleOne
Resource utilizationHighLower
ComplexityHigherSimpler
Failover timeInstantSlight delay
Conflict handlingRequiredMinimal

When to Use Active-Active

Active-active works best when:

ScenarioExample
High traffic systemsStreaming platforms
Global applicationsMulti-region services
Horizontal scalingMicroservices

Large-scale systems like Google often use multi-region active-active architectures.


When to Use Active-Passive

Active-passive works best for:

ScenarioExample
DatabasesPrimary-replica setup
Stateful servicesLegacy systems
Simpler infrastructureSmall to medium systems

Hybrid High Availability Architecture

Many real systems combine both patterns.

Example:

Diagram
flowchart TD Users --> GlobalLoadBalancer GlobalLoadBalancer --> Region1 GlobalLoadBalancer --> Region2 Region1 --> ActiveDB Region1 --> PassiveDB Region2 --> ActiveDB2 Region2 --> PassiveDB2

Applications run active-active, while databases use active-passive replication.


Multi-Region Disaster Recovery

To survive regional failures, systems replicate across regions.

Example architecture:

Diagram
flowchart LR Users --> DNS DNS --> PrimaryRegion DNS --> SecondaryRegion PrimaryRegion --> DB1 SecondaryRegion --> DB2

If primary region fails:

  • DNS redirects traffic to secondary region.

Monitoring High Availability Systems

Important monitoring metrics:

MetricPurpose
Node healthDetect failures
Replication lagData consistency
Error ratesSystem reliability
LatencyPerformance

Monitoring tools help detect failures early.


Best Practices

Remove Single Points of Failure

Always deploy redundant components.


Use Load Balancers

Distribute traffic intelligently.


Monitor System Health

Continuous monitoring ensures faster recovery.


Test Failover Regularly

Chaos engineering helps validate HA systems.

Companies like Netflix use tools like Chaos Monkey to simulate failures.


Final Architecture Overview

Diagram
flowchart TD Users --> GlobalDNS GlobalDNS --> Region1 GlobalDNS --> Region2 Region1 --> LoadBalancer1 Region2 --> LoadBalancer2 LoadBalancer1 --> App1 LoadBalancer1 --> App2 LoadBalancer2 --> App3 LoadBalancer2 --> App4 App1 --> DBCluster App3 --> DBCluster

The system remains available even when servers, nodes, or regions fail.


Key Takeaways

ConceptInsight
High AvailabilityEnsures systems remain operational during failures
Active-ActiveMultiple nodes handle traffic simultaneously
Active-PassiveOne active node with standby backup
FailoverAutomatic switching to backup systems
Hybrid architectureCombines both models in real-world systems

High availability design is a core principle of distributed system architecture, enabling modern platforms to deliver reliable services at global scale.