Asynchronous Replication: Speed and Availability at Scale

Learn how asynchronous replication works in distributed databases, including eventual consistency implications, lag monitoring, and practical use cases where speed outweighs strict consistency.

published: reading time: 19 min read author: GeekWorkBench

Asynchronous Replication: Speed and Availability at Scale

Introduction

Asynchronous replication is the workhorse of distributed databases. Writes confirm immediately on the primary. Replicas receive the changes eventually, usually milliseconds later but sometimes much longer. This gives you speed and availability, but it introduces a window where data can be lost if the primary fails.

Most production databases run async by default. The performance benefit is real. The trade-off is accepting a small probability of losing recent writes during failover. For many applications, this is acceptable. For others, it is not.

How Asynchronous Replication Works

The primary processes a write, confirms it to the client, and sends the change to replicas in the background. The client does not wait for replica confirmation.

sequenceDiagram
    participant Client
    participant Primary
    participant Replica

    Client->>Primary: BEGIN; UPDATE...; COMMIT
    Primary->>Primary: Write to local WAL
    Primary-->>Client: COMMIT confirmed
    Note over Primary,Replica: (Background replication)

    rect rgb(200, 200, 200)
        Primary->>Replica: Send WAL entry
        Replica-->>Primary: ACK
    end

This non-blocking behavior is why async replication is fast. Write latency is essentially local disk latency plus a small amount of processing. There is no waiting for remote replica confirmation.

The Replication Lag Problem

With async replication, replicas fall behind the primary. This lag can be seconds or minutes depending on load, network conditions, and replica performance.

# This sequence can fail with async replication
def update_and_read(user_id, new_email):
    db.execute("UPDATE users SET email = ? WHERE id = ?", new_email, user_id)
    # Write confirmed immediately
    # But replica might not have the update yet

    user = db_replica.execute("SELECT email FROM users WHERE id = ?", user_id)
    # Might return old email if replica is lagging
    return user['email']

This is the fundamental tension. Async replication is fast, but reads from replicas can return stale data.

Statement-Based vs WAL-Based Replication

Different databases use different mechanisms.

Statement-based replication sends the SQL statements to replicas. MySQL’s original replication used this approach.

-- Primary executes:
UPDATE users SET email = 'new@example.com' WHERE id = 1;

-- Replica executes the same statement:
UPDATE users SET email = 'new@example.com' WHERE id = 1;

The problem: nondeterministic functions like NOW() or RAND() produce different results on different nodes.

WAL-based replication sends the binary WAL entries. PostgreSQL uses this. WAL entries are deterministic because they contain the actual bytes that changed.

# WAL entry contains the actual bytes
wal_entry = {
    'type': 'UPDATE',
    'table': 'users',
    'key': 'id=1',
    'old_data': {'email': 'old@example.com'},
    'new_data': {'email': 'new@example.com'},
    'timestamp': 1711234567
}

Row-based replication sends the actual row changes. MySQL’s row-based replication combines the benefits of both.

Eventual Consistency

Async replication provides eventual consistency. Given enough time without new writes, all replicas will converge to the same state.

flowchart LR
    P[Primary] -->|immediate| R1[Replica 1]
    P -->|immediate| R2[Replica 2]

    R1 -.->|eventually consistent| Same[Same State]
    R2 -.->|eventually consistent| Same

The “eventually” part is important. During the lag window, reads from different replicas can return different results. This is observable inconsistency, not just a theoretical concern.

# User experience with eventual consistency
def get_user_email(user_id):
    # Route to random replica
    replica = random.choice([replica1, replica2, replica3])
    return replica.execute("SELECT email FROM users WHERE id = ?", user_id)

# Two consecutive calls might return different results
email1 = get_user_email(123)  # 'old@example.com'
email2 = get_user_email(123)  # 'new@example.com' if replica2 has newer data

Use Cases Where Async Makes Sense

Async replication is the right choice when:

  • Web sessions and preferences: User profile updates can lag a few seconds without causing real problems
  • Analytics and reporting: Reads for dashboards do not need millisecond freshness
  • Social media feeds: Timeline updates do not need to be immediately consistent
  • Logging and metrics: Time-series data tolerates small gaps
  • Geographic distribution: Replicating across continents with sync would be prohibitively slow
# Good async use case: user preferences
def update_user_preference(user_id, key, value):
    # Write immediately, replicate async
    db.execute("UPDATE preferences SET value = ? WHERE user_id = ? AND key = ?",
               value, user_id, key)
    # User sees update instantly on primary
    # Other replicas catch up shortly

# Bad async use case: account balance
def transfer_funds(from_id, to_id, amount):
    # This needs synchronous replication for correctness
    db.execute("UPDATE accounts SET balance = balance - ? WHERE id = ?",
               amount, from_id)
    db.execute("UPDATE accounts SET balance = balance + ? WHERE id = ?",
               amount, to_id)

Monitoring Replication Lag

Lag monitoring is critical with async replication. You need to know how far behind your replicas are.

-- PostgreSQL: Check replication lag
SELECT client_addr, state,
       sent_lsn, replay_lsn,
       sent_lsn - replay_lsn AS lag_bytes
FROM pg_stat_replication;

-- MySQL: Check replica lag
SHOW REPLICA STATUS\G
# Look at: Seconds_Behind_Master, Relay_Log_Pos
# Application-level lag monitoring
def check_replica_health(replica):
    last_replicated_ts = replica.get_last_replicated_timestamp()
    current_ts = time.time()
    lag_seconds = current_ts - last_replicated_ts

    if lag_seconds > 30:
        alert(f"Replica {replica.name} lag: {lag_seconds}s")
        # Route reads away from this replica
        remove_from_read_pool(replica)

Key metrics to track:

MetricNormalWarningCritical
Replication lag< 1s1-10s> 10s
Replica IO threadRunningStopped
Replica SQL threadRunningStopped
WAL backlog< 100MB100MB-1GB> 1GB

The CAP Theorem and Async Replication

Async replication is the availability (A) choice in the CAP theorem trade-off. During a network partition, an async primary can continue accepting writes. Those writes might not survive if the primary fails before replicas receive them.

flowchart TD
    subgraph Partition
        P[Primary] --> R1[(Replica 1)]
        P -.- X[(Partition)]
        R2[(Replica 2)] -.- X
    end

    P -->|writes continue| Client1[Client]
    R2 -.->|disconnected| P

If you need consistency during partitions, you need synchronous replication. If you need availability during partitions, async is your choice.

For more on this trade-off, see CAP Theorem and Consistency Models.

Reducing Lag in Async Setups

If lag is a problem, there are techniques to reduce it.

Upgrade replica hardware: Replicas that cannot keep up with the primary’s write rate will always lag. Give them faster disks and more CPU.

Reduce write workload: If the primary writes 10,000 transactions per second and replicas can only handle 8,000, lag will grow indefinitely.

Use parallel replication: Modern MySQL and PostgreSQL support parallel replica apply.

-- MySQL: Enable parallel replication
STOP REPLICA;
SET GLOBAL slave_parallel_workers = 8;
START REPLICA;

-- PostgreSQL: Increase wal_sender_timeout for long-running queries
ALTER SYSTEM SET wal_sender_timeout = '60s';

Cascade replication: Chain replicas so the primary does not feed every replica directly. This reduces primary load.

flowchart TD
    P[Primary] --> R1[Replica 1]
    R1 --> R2[Replica 2]
    R1 --> R3[Replica 3]
    R2 --> R4[Replica 4]

Read Routing with Async Replicas

Application code should handle stale reads gracefully when using async replicas.

# Read from replica with fallback to primary
def read_user(user_id):
    try:
        # Try replica first
        replica = get_read_replica()
        user = replica.execute("SELECT * FROM users WHERE id = ?", user_id)
        return user
    except ReplicaTooLagError:
        # Fall back to primary if replica is too far behind
        return primary.execute("SELECT * FROM users WHERE id = ?", user_id)

# Read-after-write with async replication
def update_and_read(user_id, new_email):
    # Always read from primary after writes
    db.execute("UPDATE users SET email = ? WHERE id = ?", new_email, user_id)
    return primary.execute("SELECT email FROM users WHERE id = ?", user_id)

When Async Is Not Enough

Some data genuinely cannot tolerate async replication:

  • Financial transactions: If you write a payment and the primary fails before replicas receive it, money disappears
  • Inventory management: Overselling happens when replicas do not have the latest stock counts
  • Locking operations: If you acquire a lock and the primary fails before it replicates, you have a zombie lock

For these cases, see Synchronous Replication for strong consistency options.

Real-world Failure Scenarios

Scenario 1: The Overwhelmed Replica

A replica running on undersized hardware falls behind during peak traffic. The primary continues accepting writes at 5,000 TPS while the replica can only apply 3,000. Over hours, the WAL backlog grows to 50GB. When the primary finally fails, the replica is promoted but 50GB of committed transactions are lost. The engineering team had set alert thresholds but the on-call engineer dismissed the warning, believing it was a transient load spike.

Scenario 2: The Partition That Wasn’t

During a network partition, an application continued writing to the primary. Without realizing replicas were unreachable, the team confirmed to users that orders were “saved.” When the partition healed, they discovered the primary had accumulated 2 hours of writes that now conflicted with the recovered replica’s state. Manual reconciliation took three days.

Scenario 3: The Untested Failover

A database administrator manually promoted a replica during what they thought was a primary hardware issue. The replica had 4 hours of lag due to a background analytics query consuming replica resources. Four hours of customer orders, profile updates, and payment confirmations were permanently lost. The RTO was 15 minutes, but the RPO was 4 hours - far exceeding the company’s 1-hour RPO commitment.

Scenario 4: The Replication Slot Catastrophe

A replica lost network connectivity for 6 hours. During that time, the primary’s replication slot was dropped (due to a misconfigured cleanup policy). When the replica reconnected, there was no way to resume replication - the WAL segments it needed had been purged. The team had to rebuild the entire replica from a new base backup, causing 8 hours of reduced read capacity.

Common Pitfalls / Anti-Patterns

  1. Assuming replica data is current: After a write, do not assume replicas have the update immediately. Always read from primary for critical data after writes.

  2. Not monitoring lag: Lag accumulates silently. By the time users complain, you might be hours behind. Monitor proactively.

  3. Promoting a lagging replica: If a replica is 2 hours behind and you promote it, you just lost 2 hours of data. Always check lag before promoting.

  4. Ignoring replication slot retention: If a replica falls behind and the replication slot is lost, you cannot recover without a full resync.

  5. Testing in ideal conditions: Replication lag grows under load. Test with production-level write rates.

Quick Recap

  • Async replication confirms writes immediately without waiting for replicas
  • Replication lag means replicas can return stale data
  • Best for: non-critical reads, analytics, geographically distributed systems
  • Monitor lag actively and alert on threshold breaches
  • Use read-after-write consistency when needed by reading from primary after writes
  • For critical data requiring zero RPO, use synchronous replication

Read-Your-Writes Consistency Checklist

When using async replicas, ensuring read-after-write consistency requires explicit handling:

# Strategy 1: Read from primary after writes
def read_after_write(session, user_id, new_email):
    session.execute(
        "UPDATE users SET email = ? WHERE id = ?",
        new_email, user_id
    )
    # Must read from primary to see our own write
    return primary.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )

# Strategy 2: Track LSN and wait for replication
def read_after_wait(session, user_id, new_email):
    lsn = session.execute(
        "UPDATE users SET email = ? WHERE id = ?",
        new_email, user_id
    )
    # Wait for replica to catch up to our LSN
    replica.wait_for_lsn(lsn, timeout=30)
    return replica.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )

# Strategy 3: Session pinning (sticky to primary)
def read_with_session_pin(session, user_id):
    # Session always routes to primary for this user
    session.set_pin(user_id, to_node='primary')
    return session.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )
StrategyProsConsBest For
Read from primarySimple, always correctAdds primary loadWrite-heavy workloads
Wait for LSNWorks with any replicaAdds latencyOccasional consistency needs
Session pinningConsistent experienceReduces flexibilityUser-specific data

Trade-off Analysis

AspectAsynchronous ReplicationSynchronous Replication
Write LatencyLow (local disk + processing)High (waits for replica confirmation)
Data SafetyRisk of losing writes during failoverNo data loss if replica confirms
AvailabilityHigh - continues during partitionsLower - blocked if replica unavailable
Replica RequirementsCan be geographically distantMust be low-latency for performance
Use Case FitRead-heavy, eventual consistency OKCritical data, zero RPO required
Failover ComplexityHigher - may lose recent writesLower - replica is up-to-date
Cross-DC DeploymentFeasible across continentsImpractical due to latency

Interview Questions

1. What is the fundamental difference between synchronous and asynchronous replication in terms of write acknowledgment?

Expected answer points:

  • Synchronous replication waits for replica confirmation before acknowledging the write to the client
  • Asynchronous replication confirms the write immediately after the primary records it locally, then sends changes to replicas in the background
  • This difference makes async replication faster but introduces the possibility of data loss during failover
2. What is replication lag and how does it affect read operations in an asynchronous replication setup?

Expected answer points:

  • Replication lag is the time difference between when a write is committed on the primary and when it is applied on a replica
  • Reads from lagging replicas can return stale or outdated data
  • The lag can range from milliseconds to minutes depending on network conditions, load, and replica performance
  • Applications must handle potentially inconsistent reads during the lag window
3. Explain the three main types of replication mechanisms: statement-based, WAL-based, and row-based. What are their trade-offs?

Expected answer points:

  • Statement-based: Sends SQL statements to replicas. Simple but problematic for nondeterministic functions like NOW() or RAND() that produce different results on different nodes
  • WAL-based: Sends binary Write-Ahead Log entries containing actual changed bytes. Deterministic and compact. Used by PostgreSQL
  • Row-based: Sends actual row changes. Combines benefits of both - deterministic and complete. Used by MySQL's row-based replication
4. What is eventual consistency and why is it important in asynchronous replication?

Expected answer points:

  • Eventual consistency guarantees that if no new writes are made, all replicas will eventually converge to the same state
  • The "eventually" window is critical - during lag, different replicas can return different results for the same query
  • This observable inconsistency is a real-world concern, not just theoretical
  • Suitable for applications where strict immediate consistency is not required
5. How does asynchronous replication position a database in the CAP theorem trade-off?

Expected answer points:

  • CAP theorem states you can only guarantee two of three: Consistency, Availability, and Partition tolerance
  • Async replication chooses Availability (A) over Consistency (C) during network partitions
  • The primary can continue accepting writes even if replicas are disconnected
  • Risk: writes may be lost if the primary fails before replicas receive them
6. Describe three strategies for achieving read-after-write consistency with asynchronous replicas.

Expected answer points:

  • Read from primary after writes: Route reads to the primary immediately following a write. Simple but adds primary load
  • Track LSN and wait: After a write, track the Log Sequence Number and wait for the replica to catch up to that LSN before reading. Adds latency but works with any replica
  • Session pinning: Pin a user's session to the primary for consistent reads. Reduces flexibility but provides consistent experience per user
7. What critical metrics should you monitor to detect replication lag issues in production?

Expected answer points:

  • Replication lag (bytes/time): Difference between sent and replayed LSN. Warning: 1-10s, Critical: >10s
  • Replica IO thread status: Must be running. Stopped thread means replication has stopped
  • Replica SQL thread status: Must be running. Stopped thread means events are not being applied
  • WAL backlog size: Warning: 100MB-1GB, Critical: >1GB
8. What are the dangers of promoting a lagging replica during a failover scenario?

Expected answer points:

  • If a replica is hours behind and gets promoted, all writes in that gap are permanently lost
  • The new primary will have incomplete data compared to what was committed
  • Always check lag metrics before promoting a replica
  • Establish a maximum acceptable lag threshold (e.g., RPO) before allowing automatic failover
9. What is cascade replication and what problem does it solve?

Expected answer points:

  • Cascade replication chains replicas: Primary → Replica 1 → Replica 2 → Replica 3
  • Reduces load on the primary, which no longer needs to feed every replica directly
  • Trade-off: adds latency as changes propagate through the chain
  • Useful for read-heavy deployments with many replicas across geographic regions
10. Why is asynchronous replication generally NOT suitable for financial transactions?

Expected answer points:

  • If the primary fails after write confirmation but before replicas receive the change, that data is lost
  • In financial systems, losing committed transactions (even small amounts) is unacceptable
  • Money disappearing due to replication lag can cause regulatory and customer trust issues
  • Financial systems typically require synchronous replication or consensus-based approaches
11. What techniques can reduce replication lag in an async setup?

Expected answer points:

  • Upgrade replica hardware: Faster disks and more CPU so replicas can keep up with the primary's write rate
  • Reduce write workload: If primary writes 10,000 TPS and replicas handle only 8,000, lag grows indefinitely
  • Enable parallel replication: Modern MySQL and PostgreSQL support applying events in parallel on replicas
  • Cascade replication: Chain replicas to reduce primary load
12. What is the difference between row-based replication and statement-based replication in MySQL?

Expected answer points:

  • Statement-based: Replicas execute the same SQL statements as the primary. Can produce different results with nondeterministic functions
  • Row-based: Replicas receive the actual row changes (before and after images). Deterministic and complete
  • Row-based replication produces larger binary logs but avoids subtle replication bugs
  • MySQL allows switching between modes based on workload
13. How does replication slot retention interact with replica lag?

Expected answer points:

  • Replication slots prevent the primary from removing WAL segments that replicas haven't yet received
  • If a replica falls behind significantly and the replication slot is lost, the gap cannot be recovered
  • Result: requires a full resync from the primary instead of resuming from where replication stopped
  • Monitor replication slot status and ensure replicas stay connected
14. What are good use cases for asynchronous replication?

Expected answer points:

  • Web sessions and preferences: User profile updates can lag seconds without real problems
  • Analytics and reporting: Dashboard reads do not need millisecond freshness
  • Social media feeds: Timeline updates tolerate eventual consistency
  • Logging and metrics: Time-series data tolerates small gaps
  • Geographic distribution: Cross-continent replication would be prohibitively slow with sync
16. What happens when a primary fails and there is a significant replication lag? How should you handle such a scenario?

Expected answer points:

  • If a primary fails while replicas are lagging, promoted replicas will lose the uncommitted transactions
  • RPO (Recovery Point Objective) is violated - data written but not yet replicated is lost
  • Before promoting any replica, check its lag metrics against your acceptable RPO threshold
  • If no replica is within acceptable lag, prefer manual reconciliation over automatic failover
17. Explain the concept of a replication slot and why losing it can be catastrophic.

Expected answer points:

  • Replication slots are PostgreSQL features that prevent WAL segment removal until replicas confirm receipt
  • If a replica is disconnected and the slot is dropped (due to misconfigured retention), WAL segments needed for recovery are purged
  • Result: the replica cannot resume replication and must be rebuilt from a fresh base backup
  • This can cause hours of reduced read capacity and service degradation
18. How does cascade replication reduce load on the primary?

Expected answer points:

  • In cascade replication, the primary feeds only one or two replicas directly, and those replicas feed others
  • The primary no longer needs to maintain concurrent connections for every replica
  • Network bandwidth on the primary is significantly reduced in large replica deployments
  • Trade-off: adds propagation latency as changes flow through intermediate replicas
19. What are the advantages and disadvantages of WAL-based replication compared to statement-based replication?

Expected answer points:

  • WAL advantages: Deterministic (contains actual changed bytes), compact, works for any SQL operation
  • WAL disadvantages: Binary format tied to specific database version, harder to inspect and debug
  • Statement advantages: Human-readable, portable across MySQL versions
  • Statement disadvantages: Nondeterministic functions cause divergence between primary and replica
20. Why might an application see different results from the same query executed against different replicas?

Expected answer points:

  • During the replication lag window, replicas are at different points in the transaction stream
  • If queries are load-balanced across replicas randomly, consecutive reads can return different results
  • This observable inconsistency is the price of async replication's speed
  • Applications must either accept staleness or implement read-after-write consistency by routing critical reads to the primary

Further Reading

Conclusion

Async replication is the right choice for most read-heavy workloads. The key is understanding which data needs strong consistency and which can tolerate eventual consistency, then routing reads accordingly.

Category

Related Posts

CRDTs: Conflict-Free Replicated Data Types

Learn how CRDTs enable strongly consistent eventual consistency in distributed databases. Explore G-Counters, PN-Counters, LWW-Registers, OR-Sets, and practical applications.

#distributed-systems #crdt #eventual-consistency

Synchronous Replication: Data Consistency Across Nodes

Learn about synchronous replication patterns in distributed databases, including Quorum-based replication and write-ahead log shipping for zero RPO.

#distributed-systems #replication #consistency

Database Replication: Master-Slave and Failover Patterns

Database replication explained: master-slave, multi-master, synchronous vs asynchronous strategies, failover patterns, and consistency.

#databases #replication #availability