Asynchronous Replication: Speed and Availability at Scale

Learn how asynchronous replication works in distributed databases, including eventual consistency implications, lag monitoring, and practical use cases where speed outweighs strict consistency.

published: March 24, 2026 reading time: 24 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

With asynchronous replication, the primary confirms writes immediately while replicas catch up in the background, giving you low latency and high availability at the cost of a small window where data might be lost during failover. Replication lag is the silent killer: replicas can fall seconds or hours behind depending on load and network conditions, so reads from replicas can return stale data without any indication to the application. The practical approach is to route critical reads back to the primary after writes (read-after-write consistency), monitor lag proactively with alerts on bytes or seconds behind, and reserve async replicas for analytics, dashboards, and other workloads where slight staleness is acceptable. If your RPO must be zero, synchronous replication is the only honest answer.

Asynchronous Replication: Speed and Availability at Scale

Introduction

Asynchronous replication is the workhorse of distributed databases. Writes confirm immediately on the primary. Replicas receive the changes eventually, usually milliseconds later but sometimes much longer. This gives you speed and availability, but it introduces a window where data can be lost if the primary fails.

Most production databases run async by default. The performance benefit is real. The trade-off is accepting a small probability of losing recent writes during failover. For many applications, this is acceptable. For others, it is not.

How Asynchronous Replication Works

The primary processes a write, confirms it to the client, and sends the change to replicas in the background. The client does not wait for replica confirmation.

sequenceDiagram
    participant Client
    participant Primary
    participant Replica

    Client->>Primary: BEGIN; UPDATE...; COMMIT
    Primary->>Primary: Write to local WAL
    Primary-->>Client: COMMIT confirmed
    Note over Primary,Replica: (Background replication)

    rect rgb(200, 200, 200)
        Primary->>Replica: Send WAL entry
        Replica-->>Primary: ACK
    end

This non-blocking behavior is why async replication is fast. Write latency is essentially local disk latency plus a small amount of processing. There is no waiting for remote replica confirmation.

The Replication Lag Problem

With async replication, replicas fall behind the primary. This lag can be seconds or minutes depending on load, network conditions, and replica performance.

# This sequence can fail with async replication
def update_and_read(user_id, new_email):
    db.execute("UPDATE users SET email = ? WHERE id = ?", new_email, user_id)
    # Write confirmed immediately
    # But replica might not have the update yet

    user = db_replica.execute("SELECT email FROM users WHERE id = ?", user_id)
    # Might return old email if replica is lagging
    return user['email']

This is the fundamental tension. Async replication is fast, but reads from replicas can return stale data.

Statement-Based vs WAL-Based Replication

Different databases use different mechanisms.

Statement-based replication sends the SQL statements to replicas. MySQL’s original replication used this approach.

-- Primary executes:
UPDATE users SET email = 'new@example.com' WHERE id = 1;

-- Replica executes the same statement:
UPDATE users SET email = 'new@example.com' WHERE id = 1;

The problem: nondeterministic functions like NOW() or RAND() produce different results on different nodes.

WAL-based replication sends the binary WAL entries. PostgreSQL uses this. WAL entries are deterministic because they contain the actual bytes that changed.

# WAL entry contains the actual bytes
wal_entry = {
    'type': 'UPDATE',
    'table': 'users',
    'key': 'id=1',
    'old_data': {'email': 'old@example.com'},
    'new_data': {'email': 'new@example.com'},
    'timestamp': 1711234567
}

Row-based replication sends the actual row changes. MySQL’s row-based replication combines the benefits of both.

Eventual Consistency

Async replication provides eventual consistency. Given enough time without new writes, all replicas will converge to the same state.

flowchart LR
    P[Primary] -->|immediate| R1[Replica 1]
    P -->|immediate| R2[Replica 2]

    R1 -.->|eventually consistent| Same[Same State]
    R2 -.->|eventually consistent| Same

The “eventually” part is important. During the lag window, reads from different replicas can return different results. This is observable inconsistency, not just a theoretical concern.

# User experience with eventual consistency
def get_user_email(user_id):
    # Route to random replica
    replica = random.choice([replica1, replica2, replica3])
    return replica.execute("SELECT email FROM users WHERE id = ?", user_id)

# Two consecutive calls might return different results
email1 = get_user_email(123)  # 'old@example.com'
email2 = get_user_email(123)  # 'new@example.com' if replica2 has newer data

Use Cases Where Async Makes Sense

Async replication is the right choice when:

Web sessions and preferences: User profile updates can lag a few seconds without causing real problems
Analytics and reporting: Reads for dashboards do not need millisecond freshness
Social media feeds: Timeline updates do not need to be immediately consistent
Logging and metrics: Time-series data tolerates small gaps
Geographic distribution: Replicating across continents with sync would be prohibitively slow

# Good async use case: user preferences
def update_user_preference(user_id, key, value):
    # Write immediately, replicate async
    db.execute("UPDATE preferences SET value = ? WHERE user_id = ? AND key = ?",
               value, user_id, key)
    # User sees update instantly on primary
    # Other replicas catch up shortly

# Bad async use case: account balance
def transfer_funds(from_id, to_id, amount):
    # This needs synchronous replication for correctness
    db.execute("UPDATE accounts SET balance = balance - ? WHERE id = ?",
               amount, from_id)
    db.execute("UPDATE accounts SET balance = balance + ? WHERE id = ?",
               amount, to_id)

Monitoring Replication Lag

Lag monitoring is critical with async replication. You need to know how far behind your replicas are.

-- PostgreSQL: Check replication lag
SELECT client_addr, state,
       sent_lsn, replay_lsn,
       sent_lsn - replay_lsn AS lag_bytes
FROM pg_stat_replication;

-- MySQL: Check replica lag
SHOW REPLICA STATUS\G
# Look at: Seconds_Behind_Master, Relay_Log_Pos

# Application-level lag monitoring
def check_replica_health(replica):
    last_replicated_ts = replica.get_last_replicated_timestamp()
    current_ts = time.time()
    lag_seconds = current_ts - last_replicated_ts

    if lag_seconds > 30:
        alert(f"Replica {replica.name} lag: {lag_seconds}s")
        # Route reads away from this replica
        remove_from_read_pool(replica)

Key metrics to track:

Metric	Normal	Warning	Critical
Replication lag	< 1s	1-10s	> 10s
Replica IO thread	Running	—	Stopped
Replica SQL thread	Running	—	Stopped
WAL backlog	< 100MB	100MB-1GB	> 1GB

The CAP Theorem and Async Replication

Async replication is the availability (A) choice in the CAP theorem trade-off. During a network partition, an async primary can continue accepting writes. Those writes might not survive if the primary fails before replicas receive them.

flowchart TD
    subgraph Partition
        P[Primary] --> R1[(Replica 1)]
        P -.- X[(Partition)]
        R2[(Replica 2)] -.- X
    end

    P -->|writes continue| Client1[Client]
    R2 -.->|disconnected| P

If you need consistency during partitions, you need synchronous replication. If you need availability during partitions, async is your choice.

For more on this trade-off, see CAP Theorem and Consistency Models.

Reducing Lag in Async Setups

If lag is a problem, there are techniques to reduce it.

Upgrade replica hardware: Replicas that cannot keep up with the primary’s write rate will always lag. Give them faster disks and more CPU.

Reduce write workload: If the primary writes 10,000 transactions per second and replicas can only handle 8,000, lag will grow indefinitely.

Use parallel replication: Modern MySQL and PostgreSQL support parallel replica apply.

-- MySQL: Enable parallel replication
STOP REPLICA;
SET GLOBAL slave_parallel_workers = 8;
START REPLICA;

-- PostgreSQL: Increase wal_sender_timeout for long-running queries
ALTER SYSTEM SET wal_sender_timeout = '60s';

Cascade replication: Chain replicas so the primary does not feed every replica directly. This reduces primary load.

flowchart TD
    P[Primary] --> R1[Replica 1]
    R1 --> R2[Replica 2]
    R1 --> R3[Replica 3]
    R2 --> R4[Replica 4]

Read Routing with Async Replicas

Application code should handle stale reads gracefully when using async replicas.

# Read from replica with fallback to primary
def read_user(user_id):
    try:
        # Try replica first
        replica = get_read_replica()
        user = replica.execute("SELECT * FROM users WHERE id = ?", user_id)
        return user
    except ReplicaTooLagError:
        # Fall back to primary if replica is too far behind
        return primary.execute("SELECT * FROM users WHERE id = ?", user_id)

# Read-after-write with async replication
def update_and_read(user_id, new_email):
    # Always read from primary after writes
    db.execute("UPDATE users SET email = ? WHERE id = ?", new_email, user_id)
    return primary.execute("SELECT email FROM users WHERE id = ?", user_id)

When Async Is Not Enough

Some data genuinely cannot tolerate async replication:

Financial transactions: If you write a payment and the primary fails before replicas receive it, money disappears
Inventory management: Overselling happens when replicas do not have the latest stock counts
Locking operations: If you acquire a lock and the primary fails before it replicates, you have a zombie lock

For these cases, see Synchronous Replication for strong consistency options.

Real-world Failure Scenarios

Scenario 1: The Overwhelmed Replica

A replica running on undersized hardware falls behind during peak traffic. The primary continues accepting writes at 5,000 TPS while the replica can only apply 3,000. Over hours, the WAL backlog grows to 50GB. When the primary finally fails, the replica is promoted but 50GB of committed transactions are lost. The engineering team had set alert thresholds but the on-call engineer dismissed the warning, believing it was a transient load spike.

Key metrics to catch this earlier: replica replay lag in seconds and WAL backlog size in bytes. In PostgreSQL, pg_stat_replication returns replay_lsn showing how far the replica is behind the primary’s write-ahead log. A query like SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS lag_seconds gives real-time lag. The on-call engineer would have seen lag climbing from seconds to minutes to hours if they had run this query. Alert fatigue is the human factor here: warnings fire frequently without consequences until the threshold breach finally represents a real problem. The instinct to wait and see if it resolves on its own is wrong in distributed systems because lag accumulates asymmetrically. Writes go in immediately but catching up requires the replica to have spare capacity it does not have.

When the primary fails, any transactions committed to the WAL but not yet replicated to the replica are gone. The 50GB figure represents roughly two hours of writes at 5,000 TPS with average transaction size around 50 bytes. Those transactions were acknowledged to clients as committed. The replica promotion made them permanent losses. When WAL backlog exceeds your RPO threshold, the right move is to add replicas, shunt reads away from the lagging replica, and investigate the performance bottleneck. Automated alerts with severity levels help here. A PagerDuty alert at severity 2 with a runbook link is harder to dismiss than a Slack message that scrolls off the screen. The runbook should specify exact steps: which queries to run, what values trigger action, and who to escalate to if the values do not improve within 15 minutes.

Scenario 2: The Partition That Wasn’t

During a network partition, an application continued writing to the primary. Without realizing replicas were unreachable, the team confirmed to users that orders were “saved.” When the partition healed, they discovered the primary had accumulated 2 hours of writes that now conflicted with the recovered replica’s state. Manual reconciliation took three days.

The name is slightly misleading. The team thought the network issue was brief or contained, but it was actually a sustained disconnection. The application layer had no visibility into replica reachability. The primary was accepting writes normally, so from the application’s perspective everything looked fine. The database driver reported successful commits. What the team did not know was that all replicas had lost their connection to the primary and were accumulating divergence. “Conflicted” means the replica had older state: orders that existed on the primary were absent on the replica, and in some cases the replica had different values for records that had been updated on the primary during the partition. Merging these states required reading every affected table from both nodes, identifying which writes came from which timeline, and deciding per-record which version to keep. This is not a simple last-write-wins merge. An order record that was created on the primary and updated on the replica represents a genuine conflict requiring business logic to resolve.

Real-time replica disconnect detection is well-understood. PostgreSQL replicas send heartbeats through the wal_sender_timeout mechanism; if the primary does not receive expected messages, it logs a disconnection. MySQL has SHOW REPLICA STATUS with Slave_IO_Running and Slave_SQL_Running columns that indicate replication thread health. Application-level monitoring should poll these values and raise an alert if either thread stops. The conflict resolution strategy they should have used depends on their requirements. Last-write-wins works for idempotent data like preferences. For financial records, the correct approach is to reject the writes during the partition period and prompt users to resubmit. For this team, three days of manual reconciliation suggests the data volumes were large and the business logic for merging was complex. Automated conflict detection and alerting when replica lag exceeds a defined threshold would have caught this hours earlier.

Scenario 3: The Untested Failover

A database administrator manually promoted a replica during what they thought was a primary hardware issue. The replica had 4 hours of lag due to a background analytics query consuming replica resources. Four hours of customer orders, profile updates, and payment confirmations were permanently lost. The RTO was 15 minutes, but the RPO was 4 hours - far exceeding the company’s 1-hour RPO commitment.

The mistake here is straightforward: promoting a replica without checking its lag. In MySQL, SHOW REPLICA STATUS exposes Seconds_Behind_Master which tells you how far the replica is behind. In PostgreSQL, pg_stat_replication shows the difference between sent_lsn and replay_lsn. The analytics query caused lag because long-running read queries hold the SQL replay thread in MySQL, preventing it from applying new events. In PostgreSQL, a heavy analytical query can consume shared buffers and cause bloat that slows the replay process. Either way, the replica was functionally behind and should not have been promoted.

RTO measures how long it takes to restore service. RPO measures how much data you can afford to lose. The team restored service in 15 minutes by promoting the replica, which is good RTO. But they lost 4 hours of committed transactions, which is catastrophic RPO. RTO and RPO are independent. A failover that restores availability quickly but loses customer payment records is not a successful failover by any measure.

The correct failover checklist before promoting any replica includes: checking lag in seconds or bytes against your RPO threshold, verifying the replica SQL and IO threads are running, confirming the replica has applied all events up to a specific LSN or timestamp, and comparing that timestamp against your oldest acceptable commit time. If lag exceeds your RPO, you have three options: wait for the replica to catch up, perform manual reconciliation, or accept the data loss and document it for compliance. None of these involve blindly promoting the lagging replica. Automated failover tools should enforce lag checks before promotion. If your automation cannot enforce this, manual runbooks with explicit lag checks should be mandatory before any promotion action.

Scenario 4: The Replication Slot Catastrophe

A replica lost network connectivity for 6 hours. During that time, the primary’s replication slot was dropped due to a misconfigured cleanup policy. When the replica reconnected, there was no way to resume replication because the WAL segments it needed had been purged. The team had to rebuild the entire replica from a new base backup, causing 8 hours of reduced read capacity.

Replication slots hold WAL segments that replicas have not yet confirmed receiving. When a replica connects, it gets a slot. The primary keeps WAL segments until the replica acknowledges receipt. If the replica disconnects and the slot is dropped before reconnection, the gap is permanent. There is no way to recover missing segments.

The misconfiguration was a WAL cleanup policy that ran on a schedule without checking whether replicas were using their slots. It assumed disconnected replicas would reconnect and resume automatically. But resuming requires the WAL segments to still exist. Once they are purged, the replica cannot pick up where it left off. A full rebuild from base backup is the only option, and for large databases that takes hours.

Query pg_replication_slots to check slot status. Look at restart_lsn — the oldest WAL point the replica needs. If restart_lsn falls significantly behind the primary’s current LSN, or if a slot disappears entirely, raise a critical alert. This is a situation that demands immediate attention, not a ticket to file for later.

When a replica disconnects, verify the replication slot is still there before anything else. WAL cleanup policies should not remove segments that a disconnected replica will eventually need. If the replica will be offline for hours, archive WAL segments manually with pg_receivewal or copy them to a holding location. The alternative is a full rebuild, which means reduced read capacity for hours and extra load on whatever replicas remain.

Common Pitfalls / Anti-Patterns

Assuming replica data is current: After a write, do not assume replicas have the update immediately. Always read from primary for critical data after writes.
Not monitoring lag: Lag accumulates silently. By the time users complain, you might be hours behind. Monitor proactively.
Promoting a lagging replica: If a replica is 2 hours behind and you promote it, you just lost 2 hours of data. Always check lag before promoting.
Ignoring replication slot retention: If a replica falls behind and the replication slot is lost, you cannot recover without a full resync.
Testing in ideal conditions: Replication lag grows under load. Test with production-level write rates.

Quick Recap

Async replication confirms writes immediately without waiting for replicas
Replication lag means replicas can return stale data
Best for: non-critical reads, analytics, geographically distributed systems
Monitor lag actively and alert on threshold breaches
Use read-after-write consistency when needed by reading from primary after writes
For critical data requiring zero RPO, use synchronous replication

Read-Your-Writes Consistency Checklist

When using async replicas, ensuring read-after-write consistency requires explicit handling:

# Strategy 1: Read from primary after writes
def read_after_write(session, user_id, new_email):
    session.execute(
        "UPDATE users SET email = ? WHERE id = ?",
        new_email, user_id
    )
    # Must read from primary to see our own write
    return primary.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )

# Strategy 2: Track LSN and wait for replication
def read_after_wait(session, user_id, new_email):
    lsn = session.execute(
        "UPDATE users SET email = ? WHERE id = ?",
        new_email, user_id
    )
    # Wait for replica to catch up to our LSN
    replica.wait_for_lsn(lsn, timeout=30)
    return replica.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )

# Strategy 3: Session pinning (sticky to primary)
def read_with_session_pin(session, user_id):
    # Session always routes to primary for this user
    session.set_pin(user_id, to_node='primary')
    return session.execute(
        "SELECT email FROM users WHERE id = ?",
        user_id
    )

Strategy	Pros	Cons	Best For
Read from primary	Simple, always correct	Adds primary load	Write-heavy workloads
Wait for LSN	Works with any replica	Adds latency	Occasional consistency needs
Session pinning	Consistent experience	Reduces flexibility	User-specific data

Trade-off Analysis

Aspect	Asynchronous Replication	Synchronous Replication
Write Latency	Low (local disk + processing)	High (waits for replica confirmation)
Data Safety	Risk of losing writes during failover	No data loss if replica confirms
Availability	High - continues during partitions	Lower - blocked if replica unavailable
Replica Requirements	Can be geographically distant	Must be low-latency for performance
Use Case Fit	Read-heavy, eventual consistency OK	Critical data, zero RPO required
Failover Complexity	Higher - may lose recent writes	Lower - replica is up-to-date
Cross-DC Deployment	Feasible across continents	Impractical due to latency

Interview Questions

1. What is the fundamental difference between synchronous and asynchronous replication in terms of write acknowledgment?

Expected answer points:

Synchronous replication waits for replica confirmation before acknowledging the write to the client
Asynchronous replication confirms the write immediately after the primary records it locally, then sends changes to replicas in the background
This difference makes async replication faster but introduces the possibility of data loss during failover

2. What is replication lag and how does it affect read operations in an asynchronous replication setup?

Expected answer points:

Replication lag is the time difference between when a write is committed on the primary and when it is applied on a replica
Reads from lagging replicas can return stale or outdated data
The lag can range from milliseconds to minutes depending on network conditions, load, and replica performance
Applications must handle potentially inconsistent reads during the lag window

3. Explain the three main types of replication mechanisms: statement-based, WAL-based, and row-based. What are their trade-offs?

Expected answer points:

Statement-based: Sends SQL statements to replicas. Simple but problematic for nondeterministic functions like NOW() or RAND() that produce different results on different nodes
WAL-based: Sends binary Write-Ahead Log entries containing actual changed bytes. Deterministic and compact. Used by PostgreSQL
Row-based: Sends actual row changes. Combines benefits of both - deterministic and complete. Used by MySQL's row-based replication

4. What is eventual consistency and why is it important in asynchronous replication?

Expected answer points:

Eventual consistency guarantees that if no new writes are made, all replicas will eventually converge to the same state
The "eventually" window is critical - during lag, different replicas can return different results for the same query
This observable inconsistency is a real-world concern, not just theoretical
Suitable for applications where strict immediate consistency is not required

5. How does asynchronous replication position a database in the CAP theorem trade-off?

Expected answer points:

CAP theorem states you can only guarantee two of three: Consistency, Availability, and Partition tolerance
Async replication chooses Availability (A) over Consistency (C) during network partitions
The primary can continue accepting writes even if replicas are disconnected
Risk: writes may be lost if the primary fails before replicas receive them

6. Describe three strategies for achieving read-after-write consistency with asynchronous replicas.

Expected answer points:

Read from primary after writes: Route reads to the primary immediately following a write. Simple but adds primary load
Track LSN and wait: After a write, track the Log Sequence Number and wait for the replica to catch up to that LSN before reading. Adds latency but works with any replica
Session pinning: Pin a user's session to the primary for consistent reads. Reduces flexibility but provides consistent experience per user

7. What critical metrics should you monitor to detect replication lag issues in production?

Expected answer points:

Replication lag (bytes/time): Difference between sent and replayed LSN. Warning: 1-10s, Critical: >10s
Replica IO thread status: Must be running. Stopped thread means replication has stopped
Replica SQL thread status: Must be running. Stopped thread means events are not being applied
WAL backlog size: Warning: 100MB-1GB, Critical: >1GB

8. What are the dangers of promoting a lagging replica during a failover scenario?

Expected answer points:

If a replica is hours behind and gets promoted, all writes in that gap are permanently lost
The new primary will have incomplete data compared to what was committed
Always check lag metrics before promoting a replica
Establish a maximum acceptable lag threshold (e.g., RPO) before allowing automatic failover

9. What is cascade replication and what problem does it solve?

Expected answer points:

Cascade replication chains replicas: Primary → Replica 1 → Replica 2 → Replica 3
Reduces load on the primary, which no longer needs to feed every replica directly
Trade-off: adds latency as changes propagate through the chain
Useful for read-heavy deployments with many replicas across geographic regions

10. Why is asynchronous replication generally NOT suitable for financial transactions?

Expected answer points:

If the primary fails after write confirmation but before replicas receive the change, that data is lost
In financial systems, losing committed transactions (even small amounts) is unacceptable
Money disappearing due to replication lag can cause regulatory and customer trust issues
Financial systems typically require synchronous replication or consensus-based approaches

11. What techniques can reduce replication lag in an async setup?

Expected answer points:

Upgrade replica hardware: Faster disks and more CPU so replicas can keep up with the primary's write rate
Reduce write workload: If primary writes 10,000 TPS and replicas handle only 8,000, lag grows indefinitely
Enable parallel replication: Modern MySQL and PostgreSQL support applying events in parallel on replicas
Cascade replication: Chain replicas to reduce primary load

12. What is the difference between row-based replication and statement-based replication in MySQL?

Expected answer points:

Statement-based: Replicas execute the same SQL statements as the primary. Can produce different results with nondeterministic functions
Row-based: Replicas receive the actual row changes (before and after images). Deterministic and complete
Row-based replication produces larger binary logs but avoids subtle replication bugs
MySQL allows switching between modes based on workload

13. How does replication slot retention interact with replica lag?

Expected answer points:

Replication slots prevent the primary from removing WAL segments that replicas haven't yet received
If a replica falls behind significantly and the replication slot is lost, the gap cannot be recovered
Result: requires a full resync from the primary instead of resuming from where replication stopped
Monitor replication slot status and ensure replicas stay connected

14. What are good use cases for asynchronous replication?

Expected answer points:

Web sessions and preferences: User profile updates can lag seconds without real problems
Analytics and reporting: Dashboard reads do not need millisecond freshness
Social media feeds: Timeline updates tolerate eventual consistency
Logging and metrics: Time-series data tolerates small gaps
Geographic distribution: Cross-continent replication would be prohibitively slow with sync

16. What happens when a primary fails and there is a significant replication lag? How should you handle such a scenario?

Expected answer points:

If a primary fails while replicas are lagging, promoted replicas will lose the uncommitted transactions
RPO (Recovery Point Objective) is violated - data written but not yet replicated is lost
Before promoting any replica, check its lag metrics against your acceptable RPO threshold
If no replica is within acceptable lag, prefer manual reconciliation over automatic failover

17. Explain the concept of a replication slot and why losing it can be catastrophic.

Expected answer points:

Replication slots are PostgreSQL features that prevent WAL segment removal until replicas confirm receipt
If a replica is disconnected and the slot is dropped (due to misconfigured retention), WAL segments needed for recovery are purged
Result: the replica cannot resume replication and must be rebuilt from a fresh base backup
This can cause hours of reduced read capacity and service degradation

18. How does cascade replication reduce load on the primary?

Expected answer points:

In cascade replication, the primary feeds only one or two replicas directly, and those replicas feed others
The primary no longer needs to maintain concurrent connections for every replica
Network bandwidth on the primary is significantly reduced in large replica deployments
Trade-off: adds propagation latency as changes flow through intermediate replicas

19. What are the advantages and disadvantages of WAL-based replication compared to statement-based replication?

Expected answer points:

WAL advantages: Deterministic (contains actual changed bytes), compact, works for any SQL operation
WAL disadvantages: Binary format tied to specific database version, harder to inspect and debug
Statement advantages: Human-readable, portable across MySQL versions
Statement disadvantages: Nondeterministic functions cause divergence between primary and replica

20. Why might an application see different results from the same query executed against different replicas?

Expected answer points:

During the replication lag window, replicas are at different points in the transaction stream
If queries are load-balanced across replicas randomly, consecutive reads can return different results
This observable inconsistency is the price of async replication's speed
Applications must either accept staleness or implement read-after-write consistency by routing critical reads to the primary

Conclusion

Async replication is the right choice for most read-heavy workloads. The key is understanding which data needs strong consistency and which can tolerate eventual consistency, then routing reads accordingly.

Asynchronous Replication: Speed and Availability at Scale

Introduction

How Asynchronous Replication Works

The Replication Lag Problem

Statement-Based vs WAL-Based Replication

Eventual Consistency

Use Cases Where Async Makes Sense

Monitoring Replication Lag

The CAP Theorem and Async Replication

Reducing Lag in Async Setups

Read Routing with Async Replicas

When Async Is Not Enough

Real-world Failure Scenarios

Scenario 1: The Overwhelmed Replica

Scenario 2: The Partition That Wasn’t

Scenario 3: The Untested Failover

Scenario 4: The Replication Slot Catastrophe

Common Pitfalls / Anti-Patterns

Quick Recap

Read-Your-Writes Consistency Checklist

Trade-off Analysis

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

CRDTs: Conflict-Free Replicated Data Types

Synchronous Replication: Data Consistency Across Nodes

Database Replication: Master-Slave and Failover Patterns