CacheU
High Level Design

Change Data Capture (CDC)

A deep dive into Change Data Capture (CDC) — how systems stream database changes to other services in real time, including architecture patterns, implementation strategies, event pipelines, and real-world distributed system use cases.

Change Data Capture (CDC)

Introduction

Modern distributed systems rarely consist of a single database and a single application. Instead, large-scale systems include multiple services that must react to changes in data across the system.

For example:

  • When a user updates their profile, several systems might need to know:
    • Search indexes
    • Analytics systems
    • Recommendation engines
    • Notification services
    • Cache layers

If every service constantly queries the database for updates, the database becomes overwhelmed.

This is where Change Data Capture (CDC) becomes essential.

Change Data Capture (CDC) is a technique that detects changes in a database and streams those changes to other systems in real time.

Instead of repeatedly asking:

“Has anything changed?”

CDC allows systems to say:

“Notify me whenever something changes.”


The Problem CDC Solves

Imagine a simple e-commerce system.

We have:

  • Orders database
  • Inventory system
  • Analytics pipeline
  • Notification service
  • Search index

When a new order is created, several downstream systems must update.

Without CDC, the system might look like this.

Diagram
flowchart LR Service --> Database Analytics --> Database Search --> Database Notification --> Database Inventory --> Database

Problems:

ProblemExplanation
High database loadMultiple systems repeatedly query for updates
Data inconsistencySystems may fetch stale data
High latencyChanges are not propagated immediately
Tight couplingServices depend directly on the database

CDC solves this by streaming database changes once and distributing them to multiple consumers.


CDC Architecture Overview

With CDC, the system becomes event-driven.

Diagram
flowchart LR DB[(Database)] CDC[CDC Connector] MQ[(Event Stream / Message Queue)] Inventory --> MQ Search --> MQ Analytics --> MQ Notifications --> MQ DB --> CDC CDC --> MQ

Flow:

  1. A change happens in the database.
  2. CDC captures the change.
  3. CDC converts the change into an event.
  4. The event is published to a streaming platform.
  5. Multiple services consume the event.

This creates a loosely coupled architecture.


Types of Database Changes Captured

CDC typically captures three kinds of operations.

OperationMeaning
INSERTNew record created
UPDATEExisting record modified
DELETERecord removed

Example table:

Users
idnameemail
1Alice[alice@email.com](mailto:alice@email.com)

If Alice changes her email:

UPDATE Users SET email="alice@newmail.com" WHERE id=1;

CDC emits an event:

{
  "operation": "UPDATE",
  "table": "Users",
  "before": {
    "id": 1,
    "email": "alice@email.com"
  },
  "after": {
    "id": 1,
    "email": "alice@newmail.com"
  }
}

This event can be consumed by many systems.


Where CDC Is Used

CDC is heavily used in modern architectures.

Use CaseExample
Data replicationSyncing databases
Search indexingUpdating search engines
Analytics pipelinesStreaming data to data warehouses
Event-driven microservicesTriggering workflows
Cache invalidationUpdating distributed caches

Example:

Diagram
flowchart LR DB[(Orders DB)] CDC Stream[(Event Stream)] CDC --> Stream Stream --> Analytics Stream --> FraudDetection Stream --> Recommendation Stream --> Notifications DB --> CDC

CDC Implementation Strategies

There are several ways to implement CDC.

1. Polling Based CDC

The simplest approach is periodic polling.

A service repeatedly queries the database for changes.

Example:

SELECT * FROM orders
WHERE updated_at > last_checked_time;

Diagram

Diagram
sequenceDiagram participant Poller participant Database loop every 5 seconds Poller->>Database: Query updated rows Database-->>Poller: Return results end

Problems

IssueExplanation
High DB loadConstant polling
DelayChanges not captured instantly
InefficientMany empty queries

Because of these limitations, large systems rarely rely on polling.


2. Trigger Based CDC

Another approach uses database triggers.

A trigger runs automatically when a row changes.

Example:

CREATE TRIGGER order_update_trigger
AFTER UPDATE ON orders
FOR EACH ROW
INSERT INTO order_events VALUES (NEW.id, NOW());

Architecture

Diagram
flowchart LR App --> DB DB --> Trigger Trigger --> EventTable

Advantages

  • Immediate capture of changes
  • Simple implementation

Disadvantages

ProblemExplanation
Database overheadTriggers slow down writes
Complex maintenanceHard to manage many triggers
CouplingLogic lives inside database

Large-scale systems often avoid heavy trigger usage.


3. Log-Based CDC (Most Scalable)

The most powerful CDC method reads the database transaction log.

Every database maintains a log of changes.

Examples:

DatabaseLog Type
MySQLBinlog
PostgreSQLWAL
SQL ServerTransaction Log
MongoDBOplog

Instead of polling tables, CDC tools read the log stream.

Architecture

Diagram
flowchart LR App --> DB[(Database)] DB --> Log[(Transaction Log)] CDC --> Log CDC --> Stream[(Event Stream)] Stream --> ServiceA Stream --> ServiceB Stream --> ServiceC

Advantages:

BenefitExplanation
No query overheadReads database logs
Real-time streamingChanges captured instantly
ScalableSuitable for high throughput systems

This approach is used in production systems.


CDC Event Pipeline

A typical CDC pipeline looks like this.

Diagram
flowchart LR DB[(Database)] Log[(Transaction Log)] CDC[CDC Connector] Broker[(Event Broker)] ConsumerA ConsumerB ConsumerC DB --> Log Log --> CDC CDC --> Broker Broker --> ConsumerA Broker --> ConsumerB Broker --> ConsumerC

Components:

ComponentResponsibility
DatabaseSource of truth
Transaction LogRecord of changes
CDC ConnectorReads log and converts to events
Event BrokerDistributes events
ConsumersServices reacting to changes

Data Flow Example

Consider an order creation event.

Step-by-step flow.

Diagram
sequenceDiagram participant App participant DB participant CDC participant Stream participant Inventory participant Analytics App->>DB: Insert Order DB-->>CDC: Log change CDC->>Stream: Publish OrderCreated Stream->>Inventory: Update stock Stream->>Analytics: Record sale

One change can trigger multiple downstream updates.


CDC Event Structure

A CDC event typically contains metadata.

Example:

{
  "event_id": "9821",
  "table": "orders",
  "operation": "INSERT",
  "timestamp": "2026-01-10T10:30:00Z",
  "data": {
    "order_id": 1001,
    "user_id": 42,
    "amount": 120
  }
}

Common fields:

FieldMeaning
event_idUnique event identifier
tableTable where change occurred
operationINSERT / UPDATE / DELETE
timestampWhen change occurred
dataRow data

Handling Ordering and Consistency

CDC systems must maintain event ordering.

Why?

Because updates must occur in the same sequence.

Example:

  1. Order created
  2. Order shipped
  3. Order delivered

If delivered arrives before shipped, systems break.

To ensure order:

  • Use log offsets
  • Use partitioning keys
  • Use ordered message streams

Idempotency in CDC

Consumers may receive duplicate events.

Example causes:

  • Retry
  • Consumer restart
  • Network failure

Consumers must process events idempotently.

Example strategy:

if event_id already_processed:
    ignore
else:
    process_event()

Handling Schema Changes

Database schemas evolve.

Example:

ALTER TABLE users ADD COLUMN phone;

CDC pipelines must handle:

ChallengeSolution
Schema evolutionVersioned schemas
Backward compatibilitySchema registry
Consumer mismatchGraceful parsing

Real World Example

Consider an online marketplace.

When a product changes:

  • Search index must update
  • Recommendation engine must retrain
  • Analytics must record metrics
  • Cache must invalidate

Architecture:

Diagram
flowchart LR ProductDB[(Product DB)] CDC Stream[(Event Stream)] Stream --> SearchIndex Stream --> CacheInvalidator Stream --> Analytics Stream --> Recommendation ProductDB --> CDC CDC --> Stream

Instead of services querying the database, they react to events.


Benefits of CDC

BenefitExplanation
Real-time updatesChanges propagate instantly
Reduced database loadNo repeated polling
Event-driven architectureServices react to events
Loose couplingSystems independent
ScalabilitySupports many consumers

Challenges of CDC

CDC introduces complexities.

ChallengeExplanation
Event orderingMust maintain change sequence
Schema evolutionHandling database changes
Exactly-once processingAvoid duplicates
Operational complexityManaging pipelines
Data consistencyHandling partial failures

These challenges require careful system design.


Best Practices

Use Log-Based CDC

Prefer transaction log based CDC over polling.


Ensure Idempotent Consumers

Consumers must handle duplicate events safely.


Maintain Event Ordering

Partition events using a consistent key.


Use Schema Versioning

Support backward compatibility.


Monitor the CDC Pipeline

Track:

  • event lag
  • processing delays
  • failed consumers

Summary

Change Data Capture enables real-time propagation of database changes across distributed systems.

Instead of services repeatedly querying databases, CDC allows systems to stream events whenever data changes.

Key ideas:

ConceptMeaning
CDCCapturing database changes
Event streamingBroadcasting changes
Log-based captureReading transaction logs
Event consumersServices reacting to changes

CDC is a foundational technology for building event-driven, scalable, and loosely coupled architectures in modern distributed systems.

Without CDC, large-scale systems would struggle to keep multiple services synchronized with continuously changing data.