Introduction
Today, many companies need to move data quickly from one system to another. For example, when a new order is made in an online shop, that order data must be sent to many other systems: payment, inventory, shipping, and reports. Doing this manually is slow and complex. This is where Debezium comes in.
What is Debezium?
Debezium is an open-source tool that helps us watch a database and capture every change inside it. This process is called Change Data Capture (CDC).
In simple words, Debezium can:
- Detect when new data is added (INSERT).
- Detect when existing data is changed (UPDATE).
- Detect when data is removed (DELETE).
Then, Debezium sends these changes to other systems in real time.
What is Change Data Capture (CDC)?
Change Data Capture means listening to a database and recording every change.
- Example: In a shop database, a new customer buys a product. CDC will capture that new record.
- If the customer changes their address, CDC will capture the update.
- If the record is deleted, CDC will capture the delete.
This is very useful for keeping systems synchronized and always up to date.
How Debezium Works
Debezium connects to the transaction log of a database.
- For MySQL, it reads the binlog.
- For PostgreSQL, it reads the WAL (Write-Ahead Log).
These logs store every change that happens in the database. Debezium then sends the change events to a system like Apache Kafka, which can deliver them to many applications at once.
Core Flow
Database → Debezium → Kafka → Consumer
flowchart LR %% Direction direction LR %% Sources subgraph SRC[Source Databases] A1[MySQL binlog] A2[PostgreSQL WAL] A3[MongoDB oplog] A4[SQL Server CDC] end %% Debezium Connect subgraph DBZ[Debezium Connectors] C1[MySQL Connector] C2[Postgres Connector] C3[MongoDB Connector] C4[SQL Server Connector] end %% Kafka subgraph KAFKA[Apache Kafka] T1[Topic: table1] T2[Topic: table2] end %% Consumers subgraph SINKS[Downstream Systems] S1[Stream Processor - Flink] S2[Search Index - Elasticsearch] S3[Data Warehouse - BigQuery] S4[Microservices] S5[Cache - Redis] end %% Edges A1 --> C1 A2 --> C2 A3 --> C3 A4 --> C4 C1 --> T1 C2 --> T1 C3 --> T2 C4 --> T2 T1 --> S1 T1 --> S2 T1 --> S3 T1 --> S4 T2 --> S5
Source Databases (Left side)
- These are the original systems where data is stored, such as MySQL, PostgreSQL, MongoDB, or SQL Server.
- Each database writes changes (insert, update, delete) into its transaction log (e.g., binlog for MySQL, WAL for PostgreSQL).
Debezium Connectors (Middle layer)
- Debezium has a specific connector for each type of database.
- The connector reads the database’s transaction log and captures every change event.
- Example: if a new row is added to a table in MySQL, the MySQL Connector will detect this event.
Apache Kafka (Center)
- Debezium sends the change events to Kafka topics.
- Each table usually maps to its own topic (for example:
table1
andtable2
). - Kafka acts as a buffer and message broker, allowing many consumers to read the events at their own pace.
Downstream Systems (Right side)
- These are systems or applications that consume the data changes from Kafka:
- Stream Processors (e.g., Flink, Kafka Streams) for real-time computation.
- Search Index (e.g., Elasticsearch) to keep search results up to date.
- Data Warehouse (e.g., BigQuery, Snowflake) for analytics and reporting.
- Microservices that need to react to data changes instantly.
- Caches (e.g., Redis) to maintain fast in-memory views.