Introduction
As previously explained, CDC (Change Data Capture) is one of the best ways to interconnect an OLTP database system with other systems like Data Warehouse, Caches, Spark or Hadoop.
Debezium is an open-source project developed by Red Hat which aims to simplify this process by allowing you to extract changes from various database systems (e.g. MySQL, PostgreSQL, MongoDB) and push them to Apache Kafka .
In this article, we are going to see how you can extract events from MySQL binary logs using Debezium.
Debezium Architecture
First, you need a database-specific Debezium connector to be able to extract the Redo Log (e.g. Oracle), Binary Log (e.g. MySQL), or Write-Ahead Logs (e.g. PostgreSQL).
You also need to have Kafka running so that you can push the extracted log events and make them available to other services in your enterprise system. Apache ZooKeeper is not needed by Debezium, but by Kafka since it relies on it for consensus as well as linearizability guarantees.
Installing Debezium
If you want to give Debezium a try, you can follow this very extensive tutorial offered in the Debezium documentation section.
Basically, you have to run the following Docker containers:
> docker run -it --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper:0.5
> docker run -it --name kafka -p 9092:9092 --link zookeeper:zookeeper debezium/kafka:0.5
> docker run -it --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium/example-mysql:0.5
> docker run -it --name kafka-connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my_connect_configs -e OFFSET_STORAGE_TOPIC=my_connect_offsets --link zookeeper:zookeeper --link kafka:kafka --link mysql:mysql debezium/connect:0.5
> docker run -it --name kafka-watcher --link zookeeper:zookeeper debezium/kafka:0.5 watch-topic -a -k dbserver1.inventory.customers
Afterward, you should have the following containers listed by Docker:
> docker ps -a
CONTAINER ID IMAGE NAMES
bbfeafd9125c debezium/kafka:0.5 kafka-watcher
4cfffedae69c debezium/connect:0.5 kafka-connect
36734bc82864 debezium/example-mysql:0.5 mysql
daaaab6f3206 debezium/kafka:0.5 kafka
8a7affd3e2a4 debezium/zookeeper:0.5 zookeeper
Using bash, you need to create a new connector:
curl -i -X POST ^
-H "Accept:application/json" ^
-H "Content-Type:application/json" ^
localhost:8083/connectors/ ^
-d '{ "name": "inventory-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "tasks.max": "1", "database.hostname": "mysql", "database.port": "3306", "database.user": "debezium", "database.password": "dbz", "database.server.id": "184054", "database.server.name": "dbserver1", "database.whitelist": "inventory", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic": "dbhistory.inventory" } }'
Notice that kafka-watcher
was started in interactive mode so that we can see in the console the CDC log events captured by Debezium.
Testing time
Now, if we connect to the MySQL Docker container using the root
user and the debezium
password, we can issue various SQL statements and inspect the kafka-watcher
container console output.
INSERT
When inserting a new customer
row:
INSERT INTO `inventory`.`customers`
(
`first_name`,
`last_name`,
`email`)
VALUES
(
'Vlad',
'Mihalcea',
'vlad@acme.org'
)
In the kafka-watcher
, we can now find the following JSON entry:
{
"payload":{
"before":null,
"after":{
"id":1005,
"first_name":"Vlad",
"last_name":"Mihalcea",
"email":"vlad@acme.org"
},
"source":{
"name":"dbserver1",
"server_id":223344,
"ts_sec":1500369632,
"gtid":null,
"file":"mysql-bin.000003",
"pos":364,
"row":0,
"snapshot":null,
"thread":13,
"db":"inventory",
"table":"customers"
},
"op":"c",
"ts_ms":1500369632095
}
}
The before
object is null
while the after
object shows the newly inserted value. Notice that the op
attribute value is c
, meaning it’s a CREATE event.
UPDATE
When updating the customer
row:
UPDATE `inventory`.`customers`
SET
`email` = 'vlad.mihalcea@acme.org'
WHERE
`id` = 1005
We can now find the following log event:
{
"payload":{
"before":{
"id":1005,
"first_name":"Vlad",
"last_name":"Mihalcea",
"email":"vlad@acme.org"
},
"after":{
"id":1005,
"first_name":"Vlad",
"last_name":"Mihalcea",
"email":"vlad.mihalcea@acme.org"
},
"source":{
"name":"dbserver1",
"server_id":223344,
"ts_sec":1500369929,
"gtid":null,
"file":"mysql-bin.000003",
"pos":673,
"row":0,
"snapshot":null,
"thread":13,
"db":"inventory",
"table":"customers"
},
"op":"u",
"ts_ms":1500369929464
}
}
The op
attribute value is u
, meaning we have an UPDATE log event. The before
object shows the row state before the update while the after
object captures the current state of the updated customer
database row.
DELETE
When issuing a DELETE statement:
DELETE FROM `inventory`.`customers`
WHERE id = 1005;
The following event is being recorded by the kafka-connect
Docker container:
{
"payload":{
"before":{
"id":1005,
"first_name":"Vlad",
"last_name":"Mihalcea",
"email":"vlad.mihalcea@acme.org"
},
"after":null,
"source":{
"name":"dbserver1",
"server_id":223344,
"ts_sec":1500370394,
"gtid":null,
"file":"mysql-bin.000003",
"pos":1025,
"row":0,
"snapshot":null,
"thread":13,
"db":"inventory",
"table":"customers"
},
"op":"d",
"ts_ms":1500370394589
}
}
The op
attribute value is d
, meaning we have a DELETE log event and the after
object is now null
. The before
object captures the database row state before it got deleted.
Brilliant, right?