Introduction

As previously explained, CDC (Change Data Capture) is one of the best ways to interconnect an OLTP database system with other systems like Data Warehouse, Caches, Spark or Hadoop.

Debezium is an open-source project developed by Red Hat which aims to simplify this process by allowing you to extract changes from various database systems (e.g. MySQL, PostgreSQL, MongoDB) and push them to Apache Kafka .

In this article, we are going to see how you can extract events from MySQL binary logs using Debezium.

Debezium Architecture

image

First, you need a database-specific Debezium connector to be able to extract the Redo Log (e.g. Oracle), Binary Log (e.g. MySQL), or Write-Ahead Logs (e.g. PostgreSQL).

You also need to have Kafka running so that you can push the extracted log events and make them available to other services in your enterprise system. Apache ZooKeeper is not needed by Debezium, but by Kafka since it relies on it for consensus as well as linearizability guarantees.

Installing Debezium

If you want to give Debezium a try, you can follow this very extensive tutorial offered in the Debezium documentation section.

Basically, you have to run the following Docker containers:

> docker run -it --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper:0.5

> docker run -it --name kafka -p 9092:9092 --link zookeeper:zookeeper debezium/kafka:0.5

> docker run -it --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium/example-mysql:0.5

> docker run -it --name kafka-connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my_connect_configs -e OFFSET_STORAGE_TOPIC=my_connect_offsets --link zookeeper:zookeeper --link kafka:kafka --link mysql:mysql debezium/connect:0.5

> docker run -it --name kafka-watcher --link zookeeper:zookeeper debezium/kafka:0.5 watch-topic -a -k dbserver1.inventory.customers

Afterward, you should have the following containers listed by Docker:

> docker ps -a

CONTAINER ID        IMAGE                          NAMES
bbfeafd9125c        debezium/kafka:0.5             kafka-watcher
4cfffedae69c        debezium/connect:0.5           kafka-connect
36734bc82864        debezium/example-mysql:0.5     mysql
daaaab6f3206        debezium/kafka:0.5             kafka
8a7affd3e2a4        debezium/zookeeper:0.5         zookeeper

Using bash, you need to create a new connector:

curl -i -X POST ^
 -H "Accept:application/json" ^
 -H "Content-Type:application/json" ^
 localhost:8083/connectors/ ^
 -d '{ "name": "inventory-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "tasks.max": "1", "database.hostname": "mysql", "database.port": "3306", "database.user": "debezium", "database.password": "dbz", "database.server.id": "184054", "database.server.name": "dbserver1", "database.whitelist": "inventory", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic": "dbhistory.inventory" } }'

Notice that kafka-watcher was started in interactive mode so that we can see in the console the CDC log events captured by Debezium.

Testing time

Now, if we connect to the MySQL Docker container using the root user and the debezium password, we can issue various SQL statements and inspect the kafka-watcher container console output.

INSERT

When inserting a new customer row:

INSERT INTO `inventory`.`customers`
(
    `first_name`,
    `last_name`,
    `email`)
VALUES
(
    'Vlad',
    'Mihalcea',
    'vlad@acme.org'
)

In the kafka-watcher, we can now find the following JSON entry:

{  
   "payload":{  
      "before":null,
      "after":{  
         "id":1005,
         "first_name":"Vlad",
         "last_name":"Mihalcea",
         "email":"vlad@acme.org"
      },
      "source":{  
         "name":"dbserver1",
         "server_id":223344,
         "ts_sec":1500369632,
         "gtid":null,
         "file":"mysql-bin.000003",
         "pos":364,
         "row":0,
         "snapshot":null,
         "thread":13,
         "db":"inventory",
         "table":"customers"
      },
      "op":"c",
      "ts_ms":1500369632095
   }
}

The before object is null while the after object shows the newly inserted value. Notice that the op attribute value is c, meaning it’s a CREATE event.

UPDATE

When updating the customer row:

UPDATE `inventory`.`customers`
SET
    `email` = 'vlad.mihalcea@acme.org'
WHERE 
    `id` = 1005

We can now find the following log event:

{
    "payload":{ 
      "before":{ 
         "id":1005,
         "first_name":"Vlad",
         "last_name":"Mihalcea",
         "email":"vlad@acme.org"
      },
      "after":{ 
         "id":1005,
         "first_name":"Vlad",
         "last_name":"Mihalcea",
         "email":"vlad.mihalcea@acme.org"
      },
      "source":{ 
         "name":"dbserver1",
         "server_id":223344,
         "ts_sec":1500369929,
         "gtid":null,
         "file":"mysql-bin.000003",
         "pos":673,
         "row":0,
         "snapshot":null,
         "thread":13,
         "db":"inventory",
         "table":"customers"
      },
      "op":"u",
      "ts_ms":1500369929464
   }
}

The op attribute value is u, meaning we have an UPDATE log event. The before object shows the row state before the update while the after object captures the current state of the updated customer database row.

DELETE

When issuing a DELETE statement:

DELETE FROM `inventory`.`customers`
WHERE id = 1005;

The following event is being recorded by the kafka-connect Docker container:

{ 
    "payload":{  
      "before":{  
         "id":1005,
         "first_name":"Vlad",
         "last_name":"Mihalcea",
         "email":"vlad.mihalcea@acme.org"
      },
      "after":null,
      "source":{  
         "name":"dbserver1",
         "server_id":223344,
         "ts_sec":1500370394,
         "gtid":null,
         "file":"mysql-bin.000003",
         "pos":1025,
         "row":0,
         "snapshot":null,
         "thread":13,
         "db":"inventory",
         "table":"customers"
      },
      "op":"d",
      "ts_ms":1500370394589
   }
}

The op attribute value is d, meaning we have a DELETE log event and the after object is now null. The before object captures the database row state before it got deleted.

Brilliant, right?