Change Data Capture (CDC) is defined as a process that captures the changes made in a database and makes sure that these changes are replicated to a destination, typically a Data Warehouse seamlessly. It is considered ideal for zero-downtime Cloud-based migrations.
CDC is primarily known for reducing the resources required for an ETL process. It does this either by using a source Database’s Binlog (binary log) or by depending on trigger functions to assimilate only the data that has changed since the previous ETL operation.
CDC provides real-time or near real-time movement of data by processing and moving the data continuously as new database events occur.
You need to replicate transactional data from your database into a Data Warehouse or Analytical database as they are way more efficient than Transactional Databases for executing large analytical workloads.
The data is stored, accessed, and updated by rows in a transactional database whereas Analytical Databases adopt a Column-based structure. Here, each column of data is stored together. This architecture allows you to work on a large set of data within a given column quickly.
A simple way to replicate a source database to a destination is through Bulk Load Update (or full replication) of the source to the destination. The system processes and groups data based on a schedule in Batch Processing.
This method is considered easy to implement as it keeps the destination updated with the latest records. A drawback of this method is that it utilizes a lot of system resources. This makes this replication method unsuitable for larger datasets.
The complexity of the replication method increases with the data volume. This is because new data is being constantly added, and existing data is continually changing.
Every action taken by a business’s employees, suppliers, customers, and partners is potentially one or more rows to replicate. As opposed to bulk data updates, continuous Change Data Capture leads to more efficient scaling and faster updates as more data is available for analysis.
You can carry out CDC with continuous streaming for real-time updates. You can also do the same asynchronously by looking at the source periodically and replicating the modified data.
Here are the three different types of CDC methods you can implement for your business case.
Enterprise databases keep all the changes in a transaction log for system recovery if it crashes. Every change in data made by the users is written to the transaction logs. A Log-based CDC method reads the changes that go into the logs and moves them to the destination Data Warehouse in real-time.
This is an Asynchronous Method that is known for providing high reliability. The transaction logs are present to ensure that the database can recover to a consistent state. This allows it to account for every change made by the users.
The database records and transaction logs are maintained separately which means the process has no impact on the transactions of the source database. Therefore, there is no need to write additional procedures that increase the system load.
This process operates with minimum latency. It tracks the changes in the transaction log and verifies that these changes are moved to the destination (or various destinations). A common limitation of Log-based CDC is that log formats are proprietary with a lot of them lacking sufficient documentation.
It is also limited by the fact that transaction logs are archived frequently. This poses a problem because the CDC software must read the logs before they are archived, or should be able to read the archived blogs.
A CDC solution can be created at the application level that relies on a primary field or fields. This can represent/indicate that the data in a row has changed. To carry this out you might have to change the schema to add a datetime field. This is required to depict when the record was updated or created, a boolean status indicator, or a version number.
Here you can try auditing columns by utilizing the “DATE_MODIFIED” or “LAST_UPDATED” columns.
Here are the steps involved in this process:
- Step 1: Obtain the maximum value of both the target table’s ‘Updated_Time’ and ‘Created_Time’ columns.
- Step 2: Select all the rows from the data source with ‘Created_Time’ larger than (>) the target table’s maximum ‘Created_Time’. This gives you all the newly created rows since the last CDC process was executed.
- Step 3: Next, choose all the rows from the source table that have a ‘Updated_Time’ larger than (>) the target table’s maximum ‘Updated_Time’ but lesser than (<) its maximum ‘Created_Time’. Since the rows less than the target table’s maximum create date were included in step 2, they are excluded here.
- Step 4: You can either modify existing rows from step 3 in the target or insert new rows from step 2.
Triggers are defined as software functions written to capture event-based changes. Generally, triggers run when changes are made to a table’s data through SQL commands/queries like “AFTER INSERT” or “BEFORE INSERT”.
Triggers can hinder performance because they run on the database while data changes are being made. With every transaction, changes are recorded in a different table (and in the transaction log). Therefore, the system is slowed by the additional load of Trigger-based Capture.
A common issue with Trigger-based CDC is latency. At certain intervals, the CDC code extracts the changes from the triggered tables. This can put an extra load on the system. Since triggers have to be defined for every table, operational resources might be impacted if you aim to replicate multiple tables.
To summarize the article so far, Log-based CDC is considered highly reliable. Capturing database changes with event-based triggers is considerably easy for a software engineer to set up. It is also considered a highly dependable solution. Writing a CDC Script at the application level needs more development time than the other two methods. Improper code could also result in changes being incorrectly recorded or not captured at all while writing a CDC Script.
This article talks about Change Data Capture in great detail. It covers the importance of Data Replication and the different methods you can implement to capture data changes for your business.
Extracting complex data from a diverse set of data sources to carry out an insightful analysis can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from Databases or SaaS applications to be visualized in a BI tool for free. Hevo Data is fully automated and hence does not require you to code. Hevo, an ETL Tool, with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance.