BigQuery Change Data Capture (CDC) using Pub_Sub _ by Ajith Urimajalu _ Google Cloud - Community _ Sep, 2024 _ Medium
BigQuery Change Data Capture (CDC) using Pub_Sub _ by Ajith Urimajalu _ Google Cloud - Community _ Sep, 2024 _ Medium
Open in app
42
Search
Get unlimited access to the best of Medium for less than $1/week. Become a member
There are many ways to load data to BigQuery. If you are looking for an API to load
data from your external applications, Google Cloud offers the following two choices:
1. The legacy streaming API: You use legacy tabledata.insertAll method, which is
primarily designed for appending new rows. Although it does offer best effort
de-duplication, it needs careful planning. The legacy streaming API is the only
API with REST endpoint.
2. The BigQuery Storage Write API: This is the newer API which allows you to
stream data into BigQuery. An exclusive feature of this API is change data
capture (CDC), which updates your BigQuery tables by processing and applying
streamed changes to existing data. The BigQuery Storage Write API is exclusively a
gRPC API and does not offer a REST endpoint.
To achieve de-duplication and CDC functionality with the legacy streaming API
(which always appends new rows), you can use the MERGE statement. However, this
approach can be operationally expensive for large data volumes.
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 1/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
If your system / platform / application only supports REST, are you out of luck in
using the out-of-the-box CDC functionality of the Storage Write API?
Pub/Sub offers a REST API, allowing your external app, system, or platform to
publish messages to a Pub/Sub topic. These messages can then be streamed in real-
time into BigQuery via a BigQuery subscription, which internally uses the Storage
Write API.
Flow diagram
The documentation page provides key details of schema compatibility, CDC, and
handling failures among other important aspects of using the BigQuery
subscription.
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 2/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Next, create an Avro schema in Pub/Sub. Use the below definition to match the
BigQuery table. Note that we have added two additional fields for CDC.
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "date",
"type": "string"
},
{
"name": "int_value",
"type": "long"
},
{
"name": "last_updated",
"type": "string"
},
{
"name": "int_timestamp",
"type": "long"
},
{
"name": "operation_flag",
"type": "string"
},
{
"name": "_CHANGE_TYPE",
"type": "string"
},
{
"name": "_CHANGE_SEQUENCE_NUMBER",
"type": "long"
}
]
}
Finally, create a BigQuery subscription and choose the BigQuery table created
earlier. Ensure you choose ‘Use Topic Schema’ for the schema configuration
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 3/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
BigQuery Table
Avro Schema
Pub/Sub Topic
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 4/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Testing
Let’s simulate an external application publishing a message by utilizing the built-in
feature of Pub/Sub Console and publishing the below JSON message.
{
"date": "2024-09-15",
"int_value": 100,
"last_updated": "2024-09-15T05:52:37Z",
"int_timestamp": 20240915055237,
"operation_flag": "I",
"_CHANGE_TYPE": "UPSERT",
"_CHANGE_SEQUENCE_NUMBER": 20240915055237
}
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 5/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
{
"date": "2024-09-15",
"int_value": 300,
"last_updated": "2024-09-15T06:05:13Z",
"int_timestamp": 20240915060513,
"operation_flag": "U",
"_CHANGE_TYPE": "UPSERT",
"_CHANGE_SEQUENCE_NUMBER": 20240915060513
}
{
"date": "2024-09-15",
"int_value": 400,
"last_updated": "2024-09-15T00:00:00Z",
"int_timestamp": 20240915000000,
"operation_flag": "U",
"_CHANGE_TYPE": "UPSERT",
"_CHANGE_SEQUENCE_NUMBER": 20240915000000
}
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 6/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Let’s publish one more message with _CHANGE_TYPE as ‘DELETE’. This should
delete the row from BigQuery
{
"date": "2024-09-15",
"int_value": 100,
"last_updated": "2024-09-15T06:08:36Z",
"int_timestamp": 20240915060836,
"operation_flag": "D",
"_CHANGE_TYPE": "DELETE",
"_CHANGE_SEQUENCE_NUMBER": 20240915060836
}
The utility prepares a JSON message and publishes it to the configured Pub/Sub
topic. BigQuery subscription then inserts the records to BigQuery table.
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 7/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Conclusion
As described above, BigQuery subscription in Pub/Sub is a great alternative for
systems that cannot natively call BigQuery Storage Write API.
Follow
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 9/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Send SAP RAP Business Events to external systems using Google Cloud
Pub/Sub and ABAP SDK
Introduction to SAP Business Events
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 10/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Interning in Go
Go 1.23 comes with a new package unique implementing interning, and a blog post about it.
Interning is re-using objects of equal value…
Sep 16 497 2
Jan 3 51 1
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 11/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Christianlauer in CodeX
Sep 27 43
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 12/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Hugo Lu
5d ago 86 1
Lists
ChatGPT
21 stories · 858 saves
Kai Waehner
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 13/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
The Shift Left Architecture using Data Streaming (Kafka/Flink) enables Data Products for DWH,
Data Lake, Lakehouse like…
Oct 4 93
Oct 23 2.2K 17
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 14/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium
Shuvro @ Nimesa
Aug 23 42 1
Jun 30 77 2
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 15/15