0% found this document useful (0 votes)
39 views

BigQuery Change Data Capture (CDC) using Pub_Sub _ by Ajith Urimajalu _ Google Cloud - Community _ Sep, 2024 _ Medium

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

BigQuery Change Data Capture (CDC) using Pub_Sub _ by Ajith Urimajalu _ Google Cloud - Community _ Sep, 2024 _ Medium

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep,

ty | Sep, 2024 | Medium

Open in app

42
Search

Get unlimited access to the best of Medium for less than $1/week. Become a member

BigQuery Change Data Capture (CDC) using


Pub/Sub
Ajith Urimajalu · Follow
Published in Google Cloud - Community
5 min read · Sep 15, 2024

Listen Share More

There are many ways to load data to BigQuery. If you are looking for an API to load
data from your external applications, Google Cloud offers the following two choices:

1. The legacy streaming API: You use legacy tabledata.insertAll method, which is
primarily designed for appending new rows. Although it does offer best effort
de-duplication, it needs careful planning. The legacy streaming API is the only
API with REST endpoint.

2. The BigQuery Storage Write API: This is the newer API which allows you to
stream data into BigQuery. An exclusive feature of this API is change data
capture (CDC), which updates your BigQuery tables by processing and applying
streamed changes to existing data. The BigQuery Storage Write API is exclusively a
gRPC API and does not offer a REST endpoint.

Systems without built-in gRPC client


Because many systems and platforms, including SAP ERP systems such as S/4HANA,
lack gRPC clients and offer only REST clients, calling the Storage Write API from
them is not possible.

To achieve de-duplication and CDC functionality with the legacy streaming API
(which always appends new rows), you can use the MERGE statement. However, this
approach can be operationally expensive for large data volumes.

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 1/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

If your system / platform / application only supports REST, are you out of luck in
using the out-of-the-box CDC functionality of the Storage Write API?

Pub/Sub BigQuery Subscription to the rescue !

Stream data to BigQuery with CDC with Pub/Sub

New to Pub/Sub? Get familiar with some key concepts here.

Pub/Sub offers a REST API, allowing your external app, system, or platform to
publish messages to a Pub/Sub topic. These messages can then be streamed in real-
time into BigQuery via a BigQuery subscription, which internally uses the Storage
Write API.

Flow diagram

The documentation page provides key details of schema compatibility, CDC, and
handling failures among other important aspects of using the BigQuery
subscription.

Step by step walkthrough with an example


First create a BigQuery table. Go to BigQuery Console and Click Compose new
query and run the below

CREATE OR REPLACE TABLE `PUBSUB_CDC_TEST.SAP_DATA` (


date DATE NOT NULL,
int_value INT64,
last_updated TIMESTAMP,
int_timestamp INT64,
operation_flag STRING,

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 2/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

PRIMARY KEY (date) NOT ENFORCED


);

Next, create an Avro schema in Pub/Sub. Use the below definition to match the
BigQuery table. Note that we have added two additional fields for CDC.

{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "date",
"type": "string"
},
{
"name": "int_value",
"type": "long"
},
{
"name": "last_updated",
"type": "string"
},
{
"name": "int_timestamp",
"type": "long"
},
{
"name": "operation_flag",
"type": "string"
},
{
"name": "_CHANGE_TYPE",
"type": "string"
},
{
"name": "_CHANGE_SEQUENCE_NUMBER",
"type": "long"
}
]
}

Then, create a topic and associate the schema created above.

Finally, create a BigQuery subscription and choose the BigQuery table created
earlier. Ensure you choose ‘Use Topic Schema’ for the schema configuration
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 3/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

By the end of this, you should have the following created:

BigQuery Table

Avro Schema

Pub/Sub Topic

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 4/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

Pub/Sub Subscription to BigQuery

Testing
Let’s simulate an external application publishing a message by utilizing the built-in
feature of Pub/Sub Console and publishing the below JSON message.

{
"date": "2024-09-15",
"int_value": 100,
"last_updated": "2024-09-15T05:52:37Z",
"int_timestamp": 20240915055237,
"operation_flag": "I",
"_CHANGE_TYPE": "UPSERT",
"_CHANGE_SEQUENCE_NUMBER": 20240915055237
}

The entry is successfully created in BigQuery

Let’s publish one more message, this time with a higher


_CHANGE_SEQUENCE_NUMBER, which should update the existing row.

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 5/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

{
"date": "2024-09-15",
"int_value": 300,
"last_updated": "2024-09-15T06:05:13Z",
"int_timestamp": 20240915060513,
"operation_flag": "U",
"_CHANGE_TYPE": "UPSERT",
"_CHANGE_SEQUENCE_NUMBER": 20240915060513
}

Record updated successfully

Let’s publish one more message, this time with a lower


_CHANGE_SEQUENCE_NUMBER, which should not update the existing row.

{
"date": "2024-09-15",
"int_value": 400,
"last_updated": "2024-09-15T00:00:00Z",
"int_timestamp": 20240915000000,
"operation_flag": "U",
"_CHANGE_TYPE": "UPSERT",
"_CHANGE_SEQUENCE_NUMBER": 20240915000000
}

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 6/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

No updates to existing row because of lower _CHANGE_SEQUENCE_NUMBER

Let’s publish one more message with _CHANGE_TYPE as ‘DELETE’. This should
delete the row from BigQuery

{
"date": "2024-09-15",
"int_value": 100,
"last_updated": "2024-09-15T06:08:36Z",
"int_timestamp": 20240915060836,
"operation_flag": "D",
"_CHANGE_TYPE": "DELETE",
"_CHANGE_SEQUENCE_NUMBER": 20240915060836
}

Record Successfully deleted from BigQuery

A Sample Utility for ABAP SDK for Google Cloud


As stated earlier in the Blog, SAP ERP systems lack gRPC client to call the BigQuery
Storage Write API. This limitation can be easily overcome by using the sample utility
class ZGOOG_CL_PUBSUB_TO_BQ, hosted on Github. This utility is built on top of
ABAP SDK for Google Cloud.

The utility prepares a JSON message and publishes it to the configured Pub/Sub
topic. BigQuery subscription then inserts the records to BigQuery table.

The GitHub repo also includes a demo program ZR_DEMO_PUBSUB_TO_BQ, which


shows how to use the utility.

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 7/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

Selection screen of demo program

Output of the demo program

Data successfully inserted to BigQuery

A quick note on pricing


Pub/Sub BigQuery subscriptions cost $50 per TiB. In contrast, BigQuery’s legacy
streaming inserts ( tabledata.insertAll ) cost $0.01 per 200 MB or $50 per TB.
However, considering only ingestion costs, legacy streaming inserts and Pub/Sub
BigQuery subscriptions are similarly priced.

Conclusion
As described above, BigQuery subscription in Pub/Sub is a great alternative for
systems that cannot natively call BigQuery Storage Write API.

Just Launched : Vertex AI SDK for ABAP


Still here? Google Cloud just launched Vertex AI SDK for ABAP. You can now
empower your SAP applications with AI. The Vertex AI SDK for ABAP brings Google
Cloud’s cutting-edge AI capabilities directly to your SAP environment. Checkout the
launch blog to learn more.
https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 8/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

Google Cloud Platform Abap Sdk For Google Cloud Bigquery

Google Cloud Pubsub Data

Follow

Written by Ajith Urimajalu


37 Followers · Writer for Google Cloud - Community

SAP Application Engineer at Google

More from Ajith Urimajalu and Google Cloud - Community

Ajith Urimajalu in Google Cloud - Community

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 9/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

Send SAP RAP Business Events to external systems using Google Cloud
Pub/Sub and ABAP SDK
Introduction to SAP Business Events

Dec 29, 2023 1

Daniel Sanche in Google Cloud - Community

Kubernetes 101: Pods, Nodes, Containers, and Clusters


Kubernetes is quickly becoming the new standard for deploying and managing software in the
cloud. With all the power Kubernetes provides…

Jan 3, 2018 28K 100

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 10/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

Val Deleplace in Google Cloud - Community

Interning in Go
Go 1.23 comes with a new package unique implementing interning, and a blog post about it.
Interning is re-using objects of equal value…

Sep 16 497 2

Ajith Urimajalu in Google Cloud - Community

Use Google Cloud Storage as an application file directory for SAP


Although APIs have become the predominant method for system communication, file transfer
is still used by many enterprises running SAP. For…

Jan 3 51 1

See all from Ajith Urimajalu

See all from Google Cloud - Community

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 11/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

Recommended from Medium

Christianlauer in CodeX

Google introduced Workflows for BigQuery


How to execute Code in Sequence at a Scheduled Time

Sep 27 43

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 12/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

Hugo Lu

My favourite use cases for dbt Macros


Macros are snippets of code that generate snippets of code

5d ago 86 1

Lists

General Coding Knowledge


20 stories · 1697 saves

data science and AI


40 stories · 275 saves

Predictive Modeling w/ Python


20 stories · 1630 saves

ChatGPT
21 stories · 858 saves

Kai Waehner

The Shift Left Architecture — From Batch and Lakehouse to Data


Streaming

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 13/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

The Shift Left Architecture using Data Streaming (Kafka/Flink) enables Data Products for DWH,
Data Lake, Lakehouse like…

Oct 4 93

Abdur Rahman in Stackademic

Python is No More The King of Data Science


5 Reasons Why Python is Losing Its Crown

Oct 23 2.2K 17

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 14/15
2024/11/4 晚上10:17 BigQuery Change Data Capture (CDC) using Pub/Sub | by Ajith Urimajalu | Google Cloud - Community | Sep, 2024 | Medium

Shuvro @ Nimesa

BigQuery Continuous Query — a game changer for real time dashboards?


BigQuery recently introduced another new feature — called continuous queries. As the name
suggests, a continuous query, once started…

Aug 23 42 1

Sithi Asma Basheer

What I learned from the Google BigQuery and Looker Masterclass


For data engineers and data analysts who may have missed the masterclass, fear not! I’m
capturing my experience as a reflection, but I’m…

Jun 30 77 2

See more recommendations

https://medium.com/google-cloud/bigquery-change-data-capture-cdc-using-pub-sub-b7881075acb8 15/15

You might also like