Change Data Capture From Oracle With StreamSets Data Collector - StreamSets
Change Data Capture From Oracle With StreamSets Data Collector - StreamSets
Categories ▼
with StreamSets Data Collector
Products ▼
By Pat Patterson Posted in Engineering
June 12, 2018
Authors ▼
! " #
Quick Links
Try StreamSets Editor’s Note: StreamSets is no longer relies on the
continuous miner function in Oracle. Here is an update on
Product Documentation
Oracle 19c Bulk Ingest and CDC.
Customer Support
Today’s guest post is by Franck Pachot, an Oracle Consultant
at dbi services in Switzerland. Franck has over 20 years of
experience in Oracle, covering every aspect of the database
from architecture and data modeling to tuning and
operation. Franck recently documented his experiences
testing StreamSets Data Collector‘s Oracle CDC origin, and
kindly allowed us to repost his blog entry here.
With the trend of CQRS architectures where the transactions are streamed to a bunch of
heterogenous eventually consistent polyglot-persistence microservices, logical replication and
change data capture (CDC) becomes an important component, even at the architecture design
phase. This is good for existing product vendors such as Oracle GoldenGate (which must be
licensed even to use only the CDC part in the Oracle Database as Streams is going to be
desupported) or Dbvisit replicate to Kafka. But also for open source projects. There are some
ideas for running on Debezium and VOODOO but they are not yet released.
Today I tested the StreamSets Oracle CDC Origin. StreamSets Data Collector is an open source
project, started by former Cloudera and Informatica employees, to de!ne streaming data
pipelines from data sources. It is easy, simple and has a bunch of destinations possible. The
Oracle CDC origin is based on LogMiner which means that it is easy but may have some
limitations (mainly datatypes, DDL replication and performance).
Install
The installation guide is online. I choose the easiest way for testing as they provide a Docker
container:
The GUI looks simple and e"cient. There’s a home page where you de!ne the ‘pipelines’ and
monitor them running. In the pipelines, we de!ne sources and destinations. Some connectors
are already installed, others can be automatically installed. For Oracle, as usual, you need to
download the JDBC driver yourself because Oracle doesn’t allow to get it embedded for legal
reasons. I’ll do something simple here just to check the mining from Oracle.
In ‘Package Manager’ (the little gift icon on the top) go to JDBC and check ‘install’ for the
streamsets-datacollector-jdbc-lib library.
Then in ‘External Libraries’, install (with the ‘upload’ icon at the top) the Oracle jdbc driver
(ojdbc8.jar).
"configuration": [
{
"name": "hikariConf.connectionString",
"value": "jdbc:oracle:thin:scott/tiger@//192.168.56.188:1521/pdb1"
},
{
"name": "hikariConf.useCredentials",
"value": true
},
{
"name": "hikariConf.username",
"value": "sys as sysdba"
},
{
"name": "hikariConf.password",
Show All ▼
"value": "oracle"
},
I provided SYSDBA credentials and only the PDB service, but it seems that StreamSets !gured
out automatically how to connect to the CDB (as LogMiner can be started only from
CDB$ROOT). The advantage of using LogMiner here is that you need only a JDBC connection to
the source – but of course, it will use CPU and memory resource from the source database host
in this case.
Then I de!ned the replication in the Oracle CDC tab: Schema to ‘SCOTT’ and Table Name Pattern
to ‘%’. Initial Change as ‘From Latest Change’ as I just want to see the changes and not actually
replicate for this !rst test. But of course, we can de!ne a SCN here which is what must be used
to ensure consistency between the initial load and the replication. Dictionary Source to ‘Online
Catalog’ – this is what will be used by LogMiner to map the object and column IDs to table
names and column names. But be careful as table structure changes may not be managed
correctly with this option.
{
"name": "oracleCDCConfigBean.baseConfigBean.schemaTableConfigs",
"value": [
{
"schema": "SCOTT",
"table": "%"
}
]
},
{
"name": "oracleCDCConfigBean.baseConfigBean.changeTypes",
"value": [
"INSERT",
"UPDATE",
"DELETE",
Show All ▼
"SELECT_FOR_UPDATE"
]
I’ve left the defaults. I can’t think yet about a reason for capturing the ‘select for update’, but it is
there.
{
"instanceName": "NamedPipe_01",
"library": "streamsets-datacollector-basic-lib",
"stageName":
"com_streamsets_pipeline_stage_destination_fifo_FifoDTarget",
"stageVersion": "1",
"configuration": [
{
"name": "namedPipe",
"value": "/tmp/scott"
},
{
"name": "dataFormat",
"value": "JSON"
},
Show All ▼
...
Supplemental logging
The Oracle redo log stream is by default focused only on recovery (replay of transactions in the
same database) and contains only the minimal physical information requried for it. In order to
get enough information to replay them in a di#erent database we need supplemental logging
for the database, and for the objects involved:
Run
And that’s all. Just run the pipeline and look at the logs:
StreamSet Oracle CDC pulls continuously from LogMiner to get the changes. Here are the
queries that it uses for that:
BEGIN DBMS_LOGMNR.START_LOGMNR(
STARTTIME => :1 ,
ENDTIME => :2 ,
OPTIONS => DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG
+ DBMS_LOGMNR.CONTINUOUS_MINE
+ DBMS_LOGMNR.NO_SQL_DELIMITER);
END;
This starts to mine between two timestamps. I suppose that it will read the SCNs to get !ner
grain and avoid overlapping information.
This reads the redo records. The operation codes 7 and 36 are for commit and rollbacks. The
operations 1, 3, 2, 25 are those that we want to capture (insert, update, delete, select for update)
and were de!ned in the con!guration. Here the pattern ‘%’ for the SCOTT schema has been
expanded to the table names. As far as I know, there’s no DDL mining here to automatically
capture new tables.
Capture
Then I ran this simple insert (I’ve added a primary key on this table as it is not there from
utlsampl.sql):
And I committed (as it seems that StreamSet bu#ers the changes until the end of the
transaction)
SQL> commit;
$ cat /tmp/scott
{"LOC":"Cloud","DEPTNO":50,"DNAME":"IT"}
I’ve tested some bulk loads (direct-path inserts) and it seems to be managed correctly. Actually,
this Oracle CDC is based on LogMiner so it is fully supported (no mining of proprietary redo
stream format) and limitations are clearly documented.
Monitoring
Remember that the main work is done by LogMiner, so don’t forget to look at the alert.log on
the source database. With big transactions, you may need large PGA (but you can also choose
bu#er to disk). If you have Oracle Tuning Pack, you can also monitor the main query which
retrieves the redo information from LogMiner:
You will see a di#erent SQL_ID because the !lter predicates uses literals instead of bind
variables (which is not a problem here).
Conclusion
This product is very easy to test, so you can do a proof of concept within a few hours and test
for your context: supported datatypes, operations and performance. By easy to test, I mean:
very good documentation, very friendly and responsive graphical interface, and very clear error
messages.
Thanks, Franck! You can try StreamSets Data Collector’s Oracle CDC integration today, on the cloud
platform of your choice.
Related Resources