SlideShare a Scribd company logo
Deduplicating and analysing time-series data with Apache Beam and QuestDB
NYC 2023
Javier Ramírez
QuestDB
@supercoco9 / @j@chaos.social
Deduplicating And Analysing
Time-Series Data With
Apache Beam And QuestDB
About me: I like databases & open source
2022- today. Developer relations at an open source database vendor
● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink
2019-2022. Data & Analytics specialist at a cloud provider
● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data
Analytics, Redshift, ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark…
2013-2018. Data Engineer/Big Data & Analytics consultant
● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM,
Apache Flink, HBase, MongoDB, Presto
2006-2012 - Web developer
● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch
late nineties to 2005. Desktop/CGI/Servlets/ EJBs/CORBA
● MS Access, MySQL, Oracle, Sybase, Informix
As a student/hobbyist (late eighties - early nineties)
● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works, Informix
The pre-SQL years
The licensed SQL period
The libre and open SQL
revolution / The NoSQL
rise
The hadoop dark ages / The
python hegemony/ The cloud
database big migrations
The streaming era/ The
database as a service
singularity
The SQL revenge/ the
realtime database/the
embedded database
BEAM SUMMIT NYC 2023
#
Agenda
● The problem of data duplication
● The problem of data duplication
● The problem of data duplication
● The problem of data duplication
● Behold: a dashboard!
● The many challenges of time-series data
● QuestDB to the rescue
● Down the rabbit hole of writing a custom BEAM Sink
○ Finding several needles on a documentation haystack
○ When I sadly discovered Python streaming support is meh
○ The unsung hero saves the day (again): implementing the Sink in Java
BEAM SUMMIT NYC 2023
#
Duplication
WHY
BEAM SUMMIT NYC 2023
#
Duplication
HOW
BEAM SUMMIT NYC 2023
#
Duplication
WHAT
BEAM SUMMIT NYC 2023
#
My lazy approach to choosing a database
If you can use only one
database for everything, go
with PostgreSQL*
* Or any other major and well supported RDBMS
BEAM SUMMIT NYC 2023
#
Imagine…
a factory floor with 500 machines, or
a fleet with 500 vehicles, or
50 trains, with 10 cars each, or
500 users with a mobile phone, or
500 financial instruments generating tick data
…sending data every second
BEAM SUMMIT NYC 2023
#
A conventional database’s nightmare
43,200,000 rows a day…….
302,400,000 rows a week….
1,314,144,000 rows a month
BEAM SUMMIT NYC 2023
#
Timestamps are hard
BEAM SUMMIT NYC 2023
#
Time-series analytics in a nutshell
Working with timestamped data
in a database is tricky*
* specially working with analytics of data changing over time or at a high rate
Deduplicating and analysing time-series data with Apache Beam and QuestDB
BEAM SUMMIT NYC 2023
#
We’d like to be known for
● Performance
○ Better performance with smaller machines
● Developer Experience
● Proudly Open Source (Apache 2.0)
NYC 2023
A quick overview of some
interesting queries
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
WHERE … TIME RANGE
SELECT * from trips WHERE pickup_datetime in '2018';
SELECT * from trips WHERE pickup_datetime in '2018-06';
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59';
SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10;
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7'
SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
SAMPLE BY
Aggregates data in homogeneous time chunks
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 15m ALIGN TO CALENDAR;
SELECT timestamp, min(tempF),
max(tempF), avg(tempF)
FROM weather SAMPLE BY 1M;
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
SAMPLE BY … FILL
Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value)
SELECT
timestamp,
sum(price * amount) / sum(amount) AS vwap_price,
sum(amount) AS volume
FROM trades
WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now())
SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
LATEST ON …
PARTITION BY …
Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where
multiple time series are stored in the same table.
SELECT * FROM trades
WHERE symbol in ('BTC-USD', 'ETH-USD')
LATEST ON timestamp PARTITION BY symbol, side;
BEAM SUMMIT NYC 2023
#
Try it live on https://demo.questdb.io
ASOF JOIN / LT JOIN
SPLICE JOIN
ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF
JOIN takes from the second time-series a timestamp that meets both of the following criteria:
● The timestamp is the closest to the first timestamp.
● The timestamp is strictly prior or equal to the first timestamp.
WITH trips2018 AS (
SELECT * from trips WHERE pickup_datetime in '2016'
)
SELECT pickup_datetime, fare_amount, tempF, windDir
FROM trips2018
ASOF JOIN weather;
BEAM SUMMIT NYC 2023
#
Building a Sink connector
QuestDB cannot do in-stream
deduplications.
Apache BEAM can help
BEAM SUMMIT NYC 2023
#
The Python QuestDB Sink
● WriteToQuestDB(PTransform) class
○ Receives the args you need to pass to the sink
○ Implements the expand method, which receives the PCollection then invokes
ParDo to _WriteTOQuestDBFn
○
● _WriteToQuestDBFn(DoFn) class
○ Instantiates _QuestDBSink on start_bundle
○ Flushes/releases _QuestDBSink on finish_bundle
○ Implements display_data to show info on the UI
○ Calls to _QuestDBSink.write on the process method
○
● _QuestDBSink class
○ Deals with the QuestDB connection itself
BEAM SUMMIT NYC 2023
#
The Python QuestDB Sink
https://github.com/javier/questdb-beam/tree/main/python
pcoll | WriteToQuestDB(table,
symbols=[list_of_symbols],
columns=[list_of_columns],
host=host,
port=port,
batch_size=optionalSizeOfBatch,
tls=optionalBoolean,
auth=optionalAuthDict)
Deduplicating and analysing time-series data with Apache Beam and QuestDB
BEAM SUMMIT NYC 2023
#
The Java QuestDB Sink
● QuestDbIO.Write class, extends PTransform
○ Receives the args you need to pass to the sink
○ Uses @AutoValue to generate classes “magically”
○ Implements the expand method, which receives the PCollection then invokes
ParDo to QuestDbIO.Write.WriteFn (with optional deduplication)
○ Implements populateDisplayData
● QuestDbIO.Write.WriteFn class, extends DoFn
○ Instantiates QuestDBSender on start_bundle
○ Flushes/closes QuestDBSender on finish_bundle
○ Parses/sends the QuestDbRow to QuestDB on the process method
BEAM SUMMIT NYC 2023
#
Where the magic happens
https://github.com/javier/questdb-beam/blob/main/java/src/main/java/org/apache/beam/sdk/io/questdb/QuestDbIO.java
keydAndWindowed = (PCollection) input.apply(WithKeys.of(new SerializableFunction<QuestDbRow, String>() {
@Override
public String apply(QuestDbRow r) {
return String.valueOf(r.hashCode());
}
}));
PCollection windowedItems = (PCollection)
keydAndWindowed.apply(
Window.
<KV<String, String>>into(
Sessions.
withGapDuration(
Duration.standardSeconds(deduplicationDurationMillis())
)
)
);
PCollection<QuestDbRow> uniqueRows = (PCollection<QuestDbRow>)
((PCollection) keydAndWindowed.apply(
Deduplicate.keyedValues()
)
).apply(Values.create());
BEAM SUMMIT NYC 2023
#
The Java QuestDB Sink
https://github.com/javier/questdb-beam/tree/main/java
// pcoll needs to be of type QuestDbRow
pcoll.apply(ParDo.of(new LineToMapFn()));
parsedLines.apply(QuestDbIO.write()
.withUri("your-instance-host.questdb.com:YOUR_PORT")
.withTable("beam_demo")
.withDeduplicationEnabled(true)
.withDeduplicationByValue(false)
.withDeduplicationDurationMillis(5L)
.withSSLEnabled(true)
.withAuthEnabled(true)
.withAuthUser("admin")
.withAuthToken("verySecretToken")
https://github.com/questdb/questdb
https://cloud.questdb.com
NYC 2023
QUESTIONS?
Javier Ramirez
@supercoco9
https://github.com/javier/questdb-beam
https://github.com/javier/questdb-quickstart
https://github.com/questdb/questdb
https://demo.questdb.io
https://cloud.questdb.com
Thanks!
Ad

More Related Content

Similar to Deduplicating and analysing time-series data with Apache Beam and QuestDB (20)

MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB
 
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Teradata Aster
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Codemotion
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
EDB
 
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
MongoDB
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
InfluxData
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
Karthik Murugesan
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
MongoDB
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 
Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics
MongoDB
 
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
GeoPython - Mapping Data in Jupyter Notebooks with PixieDustGeoPython - Mapping Data in Jupyter Notebooks with PixieDust
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
Margriet Groenendijk
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
Yi Pan
 
Gur1009
Gur1009Gur1009
Gur1009
Cdiscount
 
Presentation
PresentationPresentation
Presentation
Dimitris Stripelis
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
Jason Liu
 
Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search
Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity SearchTime Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search
Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search
Zilliz
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
Dori Waldman
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB
 
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Teradata Aster
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Codemotion
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
EDB
 
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
Benefits of Using MongoDB Over RDBMS (At An Evening with MongoDB Minneapolis ...
MongoDB
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
InfluxData
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
Karthik Murugesan
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
MongoDB
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 
Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics
MongoDB
 
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
GeoPython - Mapping Data in Jupyter Notebooks with PixieDustGeoPython - Mapping Data in Jupyter Notebooks with PixieDust
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
Margriet Groenendijk
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
Yi Pan
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
Jason Liu
 
Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search
Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity SearchTime Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search
Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search
Zilliz
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
Dori Waldman
 

More from javier ramirez (20)

The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDBThe Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
javier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
javier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
javier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
javier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
javier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
javier ramirez
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
javier ramirez
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
javier ramirez
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
javier ramirez
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDBThe Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
javier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
javier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
javier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
javier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
javier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
javier ramirez
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
javier ramirez
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
javier ramirez
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
javier ramirez
 
Ad

Recently uploaded (20)

VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Ad

Deduplicating and analysing time-series data with Apache Beam and QuestDB

  • 2. NYC 2023 Javier Ramírez QuestDB @supercoco9 / @[email protected] Deduplicating And Analysing Time-Series Data With Apache Beam And QuestDB
  • 3. About me: I like databases & open source 2022- today. Developer relations at an open source database vendor ● QuestDB, PostgreSQL, MongoDB, Timescale, InfluxDB, Apache Flink 2019-2022. Data & Analytics specialist at a cloud provider ● Amazon Aurora, Neptune, Athena, Timestream, DynamoDB, DocumentDB, Kinesis Data Streams, Kinesis Data Analytics, Redshift, ElastiCache for Redis, QLDB, ElasticSearch, OpenSearch, Cassandra, Spark… 2013-2018. Data Engineer/Big Data & Analytics consultant ● PostgreSQL, Redis, Neo4j, Google BigQuery, BigTable, Google Cloud Spanner, Apache Spark, Apache BEAM, Apache Flink, HBase, MongoDB, Presto 2006-2012 - Web developer ● MySQL, Redis, PostgreSQL, Sqlite, ElasticSearch late nineties to 2005. Desktop/CGI/Servlets/ EJBs/CORBA ● MS Access, MySQL, Oracle, Sybase, Informix As a student/hobbyist (late eighties - early nineties) ● Amsbase, DBase III, DBase IV, Foxpro, Microsoft Works, Informix The pre-SQL years The licensed SQL period The libre and open SQL revolution / The NoSQL rise The hadoop dark ages / The python hegemony/ The cloud database big migrations The streaming era/ The database as a service singularity The SQL revenge/ the realtime database/the embedded database
  • 4. BEAM SUMMIT NYC 2023 # Agenda ● The problem of data duplication ● The problem of data duplication ● The problem of data duplication ● The problem of data duplication ● Behold: a dashboard! ● The many challenges of time-series data ● QuestDB to the rescue ● Down the rabbit hole of writing a custom BEAM Sink ○ Finding several needles on a documentation haystack ○ When I sadly discovered Python streaming support is meh ○ The unsung hero saves the day (again): implementing the Sink in Java
  • 5. BEAM SUMMIT NYC 2023 # Duplication WHY
  • 6. BEAM SUMMIT NYC 2023 # Duplication HOW
  • 7. BEAM SUMMIT NYC 2023 # Duplication WHAT
  • 8. BEAM SUMMIT NYC 2023 # My lazy approach to choosing a database If you can use only one database for everything, go with PostgreSQL* * Or any other major and well supported RDBMS
  • 9. BEAM SUMMIT NYC 2023 # Imagine… a factory floor with 500 machines, or a fleet with 500 vehicles, or 50 trains, with 10 cars each, or 500 users with a mobile phone, or 500 financial instruments generating tick data …sending data every second
  • 10. BEAM SUMMIT NYC 2023 # A conventional database’s nightmare 43,200,000 rows a day……. 302,400,000 rows a week…. 1,314,144,000 rows a month
  • 11. BEAM SUMMIT NYC 2023 # Timestamps are hard
  • 12. BEAM SUMMIT NYC 2023 # Time-series analytics in a nutshell Working with timestamped data in a database is tricky* * specially working with analytics of data changing over time or at a high rate
  • 14. BEAM SUMMIT NYC 2023 # We’d like to be known for ● Performance ○ Better performance with smaller machines ● Developer Experience ● Proudly Open Source (Apache 2.0)
  • 15. NYC 2023 A quick overview of some interesting queries
  • 16. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io WHERE … TIME RANGE SELECT * from trips WHERE pickup_datetime in '2018'; SELECT * from trips WHERE pickup_datetime in '2018-06'; SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59'; SELECT * from trips WHERE pickup_datetime in '2018;2M' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018;10s' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018;-3d' LIMIT -10; SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;1d;7' SELECT * from trips WHERE pickup_datetime in '2018-06-21T23:59:58;4s;-1d;7'
  • 17. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io SAMPLE BY Aggregates data in homogeneous time chunks SELECT timestamp, sum(price * amount) / sum(amount) AS vwap_price, sum(amount) AS volume FROM trades WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now()) SAMPLE BY 15m ALIGN TO CALENDAR; SELECT timestamp, min(tempF), max(tempF), avg(tempF) FROM weather SAMPLE BY 1M;
  • 18. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io SAMPLE BY … FILL Can fill missing time chunks using different strategies (NULL, constant, LINEAR, PREVious value) SELECT timestamp, sum(price * amount) / sum(amount) AS vwap_price, sum(amount) AS volume FROM trades WHERE symbol = 'BTC-USD' AND timestamp > dateadd('d', -1, now()) SAMPLE BY 1s FILL(NULL) ALIGN TO CALENDAR;
  • 19. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io LATEST ON … PARTITION BY … Retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple time series are stored in the same table. SELECT * FROM trades WHERE symbol in ('BTC-USD', 'ETH-USD') LATEST ON timestamp PARTITION BY symbol, side;
  • 20. BEAM SUMMIT NYC 2023 # Try it live on https://demo.questdb.io ASOF JOIN / LT JOIN SPLICE JOIN ASOF JOIN joins two different time-series measured. For each row in the first time-series, the ASOF JOIN takes from the second time-series a timestamp that meets both of the following criteria: ● The timestamp is the closest to the first timestamp. ● The timestamp is strictly prior or equal to the first timestamp. WITH trips2018 AS ( SELECT * from trips WHERE pickup_datetime in '2016' ) SELECT pickup_datetime, fare_amount, tempF, windDir FROM trips2018 ASOF JOIN weather;
  • 21. BEAM SUMMIT NYC 2023 # Building a Sink connector QuestDB cannot do in-stream deduplications. Apache BEAM can help
  • 22. BEAM SUMMIT NYC 2023 # The Python QuestDB Sink ● WriteToQuestDB(PTransform) class ○ Receives the args you need to pass to the sink ○ Implements the expand method, which receives the PCollection then invokes ParDo to _WriteTOQuestDBFn ○ ● _WriteToQuestDBFn(DoFn) class ○ Instantiates _QuestDBSink on start_bundle ○ Flushes/releases _QuestDBSink on finish_bundle ○ Implements display_data to show info on the UI ○ Calls to _QuestDBSink.write on the process method ○ ● _QuestDBSink class ○ Deals with the QuestDB connection itself
  • 23. BEAM SUMMIT NYC 2023 # The Python QuestDB Sink https://github.com/javier/questdb-beam/tree/main/python pcoll | WriteToQuestDB(table, symbols=[list_of_symbols], columns=[list_of_columns], host=host, port=port, batch_size=optionalSizeOfBatch, tls=optionalBoolean, auth=optionalAuthDict)
  • 25. BEAM SUMMIT NYC 2023 # The Java QuestDB Sink ● QuestDbIO.Write class, extends PTransform ○ Receives the args you need to pass to the sink ○ Uses @AutoValue to generate classes “magically” ○ Implements the expand method, which receives the PCollection then invokes ParDo to QuestDbIO.Write.WriteFn (with optional deduplication) ○ Implements populateDisplayData ● QuestDbIO.Write.WriteFn class, extends DoFn ○ Instantiates QuestDBSender on start_bundle ○ Flushes/closes QuestDBSender on finish_bundle ○ Parses/sends the QuestDbRow to QuestDB on the process method
  • 26. BEAM SUMMIT NYC 2023 # Where the magic happens https://github.com/javier/questdb-beam/blob/main/java/src/main/java/org/apache/beam/sdk/io/questdb/QuestDbIO.java keydAndWindowed = (PCollection) input.apply(WithKeys.of(new SerializableFunction<QuestDbRow, String>() { @Override public String apply(QuestDbRow r) { return String.valueOf(r.hashCode()); } })); PCollection windowedItems = (PCollection) keydAndWindowed.apply( Window. <KV<String, String>>into( Sessions. withGapDuration( Duration.standardSeconds(deduplicationDurationMillis()) ) ) ); PCollection<QuestDbRow> uniqueRows = (PCollection<QuestDbRow>) ((PCollection) keydAndWindowed.apply( Deduplicate.keyedValues() ) ).apply(Values.create());
  • 27. BEAM SUMMIT NYC 2023 # The Java QuestDB Sink https://github.com/javier/questdb-beam/tree/main/java // pcoll needs to be of type QuestDbRow pcoll.apply(ParDo.of(new LineToMapFn())); parsedLines.apply(QuestDbIO.write() .withUri("your-instance-host.questdb.com:YOUR_PORT") .withTable("beam_demo") .withDeduplicationEnabled(true) .withDeduplicationByValue(false) .withDeduplicationDurationMillis(5L) .withSSLEnabled(true) .withAuthEnabled(true) .withAuthUser("admin") .withAuthToken("verySecretToken")