SlideShare a Scribd company logo
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Part 1 of 3: The basics of real-time streaming analytics
Getting started with streaming analytics
Javier Ramirez
AWS Developer Advocate
@supercoco9
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Agenda
Why real-time analytics and data streaming?
Challenges of streaming analytics
Useful concepts to reason about streaming data
Components of a streaming analytics pipeline
Overview of popular Open Source components for
streaming analytics: Apache Kafka, Apache Spark, Apache Flink, Apache
Cassandra, Apache HBase, ElasticSearch
AWS toolbox for streaming analytics: Amazon MSK, Amazon
EMR, Amazon Kinesis, Amazon Keyspaces, Amazon DynamoDB, Amazon
ElasticSearch
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Why streaming analytics
• The number of “smart” devices is
projected to be 200 billion by 2020
(over 100X increase in ten years)
• 90% of the data in the world was generated in the
last 2 years
• There are 2.5 quintillion bytes of
data created each day, and this
pace is accelerating
Source: BI Intelligence Estimates Source: Forbes – How much data do we produce
Data streaming technology enables a customer to ingest, process,
and analyze high volumes of high-velocity data from a variety of
sources
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
The of data diminishes over time
Source: Perishable insights, Mike Gualtieri, Forrester
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional “batch” business intelligence
Information half-life
in decision-making
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Cannot I just use batch big data analytics tools?
https://aws.amazon.com/streaming-data/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Cannot I just use batch big data analytics tools?
Data is never complete
You don’t know the volume of the data before you start
Low-latency is expected
Data can come out of order
System should remain available during upgrades
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple problem (until you know the details)
I want to calculate the total and average of several numbers
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple big data problem (until you know the details)
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory,
or in a single hard drive
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simplish streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We
will be adding and removing sensors all the time
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A quite standard streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a
while and then send a bunch of stale data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
An elastic and scalable streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then
send a bunch of stale data
Flow will not be constant (from few events per second to
thousands)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
An almost real-life streaming analytics scenario
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then
send a bunch of stale data
Flow will not be constant (from few events per second to thousands)
And I don’t want just the total average, but total per month, per
week, per day, per hour, per minute…
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A real business use case for streaming
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
Flow will not be constant (from few events per second to thousands)
And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
We need pretty dashboards with current status, comparison with the
past, trends, and anomaly detection
To run this reliably, we need advanced monitoring, alerts, and
autoscaling
No, I am not hiring a whole new operations team to manage the
system
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
http://gunshowcomic.com/648
Probably less than you think
~20 lines of JAVA code (plus a
few hundreds with imports,
POJOs, and boilerplate, because
JAVA)
a simple GROUP BY statement in
SQL with streaming extensions
(plus a few lines of boilerplate for
schema definition)
OR
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Streaming analytics concepts
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming data pipeline overview
Ingest Transform Analyze React Persist
• Durable
• Stateful
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Durability and reliability
Need to store intermediate data
You might want to be able to replay the stream
Self-healing architecture. If one component goes down
while data is in-flight, the system needs to re-balance and
data needs to be reassigned seamlessly
Monitoring
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stateful processing
Working on per-element streams is relatively easy (i.e. change format of each item, or filter
our records based on their own properties)
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
The real fun starts when you need to do transforms/ aggregations over groups of elements:
group by, count, max, average, joins, filtering based on properties from related records, or
complex pattern detection
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Continuous and fast
Data can come in spikes, faster than we can process it.
Need to account for reliable persistent storage while in-
flight
You will need to think how to update a system that never
stops receiving data
Since data is never complete, in the case of stateful
computations, we need to decide when to output data
(windowing)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Processing-Time based windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Event-Time Based Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Session Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Correctness: Late-arriving data
Event-time vs Processing-time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Correctness: Delivery semantics
• Exactly once
• At least once
• At most once
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Reactive
All the components need to be designed for low-latency
Source: Perishable insights, Mike Gualtieri, Forrester
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional “batch” business intelligence
Information half-life
in decision-making
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Components of a streaming
analytics pipeline
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming analytics components
Devices and/or
applications
that produce
real-time
data at high
velocity
Data from tens of
thousands of data sources
can be written to a single
stream
Data are stored in the
order they were received
for a set duration
of time and can be
replayed indefinitely
during that time
Records are read in
the order they are produced,
enabling real-time analytics
or streaming ETL
Database (NoSQL
most common),
Message broker,
Notification system,
File Storage, or Data
Lake
`
Analytics
dashboard
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
The (excellent) Open Source ecosystem
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Ingestion/in-stream storage: Apache Kafka
A distributed streaming platform
Concepts:
Producers
Topics
Brokers
Consumers
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Ingestion/in-stream storage: Apache Flume
Distributed, reliable, and available service for collecting,
aggregating, and moving large amounts of log data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Spark
Unified Analytics Engine for large-scale data processing
Concepts:
Driver/Workers
Data Source
Discretized Stream
Transforms
Streaming SQL
Outputs
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Spark
Unified Analytics Engine for large-scale data processing
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
Concepts:
Job Manager/Workers
Source
DataStream
Transforms/Operators
TableAPI/SQL
Sinks
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache Cassandra
Manage massive amounts of data, fast, without losing sleep
https://cassandra.apache.org/
Concepts:
Nodes
Token Ring
Consistency Levels
Column Families
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache Cassandra
Manage massive amounts of data, fast, without losing sleep
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache HBase
The Hadoop database, a distributed, scalable, big data store
https://hbase.apache.org/book.html
First, make sure you have enough data. If you have
hundreds of millions or billions of rows, then HBase
is a good candidate. If you only have a few
thousand/million rows, then using a traditional
RDBMS might be a better choice due to the fact
that all of your data might wind up on a single node
(or two) and the rest of the cluster may be sitting
idle.
Concepts: Hbase Master, Regions, Region Servers, Data Nodes, Column Families
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Dashboard: Elasticsearch with Kibana
Elasticsearch is a distributed JSON-based search and
analytics engine. Kibana gives shape to your data
https://www.elastic.co/kibana
Wikimedia has a live
interactive dashboard
powered by Kibana at
https://wikimedia.biterg.io/
Concepts:
Master Node
Data Nodes
Shard
Index
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Dashboard: Grafana
Grafana allows you to query, visualize, alert on and
understand your metrics no matter where they are stored.
https://grafana.com/grafana/
Wikimedia also has a
live interactive metrics
dashboard powered by
Grafana at
https://grafana.wikimedia.org/
Concepts:
Data Source
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Challenges of data streaming components
Difficult to setup Tricky to scale
Hard to achieve high availability Integration required
development
Error prone and complex to manage Expensive to maintain
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
AWS services for streaming analytics
Both managed services and native services
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming real-time data with AWS
* Some services scale up and down elastically, while others allow you to automate when to scale up/down
** It is possible to have a serverless data streaming pipeline, in which you pay only for what you use. In the case of managed
non-serverless services, you can dynamically adapt to your traffic
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for Ingestion/in-stream storage
Amazon Managed Streaming for Apache Kafka
Fully managed version of Apache Kafka
Amazon Kinesis Data Streams
Massively scalable, elastic, and durable real-time data streaming
Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data
into data lakes, data stores, and analytics services.
AWS Glue with serverless streaming
Simple, flexible, and cost-effective ETL
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for stream processing
Amazon Kinesis Data Analytics for Apache Flink
Fully managed, elastic, version of Apache Flink
Amazon Kinesis Data Analytics for SQL Applications
Process and analyze streaming data using standard SQL
Amazon EMR
Easily run and scale Apache Spark and other big data frameworks. You can also
run Apache Flink and Apache HBase on EMR
AWS Glue with serverless streaming
Simple, flexible, and cost-effective ETL. Supports Spark for serverless ETL
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for stream storage
Amazon Keyspaces for Apache Cassandra
Scalable, highly available, and managed Apache Cassandra compatible db service
Amazon DynamoDB
Fast and flexible NoSQL database service for any scale (for example, in 2017 Samsung
Cloud Service was serving 300M users with a total storage of 860TB)
Amazon EMR
Easily run and scale Apache HBase and other big data frameworks. You can also run
Apache Flink and Apache Spark on EMR
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for analytics dashboards
Amazon Elasticsearch Service
Fully managed, scalable, and secure Elasticsearch service
Amazon Quicksight
Fast, cloud-powered business intelligence service that makes it easy to deliver
insights to everyone in your organization.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A serverless data stream (per element processing)
data
producer
Kinesis Data
Streams
Amazon
SNS
Continuously stream data
Lambda
service
Lambda
functionA
Lambda
function B
Continuously polls for new data,
1 poll per second
Automatically invokes your
function(s) when data found
DynamoDB
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Fully managed stateful streaming analytics
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Getting Started
https://engineering.linkedin.com/distributed-systems/log-what-every-software-
engineer-should-know-about-real-time-datas-unifying
A great write-up on streaming analytics challenges
https://aws.amazon.com/streaming-data/
Streaming data
https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html
Getting started with Apache Kafka/Amazon MSK
https://aws.amazon.com/kinesis/
Amazon Kinesis Services for streaming data
https://aws.amazon.com/elasticsearch-service/
Amazon ElasticSearch Service
https://dl.acm.org/doi/10.1145/543613.543615
Research about Models and Issues in data stream systems
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
ThanksJavier Ramirez
AWS Developer Advocate
@supercoco9
Ad

More Related Content

Similar to Getting started with streaming analytics: streaming basics (1 of 3) (8)

AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Data Led Migration
Data Led Migration Data Led Migration
Data Led Migration
Sandy Carter
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Summits
 
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Modern Data Platforms - Thinking Data Flywheel on the CloudModern Data Platforms - Thinking Data Flywheel on the Cloud
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Alluxio, Inc.
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
IoT from Cloud to Edge & Back Again - WebSummit 2018
IoT from Cloud to Edge & Back Again - WebSummit 2018IoT from Cloud to Edge & Back Again - WebSummit 2018
IoT from Cloud to Edge & Back Again - WebSummit 2018
Boaz Ziniman
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
javier ramirez
 
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summits
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Data Led Migration
Data Led Migration Data Led Migration
Data Led Migration
Sandy Carter
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Summits
 
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Modern Data Platforms - Thinking Data Flywheel on the CloudModern Data Platforms - Thinking Data Flywheel on the Cloud
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Alluxio, Inc.
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
IoT from Cloud to Edge & Back Again - WebSummit 2018
IoT from Cloud to Edge & Back Again - WebSummit 2018IoT from Cloud to Edge & Back Again - WebSummit 2018
IoT from Cloud to Edge & Back Again - WebSummit 2018
Boaz Ziniman
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
javier ramirez
 
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summits
 

More from javier ramirez (20)

The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDBThe Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
javier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
javier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
javier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
javier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
javier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
javier ramirez
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDBThe Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
javier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
javier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
javier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
javier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
javier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
javier ramirez
 
Ad

Recently uploaded (20)

EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Ad

Getting started with streaming analytics: streaming basics (1 of 3)

  • 1. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Part 1 of 3: The basics of real-time streaming analytics Getting started with streaming analytics Javier Ramirez AWS Developer Advocate @supercoco9
  • 2. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Agenda Why real-time analytics and data streaming? Challenges of streaming analytics Useful concepts to reason about streaming data Components of a streaming analytics pipeline Overview of popular Open Source components for streaming analytics: Apache Kafka, Apache Spark, Apache Flink, Apache Cassandra, Apache HBase, ElasticSearch AWS toolbox for streaming analytics: Amazon MSK, Amazon EMR, Amazon Kinesis, Amazon Keyspaces, Amazon DynamoDB, Amazon ElasticSearch
  • 3. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Why streaming analytics • The number of “smart” devices is projected to be 200 billion by 2020 (over 100X increase in ten years) • 90% of the data in the world was generated in the last 2 years • There are 2.5 quintillion bytes of data created each day, and this pace is accelerating Source: BI Intelligence Estimates Source: Forbes – How much data do we produce Data streaming technology enables a customer to ingest, process, and analyze high volumes of high-velocity data from a variety of sources
  • 4. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential The of data diminishes over time Source: Perishable insights, Mike Gualtieri, Forrester Real time Seconds Minutes Hours Days Months Valueofdatatodecision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence Information half-life in decision-making
  • 5. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Cannot I just use batch big data analytics tools? https://aws.amazon.com/streaming-data/
  • 6. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Cannot I just use batch big data analytics tools? Data is never complete You don’t know the volume of the data before you start Low-latency is expected Data can come out of order System should remain available during upgrades
  • 7. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A simple problem (until you know the details) I want to calculate the total and average of several numbers
  • 8. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A simple big data problem (until you know the details) I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive
  • 9. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A simple streaming problem I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time
  • 10. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A simplish streaming problem I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
  • 11. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A quite standard streaming problem I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
  • 12. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential An elastic and scalable streaming problem I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data Flow will not be constant (from few events per second to thousands)
  • 13. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential An almost real-life streaming analytics scenario I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data Flow will not be constant (from few events per second to thousands) And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
  • 14. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A real business use case for streaming I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data Flow will not be constant (from few events per second to thousands) And I don’t want just the total average, but total per month, per week, per day, per hour, per minute… We need pretty dashboards with current status, comparison with the past, trends, and anomaly detection To run this reliably, we need advanced monitoring, alerts, and autoscaling No, I am not hiring a whole new operations team to manage the system
  • 15. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 17. Probably less than you think ~20 lines of JAVA code (plus a few hundreds with imports, POJOs, and boilerplate, because JAVA) a simple GROUP BY statement in SQL with streaming extensions (plus a few lines of boilerplate for schema definition) OR
  • 18. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Streaming analytics concepts
  • 19. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Streaming data pipeline overview Ingest Transform Analyze React Persist • Durable • Stateful • Continuous • Fast • Correct • Reactive • Reliable What are the key requirements?
  • 20. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Durability and reliability Need to store intermediate data You might want to be able to replay the stream Self-healing architecture. If one component goes down while data is in-flight, the system needs to re-balance and data needs to be reassigned seamlessly Monitoring
  • 21. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stateful processing Working on per-element streams is relatively easy (i.e. change format of each item, or filter our records based on their own properties) 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/ The real fun starts when you need to do transforms/ aggregations over groups of elements: group by, count, max, average, joins, filtering based on properties from related records, or complex pattern detection
  • 22. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Continuous and fast Data can come in spikes, faster than we can process it. Need to account for reliable persistent storage while in- flight You will need to think how to update a system that never stops receiving data Since data is never complete, in the case of stateful computations, we need to decide when to output data (windowing)
  • 23. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Processing-Time based windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 24. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Event-Time Based Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 25. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Session Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 26. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Correctness: Late-arriving data Event-time vs Processing-time Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 27. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Correctness: Delivery semantics • Exactly once • At least once • At most once
  • 28. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Reactive All the components need to be designed for low-latency Source: Perishable insights, Mike Gualtieri, Forrester Real time Seconds Minutes Hours Days Months Valueofdatatodecision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence Information half-life in decision-making
  • 29. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Components of a streaming analytics pipeline
  • 30. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Streaming analytics components Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Database (NoSQL most common), Message broker, Notification system, File Storage, or Data Lake ` Analytics dashboard
  • 31. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark The (excellent) Open Source ecosystem
  • 32. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Ingestion/in-stream storage: Apache Kafka A distributed streaming platform Concepts: Producers Topics Brokers Consumers
  • 33. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Ingestion/in-stream storage: Apache Flume Distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data
  • 34. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Spark Unified Analytics Engine for large-scale data processing Concepts: Driver/Workers Data Source Discretized Stream Transforms Streaming SQL Outputs
  • 35. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Spark Unified Analytics Engine for large-scale data processing
  • 36. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Flink Stateful computation over Data Streams Concepts: Job Manager/Workers Source DataStream Transforms/Operators TableAPI/SQL Sinks
  • 37. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Flink Stateful computation over Data Streams
  • 38. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Flink Stateful computation over Data Streams
  • 39. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Storage: Apache Cassandra Manage massive amounts of data, fast, without losing sleep https://cassandra.apache.org/ Concepts: Nodes Token Ring Consistency Levels Column Families
  • 40. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Storage: Apache Cassandra Manage massive amounts of data, fast, without losing sleep
  • 41. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Storage: Apache HBase The Hadoop database, a distributed, scalable, big data store https://hbase.apache.org/book.html First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. Concepts: Hbase Master, Regions, Region Servers, Data Nodes, Column Families
  • 42. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Dashboard: Elasticsearch with Kibana Elasticsearch is a distributed JSON-based search and analytics engine. Kibana gives shape to your data https://www.elastic.co/kibana Wikimedia has a live interactive dashboard powered by Kibana at https://wikimedia.biterg.io/ Concepts: Master Node Data Nodes Shard Index
  • 43. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Dashboard: Grafana Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. https://grafana.com/grafana/ Wikimedia also has a live interactive metrics dashboard powered by Grafana at https://grafana.wikimedia.org/ Concepts: Data Source
  • 44. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Challenges of data streaming components Difficult to setup Tricky to scale Hard to achieve high availability Integration required development Error prone and complex to manage Expensive to maintain
  • 45. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark AWS services for streaming analytics Both managed services and native services
  • 46. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Streaming real-time data with AWS * Some services scale up and down elastically, while others allow you to automate when to scale up/down ** It is possible to have a serverless data streaming pipeline, in which you pay only for what you use. In the case of managed non-serverless services, you can dynamically adapt to your traffic
  • 47. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Services for Ingestion/in-stream storage Amazon Managed Streaming for Apache Kafka Fully managed version of Apache Kafka Amazon Kinesis Data Streams Massively scalable, elastic, and durable real-time data streaming Amazon Kinesis Data Firehose Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. AWS Glue with serverless streaming Simple, flexible, and cost-effective ETL
  • 48. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Services for stream processing Amazon Kinesis Data Analytics for Apache Flink Fully managed, elastic, version of Apache Flink Amazon Kinesis Data Analytics for SQL Applications Process and analyze streaming data using standard SQL Amazon EMR Easily run and scale Apache Spark and other big data frameworks. You can also run Apache Flink and Apache HBase on EMR AWS Glue with serverless streaming Simple, flexible, and cost-effective ETL. Supports Spark for serverless ETL
  • 49. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Services for stream storage Amazon Keyspaces for Apache Cassandra Scalable, highly available, and managed Apache Cassandra compatible db service Amazon DynamoDB Fast and flexible NoSQL database service for any scale (for example, in 2017 Samsung Cloud Service was serving 300M users with a total storage of 860TB) Amazon EMR Easily run and scale Apache HBase and other big data frameworks. You can also run Apache Flink and Apache Spark on EMR
  • 50. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Services for analytics dashboards Amazon Elasticsearch Service Fully managed, scalable, and secure Elasticsearch service Amazon Quicksight Fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization.
  • 51. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A serverless data stream (per element processing) data producer Kinesis Data Streams Amazon SNS Continuously stream data Lambda service Lambda functionA Lambda function B Continuously polls for new data, 1 poll per second Automatically invokes your function(s) when data found DynamoDB
  • 52. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Fully managed stateful streaming analytics
  • 53. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Getting Started https://engineering.linkedin.com/distributed-systems/log-what-every-software- engineer-should-know-about-real-time-datas-unifying A great write-up on streaming analytics challenges https://aws.amazon.com/streaming-data/ Streaming data https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html Getting started with Apache Kafka/Amazon MSK https://aws.amazon.com/kinesis/ Amazon Kinesis Services for streaming data https://aws.amazon.com/elasticsearch-service/ Amazon ElasticSearch Service https://dl.acm.org/doi/10.1145/543613.543615 Research about Models and Issues in data stream systems
  • 54. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential ThanksJavier Ramirez AWS Developer Advocate @supercoco9