0% found this document useful (0 votes)
11 views

Mining Data Streams

The document discusses mining data streams, highlighting the differences between batch processing and stream processing, which involves real-time data analysis from various sources. It covers key concepts such as stream queries, sampling techniques, filtering streams, and statistical moments, along with tools and architectures used in stream processing. Additionally, it introduces algorithms like DGIM for counting 1s in binary streams and decaying windows for managing data relevance over time.

Uploaded by

nextapai.blog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Mining Data Streams

The document discusses mining data streams, highlighting the differences between batch processing and stream processing, which involves real-time data analysis from various sources. It covers key concepts such as stream queries, sampling techniques, filtering streams, and statistical moments, along with tools and architectures used in stream processing. Additionally, it introduces algorithms like DGIM for counting 1s in binary streams and decaying windows for managing data relevance over time.

Uploaded by

nextapai.blog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Mining Data Streams

Batch Processing
Batch processing involves collecting and processing data in large batches at scheduled intervals, suitable for
tasks like historical reporting or large-scale data analysis.
Data Streams
• A data stream is a continuous flow of data, typically at a high velocity, that are
generated in real-time from various sources, such as sensors, devices, applications,
or social media platforms.
Characteristics:
• Continuous flow
• High velocity
• Dynamic nature
• Ephemeral data
• Real-time processing
Stream Queries
• Stream queries are specialized queries designed to extract insights and perform
analysis on data that is continuously generated and processed in real-time.
Key Features:
• Real Time Processing
• Continuous Execution
• Limited Storage
Technical Considerations
• Data Velocity
• Windowing
• Resource Optimization
• Event Timing & Ordering
Stream Processing
• Stream processing analyzes data in real-time as it arrives, making
it ideal for applications requiring low latency and immediate
responses, such as fraud detection or real-time analytics.
Sample Query
Find the number of unique users over the past month using SQL.

SELECT COUNT(DISTINCT name) AS unique_users


FROM Logins
WHERE time >= NOW() - INTERVAL 30 DAY;
Terminologies in stream data
• Stream
• Event
• Window
• State
• Event Time
• Processing Time
• Water mark
Data Stream Architecture

Stream Stream
Working
Data Source Ingestion Processing
Stage
Layer Layer

Visualizatio Query Archival


Output
ns Engine Stage
Stream Processing Tools
Kinesis Data
Feature/Tool Kafka Streams Apache Flink Spark Streaming Apache Beam
Analytics
Processing Stream processing Stream & batch Micro-batch Unified stream &
Stream processing
Model with SQL processing processing batch
Low latency
Real-time SQL Low latency
Latency Low latency Low latency (depends on
queries (micro-batches)
engine)
Yes (via Kafka Yes Yes Yes (via execution
Fault Tolerance Yes (via AWS)
distributed sys) (checkpointing) (checkpointing) engine)
Stateful
Yes Yes Yes Yes Yes
Operations
High (Java Medium (complex Medium Medium (API for
Ease of Use High (SQL)
library) setup) (Java/Scala API) multiple engines)
Any supported
Standalone (client
Deployment AWS managed Cluster or Cloud Cluster or Cloud engine (Flink,
library)
Spark, etc.)
Data Source
Category Tools/Platforms
MQTT (Message Queuing Telemetry Transport),
1. Sensors and IoT Devices CoAP(Constrained Application Protocol), Amazon IoT
Core, Google Cloud IoT Core
2. Web Logs Fluentd, Logstash, Kafka
3. Social Media Twitter API, Facebook Graph API, Google Ads API

Role Tool Name


Logger (Producer) Filebeat, Logstash
Messenger (Kafka) Apache Kafka
Apache Flink, Apache
Helper (Consumer) Spark, Custom
Consumers
Data Ingestion Layer
Message Brokers:
•Function: Manage data flow between producers (IoT devices, applications) and
consumers (data processing systems).
•Example Technologies:
•Apache Kafka, RabbitMQ, Amazon SQS
Role: Queue and route data reliably for further processing, ensuring no data loss.

Streaming Services:
•Function: Handle real-time data processing, enabling immediate insights and actions.
•Example Technologies:
•Amazon Kinesis, Apache Flink, Apache Pulsar
Role: Process data in real time for analytics, monitoring, or triggering actions.
Stream Processing Layer
• It performs transformations, aggregations, filtering, and other operations on
streaming data, enabling real-time analytics, decision-making, and actions.
Two types of processing:
• Stateless Processing: Processes each event independently (e.g., filtering,
transformation).
• Stateful Processing: Maintains a context or state for operations (e.g., counting
events over a time window).
Data Stream Architecture Tools
Category Tools
Data Sources MQTT, Fluentd, Twitter API, Debezium

Stream Ingestion Apache Kafka, RabbitMQ, Amazon Kinesis, Google Cloud Pub/Sub

Stream Processing Apache Flink, Apache Storm, Spark Streaming, AWS Lambda

Working Storage Redis, Memcached, InfluxDB, Prometheus

Archival Storage Amazon S3, HDFS, Snowflake


Query Engine KSQL, Flink SQL, Presto, Apache Calcite
Visualization Tableau, Power BI
Output Actions PagerDuty, Apache NiFi, AWS EventBridge
Sampling Data Streams
Sampling data streams involves selecting a subset of data points from a continuous,
high-velocity data flow to make analysis manageable and efficient.

Key Features:
 Handling Large data volumes
 Real time analysis
 Resource Efficiency
 Preserving Representativeness
 Dealing with concept drift
 Reducing noise
Sampling Techniques
 Reservoir Sampling
 Sliding Window Sampling
 Systematic Sampling
 Stratified Sampling
 Priority Sampling
 Time-Based Sampling
 Bernoulli Sampling
Filtering Streams
Filtering Streams involves selecting or removing specific data elements from a data
stream based on predefined criteria.
Filtering criteria’s:
 Value based filtering
 Pattern matching
 Time Based Filtering
 Threshold Based Filtering
 Condition based filtering
Filtering Mechanisms
 Pre Filter
 Inline Filter
 Post Filter
Filtering Techniques
 Simple Filters
 Bloom Filters
 Sliding Window Filters (Time, count)
 Statistical Filters  Attribute-Based Filtering
 Content-Based Filtering  Hierarchical Filters
 Real-Time Adaptive Filters
 De-duplication Filters
 Noise Reduction Filters
Bloom Filter
• Consider a bloom filter of size 5 & 2 hash functions.
• H1(x)=x mod 5
• H2(x)=(2x+6)mod 5

Insert the 10 & 7 then check the existence of 14 & 15.


Bloom Filter
Counting Distinct Elements
 Exact Counting
 Hyper Loglog
 Flajolet-martin
 Count min Sketch
 Bloom Filters
 Sliding Window
Estimate Distinct Elements in a Data
Stream Using the Hyper loglog Algorithm
Stream:{23,14,8,23,23,8,19,14,19}
Estimate Distinct Elements in a Data Stream
Using the Flajolet-Martin Algorithm
1) Stream: {1,4,2,1,2,4,3}
Hash Function: h(x) = (3x+1)mod 5

2) Stream: {4, 2, 5 ,9, 1, 6, 3, 7}


h(x) =x + 6 mod 32
Moments
• Moments are statistical measures that summarize the distribution of elements in a
stream.
• Thetask of computing "moments" revolves around analyzing the distribution of
frequencies of various elements within a stream.
• Theproblem of "computing moments" is all about looking at how often different
things show up in a list or stream. Moments help us understand how many different
things there are, how many times each thing appears, and how even or uneven the
appearances are.
Calculating Moments
0th Moment (M₀):
• There are 3 distinct products in the stream.

1st Moment (M₁):


• A total of 6 products were sold.

2nd Moment (M₂):


• The distribution of sales is not equal, with Shoes being the most popular
product.
2nd Moment-surprise number
DGIM Algorithm
• Datar-Gionis-Indyk-Motwani (DGIM) Algorithm is a clever and efficient way to
approximate the count of 1s in the last N bits of a binary data stream.
• Bucket is a compact and summarized representation of consecutive 1s in a binary
stream.
• Itis designed to reduce memory usage while keeping track of the approximate count
of 1s in the most recent N bits of the stream.
Bucket Rules
• Every bucket must represent at least one occurrence of a 1.
• Bucket should start with one.
• Bucket Sizes are Powers of 2
• At Most Two Buckets per Size
• Length of the bucket is equal to the number of 1’s in it.
• Bucket size should increase
Find the 1’s
• Stream: 101011110010001100101
• Count the number of 1 in recent 18 bits.

• Stream:10110110110110010

• Count the number of 1


Decaying Windows
• A decayingwindow is a technique used in algorithms like streaming data analysis,
where the contribution of older data decreases over time, typically based on a decay
factor.
Types:
• Exponential Decay
• Linear Decay
• Log Decay
Exponential Decay
• The weight of data decreases exponentially over time.
Exponential Decay
• 10, 20, 30, 40, 50.
• α=0.8

•5 time slots
Linear Decay
• The weight of data decreases linearly over time.
Log Decay
• The weight of data decreases according to a logarithmic function.
Thank you

You might also like