Mining Data Streams
Mining Data Streams
Batch Processing
Batch processing involves collecting and processing data in large batches at scheduled intervals, suitable for
tasks like historical reporting or large-scale data analysis.
Data Streams
• A data stream is a continuous flow of data, typically at a high velocity, that are
generated in real-time from various sources, such as sensors, devices, applications,
or social media platforms.
Characteristics:
• Continuous flow
• High velocity
• Dynamic nature
• Ephemeral data
• Real-time processing
Stream Queries
• Stream queries are specialized queries designed to extract insights and perform
analysis on data that is continuously generated and processed in real-time.
Key Features:
• Real Time Processing
• Continuous Execution
• Limited Storage
Technical Considerations
• Data Velocity
• Windowing
• Resource Optimization
• Event Timing & Ordering
Stream Processing
• Stream processing analyzes data in real-time as it arrives, making
it ideal for applications requiring low latency and immediate
responses, such as fraud detection or real-time analytics.
Sample Query
Find the number of unique users over the past month using SQL.
Stream Stream
Working
Data Source Ingestion Processing
Stage
Layer Layer
Streaming Services:
•Function: Handle real-time data processing, enabling immediate insights and actions.
•Example Technologies:
•Amazon Kinesis, Apache Flink, Apache Pulsar
Role: Process data in real time for analytics, monitoring, or triggering actions.
Stream Processing Layer
• It performs transformations, aggregations, filtering, and other operations on
streaming data, enabling real-time analytics, decision-making, and actions.
Two types of processing:
• Stateless Processing: Processes each event independently (e.g., filtering,
transformation).
• Stateful Processing: Maintains a context or state for operations (e.g., counting
events over a time window).
Data Stream Architecture Tools
Category Tools
Data Sources MQTT, Fluentd, Twitter API, Debezium
Stream Ingestion Apache Kafka, RabbitMQ, Amazon Kinesis, Google Cloud Pub/Sub
Stream Processing Apache Flink, Apache Storm, Spark Streaming, AWS Lambda
Key Features:
Handling Large data volumes
Real time analysis
Resource Efficiency
Preserving Representativeness
Dealing with concept drift
Reducing noise
Sampling Techniques
Reservoir Sampling
Sliding Window Sampling
Systematic Sampling
Stratified Sampling
Priority Sampling
Time-Based Sampling
Bernoulli Sampling
Filtering Streams
Filtering Streams involves selecting or removing specific data elements from a data
stream based on predefined criteria.
Filtering criteria’s:
Value based filtering
Pattern matching
Time Based Filtering
Threshold Based Filtering
Condition based filtering
Filtering Mechanisms
Pre Filter
Inline Filter
Post Filter
Filtering Techniques
Simple Filters
Bloom Filters
Sliding Window Filters (Time, count)
Statistical Filters Attribute-Based Filtering
Content-Based Filtering Hierarchical Filters
Real-Time Adaptive Filters
De-duplication Filters
Noise Reduction Filters
Bloom Filter
• Consider a bloom filter of size 5 & 2 hash functions.
• H1(x)=x mod 5
• H2(x)=(2x+6)mod 5
• Stream:10110110110110010
•5 time slots
Linear Decay
• The weight of data decreases linearly over time.
Log Decay
• The weight of data decreases according to a logarithmic function.
Thank you