0% found this document useful (0 votes)

12 views

Big Data Unit III

The document discusses the stream data model, which involves continuous data flow and the challenges of processing it effectively. It covers various techniques for stream processing, sampling, filtering, counting distinct elements, estimating moments, and counting in windows, emphasizing the importance of memory-efficient algorithms. Additionally, it highlights real-life applications and use cases for each technique in fields like web analytics, network monitoring, and sensor data analysis.

Uploaded by

SMILEY FF

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Big Data Unit III

Uploaded by

SMILEY FF

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

UNIT 3

1].STREAM DATA MODEL

• Stream in English "a continuous flow"

• In computing
• transmit or receive (data, especially video and audio material) over
the Internet as a steady, continuous flow.

Data Streams:
• In many data mining situations, we do not know the entire data set in
advance
• Stream Management is important when the input rate is controlled
externally:
- Google queries
- Twitter or Facebook status updates
• We can think of the data as infinite and non-stationary (the
distribution changes over time).

Streaming Data Examples

• Sensors in transportation vehicles, industrial equipment, and farm

machinery.
• A financial institution
• A real-estate website
• A solar power company
• A media publisher
• An online gaming company.
Stream Processing
Parameter Stream processing
Data scope Queries or processing over data
within a rolling time window, or on
just the most recent data record.
Data size Individual records or micro batches
consisting of a few records.
Performance Requires latency in the order of
seconds or milliseconds.
Analyses Simple response functions,
aggregates, and rolling metrics.

Challenges in Working with Streaming Data :

• A storage layer
• A processing layer
Needs plan for scalability, data durability, and fault tolerance in both the
storage and processing layers.

The Stream Model :

• Input elements enter at a rapid rate, at one or more input ports (i.e.,
streams)
- We call elements of the stream tuples
• The system cannot store the entire stream accessibly
• Q: How do you make critical calculations about the stream using a
limited amount of (secondary) memory?
General Stream Processing Model :

Types of queries one wants answer on a data stream :

• Filtering a data stream
- Select elements with property x from the stream
• Counting distinct elements
- Number of distinct elements in the last k elements of the stream
• Estimating moments
- Estimate avg./std. dev. of last k elements
• Finding frequent elements
Applications :
• Mining query streams
- Google wants to know what queries are more frequent today than
yesterday
• Mining click streams
- Yahoo wants to know which of its pages are getting an unusual
number of hits in the past hour
• Mining social network news feeds
- E.g., look for trending topics on Twitter, Facebook
• Telephone call records
- Data feeds into customer bills as well as settlements between
telephone companies
• IP packets monitored at a switch
- Gather information for optimal routing
- Detect denial-of-service attacks

2]. Sampling Data in the Stream

In data stream environments:
• Data arrives continuously and rapidly.
• We can’t store or process the entire stream due to memory and time
limits.
Solution: Select a representative sample from the stream using smart sampling
techniques.

Sampling in a Stream
Sampling is the process of selecting a subset of elements from the data stream
such that:
• It fits in memory.
• It approximates the full stream for analysis (averages, counts, trends,
etc.).
Common Stream Sampling Techniques
1. Reservoir Sampling :
• Most popular method.
• Maintains a fixed-size random sample from an unknown-length stream.
Algorithm Steps (for reservoir size k):
1. Store the first k elements in the reservoir.
2. For the ith element (i > k):
o With probability k/i, replace a random element in the
reservoir with this new element.
3. Result: All elements in the stream have equal chance of being in the
reservoir.
Example:
• If reservoir size = 100, and 10,000 elements come in:
o The 10,000th element still has a 1% chance of being in the sample.

2. Biased Reservoir Sampling :

• Variation of reservoir sampling.
• Recent elements are more likely to be sampled.
• Useful for applications where newer data is more relevant (e.g., trends,
recent behaviors).

3. Sliding Window Sampling :

• Keep only the most recent N elements of the stream.
• Discards old data.
• Useful for real-time trend analysis (e.g., current user activity).
4. Stratified Sampling
• The stream is divided into subgroups (strata) based on some property.
• Sample each stratum independently.
• Ensures representativeness across groups (e.g., by user type, region).

Applications of Stream Sampling

Use Case Sampling Technique
Monitoring average latency of network packets Reservoir or Sliding Window

Detecting recent website usage pattern Biased or Sliding Window

Estimating total active users from live logs Reservoir Sampling

Analyzing sensor data in real-time Sliding Window or Stratified

3].Filtering Data Streams

Filtering in data streams means selecting only those elements (or tuples) from
the stream that satisfy a specific condition or property.

Filtering Important :
• Data streams contain huge volumes of information, much of which may
be irrelevant for a particular task.
• Filtering helps:
o Reduce data volume
o Focus processing power on relevant data
o Enable real-time decision making
Examples of Filtering in Streams :
Example Stream Filter Condition Filtered Result
Twitter feed Language = "English" Only English tweets

Network logs Source IP = "192.168.*" Logs from specific IPs

Sensor data Temperature > 100°C Overheating events

E-commerce events Event type = "purchase" Only purchase events

How Filtering Works in Practice

In a data stream model, you define a filtering rule or predicate:

Example in pseudo-code:

Types of Filtering
1. Simple Predicate Filtering
• Based on a condition like:
o value > threshold
o type == “error”
o location == “India”
2. Pattern-Based Filtering
• Match specific patterns in data.
• E.g., log messages with keywords like "failed" or "unauthorized".
3. Time-Window Filtering
• Filter data within a specific time range.
• E.g., events in the last 5 minutes.
4. Bloom Filter (Advanced)
• A probabilistic data structure used to test whether an element may be in
a set.
• Very space-efficient.
• Useful in distributed stream processing.

Real-Life Use Cases :

Use Case Filtering Criteria

Fraud detection Transactions above ₹10,00,000

News trend analysis Tweets with #BreakingNews

Sensor monitoring Vibration readings > threshold

System logs Error or warning messages only

• Filtering is usually done as early as possible in the stream pipeline to

reduce:
• Memory usage
• Processing time
• Network traffic (in distributed systems)
4].Counting Distinct Elements in a Stream

Why Count Distinct Elements

In large data streams, it’s often important to estimate the number of unique
items, for example:

Application Use Case

Web analytics Count of unique visitors to a site

Network monitoring Count of unique IP addresses seen

Financial systems Count of distinct users making trades

E-commerce Number of unique products sold

The Problem :
If we try to store all seen items (e.g., using a hash set), it consumes too much
memory.
So we need memory-efficient algorithms to approximate the count of distinct
elements.
Efficient Techniques for Counting Distinct Elements :
1. Flajolet-Martin Algorithm (FM Algorithm) :
A probabilistic algorithm that estimates the number of distinct elements using
bit patterns.
How It Works:
• Hash each element to a bit string (e.g., 110010...).
• Track the maximum number of trailing zeros in any hash.
• If the longest trailing zero is r, the estimated number of distinct elements
is around 2^r.
Example:
If the max trailing zeros seen in any hash is 6, then estimated count ≈ 2⁶ = 64
unique items.
Pros:
• Very space-efficient
• Works well for very large streams

2. LogLog & HyperLogLog :

Improved versions of FM algorithm with better accuracy and less variance.
• HyperLogLog is widely used in systems like Redis, Google BigQuery, and
Apache Druid.
• Uses multiple hash functions and averages the estimates.
• Tracks maximum zero runs across multiple buckets.
3. Linear Counting (for small cardinalities) :
• Maintains a bit vector of size m.
• Each element hashes into a bit position.
• After scanning, the number of unset bits helps estimate how many
distinct values were inserted.

Summary Table :
Algorithm Memory Accuracy Best Use Case
Usage
Flajolet- Very low Moderate Quick estimation in high-speed
Martin streams

HyperLogLog Low High Real-world systems (e.g.,

Redis, BigQuery)

Linear Moderate Good (for small Small-scale unique counts

Counting counts)

Applications in Real Life :

• Counting unique visitors on a website
• Number of distinct hashtags on Twitter
• Estimating distinct IPs passing through a network router
• Counting unique transactions or users in financial systems.
5].Estimating Moments in a Data Stream

What are Moments

In statistics and data mining, moments describe the distribution of data.
The k-th moment gives insight into the shape of the data:

Moment Meaning
1st Mean (average)

2nd Variance (spread)

3rd Skewness (asymmetry)

4th Kurtosis (peakedness)

In stream processing, we often want to estimate 1st and 2nd moments in real-
time.

Problem:
We cannot store the full stream, so we need to estimate moments using
limited memory and one-pass algorithms.

1. Estimating 1st Moment (Mean)

Simple Moving Average
If we only care about the last k elements:

Can be implemented with a sliding window over last k values.

Exponential Moving Average (EMA)
More efficient for streams:

• x_t = current data point

• α = smoothing factor (e.g., 0.1 or 0.2)
• EMA_{t-1} = previous estimate
We don’t store past data — just keep one variable updated.

2. Estimating 2nd Moment (Variance)

We use:

So we estimate:
• E[x] using EMA
• E[x²] using a similar method:

Then:

Estimated Variance=
Efficient Data Structures: AMS Algorithm (Alon–Matias–Szegedy)
For higher-order moments like:

Where:
• f_i is frequency of element i
• k is the moment number (k ≥ 2)
AMS uses:
• Random variables and hashing
• Very small space
• Accurate estimate of 2nd moment (self-join size)
Common in network traffic analysis, database query estimation, etc.

Real-Life Use Cases :

Moment Estimation Application Example

Mean (1st moment) Average temperature from sensor stream

Variance (2nd) Stability/volatility in stock stream

Higher moments Pattern detection in user clickstreams

Self-join size Network traffic similarity detection

6].Counting Once in a Window
In stream data processing, “counting once in a window” means:
• Counting distinct elements that appear at least once in a sliding or fixed
time window, regardless of how many times they appear.

Example:
Imagine this stream of user IDs arriving every second:

If we consider a window size of 5 elements, then in the last 5 items:

Count of distinct users = 4 (A, B, C, D)

Types of Windows :
1. Sliding Window :
o Moves forward by one item or one time unit.
o Keeps updating the count with each new data point.
2. Tumbling Window :
o Non-overlapping intervals.
o Count is reset after each window.

How to Efficiently Count Distinct in a Window :

1. Naive Approach (not ideal) :
• Store all items in the current window.
• Use a Set to get the count of unique elements.
Problem: Requires too much memory for large or fast streams.
2. DGIM Algorithm (Datar-Gionis-Indyk-Motwani) :
Used for counting number of 1’s in the last k bits of a binary stream.
• Can be adapted to count unique items once in a window using bitmaps
and buckets.

3. Sliding HyperLogLog or Count-Min Sketch with Time Decay :

• Use probabilistic data structures like:
o HyperLogLog (for distinct count)
o Count-Min Sketch (for frequency count)
• Add timestamps or window-aware logic to forget older elements.

Use Cases :

Scenario What is Counted Once

Web Analytics Unique users in the last 10 minutes
IoT Sensor Monitoring Unique sensor IDs reporting in a time window
Network Intrusion Detection Unique IP addresses accessing the system
E-commerce Unique products viewed in the last hour

Example notes :

Counting Bits (1) :

• Problem:
o Given a stream of 0s and 1s
o Be prepared to answer queries of the form
- How many 1s are in the last k bits? where k ≤ N
• Obvious solution:
o Store the most recent N bits
- When new bit comes in, discard the N+1st bit
Counting Bits (2) :
• You can not get an exact answer without storing the entire window
• Real Problem:
o What if we cannot afford to store N bits?
- E.g., we're processing 1 billion streams and
N = 1 billion
010011011101010110110110
past Future

• But we are happy with an approximate answer

DGIM Method :
• DGIM solution that does not assume uniformity
• We store O (log 2 N) bits per stream
• Solution gives approximate answer, never off by more than 50%
- Error factor can be reduced to any fraction > 0, with more
complicated algorithm and proportionally more stored bits

DGIM: Timestamps :
• Each bit in the stream has a timestamp, starting 1, 2, ...
• Record timestamps modulo N (the window size), so we can represent any
relevant timestamp in O(log 2 N) bits
Rules for forming the buckets :
• The right side of the bucket should always start with 1. (if it starts with a
0,it is to be neglected)
• E.g. • 1001011 → a bucket of size 4 ,having four l's and starting with 1 on
it's right end.
• Every bucket should have at least one 1, else no bucket can be formed.
• All buckets should be in powers of 2.
• The buckets cannot decrease in size as we move to the left. (move in
increasing order towards left)
7].Decaying Windows :
• Useful in applications which need identification of most common
elements
• Decaying window concept assigns more weight to recent elements
• The technique computes a smooth aggregation of all the 1's ever seen in
the stream, with decaying weights
• When element further appears in the stream, less weight is given
• The effect of exponentially decaying weights is to spread out the weights
of the stream elements as far back in time as the stream flows

Example:
Counting Items
- "currently" most popular movies
- Trend on Tweeter in last 10 Hours
- News flashing on News channel
- Post sharing on Social Network.
-

Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Unit 3
No ratings yet
Unit 3
30 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Unit 4
No ratings yet
Unit 4
10 pages
Big Data Analytics_Unit 3
No ratings yet
Big Data Analytics_Unit 3
64 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Mod2_Data_Streams
No ratings yet
Mod2_Data_Streams
75 pages
DA Unit 3
No ratings yet
DA Unit 3
12 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
mining data stream
No ratings yet
mining data stream
31 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Unit 2
No ratings yet
Unit 2
23 pages
unit-3 notes
No ratings yet
unit-3 notes
10 pages
a.
No ratings yet
a.
3 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
BIG_DATA_UNIT_II_NOTES
No ratings yet
BIG_DATA_UNIT_II_NOTES
19 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Mining Techniques for Streaming Data
No ratings yet
Mining Techniques for Streaming Data
14 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Methodologies for Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies for Stream Data Processing and Stream Data Systems
20 pages
Data Analytics (Unit-03)_7777
No ratings yet
Data Analytics (Unit-03)_7777
48 pages
Unit II(Big Data)
No ratings yet
Unit II(Big Data)
19 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Stream
No ratings yet
Stream
30 pages
BigData_Mod2
No ratings yet
BigData_Mod2
12 pages
Mining&Data Stream Unit-3_removed
No ratings yet
Mining&Data Stream Unit-3_removed
50 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
Big Data Ppt
No ratings yet
Big Data Ppt
37 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Bda M4
No ratings yet
Bda M4
57 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
U3 Notes
No ratings yet
U3 Notes
27 pages
Stream Computing Methods
No ratings yet
Stream Computing Methods
35 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Streams 1
No ratings yet
Streams 1
33 pages
RTDS UNIT-5
No ratings yet
RTDS UNIT-5
27 pages
KCA 034 - Unit 3
No ratings yet
KCA 034 - Unit 3
21 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
BDA-UNIT3
No ratings yet
BDA-UNIT3
22 pages
BDA Module-4
No ratings yet
BDA Module-4
8 pages
BDA
No ratings yet
BDA
6 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
CP4252 ML UNIT- V
No ratings yet
CP4252 ML UNIT- V
17 pages
Cp4252 Ml Unit-II
No ratings yet
Cp4252 Ml Unit-II
44 pages
UNIT 4
No ratings yet
UNIT 4
32 pages
UNIT IV,V
No ratings yet
UNIT IV,V
35 pages
Natalia Struharova Research Project
No ratings yet
Natalia Struharova Research Project
10 pages
YLP SAS Advanced Certification Slides
0% (3)
YLP SAS Advanced Certification Slides
114 pages
Rabin-Karp String Matching Algorithm
No ratings yet
Rabin-Karp String Matching Algorithm
11 pages
CST 204 Dbms Module - 3 Physical Data Organization
No ratings yet
CST 204 Dbms Module - 3 Physical Data Organization
93 pages
Data Structure Unit (3 &4) Notes (SnapED)
No ratings yet
Data Structure Unit (3 &4) Notes (SnapED)
25 pages
cs109 lecture 1 - counting
No ratings yet
cs109 lecture 1 - counting
5 pages
MCQ - Cryptography Hash Functions (Level - Easy)
100% (1)
MCQ - Cryptography Hash Functions (Level - Easy)
3 pages
A Review Paper On Cryptographic Hash Function
No ratings yet
A Review Paper On Cryptographic Hash Function
11 pages
Acpc 2020
No ratings yet
Acpc 2020
19 pages
Project Report - (Editing Tool) : Ritu Munshi
No ratings yet
Project Report - (Editing Tool) : Ritu Munshi
73 pages
dsa question paper june july
No ratings yet
dsa question paper june july
2 pages
P Hash
No ratings yet
P Hash
3 pages
Iso Iec 10118-2-2010
No ratings yet
Iso Iec 10118-2-2010
36 pages
Partitioning PDF
No ratings yet
Partitioning PDF
5 pages
Ec 2202 Data Structures and Object Oriented Programming in C++ Unit Iii Data Structures & Algorithms 3.1. Algorithm
No ratings yet
Ec 2202 Data Structures and Object Oriented Programming in C++ Unit Iii Data Structures & Algorithms 3.1. Algorithm
36 pages
Bloom Filters: References
No ratings yet
Bloom Filters: References
22 pages
Weaver Death of Cryptocurrency Final
No ratings yet
Weaver Death of Cryptocurrency Final
39 pages
Datastage Enterprise Edition
No ratings yet
Datastage Enterprise Edition
372 pages
CLRS Linked Lists
No ratings yet
CLRS Linked Lists
6 pages
E1_PowerCenter
No ratings yet
E1_PowerCenter
23 pages
Hash Table
No ratings yet
Hash Table
26 pages
Data Structure Unit II
No ratings yet
Data Structure Unit II
25 pages
Daa Mini Report
No ratings yet
Daa Mini Report
28 pages
How Does BIP 39 Mnemonic Work
No ratings yet
How Does BIP 39 Mnemonic Work
4 pages
AIM: Write A Program To Generate SHA-1 Hash. Description:: Practical: 9
No ratings yet
AIM: Write A Program To Generate SHA-1 Hash. Description:: Practical: 9
24 pages
Sphincs+ r3.1 Specification
No ratings yet
Sphincs+ r3.1 Specification
63 pages
Hashing in Data Structures
No ratings yet
Hashing in Data Structures
27 pages
Biometric Cryptosystem Survey Paper-1-25 Compressed-1-20 Compressed
No ratings yet
Biometric Cryptosystem Survey Paper-1-25 Compressed-1-20 Compressed
20 pages
Module - 6
No ratings yet
Module - 6
24 pages
Embark Documentation
No ratings yet
Embark Documentation
55 pages

Big Data Unit III

Uploaded by

Big Data Unit III

Uploaded by

UNIT 3

1].STREAM DATA MODEL

• Stream in English "a continuous flow"

Streaming Data Examples

• Sensors in transportation vehicles, industrial equipment, and farm

Challenges in Working with Streaming Data :

The Stream Model :

Types of queries one wants answer on a data stream :

2]. Sampling Data in the Stream

2. Biased Reservoir Sampling :

3. Sliding Window Sampling :

Applications of Stream Sampling

Detecting recent website usage pattern Biased or Sliding Window

Estimating total active users from live logs Reservoir Sampling

Analyzing sensor data in real-time Sliding Window or Stratified

3].Filtering Data Streams

Network logs Source IP = "192.168.*" Logs from specific IPs

Sensor data Temperature > 100°C Overheating events

E-commerce events Event type = "purchase" Only purchase events

How Filtering Works in Practice

Real-Life Use Cases :

Use Case Filtering Criteria

News trend analysis Tweets with #BreakingNews

Sensor monitoring Vibration readings > threshold

System logs Error or warning messages only

• Filtering is usually done as early as possible in the stream pipeline to

Why Count Distinct Elements

Application Use Case

Network monitoring Count of unique IP addresses seen

Financial systems Count of distinct users making trades

E-commerce Number of unique products sold

2. LogLog & HyperLogLog :

HyperLogLog Low High Real-world systems (e.g.,

Linear Moderate Good (for small Small-scale unique counts

Applications in Real Life :

What are Moments

2nd Variance (spread)

3rd Skewness (asymmetry)

4th Kurtosis (peakedness)

1. Estimating 1st Moment (Mean)

Can be implemented with a sliding window over last k values.

• x_t = current data point

2. Estimating 2nd Moment (Variance)

Real-Life Use Cases :

Moment Estimation Application Example

Variance (2nd) Stability/volatility in stock stream

Higher moments Pattern detection in user clickstreams

Self-join size Network traffic similarity detection

If we consider a window size of 5 elements, then in the last 5 items:

Count of distinct users = 4 (A, B, C, D)

How to Efficiently Count Distinct in a Window :

3. Sliding HyperLogLog or Count-Min Sketch with Time Decay :

Scenario What is Counted Once

Counting Bits (1) :

• But we are happy with an approximate answer

You might also like