Big Data Unit III
Big Data Unit III
Data Streams:
• In many data mining situations, we do not know the entire data set in
advance
• Stream Management is important when the input rate is controlled
externally:
- Google queries
- Twitter or Facebook status updates
• We can think of the data as infinite and non-stationary (the
distribution changes over time).
Sampling in a Stream
Sampling is the process of selecting a subset of elements from the data stream
such that:
• It fits in memory.
• It approximates the full stream for analysis (averages, counts, trends,
etc.).
Common Stream Sampling Techniques
1. Reservoir Sampling :
• Most popular method.
• Maintains a fixed-size random sample from an unknown-length stream.
Algorithm Steps (for reservoir size k):
1. Store the first k elements in the reservoir.
2. For the ith element (i > k):
o With probability k/i, replace a random element in the
reservoir with this new element.
3. Result: All elements in the stream have equal chance of being in the
reservoir.
Example:
• If reservoir size = 100, and 10,000 elements come in:
o The 10,000th element still has a 1% chance of being in the sample.
Filtering Important :
• Data streams contain huge volumes of information, much of which may
be irrelevant for a particular task.
• Filtering helps:
o Reduce data volume
o Focus processing power on relevant data
o Enable real-time decision making
Examples of Filtering in Streams :
Example Stream Filter Condition Filtered Result
Twitter feed Language = "English" Only English tweets
Example in pseudo-code:
Types of Filtering
1. Simple Predicate Filtering
• Based on a condition like:
o value > threshold
o type == “error”
o location == “India”
2. Pattern-Based Filtering
• Match specific patterns in data.
• E.g., log messages with keywords like "failed" or "unauthorized".
3. Time-Window Filtering
• Filter data within a specific time range.
• E.g., events in the last 5 minutes.
4. Bloom Filter (Advanced)
• A probabilistic data structure used to test whether an element may be in
a set.
• Very space-efficient.
• Useful in distributed stream processing.
The Problem :
If we try to store all seen items (e.g., using a hash set), it consumes too much
memory.
So we need memory-efficient algorithms to approximate the count of distinct
elements.
Efficient Techniques for Counting Distinct Elements :
1. Flajolet-Martin Algorithm (FM Algorithm) :
A probabilistic algorithm that estimates the number of distinct elements using
bit patterns.
How It Works:
• Hash each element to a bit string (e.g., 110010...).
• Track the maximum number of trailing zeros in any hash.
• If the longest trailing zero is r, the estimated number of distinct elements
is around 2^r.
Example:
If the max trailing zeros seen in any hash is 6, then estimated count ≈ 2⁶ = 64
unique items.
Pros:
• Very space-efficient
• Works well for very large streams
Summary Table :
Algorithm Memory Accuracy Best Use Case
Usage
Flajolet- Very low Moderate Quick estimation in high-speed
Martin streams
Moment Meaning
1st Mean (average)
Problem:
We cannot store the full stream, so we need to estimate moments using
limited memory and one-pass algorithms.
So we estimate:
• E[x] using EMA
• E[x²] using a similar method:
Then:
Estimated Variance=
Efficient Data Structures: AMS Algorithm (Alon–Matias–Szegedy)
For higher-order moments like:
Where:
• f_i is frequency of element i
• k is the moment number (k ≥ 2)
AMS uses:
• Random variables and hashing
• Very small space
• Accurate estimate of 2nd moment (self-join size)
Common in network traffic analysis, database query estimation, etc.
Example:
Imagine this stream of user IDs arriving every second:
Types of Windows :
1. Sliding Window :
o Moves forward by one item or one time unit.
o Keeps updating the count with each new data point.
2. Tumbling Window :
o Non-overlapping intervals.
o Count is reset after each window.
Use Cases :
Example notes :
DGIM Method :
• DGIM solution that does not assume uniformity
• We store O (log 2 N) bits per stream
• Solution gives approximate answer, never off by more than 50%
- Error factor can be reduced to any fraction > 0, with more
complicated algorithm and proportionally more stored bits
DGIM: Timestamps :
• Each bit in the stream has a timestamp, starting 1, 2, ...
• Record timestamps modulo N (the window size), so we can represent any
relevant timestamp in O(log 2 N) bits
Rules for forming the buckets :
• The right side of the bucket should always start with 1. (if it starts with a
0,it is to be neglected)
• E.g. • 1001011 → a bucket of size 4 ,having four l's and starting with 1 on
it's right end.
• Every bucket should have at least one 1, else no bucket can be formed.
• All buckets should be in powers of 2.
• The buckets cannot decrease in size as we move to the left. (move in
increasing order towards left)
7].Decaying Windows :
• Useful in applications which need identification of most common
elements
• Decaying window concept assigns more weight to recent elements
• The technique computes a smooth aggregation of all the 1's ever seen in
the stream, with decaying weights
• When element further appears in the stream, less weight is given
• The effect of exponentially decaying weights is to spread out the weights
of the stream elements as far back in time as the stream flows
Example:
Counting Items
- "currently" most popular movies
- Trend on Tweeter in last 10 Hours
- News flashing on News channel
- Post sharing on Social Network.
-