0% found this document useful (0 votes)
12 views

Big Data Unit III

The document discusses the stream data model, which involves continuous data flow and the challenges of processing it effectively. It covers various techniques for stream processing, sampling, filtering, counting distinct elements, estimating moments, and counting in windows, emphasizing the importance of memory-efficient algorithms. Additionally, it highlights real-life applications and use cases for each technique in fields like web analytics, network monitoring, and sensor data analysis.

Uploaded by

SMILEY FF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Big Data Unit III

The document discusses the stream data model, which involves continuous data flow and the challenges of processing it effectively. It covers various techniques for stream processing, sampling, filtering, counting distinct elements, estimating moments, and counting in windows, emphasizing the importance of memory-efficient algorithms. Additionally, it highlights real-life applications and use cases for each technique in fields like web analytics, network monitoring, and sensor data analysis.

Uploaded by

SMILEY FF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT 3

1].STREAM DATA MODEL

• Stream in English "a continuous flow"


• In computing
• transmit or receive (data, especially video and audio material) over
the Internet as a steady, continuous flow.

Data Streams:
• In many data mining situations, we do not know the entire data set in
advance
• Stream Management is important when the input rate is controlled
externally:
- Google queries
- Twitter or Facebook status updates
• We can think of the data as infinite and non-stationary (the
distribution changes over time).

Streaming Data Examples

• Sensors in transportation vehicles, industrial equipment, and farm


machinery.
• A financial institution
• A real-estate website
• A solar power company
• A media publisher
• An online gaming company.
Stream Processing
Parameter Stream processing
Data scope Queries or processing over data
within a rolling time window, or on
just the most recent data record.
Data size Individual records or micro batches
consisting of a few records.
Performance Requires latency in the order of
seconds or milliseconds.
Analyses Simple response functions,
aggregates, and rolling metrics.

Challenges in Working with Streaming Data :


• A storage layer
• A processing layer
Needs plan for scalability, data durability, and fault tolerance in both the
storage and processing layers.

The Stream Model :


• Input elements enter at a rapid rate, at one or more input ports (i.e.,
streams)
- We call elements of the stream tuples
• The system cannot store the entire stream accessibly
• Q: How do you make critical calculations about the stream using a
limited amount of (secondary) memory?
General Stream Processing Model :

Types of queries one wants answer on a data stream :


• Filtering a data stream
- Select elements with property x from the stream
• Counting distinct elements
- Number of distinct elements in the last k elements of the stream
• Estimating moments
- Estimate avg./std. dev. of last k elements
• Finding frequent elements
Applications :
• Mining query streams
- Google wants to know what queries are more frequent today than
yesterday
• Mining click streams
- Yahoo wants to know which of its pages are getting an unusual
number of hits in the past hour
• Mining social network news feeds
- E.g., look for trending topics on Twitter, Facebook
• Telephone call records
- Data feeds into customer bills as well as settlements between
telephone companies
• IP packets monitored at a switch
- Gather information for optimal routing
- Detect denial-of-service attacks

2]. Sampling Data in the Stream


In data stream environments:
• Data arrives continuously and rapidly.
• We can’t store or process the entire stream due to memory and time
limits.
Solution: Select a representative sample from the stream using smart sampling
techniques.

Sampling in a Stream
Sampling is the process of selecting a subset of elements from the data stream
such that:
• It fits in memory.
• It approximates the full stream for analysis (averages, counts, trends,
etc.).
Common Stream Sampling Techniques
1. Reservoir Sampling :
• Most popular method.
• Maintains a fixed-size random sample from an unknown-length stream.
Algorithm Steps (for reservoir size k):
1. Store the first k elements in the reservoir.
2. For the ith element (i > k):
o With probability k/i, replace a random element in the
reservoir with this new element.
3. Result: All elements in the stream have equal chance of being in the
reservoir.
Example:
• If reservoir size = 100, and 10,000 elements come in:
o The 10,000th element still has a 1% chance of being in the sample.

2. Biased Reservoir Sampling :


• Variation of reservoir sampling.
• Recent elements are more likely to be sampled.
• Useful for applications where newer data is more relevant (e.g., trends,
recent behaviors).

3. Sliding Window Sampling :


• Keep only the most recent N elements of the stream.
• Discards old data.
• Useful for real-time trend analysis (e.g., current user activity).
4. Stratified Sampling
• The stream is divided into subgroups (strata) based on some property.
• Sample each stratum independently.
• Ensures representativeness across groups (e.g., by user type, region).

Applications of Stream Sampling


Use Case Sampling Technique
Monitoring average latency of network packets Reservoir or Sliding Window

Detecting recent website usage pattern Biased or Sliding Window

Estimating total active users from live logs Reservoir Sampling

Analyzing sensor data in real-time Sliding Window or Stratified

3].Filtering Data Streams


Filtering in data streams means selecting only those elements (or tuples) from
the stream that satisfy a specific condition or property.

Filtering Important :
• Data streams contain huge volumes of information, much of which may
be irrelevant for a particular task.
• Filtering helps:
o Reduce data volume
o Focus processing power on relevant data
o Enable real-time decision making
Examples of Filtering in Streams :
Example Stream Filter Condition Filtered Result
Twitter feed Language = "English" Only English tweets

Network logs Source IP = "192.168.*" Logs from specific IPs

Sensor data Temperature > 100°C Overheating events

E-commerce events Event type = "purchase" Only purchase events

How Filtering Works in Practice


In a data stream model, you define a filtering rule or predicate:

Example in pseudo-code:

Types of Filtering
1. Simple Predicate Filtering
• Based on a condition like:
o value > threshold
o type == “error”
o location == “India”
2. Pattern-Based Filtering
• Match specific patterns in data.
• E.g., log messages with keywords like "failed" or "unauthorized".
3. Time-Window Filtering
• Filter data within a specific time range.
• E.g., events in the last 5 minutes.
4. Bloom Filter (Advanced)
• A probabilistic data structure used to test whether an element may be in
a set.
• Very space-efficient.
• Useful in distributed stream processing.

Real-Life Use Cases :

Use Case Filtering Criteria


Fraud detection Transactions above ₹10,00,000

News trend analysis Tweets with #BreakingNews

Sensor monitoring Vibration readings > threshold

System logs Error or warning messages only

• Filtering is usually done as early as possible in the stream pipeline to


reduce:
• Memory usage
• Processing time
• Network traffic (in distributed systems)
4].Counting Distinct Elements in a Stream

Why Count Distinct Elements


In large data streams, it’s often important to estimate the number of unique
items, for example:

Application Use Case


Web analytics Count of unique visitors to a site

Network monitoring Count of unique IP addresses seen

Financial systems Count of distinct users making trades

E-commerce Number of unique products sold

The Problem :
If we try to store all seen items (e.g., using a hash set), it consumes too much
memory.
So we need memory-efficient algorithms to approximate the count of distinct
elements.
Efficient Techniques for Counting Distinct Elements :
1. Flajolet-Martin Algorithm (FM Algorithm) :
A probabilistic algorithm that estimates the number of distinct elements using
bit patterns.
How It Works:
• Hash each element to a bit string (e.g., 110010...).
• Track the maximum number of trailing zeros in any hash.
• If the longest trailing zero is r, the estimated number of distinct elements
is around 2^r.
Example:
If the max trailing zeros seen in any hash is 6, then estimated count ≈ 2⁶ = 64
unique items.
Pros:
• Very space-efficient
• Works well for very large streams

2. LogLog & HyperLogLog :


Improved versions of FM algorithm with better accuracy and less variance.
• HyperLogLog is widely used in systems like Redis, Google BigQuery, and
Apache Druid.
• Uses multiple hash functions and averages the estimates.
• Tracks maximum zero runs across multiple buckets.
3. Linear Counting (for small cardinalities) :
• Maintains a bit vector of size m.
• Each element hashes into a bit position.
• After scanning, the number of unset bits helps estimate how many
distinct values were inserted.

Summary Table :
Algorithm Memory Accuracy Best Use Case
Usage
Flajolet- Very low Moderate Quick estimation in high-speed
Martin streams

HyperLogLog Low High Real-world systems (e.g.,


Redis, BigQuery)

Linear Moderate Good (for small Small-scale unique counts


Counting counts)

Applications in Real Life :


• Counting unique visitors on a website
• Number of distinct hashtags on Twitter
• Estimating distinct IPs passing through a network router
• Counting unique transactions or users in financial systems.
5].Estimating Moments in a Data Stream

What are Moments


In statistics and data mining, moments describe the distribution of data.
The k-th moment gives insight into the shape of the data:

Moment Meaning
1st Mean (average)

2nd Variance (spread)

3rd Skewness (asymmetry)

4th Kurtosis (peakedness)


In stream processing, we often want to estimate 1st and 2nd moments in real-
time.

Problem:
We cannot store the full stream, so we need to estimate moments using
limited memory and one-pass algorithms.

1. Estimating 1st Moment (Mean)


Simple Moving Average
If we only care about the last k elements:

Can be implemented with a sliding window over last k values.


Exponential Moving Average (EMA)
More efficient for streams:

• x_t = current data point


• α = smoothing factor (e.g., 0.1 or 0.2)
• EMA_{t-1} = previous estimate
We don’t store past data — just keep one variable updated.

2. Estimating 2nd Moment (Variance)


We use:

So we estimate:
• E[x] using EMA
• E[x²] using a similar method:

Then:

Estimated Variance=
Efficient Data Structures: AMS Algorithm (Alon–Matias–Szegedy)
For higher-order moments like:

Where:
• f_i is frequency of element i
• k is the moment number (k ≥ 2)
AMS uses:
• Random variables and hashing
• Very small space
• Accurate estimate of 2nd moment (self-join size)
Common in network traffic analysis, database query estimation, etc.

Real-Life Use Cases :

Moment Estimation Application Example


Mean (1st moment) Average temperature from sensor stream

Variance (2nd) Stability/volatility in stock stream

Higher moments Pattern detection in user clickstreams

Self-join size Network traffic similarity detection


6].Counting Once in a Window
In stream data processing, “counting once in a window” means:
• Counting distinct elements that appear at least once in a sliding or fixed
time window, regardless of how many times they appear.

Example:
Imagine this stream of user IDs arriving every second:

If we consider a window size of 5 elements, then in the last 5 items:

Count of distinct users = 4 (A, B, C, D)

Types of Windows :
1. Sliding Window :
o Moves forward by one item or one time unit.
o Keeps updating the count with each new data point.
2. Tumbling Window :
o Non-overlapping intervals.
o Count is reset after each window.

How to Efficiently Count Distinct in a Window :


1. Naive Approach (not ideal) :
• Store all items in the current window.
• Use a Set to get the count of unique elements.
Problem: Requires too much memory for large or fast streams.
2. DGIM Algorithm (Datar-Gionis-Indyk-Motwani) :
Used for counting number of 1’s in the last k bits of a binary stream.
• Can be adapted to count unique items once in a window using bitmaps
and buckets.

3. Sliding HyperLogLog or Count-Min Sketch with Time Decay :


• Use probabilistic data structures like:
o HyperLogLog (for distinct count)
o Count-Min Sketch (for frequency count)
• Add timestamps or window-aware logic to forget older elements.

Use Cases :

Scenario What is Counted Once


Web Analytics Unique users in the last 10 minutes
IoT Sensor Monitoring Unique sensor IDs reporting in a time window
Network Intrusion Detection Unique IP addresses accessing the system
E-commerce Unique products viewed in the last hour

Example notes :

Counting Bits (1) :


• Problem:
o Given a stream of 0s and 1s
o Be prepared to answer queries of the form
- How many 1s are in the last k bits? where k ≤ N
• Obvious solution:
o Store the most recent N bits
- When new bit comes in, discard the N+1st bit
Counting Bits (2) :
• You can not get an exact answer without storing the entire window
• Real Problem:
o What if we cannot afford to store N bits?
- E.g., we're processing 1 billion streams and
N = 1 billion
010011011101010110110110
past Future

• But we are happy with an approximate answer

DGIM Method :
• DGIM solution that does not assume uniformity
• We store O (log 2 N) bits per stream
• Solution gives approximate answer, never off by more than 50%
- Error factor can be reduced to any fraction > 0, with more
complicated algorithm and proportionally more stored bits

DGIM: Timestamps :
• Each bit in the stream has a timestamp, starting 1, 2, ...
• Record timestamps modulo N (the window size), so we can represent any
relevant timestamp in O(log 2 N) bits
Rules for forming the buckets :
• The right side of the bucket should always start with 1. (if it starts with a
0,it is to be neglected)
• E.g. • 1001011 → a bucket of size 4 ,having four l's and starting with 1 on
it's right end.
• Every bucket should have at least one 1, else no bucket can be formed.
• All buckets should be in powers of 2.
• The buckets cannot decrease in size as we move to the left. (move in
increasing order towards left)
7].Decaying Windows :
• Useful in applications which need identification of most common
elements
• Decaying window concept assigns more weight to recent elements
• The technique computes a smooth aggregation of all the 1's ever seen in
the stream, with decaying weights
• When element further appears in the stream, less weight is given
• The effect of exponentially decaying weights is to spread out the weights
of the stream elements as far back in time as the stream flows

Example:
Counting Items
- "currently" most popular movies
- Trend on Tweeter in last 10 Hours
- News flashing on News channel
- Post sharing on Social Network.
-

You might also like