Big Data Analytics_Unit 3
Big Data Analytics_Unit 3
Example:
Web sites often like to report the number of unique
users over the past month. If we think of each login as
a stream element, we can maintain a window that is
all logins in the most recent month and associate the
arrival time with each login.
If we think of the window as a relation Logins(name,
time), then it is simple to get the number of unique
users over the past month.
◦ Sequence-Based
The most recent n elements from the data stream
Assumes a (possibly implicit) sequence number for
each element
• Timestamp-Based
All elements from the data stream in the last m units of
time (e.g. last 1 week)
Assumes a (possibly implicit) arrival timestamp for
each element
Sampling From a Data Stream
Inputs:
Sample size k
Window size n >> k (alternatively, time duration
m)
Stream of data elements that arrive online
Output:
k elements chosen uniformly at random from the
last n elements (alternatively, from all elements
that have arrived in the last m time units)
Goal:
maintain a data structure that can produce the
desired output at any time upon request
Simple Approach (sampling)
•Choose a random subset X={x1, …,xk}, X{0,1,…,n-
1}
•The sample always consists of the non-expired
elements whose indexes are equal to x1, …,xk (modulo
n)
•Only uses O(k) memory
•Technically produces a uniform random sample of
each window, but unsatisfying because the sample is
highly periodic and may be unsuitable for many real
applications.
Bloom Filter
00000000000
Bit 5 is 1. bit 3 is 0
We conclude y is not seen before.
Counting Distinct
Elements
Counting Distinct Elements
Problem:
◦ Data stream consists of a universe of elements
chosen from a set of size N
◦ Maintain a count of the number of distinct
elements seen so far
Obvious approach:
Maintain the set of elements seen so far
◦ That is, keep a hash table of all the distinct
elements seen so far
Applications
How many different words are found
among the Web pages being crawled at
a site?
◦ Unusually low or high numbers could indicate
artificial pages (spam?)
iA
( mi ) k
Special Cases
iA
( mi ) k
moment S
We pick and keep track of many
variables X:
◦ For each variable X we store X.el and X.val
X.el corresponds to the item i
X.val corresponds to the count of item i
◦ Note this requires a count in main memory,
so number of Xs is limited
Our goal is to compute
One Random Variable (X)
How to set X.val and X.el?
◦ Assume stream has length n (we relax this later)
◦ Pick some random time t (t<n) to start,
so that any time is equally likely
◦ Let at time t the stream have item i. We set X.el =
i
◦ Then we maintain count c (X.val = c) of the
number of is in the stream starting from the chosen
time t
Then the estimate of the 2 nd moment () is:
Stream: a a b b b a b a
2nd moment is
c … number of times item at time t appears
t
from time t onwards (c1=ma , c2=ma-1,
c3=mb)
mi … total count of
item i in the stream
(we are assuming
stream has length n)
Stream: a a b b b a b a
expectation)!
Higher-Order Moments
For estimating kth moment we
essentially use the same algorithm but
change the estimate:
◦ For k=2 we used n (2·c – 1)
◦ For k=3 we use: n (3·c2 – 3c + 1) (where
c=X.val)
Why?
◦ For k=2: Remember we had and we showed terms
2c-1 (for c=1,…,m) sum to m2
So:
◦ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
Generally: Estimate
Combining Samples
In practice:
◦ Compute for
as many variables X as you can fit in memory
◦ Average them in groups
◦ Take median of averages
end.
The number of 1’s in the bucket. This
with a 1.
Every position with a 1 is in some bucket.
No position is in more than one bucket.
There are one or two buckets of any given
Stream processing: