0% found this document useful (0 votes)
9 views

Big Data Analytics_Unit 3

Uploaded by

fattestbully
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Big Data Analytics_Unit 3

Uploaded by

fattestbully
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Unit III

Mining Data Streams


Unit III
Mining Data Streams : Introduction To
Streams Concepts, Stream Data Model And
Architecture, Stream Computing, Sampling
Data In A Stream, Filtering Streams,
Counting Distinct Elements In A Stream,
Estimating Moments, Counting Oneness In A
Window, Decaying Window, Realtime
Analytics Platform(RTAP) Applications, Case
Studies, Real Time Sentiment Analysis,
Stock Market Predictions.
Mining Data Streams

•Most of the algorithms described in literature assume


that we are mining a database.

•That is, all our data is available when and if we want


it.

•In the topic to follow, we shall make another


assumption: data arrives in a stream or streams, and
if it is not processed immediately or stored, then it is
lost forever.

•Moreover, we shall assume that the data arrives so


rapidly that it is not feasible to store it all in active
storage (i.e., in a conventional database), and then
interact with it at the time of our choosing.
The Stream Data
Model
In this topic:

•We will begin discussing the elements of streams


and stream processing.

•We explain the difference between streams and


databases and the special problems that arise when
dealing with streams.

•Some typical applications where the stream model


applies will be examined.
Data-Stream-Management System
•In analogy to a database-management system, we can
view a stream processor as a kind of data-management
system (the high-level organization of which is
suggested in Fig)
•Any number of streams can enter the system. Each
stream can provide elements at its own schedule; they
need not have the same data rates or data types, and the
time between elements of one stream need not be
uniform.

•The fact that the rate of arrival of stream elements is not


under the control of the system distinguishes stream
processing from the processing of data that goes on
within a database-management system.

•The latter system controls the rate at which data is read


from the disk, and therefore never has to worry about
data getting lost as it attempts to execute queries.

•Streams may be archived in a large archival store, but


may not be feasible to answer queries with archival store.
•There is also a working store, into which summaries
or parts of streams may be placed, and which can be
used for answering queries.

•The working store might be disk, or it might be main


memory, depending on how fast we need to process
queries.

•But either way, it is of sufficiently limited capacity


that it cannot store all the data from all the streams.
THE STREAM DATA MODEL
Examples of Stream
Sources
1. Image Data:
• Satellites often send down to earth streams consisting
of many terabytes of images per day.

• Surveillance cameras produce images with lower


resolution than satellites, but there can be many of
them, each producing a stream of images at intervals
like one second.

2. Internet and Web Traffic:


A switching node in the middle of the Internet
receives streams of IP packets from many inputs and
routes them to its outputs. Normally, the job of the
switch is to transmit data and not to retain it or
query it.
•But there is a tendency to put more capability into
the switch, e.g., the ability to detect denial-of-service
attacks or the ability to reroute packets based on
information about congestion in the network.

•Web sites receive streams of various types. For


example, Google receives several hundred million
search queries per day.

•Yahoo! accepts billions of “clicks” per day on its


various sites.

•Many interesting things can be learned from these


streams.

•For example, an increase in queries like “sore throat”


enables us to track the spread of viruses.
Stream Queries
•There are two ways that queries are asked about
streams.
•From the figure we see a place within the processor
where standing queries are stored.

•These queries are, in a sense, permanently executing,


and produce outputs at appropriate times.

Example: (Question asked every time)

The stream produced by the ocean-surface-


temperature sensor might have a standing query to
output an alert whenever the temperature exceeds 25
degrees centigrade.
This query is easily answered, since it depends only on
the most recent stream element.
Alternatively, we might have a standing query that,
each time a new reading arrives, produces the
average of the 24 most recent readings.

That query also can be answered easily, if we store


the 24 most recent stream elements. When a new
stream element arrives, we can drop from the working
store the 25th most recent element.

The other form of query is ad-hoc, (a question asked


once about the current state of a stream or streams)
 If we do not store all streams in their
entirety (normally we can not) we cannot
expect to answer arbitrary queries about
streams.

 If we have some idea what kind of queries


will be asked through the ad-hoc query
interface, then we can prepare for them by
storing appropriate parts or summaries of
streams
If we want the facility to ask a wide variety of ad-hoc
queries, a common approach is to store a sliding
window of each stream in the working store.

A sliding window can be the most recent n elements


of a stream, or it can be all the elements that arrived
within the last t time units.

If we regard each stream element as a tuple, we can


treat the window as a relation and query it with any
SQL query.

Example:
Web sites often like to report the number of unique
users over the past month. If we think of each login as
a stream element, we can maintain a window that is
all logins in the most recent month and associate the
arrival time with each login.
If we think of the window as a relation Logins(name,
time), then it is simple to get the number of unique
users over the past month.

The SQL query is:


SELECT COUNT(DISTINCT(name))
FROM Logins
WHERE time >= t;

Note: that we must be able to maintain the entire


stream of logins for the past month in working storage.

However, for even the largest sites, that data is not


more than a few terabytes, and so surely can be stored
on disk.
Issues in Stream Processing
•Streams often deliver elements very rapidly. We must
process elements in real time, or we lose the opportunity
to process them at all, without accessing the archival
storage.

•It often is important that the stream-processing algorithm


is executed in main memory, without access to secondary
storage or with only rare accesses to secondary storage.

[Processing each stream V/s all streams , example ocean


sensors]

•Many problems about streaming data would be easy to


solve if we had enough memory, but become rather hard
and require the invention of new techniques in order to
execute them at a realistic rate on a machine of realistic
There are 2 things about stream algorithms

• Often, it is much more efficient to get an


approximate answer to our problem than an exact
solution.

• A variety of techniques related to hashing turn out


to be useful.
Data Stream V/s Stream Management

In DBMS, input is under control of programe staff ;


Example SQL insert

Stream Management: Input rate is controlled


externally
Example: Google search service
Streaming Sample : Sampling From a Moving Window
over Streaming Data
Why window?
• Timeliness matters
• Old/obsolete data is not useful
• Scalability matters
• Querying the entire history may be impractical
• Solution: restrict queries to a window of recent data
• As new data arrives, old data “expires”
• Addresses timeliness and scalability
Types of window

◦ Sequence-Based
 The most recent n elements from the data stream
 Assumes a (possibly implicit) sequence number for
each element

• Timestamp-Based
 All elements from the data stream in the last m units of
time (e.g. last 1 week)
 Assumes a (possibly implicit) arrival timestamp for
each element
Sampling From a Data Stream

Inputs:
Sample size k
Window size n >> k (alternatively, time duration
m)
Stream of data elements that arrive online

Output:
k elements chosen uniformly at random from the
last n elements (alternatively, from all elements
that have arrived in the last m time units)

Goal:
maintain a data structure that can produce the
desired output at any time upon request
Simple Approach (sampling)
•Choose a random subset X={x1, …,xk}, X{0,1,…,n-
1}
•The sample always consists of the non-expired
elements whose indexes are equal to x1, …,xk (modulo
n)
•Only uses O(k) memory
•Technically produces a uniform random sample of
each window, but unsatisfying because the sample is
highly periodic and may be unsuitable for many real
applications.
Bloom Filter

Application: Consider a Web Crawler


• List of URLs
• Parallel Task to get URLs
• URL – Seen before and Not seen
• Time and Space(main & secondary) at stake
• bloom filter can be used to reduce space and
time requirement
• Will have false positives occasionally but
definitely not false negatives

•It is Implemented as array of bits together with


number of hash functions.
•Argument of each function is stream element and it
returns position in the array.
•Initially all bits are 0
Examples of Bloom Filter
 Akamai's web servers use Bloom filters to prevent "one-hit-
wonders" from being stored in its disk caches. One-hit-wonders are
web objects requested by users just once, something that Akamai
found applied to nearly three-quarters of their caching
infrastructure. Using a Bloom filter to detect the second request for
a web object and caching that object only on its second request
prevents one-hit wonders from entering the disk cache,
significantly reducing disk workload and increasing disk cache hit
rates.
 Google Bigtable, Apache HBase and Apache Cassandra, and
Postgresql use Bloom filters to reduce the disk lookups for non-
existent rows or columns. Avoiding costly disk lookups considerably
increases the performance of a database query operation.
 The Google Chrome web browser used to use a Bloom filter to
identify malicious URLs. Any URL was first checked against a local
Bloom filter, and only if the Bloom filter returned a positive result
was a full check of the URL performed (and the user warned, if that
too returned a positive result).
Examples of Bloom Filter
 The Squid Web Proxy Cache uses Bloom filters for
cache digests.
 Bitcoin uses Bloom filters to speed up wallet synchronization.
 The Venti archival storage system uses Bloom filters to detect
previously stored data.
 The SPIN model checker uses Bloom filters to track the
reachable state space for large verification problems.
 The Cascading analytics framework uses Bloom filters to
speed up asymmetric joins, where one of the joined data sets
is significantly larger than the other (often called Bloom join in
the database literature).
 The Exim mail transfer agent (MTA) uses Bloom filters in its
rate-limit feature.
 Medium uses Bloom filters to avoid recommending articles a
user has previously read.
When input X arrive, we set to 1 the bits h(x) for each
hash function h.

Example:- we use an array N = 11 bits


Stream element : Integer

Use two Hash functions

h1(x) = computed as follows


• take odd number bits from the right in binary
representation of x.
• Treat it as integer i.
• Result is i mod 11

h2(x) = same; even number of bits


Bloom Filter

An example of a Bloom filter, representing the set


{x, y, z}. The colored arrows show the positions in
the bit array that each set element is mapped to. The
element w is not in the set {x, y, z}, because it
hashes to one bit-array position containing 0. For this
figure, m = 18 and k = 3.
Stream element h1 h2 filter
contents

00000000000

25 11001 101 =5 % 11 =5 10 = 2 % 11= 2


00100100000

159 10011111 0111 =7% 11 =7 1011=11 %11=0


10100101000

585 1001001001 01001 9 10010 7


10100101010
Query the Bloom Filter
How to test Membership
• Suppose element y appears in the stream

and we want to know if we have seen y


before

• Compute h(y) for each hash function y

• If all the resulting bit position are 1 , we


conclude y is seen before
If at least one of these position is 0, we conclude y is
not seen before.

From the previous example we have filter contents


10100101010

W e want to lookup for 118-> 1110110

h1( y): 1110 14 mod 11 = 3

h2(y): 101 5 mod 11 = 5

Bit 5 is 1. bit 3 is 0
We conclude y is not seen before.
Counting Distinct
Elements
Counting Distinct Elements
 Problem:
◦ Data stream consists of a universe of elements
chosen from a set of size N
◦ Maintain a count of the number of distinct
elements seen so far

 Obvious approach:
Maintain the set of elements seen so far
◦ That is, keep a hash table of all the distinct
elements seen so far
Applications
 How many different words are found
among the Web pages being crawled at
a site?
◦ Unusually low or high numbers could indicate
artificial pages (spam?)

 How many different Web pages does


each customer request in a week?
 How many distinct products have we
sold in the last week?
Using Small Storage
 Real problem: What if we do not have
space
to maintain the set of elements seen so
far?
 Estimate the count in an unbiased way
 Accept that the count may have a little
error, but limit the probability that the
error is large
Flajolet-Martin Approach
 Pick a hash function h that maps each of the N
elements to at least log2 N bits

 For each stream element a, let r(a) be the


number of trailing 0s in h(a)
◦ r(a) = position of first 1 counting from the right
 E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
 Record R = the maximum r(a) seen
◦ R = maxa r(a), over all the items a seen so far

 Estimated number of distinct elements = 2 R


Why It Works: Intuition
 Very very rough and heuristic intuition
why
Flajolet-Martin works:
◦ h(a) hashes a with equal prob. to any of N values
◦ Then h(a) is a sequence of log2 N bits,
where 2-r fraction of all as have a tail of r zeros
 About 50% of as hash to ***0
 About 25% of as hash to **00
 So, if we saw the longest tail of r=2 (i.e., item hash
ending *100) then we have probably seen
about 4 distinct items so far
◦ So, it takes to hash about 2r items before we
see one with zero-suffix of length r
Why It Works: More
formally
 Now we show why Flajolet-Martin
works
 Formally, we will show that probability
of finding a tail of r zeros:
◦ Goes to 1 if
◦ Goes to 0 if
where is the number of distinct elements
seen so far in the stream
 Thus, 2R will almost always be around
m!
Why It Doesn’t Work
 E[2R] is actually infinite
◦ Probability halves when R  R+1, but value doubles
 Workaround involves using many hash
functions hi and getting many samples of
Ri
 How are samples Ri combined?
◦ Average? What if one very large value ?
◦ Median? All estimates are a power of 2
◦ Solution:
 Partition your samples into small groups
 Take the median of groups
 Then take the average of the medians
Algorithm
Step 1: Compute the hash function h(n) for
each input in the stream.
Step 2 : Write down the binary equivalent to
the hashed value.
Step 3 : Count the number of trailing zeros in
each.
Step 4 : Identify the maximum number of
trailing zeros i.e. r
Step 5: Display the number of distinct
elements as R=2r
Computing Moments
Generalization: Moments
 Suppose a stream has elements chosen

from a set A of N values

 Let mi be the number of times value i


occurs in the stream
 The kth moment is

 iA
( mi ) k
Special Cases

 iA
( mi ) k

 0thmoment = number of distinct elements


◦ The problem just considered
 1st moment = count of the numbers of
elements = length of the stream
◦ Easy to compute
 2nd moment = surprise number S =
a measure of how uneven the distribution is
Example: Surprise Number
 Stream of length 100
 11 distinct values

 Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9


Surprise S = 910
 Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
Surprise S = 8,110
[Alon, Matias, and Szegedy

AMS Method(Alon, Matias,


Szegedy)
 AMS method works for all moments
 Gives an unbiased estimate
 We will just concentrate on the 2nd

moment S
 We pick and keep track of many

variables X:
◦ For each variable X we store X.el and X.val
 X.el corresponds to the item i
 X.val corresponds to the count of item i
◦ Note this requires a count in main memory,
so number of Xs is limited
 Our goal is to compute
One Random Variable (X)
 How to set X.val and X.el?
◦ Assume stream has length n (we relax this later)
◦ Pick some random time t (t<n) to start,
so that any time is equally likely
◦ Let at time t the stream have item i. We set X.el =
i
◦ Then we maintain count c (X.val = c) of the
number of is in the stream starting from the chosen
time t
 Then the estimate of the 2 nd moment () is:

 Note, we will keep track of multiple Xs, (X1, X2,… Xk)


and our final estimate will be
Expectation Analysis
Count: 1 2 3 ma

Stream: a a b b b a b a
 2nd moment is
 c … number of times item at time t appears
t
from time t onwards (c1=ma , c2=ma-1,
c3=mb)
mi … total count of
item i in the stream
(we are assuming
stream has length n)

Time t when Time t when


Time t when the penultimate the first i is
Group times
the last i is i is seen (c =2) seen (ct=mi)
by the value t
seen (ct=1)
seen
Expectation Analysis
Count: 1 2 3 ma

Stream: a a b b b a b a

◦ Little side calculation:


 Then
 So,
 We have the second moment (in

expectation)!
Higher-Order Moments
 For estimating kth moment we
essentially use the same algorithm but
change the estimate:
◦ For k=2 we used n (2·c – 1)
◦ For k=3 we use: n (3·c2 – 3c + 1) (where
c=X.val)
 Why?
◦ For k=2: Remember we had and we showed terms
2c-1 (for c=1,…,m) sum to m2

 So:
◦ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
 Generally: Estimate
Combining Samples
 In practice:
◦ Compute for
as many variables X as you can fit in memory
◦ Average them in groups
◦ Take median of averages

 Problem: Streams never end


◦ We assumed there was a number n,
the number of positions in the stream
◦ But real streams go on forever, so n is
a variable – the number of inputs seen so far
Streams Never End: Fixups
 (1) The variables X have n as a factor –
keep n separately; just hold the count in X
 (2) Suppose we can only store k counts.
We must throw some Xs out as time goes
on:
◦ Objective: Each starting time t is selected with
probability k/n
◦ Solution: (fixed-size sampling!)
 Choose the first k times for k variables
 When the nth element arrives (n > k), choose it with
probability k/n
 If you choose it, throw one of the previously stored
variables X out, with equal probability
Counting Itemsets
Counting Itemsets
 New Problem: Given a stream, which
items appear more than s times in the
window?
 Possible solution: Think of the stream of

baskets as one binary stream per item


◦ 1 = item present; 0 = not present
◦ Use DGIM to estimate counts of 1s for all items
6 10
4
3 2
1 2
1 0
1001110001010010001011011011100101011001101
N
DGIM algorithm
Suppose we have a window of length N on a
binary stream. We want at all times to be
able to answer queries of the form “how
many 1’s are there in the last k bits?” for
any k≤ N.

For this purpose we use the DGIM algorithm.


DGIM algorithm
 The basic version of the algorithm uses O(log2 N) bits
to represent a window of N bits, and allows us to
estimate the number of 1’s in the window with an
error of no more than 50%.
 To begin, each bit of the stream has a timestamp, the
position in which it arrives. The first bit has timestamp
1, the second has timestamp 2, and so on.
 Since we only need to distinguish positions within the
window of length N, we shall represent timestamps
modulo N, so they can be represented by log2 N bits.
If we also store the total number of bits ever seen in
the stream (i.e., the most recent timestamp) modulo
N, then we can determine from a timestamp modulo N
where in the current window the bit with that
timestamp is.
DGIM algorithm
We divide the window into buckets, 5
consisting of:
 The timestamp of its right (most recent)

end.
 The number of 1’s in the bucket. This

number must be a power of 2, and we refer


to the number of 1’s as the size of the
bucket.
DGIM algorithm
To represent a bucket, we need log2 N bits to
represent the timestamp (modulo N) of its
right end. To represent the number of 1’s we
only need log2 log2 N bits. The reason is
that we know this number i is a power of 2,
say 2j , so we can represent i by coding j in
binary. Since j is at most log2 N, it requires
log2 log2 N bits. Thus, O(logN) bits suffice
to represent a bucket.
DGIM algorithm
There are six rules that must be followed
when representing a stream by buckets.
 The right end of a bucket is always a position

with a 1.
 Every position with a 1 is in some bucket.
 No position is in more than one bucket.
 There are one or two buckets of any given

size, up to some maximum size.


 All sizes must be a power of 2.
 Buckets cannot decrease in size as we move

to the left (back in time).


DGIM algorithm
Case Study: Real Time Sentiment Analysis in
Social Media Streams

•Stream Data generated every Minute related to an


Event
•Stream could be processed for analyzing human
behaviour over the network during the event life span.
•Focus is real time machine learning based sentiment
analysis of streaming of twitter data
•Processing of stream is not manually feasible, system
is required to process streams such that:
o System must be reliable to avoid information loss
o It should deal with higher throughput rates
o Sentiment classifier should work with limited time
•Training phase is important as it should handle
unbalanced distribution of data.

• Corpus: Manual collection with proper annotation

Stream processing:

•Figure illustrates the difference between the


computations performed on static data and those
performed on streaming data.
•In static data computation ,questions are asked of
static data.
•In streaming data computation ,data is continuously
evaluated by static questions.
The IBM InfoSphere, S Storm Apache S4 platform
supports real-time processing of streaming data,
enables the results of continuous queries to be
updated over time, and can detect insights within
data streams that are still in motion
Case Study 2: Stock Market Prediction
•Predict stock market using information in the web
•Financial times, DowJones etc., are few sources for real
time news
•Prediction techniques :
• Classification based approaches
• Nearest Neighbour
• Neural Network
• Bayesian Classification
• Decision tree
• Sequential Behaviour
• Hidden Markov Model
• Temporal Belief Network etc.

You might also like