Mining Data Streams
Mining Data Streams
Suja A. Alex,
Assistant Professor,
Department of Information Technology,
St. Xavier’s Catholic College of Engineering,
Nagercoil
Agenda
Introduction to stream concepts
Stream Data Model
Sampling data in a stream
Filtering Stream
Stream counting
Decaying Window
RTAP
Real Time Data streaming tools and applications
Introduction to Stream Concepts
Data Stream is continuously arriving data flow
Examples: Data produced in dynamic environments like
Power supply
network traffic
Stock exchange
Video survelliance
Characteristics of Data streams
Continuous flow of data
Examples
Network traffic
Sensor data
Call center records
Stream Data Model
Data Stream Management System (DSMS), has multiple
data streams.
A stream data query processing architecture includes three
parts:
end user,
scratch space
Characteristics of DSMS
Transient streams (and persistent relations)
Continuous queries
Sequential access
Unpredictable data arrival and characteristics
Bounded main memory
Types of Queries
One time queries
Continuous queries
Traditional DBMS Vs DSMS
Main Memory
Data Stream(s) Main Memory Data Stream(s)
Disk
DBMS
DSMS
Sampling Data Stream
Since we can not store the entire stream, one obvious
approach is to store a sample.
Types of sampling
Reservoir Sampling
Concise Sampling
Types of sampling
Reservoir Sampling- A reservoir of size n is maintained
The first t points are added
(t+1) point is added with probability n/(t+1)
Biased Reservoir Sampling – Bias function is used to
regulate sampling from stream
Concise Sampling – distinct values in stream are stored in
main memory
Problems on Data Streams
Types of queries one wants on answer on a stream:
◦ Filtering a data stream
Select elements with property x from the stream
◦ Counting distinct elements
Number of distinct elements in the last k elements
of the stream
◦ Estimating moments
Estimate avg./std. dev. of last k elements
◦ Finding frequent elements
Filtering Streams
It decides whether an item from infinite stream must be
stored
Example- web browser storing the blacklist of
dangerous URLs
Each element of data stream is a tuple, given a list of
Publish-subscribe systems
◦ You are collecting lots of messages (news articles)
◦ People express interest in certain sets of keywords
◦ Determine whether each message matches user’s interest
Bloom Filter
Consider: |S| = m, |B| = n
Use k independent hash functions h1 ,…, hk
Initialization:
◦ Set B to all 0s
◦ Hash each element s S using each hash function hi, set B[hi(s)] =
1 (for each i = 1,.., k)
Run-time:
◦ When a stream element with key x arrives
If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
That is, x hashes to a bucket set to 1 for every hash function
hi(x)
Otherwise discard the element x
Counting Distinct Elements
Problem:
◦ Data stream consists of a universe of elements chosen from
a set of size N
◦ Maintain a count of the number of distinct elements seen so
far
Solution:
Maintain the set of elements seen so far
◦ That is, keep a hash table of all the distinct elements seen so
far
Applications
How many different words are found among the
Web pages being crawled at a site?
◦ Unusually low or high numbers could indicate artificial pages
(spam?)
iA
( mi ) k
Special cases of estimating moments
0thmoment = number of distinct elements
◦ The problem just considered
1st moment = count of the numbers of elements =
length of the stream
◦ Easy to compute
2nd moment = surprise number S =
a measure of how uneven the distribution is
Alon Matias Szegedy (AMS) Algorithm
AMS Method
Sliding window concept
N=6
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
Past Future
Counting ones – DGIM Algorithm
Datar-Gionis-Indyk-Motwani Algorithm
A bucket in the DGIM method is a record consisting
of:
1. The timestamp of its end [O(log N) bits]
2. The number of 1s between its beginning and end [O(log log
N) bits]
Constraint on buckets:
Number of 1s must be a power of 2
◦ That explains the O(log log N) in (2)
Example: Bucketized Stream
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1010110001011010101010101011010101010101110101010111010100010110
Properties we maintain:
- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size
Example: Updating Buckets
1010110001011010101010101011010101010101110101010111010100010110
0101100010110101010101010110101010101011101010101110101000101100
01011000101101010101010101101010101010111010101011101010001011001
1100010110101010101010110101010101011101010101110101000101100101
1100010110101010101010110101010101011101010101110101000101100101
1100010110101010101010110101010101011101010101110101000101100101
Query
To estimate the number of 1s in the
most recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the bucket)
2. Add half the size of the last bucket