0% found this document useful (0 votes)
4 views

Mining Data Streams

The document provides an overview of mining data streams, discussing concepts such as stream data models, sampling techniques, filtering, and counting distinct elements. It also covers real-time analytics platforms (RTAP) and their applications across various industries, emphasizing the importance of processing and analyzing continuous data flows. Key algorithms and methods for managing data streams, such as the Flajolet-Martin approach and the DGIM algorithm, are highlighted.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Mining Data Streams

The document provides an overview of mining data streams, discussing concepts such as stream data models, sampling techniques, filtering, and counting distinct elements. It also covers real-time analytics platforms (RTAP) and their applications across various industries, emphasizing the importance of processing and analyzing continuous data flows. Key algorithms and methods for managing data streams, such as the Flajolet-Martin approach and the DGIM algorithm, are highlighted.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Mining Data Streams

Suja A. Alex,
Assistant Professor,
Department of Information Technology,
St. Xavier’s Catholic College of Engineering,
Nagercoil
Agenda
 Introduction to stream concepts
 Stream Data Model
 Sampling data in a stream
 Filtering Stream
 Stream counting
 Decaying Window
 RTAP
 Real Time Data streaming tools and applications
Introduction to Stream Concepts
 Data Stream is continuously arriving data flow
 Examples: Data produced in dynamic environments like
 Power supply

 network traffic

 Stock exchange

 Video survelliance
Characteristics of Data streams
Continuous flow of data

 Examples

Network traffic

Sensor data
Call center records
Stream Data Model
 Data Stream Management System (DSMS), has multiple
data streams.
 A stream data query processing architecture includes three
parts:

end user,

query processor, and

scratch space
Characteristics of DSMS
 Transient streams (and persistent relations)
 Continuous queries
 Sequential access
 Unpredictable data arrival and characteristics
 Bounded main memory
Types of Queries
 One time queries
 Continuous queries
Traditional DBMS Vs DSMS

SQL Query Result Continuous Query (CQ) Result

Query Processing Query Processing

Main Memory
Data Stream(s) Main Memory Data Stream(s)

Disk
DBMS
DSMS
Sampling Data Stream
 Since we can not store the entire stream, one obvious
approach is to store a sample.
 Types of sampling
Reservoir Sampling

Biased Reservoir Sampling

Concise Sampling
Types of sampling
 Reservoir Sampling- A reservoir of size n is maintained
The first t points are added
(t+1) point is added with probability n/(t+1)
 Biased Reservoir Sampling – Bias function is used to
regulate sampling from stream
 Concise Sampling – distinct values in stream are stored in
main memory
Problems on Data Streams
 Types of queries one wants on answer on a stream:
◦ Filtering a data stream
 Select elements with property x from the stream
◦ Counting distinct elements
 Number of distinct elements in the last k elements
of the stream
◦ Estimating moments
 Estimate avg./std. dev. of last k elements
◦ Finding frequent elements
Filtering Streams
 It decides whether an item from infinite stream must be
stored
 Example- web browser storing the blacklist of

dangerous URLs
 Each element of data stream is a tuple, given a list of

keys S, Determine which tuples of stream are in S


 Obvious solution: Hash table
 But suppose we do not have enough memory to store

all of S in a hash table


Filtering - Applications
 Example: Email spam filtering
◦ We know 1 billion “good” email addresses
◦ If an email comes from one of these, it is NOT spam

 Publish-subscribe systems
◦ You are collecting lots of messages (news articles)
◦ People express interest in certain sets of keywords
◦ Determine whether each message matches user’s interest
Bloom Filter
 Consider: |S| = m, |B| = n
 Use k independent hash functions h1 ,…, hk
 Initialization:
◦ Set B to all 0s
◦ Hash each element s S using each hash function hi, set B[hi(s)] =
1 (for each i = 1,.., k)
 Run-time:
◦ When a stream element with key x arrives
 If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
 That is, x hashes to a bucket set to 1 for every hash function
hi(x)
 Otherwise discard the element x
Counting Distinct Elements
 Problem:
◦ Data stream consists of a universe of elements chosen from
a set of size N
◦ Maintain a count of the number of distinct elements seen so
far
 Solution:
Maintain the set of elements seen so far
◦ That is, keep a hash table of all the distinct elements seen so
far
Applications
 How many different words are found among the
Web pages being crawled at a site?
◦ Unusually low or high numbers could indicate artificial pages
(spam?)

 How many different Web pages does each customer


request in a week?
 How many distinct products have we sold in the last
week?
Flajolet-Martin Approach
 Pick a hash function h that maps each of the N
elements to at least log2 N bits

 For each stream element a, let r(a) be the number of


trailing 0s in h(a)
◦ r(a) = position of first 1 counting from the right
 E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
 Record R = the maximum r(a) seen
◦ R = maxa r(a), over all the items a seen so far

 Estimated number of distinct elements = 2R


Computing Moments
 Suppose a stream has elements chosen
from a set A of N values

 Let mi be the number of times value i occurs in the


stream
 The kth moment is

 iA
( mi ) k
Special cases of estimating moments
 0thmoment = number of distinct elements
◦ The problem just considered
 1st moment = count of the numbers of elements =
length of the stream
◦ Easy to compute
 2nd moment = surprise number S =
a measure of how uneven the distribution is
 Alon Matias Szegedy (AMS) Algorithm
AMS Method

Sliding window concept
N=6
qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

Past Future
Counting ones – DGIM Algorithm
 Datar-Gionis-Indyk-Motwani Algorithm
 A bucket in the DGIM method is a record consisting
of:
1. The timestamp of its end [O(log N) bits]
2. The number of 1s between its beginning and end [O(log log
N) bits]

 Constraint on buckets:
Number of 1s must be a power of 2
◦ That explains the O(log log N) in (2)
Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1010110001011010101010101011010101010101110101010111010100010110

Properties we maintain:
- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size
Example: Updating Buckets
1010110001011010101010101011010101010101110101010111010100010110

0101100010110101010101010110101010101011101010101110101000101100

01011000101101010101010101101010101010111010101011101010001011001

1100010110101010101010110101010101011101010101110101000101100101

1100010110101010101010110101010101011101010101110101000101100101

1100010110101010101010110101010101011101010101110101000101100101
Query
 To estimate the number of 1s in the
most recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the bucket)
2. Add half the size of the last bucket

 Remember: We do not know how many


1s
of the last bucket are still within the
wanted window
Decaying Window
 Decay factor instead of sliding window technique
 Applications:
 The problem of Most Common Elements

Example- movie tickets purchased all over the world


 Describing a Decaying Window
RTAP
 A real-time analytics platform enables organizations to
make the most out of real-time data by helping them to
extract the valuable information and trends from it.
 Real-time analytics is a process of delivering

information about events as they occur


 Some Examples
Financial Industry - Fraud Detection, Trading
E-commerce - Recommendations
Telecom Industry - Machine to Machine communication
Supply Chain Management
Business Activity Monitoring
Types of RTAP
 On Demand Real Time Analytics
 Continuous Real Time Analytics
Key Capabilities
 Delivering In-Memory Transaction Speed
 Quickly moving unneeded data to disk for long-term
storage
 Distributing Data and Processing for speed
 Supporting continuous queries for real-time events
 Embedding Data into Apps or Apps into databases
 Additional Requirements-fault tolerance, low-latency
Technologies that support
 Processing in memory
 In-database analytics
 Data warehouse appliances
 In-memory analytics
 Massively Parallel Programming(MPP)
Applications

 Real Time Sentiment Analysis (RTSA)

 Real Time Stock Prediction (RTSP)


Reference
 Anand Rajaraman and Jeffrey David Ullman,
"Mining of Massive Datasets", Cambridge
University Press, 2012.
Thank You!

You might also like