Bigdata-Mining Data Streams
Bigdata-Mining Data Streams
Introduction
Traditional DBMS store data that are finite and persistent. Data is
available whenever we want it. In Big data Analytics or in Data mining data
is assumed to come in streams. If not processed immediately it is lost.
Data streams are continuous flow of data. Sensor data, network traffic, call
center records, satellite images, data from electric power grids etc.are some
of the popular examples for data stream. Data streams possess several
unique properties
● Infinite
● Massive
● Fast changing
● New class of data may evolve, that makes it difficult to
include in the existing classes. (concept evolution)
● The relation between input data and output data may change.
(concept drift)
Apart from these unique characteristics there are some potential
challenges in data stream mining
● It is not manually possible to label all the data points in the stream.
● It is not feasible to store or archive these data stream in a
conventional database.
● Concept drift
● Concept evolution
● Speed and huge volume makes it difficult to mine the data. Only
single scan algorithms will be feasible.
● Difficult to query with SQL-based tools due to lack of schema and
structure.
Continuo
u Result
s
Multiple
Streams
Stream
Query
Processor
Limite
d Archiv
al
Storag
e
Sliding Window
🌕 Sliding window approach can be used to answer ad-hoc queries.
🌕 Each sliding window stores the most recent n elements of the
stream for some n.
🌕 Or it can be all the elements that are arrived within the last t
time units; may be day.
🌕 The length of the sliding window is specified by it range.
Stride specifies the portion of the window that is omitted
when the window moves forward.
🌕 2 types of sliding windows
■ time-based
Range and stride are specified by time intervals.
For example a sliding window with range= 10
mins and stride= 2 mins produces window that
cover the data in the last 10 mins. A new window is
created
■ count-based
Range and stride are specified in terms of
number of intervals.
🌕 The obvious approach would be to generate a random
number, say an integer from 0 to 9, in response to each search
query.
🌕 Store the tuple if and only if the random number is 0.
🌕 Each user has, on average, 1/10th of their queries stored.
Solution
A solution for the above scenario is sample the users.
Pick 1/10th of users and take all their searches in the sample.
Use a hash function that hashes the user name or user id
uniformly into 10 buckets.
Each time a search query arrives in the stream, we look up the
user to see whether or not they are in the sample. If so, we add
this search query to the sample, and if not, then not.
By using a hash function, one can avoid keeping the list of users.
Hash each user name to one of ten buckets, 0 through 9. If the user
hashes to bucket 0, then accept this search query for the sample, and
if not, then not.
General Solution
Stream of tuples with keys.
Key is some subset of each tuple’s components e.g., tuple is (user,
search, time).
Key is user Choice of key depends on application
To get a sample of size a/b: Hash each tuple’s key uniformly into b
buckets Pick
the tuple if its hash value is at most a.
Filtering Streams
Filtering involves selecting the streams that satisfies a particular
criterion. There are different methods for selecting streams. The process is
hard when it is required to search for membership in a set.
Each element of data stream is a tuple
Given a list of keys S
Determine which tuples of stream are in S
Applications of filtering
Email spam filtering
◦ We know 1 billion “good” email addresses
◦ If an email comes from one of these, it is NOT spam
Publish-subscribe systems
◦ You are collecting lots of messages (news articles)
◦ People express interest in certain sets of keywords
◦ Determine whether each message matches user’s interest
Example
Suppose we want to create an account in a Gmail. The Gmail
maintains a list of usernames of those persons who already have an
account. When we give our preferred username, we may get a
message that “username already exists”. Gmail check availability of
username by searching millions of username registered with it. There are
several methods to do the search.
Linear Search : obviously this is a bad idea because there may be
billions of accounts.
Binary search : the usernames must be stored in sorted order. Even
then it may not be possible to search in billions.
Solution is Bloom Filter Technique
Bloom Filter Technique
A Bloom filter is a space-efficient probabilistic data structure that is
used to test whether an element is a member of a set.
Bloom Filter method uses hashing.
A hash function takes input and outputs a unique identifier of
fixed length which is used for identification of input.
Working of Bloom Filter
A empty bloom filter is a bit array of m bits, all set to zero, like this –
We need k number of hash functions to calculate the hashes for a given
input.
When we want to add an item in the filter, the bits at k indices h1(x), h2(x),
… hk(x) are
set, where indices are calculated using hash functions.
Example – Suppose we want to enter “good” in the filter, we are using 3
hash functions and a bit array of length 10, all set to 0 initially. First we’ll
calculate the hashes as following :
First we’ll calculate the hashes as following :
h1(“good”) % 10 = 1
h2(“good”) % 10 = 4
h3(“good”) % 10 = 7
Again we want
to enter “bad”,
similarly we’ll calculate hashes
h1(“bad”) % 10
=3
h2(“bad”) % 10
=5
h3(“bad”) % 10
=4
If we check the bit array, bits at these indices are set to 1 but
we know that “cat” was never added to the filter. Bit at index 1 and 7 was
set when we added “good” and bit 3 was set when we added “bad”.
Generalization
The Bloom Filter
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h 1 , h 2 , . . . , h k . Each hash
function maps
“key” values to n buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream
elements whose keys are in S, while rejecting most of the stream
elements whose keys are not in S.
To initialize the bit array, begin with all bits 0.
Take each key value in S and hash it using each of the k
hash functions.
Set to 1 each bit that is h i (K) for some hash function h i
and some key value K in S.
To test a key K that arrives in the stream, check that all of h
1 (K), h 2 (K), . . . , h k (K) are
1’s in the bit-array.
If all are 1’s, then let the stream element through. If one or
more of these bits are 0, then
K could not be in S, so reject the stream element.
Medium uses bloom filters for recommending post to users by filtering post
which have been seen by user.
Quora implemented a shared bloom filter in the feed backend to filter out
stories that people have seen before.
The Google Chrome web browser used to use a Bloom filter to identify
malicious URLs
Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use
Bloom filters to reduce the disk lookups for non-existent rows or
columns
Definition
A stream of elements {x1,x2,...xs} with repetitions, and an integer m. Let n be
elements be
{e1,e2,...en}.
Estimating Moments
Estimating moments involves computation of distribution of frequencies
of different elements in the stream.
Definition of Moments
Consider a data stream of elements from a universal set. Let mi be the number
of occurrences of the ith element for any i. Then the kth-moment of the
1st moment is the sum of mi’’s. That is the length of the stream.
“surprise number”.
Example
Suppose we have a stream of length 100, in which eleven
different elements appear. The most even distribution of these eleven
elements would have one appearing 10 times and the other ten appearing 9
times each. In this case, the surprise number is 102 + 10 × 92 = 910. At
the other extreme, one of the eleven elements could appear 90 times and
the other ten appear 1 time each. Then, the surprise number would be 902
+ 10 × 12 = 8110.
As explained in the above example moments of any order can be
computed as long as the stream fits in the main memory. In the case
where the stream does not fit in the memory, we can compute the kth
moment by keeping a limited number of values and computing an
estimate from these values. We can use the following algorithm for
computing the second moment.
The Alon-Matias-Szegedy Algorithm for Second Moments
Even if there is not enough storage space, the second moment can
still be estimated using AMS Algortihm.
Algorithm
Consider a stream of length n. Without taking all the elements,
compute some sample variables.
For each variable X,
Store X.element as a particular element in the set.
Choose a position in the stream between 1 and n.
Assign X.element to be the element found in that particular
position.
Assign X.value=1 for that element.
Scan the stream and add 1 to X.value each time another
occurrence of the variable encountered.
Derive the estimate second moment of the variable X using
n(2X.value-1).
Calculate the average of all the
Example
Consider the bit stream,
..1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
The aim is to divide the stream into buckets that satisfies the DGIM
rules.
Decaying Windows
This approach is used for finding the most popular element in the
stream. This can be considered as an extension of DGIM Algorithm. The
aim is to weight the recent elements more heavily.
Recording the popularity of items sold at Amazon,
The rate at which different Twitter-users tweet.
first element to arrive and at is the current element. Let c be a small constant,
∑ the
𝑎𝑡−𝑖 (1 − 𝑐input,
)𝑖
when a new element at+1 arrives at stream all we need to do
𝑖=0
is:
2. Add a t+1 .
𝑡−1
∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0
In the end of the sequence, we can see the score of fifa is 2.135 but
ipl is 3.7264 So, ipl is more trending then fifa
Even though both of them occurred almost same number of times in input
there score is still different.
Advantages of Decaying Window Algorithm:
1.Sudden spikes or spam data is taken care.
2.New element is given more weight by this mechanism, to achieve
right trending output.