100% found this document useful (2 votes)
2K views

MC5502 Bigdata Unit 2 Notes

This document provides an overview of data stream mining concepts and techniques. It discusses data streams and how their continuous, rapid nature poses challenges for traditional data mining algorithms. Data stream management systems (DSMS) are introduced as tools designed to manage continuous data streams. Key aspects of DSMS include supporting flexible continuous queries over incoming data and operating with strict constraints of limited memory and processing time. The document then covers various data stream mining techniques such as filtering, sampling, counting distinct elements, estimating moments, and decaying windows. Real-time analytics applications and case studies are also mentioned.

Uploaded by

Sreehul
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
2K views

MC5502 Bigdata Unit 2 Notes

This document provides an overview of data stream mining concepts and techniques. It discusses data streams and how their continuous, rapid nature poses challenges for traditional data mining algorithms. Data stream management systems (DSMS) are introduced as tools designed to manage continuous data streams. Key aspects of DSMS include supporting flexible continuous queries over incoming data and operating with strict constraints of limited memory and processing time. The document then covers various data stream mining techniques such as filtering, sampling, counting distinct elements, estimating moments, and decaying windows. Real-time analytics applications and case studies are also mentioned.

Uploaded by

Sreehul
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 20

Bigdata Analytics Unit – II

UNIT II MINING DATA


STREAMS
Introduction To Streams Concepts – Stream Data Model and Architecture - Stream Computing -
Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream –
EstimatingMoments – Counting Oneness in a Window – Decaying Window - Real time Analytics
Platform(RTAP)Applications –- Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions.

Part – A
1.What factors lead to Concept Drift? [CO3-L1]
The constantly changing patterns in data streams can affect the induced data mining
models in multiple ways such as
 Changes in the class label of an existing data variable,
 Change in the available feature information.
Both these lead to a phenomenon Called Concept drift.
2. How are continuous queries evaluated? [CO3-L1]
Continuous queries are evaluated continuously as data streams continue to arrive. The answer to
a continuous query is produced over time, always reflecting the stream data seen so far.
Continuous query answers may be stored and updated as new data arrives, or they may be
produced as data streams themselves.
3. What are called Ad-Hoc Queries? [CO3-L1]
An ad-hoc query is issued online after the data streams have already begun. Ad-hoc queries can
be either one-time queries or continuous queries. Ad-hoc queries are basically questions asked
once about the current state of a stream or streams.
4. List out few challenges of data mining algorithms. [CO3-L2]
Data streams pose several challenges for data mining algorithm design, The most
important of them are,
 Algorithms must make use of limited resources (time and memory).
 Algorithms must deal with data whose distribution changes over time.
5. Why traditional data mining algorithms could not be used on data streams? [CO3-L1]
Many traditional data mining algorithms can be modified to work with larger datasets, but they
cannot handle continuous supply of data.
If a traditional algorithm has learnt and induced a model of the data seen until now, it cannot
immediately update the model when new information keeps arriving at continuous intervals.
Instead, the entire training process must be repeated with the new examples included.
6. What are one time queries? [CO3-L1]
One-time queries are queries that are evaluated once over a point-in-time snapshot of the data
set, with the answer returned to the user.For example, a stock price checker may alert the user
when a stock price crosses a particular price point.
7. What are continuous queries? [CO3-L1]
Continuous queries are evaluated continuously as data streams continue to arrive. The answer to
a continuous query is produced over time, always reflecting the stream data seen so far.
Continuous query answers may be stored and updated as new data arrives, or they may be
produced as data streams themselves.
8. What is a predefined query? [CO3-L1]
A pre-defined query is one that is supplied to the DSMS before any relevant data has arrived.
Pre-defined queries are most commonly continuous queries.
9. List out the major issues in Data Stream Query Processing. [CO3-L2]
The major issues in Data Stream Query Processing are as follows,
 Unbounded Memory Requirements
 Approximate Query Answering
 Sliding Windows
 Batch Processing, Sampling and Synopses
 Blocking Operators
10. What is Reservoir Sampling? [CO3-L1]
Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from
a list of n items, where n is either a very large or unknown number. Typically n is large enough
that the list doesn’t fit into main memory. For example, a list of search queries in Google
11. How is Biased Reservoir Sampling different from Reservoir Sampling? [CO3- L1]
Biased reservoir sampling is a bias function to regulate the sampling from the stream. This bias
gives a higher probability of selecting data points from recent parts of the stream as compared to
distant past. This bias function is quite effective since it regulates the sampling in a smooth way
so that the queries over recent horizons are more accurately resolved.
12. What does the term “Filtering a Data Stream” mean? [CO3-L1]
“Filtering” tries to observe an infinite stream of data, look at each of its items and
quantify whether the item is of interest and should be stored for further evaluation.
Hashing has been the popular solution to designing algorithms that approximate some value that
we would like to maintain.
13. What is a Bloom Filter? [CO3-L1]
A Bloom filter is a space-efficient probabilistic data structure. It is used to test whether an
element is a member of a set. False positive matches are possible, but false negatives are not,
thus a Bloom filter has a 100% recall rate.
14. What is a cardinality estimation problem? [CO3-L1]
The count-distinct problem, also known as cardinality estimation problem. It is the problem of
finding the number of distinct elements in a data stream with repeated elements.This is a well-
known problem with numerous applications.
15. What is Flajolet–Martin algorithm used for? [CO3-L1]
The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements
in a stream with a single pass and space-consumption which is logarithmic in the maximum
number of possible distinct elements in the stream.
16. How are “moments” estimated? [CO3-L1]
The method of moments is a method of estimation of population parameters. One starts with
deriving equations that relate the population moments (i.e., the expected values of powers of the
random variable under consideration) to the parameters of interest.
17. What is called the decay of insight? [CO3-L1]
The length of time that analytic insight has value is rarely considered in big data and analytics
projects.
The concept of the half-life of insight can be used to understand the expectation of the magnitude
of insight after a period of time. It gives the expectation M(t) of the magnitude of the insight after
time t.
18. What is Real-Time Analysis? [CO3-L1]
Real-time analytics is the use of, or the capacity to use, all available enterprise data and resources
when they are needed. It consists of dynamic analysis and reporting, based on data entered into a
system less than one minute before the actual time of use. Real- time analytics is also known as
real-time data analytics
19. What is a Data Stream Management System? [CO3-L1]
A Data stream management system (DSMS) is a computer program to manage continuous data
streams. It is similar to a database management system (DBMS), which is, however, designed for
static data in conventional databases. A DSMS also offers a flexible query processing so that the
information need can be expressed using queries.

20. What is Event Stream Processing? [CO3-L1]


Event stream processing, or ESP, is a set of technologies designed to assist the construction of
event-driven information systems. ESP technologies include event visualization, event databases,
event-driven middleware, and event processing languages, or complex event processing (CEP).
21. What is called Data Stream Mining? [CO3-L1]
Data Stream Mining is the process of extracting knowledge structures from continuous, rapid
data records. A data stream is an ordered sequence of instances that in many applications of data
stream mining can be read only once or a small number of times using limited computing and
storage capabilities.
Part – B

1.Explain data stream management systems in detail. [CO3-L2]


Data Streams
In the recent years, a new class of data-intensive applications has become widely recognized, that
is, applications in which the data is modelled best as transient data streams instead of traditional
Database.
Data stream real-time analytics are needed to manage the data currently generated, at an ever
increasing rate from these applications. Examples of such applications include financial
applications, network monitoring, security, telecommunications data management, web
applications, manufacturing, sensor networks, call detail records, email, blogging, twitter posts
and others.
In the data stream model, individual data items may be relational tuples. Example: network
measurements, call records, web page visits, sensor readings, etc.
The continuous arrival of data in multiple, rapid, time-varying, unpredictable and unbounded
streams create new research problems.
In all of the applications mentioned above, it is not feasible to simply load the arriving data into a
traditional database management system (DBMS) and operate on it there.
Traditional DBMSs are not designed for rapid and continuous loading of individual data items,
and they do not directly support the continuous queries that are typical of data stream
applications.
Also the data in a data stream is lost forever if it is not processed immediately or stored. But the
data in most data streams arrive so rapidly that it is not feasible to store it all in active storage
(i.e., in a conventional database), and then interact with it at later time.
Therefore the algorithms that process data streams must work under very strict constraints of
space and time.
In addition to this data streams pose several challenges for data mining algorithm design.
 Algorithms must make use of limited resources (time and memory).
 Algorithms must deal with data whose distribution changes over time.
Data Stream Management Systems
Traditional relational databases store and retrieve records of data that are static in nature.
It does not keep track of time unless time is added as an attribute to the database during
designing the schema itself.
This model was sufficient for most of the legacy applications (old applications) and older
repositories of information.
But many current and emerging applications require support for online analysis of rapidly
arriving and changing data streams. This has created a huge number of research activity which
attempts to build new models to manage streaming data.
This has resulted in DA T A ST R E A M M A N A G E M E N T SY S T E M S ( DSM S), with importance
on continuous query languages and query evaluation.
The generic model for such a DSMS is as follows.
Data Stream Model
A data stream is a real-time, continuous and ordered sequence of items. The ordering may be

done implicitly using arrival time or explicitly by using time-stamp.


It is not possible to control the order in which the items arrive. It is also not feasible to locally
store a stream fully in any memory device.
Moreover, a query made over streams will actually run continuously over a period of time and
incrementally return new results as new data arrives. Therefore, these are known as long-
running, continuous, standing and persistent queries.
Any generic model that attempts to store and retrieve data streams must have the
following characteristics,
1. The data model and query processor must allow both order-based and time-based
operations
2. Because of the inability to store a complete stream some approximate summary structures must
be used. As a result of this summarization queries over the summaries may not return exact
answers.
3. Streaming query must not use any operators that require the entire input before any results are
produced. Such operators will block the query processor indefinitely.
4. Any query that requires backtracking over a data stream is infeasible. This is due to the storage
and performance constraints imposed by a data stream. Thus any online stream algorithm is
restricted to make only one pass over the data.
5. Applications that monitor streams in real-time must react quickly to unusual data values. Thus,
long-running queries must be prepared for changes in system conditions any time during their
execution lifetime (e.g., they may encounter variable stream rates).
6. As per the Scalability requirements parallel and shared execution of many continuous queries
must be possible.
An abstract architecture for a typical DSMS is depicted in the figure below. An input monitor
may regulate the input rates, possibly by dropping packets. Data are typically stored in three
partitions: Temporary working storage (e.g., for window queries).Summary storage.Static
storage for meta-data (e.g., physical location of each source).Long-running queries are
registered in the query repository and placed into groups for shared processing. It is also
possible to pose one-time queries over the current state of the stream. The query processor
communicates with the input monitor and may re-optimize the query plans in response to
changing input rates.Results are streamed to the users or temporarily buffered.

2.Explain Sampling in Data Streams and its types[CO3-L2]


Sampling in Data Streams
Sampling is a common practice for selecting a subset of data to be analysed. Instead of dealing
with an entire data stream, we select instances at periodic intervals. Sampling is used to compute
statistics of the stream. Sampling methods reduce the amount of data to process the
computational costs, but they can also be a source of errors. The main problem is to obtain a
representative sample, a subset of data that has approximately the same properties of the original
data.

Reservoir Sampling
Many mining algorithms can be applied if only we can draw a representative sample of the data
from the stream.Imagine there is a really large stream of data elements.
The goal is to efficiently return a random sample of elements evenly distributed from the
original stream.
A simple way is to generate random integers between and , then retrieving the
elements at those indices and you have your answer.
To make this sampling without replacement, we simply needs to note whether or not our sample
already has that random number and if so, choose a new random number.
This can make the algorithm very expensive if the sample size is very close to N.
Further in the case of a data stream we don’t know , the size of the stream in advance and we
cannot index directly into it.
We can count it, but that requires making two passes of the data that is not possible. Thus, the
general sampling problem in the case of a data stream is, “How to ensure such a sample is
drawn uniformly, given that the stream is continuously growing?”
For example, if we want to draw a sample of 100 items and the stream has length of only 1000,
then we want to sample roughly one in ten items. But if a further million items arrive, we must
ensure that the probability of any item being sampled is more like one in a million. If we retain
the same 100 items, then this is cannot be considered a representative sample.
Several solutions are possible to ensure that we continuously maintain a uniform sample from
the stream.
Reservoir-based methods were originally proposed for one-pass access of data from magnetic
storage devices such as tapes. Similarly to the case of streams, the number of records is not
known in advance and the sampling must be performed dynamically as the records from the tape
are read.
Assume that we wish to obtain an unbiased sample of size from the data stream and we maintain
a reservoir of size from the data stream. The first points in the data streams are added to the
reservoir for initialization.
Then, when the th point from the data stream is received, it is added to the reservoir with
probability .
In order to make room for the new point, any of the current points in the reservoir are sampled
with equal probability and subsequently removed.
If we draw a sample of size , we initialize the sample with the first item from the stream.
Biased Reservoir Sampling
In many cases, the stream data may evolve over time, and the corresponding data mining or
query results may also change over time. Thus, the results of a query over a more recent window
may be quite different from the results of a query over a more distant window.
Similarly, the entire history of the data stream may not relevant for use in a repetitive data
mining application much as classification.
The simple reservoir sampling algorithm can be adapted to a sample from a moving window
over data streams. This is useful in many data stream applications where a small amount of
recent history is more relevant than the entire previous stream.
However, this can sometimes be an extreme solution, since for some applications we may need
to sample from varying lengths of the stream history.
While recent queries may be more frequent, it is also not possible to completely disregard
queries over more distant horizons in the data stream.
Biased reservoir sampling is a bias function to regulate the sampling from the stream. This bias
gives a higher probability of selecting data points from recent parts of the stream as compared to
distant past. This bias function is quite effective since it regulates the sampling in a smooth way
so that the queries over recent horizons are more accurately resolved.
Concise Sampling
Many a time, the size of the reservoir is sometimes restricted by the available main memory. It is
desirable to increase the sample size within the available main memory restrictions.
For this purpose, the technique of concise sampling is quite effective. Concise sampling uses the
fact that the number of distinct values of an attribute is often significantly smaller than the size of
the data stream.
In many applications, sampling is performed based on a single attribute in multi- dimensional
data. For example, customer data in an e-commerce site sampling may be done based on only
customer ids. The number of distinct customer ids is definitely much smaller than the size of the
entire stream.
The repeated occurrence of the same value can be exploited in order to increase the sample size
beyond the relevant space restrictions.
We note that when the number of distinct values in the stream is smaller than the main memory
limitations, the entire stream can be maintained in main memory, and therefore, sampling may
not even be necessary.
For current systems in which the memory sizes may be of several gigabytes, very large sample
sizes can be in main memory as long as the number of distinct values does not exceed the
memory constraints.On the other hand, for more challenging streams with an unusually large
number of distinct values, we can use the following approach.
1. The sample is maintained as a set of pairs.
2. For those pairs in which the value of count is one, we do not maintain the count explicitly, but
we maintain the value as a singleton.
3. The number of elements in this representation is referred to as the footprint and is bounded
above by .
4. We use a threshold parameter that defines the probability of successive sampling from the
stream. The value of is initialized to be .
5. As the points in the stream arrive, we add them to the current sample with probability lit.
6. We note that if the corresponding value count pair is already included in the set 5, then we
only need to increment the count by 1. Therefore, the footprint size does not increase.
7. On the other hand, if the value of the current point is distinct from all the values encountered
so far, of it exists as a singleton then the footprint increases by 1. This is because either a
singleton needs to be added, or a singleton gets converted to a value count pair with a count of 2.
8. The increase in footprint size may potentially require the removal of an element from sample S
in order to make room for the new insertion.
9. When this situation arises, we pick a new (higher) value of the threshold , and apply this
threshold to the footprint in repeated passes.
10. In each pass, we reduce the count of a value with probability until at least one value count
pair reverts to a singleton or a singleton is removed.
11. Subsequent points from the stream are sampled with probability .
In practice, may be chosen to be about 10% larger than the value of . The choice of different
values of provides different trade-offs between the average (true) sample size and the
computational requirements of reducing the footprint size.

3. Explain the process of Data Stream Mining with suitable examples[CO3-L2]


4. Data Stream Mining
Data Stream Mining is the process of extracting useful knowledge from continuous, rapid data
streams.
Many traditional data mining algorithms can be modified to work with larger datasets, but they
cannot handle continuous supply of data.
For example, if a traditional algorithm has learnt and induced a model of the data seen until now,
it cannot immediately update the model when new information keeps arrivingat continuous
intervals. Instead, the entire training process must be repeated with the new examples included.
With big data, this limitation is both undesirable and highly inefficient.
Mining big data streams faces three principal challenges:
 Volume
 Velocity
 Volatility
Volume and velocity require a high volume of data to be processed in limited time.
From the beginning, the amount of available data constantly increases from zero to potentially
infinity. This requires incremental approaches that incorporate information as it becomes
available, and online processing if not all data can be kept.
Volatility indicates that environment is never stable and has constantly changing patterns.
In this scenario, old data is of limited use, even if it could be saved and processed again later.
The constantly changing patterns can affect the induced data mining models in multiple ways:
 Changes in the class label of an existing data variable,
 Change in the available feature information.
Both these lead to a phenomenon Called Concept drift.
Example of Concept Drift
Stock Market Application
Consider a stock market application which labels a particular stock as “hold” or “sell” can
change the labels rapidly based on a current stream of input information. Changes in the
available feature information can arise when new features become available.
Weather Forecasting Application
Consider a continuous weather forecasting application may now need to consider more attributes
because of adding new sensors continuously. Existing features might need to be excluded due to
regulatory requirements, or a feature might change in its scale, if data from a more precise
instrument becomes available.
Thus, CO N C E P T D R I F T is a phenomenon that occurs because of feature changes or changes
in behaviour of the data itself.
This indicates that one important ingredient to mining data streams is O N L I N E M I N I N
G O F C H AN G E S . This means we are looking to manage data that arrives online, often in
real-time, forming a stream which is potentially infinite.
Even if the patterns are discovered in snapshots of data streams the changes to the patterns may
be more critical and informative. With data streams, people are often interested in mining queries
like
 “Compared to the past few days, what are the distinct features of the current status?
 “What are the relatively stable factors overtime?
Clearly to answer the above queries, we have to examine the changes. Further
mining data streams is challenging in the following two respects.
On one hand, random access to fast and large data streams may be impossible. Thus, multi-pass
algorithms (i.e., ones that load data items into main memory multiple times) are often infeasible.
On the other hand, the exact answers from data streams are often too expensive.
The main assumption of data stream processing is that training examples can be briefly inspected
a single time only, that is, they arrive in a high speed stream, and then must be discarded to make
room for subsequent data.
The algorithm processing the stream has no control over the order of the data seen, and must
update its model incrementally as each data element is inspected.
Another desirable property is that we must be able to apply the algorithm at any point of time
even in between successive arrivals of data elements in the stream.
All these challenges have resulted in creating a new set of algorithms written exclusively for data
streams. These algorithms can naturally cope with very large data sizes and can tackle
challenging real-time applications not previously tackled by traditional data mining. The most
common data stream mining tasks are CL U S T E R I N G , CL A S S I F I C A T I O N and FRE Q U
E N T PA T T E R N MI N I N G .

5. Explain and analyse the Bloom Filter in detail with the algorithm[CO3-L2] The
Bloom Filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard
Bloom in 1970, that is used to test whether an element is a member of a set. False positive
matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In
other words, a query returns either “possibly in set” or -definitely not in set”. Elements can be
added to the set, but not removed (though this can be addressed with a “counting” filter). The
more elements that are added to the set,, the larger the probability of false positives.
The Bloom filter for a set is much smaller than the set itself, which makes it appropriate for
storing large amounts of data required when filtering on a data stream in main memory itself
The advantage of this data structure is that it uses considerably less space than any exact method,
but pays for this by introducing a small probability of error. Depending on the space available,
this error can be made arbitrarily small.
Let us assume we want to represent n-element sets from a very large universe U, with
We want to support insertions and membership queries (as to say "Given is ?”) so that:
If the answer is No, then .
If the answer is Yes, then
may or may not be in , but the probability that (false positive) is low.

Both insertions and membership queries should be performed in a constant time. A Bloom filter
is a bit vector B of m bits, with k independent hash functions that map each key in to
the set 0 . We assume that each hash function maps a uniformly at random
chosen key to each element of with equal probability. Since we assume the hash functions
are independent, it follows that the vector is equally likely to be any of the k-tuples or elements
from
Algorithm The Bloom Filter

A Bloom filter consists of:


1. An array of n bits, initially all 0’s.

2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps
“key” values to n buckets, corresponding to the n bits of the bit-array.

3. A set S of m key values.


The purpose of the Bloom filter is to allow through all stream elements whose
keys are in S, while rejecting most of the stream elements whose keys are not
in S.

To initialize the bit array, begin with all bits 0. Take each key value in S
and hash it using each of the k hash functions. Set to 1 each bit that is hi(K)
for some hash function hi and some key value K in S.

To test a key K that arrives in the stream, check that all of


h1(K), h2(K), . . . , hk(K)
are 1’s in the bit-array. If all are 1’s, then let the stream element through. If
one or more of these bits are 0, then K could not be in S, so reject the stream
element.

Initially all bits of are set to 0.


Insert into . Compute and set
Query if , Compute .
If , then answer Yes else answer No.

Clearly, if an item is inserted into the filter, it is found when searched for. Hence, there is no
false negative error. The only possibility for error is when an item is not inserted into the filter,
buf each of the locations that the hash functions map to it are all turned on. We will show that
this error can be kept small while using considerably less space than any exact method.
Analysis of the Bloom Filter
If a key value is in 5, then the element will surely pass through the Bloom filter. However, if the
key value is not in S, it might still pass. This is called a false positive. We need to understand
how to calculate the probability of a false positive, as a function of n, the bit-array length, m the
number of members of , and , the number of hash functions. The computation can be done as
follows:
1. The probability that one hash do not set a given bit is given the setting in previous section.
2. The probability that it is not set by any of the k hash functions is .
3. Hence, after all elements of have been inserted into the Bloom filter, the probability that a
specific bit is still is . (Note that this uses the assumption that the hash
functions are independent and perfectly random.)
4.The probability of a false positive is the probability that a specific set of h bits are 1, which is

5. Thus, we can identify three performance metrics for Bloom filters that can be adjusted to tune its
performance. First the computation time (corresponds to the number of hash functions), second
the size (corresponds to the number of bits), and finally probability of error (corresponds to the
false positive rate).
6. Suppose we are given the ratio min and want to optimize the number of hash functions to
minimize the false positive rate . Note that more hash functions increase the precision but also
the number of in the filter, thus making false positives both less and more likely at the same time.
derivative to .
As grows in proportion to , the false positive rate decreases. Some reasonable values for , and are:
a. 6 4 0 05
b 6 4 0 02
Bloom filters have some nice properties that make it ideal for data streaming applications. When
an item is inserted into the filter, all the locations can be updated “blindly” in parallel. The reason
for this is that the only operation performed is to make a bit 1, so there can never be any data
race conditions. Also for an insertion, there need to be a constant (that is, k) number of writes
and no reads. Again, when searching for an item, there are at most a constant number of reads.
These properties make Bloom filters very useful for high-speed applications, where memory
access can be the bottleneck.

5.Explain the Flajolet-Martin Algorithm in detail[CO3-L2]


The Flajolet-Martin Algorithm

The principle behind the FM algorithm is that, it is possible to estimate the number of distinct
elements by hashing the elements of the universal set-to a bit-string that is sufficiently long.
This means that the length of the bit-string must be such that there are more possible results of
the hash function than there are elements of the universal set.
Before we can estimate the number of distinct elements, we first choose an upper boundary; of
distinct elements . This boundary gives us the maximum number of distinct elements that we
might be able to detect. Choosing to be too small will influence the precision of our
measurement. Choosing that is far bigger than the number of distinct elements will only use too
much memory. Here, the memory that is required is .
For most applications, is a sufficiently large bit-array. The array needs to be initialized to
zero. We will then use one or more adequate hashing functions. These hash functions will map
the input to a number that is representable by our bit-array. This number will then be analyzed
for each record. If the resulting number contained trailing zeros, we will set the kth bit in the bit
array to one.
Finally, we can estimate the currently available number of distinct elements by taking the index
of the first zero bit in the bit-array. This index is usually denoted by R. We can then estimate the
number of unique elements N to be estimated by . The algorithm is as follows:

Algorithm
Pick a hash function that maps each of the elements to at least bits- For each
stream element let be the number of trailing 0s in .Record the maximum seen.
Estimate = ,

Example
= position of first 1 counting from the right; say , then 12 is 1100 in binary,
so

Example
Suppose the stream is 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1 .... Let . So the transformed stream
( applied to each item) is 4,5,2,4,2,5,3,5,4,2,5,4
Each of the above element is converted into its binary equivalent as 100, 101, 10, 100, 10, 101,
11, 101, 100, 10,.101, 100
We compute of each item in the above stream: 2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2 So ,
which is 2. Output .
A very simple and heuristic intuition as to why Fiajokt-Martin works can be explained as follows:
1.h(a) hashes a with equal probability to any of N values
2. Then h(a) is a sequence of bits, where fraction of all a s have a tail of zeros
 About 50% of as hash to ***0
 About 25% of as hash to **00
 So, if we saw the longest tail of (i.e., item hash ending *100) then we have probably
seen about four distinct items so far
3. So, it takes to hash about items before we see one with zero-suffix of length r. More
formally we can see that the algorithm works, because the probability that a given hash function ,
ends in at least zeros is . In case of m different elements, the probability that (R is max. tail
length seen so far) is given by 2
Variations to the FM Algorithm
There are reasons why the simple FM algorithm won’t work with just a single hash function. The
expected value of [ ] is actually infinite. The probability halves when is increased to , however,
the value doubles. In order to get a much smoother estimate, that is also more reliable, we can
use many hash functions. Another problem with the FM algorithm in the above form is that the
results vary a lot. A common solution is to run the algorithm multiple times with different hash-
functions, and combine the results from the different runs.
One idea is to take the mean of the results together from each hash-function, obtaining a single
estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers
(which are likely here).
A different idea is to use the median which is less prone to be influenced by outliers. The
problem with this is that the results can only take form as some power of 2. Thus, no matter how
many hash functions we use, should the correct value of be between two powers of 2, say 400,
then it will be impossible to obtain a close estimate. A common solution is to combine both the
mean and the median:
1.Create hash-functions and split them into k distinct groups (each of size ).
2.Within each group use the median for aggregating together the results
3.Finally take the mean of the group estimates as the final estimate.

Sometimes, an outsized will bias some of the groups and make them too large. However, taking
the median of group averages will reduce the influence of this effect almost to nothing.
Moreover, if the groups themselves are large enough, then the averages can be essentially any
number, which enable us to approach the true value m as long as we use enough hash functions.
Groups should be of size at least some small multiple of .

The Flajolet-Martin Algorithm

It is possible to estimate the number of distinct elements by hashing the elements of the universal set
to a bit-string that is sufficiently long. The length of the bit-string must be sufficient that there are
more possible results of the hash function than there are elements of the universal set.
For example, 64 bits is sufficient to hash URL’s. We shall pick many different hash functions
and hash each element of the stream using these hash functions. The important property of a hash
function is that when applied to the same element, it always produces the same result. Notice that this
property was also essential for the sampling technique .

The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in
the stream, the more different hash-values we shall see. As we see more different hash-values, it
becomes more likely that one of these values will be “unusual.” The particular unusual property we
shall exploit is that the value ends in many 0’s, although many other options exist.

Whenever we apply a hash function h to a stream element a, the bit string h(a) will end in
some number of 0’s, possibly none. Call this number the tail length for a and h. Let R be the
maximum tail length of any a seen so far in the stream. Then we shall use estimate 2R for the
number of distinct elements seen in the stream.

This estimate makes intuitive sense. The probability that a given stream element a has h(a)
ending in at least r 0’s is 2−r. Suppose there are m distinct elements in the stream. Then the
probability that none of them has tail length at least r is (1 − 2−r)m. This sort of expression should be
familiar by now.
We can rewrite it as (1 − 2−r)2r _m2−r . Assuming r is reasonably large, the inner expression
is of the form (1 − ǫ)1/ǫ, which is approximately 1/e. Thus, the probability of not finding a stream
element with as many as r 0’s at the end of its hash value is e−m2−r . We can conclude:

1. If m is much larger than 2r, then the probability that we shall find a tail
of length at least r approaches 1.
2. If m is much less than 2r, then the probability of finding a tail length at least r approaches 0.
We conclude from these two points that the proposed estimate of m, which is 2R (recall R is the
largest tail length for any stream element) is unlikely to be either much too high or much too low.
6.Discuss in detail the role of Decaying Windows in data stream analysis[CO3-L2]
Decaying Windows
Pure sliding windows are not the only way by which the evolution of data streams can be taken
into account during the mining process. A second way is to introduce a decay factor into the
computation. Specifically, the weight of each transaction is multiplied by a factor of , when
a new transaction arrives. The overall effect of such an approach is to create an exponential
decay function on the arrivals in the data stream. Such a model is quite effective for evolving
data stream, since recent transactions are counted more significantly during the mining process.
Specifically, the decay factor is applied only to those itemsets whose counts are affected by the
current transaction. However, the decay factor will have to be applied in a modified way by
taking into account the last time that the itemset was touched by an update. This approach works
because the counts of each itemset reduce by the same decay factor in each iteration, as long as a
transaction count is not added to it. Such approach is also applicable to other mining problems
where statistics are represented as the sum of decaying values.
We discuss a few applications of decaying windows to find interesting aggregates over data
streams.
The Problem of Most-Common Elements
Suppose we have a stream whose elements are the movie tickets purchased all over the world,
with the name of the movie as part of the element. We want to keep a summary of the stream
that is the most popular movies “currently.” While the notion of “currently” is imprecise,
intuitively, we want to discount the popularity of an older movie that may have sold many
tickets, but most of these decades ago. Thus, a newer movie that sold n tickets in each of the last
10 weeks is probably more popular than a movie that sold 2n tickets last week but nothing in
previous weeks.
One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the ith
ticket is for that movie, and 0 otherwise. Pick a window size N, which is the number of most
recent tickets that would be considered in evaluating popularity. Then, use the method of the
DGIM algorithm to estimate the number of tickets for each movie, and rank movies by their
estimated counts.
This technique might work for movies, because there are only thousands of movies, but it would
fail if we were instead recording the popularity of items sold at Amazon, or the rate at which
different Twitter-users tweet, because there are too many Amazon products and too many
tweeters. Further, it only offers approximate answers.
Describing a Decaying Window
One approach is to re-define the question so that we are not asking for a simple count of 1s in a
window. We compute a smooth aggregation of all the 1s ever seen in the stream, but with
decaying weights. The further in the past a 1 is found the lesser is the weight given to it.
Formally, let a stream currently consist of the elements
where is the first element to arrive and is the current element. Let be a small
-6 -9
constant, such as 10 or 10 . Define the exponentially decaying window for this stream to be the
sum.

The effect of this definition is to spread out the weights of the stream elements as far back in
time as the stream goes. In contrast, a fixed window with the same sum of the weights, , would
put equal weight on each of the most recent He elements to arrive and weight on all previous
elements. This is illustrated in Figure below
It is much easier to adjust the sum in an exponentially decaying window than in a sliding
window of fixed length. In the sliding window, we have to somehow take into consideration the
element that falls out of the window each time a new element arrives. This forces us to keep the
exact elements along with the sum, or to use some approximation scheme such as DGIM. But in
the case of a decaying window, when a new element arrives at the stream input, all we need to do
is the following:

Illustrating decaying windows.


1.Multiply the current sum by .
2.Add .
The reason this method works is that each of me previous elements has now moved one
position further from the current element, so its weight is multiplied by . Further,
the weight on the current element is , so adding Add is the correct way to
include the new elements contribution.
Now we can try to solve the problem of finding the most popular movies in a stream of
ticket sales. We can use an exponentially decaying window with a constant c, say 10-9.
We are approximating a sliding window that holds the last one billion ticket sales. For
each movie, we can imagine a separate stream with a 1 each time a ticket for that movie
appears in the stream, and a 0 each time a ticket for some other movie arrives. The
decaying sum of the ; thus, it measures the current popularity of the movie.
To optimize this process, we can avoid performing these counts for the unpopular
movies. If the popularity score for a movie goes below 1, its score is dropped from the
counting. A good threshold value to use is (1/2).
When a new ticket arrives on the stream, do the following:
1.For each movie whose score is currently maintained multiply its score by (1 - c)
2.Suppose the new ticket is for movie M. If there is currently a score for M, add 1 to that
score. If there is no score for M, create one and initialize it to 1.
3.If any score is below the threshold 1/2, drop that score.
A point to be noted is that the sum of all scores is . Thus, there cannot be more than
movies with score of or more, or else the sum of the scores would exceed lie. Thus,
lie is a limit on the number of movies being counted at any time. Of course in practice,
the number of actively counted movies would be much less than . If the number of items
is very large then other more sophisticated techniques are required.

7.Explain the generic design of RTAP in detail[CO3-


Real-Time Analytics Platform (RTAP)

Real-time analytics makes use of all available data and resources when they are needed.
It consists of dynamic analysis and reporting, based on data entered into a system less
than one minute before the actual time of use. Real-time denotes the ability to process
data as it arrives, rather than storing the data and retrieving it at some point in the future.
For example, consider an e-merchant like Flipkart or Snapdeal; real time means the time
elapsed from the time a customer enters the website to the time the customer logs out.
Any analytics procedure, like providing the customer with recommendations or offering
a discount based on current value in the shopping car, etc., will have to be done within
this timeframe which may be a about 15 minutes to an hour.
But from the point of view of a military application where there is constant monitoring
say of the air space, time needed to analyze a potential threat pattern and make decision
maybe a few milliseconds.
“Real-Time Analytics” is thus discovering meaningful patterns in data for something
urgent. There are two specific and useful types of real-time analytics - On-Demand and
Continuous.
1. On-Demand Real-Time Analytics is reactive because it waits for users to request a
query and then delivers the analytics. This is used when someone within a company
needs to take a pulse on what is happening right this minute. For instance, a movie
producer may want to monitor the tweets and identify sentiments about his movie on the
first day first show and be prepared for the outcome.
2. Continuous Real-Time Analytics is more proactive and alerts users with continuous
updates in real time. The best example could be monitoring the stock market trends and
provide analytics to help users make a decision to buy or sell all in real time.
Real-Time Analytics Applications
Analytics falls along a spectrum. On one end of the spectrum sit hatch analytical
applications, which are used for complex, long-running analyses. They tend to have
slower response times (up to minutes, hours, or days) and lower requirements for
availability.
Examples of batch analytics include Hadoop- based workloads. On the other end of the
spectrum sit real-time analytical applications, which provide lighter-weight analytics
very quickly. Latency is low (sub-second) and availability requirements are high (e.g.,
99.99%). Figure below illustrates this.

Batch versus real-time analytics.


Example applications include:

1. Financial Services: Analyse ticks, tweets, satellite imagery, weather trends, and any
other type of data to inform trading algorithms in real time.
2. Government: Identify social program fraud within seconds based on program history,
citizen profile, and geospatial data.
3.E-Commerce Sites: Real-time analytics will help to tap into user preferences as
people are on the site or using a product. By knowing what users like at run time can
help the site to decide relevant content to be made available to that user. This can result
in a better customer experience overall leading to increase in sales. Let us take a look at
how this works for these companies. For example, Amazon recommendations change
after each new product you view so that they can upsell customers throughout the
session. Real-time recommendations create a personal shopping experience for each and
every customer. With more insight into, their customers on an individual level, Amazon
is able to effectively upsell and cross-sell products at every interaction point.
4.Insurance Industry: Digital channels of customer interaction (such as online
channels) as well as conversations online (such as social media) have created new
streams of real-time event data. Insurance, being a data-rich industry and a high
customer lifetime value business, can gain immensely from real-time analytics. The
following are a few scenarios where an insurance firm can benefit from real-time
analytics:
 A prospective customer visits the website looking to get a quote. Real-time
analytics can be used to predict the propensity of the customer to leave the site without
applying for a quote. This, in turn, can be used to trigger actions like free consultation,
some more schemes, etc.
 In the insurance industry, fast-tracking of claims improves customer satisfaction
significantly. However, this can increase the risk of fraud. Real-time analytics can be
used to reduce the risk of fraud even while accelerating the speed of processing claims.
 Some auto insurers are already collaborating with automobile firms to gather real-
time information from vehicles on a continuous basis. With GPS-enabled telemetry
devices in place, insurers can devise innovative policies where the premium could be
thought of as a car gas-tank that is filled up at a station, just as the actual consumption of
gas changes dynamically based on a variety of conditions, the premium can be
“consumed” in real time based on driving behaviour - say if one drives safely premium
must be extended longer than when driving rashly.
The list of applications for real-time analytics is endless. At the end of this section we
shall discuss two popular applications, real-time sentiment analysis and real-time stock
predictions, in greater detail.
Generic Design of an RTAP
Companies like Facebook and Twitter generate petabytes of real-time data. This data
must be harnessed to provide real-time analytics to make better business decisions.
Further in today’s context, billions of devices are already connected to the internet, with
more connecting each day. With the evolution of the Internet of Things (lOT), we have a
large number of new data sources such as smart meters, sensors and wearable medical
devices. Real-time analytics will leverage information from ail these devices to apply
analytics algorithms and generate automated actions within milliseconds of a trigger.
To create an environment where you can do “Real-Time Analytics”, the following three
aspects of data flows to your system are important:
1. Input: An event happens (new sale, new customer, someone enters a high security
zone etc.).
2. Process and Store Input: Capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations.
3. Output: Consume this data without disturbing operations (reports, dashboard, etc.).

You might also like