0% found this document useful (0 votes)
3 views

Bigdata-Mining Data Streams

Unit II discusses mining data streams, emphasizing the unique characteristics and challenges of stream data, such as infinite volume and concept drift. It outlines the stream data model and architecture, including components like stream processors, ETL tools, and query engines, while also covering techniques for sampling, filtering, and querying data streams. The document highlights practical applications in real-time analytics, including sentiment analysis and stock market predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Bigdata-Mining Data Streams

Unit II discusses mining data streams, emphasizing the unique characteristics and challenges of stream data, such as infinite volume and concept drift. It outlines the stream data model and architecture, including components like stream processors, ETL tools, and query engines, while also covering techniques for sampling, filtering, and querying data streams. The document highlights practical applications in real-time analytics, including sentiment analysis and stock market predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Unit II: Mining Data Streams

Introduction To Streams Concepts – Stream Data Model and


Architecture - Stream Computing - Sampling Data in a Stream –
Filtering Streams – Counting Distinct Elements in a Stream – Estimating
Moments – Counting Oneness in a Window – Decaying Window - Real
time Analytics Platform(RTAP) Applications - Case Studies - Real Time
Sentiment Analysis, Stock Market Predictions.

Introduction
Traditional DBMS store data that are finite and persistent. Data is
available whenever we want it. In Big data Analytics or in Data mining data
is assumed to come in streams. If not processed immediately it is lost.
Data streams are continuous flow of data. Sensor data, network traffic, call
center records, satellite images, data from electric power grids etc.are some
of the popular examples for data stream. Data streams possess several
unique properties
● Infinite
● Massive
● Fast changing
● New class of data may evolve, that makes it difficult to
include in the existing classes. (concept evolution)
● The relation between input data and output data may change.
(concept drift)
Apart from these unique characteristics there are some potential
challenges in data stream mining
● It is not manually possible to label all the data points in the stream.
● It is not feasible to store or archive these data stream in a
conventional database.
● Concept drift
● Concept evolution
● Speed and huge volume makes it difficult to mine the data. Only
single scan algorithms will be feasible.
● Difficult to query with SQL-based tools due to lack of schema and
structure.

Stream Data Model


A stream data architecture is a framework of software
components built to incorporate and process large volume of streaming
data from various sources.
A streaming data model processes the data immediately as it is
generated, and continue it to store. This architecture also includes various
additional tools for real time processing, data manipulation and analysis.
Benefits of stream processing
 Able to deal with infinite or never-ending streams of data
 Real-time processing
 Easy data scalability
🌕 Modern stream processing infrastructure is hyper-
scalable, able to deal with Gigabytes of data per
second with a single stream processor. This allows you
to easily deal with growing data volumes without
infrastructure changes.
The following figure shows Data Stream Management System
User/ Application

Continuo
u Result
s
Multiple
Streams

Stream
Query
Processor

Limite
d Archiv
al
Storag
e

There are four building blocks of stream architecture are:


1. Stream Processor/ Message broker
Takes data coming from various sources, translates it into a
standard format, streams it on an ongoing basis.
Two popular stream processing tools are Apache Kafka and
Amazon Kinesis Data streams
2. Batch and Real-time ETL tools
ETL stands for Extract, Transform and Load. It is the process
of moving huge volume of unstructured data from one source
to another. ETL is basically a data integration process
ETL tools aggregate data streams from one or more message
brokers.
ETL tool or platform receives queries from users, fetches
events from message queues and applies the query, to generate
a result. The result may be an API call, an action, a
visualization, an alert, or in some cases a new data stream.
A few examples of open-source ETL tools for streaming data
are Apache Storm, Spark Streaming and WSO2 Stream
Processor.
3. Query Engine
After streaming data is prepared for consumption by the stream
processor, it must be analyzed to provide value. Some of the
commonly used data analytics tools are Amazon Athena,
Amazon Redshift, Elasticsearch, Cassandra.
4. Data Storage
Streams may be archived in a large archival store, but it is not
possible to answer queries from the archival store.
A working store of limited capacity is used into which
summaries or parts of streams may be placed, and which can
be used for answering queries. The working store might be
disk, or it might be main memory, depending on how fast we
need to process queries.
The advent of low cost storage technologies paved a way for
organizations to store streaming event data.
There are several ways in which event can be stored;in a
database or a data warehouse, in the message broker, in data
lake.
A data lake is the most flexible and inexpensive option for
storing event data. But the latency (time required to transfer
data from the storage) is high for real time analysis.
Streaming data architectures enable developers to develop applications
that use both bound and unbound data in new ways. For example,
Alibaba’s search infrastructure team uses a streaming data architecture
powered by Apache Flink to update product details and inventory
information in real-time. Netflix also uses Flink to support its
recommendation engines and ING, the global bank based in The
Netherlands, uses the architecture to prevent identity theft and provide
better fraud protection. Other platforms that can accommodate both
stream and batch processing include Apache Spark, Apache Storm,
Stream Queries
There are two types of stream queries: standing queries and ad-hoc
queries
 Standing queries
🌕 Permanently executing, produce output at appropriate times.
Consider a temperature sensor bobbing about in the ocean,
sending back to a base station a reading of the surface
temperature each hour. The data produced by this sensor is a
stream of real numbers. In this case we can ask a query, what
is the maximum temperature ever recorded by the sensor.
For answering this query we need not store the entire
stream. When a new stream element arrives, we compare it
with the stored maximum, and set the maximum to whichever
is larger. Similarly, if we want the average temperature over
all time, we have only to record two values: the number of
readings ever sent in the stream and the sum of those
readings.
 Ad-hoc queries
🌕 Asked only when a particular information is needed. Not
permanently executing.
🌕 Asked once about the current state of streams. Such queries
are difficult to answer as we are not archiving the entire stream.
🌕 For answering such ad-hoc queries we have to store the
parts or
summaries of the stream.

 Sliding Window
🌕 Sliding window approach can be used to answer ad-hoc queries.
🌕 Each sliding window stores the most recent n elements of the
stream for some n.
🌕 Or it can be all the elements that are arrived within the last t
time units; may be day.
🌕 The length of the sliding window is specified by it range.
Stride specifies the portion of the window that is omitted
when the window moves forward.
🌕 2 types of sliding windows
■ time-based
 Range and stride are specified by time intervals.
 For example a sliding window with range= 10
mins and stride= 2 mins produces window that
cover the data in the last 10 mins. A new window is
created
■ count-based
 Range and stride are specified in terms of
number of intervals.
🌕 The obvious approach would be to generate a random
number, say an integer from 0 to 9, in response to each search
query.
🌕 Store the tuple if and only if the random number is 0.
🌕 Each user has, on average, 1/10th of their queries stored.

Sampling Data Stream


As mentioned earlier a data stream is a massive, infinite dataset.
Hence it is not possible to store the entire stream. While mining a data
stream a typical question that can arise is how can we answer certain
critical queries without storing the entire stream. In some cases we can
get an answer from certain samples in the stream, without examine the
entire stream. Here, we have to keep in mind two things; one is the sample
should be unbiased. The second one is the typical sample should be able to
answer the queries. Choosing the right samples is critical. Carelessly
choosing samples can destroy the results of the query. While sampling we
must take care of some pitfalls.
An Example
Consider a search engine like Google that receives a stream of
queries. Google wants to study the behavior of users. A typical question
that can be asked is “What fraction of queries asked past the month are
unique?”. Only 1/10th of the stream element is stored.
The obvious approach would be to generate a random number, say
an integer from 0 to 9, in response to each search query. Store the tuple if
and only if the random number is 0. If we do so, each user has, on
average, 1/10th of their queries stored. Statistical fluctuations will
introduce some noise into the data, but if users issue many queries, the
law of large numbers will assure us that most users will have a fraction
quite close to 1/10th of their queries stored.
This scheme gives us the wrong answer to the query asking for
the average number of duplicate queries for a user. Suppose a user has
issued s search queries one time in the past month, d search queries twice,
and no search queries more than twice. If we have a 1/10th sample, of
queries, we shall see in the sample for that user an expected s/10 of the
search queries issued once. Of the d search queries issued twice, only
d/100 will appear twice in the sample; that fraction is d times the
probability that both occurrences of the query will be in the 1/10th
sample. Of the queries that appear twice in the full stream, 18d/100 will
appear exactly once.
[Sample will contain s/10 of the singleton queries and 2d/10 of the
duplicate queries at least once .But only d/100 pairs of duplicates d/100 =
1/10 * 1/10 * d]
To see why, note that 18/100 is the probability that one of the two
occurrences will be in the 1/10th of the stream that is selected, while the
[Of d “duplicates” 18d/100 appear once 18d/100 =
((1/10*9/10)+(9/10*1/10))*d]
The correct answer to the query about the fraction of repeated searches
is d/(s+d). However, the answer we shall obtain from the sample is
d/(10s+19d).
d/100 appear twice, while s/10+18d/100 appear once. Thus, the fraction
appearing twice in the sample is d/100 divided by d/100 + s/10 + 18d/100.
This ratio is d/(10s + 19d).

Solution
A solution for the above scenario is sample the users.
 Pick 1/10th of users and take all their searches in the sample.
 Use a hash function that hashes the user name or user id
uniformly into 10 buckets.
 Each time a search query arrives in the stream, we look up the
user to see whether or not they are in the sample. If so, we add
this search query to the sample, and if not, then not.
 By using a hash function, one can avoid keeping the list of users.
 Hash each user name to one of ten buckets, 0 through 9. If the user
hashes to bucket 0, then accept this search query for the sample, and
if not, then not.
General Solution
 Stream of tuples with keys.
 Key is some subset of each tuple’s components e.g., tuple is (user,
search, time).
 Key is user Choice of key depends on application
 To get a sample of size a/b: Hash each tuple’s key uniformly into b
buckets Pick
the tuple if its hash value is at most a.

Filtering Streams
Filtering involves selecting the streams that satisfies a particular
criterion. There are different methods for selecting streams. The process is
hard when it is required to search for membership in a set.
 Each element of data stream is a tuple
 Given a list of keys S
 Determine which tuples of stream are in S
Applications of filtering
 Email spam filtering
◦ We know 1 billion “good” email addresses
◦ If an email comes from one of these, it is NOT spam
 Publish-subscribe systems
◦ You are collecting lots of messages (news articles)
◦ People express interest in certain sets of keywords
◦ Determine whether each message matches user’s interest
Example
Suppose we want to create an account in a Gmail. The Gmail
maintains a list of usernames of those persons who already have an
account. When we give our preferred username, we may get a
message that “username already exists”. Gmail check availability of
username by searching millions of username registered with it. There are
several methods to do the search.
 Linear Search : obviously this is a bad idea because there may be
billions of accounts.
 Binary search : the usernames must be stored in sorted order. Even
then it may not be possible to search in billions.
Solution is Bloom Filter Technique
Bloom Filter Technique
 A Bloom filter is a space-efficient probabilistic data structure that is
used to test whether an element is a member of a set.
 Bloom Filter method uses hashing.
 A hash function takes input and outputs a unique identifier of
fixed length which is used for identification of input.
Working of Bloom Filter
 A empty bloom filter is a bit array of m bits, all set to zero, like this –
 We need k number of hash functions to calculate the hashes for a given
input.
 When we want to add an item in the filter, the bits at k indices h1(x), h2(x),
… hk(x) are
set, where indices are calculated using hash functions.
 Example – Suppose we want to enter “good” in the filter, we are using 3
hash functions and a bit array of length 10, all set to 0 initially. First we’ll
calculate the hashes as following :
 First we’ll calculate the hashes as following :
h1(“good”) % 10 = 1
h2(“good”) % 10 = 4
h3(“good”) % 10 = 7

[These outputs are random for explanation only.]


Now we will set the bits at indices 1, 4 and 7 to 1
good

 Again we want
to enter “bad”,
similarly we’ll calculate hashes
h1(“bad”) % 10
=3
h2(“bad”) % 10
=5
h3(“bad”) % 10
=4

 Set the bits at indices 3, 5 and 4 to 1


bad

Now if we want to check a username present in the list or not we do the


reverse process.
 We calculate respective hashes using h1, h2 and h3 and check if all these
indices are set to 1 in the bit array.
 If all the bits are set then we can say that username is probably present.
 If any of the bit at these indices are 0 then username is definitely not
present.
 False Positive in Bloom Filters
The result “probably present”, is uncertainty. Let’s understand this
with an example. Suppose we want to check whether “cat” is present or not. We’ll
calculate hashes using h1, h2 and h3
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7

If we check the bit array, bits at these indices are set to 1 but
we know that “cat” was never added to the filter. Bit at index 1 and 7 was
set when we added “good” and bit 3 was set when we added “bad”.

So, because bits at calculated indices are already set by some


other item,
bloom filter erroneously claim that “cat” is present and
generating a false positive result.

 By controlling the size of bloom filter we can control the probability


of getting false positives.

 Use more number of hash functions and more bit array.

 Probability of False positivity: Let m be the size of bit array, k be


the number of hash functions and n be the number of expected
elements to be inserted in the filter, then the probability of false
positive p can be calculated as:

Generalization
The Bloom Filter
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h 1 , h 2 , . . . , h k . Each hash
function maps
“key” values to n buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream
elements whose keys are in S, while rejecting most of the stream
elements whose keys are not in S.
 To initialize the bit array, begin with all bits 0.
 Take each key value in S and hash it using each of the k
hash functions.
 Set to 1 each bit that is h i (K) for some hash function h i
and some key value K in S.
 To test a key K that arrives in the stream, check that all of h
1 (K), h 2 (K), . . . , h k (K) are
1’s in the bit-array.
 If all are 1’s, then let the stream element through. If one or
more of these bits are 0, then
K could not be in S, so reject the stream element.

The hash function used in bloom filters should be independent and


uniformly distributed. They should be fast as possible.

Applications of Bloom filters

 Medium uses bloom filters for recommending post to users by filtering post
which have been seen by user.
 Quora implemented a shared bloom filter in the feed backend to filter out
stories that people have seen before.
 The Google Chrome web browser used to use a Bloom filter to identify
malicious URLs
 Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use
Bloom filters to reduce the disk lookups for non-existent rows or
columns

Counting Distinct Elements in a Stream


This process is to count distinct elements in a data stream with repeated
element.

The elements might represent IPaddresses of packets passing through a


router, unique visitor to a web site, elements in a large database,
motifs in a DNA sequence, or elements of sensor/RFID networks.

Definition
A stream of elements {x1,x2,...xs} with repetitions, and an integer m. Let n be

the number of distinct elements, namely, n=|{x1,x2,...xs }|and let these

elements be

{e1,e2,...en}.

Objective: Find an estimate of n using only m storage units, where m«n

 Data stream consists of a universe of elements chosen from a set of size N.


Maintain a count of the number of distinct elements

seen so far Solution:


 Maintain the set of elements seen so far.
 That is, keep a hash table of all the distinct elements
seen so far.
Counting distinct elements is very important in many
practical applications. For example:
 How many different words are found among the Web
pages being crawled at a site?
◦ Unusually low or high numbers could indicate
artificial pages (spam?)

 How many different Web pages does each


customer request in a week?
 How many distinct products have we sold in the
last week?
Example
Suppose Google wants to gather the statistics of unique users it has seen in
each month. Google does not require a unique login to issue a search query.
The only way to recognize users is to identify the IP addresses from which
the queries are issued. In this case the 4 billion IP addresses serve as the
universal set.
Solution
 Keep the list of all elements seen so far in a hash table or a search
tree in the main memory.
 When a new query arrives check whether the IP address from
which the query issued is in the list or not.
 If it is not there add the new IP address. Otherwise discard.
The above solution works well as long as the number of distinct elements
Flajolet-Martin Algorithm
This algorithm approximates the number of distinct elements in a stream
or a database in one pass.
Suppose the stream consists of n elements with m of them are unique.
 Then time complexity of the algorithm is O(n).
 The algorithm requires O(log(m)) memory.
Algorithm
For a given input stream and a hash function.
 Step 1: Apply the hash function h(x) to each element in the stream.
 Step 2 : For each hash function obtained,write the binary equivalent
for the same.
 Step 3: Count the number of trailing zeros ( zeros in the end ) of
each bit of the hash function.
 Step 4: Write the value of maximum number of trailing zeros. Let the
number be r.
 Step 5: Calculate the number of distinct elements as R=2r
Example
Consider a data stream of integers, 3, 1, 4, 1, 5, 9, 2, 6, 5. Determine the
tail length for each stream element and the resulting estimate of the
number of distinct elements if the hash function is:
(a) h(x) = 2x + 1 mod 32.
(b) h(x) = 3x + 7 mod 32.
(c)h(x) = 4x mod 32.
(Treat the result of hash functions as 5 bit binary
integer). Solution
Since the data stream is small we can readily count the number
of distinct elements. There 7 distinct elements.
a) Using hash function h(x)=2x+1 mode 32
h(3)= (2*3 + 1 ) mod h(2)= (2*2 + 1 ) mod
32 =0 32 =0
h(1)= (2*1 + 1 ) mod h(6)= (2*6 + 1 ) mod 32 =0
32 =0
h(5)= (2*5 + 1 ) mod
h(4)= (2*1 + 1 ) mod 32 =0
32 =0
h(1)= (2*1 + 1 ) mod h(9)= (2*9 + 1 ) mod
32 =0 32 =0
h(5)= (2*5 + 1) mod
Convert the result into binary. Here all the results are zeros. Hence
32=0
the maximum number of trailing zeros r=0
Count of distinct number of element=2r=20=1
b) Using hash function h(x)=3x+7 mode 32
h(3)= (3*3 + 7 ) mod h(2)= (3*2 + 7 ) mod
32 =0 32 =0
h(1)= (3*1 + 7 ) mod h(6)= (3*6 + 7 ) mod 32 =0
32 =0
h(5)= (3*5 + 7 ) mod
h(4)= (3*4 + 7 ) mod 32 =0
32 =0
h(9)= (3*9 + 7 ) mod 32 =2
h(1)= (3*1 + 7 ) mod
Converting the results of hash functions into binary we get
32 =0
two values 00000, 00010.
h(5)= (3*3 + 7 ) mod
The maximum number of trailing zeros r=1
32 =0
Therefore, Count of distinct number of
element=2r=21=2
c) Using hash function h(x)=h(x) = 4x mod 32.
h(3)= 12 mod 32 h(2)= 8 mod 32
=0 =0
h(1)= 4 mod 32 h(6)= 24 mod 32
=0 =0
h(4)= 16 mod 32 h(5)= 20 mod 32
=0 =0
h(1)= 4 modthe
Converting h(9)=
32results of hash36 mod 32into binary we get
functions
=0 =4
two values 00000, 00100.
h(5)= 20 mod 32
The
=0 maximum number of trailing zeros r=2
Therefore, Count of distinct number of
element=2r=22=4
The resulting count of distinct number of
elements=1+2+4=7

Estimating Moments
Estimating moments involves computation of distribution of frequencies
of different elements in the stream.
Definition of Moments

Consider a data stream of elements from a universal set. Let mi be the number

of occurrences of the ith element for any i. Then the kth-moment of the

stream is the sum over all i of (mi )k .


 0th moment is the count of distinct elements in the stream.

 1st moment is the sum of mi’’s. That is the length of the stream.

 2nd moment is the sum of squares of mi’’s. It is also called

“surprise number”.

Example
Suppose we have a stream of length 100, in which eleven
different elements appear. The most even distribution of these eleven
elements would have one appearing 10 times and the other ten appearing 9
times each. In this case, the surprise number is 102 + 10 × 92 = 910. At
the other extreme, one of the eleven elements could appear 90 times and
the other ten appear 1 time each. Then, the surprise number would be 902
+ 10 × 12 = 8110.
As explained in the above example moments of any order can be
computed as long as the stream fits in the main memory. In the case
where the stream does not fit in the memory, we can compute the kth
moment by keeping a limited number of values and computing an
estimate from these values. We can use the following algorithm for
computing the second moment.
The Alon-Matias-Szegedy Algorithm for Second Moments
Even if there is not enough storage space, the second moment can
still be estimated using AMS Algortihm.
Algorithm
Consider a stream of length n. Without taking all the elements,
compute some sample variables.
For each variable X,
 Store X.element as a particular element in the set.
 Choose a position in the stream between 1 and n.
 Assign X.element to be the element found in that particular
position.
 Assign X.value=1 for that element.
 Scan the stream and add 1 to X.value each time another
occurrence of the variable encountered.
 Derive the estimate second moment of the variable X using
n(2X.value-1).
Calculate the average of all the

estimates. Counting Ones in a Window


In general, the kth moments for any k≥2

𝑛(𝑣𝑘 − (𝑣 − 1)𝑘),v is the X.value for some


variable X in the stream.

Counting Ones in a Window


The Datar-Gionis-Indyk-Motwani
Algorithm
 Designed to find the number of 1s
in a datastream.
 Allows to estimate the no.of 1s in a
window with an error not more
than 50%.
◦ Each bit that comes in the stream has timestamp, same as the
position of bit in the stream.
▪ First bit has a timestamp of 1, second bit has a timestamp 2
and so on.
◦ Distinguish the positions within the window of length N.
◦ Take the window size as a multiple of 2.

◦ Represent the timestamp as log2N.

◦ Divide the window into buckets, consisting of:


▪ The timestamp of its right (most recent) end.
▪ The number of 1’s in the bucket. This number must be a power
of 2.
▪ The number of 1’s is referred to as the size of the bucket.
 Example-1001011-the number of 1s is 4-Hence the bucket
size is 4.
 Rules that must be followed when representing a stream by buckets.
◦ The size of the buckets must increase as we move on
to the left.

Example
Consider the bit stream,
..1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
The aim is to divide the stream into buckets that satisfies the DGIM
rules.

Maintaining DGIM rules


Whenever a new bit enters, we may have to modify the buckets so as
to represent the window and continue to maintain the DGIM rules.
 When a new bit enter
◦ Check the leftmost bucket. If its timestamp has now reached
the current timestamp minus N , then this bucket no longer has
any of its 1’s in the window. Therefore, drop it from the list of
buckets.
◦ Consider whether the new bit is 0 or 1.
▪ If it is 0, then no further change to the buckets is needed.
▪ If the new bit is a 1,
1. Create a new bucket with the current timestamp and size 1.

2. If there are more than 2 buckets of size 1, combine the


earliest two buckets of size 1.
3. To combine any two adjacent buckets of the same size,
replace them by one bucket of twice the size.
4. The timestamp of the new bucket is the
timestamp of the rightmost of the two buckets.
▪ The time complexity O(log N).

Decaying Windows
This approach is used for finding the most popular element in the
stream. This can be considered as an extension of DGIM Algorithm. The
aim is to weight the recent elements more heavily.
 Recording the popularity of items sold at Amazon,
 The rate at which different Twitter-users tweet.

Let a stream currently consist of the elements a1 , a2 , . . . , at , where a1 is the

first element to arrive and at is the current element. Let c be a small constant,

such as 10−6 or 10−9 .

 Define the exponentially decaying window for this stream to be the


sum
𝑡−1

∑ the
𝑎𝑡−𝑖 (1 − 𝑐input,
)𝑖

when a new element at+1 arrives at stream all we need to do
𝑖=0
is:

1. Multiply the current sum by 1 − c.

2. Add a t+1 .

Decaying Window Algorithm

Identify the most popular elements (trending, in other words) in an


incoming data stream.
 Tracks the most recurring elements in an incoming data stream.
 Discounts any random spikes or spam requests that might have
boosted an
element’s frequency.
 Assign a score or weight to every element of the incoming data
stream.
 Calculate the aggregate sum for each distinct element by adding all
the weights assigned to that element.
 The element with the highest total score is listed as trending or the
most popular.

1.Assign each element with a weight/score.


2.Calculate aggregate sum for each distinct element by adding all the
weights assigned to that element.
 Assign more weight to newer
elements.
 For a new element, you first reduce the weight of all the existing
elements by a constant factor k and then assign the new element with
a specific weight.
 The aggregate sum of the decaying exponential weights can be
calculated using the following formula:

𝑡−1

∑ 𝑎𝑡−𝑖 (1 − 𝑐)𝑖
𝑖=0

1.Multiply the current sum/score by the value (1−c).


2. Add the weight corresponding to the new element.

Weight decays exponentially over time

For example, consider a sequence of twitter tags below:


fifa, ipl, fifa, ipl, ipl, ipl, fifa

Also, let's say each element in sequence has


weight of 1. Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be
calculated as below: fifa
fifa - 1 * (1-0.1) = 0.9
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is
different than fifa) fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1
because current tag is fifa only)
ipl - 1.729 * (1-0.1) + 0 = 1.5561
ipl - 1.5561 * (1-0.1) + 0 = 1.4005
ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa - 1.2605 * (1-0.1) + 1 = 2.135
ipl
fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1

fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is


different than ipl) ipl - 0.9 * (1-0.01) + 1 = 1.81
ipl - 1.81 * (1-0.01) + 1 = 2.7919
ipl -2.7919 * (1-0.01) + 1 = 3.764
fifa - 3.764 * (1-0.01) + 0 = 3.7264

In the end of the sequence, we can see the score of fifa is 2.135 but
ipl is 3.7264 So, ipl is more trending then fifa
Even though both of them occurred almost same number of times in input
there score is still different.
Advantages of Decaying Window Algorithm:
1.Sudden spikes or spam data is taken care.
2.New element is given more weight by this mechanism, to achieve
right trending output.

You might also like