MC5502 Bigdata Unit 2 Notes
MC5502 Bigdata Unit 2 Notes
Part – A
1.What factors lead to Concept Drift? [CO3-L1]
The constantly changing patterns in data streams can affect the induced data mining
models in multiple ways such as
Changes in the class label of an existing data variable,
Change in the available feature information.
Both these lead to a phenomenon Called Concept drift.
2. How are continuous queries evaluated? [CO3-L1]
Continuous queries are evaluated continuously as data streams continue to arrive. The answer to
a continuous query is produced over time, always reflecting the stream data seen so far.
Continuous query answers may be stored and updated as new data arrives, or they may be
produced as data streams themselves.
3. What are called Ad-Hoc Queries? [CO3-L1]
An ad-hoc query is issued online after the data streams have already begun. Ad-hoc queries can
be either one-time queries or continuous queries. Ad-hoc queries are basically questions asked
once about the current state of a stream or streams.
4. List out few challenges of data mining algorithms. [CO3-L2]
Data streams pose several challenges for data mining algorithm design, The most
important of them are,
Algorithms must make use of limited resources (time and memory).
Algorithms must deal with data whose distribution changes over time.
5. Why traditional data mining algorithms could not be used on data streams? [CO3-L1]
Many traditional data mining algorithms can be modified to work with larger datasets, but they
cannot handle continuous supply of data.
If a traditional algorithm has learnt and induced a model of the data seen until now, it cannot
immediately update the model when new information keeps arriving at continuous intervals.
Instead, the entire training process must be repeated with the new examples included.
6. What are one time queries? [CO3-L1]
One-time queries are queries that are evaluated once over a point-in-time snapshot of the data
set, with the answer returned to the user.For example, a stock price checker may alert the user
when a stock price crosses a particular price point.
7. What are continuous queries? [CO3-L1]
Continuous queries are evaluated continuously as data streams continue to arrive. The answer to
a continuous query is produced over time, always reflecting the stream data seen so far.
Continuous query answers may be stored and updated as new data arrives, or they may be
produced as data streams themselves.
8. What is a predefined query? [CO3-L1]
A pre-defined query is one that is supplied to the DSMS before any relevant data has arrived.
Pre-defined queries are most commonly continuous queries.
9. List out the major issues in Data Stream Query Processing. [CO3-L2]
The major issues in Data Stream Query Processing are as follows,
Unbounded Memory Requirements
Approximate Query Answering
Sliding Windows
Batch Processing, Sampling and Synopses
Blocking Operators
10. What is Reservoir Sampling? [CO3-L1]
Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from
a list of n items, where n is either a very large or unknown number. Typically n is large enough
that the list doesn’t fit into main memory. For example, a list of search queries in Google
11. How is Biased Reservoir Sampling different from Reservoir Sampling? [CO3- L1]
Biased reservoir sampling is a bias function to regulate the sampling from the stream. This bias
gives a higher probability of selecting data points from recent parts of the stream as compared to
distant past. This bias function is quite effective since it regulates the sampling in a smooth way
so that the queries over recent horizons are more accurately resolved.
12. What does the term “Filtering a Data Stream” mean? [CO3-L1]
“Filtering” tries to observe an infinite stream of data, look at each of its items and
quantify whether the item is of interest and should be stored for further evaluation.
Hashing has been the popular solution to designing algorithms that approximate some value that
we would like to maintain.
13. What is a Bloom Filter? [CO3-L1]
A Bloom filter is a space-efficient probabilistic data structure. It is used to test whether an
element is a member of a set. False positive matches are possible, but false negatives are not,
thus a Bloom filter has a 100% recall rate.
14. What is a cardinality estimation problem? [CO3-L1]
The count-distinct problem, also known as cardinality estimation problem. It is the problem of
finding the number of distinct elements in a data stream with repeated elements.This is a well-
known problem with numerous applications.
15. What is Flajolet–Martin algorithm used for? [CO3-L1]
The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements
in a stream with a single pass and space-consumption which is logarithmic in the maximum
number of possible distinct elements in the stream.
16. How are “moments” estimated? [CO3-L1]
The method of moments is a method of estimation of population parameters. One starts with
deriving equations that relate the population moments (i.e., the expected values of powers of the
random variable under consideration) to the parameters of interest.
17. What is called the decay of insight? [CO3-L1]
The length of time that analytic insight has value is rarely considered in big data and analytics
projects.
The concept of the half-life of insight can be used to understand the expectation of the magnitude
of insight after a period of time. It gives the expectation M(t) of the magnitude of the insight after
time t.
18. What is Real-Time Analysis? [CO3-L1]
Real-time analytics is the use of, or the capacity to use, all available enterprise data and resources
when they are needed. It consists of dynamic analysis and reporting, based on data entered into a
system less than one minute before the actual time of use. Real- time analytics is also known as
real-time data analytics
19. What is a Data Stream Management System? [CO3-L1]
A Data stream management system (DSMS) is a computer program to manage continuous data
streams. It is similar to a database management system (DBMS), which is, however, designed for
static data in conventional databases. A DSMS also offers a flexible query processing so that the
information need can be expressed using queries.
Reservoir Sampling
Many mining algorithms can be applied if only we can draw a representative sample of the data
from the stream.Imagine there is a really large stream of data elements.
The goal is to efficiently return a random sample of elements evenly distributed from the
original stream.
A simple way is to generate random integers between and , then retrieving the
elements at those indices and you have your answer.
To make this sampling without replacement, we simply needs to note whether or not our sample
already has that random number and if so, choose a new random number.
This can make the algorithm very expensive if the sample size is very close to N.
Further in the case of a data stream we don’t know , the size of the stream in advance and we
cannot index directly into it.
We can count it, but that requires making two passes of the data that is not possible. Thus, the
general sampling problem in the case of a data stream is, “How to ensure such a sample is
drawn uniformly, given that the stream is continuously growing?”
For example, if we want to draw a sample of 100 items and the stream has length of only 1000,
then we want to sample roughly one in ten items. But if a further million items arrive, we must
ensure that the probability of any item being sampled is more like one in a million. If we retain
the same 100 items, then this is cannot be considered a representative sample.
Several solutions are possible to ensure that we continuously maintain a uniform sample from
the stream.
Reservoir-based methods were originally proposed for one-pass access of data from magnetic
storage devices such as tapes. Similarly to the case of streams, the number of records is not
known in advance and the sampling must be performed dynamically as the records from the tape
are read.
Assume that we wish to obtain an unbiased sample of size from the data stream and we maintain
a reservoir of size from the data stream. The first points in the data streams are added to the
reservoir for initialization.
Then, when the th point from the data stream is received, it is added to the reservoir with
probability .
In order to make room for the new point, any of the current points in the reservoir are sampled
with equal probability and subsequently removed.
If we draw a sample of size , we initialize the sample with the first item from the stream.
Biased Reservoir Sampling
In many cases, the stream data may evolve over time, and the corresponding data mining or
query results may also change over time. Thus, the results of a query over a more recent window
may be quite different from the results of a query over a more distant window.
Similarly, the entire history of the data stream may not relevant for use in a repetitive data
mining application much as classification.
The simple reservoir sampling algorithm can be adapted to a sample from a moving window
over data streams. This is useful in many data stream applications where a small amount of
recent history is more relevant than the entire previous stream.
However, this can sometimes be an extreme solution, since for some applications we may need
to sample from varying lengths of the stream history.
While recent queries may be more frequent, it is also not possible to completely disregard
queries over more distant horizons in the data stream.
Biased reservoir sampling is a bias function to regulate the sampling from the stream. This bias
gives a higher probability of selecting data points from recent parts of the stream as compared to
distant past. This bias function is quite effective since it regulates the sampling in a smooth way
so that the queries over recent horizons are more accurately resolved.
Concise Sampling
Many a time, the size of the reservoir is sometimes restricted by the available main memory. It is
desirable to increase the sample size within the available main memory restrictions.
For this purpose, the technique of concise sampling is quite effective. Concise sampling uses the
fact that the number of distinct values of an attribute is often significantly smaller than the size of
the data stream.
In many applications, sampling is performed based on a single attribute in multi- dimensional
data. For example, customer data in an e-commerce site sampling may be done based on only
customer ids. The number of distinct customer ids is definitely much smaller than the size of the
entire stream.
The repeated occurrence of the same value can be exploited in order to increase the sample size
beyond the relevant space restrictions.
We note that when the number of distinct values in the stream is smaller than the main memory
limitations, the entire stream can be maintained in main memory, and therefore, sampling may
not even be necessary.
For current systems in which the memory sizes may be of several gigabytes, very large sample
sizes can be in main memory as long as the number of distinct values does not exceed the
memory constraints.On the other hand, for more challenging streams with an unusually large
number of distinct values, we can use the following approach.
1. The sample is maintained as a set of pairs.
2. For those pairs in which the value of count is one, we do not maintain the count explicitly, but
we maintain the value as a singleton.
3. The number of elements in this representation is referred to as the footprint and is bounded
above by .
4. We use a threshold parameter that defines the probability of successive sampling from the
stream. The value of is initialized to be .
5. As the points in the stream arrive, we add them to the current sample with probability lit.
6. We note that if the corresponding value count pair is already included in the set 5, then we
only need to increment the count by 1. Therefore, the footprint size does not increase.
7. On the other hand, if the value of the current point is distinct from all the values encountered
so far, of it exists as a singleton then the footprint increases by 1. This is because either a
singleton needs to be added, or a singleton gets converted to a value count pair with a count of 2.
8. The increase in footprint size may potentially require the removal of an element from sample S
in order to make room for the new insertion.
9. When this situation arises, we pick a new (higher) value of the threshold , and apply this
threshold to the footprint in repeated passes.
10. In each pass, we reduce the count of a value with probability until at least one value count
pair reverts to a singleton or a singleton is removed.
11. Subsequent points from the stream are sampled with probability .
In practice, may be chosen to be about 10% larger than the value of . The choice of different
values of provides different trade-offs between the average (true) sample size and the
computational requirements of reducing the footprint size.
5. Explain and analyse the Bloom Filter in detail with the algorithm[CO3-L2] The
Bloom Filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard
Bloom in 1970, that is used to test whether an element is a member of a set. False positive
matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In
other words, a query returns either “possibly in set” or -definitely not in set”. Elements can be
added to the set, but not removed (though this can be addressed with a “counting” filter). The
more elements that are added to the set,, the larger the probability of false positives.
The Bloom filter for a set is much smaller than the set itself, which makes it appropriate for
storing large amounts of data required when filtering on a data stream in main memory itself
The advantage of this data structure is that it uses considerably less space than any exact method,
but pays for this by introducing a small probability of error. Depending on the space available,
this error can be made arbitrarily small.
Let us assume we want to represent n-element sets from a very large universe U, with
We want to support insertions and membership queries (as to say "Given is ?”) so that:
If the answer is No, then .
If the answer is Yes, then
may or may not be in , but the probability that (false positive) is low.
Both insertions and membership queries should be performed in a constant time. A Bloom filter
is a bit vector B of m bits, with k independent hash functions that map each key in to
the set 0 . We assume that each hash function maps a uniformly at random
chosen key to each element of with equal probability. Since we assume the hash functions
are independent, it follows that the vector is equally likely to be any of the k-tuples or elements
from
Algorithm The Bloom Filter
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps
“key” values to n buckets, corresponding to the n bits of the bit-array.
To initialize the bit array, begin with all bits 0. Take each key value in S
and hash it using each of the k hash functions. Set to 1 each bit that is hi(K)
for some hash function hi and some key value K in S.
Clearly, if an item is inserted into the filter, it is found when searched for. Hence, there is no
false negative error. The only possibility for error is when an item is not inserted into the filter,
buf each of the locations that the hash functions map to it are all turned on. We will show that
this error can be kept small while using considerably less space than any exact method.
Analysis of the Bloom Filter
If a key value is in 5, then the element will surely pass through the Bloom filter. However, if the
key value is not in S, it might still pass. This is called a false positive. We need to understand
how to calculate the probability of a false positive, as a function of n, the bit-array length, m the
number of members of , and , the number of hash functions. The computation can be done as
follows:
1. The probability that one hash do not set a given bit is given the setting in previous section.
2. The probability that it is not set by any of the k hash functions is .
3. Hence, after all elements of have been inserted into the Bloom filter, the probability that a
specific bit is still is . (Note that this uses the assumption that the hash
functions are independent and perfectly random.)
4.The probability of a false positive is the probability that a specific set of h bits are 1, which is
5. Thus, we can identify three performance metrics for Bloom filters that can be adjusted to tune its
performance. First the computation time (corresponds to the number of hash functions), second
the size (corresponds to the number of bits), and finally probability of error (corresponds to the
false positive rate).
6. Suppose we are given the ratio min and want to optimize the number of hash functions to
minimize the false positive rate . Note that more hash functions increase the precision but also
the number of in the filter, thus making false positives both less and more likely at the same time.
derivative to .
As grows in proportion to , the false positive rate decreases. Some reasonable values for , and are:
a. 6 4 0 05
b 6 4 0 02
Bloom filters have some nice properties that make it ideal for data streaming applications. When
an item is inserted into the filter, all the locations can be updated “blindly” in parallel. The reason
for this is that the only operation performed is to make a bit 1, so there can never be any data
race conditions. Also for an insertion, there need to be a constant (that is, k) number of writes
and no reads. Again, when searching for an item, there are at most a constant number of reads.
These properties make Bloom filters very useful for high-speed applications, where memory
access can be the bottleneck.
The principle behind the FM algorithm is that, it is possible to estimate the number of distinct
elements by hashing the elements of the universal set-to a bit-string that is sufficiently long.
This means that the length of the bit-string must be such that there are more possible results of
the hash function than there are elements of the universal set.
Before we can estimate the number of distinct elements, we first choose an upper boundary; of
distinct elements . This boundary gives us the maximum number of distinct elements that we
might be able to detect. Choosing to be too small will influence the precision of our
measurement. Choosing that is far bigger than the number of distinct elements will only use too
much memory. Here, the memory that is required is .
For most applications, is a sufficiently large bit-array. The array needs to be initialized to
zero. We will then use one or more adequate hashing functions. These hash functions will map
the input to a number that is representable by our bit-array. This number will then be analyzed
for each record. If the resulting number contained trailing zeros, we will set the kth bit in the bit
array to one.
Finally, we can estimate the currently available number of distinct elements by taking the index
of the first zero bit in the bit-array. This index is usually denoted by R. We can then estimate the
number of unique elements N to be estimated by . The algorithm is as follows:
Algorithm
Pick a hash function that maps each of the elements to at least bits- For each
stream element let be the number of trailing 0s in .Record the maximum seen.
Estimate = ,
Example
= position of first 1 counting from the right; say , then 12 is 1100 in binary,
so
Example
Suppose the stream is 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1 .... Let . So the transformed stream
( applied to each item) is 4,5,2,4,2,5,3,5,4,2,5,4
Each of the above element is converted into its binary equivalent as 100, 101, 10, 100, 10, 101,
11, 101, 100, 10,.101, 100
We compute of each item in the above stream: 2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2 So ,
which is 2. Output .
A very simple and heuristic intuition as to why Fiajokt-Martin works can be explained as follows:
1.h(a) hashes a with equal probability to any of N values
2. Then h(a) is a sequence of bits, where fraction of all a s have a tail of zeros
About 50% of as hash to ***0
About 25% of as hash to **00
So, if we saw the longest tail of (i.e., item hash ending *100) then we have probably
seen about four distinct items so far
3. So, it takes to hash about items before we see one with zero-suffix of length r. More
formally we can see that the algorithm works, because the probability that a given hash function ,
ends in at least zeros is . In case of m different elements, the probability that (R is max. tail
length seen so far) is given by 2
Variations to the FM Algorithm
There are reasons why the simple FM algorithm won’t work with just a single hash function. The
expected value of [ ] is actually infinite. The probability halves when is increased to , however,
the value doubles. In order to get a much smoother estimate, that is also more reliable, we can
use many hash functions. Another problem with the FM algorithm in the above form is that the
results vary a lot. A common solution is to run the algorithm multiple times with different hash-
functions, and combine the results from the different runs.
One idea is to take the mean of the results together from each hash-function, obtaining a single
estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers
(which are likely here).
A different idea is to use the median which is less prone to be influenced by outliers. The
problem with this is that the results can only take form as some power of 2. Thus, no matter how
many hash functions we use, should the correct value of be between two powers of 2, say 400,
then it will be impossible to obtain a close estimate. A common solution is to combine both the
mean and the median:
1.Create hash-functions and split them into k distinct groups (each of size ).
2.Within each group use the median for aggregating together the results
3.Finally take the mean of the group estimates as the final estimate.
Sometimes, an outsized will bias some of the groups and make them too large. However, taking
the median of group averages will reduce the influence of this effect almost to nothing.
Moreover, if the groups themselves are large enough, then the averages can be essentially any
number, which enable us to approach the true value m as long as we use enough hash functions.
Groups should be of size at least some small multiple of .
It is possible to estimate the number of distinct elements by hashing the elements of the universal set
to a bit-string that is sufficiently long. The length of the bit-string must be sufficient that there are
more possible results of the hash function than there are elements of the universal set.
For example, 64 bits is sufficient to hash URL’s. We shall pick many different hash functions
and hash each element of the stream using these hash functions. The important property of a hash
function is that when applied to the same element, it always produces the same result. Notice that this
property was also essential for the sampling technique .
The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in
the stream, the more different hash-values we shall see. As we see more different hash-values, it
becomes more likely that one of these values will be “unusual.” The particular unusual property we
shall exploit is that the value ends in many 0’s, although many other options exist.
Whenever we apply a hash function h to a stream element a, the bit string h(a) will end in
some number of 0’s, possibly none. Call this number the tail length for a and h. Let R be the
maximum tail length of any a seen so far in the stream. Then we shall use estimate 2R for the
number of distinct elements seen in the stream.
This estimate makes intuitive sense. The probability that a given stream element a has h(a)
ending in at least r 0’s is 2−r. Suppose there are m distinct elements in the stream. Then the
probability that none of them has tail length at least r is (1 − 2−r)m. This sort of expression should be
familiar by now.
We can rewrite it as (1 − 2−r)2r _m2−r . Assuming r is reasonably large, the inner expression
is of the form (1 − ǫ)1/ǫ, which is approximately 1/e. Thus, the probability of not finding a stream
element with as many as r 0’s at the end of its hash value is e−m2−r . We can conclude:
1. If m is much larger than 2r, then the probability that we shall find a tail
of length at least r approaches 1.
2. If m is much less than 2r, then the probability of finding a tail length at least r approaches 0.
We conclude from these two points that the proposed estimate of m, which is 2R (recall R is the
largest tail length for any stream element) is unlikely to be either much too high or much too low.
6.Discuss in detail the role of Decaying Windows in data stream analysis[CO3-L2]
Decaying Windows
Pure sliding windows are not the only way by which the evolution of data streams can be taken
into account during the mining process. A second way is to introduce a decay factor into the
computation. Specifically, the weight of each transaction is multiplied by a factor of , when
a new transaction arrives. The overall effect of such an approach is to create an exponential
decay function on the arrivals in the data stream. Such a model is quite effective for evolving
data stream, since recent transactions are counted more significantly during the mining process.
Specifically, the decay factor is applied only to those itemsets whose counts are affected by the
current transaction. However, the decay factor will have to be applied in a modified way by
taking into account the last time that the itemset was touched by an update. This approach works
because the counts of each itemset reduce by the same decay factor in each iteration, as long as a
transaction count is not added to it. Such approach is also applicable to other mining problems
where statistics are represented as the sum of decaying values.
We discuss a few applications of decaying windows to find interesting aggregates over data
streams.
The Problem of Most-Common Elements
Suppose we have a stream whose elements are the movie tickets purchased all over the world,
with the name of the movie as part of the element. We want to keep a summary of the stream
that is the most popular movies “currently.” While the notion of “currently” is imprecise,
intuitively, we want to discount the popularity of an older movie that may have sold many
tickets, but most of these decades ago. Thus, a newer movie that sold n tickets in each of the last
10 weeks is probably more popular than a movie that sold 2n tickets last week but nothing in
previous weeks.
One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the ith
ticket is for that movie, and 0 otherwise. Pick a window size N, which is the number of most
recent tickets that would be considered in evaluating popularity. Then, use the method of the
DGIM algorithm to estimate the number of tickets for each movie, and rank movies by their
estimated counts.
This technique might work for movies, because there are only thousands of movies, but it would
fail if we were instead recording the popularity of items sold at Amazon, or the rate at which
different Twitter-users tweet, because there are too many Amazon products and too many
tweeters. Further, it only offers approximate answers.
Describing a Decaying Window
One approach is to re-define the question so that we are not asking for a simple count of 1s in a
window. We compute a smooth aggregation of all the 1s ever seen in the stream, but with
decaying weights. The further in the past a 1 is found the lesser is the weight given to it.
Formally, let a stream currently consist of the elements
where is the first element to arrive and is the current element. Let be a small
-6 -9
constant, such as 10 or 10 . Define the exponentially decaying window for this stream to be the
sum.
The effect of this definition is to spread out the weights of the stream elements as far back in
time as the stream goes. In contrast, a fixed window with the same sum of the weights, , would
put equal weight on each of the most recent He elements to arrive and weight on all previous
elements. This is illustrated in Figure below
It is much easier to adjust the sum in an exponentially decaying window than in a sliding
window of fixed length. In the sliding window, we have to somehow take into consideration the
element that falls out of the window each time a new element arrives. This forces us to keep the
exact elements along with the sum, or to use some approximation scheme such as DGIM. But in
the case of a decaying window, when a new element arrives at the stream input, all we need to do
is the following:
Real-time analytics makes use of all available data and resources when they are needed.
It consists of dynamic analysis and reporting, based on data entered into a system less
than one minute before the actual time of use. Real-time denotes the ability to process
data as it arrives, rather than storing the data and retrieving it at some point in the future.
For example, consider an e-merchant like Flipkart or Snapdeal; real time means the time
elapsed from the time a customer enters the website to the time the customer logs out.
Any analytics procedure, like providing the customer with recommendations or offering
a discount based on current value in the shopping car, etc., will have to be done within
this timeframe which may be a about 15 minutes to an hour.
But from the point of view of a military application where there is constant monitoring
say of the air space, time needed to analyze a potential threat pattern and make decision
maybe a few milliseconds.
“Real-Time Analytics” is thus discovering meaningful patterns in data for something
urgent. There are two specific and useful types of real-time analytics - On-Demand and
Continuous.
1. On-Demand Real-Time Analytics is reactive because it waits for users to request a
query and then delivers the analytics. This is used when someone within a company
needs to take a pulse on what is happening right this minute. For instance, a movie
producer may want to monitor the tweets and identify sentiments about his movie on the
first day first show and be prepared for the outcome.
2. Continuous Real-Time Analytics is more proactive and alerts users with continuous
updates in real time. The best example could be monitoring the stock market trends and
provide analytics to help users make a decision to buy or sell all in real time.
Real-Time Analytics Applications
Analytics falls along a spectrum. On one end of the spectrum sit hatch analytical
applications, which are used for complex, long-running analyses. They tend to have
slower response times (up to minutes, hours, or days) and lower requirements for
availability.
Examples of batch analytics include Hadoop- based workloads. On the other end of the
spectrum sit real-time analytical applications, which provide lighter-weight analytics
very quickly. Latency is low (sub-second) and availability requirements are high (e.g.,
99.99%). Figure below illustrates this.
1. Financial Services: Analyse ticks, tweets, satellite imagery, weather trends, and any
other type of data to inform trading algorithms in real time.
2. Government: Identify social program fraud within seconds based on program history,
citizen profile, and geospatial data.
3.E-Commerce Sites: Real-time analytics will help to tap into user preferences as
people are on the site or using a product. By knowing what users like at run time can
help the site to decide relevant content to be made available to that user. This can result
in a better customer experience overall leading to increase in sales. Let us take a look at
how this works for these companies. For example, Amazon recommendations change
after each new product you view so that they can upsell customers throughout the
session. Real-time recommendations create a personal shopping experience for each and
every customer. With more insight into, their customers on an individual level, Amazon
is able to effectively upsell and cross-sell products at every interaction point.
4.Insurance Industry: Digital channels of customer interaction (such as online
channels) as well as conversations online (such as social media) have created new
streams of real-time event data. Insurance, being a data-rich industry and a high
customer lifetime value business, can gain immensely from real-time analytics. The
following are a few scenarios where an insurance firm can benefit from real-time
analytics:
A prospective customer visits the website looking to get a quote. Real-time
analytics can be used to predict the propensity of the customer to leave the site without
applying for a quote. This, in turn, can be used to trigger actions like free consultation,
some more schemes, etc.
In the insurance industry, fast-tracking of claims improves customer satisfaction
significantly. However, this can increase the risk of fraud. Real-time analytics can be
used to reduce the risk of fraud even while accelerating the speed of processing claims.
Some auto insurers are already collaborating with automobile firms to gather real-
time information from vehicles on a continuous basis. With GPS-enabled telemetry
devices in place, insurers can devise innovative policies where the premium could be
thought of as a car gas-tank that is filled up at a station, just as the actual consumption of
gas changes dynamically based on a variety of conditions, the premium can be
“consumed” in real time based on driving behaviour - say if one drives safely premium
must be extended longer than when driving rashly.
The list of applications for real-time analytics is endless. At the end of this section we
shall discuss two popular applications, real-time sentiment analysis and real-time stock
predictions, in greater detail.
Generic Design of an RTAP
Companies like Facebook and Twitter generate petabytes of real-time data. This data
must be harnessed to provide real-time analytics to make better business decisions.
Further in today’s context, billions of devices are already connected to the internet, with
more connecting each day. With the evolution of the Internet of Things (lOT), we have a
large number of new data sources such as smart meters, sensors and wearable medical
devices. Real-time analytics will leverage information from ail these devices to apply
analytics algorithms and generate automated actions within milliseconds of a trigger.
To create an environment where you can do “Real-Time Analytics”, the following three
aspects of data flows to your system are important:
1. Input: An event happens (new sale, new customer, someone enters a high security
zone etc.).
2. Process and Store Input: Capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations.
3. Output: Consume this data without disturbing operations (reports, dashboard, etc.).