Unit-2 BDA
Unit-2 BDA
the clickstream records and generates a security alert if the clickstream shows
suspicious behavior.
b) In financial institutions, to tracks the market changes the customer portfolios are
adjusted based on configured constraints.
c) In power power grid, the alert or notification is generated based on throughput
when certain thresholds are reached.
d) In news source, the articles that are relevant to the audience are generated by
analyzing the clickstream records from various sources based on their demographic
information.
e) In network management and web traffic engineering, the streams of packets are
collected and processed to detect anomalies.
The streams which are inputted to the stream processor has to be stored in a
temporary store or working store. The temporary store in data stream model is a transient
store used for storing the parts of streams which can be queried for processing. The
temporary store can be a disk, or main memory, depending on how fast queries to be
processed. The results of the queries are stored in a large archival storage but archival
data cannot be used for query processing but can be used in special circumstances. A
streaming query is a continuous query that executes over the streaming data. They are
similar to database queries used for analyzing data and differ by operating continuously
on data as they arrive incrementally in real-time. The stream processor supports two
types of queries namely ad-hoc queries and standing queries. In ad-hoc query, the
variable dependent results are generated where each query generates different results
depending on the value of the variable. The ad-hoc query uses common approach to store
a sliding window of each stream in the working or temporary store. It doesn’t allow to
store all the streams entirely as it expects to answer arbitrary queries about the streams by
storing appropriate parts or summaries of streams. They are intended for a specific
purpose in contrast to a predefined query.
Alternatively, the standing queries are continuous query that executes over the
streaming data whose functions are predetermined. For standing query, each time a new
stream is arrives and produces the aggregate results.
lOMoARcPSD|22074634
The ETL platforms receives queries from users, based on that it fetches the events from
message queues and applies the query stream data to generate a result by performing
additional joins, transformations on aggregations. The result may be an API call, an
action, a visualization, an alert, or in some cases a new data stream. The popular ETL
tools for streaming data are Apache Storm, Spark and Flink and Samza.
The third component of the data stream architecture is Query Engine which is used
once streaming data is prepared for consumption by the stream processor. Such data
must be analyzed to provide valuable insights, while the fourth component is streaming
data storage, which is used to store the streaming event data into different data storage
mediums like data lakes.
The stream data processing provides several benefits like able to deal with never-
ending streams of events, real-time data processing, detecting patterns in time-series data
and easy data scalability while some of the limitations of stream data processing are
network latency, limited throughput, slow processing. Supporting on window sized
streams and limitations related to In-memory access to stream data.
The common examples of data stream applications are
Sensor networks : Which is a huge source of data occurring in streams and are used in
numerous situations that require constant monitoring of several variables, based on
which important decisions are made.
Network traffic analysis : In which, network service providers can constantly get
information about Internet traffic, heavily used routes, etc. to identify and predict
potential congestions or identify potentially fraudulent activities.
Financial applications : In which the online analysis of stock prices is performed which
is used for making the sell decisions about the product, quickly identifying correlations
with other products, understand fast changing trends and to an extent forecasting
future valuations about the product.
The Queries over continuous data streams have much in common with queries in a
traditional DBMS. There are two types of queries can be identified as typical over data
streams namely One-time queries and Continuous queries :
a) One-time queries : One-time queries are queries that are evaluated once over a
point-in-time snapshot of the data set, with the answer returned to the user. For
example, a stock price checker may alert the user when a stock price crosses a
particular price point.
b) Continuous queries : Continuous queries, on the other hand, are evaluated
continuously as data streams continue to arrive. The answer to a continuous query is
lOMoARcPSD|22074634
produced over time, always reflecting the stream data seen so far. Continuous query
answers may be stored and updated as new data arrives, or they may be produced as
data streams themselves.
analyze and act on insights, and to move from batch processing to real time analytical
decisions. The stream computing supports low-latency velocities and massively parallel
processing architectures to obtain the useful knowledge from big data. Consequently, the
stream computing model is a new trend for high-throughput computing in the big data
analytics. The different organizations who uses stream computing are telecommunication,
health care, utility companies, municipal transits, security agencies and many more. The
two popular use cases of stream computing are distribution load forecasting, conditional
maintenance and smart meter analytics in energy industry and monitoring a continuous
stream of data and generate alerts when intrusion is detected on a network through a
sensor input.
The Integrated development environment (IDE) is used for debugging and testing of
stream processing applications that processes streams using streaming operators, visual
development of applications, provides filtering, aggregation, correlation methods for
streamed data along with user interface for time windows analysis. The database
connectors are used for providing rule engines and stream processing engines for
processing a streamed data with multiple DBMS features. Common main memory DBMS
and rule engines are to be redesigned to use in stream computing. The streaming
analytics engine allows management, monitoring, and real-time analytics for real-time
streaming data and data mart is used for storing live data for processing with additional
feature like operational business intelligence. It also provides automated alerts for the
events.
In data stream processing, the three important operations used are sampling, filtering
and counting distinct elements from the stream which are explained in next subsequent
sections.
This method works well as long as we keep the list of all users and in/out decision in
main memory. By using a hash function, one can avoid keeping the list of users such that
for each user name hash to one of ten buckets, 0 to 9. Therefore, if the user hashes to
bucket 0, then accept this search query for the sample, and if not, then not. Effectively, we
use the hash function as a random number generator and without storing the in/out
decision for any user, we can reconstruct that decision any time a search query by that
user arrives.
The generalized sampling problem consists of tuples with n components for the
streams. A subset of the components are the key components, on which the selection of
the sample will be based. In our example, the user, query, and time are the subsets and
users are in the key. However, we can use sample of queries on key attributes to get the
outcome.
In general, to generate a samples of size a/b where a is the key and b are the tuples, we
hash the key value a for each tuple to b buckets, and accept the tuple for the sample if the
hash value is less than a. The result will be a sample consisting of all tuples with certain
key values and the selected key values will be approximately a/b of all the key values
appearing in the stream. While sampling methods reduce the amount of data to process,
and, by consequence, the computational costs, they can also be a source of errors. The
main problem is to obtain a representative sample, a subset of data that has
approximately the same properties of the original data.
In reservoir sampling, the randomized algorithms are used for randomly choosing the
samples from a list of items, where list of items is either a very large or unknown number.
For example, imagine you are given a really large stream of data and your goal is to
efficiently return a random sample of 1000 elements evenly distributed from the original
stream. A simple way is to generate random integers between 0 and (N – 1), then
retrieving the elements at those indices will give the answer.
4.4.1.2 Biased Reservoir Sampling
In biased reservoir sampling is a bias function to regulate the sampling from the
stream. In many cases, the stream data may evolve over time, and the corresponding data
mining or query results may also change over time. Thus, the results of a query over a
lOMoARcPSD|22074634
more recent window may be quite different from the results of a query over a more
distant window. Similarly, the entire history of the data stream may not relevant for use
in a repetitive data mining application such as classification. The simple reservoir
sampling algorithm can be adapted to a sample from a moving window over data
streams. This is useful in many data stream applications where a small amount of recent
history is more relevant than the entire previous stream. This will give a higher
probability of selecting data points from recent parts of the stream as compared to distant
past. The bias function in sampling is quite effective since it regulates the sampling in a
smooth way so that the queries over recent horizons are more accurately resolved.
4.4.1.3 Concise Sampling
Many a time, the size of the reservoir is sometimes restricted by the available main
memory. It is desirable to increase the sample size within the available main memory
restrictions. For this purpose, the technique of concise sampling is quite effective. Concise
sampling exploits the fact that the number of distinct values of an attribute is often
significantly smaller than the size of the data stream. In many applications, sampling is
performed based on a single attribute in multi-dimensional data that type of sampling is
called concise sampling. For example, customer data in an e-commerce site sampling may
be done based on only customer ids. The number of distinct customer ids is definitely
much smaller than “n” the size of the entire stream.
email itself as a pair. As each email address consumes 20 bytes or more space, it is not
reasonable to store the set S in main memory. Thus, we have to use disk to store and
access that.
Suppose we want to use main memory as a bit array, then we need eight million bits
array and have to run hash function h to eight million buckets from email addresses.
Since there are one million members of S, approximately 1/8th of the bits will be 1 and
rest would be 0. Here, as soon as stream element arrives, we hash its email address, if
hash value for stream element e-mail comes to 1 then we let the email through else we
drop this stream element. But sometimes spam email will get through, so to eliminate
every spam, we need to check for membership in set S those good and bad emails that get
through the filter. The Bloom filter is used in such cases to eliminate the tuples which do
not meet the selection criterion.
the stream element pass through else discard. That means, if one or more of these bits are
remains 0, then K could not be found in S, so reject the stream element. So, to find out
how many elements are passed we need to calculate the probability of a false positive
outcomes, as a function of n bit-array length, m the number of members of set (S), and m
number of hash functions.
Let us take an example, where we have a model which is used for throwing darts at
the targets. Here, suppose we have T targets and D darts and there is a possibility of any
dart is equally likely to hit any target. So, the analysis of how many targets can we expect
to be hit at least once falls in one of the conditions given below :
T–1
The probability of a given dart will not hit a given target would be
T
T – 1
D
The probability of none of the D darts will hit a given target would be
T
With approximation, the probability that none of the y darts hit a given target would
be e(D/T) .
would be {3, 4, 2, 3}, the count of distinct numbers in fourth pass is 3. Therefore, the final
count of distinct numbers are 3, 4, 4, 3.
Let us take another example, suppose we want to find out how many unique users
have accessed a particular website let’s say Amazon in a given month based on gathering
statistics. So, here universal set would be a set of logins and IP address which has
sequences of four 8-bit bytes from which they send the query for that site. The easiest way
to solve this problem is to keep the set in main memory which has list of all the elements
in the stream and make them arranged in an search structure like hash table or search tree
so as to add new elements quickly. But the problem here is to obtain an exact number of
distinct elements appear in the stream. However, if the number of distinct elements is too
large then we cannot store them in main memory. Therefore, the solution of this problem
is to use several machines for handling only one or more number of the streams and store
most of the data structure in secondary.
stream consists of elements chosen from a universal set U which has ordered elements i
and mi be the number of occurrences of the ith element. Then the kth order moment of the
stream is calculated as sum over all i.e.
Fk = i A (mi)k
Here, 0th moment of the stream is sum of 1 for each mi >0; number of distinct elements.
1st moment of the stream is the sum of all mi , which must be the length of the stream. The
2nd moment of the stream sum of the squares of the mi2 , which could be a surprise
number (S) that measures the uneven the distribution of elements in the stream, m2
describes the “skewness” of a distribution; smaller the value of M2, less skewed is the
distribution.
For example, suppose we have a stream of length 100, in which eleven different
elements are appeared. The most even distribution of these eleven elements would be 1
appearing 10 times and the 10 appearing 9 times each. In this case, the surprise number
would be 1×102 + 10 × 92 = 910. Here, we can’t keep count for each element that appeared
in a stream in main memory. So, we need to estimate the kth moment of a stream by
keeping a limited number of values in main memory and computing an estimate from
these values.
Examples :
Consider the following data streams and calculate the surprise number :
1) 5,5,5,5,5 Surprise number = 5 × 52 = 125
2) 9,9,5,1,1 Surprise number = (2 × 92 + 1 × 52 + 2 × 12) =189
To estimate the second moment of the stream with limited amount of main memory
space we can use Alon-Matias-Szegedy algorithm. Here, more the space we use, the more
accurate the estimate will be. In this algorithm, we compute the number of variables X.
For each variable X, we store when a particular element of the universal set, which we
refer to as X.element and the value of the integer variable X.value. To find the value of a
variable X, we select the position in the stream between 1 and n randomly. If element is
found in set X.element then initialize X.value to 1. Likewise, we read the stream, add 1 to
X.value each time we encounter another occurrence of X.element. Technically, the
estimates of the second and higher moments assumes that the stream length n is a
constant and it grows with time. Here, we store only the values of variables and multiply
some function of that value by n when it is time to estimate the moment.
lOMoARcPSD|22074634
Fig. 4.8.1 : Bitstream divided into buckets following the DGIM rules
Bucket size Bucket size 4 Bucket size Bucket size Bucket size
4 i.e. 2 =4
2 2 2 1 i.e. 20=1
i.e. 22=4 i.e. 21=2 i.e. 21=2
Here, when new bit comes in then drop last bucket if its timestamp is prior to N time
before current time. If the new bit arrived is 0 with a time stamp 101, then there are no
changes needed in the buckets but if the new bit that arrives is 1, then we need to make
some changes.
101011 000 10111 0 11 00 101 1 0 1 1
New bits to be entered
So current bit is 1 then create a new bucket of size 1 and make the current timestamp
and size to 1. If there was only one bucket of size 1, then nothing more needs to be done.
However, if there are now three buckets of size 1 (buckets with timestamp 100,102,103)
then combine the leftmost(oldest) two buckets of size 2 as shown below.
101011 000 10111 0 11 00 101 1001 1 1
Bucket Bucket Bucket Bucket Bucket Bucket
size 4 size 4 size 2 size 2 size 2 size 1
To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost of the
two buckets. By performing combining operation on buckets, the resulting buckets would
be
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
In decaying window, it is easier to adjust the sum exponentially than sliding window
of fixed length. The effect of this definition is to spread out the weights of the stream
elements as far back in time as the stream goes. In sliding window, the element that falls
out of the window each time a new element arrives needs to be taken care. In contrast, a
fixed window with the same sum of the weights, 1/c, would put equal weight 1 on each of
the most recent 1/c elements to arrive and weight 0 on all previous elements which is
illustrated in Fig. 4.9.1. However, when a new element at+1 arrives at the stream input,
we first multiply the current sum by 1 – c and then add at+1.
In this method, each of the previous elements get moved one position further from the
current element, so its weight is multiplied by 1 − c. Further, the weight on the current
element is (1 − c)0 = 1, so adding at+1 is the correct way to include the new element’s
contribution.
smart meters with huge number of new data sources. The Real-time analytics will
leverage information from all these devices to apply analytics algorithms and generate
automated actions within milliseconds of a trigger. The Real-Time analytics platform
composed of three components namely :
Input : which is generated upon the event happens (like new sale, new customer,
someone enters a high security zone etc.)
Processing unit : which capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations. It also involves executing different
standing and ad-hoc queries over streamed data and
Output : that consume this data without disturbing operations, explore it for better
insights and generates analytical results by means of different visual reports over the
dedicated dashboard. The general architecture of Real-Time Analytics Platform is shown
in Fig. 4.10.1.
The various requirements for real-time analytics platform are as follows :
1. It must support continuous queries for real-time events.
2. It must consider the features like, robust- ness, fault tolerance, low-latency reads
and updates, incremental analytics and learning and scalability.
3. It must have improved the in-memory transaction speed.
4. It should quickly move the not needed data into secondary disk for persistent
Storage.
5. It must support distributing data from various sources with speedy processing.
The basic building blocks of Real Time Streaming Platform are shown in Fig. 4.10.2.
The streaming data is collected from various flexible data sources by producing
connectors which move and receive data from the sources to the queuing system. The
queuing system is faulty tolerance and persistent in nature. The streamed data then
buffered to be consumed by the stream processing engine. The queuing system is a high-
throughput, low latency system which provides high availability and fail-over
capabilities. There are many technologies that support real-time analytics, such as :
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
SAP HANA : It is a streaming analytical tool that allows SAP users to capture, stream
and analyze data with active event monitoring and event driven response to
applications.
Apache Spark : It is a streaming platform for big data analytics in real-time developed
by Apache.
Cisco Connected Streaming Platform : It is used for finding the insights from high
velocity streams of live data over the network with multiple sources with enabled
immediate actions.
Oracle Stream Analytics : It provides graphical interface to performing analytics over
the real-time streamed data.
Google Real Time Analytics : It is used for performing real-time analytics over the
cloud data collected over different applications.
comments and feedback with emotional states such as “angry”, “sad” and “happy”. It
tries to identify and extract sentiments within the text. The analysis of sentiments can be
either document based where the sentiment in the entire document is summarized as
positive, negative or objective or can be sentence based where individual sentences,
bearing sentiments, in the text are classified.
Sentiment analysis is widely applied to reviews and social media for a variety of
applications, ranging from marketing to customer service. In the context of analytics,
sentiment analysis is “the automated mining of attitudes, opinions and emotions from
text, speech and database sources”. With the proliferation of reviews, ratings,
recommendations and other forms of online expression, online opinion has turned into a
kind of virtual currency for businesses looking to market their products, identify new
opportunities and manage their reputations.
Some of the popular applications of real-time sentiment analysis are,
1) Collecting and analyzing sentiments over the Twitter. As Twitter has become a
central site where people express their opinions and views on political parties and
candidates. Emerging events or news are often followed almost instantly by a burst
in Twitter volume, which if analyzed in real time can help explore how these events
affect public opinion. While traditional content analysis takes days or weeks to
complete, real time sentiment analysis can look into the entire Twitter traffic about
the election, delivering results instantly and continuously. It offers the public, the
media, politicians and scholars a new and timely perspective on the dynamics of the
electoral process and public opinion.
2) Analyzing the sentiments of messages posted to social networks or online forums
can generate countless business values for the organizations which aim to extract
timely business intelligence about how their products or services are perceived by
their customers. As a result, proactive marketing or product design strategy can be
developed to effectively increase the customer base.
3) Tracking the crowd sentiments during commercial viewing by advertising agencies
on TVs and decide which commercials are resulting in positive sentiments and
which are not.
4) A news media website is interested in getting an edge over its competitors by
featuring site content that is immediately relevant to its readers where they use
social media to know the topics relevant to their readers by doing real time
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
sentiment analysis on Twitter data. They Specifically, to identify what topics are
trending in real time on Twitter, they need real-time analytics about the tweet
volume and sentiment for key topics
5) In Marketing, the real-time sentiment analysis can be used to know the public
reactions on product or services supplies by an organization. The analysis is
performed on which product or services they like or dislike and how they can be
improved,
6) In Quality Assurance, the real-time sentiment analysis can be used to detect errors
in your products based on your actual user’s experience.
7) In Politics, the real-time sentiment analysis can be used to determine the views of
the people regarding specific situations on which they angry or happy.
8) In Finances, the real-time sentiment analysis tries to detect the sentiment towards a
brand, to anticipate their market moves
The best example of real time sentiment analysis is predicting the pricing or
promotions of a product being offered through social media and the web. The solution for
price or promotion prediction can be implemented software solutions like Radar (Real-
Time Analytics Dashboard Application for Retail) and Apache Storm. The RADAR is the
software solution for retailers built using a Natural Language Processing (NLP) based
Sentiment Analysis engine that utilizes different Hadoop’s technologies including HDFS,
Apache Storm, Apache Solr, Oozie and Zookeeper to help enterprises maximize sales
through databased continuous re-pricing. Apache Storm is a distributed real-time
computation system for processing large volumes of high-velocity data. It is part of the
Hadoop ecosystem. Storm is extremely fast, with the ability to process over a million
records per second per node on a cluster of modest size. Apache Solr is another tool from
the Hadoop ecosystem which provides highly reliable, scalable search facility at real time.
RADAR uses Apache STORM for real-time data processing and Apache SOLR for
indexing and data analysis. The generalized architecture of RADAR for retail is shown in
Fig. 4.11.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
For retailers, the RADAR can be used to customize their environment so that they can
track the following for any number of products / services in their portfolio based on Social
Sentiment for each product or service they are offering and competitive
pricing/promotions being offered through social media and the web. With this solution,
retailers can create continuous re-pricing campaigns and implement them real-time in
their pricing systems, track the impact of re-pricing on sales and continuously compare it
with social sentiment.
Traditionally, stock market prediction algorithms used to check historical stock prices
and try to predict the future using different models. The traditional approach is not
effective in a real time because, as stock market trends continually changes based upon
economic forces, regulations, competition, new products, world events and even (positive
or negative) tweets are all factors to affect stock prices therefore. Thus, predicting the
stock prices using real-time analytics is the necessity. The generalized architecture for
real-time stock prediction has three basic steps, as shown in Fig. 4.12.1.
Fig. 4.12.2 : Detailed representation of real-time stock prediction using machine learning