0% found this document useful (0 votes)
7 views

Unit-2 BDA

The document discusses the concepts of stream processing in big data analytics, highlighting its advantages over traditional batch processing, such as real-time data analysis and the ability to handle continuous data streams. It outlines various applications of data streaming, sources of streamed data, and the architecture of data stream management systems. Additionally, it addresses the challenges and advantages of stream computing, emphasizing its role in enabling organizations to make timely decisions based on rapidly changing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Unit-2 BDA

The document discusses the concepts of stream processing in big data analytics, highlighting its advantages over traditional batch processing, such as real-time data analysis and the ability to handle continuous data streams. It outlines various applications of data streaming, sources of streamed data, and the architecture of data stream management systems. Additionally, it addresses the challenges and advantages of stream computing, emphasizing its role in enabling organizations to make timely decisions based on rapidly changing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

lOMoARcPSD|22074634

Big Data Analytics 4-2 Stream Memory

4.1 Introduction to Streams Concepts


The stream is the sequence of data elements which flows in a group. The big data
analytics process the data which is stored in databases or generated at real time.
Traditionally, the data used to be processed in batches which was stored in databases.
The Batch processing is nothing but processing of block of data that have been stored in a
database over a period of time. Such data contains millions of records generated in a day
which are stored as a file or record and undergo processing at the end of the day for
various kinds of analysis. It is capable to processes huge volumes of stored data with
longer periods of latency. For example, processing all the transactions performed by a
financial firm in a week for getting analytics.
The stream is communication of bytes or characters over the socket in a computer
network. The stream processing is another big data technology which is used to query
continuous data stream generated in a real-time for finding the insights or detect
conditions and quickly take actions within a small period of time. The data streaming is
the process of sending continuous data rather than in batches. In some applications, the
data comes as a continuous stream of events. If we use batch processing in such
applications, then we need to stop the data collection somewhere and need to store data
in a batch for processing in that case if next batch comes for process then we have to
aggregate the results of previously processed batches which is quite difficult and time
consuming. In contrast, the stream processing supports never ending data streams that
can detect patterns and inspect multiple levels of results on real time data without being
stored in batches with simpler aggregation of processed data. For example, with stream
processing, you can receive an alert when the stock market prices cross to the threshold or
get the notification when temperature reaches to the freezing point by querying data
streams coming from a temperature sensor. The data streaming is ideally used for time
series analysis or detecting hidden patterns over the time and it can simultaneously
process the multiple streams. In the data stream model, individual data items may be
relational tuples, for example, network measurements, call records, web page visits,
sensor readings and so on. However, their continuous arrival in multiple, rapid, time-
varying, possibly unpredictable and unbounded streams appear to yield some
fundamentally new research problems.

4.1.1 Applications of Data Streaming


The popular applications of data streaming are :
a) In E-commerce site, to find the anomalous behavior in the data stream they stream
lOMoARcPSD|22074634

Big Data Analytics 4-3 Stream Memory

the clickstream records and generates a security alert if the clickstream shows
suspicious behavior.
b) In financial institutions, to tracks the market changes the customer portfolios are
adjusted based on configured constraints.
c) In power power grid, the alert or notification is generated based on throughput
when certain thresholds are reached.
d) In news source, the articles that are relevant to the audience are generated by
analyzing the clickstream records from various sources based on their demographic
information.
e) In network management and web traffic engineering, the streams of packets are
collected and processed to detect anomalies.

4.1.2 Sources of Streamed Data


There are various sources of streamed data which provides data for stream processing.
The sources of streaming data ranges from computer applications to the Internet of
things (IOT) sensors. It include satellites data, sensors data, IOT applications, websites,
social media data etc. The various examples of sources of stream data are listed below
a) Sensor data : Where data receives from different kinds of wired or wireless sensors.
For example, the real time data generated by temperature sensor provided to the
stream processing engine for taking action when threshold meets.
b) Satellite image data : Where data receives from satellites to the to earth streams
which consist of many terabytes of images per day. The surveillance cameras are
fitted in a satellite which produces images for processing streamed at station on
earth.
c) Web data : Where the real time streams of IP packets generated on internet are
provided to the switching node which runs queries to detect denial-of-service attacks
or other attacks are then reroute the packets based on information about congestion
in the network.
d) Data in an online retail stores : Where retail firm data collect, store and process data
about product purchase and services by particular customer to understand the
customers behavior analysis
e) Social web data : Where data generated through social media websites like Twitter.
Facebook is used by third party organization for sentimental analysis and prediction
of human behavior.
lOMoARcPSD|22074634

Big Data Analytics 4-4 Stream Memory

The applications which uses data streams are,


 Realtime maps which uses location-based services to find nearest point of interest
 Location based advertisement of notifications
 Watching streamed videos or listening streamed music
 Subscribing online news alerts or weather forecasting service
 Monitoring
 Performing fraud detection on live online transaction
 Detection of anomaly on network applications
 Monitoring and detection of potential failure of system using network monitoring
tools
 Monitoring embedded systems and industry machinery in real-time using
surveillance cameras
 Subscribing real-time updates on social medias like twitter, Facebook etc.

4.2 Stream Data Model and Architecture


The stream data model is responsible for receiving and processing the real-time data
over analytical platforms. The stream data model uses data stream management system
unlike database management systems. It consists stream processors for managing and
processing the data streams. The input for stream processor is provided by the
applications which allows multiple streams to be enter into the system for processing.

4.2.1 Data Stream Management System


The traditional relational databases are intended for storing and retrieving records of
data that are static in nature. Further these databases do not perceive a notion of time
unless time is added as an attribute to the database during designing the schema itself.
While this model was adequate for most of the legacy applications and older repositories
of information, many current and emerging applications require support for online
analysis of rapidly arriving and changing data streams. This has prompted to build new
models to manage streaming data. This has resulted in data stream management systems
(DSMS), with an emphasis on continuous query languages and query evaluation. Each
input stream in data stream management system poses different data types and data
rates. The typical architecture of data stream management system is shown in Fig. 4.2.1.
lOMoARcPSD|22074634

Big Data Analytics 4-5 Stream Memory

Fig. 4.2.1: Architecture of data stream management system

The streams which are inputted to the stream processor has to be stored in a
temporary store or working store. The temporary store in data stream model is a transient
store used for storing the parts of streams which can be queried for processing. The
temporary store can be a disk, or main memory, depending on how fast queries to be
processed. The results of the queries are stored in a large archival storage but archival
data cannot be used for query processing but can be used in special circumstances. A
streaming query is a continuous query that executes over the streaming data. They are
similar to database queries used for analyzing data and differ by operating continuously
on data as they arrive incrementally in real-time. The stream processor supports two
types of queries namely ad-hoc queries and standing queries. In ad-hoc query, the
variable dependent results are generated where each query generates different results
depending on the value of the variable. The ad-hoc query uses common approach to store
a sliding window of each stream in the working or temporary store. It doesn’t allow to
store all the streams entirely as it expects to answer arbitrary queries about the streams by
storing appropriate parts or summaries of streams. They are intended for a specific
purpose in contrast to a predefined query.
Alternatively, the standing queries are continuous query that executes over the
streaming data whose functions are predetermined. For standing query, each time a new
stream is arrives and produces the aggregate results.
lOMoARcPSD|22074634

Big Data Analytics 4-6 Stream Memory

4.2.2 Data Streaming Architecture


A streaming data architecture is a framework for processing huge volumes of
streaming data from multiple sources. As traditional data solutions concentrated on
consuming and processing data in batches while streaming data architecture consumes
data immediately as it is produced, store it in storage mediums and perform real-time
data processing, manipulation and analytics. Most of the streaming architectures are built
on solutions, specific to the problems such as stream processing, data integration, data
storage and real-time analytics. The generalized streaming architecture composed of four
components like Message Broker or Stream Processor, ETL tools, Query Engine and
streaming data storage shown in Fig. 4.2.2.
The first component of data streaming architecture is Message Broker or Stream
Processors. They are the producer of the data which translate the streams into a standard
message format. The other components in the architecture can consume the messages
passed by the broker. The popular legacy message brokers are RabbitMQ and Apache
ActiveMQ which are based on Message Oriented Middleware while the latest messaging
platforms for stream processing are Apache Kafka and Amazon Kinesis.
The second component of the data stream architecture is batch or real-time ETL
(Extract, Transform and Load) tools that streams data from one or more message brokers
and aggregate or transform them into a well-defined structured before data can be
analyzed with SQL-based analytics tools.

Fig. 4.2.2 : Generalized data streaming architecture


lOMoARcPSD|22074634

Big Data Analytics 4-7 Stream Memory

The ETL platforms receives queries from users, based on that it fetches the events from
message queues and applies the query stream data to generate a result by performing
additional joins, transformations on aggregations. The result may be an API call, an
action, a visualization, an alert, or in some cases a new data stream. The popular ETL
tools for streaming data are Apache Storm, Spark and Flink and Samza.
The third component of the data stream architecture is Query Engine which is used
once streaming data is prepared for consumption by the stream processor. Such data
must be analyzed to provide valuable insights, while the fourth component is streaming
data storage, which is used to store the streaming event data into different data storage
mediums like data lakes.
The stream data processing provides several benefits like able to deal with never-
ending streams of events, real-time data processing, detecting patterns in time-series data
and easy data scalability while some of the limitations of stream data processing are
network latency, limited throughput, slow processing. Supporting on window sized
streams and limitations related to In-memory access to stream data.
The common examples of data stream applications are
 Sensor networks : Which is a huge source of data occurring in streams and are used in
numerous situations that require constant monitoring of several variables, based on
which important decisions are made.
 Network traffic analysis : In which, network service providers can constantly get
information about Internet traffic, heavily used routes, etc. to identify and predict
potential congestions or identify potentially fraudulent activities.
 Financial applications : In which the online analysis of stock prices is performed which
is used for making the sell decisions about the product, quickly identifying correlations
with other products, understand fast changing trends and to an extent forecasting
future valuations about the product.
The Queries over continuous data streams have much in common with queries in a
traditional DBMS. There are two types of queries can be identified as typical over data
streams namely One-time queries and Continuous queries :
a) One-time queries : One-time queries are queries that are evaluated once over a
point-in-time snapshot of the data set, with the answer returned to the user. For
example, a stock price checker may alert the user when a stock price crosses a
particular price point.
b) Continuous queries : Continuous queries, on the other hand, are evaluated
continuously as data streams continue to arrive. The answer to a continuous query is
lOMoARcPSD|22074634

Big Data Analytics 4-8 Stream Memory

produced over time, always reflecting the stream data seen so far. Continuous query
answers may be stored and updated as new data arrives, or they may be produced as
data streams themselves.

4.2.3 Issues in Data Stream Query Processing


Apart from benefits, there are some issues in data stream query processing which are
explained as follows
a) Unbounded memory requirements : Since data streams are potentially unbounded
in size, the amount of storage required to compute an exact answer to a data stream
query may also grow without bound. Algorithms that use external memory are not
well-suited to data stream applications since they do not support continuous queries.
For this reason, we are interested in algorithms that are able to confine themselves to
main memory without accessing disk.
b) Approximate query answering : When we are limited to a bounded amount of
memory, it is not always possible to produce exact answers for the data stream
queries; however, high-quality approximate answers are often acceptable in lieu of
exact answers.
c) Sliding windows : One technique for approximate query answering is to evaluate
the query not over the entire past history of the data streams, but rather only over
sliding windows of the recent data from the streams. Imposing sliding windows on
data streams is a natural method for approximation that has several attractive
properties but for many applications, sliding windows can be a requirement needed
as a part of the desired query semantics explicitly expressed as a part of the user’s
query.
d) Blocking operators : A blocking query operator is a query operator that is unable to
produce an answer until it has seen its entire input.

4.3 Stream Computing


The stream computing is a computing paradigm that reads data from collections of
sensors in a stream form and as a result it computes continuous real-time data streams.
The stream computing is enables graphics processors (GPUs) to work in coordination
with low-latency and high-performance CPUs to solve complex computational problems.
The data stream in stream computing has sequence of data sets and a continuous stream
carries infinite sequence of data sets. Stream computing can be applied on high velocity
stream of data from real time sources such as market data, mobile, sensors, click Stream
and even transactions. It empowers organizations to analyze and follow up on rapidly
changing data in real time, upgrade existing models with new bits of insights, capture
lOMoARcPSD|22074634

Big Data Analytics 4-9 Stream Memory

analyze and act on insights, and to move from batch processing to real time analytical
decisions. The stream computing supports low-latency velocities and massively parallel
processing architectures to obtain the useful knowledge from big data. Consequently, the
stream computing model is a new trend for high-throughput computing in the big data
analytics. The different organizations who uses stream computing are telecommunication,
health care, utility companies, municipal transits, security agencies and many more. The
two popular use cases of stream computing are distribution load forecasting, conditional
maintenance and smart meter analytics in energy industry and monitoring a continuous
stream of data and generate alerts when intrusion is detected on a network through a
sensor input.

4.3.1 Stream Computing Architecture


The architecture of stream computing consists of five components namely: Server,
integrated development environment, database connectors, streaming analytics engine
and data mart. The generalized architecture of stream computing is shown in Fig. 4.3.1.
In this architecture, the server is responsible for processing the real-time streaming
event data with high throughputs and low latency. The low latency is provided by means
of processing the streams in a main memory.

Fig. 4.3.1 : Generalized architecture of stream computing


lOMoARcPSD|22074634

Big Data Analytics 4 - 10 Stream Memory

The Integrated development environment (IDE) is used for debugging and testing of
stream processing applications that processes streams using streaming operators, visual
development of applications, provides filtering, aggregation, correlation methods for
streamed data along with user interface for time windows analysis. The database
connectors are used for providing rule engines and stream processing engines for
processing a streamed data with multiple DBMS features. Common main memory DBMS
and rule engines are to be redesigned to use in stream computing. The streaming
analytics engine allows management, monitoring, and real-time analytics for real-time
streaming data and data mart is used for storing live data for processing with additional
feature like operational business intelligence. It also provides automated alerts for the
events.

4.3.2 Advantages of stream computing


The advantages of stream computing as listed as follows
 It provides simple and extremely complex analytics with agility
 It is scalable as per computational intensity
 It supports a wide range of relational and non-relational data types
 It can analyze continuous, massive volumes of data at rates up to petabytes
 Performs complex analytics of heterogeneous data types including text, images,
audio, voice, VoIP, video, web traffic, email, GPS data, financial transaction data,
satellite data, sensors, and any other type of digital information that is relevant to
your business.
 Leverages sub-millisecond latencies to react to events and trends as they are
unfolding, while it is still possible to improve business outcomes.
 It can seamlessly deploy applications on any size computer cluster and adapts to
work in rapidly changing environment.

4.3.3 Limitations of Stream Computing


 In an extreme case security and data confidentiality is the main concern in Stream
computing.
 The flexibility, resiliency and data type handling are the serious considerations in
stream computing.
lOMoARcPSD|22074634

Big Data Analytics 4 - 11 Stream Memory

In data stream processing, the three important operations used are sampling, filtering
and counting distinct elements from the stream which are explained in next subsequent
sections.

4.4 Sampling Data in a Stream


The sampling in a data stream is the process of collecting and representing the sample
of the elements of a data stream. The samples are usually much smaller element of entire
stream, but designed to retain the original characteristics of the stream. Therefore, the
elements that are not stored within the sample are lost forever, and cannot be retrieved.
The sampling process is intended for extracting reliable samples from a stream. The data
stream sampling uses many stream algorithms and techniques to extract the sample, but
most popular technique is hashing.
It is used when we have multiple subsets of a stream and want to run the query on
that which can retrieve statistically representative of the stream as a whole. In such cases
ad-hoc queries can be used on the sample along with hashing.
For example, suppose, we want to study a user’s behavior on search engine and search
engine receives multiple stream of queries. Here, we assume that the stream consists of
tuples user, query and time. So, if we run an ad-hoc query to find out what portion of the
typical user’s search queries were repeated over the past month?” In that case, approach
would be to generate a random number between 0 to 9, in response to each search query.
Here assume that the tuple will be stored if and only if the random number is 0, so as to
store the average, 1/10th of queries of each user. But due to the statistical fluctuations into
the data noise gets introduced when users issue large numbers of queries. However, this
scheme gives wrong answer to the query asking for the average number of duplicate
queries for a user. So, we consider the S search queries one time in a month and T search
queries twice.
Therefore for 1/10th sample of queries, user can expect S/10 of the search and for twice
search queries, it would be T/10*1/10=T/100 to fraction T times the probability that both
occurrences of the query will be in the 1/10th sample. For, full stream the query about the
fraction of repeated searches would be T/(S+T).
To find representative search we use In and Out keywords. So, if we see the previous
search records for the user during the current search then we do not do anything but if
we have no search record for the user, then we generate a random integer between 0 and
9. If the number generated is 0 then we add this user to our list with value “in,” otherwise
we add the user with the value “out.”
lOMoARcPSD|22074634

Big Data Analytics 4 - 12 Stream Memory

This method works well as long as we keep the list of all users and in/out decision in
main memory. By using a hash function, one can avoid keeping the list of users such that
for each user name hash to one of ten buckets, 0 to 9. Therefore, if the user hashes to
bucket 0, then accept this search query for the sample, and if not, then not. Effectively, we
use the hash function as a random number generator and without storing the in/out
decision for any user, we can reconstruct that decision any time a search query by that
user arrives.
The generalized sampling problem consists of tuples with n components for the
streams. A subset of the components are the key components, on which the selection of
the sample will be based. In our example, the user, query, and time are the subsets and
users are in the key. However, we can use sample of queries on key attributes to get the
outcome.
In general, to generate a samples of size a/b where a is the key and b are the tuples, we
hash the key value a for each tuple to b buckets, and accept the tuple for the sample if the
hash value is less than a. The result will be a sample consisting of all tuples with certain
key values and the selected key values will be approximately a/b of all the key values
appearing in the stream. While sampling methods reduce the amount of data to process,
and, by consequence, the computational costs, they can also be a source of errors. The
main problem is to obtain a representative sample, a subset of data that has
approximately the same properties of the original data.

4.4.1 Types of Sampling


There are basic three types of sampling explained as follows
4.4.1.1 Reservoir Sampling

In reservoir sampling, the randomized algorithms are used for randomly choosing the
samples from a list of items, where list of items is either a very large or unknown number.
For example, imagine you are given a really large stream of data and your goal is to
efficiently return a random sample of 1000 elements evenly distributed from the original
stream. A simple way is to generate random integers between 0 and (N – 1), then
retrieving the elements at those indices will give the answer.
4.4.1.2 Biased Reservoir Sampling

In biased reservoir sampling is a bias function to regulate the sampling from the
stream. In many cases, the stream data may evolve over time, and the corresponding data
mining or query results may also change over time. Thus, the results of a query over a
lOMoARcPSD|22074634

Big Data Analytics 4 - 13 Stream Memory

more recent window may be quite different from the results of a query over a more
distant window. Similarly, the entire history of the data stream may not relevant for use
in a repetitive data mining application such as classification. The simple reservoir
sampling algorithm can be adapted to a sample from a moving window over data
streams. This is useful in many data stream applications where a small amount of recent
history is more relevant than the entire previous stream. This will give a higher
probability of selecting data points from recent parts of the stream as compared to distant
past. The bias function in sampling is quite effective since it regulates the sampling in a
smooth way so that the queries over recent horizons are more accurately resolved.
4.4.1.3 Concise Sampling

Many a time, the size of the reservoir is sometimes restricted by the available main
memory. It is desirable to increase the sample size within the available main memory
restrictions. For this purpose, the technique of concise sampling is quite effective. Concise
sampling exploits the fact that the number of distinct values of an attribute is often
significantly smaller than the size of the data stream. In many applications, sampling is
performed based on a single attribute in multi-dimensional data that type of sampling is
called concise sampling. For example, customer data in an e-commerce site sampling may
be done based on only customer ids. The number of distinct customer ids is definitely
much smaller than “n” the size of the entire stream.

4.5 Filtering Streams


The data stream processing poses another approach called selection, or filtering. The
filtering is the process of accepting the tuples in the stream that meets the selection
criterion where accepted tuples are provided to another process as a stream and rejected
tuples are dropped. In filtering, If the selection criterion is a based-on property of tuple
then filtering would easier but if selection criterion involves lookup for membership
function in a set then it becomes hard to filter the stream and large to store in main
memory.
The Hashes are the individual entries in a hash table that act like the index. The hash
function is used to produce the hash values where input is an element containing
complex data, and the output is a simple number that acts as an index to that element. A
hash function is deterministic in nature because it produces the same number every time
you feed it a specific data input.
Let us take an example, suppose we have a set {S} of one million allowed email
addresses which are not to be spam. So, the stream consists of email address and the
lOMoARcPSD|22074634

Big Data Analytics 4 - 14 Stream Memory

email itself as a pair. As each email address consumes 20 bytes or more space, it is not
reasonable to store the set S in main memory. Thus, we have to use disk to store and
access that.
Suppose we want to use main memory as a bit array, then we need eight million bits
array and have to run hash function h to eight million buckets from email addresses.
Since there are one million members of S, approximately 1/8th of the bits will be 1 and
rest would be 0. Here, as soon as stream element arrives, we hash its email address, if
hash value for stream element e-mail comes to 1 then we let the email through else we
drop this stream element. But sometimes spam email will get through, so to eliminate
every spam, we need to check for membership in set S those good and bad emails that get
through the filter. The Bloom filter is used in such cases to eliminate the tuples which do
not meet the selection criterion.

4.5.1 Bloom Filter


The purpose of the Bloom filter is to allow all the stream elements whose keys (K) lie
in set (S) otherwise rejecting most of the stream elements whose keys (K) are not part of
set (S).The basic algorithm of bloom filter consist of test and add methods where test is
used to check whether a given element is in the set or not. If it returns the outcome as
false then we conclude element is definitely not in the set, if it returns true then we
consider the element is probably in the set and false positive rate is a function of the
bloom filter' used to calculate size and the number of independent hash functions used.
The add method simply adds an element to the set where removal is not possible without
introducing false negative values, but extensions to the bloom filter are possible.
Typically, a Bloom filter algorithm has three basic steps given as follows :
a) Select an array/vector of n bits, whose initial bits set to all 0’s.
b) The group of hash functions like {H1, H2, . . . , Hk} where each hash function
maps the “Key” (K) values to n buckets, corresponding to the n bits of the
array.
c) Make a set (S) of matched (m) key values.
In bloom filtering, at the first step we initialize the n bit array by setting all bits 0. Then
starts with taking each key value (K) in set (S) and hash it using each of the m hash
functions. The outcome is set to 1, if each bit is in hi(K) for some hash function hi found
then we conclude that some key value (K) are present in set (S).
To test a key (K) that arrives in the stream, check that all the hash functions h1(K),
h2(K), . . . , hk(K) which has 1’s in the bit-array. If all the values found to be 1’s, then let
lOMoARcPSD|22074634

Big Data Analytics 4 - 15 Stream Memory

the stream element pass through else discard. That means, if one or more of these bits are
remains 0, then K could not be found in S, so reject the stream element. So, to find out
how many elements are passed we need to calculate the probability of a false positive
outcomes, as a function of n bit-array length, m the number of members of set (S), and m
number of hash functions.
Let us take an example, where we have a model which is used for throwing darts at
the targets. Here, suppose we have T targets and D darts and there is a possibility of any
dart is equally likely to hit any target. So, the analysis of how many targets can we expect
to be hit at least once falls in one of the conditions given below :
T–1
 The probability of a given dart will not hit a given target would be
T
T – 1
D

 The probability of none of the D darts will hit a given target would be  
 T 
 With approximation, the probability that none of the y darts hit a given target would
be e(D/T) .

4.6 Counting Distinct Elements in a Stream


After performing sampling and filtering on data stream, the third kind of processing is
count-distinct problem. The sampling and filtering as used to calculate space needed per
stream in a reasonable amount of main memory by using a variety of hashing and a
randomized algorithm.

4.6.1 The Count-Distinct Problem


The count-distinct problem is used for finding the number of distinct elements in a
data stream with repeated elements. Suppose stream elements are chosen from some
universal set. We would like to know how many different elements have appeared in the
stream, counting either from the beginning of the stream or from some known time in the
past. A simple solution is to traverse the given array, consider every window in it and
count distinct elements in the window.
For example : Given an array of size n and an integer k, return the of count of distinct
numbers in all windows of size k. Where k = 4 and input array is {1, 2, 1, 3, 4, 2, 3}.
Here as window size k = 4, in the first pass the window would be {1, 2, 1, 3}. So, the
count of distinct numbers in first pass is 3. In second pass, window would be {2, 1, 3, 4}
and the count of distinct numbers in second pass is 4. In third pass, window would be
{1, 3, 4, 2} and the count of distinct numbers in third pass is 4 and in fourth pass, window
lOMoARcPSD|22074634

Big Data Analytics 4 - 16 Stream Memory

would be {3, 4, 2, 3}, the count of distinct numbers in fourth pass is 3. Therefore, the final
count of distinct numbers are 3, 4, 4, 3.
Let us take another example, suppose we want to find out how many unique users
have accessed a particular website let’s say Amazon in a given month based on gathering
statistics. So, here universal set would be a set of logins and IP address which has
sequences of four 8-bit bytes from which they send the query for that site. The easiest way
to solve this problem is to keep the set in main memory which has list of all the elements
in the stream and make them arranged in an search structure like hash table or search tree
so as to add new elements quickly. But the problem here is to obtain an exact number of
distinct elements appear in the stream. However, if the number of distinct elements is too
large then we cannot store them in main memory. Therefore, the solution of this problem
is to use several machines for handling only one or more number of the streams and store
most of the data structure in secondary.

4.7 Estimating Moments

4.7 Estimating Moments


The generalization of the problem of counting distinct elements in a stream is an
interesting issue by itself. The problem, called computing “moments”, involves the
distribution of frequencies of different elements in the stream. The estimating moments
involves the distribution of frequencies of various elements in the stream. Suppose a
lOMoARcPSD|22074634

Big Data Analytics 4 - 19 Stream Memory

4.7 Estimating Moments


The generalization of the problem of counting distinct elements in a stream is an
interesting issue by itself. The problem, called computing “moments”, involves the
distribution of frequencies of different elements in the stream. The estimating moments
involves the distribution of frequencies of various elements in the stream. Suppose a
lOMoARcPSD|22074634

Big Data Analytics 4 - 20 Stream Memory

stream consists of elements chosen from a universal set U which has ordered elements i
and mi be the number of occurrences of the ith element. Then the kth order moment of the
stream is calculated as sum over all i.e.
Fk = i  A (mi)k
Here, 0th moment of the stream is sum of 1 for each mi >0; number of distinct elements.
1st moment of the stream is the sum of all mi , which must be the length of the stream. The
2nd moment of the stream sum of the squares of the mi2 , which could be a surprise
number (S) that measures the uneven the distribution of elements in the stream, m2
describes the “skewness” of a distribution; smaller the value of M2, less skewed is the
distribution.
For example, suppose we have a stream of length 100, in which eleven different
elements are appeared. The most even distribution of these eleven elements would be 1
appearing 10 times and the 10 appearing 9 times each. In this case, the surprise number
would be 1×102 + 10 × 92 = 910. Here, we can’t keep count for each element that appeared
in a stream in main memory. So, we need to estimate the kth moment of a stream by
keeping a limited number of values in main memory and computing an estimate from
these values.
Examples :
Consider the following data streams and calculate the surprise number :
1) 5,5,5,5,5  Surprise number = 5 × 52 = 125
2) 9,9,5,1,1  Surprise number = (2 × 92 + 1 × 52 + 2 × 12) =189
To estimate the second moment of the stream with limited amount of main memory
space we can use Alon-Matias-Szegedy algorithm. Here, more the space we use, the more
accurate the estimate will be. In this algorithm, we compute the number of variables X.
For each variable X, we store when a particular element of the universal set, which we
refer to as X.element and the value of the integer variable X.value. To find the value of a
variable X, we select the position in the stream between 1 and n randomly. If element is
found in set X.element then initialize X.value to 1. Likewise, we read the stream, add 1 to
X.value each time we encounter another occurrence of X.element. Technically, the
estimates of the second and higher moments assumes that the stream length n is a
constant and it grows with time. Here, we store only the values of variables and multiply
some function of that value by n when it is time to estimate the moment.
lOMoARcPSD|22074634

Big Data Analytics 4 - 21 Stream Memory

4.8 Counting ones in a Window


Now, let us see the counting problems for streams. Suppose we have a window of
length N on a binary stream and want to find out how many 1’s is there in the last k bits?
for any k ≤ N. As we know that practically we cannot afford to store entire window of a
stream in memory, so to calculate number on 1’s in last k-bits we are going to use
approximation algorithm which is explained in a subsequent section.
For a given problem for finding number of 1’s in last k-bits, it is necessary to store all
N bits of the window along with the representation as fewer than N bits could not work.
Since there are 2N sequences of N bits with fewer than 2N representations, there must be
two different bit strings x and y that have the same representation and if x≠y then they
must differ in at least one bit

4.8.1 The Datar-Gionis-Indyk-Motwani (DGIM) Algorithm


The DGIM algorithm is used to find the number 1’s in a data set. This algorithm uses
O(log2 N) bits to represent a window of N bit, allows to estimate the number of 1’s in the
window with and error of no more than 50 %. In this, each bit of the stream has a
timestamp which signifies the position in which it arrives. The first bit has timestamp 1,
the second has timestamp 2, and so on.
As we need to distinguish positions within the window of length N, we shall represent
timestamps modulo N which can be represented by log2 N bits. To store the total number
of bits we ever seen in the stream, we need to determine the window by timestamp
modulo N. For that, we need to divide the window into buckets consisting of timestamp
of its right (most recent) end and the number of 1’s in the bucket. This number must be a
power of 2, and we refer to the number of 1’s as the size of the bucket. To represent a
bucket, we need log 2 N bits to represent the timestamp which is modulo N of its right
end. To represent the number of 1’s we only need log2 log2 N bits. Thus, O(logN) bits
suffice to represent a bucket. There are six rules that must be followed when representing
a stream by buckets.
A) The right side of the bucket should always start with 1 as if it starts with a 0, it is to
be neglected. for example, 1001011 here a bucket of size would be 4 as it is having
four 1’s and starting with 1 on it’s right end i.e. Every position with a 1 is in some
bucket.
B) Every bucket should have at least one 1, else no bucket can be formed i.e. Every
position with a 1 is in some bucket
C) No position is in more than one bucket.
D) There are one or two buckets of any given size, up to some maximum size.
lOMoARcPSD|22074634

Big Data Analytics 4 - 22 Stream Memory

E) All buckets sizes must be in a power of 2.


F) Buckets cannot decrease in size as we move to the left.
Suppose, given stream is . . 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0. The bitstream
divided into buckets following the DGIM rules is shown in Fig. 4.8.1.

Fig. 4.8.1 : Bitstream divided into buckets following the DGIM rules

For Example : Suppose the input stream bit is ….101011000101110110010110, so


estimate the total number of 1’s and number of buckets. Here window size is N = 24.
Now, create a bucket whose rightmost bit would be 1. In our example we found 5
buckets as shown below.
101011 000 10111 0 11 00 101 1 0

Bucket size Bucket size 4 Bucket size Bucket size Bucket size
4 i.e. 2 =4
2 2 2 1 i.e. 20=1
i.e. 22=4 i.e. 21=2 i.e. 21=2
Here, when new bit comes in then drop last bucket if its timestamp is prior to N time
before current time. If the new bit arrived is 0 with a time stamp 101, then there are no
changes needed in the buckets but if the new bit that arrives is 1, then we need to make
some changes.
101011 000 10111 0 11 00 101 1 0 1 1
New bits to be entered
So current bit is 1 then create a new bucket of size 1 and make the current timestamp
and size to 1. If there was only one bucket of size 1, then nothing more needs to be done.
However, if there are now three buckets of size 1 (buckets with timestamp 100,102,103)
then combine the leftmost(oldest) two buckets of size 2 as shown below.
101011 000 10111 0 11 00 101 1001 1 1
Bucket Bucket Bucket Bucket Bucket Bucket
size 4 size 4 size 2 size 2 size 2 size 1
To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost of the
two buckets. By performing combining operation on buckets, the resulting buckets would
be

®
TECHNICAL PUBLICATIONS - An up thrust for knowledge

Downloaded by M.Tech-IoT 2019Batch JNTUACEA ([email protected])


lOMoARcPSD|22074634

Big Data Analytics 4 - 23 Stream Memory

101011 000 10111 1100101 1001 1 1


Bucket size 4 Bucket size 4 Bucket size 4 Bucket size 2 Bucket size
1
Now, sometimes combining two buckets of size 1 may create a third bucket of size 2. If
so, we combine the leftmost two buckets of size 2 into a bucket of size 4. This process may
ripple through the bucket sizes. Here, continues the process until current timestamp-
leftmost bucket timestamp of window is < N i.e. 24.
So finally, by counting the sizes of the buckets in the last 20 bits, we get solution to the
problem i.e. 11 ones.
As each bucket can be represented by O(logN) bits. If the window has length N, then
there are no more than N 1’s. So, if there are O(logN) buckets then the total space required
for all the buckets representing a window of size N is O(log2 N).The solution for the
problem to find how many 1’s there are in the last k bits of the window, for some
1 ≤ k ≤ N. Find the bucket b with the earliest times tamp that includes at least some of the
k most recent bits then estimate the number of 1’s to be the sum of the sizes of all the
buckets to the right (more recent) than bucket b, plus half the size of b itself.
To add new bit in a window of length N represented by buckets, we may need to
modify the buckets. So, with satisfying the DGIM conditions, first, whenever a new bit
enters check the leftmost bucket. If its timestamp has now reached the current timestamp
minus N, then this bucket no longer has any of its 1’s in the window. Therefore, drop it
from the list of buckets. or create a new bucket with the current timestamp and set its size
to 1. However, if there are now more buckets of size 1 then we need to fix this problem by
combining the leftmost two buckets of size 1. To combine any two adjacent buckets of the
same size, replace them by one bucket of twice the size. Here, the timestamp of the new
bucket is the timestamp of the rightmost (later in time) of the two buckets. As a result,
any new bit can be processed in O(logN) time.

4.9 Decaying Window


The decaying window is used for finding the most common “recent” elements in the
streams. Suppose, a stream consist of the elements a1, a2, . . . , at, where a1 is the first
element to arrive and at is the current element. Let c be a small constant, such as 10−6 to
10−9. Therefore, the exponentially decaying window for this stream would be
t–1
 at – i (1 – c)i
i=0
lOMoARcPSD|22074634

Big Data Analytics 4 - 24 Stream Memory

In decaying window, it is easier to adjust the sum exponentially than sliding window
of fixed length. The effect of this definition is to spread out the weights of the stream
elements as far back in time as the stream goes. In sliding window, the element that falls
out of the window each time a new element arrives needs to be taken care. In contrast, a
fixed window with the same sum of the weights, 1/c, would put equal weight 1 on each of
the most recent 1/c elements to arrive and weight 0 on all previous elements which is
illustrated in Fig. 4.9.1. However, when a new element at+1 arrives at the stream input,
we first multiply the current sum by 1 – c and then add at+1.

Fig. 4.9.1 : Decaying window

In this method, each of the previous elements get moved one position further from the
current element, so its weight is multiplied by 1 − c. Further, the weight on the current
element is (1 − c)0 = 1, so adding at+1 is the correct way to include the new element’s
contribution.

4.10 Real Time Analytics Platform (RTAP)


A real-time analytics platform enables organizations by helping them to extract the
valuable information and trends from most out of real-time data. Such platforms help in
measuring data from the business point of view in real time. An ideal real-time analytics
platform would help in analyzing the data, correlating it and predicting the outcomes on
a real-time basis. It helps organizations in tracking the things in real time, thus helping
them in the decision-making process as well as connect the data sources for better
analytics and visualization. The RTAP is related to responsiveness of data which needs to
be processed immediately upon generated, sometimes need to update the information at
the same rate at which it gets received. The RTAP analyzes the data, correlates and
predicts the outcomes in real-time and helps timely in decision making.
As we know the social medias like Facebook and Twitter generate petabytes of real-
time data. This data must be harnessed to provide real-time analytics to make better
business decisions. Further in today’s context, billions of devices are connected to the
internet such as mobile phones, personal computers, laptops, wearable medical devices,
lOMoARcPSD|22074634

Big Data Analytics 4 - 25 Stream Memory

smart meters with huge number of new data sources. The Real-time analytics will
leverage information from all these devices to apply analytics algorithms and generate
automated actions within milliseconds of a trigger. The Real-Time analytics platform
composed of three components namely :
Input : which is generated upon the event happens (like new sale, new customer,
someone enters a high security zone etc.)
Processing unit : which capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations. It also involves executing different
standing and ad-hoc queries over streamed data and
Output : that consume this data without disturbing operations, explore it for better
insights and generates analytical results by means of different visual reports over the
dedicated dashboard. The general architecture of Real-Time Analytics Platform is shown
in Fig. 4.10.1.
The various requirements for real-time analytics platform are as follows :
1. It must support continuous queries for real-time events.
2. It must consider the features like, robust- ness, fault tolerance, low-latency reads
and updates, incremental analytics and learning and scalability.
3. It must have improved the in-memory transaction speed.
4. It should quickly move the not needed data into secondary disk for persistent
Storage.
5. It must support distributing data from various sources with speedy processing.

Fig. 4.10.1 : Architecture of Real-Time Analytics Platform


lOMoARcPSD|22074634

Big Data Analytics 4 - 26 Stream Memory

The basic building blocks of Real Time Streaming Platform are shown in Fig. 4.10.2.
The streaming data is collected from various flexible data sources by producing
connectors which move and receive data from the sources to the queuing system. The
queuing system is faulty tolerance and persistent in nature. The streamed data then
buffered to be consumed by the stream processing engine. The queuing system is a high-
throughput, low latency system which provides high availability and fail-over
capabilities. There are many technologies that support real-time analytics, such as :

Fig. 4.10.2 : Basic building blocks of Real-Time Analytics Platform

1. Processing In Memory (PIM), a chip architecture in which the processor is


integrated into a memory chip to reduce latency.
2. In-database Analytics, a technology that allows data processing to be conducted
within the database by building analytic logic into the database itself.
3. Data Warehouse Appliances, combination of hardware and software products
designed specifically for analytical processing. An appliance allows the purchaser to
deploy a high-performance data warehouse right out of the box.
4. In-memory Analytics, an approach to querying data when it resides in Random
Access Memory (RAM), as opposed to querying data that is stored on physical disks.
5. Massively Parallel Programming (MPP), the coordinated processing of a program
by multiple processors that work on different parts of the program, with each
processor using its own operating system and memory.
Some of the popular Real Time Analytics Platforms are :
 IBM Info Streams : It is used as a streaming platform for analyzing broad range of real-
time unstructured data like text, videos, geospatial images, sensors data etc.

®
TECHNICAL PUBLICATIONS - An up thrust for knowledge

Downloaded by M.Tech-IoT 2019Batch JNTUACEA ([email protected])


lOMoARcPSD|22074634

Big Data Analytics 4 - 27 Stream Memory

 SAP HANA : It is a streaming analytical tool that allows SAP users to capture, stream
and analyze data with active event monitoring and event driven response to
applications.
 Apache Spark : It is a streaming platform for big data analytics in real-time developed
by Apache.
 Cisco Connected Streaming Platform : It is used for finding the insights from high
velocity streams of live data over the network with multiple sources with enabled
immediate actions.
 Oracle Stream Analytics : It provides graphical interface to performing analytics over
the real-time streamed data.
 Google Real Time Analytics : It is used for performing real-time analytics over the
cloud data collected over different applications.

4.10.1 Applications of Realtime Analytics Platforms


There are many real-time applications which uses realtime analytics platforms, some
of them are listed below :
 Click analytics for online product recommendation
 Automated event actions for emergency services like fires, accidents or any disasters in
the industry
 Notification for any abnormal measurement in healthcare which requires immediate
actions
 Log analysis for understanding user’s behavior and usage pattern
 Fraud detection for online transactions
 Push notifications to the customers for location-based advertisements for retail
 Broadcasting news to the users which are relevant to them

4.11 Real Time Sentiment Analysis


The Sentiment Analysis (also referred as opinion mining) is a Natural Language
Processing and Information Extraction task that aims to obtain the feelings expressed in
positive or negative comments, questions and requests, by analyzing a large number of
data over the web. In real-time sentimental analysis, the sentiments are collected and
analyzed in real time with live data over the web. It uses natural language processing,
text analysis and computational linguistics to identify and extract subjective information
in source materials. The goal of sentiments analysis is to allows organizations, political
parties and common people to track sentiments by identifying feelings, attitude and state
of mind of people towards a product or service and classify them as positive, negative
and neutral from the tremendous amount of data in the form of reviews, tweets,
lOMoARcPSD|22074634

Big Data Analytics 4 - 28 Stream Memory

comments and feedback with emotional states such as “angry”, “sad” and “happy”. It
tries to identify and extract sentiments within the text. The analysis of sentiments can be
either document based where the sentiment in the entire document is summarized as
positive, negative or objective or can be sentence based where individual sentences,
bearing sentiments, in the text are classified.
Sentiment analysis is widely applied to reviews and social media for a variety of
applications, ranging from marketing to customer service. In the context of analytics,
sentiment analysis is “the automated mining of attitudes, opinions and emotions from
text, speech and database sources”. With the proliferation of reviews, ratings,
recommendations and other forms of online expression, online opinion has turned into a
kind of virtual currency for businesses looking to market their products, identify new
opportunities and manage their reputations.
Some of the popular applications of real-time sentiment analysis are,
1) Collecting and analyzing sentiments over the Twitter. As Twitter has become a
central site where people express their opinions and views on political parties and
candidates. Emerging events or news are often followed almost instantly by a burst
in Twitter volume, which if analyzed in real time can help explore how these events
affect public opinion. While traditional content analysis takes days or weeks to
complete, real time sentiment analysis can look into the entire Twitter traffic about
the election, delivering results instantly and continuously. It offers the public, the
media, politicians and scholars a new and timely perspective on the dynamics of the
electoral process and public opinion.
2) Analyzing the sentiments of messages posted to social networks or online forums
can generate countless business values for the organizations which aim to extract
timely business intelligence about how their products or services are perceived by
their customers. As a result, proactive marketing or product design strategy can be
developed to effectively increase the customer base.
3) Tracking the crowd sentiments during commercial viewing by advertising agencies
on TVs and decide which commercials are resulting in positive sentiments and
which are not.
4) A news media website is interested in getting an edge over its competitors by
featuring site content that is immediately relevant to its readers where they use
social media to know the topics relevant to their readers by doing real time

®
TECHNICAL PUBLICATIONS - An up thrust for knowledge

Downloaded by M.Tech-IoT 2019Batch JNTUACEA ([email protected])


lOMoARcPSD|22074634

Big Data Analytics 4 - 29 Stream Memory

sentiment analysis on Twitter data. They Specifically, to identify what topics are
trending in real time on Twitter, they need real-time analytics about the tweet
volume and sentiment for key topics
5) In Marketing, the real-time sentiment analysis can be used to know the public
reactions on product or services supplies by an organization. The analysis is
performed on which product or services they like or dislike and how they can be
improved,
6) In Quality Assurance, the real-time sentiment analysis can be used to detect errors
in your products based on your actual user’s experience.
7) In Politics, the real-time sentiment analysis can be used to determine the views of
the people regarding specific situations on which they angry or happy.
8) In Finances, the real-time sentiment analysis tries to detect the sentiment towards a
brand, to anticipate their market moves
The best example of real time sentiment analysis is predicting the pricing or
promotions of a product being offered through social media and the web. The solution for
price or promotion prediction can be implemented software solutions like Radar (Real-
Time Analytics Dashboard Application for Retail) and Apache Storm. The RADAR is the
software solution for retailers built using a Natural Language Processing (NLP) based
Sentiment Analysis engine that utilizes different Hadoop’s technologies including HDFS,
Apache Storm, Apache Solr, Oozie and Zookeeper to help enterprises maximize sales
through databased continuous re-pricing. Apache Storm is a distributed real-time
computation system for processing large volumes of high-velocity data. It is part of the
Hadoop ecosystem. Storm is extremely fast, with the ability to process over a million
records per second per node on a cluster of modest size. Apache Solr is another tool from
the Hadoop ecosystem which provides highly reliable, scalable search facility at real time.
RADAR uses Apache STORM for real-time data processing and Apache SOLR for
indexing and data analysis. The generalized architecture of RADAR for retail is shown in
Fig. 4.11.1.

®
TECHNICAL PUBLICATIONS - An up thrust for knowledge

Downloaded by M.Tech-IoT 2019Batch JNTUACEA ([email protected])


lOMoARcPSD|22074634

Big Data Analytics 4 - 30 Stream Memory

Fig. 4.11.1 : Generalized architecture of RADAR for retail

For retailers, the RADAR can be used to customize their environment so that they can
track the following for any number of products / services in their portfolio based on Social
Sentiment for each product or service they are offering and competitive
pricing/promotions being offered through social media and the web. With this solution,
retailers can create continuous re-pricing campaigns and implement them real-time in
their pricing systems, track the impact of re-pricing on sales and continuously compare it
with social sentiment.

4.12 Stock Market Predictions


Stock market prediction is the act of trying to determine the future value of a company
stock or other financial instrument traded on an exchange. The successful prediction of a
stock's future price could yield significant profit.
Predicting stock prices is a challenging problem in itself because of the number of
variables which are involved. The stock market process is full of uncertainty and it’s
affected by many factors. Hence the stock market prediction is one of the important
exertions in business and finance. As it produces large amount of data every day, it is
very difficult for an individual to consider all the current and past information for
predicting future trend of a stock.
lOMoARcPSD|22074634

Big Data Analytics 4 - 31 Stream Memory

Traditionally, stock market prediction algorithms used to check historical stock prices
and try to predict the future using different models. The traditional approach is not
effective in a real time because, as stock market trends continually changes based upon
economic forces, regulations, competition, new products, world events and even (positive
or negative) tweets are all factors to affect stock prices therefore. Thus, predicting the
stock prices using real-time analytics is the necessity. The generalized architecture for
real-time stock prediction has three basic steps, as shown in Fig. 4.12.1.

Fig. 4.12.1 : Generalized architecture for real-time stock prediction

There are three basic components :


1. In the first step, the incoming real-time trading data is captured and stored into a
persistent storage as it becomes historical data over the period of time.
2. Secondly, the system must be able to learn from historical trends in the data and
recognize patterns and probabilities to inform decisions.
3. Third, the system needs to do a real-time comparison of new, incoming trading data
with the learned patterns and probabilities based on historical data. Then, it
predicts an outcome and determines an action to take.
A more detailed picture with machine learning approach for stock prediction is given
in Fig. 4.12.2.
lOMoARcPSD|22074634

Big Data Analytics 4 - 32 Stream Memory

Fig. 4.12.2 : Detailed representation of real-time stock prediction using machine learning

The following steps are followed :


1. The Live data, from Yahoo! Finance or any other finance news RSS feeds is read and
processed. The data is then stored in memory with a fast, consistent, resilient, and
linearly scalable system.
2. Using the live, hot data from Apache Geode, a Spark MLib application creates and
trains a model, comparing new data to historical patterns. The models could also be
supported by other toolsets, such as Apache MADlib or R.
3. Results of the machine learning model are pushed to other interested applications
and also updated within Apache Geode for real-time prediction and decisioning.
4. As data ages and starts to become cool, it is moved from Apache Geode to Apache
HAWQ and eventually lands in Apache Hadoop™. Apache HAWQ allows for SQL-
based analysis on petabyte-scale data sets and allows data scientists to iterate on
and improve models.
5. Another process is triggered to periodically retrain and update the machine
learning model based on the whole historical data set. This closes the loop and
creates ongoing updates and improvements when historical patterns change or as
new models emerge.
The most common advantages of stock prediction using big data approach are
 It stabilizes the online trading
 Real-time data analysis with a rapid speed
lOMoARcPSD|22074634

Big Data Analytics 4 - 33 Stream Memory

 Improves the relationship between investors and stock trading firms


 Provides the best estimation of outcomes and returns :
 Mitigate the probable risks on stock trading online and make a right investment
decision
 Enhances the machine learning ability to produces accurate predictions

You might also like