0% found this document useful (0 votes)
34 views

DA Unit 3

Data streaming involves processing continuous data in real-time as it is generated, as opposed to batch processing of stored data. A data stream consists of a continuous flow of data elements ordered in a sequence. Unlike batch processing, data streaming allows processing data as soon as it is created. Key features of data streams include their continuous flow, infinite length, high velocity, and variability. Data streams are essential for modern data processing and decision making by enabling real-time insights from continuous data sources.

Uploaded by

Shruti Saxena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

DA Unit 3

Data streaming involves processing continuous data in real-time as it is generated, as opposed to batch processing of stored data. A data stream consists of a continuous flow of data elements ordered in a sequence. Unlike batch processing, data streaming allows processing data as soon as it is created. Key features of data streams include their continuous flow, infinite length, high velocity, and variability. Data streams are essential for modern data processing and decision making by enabling real-time insights from continuous data sources.

Uploaded by

Shruti Saxena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

WHAT IS DATA STREAM:

Data streaming is a modern approach to processing and analyzing


data in real-time, as opposed to batch processing methods. A data
stream is a continuous flow of data elements that are ordered
in a sequence and processed as they are generated. Data stream
is different from traditional batch processing methods in that they
are continuous, unbounded, and potentially high-velocity with high
variability.
Unlike traditional data processing, where data is collected and
processed in batches, data streams are continuously collecting data,
making it possible to process data as soon as they are created.
Key features of data streams include their continuous flow, infinite
length, unbounded nature, high velocity, and potentially high
variability. They are often used in stream processing systems like
Apache Spark Streaming.
Importance of data streams in modern data processing
Data streams play a critical role in modern data processing, enabling
real-time insights and automated actions.
In the healthcare industry, data streams enable continuous
monitoring of patient data, allowing for early detection and
intervention in the event of critical health issues. Data streaming can
also be used in machine learning algorithms to derive insights from
continuous data and improve predictive analytics.
So evidently, data streams are essential for modern data processing
and decision-making, enabling businesses to derive valuable insights
from the continuous stream of data generated by their internal
IT systems and external data sources.
• Example of Stream Sources: Sensor data, Image data, Internet &
Web traffic.
Issues in Stream Processing
• Stream often deliver elements very rapidly.
• We must process elements in real time.
• Its important that stream processing algorithm is executed in
main memory.
Sampling data in stream
• Data sampling is a statistical analysis technique used to select,
manipulate and analyze a representative subset of data
points to identify patterns and trends in the larger data
set being examined.
• The method of collecting data from a population, regarding a
sample on a group of items and examining it to draw out some
conclusion, is known as Sample Method.
• Probability sampling allows every member of the population a
chance to get selected. It is mainly used in quantitative research
when you want to produce results representative of the whole
population.
• In non-probability sampling, not every individual has a chance
of being included in the sample. This sampling method is easier
and cheaper but also has high risks of sampling bias.
Filtering :
Another common process on stream is selection or filtering.
We want to accept those tuples in the stream that meet a criteria.
Accepted tuples are passed to another process as a stream ,while
other tuples are dropped.
Bloom filtering is the way to eliminate most of the tuples that do not
meet criteria.
RTAP (Real Time Analytics
Platform)
A real-time analytics platform enables organizations to make
the most out of real-time data by helping them to extract the
valuable information and trends from it.
Such platforms help in measuring data from the business
point of view in real time, further making the best use of
data.
An ideal real-time analytics platform would help in analyzing
the data, correlating it and predicting the outcomes on a real-
time basis.

Widely used RTAP:


1.Apache sparkstreaming: A big data platform for data
stream analytics in real time.
2.Oracle stream analytics(OSA): A platform that provides
a graphical interface to “fast data”.
3.SAP HANA: A streaming analytics tool which also does real
time analysis.
4.SQL stream Blaze: an analytics platform , offering a real
time ,easy to use and powerful visual development
environment.
RTAP Applications:
 Fraud detection systems for online transactions.
 Social media analytics
 Click analysis for online recommendations.
Advantages of RTAP:
 Create our interactive analytics tool.
 Make use of ML.
 Transparent dashboards allow users to share information.

Stock market prediction:

What is the Stock Market

A stock market is a public market where you can buy and sell shares for publicly listed
companies. The stocks, also known as equities, represent ownership in the company. The
stock exchange is the mediator that allows the buying and selling of shares.

Importance of Stock Market

 Stock markets help companies to raise capital.

 It helps generate personal wealth.

 Stock markets serve as an indicator of the state of the economy.

 It is a widely used source for people to invest money in companies with high
growth potential.

Stock Price Prediction

Stock Price Prediction using machine learning helps you discover the future value of
company stock and other financial assets traded on an exchange. The entire idea of predicting
stock prices is to gain significant profits. Predicting how the stock market will perform is a
hard task to do. There are other factors involved in the prediction, such as physical and
psychological factors, rational and irrational behaviour, and so on. All these factors combine
to make share prices dynamic and volatile. This makes it very difficult to predict stock prices
with high accuracy.

FLAJOLET-MARTIN
ALGORITHM(counting distinct
elements in a stram):
The Flajolet-Martin algorithm is also known as
probabilistic algorithm which is manly used to count the
number of unique elements in a stream or database.

The steps for the Flajolet-Martin algorithm are:

 First step is to choose a hash function that can be used to


map the elements in the database to fixed-length binary
strings. The length of the binary string can be chosen
based on the accuracy desired.
 Next step is to apply the hash function to each data item
in the dataset to get its binary string representation.
 Next step includes determinig the position of the
rightmost zero in each binary string.
 Next we compute the maximum position of the rightmost
zero for all binary strings.
 Now we estimate the number of distinct elements in the
dataset as 2 to the power of the maximum position of the
rightmost zero which we calculated in previous step.

Pseudo Code-Stepwise Solution:


1. Selecting a hash function h so each element in the
set is mapped to a string to at least log2n
bits.
2. For each element x, r(x)= length of trailing zeroes in
h(x)
3. R=max(r(x))

=> Distinct elements= 2R

Example:
S= 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1
H(x)= (6x+1)mod 5
Assume b=5

R = max( r(a) ) = 2
So no. of distinct elements = N=2^2=4

DGIM(Datar Gionis Indyk Motwani for


oneness in a window):
• Suppose we have a window of length N on a binary
stream. We want at all times to be able to answer
queries of the form “how many 1’s are there in the last
k bits?” for any k≤ N. For this purpose we use the DGIM
algorithm.
• The basic version of the algorithm uses O(log2 N) bits
to represent a window of N bits, and allows us to
estimate the number of 1’s in the window with an error
of no more than 50%.
• To begin, each bit of the stream has a timestamp, the
position in which it arrives. The first bit has timestamp
1, the second has timestamp 2, and so on.
• We divide the window into buckets, 5 consisting of:
1. The timestamp of its right (most recent) end.
2. The number of 1’s in the bucket. This number must
be a power of 2, and we refer to the number of 1’s
as the size of the bucket.
There are six rules that must be followed when representing
a stream by buckets.
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to
some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left
(back in time).

ESTIMATING MOMENTS:
Estimating moments is a generalization of the problem of
counting distinct elements in a stream. The problem, called
computing "moments," involves the distribution of
frequencies of different elements in the stream.

You might also like