SlideShare a Scribd company logo
Suja A. Alex,
Assistant Professor,
Department of Information Technology,
St. Xavier’s Catholic College of Engineering,
Nagercoil
 Introduction to stream concepts
 Stream Data Model
 Sampling data in a stream
 Filtering Stream
 Stream counting
 Decaying Window
 RTAP
 Real Time Data streaming tools and applications
 Data Stream is continuously arriving data flow
 Examples: Data produced in dynamic environments like
 Power supply
 network traffic
 Stock exchange
 Video survelliance
Continuous flow of data
 Examples
Network traffic
Sensor data
Call center records
 Data Stream Management System (DSMS), has multiple
data streams.
 A stream data query processing architecture includes three
parts:
end user,
query processor, and
scratch space
 Transient streams (and persistent relations)
 Continuous queries
 Sequential access
 Unpredictable data arrival and characteristics
 Bounded main memory
Types of Queries
 One time queries
 Continuous queries
Query Processing
Continuous Query (CQ)Result
Query Processing
Main MemoryData Stream(s) Data Stream(s)
Disk
Main Memory
SQL Query Result
Mining Data Streams
Mining Data Streams
 Since we can not store the entire stream, one obvious
approach is to store a sample.
 Types of sampling
Reservoir Sampling
Biased Reservoir Sampling
Concise Sampling
 Reservoir Sampling- A reservoir of size n is maintained
The first t points are added
(t+1) point is added with probability n/(t+1)
 Biased Reservoir Sampling – Bias function is used to
regulate sampling from stream
 Concise Sampling – distinct values in stream are stored in
main memory
 Types of queries one wants on answer on a stream:
◦ Filtering a data stream
 Select elements with property x from the stream
◦ Counting distinct elements
 Number of distinct elements in the last k elements
of the stream
◦ Estimating moments
 Estimate avg./std. dev. of last k elements
◦ Finding frequent elements
 It decides whether an item from infinite stream must be
stored
 Example- web browser storing the blacklist of
dangerous URLs
 Each element of data stream is a tuple, given a list of
keys S, Determine which tuples of stream are in S
 Obvious solution: Hash table
 But suppose we do not have enough memory to store
all of S in a hash table
 Example: Email spam filtering
◦ We know 1 billion “good” email addresses
◦ If an email comes from one of these, it is NOT spam
 Publish-subscribe systems
◦ You are collecting lots of messages (news articles)
◦ People express interest in certain sets of keywords
◦ Determine whether each message matches user’s interest
 Consider: |S| = m, |B| = n
 Use k independent hash functions h1 ,…, hk
 Initialization:
◦ Set B to all 0s
◦ Hash each element s S using each hash function hi, set
B[hi(s)] = 1 (for each i = 1,.., k)
 Run-time:
◦ When a stream element with key x arrives
 If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
 That is, x hashes to a bucket set to 1 for every hash
function hi(x)
 Otherwise discard the element x
 Problem:
◦ Data stream consists of a universe of elements chosen from
a set of size N
◦ Maintain a count of the number of distinct elements seen so
far
 Solution:
Maintain the set of elements seen so far
◦ That is, keep a hash table of all the distinct elements seen so
far
 How many different words are found among the
Web pages being crawled at a site?
◦ Unusually low or high numbers could indicate artificial pages
(spam?)
 How many different Web pages does each customer
request in a week?
 How many distinct products have we sold in the last
week?
 Pick a hash function h that maps each of the N
elements to at least log2 N bits
 For each stream element a, let r(a) be the number of
trailing 0s in h(a)
◦ r(a) = position of first 1 counting from the right
 E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
 Record R = the maximum r(a) seen
◦ R = maxa r(a), over all the items a seen so far
 Estimated number of distinct elements = 2R
 Suppose a stream has elements chosen
from a set A of N values
 Let mi be the number of times value i occurs in the
stream
 The kth moment is
Ai
k
im )(
 0thmoment = number of distinct elements
◦ The problem just considered
 1st moment = count of the numbers of elements =
length of the stream
◦ Easy to compute
 2nd moment = surprise number S =
a measure of how uneven the distribution is
 Alon Matias Szegedy (AMS) Algorithm

q w e r t y u i o p a s d f g h j k l z x c v b n m
q w e r t y u i o p a s d f g h j k l z x c v b n m
q w e r t y u i o p a s d f g h j k l z x c v b n m
q w e r t y u i o p a s d f g h j k l z x c v b n m
Past Future
N = 6
 Datar-Gionis-Indyk-Motwani Algorithm
 A bucket in the DGIM method is a record consisting
of:
1. The timestamp of its end [O(log N) bits]
2. The number of 1s between its beginning and end [O(log log
N) bits]
 Constraint on buckets:
Number of 1s must be a power of 2
◦ That explains the O(log log N) in (2)
N
1 of
size 2
2 of
size 4
2 of
size 8
At least 1 of
size 16. Partially
beyond window.
2 of
size 1
1001010110001011010101010101011010101010101110101010111010100010110010
Properties we maintain:
- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size
1001010110001011010101010101011010101010101110101010111010100010110010
0010101100010110101010101010110101010101011101010101110101000101100101
0010101100010110101010101010110101010101011101010101110101000101100101
0101100010110101010101010110101010101011101010101110101000101100101101
0101100010110101010101010110101010101011101010101110101000101100101101
0101100010110101010101010110101010101011101010101110101000101100101101
 To estimate the number of 1s in the most
recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the bucket)
2. Add half the size of the last bucket
 Remember: We do not know how many 1s
of the last bucket are still within the wanted
window
 Decay factor instead of sliding window technique
 Applications:
 The problem of Most Common Elements
Example- movie tickets purchased all over the world
 Describing a Decaying Window
 A real-time analytics platform enables organizations to
make the most out of real-time data by helping them to
extract the valuable information and trends from it.
 Real-time analytics is a process of delivering
information about events as they occur
 Some Examples
Financial Industry - Fraud Detection, Trading
E-commerce - Recommendations
Telecom Industry - Machine to Machine communication
Supply Chain Management
Business Activity Monitoring
 On Demand Real Time Analytics
 Continuous Real Time Analytics
 Delivering In-Memory Transaction Speed
 Quickly moving unneeded data to disk for long-term
storage
 Distributing Data and Processing for speed
 Supporting continuous queries for real-time events
 Embedding Data into Apps or Apps into databases
 Additional Requirements-fault tolerance, low-latency
 Processing in memory
 In-database analytics
 Data warehouse appliances
 In-memory analytics
 Massively Parallel Programming(MPP)
 Real Time Sentiment Analysis (RTSA)
 Real Time Stock Prediction (RTSP)
 Anand Rajaraman and Jeffrey David Ullman,
"Mining of Massive Datasets", Cambridge
University Press, 2012.
Mining Data Streams
Ad

More Related Content

What's hot (20)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
Vincenzo Gulisano
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
Pier Luca Lanzi
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
nehabsairam
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
Institute of Technology Telkom
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Hadoop
HadoopHadoop
Hadoop
Nishant Gandhi
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 

Similar to Mining Data Streams (20)

Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
Akash Gupta
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
Team activity analysis / visualization
Team activity analysis / visualizationTeam activity analysis / visualization
Team activity analysis / visualization
Nicolas Maisonneuve
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
Maycon Viana Bordin
 
Data Structure and Algorithms Department of Computer Science
Data Structure and Algorithms Department of Computer ScienceData Structure and Algorithms Department of Computer Science
Data Structure and Algorithms Department of Computer Science
donotreply20
 
Secure information aggregation in sensor networks
Secure information aggregation in sensor networksSecure information aggregation in sensor networks
Secure information aggregation in sensor networks
Aleksandr Yampolskiy
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Seattle Apache Flink Meetup
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
t10_part1.pptx
t10_part1.pptxt10_part1.pptx
t10_part1.pptx
JoydipChandra2
 
The Incremental Path to Observability
The Incremental Path to ObservabilityThe Incremental Path to Observability
The Incremental Path to Observability
Emily Nakashima
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
Theo Schlossnagle
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
C4Media
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
Twitter Developers
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
Akash Gupta
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
Team activity analysis / visualization
Team activity analysis / visualizationTeam activity analysis / visualization
Team activity analysis / visualization
Nicolas Maisonneuve
 
Data Structure and Algorithms Department of Computer Science
Data Structure and Algorithms Department of Computer ScienceData Structure and Algorithms Department of Computer Science
Data Structure and Algorithms Department of Computer Science
donotreply20
 
Secure information aggregation in sensor networks
Secure information aggregation in sensor networksSecure information aggregation in sensor networks
Secure information aggregation in sensor networks
Aleksandr Yampolskiy
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Seattle Apache Flink Meetup
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
The Incremental Path to Observability
The Incremental Path to ObservabilityThe Incremental Path to Observability
The Incremental Path to Observability
Emily Nakashima
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
C4Media
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
Twitter Developers
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
Ad

Recently uploaded (20)

Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Ad

Mining Data Streams

  • 1. Suja A. Alex, Assistant Professor, Department of Information Technology, St. Xavier’s Catholic College of Engineering, Nagercoil
  • 2.  Introduction to stream concepts  Stream Data Model  Sampling data in a stream  Filtering Stream  Stream counting  Decaying Window  RTAP  Real Time Data streaming tools and applications
  • 3.  Data Stream is continuously arriving data flow  Examples: Data produced in dynamic environments like  Power supply  network traffic  Stock exchange  Video survelliance
  • 4. Continuous flow of data  Examples Network traffic Sensor data Call center records
  • 5.  Data Stream Management System (DSMS), has multiple data streams.  A stream data query processing architecture includes three parts: end user, query processor, and scratch space
  • 6.  Transient streams (and persistent relations)  Continuous queries  Sequential access  Unpredictable data arrival and characteristics  Bounded main memory Types of Queries  One time queries  Continuous queries
  • 7. Query Processing Continuous Query (CQ)Result Query Processing Main MemoryData Stream(s) Data Stream(s) Disk Main Memory SQL Query Result
  • 10.  Since we can not store the entire stream, one obvious approach is to store a sample.  Types of sampling Reservoir Sampling Biased Reservoir Sampling Concise Sampling
  • 11.  Reservoir Sampling- A reservoir of size n is maintained The first t points are added (t+1) point is added with probability n/(t+1)  Biased Reservoir Sampling – Bias function is used to regulate sampling from stream  Concise Sampling – distinct values in stream are stored in main memory
  • 12.  Types of queries one wants on answer on a stream: ◦ Filtering a data stream  Select elements with property x from the stream ◦ Counting distinct elements  Number of distinct elements in the last k elements of the stream ◦ Estimating moments  Estimate avg./std. dev. of last k elements ◦ Finding frequent elements
  • 13.  It decides whether an item from infinite stream must be stored  Example- web browser storing the blacklist of dangerous URLs  Each element of data stream is a tuple, given a list of keys S, Determine which tuples of stream are in S  Obvious solution: Hash table  But suppose we do not have enough memory to store all of S in a hash table
  • 14.  Example: Email spam filtering ◦ We know 1 billion “good” email addresses ◦ If an email comes from one of these, it is NOT spam  Publish-subscribe systems ◦ You are collecting lots of messages (news articles) ◦ People express interest in certain sets of keywords ◦ Determine whether each message matches user’s interest
  • 15.  Consider: |S| = m, |B| = n  Use k independent hash functions h1 ,…, hk  Initialization: ◦ Set B to all 0s ◦ Hash each element s S using each hash function hi, set B[hi(s)] = 1 (for each i = 1,.., k)  Run-time: ◦ When a stream element with key x arrives  If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S  That is, x hashes to a bucket set to 1 for every hash function hi(x)  Otherwise discard the element x
  • 16.  Problem: ◦ Data stream consists of a universe of elements chosen from a set of size N ◦ Maintain a count of the number of distinct elements seen so far  Solution: Maintain the set of elements seen so far ◦ That is, keep a hash table of all the distinct elements seen so far
  • 17.  How many different words are found among the Web pages being crawled at a site? ◦ Unusually low or high numbers could indicate artificial pages (spam?)  How many different Web pages does each customer request in a week?  How many distinct products have we sold in the last week?
  • 18.  Pick a hash function h that maps each of the N elements to at least log2 N bits  For each stream element a, let r(a) be the number of trailing 0s in h(a) ◦ r(a) = position of first 1 counting from the right  E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2  Record R = the maximum r(a) seen ◦ R = maxa r(a), over all the items a seen so far  Estimated number of distinct elements = 2R
  • 19.  Suppose a stream has elements chosen from a set A of N values  Let mi be the number of times value i occurs in the stream  The kth moment is Ai k im )(
  • 20.  0thmoment = number of distinct elements ◦ The problem just considered  1st moment = count of the numbers of elements = length of the stream ◦ Easy to compute  2nd moment = surprise number S = a measure of how uneven the distribution is  Alon Matias Szegedy (AMS) Algorithm
  • 21.
  • 22. q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future N = 6
  • 23.  Datar-Gionis-Indyk-Motwani Algorithm  A bucket in the DGIM method is a record consisting of: 1. The timestamp of its end [O(log N) bits] 2. The number of 1s between its beginning and end [O(log log N) bits]  Constraint on buckets: Number of 1s must be a power of 2 ◦ That explains the O(log log N) in (2)
  • 24. N 1 of size 2 2 of size 4 2 of size 8 At least 1 of size 16. Partially beyond window. 2 of size 1 1001010110001011010101010101011010101010101110101010111010100010110010 Properties we maintain: - Either one or two buckets with the same power-of-2 number of 1s - Buckets do not overlap in timestamps - Buckets are sorted by size
  • 26.  To estimate the number of 1s in the most recent N bits: 1. Sum the sizes of all buckets but the last (note “size” means the number of 1s in the bucket) 2. Add half the size of the last bucket  Remember: We do not know how many 1s of the last bucket are still within the wanted window
  • 27.  Decay factor instead of sliding window technique  Applications:  The problem of Most Common Elements Example- movie tickets purchased all over the world  Describing a Decaying Window
  • 28.  A real-time analytics platform enables organizations to make the most out of real-time data by helping them to extract the valuable information and trends from it.  Real-time analytics is a process of delivering information about events as they occur  Some Examples Financial Industry - Fraud Detection, Trading E-commerce - Recommendations Telecom Industry - Machine to Machine communication Supply Chain Management Business Activity Monitoring
  • 29.  On Demand Real Time Analytics  Continuous Real Time Analytics
  • 30.  Delivering In-Memory Transaction Speed  Quickly moving unneeded data to disk for long-term storage  Distributing Data and Processing for speed  Supporting continuous queries for real-time events  Embedding Data into Apps or Apps into databases  Additional Requirements-fault tolerance, low-latency
  • 31.  Processing in memory  In-database analytics  Data warehouse appliances  In-memory analytics  Massively Parallel Programming(MPP)
  • 32.  Real Time Sentiment Analysis (RTSA)  Real Time Stock Prediction (RTSP)
  • 33.  Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Datasets", Cambridge University Press, 2012.