SlideShare a Scribd company logo
2
Most read
6
Most read
14
Most read
Data Stream Mining
George Tzinos
Introduction
▪ Large amount of data streams every day.
▪ Efficient knowledge discovery of such data streams is an
emerging active research area in data mining with broad
applications.
▪ Data streams typically arrive continuously in high speed with huge
amount and changing data distribution.
▪ New issues that need to be considered.
▪ Data mining techniques which require multiple scans of the entire
data sets can not be applied directly to mine stream data, which
usually allows only one scan and demands fast response time
2
3
Network traffic
Sensor data
Call center records
Applications
Requirements
1. Process an example at a time, and inspect it only once (at most)
2. Use a limited amount of memory
3. Work in a limited amount of time
4. Be ready to predict at any point
4
Requirements
5
Requirements
6
Traditional Techniques vs Stream
7
Traditional Stream
No. of passes Multiple Single
Processing time Unlimited Restricted
Memory usage Unlimited Restricted
Type of result Accurate Approximate
Basic Techniques
8
▪ Sampling
▪ Load shedding
▪ Sketching
▪ Synopsis data structures
▪ Aggregation
Forgetting mechanisms
9
▪ Should be able to react to the changing concept by forgetting
outdated data, while learning new class descriptions
▪ How to select the data range to remember
Utilization of time and space
10
▪ Sliding Window
▪ Algorithm Output Granularity (AOG)
Windowing techniques - 1
11
▪ The most popular approach to dealing with time changing data
involves the use of sliding windows.
▪ Windows provide a way of limiting the amount of examples
introduced to the learner
▪ Eliminating those data points that come from an old concept.
Windowing techniques - 2
12
Windowing techniques - 3 (Fixed Window)
13
▪ Each example updates the window and later the classifier is
updated by that window.
▪ In the simplest approach sliding windows are of fixed size
▪ Include only the most recent examples from the data stream.
▪ With each new data point the oldest example that does not fit in
the window is thrown away.
▪ When using windows of fixed size, the user is caught in a tradeoff.
▪ If he chooses a small window size the classifier will react quickly
to changes, but may loose on accuracy in periods of stability
▪ Choosing a large size will result in increasing accuracy in periods of
stability, but will fail to adapt to rapidly changing concepts.
Windowing techniques - 4
14
▪ Weights:
▫ A simple way of making the forgetting process more
dynamic is providing the window with a decay function
that assigns a weight to each example.
▫ Older examples receive smaller weights and are treated
as less important by the base classifier.
▫ ( Maintaining time-decaying stream aggregates )
▪ FISH
▪ ADWIN
Classification in Data Steams
15
▪ Classification, learning a model in order to assign labels to new,
unlabeled data points is a well studied supervised machine
learning task.
▪ Methods include naive Bayes, k-nearest neighbors, classification
trees, support vector machines, rule-based classifiers and many
more (Hastie et al. 2001).
▪ However, as with clustering these algorithms need access to the
complete training data several times and thus are not suitable for
data streams with constantly arriving new training data and
concept drift.
Classification in Data Steams - 2
16
▪ Wang et al. proposed a general framework for mining concept
drifting data streams.
▪ Domingos et al., VFDT (Very Fast Decision Tree)
▪
Tools for Data Streams
17
▪ Scikit Learn (Out of core)
▪ MOA (Massive Online Analysis)
Refferences
18
▪ [1] Geoff Hulten et al, Mining Time-Changing Data Streams
▪ [2] Qin Zhang et al, Towards Mining Trapezoidal Data Streams
▪ [3] Neha Gupta, Indrjeet Rajput, Stream Data Mining: A Survey
▪ [4] Johns Hopkins, Data Stream Mining: A Review of Learning Methods and Frameworks
▪ [5] Jiawei Han et al, Data mining: Concepts and Techniques
▪ [6] Albert Bife et alt, DATA STREAM MINING A Practical Approach
▪ [7] Oded Maimon, Dr. Lior Rokach, Data Mining and Knowledge Discovery Handbook
▪ [8] Neha Gupta, Indrjeet Rajput, Stream Data Mining: A Survey, International Journal of Engineering
Research and Applications
▪ [9] Dariusz Brzeziński, MINING DATA STREAMS WITH CONCEPT DRIFT
THANKS!
Any questions?
19
Ad

Recommended

Virtual Machine Concept
Virtual Machine Concept
fatimaanique1
 
Multiple Access in Computer Network
Multiple Access in Computer Network
Hitesh Mohapatra
 
Free Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFS
United International University
 
Design issues of dos
Design issues of dos
vanamali_vanu
 
Multiprogramming&timesharing
Multiprogramming&timesharing
Tanuj Tyagi
 
AusNOG 2019: TCP and BBR
AusNOG 2019: TCP and BBR
APNIC
 
Distributed operating system(os)
Distributed operating system(os)
Dinesh Modak
 
Mutual exclusion
Mutual exclusion
Dillip Behera
 
Cache coherence
Cache coherence
Priyam Pandey
 
Memory Management
Memory Management
DEDE IRYAWAN
 
Lec11 semaphores
Lec11 semaphores
anandammca
 
Applications of Distributed Systems
Applications of Distributed Systems
sandra sukarieh
 
Page replacement
Page replacement
Davin Abraham
 
Go back-n protocol
Go back-n protocol
STEFFY D
 
Chapter 8 distributed file systems
Chapter 8 distributed file systems
AbDul ThaYyal
 
Process in operating system
Process in operating system
Chetan Mahawar
 
Week 3 lecture material cc
Week 3 lecture material cc
Ankit Gupta
 
Operating Systems - memory management
Operating Systems - memory management
Mukesh Chinta
 
File Transfer Protocol
File Transfer Protocol
selvakumar_b1985
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
Dilum Bandara
 
Distributed Systems Introduction and Importance
Distributed Systems Introduction and Importance
SHIKHA GAUTAM
 
Virtual memory
Virtual memory
Dr. Shashank Shetty
 
Kernel mode vs user mode in linux
Kernel mode vs user mode in linux
Siddique Ibrahim
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Network operating system
Network operating system
Jasper John Cinatad
 
Leaky bucket A
Leaky bucket A
Syed Shaheer Gilani
 
Distributed System Management
Distributed System Management
Ibrahim Amer
 
pop3-imap.ppt
pop3-imap.ppt
BilalYounssi
 
Paper Title - Mining Techniques for Streaming Data
Paper Title - Mining Techniques for Streaming Data
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 

More Related Content

What's hot (20)

Cache coherence
Cache coherence
Priyam Pandey
 
Memory Management
Memory Management
DEDE IRYAWAN
 
Lec11 semaphores
Lec11 semaphores
anandammca
 
Applications of Distributed Systems
Applications of Distributed Systems
sandra sukarieh
 
Page replacement
Page replacement
Davin Abraham
 
Go back-n protocol
Go back-n protocol
STEFFY D
 
Chapter 8 distributed file systems
Chapter 8 distributed file systems
AbDul ThaYyal
 
Process in operating system
Process in operating system
Chetan Mahawar
 
Week 3 lecture material cc
Week 3 lecture material cc
Ankit Gupta
 
Operating Systems - memory management
Operating Systems - memory management
Mukesh Chinta
 
File Transfer Protocol
File Transfer Protocol
selvakumar_b1985
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
Dilum Bandara
 
Distributed Systems Introduction and Importance
Distributed Systems Introduction and Importance
SHIKHA GAUTAM
 
Virtual memory
Virtual memory
Dr. Shashank Shetty
 
Kernel mode vs user mode in linux
Kernel mode vs user mode in linux
Siddique Ibrahim
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Network operating system
Network operating system
Jasper John Cinatad
 
Leaky bucket A
Leaky bucket A
Syed Shaheer Gilani
 
Distributed System Management
Distributed System Management
Ibrahim Amer
 
pop3-imap.ppt
pop3-imap.ppt
BilalYounssi
 
Lec11 semaphores
Lec11 semaphores
anandammca
 
Applications of Distributed Systems
Applications of Distributed Systems
sandra sukarieh
 
Go back-n protocol
Go back-n protocol
STEFFY D
 
Chapter 8 distributed file systems
Chapter 8 distributed file systems
AbDul ThaYyal
 
Process in operating system
Process in operating system
Chetan Mahawar
 
Week 3 lecture material cc
Week 3 lecture material cc
Ankit Gupta
 
Operating Systems - memory management
Operating Systems - memory management
Mukesh Chinta
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
Dilum Bandara
 
Distributed Systems Introduction and Importance
Distributed Systems Introduction and Importance
SHIKHA GAUTAM
 
Kernel mode vs user mode in linux
Kernel mode vs user mode in linux
Siddique Ibrahim
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Distributed System Management
Distributed System Management
Ibrahim Amer
 

Similar to Data stream mining (20)

Paper Title - Mining Techniques for Streaming Data
Paper Title - Mining Techniques for Streaming Data
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 
081.ppt
081.ppt
amil baba
 
IJCSIT
IJCSIT
Poonam Debnath
 
Aa31163168
Aa31163168
IJERA Editor
 
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET Journal
 
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET Journal
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
Houw Liong The
 
Online machine learning in Streaming Applications
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
Jewei Hans & Kamber Chapter 8
Jewei Hans & Kamber Chapter 8
Houw Liong The
 
1105.1950
1105.1950
Nhat Tam
 
Fn3110961103
Fn3110961103
IJERA Editor
 
data streammining and its applications.ppt
data streammining and its applications.ppt
ajajkhan16
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
DheerajPachauri
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Albert Bifet
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
Manishankar Medi
 
Paper Title - Mining Techniques for Streaming Data
Paper Title - Mining Techniques for Streaming Data
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
IJDKP
 
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET Journal
 
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET Journal
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
Houw Liong The
 
Online machine learning in Streaming Applications
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
Jewei Hans & Kamber Chapter 8
Jewei Hans & Kamber Chapter 8
Houw Liong The
 
data streammining and its applications.ppt
data streammining and its applications.ppt
ajajkhan16
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
DheerajPachauri
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Albert Bifet
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
Manishankar Medi
 
Ad

Recently uploaded (20)

Type of Heat Exchanger operation Socar pptx
Type of Heat Exchanger operation Socar pptx
TuralQuliyev5
 
GBSN_ Unit 1 - Introduction to Microbiology
GBSN_ Unit 1 - Introduction to Microbiology
Areesha Ahmad
 
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
ShrayasiRoy2
 
Lecture 9 Natural selection Evolution.pptx
Lecture 9 Natural selection Evolution.pptx
madi34702
 
Lesson 1 in Earth and Life Science .pptx
Lesson 1 in Earth and Life Science .pptx
KizzelLanada2
 
Antipsychotics-FOR LECTURE.pdf................
Antipsychotics-FOR LECTURE.pdf................
FalguniPatil6
 
Climate and Weather_Science 9_Q3_PH.pptx
Climate and Weather_Science 9_Q3_PH.pptx
Dayan Espartero
 
MOLD -GENERAL CHARACTERISTICS AND CLASSIFICATION
MOLD -GENERAL CHARACTERISTICS AND CLASSIFICATION
aparnamp966
 
Properties of Gases siwhdhadpaldndn.pptx
Properties of Gases siwhdhadpaldndn.pptx
CatherineJadeBurce
 
STAPHYLOCOCCAL AND STREPTOCOCCAL INFECTIONS 2.ppt
STAPHYLOCOCCAL AND STREPTOCOCCAL INFECTIONS 2.ppt
pakranti27
 
The scientific heritage No 162 (162) (2025)
The scientific heritage No 162 (162) (2025)
The scientific heritage
 
THE CIRCULATORY SYSTEM GRADE 9 SCIENCE.pptx
THE CIRCULATORY SYSTEM GRADE 9 SCIENCE.pptx
roselyncatacutan
 
pollination njnjnjnjnjnjjnjnjnjnjnjnjnnj
pollination njnjnjnjnjnjjnjnjnjnjnjnjnnj
bhg31shagnik
 
Overview of Stem Cells and Immune Modulation.ppsx
Overview of Stem Cells and Immune Modulation.ppsx
AhmedAtwa29
 
What is Skeleton system.pptx by aahil sir
What is Skeleton system.pptx by aahil sir
bhatbashir421
 
Antibiotic and herbicide Resistance Genes
Antibiotic and herbicide Resistance Genes
AkshitRawat20
 
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
ShrayasiRoy
 
TISSUE TRANSPLANTATTION and IT'S IMPORTANCE IS DISCUSSED
TISSUE TRANSPLANTATTION and IT'S IMPORTANCE IS DISCUSSED
PhoebeAkinyi1
 
lysosomes "suicide bags of cell" and hydrolytic enzymes
lysosomes "suicide bags of cell" and hydrolytic enzymes
kchaturvedi070
 
Chromatography Slides for the course of Introduction to Biology and Chemistry...
Chromatography Slides for the course of Introduction to Biology and Chemistry...
Md. Arif Shahriar
 
Type of Heat Exchanger operation Socar pptx
Type of Heat Exchanger operation Socar pptx
TuralQuliyev5
 
GBSN_ Unit 1 - Introduction to Microbiology
GBSN_ Unit 1 - Introduction to Microbiology
Areesha Ahmad
 
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
ShrayasiRoy2
 
Lecture 9 Natural selection Evolution.pptx
Lecture 9 Natural selection Evolution.pptx
madi34702
 
Lesson 1 in Earth and Life Science .pptx
Lesson 1 in Earth and Life Science .pptx
KizzelLanada2
 
Antipsychotics-FOR LECTURE.pdf................
Antipsychotics-FOR LECTURE.pdf................
FalguniPatil6
 
Climate and Weather_Science 9_Q3_PH.pptx
Climate and Weather_Science 9_Q3_PH.pptx
Dayan Espartero
 
MOLD -GENERAL CHARACTERISTICS AND CLASSIFICATION
MOLD -GENERAL CHARACTERISTICS AND CLASSIFICATION
aparnamp966
 
Properties of Gases siwhdhadpaldndn.pptx
Properties of Gases siwhdhadpaldndn.pptx
CatherineJadeBurce
 
STAPHYLOCOCCAL AND STREPTOCOCCAL INFECTIONS 2.ppt
STAPHYLOCOCCAL AND STREPTOCOCCAL INFECTIONS 2.ppt
pakranti27
 
The scientific heritage No 162 (162) (2025)
The scientific heritage No 162 (162) (2025)
The scientific heritage
 
THE CIRCULATORY SYSTEM GRADE 9 SCIENCE.pptx
THE CIRCULATORY SYSTEM GRADE 9 SCIENCE.pptx
roselyncatacutan
 
pollination njnjnjnjnjnjjnjnjnjnjnjnjnnj
pollination njnjnjnjnjnjjnjnjnjnjnjnjnnj
bhg31shagnik
 
Overview of Stem Cells and Immune Modulation.ppsx
Overview of Stem Cells and Immune Modulation.ppsx
AhmedAtwa29
 
What is Skeleton system.pptx by aahil sir
What is Skeleton system.pptx by aahil sir
bhatbashir421
 
Antibiotic and herbicide Resistance Genes
Antibiotic and herbicide Resistance Genes
AkshitRawat20
 
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
ShrayasiRoy
 
TISSUE TRANSPLANTATTION and IT'S IMPORTANCE IS DISCUSSED
TISSUE TRANSPLANTATTION and IT'S IMPORTANCE IS DISCUSSED
PhoebeAkinyi1
 
lysosomes "suicide bags of cell" and hydrolytic enzymes
lysosomes "suicide bags of cell" and hydrolytic enzymes
kchaturvedi070
 
Chromatography Slides for the course of Introduction to Biology and Chemistry...
Chromatography Slides for the course of Introduction to Biology and Chemistry...
Md. Arif Shahriar
 
Ad

Data stream mining

  • 2. Introduction ▪ Large amount of data streams every day. ▪ Efficient knowledge discovery of such data streams is an emerging active research area in data mining with broad applications. ▪ Data streams typically arrive continuously in high speed with huge amount and changing data distribution. ▪ New issues that need to be considered. ▪ Data mining techniques which require multiple scans of the entire data sets can not be applied directly to mine stream data, which usually allows only one scan and demands fast response time 2
  • 3. 3 Network traffic Sensor data Call center records Applications
  • 4. Requirements 1. Process an example at a time, and inspect it only once (at most) 2. Use a limited amount of memory 3. Work in a limited amount of time 4. Be ready to predict at any point 4
  • 7. Traditional Techniques vs Stream 7 Traditional Stream No. of passes Multiple Single Processing time Unlimited Restricted Memory usage Unlimited Restricted Type of result Accurate Approximate
  • 8. Basic Techniques 8 ▪ Sampling ▪ Load shedding ▪ Sketching ▪ Synopsis data structures ▪ Aggregation
  • 9. Forgetting mechanisms 9 ▪ Should be able to react to the changing concept by forgetting outdated data, while learning new class descriptions ▪ How to select the data range to remember
  • 10. Utilization of time and space 10 ▪ Sliding Window ▪ Algorithm Output Granularity (AOG)
  • 11. Windowing techniques - 1 11 ▪ The most popular approach to dealing with time changing data involves the use of sliding windows. ▪ Windows provide a way of limiting the amount of examples introduced to the learner ▪ Eliminating those data points that come from an old concept.
  • 13. Windowing techniques - 3 (Fixed Window) 13 ▪ Each example updates the window and later the classifier is updated by that window. ▪ In the simplest approach sliding windows are of fixed size ▪ Include only the most recent examples from the data stream. ▪ With each new data point the oldest example that does not fit in the window is thrown away. ▪ When using windows of fixed size, the user is caught in a tradeoff. ▪ If he chooses a small window size the classifier will react quickly to changes, but may loose on accuracy in periods of stability ▪ Choosing a large size will result in increasing accuracy in periods of stability, but will fail to adapt to rapidly changing concepts.
  • 14. Windowing techniques - 4 14 ▪ Weights: ▫ A simple way of making the forgetting process more dynamic is providing the window with a decay function that assigns a weight to each example. ▫ Older examples receive smaller weights and are treated as less important by the base classifier. ▫ ( Maintaining time-decaying stream aggregates ) ▪ FISH ▪ ADWIN
  • 15. Classification in Data Steams 15 ▪ Classification, learning a model in order to assign labels to new, unlabeled data points is a well studied supervised machine learning task. ▪ Methods include naive Bayes, k-nearest neighbors, classification trees, support vector machines, rule-based classifiers and many more (Hastie et al. 2001). ▪ However, as with clustering these algorithms need access to the complete training data several times and thus are not suitable for data streams with constantly arriving new training data and concept drift.
  • 16. Classification in Data Steams - 2 16 ▪ Wang et al. proposed a general framework for mining concept drifting data streams. ▪ Domingos et al., VFDT (Very Fast Decision Tree) ▪
  • 17. Tools for Data Streams 17 ▪ Scikit Learn (Out of core) ▪ MOA (Massive Online Analysis)
  • 18. Refferences 18 ▪ [1] Geoff Hulten et al, Mining Time-Changing Data Streams ▪ [2] Qin Zhang et al, Towards Mining Trapezoidal Data Streams ▪ [3] Neha Gupta, Indrjeet Rajput, Stream Data Mining: A Survey ▪ [4] Johns Hopkins, Data Stream Mining: A Review of Learning Methods and Frameworks ▪ [5] Jiawei Han et al, Data mining: Concepts and Techniques ▪ [6] Albert Bife et alt, DATA STREAM MINING A Practical Approach ▪ [7] Oded Maimon, Dr. Lior Rokach, Data Mining and Knowledge Discovery Handbook ▪ [8] Neha Gupta, Indrjeet Rajput, Stream Data Mining: A Survey, International Journal of Engineering Research and Applications ▪ [9] Dariusz Brzeziński, MINING DATA STREAMS WITH CONCEPT DRIFT