Data stream mining

Data Stream Mining
George Tzinos

Introduction
▪ Large amount of data streams every day.
▪ Efficient knowledge discovery of such data streams is an
emerging active research area in data mining with broad
applications.
▪ Data streams typically arrive continuously in high speed with huge
amount and changing data distribution.
▪ New issues that need to be considered.
▪ Data mining techniques which require multiple scans of the entire
data sets can not be applied directly to mine stream data, which
usually allows only one scan and demands fast response time
2

3
Network traffic
Sensor data
Call center records
Applications

Requirements
1. Process an example at a time, and inspect it only once (at most)
2. Use a limited amount of memory
3. Work in a limited amount of time
4. Be ready to predict at any point
4

Traditional Techniques vs Stream
7
Traditional Stream
No. of passes Multiple Single
Processing time Unlimited Restricted
Memory usage Unlimited Restricted
Type of result Accurate Approximate

Basic Techniques
8
▪ Sampling
▪ Load shedding
▪ Sketching
▪ Synopsis data structures
▪ Aggregation

Forgetting mechanisms
9
▪ Should be able to react to the changing concept by forgetting
outdated data, while learning new class descriptions
▪ How to select the data range to remember

Utilization of time and space
10
▪ Sliding Window
▪ Algorithm Output Granularity (AOG)

Windowing techniques - 1
11
▪ The most popular approach to dealing with time changing data
involves the use of sliding windows.
▪ Windows provide a way of limiting the amount of examples
introduced to the learner
▪ Eliminating those data points that come from an old concept.

Windowing techniques - 3 (Fixed Window)
13
▪ Each example updates the window and later the classifier is
updated by that window.
▪ In the simplest approach sliding windows are of fixed size
▪ Include only the most recent examples from the data stream.
▪ With each new data point the oldest example that does not fit in
the window is thrown away.
▪ When using windows of fixed size, the user is caught in a tradeoff.
▪ If he chooses a small window size the classifier will react quickly
to changes, but may loose on accuracy in periods of stability
▪ Choosing a large size will result in increasing accuracy in periods of
stability, but will fail to adapt to rapidly changing concepts.

Windowing techniques - 4
14
▪ Weights:
▫ A simple way of making the forgetting process more
dynamic is providing the window with a decay function
that assigns a weight to each example.
▫ Older examples receive smaller weights and are treated
as less important by the base classifier.
▫ ( Maintaining time-decaying stream aggregates )
▪ FISH
▪ ADWIN

Classification in Data Steams
15
▪ Classification, learning a model in order to assign labels to new,
unlabeled data points is a well studied supervised machine
learning task.
▪ Methods include naive Bayes, k-nearest neighbors, classification
trees, support vector machines, rule-based classifiers and many
more (Hastie et al. 2001).
▪ However, as with clustering these algorithms need access to the
complete training data several times and thus are not suitable for
data streams with constantly arriving new training data and
concept drift.

Classification in Data Steams - 2
16
▪ Wang et al. proposed a general framework for mining concept
drifting data streams.
▪ Domingos et al., VFDT (Very Fast Decision Tree)
▪

Tools for Data Streams
17
▪ Scikit Learn (Out of core)
▪ MOA (Massive Online Analysis)

Refferences
18
▪ [1] Geoff Hulten et al, Mining Time-Changing Data Streams
▪ [2] Qin Zhang et al, Towards Mining Trapezoidal Data Streams
▪ [3] Neha Gupta, Indrjeet Rajput, Stream Data Mining: A Survey
▪ [4] Johns Hopkins, Data Stream Mining: A Review of Learning Methods and Frameworks
▪ [5] Jiawei Han et al, Data mining: Concepts and Techniques
▪ [6] Albert Bife et alt, DATA STREAM MINING A Practical Approach
▪ [7] Oded Maimon, Dr. Lior Rokach, Data Mining and Knowledge Discovery Handbook
▪ [8] Neha Gupta, Indrjeet Rajput, Stream Data Mining: A Survey, International Journal of Engineering
Research and Applications
▪ [9] Dariusz Brzeziński, MINING DATA STREAMS WITH CONCEPT DRIFT

Data stream mining

Recommended

More Related Content

What's hot (20)

Similar to Data stream mining (20)

Recently uploaded (20)

Data stream mining