100% found this document useful (3 votes)
16 views

Full Download (Ebook) Online Machine Learning: A Practical Guide with Examples in Python (Machine Learning: Foundations, Methodologies, and Applications) by Thomas Bartz-Beielstein, Eva Bartz ISBN 9789819970063, 9819970067 PDF DOCX

The document provides information about various ebooks available for download, focusing on topics related to machine learning, including 'Online Machine Learning: A Practical Guide' and 'Hyperparameter Tuning for Machine and Deep Learning with R.' It highlights the significance of online machine learning (OML) and its advantages over batch machine learning, emphasizing the need for educational resources and practical applications in the field. The book aims to serve as a comprehensive guide for both beginners and experts interested in OML, detailing theoretical foundations, practical considerations, and real-world applications.

Uploaded by

mattamunko3x
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
16 views

Full Download (Ebook) Online Machine Learning: A Practical Guide with Examples in Python (Machine Learning: Foundations, Methodologies, and Applications) by Thomas Bartz-Beielstein, Eva Bartz ISBN 9789819970063, 9819970067 PDF DOCX

The document provides information about various ebooks available for download, focusing on topics related to machine learning, including 'Online Machine Learning: A Practical Guide' and 'Hyperparameter Tuning for Machine and Deep Learning with R.' It highlights the significance of online machine learning (OML) and its advantages over batch machine learning, emphasizing the need for educational resources and practical applications in the field. The book aims to serve as a comprehensive guide for both beginners and experts interested in OML, detailing theoretical foundations, practical considerations, and real-world applications.

Uploaded by

mattamunko3x
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Download the Full Ebook and Access More Features - ebooknice.

com

(Ebook) Online Machine Learning: A Practical Guide


with Examples in Python (Machine Learning:
Foundations, Methodologies, and Applications) by
Thomas Bartz-Beielstein, Eva Bartz ISBN
9789819970063, 9819970067
https://ebooknice.com/product/online-machine-learning-a-
practical-guide-with-examples-in-python-machine-learning-
foundations-methodologies-and-applications-55455448

OR CLICK HERE

DOWLOAD EBOOK

Download more ebook instantly today at https://ebooknice.com


Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

(Ebook) Hyperparameter Tuning for Machine and Deep Learning with R: A Practical
Guide by Eva Bartz, Thomas Bartz-Beielstein, Martin Zaefferer, Olaf Mersmann, (eds.)
ISBN 9789811951695, 9811951691

https://ebooknice.com/product/hyperparameter-tuning-for-machine-and-deep-
learning-with-r-a-practical-guide-48885888

ebooknice.com

(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason; Viles, James ISBN
9781459699816, 9781743365571, 9781925268492, 1459699815, 1743365578, 1925268497

https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374

ebooknice.com

(Ebook) Matematik 5000+ Kurs 2c Lärobok by Lena Alfredsson, Hans Heikne, Sanna
Bodemyr ISBN 9789127456600, 9127456609

https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312

ebooknice.com

(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT II Success) by Peterson's
ISBN 9780768906677, 0768906679

https://ebooknice.com/product/sat-ii-success-math-1c-and-2c-2002-peterson-s-sat-
ii-success-1722018

ebooknice.com
(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master the SAT Subject Test: Math
Levels 1 & 2) by Arco ISBN 9780768923049, 0768923042

https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-arco-master-
the-sat-subject-test-math-levels-1-2-2326094

ebooknice.com

(Ebook) Cambridge IGCSE and O Level History Workbook 2C - Depth Study: the United
States, 1919-41 2nd Edition by Benjamin Harrison ISBN 9781398375147, 9781398375048,
1398375144, 1398375047

https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044

ebooknice.com

(Ebook) Machine Learning: The Basics (Machine Learning: Foundations, Methodologies,


and Applications) by Alexander Jung ISBN 9789811681929, 9811681929

https://ebooknice.com/product/machine-learning-the-basics-machine-learning-
foundations-methodologies-and-applications-37751790

ebooknice.com

(Ebook) Machine Learning: The Basics (Machine Learning: Foundations, Methodologies,


and Applications) by Alexander Jung ISBN 9789811681929, 9811681929

https://ebooknice.com/product/machine-learning-the-basics-machine-learning-
foundations-methodologies-and-applications-37757544

ebooknice.com

(Ebook) Practical Machine Learning for Streaming Data with Python: Design, Develop,
and Validate Online Learning Models by Sayan Putatunda ISBN 9781484268667,
9781484268674, 1484268660, 1484268679

https://ebooknice.com/product/practical-machine-learning-for-streaming-data-
with-python-design-develop-and-validate-online-learning-models-24035852

ebooknice.com
Machine Learning: Foundations, Methodologies,
and Applications

Eva Bartz
Thomas Bartz-Beielstein Editors

Online
Machine
Learning
A Practical Guide with Examples in
Python
Machine Learning: Foundations, Methodologies,
and Applications

Series Editors
Kay Chen Tan, Department of Computing, Hong Kong Polytechnic University,
Hong Kong, China
Dacheng Tao, University of Technology, Sydney, Australia
Books published in this series focus on the theory and computational foundations,
advanced methodologies and practical applications of machine learning, ideally
combining mathematically rigorous treatments of a contemporary topics in machine
learning with specific illustrations in relevant algorithm designs and demonstrations
in real-world applications. The intended readership includes research students and
researchers in computer science, computer engineering, electrical engineering, data
science, and related areas seeking a convenient medium to track the progresses made
in the foundations, methodologies, and applications of machine learning.
Topics considered include all areas of machine learning, including but not limited
to:
. Decision tree
. Artificial neural networks
. Kernel learning
. Bayesian learning
. Ensemble methods
. Dimension reduction and metric learning
. Reinforcement learning
. Meta learning and learning to learn
. Imitation learning
. Computational learning theory
. Probabilistic graphical models
. Transfer learning
. Multi-view and multi-task learning
. Graph neural networks
. Generative adversarial networks
. Federated learning
This series includes monographs, introductory and advanced textbooks, and state-
of-the-art collections. Furthermore, it supports Open Access publication mode.
Eva Bartz · Thomas Bartz-Beielstein
Editors

Online Machine Learning


A Practical Guide with Examples in Python
Editors
Eva Bartz Thomas Bartz-Beielstein
Bartz & Bartz GmbH Institute for Data Science, Engineering,
Gummersbach, Germany and Analytics
TH Köln
Gummersbach, Germany

ISSN 2730-9908 ISSN 2730-9916 (electronic)


Machine Learning: Foundations, Methodologies, and Applications
ISBN 978-981-99-7006-3 ISBN 978-981-99-7007-0 (eBook)
https://doi.org/10.1007/978-981-99-7007-0

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2024

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

Paper in this product is recyclable.


Foreword

Do you hear the rumble of the drums? That’s the world of data analytics moving
towards real time. A lot of effort is being poured into turning batch data warehouses
into real-time data warehouses. It seems inevitable that more advanced use cases,
such as machine learning, will also move towards real time. And yet, the field of
online machine learning has already existed for decades. In fact, a lot of modern
deep learning is powered by online learning methods. However, online machine
learning is yet to be fully appreciated. The fact a model operates online is only
scratching the surface. Doing online machine learning can reap great benefits if done
comprehensively and properly. But it also requires a different mental model to what
most practitioners are used to.
I’ve been working on online machine learning for over 5 years. Admittedly, I have
not observed a great shift towards online machine learning. In spite of that, I’ve never
been more convinced online machine learning has enormous merits that are yet to
be uncovered and held in high regard. I believe there are several ways for online
machine learning to grow in popularity. For of all, although it’s quite clear Big Tech
companies are running online models in production, there are not enough public
details of how they do it. Practitioners have to be convinced by real and concrete
examples. Secondly, there are not enough tools and libraries that make it easy to do
online machine learning, akin to what scikit-learn did for batch machine learning.
This is something I tried to resolve by creating River, although there are other great
tools out there, such as Vowpal Wabbit. Thirdly, there is a lack of educational material
that explains how to do online machine learning.
This book is a wonderful attempt to address the third point. It covers all the standard
topics of machine learning, but with an online twist. It’s a great introduction to online
machine learning. I hope it will inspire more people to do online machine learning, to
appreciate the value of processing data online, and to do so properly. Once you have
understood the concepts in this book, you will be able to view the world of machine
learning through a different lens. You will be able to see the world as a stream of
data, and you will be able to process it as such. You will be able to build models that
learn from data as it arrives. You will be able to build models that adapt to change.
You will be able to build models that are always up-to-date. You will be able to build

v
vi Foreword

models that are always learning. You will be able to build models that are evaluated
in real time. Trust me, it’s worth it.

Paris, France Max Halford


July 2023 Head of Data at Carbonfact
and Co-creator of River
Preface

This book deals with the exciting, seminal topic of Online Machine Learning (OML).
It is divided into three parts: First, we look in detail at the theoretical foundations of
OML. We describe what OML is and ask how it can be compared to Batch Machine
Learning (BML) and what criteria one should develop for a meaningful comparison.
In the second part, we provide practical considerations, and in the third part, we
substantiate them with concrete practical applications.
Why OML? Among other things, it is about the decisive time advantage. This
can be months, weeks, days, hours, or even just seconds. This time advantage can
arise if Artificial Intelligence (AI) can evaluate data continuously, i.e., online. It
does not have to wait until a complete set of data is available, but can already use a
single observation to update the model. Does OML have other advantages besides
the obvious time advantage? If so, what are they? We ask, are there limitations of
BML that OML overcomes? It needs to be carefully examined at what price one gets
these advantages from OML. How high is the memory requirement compared to
conventional methods? Memory requirements also mean financial costs, e.g., due to
higher energy requirements. Is OML possibly energy-saving and thus more sustain-
able, i.e., Green IT? Is it possible to obtain comparably good results? Does the quality
(performance) suffer, do the results become less accurate? In order to answer these
questions reliably, we first give an understandable introduction to OML in the theo-
retical part, which is suitable for beginners as well as for advanced users. Then we
justify the criteria we found for the comparability of OML and BML, namely a well-
comprehensible representation of quality, time, and memory requirements. In the
second part, we address the question of exactly how OML can be used in practice.
We are joined by experts from the field who report on their practical experiences,
e.g., requirements for official statistics. We give reasons for recommendations for
the practical use of OML.
We comprehensively present the software packages currently available for OML,
especially “River”,1 and offer Sequential Parameter Optimization Toolbox for River
(spotRiver), a software we developed specifically for OML. We deal in detail with

1 https://riverml.xyz/.

vii
viii Preface

special problems that can occur with data streams. The central problem for data
streams is drift. We deal with the explainability of AI models, interpretability, and
reproducibility as required in upcoming regulations for AI systems. These aspects
can contribute to higher acceptance of AI.
In the application section, we present two detailed studies, one of which uses a
large dataset with one million data. We provide evidence of when OML performs
better than BML. Of particular interest is the study on hyperparameter tuning of
OML. Here we show how OML can perform significantly better by optimizing
hyperparameters.

Notebook
Supplementary program code for the applications and examples from this book
can be found in so-called “Jupyter Notebooks” in the GitHub repository https://
github.com/sn-code-inside/online-machine-learning/. The notebooks are orga-
nized by chapter.

The consulting firm Bartz & Bartz GmbH2 laid the foundation for this book when
it was awarded a contract from a tender of the Federal Statistical Office of Germany
in 2023.3 The Federal Statistical Office of Germany wanted to know whether it makes
sense to use OML now for the treasure trove of data and the evaluation on behalf
of the public sector (see the comments in Chap. 7). The slightly sobering result of
our expertise was: interesting perspectives for the future are opening up, but at the
moment there is no immediate prospect of using it. In some cases, there are technical
and organizational hurdles to adapting processes in such a way that the advantages of
OML can really come into play. In some cases, OML processes and implementations
are not yet mature enough.
The topic fascinated us so much that we decided to pursue it further. Prof. Dr.
Thomas Bartz-Beielstein took the question of the practical relevance of OML with
him to the TH Köln, where he continued his research in the field, which had been
ongoing for years. Under his guidance, the research group at the Institute for Data
Science, Engineering, and Analytics (IDE+A)4 was able to develop software to the
point where we believe we have advanced its suitability quite a bit. Thus, we have
combined the expertise of Bartz & Bartz GmbH with the research at the TH Köln,
which resulted in this book.
Overall, the book is equally suitable as a reference manual for experts dealing
with OML, as a textbook for beginners who want to deal with OML, and as a scien-
tific publication for scientists dealing with OML, since it reflects the latest state of
research. But it can also serve as quasi-OML consulting, as decision-makers and

2 https://bartzundbartz.de.
3 https://destatis.de.
4 https://www.th-koeln.de/idea.
Preface ix

practitioners can use our explanations to tailor OML to their needs and use it for
their application, and ask whether the benefits of OML might outweigh the costs.
To name just a few examples from military and civilian practice:
. You use state-of-the-art sensor systems to predict floods. Here, faster prediction
can save lives.
. You need to fend off terrorist attacks and use underwater sensors to do so. Here,
it can be crucial that the AI “recognizes” more quickly whether harmless water
sportsmen are involved.
. You are responsible for observing the airspace. Reconnaissance drones, for
example, can be used more efficiently if they can be programmed and trained
with very recent AI data evaluations.
. You must be very expeditious in adjusting the production of critical infrastructure
goods, such as vaccines, protective clothing, or medical equipment. Here, it can
be useful to keep the entire production process, including the raw materials to be
used, as up-to-date as possible. This can be achieved by real-time evaluation and
translation into requirements based on hospital bed occupancy or sick notes.
. You are a payment service provider and you need to detect fraud attempts virtually
in real time.
In conclusion, we note: OML will soon become practical, it is worthwhile to get
involved with it now. This book already presents some tools that will facilitate the
practice of OML in the future. A promising breakthrough is to be expected, because
practice shows that due to the large amounts of data that accumulate, the previous
BML is no longer sufficient. OML is the solution to evaluate and process data streams
in real time and to deliver results that are relevant for practice. Specifically, the book
covers the following topics:
Chapter 1 describes the motivation for this book and the objective. It describes
the drawbacks and limitations of BML and the need for OML. Chapter 2 gives an
overview and evaluation of methods and algorithms with special focus on supervised
learning (classification and regression). Chapter 3 describes procedures for drift
detection. Updateability of OML procedures is discussed in Chap. 4. Chapter 5
explains procedures for the evaluation of OML methods. Chapter 6 deals with special
requirements for OML. Possible OML applications are presented in Chap. 7 and
evaluated by experts in official statistics. The availability of the algorithms in software
packages, especially for R and Python, is presented in Chap. 8.
The computational effort required for updating the OML models, also in compar-
ison to an algorithmically similar offline procedure (BML), is examined experimen-
tally in Chap. 9. There, it is also discussed to what extent the model quality could be
affected, especially in comparison to similar offline methods. Chapter 10 describes
hyperparameter tuning for OML. Chapter 11 presents a summary and gives important
recommendations for practitioners.

Gummersbach, Germany Eva Bartz


July 2023
Contents

1 Introduction: From Batch to Online Machine Learning . . . . . . . . . . . 1


Thomas Bartz-Beielstein
2 Supervised Learning: Classification and Regression . . . . . . . . . . . . . . 13
Thomas Bartz-Beielstein
3 Drift Detection and Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Thomas Bartz-Beielstein and Lukas Hans
4 Initial Selection and Subsequent Updating of OML Models . . . . . . . 41
Thomas Bartz-Beielstein
5 Evaluation and Performance Measurement . . . . . . . . . . . . . . . . . . . . . . 47
Thomas Bartz-Beielstein
6 Special Requirements for Online Machine Learning Methods . . . . . 63
Thomas Bartz-Beielstein
7 Practical Applications of Online Machine Learning . . . . . . . . . . . . . . 71
Steffen Moritz, Florian Dumpert, Christian Jung,
Thomas Bartz-Beielstein, and Eva Bartz
8 Open-Source Software for Online Machine Learning . . . . . . . . . . . . . 97
Thomas Bartz-Beielstein
9 An Experimental Comparison of Batch and Online Machine
Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Thomas Bartz-Beielstein and Lukas Hans
10 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Thomas Bartz-Beielstein
11 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Thomas Bartz-Beielstein and Eva Bartz

xi
xii Contents

Appendix A: Definitions and Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


Appendix B: Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Contributors

Eva Bartz Bartz & Bartz GmbH, Gummersbach, Germany


Thomas Bartz-Beielstein Institute for Data Science, Engineering, and Analytics,
TH Köln, Gummersbach, Germany
Florian Dumpert Federal Statistical Office of Germany, Wiesbaden, Germany
Lukas Hans Institute for Data Science, Engineering, and Analytics, TH Köln,
Gummersbach, Germany
Christian Jung SMS Group GmbH, Siegen, Germany
Steffen Moritz Federal Statistical Office of Germany, Wiesbaden, Germany

xiii
Chapter 1
Introduction: From Batch to Online
Machine Learning

Thomas Bartz-Beielstein

Abstract Batch Machine Learning (BML), which is also referred to as “offline


machine learning”, reaches its limits when dealing with very large amounts of data.
This is especially true for available memory, handling drift in data streams, and
processing new, unknown data. Online Machine Learning (OML) is an alternative
to BML that overcomes the limitations of BML. In this chapter, the basic terms and
concepts of OML are introduced and the differences to BML are shown.

1.1 Streaming Data

The volume of data generated from various sources has increased tremendously
in recent years. Technological advances have enabled the continuous collection of
data. The “three Vs” (volume, velocity, and variety) were initially used as criteria to
describe big data1 : Volume here refers to the large amount of data, velocity refers to
the high speed at which the data is generated, and variety refers to the large variety
of data.
The data streams (streaming data) considered in this book pose an even greater
challenge to Machine Learning (ML) algorithms than big data. In addition to the
three big data Vs, there are other challenges that arise, in particular, from volatility
and the possibility that abrupt changes (“drift”) can occur.
Definition 1.1 (Streaming-Data) Streaming data is data that is generated in a con-
tinuous data stream. It is loosely structured, volatile, always “flowing”, and contains
unpredictable, sometimes abrupt, changes. Streaming data is a subset of big data
with the following characteristics:
. Volume: Streaming data is generated in very large quantities.

1 The three Vs were expanded over time by adding veracity and value to the “five Vs”.

T. Bartz-Beielstein (B)
Institute for Data Science, Engineering, and Analytics, TH Köln, Gummersbach, Germany
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
E. Bartz and T. Bartz-Beielstein (eds.), Online Machine Learning,
Machine Learning: Foundations, Methodologies, and Applications,
https://doi.org/10.1007/978-981-99-7007-0_1
2 T. Bartz-Beielstein

. Velocity: Streaming data is generated at a very high rate.


. Variety: Streaming data is available in very different formats. We refer to this
property as “vertical variety”.
. Variability: Streaming data is structureless and varies over time. For example, drift
can occur gradually or abruptly. We refer to this property as “horizontal variety”.
. Volatility: Streaming data is available only once.

Example: Streaming-Data

A great deal of data is generated during various daily transactions, such as online
shopping, online banking, or online stock trading. In addition, there is sensor data,
social media data, data from operational monitoring and data from the Internet of
Things, to name just a few examples.

Streaming data requires real-time or near real-time analysis. Since the data stream
is constantly being produced and never ends, it is not possible to store these enormous
volumes of data.
Definition 1.2 (Static Data) By static data we mean data that have been collected
at a certain point in time and are no longer changed. They are used in the field of
classical ML and have the following properties:
. Volume: Static data usually have a manageable volume.
. Persistence: Static data can be retrieved as often as required. They do not change
their structure.
. Structure: Static data are usually structured and are available in tabular form.

The idea for this book is based on a study conducted for the German Federal
Statistical Office. The algorithms described here may also become relevant for official
statistics. One of the main objectives of the Federal Statistical Office is to publish
statistics at regular intervals. New data is continuously being accumulated, which has
to be evaluated. The publication intervals and data volumes are still manageable, but
the current trend goes towards new digital data and shorter publication cycles. The
large data volumes and analysis requirements that will then exist could necessitate
novel ML algorithms. This issue is explored in Sect. 7.1.

1.2 Disadvantages of Batch Learning

In this book, we distinguish between algorithms and models: Models are built using
algorithms and data. Most ML algorithms use static data in three steps to build
models:
1 Introduction: From Batch to Online Machine Learning 3

1. After loading the data, the data is preprocessed.


2. Then, a model is fitted to the data. This step is also called training. During training,
the data can be used multiple times.
3. Finally, the performance of the model is calculated on test data that was not used
during training.
Definition 1.3 (Batch Machine Learning) Machine learning that (classically) uses
the entire data set or large subsets of the data set (training data) to build the model
is called “batch machine learning” (BML). Instances can be used multiple times.
Relatively large amounts of time and memory are available. BML is also referred to
as offline machine learning.
The BML reaches its limits when data streams are processed. In this case, volatile
data is present that cannot be used multiple times. In addition, batch models may
become outdated due to concept drift (i.e., the data distribution changes over time).
This book presents approaches to solving the following problems that cannot be
solved with classical batch models:
1. Large memory requirements
2. Drift
3. Unknown data
4. Accessibility of data.
These problems are described in detail in Sects. 1.2.1–1.2.4.

1.2.1 Memory Requirements

BML problems occur when the size of the data set exceeds the size of the available
amount of Random Access Memory (RAM). Possible solutions are
. optimization of the data types (“sparse representations”),
. use of a partial data set (“out-of-core learning”),
i.e., dividing the data into blocks or mini-batches, see Spark MLlib2 or Dask,3 and
. use of highly simplified models.
In these solutions, the data is fitted to the model rather than the model to the data.
Therefore, the full potential of online data is not used.

1.2.2 Drift

In general, structural changes (“drift”) in the data cause problems for ML algorithms.4
For example, in energy consumption forecasting, previously known consumption

2 https://spark.apache.org/mllib/.
3 https://examples.dask.org/machine-learning.html.
4 This section describes the different types of drift. The OML algorithms for drift detection and

handling are described in Chap. 3.


4 T. Bartz-Beielstein

levels are only one element needed for modeling. In practice, future demand is
driven by a number of non-stationary forces such as climate variability, population
growth, or by the introduction of disruptive clean energy technologies. These may
require both gradual and sudden domain adjustments.
Drift causes problems for ML models because models can become outdated—
they become unreliable over time because the relationships they capture are no longer
valid. This leads to a decrease in the performance of these models. Therefore, pre-
diction, classification, regression, or anomaly detection approaches should be able
to detect and respond to concept deviations in a timely manner so that the model can
be updated as soon as possible.
In time series applications, in many fields such as finance, e-commerce, eco-
nomics, and healthcare, the statistical properties of the time series may change,
rendering forecasting models useless. Although the concept of the drift problem has
been well studied in the literature, surprisingly little effort has been made to solve it.
We can distinguish three types of drift: feature, label, and concept drift.
In the following, .(X, y) denotes a sample, where . X is a set of features and . y is the
target variable. Features can be derived from attributes. Attributes are also referred
to as independent variables, and target variables correspondingly as dependent vari-
ables. In classification problems, the target variable is a class label, in regression
problems the predicted value. Often . y is not only determined by . X but also by a set
of unknown underlying conditions. This leads to the definition of the concept:
Definition 1.4 (Concept) A concept is a relationship between . X and . y given a set
of unknown constraints (a context . K ).
Definition 1.5 (Feature Drift) Feature drift describes a change in the independent
variable . X .
A regulatory intervention is an example of feature drift: New laws can change
consumer behavior (Auffarth, 2021; Castle et al., 2021).
Definition 1.6 (Label Drift) Label drift is a change in the target . y.
The increase in the average value of goods at retail is given here as an example
of label drift.
Definition 1.7 (Concept Drift) Concept drift is a change in the concept, i.e., the
relationship between the independent variables . X and the target variable . y.
ML models can not observe the underlying conditions that determine a concept
and therefore have to make an assumption about which relationship applies to each
sample. This is difficult when conditions change, leading to a change in the concept,
which is called concept drift. The synthetically generated Friedman-Drift data set
provides a vivid example of concept drift.
Definition 1.8 (The Friedman-Drift Data Set) Each observation in the Friedman-
Drift data set consists of ten features. Each feature value is drawn equally distributed
from the interval.[0, 1]. Only the first six features,.x0 to.x5 , are relevant. The dependent
variable is defined by two functions that depend on whether drift is present:
1 Introduction: From Batch to Online Machine Learning 5
[
10 sin(π x0 x1 ) + 20(x2 − 0.5)2 + 10x3 + 5x4 , if drift occurs
. f (x) =
10 sin(π x3 x5 ) + 20(x1 − 0.5)2 + 10x0 + 5x1 , otherwise.

Note the change in active variables, e.g., from .x0 to .x3 , which implements the
change in concept.
The synthetically generated Friedman-Drift data set is used in Sect. 9.2.

Concept Drift in Practice

An example of concept drift may be the prediction of ozone levels (. y) at a particular


location based on sensor data (. X ). We may be able to predict . y based on . X , where the
relationship may depend on the wind direction (the context . K ), which is not detected
by the sensors.

A Simple Example of Concept Drift

The example shown in Fig. 1.1 illustrates how a simple concept drift occurs by
combining three data sets.5

Abrupt and Gradual Concept Drift

The changes in data streams or concept drift patterns can be either gradual or abrupt.
Abrupt changes in data streams or abrupt concept drift mean a sudden change in the
properties of the data. For example, a change in mean, a change in variance, etc. It
is important to recognize these changes because they have practical implications for
applications in quality control, system monitoring, fault detection, and other areas.
If the changes in the distributions of the data in the data streams occur slowly
but over a long period of time, then this is drawn as gradual concept drift. Gradual
concept drift is relatively difficult to detect. Figure 1.2 shows the difference between
gradual and abrupt drift.

In recurrent concept drift, certain features of older data streams reappear after
some time.

5 The example is based on the “Concept Drift” section in the River documentation, see https://
riverml.xyz/dev/introduction/getting-started/concept-drift-detection/.
6 T. Bartz-Beielstein

Fig. 1.1 Synthetically generated drift created by combining 1,000 data each from three different
distributions. For the first thousand data, a normal distribution with mean .μ1 = 0.8 and standard
deviation .σ = 0.05 was used. The second thousand data use .μ1 = 0.4 and .σ1 = 0.02 and the last
thousand data use .μ3 = 0.6 and .σ3 = 0.1. The left figure shows the data over time and histograms
of the three data sets are shown on the right

Fig. 1.2 Gradual and abrupt concept drift. The data were generated synthetically. The SEA synthetic
data set (SEA) drift generator described in Sect. 5.4.2 was used for this purpose. Concept changes
occurring after 25,000, 50,000, and 75,000 steps are indicated by red vertical lines

Drift and Non-stationarity


Please note that the term drift is often used in a broader sense for non-stationary
behavior. Non-stationary behavior occurs, among other things, when new prod-
ucts are introduced, during hacker attacks, due to vacation periods, when
weather conditions change, when the economic conditions change, or when
sensors are poorly calibrated or new sensors are installed.

In BML, drift handling can be on-demand or periodic: Models for ML can be


retrained regularly, i.e., at specified times (weekly) or according to specified crite-
1 Introduction: From Batch to Online Machine Learning 7

ria (event-based, e.g., when new data arrives) to avoid performance degradation.6
Alternatively, training can be triggered on an as-needed basis, i.e., either based on
performance monitoring of the models or based on change-detection methods.

1.2.3 New, Unknown Data

Another problem for BML is that it cannot easily learn from new data that contains
unknown attributes. When unknown attributes appear in the data, the model must
learn from scratch with a new data set composed of the old data and the new data.
This is especially difficult in a situation where data with new attributes may occur
every week, day, hour, minute, or even every time a measurement is taken.

Recommender Systems

For example, if a recommender system is to be created for an e-commerce app, then


the model probably needs to be trained from scratch every week. With the popularity
of the recommender service, the data set with which the model is trained grows. This
leads to longer training times and may require additional hardware.

1.2.4 Accessibility and Availability of the Data

Features are generated from the attributes of the data when models are trained. Feature
generation can improve ML performance by generating new features that correlate
better with the target variable and are therefore easier to learn.
Definition 1.9 (Feature Generation) Feature generation describes the process of
creating new features from one or more attributes of the data set.

Feature Generation

Feature generation can be accomplished by transforming an attribute into a new


feature using a mathematical function such as the logarithmic transformation. A new
feature can also be formed by calculating the distances between multiple attributes.

In practical applications, some attributes are no longer available after some time,
e.g., because they have been overwritten or simply have been deleted. Thus, features

6 This approach is implemented in the context of “mini batch machine learning”, cf. Definition 1.11.
8 T. Bartz-Beielstein

Table 1.1 Problems and solutions for BML for streaming data
Problem BML solution Disadvantages of solution
Memory requirements Optimization of data types, Performance degradation,
mini-batch learning, simplified lower accuracy
models
Drift Re-training High effort
New, unknown data Re-training High effort
Accessibility, availability of No general solution available
data

that were available recently may no longer be available at the current time. In general,
the provision of all data at the same time and in the same place is not always possible
(Auffarth, 2021). Table 1.1 summarizes the problems and solutions for BML.

1.2.5 Other Problems

Furthermore, the common ML assumption that data is independent and uniformly


distributed (IID) (stationarity) is false for most streaming data, since attributes and
labels are often correlated. For example, for systems that detect attacks directed
against a computer system or computer network, called intrusion detection systems,
only the label “no-intrusion” occurs over a long period of time.
Furthermore, most assumptions made in the field of time series analysis do not
apply to streaming data. This refers in particular to temporal correlations used to
decompose time series (trend, seasonality, and residual component). Decomposition
cannot be applied to streaming data because they have little structure and abrupt
changes may occur.

1.3 Incremental Learning, Online Learning, and Stream


Learning

The challenges of processing data streams described in Sect. 1.2 led to the devel-
opment of a class of methods known as incremental or online learning methods,
the development of which has been heavily promoted in recent years (Bifet, 2018;
Losing et al., 2018). In particular, the development of the River7 framework has
helped incremental learning gain popularity in recent years (Montiel et al., 2021).

7 https://riverml.xyz/.
1 Introduction: From Batch to Online Machine Learning 9

The purpose of incremental learning is to fit an ML model into a stream of data.


In this process, the data are not fully available, but observations are provided one at
a time.
Since the term “online learning” is often used in the context of educational
research, we use the term “online machine learning”, or OML for short, in the follow-
ing. Also in common use are the terms “incremental learning” and “stream learning”.
For the conventional ML approach, the term “batch machine learning” introduced in
Definition 1.3 is used.
Definition 1.10 (Online Machine Learning (OML)) Online machine learning is ML
that uses single instances to create and update the model. Instances can only be used
once. There is relatively little time (real-time capability) and relatively little memory
available.
While writing this book, it became clear that defining another class would be
useful:
Definition 1.11 (Mini-Batch Machine Learning) Mini-batch machine learning is
ML that uses subsets, so-called mini-batches, of the entire data set (training data)
to repeatedly build the model. Mini-batches are usually used only once. There is
relatively little time and relatively little memory available.
From the requirements described in this section, the axioms for stream learning
can be derived.
Definition 1.12 (Axioms for Stream Learning) In the literature, e.g., Bifet (2018),
the following five axioms are used:
1. Each instance can be used only once.
2. Processing time is severely limited.
3. Memory is limited.
4. The algorithm must be able to return a result at any time (“anytime property”).
5. Data streams are assumed to change over time, i.e., data sources are not stationary.
Korstanje (2022) also distinguishes between “incremental learning” and “adap-
tive learning”. Incremental learning uses models that can be updated with a single
observation at a time. Adaptive learning is defined as follows:
Definition 1.13 (Adaptive Learning) Adaptive methods adjust the model to new
situations. New trends that occur in the underlying data are taken into account.
We use the term OML for incremental and adaptive learning methods that provide
methods for dealing with the axioms for stream learning presented in Definition 1.12.
10 T. Bartz-Beielstein

1.4 Transitioning Batch to Online Machine Learning

An incremental learning algorithm can be approximated by using a batch learner


with a sliding window. In this case, the model is re-trained each time a new window
(of data points) arrives. Any batch learner, such as linear regression, can be applied
to a data stream with a sliding window to approximate an incremental learning algo-
rithm. Examples include the mini-batch gradient descent method and the Stochastic
Gradient Descent (SGD) method.

Stochastic Gradient Descent

The gradient descent method8 is a popular batch method to find the minimum of a
function (the so-called cost or also objective function). For large data sets, a single
update of the parameters takes a long time because the entire training data set is used
for this purpose.
The SGD method is an iterative optimization method and can be considered as
a stochastic approximation of the gradient descent method. The gradient, which is
calculated from the entire data set in the gradient descent method, is replaced by
an estimate that uses only a randomly selected subset of the data set. The SGD
algorithm is an example of an OML algorithm that updates the model parameters at
each training observation.

Notebook Example: Stochastic Gradient Descent


The Jupyter Notebook in the GitHub repository https://github.com/sn-code-
inside/online-machine-learning/ explains the difference between classical and
stochastic gradient descent.

References

Auffarth, D. (2021). Machine learning for time-series with Python: Forecast, predict, and detect.
Packt.
Bifet, A., et al. (2018). Machine learning for data streams with practical examples in MOA. MIT
Press.
Castle, S., Schwarzenberg, R., & Pourvali, M. (2021). Detecting covariate drift with explana-
tions. Natural Language Processing and Chinese Computing: 10th CCF International Confer-
ence, NLPCC 2021, Qingdao, China, October 13–17, 2021, Proceedings, Part II (pp. 317–322).
Springer.

8 See Definition A.1 in the Appendix.


1 Introduction: From Batch to Online Machine Learning 11

Korstanje, J. (2022). Machine learning for streaming data with Python. Packt.
Losing, V., Hammer, B., & Wersing, H. (2018). Incremental on-line learning: A review and com-
parison of state of the art algorithms. Neurocomputing, 275, 1261–1274.
Montiel, J., et al. (2021). River: Machine learning for streaming data in Python. Journal of Machine
Learning Research, 22(1), 4945–4952. ISSN: 1532-4435.
Chapter 2
Supervised Learning: Classification
and Regression

Thomas Bartz-Beielstein

Abstract This chapter provides an overview and evaluation of Online Machine


Learning (OML) methods and algorithms, with a special focus on supervised
learning. First, methods from the areas of classification (Sect. 2.1) and regression
(Sect. 2.2) are presented. Then, ensemble methods are described in Sect. 2.3. Clus-
tering methods are briefly mentioned in Sect. 2.4. An overview is given in Sect. 2.5.

2.1 Classification

2.1.1 Baseline Algorithms

In the area of OML classification, there are so-called “baseline algorithms” that are
briefly presented here, as they serve as building blocks for more complex OML
methods.
The Majority-Class classifier counts the occurrences of the individual classes
and selects the class with the highest frequency for new instances. The No-Change
classifier selects the last class from the data stream. The Lazy Classifier is a classifier
that stores some already observed instances and their classes. A new instance is
classified into the class of the nearest already observed instance.

Example: k-NN Classifier

Examples for lazy classifiers are .k-nearest neighbor algorithms (.k-NN algorithms).
In .k-NN, the class membership is determined based on a majority decision: The class
that occurs most frequently in the neighborhood of the data point to be classified is
chosen. .k-NN is a “lazy classifier” because no training process is run. Only the
training data set is stored. Learning only takes place when a classification is made.

T. Bartz-Beielstein (B)
Institute for Data Science, Engineering, and Analytics, TH Köln, Gummersbach, Germany
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 13
E. Bartz and T. Bartz-Beielstein (eds.), Online Machine Learning,
Machine Learning: Foundations, Methodologies, and Applications,
https://doi.org/10.1007/978-981-99-7007-0_2
14 T. Bartz-Beielstein

2.1.2 The Naive-Bayes Classifier

The Naive-Bayes classifier is based on the Bayes theorem (see Theorem A.1 in
the appendix). It calculates the probabilities of the individual classes based on the
attributes and then selects the class with the highest probability. Since the Naive-
Bayes classifier is a simple and inexpensive incremental method, it is briefly presented
here. In addition, its elements play an important role in the creation of Hoeffding
trees, which are presented in Sect. 2.1.3.1.

Naive-Bayes Classifier
We assume that there are .k discrete attributes .x1 , x2 , . . . , xk and .n c different
classes. In the following, .v j denotes the value of an attribute and .c the class
to which an observation belongs. The information from the training data is
summarized in a table that stores a counter .n i, j,c for each triple .(xi , v j , c).
For example, if the observations shown in Table 2.1 are available and a new
observation . B with the values

(x1 = 1, x2 = 1, x3 = 1, x4 = 0)
.

arrives, whose class membership is to be determined, then the two probabilities

. P(Y = 0|B) P(Y = 0)P(B|Y = 0)


P(Y = 1|B) P(Y = 1)P(B|Y = 1)

are calculated using Bayes’ theorem. For the two classes “0” and “1”, we obtain
Table 2.2, the table of absolute frequencies. The Laplace correction is applied
to calculate the frequencies for the classes that do not occur in the training data.
The Laplace correction results from.n i, j,c + 1, i.e., the frequency for each class
.c is increased by 1.
After applying the Laplace correction, we obtain the values shown in
Table 2.3, with which we can calculate the probabilities for . P(B|Y = 0) or
. P(B|Y = 1). It holds:

Table 2.1 Labeled y .x1 .x2 .x3 .x4


observations used as training
data 1 0 1 0 1
0 1 0 0 0
1 1 1 1 1
0 1 0 0 1
2 Supervised Learning: Classification and Regression 15

. P(B|Y = 0) = P(x1 = 1, x2 = 1, x3 = 1, x4 = 0|Y = 0)


= 1/2 × 1/4 × 1/4 × 1/2 = 1/64

and thus . P(Y = 0|B) = 1/2 × 1/64 = 1/128. In comparison,

. P(B|Y = 1) = P(x1 = 1, x2 = 1, x3 = 1, x4 = 0|Y = 1)


= 1/2 × 3/4 × 1/2 × 1/4 = 3/64

and thus . P(Y = 1|B) = 1/2 × 3/64 = 3/128. Since . P(Y = 1|B) > P(Y =
0|B), the Naive-Bayes classifier selects the class “1” for the new observation
. B.

The table entries shown in Table 2.2 play an important role as statistics in trees,
see Definition 2.1. They can be represented as triples .(xi , v j , c) with values .n i, j,c .
For the first entry .(xi = 1, v j = 0, c = 0) in Table 2.2b, we obtain .n 1,0,0 = 1. For
the last entry .(xi = 4, v j = 1, c = 1) in Table 2.2b we obtain .n 4,1,1 = 2.
Prominent in the area of OML classification are tree-based methods (so-called
“trees”), such as Hoeffding Trees (HTs) and Hoeffding Adaptive Trees (HATs). In
addition to the tree-based methods presented in Sect. 2.1.3, we also present more
specific methods such as Support Vector Machine (SVMs) and Passive-Aggressive
(PA) in Sect. 2.1.4.

Table 2.2 Absolutely frequencies without Laplace correction


(a) Frequencies for Y = 0 (b) Frequencies for Y = 1
y x1 x2 x3 x4 y x1 x2 x3 x4
0 1 2 2 1 0 1 0 1 0
1 1 0 0 1 1 1 2 1 2

Table 2.3 Absolute frequencies after Laplace correction


(a) Frequencies for Y = 0 (b) Frequencies for Y = 1
y x1 x2 x3 x4 y x1 x2 x3 x4
0 2 3 3 2 0 2 1 2 1
1 2 1 1 2 1 2 3 2 3
16 T. Bartz-Beielstein

x_1

> 4.5455
4.5455

6.415 > 2.7189


x_0 x_0

> 6.415 2.7189

Class True: Class True:


P(False) = 0.1 P(False) = 0.0
x_0 x_1
P(True) = 0.9 P(True) = 1.0
samples: 1120 samples: 2664
4.0803 > 4.0803 7.027
> 7.027

Class False: Class False: Class True:


P(False) = 1.0 P(False) = 0.8 P(False) = 0.0
x_1
P(True) = 0.0 P(True) = 0.2 P(True) = 1.0
samples: 1265 samples: 453 samples: 520
2.4845 > 2.4845

Class False: Class True:


P(False) = 0.9 P(False) = 0.2
P(True) = 0.1 P(True) = 0.8
samples: 212 samples: 188

Fig. 2.1 Tree. Classification of SEA data set. The root of the tree is a node where the first test of
the attributes .x1 takes place. It is tested whether .x1 is greater or less than 4.5455. The branches
represent the results of the test. They lead to more nodes until the final nodes or leaves are reached.
The leaves are the predictions for the classes .Y = 0 and .Y = 1. The color scale symbolizes the
relative class frequencies in the nodes: from dark blue for high probability “false” to light blue and
light orange to dark orange for high probability “true”

2.1.3 Tree-Based Methods

A challenge in processing data streams is the high memory requirement. It is impos-


sible to store all data. Since trees allow a compact representation, they are popular
methods in the area of OML. Figure 2.1 shows an example tree for the classification
of the SEA synthetic data set (SEA).
Trees have the following elements:
1. Node: Testing an attribute
2. Branch: Result of the test
3. Leaf or terminal node: Prediction (of a class in classification).
We introduce two important representatives for tree-based OML methods in this
section: HTs, also known as Very Fast Decision Trees (VFDTs), in Sect. 2.1.3.1 and
Extremely Fast Decision Tree Classifier in Sect. 2.1.3.2.
2 Supervised Learning: Classification and Regression 17

2.1.3.1 Hoeffding-Trees

A Batch Machine Learning (BML) tree uses instances multiple times to calculate
the best split attributes (“splits”). Therefore, the use of BML decision tree methods
such as Classification And Regression Tree (CART) (Breiman, 1984) is not possible
in a streaming-data context. Hoeffding trees are the OML counterpart to the BML
trees (Domingos & Hulten, 2000). However, they do not use the instances multiple
times, but work directly on the incoming instances. They thus fulfill the first axiom
for stream learning (Definition 1.12).
Hoeffding trees are better suited for OML as incremental decision tree learners.
They are based on the idea that a small sample is often sufficient to select an optimal
split attribute. This idea is supported by the statistical result known as the Hoeffding
bound, see Theorem A.2 in the appendix. This result can be simplified as shown in
Example 2.1.3.1:

Example: Urn Model

An urn contains a very large number of red and black balls. We want to answer the
question of whether the urn contains more black or more red balls. To do this, we
draw a ball from the urn and observe its color, where the process can be repeated as
often as desired.
After the process has been carried out ten times, we have obtained 4 red and 6
black balls, after one hundred attempts 47 red and 53 black balls, after one thousand
attempts 501 red and 499 black balls. We can now (with a small uncertainty) say that
the urn contains the same number of black and red balls, without having to draw all
the balls. The probability that we are wrong is very low.
The Hoeffding bound depends on the number of observations and the permitted
uncertainty. This uncertainty can be determined at the beginning using a confidence
bound .s.

Definition 2.1 (Hoeffding Tree (HT)) The Hoeffding tree stores the statistics . S in
each node to perform a split. For discrete attributes, this is the same information that
is also used by the Naive-Bayes predictor: For each triple .(xi , v j , c), a table with
the counter .n i, j,c of the instances with .xi = v j and for the count values .C = c is
managed.
The HT uses two input parameters, the data stream . D with labeled examples and
a confidence bound .s. The following code shows an algorithmic description of the
HT algorithm according to Bifet (2018):
HoeffdingTree(D, s)
1 let HT be a tree with a single leaf (root)
2 init counts .n i, j,c at root
3 for each example .(x, y) in . D
4 do HTGrow .((x, y), H T , s)
Exploring the Variety of Random
Documents with Different Content
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebooknice.com

You might also like