0% found this document useful (0 votes)
5 views

thesis on Asteroid Prediction hazard level

This master's thesis explores the use of machine learning to predict packet loss in TCP connections, aiming to improve internet congestion control. By proactively reducing sending rates based on predictions, the thesis demonstrates potential reductions in packet loss and latency, thus enhancing Quality of Service for end users. The research includes the identification of relevant input variables, model training, and real-time testing, yielding promising results for packet loss prediction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

thesis on Asteroid Prediction hazard level

This master's thesis explores the use of machine learning to predict packet loss in TCP connections, aiming to improve internet congestion control. By proactively reducing sending rates based on predictions, the thesis demonstrates potential reductions in packet loss and latency, thus enhancing Quality of Service for end users. The research includes the identification of relevant input variables, model training, and real-time testing, yielding promising results for packet loss prediction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

Master’s thesis

Improving Internet Congestion


Control With Packet Loss
Prediction Using Machine
Learning
Is it possible to predict if a packet will be lost before sending it?

Maximilian von Stephanides

Informatics: Programming and System Architecture


60 ECTS study points

Department of Informatics
Faculty of Mathematics and Natural Sciences

Autumn 2023
Maximilian von Stephanides

Improving Internet Congestion


Control With Packet Loss Prediction
Using Machine Learning

Is it possible to predict if a packet will be lost before


sending it?

Supervisors:
Safiqul Islam
Michael Welzl
Abstract

Congestion and resulting packet loss in TCP connections can lead to


performance degradation and reduced Quality of Service (QoS) for end users.
If a packet is dropped between two endpoints that use TCP, it results in delays
while the lost packet is retransmitted and finds its way to the destination.
Some TCP congestion control algorithms, such as Cubic, adjust their
sending rate reactively after experiencing congestion signals in the form of
packet losses. This leads to behavior where a sender keeps increasing its
sending rate until congestion and therefore packet losses occur. Reacting
proactively instead could allow the sender to reduce its sending rate before
congestion occurs and could therefore lead to a reduction in packet loss and
resulting retransmissions, resulting in a reduction in latency and improved
QoS for the end users.
There is some regularity in when a congestion control mechanism will lose
a packet. For example, a single TCP connection alone could lose packets at
regular intervals — or, if competing with other traffic, it might lose some
packets at regular intervals coming from other connections (related to the
round-trip times of the cross-traffic). This thesis answers the question: Can
we use machine learning to predict packet loss, and use the result of the
prediction to perform actions like proactively reducing the sending rate?
This thesis investigates the idea of real-time packet loss prediction. We
identify which input variables and machine learning models could be applied
to tackle this problem, and train and evaluate multiple models which are
tested in real time on running TCP connections. We find that the models
yield good results and successfully present a proof of concept for real-time
packet loss prediction on TCP connections. We also show how Explicit
Congestion Notification (ECN) can be leveraged to proactively reduce the
sending rate based on model predictions and, by doing so, reduce the amount
of retransmissions that occur due to congestion and resulting packet loss.

i
ii
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . 1
1.2 Research questions. . . . . . . . . . . . . . . . . . . . 1
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . 3
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Congestion control and packet loss . . . . . . . . . . . . . . 5
2.1.1 Congestion induced packet loss . . . . . . . . . . . . 6
2.1.2 The problem with congestion . . . . . . . . . . . . . 6
2.1.3 Queue size and BDP . . . . . . . . . . . . . . . . 6
2.1.4 Congestion control . . . . . . . . . . . . . . . . . 7
2.1.5 Components involved in congestion control mechanisms . . 7
2.1.6 Slow start and congestion avoidance . . . . . . . . . . 7
2.1.7 Fast retransmit and fast recovery . . . . . . . . . . . 9
2.1.8 Common congestion control algorithms . . . . . . . . . 9
2.1.9 ECN . . . . . . . . . . . . . . . . . . . . . . 11
2.1.10 Non-loss-based congestion control algorithms . . . . . . 15
2.2 Machine learning . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 The machine learning process . . . . . . . . . . . . 17
2.2.2 Data collection and transformation . . . . . . . . . . . 18
2.2.3 Features . . . . . . . . . . . . . . . . . . . . . 20
2.2.4 Supervised learning . . . . . . . . . . . . . . . . 20
2.2.5 Binary classification . . . . . . . . . . . . . . . . 22
2.2.6 Ensemble learning . . . . . . . . . . . . . . . . . 24
2.2.7 Bias and variance . . . . . . . . . . . . . . . . . 25
2.2.8 Reinforcement learning . . . . . . . . . . . . . . . 26
2.2.9 Sequence models . . . . . . . . . . . . . . . . . 26
2.3 Related works . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Machine learning based congestion control. . . . . . . . 27
2.3.2 Packet loss prediction . . . . . . . . . . . . . . . . 30
3 Machine learning model design and evaluation . . . . . . . . . . . . 33
3.1 Model choice . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Network tools . . . . . . . . . . . . . . . . . . . 34
3.2.2 Machine learning libraries . . . . . . . . . . . . . . 35
3.3 Data collection. . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Network configuration . . . . . . . . . . . . . . . . 36
3.3.2 Initial approach . . . . . . . . . . . . . . . . . . 38

iii
Contents

3.3.3 Final approach . . . . . . . . . . . . . . . . . . 39


3.4 Data transformation . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Parsing the output from the data collection process. . . . . 43
3.4.2 Feature selection . . . . . . . . . . . . . . . . . 44
3.4.3 Extracting the relevant data. . . . . . . . . . . . . . 46
3.4.4 Labeling the training data . . . . . . . . . . . . . . 50
3.4.5 Creating a CSV file . . . . . . . . . . . . . . . . . 50
3.4.6 Creating the final dataset . . . . . . . . . . . . . . 51
3.4.7 Cleaning the data . . . . . . . . . . . . . . . . . 51
3.4.8 Creating the training, validation, and test datasets . . . . . 52
3.5 Phase one classifiers . . . . . . . . . . . . . . . . . . . 52
3.5.1 Model creation . . . . . . . . . . . . . . . . . . 53
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . 54
3.6 Phase two classifiers . . . . . . . . . . . . . . . . . . . 58
3.6.1 Model creation . . . . . . . . . . . . . . . . . . 58
3.6.2 Results . . . . . . . . . . . . . . . . . . . . . 59
3.7 Phase three classifiers. . . . . . . . . . . . . . . . . . . 64
3.7.1 Model creation . . . . . . . . . . . . . . . . . . 64
3.7.2 Results . . . . . . . . . . . . . . . . . . . . . 66
4 Model inference and results . . . . . . . . . . . . . . . . . . . 75
4.1 TCP connection setup and model inference . . . . . . . . . . . 77
4.2 Data preparation module . . . . . . . . . . . . . . . . . . 79
4.3 Prediction module . . . . . . . . . . . . . . . . . . . . 81
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1 Single flow results . . . . . . . . . . . . . . . . . 85
4.4.2 Background traffic results . . . . . . . . . . . . . . 91
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Addressing research questions . . . . . . . . . . . . . . . 99
5.2 Contributions . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Future research . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Real-life tests . . . . . . . . . . . . . . . . . . . 101
5.3.2 Different machine learning models . . . . . . . . . . . 101
5.3.3 Different machine learning features . . . . . . . . . . . 101
5.3.4 Different training data . . . . . . . . . . . . . . . . 102
5.3.5 A common model for multiple congestion control algorithms . 102
5.3.6 Dynamic FEC . . . . . . . . . . . . . . . . . . . 102

iv
List of Figures

2.1 The congestion window behavior of TCP Reno . . . . . . . . . . . . 10


2.2 The congestion window behavior of TCP Cubic. . . . . . . . . . . . 12
2.3 The ECN signaling process between a sender and receiver . . . . . . . 14
2.4 The congestion window behavior of BBR . . . . . . . . . . . . . . 16
2.5 Confusion matrix visualizing the amount of True Positives, False Positives, True
Negatives, and False Negatives . . . . . . . . . . . . . . . . . 24

4.1 The test setup showing the flow of data when the model was applied to perform
predictions on a connection in real time . . . . . . . . . . . . . . 76
4.2 cwnd plot from a connection configured using TCP Reno with 30ms delay,
50Mbps bandwidth, and 1 BDP queue size . . . . . . . . . . . . . 85
4.3 cwnd plot overlaid with model predictions from a connection configured using
TCP Reno with 70ms delay, 50Mbps bandwidth, 0.5 BDP queue size, and 0.25
classification threshold . . . . . . . . . . . . . . . . . . . . . 86
4.4 cwnd plot overlaid with model predictions from a connection configured using
TCP Reno with 30ms delay, 50Mbps bandwidth, 1 BDP queue size, and 0.25
classification threshold . . . . . . . . . . . . . . . . . . . . . 87
4.5 cwnd plot overlaid with model predictions from a connection configured using
TCP Cubic with 30ms delay, 50Mbps bandwidth, 1 BDP queue size, and 0.1
classification threshold . . . . . . . . . . . . . . . . . . . . . 87
4.6 Retransmission reduction for TCP Reno when running as a single flow with
model inference enabled at various classification thresholds . . . . . . . 89
4.7 Throughput change for TCP Reno when running as a single flow with model
inference enabled at various classification thresholds . . . . . . . . . 89
4.8 Retransmission reduction for TCP Cubic when running as a single flow with
model inference enabled at various classification thresholds . . . . . . . 89
4.9 Throughput change for TCP Cubic when running as a single flow with model
inference enabled at various classification thresholds . . . . . . . . . 89
4.10 Trade-off between retransmission reduction and throughput change for TCP
Reno when running as a single flow with model inference enabled at various
classification thresholds . . . . . . . . . . . . . . . . . . . . 90
4.11 Trade-off between retransmission reduction and throughput change for TCP
Cubic when running as a single flow with model inference enabled at various
classification thresholds . . . . . . . . . . . . . . . . . . . . 90
4.12 cwnd plot from a connection configured using TCP Reno with 30ms delay,
50Mbps bandwidth, 0.5 BDP queue size, and 0.1 classification threshold
resulting in zero retransmissions . . . . . . . . . . . . . . . . . 91

v
List of Figures

4.13 cwnd plot overlaid with model predictions from a connection configured using
TCP Reno with 70ms delay, 50Mbps bandwidth, 0.25 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.14 cwnd plot overlaid with model predictions from a connection configured using
TCP Cubic with 70ms delay, 50Mbps bandwidth, 0.25 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.15 cwnd plot overlaid with model predictions from a connection configured
using TCP Reno with 70ms delay, 50Mbps bandwidth, 0.5 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.16 cwnd plot overlaid with model predictions from a connection configured
using TCP Cubic with 70ms delay, 50Mbps bandwidth, 0.5 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.17 cwnd plot overlaid with model predictions from a connection configured using
TCP Reno with 70ms delay, 50Mbps bandwidth, 1 BDP queue size, background
traffic, and 0.5 classification threshold . . . . . . . . . . . . . . . 96
4.18 cwnd plot overlaid with model predictions from a connection configured
using TCP Cubic with 70ms delay, 50Mbps bandwidth, 1 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.19 Retransmission reduction for TCP Reno when running with background traffic
with model inference enabled at various classification thresholds . . . . . 97
4.20 Throughput change for TCP Reno when running with background traffic with
model inference enabled at various classification thresholds . . . . . . . 97
4.21 Retransmission reduction for TCP Cubic when running with background traffic
with model inference enabled at various classification thresholds . . . . . 97
4.22 Throughput change for TCP Cubic when running with background traffic with
model inference enabled at various classification thresholds . . . . . . . 97
4.23 Trade-off between retransmission reduction and throughput change for TCP
Reno when running with background traffic with model inference enabled at
various classification thresholds . . . . . . . . . . . . . . . . . 97
4.24 Trade-off between retransmission reduction and throughput change for TCP
Cubic when running with background traffic with model inference enabled at
various classification thresholds . . . . . . . . . . . . . . . . . 97

vi
List of Tables

3.1 Programs used for network configuration and data collection and their versions 38
3.2 Features selected from or based on values in the ss output . . . . . . . 47
3.3 Relevant fields from the ss output and their ss man page descriptions [85] . 47
3.4 Network connection parameters used to configure the data collection procedure
in phase one. . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 The proportion of samples for each class in the training, validation, and test
sets for the Reno phase one dataset. . . . . . . . . . . . . . . . 54
3.6 The proportion of samples for each class in the training, validation, and test
sets for the Cubic phase one dataset . . . . . . . . . . . . . . . 54
3.7 Reno phase one results . . . . . . . . . . . . . . . . . . . . 55
3.8 Cubic phase one results . . . . . . . . . . . . . . . . . . . . 55
3.9 Reno phase one confusion matrices . . . . . . . . . . . . . . . . 55
3.10 Cubic phase one confusion matrices. . . . . . . . . . . . . . . . 56
3.11 Feature importances for the Reno phase one model . . . . . . . . . . 56
3.12 Feature importances for the Cubic phase one model. . . . . . . . . . 57
3.13 An excerpt from the Reno phase one training dataset . . . . . . . . . 57
3.14 An excerpt from the Cubic phase one training dataset . . . . . . . . . 57
3.15 Connection parameters for phase two . . . . . . . . . . . . . . . 59
3.16 The proportion of samples for each class in the training, validation, and test
sets for the Reno phase two dataset . . . . . . . . . . . . . . . . 59
3.17 The proportion of samples for each class in the training, validation, and test
sets for the Cubic phase two dataset. . . . . . . . . . . . . . . . 59
3.18 Reno phase two results . . . . . . . . . . . . . . . . . . . . 60
3.19 Cubic phase two results . . . . . . . . . . . . . . . . . . . . 60
3.20 Reno phase two confusion matrices . . . . . . . . . . . . . . . . 60
3.21 Cubic phase two confusion matrices . . . . . . . . . . . . . . . . 60
3.22 An excerpt from the Reno phase two training dataset. The training data is
shown as groups of five and five samples from different parts of the aggregated
training data, meaning that each group of five samples came from the same
part, and that these five samples in each group occurred in succession. Each
group of five samples includes one True sample, the two False samples that
occurred right before, and the two False samples that occurred right after . . 61
3.23 An excerpt from the Cubic phase two training dataset. The training data is
shown as groups of five and five samples from different parts of the aggregated
training data, meaning that each group of five samples came from the same
part, and that these five samples in each group occurred in succession. Each
group of five samples includes one True sample, the two False samples that
occurred right before, and the two False samples that occurred right after . . 61

vii
List of Tables

3.24 Feature importances for the Reno phase two model . . . . . . . . . . 62


3.25 Feature importances for the Cubic phase two model . . . . . . . . . . 63
3.26 Feature importances difference for the Reno phase one and phase two models 63
3.27 Feature importances difference for the Cubic phase one and phase two models 64
3.28 Connection parameters for phase three . . . . . . . . . . . . . . 65
3.29 Hyperparameters for the Reno phase three model . . . . . . . . . . 65
3.30 Hyperparameters for the Cubic phase three model . . . . . . . . . . 66
3.31 The proportion of samples for each class in the training, validation, and test
sets for the Reno phase three dataset . . . . . . . . . . . . . . . 66
3.32 The proportion of samples for each class in the training, validation, and test
sets for the Cubic phase three dataset . . . . . . . . . . . . . . . 66
3.33 Reno phase three results . . . . . . . . . . . . . . . . . . . . 67
3.34 Cubic phase three results . . . . . . . . . . . . . . . . . . . . 67
3.35 Reno phase three confusion matrices . . . . . . . . . . . . . . . 67
3.36 Cubic phase three confusion matrices . . . . . . . . . . . . . . . 67
3.37 An excerpt from the Reno phase three training dataset. The training data is
shown as groups of five and five samples from different parts of the aggregated
training data, meaning that each group of five samples came from the same
part, and that these five samples in each group occurred in succession. Each
group of five samples includes one True sample, the two False samples that
occurred right before, and the two False samples that occurred right after . . 68
3.38 An excerpt from the Cubic phase three training dataset. The training data is
shown as groups of five and five samples from different parts of the aggregated
training data, meaning that each group of five samples came from the same
part, and that these five samples in each group occurred in succession. Each
group of five samples includes one True sample, the two False samples that
occurred right before, and the two False samples that occurred right after . . 69
3.39 Feature importances for the Reno phase three model . . . . . . . . . 70
3.40 Feature importances for the Cubic phase three model . . . . . . . . . 70
3.41 Differences in feature importances for Reno models across phases . . . . 71
3.42 Differences in feature importances for Cubic models across phases . . . . 71

4.1 The scenarios that were considered when running the model inference related
tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Results for a specific Reno single flow scenario with a configured delay of 30ms,
bandwidth of 50Mbit, and queue size of 0.5BDP . . . . . . . . . . . 90
4.3 Difference in retransmissions for the single flow and background traffic case for
TCP Reno for the various scenarios without model inference . . . . . . 93
4.4 Difference in retransmissions for the single flow and background traffic case for
TCP Cubic for the various scenarios without model inference . . . . . . 93

viii
Preface

First of all, I would like to thank my fantastic supervisors: Dr. Safiqul Islam and Dr.
Michael Welzl. It goes without saying that this thesis would not have been possible
without you, or at the very least, it would not have been as good. You have always
helped when I was stuck with a problem or simply had a few questions. Through
multiple meetings and discussions, you have either guided me on the correct path or
helped me find out where to go next. I have been truly inspired by your enthusiasm for
research and congestion control in particular. On numerous occasions, I have witnessed
you both being visibly excited over the problem discussed in this thesis, and this has
been a great source of motivation for me throughout the months I have spent working on
it. I also want to thank you both for going above and beyond by sometimes answering
my emails during weekends or way past normal working hours. I have learned a lot from
both of you, and will always carry this experience with me.
I would like to thank my family, especially my parents, for always being supportive and
believing in me. Your support, both emotional and financial, throughout my years spent
studying means a lot to me, and is something that I do not take for granted.
I would like to thank my girlfriend. Not only have you been supportive of me and helped
me when I was stressed or worried about something, your family has been as well. The
way you have discussed my thesis and my progress with your family has been inspiring
and shows me that you truly care.
Lastly, I would like to thank my friends and everyone at “Oslo Styrkeløftklubb” for
cheering me up and keeping me sane throughout this process.

ix
Preface

x
Chapter 1

Introduction

1.1 Problem statement


Congestion and resulting packet loss can lead to performance degradation and reduced
Quality of Service (QoS) for end users. This is especially prevalent in TCP where the
Head-of-Line (HOL) blocking [46] problem results in packets in a queue having to wait
because a packet at the head of the queue (line) cannot move forward due to congestion.
If a single packet is dropped or lost in the network somewhere between two endpoints
that use TCP, it results in the entire TCP connection being brought to a halt while the
lost packet is retransmitted and finds its way to the destination. Packet loss and HOL
blocking can lead to increased delay and reduced QoS for the end users relying on the
TCP flow.
Recent protocols such as QUIC [45] have tried to solve this problem. However, this
requires using and deploying an entirely new protocol instead of relying on existing
protocols like TCP Cubic, which is implemented and used as default in the Linux kernel,
and therefore used by millions of machines around the world.
More traditional TCP congestion control algorithms, such as Cubic, only adjust their
sending rate reactively after experiencing congestion signals in the form of packet losses
and resulting retransmissions. This leads to behavior where the sender keeps increasing
its sending rate until congestion occurs, which results in packet losses. These packet
losses are then detected by mechanisms such as the retransmission timeout (RTO) or
Duplicate ACKs (DupAcks) and the sender is informed and adjusts its sending rate.
Reacting proactively instead could allow the sender to reduce its sending rate before
congestion occurs and could therefore lead to a reduction in packet loss and resulting
retransmissions, resulting in a reduction in latency and improved QoS for the end users.

1.2 Research questions


The problem statement above leads us to define the following main research question
that this thesis attempts to answer, and that is also the subtitle of this thesis:
Is it possible to predict if a packet will be lost before sending it?
In addition, we have decided to split the main research question into multiple sub-
questions, all of which are defined below:

1
Chapter 1. Introduction

Plug-and-play solution Is it possible to do this for multiple congestion control


algorithms?
Simplicity and performance Can we do this using a simple machine learning model,
such as a tree-based classifier?
Retransmission reduction Can we use a mechanism like Explicit Congestion
Notification (ECN) to proactively reduce the sending rate before congestion occurs
and packets are dropped and, by doing so, reduce the amount of retransmissions?

1.3 Contributions
In this thesis, we have done extensive research on the topic of real-time packet loss
prediction, including which TCP state variables could be informative to a potential
solution and could be used to select or construct machine learning features. The following
bullet points summarize our contributions in this thesis related to data collection, data
transformation, and model design and evaluation:
• Performed extensive data collection in the form of tests involving TCP connections
configured as a single flow between a sender and receiver with and without the
presence of background traffic and with multiple congestion control algorithms.
• Trained and tuned multiple machine learning models, which we have evaluated
using multiple performance metrics and through more manual inspections and
analysis.
• Investigated classification performance differences for models that were trained
and evaluated on data collected with and without the presence of background
traffic, and models that were trained and evaluated on data collected only without
background traffic.
In addition, we have performed real-time model inference tests using a custom test setup.
The following bullet points summarize our contributions in this thesis related to real-time
model inference and proactive sending rate adjustments using ECN:
• Through various tests, we have investigated how the models perform in real time
and how accurate and informative the predictions are.
• Done tests for various congestion control algorithms and for connections with and
without background traffic.
• Calculated metrics such as retransmission reduction and throughput change, and
created various plots for showing the correlation between model inference and these
metrics, where we have shown how the results vary with the chosen classification
threshold.
• Shown how, through proactive sending rate adjustments using model predictions
and ECN, we can reduce the amount of retransmissions in a TCP connection.
Finally, we have briefly touched on potential improvements that could be made to both
the training data and models, as well as future work that could be performed to further
investigate and potentially improve the performance of our models.

2
1.4. Organization

1.4 Organization
The remainder of this thesis is organized into the following chapters:
Chapter 2: Background Chapter 2 presents background information on the most
relevant concepts: mainly congestion control and machine learning. This chapter
also presents related works and discuss how they relate to the work presented in
this thesis. This chapter therefore exists to give a context to the work we have
performed here and to better inform the reader of various concepts that they
potentially are not very familiar with.
Chapter 3: Model design and evaluation Chapter 3 discusses how we developed
and evaluated the machine learning models we created in this thesis for the purpose
of real-time packet loss prediction. In this chapter, we discuss topics such as the
following:
• How the network was configured in order to collect data.
• How we collected data, including the type of data and how much.
• How we transformed the data multiple times in order to extract or construct
various machine learning features and create various datasets that were used
for training and evaluation when creating the machine learning models.
• How we created the models using machine learning algorithms such as
LightGBM or XGBoost, and various results in the form of metrics such as
precision, recall, F1 score, and feature importances.
• Potential reasons for the various feature importances and how the feature
importances changed with the training data, due to factors such as the added
presence of background traffic.
Chapter 4: Model inference and results Chapter 4 discusses how we performed
model inference by creating a custom test setup for testing the machine learning
models in real-time. In this chapter, we explain how we leveraged ECN to
reduce the sending rate proactively based on model predictions, and the resulting
retransmission reductions. We also discuss how metrics such as the retransmission
reduction and throughput change varied with the classification threshold that the
model was configured with.
Chapter 5: Conclusion In Chapter 5 we address the research questions, summarize
our contributions, present concluding remarks, and explore possible areas for future
research.

3
Chapter 1. Introduction

4
Chapter 2

Background

Both congestion control and machine learning are broad topics with a lot of active
research. The latter has gained a lot of popularity in recent years, especially with the
growing prominence of Artificial Intelligence (AI), while the former has always been a
topic of interest in order to improve the way the Internet works.
Congestion control — a technique used by transport protocols like TCP — is strongly
related to the concept of packet loss, with a more congested network leading to a
higher probability of packet loss [94]. Since packet loss is typically undesired, various
mechanisms are in place today in order to combat congestion.
Machine learning, on the other hand, is a subset of AI that focuses on the development
of specific algorithms, known as machine learning algorithms, that allow computers to
learn and make decisions from data. These algorithms together with the data form what
is known as machine learning models [54]. These models, instead of being explicitly
programmed to perform a task, use patterns and inference to produce decisions based
on new data.
In recent years there has been a convergence of these two fields, with many researchers
exploring the application of machine learning in order to optimize congestion control.
While traditional congestion control mechanisms are mainly based on predefined rules
and heuristics, more dynamic congestion control mechanisms partially or fully based on
machine learning could possess the ability to adapt and learn from varying conditions
and therefore potentially be more efficient.

2.1 Congestion control and packet loss


The Internet provides its users with a service called best effort — meaning that, the
network does its best to deliver its users’ data as quickly as possible. But there are
no guarantees. Network performance when doing things like visiting a website or
downloading a file can fluctuate greatly due to different external factors, one of them
being congestion.
Congestion occurs when resource demands exceed the capacity of the network. In the
simplest case, Internet packets are transferred across a link between a sender and a
receiver. This link has a certain capacity which is measured in Megabits per second
(Mbps). If the transmission rate exceeds this capacity, there are excess packets that

5
Chapter 2. Background

cannot be transferred across the link. These packets can either be buffered or dropped.
If the packets are buffered, this typically means that they are placed in a queue at a
router, which often works as a basic FIFO (First In, First Out) queue and only drops
packets if the queue is full — the underlying assumption being that a reduction in
throughput would eventually drain the queue. As the queue grows, the network is said
to become congested.

2.1.1 Congestion induced packet loss


If more packets are being sent even though the queue is already full, these packets will
arrive without being stored in the queue. Since they cannot be processed, they will be
dropped. This is known as congestion induced packet loss.
Performance at a node is often measured not only in terms of delay, but also
in terms of the probability of packet loss [44].

2.1.2 The problem with congestion


There are two problems when storing packets in a queue like discussed in the previous
section:
1. Storing packets in a queue adds delay, depending on the length of the queue.
2. Internet traffic does not strictly follow a Poisson distribution, that is, the
assumption that there are as many upward fluctuations as there are downward
fluctuations may be wrong [94].
Because of the first problem, packet loss may occur no matter the length of the queue.
Because of the second problem, queues should generally be kept short — because, as
already mentioned, growing queues mean a congested network, which will manifest itself
in increasing delay, and, at worst, packet loss.
A network is said to be congested from the perspective of a user if the service
quality noticed by the user decreases because of an increase in network load
[94].

2.1.3 Queue size and BDP


As mentioned in the previous sections, if more packets are being sent even though the
queue is already full, these packets will arrive without being stored in the queue and
therefore be dropped, leading to congestion induced packet loss. Queue size is therefore
an important consideration, and there are some general recommendations that have been
proposed related to the sizing of queues. The “rule-of-thumb” approach to sizing queues
states that the queue size should be equal to the Bandwidth Delay Product (BDP) [89],
where the BDP refers to the product of the data link’s capacity — often measured in bits
or Megabits per second — and its round-trip delay time (RTT). The BDP is therefore
a value in bits or bytes, depending on how the calculation is performed and represents
the maximum amount of data that can be in flight at any time.
There has however been research indicating that the “rule-of-thumb” approach to sizing
queues by setting the queue size equal to 1 BDP is not always optimal, with sometimes
larger queue sizes resulting in better performance [74].

6
2.1. Congestion control and packet loss

2.1.4 Congestion control


Since congestion can occur in networks due to the abovementioned reasons, and
congestion typically leads to decreased service quality noticed by the users, congestion
control mechanisms are in place today to combat this problem. The goal of these
mechanisms are to use the network as efficiently as possible by attaining the highest
possible throughput while maintaining a low loss ratio and small delay. The term
“congestion avoidance” is sometimes used when referring to these mechanisms.

2.1.5 Components involved in congestion control mechanisms


When designing a congestion control mechanism, there are typically three components
that need to be considered:
1. The sender: This is where the traffic originates from, and where the first decisions
are made.
2. Intermediate nodes: Depending on the specific network scenario, each packet
usually traverses a certain number of intermediate nodes such as routers. This is
typically where the queues are found that grow in the presence of congestion.
3. The receiver: This is where the traffic eventually arrives. The ultimate goal of
almost any network communication code is to maximize the satisfaction of a user
at this network node [94].
Traffic can be controlled at the sender and intermediate nodes. This means that some
sort of congestion control scheme could be deployed there.

2.1.6 Slow start and congestion avoidance


TCP has been, and still is, one of the most common transport-layer protocols in use on
the Internet and makes use of various congestion control mechanisms — two of them
being slow start and congestion avoidance.
Slow start and congestion avoidance are two sender side algorithms used by TCP to
control the amount of outstanding data being injected into the network. They make use
of various state variables to accomplish this, which are added on a per-connection basis
[9].

TCP
TCP, explained in full detail in RFC 9293 [21], is one of the most important transport-
layer protocols and has been in use on the Internet for multiple decades. It is a
connection-oriented protocol, meaning that a connection is established between the
sender and receiver before communicating. This is accomplished by the well known
TCP three-way handshake.
TCP makes it possible to transfer a reliable stream of data from a sender to a receiver
even in the presence of factors like packet losses. One of the ways this reliability is
achieved, is by sending a packet, waiting for a signal (ACK) from the receiver, and
retransmitting the packet after a while if the ACK does not arrive — the duration is
specified by a variable called the retransmission timeout (RTO). The RTO is a dynamic
value and is calculated based on an estimate of the RTT [21].

7
Chapter 2. Background

TCP congestion control and reliability


TCP tries to follow what is called the conservation of packets principle, which refers
to the idea that a new packet isn’t put into the network before an old packet leaves.
A connection that fulfills this principle is said to be stable. Van Jacobson argues that
systems with this property should be robust in the face of congestion and that there are
only three ways this could fail [39]:
1. The connection does not get to equilibrium.
2. A sender injects a new packet before an old packet has exited.
3. The equilibrium cannot be reached because of resource limits along the path.
The second failure relates to the RTO timer, while the first and third failures are tackled
by the slow start algorithm and congestion avoidance algorithm respectively [94].

Slow start
Slow start is a mechanism in TCP that allows the sender to reach a reasonable sending
rate fast when there are unknown network conditions. A variable called the congestion
window (cwnd) is used for this purpose. The value of cwnd limits the amount of data that
the sender can inject into the network before receiving an ACK, and changes dynamically
based on feedback from the receiver. The cwnd value is initialized with a certain value
and increased by one segment for each ACK that arrives. Since the value of cwnd
increases by one for each ACK it means that slow start has exponential growth. For
each ACK that is received, twice as many packets leave the sender — which all result in
ACKs that cause twice as many packets to leave the sender again, and so on.
In addition to the cwnd variable, another state variable called the slow start threshold
(ssthresh) is used to decide which of the two algorithms slow start and congestion
avoidance is used to control the data transmission.
The slow start algorithm is employed at the beginning of TCP transmissions, but also
after repairing loss detected by the RTO. ssthresh and cwnd are therefore important
state variables in the context of packet loss [94] [9].

Congestion avoidance
If the value of cwnd is larger than the value of ssthresh, the congestion avoidance
algorithm is employed instead of slow start.
In contrast to the slow start algorithm, the congestion avoidance algorithm uses Additive
Increase Multiplicative Decrease (AIMD) when increasing the cwnd state variable.
Instead of incrementing cwnd by one for each ACK, congestion avoidance increases it as
follows:
cwnd = cwnd + MSS * MSS/cwnd
This results in the window being increased by at most one segment per RTT — this
being the Additive Increase part of the algorithm — leading to linear growth of the
congestion window in this phase.
When a congestion signal is detected, typically in the form of a packet loss or increased
delay, congestion avoidance uses Multiplicative Decrease to decrease the value of cwnd

8
2.1. Congestion control and packet loss

multiplicatively, often by halving the current window size. This is done to reduce
congestion in the network and can be seen in Figure 2.1 for TCP Reno [9] [94].

2.1.7 Fast retransmit and fast recovery


Fast retransmit and fast recovery are two more sender-side algorithms part of TCP’s
congestion control mechanisms. They are designed to allow the connection to recover
more quickly from packet losses.
The idea in fast retransmit and fast recovery is to use duplicate ACKs (DupACKs) as an
indication of packet loss in addition to the RTO [94].

Duplicate ACKs
If a sender transmits packets numbered from 1 to 5, and the receiver only successfully
receives segments 1, 3, 4, and 5, the receiver’s response to the reception of segment
1 will typically be “ACK 2”. This is an acknowledgment to the sender, indicating
that it is now awaiting the receipt of segment 2. Upon the arrival of segments 3, 4,
and 5, the receiver sends additional acknowledgments for the missing segment 2. These
additional acknowledgments are interpreted by the sender as duplicate acknowledgments
(DupACKs), suggesting the potential loss of a packet [94].

Fast retransmit
Fast retransmit uses the loss detecting scheme described in the above section to allow
the sender to send the segment that was requested numerous times before waiting for
the RTO timer to expire [94].

Fast recovery
After fast retransmit has done its job and sent what appeared to be the missing segment,
the fast recovery algorithm controls the transmission of new data until a normal ACK
arrives. Since the receiver can only generate a DupACK when a packet has arrived,
receiving a DupACK at the sender does not necessarily mean that a packet loss occurred
— it can also mean that a packet was received out of order. Therefore, switching to slow
start mode is not necessary, and cwnd is directly set to half the current amount of data
in flight [9] [94].

2.1.8 Common congestion control algorithms


Multiple different congestion control algorithms have been proposed and/or utilized in
TCP since its first introduction. One of the earlier ones was TCP Reno [9], while
a somewhat recent and very popular one, has been TCP Cubic [77]. Both of these
algorithms implement slow start and congestion avoidance, but the window increase
function in Cubic is a bit different than the linear increase that is seen in Reno.
Congestion control algorithms can be split into three primary groups based on how they
work — the three groups being: Loss-based algorithms, delay-based algorithms, and
signal-based algorithms [30].

9
Chapter 2. Background

TCP Reno
TCP Reno was one of the earlier congestion control algorithms introduced and
implemented in TCP. It utilizes slow start and congestion avoidance, in addition to
fast retransmit and fast recovery.
In slow start the congestion window increases exponentially, meaning that the sending
rate is increased by two for each ACK that is received, while in congestion avoidance
and fast recovery the congestion window increases linearly, meaning that the sending
rate is increased by one for each ACK that is received. This can be seen in Figure 2.1.
Slow start is used in Reno at the beginning of a transmission and whenever a loss is
detected through a RTO. Congestion avoidance is used in Reno at the end of the slow
start phase or after a loss is detected via DupACKs [9].

Figure 2.1: The congestion window behavior of TCP Reno

TCP Cubic
TCP Cubic was first implemented in the Linux kernel in 2006 [33], and features many
of the same components as Reno with some key differences.
To achieve better network utilization and stability, Cubic uses both the concave and
convex profiles of a cubic function to adjust the cwnd. This is in contrast to some other
non-Reno congestion control algorithms that only use convex functions. Like Reno,
Cubic responds to congestion events that are detected by DupACKs (fast retransmit
and fast recovery). But unlike Reno, Cubic registers the congestion window size when
that happens and stores it in a state variable called W_max.

10
2.1. Congestion control and packet loss

Cubic can run in three different modes depending on the value of the current cwnd and
W_max, all of which are listed below:
The TCP-friendly region This mode ensures that Cubic achieves at least the same
throughput as standard TCP and is used in networks where standard TCP
performs well. Such networks include short RTT and small bandwidth networks.
The concave region This mode is used if Cubic is not in the TCP-friendly region and
cwnd is less than W_max.
The convex region This mode is used if Cubic is not in the TCP-friendly region and
cwnd is greater than W_max.
This pattern of first increasing the cwnd using the concave profile of the cubic function,
followed by switching to the convex profile of the cubic function promotes high network
utilization and stability [77].
How the congestion window increases and decreases in response to congestion is shown
in Figure 2.2, where the cubic increase and decrease pattern can be clearly seen.
There have been multiple studies comparing the performance of Reno and Cubic, with
some key findings summarized below:
• Cubic seems to generally outperform Reno in networks with long fat pipes — long
fat pipes referring to networks with high bandwidth and high latency [108].
• In mixed environments where there are both Reno and Cubic flows, the Cubic flows
could be more aggressive and take a larger share of the bandwidth than Reno flows,
but this is not necessarily the case [24].
• The cwnd growth behavior can be smoother in Cubic than in Reno. There are also
scenarios where Cubic can have less abruptly falling cwnd values and more stable
cwnd values over time [40].
One interesting aspect of Cubic is fast convergence. This is a heuristic added to Cubic
to improve convergence speed in cases where a new flow joins the network and existing
flows have to sacrifice some of their bandwidth. Fast convergence is designed for network
environments with multiple Cubic flows and can make it difficult to predict its behavior
[77].

2.1.9 ECN
Explicit Congestion Notification (ECN) is a mechanism that allows the sender to behave
as if the packet was dropped instead of actually dropping a packet. ECN was the first
feasible TCP/IP congestion control solution that incorporated explicit feedback — where
the feedback is in the form of header bits — and is able to reduce loss in the presence
of routers that use active queue management [94].

Active queue management


ECN relies on Active Queue Management (AQM), which are mechanisms that detect
congestion before the buffer is full and packets are dropped, and provide a signal of
this congestion to the end nodes involved in the connection. The goal of ECN and other
mechanisms that rely on AQM is to reduce building up queues and prevent packet losses,
because both large queues and packet losses can lead to increased delay for the traffic

11
Chapter 2. Background

Figure 2.2: The congestion window behavior of TCP Cubic

sharing that queue. In addition, AQM means that protocols like TCP — which rely
on packet drops as a congestion signal — have a way of detecting congestion before the
packets are dropped [28].

RED
Random Early Detection (RED) is a well-known AQM mechanism that aims to reduce
end-to-end delay by keeping the average queue size low while still allowing for occasional
bursts of traffic [94].
RED dynamically calculates the average queue size using a low-pass filter with an
exponential weighted moving average (EWMA), as described by the following formula
[27]:

avgq ← (1 − wq )avg + wq q

where avgq is the average queue length estimate, q is the instantaneous queue length
and wq is a factor that controls how fast the EWMA adapts to fluctuations in the queue
length.
avgq is compared to a minimum threshold mint and a maximum threshold maxt and
packets are marked — which could manifest in dropped packets, a bit in the IP or TCP
header being altered, or other viable actions being executed — according to the following
scheme [27]:

12
2.1. Congestion control and packet loss

avgq < mint : No packets are marked.


avgq > mint ∧ avgq < maxq : All packets are marked with a probability pa where pa is
a function of avgq .
avgq > maxq : All packets are marked.

Header bits
As previously mentioned, ECN incorporates explicit feedback in the form of header bits.
The ECN field in the IP header is used for this purpose, where two bits are used to
indicate one of four ECN codepoints as follows [28]:
00 Not ECT : Used to indicate that ECN should not be used.
01 ECT(1) : Used to indicate that the end-nodes in a transmission are ECN capable,
and is confirmed in TCP in the pre-negotiation during the connection setup phase.
10 ECT(0) : Same as ECT(1).
11 Congestion Experienced (CE) : Used to mark packets that encounter congestion
with a probability proportional to the average queue size.
In addition to the IP header bits, there are two TCP header bits that are used for ECN
as well; these are [28]:
ECN-Echo (ECE) : Set by the receiver to signal back to the sender: “I saw a CE bit,
so reduce your rate as if the packet had been dropped” [94].
Congestion Window Reduced (CWR) : Used by the sender to signal to the
receiver that it has reduced its cwnd in response to a congestion notification.

The ECN signaling process


The ECN signaling process between a sender and receiver as shown in Figure 2.3, can
be summarized as follows [28]:
1. A packet with ECT = 1 and CE = 0 is sent by the sender to indicate that it is
ECN capable.
2. When a router between the sender and receiver detects congestion, the packet
header is checked, and since ECT = 1, CE is set to 1 instead of dropping the
packet.
3. The receiver informs the sender about the congestion event by setting ECE = 1 in
subsequent ACKs even if CE = 0.
4. The sender reduces its congestion window followed by setting CWR = 1 so the
receiver can be informed and stop setting ECE = 1 until the network is eventually
congested again. This is done to avoid losing an ACK with ECE = 1.

Benefits of ECN
There are potential benefits to using ECN instead of solely relying on packet drops as
the only congestion signal [25]:

13
Chapter 2. Background

Figure 2.3: The ECN signaling process between a sender and receiver

Improved throughput Implementing ECN can improve the throughput of an


application. However, if the number of packets that are dropped due to congestion
is small, this benefit could be relatively small as well.
Reduced head-of-line blocking Implementing ECN can reduce head-of-line blocking
issues, and can therefore reduce the delays that occur because of this.
Reduced probability of RTO expiry Implementing ECN can reduce the probability
of loss and therefore also reduce the probability of RTO expiry, which typically can
cause issues as a result of a sudden and significant change in the allowed rate at
which an application can forward pakets.
Applications that do not retransmit lost packets Implementing ECN can be use-
ful for applications that do not support retransmitting lost packets, such as latency-
critical applications. If ECN is used in such applications, they can adjust their
sending rate following detection of congestion. This allows for a reduction in send-
ing rate before experiencing congestion-induced packet losses.
Making incipient congestion visible Implementing ECN exposes the presence of
congestion on a network path to the transport and network layers. This allows
information to be collected about the presence of incipient congestion, which can
be used for purposes such as monitoring the level of congestion by an application
or a network operator.
Opportunities for new transport mechanisms ECN can enable the design and
deployment of new congestion control algorithms. This has allowed for congestion
control algorithms such as Data Center TCP (DCTCP) (Section 2.1.8).

Data Center TCP


As the name suggests, Data Center TCP (DCTCP) is a TCP-like protocol designed for
traffic in data centers, where low-cost switches with limited queue capacities are often
employed. Small queues combined with the typical traffic patterns seen in data centers

14
2.1. Congestion control and packet loss

leads to normal TCP not being optimal, causing high latencies and frequent packet
losses.
DCTCP tries to solve the abovementioned problems by leveraging and extending ECN
to estimate the fraction of bytes that encounter congestion and scaling the congestion
window based on this estimate. This allows the various senders in the data center to
react proportionally to congestion instead of simply halving their window, and provides
the data center servers with high burst tolerance, low latency, and high throughput [8].

2.1.10 Non-loss-based congestion control algorithms


Unlike the loss-based algorithms Reno and Cubic, there are also congestion control
algorithms that do not rely on loss as a signal but rather delay or other forms of signal
mechanisms.
These congestion control algorithms can typically be a bit harder to predict the behavior
of, at least with regards to packet losses — since they do not necessarily provoke loss.
Due to this reason, these congestion control algorithms have not been considered for the
purpose of packet loss prediction in this thesis.

BBR
BBR is a congestion control algorithm developed by Google that primarily relies on
different signals than packet loss to detect congestion, with the goal being to achieve
higher bandwidths and lower latencies for traffic on the Internet [11].
The paper from Google [11] argues that loss-based congestion control algorithms such
as Cubic are not optimal — causing bufferbloat in the case of large buffers and
misinterpreting packet loss as a congestion signal in the case of small buffers, leading to
low throughput.
BBR works by continually measuring the Bottleneck Bandwidth and the Round-trip
propagation time (BBR) in order to control its sending rate.
How the congestion window varies throughout the connection for BBR can be seen in
Figure 2.4.

LEDBAT
Low Extra Delay Background Transport (LEDBAT) — first technically documented in
2012 [82] — is a congestion control algorithm that relies on delay as a signal of congestion.
It measures the time a packet travels from a given sender to a receiver, not just the round
trip time. This is done by applying a timestamp to every packet transferred from the
sender that the receiver then uses to subtract from its local time and returns the result to
the original sender — the result being the one-way delay from the sender to the receiver.
The sender then uses this information to consider the difference in delay over time [30].
LEDBAT is primarily designed for one-way bulk transfer applications like file-sharing
and file-updates, and is therefore used by multiple operating systems for the purpose of
operating system updates [78].

15
Chapter 2. Background

Figure 2.4: The congestion window behavior of BBR

2.2 Machine learning


Around the world, computers and other digital components capture and store vast
amounts of data every day. For example, if you login to a website, the website will
potentially store data about you such as your name, your email, where you live, your
birth date, and even who are your parents and siblings. This data can be used for
various purposes like advertising and personalization, making data collection a very
lucrative business with large corporations being built on this idea that personal data is
worth something — some even say more than gold.
Outside of the context of this personal data collection, more anonymous data can also
be useful for various purposes. A banking system can for example make use of data in
order to learn the behaviors of credit card fraud and create a system to detect it.
The technique that makes these things possible is called machine learning, which is a
branch of computational algorithms that are designed to emulate human intelligence by
learning from the surrounding environment — or data [22].
Machine learning, in the general sense, is about making machines — computers in most
cases — modify or adapt their actions over time so that these actions get more accurate
with regards to a certain metric [54]. The machines modify or adapt their actions by
learning from experience, where experience is in the form of data. It can therefore be
said that the main goal of machine learning is to develop learning algorithms that build
machine learning models from data [111].

16
2.2. Machine learning

We can generally classify machine learning algorithms into three different categories
depending on how they work and their application areas, these are [54]:
Supervised learning In supervised learning a dataset of examples with the correct
responses is provided — this is typically referred to as a labeled training set —
and, based on this dataset, the algorithm tries to learn from it and generalize to
respond correctly to all possible inputs, so that the algorithm can be applied to
new, unseen data.
Unsupervised learning Unlike supervised learning, in unsupervised learning there is
no labeled training dataset, meaning that the correct responses are not provided for
the input data that is used to train the model. Unsupervised learning techniques
use the input data to identify similarities between the inputs so that inputs that
are similar are categorized together.
Reinforcement learning Reinforcement learning lies somewhere between supervised
and unsupervised learning. As the name suggests, reinforcement learning
techniques get told whether or not the answer is correct, but do not get told
how to improve it. The algorithm therefore has to explore and try out different
strategies until it works out how to get the answer right.
In the context of this thesis, supervised learning is the most relevant because we are
dealing with a case of binary classification, which is a subset of supervised learning.
Reinforcement learning is partly relevant because it’s being used in many newer
TCP congestion control mechanisms [107] [48] [66] and could — with the correct
implementation and dataset — be applied to the problem discussed in this thesis.
Unsupervised learning is not relevant for the purpose of this thesis since it deals with
problems that are different in nature — more related to clustering and seeing patterns
in data.

2.2.1 The machine learning process


Whenever there is a problem that could potentially be solved by a machine learning
algorithm, there are multiple steps involved in going from said problem to a potential
solution in the form of a machine learning model. These steps can be referred to as the
machine learning process.
The machine learning process typically consists of the following components, depending
on if data is already available or not [54]:
Data collection In machine learning, the data often has to be collected from one or
multiple sources. The process of collecting the appropriate type and amount of
data from the appropriate data source or data sources is known as data collection.
Data transformation The data collected is not necessarily in the appropriate format
for a given machine learning algorithm. Sometimes the data has to be “massaged”
in various ways, for example by converting the data from one format to another, by
removing noise, by merging the data from multiple sources, amongst other things.
Feature selection For a given machine learning problem, there are probably some
aspects of the problem that are more important to the solution than others.
These various aspects can be found in the data collected in the data collection
and transformation steps and are known as features. Features are different

17
Chapter 2. Background

characteristics that together describe a datapoint in the dataset. Selecting the


appropriate features typically requires some prior knowledge about the problem
and the data.
Algorithm choice The choice of an appropriate machine learning algorithm is crucial
to the outcome. It can be a good idea to choose the appropriate algorithm before
the data is collected if data is not already available, so that the data collection and
transformation and the feature selection steps can be done in accordance with the
selected machine learning algorithm and what it typically expects as inputs.
Parameter and model selection Many machine learning models require certain
parameters to be set manually. These parameters are known as hyperparameters.
When using an optimized machine learning model, these parameters typically have
very good default values, but can be tuned in order to improve model performance
for a specific problem.
Training When given a training dataset, algorithm, and appropriate hyperparameters,
training is the act of using computational resources in order to construct a model
of the data to predict outputs on new, unseen data.
Evaluation A machine learning model typically needs to be evaluated during
development or before deploying it. The process of evaluating a machine learning
model should be done on data it was not trained on.

2.2.2 Data collection and transformation


As already mentioned, data collection is a crucial part of the machine learning process
and involves collecting the appropriate type and amount of data from the appropriate
data sources or data sources.
Since the data collected is not necessarily in the appropriate format or there are aspects
of the data that need to be changed in order to work with it, data collection is typically
followed by data transformation where the data is potentially mutated in various ways to
construct a final dataset that is more appropriate for a given machine learning algorithm.

Data collection
If data is not already available, the first step in the machine learning process is typically
data collection. The data collection step can often be merged together with the feature
selection step so that the feature selection step guides the data collection process in
order to only collect the relevant data. If we want to train a machine learning model on
data from birds, where the features should be the height, weight, and sex of the birds, it
makes sense to collect only this data in the data collection step to save time and make
it more feasible.
The amount of data that needs to be collected typically varies from problem to problem
and is a trade-off between computational overhead and model performance. One
approach to ensure that the data collected is valuable before collecting vast amounts
of data, is to first assemble a reasonably small dataset with all of the features that are
believed to be useful, and experimenting with that dataset by training and evaluating
a model before choosing the best features and collection the full, much larger dataset
[54].

18
2.2. Machine learning

Data transformation
The data gathered in the data collection process is sometimes not in a format that is
suitable or optimal for a given machine learning model. For example, many machine
learning models expect numerical features instead of string representations of a given
value — meaning, the number 180 instead of the string 180cm or similar.
Data transformation is therefore the step in the machine learning process that takes raw
data from the data collection step and converts it to a format that the machine can
understand and work with.
Depending on the type of data, the data transformation step can include one or many
of the common techniques listed below:
Data cleaning Data cleaning is the process of detecting and repairing incorrect or
incomplete information in the dataset. This can be done by adding or repairing
missing values, dealing with outliers, or converting a feature to the same format
when the raw data has different formats for the same feature, amongst other things
[14].
Feature extraction Feature extraction typically involves both feature construction
and feature selection, which can be roughly explained as reducing a large amount
of raw information down to a smaller set of more useful variables by constructing
and selecting relevant and informative variables that describe the data in question
which are referred to as features, in addition to potentially adding extra information
to the dataset where none existed previously — the latter being a form of data
augmentation [32].
Data scaling Data scaling can involve scaling or casting the features to a certain range,
for example 0–1, with the goal being to transform them to be on a similar scale so
that each feature has an equal contribution to the model performance [83].
Data aggregation Sometimes, the data used to train a machine learning model may be
sourced from a variety of places, resulting in multiple distinct datasets. Combining
these datasets into a single large dataset is called data aggregation. The opposite
process of splitting a single large dataset into multiple smaller subsets is referred
to as data disaggregation.
Sampling/splitting A common approach in machine learning — especially in
supervised learning techniques — is to separate a large dataset into training,
validation, and optionally a final testing dataset. This is done to ensure that the
data can be trained on the training dataset but evaluated on the validation dataset.
The final testing dataset is used to test the model and check its performance on
data that it has not encountered during the training and evaluation phase. There
are many ways this splitting can be done, and the exact technique can impact the
performance of the model by potentially adding more or less variance [76].
Methods such as as the ones mentioned above can in many cases improve the performance
and stability of a model and data transformation is therefore usually an important step
in most machine learning processes.

19
Chapter 2. Background

2.2.3 Features
Given a dataset, machine learning features can be regarded as the various variables that
describe the elements of the dataset. For example, if the dataset consists of data from
various people belonging to a certain demographic, the features can be variables such as
height, weight, sex, and so on. The elements of a dataset can therefore be described
as feature vectors where each element in the vector contains a feature that describes an
aspect of the element.
Features can either be extracted directly from the original data or constructed by
combining or transforming different aspects of the original data or other features.
In the context of a given machine learning problem, an algorithm used to solve the
problem, and collected data for training, not all features can be considered equal with
regards to their impact on model performance. Feature selection is therefore usually an
important step when dealing with machine learning features for a given problem [54].

Feature construction
Sometimes, machine learning features can be constructed by combining or transforming
different aspects of the other features. To expand on the example from the first paragraph
— namely the features height, weight, and sex — a constructed feature could be BMI,
which is created from the weight and height.
For certain problems, the difference between the current and previous values for a given
feature could be more interesting than the values in isolation. This could be the case if
upwards or downwards trends in the given feature are relevant to the problem.
Constructed features can sometimes have a higher correlation with the target label in
binary classification tasks — meaning that they are more likely to be informative for the
prediction [34].

Feature selection
When dealing with data in machine learning, there are usually many features that can
be extracted from said data. Not all of these features are useful however, and some are
more useful than others — useful referring to how informative they are to the solution.
For this reason, feature selection is a critical step when dealing with a given machine
learning problem.
Feature selection consists of identifying the features that are most useful for the problem
and its solution and should be supplied to the machine learning algorithm to construct
a model [54]. This usually requires knowledge of both the problem and the data.
In addition to identifying and selecting the features that are most useful for the machine
learning algorithm, it is important that the features can be collected without significant
expense or time and that they are robust to noise and other inaccuracies in the data
[54].

2.2.4 Supervised learning


In machine learning, supervised learning algorithms are methods that are designed to
predict or classify an outcome of interest [42]. The prediction can be binary — meaning
that either a yes or no is predicted for some given input — or non-binary, so that either

20
2.2. Machine learning

one of more than two options is predicted for a given input. The former is referred to
as binary classification while the latter is referred to as multi-class classification. For
the purpose of this thesis, we are only interested in binary classification because we are
dealing with packet loss, which is a situation that can either happen or not.
As briefly mentioned in Section 2.2.2, when dealing with supervised learning techniques,
a training and test dataset is typically required. The training dataset is used to learn
the behavior of the target function, while the test dataset is used to test the performance
of the model.
There are many different types of supervised learning methods. Some of the most
common and relevant to this thesis are listed below [64] [13]:
Decision trees Decision trees consist of different nodes, starting with a root node.
The root node has no incoming edges, while the rest of the nodes have exactly one
incoming edge. The rest of the nodes are further split into internal nodes or leaf
nodes, depending on if they have outgoing edges or not. The former is the case if
they do and the latter is the case if they don’t. Each leaf is assigned to one class
that represents the most appropriate target value, so that inputs are classified by
going down the tree starting at the root and ending at a leaf.
Linear regression Linear regression uses regression analysis to specify a relationship
between one or more features and a target label by fitting a straight line to the
data — hence the name, linear regression. Linear regression is typically used to
predict a continuous variable, meaning a numeric variable, and is therefore usually
not suitable for classification tasks.
Logistic regression The goal of logistic regression is to find the relationship between
some given features and the probability of a given outcome — or in the context of
classification, the probability that the datapoint represented by the given features
belongs to a certain class. Rather than fitting a straight line to the data like
linear regression, logistic regression fits an S-shaped curve to the data using a
sigmoid function called the logistic function — hence the name, logistic regression.
Unlike linear regression, logistic regression is well suited for classification problems,
especially binary ones [5].

Training and test data


When dealing with supervised learning algorithms, a training dataset is needed in order
to fit the model to the data so that it hopefully can be applied to new, unseen data.
To make sure that the model is correctly fit, a test dataset is usually needed as well.
This dataset is used to evaluate the model during and/or after the process of fitting the
model to the training data. Since the final model should be optimally tested on data
that it has never seen before, a third dataset is sometimes used as well for the purpose
of final testing.
A common approach when dealing with supervised learning algorithms is to split a large
dataset into subsets or gather independent datasets to be used for training, validation,
and optionally final testing. This is done to ensure that the model can be trained on the
training dataset but evaluated on the validation dataset. The reason for not using the
same dataset for both is related to the bias-variance trade-off and the concept of over-
and underfitting. If the model is trained to perfectly fit the training dataset, it will most
likely be overfit and show great results on this dataset when being evaluated on it —

21
Chapter 2. Background

but this does not mean that it will show the same results for data it has not encountered
during training, potentially leading to a poor model that does not generalize well.
When the model has been trained and evaluated one or more times, an optional final
testing dataset can be used to test the model and check its performance on data that it
has not encountered during either training or evaluation.

Data labeling
As mentioned in the previous section, when dealing with supervised learning algorithms,
training data is needed. This training data needs to be labeled, where labeled refers to
the correct output being assigned to each row in the dataset. For example, if the rows
represent emails, the label could be Spam or Not Spam.
The reason that the training data needs to be labeled is that supervised learning
algorithms fit a function to some given training data in order to minimize the error
with respect to the difference between the function outputs and the actual labels. The
labels are therefore used to guide the model in a certain direction so that it can hopefully
detect the desired data pattern and predict this pattern for new, unseen inputs that were
not encountered during training.
Sometimes, the way that the data should be labeled is very straightforward. To expand
on the email example above, it is quite easy to assign a label of spam or not spam when
creating training data containing emails. Or when dealing with weather data, it is easy
to assign a label of rain or no rain. But this is not always the case, and care should
therefore be taken when labeling training data.
For the reasons mentioned above, the way that the data is labeled can have a great
influence on the model performance.

2.2.5 Binary classification


Binary classification is a subset of supervised learning that consists of taking input
vectors consisting of some chosen features and deciding which of exactly two classes they
belong to, based on training from exemplars of each class [54].

Imbalanced datasets
In binary classification, one of the classes is usually referred to as the positive class
and the other the negative class. Depending on the dataset and the problem under
examination, there might be more or less samples from the positive class present in the
set. When there are very few or very many — depending on which class is referred
to as the positive one — positive examples in the training set, the data is referred to
as imbalanced. In this case, the class that makes up the large majority of the set is
called the majority class, while the other class is called the minority class. Typically,
the positive class is the minority class, while the negative class is the majority class.
When dealing with imbalanced datasets, there are various techniques that can be applied
to try to combat the problem, with a common technique being to try to rebalance the
dataset artificially by upsampling and/or downsampling, where samples are replicated
from the minority class or ignored from the majority class respectively.

22
2.2. Machine learning

Performance metrics
When dealing with machine learning problems, one needs to know if a given model is
good or not — where good refers to how well it can solve the problem — and if a model is
getting better or worse throughout the process of training. Various performance metrics
therefore exist to evaluate the performance of machine learning models.
Some of the most common metrics when dealing with binary classification problems are
listed below [16]:
Accuracy Accuracy is defined as the number of correct predictions — positive or
negative — divided by the total number of predictions.
Precision Precision is defined as the amount of true positives divided by the sum of
true positives and false positives, where true and false positives refer to inputs that
were classified as positive and actually were positive and inputs that were classified
as positive but were actually negative, respectively. Precision therefore identifies
the proportion of inputs that were correctly classified as positive.
Recall Recall is defined as the amount of true positives divided by the sum of the true
positives and false negatives, where true positives refer to the same as for precision
and false negatives refer to inputs that were classified as negative but were actually
positive. Recall therefore identifies the proportion of actually positive inputs that
were classified as negative.
F1 score The F1 score is the harmonic mean of precision and recall and therefore a
useful metric to evaluate model performance with regards to both precision and
recall.

Confusion matrices
When dealing with binary classification problems, it is often useful to visualize the
amount of True and False Positives (TP, FP), and True and False Negatives (TN, FN).
A TP refers to a sample that was both labeled and classified as True, while a FP refers
to a sample that was classified as True but labeled as False. Similarly, a TN refers to a
sample that was both labeled and classified as False, while a FN refers to a sample that
was classified as False but labeled as True.
A confusion matrix shows the amount of TP, FP, TN, and FN in a table with two rows
and columns, as shown in Figure 2.5. It therefore shows how often the model correctly
predicts the positive or negative class and can be useful for understanding the behavior
and performance of a classifier.

Classification threshold
As already briefly mentioned in Section 2.2.4, the output of a logistic regression model
is a probability — more specifically a number between 0 and 1 — which represents the
probability that the input belongs to a certain class. Typically, an output closer to 1
indicates that the input most likely belongs to the positive class, while an output closer
to 0 indicates that the input most likely belongs to the negative class. But what about
the cases where the output is 0.51 or 0.49? There needs to be a way to decide when the
input should be classified as either positive or negative — depending on the probability
given by the model. This is achieved by making use of a classification threshold, typically
with a value of 0.5 [112]. When the threshold is set to 0.5, it typically means that inputs

23
Chapter 2. Background

Figure 2.5: Confusion matrix visualizing the amount of True Positives, False Positives, True
Negatives, and False Negatives

with an output of below 0.5 are classified as the negative class, while inputs with an
output of above or equal to 0.5 are classified as the positive class.
When dealing with imbalanced datasets, the default threshold of 0.5 is usually not
optimal for model performance [112] [23]. This is because the probability distribution
for imbalanced data tends to be biased toward the majority class [43], while the minority
class is often the one that is interesting when dealing with imbalanced datasets. Due to
these reasons, the threshold should typically be adjusted when dealing with imbalanced
datasets to improve model performance.
While it is possible to compute the best classification threshold, it is also possible to do
manual testing in order to find a good threshold.

2.2.6 Ensemble learning


A common machine learning technique is to train more than one model on the same
dataset — each of these models producing different results, with some learning certain
things and some other things — and putting them together to form a final model in
the hope that the result of this model is better than any of the results produced by the

24
2.2. Machine learning

individual models. This is called ensemble learning [54].

Boosting
One of the most popular ensemble methods is called boosting. In boosting, many low-
quality models are put together in a useful way to produce a final model that hopefully
gives good results [54].
In boosting algorithms, the process of creating the final ensemble model is typically done
by starting with a base model and then adjusting the distribution of the training samples
according to the results of this model so that incorrectly classified samples receive more
attention by the subsequent base models. This way the second base model is trained
with the adjusted training samples, and the result is used to adjust the training sample
distribution again so that the next base model is trained with the adjusted training
samples, and so on [111].

Gradient boosting
Gradient boosting — a now common boosting technique first introduced in 1999 by
Jerome Friedman [29] — is often used in classification tasks and produces a prediction
model in the form of an ensemble of weak prediction models. The weak prediction models
in gradient boosting can be decision trees [65]. When this is the case, the booster can
be referred to as a gradient-boosted tree.

2.2.7 Bias and variance


When training a machine learning algorithm using some data to create a machine
learning model, some choices have to be made regarding which algorithm to use and
which parameters should be used for that specific model. The higher the degree of
freedom of the algorithm, the more complicated the model can be.
When constructing a supervised learning model, the goal of the chosen supervised
learning algorithm is to best estimate a target function for an output variable Y given
some input data X. The algorithm can assume more or less about the form of this
target function, potentially leading to different results. If the algorithm makes few
assumptions about the form of the target function, this could lead to the final model
not being accurate and not matching the data well. On the other hand, if the algorithm
makes many assumptions about the form of the target function, this could lead to the
final model not being very precise and there being a lot of variation in the results. The
former is referred to as the bias of the model while the latter is referred to as the variance
of the model.
Improving the bias will typically reduce the variance and vice versa. This is related to
a phenomenon known as the bias-variance trade-off [54].

The trade-off between bias and variance


A perfect machine learning model would have both low bias and low variance, because
this would result in very good model performance for both data encountered during
training and new, unseen data. However, the bias-variance trade-off in machine learning
says that decreasing the bias will increase the variance and vice versa. The goal when

25
Chapter 2. Background

training and tuning a model in machine learning is therefore often to balance out bias and
variance, with the most important of them being related to the problem being solved.

Overfitting and generalization


A model is said to overfit if it has learned about the noise and inaccuracies in the data
as well as the actual target function. This can happen if the model is trained for too
long. Such a model will not be able to generalize well to new, unseen data that it has
not encountered during training and is therefore not very useful — depending on the
application [54].
A model is said to underfit if it has not managed to learn the target function and has
rather made many simplifying assumptions about its form. Such a model will be able to
generalize well to new, unseen data but will have poor performance because the target
function has not been learned correctly, e.g. because important details in the data have
been missed [15].
Overfitting and generalization of a model is strongly correlated to the bias-variance
trade-off. A model with high variance (and therefore low bias) will typically have a
higher risk of overfitting, while a model with high bias (and therefore low variance) will
typically have a higher risk of underfitting [6].

2.2.8 Reinforcement learning


Reinforcement learning is a class of machine learning algorithms lying at the intersection
between supervised and unsupervised learning. Reinforcement learning algorithms get
feedback in the form of a reward that is produced by a reward function. Unlike in
supervised learning, where the algorithm is taught the correct answer based on a labeled
training set, the reward function in reinforcement learning only evaluates the current
solution but does not suggest how to improve it. The algorithm therefore has to explore
by trying out different strategies to find the best one. Reinforcement learning can be
regarded as a search problem where an algorithm searches over a state space of possible
inputs and outputs in order to maximize some reward [54].
In reinforcement learning, there is an agent and an environment. The agent is the
thing that is learning and the environment is where the agent is learning and what it
is learning about. The environment is always in a certain state. The things that the
agent can do given the current state are known as the actions. Reinforcement learning
algorithms know about the current input (the state), and the possible things it can do
given that input (the actions), and its aim is to maximize the reward produced by the
reward function. How the algorithm chooses the action that should be performed given
the current state is known as the policy [54].

2.2.9 Sequence models


Sequence models are a group of machine learning models that have been designed to
handle problems that involve sequential information. In contrast to models that use
static forms of data, in sequence models the order and relationship between the various
samples matters. One example of sequence data is time-series data, which is a type
of temporal data that is sampled based on a time-based dimension like days, months,
years, and so on. Time-series data is therefore indexed based on a date or time attribute.

26
2.3. Related works

Weekly sales, daily stock prices, and yearly temperature changes are all examples of
time-series data.
Sequence models can be applied to time-series data in order to predict future values based
on past values, such as today’s stock price given the stock prices of the last few years.
This is referred to as time-series forecasting. Since the order and relationship between
the various samples matters in sequence data, missing values can have a great impact on
model performance and need to be handled carefully. Using a different machine learning
model that does not rely on the temporal aspect of the data, can sometimes handle such
missing values better.
Some common algorithms for dealing with sequence data are: Recurrent Neural Networks
(RNNs) and Long Short Term Memory (LSTM) networks, which are a special type of
RNN that are designed to avoid long-term dependency problems [52].

2.3 Related works


Using machine learning to improve congestion control is not a novel concept, and there
have been multiple studies investigating how machine learning methods can be applied
to either develop new or improve existing congestion control mechanisms. Many of these
studies and approaches described therein have made use of Reinforcement Learning (RL)
in an effort to construct congestion control mechanisms that perform better than the
more traditional mechanisms that are in use today, which are mainly based on predefined
rules and heuristics.
There have been multiple studies on the topic of packet loss or congestion prediction in
networks. However, predicting packet loss (or congestion) in real time, on the order of
milliseconds, is a topic that, to the best of our knowledge, has not been described in a
study or thesis before. The material in this thesis therefore presents the first work on
this topic, and should be regarded as a proof of concept investigating if this is possible
or not.

2.3.1 Machine learning based congestion control


T. Zhang and S. Mao [110] discussed advancements in networks and how novel congestion
control mechanisms can be designed to better support these networks using machine
learning. In their paper, they made a case for Deep Reinforcement Learning (DRL)
as a suitable approach for creating ML based congestion control mechanisms. They
also discussed how ML based approaches can leverage past experience and information
about the network environment instead of simply using a fixed set of rules to potentially
improve performance.
K. Winstein and H. Balakrishnan [95] presented the first work on machine learning based
congestion control. In their study, they described how they developed a program called
“Remy” in order to generate congestion control algorithms that run at the endpoints.
Using Remy, they argued that protocol designers would be able to specify their knowledge
or assumptions about a network and objectives the algorithm should try to achieve —
such as high throughput and low queueing delay — in order to generate a suitable
congestion control algorithm that could achieve those desired objectives. They argued
that there is no single congestion control method that is the best in all situations, and
that computer generated approaches, such as the ones that are produced by Remy,

27
Chapter 2. Background

could be more suitable. However, the algorithms generated by Remy are more suitable
for specific networks with known and defined characteristics instead of the Internet in
general, and is therefore not widely used today.
W. Wei, H. Gu, and B. Li [93] examined and compared some of the most prominent
research results on the use of machine learning to design congestion control mechanisms.
In their paper, they pointed to challenges involving designing congestion control
mechanisms that work with a wide variety of network characteristics, where they argued
that a good congestion control mechanism needs to be able to operate effectively over a
large range of BDPs.
Their paper discusses and summarizes various congestion control protocols based on
offline or online learning — where offline learning refers to a protocol that uses a pre-
trained machine learning model, and online learning refers to a protocol that learns in
real time. The paper discusses the following machine learning based congestion control
protocols:
Remy The paper briefly discusses Remy [95], in order to give a first example of
a congestion control mechanism that uses optimization and machine learning
techniques to learn and optimize for the dynamic behavior of the network path
between a source and a destination, rather than using a hand-tuned heuristic such
as BBR. It briefly touches on Remy’s disadvantage related to offline optimization
by mentioning that when the network environment deviates from the input
assumptions and network models made, performance may degrade due to a
mismatch. This is related to the concept of bias and variance in machine learning.
PCC The paper briefly discusses Performance-oriented Congestion Control (PCC) [20],
which, compared to Remy, learns in an online fashion using multiple micro-
experiments. Each of these micro-experiments sends at two different rates, and
evaluates which rate leads to better performance. Using such micro-experiments,
the algorithm can learn in real-time and move in the direction of improved
performance, with the key idea being to learn the relationship between rate control
actions and the performance that is empirically observed — where, similarly to
Remy, performance is defined by a utility function that describes an objective,
such as to achieve high utilization of the bottleneck bandwidth with low loss rates.
PCC Vivace PCC Vivace [19] is an evolution of PCC that mainly differs in its utility
function, which incorporates not only throughput and loss rates as in PCC, but
also RTT gradients. The paper mentions that PCC Vivace is more TCP-friendly,
converges faster, and reacts more swiftly to changes in network conditions.
Indigo Like PCC and PCC Vivace, Indigo [109] is a novel congestion control based
on machine learning. However, unlike PCC and PCC Vivace, Indigo uses offline
learning using a supply of network emulators to generate training data. The paper
argues that it can be difficult to use online learning to train a congestion control
algorithm, because many ML algorithms require long training times (hours to
weeks), while the condition of network paths can evolve in much shorter time scales
(seconds). They therefore touch on the fact that offline training based approaches
have an advantage in that they are using pre-trained models.
The paper discusses the trade-off between offline and online learning, and how —
depending on the network environment — one approach can be better than the other.
They argue that offline training based approaches, such as Remy and Indigo, perform

28
2.3. Related works

better than online training based approaches, such as PCC and PCC Vivace, under the
assumption that the training environment does not deviate significantly from the actual
network environment. This highlights an important consideration when designing and
training a machine learning based congestion control algorithm, and is related to the
general concept in machine learning of bias and variance (Section 2.2.7).
The paper also discusses and summarizes various congestion control algorithms based
on Deep Reinforcement Learning (DRL), where they discuss opportunities for a fully-
automated mechanism to train a DRL agent by interacting with a real-world network
environment, in order to avoid hand-tuned heuristics as much as possible. Some of these
are described below.
Aurora Aurora [41] is a rate-based congestion control protocol based on DRL, where
the agent uses changes in the sending rate as its actions, and uses statistics about
latencies, as well as the ratio of packets sent to those acknowledged, as its states. It
therefore uses the aforementioned information about latencies and ratio of packets
vs. acknowledged to either increase or decrease its sending rate. The reward
function of Aurora, which the algorithm uses to know if it is improving or not, is
formulated as a linear function based on throughput, latency, and loss. The paper
discusses performance improvements in Aurora when compared to protocols such
as TCP Cubic, but also drawbacks related to missing results or research for various
types of network links, real-world networking environments, and how Aurora reacts
to competing flows.
R3Net Like Aurora, R3Net [26] is a congestion control protocol based on DRL with a
focus on minimizing packet latencies. R3Net was developed by Microsoft with its
design targeting video streaming and real-time conferencing applications. It uses a
simulator based on trace replays to train the DRL agent using simulated network
links and cross traffic. Since R3Net was designed specifically for low-latency real-
time traffic and not tuned for the general Internet, it has not been evaluated against
existing heuristics that are intended for the general Internet, and its performance
in this case is not clear.
MVFST-RL MVFST-RL [84] is, like Aurora and R3Net, a DRL-based congestion
control protocol. It was developed by Meta and uses a non-blocking DRL agent,
where a sender does not need to wait for the agent to produce an action. This
gives an advantage with respect to the number of bytes transmitted.
Orca Orca [1] is a hybrid congestion control protocol that depends on TCP for fine-
grained control actions. It uses a DRL agent to adjust the size of the TCP
congestion window (cwnd). The paper claims that Orca is able to achieve very good
performance in typical network environments while incurring little computation
overhead. Since Orca uses TCP Cubic under the hood, it is friendly to competing
Cubic flows.
The paper wraps up by discussing some challenges and future directions related to
machine learning based congestion controls. They point to challenges related to the
design of the agent in the various DRL-based congestion control algorithms, discussing
that the agent needs to be well-designed, including its training algorithm, its neural
network model, its state and action spaces, as well as its reward function. They therefore
argue that, to some extent, even more components need to be hand-tuned compared to
traditional heuristics, without intuitive links between the designs and their corresponding

29
Chapter 2. Background

outcomes.
In addition they highlight challenges in realistic deployments of DRL-based congestion
controls related to the feasibility of implementing such protocols efficiently. Since
congestion control protocols reside in the transport layer, and are traditionally part
of the operating system kernel, they point to practical limitations on what can be
implemented in the kernel, where they mention that TCP Cubic, for example, uses
approximation algorithms to avoid floating-point computation in the kernel due to these
limitations. They claim that DRL algorithms could require computational power that
may have to be carried out in user space, where they use Orca as an example, this being
a DLR-based algorithm that implements its agent in user space with TensorFlow, and
communicates with TCP Cubic in the kernel via socket options.
In addition to the machine learning based congestion control protocols discussed above
and in the referenced paper by W. Wei, H. Gu, and B. Li [93], there have been multiple
other such protocols presented. Some of these are briefly mentioned below.
QTCP QTCP [48] is an RL-based congestion control mechanism that uses online
learning to enable senders to gradually learn the optimal congestion control
policy. Unlike traditional heuristics, it does not use hard-coded rules and can
be generalized to a variety of different networking scenarios. The authors claim
higher throughput while maintaining low transmission latency compared to the
traditional rule-based TCP [48].
TCP-Drinc Deep reinforcement learning based congestion control (TCP-Drinc) [107]
is another DRL-based congestion control mechanism which adjusts the cwnd based
on past experience in the form of a set of measured features — such as differences
in cwnd and RTT, minimum RTT, and the inter-arrival time of ACKs — that are
stored in a buffer as historical data.
SmartCC SmartCC [49] is a RL-based multipath congestion control mechanism
designed to deal with the diversities of multiple communication paths in
heterogeneous networks. It adjusts the subflows cwnd values adaptively to fit
different network scenarios. Compared to some other RL-based congestion control
mechanisms, in SmartCC, the model training and execution step are decoupled,
and the learning process does therefore not introduce additional overhead in the
form of added delay.

2.3.2 Packet loss prediction


Multiple studies that have investigated topics related to packet loss prediction and a
short summary of them are described below.
L. Roychoudhuri and E. S. Al-Shaer [79] proposed a novel framework for predicting
packet loss and congestion in real-time audio streams, based on end-to-end delay
variation and trends, enabling proactive error recovery and congestion avoidance. In
their approach, they designated the minimum delay of a path as the baseline delay,
signifying the delay under no congestion. They then identified the delay at the capacity
saturation point of a path as the loss threshold delay, after which packet loss is more
likely. They then used the increase patterns or trends of the delay as an indication
of congestion causing packet loss. This way, they used delay trends to determine
the likelihood of packet loss. This paper is somewhat related to the topic discussed
in this thesis, however, their framework for predicting packet loss was only based on

30
2.3. Related works

delay, not based specifically on TCP traffic, and not based on any sort of machine
learning, but rather employed more traditional heuristics. The way they describe how
the framework could be applied is, however, very interesting and closely mimics what
is being investigated in this thesis — proactive congestion avoidance based on real-time
packet loss prediction.
In their paper titled “A machine learning approach for packet loss prediction in science
flows”, A. Giannakou, D. Dwivedi, and S. Peisert [31] discuss how they developed
a machine learning tool based on Random Forest regression for predicting packet
retransmissions in science flows. Their work discusses predicting the amount of
retransmissions using regression, not packet loss or congestion in real time proactively,
and is therefore not directly dealing with the same problem that this thesis is tackling.
However, the fact that they used a tree-based algorithm to construct a machine learning
model based on various features such as cwnd and rtt is very interesting and closely
related to the work discussed in this thesis.
There have been multiple other papers [4] [55] that are quite similar to the paper
discussed above by A. Giannakou, D. Dwivedi, and S. Peisert, in the sense that the
topic of discussion is either predicting the amount of packet loss or the rate of packet
loss, but not if packet loss will occur at any given point in time or not.
There have been multiple papers published on the topic of congestion prediction [35] [2]
[75] [70]. However, these papers are more related to predicting congestion in the network
as a whole or in specific links in the network, sometimes hours or minutes beforehand.
They are therefore dealing with a larger time-scale than the topic that is discussed in
this thesis — which deals with real-time prediction and resulting actions based on the
outcome of the prediction in the order of milliseconds.
A very recent paper by H. Benadji, L. Zitoune, and V.Vèque [7] discusses the topic of
loss ratio prediction using deep learning, where they describe how they used time series
data and Deep Learning (DL) models to predict the loss ratio in IoT networks. Like
some of the studies discussed above, this is not directly related to the topic discussed in
this thesis, which deals with actual packet loss prediction in real-time, not packet loss
rate prediction. The former is binary and the latter is a number, and is a quite different
metric. However, they also mention future work with the goal of designing a proactive
congestion control solution based on packet loss rate prediction, which would be very
relevant to what is discussed in this thesis.
A bit less recent but still relevant work by W. Na, B. Bae, S. Cho and N. Kim [63]
discusses a deep-learning based TCP (DL-TCP) protocol for a disaster 5G mmWave
network that learns the node’s mobility information and signal strength, and adjusts the
TCP cwnd by predicting when the network is disconnected and reconnected leading
to better network stability and higher network throughput than existing protocols
such as TCP NewReno, TCP Cubic, and TCP BBR. The approach described in the
paper predicts the duration for which the transmitting signal is blocked in a 5G
mmWave network based on mobile base stations using deep learning and indicators
such as mobility, location, signal-to-noise ratio, and value of the terminal. The
predicted blockage duration is then used to fix cwnd and perform buffering for the
corresponding time to utilize the mmWave capacity if the blockage duration is less than
the retransmission timeout (RTO).
A paper by B.A. Arouche Nunes, K. Veenstra, W. Ballenthin, S. Lukin, and K. Obraczka

31
Chapter 2. Background

[3] represents one of the earlier works that used machine learning algorithms to predict
network performance. It presents a novel approach to RTT estimation using machine
learning. Their findings indicated that their machine learning based RTT estimation was
more accurate than the Exponentially Weighted Moving Average (EWMA) estimation
that is used in some TCP congestion control mechanisms. They ran experiments showing
a reduction in the number of retransmitted packets and an increase in goodput compared
to the traditional approach using an EWMA. This can be attributed to the fact that TCP
uses RTT estimates to compute its RTO timer value. More accurate RTT estimations
can therefore result in more accurate RTO values and less false retransmissions due to
RTO expirations in cases where the RTO should have been higher. The experiments
comparing retransmissions and goodput are closely related to the experiments that were
ran and discussed in this thesis, at least with regards to the relevant metrics considered.
Recent work by L. Diez, A. Fernández, M. Khan, Y. Zaki, and R. Agüero [18] studied
the application of machine learning techniques to predict the congestion status of 5G
mmWave networks. They identified transport-layer metrics relevant to the congestion
state, such as delay and inter-arrival time, and studied their correlation with the
perceived congestion. They did this by generating transport-layer traces and analyzing
the information provided therein to derive meaningful congestion metrics. They point to
a clear correlation between metrics such as the moving standard deviation of the delay
and congestion. However, they point to a weak correlation between the moving average
of the delay and congestion. These findings could be informative for the machine learning
feature selection and construction process discussed in this thesis, described in detail in
Section 3.4. However, since the traces were generated by 5G mmWave connections, it is
not necessarily informative due to this thesis dealing with wired connections.

32
Chapter 3

Machine learning model design and


evaluation

To do anything interesting with machine learning — especially in the case of supervised


learning problems (Section 2.2.4) — data is often needed. In order to collect data, one
needs to know from where and how to collect it. The former was the first aspect tackled
in this work, followed by the latter. In addition, the data is not necessarily collected in
a format that is suitable for the chosen machine learning algorithm. Data collection and
transformation is therefore usually a large and time consuming step when dealing with
a problem of this nature (Section 2.2.2).
This thesis tackles a prediction problem. More specifically, it tackles a binary
classification problem (Section 2.2.5), because packets should be classified as either
Lost (1) or Not Lost (0). Further, since the topic of discussion is how to improve
TCP congestion control (Section 2.1), data was collected from a single TCP connection
between a sender and receiver in a virtual network configured using Mininet. Exactly
how the network was configured is further described in Section 3.3. The Linux network
utility ss was used to collect the data. How ss works and some of the data it collects
is discussed in Section 3.2. Various bash and Python scripts were used to automate the
data collection and data transformation process as much as possible. This is further
described in Section 3.3 and Section 3.4.
The work relating to the machine learning model was organized into three main
phases, with both the machine learning model and data collection and transformation
procedures becoming increasingly complex throughout the phases — the final model
being trained on data captured from measurements using various congestion control
algorithms, combinations of connection parameters like bandwidth, delay, and queue
size, and presence or absence of background traffic.
Phase one An initial machine learning model trained on data from a single flow
between a single sender and receiver, where the measurements were taken with
only one combination of bandwidth, delay, and queue size.
Phase two An iteration of the initial model trained on data from a single flow between
a single sender and receiver, but with merged data created from many different
measurements with various combinations of connection parameters.
Phase three A final model trained on data from a single flow between a single sender

33
Chapter 3. Machine learning model design and evaluation

and receiver, but with merged data created from many different measurements
with various combinations of connection parameters and the presence or absence
of background traffic.
The phase three models should be given the most attention, seeing as they are to be
considered the “general” models trained on the largest dataset and should optimally be
able to predict packet loss in many different cases — where cases refers to connections
with different configurations with respect to connection parameters like bandwidth,
delay, queue size, presence or absence of background traffic, and so on. They are therefore
the models that are most discussed here and the only models that were exported and
later tested, which is the topic of Chapter 4.

3.1 Model choice


The choice of machine learning algorithm to create a suitable model was mainly based on
performance considerations, such as training time and required computation resources.
Since this thesis tackles a binary classification problem, the choice of algorithm was
constrained to supervised learning algorithms. A sequence model (Section 2.2.9), such
as a Recurrent Neural Network (RNN) or, more specifically, a Long Short Term Memory
(LSTM) network, was considered as a suitable choice. A TCP connection can be
interpreted as a time-series of TCP states, where the state of the connection at any given
point in the form of state variables such as cwnd and rtt represents the probability of
congestion.
Using a sequence model would potentially give the benefit of learning from history —
for example, how state variables such as cwnd and rtt evolve over the course of a TCP
connection. However, considering the time constraints of the thesis and our aim of
constructing a proof of concept for real-time packet-loss prediction, we chose not to use
an LSTM. This decision was due to LSTMs being slow to train and often requiring
substantial computational resources, particularly with large datasets. In addition, as
described in Section 3.3, there were some cases where the data had missing values and
was therefore not used to create a sample. As discussed in Section 2.2.9, since the order
and relationship between the various samples matters in sequence models, such missing
values can have a great impact on model performance and need to be handled carefully.
Simpler models that use static data can handle this better. We therefore decided to use
a simpler tree-based model instead, opting for XGBoost [103], where research has shown
comparable performance to LSTM for prediction problems [12].

3.2 Tools
Multiple tools and programming libraries were used when creating the machine learning
models, with the two main categorizations being network tools and machine learning
libraries.

3.2.1 Network tools


Various network tools were investigated and/or used when dealing with the problem
described in this thesis. The network tools were mainly relevant in the data collection
process and during testing, where they were used to either configure the experiments
or collect data. The two crucial ones were Mininet and ss. In addition to these, iPerf

34
3.2. Tools

and TShark were also important and used for generating traffic and capturing packet
information, respectively.

Mininet
Mininet is a utility for creating a virtual network on a local laptop or other personal
computer [56]. Mininet creates a virtual network, running real kernel, switch, and
application code, on a single machine.
Mininet ships with the Mininet CLI [58], where one can interact with the network,
including, but not limited to, the nodes and switches in it. It is therefore possible to
configure the network to suit one’s personal needs and preferences. It is also possible
to run commands from the different nodes in the network after it has been started.
Assuming that the network consists of two hosts: h1 and h2, it is possible to ping h2
from h1 using the following command:
h1 ping h2
In addition to the CLI, Mininet ships with a Python API [57] that makes it possible to
configure the network and run commands from the different nodes in it using a custom
Python script.

iPerf
iPerf is a tool for active measurements of the maximum available bandwidth of IP
networks. It supports various protocols, including TCP, and can be used to generate
traffic from one host to another [37].

ss
ss is a Linux network utility for dumping socket statistics [85]. ss can be used to display
internal TCP state information like cwnd, ssthresh, and rtt. It is possible to supply
ss with filter options such as source and destination IP address and port. This makes
it possible to, for example, only capture TCP information from outgoing traffic from a
specific IP address and port pair.

TShark
TShark is a network protocol analyzer for capturing packet data. TShark can also read
packets from a capture file. TShark uses the pcap library to capture traffic from a given
network interface [88].

3.2.2 Machine learning libraries


Like the network tools, various machine learning libraries were investigated and/or used
when dealing with the problem described in this thesis. Unlike the network tools, the
machine learning libraries were only relevant in the model creation process, were they
were mainly used for importing classifier algorithms in order to train the models.
It was concluded that gradient boosting algorithms (Section 2.2.6) were most likely
the best fit for the problem tackled in this thesis, due to research indicating better
performance than neural networks on tabular data [10] as long as the datasets are not
very large:

35
Chapter 3. Machine learning model design and evaluation

Deep neural network-based methods for heterogeneous tabular data are still
inferior to machine learning methods based on decision tree ensembles for
small- and medium-sized datasets (less than ∼1M samples) [10].
Due to the reasons described above, only two gradient boosting libraries were applied
to create various machine learning models which were then evaluated based on various
performance metrics (Section 2.2.5).

XGBoost
XGBoost, or eXtreme Gradient Boosting, is an optimized gradient-boosted decision tree
machine learning library. XGBoost provides parallel tree boosting and can be applied
to many different problems, but especially regression and classification problems [103].

LightGBM
LightGBM is a gradient boosting framework that uses tree based learning algorithms
[51]. Compared to other gradient boosting frameworks such as XGBoost, LightGBM has
faster training speeds [17] and higher efficiency while producing similar [50] or better
[53] results. This makes it well suited for research applications such as the one tackled
in this thesis, where training speeds and efficiency often are of great importance.

3.3 Data collection


Multiple options and approaches were initially explored when tackling the problem of
data collection. The first approach involved the use of both TShark and ss in order
to capture both packet data and socket statistics for each of the packets being passed
from the sender to the receiver, while the second and final approach involved the use of
only ss to generate a series of TCP data by capturing the TCP state information at the
sender as it sent packets to the receiver.

3.3.1 Network configuration


The virtual network was configured using Mininet and consisted of three nodes: a sender
host h1, a receiver host h2, and a router r, connected using the following basic topology:
h1-r-h2
This was accomplished using a custom Python script that was supplied to the Mininet
program. Said script also supported configuring different variations of bandwidth, delay,
queue size, number of background flows, as well as selecting the congestion control
algorithm used.
The three nodes were configured using custom Python classes inheriting from the Mininet
Node class [59]. The network configuration settings mentioned above were passed to the
different nodes in the network as a Python dictionary called params, from which they
could be extracted and initialized as instance variables in the relevant class. The hosts
could then be configured using these instance variables by overriding the config method
[61] of the Node class in Mininet. Calling the cmd method of the Node class [60] inside
the config method made it possible to automate the entire network configuration step,
by calling various commands such as shown in Listing 3.1 for the different nodes.

36
3.3. Data collection

1 d e f c o n f i g ( s e l f , ∗∗ params ) :
2 s u p e r ( LinuxRouter , s e l f ) . c o n f i g ( ∗ ∗ params )
3 s e l f . cmd( " s y s c t l n e t . i p v 4 . ip_forward=1" )
4 s e l f . cmd( " sudo t c q d i s c d e l dev r−h2 r o o t " )
5 s e l f . cmd(
6 f " sudo t c q d i s c add dev r−h2 r o o t h a n d l e 2 : netem d e l a y { s e l f .
d e l a y }ms "
7 )
8 s e l f . cmd( " sudo t c q d i s c add dev r−h2 p a r e n t 2 : h a n d l e 3 : htb
d e f a u l t 10 " )
9 s e l f . cmd(
10 f " sudo t c c l a s s add dev r−h2 p a r e n t 3 : c l a s s i d 10 htb r a t e {
s e l f . bandwidth } Mbit "
11 )
12 s e l f . cmd(
13 f " sudo t c q d i s c add dev r−h2 p a r e n t 3 : 1 0 h a n d l e 1 1 : b f i f o l i m i t
{ s e l f . queue_size } "
14 )
15 s e l f . cmd( f " sudo s y s c t l −w n e t . i p v 4 . t c p _ c o n g e s t i o n _ c o n t r o l={ s e l f .
cc_algorithm } " )
16 s e l f . cmd( " sudo s y s c t l −w n e t . i p v 4 . tcp_window_scaling=1" )
17 s e l f . cmd( " sudo e t h t o o l −K r−h1 t s o o f f " )
18 s e l f . cmd( " sudo e t h t o o l −K r−h2 t s o o f f " )

Listing 3.1: Configuring the router node in the virtual network

The commands above were ran from the router node when starting the Mininet virtual
network, configuring the router with specified values for delay, bandwidth, queue size,
and congestion control algorithm. The delay was only configured on the egress interface,
meaning the interface between r and h2. There was therefore no added delay on the
ingress interface or the interface between r and h1, giving a resulting RTT roughly
equal to the delay. Similar commands were applied to the sender and receiver nodes
to configure the congestion control algorithm used and enable ECN. The reason for
enabling ECN (setting ECT to 1) at the nodes was to make it possible to toggle the CE
bit dynamically at the router based on model predictions. This is further described in
Chapter 4.
To support collecting data with background traffic and because iPerf3 was used to create
the flows needed to generate traffic for data collection, the receiver node was configured
to automatically start the required amount of iPerf3 servers and configuring them with
port numbers starting from 5201, so the iPerf3 clients at the sender side could later
connect to these. How this configuration was done is shown in Listing 3.2.
1 d e f c o n f i g ( s e l f , ∗∗ params ) :
2 s u p e r ( R e c e i v e r , s e l f ) . c o n f i g ( ∗ ∗ params )
3
4 # S t a r t t h e r e q u i r e d amount o f i p e r f 3 s e r v e r s .
5 base_port = 5201
6 f o r s e r v e r i n r a n g e ( s e l f . background_flows + 1 ) :
7 s e l f . cmd( f ’ i p e r f 3 −s −p { base_port + s e r v e r } −D ’ )
8
9 s e l f . cmd( f ’ sudo s y s c t l −w n e t . i p v 4 . t c p _ c o n g e s t i o n _ c o n t r o l={ s e l f .
cc_algorithm } ’ )
10 s e l f . cmd( ’ sudo s y s c t l −w n e t . i p v 4 . tcp_window_scaling=1 ’ )

Listing 3.2: Configuring the receiver node in the virtual network

37
Chapter 3. Machine learning model design and evaluation

In addition, a special scenario command line argument was supported by the network
configuration script that configured all background flows to either use Cubic or Reno or
half Reno and half Cubic. When the data collection procedure was configured to run
with 6 background flows and the scenario was half, 3 of the background flows would
use Reno and the other 3 would use Cubic.
Python was used to run the custom script which started and configured the network as
described above, and started the Mininet CLI.
The machine used to run the virtual network was a personal computer with an Intel®
Core™ i5-6500 CPU running Ubuntu.
The machine used to run the machine learning related code was a personal MacBook
Pro with an M1 Max CPU running macOS.
Table 3.1 shows the programs and their versions that were used to configure the network
and collect data.

Program Version
Mininet 2.3.0
Python 3.10.12
Ubuntu 22.04.1
iPerf 3.9
ss iproute2-5.15.0

Table 3.1: Programs used for network configuration and data collection and their versions

3.3.2 Initial approach


Initially, the idea was that both the packet information for the individual packets and
TCP state information could be useful. The packet information was captured using the
TShark network protocol analyzer, while the TCP information was captured using the
ss utility. The final dataset in this case was a collection of rows, each row representing a
single packet with columns representing the various machine learning features (Section
2.2.3).
One big problem with this approach was that it would be necessary to match the
data captured for the individual packets using TShark with the TCP state information
captured using ss — since the rows in the final dataset should represent the individual
packets. While TShark could easily be started in capture mode and write packet data to
a file as packets were sent, this was not the case for ss — which seemingly had to be run
manually each time TCP information should be captured. A quite primitive solution
to this was to only send packets at specific intervals using hping3 [36] while running ss
at the same interval. This way, TCP information would be captured directly after the
packet was sent for each of the packets.
A Python script for merging the two resulting output files was used to produce the final
dataset. However, inducing congestion with this approach proved to be difficult. In
addition, it proved to be a challenge to sync the packet capturing with the ss polling as
the data capture procedure grew more complex.
With no way of inducing congestion — and the approach being very error prone due to

38
3.3. Data collection

the abovementioned syncing issues — it was hard or impossible to produce congestion


induced packet loss. This meant that labeled training data was impossible to produce
and this approach was abandoned.

3.3.3 Final approach


Only TCP state information captured by ss was considered for the final approach. The
data was collected from iPerf tests between a sender h1 and receiver h2 in a virtual
network created using Mininet. More details about the network configuration can be
found in Section 3.3.1.
In these tests, iPerf was used to generate traffic from h1 to h2, being configured to
run for a given amount of time. The entire data collection process from starting
and configuring the virtual network, running the tests for a specified amount of time,
continuously collecting data using ss, and producing the output in the form of a txt
file was automated by a bash script that relied on a series of bash functions. The script
for starting the entire process supported specifying the duration to run the tests for,
the congestion control algorithm to use, the number of flows to configure iPerf with,
the delay to configure the router r with, the bandwidth to configure the router with,
the queue size to configure the router with, and the presence or absence of background
traffic. Both the bash script used for data collection and the bash functions it relied on
evolved throughout the different phases, becoming increasingly advanced when moving
to capturing data with background traffic.
ss ran at regular intervals throughout the iPerf tests, with the interval being determined
by the execution time of the ss utility. Multiple timings of the ss utility seemed to
indicate that it required at least 10ms to run, and the minimum delay was therefore set
to 30ms when capturing the data from the different TCP connections to ensure that at
least one ss measurement was captured for each RTT.
The output from ss was saved to a txt file and looked mostly like in 3.3.

39
Chapter 3. Machine learning model design and evaluation

1 N e t id S t a t e Recv−Q Send−Q L o c a l Address : Port Peer Address : P o r t P r o c e s s


2 tcp ESTAB 0 78252 10.1.1.100:47680 1 0 . 2 . 2 . 1 0 0 : 5 0 0 1 t i m e r : ( on
, 2 0 4 ms , 0 )
3 t s s a c k c u b i c w s c a l e : 9 , 9 r t o : 3 0 4 r t t : 1 0 0 . 1 5 6 / 5 0 . 0 7 8 mss : 1 4 4 8 pmtu : 1 5 0 0
rcvmss : 5 3 6 advmss : 1 4 4 8 cwnd : 1 0 b y t e s _ s e n t : 1 3 0 9 2 bytes_acked : 1 segs_out
: 1 2 s e g s _ i n : 1 data_segs_out : 1 0 send 1 . 1 6 Mbps l a s t s n d : 8 l a s t r c v : 1 2
l a s t a c k : 1 2 p a c i n g _ r a t e 2 . 3 1 Mbps d e l i v e r e d : 1 busy : 8 ms unacked : 1 0
rcv_space : 1 4 4 8 0 r c v _ s s t h r e s h : 4 2 2 4 2 n o t s e n t : 6 5 1 6 0 m i n r t t : 1 0 0 . 1 5 6
4 N e t id S t a t e Recv−Q Send−Q L o c a l Address : Port Peer Address : P o r t P r o c e s s
5 tcp ESTAB 0 78252 10.1.1.100:47680 1 0 . 2 . 2 . 1 0 0 : 5 0 0 1 t i m e r : ( on
, 1 8 0 ms , 0 )
6 t s s a c k c u b i c w s c a l e : 9 , 9 r t o : 3 0 4 r t t : 1 0 0 . 1 5 6 / 5 0 . 0 7 8 mss : 1 4 4 8 pmtu : 1 5 0 0
rcvmss : 5 3 6 advmss : 1 4 4 8 cwnd : 1 0 b y t e s _ s e n t : 1 3 0 9 2 bytes_acked : 1 segs_out
: 1 2 s e g s _ i n : 1 data_segs_out : 1 0 send 1 . 1 6 Mbps l a s t s n d : 3 2 l a s t r c v : 3 6
l a s t a c k : 3 6 p a c i n g _ r a t e 2 . 3 1 Mbps d e l i v e r e d : 1 busy : 3 2 ms unacked : 1 0
rcv_space : 1 4 4 8 0 r c v _ s s t h r e s h : 4 2 2 4 2 n o t s e n t : 6 5 1 6 0 m i n r t t : 1 0 0 . 1 5 6
7 N e t id S t a t e Recv−Q Send−Q L o c a l Address : Port Peer Address : P o r t P r o c e s s
8 tcp ESTAB 28 110048 10.1.1.100:47680 1 0 . 2 . 2 . 1 0 0 : 5 0 0 1 t i m e r : ( on
, 1 5 2 ms , 0 )
9 t s s a c k c u b i c w s c a l e : 9 , 9 r t o : 2 8 0 r t t : 7 6 . 0 0 4 / 3 9 . 0 5 3 a t o : 4 0 mss : 1 4 4 8 pmtu
: 1 5 0 0 rcvmss : 5 3 6 advmss : 1 4 4 8 cwnd : 1 9 b y t e s _ s e n t : 3 7 7 0 8 bytes_acked : 1 1 6 4 5
b y t e s _ r e c e i v e d : 2 8 segs_out : 3 0 s e g s _ i n : 7 data_segs_out : 2 7 data_segs_in
: 1 send 2 . 9 Mbps l a s t s n d : 1 2 l a s t r c v : 1 2 l a s t a c k : 1 2 p a c i n g _ r a t e 5 . 7 9 Mbps
d e l i v e r y _ r a t e 2 . 0 3 Mbps d e l i v e r e d : 1 0 a p p _ l i m i t e d busy : 6 0 ms unacked : 1 8
rcv_space : 1 4 4 8 0 r c v _ s s t h r e s h : 4 2 2 4 2 n o t s e n t : 8 3 9 8 4 m i n r t t : 5 0 . 0 1 2

Listing 3.3: Example output from ss

For the ss output in Listing 3.3, the combination of every first, second, and third line
represented a single measurement, but only the lines that are the second and third lines
in the output — meaning the lines that start with tcp and ts sack for the outputs
above — were of interest and contained the data that should be present in the final
dataset. However, there seemed to be cases were these were not always and the second
and third lines and the measurements did not always look like this either with regards
to the fields being recorded. Sometimes fields were missing, other times additional fields
were present. A large and time consuming part of the thesis was therefore dedicated
to data transformation — with the idea being that the final dataset should consist of a
series of rows, each row representing an individual ss measurement, while the columns
represented either raw or transformed data from the relevant measurement in the ss
output.
All the iPerf tests for the various connections ran for 300 seconds (5 minutes) in order
to collect a sufficient amount of data for each scenario with regards to the specific
combination of connection parameters.
In addition to capturing TCP state information with ss for training data, TShark was
run to capture packet information and save the result to a pcap file. This pcap file was
not used for training data purposes, but was relevant later when analyzing results and
comparing connections with and without model inference enabled, which is the topic of
Chapter 4. TShark was configured to capture all packets sent and received on the h1-r
interface. This was done because h1 always acted as the sender in the tests — sender in
this case referring to this being the node that hosted the iPerf clients — while h2 acted
as the receiver and hosted the iPerf servers. How TShark was configured is shown in
Listing 3.4 below.

40
3.3. Data collection

1 # Captures t r a f f i c from t h e f i r s t Mi n in et h o s t u s i n g t s h a r k and w r i t e s i t


t o t h e s p e c i f i e d output f i l e .
2 # Parameters :
3 # $1 : Output f i l e path where t h e t r a f f i c data w i l l be saved .
4 capture_traffic_tshark () {
5 l o c a l o u t p u t _ f i l e=$1
6
7 sudo t s h a r k − i h1−r −f " t c p " −w " $ o u t p u t _ f i l e "
8 }

Listing 3.4: How TShark was configured

As briefly mentioned in the beginning of this chapter, the phase three data consisted of
training data captured from a single flow between a single sender and receiver but with
merged data created from many different measurements with different combinations of
connection parameters and the presence or absence of background traffic.
Single flow in this case refers to the fact that the ss data was always captured from
one specific flow, even though there were multiple flows between the sender and receiver
when adding background traffic. The way the foreground flow was separated from the
background flows, was by simply specifying the IP address and port pair of the source
and destination when filtering the traffic with the ss command, like shown in Listing
3.5. This way, ss was always configured to capture traffic sent from the first iPerf client
started at the sender and received at the first iPerf server started at the receiver.
1 # Use s s t o c a p t u r e s o c k e t s t a t i s t i c s from t h e f i r s t M in in e t h o s t and
append t o g i v e n f i l e .
2 # Parameters :
3 # $1 : F i l e t o w r i t e s o c k e t s t a t i s t i c s t o .
4 capture_ss ( ) {
5 l o c a l f i l e =" $1 "
6
7 s s − i −o s r c 1 0 . 1 . 1 . 1 0 0 : 5 0 0 1 d s t 1 0 . 2 . 2 . 1 0 0 : 5 2 0 1 >> " $ f i l e "
8 }

Listing 3.5: How ss was configured

Phase one
As mentioned in the previous section, all the iPerf tests for the various connections ran
for 300 seconds (5 minutes). This was also the case for the phase one data collection,
where the data collection procedure in the form of the relevant script ran for 5 minutes
and was configured with the following values for bandwidth, delay, and queue size:
• bandwidth: 50Mbit
• delay: 70ms
• queue size: 437500 (1 BDP) bytes
The test was configured as a single flow between a single sender and receiver without
any background traffic.
The phase one data collection resulted in a single txt file for both Reno and Cubic that
contained the various ss outputs.

41
Chapter 3. Machine learning model design and evaluation

Phase two
Compared to phase one, the phase two data collection was much more intricate and
resulted in significantly more data. Like phase one, all the iPerf tests for the various
connections ran for 300 seconds (5 minutes), but instead of running just one test with
one specific combination of bandwidth, delay, and queue size, 75 different tests were ran,
all with different combinations of the mentioned connection parameters.
The possible values for bandwidth, delay, and queue size — where queue size is
represented as multipliers of BDP in bytes — are shown below:
• bandwidths: 10Mbit, 20Mbit, 30Mbit, 40Mbit, 50Mbit
• delays: 30ms, 40ms, 50ms, 60ms, 70ms
• queue sizes: 0.25 BDP, 0.5 BDP, 1 BDP
Each test represented a specific permutation of the abovementioned options, so that
each test was configured with a specific value for delay, bandwidth, and queue size in
the form of a BDP multiplier value. Given a delay of 30ms, a bandwidth of 10Mbit, and
a queue size of 1 BDP, that specific permutation represented a test where the delay was
configured to 30ms, the bandwidth was 10Mbit, and the queue size was equal to 1 BDP
where the BDP was always calculated from the configured delay and bandwidth for that
specific test. If the queue size was configured to 0.25 BDP, it would represent 1/4 of the
calculated BDP based on the configured delay and bandwidth.
Like phase one, the tests were configured as single flows between a single sender and
receiver without any background traffic.
The phase two data collection resulted in 75 distinct txt files for both Reno and Cubic.
Each file contained the various ss outputs from a test with a specific combination of
connection parameters.

Phase three
Compared to phase two, the phase three data collection was even more intricate and
also resulted in significantly more data. Like phase one and two, all the iPerf tests for
the various connections ran for 300 seconds (5 minutes), but now background traffic was
added.
Explained in more detail in Section 3.3.1, background traffic was added by configuring
the receiver to automatically start the desired amount of iPerf servers, and configuring
the sender to start the same amount of clients in order to connect to the servers.
In all cases, in addition to the single foreground flow, 6 background flows were added to
the connection. These 6 background flows were configured to use either Reno or Cubic
or half/half — meaning that 3 of them used Reno and the other 3 Cubic.
The same values for bandwidth, delay, and queue size — where queue size is represented
as multipliers of BDP in bytes — as in phase two were used, but in addition, three
different scenarios for the background traffic flows were considered:
• bandwidths: 10Mbit, 20Mbit, 30Mbit, 40Mbit, 50Mbit
• delays: 30ms, 40ms, 50ms, 60ms, 70ms
• queue sizes: 0.25 BDP, 0.5 BDP, 1 BDP

42
3.4. Data transformation

• scenarios: Reno, Cubic, Half


This resulted in 225 different combinations, in addition to the 75 different combinations
that were already produced by the phase two data collection.
The phase three data collection resulted in 225 new distinct txt files for both Reno and
Cubic. Each file contained the various ss outputs from a test with a specific combination
of connection parameters and background traffic scenario.
The final dataset for phase three consisted of the 75 txt files produced by the phase two
data collection in addition to the 225 new txt files produced by the phase three data
collection, resulting in 300 distinct txt files.
Since both the phase two and phase three data collection resulted in multiple txt files,
the final datasets can be seen as aggregated data because the data came from multiple
sources — the sources being the different txt files. That being said, the output from
the data transformation process, described in detail in the next section, was always a
single csv file that contained all the ss measurements from the various txt files.

3.4 Data transformation


As mentioned in the previous section, data transformation was a large and time
consuming part of this thesis. This is because there was quite a large discrepancy
between the output from the data collection process and the final dataset that was used
for training and tuning the machine learning model — with the final dataset consisting
of a pandas.DataFrame object [67], while the output produced by the data collection
process were simple txt files that looked like the output in Listing 3.3.
The first step in the data transformation process was therefore dedicated to parsing
these txt files in order to extract and/or construct the relevant features that should
be included for each row, combining the parsed data if there were multiple files, and
converting the result to a spreadsheet in the form of a csv file that could easily be read
into a pandas.DataFrame object.

3.4.1 Parsing the output from the data collection process


Python was an obvious choice for parsing the output from the data collection process,
as it offers good built-in support for working with input files [72]. A Python script
was therefore created for this purpose, which first read all the relevant txt files into
memory by creating a Python dictionary [73] where the keys represented the path to the
file containing the outputs and the values represented lists containing a string for each
measurement with the relevant data for that measurement.
This removed the lines that only contained labels without data, so that the output for
each measurement looked more like in Listing 3.6.

43
Chapter 3. Machine learning model design and evaluation

1 tcp ESTAB 0 78252 10.1.1.100:47680 1 0 . 2 . 2 . 1 0 0 : 5 0 0 1 t i m e r : ( on


, 2 0 4 ms , 0 )
2 t s s a c k c u b i c w s c a l e : 9 , 9 r t o : 3 0 4 r t t : 1 0 0 . 1 5 6 / 5 0 . 0 7 8 mss : 1 4 4 8 pmtu : 1 5 0 0
rcvmss : 5 3 6 advmss : 1 4 4 8 cwnd : 1 0 b y t e s _ s e n t : 1 3 0 9 2 bytes_acked : 1 segs_out
: 1 2 s e g s _ i n : 1 data_segs_out : 1 0 send 1 . 1 6 Mbps l a s t s n d : 8 l a s t r c v : 1 2
l a s t a c k : 1 2 p a c i n g _ r a t e 2 . 3 1 Mbps d e l i v e r e d : 1 busy : 8 ms unacked : 1 0
rcv_space : 1 4 4 8 0 r c v _ s s t h r e s h : 4 2 2 4 2 n o t s e n t : 6 5 1 6 0 m i n r t t : 1 0 0 . 1 5 6
3 tcp ESTAB 0 78252 10.1.1.100:47680 1 0 . 2 . 2 . 1 0 0 : 5 0 0 1 t i m e r : ( on
, 1 8 0 ms , 0 )
4 t s s a c k c u b i c w s c a l e : 9 , 9 r t o : 3 0 4 r t t : 1 0 0 . 1 5 6 / 5 0 . 0 7 8 mss : 1 4 4 8 pmtu : 1 5 0 0
rcvmss : 5 3 6 advmss : 1 4 4 8 cwnd : 1 0 b y t e s _ s e n t : 1 3 0 9 2 bytes_acked : 1 segs_out
: 1 2 s e g s _ i n : 1 data_segs_out : 1 0 send 1 . 1 6 Mbps l a s t s n d : 3 2 l a s t r c v : 3 6
l a s t a c k : 3 6 p a c i n g _ r a t e 2 . 3 1 Mbps d e l i v e r e d : 1 busy : 3 2 ms unacked : 1 0
rcv_space : 1 4 4 8 0 r c v _ s s t h r e s h : 4 2 2 4 2 n o t s e n t : 6 5 1 6 0 m i n r t t : 1 0 0 . 1 5 6
5 tcp ESTAB 28 110048 10.1.1.100:47680 1 0 . 2 . 2 . 1 0 0 : 5 0 0 1 t i m e r : ( on
, 1 5 2 ms , 0 )
6 t s s a c k c u b i c w s c a l e : 9 , 9 r t o : 2 8 0 r t t : 7 6 . 0 0 4 / 3 9 . 0 5 3 a t o : 4 0 mss : 1 4 4 8 pmtu
: 1 5 0 0 rcvmss : 5 3 6 advmss : 1 4 4 8 cwnd : 1 9 b y t e s _ s e n t : 3 7 7 0 8 bytes_acked : 1 1 6 4 5
b y t e s _ r e c e i v e d : 2 8 segs_out : 3 0 s e g s _ i n : 7 data_segs_out : 2 7 data_segs_in
: 1 send 2 . 9 Mbps l a s t s n d : 1 2 l a s t r c v : 1 2 l a s t a c k : 1 2 p a c i n g _ r a t e 5 . 7 9 Mbps
d e l i v e r y _ r a t e 2 . 0 3 Mbps d e l i v e r e d : 1 0 a p p _ l i m i t e d busy : 6 0 ms unacked : 1 8
rcv_space : 1 4 4 8 0 r c v _ s s t h r e s h : 4 2 2 4 2 n o t s e n t : 8 3 9 8 4 m i n r t t : 5 0 . 0 1 2

Listing 3.6: Example output from ss after removing irrelevant lines

The result of parsing the output from the data collection process was therefore a Python
dictionary where the keys represented the path to the file containing the outputs and
the values represented lists where each list element in the various lists was a string that
contained the second and third line for a specific ss measurement for that connection, so
that lines 1 and 2 in Listing 3.6 — which contain the data for one specific ss measurement
— were represented as a string that contained all the information for that measurement,
like shown below:
{
path: [
ss_measurement_string,
ss_measurement_string
],
path2: [
ss_measurement_string,
ss_measurement_string
],
}

3.4.2 Feature selection


Referring to the ss man page [85], there are many fields that are more or less useful —
useful in this context referring to how informative a given field is for the prediction —
for the problem tackled in this thesis. It was quickly concluded that the cwnd would
most likely be a very useful feature, in addition to the rtt. This is because both the
cwnd and rtt grow as packet loss becomes more and more likely. A high value for one of
these could therefore indicate congestion and should be an indicator for the model that
packet loss could happen at any time in the near future.
The reason for the cwnd being relevant is made immediately aparrent when looking at a

44
3.4. Data transformation

cwnd plot of either Reno (Figure 2.1) or Cubic (Figure 2.2). As can be seen in both plots,
the cwnd grows until it reaches a peak. Assuming that the connection is in congestion
avoidance mode (Section 2.1.6), the peak is where a congestion signal is detected —
usually in the form of a packet loss. Multiplicative Decrease is used to reduce congestion
in the network by reducing the rate at which the sender can inject packets. The cwnd is
therefore informative of when packet loss will happen — because the chance of packet
loss increases with an increasing cwnd, and a maximum value of cwnd is reached just
before packet loss happens.
For the same reasons as described above, the min_cwnd and max_cwnd features were
added. A cwnd that is quite close to the max_cwnd seen so far most likely indicates a
high chance of congestion — especially in cases when the max_cwnd does not fluctuate
much. Similarly, a cwnd value close to min_cwnd most likely indicates a small chance
of congestion, because congestion has just been experienced and reacted to, with the
Multiplicative Decrease scheme briefly explained above and in more detail in Section
2.1.6.
The min_x and max_x features served as a way to incorporate historical data into the
algorithm. By doing so, theoretically, as the connection progresses, the algorithm’s
predictions could improve by learning the connection’s behavior.
All the chosen features and a short description of each feature are shown in Table 3.2.
In addition, the rationale behind choosing each feature is explained below:
timer_name The information for this feature was extracted directly from the ss field
present in the output, specifically from part of the timer data. Among the multiple
timers, the retransmission timeout (RTO) was identified as relevant due to its
association with congestion. It was therefore deemed important to distinguish this
timer from the others.
expire_time Chosen because the expire_time is a function of the RTT. Therefore, a
relatively high value for this feature could indicate congestion, and vice versa.
retrans Chosen because it should theoretically show how many times a retransmission
occurred, according to the ss man page [85], which could be relevant after
congestion in order to determine the degree of congestion.
rto Chosen because the retransmission timeout (RTO) is related to congestion and is a
function of the RTT — a higher RTO value indicates a higher RTT value, which
could indicate congestion in the network and imminent packet loss. While the
RTT is a very direct measure, the RTO is more of a long-term average [21].
rtt Chosen because the RTT is strongly related to network congestion with a higher
RTT value indicating queue growth and therefore congestion in the network, and
vice versa.
rtt_variance Chosen because, like the RTO, the RTT variance is a function of the
RTT, but it is a slightly more direct measure that estimates how much the RTT
fluctuates.
cwnd Chosen because the cwnd value strongly correlates with the likelihood of
congestion. A higher value of cwnd indicates a greater probability of congestion,
with the opposite being true for a lower value. This relationship can be observed
in the cwnd plots (Figure 2.1 and Figure 2.2).

45
Chapter 3. Machine learning model design and evaluation

cwnd_diff Chosen because trends in cwnd can indicate if congestion has just occurred
or not. If the cwnd is still growing, it indicates that the connection is in congestion
avoidance mode — assuming that the cwnd is larger than the ssthresh (Section
2.1.6) — while if the cwnd is decreasing, it indicates that congestion just occurred
because a peak has been reached and the fast recovery phase has begun (Section
2.1.7). This value could also indicate how quickly the cwnd is increasing or
decreasing, which could potentially indicate a higher chance of congestion or not.
ssthresh Chosen because the ssthresh contains information about the maximum cwnd
value that has worked in the past. Could be used as a supplement to the max_cwnd
feature discussed further below.
data_segments_sent This feature represented the difference in the number of
segments sent containing a positive length data segment between the current and
previous ss measurement, and was extracted from the data_segs_out field of the
ss output, explained in Table 3.3. Included in the feature set because it could help
distinguish between slow start or not. However, the initial slow start peak was
not included in the training data or considered when later applying the model to
predict packet loss in real time, as discussed in Chapter 4.
last_send Chosen because, due to ACK-clocking, there could be a positive correlation
between congestion from other traffic and this value.
pacing_rate Like many of the other chosen features, this is a function of the RTT,
and was chosen because it approximates the sending rate as cwnd/rtt, which is in
principle roughly as important as the cwnd.
min_rtt Chosen because any queue growth will cause the RTT to grow in relation
to this value; the absolute RTT should not really matter when the data is being
sourced from many different connections that have been configured with different
delays and therefore contain samples with different RTT values.
max_rtt Chosen for the same reason as the min_rtt as explained above; The RTT
should be considered in relation to the current max_rtt value, where an RTT close
to the max indicates congestion.
min_cwnd Chosen because a cwnd value close to min_cwnd most likely indicates a small
chance of congestion, because congestion has just been experienced and reacted to.
max_cwnd Chosen because a cwnd that is quite close to the max_cwnd seen so far
most likely indicates a high chance of congestion — especially in cases when the
max_cwnd does not fluctuate much — which is the case for single flow connections
that do not have any background traffic.
min_ssthresh Chosen because the ssthresh in isolation does not reveal much about
the probability of congestion when the data has been aggregated from many
different connections.
max_ssthresh Chosen for the same reason as the min_sshtresh. The ssthresh needs
to be considered in relation to this value and the min.

3.4.3 Extracting the relevant data


The same Python script used for parsing the ss output files was also used for extracting
the relevant data from the measurement strings created as a result of the parsing step.

46
3.4. Data transformation

Feature Description ss field


timer_name Name of the timer timer_name
expire_time How long until the timer will expire expire_time
retrans How many times the retransmission oc- retrans
curred
rto TCP re-transmission timeout value rto
rtt Average round-trip time rtt
rtt_variance Mean deviation of RTT rttvar
min_rtt Minimum RTT so far Derived from rtt
max_rtt Maximum RTT so far Derived from rtt
cwnd Congestion window size cwnd
cwnd_diff Difference between the current cwnd value Derived from cwnd
and the previous one that was not the
same
min_cwnd Minimum cwnd so far Derived from cwnd
max_cwnd Maximum cwnd so far Derived from cwnd
ssthresh Slow start threshold ssthresh
min_ssthresh Minimum ssthresh so far Derived from ssthresh
max_ssthresh Maximum ssthresh so far Derived from ssthresh
data_segments_sent Data segments sent between the current data_segs_out
and previous measurement
last_send Time since the last packet was sent lastsnd
pacing_rate Pacing rate pacing_rate

Table 3.2: Features selected from or based on values in the ss output

Referring to the man pages of ss [85], there is quite a lot of information that can be
extracted but not all of it is necessarily relevant for the purpose of predicting congestion
induced packet loss. The fields described in Table 3.3 from the ss output were deemed
to be relevant based on the discussion in Section 3.4.2.

ss field Description
timer_name The name of the timer
expire_time How long time the timer will expire
retrans How many times the retransmission occurred
cong_alg Congestion algorithm used
rto TCP re-transmission timeout value
rtt Average round-trip time
rttvar Mean deviation of RTT
cwnd Congestion window size
ssthresh Slow start threshold
data_segs_out Number of segments sent containing a positive length data segment
lastsnd Time since the last packet was sent
pacing_rate The pacing rate

Table 3.3: Relevant fields from the ss output and their ss man page descriptions [85]

As an intermediary step, it seemed sensible to represent the individual ss measurements


as Python dictionaries instead of just strings, where the keys represented the ss fields
and the values represented the values for the fields, like shown in the example below:

47
Chapter 3. Machine learning model design and evaluation

{
cwnd: 10,
ssthresh: 10
...
}
This was done to make it easier to manipulate each measurement and later convert the
merged dataset to a csv file.
For the purpose of extracting the ss fields in Table 3.3, multiple functions like the one
shown below in Listing 3.7 were created that took the measurement in the form of a
string — like discussed in the previous section — as as an argument and defined and
used a regex for filtering out a specific field from the measurement string in order to add
it to the dictionary representing the measurement, if the measurement string matched
the regex.
1 d e f add_cwnd ( s s _ d i c t : d i c t , measurement : s t r ) −> t u p l e :
2 " " " Add t h e cwnd from t h e g i v e n measurement t o t h e g i v e n d i c t i o n a r y .
3 Args :
4 s s _ d i c t : The d i c t i o n a r y t o add t h e cwnd t o .
5 measurement : The measurement from s s .
6 Returns :
7 A t u p l e c o n t a i n i n g t h e cwnd and True i f t h e measurement c o n t a i n e d
cwnd i n f o r m a t i o n ,
8 0 and F a l s e o t h e r w i s e .
9 """
10 cwnd_regex = r e . c o m p i l e ( r " cwnd : ( \ d+) " )
11 cwnd_match = r e . s e a r c h ( cwnd_regex , measurement )
12 i f cwnd_match :
13 cwnd = cwnd_match . group ( 1 )
14 cwnd = i n t ( cwnd )
15 s s _ d i c t [ " cwnd " ] = cwnd
16 r e t u r n cwnd , True
17
18 return 0 , False

Listing 3.7: Function for adding the cwnd feature

Features like the cwnd shown above, or the ssthresh were quite straightforward to extract
from the data, because they just needed to be extracted and used directly without any
special intermediary steps except converting from a string.
However, referring to the discussion in Section 3.4.2 and Table 3.2, there were some
features that should be included in the final dataset that were not possible to extract
directly from the ss measurements. These features were mostly dependent on the values
in the fields of the other measurements and therefore needed special treatment. For
example, the features min_cwnd and max_cwnd were dependent on a dynamic minimum
and maximum value of cwnd that updated itself as the function dealt with more and
more measurements — meaning that, if the cwnd value for the first measurement was
10, this would be the min_cwnd value for all measurements until an even smaller value
was found, and so on.
As mentioned in the previous section on data collection, all the measurements for
producing traffic and gathering data from said traffic ran for 5 minutes when capturing
training data. Referring to the cwnd plots of Reno and Cubic in Figure 2.1 and Figure
2.2 respectively, the initial slow start peak (Section 2.1.6) can be seen in both figures

48
3.4. Data transformation

where the cwnd is much higher than for the remainder of the connection. In order to
avoid outliers in the data, it was decided that only measurements in congestion avoidance
should be included, meaning that the first few seconds of measurements from the initial
slow start phase were not included in the training data. To make the min_cwnd and
max_cwnd values more consistent with the cwnd values from the congestion avoidance
phase, these values were therefore not set until after the initial slow start peak. This was
also the case for the min_ssthresh and max_ssthresh values. However, the min_rtt
and max_rtt values were taken from the beginning of the connection.
To keep things simple and because all the measurements ran for 5 minutes, the first 30
seconds of measurements were not included in the final training data to make sure that
no measurements from the initial slow start peak were included.
A feature that proved to be quite a challenge when dealing with, was the cwnd_diff,
as described in Table 3.2. Referring to said description, if the cwnd value of the current
measurement was 10 and the previous one was also 10, the one before the previous one
would have to be considered, and so on until a value was found that was different from
the current. The reason for including this feature is described in detail in Section 3.4.2.
Adding this feature was accomplished by a recursive function as shown in Listing 3.8.
1 d e f add_cwnd_diff (
2 s s _ d i c t s : l i s t , s s _ d i c t : d i c t , cwnd : i n t , prev_cwnd : i n t , c u r r e n t _ i n d e x
: int
3 ) −> None :
4 " " " Add t h e d i f f e r e n c e between t h e c u r r e n t c o n g e s t i o n window and t h e
p r e v i o u s c o n g e s t i o n window v a l u e t h a t was not t h e same .
5 The p r e v i o u s v a l u e i n t h i s c a s e r e f e r s t o an e a r l i e r v a l u e i n t h e
same measurement t h a t was not t h e same a s t h e c u r r e n t .
6 Meaning , t h a t i f t h e c u r r e n t c o n g e s t i o n window i s 10 and t h e one
d i r e c t l y b e f o r e i t was a l s o 1 0 , t h e p r e v i o u s c o n g e s t i o n
7 window w i l l t h e one b e f o r e that , and s o on .
8
9 Args :
10 s s _ d i c t s : The l i s t o f d i c t i o n a r i e s t h a t c o n t a i n t h e measurements
from s s .
11 s s _ d i c t . The d i c t i o n a r y t o add t h e cwnd d i f f t o .
12 cwnd : The c u r r e n t c o n g e s t i o n window v a l u e .
13 prev_cwnd : The p r e v i o u s c o n g e s t i o n window v a l u e t h a t was not t h e
same .
14 c u r r e n t _ i n d e x : The i n d e x o f t h e c u r r e n t measurement i n i n s s _ d i c t s .
15 """
16 i f c u r r e n t _ i n d e x == 0 :
17 s s _ d i c t [ " cwnd_diff " ] = 0
18 return
19
20 i f cwnd == prev_cwnd :
21 prev_cwnd = s s _ d i c t s [ c u r r e n t _ i n d e x − 1 ] [ " cwnd " ]
22 i f prev_cwnd == cwnd :
23 r e t u r n add_cwnd_diff ( s s _ d i c t s , s s _ d i c t , cwnd , prev_cwnd ,
current_index − 1)
24 else :
25 s s _ d i c t [ " cwnd_diff " ] = cwnd − prev_cwnd
26 return

Listing 3.8: Function for adding the cwnd_diff feature

As mentioned in the data collection section (Section 3.3), there were sometimes cases

49
Chapter 3. Machine learning model design and evaluation

when the output from ss did not include all the relevant fields that should be present for
each row in the final dataset. These measurements were not included in the final dataset,
but their cwnd, rtt, and ssthresh values were used to calculate min_cwnd, max_cwnd,
min_rtt, max_rtt, min_ssthresh, and max_ssthresh if these fields were present but
not some others.
The final result of this step in the data transformation process was a two-dimensional list
of lists where all the inner lists contained dictionaries where each dictionary included the
relevant fields for each of the measurements. Each of these dictionaries and their items
represented what should be the rows and columns in the final dataset respectively.

3.4.4 Labeling the training data


The way that the training data should be labeled was not straightforward, but it
was concluded that labeling the data by looking at the cwnd values of the various
measurements made sense. Two different heuristics for labeling data based on the cwnd
were chosen:
Simple labeling In the simple labeling case, the data was labeled by comparing the
cwnd value of the current measurement (the one that should be labeled) with
the cwnd value of the previous measurement. If the cwnd value of the previous
measurement was larger than the current, the current measurement was labeled as
lost. Otherwise, the current measurement was labeled as not lost.
Complex labeling In the complex labeling case, the data was labeled by comparing
the cwnd value of the current measurement (the one that should be labeled ) with
the cwnd values of both the previous and the next measurements. If the cwnd value
of the previous measurement was the same or smaller than the current, and the
cwnd value of the next measurement was smaller, the current measurement was
labeled as lost. Otherwise, the current measurement was labeled as not lost.
Referring to the cwnd plots of Reno and Cubic in Figure 2.1 and 2.2 respectively, the
simple labeling case was meant to capture downward movement and therefore meant to
capture the cases where packet loss had just happened.
Referring to the same plots as above, the complex labeling case was meant to capture
the “peaks” that occur right before packet loss happens.
It was hypothesized that the complex labeling method made the most sense. This is
because the simple labeling method would simply capture downward trends, and could
therefore lead to a situation where the model reacts after packet loss — and keeps
reacting as the cwnd decreases during the fast recovery phase (Section 2.1.7) and the
network is no longer congested. After some initial experiments with models that were
trained on data labeled using the simple labeling method, this seemed to be the case.
For brevity and to keep things clear, the simple models and results will therefore not be
further discussed.

3.4.5 Creating a CSV file


Having created a two-dimensional list of lists of dictionaries where each dictionary
represented what should be a row in the final dataset and the dictionary items
representing its columns, and labeled the data using the chosen labeling method — as
explained in Section 3.4.3 and Section 3.4.4 — creating the final csv file was easy. This

50
3.4. Data transformation

was accomplished by creating a DictWriter object [71] from the Python csv module,
as shown in Listing 3.9.
1 d e f c r e a t e _ c s v ( s s _ d i c t s : L i s t [ L i s t [ D i c t ] ] , path : s t r ) −> None :
2 " " " C r e a t e a c s v f i l e from t h e s s measurements i n t h e g i v e n l i s t o f
l i s t s of dictionaries .
3
4 Args :
5 s s _ d i c t s : A l i s t o f l i s t s , each c o n t a i n i n g d i c t i o n a r i e s
r e p r e s e n t i n g t h e measurements from s s .
6 path : Where t o c r e a t e t h e c s v f i l e .
7 """
8 flattened_ss_dicts = [ ss_dict for s u b l i s t in ss_dicts for ss_dict in
sublist ]
9
10 with open ( path , "w" , n e w l i n e=" " ) a s c s v _ f i l e :
11 w r i t e r = c s v . D i c t W r i t e r ( c s v _ f i l e , f i e l d n a m e s=f l a t t e n e d _ s s _ d i c t s [ 0 ] .
keys ( ) )
12
13 writer . writeheader ()
14 for ss_dict in flattened_ss_dicts :
15 w r i t e r . writerow ( ss_dict )
16
17 p r i n t ( " Created c s v f i l e under path : " , path )

Listing 3.9: Function for creating a csv file

The final result of this step in the data transformation process was a csv file containing
all the ss measurements as rows and the machine learning features as columns.

3.4.6 Creating the final dataset


Creating the final dataset in the form of a pandas.DataFrame [69] object was very simple
because the data had already been transformed multiple times into a suitable format;
Loading the data into a pandas.DataFrame object representing the tabular dataset was
accomplished by simply calling the pandas.read_csv method [68].
One DataFrame object was created for each congestion control algorithm for each phase:
one for Reno and one for Cubic, resulting in two DataFrames per phase. Before the
datasets could be used to create any of the machine learning models, they needed to
be transformed further — mainly by creating training and test data, but also by doing
some simple data cleaning.

3.4.7 Cleaning the data


The data contained some categorical features (Section 2.2.3) that would need to be
converted before using the data with the chosen machine learning algorithms.
These categorical features — the features being the lost, timer_name, and timestamp
columns in the DataFrame (Section 3.4.2) — were converted to numerical features, for
example by converting True to 1 and False to 0.
The timestamp column was removed due to it being superfluous and not relevant for the
prediction task. It was concluded that this feature would most likely only add noise to
the data when training the model, but it was useful to have in the csv file for debugging
purposes.

51
Chapter 3. Machine learning model design and evaluation

The final result of the data cleaning step was a dataset with only numerical features in
either float or integer format.

3.4.8 Creating the training, validation, and test datasets


As discussed in Section 2.2.4, training and test data is needed for this particular type of
machine learning problem — this being a supervised learning problem, more specifically
a case of binary classification (Section 2.2.5).
Three datasets for each model were created for the purpose of training and evaluating
the machine learning model:
1. A training dataset for training the model by fitting it to the data.
2. A validation dataset for evaluating the performance of the model when tuning
the hyperparameters or doing other adjustments to the model.
3. A final testing dataset for testing the final model on data that it had not
encountered during training or tuning.
The three abovementioned datasets were created from the original labeled and cleaned
dataset that contained all the different measurements captured in the data collection
process. To ensure that all sets had a consistent ratio of lost and not lost packets,
stratified sampling was used where the measurements were sampled based on the lost
column. This way the proportion of samples for each class (i.e., lost and not lost packets)
was consistent across the training, validation, and test sets.
The data was split into a 50/25/25 ratio according to the following scheme:
• 50% for training.
• 25% for validation.
• 25% for final testing.
The final test set was not used until the end for final testing purposes.
The seed was fixed for shuffling and sampling. This was done to ensure that we obtained
the same split each time, aiding in reproducibility of the results.

3.5 Phase one classifiers


Starting with the simplest case, which is a single flow between a single sender and
receiver, and a single combination of the relevant connection parameters bandwidth,
delay, and queue size, an initial machine learning model was trained and evaluated to
check if the performance with regards to packet loss prediction was satisfactory on a
simple dataset that should be easy to predict on.
Referring to the cwnd plots from Reno (Figure 2.1) and Cubic (Figure 2.2), it is easy to
see where the packet losses happen, and it was therefore hypothesized that a machine
learning model should be able to learn this same pattern. Referring to the cwnd plot of
BBR (Figure 2.4), the same could not be said here, with BBR not reacting to congestion
in the same way as Reno and Cubic, discussed in more detail in Section 2.1.10.

52
3.5. Phase one classifiers

3.5.1 Model creation


For all the three phases, one classifier for Reno and another classifier for Cubic was
created, resulting in two classifiers per phase.
Instead of creating one “common” classifier for both Reno and Cubic, it was concluded
that creating a separate set of classifiers for each of the congestion control algorithms
made more sense — due to them exhibiting slightly different behavior. For this reason,
training data was collected (Section 3.3) from only Reno connections and only Cubic
connections, labeled using the complex labeling method (Section 3.4.4), and then used
for training the appropriate classifier, resulting in the following two classifiers:
1. Reno phase one classifier trained on training data gathered from a single
connection configured with Reno and the connection parameters shown in Table
3.4 that was labeled using the complex labeling method.
2. Cubic phase one classifier trained on training data gathered from a single
connection configured with Cubic and the connection parameters shown in Table
3.4 that was labeled using the complex labeling method.
Initially, LightGBM [51] was used for creating the phase one binary classifiers. The
lightgbm.LGBMClassifier class [47] was used for this purpose. More details about
LightGBM and the rationale behind initially choosing it over other gradient boosting
frameworks such as XGBoost can be found in Section 3.2.
As the thesis work progressed, the datasets grew larger and larger, and the initial
approach of training the models on the Linux machine used for the data collection step
using LightGBM was no longer viable because of hardware restrictions. This could be
circumvented by using remote compute, but it was decided to move the machine learning
related computations to the author’s personal MacBook Pro — due to the author having
a very powerful machine with an Apple M1 Max chip and 32GB of memory.
Multiple issues were encountered when switching from the Linux machine to the author’s
personal MacBook Pro with regards to running the machine learning model, more
specifically the code involving the use of LightGBM. According to a Stack Overflow
thread [38] there seemed to be no officially supported release of LightGBM for the new
ARM Apple Silicon Macs at the time of writing. As the author’s personal MacBook Pro
had such a processor, it was decided that the model should be updated to use XGBoost
instead — which was officially supported [104].
Updating the model to use XGBoost instead of LightGBM proved to be quite painless
and did not involve large changes to other parts of the code. LightGBM was therefore
replaced by XGBoost as the chosen gradient boosting framework going forward.
The two phase one classifiers were created using the XGBClassifier [96] class from the
XGBoost library. The options supplied to the XGBClassifier constructor are explained
below:
objective="binary:logistic" To specify that the classifier should use the objective
function for binary classification. This is also the default, so it was strictly speaking
not necessary to supply to the constructor.
random_state=42 To fix the seed when training the model for more reproducible
results.

53
Chapter 3. Machine learning model design and evaluation

The classifier was trained on the training data (Section 3.4.8) using the fit method of
the XGBClassifier class [98]. No hyperparameter tuning was performed, so no other
parameters were supplied to the XGBClassifier constructor.
The predictions were performed using the predict method of the XGBClassifier class
[100].
Results were evaluated by calculating the accuracy, precision, recall, and F1 score
(Section 2.2.5) of the various classifiers. In addition, a confusion matrix (Section 2.2.5)
was computed for each classifier. All metrics and the confusion matrices were computed
using the sklearn.metrics module [81].

Connection parameter Value


Bandwidth 50Mbps
Delay 70ms
Queue size 437500 bytes (1 BDP)

Table 3.4: Network connection parameters used to configure the data collection procedure in
phase one

3.5.2 Results
The phase one datasets displayed a significant imbalance between the majority and
minority class: no packet loss and packet loss, as illustrated in Table 3.5 for Reno and
Table 3.6 for Cubic. This imbalance can be explained by packet loss at the transport layer
being relatively rare. Given this imbalance, relying solely on accuracy as a performance
metric when evaluating the performance of the classifiers proved to be insufficient. This
is because an imbalanced dataset can lead to a situation where the classifier predicts the
majority class all the time, and still has 99% accuracy because the dataset consists of
99% majority class samples and only 1% minority class samples (Section 2.2.5).

Dataset Total samples Samples marked as lost Proportion


Training 3853 5 0.001
Validation 1898 2 0.001
Test 1918 2 0.001

Table 3.5: The proportion of samples for each class in the training, validation, and test sets for
the Reno phase one dataset

Dataset Total samples Samples marked as lost Proportion


Training 4111 17 0.004
Validation 2026 8 0.004
Test 2046 9 0.004

Table 3.6: The proportion of samples for each class in the training, validation, and test sets for
the Cubic phase one dataset

For this reason, precision and recall, and the combined score F1 score were mainly used
when evaluating the classifiers. It was hypothesized that a high precision would be
more important than a high recall value — due to a high amount of false positives (low

54
3.5. Phase one classifiers

precision) leading to unnecessary backoffs if the model were to be applied in order to


dynamically reduce the sending rate based on model predictions — so special attention
was paid to this metric. A confusion matrix was therefore also computed, in order to
visualize the amount of True and False Positives (TP, FP), and True and False Negatives
(TN, FN).
In addition to the metrics above, the feature importances were extracted using the
feature_importances_ property of the XGBClassifier [97].
The results for Reno and Cubic as shown in Table 3.7 and Table 3.8, respectively, were
quite bad, achieving low scores for both precision and recall.

Dataset Precision Recall F1 score


validation 0.0000 0.0000 0.0000
test 0.3333 1.0000 0.5000

Table 3.7: Reno phase one results

Dataset Precision Recall F1 score


validation 0.4000 0.2500 0.3077
test 0.3333 0.1111 0.1667

Table 3.8: Cubic phase one results

The confusion matrices for Reno and Cubic are shown in Table 3.9 and Table 3.10,
respectively, and show that most samples were not classified correctly — except for a
few samples in both the Reno and Cubic case.

Validation Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 1895 1
Actual: Lost 2 0

Testing Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 1912 4
Actual: Lost 0 2

Table 3.9: Reno phase one confusion matrices

The feature importances for Reno and Cubic are shown in Table 3.11 and Table 3.12,
respectively, and show that the cwnd and cwnd_diff features were very important for
the classification for Reno. While the cwnd feature was also important for Cubic, the
cwnd_diff feature was not used at all. However, the rtt, data_segments_sent, and
the ssthresh features were of quite high importance for the Cubic classifier.
In the case of Reno, when examining the training data, the feature importances can be
explained by the cwnd reaching a maximum value for the True cases compared to the
others, as visualized in Table 3.13 showing an excerpt from the Reno training data. By

55
Chapter 3. Machine learning model design and evaluation

Validation Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 2015 3
Actual: Lost 6 2

Testing Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 2035 2
Actual: Lost 8 1

Table 3.10: Cubic phase one confusion matrices

Feature Importance
timer_name 0.00
expire_time 0.04
retrans 0.00
rto 0.00
rtt 0.07
rtt_variance 0.05
cwnd 0.33
ssthresh 0.00
data_segments_sent 0.03
last_send 0.00
pacing_rate 0.10
min_rtt 0.00
max_rtt 0.00
cwnd_diff 0.38
min_cwnd 0.00
max_cwnd 0.00
min_ssthresh 0.00
max_ssthresh 0.00

Table 3.11: Feature importances for the Reno phase one model

utilizing this information in combination with the cwnd_diff feature that shows that
the cwnd is growing, this could make it possible for the model to spot a trend where
the True cases consistently have high cwnd values and positive cwnd_diff values. As
the feature importances show, the rtt also seems to be a factor here, with the value
reaching a near maximum right before packet loss happens — indicating that the queue
is growing and that there is congestion.
For Cubic, much like Reno, the cwnd feature was highly significant, with the significance
being explained by the same reasoning as for Reno. However, as already mentioned,
features such as rtt, data_segments_sent, and the ssthresh also held considerable
importance.
The reason behind the importance of the rtt feature is the same as for Reno, with this
being a value that seems to increase until it reaches a near maximum right before packet
loss happens.

56
3.5. Phase one classifiers

Feature Importance
timer_name 0.00
expire_time 0.04
retrans 0.00
rto 0.00
rtt 0.10
rtt_variance 0.05
cwnd 0.41
ssthresh 0.15
data_segments_sent 0.11
last_send 0.00
pacing_rate 0.06
min_rtt 0.00
max_rtt 0.00
cwnd_diff 0.00
min_cwnd 0.07
max_cwnd 0.00
min_ssthresh 0.00
max_ssthresh 0.00

Table 3.12: Feature importances for the Cubic phase one model

cwnd expire_time rtt cwnd_diff ssthresh data_segments_sent pacing_rate lost


577 288 139.262 1 288 16 57.6 False
577 288 139.273 1 288 12 57.6 False
577 288 139.256 1 288 16 57.6 False
577 200 139.317 1 288 17 57.6 True
397 340 140.182 -7 289 8 74.7 False
389 340 140.189 -8 289 7 75.5 False
290 152 72.781 1 289 13 55.4 False

Table 3.13: An excerpt from the Reno phase one training dataset

The reason behind the data_segments_sent feature being of quite high importance
seemed to be a pattern where the values were consistently smaller right after packet
loss. When examining the training data, this was made clear by looking at the values
for True cases and the cases right after the True cases. The True cases consistently had
a higher value than the False cases right after, which seemed to indicate the pattern
described earlier. This can be seen in Table 3.14.

cwnd expire_time rtt cwnd_diff ssthresh data_segments_sent pacing_rate lost


592 280 135.924 1 437 15 60.5 False
597 284 137.073 1 437 15 60.5 False
604 288 138.766 1 437 15 60.5 False
605 284 139.062 1 437 15 60.5 True
508 336 139.982 -4 437 9 77.7 False
503 336 139.983 -5 437 12 78.5 False
495 336 139.934 -8 437 12 79.7 False

Table 3.14: An excerpt from the Cubic phase one training dataset

However, it was harder to find a clear reason for the importance value of the ssthresh
feature. Upon analyzing the training data, it was observed that the values for this feature
predominantly hovered around approximately 405 for the Cubic phase one training data.
There seemed to be no distinct pattern — values occasionally increased following packet

57
Chapter 3. Machine learning model design and evaluation

loss, while at other times, they decreased.


The reason for the poor classification performance might be partially attributed to the
data providing few indications of packet loss, especially when comparing the data points
labeled as True to the preceding data points labeled as False. This could cause the model
to react earlier than it should, because it could label samples as Lost, even though they
should be labeled as Not Lost according to the training data — these being the samples
that directly precede the actual Lost samples. This would not necessarily be an issue
depending on how much earlier the model reacts — earlier in this case referring to the
time before the actual packet loss happens. Referring to the training data excerpt for
Reno (Table 3.13), there is very little difference in the various features for the sample
that was labeled as Lost compared to the four preceding samples that were labeled as
Not Lost.
As discussed here and visualized in the various tables, the results for the phase one
classifiers were generally quite poor. The reason for the poor performance can be
partially explained by the heavy data imbalance in the training data. In addition, there
did not seem to be a very clear pattern in the data with regards to there being little
distinction between Lost cases and the cases labeled as Not Lost that directly preceded
the Lost cases. In this part of the data, there were only small differences in some of the
features, and capturing this pattern therefore seemed to prove difficult for the machine
learning model.

3.6 Phase two classifiers


As discussed in the previous section, the results for the phase one classifiers were quite
disappointing. We expected the performance to stay mostly the same when training
and evaluating the phase two classifiers due to the training data being quite similar —
similar referring to the values and patterns in the data. However, unlike the phase one
training data, the phase two training data was aggregated from many different scenarios,
each scenario representing a single TCP flow using either Reno or Cubic with a specific
combination of connection parameters like bandwidth, delay, and queue size. This is
explained in detail in Section 3.3.
The main difference between the phase one and phase two classifiers can therefore be
found in the training data that the classifiers were trained on. When evaluating the
performance, we expected that the main difference between the phase one and phase
two classifiers would be found in the feature importances, because features like the cwnd
and rtt were not as constant as they were for the phase one classifier.

3.6.1 Model creation


As in phase one, one classifier for Reno and another classifier for Cubic was created. The
Reno classifier was trained on flows that were configured to only use Reno and equivalent
for the Cubic classifier.
This resulted in the following two classifiers in phase two:
1. Reno phase two classifier trained on aggregated training data gathered
from many different connections configured with Reno and a specific set of the
connection parameters shown in Table 3.15 that was labeled using the complex
labeling method.

58
3.6. Phase two classifiers

2. Cubic phase one classifier trained on aggregated training data gathered


from many different connections configured with Cubic and a specific set of the
connection parameters shown in Table 3.15 that was labeled using the complex
labeling method.
As in phase one, the phase two classifiers were created using the XGBClassifier [96]
class from the XGBoost library. The options supplied to the XGBClassifier constructor
were the same as for phase one (Section 3.5).
Training and model evaluation were also done in the same way as with the phase one
classifiers, where no hyperparameter tuning was performed, and the precision, recall, F1
score, feature importances, and confusion matrices were calculated and evaluated.

Connection parameter Values


Bandwidth (Mbps) 10, 20, 30, 40, 50
Delay (ms) 30, 40, 50, 60, 70
Queue size (proportion of BDP) 0.25, 0.5, 1

Table 3.15: Connection parameters for phase two

3.6.2 Results
As in phase one, the data was still very imbalanced in phase two, as illustrated in
Table 3.16 and 3.17 for Reno and Cubic, respectively. However, the proportion of lost
samples did greatly improve, especially for Reno, increasing from 0.001 to 0.007 from
phase one to phase two. This can be attributed to the fact that there was simply much
more training data in phase two, and the training data was also aggregated from many
different connections instead of just a single connection with a specific set of connection
parameters. Some of these connections were configured with smaller BDPs than the one
that was used for data collection in phase one, and therefore had more frequent packet
loss.

Dataset Total samples Samples marked as lost Proportion


Training 489464 3294 0.007
Validation 241080 1622 0.007
Test 243515 1639 0.007

Table 3.16: The proportion of samples for each class in the training, validation, and test sets for
the Reno phase two dataset

Dataset Total samples Samples marked as lost Proportion


Training 433243 3513 0.008
Validation 213389 1730 0.008
Test 215545 1748 0.008

Table 3.17: The proportion of samples for each class in the training, validation, and test sets for
the Cubic phase two dataset

As hypothesized and illustrated in Table 3.18 and Table 3.19 for Reno and Cubic,
respectively, the classification performance for the phase two models was quite poor,

59
Chapter 3. Machine learning model design and evaluation

indicating that this could be a difficult problem for a machine learning model to handle.
Precision was decent, indicating that many samples were correctly classified as Lost if
they were actually Lost and vice versa. Recall was very poor, indicating that many
samples that should have been classified as Lost were classified as Not Lost. The latter
suggests that the model missed many cases where packets should be labeled as Lost.

Dataset Precision Recall F1 score


Validation 0.6384 0.1720 0.2710
Test 0.6364 0.1836 0.2850

Table 3.18: Reno phase two results

Dataset Precision Recall F1 score


Validation 0.6055 0.1393 0.2265
Test 0.6450 0.1476 0.2402

Table 3.19: Cubic phase two results

Validation Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 239300 158
Actual: Lost 1343 279

Testing Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 241704 172
Actual: Lost 1338 301

Table 3.20: Reno phase two confusion matrices

Validation Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 211502 157
Actual: Lost 1489 241

Testing Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 213655 142
Actual: Lost 1490 258

Table 3.21: Cubic phase two confusion matrices

As illustrated in Table 3.22 and Table 3.23 for Reno and Cubic, respectively, the samples
in the training data right before the samples that were labeled as True seem to be very
similar to the True sample. This could indicate that the model often gets things “almost

60
3.6. Phase two classifiers

right”, by labeling a sample as True even if it is not right at the peak yet, but rather
approaching it. This could make the model react to congestion just a bit earlier than
the optimal case, which would be right before the peak.

cwnd expire_time rtt cwnd_diff ssthresh data_segments_sent pacing_rate lost


262 164 78.795 1 132 12 46.2 False
262 168 78.835 1 132 12 46.2 False
262 168 78.793 1 132 12 46.2 True
180 280 80.494 -6 133 6 60.4 False
174 280 80.503 -6 133 6 61.5 False
116 204 138.078 1 58 2 11.7 False
116 204 138.081 1 58 2 11.7 False
116 204 138.08 1 58 4 11.7 True
115 20 137.804 -1 58 2 11.9 False
96 340 138.582 -2 58 1 13.6 False
309 160 74.309 1 155 17 57.8 False
309 160 74.239 1 155 16 57.9 False
309 160 74.521 1 155 13 57.6 True
200 272 75.217 -8 155 7 76.9 False
175 272 75.231 -8 155 7 81.5 False
295 200 118.669 1 149 10 34.6 False
295 204 118.675 1 149 10 34.6 False
295 200 119.089 1 149 8 34.4 True
278 320 120.162 -4 148 5 36.4 False
263 320 120.06 -4 148 5 38.2 False

Table 3.22: An excerpt from the Reno phase two training dataset. The training data is shown
as groups of five and five samples from different parts of the aggregated training data, meaning
that each group of five samples came from the same part, and that these five samples in each
group occurred in succession. Each group of five samples includes one True sample, the two False
samples that occurred right before, and the two False samples that occurred right after

cwnd expire_time rtt cwnd_diff ssthresh data_segments_sent pacing_rate lost


298 200 119.499 1 217 10 34.7 False
298 200 119.498 1 217 11 34.7 False
298 200 119.508 1 217 10 34.7 True
214 320 120.346 -3 208 6 55.8 False
212 320 120.214 -2 208 7 56.5 False
428 200 137.985 1 324 13 43.1 False
428 200 139.701 1 324 9 42.6 False
429 288 140.195 1 324 9 42.5 True
425 340 140.087 -5 301 10 42.8 False
341 328 129.42 -3 301 9 66.9 False
111 96 44.431 1 78 10 34.7 False
111 96 44.43 1 78 11 34.7 False
111 96 45.201 1 78 10 34.1 True
95 244 45.441 -2 78 7 45.0 False
92 244 45.397 -3 78 6 46.8 False
265 164 79.775 1 186 15 46.2 False
265 168 79.761 1 186 15 46.2 False
265 168 79.758 1 186 12 46.2 True
208 280 80.616 -3 185 9 66.9 False
192 280 80.493 -4 185 8 73.2 False

Table 3.23: An excerpt from the Cubic phase two training dataset. The training data is shown
as groups of five and five samples from different parts of the aggregated training data, meaning
that each group of five samples came from the same part, and that these five samples in each
group occurred in succession. Each group of five samples includes one True sample, the two False
samples that occurred right before, and the two False samples that occurred right after

As hypothesized and illustrated in Table 3.26 and Table 3.27 for Reno and Cubic,

61
Chapter 3. Machine learning model design and evaluation

respectively, the main difference between the phase one and phase two classifiers results
wise was found in the feature importances. Features like the cwnd that had high
importance in the phase one classifiers were no longer of great importance in phase
two because the training data was aggregated from many different connections with
different combinations of connection parameters. Unlike in phase one, where the cwnd
could be used directly to spot a trend where the cwnd reached a peak before packet loss,
in phase two this was not possible because the peak value was different between the
various connections that the models were trained on. This can be seen in Table 3.22 and
Table 3.23. The same reasoning can be applied to the pacing_rate feature, which also
decreased in importance for both phase two classifiers.

Feature Importance
timer_name 0.00
expire_time 0.14
retrans 0.00
rto 0.00
rtt 0.10
rtt_variance 0.02
cwnd 0.04
ssthresh 0.02
data_segments_sent 0.02
last_send 0.03
pacing_rate 0.04
min_rtt 0.10
max_rtt 0.14
cwnd_diff 0.17
min_cwnd 0.03
max_cwnd 0.04
min_ssthresh 0.04
max_ssthresh 0.06

Table 3.24: Feature importances for the Reno phase two model

One group of features that had more or less no importance in the phase one classifiers,
were the various min and max features, as explained in Section 3.4.2. All of these increased
in importance for the phase two classifiers. This can be attributed to more or less the
same reason for why features like the cwnd decreased in importance. When the cwnd
could no longer directly be used to spot the trend where the cwnd reached a maximum
value before packet loss, features like the max_cwnd and max_rtt were perhaps used to
aid the model when deciding if a datapoint should be labeled as lost or not.
The cwnd_diff feature had quite high and about the same importance for both phase
two classifiers, as illustrated in Table 3.24 and 3.25 for Reno and Cubic, respectively.
Referring to the cwnd plots of Reno and Cubic in Figure 2.1 and Figure 2.2, respectively,
this was probably used to make sure that the model did not label samples where the
cwnd was decreasing as Lost. As briefly mentioned in Section 3.4.4, some initial models
were trained on data labeled using the simple labeling method — where packets were
labeled as Lost if the cwnd value of the previous packet was smaller than the current —
and evaluated. While the models had great classification performance, they probably
just used the cwnd_diff feature to see that the cwnd was decreasing and always classified
packets as Lost in that case. The final models trained on training data labeled using the
complex labeling method avoided classifying packets as Lost if the cwnd was decreasing.

62
3.6. Phase two classifiers

Feature Importance
timer_name 0.00
expire_time 0.06
retrans 0.00
rto 0.02
rtt 0.08
rtt_variance 0.02
cwnd 0.05
ssthresh 0.04
data_segments_sent 0.05
last_send 0.02
pacing_rate 0.04
min_rtt 0.09
max_rtt 0.14
cwnd_diff 0.19
min_cwnd 0.02
max_cwnd 0.08
min_ssthresh 0.03
max_ssthresh 0.09

Table 3.25: Feature importances for the Cubic phase two model

Difference in importance
Feature Phase one importance (phase two - phase one)
timer_name 0.00 0.00
expire_time 0.04 +0.10
retrans 0.00 0.00
rto 0.00 0.00
rtt 0.07 +0.03
rtt_variance 0.05 -0.03
cwnd 0.33 -0.29
ssthresh 0.00 +0.02
data_segments_sent 0.03 -0.01
last_send 0.00 +0.03
pacing_rate 0.10 -0.06
min_rtt 0.00 +0.10
max_rtt 0.00 +0.14
cwnd_diff 0.38 -0.21
min_cwnd 0.00 +0.03
max_cwnd 0.00 +0.04
min_ssthresh 0.00 +0.04
max_ssthresh 0.00 +0.06

Table 3.26: Feature importances difference for the Reno phase one and phase two models

This was an important reason for why the simple labeling case was deemed unsuitable
and clearly highlights the impact data labeling can have on model performance and
results.

63
Chapter 3. Machine learning model design and evaluation

Difference in importance
Feature Phase one importance (phase two - phase one)
timer_name 0.00 0.00
expire_time 0.04 +0.02
retrans 0.00 0.00
rto 0.00 +0.02
rtt 0.10 -0.02
rtt_variance 0.05 -0.03
cwnd 0.41 -0.36
ssthresh 0.15 -0.11
data_segments_sent 0.11 -0.06
last_send 0.00 +0.02
pacing_rate 0.06 -0.02
min_rtt 0.00 +0.09
max_rtt 0.00 +0.14
cwnd_diff 0.00 +0.19
min_cwnd 0.07 -0.05
max_cwnd 0.00 +0.08
min_ssthresh 0.00 +0.03
max_ssthresh 0.00 +0.09

Table 3.27: Feature importances difference for the Cubic phase one and phase two models

3.7 Phase three classifiers


As discussed in the previous sections (Section 3.5 and Section 3.6), the results for both
the phase one and phase two classifiers were quite disappointing. The performance was
expected to stay mostly the same or be worse when training and evaluating the phase
three classifiers due to the training data being quite similar but with the addition of
background traffic, which should in theory make the situation less predictable because of
the foreground flow being affected by the background flows. Like the phase two training
data, the phase three training data was aggregated from many different scenarios, each
scenario representing a single TCP flow with or without background traffic using either
Reno or Cubic with a specific combination of connection parameters like bandwidth,
delay, queue size, and type of background traffic in the cases where background traffic
was present, explained in detail in Section 3.3.
The main difference between the phase two and phase three classifiers can therefore be
found in the training data that the classifiers were trained on. Similarly, we expected that
the main difference between the phase two and phase three classifiers when evaluating the
performance would be found in the feature importances and partially in the classification
performance, because the values for all the features fluctuated even more in the training
data compared to both phase one and phase two.

3.7.1 Model creation


As in phase one and phase two, one classifier for Reno and another classifier for Cubic
was created, where the former was trained on flows that were configured to only use
Reno and equivalent for the latter.
This resulted in the following two classifiers in phase three:

64
3.7. Phase three classifiers

1. Reno phase three classifier trained on aggregated training data gathered from
many different connections configured with Reno, the presence or absence of
background traffic, and a specific set of the connection parameters shown in Table
3.28 that was labeled using the complex labeling method.
2. Cubic phase three classifier trained on aggregated training data gathered from
many different connections configured with Cubic, the presence or absence of
background traffic, and a specific set of the connection parameters shown in Table
3.28 that was labeled using the complex labeling method.

Connection parameter Values


Bandwidth (Mbps) 10, 20, 30, 40, 50
Delay (ms) 30, 40, 50, 60, 70
Queue size (proportion of BDP) 0.25, 0.5, 1
Scenario reno, cubic, half

Table 3.28: Connection parameters for phase three

Like phase one and phase two, the phase three classifiers were created using the
XGBClassifier [96] class from the XGBoost library. The options supplied to the
XGBClassifier constructor were the same as for phase one and phase two (Section
3.5).
Model evaluation was also done in the same way as with the phase one and phase
two classifiers, where the precision, recall, F1 score, feature importances, and confusion
matrices were calculated and evaluated.
Unlike phase one and phase two, in phase three various hyperparameters were tuned
with the goal of improving model performance. The choice of which parameters to
tune was based on the relevant section in the XGBoost documentation titled “Notes
on Parameter Tuning” [106]. The chosen hyperparameters and their default and tuned
values are shown in Table 3.29 for Reno and Table 3.30 for Cubic.

Hyperparameter Default value Tuned value


Scale pos weight 1 5
Learning rate 0.3 0.3
Gamma 0 9
Max depth 6 12
Min child weight 1 1
Subsample 1 1
Colsample bytree 1 1

Table 3.29: Hyperparameters for the Reno phase three model

The hyperparameters were tuned manually by supplying various values to the


XGBClassifier when training, comparing the F1 scores of the resulting classifiers, and
choosing the best value based on the score.
One of the hyperparameters that was very important and required special attention, was
the scale_pos_weight parameter. From the XGBoost documentation:
Control the balance of positive and negative weights, useful for unbalanced

65
Chapter 3. Machine learning model design and evaluation

Hyperparameter Default value Tuned value


Scale pos weight 1 5
Learning rate 0.3 0.3
Gamma 0 9
Max depth 6 10
Min child weight 1 8
Subsample 1 1
Colsample bytree 1 0.8

Table 3.30: Hyperparameters for the Cubic phase three model

classes. A typical value to consider: sum(negative instances) / sum(positive


instances) [105].
The scale_pos_weight parameter is used to tune the behavior of the XGBClassifier
for imbalanced classification problems, such as the one tackled in this thesis. Since the
phase three training data was still heavily imbalanced — as illustrated in Table 3.31
and Table 3.32 for Reno and Cubic, respectively — tuning this parameter had a large
impact on the classification performance when considering the F1 score. As the XGBoost
documentation suggests, sum(negative instances) / sum(positive instances) was one of
the values that was considered when tuning, but values closer to the default 1 seemed
to provide better results, with the final value being 5 for both Reno and Cubic.
The final tuned phase three models for Reno and Cubic were exported to a file using the
save_model function of the XGBClassifier class [102].

3.7.2 Results
As in phase one and phase two, the data was still very imbalanced in phase three, as
illustrated in Table 3.31 and 3.32 for Reno and Cubic respectively. However, like when
going from phase one to phase two, the proportion of lost samples did improve from
phase two to phase three. Similarly to phase two, this can be partially attributed to
there being much more training data in phase three, and perhaps packet loss being more
frequent in the presence of background traffic.

Dataset Total samples Samples marked as lost Proportion


Training 7283737 86882 0.012
Validation 3587513 42792 0.012
Test 3623750 43224 0.012

Table 3.31: The proportion of samples for each class in the training, validation, and test sets for
the Reno phase three dataset

Dataset Total samples Samples marked as lost Proportion


Training 7120069 65805 0.009
Validation 3506900 32412 0.009
Test 3542324 32739 0.009

Table 3.32: The proportion of samples for each class in the training, validation, and test sets for
the Cubic phase three dataset

66
3.7. Phase three classifiers

As hypothesized and illustrated in Table 3.33 and Table 3.34 for Reno and Cubic,
respectively, similar to phase one and phase two, the classification performance for the
phase three models was quite poor with regards to the relevant metrics. Precision was
greatly reduced compared to phase two, indicating many false positives — false positives
in this case referring to samples being classified as Lost while they were actually labeled
as Not Lost.

Dataset Precision Recall F1 score


validation 0.3411 0.4876 0.4014
test 0.3438 0.4920 0.4048

Table 3.33: Reno phase three results

Dataset Precision Recall F1 score


validation 0.3233 0.4206 0.3656
test 0.3203 0.4168 0.3622

Table 3.34: Cubic phase three results

Validation Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 3504417 40304
Actual: Lost 21927 20865

Testing Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 3539936 40590
Actual: Lost 21958 21266

Table 3.35: Reno phase three confusion matrices

Validation Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 3445956 28532
Actual: Lost 18779 13633

Testing Dataset

Predicted: Not Lost Predicted: Lost


Actual: Not Lost 3480623 28962
Actual: Lost 19092 13647

Table 3.36: Cubic phase three confusion matrices

The reason for the low precision could be partially explained by there being very little
difference in feature values for the samples labeled as Lost and the samples directly

67
Chapter 3. Machine learning model design and evaluation

preceding them that were labeled as Not Lost. This is illustrated in Table 3.37 and
Table 3.38 for Reno and Cubic, respectively. Even though many samples were wrongfully
labeled as Lost, they could belong to this group of samples that occurs right before the
actual Lost samples, as they have very similar feature values. The model could therefore
be reacting a bit early. As explained when discussing the results for phase two, this
could indicate that the model often gets things almost right, by labeling a case as True
even if it is not right at the peak yet, but rather approaching it.

cwnd expire_time rtt cwnd_diff ssthresh data_segments_sent pacing_rate lost


262 164 78.795 1 132 12 46.2 False
262 168 78.835 1 132 12 46.2 False
262 168 78.793 1 132 12 46.2 True
180 280 80.494 -6 133 6 60.4 False
174 280 80.503 -6 133 6 61.5 False
495 200 119.45 1 248 16 57.6 False
495 200 119.437 1 248 21 57.6 False
496 248 119.427 1 248 16 57.7 True
488 316 119.598 -8 248 6 57.9 False
298 320 120.294 -7 248 8 79.5 False
115 204 138.074 1 58 2 11.6 False
116 204 138.072 1 58 2 11.7 False
116 20 137.931 1 58 5 11.8 True
114 336 137.957 -2 58 1 11.9 False
77 340 137.286 -3 58 2 15.7 False
18 276 105.988 -2 13 0 4.46 False
18 268 105.988 -2 13 0 4.46 False
18 264 105.988 -2 13 0 4.46 True
15 304 105.669 -3 13 2 4.74 False
15 304 105.615 -3 13 1 4.87 False

Table 3.37: An excerpt from the Reno phase three training dataset. The training data is shown
as groups of five and five samples from different parts of the aggregated training data, meaning
that each group of five samples came from the same part, and that these five samples in each
group occurred in succession. Each group of five samples includes one True sample, the two False
samples that occurred right before, and the two False samples that occurred right after

Another reason for the low precision can be attributed to the scale_pos_weight
parameter used for tuning the model. Increasing this value from the default of 1 seemed
to improve the F1 score at the expense of precision since it greatly improved recall.
Adjusting the classification threshold (Section 2.2.5) after tuning showed that the model
improved with regards to both precision and recall compared to no tuning.
Even though the performance was poor, the results between the validation and test set
were very similar, indicating that the model generalizes well and is not overfitting to the
training data (Section 2.2.7). This was also the case for the phase one and phase two
models.
As with the phase two classifiers and illustrated in Table 3.41 and Table 3.42 for Reno
and Cubic, respectively, the feature importances changed quite a bit, however, not in
exactly the same way as they did when going from phase one to phase two, where more
static features like the cwnd had a great reduction in importance and more dynamic
features like the max_rtt seemed to become more important. This was also the case for
the phase three classifiers when comparing to phase one, but in addition features like the
expire_time and last_send seemed to become significantly more important compared
to both phase one and phase two.
Looking at the training data excerpts from Reno and Cubic as illustrated in Table 3.37

68
3.7. Phase three classifiers

cwnd expire_time rtt cwnd_diff ssthresh data_segments_sent pacing_rate lost


300 240 114.847 1 128 9 36.3 False
301 240 115.312 1 128 9 36.3 False
302 240 115.724 1 128 9 36.3 True
299 320 120.028 -3 217 6 38.7 False
295 316 119.985 -4 217 6 39.3 False
112 96 44.623 1 78 6 34.9 False
112 92 44.623 1 78 0 34.9 False
112 88 44.623 1 78 0 34.9 True
78 244 47.288 -4 78 9 33.2 False
78 236 39.652 -4 78 9 48.4 False
7 256 56.066 1 4 0 1.98 False
7 252 56.066 1 4 0 1.98 False
7 248 56.066 1 4 0 1.98 True
6 248 51.91 -1 4 1 2.68 False
4 248 49.041 -2 4 2 2.83 False
111 96 44.74 1 78 10 34.5 False
112 96 44.514 1 78 8 35.0 False
112 96 44.461 1 78 10 35.0 True
81 72 32.289 1 78 8 34.9 False
82 72 32.627 1 78 8 34.9 False

Table 3.38: An excerpt from the Cubic phase three training dataset. The training data is shown
as groups of five and five samples from different parts of the aggregated training data, meaning
that each group of five samples came from the same part, and that these five samples in each
group occurred in succession. Each group of five samples includes one True sample, the two False
samples that occurred right before, and the two False samples that occurred right after

and Table 3.38, respectively, the reason for the expire_time feature having such high
importance in the phase three classifiers could be partially attributed to a pattern where
the value seemed to be higher for the False samples that directly follow a True sample.
However, at least for Cubic, this seemed to not always be the case, with the values for
the False samples directly following a True sample sometimes being lower than the True
sample. This can be seen in Table 3.38, in the last three columns, showing how the value
is larger for the True sample before it decreases for the False samples directly after.
The reason for the high importance for the last_send feature could be partially
explained by this being a value that seemed to increase as the samples got closer to a
True case, meaning that it seemed to increase with increasing congestion, before dropping
down to a substantially smaller value again for the False samples directly following a
True case. This was especially true for samples coming from outputs where the BDP
was generally smaller. In cases where the samples came from outputs where the BDP
was larger, the last_send values seemed to be mostly stable, only sometimes increasing
for a True sample before dropping down again for the False samples afterwards. For
example, the value could be 4 for almost all the samples from a specific ss output, only
sometimes being a different value, such as 8, for some of the True cases.
As in phase two, the cwnd_diff feature had quite high and about the same importance
for both phase three classifiers, as illustrated in Table 3.39 and 3.40 for Reno and Cubic,
respectively. Like discussed for the phase two classifiers, this was probably used to make
sure that the model did not label cases where the cwnd was decreasing as Lost. However,
the cwnd_diff was not always negative for some of the False samples that followed a
True sample. There were also many True samples that had a negative cwnd_diff value
in the phase three training data, where one such example is shown in Table 3.37 for
the Reno phase three training data. There seemed to be a pattern where some of

69
Chapter 3. Machine learning model design and evaluation

Feature Importance
timer_name 0.00
expire_time 0.26
retrans 0.00
rto 0.04
rtt 0.04
rtt_variance 0.02
cwnd 0.07
ssthresh 0.05
data_segments_sent 0.04
last_send 0.14
pacing_rate 0.02
min_rtt 0.03
max_rtt 0.06
cwnd_diff 0.13
min_cwnd 0.03
max_cwnd 0.02
min_ssthresh 0.04
max_ssthresh 0.02

Table 3.39: Feature importances for the Reno phase three model

Feature Importance
timer_name 0.00
expire_time 0.23
retrans 0.00
rto 0.05
rtt 0.04
rtt_variance 0.02
cwnd 0.06
ssthresh 0.05
data_segments_sent 0.03
last_send 0.13
pacing_rate 0.03
min_rtt 0.03
max_rtt 0.06
cwnd_diff 0.15
min_cwnd 0.03
max_cwnd 0.03
min_ssthresh 0.04
max_ssthresh 0.02

Table 3.40: Feature importances for the Cubic phase three model

the connections with generally small cwnd values had scenarios where the cwnd first
decreased, then stayed the same level for a while before it decreased even more. For
example, cwnd decreased from 15 to 12, then there were many samples with cwnd 12,
before it decreased further to 10, and so on. In cases where the cwnd values generally
were larger, the changes in values from one sample to the next seemed larger as well,
and this issue seemed to be less present.
The reason for there being some samples labeled as False in the training data immediately

70
3.7. Phase three classifiers

Phase two - Phase three -


Feature Phase one importance phase one phase two
timer_name 0.00 +0.00 +0.00
expire_time 0.04 +0.10 +0.12
retrans 0.00 +0.00 +0.00
rto 0.00 +0.00 +0.03
rtt 0.07 +0.03 -0.06
rtt_variance 0.05 -0.03 +0.00
cwnd 0.33 -0.28 +0.02
ssthresh 0.00 +0.02 +0.03
data_segments_sent 0.03 -0.01 +0.02
last_send 0.00 +0.03 +0.11
pacing_rate 0.10 -0.06 -0.02
min_rtt 0.00 +0.10 -0.07
max_rtt 0.00 +0.14 -0.08
cwnd_diff 0.38 -0.22 -0.04
min_cwnd 0.00 +0.03 +0.01
max_cwnd 0.00 +0.04 -0.02
min_ssthresh 0.00 +0.04 -0.002
max_ssthresh 0.00 +0.06 -0.04

Table 3.41: Differences in feature importances for Reno models across phases

Phase two - Phase three -


Feature Phase one importance phase one phase two
timer_name 0.00 0.00 0.00
expire_time 0.04 +0.01 +0.17
retrans 0.00 +0.00 +0.00
rto 0.00 +0.02 +0.04
rtt 0.10 -0.02 -0.04
rtt_variance 0.05 -0.03 +0.00
cwnd 0.41 -0.37 +0.02
ssthresh 0.15 -0.12 +0.01
data_segments_sent 0.11 -0.06 -0.02
last_send 0.00 +0.02 +0.11
pacing_rate 0.06 -0.02 -0.01
min_rtt 0.00 +0.09 -0.06
max_rtt 0.00 +0.14 -0.08
cwnd_diff 0.00 +0.19 -0.04
min_cwnd 0.07 -0.05 +0.01
max_cwnd 0.00 +0.08 -0.04
min_ssthresh 0.00 +0.03 +0.01
max_ssthresh 0.00 +0.09 -0.07

Table 3.42: Differences in feature importances for Cubic models across phases

following True cases that also had a positive cwnd_diff value, as illustrated in Table
3.38, could be explained by the way that the cwnd_diff was assigned to the samples.
As discussed in Section 3.3, data was captured using the ss network utility [85]. Most
of the time, the output from ss contained the various fields that should be used for
the machine learning features in the training data, such as cwnd, rtt, and so on, but
there were cases were some fields were missing. These cases were discarded in the data

71
Chapter 3. Machine learning model design and evaluation

transformation step when parsing the outputs from the data collection step, and were
therefore not included in the final training data as samples. However, when calculating
the cwnd_diff value using the function in Listing 3.8, as long as the cwnd value was
present in the output, it was considered and added to a separate data structure from
the one that contained the samples that should be present in the final training data.
There were therefore two data structures: one for all the outputs that contained the
cwnd field from ss and another for all the valid outputs that contained all the required
fields that should be present in the final samples. The first one was used to assign the
cwnd_diff, but since both data structures contained references to the same Python
dictionary object, this was the value present in the final samples as well.
This can be explained more clearly by considering a scenario where you have three
outputs from ss that should be parsed and potentially included in the final training
data as samples. The first output contains all the fields that need to be present in
order to extract all the machine learning features and construct a valid sample, and is
therefore regarded as “valid”. The third output is the same, also “valid”. The second
output however, is missing one field, for example the ssthresh. This second output is
therefore not valid because it is missing a field that needs to be extracted in order to
construct the relevant feature, and should not be included in the final training data.
However, this second output is not missing the cwnd field. Is is therefore added to a
separate data structure that contains all the outputs that had a valid cwnd field with a
value. All the three outputs have a valid cwnd field with the following values: 4, 2, and
3 for the first, second, and third output, respectively. For the valid outputs, adding the
cwnd_diff feature for the first output is not possible, because it is the first and there
is therefore no previous sample to use for calculating it. Adding the cwnd_dfff feature
for the third output is possible however, and since the second output had a valid cwnd
field, the cwnd_diff feature value for the third output is 3 − 2 = 1. If the second output
would not have been considered at all, the value would have been 3 − 4 = −1 instead.
This could help explain the cases where the False samples have a positive cwnd_diff
value. Only using the valid outputs for this instead could perhaps have prevented this.
Fully explained in Section 3.4.4, the labeling procedure used to label the phase three
training data labeled samples by looking at the previous, current, and next samples.
For a given sample referred to as the current sample in this case, the cwnd value was
compared to the previous and next sample. If the previous was smaller or the same and
the next was smaller, the current sample was labeled a Lost — otherwise, the sample
was labeled as Not Lost. This was meant to simulate the peaks that can be seen in
the cwnd plot of, for example, Reno in Figure 2.1. Making the labeling less sensitive to
changes in cwnd could perhaps have solved the problem of True samples having negative
cwnd_diff values and improved model performance, by making sure that packets were
only labeled as Lost if the next measurement had an appropriately smaller value, instead
of just checking if it was smaller. Another thing that could have combatted the problem,
would be to check if there is currently a downward trend — meaning that for a current
sample, the next sample has a smaller cwnd value — and only label a packet as Lost
if this is the first occurrence of a downward trend in the current downward trend (and
reset this when cwnd is growing again).
Seeing as the amount of samples in the phase three datasets were very large, with there
being about ∼7M samples in both the Reno and Cubic phase three training datasets,
exploring a different approach using a deep neural network instead of a gradient booster
could perhaps have been a better option, according to research comparing the two [10],

72
3.7. Phase three classifiers

which suggests that for very large tabular datasets — where “very large” in this case
refers to datasets with about ∼10M features — with predominantly continuous features,
modern neural network architectures may have an advantage over gradient boosting
frameworks such as the ones that were used in this thesis.

73
Chapter 3. Machine learning model design and evaluation

74
Chapter 4

Model inference and results

Training and evaluating a machine learning model on labeled data, as discussed in detail
in Chapter 3, can give a good indication of its performance and what it might be used
for. However, applying a trained machine learning model to handle some task in real
time can provide an even better indication of its performance and usefulness.
For example, there could be a case where a trained binary classifier classifies a sample
as either True or False, and even though it was not labeled as such in the data used
for model evaluation containing the correct labels, it could be a useful prediction when
applying the model to a given problem. When evaluating the model in the training
and evaluation steps, these cases would be regarded as wrongfully classified, and would
therefore contribute to worse performance when looking at the relevant performance
metrics, such as precision, recall, and F1 score (Section 2.2.5). Therefore, looking at
performance metrics alone is not always enough and does not necessarily give a complete
picture of the potential usefulness of a machine learning model.
For the problem discussed in this thesis, if a sample was classified as True (meaning
packet loss) even though it was not a sample that should be classified as True according
to the training data — meaning that it was not the sample that was taken right at the
peak before the packet loss happened — it could still be a “correct” prediction in the
sense that it was very close to the actual True case. Referring to the cwnd plot of Reno
(Figure 2.1), if samples that were close enough to the peak were classified as lost, this
could prove useful in the sense that it could serve as an indicator of congestion and most
likely imminent packet loss.
Model inference is described by Google in the following way:
Machine learning inference is the process of running data points into a
machine learning model to calculate an output such as a single numerical
score. This process is also referred to as “operationalizing a machine learning
model” or “putting a machine learning model into production.” [62]
The data points were produced using ss [85] from a TCP connection, transformed and
prepared as input data by a data preparation module, and the output calculated as a
single numerical score by a prediction module, all of which are briefly explained below
and in detail in their own sections. The entire setup is visualized in Figure 4.1. The goal
was to apply the model and investigate its real-time prediction performance on a TCP
connection.

75
Chapter 4. Model inference and results

Figure 4.1: The test setup showing the flow of data when the model was applied to perform
predictions on a connection in real time

TCP connection A script that started a TCP connection configured with the desired
parameters, started polling data from the connection using ss, and started TShark
[88] in capture mode to capture packet information from the connection for later
analysis.
Data preparation module A Python script that was started and run as a daemon in
the background by the abovementioned connection script to create the prepared
input data in the form of a csv file for the prediction module.
Prediction module Similarly to the data preparation module, the prediction module

76
4.1. TCP connection setup and model inference

was a Python script that was started and run a a daemon in the background by the
connection script. The prediction module loaded the relevant exported classifier
on startup, and used the classifier to perform predictions based on the input data
that was prepared by the data preparation module.
The final output of the prediction module was a prediction in the form of a boolean value
(1 or 0) that was written to a file which was watched by the connection for changes.
Based on the contents of said file, the connection toggled ECN (Section 2.1.9) on or
off at the router, in order to signal congestion and reduce the sending rate dynamically
based on model predictions instead of waiting for packet loss to happen.
Only the phase three models (Section 3.7) were considered and used when running the
tests described in this chapter.

4.1 TCP connection setup and model inference


As briefly mentioned in the beginning of this chapter, a bash script was created that
was responsible for starting and configuring the connection as desired for one specific
test, in addition to starting the other required components of the test setup. This script
supported multiple command line arguments for the various connection parameters, such
as: duration to run the test for, which congestion control algorithm to use, the delay
to configure the connection with, the bandwidth to configure the connection with, how
many background flows should be started, and so on. The script also supported running
the connection with either model inference enabled or not enabled. In the former case,
multiple other modules would need to be started to support the entire model inference
procedure, as briefly explained in the beginning of this chapter, and as illustrated in
Figure 4.1. In the latter case, where no model inference was enabled, the script more
or less mimicked the same script that was used in the data collection step, described in
Section 3.3.
The connection was created as either a single flow between a sender and receiver or as
a single flow with background traffic — where the background traffic was generated in
the form of multiple background flows that were only used to impact the behavior of
the single flow that was considered the foreground flow and that was used for the data
analysis and input data creation — depending on the supplied command line arguments.
In both cases, the connection was created and started using Mininet [56], using the same
setup that was described in Section 3.3.1. To briefly summarize: The connection was
created using three nodes: a sender h1, a router r, and a receiver h2. These nodes
were configured with various options on startup using a custom Python script that was
supplied to the Mininet command. For each flow, an iPerf3 server was started at the
receiver, while an iPerf3 client was started at the sender in order to connect to the server.
ECN was enabled at the sender and receiver nodes by setting the ECT bit to 1, so that
packets could be marked at the router using the CE bit later (Section 2.1.9) based on
model predictions.
The connection was configured to watch a given file for changes based on a configured
interval and, based on the contents of said file, either CE mark or not CE mark outgoing
packets. This was run from the router node in the connection, so that the outgoing
packets from the router were CE marked based on the contents of said file. This was
done using an iptables rule, as shown in Listing 4.1. The watch interval was configured
so that the file contents would be checked three times per RTT. This was done to ensure

77
Chapter 4. Model inference and results

that a prediction happened at least once per RTT. The data preparation (Section 4.2)
and prediction (Section 4.3) modules were timed multiple times using the Python timeit
module [87] in order to get an idea of the average running time of the modules when
doing model inference. The timings indicated that the average running time on the Linux
machine that was used for testing was about 0.15ms and 5ms for the data preparation
module and prediction module, respectively.
1 toggle_ecn () {
2 l o c a l i n p u t _ f i l e _ p a t h=" $1 "
3 l o c a l d e l a y=" $2 "
4 l o c a l w a t c h _ i n t e r v a l=$ ( ( d e l a y / 3 ) )
5 l o c a l prev_content=" "
6
7 w h i l e t r u e ; do
8 l o c a l f i l e _ c o n t e n t=$ ( c a t " $ i n p u t _ f i l e _ p a t h " )
9
10 if [ [ " $ f i l e _ c o n t e n t " != " $prev_content " ] ] ; then
11 # Clear previous r u l e s .
12 i p t a b l e s −t mangle −F OUTPUT
13
14 if [ [ " $ f i l e _ c o n t e n t " == " 1 " ] ] ; then
15 # Enable ECN.
16 i p t a b l e s −t mangle −A POSTROUTING −p t c p −j TOS −−s e t −t o s 3
17 e l i f [ [ " $ f i l e _ c o n t e n t " == " 0 " ] ] ; then
18 # D i s a b l e ECN.
19 i p t a b l e s −t mangle −D POSTROUTING −p t c p −j TOS −−s e t −t o s 3
20 fi
21
22 # Update p r e v i o u s c o n t e n t .
23 prev_content=" $ f i l e _ c o n t e n t "
24 fi
25
26 s l e e p $ ( ( watch_interval / 1000) ) . $ ( ( watch_interval % 1000) )
27 done
28 }

Listing 4.1: Bash function for watching a given file for changes and either enabling or disabling
ECN using an iptables rule

By CE marking all outgoing packets from the router if the contents of the watched file
containing the model prediction indicated that this should be done, the receiver h2 could
see this bit and send ECN-Echoes (ECE) to the sender h1 as part of the TCP header in
the ACKs. The sender then reacted to this by reducing its sending rate and setting the
Congestion Window Reduced (CWR) bit in the TCP header of the packets that were sent
to the receiver, as explained in Section 2.1.9. This way, the sending rate was dynamically
reduced based on model predictions instead of as a result of packet loss (Section 2.1.6).
The script also started a process in the background for continuously polling data using
ss. In addition to appending the ss data to a file to produce a txt with all the various
outputs, ss data was written to a separate txt file that only contained a single ss
output. This file was used by the data preparation module, as described in Section 4.2.
In addition to starting a process for polling data using ss, the connection script started
TShark in capture mode to capture both outgoing and incoming packets on the h1-r
interface. This resulted in pcap files that were later used to calculate metrics such
as throughput and retransmissions, in order to compare the results between the various
tests. The resulting pcap files were also analyzed to check the behavior of the connection

78
4.2. Data preparation module

with both model inference enabled and disabled. In the former case, it was of great
interest if the router correctly CE marked outgoing packets and the ECE bit was set by
the receiver on the ACKs going to the sender.
The script supported being run in model inference mode or no model inference mode,
where the former referred to the case where model predictions should be used to either
CE mark or not CE mark outgoing packets at the router based on said predictions.
However, the script also supported being run in timestamp mode, referring to a mode
where model predictions were enabled but packets were not CE marked at the router,
meaning that the connection was running as normal. When the connection was run in
timestamp mode, all the model predictions and their timestamps were written to an
output file, which was later used to produce plots of the cwnd overlaid by the model
predictions. These are shown and discussed in Section 4.4.
The connection script created a directory based on the passed command line arguments,
where all the relevant output files or files that were used by the other components of the
test setup resided.

4.2 Data preparation module


The data preparation module was a Python script that was started and run in the
background as a daemon by the connection script if model inference was turned on
based on the relevant command line flag. The data peparation module was configured
to watch a given directory for changes to a specific file, this file being the ss output
file that was created by the connection script when polling data from the connection.
To accomplish this, the data preparation module supported command line arguments
such as the relevant directory path that should be watched, the input file that should
be loaded and parsed, and where the output should be saved.
The relevant directory as specified by the abovementioned command line argument was
watched for changes using the Watchdog module [90], where an Observer instance
[92] was created and configured with a custom event handler class by extending the
FileSystemEventHandler class [91], like shown in Listing 4.2. This meant that the
directory produced by the connection script was observed for changes, and each time
there was a change, the relevant file, as specified by the input_file command line
argument was loaded and parsed.

79
Chapter 4. Model inference and results

1 d e f o b s e r v e ( dir_path : s t r , f i l e _ p a t h : s t r , output_path : s t r ) −> None :


2 " " " Observe t h e d i r e c t o r y with t h e g i v e n path f o r c h a n g e s and p r e p a r e
i n p u t data .
3
4 Observes t h e d i r e c t o r y with t h e g i v e n path f o r c h a n g e s . I f t h e r e i s a
change ,
5 an e v e n t h a n d l e r i s c a l l e d t h a t l o a d s t h e f i l e l o c a t e d a t t h e g i v e n
path
6 and p r e p a r e s t h e i n p u t data f o r t h e c l a s s i f i e r .
7
8 Args :
9 dir_path : The d i r e c t o r y path t h a t s h o u l d be watched .
10 f i l e _ p a t h : The path t o t h e t e x t f i l e t h a t s h o u l d be l o a d e d and
parsed .
11 output_path : Path t o where t h e output s h o u l d be saved .
12 """
13 o b s e r v e r = Observer ( )
14 e v e n t _ h a n d l e r = EventHandler ( dir_path , f i l e _ p a t h , output_path )
15 o b s e r v e r . s c h e d u l e ( event_handler , path=dir_path )
16 observer . start ()
17
18 p r i n t ( " Data p e r s i s t e n c e module s t a r t e d . Waiting f o r c h a n g e s . . . \ n\n " )
19
20 try :
21 w h i l e o b s e r v e r . i s _ a l i v e ( ) and time . time ( ) < e v e n t _ h a n d l e r . t i m e o u t :
22 observer . join (1)
23 finally :
24 observer . stop ()
25 observer . join ()

Listing 4.2: Python function using the Observer class from Watchdog to observe a given directory
for changes and using a custom event handler to handle various events such as created, modified,
deleted, and so on

Loading and parsing the ss output file was done using a modified version of the approach
described in Section 3.4, where various functions were defined and used to extract the
relevant values from the ss output, such as cwnd, rtt, rto, and so on to construct the
final machine learning features that should be present in the input data to the prediction
module.
The more special features, such as the various min and max features were handled by
being defined as instance variables on the event handler class, and therefore continuously
updated as the connection progressed. This was also the reason why the prediction
module was run as a daemon for the entire duration of the connection — so that the
various features could be persisted and used for subsequent outputs to create input data
for the prediction module.
One thing that should be taken note of, is the way that the cwnd_diff feature was
handled by the data preparation module. This feature, fully explained in Section 3.4.2,
represented the difference between the current cwnd value and the previous one that
was not the same. However, in the data preparation module, a simplified version of this
was calculated and added as part of the final input file that should be supplied to the
prediction module. Here, only the difference between the current and previous cwnd was
used to calculate the cwnd_diff, meaning that the value was often 0. This was done
to make it simpler and reduce computation overhead when running everything in real
time.

80
4.3. Prediction module

In the data collection step, fully explained in section 3.3, it was decided that the initial
slow start peak (Section 2.1.6) should be skipped when collecting the data to be used
for training. This was done in order to avoid outliers and produce more consistent data
with regards to the various machine learning features that should be used for the final
model. In the same way, it was decided that the predictions should not be performed
before the initial slow start peak was over when running the model inference in real time
on a TCP connection. Therefore, the data preparation module simply returned None for
the first few seconds of the connection — which the prediction module was configured
to handle so that it returned False — so that the model inference did not kick in before
the connection had reached congestion avoidance mode and the various features were
more consistent with the data that the final model had been trained on.
Having read and parsed the various ss fields from the ss output file produced by the
connection script, a csv file was created in the same directory as all the other files. This
csv file represented the final prepared input data containing the same machine learning
features that were used to train the model.

4.3 Prediction module


Like the data preparation module, the prediction module was a Python script that
was started and run in the background as a daemon by the connection script if model
inference was turned on based on the relevant command line flag supplied to the
connection script. Similarly, the prediction module was configured to watch a given
directory for changes to a specific file, this file being the csv output file that was created
by the data preparation module that contained the prepared input data in order to
perform a prediction. This input data contained all the features fully transformed and
ready for the model to use.
On startup, the prediction module simply loaded the relevant exported classifier — de-
pending on the passed command line argument — using the XGBClassifier.load_model
method [99], and watched the output file from the data preparation module for changes.
Each time there was a change to the file, it was loaded as a pandas.DataFrame [69]
object using the pandas.read_csv method[68], similarly as described in the data trans-
formation step for the training data (Section 3.4).
The prediction module then used the loaded classifier together with the loaded input data
from the data preparation module and an optional classification threshold to perform a
prediction and either output a probability or the result directly using the predict_proba
[101] or predict [100] method of the XGBClassifier [96] class respectively. If a custom
classification threshold was supplied to the connection script, and therefore the prediction
module, the predict_proba method was first used to output a probability, and the
classification threshold was used to output the final prediction by returning 1 if the
probability was larger or the same as the threshold and 0 otherwise (Section 2.2.5).
As mentioned in 4.1, the connection script supported being ran in timestamp mode. In
this mode, the predictions were performed as usual but not used for anything other than
appending the timestamp and prediction result to a file, meaning that the router did not
CE mark outgoing packets when running in this mode. When running the connection
script in timestamp mode, the prediction module was configured to run in the same
mode using a command line flag. This way, the input data could be loaded and a
prediction could be performed as usual, but the prediction together with the timestamp

81
Chapter 4. Model inference and results

— where the timestamp was calculated by subtracting the current time from the time
when the prediction module was started — was appended to a file that resided in the
same directory where all the other relevant files were, as described in Section 4.1.
The final output of the prediction module was a boolean value (1 or 0) representing the
prediction that was written to a file which was watched by the connection for changes,
as described in Section 4.1.

4.4 Results
Results were analyzed by calculating the throughput and retransmissions for various
scenarios. For each scenario, consisting of a specific combination of delay, bandwidth,
and queue size connection parameters, tests were ran with and without model inference
enabled. For the model inference related part of the tests, three classification thresholds
(Section 2.2.5) were considered, with the values being: 0.1, 0.25, and 0.5. This resulted
in four different groups of results for one specific scenario:
1. No model inference enabled: This group contained the results where no model
inference was enabled and served as a baseline for comparison with the other
groups where model inference was enabled and various classification thresholds
were considered.
2. 0.1 threshold: This group contained the results where model inference was
enabled and the classification threshold was configured to 0.1, so that samples
with an output probability at or above 0.1 were classified as Lost, and samples
with an output probability of less than 0.1 were classified as Not Lost.
3. 0.25 threshold: Same as for 0.1 but with a classification threshold of 0.25.
4. 0.5 threshold: Same as for 0.1 and 0.25 but with a classification threshold of 0.5.
For each scenario and each of these groups, the relevant test was run five times in order
to get an average result with regards to retransmissions and throughput. The reason
for running the tests multiple times in order to get an average result was due to there
seemingly being a large variability between test runs for a given scenario with regards
to the amount of retransmissions and throughput — especially for the connections with
added background traffic where the behavior of the connection and therefore the results
were less predictable due to the flow being impacted by the background traffic. For
a given scenario, this resulted in 20 different files with results: five for each of the
four groups. Each of these files contained the calculated throughput and the amount
of retransmissions that occurred. The average throughput and retransmissions for each
scenario were then calculated and saved to a new file that represented the average results
for that specific scenario.
The abovementioned results in the form of throughput and retransmissions were
calculated by parsing the pcap file that was output by TShark when capturing packet
data for the duration of the test, as explained in Section 4.1. This was done using
a Python script that first parsed the relevant pcap file and read the various packets
into memory using the rdpcap function from the scapy module [80]. Since TShark was
configured to capture all incoming and outgoing packets at the h1-r interface between
the sender and receiver (Section 4.1), the resulting packet list from the rdpcap function
was filtered to only include the packets that originated from the sender h1. Throughput

82
4.4. Results

was then calculated by first creating a new list which contained all the packets excluding
retransmissions, and dividing the total bytes sent by the duration to get the megabits
per second (Mbps), as shown in Listing 4.3.
1 d e f get_throughput ( p a c k e t s : l i s t , d u r a t i o n : i n t ) −> f l o a t :
2 " " " C a l c u l a t e t h e throughput i n Mbps .
3
4 Args :
5 p a c k e t s : L i s t o f p a c k e t s t h a t s h o u l d be i n c l u d e d i n t h e c a l c u l a t i o n
.
6 d u r a t i o n : The d u r a t i o n o f t h e measurement i n s e c o n d s .
7
8 Returns :
9 The throughput i n Mbps .
10 """
11 t o t a l _ b y t e s = sum ( l e n ( p a c k e t [TCP ] . p ay l oa d ) f o r p a c k e t i n p a c k e t s i f TCP
in packet )
12 throughput_mbps = ( t o t a l _ b y t e s ∗ 8 ) / ( 1 0 0 0 0 0 0 ∗ d u r a t i o n )
13
14 r e t u r n throughput_mbps

Listing 4.3: Python function for calculating the throughput in Mbps given a list of packets and
the duration of the connection

The reason for filtering out the retransmissions before calculating the throughput was
to not consider the same packets multiple times when calculating the total bytes sent.
However, even though the results would have been generally higher for the throughput
of the various tests, they would have been consistent in both cases and the comparisons
would therefore have been the same if this was not done.
Retransmissions were calculated by first filtering the original packet list to only include
the packets sent after the initial slow start peak (Section 2.1.6) by skipping the first
second. This was done to make the measurements more consistent, and to follow the
same approach that was used for creating the training data (Section 3.3) and when
performing the predictions during the model inference tests (Section 4.2). The amount
of retransmissions were then calculated by going through the packet list and checking
for duplicate sequence numbers, like shown in Listing 4.4.

83
Chapter 4. Model inference and results

1 d e f g e t _ r e t r a n s m i s s i o n s ( p a c k e t s ) −> i n t :
2 " " " C a l c u l a t e t h e number o f r e t r a n s m i s s i o n s i n t h e pcap f i l e .
3
4 Args :
5 p a c k e t s : L i s t o f p a c k e t s t h a t s h o u l d be i n c l u d e d i n t h e c a l c u l a t i o n
.
6
7 Returns :
8 The number o f r e t r a n s m i s s i o n s .
9 """
10 seq_numbers = d e f a u l t d i c t ( i n t )
11 retransmissions = 0
12
13 f o r packet in packets :
14 i f TCP i n p a c k e t :
15 s e q = p a c k e t [TCP ] . s e q
16 seq_numbers [ s e q ] += 1
17 i f seq_numbers [ s e q ] > 1 :
18 r e t r a n s m i s s i o n s += 1
19
20 return retransmissions

Listing 4.4: Python function for calculating the amount of retransmissions given a list of packets

The final metrics in the form of throughput and amount of retransmissions were saved to
a txt file in the relevant directory for the specific test for a specific scenario, resulting in
five such txt files with metrics for each scenario and each group of results, as described
earlier in this section.
The possible scenarios with regards to delay, bandwidth, and queue size as a multiplier
of BDP that were considered are shown in Table 4.1 below:

Delay Bandwidth Queue size (BDP multiplier)


30 10 0.25
30 10 0.5
30 10 1.0
30 50 0.25
30 50 0.5
30 50 1.0

Table 4.1: The scenarios that were considered when running the model inference related tests

Each scenario represented a specific permutation of the connection parameters shown in


Table 4.1, where one permutation consisted of a specific value for delay, bandwidth, and
queue size in the form of a BDP multiplier value. Given a delay of 30ms, a bandwidth
of 10Mbps, and a queue size of 1 BDP, this would mean that this specific permutation
represented the scenario where the delay was configured to 30ms, the bandwidth was
10Mbps, and the queue size was equal to 1 BDP where the BDP was always calculated
from the delay and bandwidth for the specific scenario. If the queue size was configured
to 0.25 BDP, it would represent 1/4 of the calculated BDP based on the configured delay
and bandwidth.
For each of these six scenarios, TCP experienced at least 10 sawteeth in congestion
avoidance (Section 2.1.6), as shown in Figure 4.2.

84
4.4. Results

Figure 4.2: cwnd plot from a connection configured using TCP Reno with 30ms delay, 50Mbps
bandwidth, and 1 BDP queue size

As mentioned in the beginning of this section, the tests were run with and without model
inference enabled, where 0.1, 0.25, and 0.5 were considered as classification thresholds
in the latter case. Since each test was run five times, the final result consisted of 120
distinct txt files with metrics in the form of throughput and amount of retransmissions.
In addition to calculating metrics for all the scenarios and comparing the results, tests
were ran in timestamp mode — described in Section 4.1 — to gather results in the form
of an output file with predictions and their timestamps to produce plots of the cwnd vs.
the timestamps with predictions overlaid, such as the one shown in Figure 4.3.
The abovementioned tests for the various scenarios producing 120 distinct txt files and
the tests in timestamp mode were done for both Reno and Cubic for both single flow
and background traffic.

4.4.1 Single flow results


As mentioned in the previous section, the various scenarios in Table 4.1 were considered
when collecting the results for the single flow case. For each of these scenarios, tests
were run using the connection script described in Section 4.1, where the connection
was configured as a single flow without background traffic using either Reno or Cubic
and a specific combination of the relevant connection parameters as shown in Table
4.1. Tests were run with and without model inference, where 0.1, 0.25, and 0.5 were
considered as classification thresholds. For each test, the total throughput and amount of
retransmissions were calculated as described in Section 4.4. As already mentioned there,

85
Chapter 4. Model inference and results

each test was ran five times to get an average value for throughput and retransmissions.
The results were then aggregated by classification threshold, so that the final results
were grouped into four distinct groups, as described in the beginning of Section 4.4: no
model inference, 0.1 threshold, 0.25 threshold, 0.5 threshold.
In addition to the tests described above and in more detail in Section 4.4, tests were run
in timestamp mode — described in detail in Section 4.1 — for both Reno and Cubic in
order to get a plot of the cwnd overlaid by the model predictions. These plots show the
regular behavior of the connection without model inference enabled, with marks showing
where the prediction module (Section 4.3) decided to classify a sample as Lost (1) — and
therefore where the router would have toggled on CE marking on the outgoing interface
if model inference was turned on to allow the sender to back off.
Figure 4.3 shows a single flow configured with Reno and a quite large BDP to illustrate
packet loss prediction as a first proof of concept. As mentioned, the red marks in the
plot show where the prediction module decided to mark a sample as Lost (1) and where
the sender would have backed off in order to avoid congestion if model inference was
enabled. For this specific test, the predictions were very accurate, which also seemed to
be a general trend throughout the tests — depending on the classification threshold and
configured BDP.

Figure 4.3: cwnd plot overlaid with model predictions from a connection configured using TCP
Reno with 70ms delay, 50Mbps bandwidth, 0.5 BDP queue size, and 0.25 classification threshold

Figure 4.4 and Figure 4.5 show a single flow configured with Reno or Cubic respectively,
and a smaller BDP than the connection illustrated in Figure 4.3. For both Reno and
Cubic, the model predictions were mostly accurate. However, in the case of Reno, the

86
4.4. Results

model seemed to miss a sawtooth entirely — as can be seen at the fourth sawtooth in
Figure 4.4 — while it missed a lot for Cubic. This seemed to be the case for many of the
tests, especially for connections configured with larger BDPs and larger thresholds. The
reason for this could be partially attributed to a higher classification threshold resulting
in the prediction module being less likely to mark a given sample as Lost (1) because the
output probabilities were mostly very low, with the output probabilities being very low
because of a higher BDP seemed to result in more stable values with less fluctuations
compared to connections configured with smaller BDPs (Section 3.7).

Figure 4.4: cwnd plot overlaid with model pre- Figure 4.5: cwnd plot overlaid with model pre-
dictions from a connection configured using dictions from a connection configured using
TCP Reno with 30ms delay, 50Mbps band- TCP Cubic with 30ms delay, 50Mbps band-
width, 1 BDP queue size, and 0.25 classifica- width, 1 BDP queue size, and 0.1 classification
tion threshold threshold

There seemed to be a general trend where connections with larger BDPs performed
better than connections with smaller BDPs. This could partially be attributed to what
was discussed in Section 3.7, where there seemed to be samples in the training data that
were labeled as True even though they should have been labeled as False. This seemed
to be mostly the case for connections with smaller BDPs, where the cwnd fluctuated
between values such as 14 and 15, sometimes there being cases where the value was for
example 14, 14, 14, 15, 14 for a series of five samples. According to the heuristic used
by the labeling procedure (Section 3.4.4), the sample with a cwnd value of 15 would be
labeled as True in this case, even though this was not necessarily the peak. In addition,
there seemed to be cases for low BDPs where the cwnd value reached a peak and a sample
was correctly labeled as True, before the values decreased but decreased gradually so
that the value was for example 30 at the peak, then 25 for two samples before going
further down to 23, where it stayed at 23 for a few samples before going down to 21,
and so on. According to the heuristic of the labeling procedure, this meant that some
of these samples in the decreasing phase where labeled as True because, for a given
sample, the cwnd value of the previous value was the same and the value of the next was
smaller. This resulted in multiple samples labeled as True in the decreasing phase and
therefore with negative cwnd_diff values that should have been labeled as False. These
wrongfully labeled samples could help explain the comparatively worse performance for
connections with smaller BDPs.
As illustrated in Figure 4.4 and Figure 4.5 for Reno and Cubic respectively, what was
hypothesized in Section 3.6 and Section 3.7 seemed to be true, with the model often
getting things almost right by marking a sample as Lost even though it was not right at

87
Chapter 4. Model inference and results

the peak yet, and sometimes marking packets that were a bit further from the peak. As
illustrated in the figures for both Reno and Cubic, the predictions were still very close
to the actual peak, and could be considered a sensible place to back off in the case of
sending rate reduction to avoid congestion.
Figure 4.6, Figure 4.7, Figure 4.8, and Figure 4.9 show the reduction in retransmissions
for the various classification thresholds when model inference was enabled compared
to the baseline case with no model inference enabled and the throughput change
for Reno and Cubic respectively. In all cases, outliers have been removed from the
plots to illustrate the general trends more clearly. For both Reno and Cubic, there
seemed to be a general trend where the amount of retransmissions was always reduced
when model inference was enabled, with one exception in the Reno case with a 0.5
classification threshold. This exception with negative retransmission reduction can be
mainly explained by variance between test runs, where the amount of retransmissions
sometimes varied greatly between runs. There could also be some added overhead
from running the connection with model inference enabled potentially contributing to
increased retransmissions in some cases where very few samples were marked as Lost.
If the tests were run more than five times for each scenario, for example 100 times,
it would be expected that the retransmission reduction would always be positive with
model inference enabled.
There seemed to be a negative correlation between the classification threshold and
retransmission reduction, where a lower classification threshold resulted in a greater
reduction in retransmissions. However, this seemed to always come at the expense
of somewhat reduced throughput, with there being a positive correlation between the
throughput and classification threshold, where a lower threshold resulted in lower
throughput and vice versa. These results therefore illustrate a trade-off between
the amount of retransmissions and the throughput, where potentially a classification
threshold could be chosen based on which metric is deemed the most important: having
less retransmissions or having higher throughput.
The trade-off between retransmission reduction and throughput change is illustrated in
Figure 4.10 and Figure 4.11 for Reno and Cubic respectively for the single flow case,
showing that when the retransmission reduction grows, the throughput change decreases,
and vice versa. The center coordinates of each circle are determined by the difference
in the median retransmissions and throughput for the relevant threshold compared to
the baseline (no model inference). The diameter of the circles is based on the variance
in the retransmission reduction for the relevant threshold, calculated by subtracting the
upper quartile from the lower quartile. The plots therefore also illustrate that there
was a great variance in the results for the various scenarios, with some scenarios having
greater retransmission reductions for a specific threshold than the others compared to
the baseline case where no model inference was enabled. This could be attributed to
an observed trend, especially for Reno, where connections with smaller BDPs proved to
result in more aggressive prediction performance with regards to how many samples were
labeled as Lost compared to connections with higher BDPs, where a lower threshold was
often needed in order to produce any predictions at all. This latter case is illustrated in
Figure 4.5, which shows a connection configured with the largest BDP of the considered
scenarios discussed in Section 4.4, and where the model predictions happened in the case
of a 0.1 classification threshold. As can be seen in the plot, the model was relatively
conservative with the markings given the very low classification threshold.

88
4.4. Results

Figure 4.6: Retransmission reduction for TCP Figure 4.7: Throughput change for TCP Reno
Reno when running as a single flow with model when running as a single flow with model
inference enabled at various classification inference enabled at various classification
thresholds thresholds

Figure 4.8: Retransmission reduction for TCP Figure 4.9: Throughput change for TCP Cubic
Cubic when running as a single flow with when running as a single flow with model
model inference enabled at various classifica- inference enabled at various classification
tion thresholds thresholds

Variance in the retransmission reduction for a specific threshold could also be attributed
to variance between test runs, where some test runs for some of the scenarios for a
specific threshold could have resulted in greater retransmission reductions compared to
the others, leading to a situation where the difference in the upper quartile for a given
threshold compared to the baseline was greater compared to the lower quartile for the
same threshold.
The model predictions seemed to be generally more aggressive for Reno compared
to Cubic, which could help explain the difference in retransmission reduction for
Reno compared to Cubic. As illustrated in Figure 4.6 and Figure 4.8 for Reno and
Cubic respectively, the median retransmission reduction is about the same, with a
slightly higher value for Cubic. However, the upper quartile and maximum value for
retransmission reduction was considerably higher for Reno, reaching 100% reduction at
the maximum. This can be attributed to one specific scenario with model inference
enabled using the lowest threshold (0.1), which resulted in 0 retransmissions, while the
baseline case for the same scenario resulted in 45 retransmissions. This came with a
considerable reduction in throughput however, as shown in Table 4.2.

89
Chapter 4. Model inference and results

Figure 4.10: Trade-off between retransmission Figure 4.11: Trade-off between retransmission
reduction and throughput change for TCP reduction and throughput change for TCP
Reno when running as a single flow with model Cubic when running as a single flow with
inference enabled at various classification model inference enabled at various classifica-
thresholds tion thresholds

Model inference Avg. throughput (Mbps) Avg. retransmissions


No 44.65 45
Yes (0.1 threshold) 28.19 0

Table 4.2: Results for a specific Reno single flow scenario with a configured delay of 30ms,
bandwidth of 50Mbit, and queue size of 0.5BDP

The pcap files from the various test runs of the scenario with model inference
enabled shown in Table 4.2 were analyzed, and the packets were filtered to only show
retransmissions and ECN-Echoes (ECE) by applying the filter below:
tcp.flags.ece==1 or tcp.analysis.retransmission
It was observed that the connection backed off at regular intervals without losing any
packets. The cwnd plot from one of the five test runs for this scenario that resulted
in zero retransmissions is shown in Figure 4.12. This demonstrates a successful proof
of concept showing that a machine learning model can be used to predict packet loss
and the sending rate can be reduced proactively by making use of ECN (Section 2.1.9)
to avoid losing packets and resulting retransmissions due to buffer overflows caused by
congestion.
The results seemed to be mostly the same for Reno and Cubic, with there being some
differences in result variations for the various thresholds and Reno generally producing
a bit higher retransmission reductions and negative throughput changes, where the
generally higher values for Reno could be explained by the aforementioned observation
that the Reno model seemed to be generally a bit more aggressive with marking samples
as Lost — illustrated in Figure 4.4 and Figure 4.5 showing how the Reno model marked
more packets as Lost even though the configured classification threshold was higher for
a connection with the same values for delay, bandwidth, and queue size compared to
Cubic. The difference in result variations for the various thresholds seemed to be more
of a general case for both algorithms, and this was the reason why the tests were run
multiple times to get an average. If the average value was even more smoothed out by
running the tests many times, it would be expected that the variance in results would

90
4.4. Results

have been closer for the two algorithms.

Figure 4.12: cwnd plot from a connection configured using TCP Reno with 30ms delay, 50Mbps
bandwidth, 0.5 BDP queue size, and 0.1 classification threshold resulting in zero retransmissions

4.4.2 Background traffic results


Like for the single flow results, the various scenarios in Table 4.1 were considered when
collecting the results with background traffic. Tests were run using the connection
script (Section 4.1), where the connection was configured as a single flow with
background traffic using either Reno or Cubic and a specific combination of the
connection parameters shown in the abovementioned table. Background traffic was
always configured as six background flows, where three of them were configured with
Reno, and the other three with Cubic.
As in the single flow case, tests were run with and without model inference enabled, where
the same classification thresholds were considered. The same metrics and average values
were calculated as described in Section 4.4. Results were aggregated by classification
threshold, so that the final results for the background traffic case were grouped into four
distinct groups, as described in the beginning of Section 4.4: no model inference, 0.1
threshold, 0.25 threshold, and 0.5 threshold.
As in the single flow case, tests were also ran in timestamp mode (Section 4.1), to get
plots of the cwnd overlaid by the model predictions. However, compared to the single
flow case, more plots were produced for the background traffic case to show the relation
between queue size and model performance.
Figure 4.13, Figure 4.15, and Figure 4.17 show a flow configured with Reno and

91
Chapter 4. Model inference and results

background traffic with various queue sizes in the form of multipliers of BDP. The
connections in all of these cases were configured using the same delay and bandwidth,
only the queue size was varied by using either 0.25, 0.5 or 1 BDP for the queue size. In
addition, the same classification threshold was used (0.5) when running the prediction
module. Figure 4.14, Figure 4.16, and Figure 4.18 show the same scenarios for Cubic.
As illustrated by these results, the predictions seem to be mostly accurate — even in the
presence of background traffic. This could indicate that the model has actually learned
a pattern related to when congestion happens or not, instead of just using a simple
heuristic such as checking if the cwnd is above a certain value and classifying samples
based on that. Unlike the single flow cases, in the presence of background traffic, the
model has no way of knowing which cwnd value represents a threshold — where higher
values most likely indicate congestion and lower values most likely do not — because the
cwnd values fluctuate much more, and what was the maximum value for one sawtooth
is not the same as what was the maximum value for the next sawtooth, and so on.
Based on the model evaluation results discussed in 3.7, this indicates that the model
used different features than the cwnd to determine if a given sample should be classified
as True or False.
As also illustrated by these results, the predictions for Reno especially seemed to be
very accurate, with the model predictions being very close the peaks of each sawtooth,
with some exceptions where the predictions were further from the peak — but always
on the correct side, meaning the side where the cwnd was still growing before congestion
occurred and Multiplicative Decrease (MD) (Section 2.1.6) was used to decrease the
cwnd. As discussed in Section 3.7, this indicates that the model potentially used the
cwnd_diff feature to not mark samples as Lost when the cwnd seemed to be decreasing.
As explained in Section 4.2, even though a simplified version of the cwnd_diff feature
that only considered the difference between the current and previous cwnd values —
compared to the one that was used for creating the training data which considered the
difference between the current and the previous that was not the same (Section 3.4.2) —
was used for the purpose of the model inference setup in the data preparation module
to reduce computation overhead and speed up processing, this could indicate that the
cwnd decreased fast enough for the samples that were taken after congestion occurred
and MD was used to reduce the cwnd, and the values were therefore negative for many
of these samples.
Compared to the single flow case, higher classification thresholds generally seemed to
work better for the background traffic connections. While very few packets seemed
to be marked as Lost by the model in the single flow case when 0.5 was used as the
classification threshold, this was not the case for background traffic, where the same
threshold generally produced quite favorable results by providing a nice balance between
retransmission reductions and negative throughput changes, as illustrated in Figure 4.19
and Figure 4.21 for Reno and Cubic respectively. This could be partially attributed to the
fact that the background traffic connections seemed to have much more retransmissions
on average, as shown in Table 4.3 and Table 4.4 for Reno and Cubic respectively, showing
the difference in average retransmissions for single flow and background traffic for each
scenario.
Lower classification thresholds, such as 0.1 and 0.25, seemed to be very aggressive on the
background traffic connections with regards to marking samples as Lost. This seemed to
also be the case for the scenarios with the largest BDPs. For the highest BDP and queue

92
4.4. Results

Delay Bandwidth Queue size Retrans. (single flow) Retrans. (bg)


30 10 0.25 162 2760
30 10 0.5 144 2258
30 10 1 92 1678
30 50 0.25 82 1518
30 50 0.5 45 1071
30 50 1 31 638

Table 4.3: Difference in retransmissions for the single flow and background traffic case for TCP
Reno for the various scenarios without model inference

Delay Bandwidth Queue size Retrans. (single flow) Retrans. (bg)


30 10 0.25 100 2622
30 10 0.5 62 2132
30 10 1 42 1560
30 50 0.25 49 1395
30 50 0.5 63 1072
30 50 1 40 630

Table 4.4: Difference in retransmissions for the single flow and background traffic case for TCP
Cubic for the various scenarios without model inference

size scenario — meaning the one that was configured with a delay of 30ms, bandwidth
of 50Mbit, and queue size of 1 BDP — it was observed that model inference tests with a
classification threshold of 0.1 produced a large amount of True predictions. While many
of these predictions where quite close to the peaks (where they should optimally be),
a considerable amount of the predictions were also much further from the peaks, with
some even being on the “wrong” side of the peaks, referring to the side where the cwnd
was decreasing. This seemed to not be the case for higher classification thresholds, such
as 0.5, which generally produced less and more conservative predictions, where these
predictions were generally more accurate in the respect that they were closer to the
peaks. However, it was also observed that, even with a high classification threshold such
as 0.5, the predictions were not very accurate for the scenarios with lower BDPs and the
models for Reno and Cubic were very aggressive with marking samples as Lost.
There seemed to be a positive correlation between BDP and model performance, where
connections with larger BDPs — and larger queue sizes — seemed to have the most
accurate predictions. This could partially be attributed to what was discussed in Section
3.7, where for connections with smaller BDPs in the training data, there seemed to
be many samples that were wrongfully labeled as True and values seemed to be more
stable for the connections with larger BDPs. In the case of connections with smaller
BDPs, even a threshold of 0.5 seemed to be too aggressive, probably leading to the
retransmission reduction values shown in Figure 4.19 and Figure 4.21 quite close to 60%
for 0.5 classification threshold. The connections with smaller BDPs probably also lead
to the retransmission reduction values for 0.1 and 0.25 classification thresholds that were
quite close to 100% for Reno and 80% for Cubic. Interestingly, however, the throughput
did not seem to decrease that much, with the maximum negative throughput change

93
Chapter 4. Model inference and results

being a bit below 25% for both.


As in the single flow case, what was hypothesized in Section 3.6 and Section 3.7 seemed
to be true for the background traffic case as well, with the model again often getting
things almost right by marking samples as Lost a bit before the peak. Especially in the
cases with the largest queue sizes, as illustrated in Figure 4.17 and Figure 4.18 for Reno
and Cubic respectively, the predictions were quite accurate and very close to the actual
peaks — with some exceptions showing that the model sometimes was a bit eager. These
exceptions could perhaps be partially attributed to the previously mentioned problem
with the training data, where sometimes samples were incorrectly labeled as True, as
explained for the single flow case and also discussed in Section 3.7.
On the other hand, Figure 4.13 and Figure 4.14 for Reno and Cubic respectively illustrate
that the model sometimes was very eager with classifying samples as Lost — especially
for Cubic — and again illustrates the previously mentioned relationship between BDP
and classification performance and threshold — where higher thresholds seemed to work
better with lower BDPs and vice versa. Using a threshold of 0.75 or perhaps even higher
could have resulted in better results for the background traffic scenarios with lower
BDPs.
Figure 4.19, Figure 4.20, Figure 4.21, and Figure 4.22 show the reduction in
retransmissions for the various classification thresholds when model inference was
enabled compared to the baseline case with no model inference enabled and the
throughput change for Reno and Cubic respectively. In all cases, outliers have been
removed from the plots to illustrate the general trends more clearly. As in the single
flow case, for both Reno and Cubic there seemed to be a general trend where there
was a reduction in retransmissions when model inference was enabled, with a negative
correlation between the classification threshold and retransmission reduction. Also like
the single flow case, there was a positive correlation between the classification threshold
and the througput, with higher thresholds generally resulting in higher throughput.
These results again therefore illustrate a trade-off between the amount of retransmissions
and the throughput, as discussed for the single flow case, and as further illustrated in
Figure 4.23 and Figure 4.24 showing the trade-off between retransmission reduction and
negative throughput change for Reno and Cubic, respectively.
Compared to the single flow case, the retransmission reduction seemed to be more
significant for higher classification thresholds for the background traffic tests, with a
median value of around 15% for Reno and just below 10% for Cubic in the case of the
highest threshold at 0.5. In the case of a more aggressive threshold at 0.1, it seemed
that the median retransmission reduction was a bit reduced for both Reno and Cubic in
the background traffic case — but still significant with a median value of just below 40%
for Reno and around 35% for Cubic. However, as in the single flow case, the negative
throughput change was quite large for the lower thresholds 0.1 and 0.25, at about 20%
for both Reno and Cubic. Seeing that the difference in median retransmission reduction
was not that large compared to the single flow case, it could be argued that a threshold
of 0.5 generally seemed to be the best for the background traffic case, where a relatively
small reduction in throughput could represent a decent trade-off if reducing the amount of
retransmissions by sacrificing some throughput could be beneficial for a given application.
But, as Figure 4.23 and Figure 4.24 show, this relationship seemed to be quite linear, and
the choice of threshold could therefore be treated as a parameter where the application
could dictate the best value depending on what is more important: less retransmissions

94
4.4. Results

(which cause head-of-line blocking delays when packets are retransmitted) or higher
throughput. Also, as discussed previously, since the optimal threshold seemed to vary
depending on the BDP — with lower BDPs generally requiring higher thresholds in
order to not overreact — the threshold could perhaps be tuned dynamically depending
on the network conditions such as delay and available bandwidth.

95
Chapter 4. Model inference and results

Figure 4.13: cwnd plot overlaid with model Figure 4.14: cwnd plot overlaid with model
predictions from a connection configured us- predictions from a connection configured using
ing TCP Reno with 70ms delay, 50Mbps band- TCP Cubic with 70ms delay, 50Mbps band-
width, 0.25 BDP queue size, background traf- width, 0.25 BDP queue size, background traf-
fic, and 0.5 classification threshold fic, and 0.5 classification threshold

Figure 4.15: cwnd plot overlaid with model Figure 4.16: cwnd plot overlaid with model
predictions from a connection configured us- predictions from a connection configured using
ing TCP Reno with 70ms delay, 50Mbps band- TCP Cubic with 70ms delay, 50Mbps band-
width, 0.5 BDP queue size, background traffic, width, 0.5 BDP queue size, background traffic,
and 0.5 classification threshold and 0.5 classification threshold

Figure 4.17: cwnd plot overlaid with model Figure 4.18: cwnd plot overlaid with model
predictions from a connection configured us- predictions from a connection configured using
ing TCP Reno with 70ms delay, 50Mbps band- TCP Cubic with 70ms delay, 50Mbps band-
width, 1 BDP queue size, background traffic, width, 1 BDP queue size, background traffic,
and 0.5 classification threshold and 0.5 classification threshold

96
4.4. Results

Figure 4.19: Retransmission reduction for Figure 4.20: Throughput change for TCP Reno
TCP Reno when running with background when running with background traffic with
traffic with model inference enabled at various model inference enabled at various classifica-
classification thresholds tion thresholds

Figure 4.21: Retransmission reduction for Figure 4.22: Throughput change for TCP Cu-
TCP Cubic when running with background bic when running with background traffic with
traffic with model inference enabled at various model inference enabled at various classifica-
classification thresholds tion thresholds

Figure 4.23: Trade-off between retransmission Figure 4.24: Trade-off between retransmission
reduction and throughput change for TCP reduction and throughput change for TCP Cu-
Reno when running with background traffic bic when running with background traffic with
with model inference enabled at various clas- model inference enabled at various classifica-
sification thresholds tion thresholds

97
Chapter 4. Model inference and results

98
Chapter 5

Conclusion

In this thesis we have thoroughly investigated the topic of real-time packet loss
prediction. We have shown how a machine learning model can be trained on data
collected from multiple TCP connections and applied to tackle this problem in order to
implement proactive congestion control measures using mechanisms such as ECN.
In the introduction we posed some research questions, which we have tried to answer to
the best of our ability throughout this thesis. A summary of our findings and answers
to the research questions are discussed below. In addition, we have highlighted the
main contributions that this thesis brings to the scientific community, which we briefly
mentioned in the introduction as well. Finally, we have proposed some possibilities for
future research that could be made to expand on the work presented in this thesis. All
the results presented in this thesis are reproducible, and the source code is publicly
available [86].

5.1 Addressing research questions


The main research question that this thesis has attempted to answer, and that we
presented in the introduction, is:
Is it possible to predict if a packet will be lost before sending it?
To answer this question, in Chapter 3 we created various machine learning models using
XGBoost [103] that we trained on TCP state information from multiple TCP connections
created using Mininet [56]. In Chapter 4, we created a custom test setup where we did
model inference in order to investigate how the various models performed with regards
to real-time prediction performance on a running TCP connection with Reno and Cubic,
and the presence or absence of background traffic. Our findings presented in Chapter 4
indicate that we successfully answered this research question, and that this does indeed
seem to be possible. This is shown in the various Figures (4.3, 4.4, 4.5, 4.13, 4.15,
4.17, 4.14, 4.16, 4.18) that show plots of the cwnd from various connections overlayed
by the predictions of the models we developed. These figures clearly show that the
model predictions are mostly accurate and close to the peaks where congestion occurs
and resulting packet loss happens.
In addition to the main research question, we also presented various sub-questions in
the introduction. These are discussed below:

99
Chapter 5. Conclusion

Is it possible to do this for multiple congestion control algorithms?


To answer this question, we created separate sets of machine learning models for two
congestion control algorithms: TCP Reno and TCP Cubic. We then applied these two
sets of models to predict packet loss in real time on running TCP connections configured
with either Reno and Cubic. Our findings, which are presented in Chapter 4, indicate
that this is indeed possible for both Reno and Cubic, and therefore proves that this is
possible for multiple congestion control algorithms.
Can we do this using a simple machine learning model, such as a tree-based
classifier?
The models we developed were based on a gradient-boosted decision tree algorithm and
created using XGBoost — more specifically, the XGBClassifier Python class [96]. We
have therefore proved that this is indeed possible using a simple machine learning model,
such as a tree-based classifier.
Can we use a mechanism like ECN to proactively reduce the sending rate
before congestion occurs and packets are dropped and, by doing so, reduce the
amount of retransmissions?
Our work presented in Chapter 4 discusses how we applied various machine learning
models to TCP connections to perform real-time prediction and subsequent sending rate
adjustments using ECN. Our results indicate that the amount of retransmissions were
reduced in all cases when model inference was enabled.

5.2 Contributions
The main contributions of this thesis, as briefly mentioned in the introduction, are
summarized below:
As described in detail in Chapter 3, we have done the following:
• Done extensive research on the topic of real-time packet loss prediction, including
which TCP state variables could be informative to a potential solution and could
be used to select or construct machine learning features to create models.
• Performed extensive data collection in the form of tests involving TCP connections
configured as a single flow between a sender and receiver with and without the
presence of background traffic and with multiple congestion control algorithms.
• Trained and tuned multiple machine learning models, which we have evaluated
using multiple performance metrics and through more manual inspections and
analysis. We have investigated classification performance differences for models
that were trained and evaluated on data collected with and without the presence
of background traffic, and models that were trained and evaluated on data collected
only without background traffic.
In addition, as described in Chapter 4, we have done the following:
• Performed real-time model inference where, through various tests, we have
investigated how the models perform in real time and how accurate and informative
the predictions are for a potential solution involving the use of proactive sending
rate reduction.

100
5.3. Future research

• Done such tests for various congestion control algorithms and for connections
with and without background traffic, where we have calculated metrics such as
retransmission reduction and throughput change, and created various plots for
showing the correlation between model inference and these metrics.
• Shown how ECN can be leveraged to proactively adjust the sending rate based on
model predictions to reduce the amount of retransmissions.

5.3 Future research


Future research could be focused on multiple topics, with some examples being discussed
below.

5.3.1 Real-life tests


This thesis has not done any real-life testing. All the TCP connections were created in
virtual environments using Mininet and traffic was generated using iPerf. The results
here are therefore not necessarily transferable to a real-life scenario involving real end
users that try to communicate over a network link using TCP, but rather presented as
a proof of concept indicating that this could be possible.

5.3.2 Different machine learning models


Research creating models using different machine learning algorithms than the XGBoost
based models we have created in our work, would be very interesting. Results could then
be compared between models — given that the training data remained the same. One
of the reasons for using a gradient booster such as XGBoost, was research indicating
better performance on tabular data [10] given that the dataset was not very large.
However, throughout the thesis work, the dataset eventually grew quite large, and
an approach using a neural network could therefore perhaps have resulted in better
performance than what we have presented here. In addition, we chose to use a static
model like XGBoost primarily for performance reasons. Research training and evaluating
a sequence model, (Section 2.2.9), such as a Long Short Term Memory (LSTM) model,
would be interesting and could perhaps result in improved performance in the form of
more accurate predictions.

5.3.3 Different machine learning features


Research exploring the use of different features or different combinations of features than
what we have discussed here (Section 3.4), could be interesting in order to investigate
which features or combinations of features could be more informative to a solution than
others. Our features were extracted from outputs produced by ss [85] containing TCP
state information, such as cwnd, rtt, and ssthresh. However, there were multiple other
fields included in the ss outputs that we did not include in the final training data as
features, and that could have been used directly or indirectly to construct new features.
In addition, as shown in the section evaluating the final models (Section 3.7), there
were a few features that had very little or no importance. Investigating and comparing
model performance with and without certain features, could therefore have been quite
interesting.

101
Chapter 5. Conclusion

5.3.4 Different training data


Another possible avenue for future work could be exploring models trained with different
or at least refined training data. Especially the labeling method seemed to greatly
influence model results, and exploring different labeling methods for creating labeled
training data, could therefore be very interesting and show if alternative labeling methods
performed better or worse.

5.3.5 A common model for multiple congestion control algorithms


In our work, we created separate models for TCP Reno and TCP Cubic. An interesting
topic would be exploring the creation and application of a “common” model trained on
both the Reno and Cubic datasets.

5.3.6 Dynamic FEC


In our work, we leveraged ECN to reduce the sending rate proactively based on model
predictions in order to reduce the amount of retransmissions. Instead of reducing
the sending rate, research could be performed investigating the use of Forward Error
Correction (FEC) to dynamically encode information at the sender based on model
predictions. If implemented effectively, this approach could theoretically lower the
overhead commonly linked to FEC implementations. This is because packet loss at
the transport layer is relatively rare, resulting in most of the encoded information often
remaining unused.

102
Bibliography

[1] Soheil Abbasloo, Chen-Yu Yen, and H. Jonathan Chao. “Classic Meets Modern: A
Pragmatic Learning-Based Congestion Control for the Internet.” In: Proceedings
of the Annual Conference of the ACM Special Interest Group on Data
Communication on the Applications, Technologies, Architectures, and Protocols
for Computer Communication. SIGCOMM ’20. Virtual Event, USA: Association
for Computing Machinery, 2020, pp. 632–647. isbn: 9781450379557. doi: 10.1145/
3387514.3405892. url: https://doi.org/10.1145/3387514.3405892.
[2] Davide Andreoletti et al. “Network Traffic Prediction based on Diffusion
Convolutional Recurrent Neural Networks.” In: IEEE INFOCOM 2019 - IEEE
Conference on Computer Communications Workshops (INFOCOM WKSHPS).
2019, pp. 246–251. doi: 10.1109/INFCOMW.2019.8845132.
[3] Bruno Astuto Arouche Nunes et al. “A machine learning framework for TCP
round-trip time estimation.” In: EURASIP Journal on Wireless Communications
and Networking 2014 (2014), pp. 1–22.
[4] Amir F. Atiya et al. “Packet Loss Rate Prediction Using the Sparse Basis
Prediction Model.” In: IEEE Transactions on Neural Networks 18.3 (2007),
pp. 950–954. doi: 10.1109/TNN.2007.891681.
[5] Vedant Bahel, Sofia Pillai, and Manit Malhotra. “A Comparative Study on
Various Binary Classification Algorithms and their Improved Variant for Optimal
Performance.” In: 2020 IEEE Region 10 Symposium (TENSYMP). 2020, pp. 495–
498. doi: 10.1109/TENSYMP50017.2020.9230877.
[6] Mikhail Belkin et al. “Reconciling modern machine-learning practice and the
classical bias–variance trade-off.” In: Proceedings of the National Academy of
Sciences 116.32 (2019), pp. 15849–15854. doi: 10.1073/pnas.1903070116. eprint:
https://www.pnas.org/doi/pdf/10.1073/pnas.1903070116. url: https://www.pnas.
org/doi/abs/10.1073/pnas.1903070116.
[7] Hanane Benadji, Lynda Zitoune, and Véronique Vèque. “Predictive Modeling of
Loss Ratio for Congestion Control in IoT Networks Using Deep Learning.” In: the
IEEE Global Communications Conference (GLOBECOM). 2023.
[8] Stephen Bensley et al. Data Center TCP (DCTCP): TCP Congestion Control
for Data Centers. RFC 8257. Oct. 2017. doi: 10 . 17487 / RFC8257. url: https :
//www.rfc-editor.org/info/rfc8257.
[9] Ethan Blanton, Dr. Vern Paxson, and Mark Allman. TCP Congestion Control.
RFC 5681. Sept. 2009. doi: 10.17487/RFC5681. url: https://www.rfc-editor.org/
info/rfc5681.

103
Bibliography

[10] Vadim Borisov et al. “Deep Neural Networks and Tabular Data: A Survey.” In:
IEEE Transactions on Neural Networks and Learning Systems (2022), pp. 1–21.
doi: 10.1109/TNNLS.2022.3229161.
[11] Neal Cardwell et al. “BBR: Congestion-Based Congestion Control.” In: ACM
Queue 14, September-October (2016), pp. 20–53. url: http : / / queue . acm . org /
detail.cfm?id=3022184.
[12] Selene Cerna Ñahuis et al. “A Comparison of LSTM and XGBoost for Predicting
Firemen Interventions.” In: June 2020, pp. 424–434. isbn: 978-3-030-45690-0. doi:
10.1007/978-3-030-45691-7_39.
[13] Rene Y Choi et al. “Introduction to machine learning, neural networks, and deep
learning.” In: Translational vision science & technology 9.2 (2020), pp. 14–14.
[14] Xu Chu et al. “Data Cleaning: Overview and Emerging Challenges.” In:
Proceedings of the 2016 International Conference on Management of Data.
SIGMOD ’16. San Francisco, California, USA: Association for Computing
Machinery, 2016, pp. 2201–2206. isbn: 9781450335317. doi: 10.1145/2882903.
2912574. url: https://doi.org/10.1145/2882903.2912574.
[15] Pádraig Cunningham and Sarah Jane Delany. “Underestimation Bias and
Underfitting in Machine Learning.” In: Trustworthy AI - Integrating Learning,
Optimization and Reasoning. Ed. by Fredrik Heintz, Michela Milano, and Barry
O’Sullivan. Cham: Springer International Publishing, 2021, pp. 20–31. isbn: 978-
3-030-73959-1.
[16] Hercules Dalianis. “Evaluation Metrics and Evaluation.” In: Clinical Text Mining:
Secondary Use of Electronic Patient Records. Cham: Springer International
Publishing, 2018, pp. 45–53. isbn: 978-3-319-78503-5. doi: 10.1007/978- 3- 319-
78503-5_6. url: https://doi.org/10.1007/978-3-319-78503-5_6.
[17] Essam Al Daoud. “Comparison between XGBoost, LightGBM and CatBoost
Using a Home Credit Dataset.” In: International Journal of Computer and
Information Engineering 13.1 (2019), pp. 6–10. issn: eISSN: 1307-6892. url:
https://publications.waset.org/vol/145.
[18] Luis Diez et al. “Can We Exploit Machine Learning to Predict Congestion over
mmWave 5G Channels?” In: Applied Sciences 10.18 (2020). issn: 2076-3417. doi:
10.3390/app10186164. url: https://www.mdpi.com/2076-3417/10/18/6164.
[19] Mo Dong et al. “PCC Vivace: Online-Learning Congestion Control.” In: 15th
USENIX Symposium on Networked Systems Design and Implementation (NSDI
18). Renton, WA: USENIX Association, Apr. 2018, pp. 343–356. isbn: 978-1-
939133-01-4. url: https://www.usenix.org/conference/nsdi18/presentation/dong.
[20] Mo Dong et al. “PCC: Re-architecting Congestion Control for Consistent High
Performance.” In: 12th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 15). Oakland, CA: USENIX Association, May 2015,
pp. 395–408. isbn: 978-1-931971-218. url: https : / / www. usenix . org / conference /
nsdi15/technical-sessions/presentation/dong.
[21] Wesley Eddy. Transmission Control Protocol (TCP). RFC 9293. Aug. 2022. doi:
10.17487/RFC9293. url: https://www.rfc-editor.org/info/rfc9293.

104
Bibliography

[22] Issam El Naqa and Martin J. Murphy. “What Is Machine Learning?” In: Machine
Learning in Radiation Oncology: Theory and Applications. Ed. by Issam El Naqa,
Ruijiang Li, and Martin J. Murphy. Cham: Springer International Publishing,
2015, pp. 3–11. isbn: 978-3-319-18305-3. doi: 10.1007/978- 3- 319- 18305- 3_1.
url: https://doi.org/10.1007/978-3-319-18305-3_1.
[23] Carmen Esposito et al. “GHOST: Adjusting the Decision Threshold to Handle
Imbalanced Data in Machine Learning.” In: Journal of Chemical Information
and Modeling 61.6 (2021). PMID: 34100609, pp. 2623–2640. doi: 10.1021/acs.
jcim . 1c00160. eprint: https : / / doi . org / 10 . 1021 / acs . jcim . 1c00160. url: https :
//doi.org/10.1021/acs.jcim.1c00160.
[24] A Esterhuizen and AE Krzesinski. “TCP congestion control comparison.” In:
SATNAC, September (2012).
[25] Gorry Fairhurst and Michael Welzl. The Benefits of Using Explicit Congestion
Notification (ECN). RFC 8087. Mar. 2017. doi: 10.17487/RFC8087. url: https:
//www.rfc-editor.org/info/rfc8087.
[26] Joyce Fang et al. “Reinforcement learning for bandwidth estimation and
congestion control in real-time communications.” In: CoRR abs/1912.02222
(2019). arXiv: 1912.02222. url: http://arxiv.org/abs/1912.02222.
[27] S. Floyd and V. Jacobson. “Random early detection gateways for congestion
avoidance.” In: IEEE/ACM Transactions on Networking 1.4 (1993), pp. 397–413.
doi: 10.1109/90.251892.
[28] Sally Floyd, Dr. K. K. Ramakrishnan, and David L. Black. The Addition of
Explicit Congestion Notification (ECN) to IP. RFC 3168. Sept. 2001. doi: 10.
17487/RFC3168. url: https://www.rfc-editor.org/info/rfc3168.
[29] Jerome H. Friedman. “Greedy Function Approximation: A Gradient Boosting
Machine.” In: The Annals of Statistics 29.5 (2001), pp. 1189–1232. issn: 00905364.
url: http://www.jstor.org/stable/2699986 (visited on 08/16/2023).
[30] Moritz Geist and Benedikt Jaeger. “Overview of TCP congestion control
algorithms.” In: Network 11 (2019).
[31] Anna Giannakou, Dipankar Dwivedi, and Sean Peisert. “A machine learning
approach for packet loss prediction in science flows.” In: Future Generation
Computer Systems 102 (2020), pp. 190–197. issn: 0167-739X. doi: https : / / doi .
org/10.1016/j.future.2019.07.053. url: https://www.sciencedirect.com/science/
article/pii/S0167739X19305850.
[32] Isabelle Guyon et al. Feature extraction: foundations and applications. Vol. 207.
Springer, 2008.
[33] Sangtae Ha, Injong Rhee, and Lisong Xu. “CUBIC: A New TCP-Friendly High-
Speed TCP Variant.” In: SIGOPS Oper. Syst. Rev. 42.5 (July 2008), pp. 64–74.
issn: 0163-5980. doi: 10.1145/1400097.1400105. url: https://doi.org/10.1145/
1400097.1400105.
[34] Mark A Hall. “Correlation-based feature selection of discrete and numeric class
machine learning.” In: (2000).
[35] Erwin Harahap et al. “A router-based management system for prediction of
network congestion.” In: 2014 IEEE 13th International Workshop on Advanced
Motion Control (AMC). 2014, pp. 398–403. doi: 10.1109/AMC.2014.6823315.

105
Bibliography

[36] hping3. url: https://www.kali.org/tools/hping3/ (visited on 04/12/2023).


[37] iPerf. url: https://iperf.fr/ (visited on 04/12/2023).
[38] Is LightGBM available for Mac M1? url: https://stackoverflow.com/questions/
74568115/is-lightgbm-available-for-mac-m1 (visited on 06/07/2023).
[39] V. Jacobson. “Congestion Avoidance and Control.” In: Symposium Proceedings
on Communications Architectures and Protocols. SIGCOMM ’88. Stanford,
California, USA: Association for Computing Machinery, 1988, pp. 314–329. isbn:
0897912799. doi: 10.1145/52324.52356. url: https://doi.org/10.1145/52324.52356.
[40] Habibullah Jamal and Kiran Sultan. “Performance analysis of TCP congestion
control algorithms.” In: International journal of computers and communications
2.1 (2008), pp. 18–24.
[41] Nathan Jay et al. “A Deep Reinforcement Learning Perspective on Internet
Congestion Control.” In: Proceedings of the 36th International Conference on
Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov.
Vol. 97. Proceedings of Machine Learning Research. PMLR, June 2019, pp. 3050–
3059. url: https://proceedings.mlr.press/v97/jay19a.html.
[42] T. Jiang, J. L. Gradus, and A. J. Rosellini. “Supervised Machine Learning: A
Brief Primer.” In: Behav Ther 51.5 (Sept. 2020), pp. 675–687.
[43] Gary King and Langche Zeng. “Logistic Regression in Rare Events Data.” In:
Political Analysis 9.2 (2001), pp. 137–163. doi: 10 . 1093 / oxfordjournals . pan .
a004868.
[44] James F. Kurose and Keith W. Ross. Computer Networking: A Top-Down
Approach. English. Seventh. Pearson, 2017. isbn: 978-1-292-15359-9. url: https:
//www.pearson.com/us/higher-education/program/PGM1101673.html.
[45] Adam Langley et al. “The QUIC Transport Protocol: Design and Internet-Scale
Deployment.” In: 2017.
[46] Gary Lee. “Chapter 3 - Switch Fabric Technology.” In: Cloud Networking. Ed. by
Gary Lee. Boston: Morgan Kaufmann, 2014, pp. 37–64. isbn: 978-0-12-800728-
0. doi: https : / / doi . org / 10 . 1016 / B978 - 0 - 12 - 800728 - 0 . 00003 - 5. url: https :
//www.sciencedirect.com/science/article/pii/B9780128007280000035.
[47] LGBMClassifier. url: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.
LGBMClassifier.html (visited on 04/20/2023).
[48] Wei Li et al. “QTCP: Adaptive Congestion Control with Reinforcement Learning.”
In: IEEE Transactions on Network Science and Engineering 6.3 (2019), pp. 445–
458. doi: 10.1109/TNSE.2018.2835758.
[49] Wenzhong Li et al. “SmartCC: A Reinforcement Learning Approach for Multipath
TCP Congestion Control in Heterogeneous Networks.” In: IEEE Journal on
Selected Areas in Communications 37.11 (2019), pp. 2621–2633. doi: 10.1109/
JSAC.2019.2933761.
[50] Yunxin Liang et al. “Product Marketing Prediction Based on XGboost and
LightGBM Algorithm.” In: Proceedings of the 2nd International Conference
on Artificial Intelligence and Pattern Recognition. AIPR ’19. Beijing, China:
Association for Computing Machinery, 2019, pp. 150–153. isbn: 9781450372299.
doi: 10.1145/3357254.3357290. url: https://doi.org/10.1145/3357254.3357290.
[51] LightGBM. url: https://lightgbm.readthedocs.io/en/latest/ (visited on 04/12/2023).

106
Bibliography

[52] Benoit Liquet, Sarat Moka, and Yoni Nazarathy. The mathematical engineering
of deep learning. 2023.
[53] Marcos Roberto Machado, Salma Karray, and Ivaldo Tributino de Sousa.
“LightGBM: an Effective Decision Tree Gradient Boosting Method to Predict
Customer Loyalty in the Finance Industry.” In: 2019 14th International
Conference on Computer Science & Education (ICCSE). 2019, pp. 1111–1116.
doi: 10.1109/ICCSE.2019.8845529.
[54] Stephen Marsland. Machine learning: an algorithmic perspective. CRC press,
2015.
[55] H.R. Mehrvar and M.R. Soleymani. “Packet loss rate prediction using a
universal indicator of traffic.” In: ICC 2001. IEEE International Conference on
Communications. Conference Record (Cat. No.01CH37240). Vol. 3. 2001, 647–
653 vol.3. doi: 10.1109/ICC.2001.937277.
[56] Mininet. url: http://mininet.org/ (visited on 04/12/2023).
[57] Mininet API. url: http://mininet.org/api/annotated.html (visited on 04/12/2023).
[58] Mininet CLI. url: http://mininet.org/walkthrough/#interact-with-hosts-and-switches
(visited on 04/12/2023).
[59] Mininet Node class. url: http://mininet.org/api/classmininet_1_1node_1_1Node.
html (visited on 04/12/2023).
[60] Mininet Node cmd method. url: http://mininet.org/api/classmininet_1_1node_1_
1Node.html#a6e1338af3c4a0348963a257ac548153b (visited on 04/12/2023).
[61] Mininet Node config method. url: http://mininet.org/api/classmininet_1_1node_1_
1Node.html#ae1c80e11ed708d3f0d3c98acd4299ed4 (visited on 04/12/2023).
[62] ModelInference. url: https://cloud.google.com/bigquery/docs/inference- overview
(visited on 10/28/2023).
[63] Woongsoo Na et al. “DL-TCP: Deep Learning-Based Transmission Control
Protocol for Disaster 5G mmWave Networks.” In: IEEE Access 7 (2019),
pp. 145134–145144. doi: 10.1109/ACCESS.2019.2945582.
[64] Vladimir Nasteski. “An overview of the supervised machine learning methods.”
In: HORIZONS.B 4 (Dec. 2017), pp. 51–62. doi: 10.20544/HORIZONS.B.04.1.
17.P05.
[65] Alexey Natekin and Alois Knoll. “Gradient boosting machines, a tutorial.” In:
Frontiers in neurorobotics 7 (2013), p. 21.
[66] Xiaohui Nie et al. “Dynamic TCP Initial Windows and Congestion Control
Schemes Through Reinforcement Learning.” In: IEEE Journal on Selected Areas
in Communications 37.6 (2019), pp. 1231–1247. doi: 10 . 1109 / JSAC . 2019 .
2904350.
[67] pandas. url: https://pandas.pydata.org (visited on 04/13/2023).
[68] pandas.read_csv. url: https://pandas.pydata.org/docs/reference/api/pandas.read%
5C_csv.html (visited on 04/13/2023).
[69] pandasDataFrame. url: https : / / pandas. pydata . org / docs / reference / api / pandas.
DataFrame.html (visited on 04/13/2023).

107
Bibliography

[70] Konstantinos Poularakis et al. “Generalizable and Interpretable Deep Learning


for Network Congestion Prediction.” In: 2021 IEEE 29th International Conference
on Network Protocols (ICNP). 2021, pp. 1–10. doi: 10.1109/ICNP52444.2021.
9651937.
[71] Python csv.DictWriter. url: https : / / docs . python . org / 3 / library / csv. html # csv.
DictWriter (visited on 04/18/2023).
[72] Python Input and Output. url: https://docs.python.org/3/tutorial/inputoutput.html
(visited on 04/13/2023).
[73] Python List. url: https://docs.python.org/3/tutorial/datastructures.html#dictionaries
(visited on 10/02/2023).
[74] Darijo Raca, Ahmed H. Zahran, and Cormac J. Sreenan. “Sizing Network Buffers:
An HTTP Adaptive Streaming Perspective.” In: 2016 IEEE 4th International
Conference on Future Internet of Things and Cloud Workshops (FiCloudW).
2016, pp. 369–376. doi: 10.1109/W-FiCloud.2016.80.
[75] Elad Rapaport et al. “Predicting traffic overflows on private peering.” In: CoRR
abs/2010.01380 (2020). arXiv: 2010.01380. url: https://arxiv.org/abs/2010.01380.
[76] Zuzana Reitermanova et al. “Data splitting.” In: WDS. Vol. 10. Matfyzpress
Prague. 2010, pp. 31–36.
[77] Injong Rhee et al. CUBIC for Fast Long-Distance Networks. RFC 8312. Feb.
2018. doi: 10.17487/RFC8312. url: https://www.rfc-editor.org/info/rfc8312.
[78] David Ros and Michael Welzl. “Less-than-Best-Effort Service: A Survey of End-
to-End Approaches.” In: IEEE Communications Surveys & Tutorials 15.2 (2013),
pp. 898–908. doi: 10.1109/SURV.2012.060912.00176.
[79] Lopamudra Roychoudhuri and Ehab S. Al-Shaer. “Real-time packet loss
prediction based on end-to-end delay variation.” In: IEEE Transactions on
Network and Service Management 2.1 (2005), pp. 29–38. doi: 10.1109/TNSM.
2005.4798299.
[80] Scapyrdpcap. url: https://scapy.readthedocs.io/en/latest/api/scapy.utils.html#scapy.
utils.rdpcap (visited on 10/29/2023).
[81] scikit-learn metrics and scoring. url: https : / / scikit - learn . org / stable / modules /
model_evaluation.html (visited on 04/12/2023).
[82] Stanislav Shalunov et al. Low Extra Delay Background Transport (LEDBAT).
RFC 6817. Dec. 2012. doi: 10.17487/RFC6817. url: https://www.rfc-editor.org/
info/rfc6817.
[83] Dalwinder Singh and Birmohan Singh. “Investigating the impact of data
normalization on classification performance.” In: Applied Soft Computing 97
(2020), p. 105524. issn: 1568-4946. doi: https : / / doi . org / 10 . 1016 / j . asoc .
2019 . 105524. url: https : / / www . sciencedirect . com / science / article / pii /
S1568494619302947.
[84] Viswanath Sivakumar et al. “MVFST-RL: An Asynchronous RL Framework for
Congestion Control with Delayed Actions.” In: CoRR abs/1910.04054 (2019).
arXiv: 1910.04054. url: http://arxiv.org/abs/1910.04054.
[85] ss. url: https://man7.org/linux/man-pages/man8/ss.8.html (visited on 04/12/2023).
[86] Thesis source code. url: https://www.maximilian.no/thesis/download (visited on
11/14/2023).

108
Bibliography

[87] timeit. url: https://docs.python.org/3/library/timeit.html (visited on 11/12/2023).


[88] TShark. url: https://www.wireshark.org/docs/man-pages/tshark.html (visited on
04/12/2023).
[89] Curtis Villamizar and Cheng Song. “High Performance TCP in ANSNET.” In:
SIGCOMM Comput. Commun. Rev. 24.5 (Oct. 1994), pp. 45–60. issn: 0146-4833.
doi: 10.1145/205511.205520. url: https://doi.org/10.1145/205511.205520.
[90] Watchdog. url: https://pypi.org/project/watchdog/ (visited on 10/26/2023).
[91] WatchdogFileSystemEventHandler. url: https://pythonhosted.org/watchdog/api.
html#watchdog.events.FileSystemEventHandler (visited on 10/26/2023).
[92] WatchdogObserver. url: https://pythonhosted.org/watchdog/api.html#watchdog.
observers.Observer (visited on 10/26/2023).
[93] Wenting Wei, Huaxi Gu, and Baochun Li. “Congestion Control: A Renaissance
with Machine Learning.” In: IEEE Network 35.4 (2021), pp. 262–269. doi: 10.
1109/MNET.011.2000603.
[94] Michael Welzl. “Network Congestion Control: Managing Internet Traffic.” In:
Network Congestion Control: Managing Internet Traffic (May 2006), pp. 1–263.
doi: 10.1002/047002531X.
[95] Keith Winstein and Hari Balakrishnan. “TCP Ex Machina: Computer-Generated
Congestion Control.” In: Proceedings of the ACM SIGCOMM 2013 Conference
on SIGCOMM. SIGCOMM ’13. Hong Kong, China: Association for Computing
Machinery, 2013, pp. 123–134. isbn: 9781450320566. doi: 10 . 1145 / 2486001 .
2486020. url: https://doi.org/10.1145/2486001.2486020.
[96] XGBClassifier. url: https://xgboost.readthedocs.io/en/stable/python/python_api.
html#xgboost.XGBClassifier (visited on 10/03/2023).
[97] XGBClassifierFeatureImportance. url: https://xgboost.readthedocs.io/en/stable/
python/python_api.html#xgboost.XGBClassifier.feature_importances_ (visited on
10/03/2023).
[98] XGBClassifierFit. url: https://xgboost.readthedocs.io/en/stable/python/python_
api.html#xgboost.XGBClassifier.fit (visited on 10/03/2023).
[99] XGBClassifierLoadModel. url: https://xgboost.readthedocs.io/en/stable/python/
python_api.html#xgboost.Booster.load_model (visited on 10/19/2023).
[100] XGBClassifierPredict. url: https : / / xgboost . readthedocs . io / en / stable / python /
python_api.html#xgboost.XGBClassifier.predict (visited on 10/03/2023).
[101] XGBClassifierPredictProba. url: https://xgboost.readthedocs.io/en/stable/python/
python_api.html#xgboost.XGBClassifier.predict_proba (visited on 10/28/2023).
[102] XGBClassifierSaveModel. url: https://xgboost.readthedocs.io/en/stable/python/
python_api.html#xgboost.Booster.save_model (visited on 10/19/2023).
[103] XGBoost. url: https://xgboost.readthedocs.io/en/stable/ (visited on 04/12/2023).
[104] XGBoost installation guide? url: https://xgboost.readthedocs.io/en/stable/install.
html (visited on 06/07/2023).
[105] XGBoostParams. url: https://xgboost.readthedocs.io/en/stable/parameter.html
(visited on 10/19/2023).
[106] XGBoostParamTuning. url: https : / / xgboost . readthedocs. io / en / stable / tutorials /
param_tuning.html (visited on 10/19/2023).

109
Bibliography

[107] Kefan Xiao, Shiwen Mao, and Jitendra K. Tugnait. “TCP-Drinc: Smart
Congestion Control Based on Deep Reinforcement Learning.” In: IEEE Access
7 (2019), pp. 11892–11904. doi: 10.1109/ACCESS.2019.2892046.
[108] T. Yamamoto. “Estimation of the advanced TCP/IP algorithms for long distance
collaboration.” In: Fusion Engineering and Design 83.2 (2008). Proceedings of
the 6th IAEA Technical Meeting on Control, Data Acquisition, and Remote
Participation for Fusion Research, pp. 516–519. issn: 0920-3796. doi: https : / /
doi.org/10.1016/j.fusengdes.2007.10.006. url: https://www.sciencedirect.com/
science/article/pii/S0920379607005078.
[109] Francis Y. Yan et al. “Pantheon: the training ground for Internet congestion-
control research.” In: 2018 USENIX Annual Technical Conference (USENIX ATC
18). Boston, MA: USENIX Association, July 2018, pp. 731–743. isbn: 978-1-
939133-01-4. url: https : / / www. usenix . org / conference / atc18 / presentation / yan -
francis.
[110] Ticao Zhang and Shiwen Mao. “Machine Learning for End-to-End Congestion
Control.” In: IEEE Communications Magazine 58.6 (2020), pp. 52–57. doi: 10.
1109/MCOM.001.1900509.
[111] Zhi-Hua Zhou. Machine learning. Springer Nature, 2021.
[112] Quan Zou et al. “Finding the Best Classification Threshold in Imbalanced
Classification.” In: Big Data Research 5 (2016). Big data analytics and
applications, pp. 2–8. issn: 2214-5796. doi: https://doi.org/10.1016/j.bdr.2015.12.
001. url: https://www.sciencedirect.com/science/article/pii/S2214579615000611.

110

You might also like