thesis on Asteroid Prediction hazard level
thesis on Asteroid Prediction hazard level
Department of Informatics
Faculty of Mathematics and Natural Sciences
Autumn 2023
Maximilian von Stephanides
Supervisors:
Safiqul Islam
Michael Welzl
Abstract
i
ii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . 1
1.2 Research questions. . . . . . . . . . . . . . . . . . . . 1
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . 3
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Congestion control and packet loss . . . . . . . . . . . . . . 5
2.1.1 Congestion induced packet loss . . . . . . . . . . . . 6
2.1.2 The problem with congestion . . . . . . . . . . . . . 6
2.1.3 Queue size and BDP . . . . . . . . . . . . . . . . 6
2.1.4 Congestion control . . . . . . . . . . . . . . . . . 7
2.1.5 Components involved in congestion control mechanisms . . 7
2.1.6 Slow start and congestion avoidance . . . . . . . . . . 7
2.1.7 Fast retransmit and fast recovery . . . . . . . . . . . 9
2.1.8 Common congestion control algorithms . . . . . . . . . 9
2.1.9 ECN . . . . . . . . . . . . . . . . . . . . . . 11
2.1.10 Non-loss-based congestion control algorithms . . . . . . 15
2.2 Machine learning . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 The machine learning process . . . . . . . . . . . . 17
2.2.2 Data collection and transformation . . . . . . . . . . . 18
2.2.3 Features . . . . . . . . . . . . . . . . . . . . . 20
2.2.4 Supervised learning . . . . . . . . . . . . . . . . 20
2.2.5 Binary classification . . . . . . . . . . . . . . . . 22
2.2.6 Ensemble learning . . . . . . . . . . . . . . . . . 24
2.2.7 Bias and variance . . . . . . . . . . . . . . . . . 25
2.2.8 Reinforcement learning . . . . . . . . . . . . . . . 26
2.2.9 Sequence models . . . . . . . . . . . . . . . . . 26
2.3 Related works . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Machine learning based congestion control. . . . . . . . 27
2.3.2 Packet loss prediction . . . . . . . . . . . . . . . . 30
3 Machine learning model design and evaluation . . . . . . . . . . . . 33
3.1 Model choice . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Network tools . . . . . . . . . . . . . . . . . . . 34
3.2.2 Machine learning libraries . . . . . . . . . . . . . . 35
3.3 Data collection. . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Network configuration . . . . . . . . . . . . . . . . 36
3.3.2 Initial approach . . . . . . . . . . . . . . . . . . 38
iii
Contents
iv
List of Figures
4.1 The test setup showing the flow of data when the model was applied to perform
predictions on a connection in real time . . . . . . . . . . . . . . 76
4.2 cwnd plot from a connection configured using TCP Reno with 30ms delay,
50Mbps bandwidth, and 1 BDP queue size . . . . . . . . . . . . . 85
4.3 cwnd plot overlaid with model predictions from a connection configured using
TCP Reno with 70ms delay, 50Mbps bandwidth, 0.5 BDP queue size, and 0.25
classification threshold . . . . . . . . . . . . . . . . . . . . . 86
4.4 cwnd plot overlaid with model predictions from a connection configured using
TCP Reno with 30ms delay, 50Mbps bandwidth, 1 BDP queue size, and 0.25
classification threshold . . . . . . . . . . . . . . . . . . . . . 87
4.5 cwnd plot overlaid with model predictions from a connection configured using
TCP Cubic with 30ms delay, 50Mbps bandwidth, 1 BDP queue size, and 0.1
classification threshold . . . . . . . . . . . . . . . . . . . . . 87
4.6 Retransmission reduction for TCP Reno when running as a single flow with
model inference enabled at various classification thresholds . . . . . . . 89
4.7 Throughput change for TCP Reno when running as a single flow with model
inference enabled at various classification thresholds . . . . . . . . . 89
4.8 Retransmission reduction for TCP Cubic when running as a single flow with
model inference enabled at various classification thresholds . . . . . . . 89
4.9 Throughput change for TCP Cubic when running as a single flow with model
inference enabled at various classification thresholds . . . . . . . . . 89
4.10 Trade-off between retransmission reduction and throughput change for TCP
Reno when running as a single flow with model inference enabled at various
classification thresholds . . . . . . . . . . . . . . . . . . . . 90
4.11 Trade-off between retransmission reduction and throughput change for TCP
Cubic when running as a single flow with model inference enabled at various
classification thresholds . . . . . . . . . . . . . . . . . . . . 90
4.12 cwnd plot from a connection configured using TCP Reno with 30ms delay,
50Mbps bandwidth, 0.5 BDP queue size, and 0.1 classification threshold
resulting in zero retransmissions . . . . . . . . . . . . . . . . . 91
v
List of Figures
4.13 cwnd plot overlaid with model predictions from a connection configured using
TCP Reno with 70ms delay, 50Mbps bandwidth, 0.25 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.14 cwnd plot overlaid with model predictions from a connection configured using
TCP Cubic with 70ms delay, 50Mbps bandwidth, 0.25 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.15 cwnd plot overlaid with model predictions from a connection configured
using TCP Reno with 70ms delay, 50Mbps bandwidth, 0.5 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.16 cwnd plot overlaid with model predictions from a connection configured
using TCP Cubic with 70ms delay, 50Mbps bandwidth, 0.5 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.17 cwnd plot overlaid with model predictions from a connection configured using
TCP Reno with 70ms delay, 50Mbps bandwidth, 1 BDP queue size, background
traffic, and 0.5 classification threshold . . . . . . . . . . . . . . . 96
4.18 cwnd plot overlaid with model predictions from a connection configured
using TCP Cubic with 70ms delay, 50Mbps bandwidth, 1 BDP queue size,
background traffic, and 0.5 classification threshold . . . . . . . . . . 96
4.19 Retransmission reduction for TCP Reno when running with background traffic
with model inference enabled at various classification thresholds . . . . . 97
4.20 Throughput change for TCP Reno when running with background traffic with
model inference enabled at various classification thresholds . . . . . . . 97
4.21 Retransmission reduction for TCP Cubic when running with background traffic
with model inference enabled at various classification thresholds . . . . . 97
4.22 Throughput change for TCP Cubic when running with background traffic with
model inference enabled at various classification thresholds . . . . . . . 97
4.23 Trade-off between retransmission reduction and throughput change for TCP
Reno when running with background traffic with model inference enabled at
various classification thresholds . . . . . . . . . . . . . . . . . 97
4.24 Trade-off between retransmission reduction and throughput change for TCP
Cubic when running with background traffic with model inference enabled at
various classification thresholds . . . . . . . . . . . . . . . . . 97
vi
List of Tables
3.1 Programs used for network configuration and data collection and their versions 38
3.2 Features selected from or based on values in the ss output . . . . . . . 47
3.3 Relevant fields from the ss output and their ss man page descriptions [85] . 47
3.4 Network connection parameters used to configure the data collection procedure
in phase one. . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 The proportion of samples for each class in the training, validation, and test
sets for the Reno phase one dataset. . . . . . . . . . . . . . . . 54
3.6 The proportion of samples for each class in the training, validation, and test
sets for the Cubic phase one dataset . . . . . . . . . . . . . . . 54
3.7 Reno phase one results . . . . . . . . . . . . . . . . . . . . 55
3.8 Cubic phase one results . . . . . . . . . . . . . . . . . . . . 55
3.9 Reno phase one confusion matrices . . . . . . . . . . . . . . . . 55
3.10 Cubic phase one confusion matrices. . . . . . . . . . . . . . . . 56
3.11 Feature importances for the Reno phase one model . . . . . . . . . . 56
3.12 Feature importances for the Cubic phase one model. . . . . . . . . . 57
3.13 An excerpt from the Reno phase one training dataset . . . . . . . . . 57
3.14 An excerpt from the Cubic phase one training dataset . . . . . . . . . 57
3.15 Connection parameters for phase two . . . . . . . . . . . . . . . 59
3.16 The proportion of samples for each class in the training, validation, and test
sets for the Reno phase two dataset . . . . . . . . . . . . . . . . 59
3.17 The proportion of samples for each class in the training, validation, and test
sets for the Cubic phase two dataset. . . . . . . . . . . . . . . . 59
3.18 Reno phase two results . . . . . . . . . . . . . . . . . . . . 60
3.19 Cubic phase two results . . . . . . . . . . . . . . . . . . . . 60
3.20 Reno phase two confusion matrices . . . . . . . . . . . . . . . . 60
3.21 Cubic phase two confusion matrices . . . . . . . . . . . . . . . . 60
3.22 An excerpt from the Reno phase two training dataset. The training data is
shown as groups of five and five samples from different parts of the aggregated
training data, meaning that each group of five samples came from the same
part, and that these five samples in each group occurred in succession. Each
group of five samples includes one True sample, the two False samples that
occurred right before, and the two False samples that occurred right after . . 61
3.23 An excerpt from the Cubic phase two training dataset. The training data is
shown as groups of five and five samples from different parts of the aggregated
training data, meaning that each group of five samples came from the same
part, and that these five samples in each group occurred in succession. Each
group of five samples includes one True sample, the two False samples that
occurred right before, and the two False samples that occurred right after . . 61
vii
List of Tables
4.1 The scenarios that were considered when running the model inference related
tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Results for a specific Reno single flow scenario with a configured delay of 30ms,
bandwidth of 50Mbit, and queue size of 0.5BDP . . . . . . . . . . . 90
4.3 Difference in retransmissions for the single flow and background traffic case for
TCP Reno for the various scenarios without model inference . . . . . . 93
4.4 Difference in retransmissions for the single flow and background traffic case for
TCP Cubic for the various scenarios without model inference . . . . . . 93
viii
Preface
First of all, I would like to thank my fantastic supervisors: Dr. Safiqul Islam and Dr.
Michael Welzl. It goes without saying that this thesis would not have been possible
without you, or at the very least, it would not have been as good. You have always
helped when I was stuck with a problem or simply had a few questions. Through
multiple meetings and discussions, you have either guided me on the correct path or
helped me find out where to go next. I have been truly inspired by your enthusiasm for
research and congestion control in particular. On numerous occasions, I have witnessed
you both being visibly excited over the problem discussed in this thesis, and this has
been a great source of motivation for me throughout the months I have spent working on
it. I also want to thank you both for going above and beyond by sometimes answering
my emails during weekends or way past normal working hours. I have learned a lot from
both of you, and will always carry this experience with me.
I would like to thank my family, especially my parents, for always being supportive and
believing in me. Your support, both emotional and financial, throughout my years spent
studying means a lot to me, and is something that I do not take for granted.
I would like to thank my girlfriend. Not only have you been supportive of me and helped
me when I was stressed or worried about something, your family has been as well. The
way you have discussed my thesis and my progress with your family has been inspiring
and shows me that you truly care.
Lastly, I would like to thank my friends and everyone at “Oslo Styrkeløftklubb” for
cheering me up and keeping me sane throughout this process.
ix
Preface
x
Chapter 1
Introduction
1
Chapter 1. Introduction
1.3 Contributions
In this thesis, we have done extensive research on the topic of real-time packet loss
prediction, including which TCP state variables could be informative to a potential
solution and could be used to select or construct machine learning features. The following
bullet points summarize our contributions in this thesis related to data collection, data
transformation, and model design and evaluation:
• Performed extensive data collection in the form of tests involving TCP connections
configured as a single flow between a sender and receiver with and without the
presence of background traffic and with multiple congestion control algorithms.
• Trained and tuned multiple machine learning models, which we have evaluated
using multiple performance metrics and through more manual inspections and
analysis.
• Investigated classification performance differences for models that were trained
and evaluated on data collected with and without the presence of background
traffic, and models that were trained and evaluated on data collected only without
background traffic.
In addition, we have performed real-time model inference tests using a custom test setup.
The following bullet points summarize our contributions in this thesis related to real-time
model inference and proactive sending rate adjustments using ECN:
• Through various tests, we have investigated how the models perform in real time
and how accurate and informative the predictions are.
• Done tests for various congestion control algorithms and for connections with and
without background traffic.
• Calculated metrics such as retransmission reduction and throughput change, and
created various plots for showing the correlation between model inference and these
metrics, where we have shown how the results vary with the chosen classification
threshold.
• Shown how, through proactive sending rate adjustments using model predictions
and ECN, we can reduce the amount of retransmissions in a TCP connection.
Finally, we have briefly touched on potential improvements that could be made to both
the training data and models, as well as future work that could be performed to further
investigate and potentially improve the performance of our models.
2
1.4. Organization
1.4 Organization
The remainder of this thesis is organized into the following chapters:
Chapter 2: Background Chapter 2 presents background information on the most
relevant concepts: mainly congestion control and machine learning. This chapter
also presents related works and discuss how they relate to the work presented in
this thesis. This chapter therefore exists to give a context to the work we have
performed here and to better inform the reader of various concepts that they
potentially are not very familiar with.
Chapter 3: Model design and evaluation Chapter 3 discusses how we developed
and evaluated the machine learning models we created in this thesis for the purpose
of real-time packet loss prediction. In this chapter, we discuss topics such as the
following:
• How the network was configured in order to collect data.
• How we collected data, including the type of data and how much.
• How we transformed the data multiple times in order to extract or construct
various machine learning features and create various datasets that were used
for training and evaluation when creating the machine learning models.
• How we created the models using machine learning algorithms such as
LightGBM or XGBoost, and various results in the form of metrics such as
precision, recall, F1 score, and feature importances.
• Potential reasons for the various feature importances and how the feature
importances changed with the training data, due to factors such as the added
presence of background traffic.
Chapter 4: Model inference and results Chapter 4 discusses how we performed
model inference by creating a custom test setup for testing the machine learning
models in real-time. In this chapter, we explain how we leveraged ECN to
reduce the sending rate proactively based on model predictions, and the resulting
retransmission reductions. We also discuss how metrics such as the retransmission
reduction and throughput change varied with the classification threshold that the
model was configured with.
Chapter 5: Conclusion In Chapter 5 we address the research questions, summarize
our contributions, present concluding remarks, and explore possible areas for future
research.
3
Chapter 1. Introduction
4
Chapter 2
Background
Both congestion control and machine learning are broad topics with a lot of active
research. The latter has gained a lot of popularity in recent years, especially with the
growing prominence of Artificial Intelligence (AI), while the former has always been a
topic of interest in order to improve the way the Internet works.
Congestion control — a technique used by transport protocols like TCP — is strongly
related to the concept of packet loss, with a more congested network leading to a
higher probability of packet loss [94]. Since packet loss is typically undesired, various
mechanisms are in place today in order to combat congestion.
Machine learning, on the other hand, is a subset of AI that focuses on the development
of specific algorithms, known as machine learning algorithms, that allow computers to
learn and make decisions from data. These algorithms together with the data form what
is known as machine learning models [54]. These models, instead of being explicitly
programmed to perform a task, use patterns and inference to produce decisions based
on new data.
In recent years there has been a convergence of these two fields, with many researchers
exploring the application of machine learning in order to optimize congestion control.
While traditional congestion control mechanisms are mainly based on predefined rules
and heuristics, more dynamic congestion control mechanisms partially or fully based on
machine learning could possess the ability to adapt and learn from varying conditions
and therefore potentially be more efficient.
5
Chapter 2. Background
cannot be transferred across the link. These packets can either be buffered or dropped.
If the packets are buffered, this typically means that they are placed in a queue at a
router, which often works as a basic FIFO (First In, First Out) queue and only drops
packets if the queue is full — the underlying assumption being that a reduction in
throughput would eventually drain the queue. As the queue grows, the network is said
to become congested.
6
2.1. Congestion control and packet loss
TCP
TCP, explained in full detail in RFC 9293 [21], is one of the most important transport-
layer protocols and has been in use on the Internet for multiple decades. It is a
connection-oriented protocol, meaning that a connection is established between the
sender and receiver before communicating. This is accomplished by the well known
TCP three-way handshake.
TCP makes it possible to transfer a reliable stream of data from a sender to a receiver
even in the presence of factors like packet losses. One of the ways this reliability is
achieved, is by sending a packet, waiting for a signal (ACK) from the receiver, and
retransmitting the packet after a while if the ACK does not arrive — the duration is
specified by a variable called the retransmission timeout (RTO). The RTO is a dynamic
value and is calculated based on an estimate of the RTT [21].
7
Chapter 2. Background
Slow start
Slow start is a mechanism in TCP that allows the sender to reach a reasonable sending
rate fast when there are unknown network conditions. A variable called the congestion
window (cwnd) is used for this purpose. The value of cwnd limits the amount of data that
the sender can inject into the network before receiving an ACK, and changes dynamically
based on feedback from the receiver. The cwnd value is initialized with a certain value
and increased by one segment for each ACK that arrives. Since the value of cwnd
increases by one for each ACK it means that slow start has exponential growth. For
each ACK that is received, twice as many packets leave the sender — which all result in
ACKs that cause twice as many packets to leave the sender again, and so on.
In addition to the cwnd variable, another state variable called the slow start threshold
(ssthresh) is used to decide which of the two algorithms slow start and congestion
avoidance is used to control the data transmission.
The slow start algorithm is employed at the beginning of TCP transmissions, but also
after repairing loss detected by the RTO. ssthresh and cwnd are therefore important
state variables in the context of packet loss [94] [9].
Congestion avoidance
If the value of cwnd is larger than the value of ssthresh, the congestion avoidance
algorithm is employed instead of slow start.
In contrast to the slow start algorithm, the congestion avoidance algorithm uses Additive
Increase Multiplicative Decrease (AIMD) when increasing the cwnd state variable.
Instead of incrementing cwnd by one for each ACK, congestion avoidance increases it as
follows:
cwnd = cwnd + MSS * MSS/cwnd
This results in the window being increased by at most one segment per RTT — this
being the Additive Increase part of the algorithm — leading to linear growth of the
congestion window in this phase.
When a congestion signal is detected, typically in the form of a packet loss or increased
delay, congestion avoidance uses Multiplicative Decrease to decrease the value of cwnd
8
2.1. Congestion control and packet loss
multiplicatively, often by halving the current window size. This is done to reduce
congestion in the network and can be seen in Figure 2.1 for TCP Reno [9] [94].
Duplicate ACKs
If a sender transmits packets numbered from 1 to 5, and the receiver only successfully
receives segments 1, 3, 4, and 5, the receiver’s response to the reception of segment
1 will typically be “ACK 2”. This is an acknowledgment to the sender, indicating
that it is now awaiting the receipt of segment 2. Upon the arrival of segments 3, 4,
and 5, the receiver sends additional acknowledgments for the missing segment 2. These
additional acknowledgments are interpreted by the sender as duplicate acknowledgments
(DupACKs), suggesting the potential loss of a packet [94].
Fast retransmit
Fast retransmit uses the loss detecting scheme described in the above section to allow
the sender to send the segment that was requested numerous times before waiting for
the RTO timer to expire [94].
Fast recovery
After fast retransmit has done its job and sent what appeared to be the missing segment,
the fast recovery algorithm controls the transmission of new data until a normal ACK
arrives. Since the receiver can only generate a DupACK when a packet has arrived,
receiving a DupACK at the sender does not necessarily mean that a packet loss occurred
— it can also mean that a packet was received out of order. Therefore, switching to slow
start mode is not necessary, and cwnd is directly set to half the current amount of data
in flight [9] [94].
9
Chapter 2. Background
TCP Reno
TCP Reno was one of the earlier congestion control algorithms introduced and
implemented in TCP. It utilizes slow start and congestion avoidance, in addition to
fast retransmit and fast recovery.
In slow start the congestion window increases exponentially, meaning that the sending
rate is increased by two for each ACK that is received, while in congestion avoidance
and fast recovery the congestion window increases linearly, meaning that the sending
rate is increased by one for each ACK that is received. This can be seen in Figure 2.1.
Slow start is used in Reno at the beginning of a transmission and whenever a loss is
detected through a RTO. Congestion avoidance is used in Reno at the end of the slow
start phase or after a loss is detected via DupACKs [9].
TCP Cubic
TCP Cubic was first implemented in the Linux kernel in 2006 [33], and features many
of the same components as Reno with some key differences.
To achieve better network utilization and stability, Cubic uses both the concave and
convex profiles of a cubic function to adjust the cwnd. This is in contrast to some other
non-Reno congestion control algorithms that only use convex functions. Like Reno,
Cubic responds to congestion events that are detected by DupACKs (fast retransmit
and fast recovery). But unlike Reno, Cubic registers the congestion window size when
that happens and stores it in a state variable called W_max.
10
2.1. Congestion control and packet loss
Cubic can run in three different modes depending on the value of the current cwnd and
W_max, all of which are listed below:
The TCP-friendly region This mode ensures that Cubic achieves at least the same
throughput as standard TCP and is used in networks where standard TCP
performs well. Such networks include short RTT and small bandwidth networks.
The concave region This mode is used if Cubic is not in the TCP-friendly region and
cwnd is less than W_max.
The convex region This mode is used if Cubic is not in the TCP-friendly region and
cwnd is greater than W_max.
This pattern of first increasing the cwnd using the concave profile of the cubic function,
followed by switching to the convex profile of the cubic function promotes high network
utilization and stability [77].
How the congestion window increases and decreases in response to congestion is shown
in Figure 2.2, where the cubic increase and decrease pattern can be clearly seen.
There have been multiple studies comparing the performance of Reno and Cubic, with
some key findings summarized below:
• Cubic seems to generally outperform Reno in networks with long fat pipes — long
fat pipes referring to networks with high bandwidth and high latency [108].
• In mixed environments where there are both Reno and Cubic flows, the Cubic flows
could be more aggressive and take a larger share of the bandwidth than Reno flows,
but this is not necessarily the case [24].
• The cwnd growth behavior can be smoother in Cubic than in Reno. There are also
scenarios where Cubic can have less abruptly falling cwnd values and more stable
cwnd values over time [40].
One interesting aspect of Cubic is fast convergence. This is a heuristic added to Cubic
to improve convergence speed in cases where a new flow joins the network and existing
flows have to sacrifice some of their bandwidth. Fast convergence is designed for network
environments with multiple Cubic flows and can make it difficult to predict its behavior
[77].
2.1.9 ECN
Explicit Congestion Notification (ECN) is a mechanism that allows the sender to behave
as if the packet was dropped instead of actually dropping a packet. ECN was the first
feasible TCP/IP congestion control solution that incorporated explicit feedback — where
the feedback is in the form of header bits — and is able to reduce loss in the presence
of routers that use active queue management [94].
11
Chapter 2. Background
sharing that queue. In addition, AQM means that protocols like TCP — which rely
on packet drops as a congestion signal — have a way of detecting congestion before the
packets are dropped [28].
RED
Random Early Detection (RED) is a well-known AQM mechanism that aims to reduce
end-to-end delay by keeping the average queue size low while still allowing for occasional
bursts of traffic [94].
RED dynamically calculates the average queue size using a low-pass filter with an
exponential weighted moving average (EWMA), as described by the following formula
[27]:
avgq ← (1 − wq )avg + wq q
where avgq is the average queue length estimate, q is the instantaneous queue length
and wq is a factor that controls how fast the EWMA adapts to fluctuations in the queue
length.
avgq is compared to a minimum threshold mint and a maximum threshold maxt and
packets are marked — which could manifest in dropped packets, a bit in the IP or TCP
header being altered, or other viable actions being executed — according to the following
scheme [27]:
12
2.1. Congestion control and packet loss
Header bits
As previously mentioned, ECN incorporates explicit feedback in the form of header bits.
The ECN field in the IP header is used for this purpose, where two bits are used to
indicate one of four ECN codepoints as follows [28]:
00 Not ECT : Used to indicate that ECN should not be used.
01 ECT(1) : Used to indicate that the end-nodes in a transmission are ECN capable,
and is confirmed in TCP in the pre-negotiation during the connection setup phase.
10 ECT(0) : Same as ECT(1).
11 Congestion Experienced (CE) : Used to mark packets that encounter congestion
with a probability proportional to the average queue size.
In addition to the IP header bits, there are two TCP header bits that are used for ECN
as well; these are [28]:
ECN-Echo (ECE) : Set by the receiver to signal back to the sender: “I saw a CE bit,
so reduce your rate as if the packet had been dropped” [94].
Congestion Window Reduced (CWR) : Used by the sender to signal to the
receiver that it has reduced its cwnd in response to a congestion notification.
Benefits of ECN
There are potential benefits to using ECN instead of solely relying on packet drops as
the only congestion signal [25]:
13
Chapter 2. Background
Figure 2.3: The ECN signaling process between a sender and receiver
14
2.1. Congestion control and packet loss
leads to normal TCP not being optimal, causing high latencies and frequent packet
losses.
DCTCP tries to solve the abovementioned problems by leveraging and extending ECN
to estimate the fraction of bytes that encounter congestion and scaling the congestion
window based on this estimate. This allows the various senders in the data center to
react proportionally to congestion instead of simply halving their window, and provides
the data center servers with high burst tolerance, low latency, and high throughput [8].
BBR
BBR is a congestion control algorithm developed by Google that primarily relies on
different signals than packet loss to detect congestion, with the goal being to achieve
higher bandwidths and lower latencies for traffic on the Internet [11].
The paper from Google [11] argues that loss-based congestion control algorithms such
as Cubic are not optimal — causing bufferbloat in the case of large buffers and
misinterpreting packet loss as a congestion signal in the case of small buffers, leading to
low throughput.
BBR works by continually measuring the Bottleneck Bandwidth and the Round-trip
propagation time (BBR) in order to control its sending rate.
How the congestion window varies throughout the connection for BBR can be seen in
Figure 2.4.
LEDBAT
Low Extra Delay Background Transport (LEDBAT) — first technically documented in
2012 [82] — is a congestion control algorithm that relies on delay as a signal of congestion.
It measures the time a packet travels from a given sender to a receiver, not just the round
trip time. This is done by applying a timestamp to every packet transferred from the
sender that the receiver then uses to subtract from its local time and returns the result to
the original sender — the result being the one-way delay from the sender to the receiver.
The sender then uses this information to consider the difference in delay over time [30].
LEDBAT is primarily designed for one-way bulk transfer applications like file-sharing
and file-updates, and is therefore used by multiple operating systems for the purpose of
operating system updates [78].
15
Chapter 2. Background
16
2.2. Machine learning
We can generally classify machine learning algorithms into three different categories
depending on how they work and their application areas, these are [54]:
Supervised learning In supervised learning a dataset of examples with the correct
responses is provided — this is typically referred to as a labeled training set —
and, based on this dataset, the algorithm tries to learn from it and generalize to
respond correctly to all possible inputs, so that the algorithm can be applied to
new, unseen data.
Unsupervised learning Unlike supervised learning, in unsupervised learning there is
no labeled training dataset, meaning that the correct responses are not provided for
the input data that is used to train the model. Unsupervised learning techniques
use the input data to identify similarities between the inputs so that inputs that
are similar are categorized together.
Reinforcement learning Reinforcement learning lies somewhere between supervised
and unsupervised learning. As the name suggests, reinforcement learning
techniques get told whether or not the answer is correct, but do not get told
how to improve it. The algorithm therefore has to explore and try out different
strategies until it works out how to get the answer right.
In the context of this thesis, supervised learning is the most relevant because we are
dealing with a case of binary classification, which is a subset of supervised learning.
Reinforcement learning is partly relevant because it’s being used in many newer
TCP congestion control mechanisms [107] [48] [66] and could — with the correct
implementation and dataset — be applied to the problem discussed in this thesis.
Unsupervised learning is not relevant for the purpose of this thesis since it deals with
problems that are different in nature — more related to clustering and seeing patterns
in data.
17
Chapter 2. Background
Data collection
If data is not already available, the first step in the machine learning process is typically
data collection. The data collection step can often be merged together with the feature
selection step so that the feature selection step guides the data collection process in
order to only collect the relevant data. If we want to train a machine learning model on
data from birds, where the features should be the height, weight, and sex of the birds, it
makes sense to collect only this data in the data collection step to save time and make
it more feasible.
The amount of data that needs to be collected typically varies from problem to problem
and is a trade-off between computational overhead and model performance. One
approach to ensure that the data collected is valuable before collecting vast amounts
of data, is to first assemble a reasonably small dataset with all of the features that are
believed to be useful, and experimenting with that dataset by training and evaluating
a model before choosing the best features and collection the full, much larger dataset
[54].
18
2.2. Machine learning
Data transformation
The data gathered in the data collection process is sometimes not in a format that is
suitable or optimal for a given machine learning model. For example, many machine
learning models expect numerical features instead of string representations of a given
value — meaning, the number 180 instead of the string 180cm or similar.
Data transformation is therefore the step in the machine learning process that takes raw
data from the data collection step and converts it to a format that the machine can
understand and work with.
Depending on the type of data, the data transformation step can include one or many
of the common techniques listed below:
Data cleaning Data cleaning is the process of detecting and repairing incorrect or
incomplete information in the dataset. This can be done by adding or repairing
missing values, dealing with outliers, or converting a feature to the same format
when the raw data has different formats for the same feature, amongst other things
[14].
Feature extraction Feature extraction typically involves both feature construction
and feature selection, which can be roughly explained as reducing a large amount
of raw information down to a smaller set of more useful variables by constructing
and selecting relevant and informative variables that describe the data in question
which are referred to as features, in addition to potentially adding extra information
to the dataset where none existed previously — the latter being a form of data
augmentation [32].
Data scaling Data scaling can involve scaling or casting the features to a certain range,
for example 0–1, with the goal being to transform them to be on a similar scale so
that each feature has an equal contribution to the model performance [83].
Data aggregation Sometimes, the data used to train a machine learning model may be
sourced from a variety of places, resulting in multiple distinct datasets. Combining
these datasets into a single large dataset is called data aggregation. The opposite
process of splitting a single large dataset into multiple smaller subsets is referred
to as data disaggregation.
Sampling/splitting A common approach in machine learning — especially in
supervised learning techniques — is to separate a large dataset into training,
validation, and optionally a final testing dataset. This is done to ensure that the
data can be trained on the training dataset but evaluated on the validation dataset.
The final testing dataset is used to test the model and check its performance on
data that it has not encountered during the training and evaluation phase. There
are many ways this splitting can be done, and the exact technique can impact the
performance of the model by potentially adding more or less variance [76].
Methods such as as the ones mentioned above can in many cases improve the performance
and stability of a model and data transformation is therefore usually an important step
in most machine learning processes.
19
Chapter 2. Background
2.2.3 Features
Given a dataset, machine learning features can be regarded as the various variables that
describe the elements of the dataset. For example, if the dataset consists of data from
various people belonging to a certain demographic, the features can be variables such as
height, weight, sex, and so on. The elements of a dataset can therefore be described
as feature vectors where each element in the vector contains a feature that describes an
aspect of the element.
Features can either be extracted directly from the original data or constructed by
combining or transforming different aspects of the original data or other features.
In the context of a given machine learning problem, an algorithm used to solve the
problem, and collected data for training, not all features can be considered equal with
regards to their impact on model performance. Feature selection is therefore usually an
important step when dealing with machine learning features for a given problem [54].
Feature construction
Sometimes, machine learning features can be constructed by combining or transforming
different aspects of the other features. To expand on the example from the first paragraph
— namely the features height, weight, and sex — a constructed feature could be BMI,
which is created from the weight and height.
For certain problems, the difference between the current and previous values for a given
feature could be more interesting than the values in isolation. This could be the case if
upwards or downwards trends in the given feature are relevant to the problem.
Constructed features can sometimes have a higher correlation with the target label in
binary classification tasks — meaning that they are more likely to be informative for the
prediction [34].
Feature selection
When dealing with data in machine learning, there are usually many features that can
be extracted from said data. Not all of these features are useful however, and some are
more useful than others — useful referring to how informative they are to the solution.
For this reason, feature selection is a critical step when dealing with a given machine
learning problem.
Feature selection consists of identifying the features that are most useful for the problem
and its solution and should be supplied to the machine learning algorithm to construct
a model [54]. This usually requires knowledge of both the problem and the data.
In addition to identifying and selecting the features that are most useful for the machine
learning algorithm, it is important that the features can be collected without significant
expense or time and that they are robust to noise and other inaccuracies in the data
[54].
20
2.2. Machine learning
one of more than two options is predicted for a given input. The former is referred to
as binary classification while the latter is referred to as multi-class classification. For
the purpose of this thesis, we are only interested in binary classification because we are
dealing with packet loss, which is a situation that can either happen or not.
As briefly mentioned in Section 2.2.2, when dealing with supervised learning techniques,
a training and test dataset is typically required. The training dataset is used to learn
the behavior of the target function, while the test dataset is used to test the performance
of the model.
There are many different types of supervised learning methods. Some of the most
common and relevant to this thesis are listed below [64] [13]:
Decision trees Decision trees consist of different nodes, starting with a root node.
The root node has no incoming edges, while the rest of the nodes have exactly one
incoming edge. The rest of the nodes are further split into internal nodes or leaf
nodes, depending on if they have outgoing edges or not. The former is the case if
they do and the latter is the case if they don’t. Each leaf is assigned to one class
that represents the most appropriate target value, so that inputs are classified by
going down the tree starting at the root and ending at a leaf.
Linear regression Linear regression uses regression analysis to specify a relationship
between one or more features and a target label by fitting a straight line to the
data — hence the name, linear regression. Linear regression is typically used to
predict a continuous variable, meaning a numeric variable, and is therefore usually
not suitable for classification tasks.
Logistic regression The goal of logistic regression is to find the relationship between
some given features and the probability of a given outcome — or in the context of
classification, the probability that the datapoint represented by the given features
belongs to a certain class. Rather than fitting a straight line to the data like
linear regression, logistic regression fits an S-shaped curve to the data using a
sigmoid function called the logistic function — hence the name, logistic regression.
Unlike linear regression, logistic regression is well suited for classification problems,
especially binary ones [5].
21
Chapter 2. Background
but this does not mean that it will show the same results for data it has not encountered
during training, potentially leading to a poor model that does not generalize well.
When the model has been trained and evaluated one or more times, an optional final
testing dataset can be used to test the model and check its performance on data that it
has not encountered during either training or evaluation.
Data labeling
As mentioned in the previous section, when dealing with supervised learning algorithms,
training data is needed. This training data needs to be labeled, where labeled refers to
the correct output being assigned to each row in the dataset. For example, if the rows
represent emails, the label could be Spam or Not Spam.
The reason that the training data needs to be labeled is that supervised learning
algorithms fit a function to some given training data in order to minimize the error
with respect to the difference between the function outputs and the actual labels. The
labels are therefore used to guide the model in a certain direction so that it can hopefully
detect the desired data pattern and predict this pattern for new, unseen inputs that were
not encountered during training.
Sometimes, the way that the data should be labeled is very straightforward. To expand
on the email example above, it is quite easy to assign a label of spam or not spam when
creating training data containing emails. Or when dealing with weather data, it is easy
to assign a label of rain or no rain. But this is not always the case, and care should
therefore be taken when labeling training data.
For the reasons mentioned above, the way that the data is labeled can have a great
influence on the model performance.
Imbalanced datasets
In binary classification, one of the classes is usually referred to as the positive class
and the other the negative class. Depending on the dataset and the problem under
examination, there might be more or less samples from the positive class present in the
set. When there are very few or very many — depending on which class is referred
to as the positive one — positive examples in the training set, the data is referred to
as imbalanced. In this case, the class that makes up the large majority of the set is
called the majority class, while the other class is called the minority class. Typically,
the positive class is the minority class, while the negative class is the majority class.
When dealing with imbalanced datasets, there are various techniques that can be applied
to try to combat the problem, with a common technique being to try to rebalance the
dataset artificially by upsampling and/or downsampling, where samples are replicated
from the minority class or ignored from the majority class respectively.
22
2.2. Machine learning
Performance metrics
When dealing with machine learning problems, one needs to know if a given model is
good or not — where good refers to how well it can solve the problem — and if a model is
getting better or worse throughout the process of training. Various performance metrics
therefore exist to evaluate the performance of machine learning models.
Some of the most common metrics when dealing with binary classification problems are
listed below [16]:
Accuracy Accuracy is defined as the number of correct predictions — positive or
negative — divided by the total number of predictions.
Precision Precision is defined as the amount of true positives divided by the sum of
true positives and false positives, where true and false positives refer to inputs that
were classified as positive and actually were positive and inputs that were classified
as positive but were actually negative, respectively. Precision therefore identifies
the proportion of inputs that were correctly classified as positive.
Recall Recall is defined as the amount of true positives divided by the sum of the true
positives and false negatives, where true positives refer to the same as for precision
and false negatives refer to inputs that were classified as negative but were actually
positive. Recall therefore identifies the proportion of actually positive inputs that
were classified as negative.
F1 score The F1 score is the harmonic mean of precision and recall and therefore a
useful metric to evaluate model performance with regards to both precision and
recall.
Confusion matrices
When dealing with binary classification problems, it is often useful to visualize the
amount of True and False Positives (TP, FP), and True and False Negatives (TN, FN).
A TP refers to a sample that was both labeled and classified as True, while a FP refers
to a sample that was classified as True but labeled as False. Similarly, a TN refers to a
sample that was both labeled and classified as False, while a FN refers to a sample that
was classified as False but labeled as True.
A confusion matrix shows the amount of TP, FP, TN, and FN in a table with two rows
and columns, as shown in Figure 2.5. It therefore shows how often the model correctly
predicts the positive or negative class and can be useful for understanding the behavior
and performance of a classifier.
Classification threshold
As already briefly mentioned in Section 2.2.4, the output of a logistic regression model
is a probability — more specifically a number between 0 and 1 — which represents the
probability that the input belongs to a certain class. Typically, an output closer to 1
indicates that the input most likely belongs to the positive class, while an output closer
to 0 indicates that the input most likely belongs to the negative class. But what about
the cases where the output is 0.51 or 0.49? There needs to be a way to decide when the
input should be classified as either positive or negative — depending on the probability
given by the model. This is achieved by making use of a classification threshold, typically
with a value of 0.5 [112]. When the threshold is set to 0.5, it typically means that inputs
23
Chapter 2. Background
Figure 2.5: Confusion matrix visualizing the amount of True Positives, False Positives, True
Negatives, and False Negatives
with an output of below 0.5 are classified as the negative class, while inputs with an
output of above or equal to 0.5 are classified as the positive class.
When dealing with imbalanced datasets, the default threshold of 0.5 is usually not
optimal for model performance [112] [23]. This is because the probability distribution
for imbalanced data tends to be biased toward the majority class [43], while the minority
class is often the one that is interesting when dealing with imbalanced datasets. Due to
these reasons, the threshold should typically be adjusted when dealing with imbalanced
datasets to improve model performance.
While it is possible to compute the best classification threshold, it is also possible to do
manual testing in order to find a good threshold.
24
2.2. Machine learning
Boosting
One of the most popular ensemble methods is called boosting. In boosting, many low-
quality models are put together in a useful way to produce a final model that hopefully
gives good results [54].
In boosting algorithms, the process of creating the final ensemble model is typically done
by starting with a base model and then adjusting the distribution of the training samples
according to the results of this model so that incorrectly classified samples receive more
attention by the subsequent base models. This way the second base model is trained
with the adjusted training samples, and the result is used to adjust the training sample
distribution again so that the next base model is trained with the adjusted training
samples, and so on [111].
Gradient boosting
Gradient boosting — a now common boosting technique first introduced in 1999 by
Jerome Friedman [29] — is often used in classification tasks and produces a prediction
model in the form of an ensemble of weak prediction models. The weak prediction models
in gradient boosting can be decision trees [65]. When this is the case, the booster can
be referred to as a gradient-boosted tree.
25
Chapter 2. Background
training and tuning a model in machine learning is therefore often to balance out bias and
variance, with the most important of them being related to the problem being solved.
26
2.3. Related works
Weekly sales, daily stock prices, and yearly temperature changes are all examples of
time-series data.
Sequence models can be applied to time-series data in order to predict future values based
on past values, such as today’s stock price given the stock prices of the last few years.
This is referred to as time-series forecasting. Since the order and relationship between
the various samples matters in sequence data, missing values can have a great impact on
model performance and need to be handled carefully. Using a different machine learning
model that does not rely on the temporal aspect of the data, can sometimes handle such
missing values better.
Some common algorithms for dealing with sequence data are: Recurrent Neural Networks
(RNNs) and Long Short Term Memory (LSTM) networks, which are a special type of
RNN that are designed to avoid long-term dependency problems [52].
27
Chapter 2. Background
could be more suitable. However, the algorithms generated by Remy are more suitable
for specific networks with known and defined characteristics instead of the Internet in
general, and is therefore not widely used today.
W. Wei, H. Gu, and B. Li [93] examined and compared some of the most prominent
research results on the use of machine learning to design congestion control mechanisms.
In their paper, they pointed to challenges involving designing congestion control
mechanisms that work with a wide variety of network characteristics, where they argued
that a good congestion control mechanism needs to be able to operate effectively over a
large range of BDPs.
Their paper discusses and summarizes various congestion control protocols based on
offline or online learning — where offline learning refers to a protocol that uses a pre-
trained machine learning model, and online learning refers to a protocol that learns in
real time. The paper discusses the following machine learning based congestion control
protocols:
Remy The paper briefly discusses Remy [95], in order to give a first example of
a congestion control mechanism that uses optimization and machine learning
techniques to learn and optimize for the dynamic behavior of the network path
between a source and a destination, rather than using a hand-tuned heuristic such
as BBR. It briefly touches on Remy’s disadvantage related to offline optimization
by mentioning that when the network environment deviates from the input
assumptions and network models made, performance may degrade due to a
mismatch. This is related to the concept of bias and variance in machine learning.
PCC The paper briefly discusses Performance-oriented Congestion Control (PCC) [20],
which, compared to Remy, learns in an online fashion using multiple micro-
experiments. Each of these micro-experiments sends at two different rates, and
evaluates which rate leads to better performance. Using such micro-experiments,
the algorithm can learn in real-time and move in the direction of improved
performance, with the key idea being to learn the relationship between rate control
actions and the performance that is empirically observed — where, similarly to
Remy, performance is defined by a utility function that describes an objective,
such as to achieve high utilization of the bottleneck bandwidth with low loss rates.
PCC Vivace PCC Vivace [19] is an evolution of PCC that mainly differs in its utility
function, which incorporates not only throughput and loss rates as in PCC, but
also RTT gradients. The paper mentions that PCC Vivace is more TCP-friendly,
converges faster, and reacts more swiftly to changes in network conditions.
Indigo Like PCC and PCC Vivace, Indigo [109] is a novel congestion control based
on machine learning. However, unlike PCC and PCC Vivace, Indigo uses offline
learning using a supply of network emulators to generate training data. The paper
argues that it can be difficult to use online learning to train a congestion control
algorithm, because many ML algorithms require long training times (hours to
weeks), while the condition of network paths can evolve in much shorter time scales
(seconds). They therefore touch on the fact that offline training based approaches
have an advantage in that they are using pre-trained models.
The paper discusses the trade-off between offline and online learning, and how —
depending on the network environment — one approach can be better than the other.
They argue that offline training based approaches, such as Remy and Indigo, perform
28
2.3. Related works
better than online training based approaches, such as PCC and PCC Vivace, under the
assumption that the training environment does not deviate significantly from the actual
network environment. This highlights an important consideration when designing and
training a machine learning based congestion control algorithm, and is related to the
general concept in machine learning of bias and variance (Section 2.2.7).
The paper also discusses and summarizes various congestion control algorithms based
on Deep Reinforcement Learning (DRL), where they discuss opportunities for a fully-
automated mechanism to train a DRL agent by interacting with a real-world network
environment, in order to avoid hand-tuned heuristics as much as possible. Some of these
are described below.
Aurora Aurora [41] is a rate-based congestion control protocol based on DRL, where
the agent uses changes in the sending rate as its actions, and uses statistics about
latencies, as well as the ratio of packets sent to those acknowledged, as its states. It
therefore uses the aforementioned information about latencies and ratio of packets
vs. acknowledged to either increase or decrease its sending rate. The reward
function of Aurora, which the algorithm uses to know if it is improving or not, is
formulated as a linear function based on throughput, latency, and loss. The paper
discusses performance improvements in Aurora when compared to protocols such
as TCP Cubic, but also drawbacks related to missing results or research for various
types of network links, real-world networking environments, and how Aurora reacts
to competing flows.
R3Net Like Aurora, R3Net [26] is a congestion control protocol based on DRL with a
focus on minimizing packet latencies. R3Net was developed by Microsoft with its
design targeting video streaming and real-time conferencing applications. It uses a
simulator based on trace replays to train the DRL agent using simulated network
links and cross traffic. Since R3Net was designed specifically for low-latency real-
time traffic and not tuned for the general Internet, it has not been evaluated against
existing heuristics that are intended for the general Internet, and its performance
in this case is not clear.
MVFST-RL MVFST-RL [84] is, like Aurora and R3Net, a DRL-based congestion
control protocol. It was developed by Meta and uses a non-blocking DRL agent,
where a sender does not need to wait for the agent to produce an action. This
gives an advantage with respect to the number of bytes transmitted.
Orca Orca [1] is a hybrid congestion control protocol that depends on TCP for fine-
grained control actions. It uses a DRL agent to adjust the size of the TCP
congestion window (cwnd). The paper claims that Orca is able to achieve very good
performance in typical network environments while incurring little computation
overhead. Since Orca uses TCP Cubic under the hood, it is friendly to competing
Cubic flows.
The paper wraps up by discussing some challenges and future directions related to
machine learning based congestion controls. They point to challenges related to the
design of the agent in the various DRL-based congestion control algorithms, discussing
that the agent needs to be well-designed, including its training algorithm, its neural
network model, its state and action spaces, as well as its reward function. They therefore
argue that, to some extent, even more components need to be hand-tuned compared to
traditional heuristics, without intuitive links between the designs and their corresponding
29
Chapter 2. Background
outcomes.
In addition they highlight challenges in realistic deployments of DRL-based congestion
controls related to the feasibility of implementing such protocols efficiently. Since
congestion control protocols reside in the transport layer, and are traditionally part
of the operating system kernel, they point to practical limitations on what can be
implemented in the kernel, where they mention that TCP Cubic, for example, uses
approximation algorithms to avoid floating-point computation in the kernel due to these
limitations. They claim that DRL algorithms could require computational power that
may have to be carried out in user space, where they use Orca as an example, this being
a DLR-based algorithm that implements its agent in user space with TensorFlow, and
communicates with TCP Cubic in the kernel via socket options.
In addition to the machine learning based congestion control protocols discussed above
and in the referenced paper by W. Wei, H. Gu, and B. Li [93], there have been multiple
other such protocols presented. Some of these are briefly mentioned below.
QTCP QTCP [48] is an RL-based congestion control mechanism that uses online
learning to enable senders to gradually learn the optimal congestion control
policy. Unlike traditional heuristics, it does not use hard-coded rules and can
be generalized to a variety of different networking scenarios. The authors claim
higher throughput while maintaining low transmission latency compared to the
traditional rule-based TCP [48].
TCP-Drinc Deep reinforcement learning based congestion control (TCP-Drinc) [107]
is another DRL-based congestion control mechanism which adjusts the cwnd based
on past experience in the form of a set of measured features — such as differences
in cwnd and RTT, minimum RTT, and the inter-arrival time of ACKs — that are
stored in a buffer as historical data.
SmartCC SmartCC [49] is a RL-based multipath congestion control mechanism
designed to deal with the diversities of multiple communication paths in
heterogeneous networks. It adjusts the subflows cwnd values adaptively to fit
different network scenarios. Compared to some other RL-based congestion control
mechanisms, in SmartCC, the model training and execution step are decoupled,
and the learning process does therefore not introduce additional overhead in the
form of added delay.
30
2.3. Related works
delay, not based specifically on TCP traffic, and not based on any sort of machine
learning, but rather employed more traditional heuristics. The way they describe how
the framework could be applied is, however, very interesting and closely mimics what
is being investigated in this thesis — proactive congestion avoidance based on real-time
packet loss prediction.
In their paper titled “A machine learning approach for packet loss prediction in science
flows”, A. Giannakou, D. Dwivedi, and S. Peisert [31] discuss how they developed
a machine learning tool based on Random Forest regression for predicting packet
retransmissions in science flows. Their work discusses predicting the amount of
retransmissions using regression, not packet loss or congestion in real time proactively,
and is therefore not directly dealing with the same problem that this thesis is tackling.
However, the fact that they used a tree-based algorithm to construct a machine learning
model based on various features such as cwnd and rtt is very interesting and closely
related to the work discussed in this thesis.
There have been multiple other papers [4] [55] that are quite similar to the paper
discussed above by A. Giannakou, D. Dwivedi, and S. Peisert, in the sense that the
topic of discussion is either predicting the amount of packet loss or the rate of packet
loss, but not if packet loss will occur at any given point in time or not.
There have been multiple papers published on the topic of congestion prediction [35] [2]
[75] [70]. However, these papers are more related to predicting congestion in the network
as a whole or in specific links in the network, sometimes hours or minutes beforehand.
They are therefore dealing with a larger time-scale than the topic that is discussed in
this thesis — which deals with real-time prediction and resulting actions based on the
outcome of the prediction in the order of milliseconds.
A very recent paper by H. Benadji, L. Zitoune, and V.Vèque [7] discusses the topic of
loss ratio prediction using deep learning, where they describe how they used time series
data and Deep Learning (DL) models to predict the loss ratio in IoT networks. Like
some of the studies discussed above, this is not directly related to the topic discussed in
this thesis, which deals with actual packet loss prediction in real-time, not packet loss
rate prediction. The former is binary and the latter is a number, and is a quite different
metric. However, they also mention future work with the goal of designing a proactive
congestion control solution based on packet loss rate prediction, which would be very
relevant to what is discussed in this thesis.
A bit less recent but still relevant work by W. Na, B. Bae, S. Cho and N. Kim [63]
discusses a deep-learning based TCP (DL-TCP) protocol for a disaster 5G mmWave
network that learns the node’s mobility information and signal strength, and adjusts the
TCP cwnd by predicting when the network is disconnected and reconnected leading
to better network stability and higher network throughput than existing protocols
such as TCP NewReno, TCP Cubic, and TCP BBR. The approach described in the
paper predicts the duration for which the transmitting signal is blocked in a 5G
mmWave network based on mobile base stations using deep learning and indicators
such as mobility, location, signal-to-noise ratio, and value of the terminal. The
predicted blockage duration is then used to fix cwnd and perform buffering for the
corresponding time to utilize the mmWave capacity if the blockage duration is less than
the retransmission timeout (RTO).
A paper by B.A. Arouche Nunes, K. Veenstra, W. Ballenthin, S. Lukin, and K. Obraczka
31
Chapter 2. Background
[3] represents one of the earlier works that used machine learning algorithms to predict
network performance. It presents a novel approach to RTT estimation using machine
learning. Their findings indicated that their machine learning based RTT estimation was
more accurate than the Exponentially Weighted Moving Average (EWMA) estimation
that is used in some TCP congestion control mechanisms. They ran experiments showing
a reduction in the number of retransmitted packets and an increase in goodput compared
to the traditional approach using an EWMA. This can be attributed to the fact that TCP
uses RTT estimates to compute its RTO timer value. More accurate RTT estimations
can therefore result in more accurate RTO values and less false retransmissions due to
RTO expirations in cases where the RTO should have been higher. The experiments
comparing retransmissions and goodput are closely related to the experiments that were
ran and discussed in this thesis, at least with regards to the relevant metrics considered.
Recent work by L. Diez, A. Fernández, M. Khan, Y. Zaki, and R. Agüero [18] studied
the application of machine learning techniques to predict the congestion status of 5G
mmWave networks. They identified transport-layer metrics relevant to the congestion
state, such as delay and inter-arrival time, and studied their correlation with the
perceived congestion. They did this by generating transport-layer traces and analyzing
the information provided therein to derive meaningful congestion metrics. They point to
a clear correlation between metrics such as the moving standard deviation of the delay
and congestion. However, they point to a weak correlation between the moving average
of the delay and congestion. These findings could be informative for the machine learning
feature selection and construction process discussed in this thesis, described in detail in
Section 3.4. However, since the traces were generated by 5G mmWave connections, it is
not necessarily informative due to this thesis dealing with wired connections.
32
Chapter 3
33
Chapter 3. Machine learning model design and evaluation
and receiver, but with merged data created from many different measurements
with various combinations of connection parameters and the presence or absence
of background traffic.
The phase three models should be given the most attention, seeing as they are to be
considered the “general” models trained on the largest dataset and should optimally be
able to predict packet loss in many different cases — where cases refers to connections
with different configurations with respect to connection parameters like bandwidth,
delay, queue size, presence or absence of background traffic, and so on. They are therefore
the models that are most discussed here and the only models that were exported and
later tested, which is the topic of Chapter 4.
3.2 Tools
Multiple tools and programming libraries were used when creating the machine learning
models, with the two main categorizations being network tools and machine learning
libraries.
34
3.2. Tools
and TShark were also important and used for generating traffic and capturing packet
information, respectively.
Mininet
Mininet is a utility for creating a virtual network on a local laptop or other personal
computer [56]. Mininet creates a virtual network, running real kernel, switch, and
application code, on a single machine.
Mininet ships with the Mininet CLI [58], where one can interact with the network,
including, but not limited to, the nodes and switches in it. It is therefore possible to
configure the network to suit one’s personal needs and preferences. It is also possible
to run commands from the different nodes in the network after it has been started.
Assuming that the network consists of two hosts: h1 and h2, it is possible to ping h2
from h1 using the following command:
h1 ping h2
In addition to the CLI, Mininet ships with a Python API [57] that makes it possible to
configure the network and run commands from the different nodes in it using a custom
Python script.
iPerf
iPerf is a tool for active measurements of the maximum available bandwidth of IP
networks. It supports various protocols, including TCP, and can be used to generate
traffic from one host to another [37].
ss
ss is a Linux network utility for dumping socket statistics [85]. ss can be used to display
internal TCP state information like cwnd, ssthresh, and rtt. It is possible to supply
ss with filter options such as source and destination IP address and port. This makes
it possible to, for example, only capture TCP information from outgoing traffic from a
specific IP address and port pair.
TShark
TShark is a network protocol analyzer for capturing packet data. TShark can also read
packets from a capture file. TShark uses the pcap library to capture traffic from a given
network interface [88].
35
Chapter 3. Machine learning model design and evaluation
Deep neural network-based methods for heterogeneous tabular data are still
inferior to machine learning methods based on decision tree ensembles for
small- and medium-sized datasets (less than ∼1M samples) [10].
Due to the reasons described above, only two gradient boosting libraries were applied
to create various machine learning models which were then evaluated based on various
performance metrics (Section 2.2.5).
XGBoost
XGBoost, or eXtreme Gradient Boosting, is an optimized gradient-boosted decision tree
machine learning library. XGBoost provides parallel tree boosting and can be applied
to many different problems, but especially regression and classification problems [103].
LightGBM
LightGBM is a gradient boosting framework that uses tree based learning algorithms
[51]. Compared to other gradient boosting frameworks such as XGBoost, LightGBM has
faster training speeds [17] and higher efficiency while producing similar [50] or better
[53] results. This makes it well suited for research applications such as the one tackled
in this thesis, where training speeds and efficiency often are of great importance.
36
3.3. Data collection
1 d e f c o n f i g ( s e l f , ∗∗ params ) :
2 s u p e r ( LinuxRouter , s e l f ) . c o n f i g ( ∗ ∗ params )
3 s e l f . cmd( " s y s c t l n e t . i p v 4 . ip_forward=1" )
4 s e l f . cmd( " sudo t c q d i s c d e l dev r−h2 r o o t " )
5 s e l f . cmd(
6 f " sudo t c q d i s c add dev r−h2 r o o t h a n d l e 2 : netem d e l a y { s e l f .
d e l a y }ms "
7 )
8 s e l f . cmd( " sudo t c q d i s c add dev r−h2 p a r e n t 2 : h a n d l e 3 : htb
d e f a u l t 10 " )
9 s e l f . cmd(
10 f " sudo t c c l a s s add dev r−h2 p a r e n t 3 : c l a s s i d 10 htb r a t e {
s e l f . bandwidth } Mbit "
11 )
12 s e l f . cmd(
13 f " sudo t c q d i s c add dev r−h2 p a r e n t 3 : 1 0 h a n d l e 1 1 : b f i f o l i m i t
{ s e l f . queue_size } "
14 )
15 s e l f . cmd( f " sudo s y s c t l −w n e t . i p v 4 . t c p _ c o n g e s t i o n _ c o n t r o l={ s e l f .
cc_algorithm } " )
16 s e l f . cmd( " sudo s y s c t l −w n e t . i p v 4 . tcp_window_scaling=1" )
17 s e l f . cmd( " sudo e t h t o o l −K r−h1 t s o o f f " )
18 s e l f . cmd( " sudo e t h t o o l −K r−h2 t s o o f f " )
The commands above were ran from the router node when starting the Mininet virtual
network, configuring the router with specified values for delay, bandwidth, queue size,
and congestion control algorithm. The delay was only configured on the egress interface,
meaning the interface between r and h2. There was therefore no added delay on the
ingress interface or the interface between r and h1, giving a resulting RTT roughly
equal to the delay. Similar commands were applied to the sender and receiver nodes
to configure the congestion control algorithm used and enable ECN. The reason for
enabling ECN (setting ECT to 1) at the nodes was to make it possible to toggle the CE
bit dynamically at the router based on model predictions. This is further described in
Chapter 4.
To support collecting data with background traffic and because iPerf3 was used to create
the flows needed to generate traffic for data collection, the receiver node was configured
to automatically start the required amount of iPerf3 servers and configuring them with
port numbers starting from 5201, so the iPerf3 clients at the sender side could later
connect to these. How this configuration was done is shown in Listing 3.2.
1 d e f c o n f i g ( s e l f , ∗∗ params ) :
2 s u p e r ( R e c e i v e r , s e l f ) . c o n f i g ( ∗ ∗ params )
3
4 # S t a r t t h e r e q u i r e d amount o f i p e r f 3 s e r v e r s .
5 base_port = 5201
6 f o r s e r v e r i n r a n g e ( s e l f . background_flows + 1 ) :
7 s e l f . cmd( f ’ i p e r f 3 −s −p { base_port + s e r v e r } −D ’ )
8
9 s e l f . cmd( f ’ sudo s y s c t l −w n e t . i p v 4 . t c p _ c o n g e s t i o n _ c o n t r o l={ s e l f .
cc_algorithm } ’ )
10 s e l f . cmd( ’ sudo s y s c t l −w n e t . i p v 4 . tcp_window_scaling=1 ’ )
37
Chapter 3. Machine learning model design and evaluation
In addition, a special scenario command line argument was supported by the network
configuration script that configured all background flows to either use Cubic or Reno or
half Reno and half Cubic. When the data collection procedure was configured to run
with 6 background flows and the scenario was half, 3 of the background flows would
use Reno and the other 3 would use Cubic.
Python was used to run the custom script which started and configured the network as
described above, and started the Mininet CLI.
The machine used to run the virtual network was a personal computer with an Intel®
Core™ i5-6500 CPU running Ubuntu.
The machine used to run the machine learning related code was a personal MacBook
Pro with an M1 Max CPU running macOS.
Table 3.1 shows the programs and their versions that were used to configure the network
and collect data.
Program Version
Mininet 2.3.0
Python 3.10.12
Ubuntu 22.04.1
iPerf 3.9
ss iproute2-5.15.0
Table 3.1: Programs used for network configuration and data collection and their versions
38
3.3. Data collection
39
Chapter 3. Machine learning model design and evaluation
For the ss output in Listing 3.3, the combination of every first, second, and third line
represented a single measurement, but only the lines that are the second and third lines
in the output — meaning the lines that start with tcp and ts sack for the outputs
above — were of interest and contained the data that should be present in the final
dataset. However, there seemed to be cases were these were not always and the second
and third lines and the measurements did not always look like this either with regards
to the fields being recorded. Sometimes fields were missing, other times additional fields
were present. A large and time consuming part of the thesis was therefore dedicated
to data transformation — with the idea being that the final dataset should consist of a
series of rows, each row representing an individual ss measurement, while the columns
represented either raw or transformed data from the relevant measurement in the ss
output.
All the iPerf tests for the various connections ran for 300 seconds (5 minutes) in order
to collect a sufficient amount of data for each scenario with regards to the specific
combination of connection parameters.
In addition to capturing TCP state information with ss for training data, TShark was
run to capture packet information and save the result to a pcap file. This pcap file was
not used for training data purposes, but was relevant later when analyzing results and
comparing connections with and without model inference enabled, which is the topic of
Chapter 4. TShark was configured to capture all packets sent and received on the h1-r
interface. This was done because h1 always acted as the sender in the tests — sender in
this case referring to this being the node that hosted the iPerf clients — while h2 acted
as the receiver and hosted the iPerf servers. How TShark was configured is shown in
Listing 3.4 below.
40
3.3. Data collection
As briefly mentioned in the beginning of this chapter, the phase three data consisted of
training data captured from a single flow between a single sender and receiver but with
merged data created from many different measurements with different combinations of
connection parameters and the presence or absence of background traffic.
Single flow in this case refers to the fact that the ss data was always captured from
one specific flow, even though there were multiple flows between the sender and receiver
when adding background traffic. The way the foreground flow was separated from the
background flows, was by simply specifying the IP address and port pair of the source
and destination when filtering the traffic with the ss command, like shown in Listing
3.5. This way, ss was always configured to capture traffic sent from the first iPerf client
started at the sender and received at the first iPerf server started at the receiver.
1 # Use s s t o c a p t u r e s o c k e t s t a t i s t i c s from t h e f i r s t M in in e t h o s t and
append t o g i v e n f i l e .
2 # Parameters :
3 # $1 : F i l e t o w r i t e s o c k e t s t a t i s t i c s t o .
4 capture_ss ( ) {
5 l o c a l f i l e =" $1 "
6
7 s s − i −o s r c 1 0 . 1 . 1 . 1 0 0 : 5 0 0 1 d s t 1 0 . 2 . 2 . 1 0 0 : 5 2 0 1 >> " $ f i l e "
8 }
Phase one
As mentioned in the previous section, all the iPerf tests for the various connections ran
for 300 seconds (5 minutes). This was also the case for the phase one data collection,
where the data collection procedure in the form of the relevant script ran for 5 minutes
and was configured with the following values for bandwidth, delay, and queue size:
• bandwidth: 50Mbit
• delay: 70ms
• queue size: 437500 (1 BDP) bytes
The test was configured as a single flow between a single sender and receiver without
any background traffic.
The phase one data collection resulted in a single txt file for both Reno and Cubic that
contained the various ss outputs.
41
Chapter 3. Machine learning model design and evaluation
Phase two
Compared to phase one, the phase two data collection was much more intricate and
resulted in significantly more data. Like phase one, all the iPerf tests for the various
connections ran for 300 seconds (5 minutes), but instead of running just one test with
one specific combination of bandwidth, delay, and queue size, 75 different tests were ran,
all with different combinations of the mentioned connection parameters.
The possible values for bandwidth, delay, and queue size — where queue size is
represented as multipliers of BDP in bytes — are shown below:
• bandwidths: 10Mbit, 20Mbit, 30Mbit, 40Mbit, 50Mbit
• delays: 30ms, 40ms, 50ms, 60ms, 70ms
• queue sizes: 0.25 BDP, 0.5 BDP, 1 BDP
Each test represented a specific permutation of the abovementioned options, so that
each test was configured with a specific value for delay, bandwidth, and queue size in
the form of a BDP multiplier value. Given a delay of 30ms, a bandwidth of 10Mbit, and
a queue size of 1 BDP, that specific permutation represented a test where the delay was
configured to 30ms, the bandwidth was 10Mbit, and the queue size was equal to 1 BDP
where the BDP was always calculated from the configured delay and bandwidth for that
specific test. If the queue size was configured to 0.25 BDP, it would represent 1/4 of the
calculated BDP based on the configured delay and bandwidth.
Like phase one, the tests were configured as single flows between a single sender and
receiver without any background traffic.
The phase two data collection resulted in 75 distinct txt files for both Reno and Cubic.
Each file contained the various ss outputs from a test with a specific combination of
connection parameters.
Phase three
Compared to phase two, the phase three data collection was even more intricate and
also resulted in significantly more data. Like phase one and two, all the iPerf tests for
the various connections ran for 300 seconds (5 minutes), but now background traffic was
added.
Explained in more detail in Section 3.3.1, background traffic was added by configuring
the receiver to automatically start the desired amount of iPerf servers, and configuring
the sender to start the same amount of clients in order to connect to the servers.
In all cases, in addition to the single foreground flow, 6 background flows were added to
the connection. These 6 background flows were configured to use either Reno or Cubic
or half/half — meaning that 3 of them used Reno and the other 3 Cubic.
The same values for bandwidth, delay, and queue size — where queue size is represented
as multipliers of BDP in bytes — as in phase two were used, but in addition, three
different scenarios for the background traffic flows were considered:
• bandwidths: 10Mbit, 20Mbit, 30Mbit, 40Mbit, 50Mbit
• delays: 30ms, 40ms, 50ms, 60ms, 70ms
• queue sizes: 0.25 BDP, 0.5 BDP, 1 BDP
42
3.4. Data transformation
43
Chapter 3. Machine learning model design and evaluation
The result of parsing the output from the data collection process was therefore a Python
dictionary where the keys represented the path to the file containing the outputs and
the values represented lists where each list element in the various lists was a string that
contained the second and third line for a specific ss measurement for that connection, so
that lines 1 and 2 in Listing 3.6 — which contain the data for one specific ss measurement
— were represented as a string that contained all the information for that measurement,
like shown below:
{
path: [
ss_measurement_string,
ss_measurement_string
],
path2: [
ss_measurement_string,
ss_measurement_string
],
}
44
3.4. Data transformation
cwnd plot of either Reno (Figure 2.1) or Cubic (Figure 2.2). As can be seen in both plots,
the cwnd grows until it reaches a peak. Assuming that the connection is in congestion
avoidance mode (Section 2.1.6), the peak is where a congestion signal is detected —
usually in the form of a packet loss. Multiplicative Decrease is used to reduce congestion
in the network by reducing the rate at which the sender can inject packets. The cwnd is
therefore informative of when packet loss will happen — because the chance of packet
loss increases with an increasing cwnd, and a maximum value of cwnd is reached just
before packet loss happens.
For the same reasons as described above, the min_cwnd and max_cwnd features were
added. A cwnd that is quite close to the max_cwnd seen so far most likely indicates a
high chance of congestion — especially in cases when the max_cwnd does not fluctuate
much. Similarly, a cwnd value close to min_cwnd most likely indicates a small chance
of congestion, because congestion has just been experienced and reacted to, with the
Multiplicative Decrease scheme briefly explained above and in more detail in Section
2.1.6.
The min_x and max_x features served as a way to incorporate historical data into the
algorithm. By doing so, theoretically, as the connection progresses, the algorithm’s
predictions could improve by learning the connection’s behavior.
All the chosen features and a short description of each feature are shown in Table 3.2.
In addition, the rationale behind choosing each feature is explained below:
timer_name The information for this feature was extracted directly from the ss field
present in the output, specifically from part of the timer data. Among the multiple
timers, the retransmission timeout (RTO) was identified as relevant due to its
association with congestion. It was therefore deemed important to distinguish this
timer from the others.
expire_time Chosen because the expire_time is a function of the RTT. Therefore, a
relatively high value for this feature could indicate congestion, and vice versa.
retrans Chosen because it should theoretically show how many times a retransmission
occurred, according to the ss man page [85], which could be relevant after
congestion in order to determine the degree of congestion.
rto Chosen because the retransmission timeout (RTO) is related to congestion and is a
function of the RTT — a higher RTO value indicates a higher RTT value, which
could indicate congestion in the network and imminent packet loss. While the
RTT is a very direct measure, the RTO is more of a long-term average [21].
rtt Chosen because the RTT is strongly related to network congestion with a higher
RTT value indicating queue growth and therefore congestion in the network, and
vice versa.
rtt_variance Chosen because, like the RTO, the RTT variance is a function of the
RTT, but it is a slightly more direct measure that estimates how much the RTT
fluctuates.
cwnd Chosen because the cwnd value strongly correlates with the likelihood of
congestion. A higher value of cwnd indicates a greater probability of congestion,
with the opposite being true for a lower value. This relationship can be observed
in the cwnd plots (Figure 2.1 and Figure 2.2).
45
Chapter 3. Machine learning model design and evaluation
cwnd_diff Chosen because trends in cwnd can indicate if congestion has just occurred
or not. If the cwnd is still growing, it indicates that the connection is in congestion
avoidance mode — assuming that the cwnd is larger than the ssthresh (Section
2.1.6) — while if the cwnd is decreasing, it indicates that congestion just occurred
because a peak has been reached and the fast recovery phase has begun (Section
2.1.7). This value could also indicate how quickly the cwnd is increasing or
decreasing, which could potentially indicate a higher chance of congestion or not.
ssthresh Chosen because the ssthresh contains information about the maximum cwnd
value that has worked in the past. Could be used as a supplement to the max_cwnd
feature discussed further below.
data_segments_sent This feature represented the difference in the number of
segments sent containing a positive length data segment between the current and
previous ss measurement, and was extracted from the data_segs_out field of the
ss output, explained in Table 3.3. Included in the feature set because it could help
distinguish between slow start or not. However, the initial slow start peak was
not included in the training data or considered when later applying the model to
predict packet loss in real time, as discussed in Chapter 4.
last_send Chosen because, due to ACK-clocking, there could be a positive correlation
between congestion from other traffic and this value.
pacing_rate Like many of the other chosen features, this is a function of the RTT,
and was chosen because it approximates the sending rate as cwnd/rtt, which is in
principle roughly as important as the cwnd.
min_rtt Chosen because any queue growth will cause the RTT to grow in relation
to this value; the absolute RTT should not really matter when the data is being
sourced from many different connections that have been configured with different
delays and therefore contain samples with different RTT values.
max_rtt Chosen for the same reason as the min_rtt as explained above; The RTT
should be considered in relation to the current max_rtt value, where an RTT close
to the max indicates congestion.
min_cwnd Chosen because a cwnd value close to min_cwnd most likely indicates a small
chance of congestion, because congestion has just been experienced and reacted to.
max_cwnd Chosen because a cwnd that is quite close to the max_cwnd seen so far
most likely indicates a high chance of congestion — especially in cases when the
max_cwnd does not fluctuate much — which is the case for single flow connections
that do not have any background traffic.
min_ssthresh Chosen because the ssthresh in isolation does not reveal much about
the probability of congestion when the data has been aggregated from many
different connections.
max_ssthresh Chosen for the same reason as the min_sshtresh. The ssthresh needs
to be considered in relation to this value and the min.
46
3.4. Data transformation
Referring to the man pages of ss [85], there is quite a lot of information that can be
extracted but not all of it is necessarily relevant for the purpose of predicting congestion
induced packet loss. The fields described in Table 3.3 from the ss output were deemed
to be relevant based on the discussion in Section 3.4.2.
ss field Description
timer_name The name of the timer
expire_time How long time the timer will expire
retrans How many times the retransmission occurred
cong_alg Congestion algorithm used
rto TCP re-transmission timeout value
rtt Average round-trip time
rttvar Mean deviation of RTT
cwnd Congestion window size
ssthresh Slow start threshold
data_segs_out Number of segments sent containing a positive length data segment
lastsnd Time since the last packet was sent
pacing_rate The pacing rate
Table 3.3: Relevant fields from the ss output and their ss man page descriptions [85]
47
Chapter 3. Machine learning model design and evaluation
{
cwnd: 10,
ssthresh: 10
...
}
This was done to make it easier to manipulate each measurement and later convert the
merged dataset to a csv file.
For the purpose of extracting the ss fields in Table 3.3, multiple functions like the one
shown below in Listing 3.7 were created that took the measurement in the form of a
string — like discussed in the previous section — as as an argument and defined and
used a regex for filtering out a specific field from the measurement string in order to add
it to the dictionary representing the measurement, if the measurement string matched
the regex.
1 d e f add_cwnd ( s s _ d i c t : d i c t , measurement : s t r ) −> t u p l e :
2 " " " Add t h e cwnd from t h e g i v e n measurement t o t h e g i v e n d i c t i o n a r y .
3 Args :
4 s s _ d i c t : The d i c t i o n a r y t o add t h e cwnd t o .
5 measurement : The measurement from s s .
6 Returns :
7 A t u p l e c o n t a i n i n g t h e cwnd and True i f t h e measurement c o n t a i n e d
cwnd i n f o r m a t i o n ,
8 0 and F a l s e o t h e r w i s e .
9 """
10 cwnd_regex = r e . c o m p i l e ( r " cwnd : ( \ d+) " )
11 cwnd_match = r e . s e a r c h ( cwnd_regex , measurement )
12 i f cwnd_match :
13 cwnd = cwnd_match . group ( 1 )
14 cwnd = i n t ( cwnd )
15 s s _ d i c t [ " cwnd " ] = cwnd
16 r e t u r n cwnd , True
17
18 return 0 , False
Features like the cwnd shown above, or the ssthresh were quite straightforward to extract
from the data, because they just needed to be extracted and used directly without any
special intermediary steps except converting from a string.
However, referring to the discussion in Section 3.4.2 and Table 3.2, there were some
features that should be included in the final dataset that were not possible to extract
directly from the ss measurements. These features were mostly dependent on the values
in the fields of the other measurements and therefore needed special treatment. For
example, the features min_cwnd and max_cwnd were dependent on a dynamic minimum
and maximum value of cwnd that updated itself as the function dealt with more and
more measurements — meaning that, if the cwnd value for the first measurement was
10, this would be the min_cwnd value for all measurements until an even smaller value
was found, and so on.
As mentioned in the previous section on data collection, all the measurements for
producing traffic and gathering data from said traffic ran for 5 minutes when capturing
training data. Referring to the cwnd plots of Reno and Cubic in Figure 2.1 and Figure
2.2 respectively, the initial slow start peak (Section 2.1.6) can be seen in both figures
48
3.4. Data transformation
where the cwnd is much higher than for the remainder of the connection. In order to
avoid outliers in the data, it was decided that only measurements in congestion avoidance
should be included, meaning that the first few seconds of measurements from the initial
slow start phase were not included in the training data. To make the min_cwnd and
max_cwnd values more consistent with the cwnd values from the congestion avoidance
phase, these values were therefore not set until after the initial slow start peak. This was
also the case for the min_ssthresh and max_ssthresh values. However, the min_rtt
and max_rtt values were taken from the beginning of the connection.
To keep things simple and because all the measurements ran for 5 minutes, the first 30
seconds of measurements were not included in the final training data to make sure that
no measurements from the initial slow start peak were included.
A feature that proved to be quite a challenge when dealing with, was the cwnd_diff,
as described in Table 3.2. Referring to said description, if the cwnd value of the current
measurement was 10 and the previous one was also 10, the one before the previous one
would have to be considered, and so on until a value was found that was different from
the current. The reason for including this feature is described in detail in Section 3.4.2.
Adding this feature was accomplished by a recursive function as shown in Listing 3.8.
1 d e f add_cwnd_diff (
2 s s _ d i c t s : l i s t , s s _ d i c t : d i c t , cwnd : i n t , prev_cwnd : i n t , c u r r e n t _ i n d e x
: int
3 ) −> None :
4 " " " Add t h e d i f f e r e n c e between t h e c u r r e n t c o n g e s t i o n window and t h e
p r e v i o u s c o n g e s t i o n window v a l u e t h a t was not t h e same .
5 The p r e v i o u s v a l u e i n t h i s c a s e r e f e r s t o an e a r l i e r v a l u e i n t h e
same measurement t h a t was not t h e same a s t h e c u r r e n t .
6 Meaning , t h a t i f t h e c u r r e n t c o n g e s t i o n window i s 10 and t h e one
d i r e c t l y b e f o r e i t was a l s o 1 0 , t h e p r e v i o u s c o n g e s t i o n
7 window w i l l t h e one b e f o r e that , and s o on .
8
9 Args :
10 s s _ d i c t s : The l i s t o f d i c t i o n a r i e s t h a t c o n t a i n t h e measurements
from s s .
11 s s _ d i c t . The d i c t i o n a r y t o add t h e cwnd d i f f t o .
12 cwnd : The c u r r e n t c o n g e s t i o n window v a l u e .
13 prev_cwnd : The p r e v i o u s c o n g e s t i o n window v a l u e t h a t was not t h e
same .
14 c u r r e n t _ i n d e x : The i n d e x o f t h e c u r r e n t measurement i n i n s s _ d i c t s .
15 """
16 i f c u r r e n t _ i n d e x == 0 :
17 s s _ d i c t [ " cwnd_diff " ] = 0
18 return
19
20 i f cwnd == prev_cwnd :
21 prev_cwnd = s s _ d i c t s [ c u r r e n t _ i n d e x − 1 ] [ " cwnd " ]
22 i f prev_cwnd == cwnd :
23 r e t u r n add_cwnd_diff ( s s _ d i c t s , s s _ d i c t , cwnd , prev_cwnd ,
current_index − 1)
24 else :
25 s s _ d i c t [ " cwnd_diff " ] = cwnd − prev_cwnd
26 return
As mentioned in the data collection section (Section 3.3), there were sometimes cases
49
Chapter 3. Machine learning model design and evaluation
when the output from ss did not include all the relevant fields that should be present for
each row in the final dataset. These measurements were not included in the final dataset,
but their cwnd, rtt, and ssthresh values were used to calculate min_cwnd, max_cwnd,
min_rtt, max_rtt, min_ssthresh, and max_ssthresh if these fields were present but
not some others.
The final result of this step in the data transformation process was a two-dimensional list
of lists where all the inner lists contained dictionaries where each dictionary included the
relevant fields for each of the measurements. Each of these dictionaries and their items
represented what should be the rows and columns in the final dataset respectively.
50
3.4. Data transformation
was accomplished by creating a DictWriter object [71] from the Python csv module,
as shown in Listing 3.9.
1 d e f c r e a t e _ c s v ( s s _ d i c t s : L i s t [ L i s t [ D i c t ] ] , path : s t r ) −> None :
2 " " " C r e a t e a c s v f i l e from t h e s s measurements i n t h e g i v e n l i s t o f
l i s t s of dictionaries .
3
4 Args :
5 s s _ d i c t s : A l i s t o f l i s t s , each c o n t a i n i n g d i c t i o n a r i e s
r e p r e s e n t i n g t h e measurements from s s .
6 path : Where t o c r e a t e t h e c s v f i l e .
7 """
8 flattened_ss_dicts = [ ss_dict for s u b l i s t in ss_dicts for ss_dict in
sublist ]
9
10 with open ( path , "w" , n e w l i n e=" " ) a s c s v _ f i l e :
11 w r i t e r = c s v . D i c t W r i t e r ( c s v _ f i l e , f i e l d n a m e s=f l a t t e n e d _ s s _ d i c t s [ 0 ] .
keys ( ) )
12
13 writer . writeheader ()
14 for ss_dict in flattened_ss_dicts :
15 w r i t e r . writerow ( ss_dict )
16
17 p r i n t ( " Created c s v f i l e under path : " , path )
The final result of this step in the data transformation process was a csv file containing
all the ss measurements as rows and the machine learning features as columns.
51
Chapter 3. Machine learning model design and evaluation
The final result of the data cleaning step was a dataset with only numerical features in
either float or integer format.
52
3.5. Phase one classifiers
53
Chapter 3. Machine learning model design and evaluation
The classifier was trained on the training data (Section 3.4.8) using the fit method of
the XGBClassifier class [98]. No hyperparameter tuning was performed, so no other
parameters were supplied to the XGBClassifier constructor.
The predictions were performed using the predict method of the XGBClassifier class
[100].
Results were evaluated by calculating the accuracy, precision, recall, and F1 score
(Section 2.2.5) of the various classifiers. In addition, a confusion matrix (Section 2.2.5)
was computed for each classifier. All metrics and the confusion matrices were computed
using the sklearn.metrics module [81].
Table 3.4: Network connection parameters used to configure the data collection procedure in
phase one
3.5.2 Results
The phase one datasets displayed a significant imbalance between the majority and
minority class: no packet loss and packet loss, as illustrated in Table 3.5 for Reno and
Table 3.6 for Cubic. This imbalance can be explained by packet loss at the transport layer
being relatively rare. Given this imbalance, relying solely on accuracy as a performance
metric when evaluating the performance of the classifiers proved to be insufficient. This
is because an imbalanced dataset can lead to a situation where the classifier predicts the
majority class all the time, and still has 99% accuracy because the dataset consists of
99% majority class samples and only 1% minority class samples (Section 2.2.5).
Table 3.5: The proportion of samples for each class in the training, validation, and test sets for
the Reno phase one dataset
Table 3.6: The proportion of samples for each class in the training, validation, and test sets for
the Cubic phase one dataset
For this reason, precision and recall, and the combined score F1 score were mainly used
when evaluating the classifiers. It was hypothesized that a high precision would be
more important than a high recall value — due to a high amount of false positives (low
54
3.5. Phase one classifiers
The confusion matrices for Reno and Cubic are shown in Table 3.9 and Table 3.10,
respectively, and show that most samples were not classified correctly — except for a
few samples in both the Reno and Cubic case.
Validation Dataset
Testing Dataset
The feature importances for Reno and Cubic are shown in Table 3.11 and Table 3.12,
respectively, and show that the cwnd and cwnd_diff features were very important for
the classification for Reno. While the cwnd feature was also important for Cubic, the
cwnd_diff feature was not used at all. However, the rtt, data_segments_sent, and
the ssthresh features were of quite high importance for the Cubic classifier.
In the case of Reno, when examining the training data, the feature importances can be
explained by the cwnd reaching a maximum value for the True cases compared to the
others, as visualized in Table 3.13 showing an excerpt from the Reno training data. By
55
Chapter 3. Machine learning model design and evaluation
Validation Dataset
Testing Dataset
Feature Importance
timer_name 0.00
expire_time 0.04
retrans 0.00
rto 0.00
rtt 0.07
rtt_variance 0.05
cwnd 0.33
ssthresh 0.00
data_segments_sent 0.03
last_send 0.00
pacing_rate 0.10
min_rtt 0.00
max_rtt 0.00
cwnd_diff 0.38
min_cwnd 0.00
max_cwnd 0.00
min_ssthresh 0.00
max_ssthresh 0.00
Table 3.11: Feature importances for the Reno phase one model
utilizing this information in combination with the cwnd_diff feature that shows that
the cwnd is growing, this could make it possible for the model to spot a trend where
the True cases consistently have high cwnd values and positive cwnd_diff values. As
the feature importances show, the rtt also seems to be a factor here, with the value
reaching a near maximum right before packet loss happens — indicating that the queue
is growing and that there is congestion.
For Cubic, much like Reno, the cwnd feature was highly significant, with the significance
being explained by the same reasoning as for Reno. However, as already mentioned,
features such as rtt, data_segments_sent, and the ssthresh also held considerable
importance.
The reason behind the importance of the rtt feature is the same as for Reno, with this
being a value that seems to increase until it reaches a near maximum right before packet
loss happens.
56
3.5. Phase one classifiers
Feature Importance
timer_name 0.00
expire_time 0.04
retrans 0.00
rto 0.00
rtt 0.10
rtt_variance 0.05
cwnd 0.41
ssthresh 0.15
data_segments_sent 0.11
last_send 0.00
pacing_rate 0.06
min_rtt 0.00
max_rtt 0.00
cwnd_diff 0.00
min_cwnd 0.07
max_cwnd 0.00
min_ssthresh 0.00
max_ssthresh 0.00
Table 3.12: Feature importances for the Cubic phase one model
Table 3.13: An excerpt from the Reno phase one training dataset
The reason behind the data_segments_sent feature being of quite high importance
seemed to be a pattern where the values were consistently smaller right after packet
loss. When examining the training data, this was made clear by looking at the values
for True cases and the cases right after the True cases. The True cases consistently had
a higher value than the False cases right after, which seemed to indicate the pattern
described earlier. This can be seen in Table 3.14.
Table 3.14: An excerpt from the Cubic phase one training dataset
However, it was harder to find a clear reason for the importance value of the ssthresh
feature. Upon analyzing the training data, it was observed that the values for this feature
predominantly hovered around approximately 405 for the Cubic phase one training data.
There seemed to be no distinct pattern — values occasionally increased following packet
57
Chapter 3. Machine learning model design and evaluation
58
3.6. Phase two classifiers
3.6.2 Results
As in phase one, the data was still very imbalanced in phase two, as illustrated in
Table 3.16 and 3.17 for Reno and Cubic, respectively. However, the proportion of lost
samples did greatly improve, especially for Reno, increasing from 0.001 to 0.007 from
phase one to phase two. This can be attributed to the fact that there was simply much
more training data in phase two, and the training data was also aggregated from many
different connections instead of just a single connection with a specific set of connection
parameters. Some of these connections were configured with smaller BDPs than the one
that was used for data collection in phase one, and therefore had more frequent packet
loss.
Table 3.16: The proportion of samples for each class in the training, validation, and test sets for
the Reno phase two dataset
Table 3.17: The proportion of samples for each class in the training, validation, and test sets for
the Cubic phase two dataset
As hypothesized and illustrated in Table 3.18 and Table 3.19 for Reno and Cubic,
respectively, the classification performance for the phase two models was quite poor,
59
Chapter 3. Machine learning model design and evaluation
indicating that this could be a difficult problem for a machine learning model to handle.
Precision was decent, indicating that many samples were correctly classified as Lost if
they were actually Lost and vice versa. Recall was very poor, indicating that many
samples that should have been classified as Lost were classified as Not Lost. The latter
suggests that the model missed many cases where packets should be labeled as Lost.
Validation Dataset
Testing Dataset
Validation Dataset
Testing Dataset
As illustrated in Table 3.22 and Table 3.23 for Reno and Cubic, respectively, the samples
in the training data right before the samples that were labeled as True seem to be very
similar to the True sample. This could indicate that the model often gets things “almost
60
3.6. Phase two classifiers
right”, by labeling a sample as True even if it is not right at the peak yet, but rather
approaching it. This could make the model react to congestion just a bit earlier than
the optimal case, which would be right before the peak.
Table 3.22: An excerpt from the Reno phase two training dataset. The training data is shown
as groups of five and five samples from different parts of the aggregated training data, meaning
that each group of five samples came from the same part, and that these five samples in each
group occurred in succession. Each group of five samples includes one True sample, the two False
samples that occurred right before, and the two False samples that occurred right after
Table 3.23: An excerpt from the Cubic phase two training dataset. The training data is shown
as groups of five and five samples from different parts of the aggregated training data, meaning
that each group of five samples came from the same part, and that these five samples in each
group occurred in succession. Each group of five samples includes one True sample, the two False
samples that occurred right before, and the two False samples that occurred right after
As hypothesized and illustrated in Table 3.26 and Table 3.27 for Reno and Cubic,
61
Chapter 3. Machine learning model design and evaluation
respectively, the main difference between the phase one and phase two classifiers results
wise was found in the feature importances. Features like the cwnd that had high
importance in the phase one classifiers were no longer of great importance in phase
two because the training data was aggregated from many different connections with
different combinations of connection parameters. Unlike in phase one, where the cwnd
could be used directly to spot a trend where the cwnd reached a peak before packet loss,
in phase two this was not possible because the peak value was different between the
various connections that the models were trained on. This can be seen in Table 3.22 and
Table 3.23. The same reasoning can be applied to the pacing_rate feature, which also
decreased in importance for both phase two classifiers.
Feature Importance
timer_name 0.00
expire_time 0.14
retrans 0.00
rto 0.00
rtt 0.10
rtt_variance 0.02
cwnd 0.04
ssthresh 0.02
data_segments_sent 0.02
last_send 0.03
pacing_rate 0.04
min_rtt 0.10
max_rtt 0.14
cwnd_diff 0.17
min_cwnd 0.03
max_cwnd 0.04
min_ssthresh 0.04
max_ssthresh 0.06
Table 3.24: Feature importances for the Reno phase two model
One group of features that had more or less no importance in the phase one classifiers,
were the various min and max features, as explained in Section 3.4.2. All of these increased
in importance for the phase two classifiers. This can be attributed to more or less the
same reason for why features like the cwnd decreased in importance. When the cwnd
could no longer directly be used to spot the trend where the cwnd reached a maximum
value before packet loss, features like the max_cwnd and max_rtt were perhaps used to
aid the model when deciding if a datapoint should be labeled as lost or not.
The cwnd_diff feature had quite high and about the same importance for both phase
two classifiers, as illustrated in Table 3.24 and 3.25 for Reno and Cubic, respectively.
Referring to the cwnd plots of Reno and Cubic in Figure 2.1 and Figure 2.2, respectively,
this was probably used to make sure that the model did not label samples where the
cwnd was decreasing as Lost. As briefly mentioned in Section 3.4.4, some initial models
were trained on data labeled using the simple labeling method — where packets were
labeled as Lost if the cwnd value of the previous packet was smaller than the current —
and evaluated. While the models had great classification performance, they probably
just used the cwnd_diff feature to see that the cwnd was decreasing and always classified
packets as Lost in that case. The final models trained on training data labeled using the
complex labeling method avoided classifying packets as Lost if the cwnd was decreasing.
62
3.6. Phase two classifiers
Feature Importance
timer_name 0.00
expire_time 0.06
retrans 0.00
rto 0.02
rtt 0.08
rtt_variance 0.02
cwnd 0.05
ssthresh 0.04
data_segments_sent 0.05
last_send 0.02
pacing_rate 0.04
min_rtt 0.09
max_rtt 0.14
cwnd_diff 0.19
min_cwnd 0.02
max_cwnd 0.08
min_ssthresh 0.03
max_ssthresh 0.09
Table 3.25: Feature importances for the Cubic phase two model
Difference in importance
Feature Phase one importance (phase two - phase one)
timer_name 0.00 0.00
expire_time 0.04 +0.10
retrans 0.00 0.00
rto 0.00 0.00
rtt 0.07 +0.03
rtt_variance 0.05 -0.03
cwnd 0.33 -0.29
ssthresh 0.00 +0.02
data_segments_sent 0.03 -0.01
last_send 0.00 +0.03
pacing_rate 0.10 -0.06
min_rtt 0.00 +0.10
max_rtt 0.00 +0.14
cwnd_diff 0.38 -0.21
min_cwnd 0.00 +0.03
max_cwnd 0.00 +0.04
min_ssthresh 0.00 +0.04
max_ssthresh 0.00 +0.06
Table 3.26: Feature importances difference for the Reno phase one and phase two models
This was an important reason for why the simple labeling case was deemed unsuitable
and clearly highlights the impact data labeling can have on model performance and
results.
63
Chapter 3. Machine learning model design and evaluation
Difference in importance
Feature Phase one importance (phase two - phase one)
timer_name 0.00 0.00
expire_time 0.04 +0.02
retrans 0.00 0.00
rto 0.00 +0.02
rtt 0.10 -0.02
rtt_variance 0.05 -0.03
cwnd 0.41 -0.36
ssthresh 0.15 -0.11
data_segments_sent 0.11 -0.06
last_send 0.00 +0.02
pacing_rate 0.06 -0.02
min_rtt 0.00 +0.09
max_rtt 0.00 +0.14
cwnd_diff 0.00 +0.19
min_cwnd 0.07 -0.05
max_cwnd 0.00 +0.08
min_ssthresh 0.00 +0.03
max_ssthresh 0.00 +0.09
Table 3.27: Feature importances difference for the Cubic phase one and phase two models
64
3.7. Phase three classifiers
1. Reno phase three classifier trained on aggregated training data gathered from
many different connections configured with Reno, the presence or absence of
background traffic, and a specific set of the connection parameters shown in Table
3.28 that was labeled using the complex labeling method.
2. Cubic phase three classifier trained on aggregated training data gathered from
many different connections configured with Cubic, the presence or absence of
background traffic, and a specific set of the connection parameters shown in Table
3.28 that was labeled using the complex labeling method.
Like phase one and phase two, the phase three classifiers were created using the
XGBClassifier [96] class from the XGBoost library. The options supplied to the
XGBClassifier constructor were the same as for phase one and phase two (Section
3.5).
Model evaluation was also done in the same way as with the phase one and phase
two classifiers, where the precision, recall, F1 score, feature importances, and confusion
matrices were calculated and evaluated.
Unlike phase one and phase two, in phase three various hyperparameters were tuned
with the goal of improving model performance. The choice of which parameters to
tune was based on the relevant section in the XGBoost documentation titled “Notes
on Parameter Tuning” [106]. The chosen hyperparameters and their default and tuned
values are shown in Table 3.29 for Reno and Table 3.30 for Cubic.
65
Chapter 3. Machine learning model design and evaluation
3.7.2 Results
As in phase one and phase two, the data was still very imbalanced in phase three, as
illustrated in Table 3.31 and 3.32 for Reno and Cubic respectively. However, like when
going from phase one to phase two, the proportion of lost samples did improve from
phase two to phase three. Similarly to phase two, this can be partially attributed to
there being much more training data in phase three, and perhaps packet loss being more
frequent in the presence of background traffic.
Table 3.31: The proportion of samples for each class in the training, validation, and test sets for
the Reno phase three dataset
Table 3.32: The proportion of samples for each class in the training, validation, and test sets for
the Cubic phase three dataset
66
3.7. Phase three classifiers
As hypothesized and illustrated in Table 3.33 and Table 3.34 for Reno and Cubic,
respectively, similar to phase one and phase two, the classification performance for the
phase three models was quite poor with regards to the relevant metrics. Precision was
greatly reduced compared to phase two, indicating many false positives — false positives
in this case referring to samples being classified as Lost while they were actually labeled
as Not Lost.
Validation Dataset
Testing Dataset
Validation Dataset
Testing Dataset
The reason for the low precision could be partially explained by there being very little
difference in feature values for the samples labeled as Lost and the samples directly
67
Chapter 3. Machine learning model design and evaluation
preceding them that were labeled as Not Lost. This is illustrated in Table 3.37 and
Table 3.38 for Reno and Cubic, respectively. Even though many samples were wrongfully
labeled as Lost, they could belong to this group of samples that occurs right before the
actual Lost samples, as they have very similar feature values. The model could therefore
be reacting a bit early. As explained when discussing the results for phase two, this
could indicate that the model often gets things almost right, by labeling a case as True
even if it is not right at the peak yet, but rather approaching it.
Table 3.37: An excerpt from the Reno phase three training dataset. The training data is shown
as groups of five and five samples from different parts of the aggregated training data, meaning
that each group of five samples came from the same part, and that these five samples in each
group occurred in succession. Each group of five samples includes one True sample, the two False
samples that occurred right before, and the two False samples that occurred right after
Another reason for the low precision can be attributed to the scale_pos_weight
parameter used for tuning the model. Increasing this value from the default of 1 seemed
to improve the F1 score at the expense of precision since it greatly improved recall.
Adjusting the classification threshold (Section 2.2.5) after tuning showed that the model
improved with regards to both precision and recall compared to no tuning.
Even though the performance was poor, the results between the validation and test set
were very similar, indicating that the model generalizes well and is not overfitting to the
training data (Section 2.2.7). This was also the case for the phase one and phase two
models.
As with the phase two classifiers and illustrated in Table 3.41 and Table 3.42 for Reno
and Cubic, respectively, the feature importances changed quite a bit, however, not in
exactly the same way as they did when going from phase one to phase two, where more
static features like the cwnd had a great reduction in importance and more dynamic
features like the max_rtt seemed to become more important. This was also the case for
the phase three classifiers when comparing to phase one, but in addition features like the
expire_time and last_send seemed to become significantly more important compared
to both phase one and phase two.
Looking at the training data excerpts from Reno and Cubic as illustrated in Table 3.37
68
3.7. Phase three classifiers
Table 3.38: An excerpt from the Cubic phase three training dataset. The training data is shown
as groups of five and five samples from different parts of the aggregated training data, meaning
that each group of five samples came from the same part, and that these five samples in each
group occurred in succession. Each group of five samples includes one True sample, the two False
samples that occurred right before, and the two False samples that occurred right after
and Table 3.38, respectively, the reason for the expire_time feature having such high
importance in the phase three classifiers could be partially attributed to a pattern where
the value seemed to be higher for the False samples that directly follow a True sample.
However, at least for Cubic, this seemed to not always be the case, with the values for
the False samples directly following a True sample sometimes being lower than the True
sample. This can be seen in Table 3.38, in the last three columns, showing how the value
is larger for the True sample before it decreases for the False samples directly after.
The reason for the high importance for the last_send feature could be partially
explained by this being a value that seemed to increase as the samples got closer to a
True case, meaning that it seemed to increase with increasing congestion, before dropping
down to a substantially smaller value again for the False samples directly following a
True case. This was especially true for samples coming from outputs where the BDP
was generally smaller. In cases where the samples came from outputs where the BDP
was larger, the last_send values seemed to be mostly stable, only sometimes increasing
for a True sample before dropping down again for the False samples afterwards. For
example, the value could be 4 for almost all the samples from a specific ss output, only
sometimes being a different value, such as 8, for some of the True cases.
As in phase two, the cwnd_diff feature had quite high and about the same importance
for both phase three classifiers, as illustrated in Table 3.39 and 3.40 for Reno and Cubic,
respectively. Like discussed for the phase two classifiers, this was probably used to make
sure that the model did not label cases where the cwnd was decreasing as Lost. However,
the cwnd_diff was not always negative for some of the False samples that followed a
True sample. There were also many True samples that had a negative cwnd_diff value
in the phase three training data, where one such example is shown in Table 3.37 for
the Reno phase three training data. There seemed to be a pattern where some of
69
Chapter 3. Machine learning model design and evaluation
Feature Importance
timer_name 0.00
expire_time 0.26
retrans 0.00
rto 0.04
rtt 0.04
rtt_variance 0.02
cwnd 0.07
ssthresh 0.05
data_segments_sent 0.04
last_send 0.14
pacing_rate 0.02
min_rtt 0.03
max_rtt 0.06
cwnd_diff 0.13
min_cwnd 0.03
max_cwnd 0.02
min_ssthresh 0.04
max_ssthresh 0.02
Table 3.39: Feature importances for the Reno phase three model
Feature Importance
timer_name 0.00
expire_time 0.23
retrans 0.00
rto 0.05
rtt 0.04
rtt_variance 0.02
cwnd 0.06
ssthresh 0.05
data_segments_sent 0.03
last_send 0.13
pacing_rate 0.03
min_rtt 0.03
max_rtt 0.06
cwnd_diff 0.15
min_cwnd 0.03
max_cwnd 0.03
min_ssthresh 0.04
max_ssthresh 0.02
Table 3.40: Feature importances for the Cubic phase three model
the connections with generally small cwnd values had scenarios where the cwnd first
decreased, then stayed the same level for a while before it decreased even more. For
example, cwnd decreased from 15 to 12, then there were many samples with cwnd 12,
before it decreased further to 10, and so on. In cases where the cwnd values generally
were larger, the changes in values from one sample to the next seemed larger as well,
and this issue seemed to be less present.
The reason for there being some samples labeled as False in the training data immediately
70
3.7. Phase three classifiers
Table 3.41: Differences in feature importances for Reno models across phases
Table 3.42: Differences in feature importances for Cubic models across phases
following True cases that also had a positive cwnd_diff value, as illustrated in Table
3.38, could be explained by the way that the cwnd_diff was assigned to the samples.
As discussed in Section 3.3, data was captured using the ss network utility [85]. Most
of the time, the output from ss contained the various fields that should be used for
the machine learning features in the training data, such as cwnd, rtt, and so on, but
there were cases were some fields were missing. These cases were discarded in the data
71
Chapter 3. Machine learning model design and evaluation
transformation step when parsing the outputs from the data collection step, and were
therefore not included in the final training data as samples. However, when calculating
the cwnd_diff value using the function in Listing 3.8, as long as the cwnd value was
present in the output, it was considered and added to a separate data structure from
the one that contained the samples that should be present in the final training data.
There were therefore two data structures: one for all the outputs that contained the
cwnd field from ss and another for all the valid outputs that contained all the required
fields that should be present in the final samples. The first one was used to assign the
cwnd_diff, but since both data structures contained references to the same Python
dictionary object, this was the value present in the final samples as well.
This can be explained more clearly by considering a scenario where you have three
outputs from ss that should be parsed and potentially included in the final training
data as samples. The first output contains all the fields that need to be present in
order to extract all the machine learning features and construct a valid sample, and is
therefore regarded as “valid”. The third output is the same, also “valid”. The second
output however, is missing one field, for example the ssthresh. This second output is
therefore not valid because it is missing a field that needs to be extracted in order to
construct the relevant feature, and should not be included in the final training data.
However, this second output is not missing the cwnd field. Is is therefore added to a
separate data structure that contains all the outputs that had a valid cwnd field with a
value. All the three outputs have a valid cwnd field with the following values: 4, 2, and
3 for the first, second, and third output, respectively. For the valid outputs, adding the
cwnd_diff feature for the first output is not possible, because it is the first and there
is therefore no previous sample to use for calculating it. Adding the cwnd_dfff feature
for the third output is possible however, and since the second output had a valid cwnd
field, the cwnd_diff feature value for the third output is 3 − 2 = 1. If the second output
would not have been considered at all, the value would have been 3 − 4 = −1 instead.
This could help explain the cases where the False samples have a positive cwnd_diff
value. Only using the valid outputs for this instead could perhaps have prevented this.
Fully explained in Section 3.4.4, the labeling procedure used to label the phase three
training data labeled samples by looking at the previous, current, and next samples.
For a given sample referred to as the current sample in this case, the cwnd value was
compared to the previous and next sample. If the previous was smaller or the same and
the next was smaller, the current sample was labeled a Lost — otherwise, the sample
was labeled as Not Lost. This was meant to simulate the peaks that can be seen in
the cwnd plot of, for example, Reno in Figure 2.1. Making the labeling less sensitive to
changes in cwnd could perhaps have solved the problem of True samples having negative
cwnd_diff values and improved model performance, by making sure that packets were
only labeled as Lost if the next measurement had an appropriately smaller value, instead
of just checking if it was smaller. Another thing that could have combatted the problem,
would be to check if there is currently a downward trend — meaning that for a current
sample, the next sample has a smaller cwnd value — and only label a packet as Lost
if this is the first occurrence of a downward trend in the current downward trend (and
reset this when cwnd is growing again).
Seeing as the amount of samples in the phase three datasets were very large, with there
being about ∼7M samples in both the Reno and Cubic phase three training datasets,
exploring a different approach using a deep neural network instead of a gradient booster
could perhaps have been a better option, according to research comparing the two [10],
72
3.7. Phase three classifiers
which suggests that for very large tabular datasets — where “very large” in this case
refers to datasets with about ∼10M features — with predominantly continuous features,
modern neural network architectures may have an advantage over gradient boosting
frameworks such as the ones that were used in this thesis.
73
Chapter 3. Machine learning model design and evaluation
74
Chapter 4
Training and evaluating a machine learning model on labeled data, as discussed in detail
in Chapter 3, can give a good indication of its performance and what it might be used
for. However, applying a trained machine learning model to handle some task in real
time can provide an even better indication of its performance and usefulness.
For example, there could be a case where a trained binary classifier classifies a sample
as either True or False, and even though it was not labeled as such in the data used
for model evaluation containing the correct labels, it could be a useful prediction when
applying the model to a given problem. When evaluating the model in the training
and evaluation steps, these cases would be regarded as wrongfully classified, and would
therefore contribute to worse performance when looking at the relevant performance
metrics, such as precision, recall, and F1 score (Section 2.2.5). Therefore, looking at
performance metrics alone is not always enough and does not necessarily give a complete
picture of the potential usefulness of a machine learning model.
For the problem discussed in this thesis, if a sample was classified as True (meaning
packet loss) even though it was not a sample that should be classified as True according
to the training data — meaning that it was not the sample that was taken right at the
peak before the packet loss happened — it could still be a “correct” prediction in the
sense that it was very close to the actual True case. Referring to the cwnd plot of Reno
(Figure 2.1), if samples that were close enough to the peak were classified as lost, this
could prove useful in the sense that it could serve as an indicator of congestion and most
likely imminent packet loss.
Model inference is described by Google in the following way:
Machine learning inference is the process of running data points into a
machine learning model to calculate an output such as a single numerical
score. This process is also referred to as “operationalizing a machine learning
model” or “putting a machine learning model into production.” [62]
The data points were produced using ss [85] from a TCP connection, transformed and
prepared as input data by a data preparation module, and the output calculated as a
single numerical score by a prediction module, all of which are briefly explained below
and in detail in their own sections. The entire setup is visualized in Figure 4.1. The goal
was to apply the model and investigate its real-time prediction performance on a TCP
connection.
75
Chapter 4. Model inference and results
Figure 4.1: The test setup showing the flow of data when the model was applied to perform
predictions on a connection in real time
TCP connection A script that started a TCP connection configured with the desired
parameters, started polling data from the connection using ss, and started TShark
[88] in capture mode to capture packet information from the connection for later
analysis.
Data preparation module A Python script that was started and run as a daemon in
the background by the abovementioned connection script to create the prepared
input data in the form of a csv file for the prediction module.
Prediction module Similarly to the data preparation module, the prediction module
76
4.1. TCP connection setup and model inference
was a Python script that was started and run a a daemon in the background by the
connection script. The prediction module loaded the relevant exported classifier
on startup, and used the classifier to perform predictions based on the input data
that was prepared by the data preparation module.
The final output of the prediction module was a prediction in the form of a boolean value
(1 or 0) that was written to a file which was watched by the connection for changes.
Based on the contents of said file, the connection toggled ECN (Section 2.1.9) on or
off at the router, in order to signal congestion and reduce the sending rate dynamically
based on model predictions instead of waiting for packet loss to happen.
Only the phase three models (Section 3.7) were considered and used when running the
tests described in this chapter.
77
Chapter 4. Model inference and results
that a prediction happened at least once per RTT. The data preparation (Section 4.2)
and prediction (Section 4.3) modules were timed multiple times using the Python timeit
module [87] in order to get an idea of the average running time of the modules when
doing model inference. The timings indicated that the average running time on the Linux
machine that was used for testing was about 0.15ms and 5ms for the data preparation
module and prediction module, respectively.
1 toggle_ecn () {
2 l o c a l i n p u t _ f i l e _ p a t h=" $1 "
3 l o c a l d e l a y=" $2 "
4 l o c a l w a t c h _ i n t e r v a l=$ ( ( d e l a y / 3 ) )
5 l o c a l prev_content=" "
6
7 w h i l e t r u e ; do
8 l o c a l f i l e _ c o n t e n t=$ ( c a t " $ i n p u t _ f i l e _ p a t h " )
9
10 if [ [ " $ f i l e _ c o n t e n t " != " $prev_content " ] ] ; then
11 # Clear previous r u l e s .
12 i p t a b l e s −t mangle −F OUTPUT
13
14 if [ [ " $ f i l e _ c o n t e n t " == " 1 " ] ] ; then
15 # Enable ECN.
16 i p t a b l e s −t mangle −A POSTROUTING −p t c p −j TOS −−s e t −t o s 3
17 e l i f [ [ " $ f i l e _ c o n t e n t " == " 0 " ] ] ; then
18 # D i s a b l e ECN.
19 i p t a b l e s −t mangle −D POSTROUTING −p t c p −j TOS −−s e t −t o s 3
20 fi
21
22 # Update p r e v i o u s c o n t e n t .
23 prev_content=" $ f i l e _ c o n t e n t "
24 fi
25
26 s l e e p $ ( ( watch_interval / 1000) ) . $ ( ( watch_interval % 1000) )
27 done
28 }
Listing 4.1: Bash function for watching a given file for changes and either enabling or disabling
ECN using an iptables rule
By CE marking all outgoing packets from the router if the contents of the watched file
containing the model prediction indicated that this should be done, the receiver h2 could
see this bit and send ECN-Echoes (ECE) to the sender h1 as part of the TCP header in
the ACKs. The sender then reacted to this by reducing its sending rate and setting the
Congestion Window Reduced (CWR) bit in the TCP header of the packets that were sent
to the receiver, as explained in Section 2.1.9. This way, the sending rate was dynamically
reduced based on model predictions instead of as a result of packet loss (Section 2.1.6).
The script also started a process in the background for continuously polling data using
ss. In addition to appending the ss data to a file to produce a txt with all the various
outputs, ss data was written to a separate txt file that only contained a single ss
output. This file was used by the data preparation module, as described in Section 4.2.
In addition to starting a process for polling data using ss, the connection script started
TShark in capture mode to capture both outgoing and incoming packets on the h1-r
interface. This resulted in pcap files that were later used to calculate metrics such
as throughput and retransmissions, in order to compare the results between the various
tests. The resulting pcap files were also analyzed to check the behavior of the connection
78
4.2. Data preparation module
with both model inference enabled and disabled. In the former case, it was of great
interest if the router correctly CE marked outgoing packets and the ECE bit was set by
the receiver on the ACKs going to the sender.
The script supported being run in model inference mode or no model inference mode,
where the former referred to the case where model predictions should be used to either
CE mark or not CE mark outgoing packets at the router based on said predictions.
However, the script also supported being run in timestamp mode, referring to a mode
where model predictions were enabled but packets were not CE marked at the router,
meaning that the connection was running as normal. When the connection was run in
timestamp mode, all the model predictions and their timestamps were written to an
output file, which was later used to produce plots of the cwnd overlaid by the model
predictions. These are shown and discussed in Section 4.4.
The connection script created a directory based on the passed command line arguments,
where all the relevant output files or files that were used by the other components of the
test setup resided.
79
Chapter 4. Model inference and results
Listing 4.2: Python function using the Observer class from Watchdog to observe a given directory
for changes and using a custom event handler to handle various events such as created, modified,
deleted, and so on
Loading and parsing the ss output file was done using a modified version of the approach
described in Section 3.4, where various functions were defined and used to extract the
relevant values from the ss output, such as cwnd, rtt, rto, and so on to construct the
final machine learning features that should be present in the input data to the prediction
module.
The more special features, such as the various min and max features were handled by
being defined as instance variables on the event handler class, and therefore continuously
updated as the connection progressed. This was also the reason why the prediction
module was run as a daemon for the entire duration of the connection — so that the
various features could be persisted and used for subsequent outputs to create input data
for the prediction module.
One thing that should be taken note of, is the way that the cwnd_diff feature was
handled by the data preparation module. This feature, fully explained in Section 3.4.2,
represented the difference between the current cwnd value and the previous one that
was not the same. However, in the data preparation module, a simplified version of this
was calculated and added as part of the final input file that should be supplied to the
prediction module. Here, only the difference between the current and previous cwnd was
used to calculate the cwnd_diff, meaning that the value was often 0. This was done
to make it simpler and reduce computation overhead when running everything in real
time.
80
4.3. Prediction module
In the data collection step, fully explained in section 3.3, it was decided that the initial
slow start peak (Section 2.1.6) should be skipped when collecting the data to be used
for training. This was done in order to avoid outliers and produce more consistent data
with regards to the various machine learning features that should be used for the final
model. In the same way, it was decided that the predictions should not be performed
before the initial slow start peak was over when running the model inference in real time
on a TCP connection. Therefore, the data preparation module simply returned None for
the first few seconds of the connection — which the prediction module was configured
to handle so that it returned False — so that the model inference did not kick in before
the connection had reached congestion avoidance mode and the various features were
more consistent with the data that the final model had been trained on.
Having read and parsed the various ss fields from the ss output file produced by the
connection script, a csv file was created in the same directory as all the other files. This
csv file represented the final prepared input data containing the same machine learning
features that were used to train the model.
81
Chapter 4. Model inference and results
— where the timestamp was calculated by subtracting the current time from the time
when the prediction module was started — was appended to a file that resided in the
same directory where all the other relevant files were, as described in Section 4.1.
The final output of the prediction module was a boolean value (1 or 0) representing the
prediction that was written to a file which was watched by the connection for changes,
as described in Section 4.1.
4.4 Results
Results were analyzed by calculating the throughput and retransmissions for various
scenarios. For each scenario, consisting of a specific combination of delay, bandwidth,
and queue size connection parameters, tests were ran with and without model inference
enabled. For the model inference related part of the tests, three classification thresholds
(Section 2.2.5) were considered, with the values being: 0.1, 0.25, and 0.5. This resulted
in four different groups of results for one specific scenario:
1. No model inference enabled: This group contained the results where no model
inference was enabled and served as a baseline for comparison with the other
groups where model inference was enabled and various classification thresholds
were considered.
2. 0.1 threshold: This group contained the results where model inference was
enabled and the classification threshold was configured to 0.1, so that samples
with an output probability at or above 0.1 were classified as Lost, and samples
with an output probability of less than 0.1 were classified as Not Lost.
3. 0.25 threshold: Same as for 0.1 but with a classification threshold of 0.25.
4. 0.5 threshold: Same as for 0.1 and 0.25 but with a classification threshold of 0.5.
For each scenario and each of these groups, the relevant test was run five times in order
to get an average result with regards to retransmissions and throughput. The reason
for running the tests multiple times in order to get an average result was due to there
seemingly being a large variability between test runs for a given scenario with regards
to the amount of retransmissions and throughput — especially for the connections with
added background traffic where the behavior of the connection and therefore the results
were less predictable due to the flow being impacted by the background traffic. For
a given scenario, this resulted in 20 different files with results: five for each of the
four groups. Each of these files contained the calculated throughput and the amount
of retransmissions that occurred. The average throughput and retransmissions for each
scenario were then calculated and saved to a new file that represented the average results
for that specific scenario.
The abovementioned results in the form of throughput and retransmissions were
calculated by parsing the pcap file that was output by TShark when capturing packet
data for the duration of the test, as explained in Section 4.1. This was done using
a Python script that first parsed the relevant pcap file and read the various packets
into memory using the rdpcap function from the scapy module [80]. Since TShark was
configured to capture all incoming and outgoing packets at the h1-r interface between
the sender and receiver (Section 4.1), the resulting packet list from the rdpcap function
was filtered to only include the packets that originated from the sender h1. Throughput
82
4.4. Results
was then calculated by first creating a new list which contained all the packets excluding
retransmissions, and dividing the total bytes sent by the duration to get the megabits
per second (Mbps), as shown in Listing 4.3.
1 d e f get_throughput ( p a c k e t s : l i s t , d u r a t i o n : i n t ) −> f l o a t :
2 " " " C a l c u l a t e t h e throughput i n Mbps .
3
4 Args :
5 p a c k e t s : L i s t o f p a c k e t s t h a t s h o u l d be i n c l u d e d i n t h e c a l c u l a t i o n
.
6 d u r a t i o n : The d u r a t i o n o f t h e measurement i n s e c o n d s .
7
8 Returns :
9 The throughput i n Mbps .
10 """
11 t o t a l _ b y t e s = sum ( l e n ( p a c k e t [TCP ] . p ay l oa d ) f o r p a c k e t i n p a c k e t s i f TCP
in packet )
12 throughput_mbps = ( t o t a l _ b y t e s ∗ 8 ) / ( 1 0 0 0 0 0 0 ∗ d u r a t i o n )
13
14 r e t u r n throughput_mbps
Listing 4.3: Python function for calculating the throughput in Mbps given a list of packets and
the duration of the connection
The reason for filtering out the retransmissions before calculating the throughput was
to not consider the same packets multiple times when calculating the total bytes sent.
However, even though the results would have been generally higher for the throughput
of the various tests, they would have been consistent in both cases and the comparisons
would therefore have been the same if this was not done.
Retransmissions were calculated by first filtering the original packet list to only include
the packets sent after the initial slow start peak (Section 2.1.6) by skipping the first
second. This was done to make the measurements more consistent, and to follow the
same approach that was used for creating the training data (Section 3.3) and when
performing the predictions during the model inference tests (Section 4.2). The amount
of retransmissions were then calculated by going through the packet list and checking
for duplicate sequence numbers, like shown in Listing 4.4.
83
Chapter 4. Model inference and results
1 d e f g e t _ r e t r a n s m i s s i o n s ( p a c k e t s ) −> i n t :
2 " " " C a l c u l a t e t h e number o f r e t r a n s m i s s i o n s i n t h e pcap f i l e .
3
4 Args :
5 p a c k e t s : L i s t o f p a c k e t s t h a t s h o u l d be i n c l u d e d i n t h e c a l c u l a t i o n
.
6
7 Returns :
8 The number o f r e t r a n s m i s s i o n s .
9 """
10 seq_numbers = d e f a u l t d i c t ( i n t )
11 retransmissions = 0
12
13 f o r packet in packets :
14 i f TCP i n p a c k e t :
15 s e q = p a c k e t [TCP ] . s e q
16 seq_numbers [ s e q ] += 1
17 i f seq_numbers [ s e q ] > 1 :
18 r e t r a n s m i s s i o n s += 1
19
20 return retransmissions
Listing 4.4: Python function for calculating the amount of retransmissions given a list of packets
The final metrics in the form of throughput and amount of retransmissions were saved to
a txt file in the relevant directory for the specific test for a specific scenario, resulting in
five such txt files with metrics for each scenario and each group of results, as described
earlier in this section.
The possible scenarios with regards to delay, bandwidth, and queue size as a multiplier
of BDP that were considered are shown in Table 4.1 below:
Table 4.1: The scenarios that were considered when running the model inference related tests
84
4.4. Results
Figure 4.2: cwnd plot from a connection configured using TCP Reno with 30ms delay, 50Mbps
bandwidth, and 1 BDP queue size
As mentioned in the beginning of this section, the tests were run with and without model
inference enabled, where 0.1, 0.25, and 0.5 were considered as classification thresholds
in the latter case. Since each test was run five times, the final result consisted of 120
distinct txt files with metrics in the form of throughput and amount of retransmissions.
In addition to calculating metrics for all the scenarios and comparing the results, tests
were ran in timestamp mode — described in Section 4.1 — to gather results in the form
of an output file with predictions and their timestamps to produce plots of the cwnd vs.
the timestamps with predictions overlaid, such as the one shown in Figure 4.3.
The abovementioned tests for the various scenarios producing 120 distinct txt files and
the tests in timestamp mode were done for both Reno and Cubic for both single flow
and background traffic.
85
Chapter 4. Model inference and results
each test was ran five times to get an average value for throughput and retransmissions.
The results were then aggregated by classification threshold, so that the final results
were grouped into four distinct groups, as described in the beginning of Section 4.4: no
model inference, 0.1 threshold, 0.25 threshold, 0.5 threshold.
In addition to the tests described above and in more detail in Section 4.4, tests were run
in timestamp mode — described in detail in Section 4.1 — for both Reno and Cubic in
order to get a plot of the cwnd overlaid by the model predictions. These plots show the
regular behavior of the connection without model inference enabled, with marks showing
where the prediction module (Section 4.3) decided to classify a sample as Lost (1) — and
therefore where the router would have toggled on CE marking on the outgoing interface
if model inference was turned on to allow the sender to back off.
Figure 4.3 shows a single flow configured with Reno and a quite large BDP to illustrate
packet loss prediction as a first proof of concept. As mentioned, the red marks in the
plot show where the prediction module decided to mark a sample as Lost (1) and where
the sender would have backed off in order to avoid congestion if model inference was
enabled. For this specific test, the predictions were very accurate, which also seemed to
be a general trend throughout the tests — depending on the classification threshold and
configured BDP.
Figure 4.3: cwnd plot overlaid with model predictions from a connection configured using TCP
Reno with 70ms delay, 50Mbps bandwidth, 0.5 BDP queue size, and 0.25 classification threshold
Figure 4.4 and Figure 4.5 show a single flow configured with Reno or Cubic respectively,
and a smaller BDP than the connection illustrated in Figure 4.3. For both Reno and
Cubic, the model predictions were mostly accurate. However, in the case of Reno, the
86
4.4. Results
model seemed to miss a sawtooth entirely — as can be seen at the fourth sawtooth in
Figure 4.4 — while it missed a lot for Cubic. This seemed to be the case for many of the
tests, especially for connections configured with larger BDPs and larger thresholds. The
reason for this could be partially attributed to a higher classification threshold resulting
in the prediction module being less likely to mark a given sample as Lost (1) because the
output probabilities were mostly very low, with the output probabilities being very low
because of a higher BDP seemed to result in more stable values with less fluctuations
compared to connections configured with smaller BDPs (Section 3.7).
Figure 4.4: cwnd plot overlaid with model pre- Figure 4.5: cwnd plot overlaid with model pre-
dictions from a connection configured using dictions from a connection configured using
TCP Reno with 30ms delay, 50Mbps band- TCP Cubic with 30ms delay, 50Mbps band-
width, 1 BDP queue size, and 0.25 classifica- width, 1 BDP queue size, and 0.1 classification
tion threshold threshold
There seemed to be a general trend where connections with larger BDPs performed
better than connections with smaller BDPs. This could partially be attributed to what
was discussed in Section 3.7, where there seemed to be samples in the training data that
were labeled as True even though they should have been labeled as False. This seemed
to be mostly the case for connections with smaller BDPs, where the cwnd fluctuated
between values such as 14 and 15, sometimes there being cases where the value was for
example 14, 14, 14, 15, 14 for a series of five samples. According to the heuristic used
by the labeling procedure (Section 3.4.4), the sample with a cwnd value of 15 would be
labeled as True in this case, even though this was not necessarily the peak. In addition,
there seemed to be cases for low BDPs where the cwnd value reached a peak and a sample
was correctly labeled as True, before the values decreased but decreased gradually so
that the value was for example 30 at the peak, then 25 for two samples before going
further down to 23, where it stayed at 23 for a few samples before going down to 21,
and so on. According to the heuristic of the labeling procedure, this meant that some
of these samples in the decreasing phase where labeled as True because, for a given
sample, the cwnd value of the previous value was the same and the value of the next was
smaller. This resulted in multiple samples labeled as True in the decreasing phase and
therefore with negative cwnd_diff values that should have been labeled as False. These
wrongfully labeled samples could help explain the comparatively worse performance for
connections with smaller BDPs.
As illustrated in Figure 4.4 and Figure 4.5 for Reno and Cubic respectively, what was
hypothesized in Section 3.6 and Section 3.7 seemed to be true, with the model often
getting things almost right by marking a sample as Lost even though it was not right at
87
Chapter 4. Model inference and results
the peak yet, and sometimes marking packets that were a bit further from the peak. As
illustrated in the figures for both Reno and Cubic, the predictions were still very close
to the actual peak, and could be considered a sensible place to back off in the case of
sending rate reduction to avoid congestion.
Figure 4.6, Figure 4.7, Figure 4.8, and Figure 4.9 show the reduction in retransmissions
for the various classification thresholds when model inference was enabled compared
to the baseline case with no model inference enabled and the throughput change
for Reno and Cubic respectively. In all cases, outliers have been removed from the
plots to illustrate the general trends more clearly. For both Reno and Cubic, there
seemed to be a general trend where the amount of retransmissions was always reduced
when model inference was enabled, with one exception in the Reno case with a 0.5
classification threshold. This exception with negative retransmission reduction can be
mainly explained by variance between test runs, where the amount of retransmissions
sometimes varied greatly between runs. There could also be some added overhead
from running the connection with model inference enabled potentially contributing to
increased retransmissions in some cases where very few samples were marked as Lost.
If the tests were run more than five times for each scenario, for example 100 times,
it would be expected that the retransmission reduction would always be positive with
model inference enabled.
There seemed to be a negative correlation between the classification threshold and
retransmission reduction, where a lower classification threshold resulted in a greater
reduction in retransmissions. However, this seemed to always come at the expense
of somewhat reduced throughput, with there being a positive correlation between the
throughput and classification threshold, where a lower threshold resulted in lower
throughput and vice versa. These results therefore illustrate a trade-off between
the amount of retransmissions and the throughput, where potentially a classification
threshold could be chosen based on which metric is deemed the most important: having
less retransmissions or having higher throughput.
The trade-off between retransmission reduction and throughput change is illustrated in
Figure 4.10 and Figure 4.11 for Reno and Cubic respectively for the single flow case,
showing that when the retransmission reduction grows, the throughput change decreases,
and vice versa. The center coordinates of each circle are determined by the difference
in the median retransmissions and throughput for the relevant threshold compared to
the baseline (no model inference). The diameter of the circles is based on the variance
in the retransmission reduction for the relevant threshold, calculated by subtracting the
upper quartile from the lower quartile. The plots therefore also illustrate that there
was a great variance in the results for the various scenarios, with some scenarios having
greater retransmission reductions for a specific threshold than the others compared to
the baseline case where no model inference was enabled. This could be attributed to
an observed trend, especially for Reno, where connections with smaller BDPs proved to
result in more aggressive prediction performance with regards to how many samples were
labeled as Lost compared to connections with higher BDPs, where a lower threshold was
often needed in order to produce any predictions at all. This latter case is illustrated in
Figure 4.5, which shows a connection configured with the largest BDP of the considered
scenarios discussed in Section 4.4, and where the model predictions happened in the case
of a 0.1 classification threshold. As can be seen in the plot, the model was relatively
conservative with the markings given the very low classification threshold.
88
4.4. Results
Figure 4.6: Retransmission reduction for TCP Figure 4.7: Throughput change for TCP Reno
Reno when running as a single flow with model when running as a single flow with model
inference enabled at various classification inference enabled at various classification
thresholds thresholds
Figure 4.8: Retransmission reduction for TCP Figure 4.9: Throughput change for TCP Cubic
Cubic when running as a single flow with when running as a single flow with model
model inference enabled at various classifica- inference enabled at various classification
tion thresholds thresholds
Variance in the retransmission reduction for a specific threshold could also be attributed
to variance between test runs, where some test runs for some of the scenarios for a
specific threshold could have resulted in greater retransmission reductions compared to
the others, leading to a situation where the difference in the upper quartile for a given
threshold compared to the baseline was greater compared to the lower quartile for the
same threshold.
The model predictions seemed to be generally more aggressive for Reno compared
to Cubic, which could help explain the difference in retransmission reduction for
Reno compared to Cubic. As illustrated in Figure 4.6 and Figure 4.8 for Reno and
Cubic respectively, the median retransmission reduction is about the same, with a
slightly higher value for Cubic. However, the upper quartile and maximum value for
retransmission reduction was considerably higher for Reno, reaching 100% reduction at
the maximum. This can be attributed to one specific scenario with model inference
enabled using the lowest threshold (0.1), which resulted in 0 retransmissions, while the
baseline case for the same scenario resulted in 45 retransmissions. This came with a
considerable reduction in throughput however, as shown in Table 4.2.
89
Chapter 4. Model inference and results
Figure 4.10: Trade-off between retransmission Figure 4.11: Trade-off between retransmission
reduction and throughput change for TCP reduction and throughput change for TCP
Reno when running as a single flow with model Cubic when running as a single flow with
inference enabled at various classification model inference enabled at various classifica-
thresholds tion thresholds
Table 4.2: Results for a specific Reno single flow scenario with a configured delay of 30ms,
bandwidth of 50Mbit, and queue size of 0.5BDP
The pcap files from the various test runs of the scenario with model inference
enabled shown in Table 4.2 were analyzed, and the packets were filtered to only show
retransmissions and ECN-Echoes (ECE) by applying the filter below:
tcp.flags.ece==1 or tcp.analysis.retransmission
It was observed that the connection backed off at regular intervals without losing any
packets. The cwnd plot from one of the five test runs for this scenario that resulted
in zero retransmissions is shown in Figure 4.12. This demonstrates a successful proof
of concept showing that a machine learning model can be used to predict packet loss
and the sending rate can be reduced proactively by making use of ECN (Section 2.1.9)
to avoid losing packets and resulting retransmissions due to buffer overflows caused by
congestion.
The results seemed to be mostly the same for Reno and Cubic, with there being some
differences in result variations for the various thresholds and Reno generally producing
a bit higher retransmission reductions and negative throughput changes, where the
generally higher values for Reno could be explained by the aforementioned observation
that the Reno model seemed to be generally a bit more aggressive with marking samples
as Lost — illustrated in Figure 4.4 and Figure 4.5 showing how the Reno model marked
more packets as Lost even though the configured classification threshold was higher for
a connection with the same values for delay, bandwidth, and queue size compared to
Cubic. The difference in result variations for the various thresholds seemed to be more
of a general case for both algorithms, and this was the reason why the tests were run
multiple times to get an average. If the average value was even more smoothed out by
running the tests many times, it would be expected that the variance in results would
90
4.4. Results
Figure 4.12: cwnd plot from a connection configured using TCP Reno with 30ms delay, 50Mbps
bandwidth, 0.5 BDP queue size, and 0.1 classification threshold resulting in zero retransmissions
91
Chapter 4. Model inference and results
background traffic with various queue sizes in the form of multipliers of BDP. The
connections in all of these cases were configured using the same delay and bandwidth,
only the queue size was varied by using either 0.25, 0.5 or 1 BDP for the queue size. In
addition, the same classification threshold was used (0.5) when running the prediction
module. Figure 4.14, Figure 4.16, and Figure 4.18 show the same scenarios for Cubic.
As illustrated by these results, the predictions seem to be mostly accurate — even in the
presence of background traffic. This could indicate that the model has actually learned
a pattern related to when congestion happens or not, instead of just using a simple
heuristic such as checking if the cwnd is above a certain value and classifying samples
based on that. Unlike the single flow cases, in the presence of background traffic, the
model has no way of knowing which cwnd value represents a threshold — where higher
values most likely indicate congestion and lower values most likely do not — because the
cwnd values fluctuate much more, and what was the maximum value for one sawtooth
is not the same as what was the maximum value for the next sawtooth, and so on.
Based on the model evaluation results discussed in 3.7, this indicates that the model
used different features than the cwnd to determine if a given sample should be classified
as True or False.
As also illustrated by these results, the predictions for Reno especially seemed to be
very accurate, with the model predictions being very close the peaks of each sawtooth,
with some exceptions where the predictions were further from the peak — but always
on the correct side, meaning the side where the cwnd was still growing before congestion
occurred and Multiplicative Decrease (MD) (Section 2.1.6) was used to decrease the
cwnd. As discussed in Section 3.7, this indicates that the model potentially used the
cwnd_diff feature to not mark samples as Lost when the cwnd seemed to be decreasing.
As explained in Section 4.2, even though a simplified version of the cwnd_diff feature
that only considered the difference between the current and previous cwnd values —
compared to the one that was used for creating the training data which considered the
difference between the current and the previous that was not the same (Section 3.4.2) —
was used for the purpose of the model inference setup in the data preparation module
to reduce computation overhead and speed up processing, this could indicate that the
cwnd decreased fast enough for the samples that were taken after congestion occurred
and MD was used to reduce the cwnd, and the values were therefore negative for many
of these samples.
Compared to the single flow case, higher classification thresholds generally seemed to
work better for the background traffic connections. While very few packets seemed
to be marked as Lost by the model in the single flow case when 0.5 was used as the
classification threshold, this was not the case for background traffic, where the same
threshold generally produced quite favorable results by providing a nice balance between
retransmission reductions and negative throughput changes, as illustrated in Figure 4.19
and Figure 4.21 for Reno and Cubic respectively. This could be partially attributed to the
fact that the background traffic connections seemed to have much more retransmissions
on average, as shown in Table 4.3 and Table 4.4 for Reno and Cubic respectively, showing
the difference in average retransmissions for single flow and background traffic for each
scenario.
Lower classification thresholds, such as 0.1 and 0.25, seemed to be very aggressive on the
background traffic connections with regards to marking samples as Lost. This seemed to
also be the case for the scenarios with the largest BDPs. For the highest BDP and queue
92
4.4. Results
Table 4.3: Difference in retransmissions for the single flow and background traffic case for TCP
Reno for the various scenarios without model inference
Table 4.4: Difference in retransmissions for the single flow and background traffic case for TCP
Cubic for the various scenarios without model inference
size scenario — meaning the one that was configured with a delay of 30ms, bandwidth
of 50Mbit, and queue size of 1 BDP — it was observed that model inference tests with a
classification threshold of 0.1 produced a large amount of True predictions. While many
of these predictions where quite close to the peaks (where they should optimally be),
a considerable amount of the predictions were also much further from the peaks, with
some even being on the “wrong” side of the peaks, referring to the side where the cwnd
was decreasing. This seemed to not be the case for higher classification thresholds, such
as 0.5, which generally produced less and more conservative predictions, where these
predictions were generally more accurate in the respect that they were closer to the
peaks. However, it was also observed that, even with a high classification threshold such
as 0.5, the predictions were not very accurate for the scenarios with lower BDPs and the
models for Reno and Cubic were very aggressive with marking samples as Lost.
There seemed to be a positive correlation between BDP and model performance, where
connections with larger BDPs — and larger queue sizes — seemed to have the most
accurate predictions. This could partially be attributed to what was discussed in Section
3.7, where for connections with smaller BDPs in the training data, there seemed to
be many samples that were wrongfully labeled as True and values seemed to be more
stable for the connections with larger BDPs. In the case of connections with smaller
BDPs, even a threshold of 0.5 seemed to be too aggressive, probably leading to the
retransmission reduction values shown in Figure 4.19 and Figure 4.21 quite close to 60%
for 0.5 classification threshold. The connections with smaller BDPs probably also lead
to the retransmission reduction values for 0.1 and 0.25 classification thresholds that were
quite close to 100% for Reno and 80% for Cubic. Interestingly, however, the throughput
did not seem to decrease that much, with the maximum negative throughput change
93
Chapter 4. Model inference and results
94
4.4. Results
(which cause head-of-line blocking delays when packets are retransmitted) or higher
throughput. Also, as discussed previously, since the optimal threshold seemed to vary
depending on the BDP — with lower BDPs generally requiring higher thresholds in
order to not overreact — the threshold could perhaps be tuned dynamically depending
on the network conditions such as delay and available bandwidth.
95
Chapter 4. Model inference and results
Figure 4.13: cwnd plot overlaid with model Figure 4.14: cwnd plot overlaid with model
predictions from a connection configured us- predictions from a connection configured using
ing TCP Reno with 70ms delay, 50Mbps band- TCP Cubic with 70ms delay, 50Mbps band-
width, 0.25 BDP queue size, background traf- width, 0.25 BDP queue size, background traf-
fic, and 0.5 classification threshold fic, and 0.5 classification threshold
Figure 4.15: cwnd plot overlaid with model Figure 4.16: cwnd plot overlaid with model
predictions from a connection configured us- predictions from a connection configured using
ing TCP Reno with 70ms delay, 50Mbps band- TCP Cubic with 70ms delay, 50Mbps band-
width, 0.5 BDP queue size, background traffic, width, 0.5 BDP queue size, background traffic,
and 0.5 classification threshold and 0.5 classification threshold
Figure 4.17: cwnd plot overlaid with model Figure 4.18: cwnd plot overlaid with model
predictions from a connection configured us- predictions from a connection configured using
ing TCP Reno with 70ms delay, 50Mbps band- TCP Cubic with 70ms delay, 50Mbps band-
width, 1 BDP queue size, background traffic, width, 1 BDP queue size, background traffic,
and 0.5 classification threshold and 0.5 classification threshold
96
4.4. Results
Figure 4.19: Retransmission reduction for Figure 4.20: Throughput change for TCP Reno
TCP Reno when running with background when running with background traffic with
traffic with model inference enabled at various model inference enabled at various classifica-
classification thresholds tion thresholds
Figure 4.21: Retransmission reduction for Figure 4.22: Throughput change for TCP Cu-
TCP Cubic when running with background bic when running with background traffic with
traffic with model inference enabled at various model inference enabled at various classifica-
classification thresholds tion thresholds
Figure 4.23: Trade-off between retransmission Figure 4.24: Trade-off between retransmission
reduction and throughput change for TCP reduction and throughput change for TCP Cu-
Reno when running with background traffic bic when running with background traffic with
with model inference enabled at various clas- model inference enabled at various classifica-
sification thresholds tion thresholds
97
Chapter 4. Model inference and results
98
Chapter 5
Conclusion
In this thesis we have thoroughly investigated the topic of real-time packet loss
prediction. We have shown how a machine learning model can be trained on data
collected from multiple TCP connections and applied to tackle this problem in order to
implement proactive congestion control measures using mechanisms such as ECN.
In the introduction we posed some research questions, which we have tried to answer to
the best of our ability throughout this thesis. A summary of our findings and answers
to the research questions are discussed below. In addition, we have highlighted the
main contributions that this thesis brings to the scientific community, which we briefly
mentioned in the introduction as well. Finally, we have proposed some possibilities for
future research that could be made to expand on the work presented in this thesis. All
the results presented in this thesis are reproducible, and the source code is publicly
available [86].
99
Chapter 5. Conclusion
5.2 Contributions
The main contributions of this thesis, as briefly mentioned in the introduction, are
summarized below:
As described in detail in Chapter 3, we have done the following:
• Done extensive research on the topic of real-time packet loss prediction, including
which TCP state variables could be informative to a potential solution and could
be used to select or construct machine learning features to create models.
• Performed extensive data collection in the form of tests involving TCP connections
configured as a single flow between a sender and receiver with and without the
presence of background traffic and with multiple congestion control algorithms.
• Trained and tuned multiple machine learning models, which we have evaluated
using multiple performance metrics and through more manual inspections and
analysis. We have investigated classification performance differences for models
that were trained and evaluated on data collected with and without the presence
of background traffic, and models that were trained and evaluated on data collected
only without background traffic.
In addition, as described in Chapter 4, we have done the following:
• Performed real-time model inference where, through various tests, we have
investigated how the models perform in real time and how accurate and informative
the predictions are for a potential solution involving the use of proactive sending
rate reduction.
100
5.3. Future research
• Done such tests for various congestion control algorithms and for connections
with and without background traffic, where we have calculated metrics such as
retransmission reduction and throughput change, and created various plots for
showing the correlation between model inference and these metrics.
• Shown how ECN can be leveraged to proactively adjust the sending rate based on
model predictions to reduce the amount of retransmissions.
101
Chapter 5. Conclusion
102
Bibliography
[1] Soheil Abbasloo, Chen-Yu Yen, and H. Jonathan Chao. “Classic Meets Modern: A
Pragmatic Learning-Based Congestion Control for the Internet.” In: Proceedings
of the Annual Conference of the ACM Special Interest Group on Data
Communication on the Applications, Technologies, Architectures, and Protocols
for Computer Communication. SIGCOMM ’20. Virtual Event, USA: Association
for Computing Machinery, 2020, pp. 632–647. isbn: 9781450379557. doi: 10.1145/
3387514.3405892. url: https://doi.org/10.1145/3387514.3405892.
[2] Davide Andreoletti et al. “Network Traffic Prediction based on Diffusion
Convolutional Recurrent Neural Networks.” In: IEEE INFOCOM 2019 - IEEE
Conference on Computer Communications Workshops (INFOCOM WKSHPS).
2019, pp. 246–251. doi: 10.1109/INFCOMW.2019.8845132.
[3] Bruno Astuto Arouche Nunes et al. “A machine learning framework for TCP
round-trip time estimation.” In: EURASIP Journal on Wireless Communications
and Networking 2014 (2014), pp. 1–22.
[4] Amir F. Atiya et al. “Packet Loss Rate Prediction Using the Sparse Basis
Prediction Model.” In: IEEE Transactions on Neural Networks 18.3 (2007),
pp. 950–954. doi: 10.1109/TNN.2007.891681.
[5] Vedant Bahel, Sofia Pillai, and Manit Malhotra. “A Comparative Study on
Various Binary Classification Algorithms and their Improved Variant for Optimal
Performance.” In: 2020 IEEE Region 10 Symposium (TENSYMP). 2020, pp. 495–
498. doi: 10.1109/TENSYMP50017.2020.9230877.
[6] Mikhail Belkin et al. “Reconciling modern machine-learning practice and the
classical bias–variance trade-off.” In: Proceedings of the National Academy of
Sciences 116.32 (2019), pp. 15849–15854. doi: 10.1073/pnas.1903070116. eprint:
https://www.pnas.org/doi/pdf/10.1073/pnas.1903070116. url: https://www.pnas.
org/doi/abs/10.1073/pnas.1903070116.
[7] Hanane Benadji, Lynda Zitoune, and Véronique Vèque. “Predictive Modeling of
Loss Ratio for Congestion Control in IoT Networks Using Deep Learning.” In: the
IEEE Global Communications Conference (GLOBECOM). 2023.
[8] Stephen Bensley et al. Data Center TCP (DCTCP): TCP Congestion Control
for Data Centers. RFC 8257. Oct. 2017. doi: 10 . 17487 / RFC8257. url: https :
//www.rfc-editor.org/info/rfc8257.
[9] Ethan Blanton, Dr. Vern Paxson, and Mark Allman. TCP Congestion Control.
RFC 5681. Sept. 2009. doi: 10.17487/RFC5681. url: https://www.rfc-editor.org/
info/rfc5681.
103
Bibliography
[10] Vadim Borisov et al. “Deep Neural Networks and Tabular Data: A Survey.” In:
IEEE Transactions on Neural Networks and Learning Systems (2022), pp. 1–21.
doi: 10.1109/TNNLS.2022.3229161.
[11] Neal Cardwell et al. “BBR: Congestion-Based Congestion Control.” In: ACM
Queue 14, September-October (2016), pp. 20–53. url: http : / / queue . acm . org /
detail.cfm?id=3022184.
[12] Selene Cerna Ñahuis et al. “A Comparison of LSTM and XGBoost for Predicting
Firemen Interventions.” In: June 2020, pp. 424–434. isbn: 978-3-030-45690-0. doi:
10.1007/978-3-030-45691-7_39.
[13] Rene Y Choi et al. “Introduction to machine learning, neural networks, and deep
learning.” In: Translational vision science & technology 9.2 (2020), pp. 14–14.
[14] Xu Chu et al. “Data Cleaning: Overview and Emerging Challenges.” In:
Proceedings of the 2016 International Conference on Management of Data.
SIGMOD ’16. San Francisco, California, USA: Association for Computing
Machinery, 2016, pp. 2201–2206. isbn: 9781450335317. doi: 10.1145/2882903.
2912574. url: https://doi.org/10.1145/2882903.2912574.
[15] Pádraig Cunningham and Sarah Jane Delany. “Underestimation Bias and
Underfitting in Machine Learning.” In: Trustworthy AI - Integrating Learning,
Optimization and Reasoning. Ed. by Fredrik Heintz, Michela Milano, and Barry
O’Sullivan. Cham: Springer International Publishing, 2021, pp. 20–31. isbn: 978-
3-030-73959-1.
[16] Hercules Dalianis. “Evaluation Metrics and Evaluation.” In: Clinical Text Mining:
Secondary Use of Electronic Patient Records. Cham: Springer International
Publishing, 2018, pp. 45–53. isbn: 978-3-319-78503-5. doi: 10.1007/978- 3- 319-
78503-5_6. url: https://doi.org/10.1007/978-3-319-78503-5_6.
[17] Essam Al Daoud. “Comparison between XGBoost, LightGBM and CatBoost
Using a Home Credit Dataset.” In: International Journal of Computer and
Information Engineering 13.1 (2019), pp. 6–10. issn: eISSN: 1307-6892. url:
https://publications.waset.org/vol/145.
[18] Luis Diez et al. “Can We Exploit Machine Learning to Predict Congestion over
mmWave 5G Channels?” In: Applied Sciences 10.18 (2020). issn: 2076-3417. doi:
10.3390/app10186164. url: https://www.mdpi.com/2076-3417/10/18/6164.
[19] Mo Dong et al. “PCC Vivace: Online-Learning Congestion Control.” In: 15th
USENIX Symposium on Networked Systems Design and Implementation (NSDI
18). Renton, WA: USENIX Association, Apr. 2018, pp. 343–356. isbn: 978-1-
939133-01-4. url: https://www.usenix.org/conference/nsdi18/presentation/dong.
[20] Mo Dong et al. “PCC: Re-architecting Congestion Control for Consistent High
Performance.” In: 12th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 15). Oakland, CA: USENIX Association, May 2015,
pp. 395–408. isbn: 978-1-931971-218. url: https : / / www. usenix . org / conference /
nsdi15/technical-sessions/presentation/dong.
[21] Wesley Eddy. Transmission Control Protocol (TCP). RFC 9293. Aug. 2022. doi:
10.17487/RFC9293. url: https://www.rfc-editor.org/info/rfc9293.
104
Bibliography
[22] Issam El Naqa and Martin J. Murphy. “What Is Machine Learning?” In: Machine
Learning in Radiation Oncology: Theory and Applications. Ed. by Issam El Naqa,
Ruijiang Li, and Martin J. Murphy. Cham: Springer International Publishing,
2015, pp. 3–11. isbn: 978-3-319-18305-3. doi: 10.1007/978- 3- 319- 18305- 3_1.
url: https://doi.org/10.1007/978-3-319-18305-3_1.
[23] Carmen Esposito et al. “GHOST: Adjusting the Decision Threshold to Handle
Imbalanced Data in Machine Learning.” In: Journal of Chemical Information
and Modeling 61.6 (2021). PMID: 34100609, pp. 2623–2640. doi: 10.1021/acs.
jcim . 1c00160. eprint: https : / / doi . org / 10 . 1021 / acs . jcim . 1c00160. url: https :
//doi.org/10.1021/acs.jcim.1c00160.
[24] A Esterhuizen and AE Krzesinski. “TCP congestion control comparison.” In:
SATNAC, September (2012).
[25] Gorry Fairhurst and Michael Welzl. The Benefits of Using Explicit Congestion
Notification (ECN). RFC 8087. Mar. 2017. doi: 10.17487/RFC8087. url: https:
//www.rfc-editor.org/info/rfc8087.
[26] Joyce Fang et al. “Reinforcement learning for bandwidth estimation and
congestion control in real-time communications.” In: CoRR abs/1912.02222
(2019). arXiv: 1912.02222. url: http://arxiv.org/abs/1912.02222.
[27] S. Floyd and V. Jacobson. “Random early detection gateways for congestion
avoidance.” In: IEEE/ACM Transactions on Networking 1.4 (1993), pp. 397–413.
doi: 10.1109/90.251892.
[28] Sally Floyd, Dr. K. K. Ramakrishnan, and David L. Black. The Addition of
Explicit Congestion Notification (ECN) to IP. RFC 3168. Sept. 2001. doi: 10.
17487/RFC3168. url: https://www.rfc-editor.org/info/rfc3168.
[29] Jerome H. Friedman. “Greedy Function Approximation: A Gradient Boosting
Machine.” In: The Annals of Statistics 29.5 (2001), pp. 1189–1232. issn: 00905364.
url: http://www.jstor.org/stable/2699986 (visited on 08/16/2023).
[30] Moritz Geist and Benedikt Jaeger. “Overview of TCP congestion control
algorithms.” In: Network 11 (2019).
[31] Anna Giannakou, Dipankar Dwivedi, and Sean Peisert. “A machine learning
approach for packet loss prediction in science flows.” In: Future Generation
Computer Systems 102 (2020), pp. 190–197. issn: 0167-739X. doi: https : / / doi .
org/10.1016/j.future.2019.07.053. url: https://www.sciencedirect.com/science/
article/pii/S0167739X19305850.
[32] Isabelle Guyon et al. Feature extraction: foundations and applications. Vol. 207.
Springer, 2008.
[33] Sangtae Ha, Injong Rhee, and Lisong Xu. “CUBIC: A New TCP-Friendly High-
Speed TCP Variant.” In: SIGOPS Oper. Syst. Rev. 42.5 (July 2008), pp. 64–74.
issn: 0163-5980. doi: 10.1145/1400097.1400105. url: https://doi.org/10.1145/
1400097.1400105.
[34] Mark A Hall. “Correlation-based feature selection of discrete and numeric class
machine learning.” In: (2000).
[35] Erwin Harahap et al. “A router-based management system for prediction of
network congestion.” In: 2014 IEEE 13th International Workshop on Advanced
Motion Control (AMC). 2014, pp. 398–403. doi: 10.1109/AMC.2014.6823315.
105
Bibliography
106
Bibliography
[52] Benoit Liquet, Sarat Moka, and Yoni Nazarathy. The mathematical engineering
of deep learning. 2023.
[53] Marcos Roberto Machado, Salma Karray, and Ivaldo Tributino de Sousa.
“LightGBM: an Effective Decision Tree Gradient Boosting Method to Predict
Customer Loyalty in the Finance Industry.” In: 2019 14th International
Conference on Computer Science & Education (ICCSE). 2019, pp. 1111–1116.
doi: 10.1109/ICCSE.2019.8845529.
[54] Stephen Marsland. Machine learning: an algorithmic perspective. CRC press,
2015.
[55] H.R. Mehrvar and M.R. Soleymani. “Packet loss rate prediction using a
universal indicator of traffic.” In: ICC 2001. IEEE International Conference on
Communications. Conference Record (Cat. No.01CH37240). Vol. 3. 2001, 647–
653 vol.3. doi: 10.1109/ICC.2001.937277.
[56] Mininet. url: http://mininet.org/ (visited on 04/12/2023).
[57] Mininet API. url: http://mininet.org/api/annotated.html (visited on 04/12/2023).
[58] Mininet CLI. url: http://mininet.org/walkthrough/#interact-with-hosts-and-switches
(visited on 04/12/2023).
[59] Mininet Node class. url: http://mininet.org/api/classmininet_1_1node_1_1Node.
html (visited on 04/12/2023).
[60] Mininet Node cmd method. url: http://mininet.org/api/classmininet_1_1node_1_
1Node.html#a6e1338af3c4a0348963a257ac548153b (visited on 04/12/2023).
[61] Mininet Node config method. url: http://mininet.org/api/classmininet_1_1node_1_
1Node.html#ae1c80e11ed708d3f0d3c98acd4299ed4 (visited on 04/12/2023).
[62] ModelInference. url: https://cloud.google.com/bigquery/docs/inference- overview
(visited on 10/28/2023).
[63] Woongsoo Na et al. “DL-TCP: Deep Learning-Based Transmission Control
Protocol for Disaster 5G mmWave Networks.” In: IEEE Access 7 (2019),
pp. 145134–145144. doi: 10.1109/ACCESS.2019.2945582.
[64] Vladimir Nasteski. “An overview of the supervised machine learning methods.”
In: HORIZONS.B 4 (Dec. 2017), pp. 51–62. doi: 10.20544/HORIZONS.B.04.1.
17.P05.
[65] Alexey Natekin and Alois Knoll. “Gradient boosting machines, a tutorial.” In:
Frontiers in neurorobotics 7 (2013), p. 21.
[66] Xiaohui Nie et al. “Dynamic TCP Initial Windows and Congestion Control
Schemes Through Reinforcement Learning.” In: IEEE Journal on Selected Areas
in Communications 37.6 (2019), pp. 1231–1247. doi: 10 . 1109 / JSAC . 2019 .
2904350.
[67] pandas. url: https://pandas.pydata.org (visited on 04/13/2023).
[68] pandas.read_csv. url: https://pandas.pydata.org/docs/reference/api/pandas.read%
5C_csv.html (visited on 04/13/2023).
[69] pandasDataFrame. url: https : / / pandas. pydata . org / docs / reference / api / pandas.
DataFrame.html (visited on 04/13/2023).
107
Bibliography
108
Bibliography
109
Bibliography
[107] Kefan Xiao, Shiwen Mao, and Jitendra K. Tugnait. “TCP-Drinc: Smart
Congestion Control Based on Deep Reinforcement Learning.” In: IEEE Access
7 (2019), pp. 11892–11904. doi: 10.1109/ACCESS.2019.2892046.
[108] T. Yamamoto. “Estimation of the advanced TCP/IP algorithms for long distance
collaboration.” In: Fusion Engineering and Design 83.2 (2008). Proceedings of
the 6th IAEA Technical Meeting on Control, Data Acquisition, and Remote
Participation for Fusion Research, pp. 516–519. issn: 0920-3796. doi: https : / /
doi.org/10.1016/j.fusengdes.2007.10.006. url: https://www.sciencedirect.com/
science/article/pii/S0920379607005078.
[109] Francis Y. Yan et al. “Pantheon: the training ground for Internet congestion-
control research.” In: 2018 USENIX Annual Technical Conference (USENIX ATC
18). Boston, MA: USENIX Association, July 2018, pp. 731–743. isbn: 978-1-
939133-01-4. url: https : / / www. usenix . org / conference / atc18 / presentation / yan -
francis.
[110] Ticao Zhang and Shiwen Mao. “Machine Learning for End-to-End Congestion
Control.” In: IEEE Communications Magazine 58.6 (2020), pp. 52–57. doi: 10.
1109/MCOM.001.1900509.
[111] Zhi-Hua Zhou. Machine learning. Springer Nature, 2021.
[112] Quan Zou et al. “Finding the Best Classification Threshold in Imbalanced
Classification.” In: Big Data Research 5 (2016). Big data analytics and
applications, pp. 2–8. issn: 2214-5796. doi: https://doi.org/10.1016/j.bdr.2015.12.
001. url: https://www.sciencedirect.com/science/article/pii/S2214579615000611.
110