0% found this document useful (0 votes)
5 views

Improving_the_Transfer_of_Machine_Learning-Based_Video_QoE_Estimation_Across_Diverse_Networks

Uploaded by

abhayskulkarni11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Improving_the_Transfer_of_Machine_Learning-Based_Video_QoE_Estimation_Across_Diverse_Networks

Uploaded by

abhayskulkarni11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2824 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 21, NO.

3, JUNE 2024

Improving the Transfer of Machine Learning-Based


Video QoE Estimation Across Diverse Networks
Michael Seufert , Member, IEEE, and Irena Orsolic

Abstract—With video streaming traffic generally being corresponding to Web conferencing, video-on-demand and
encrypted end-to-end, there is a lot of interest from network gaming services [2], [3], with billions of people opting for or
operators to find novel ways to evaluate streaming performance being forced into using online meetings, education, entertain-
at the application layer. Machine learning (ML) has been
extensively used to develop solutions that infer application-level ment, etc.
Key Performance Indicators (KPI) and/or Quality of Experience In the context of such immense amounts of generated
(QoE) from the patterns in encrypted traffic. Having such insights network traffic, it is becoming more and more important
provides the means for more user-centric traffic management to manage both the services and the networks in a way to
and enables the mitigation of QoE degradations, thus potentially efficiently use available resources while keeping the customers
preventing customer churn. The ML–based QoE/KPI estimation
solutions proposed in literature are typically trained on a satisfied. This requires the understanding of how measureable
limited set of network scenarios and it is often unclear how network- and application-level parameters influence end-users’
the obtained models perform if applied in a previously unseen Quality of Experience (QoE) and thereupon invoking QoE-
setting (e.g., if the model is applied at the premises of a aware service/network management mechanisms [4], [5]. A
different network operator). In this paper, we address this gap by special research focus in this direction has been put on video
cross-evaluating the performance of QoE/KPI estimation models
trained on 4 separate datasets generated from streaming 48000 streaming services, due to their impact on the global network
video streaming sessions. The paper evaluates a set of methods traffic. Aiming to meet the users’ expectations, streaming
for improving the performance of models when applied in a services may employ various quality adaptation strategies (e.g.,
different network. Analyzed methods require no or considerably adaptive streaming in compliance with MPEG-DASH [6]),
less application-level ground-truth data collected in the new utilize increasingly efficient compression techniques [7], etc.
setting, thus significantly reducing the extensiveness of required
data collection. Management mechanisms from the network perspective may
include, for example, QoE-aware resource (re)allocation and
Index Terms—Video streaming, traffic encryption, machine (re)routing [8]. While the impact of application-level param-
learning.
eters onto QoE is described in literature and embodied
in existing QoE models [9], [10], the relationship between
I. I NTRODUCTION network-level metrics and QoE is far more complex to unveil,
particularly given the widespread use of encryption.
IDEO streaming services are held accountable for the
V largest portion of globally generated network traffic.
According to Ericsson Mobility Report from Nov. 2021 [1],
Motivated by the high interest from the industry, plenty
of research efforts have been put into describing the rela-
tionship between network-level metrics and video streaming
video traffic is estimated to account for 69 percent of all
performance [11], [12], [13], [14], [15], [16], [17], [18],
mobile data traffic and its share is expected to increase to
[19], [20], [21]. In these studies, streaming performance is
79 percent by 2027. During the same time, total global
commonly expressed in terms of QoE and/or application-level
mobile data traffic is expected to grow 4.4 times. Additional
key performance indicators (KPI) such as startup delay, video
strain on networks has been put amid the COVID-19 pan-
encoding bitrate, and video resolution. Most of aforementioned
demic. Dramatic surges in network traffic have been observed
studies are exploiting methods from the domain of machine
Manuscript received 22 May 2023; revised 2 September 2023; accepted learning (ML) in order to map the network traffic patterns
24 September 2023. Date of publication 23 October 2023; date of cur- to quality degradations perceived by users. While proposed
rent version 12 July 2024. This work was partly funded by the German
Research Foundation (Deutsche Forschungsgemeinschaft, DFG) under grant approaches have proven to perform well on individual use-
SE 3163/3-1, project number: 500105691 (UserNet) and partly funded cases focused on a particular streaming service and when
by the Croatian Science Foundation under the project IP-2019-04-9793 applied on a dataset collected in one particular experiment
(Q-MERSIVE). The associate editor coordinating the review of this article
and approving it for publication was D. Puthal. (Corresponding author: setup, it is often unclear how well they generalize across
Michael Seufert.) various use-cases. In this paper we go beyond related work
Michael Seufert was with the Chair of Communication Networks, by investigating the applicability of such ML models across
University of Würzburg, 97070 Würzburg, Germany. He is now with the Chair
of Networked Embedded Systems and Communication Systems, University different network setups and over longer periods of time.
of Augsburg, 86159 Augsburg, Germany (e-mail: [email protected]). The goals of this study are 1) to investigate how well
Irena Orsolic was with the Faculty of Electrical Engineering and QoE/KPI estimation models perform when applied on data
Computing, University of Zagreb, 10000 Zagreb, Croatia. She is now with
Ericsson AB, Stockholm, Sweden (e-mail: [email protected]). collected in a different network setting and in a different
Digital Object Identifier 10.1109/TNSM.2023.3326664 time period, and 2) to propose potential solutions for adapting

c 2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
SEUFERT AND ORSOLIC: IMPROVING THE TRANSFER OF ML-BASED VIDEO QoE ESTIMATION 2825

the models in order to perform well in new setups and over specifically trained for. Thus, we aim to investigate potential
extended periods of time. The paper evaluates various data solutions for improving the cross-applicability of the models
science methods for detecting and eliminating dissimilarities (Section V-C). We apply and evaluate methods based on
in data from different sources, with the aim to reduce the scaling, decomposition, manifold learning, ML-based fea-
extensiveness of data collection while developing and main- ture representation transfer, drift elimination, and enrichment
taining accurate QoE/KPI estimation solutions. We believe that (Section VI).
this research is of high interest to companies developing user The initial results of the study were published in [22],
experience analytics solutions based on ML, as the presented where we trained and tested network-specific, cross-network,
analysis and methods may help them reduce the time to and general models for session-level QoE/KPI classification
customize the solution for a specific network as well as reduce on a smaller sample of around 5000 videos collected in 2020.
maintenance costs. This paper presents a much more comprehensive analysis
The contributions of the paper can be summarized as of cross-network model applicability and extends the initial
follows: study by using a 10 times larger and more diverse dataset,
1) Large Video on Demand Streaming Datasets: During this giving a deeper insight into the differences in data across
study, four datasets were collected using measurement setups measurement scenarios. This paper builds on top of previous
described in Section III-A. The datasets were collected during work by assessing various additional methods for improving
the period from July 2020 to August 2021 and differ in the model cross-applicability.
location where the measurement campaign was run (Würzburg, The paper is organized as follows. Section II outlines back-
Zagreb) as well as in the type of the access network (Ethernet, ground and related work on ML–based in-network QoE/KPI
Wi-Fi). All datasets include data corresponding to the same estimation models, applicability of such models across dif-
set of 2000 distinct videos being streamed to a laptop, with ferent usage settings, and potential for model adaptation and
and without using an ad-blocking plugin, under 3 different transfer. Section III describes the measurement setup, data
bandwidth constraints. This results in 48000 streamed video collection campaigns, and data processing. The dataset is then
sessions. The description of the dataset is given in Section IV portrayed and analyzed in Section IV. Section V presents
and we will publish the processed dataset (ready for ML the modeling procedure and methods for improving model
analysis) online.1 transfer. Results are shown and discussed in Section VI,
2) Dataset analysis: We analyzed and compared the datasets followed by a conclusion and outlook in Section VII.
in terms of both application-level data (Section IV-A) and
network traffic features. In Section IV-B we analyze network
II. BACKGROUND AND R ELATED W ORK
traffic features on a per-video level, while in Section IV-C
network traffic features are analyzed on a per-second level. We A. In-Network Video Streaming QoE/KPI Estimation
note that in this paper we consider session-level (per-video) The idea of applying ML for estimating QoE and video
classification only, but both types of datasets are made publicly streaming KPIs in the network was suggested in [11]. The
available, and thus described. proposed Prometheus approach outperformed existing solu-
3) Session-Level QoE/KPI Estimation Models: We trained tions that required control over the app services and domain
numerous models for estimating QoE, startup delay, video expertise. Soon thereafter, many of the popular video stream-
resolution, video bitrate, and rebuffering occurrence. For this, ing services started introducing encryption to their flows,
we focus on shallow learning only, which proved to work thus making some of the network traffic features needed as
well for QoE/KPI estimation in related works. Moreover, since input inaccessible. Hence, the study in [15] relied exclusively
QoE datasets are typically small due to the expensive dataset on features obtainable from the encrypted traffic, but the
creation, the added value and utility of training deep learning application-level ground-truth was still derived from non-
models might be very small, such that we do not consider deep encrypted flows captured at a Web-proxy. Building on top of
learning in this work. Using a systematic approach, we com- these ideas, the studies published in [23], [24] resulted with
pare i) the performance of network-specific models (trained approaches fully applicable in the context of encrypted traffic,
and tested on a single dataset), ii) the performance of general i.e., not needing the access to packet payloads at any phase of
models (trained and tested on all datasets merged), and iii) the model training. These studies have proven the feasibility
the cross-applicability performance of network-specific models of classifying YouTube video streams into QoE classes in a
(trained on a single dataset and tested on all other datasets per-video (session-level) manner based only on the statistical
separately). The models are trained and their performance properties of the encrypted traffic volume.
is analyzed for session-level (per-video) QoE/KPI estimation In parallel with the further development of session-level
(Section VI). QoE/KPI estimation solutions [25], [26], a number of real-
4) Analysis of Methods for Improving the Model Cross- time KPI estimation approaches were proposed [13], [16],
Applicability: In order to reduce the exhaustiveness of data [18], [19], [20], [27], [28], [29]. Such solutions, as opposed
collection for model training purposes, it would be valuable to session-level ones, might be more appropriate for network
if once trained models could be reused for other use-cases operators looking to dynamically manage QoE and optimize
(e.g., different networks/locations) besides the one they were resource allocation. Session-level approaches, on the other
hand provide per-video session metrics and may be more
1 https://urn.nsk.hr/urn:nbn:hr:168:227338 appropriate for network planning purposes.
2826 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 21, NO. 3, JUNE 2024

Focusing on the deployment of these ML–based models in gradually (gradual concept drift), and temporarily disappear
5G networks, in [30], [31] the authors present their simulations and reappear (reoccurring concept drift) [42], [43], [44].
used to assess such models when embedded in NWDAF The strategies of handling the drift differ among these
(Network Data Analytics Function) – an analytics entity in 5G types. Covariate shift can be detected using ML and the
with machine learning capabilities [32], [33]. drifting features can be excluded from model training later
Another interesting research avenue on QoE monitoring on. This is done in the Drift elimination method, applied
for HTTP adaptive streaming (HAS) is the inclusion of user in this paper. For sudden drifts, a common approach is to
behavior and its impact on traffic patterns and, consequently, re-train the model on instances captured both before and
on model performance. In [21], [34], the authors explore this after the sudden drift occurrence [38], [45]. This principle is
important issue, given its relation to deploying robust QoE adopted in another method for improving transfer proposed in
monitoring solutions in the network. this paper: Enrichment. Aiming to reduce the drift between
the datasets, we also test various feature transformation
and dimensionality reduction techniques, including Scaling,
B. Applicability of QoE/KPI Estimation Models Across Decomposition, Manifold learning [46], [47], and ML–based
Different Scenarios feature representation transfer. These methods are described
Related work has barely scratched the surface of the in more detail in Section V-C.
problem of inherent dimensionality originating from the vari-
ety of possible video streaming usage scenarios. The III. DATASET P REPARATION
cross-testing efforts described in [25], [35] were focused on A. Measurement Setup
the applicability of YouTube QoE/KPI classification models
The measurements were conducted using a Java–based
trained in a lab setting on data collected in an operational
framework similar to [16], [48]. The measurement framework
mobile network. Similarly, models trained on data collected
is able to automatically start a Chrome browser using the
on Android platform were cross-tested with data collected on
Selenium browser automation tool.2 The browser was config-
iOS [25]. The paper reports limited cross-applicability capa-
ured to log all HTTP requests to a file (-log-net-log)
bilities, with models demonstrating a decrease in performance.
and QUIC traffic was enabled (-enable-quic). Optionally,
On the other hand, the performance of general models, trained
the browser could also load and install a Chrome extension
on the dataset containing samples from both platforms is com-
during startup. For a single measurement run, the browser
parable to that of models trained for a specific platform [20].
creates a new and isolated browsing session independent of
Similar conclusions, but focused on different services and
browsing history or previously stored session or user data
not platforms, have been found in [18]. The authors show
(e.g., cookies), and accesses the video streaming service main
that developing well–performing general models is feasible
page. After the page has fully loaded and occasional pop-ups
if the training set included data from all services. Applying
have been handled, the framework spawns a separate thread,
the model trained on Amazon, YouTube, and Twitch data
which captures the network traffic using tshark.3 Next, the
to Netflix data resulted in a significant drop in model
browser accesses a single video page and injects a JavaScript–
performance. Regarding the generalization efforts, interesting
based monitoring script [49], [50] into the webpage, which
approaches can be found in [36], [37], where the authors
periodically polls the current timestamp, the current video
investigate challenges related to model sharing and a transfer-
playtime, buffered playtime, video resolution, and player state
learning approach which allows local models to learn a generic
every 250 ms. The video is then streamed for 180 s or until
base model for MOS, and then consider additional features for
the video end, and the application-layer information about
location-specific QoE models. However, both approaches rely
the streaming session is logged to a file. Afterwards, the
on application-level KPIs and do not consider estimating QoE
framework closes the browser and terminates the network
from encrypted network traffic.
traffic capture, before a new measurement run can be started.
A list of 2000 videos was selected according to the popu-
C. Adaptation and Transfer of QoE/KPI Estimation Models larity of the video content, such that the full range of video
popularity, ranging from below 100 views to over billions of
In real-life applications, data used as input for prediction
views, was represented in the list. The measurements were
models often changes over time, making the performance of
conducted in a high speed optical fiber campus network at
the models degrade as newly generated data is presented to
the University of Würzburg, Germany, in a cable broadband
them. Similarly, there might be slight differences between
network of an ISP in Würzburg, Germany, and in a cable
the same type of data but generated by different sources.
broadband network of an ISP in Zagreb, Croatia. In all
In general, this phenomenon is known as the dataset shift
locations, the framework was installed on a laptop and a
or dataset drift [38], [39]. Dataset shift can be divided into
Raspberry Pi 4 was used as a bridge to connect the laptop to
three categories: 1) covariate shift (shift in independent vari-
the network. The Raspberry Pi acted as a network emulator
ables, 2) prior probability shift (shift in the target variable),
and was able to limit the bandwidth using Linux traffic control
and 3) concept drift (target concept depends on hidden
(tc). Three different network conditions were emulated in
contexts that are not explicitly provided to the learner algo-
rithm) [40], [41]. Depending on the domain, concept drift can 2 https://www.selenium.dev/
appear suddenly (sudden, abrupt, instantaneous concept drift), 3 https://www.wireshark.org/docs/man-pages/tshark.html
SEUFERT AND ORSOLIC: IMPROVING THE TRANSFER OF ML-BASED VIDEO QoE ESTIMATION 2827

TABLE I
S UMMARY OF C OLLECTED DATASETS

both locations, namely, no limitation, a fixed limitation of


1 Mbps, as well as a stochastic limitation following an expo-
nential distribution with a mean of 1 Mbps. The whole list
of 2000 videos was measured both without and with an ad-
blocking Chrome extension, in Würzburg and Zagreb, in all
three network conditions, both in 2020 and 2021, which results
in a dataset of roughly 48000 streamed video sessions. The
measurement runs were conducted over five months in 2020
(Jul. – Nov.) and over three months in 2021 (Apr. – Jun.). A
summary of the datasets is given in Table I.

B. Dataset Preprocessing
Fig. 1. The collected raw datasets are processed into two ML datasets that
For each measurement run denoted with year (2020, 2021), can be used for 1) session-level QoE/KPI estimation and 2) real-time KPI
location (Wue, Zag), bandwidth limitation setting (unlim- estimation.
ited, 1 Mbps, stochastic), and ad-blocking plugin status (on,
off), the raw datasets contain HTTP logs, measurement logs
(application-level events), and network traffic traces. From A. Application-Level Data
these logs, we generate two datasets that can later on be We observe in Figure 2(a) that measurement durations
used to train the QoE/KPI estimation models. We refer to follow similar distributions in all datasets. The only notable
these datasets as session-level ML dataset and real-time ML difference is around 20 s offset for 10-12% of the videos,
dataset, as opposed to the term raw datasets which we which could be attributed to a change of advertisement strategy
use to describe initial logs. This is summarized in Figure 1. between 2020 and 2021. In Figure 2(b) it can be seen that
Both datasets are.csv files with rows containing network the bandwidth limitations were applied as configured, as
traffic features (statistical properties of the encrypted traffic) roughly two thirds of the sessions have an average down-
and QoE/KPI labels. While in the session-level ML dataset link bandwidth of at most 1 Mbps with overlapping CDFs.
each row represents a single video with features and labels Regarding the sessions streamed with unlimited bandwidth,
describing the whole video session, in the real-time ML dataset it can be seen that Wue_2021 had slightly higher average
a row represents one second of a video streaming session. bandwidths than the other three datasets. However, the highest
These two ML datasets are publicly accessible online4 and average bandwidth was be observed in the Wue_2020 data
briefly described in the continuation of the paper. Due to with 36.84 Mbps compared to 10.61 Mbps for Wue_2021,
paper length limitations and limitations with regards to the 4.37 Mbps for Zag_2021, and 4.33 Mbps for Zag_2020.
computation environment, in this study we train models using When comparing the ratio of active download time and session
session-level variant of the dataset only. duration in Figure 2(c), we see a clear difference between
2020 and 2021 measurements, which indicates changes in
IV. DATASET C HARACTERISTICS the adaptive bitrate logic and resulting download behavior
Out of the scheduled 48000 video streams, 43207 were of the streaming service. Nevertheless, within each year, the
played, e.g., not skipped due to video being deleted or its avail- distribution of active download ratios is similar across both
ability settings changed over the course of the measurement locations.
campaigns. Out of these videos, for further analysis we used: We now look in more detail into the application-layer
8833 videos from Wue_2020, 9410 from Zag_2020, 5310 KPIs among the four collected datasets. While most curves
from Wue_2021, and 6640 from Zag_2021. The videos have similar shapes across datasets, an interesting observa-
were selected based on log consistency criteria in order to tion is that average video resolutions were often higher in
exclude incomplete logs and measurement errors. The ML Wue_2021 as compared to the other datasets (Figure 2(e)),
datasets that we provide to the research community are filtered consequently resulting in longer startup delays (Figure 2(d))
based on these criteria. and higher average bitrates (Figure 2(f)). With the exception
of Wue_2021, which has a large portion of videos played
4 https://urn.nsk.hr/urn:nbn:hr:168:227338 in 1080 p, most commonly observed resolutions were 480 p
2828 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 21, NO. 3, JUNE 2024

TABLE II
OVERVIEW OF R ECORDED DATA F ROM S TREAM OF E NCRYPTED
PACKETS AND THE D ERIVED F EATURES

and 720 p (Figure 2(e)). Most of the average video encoding


bitrates are under 1000 kbps and the values are very rarely
exceeding 3000 kbps (Figure 2(f)). Stalling was noted in
approximately 40% of the played videos (Figure 2(h)), rarely
occurring multiple times in a single video, and half of the
videos with stalling events had a total stalling time of less
than 2 seconds (Figure 2(i)). MOS values in all datasets cover
a range from roughly 3.0 to 4.7, with half of the values above
4.5 (Figure 2(j)).

B. Session-Level Network Traffic Features


We aggregated the packet-level traces - consisting of time-
stamp, src/dst IP address, src/dst port, protocol type, packet
size - of each video session into a set of 109 features,
which characterize the traffic using statistical descriptors,
see Table II. These features were selected based on domain
knowledge and related works, e.g., [27] and include session
duration, as well as packet count, count of packets greater than
100 bytes, volume, and average throughput for both uplink and
downlink traffic. We also add the active download ratio and
the average downlink throughput in the first 1/5/10 seconds.
Moreover, we consider the distribution of packet size, packet
size of packets greater than 100 bytes, packet inter-arrival time,
and data volume in 0.1/1/10-seconds time slots for both uplink
and downlink traffic. From these distributions, we compute
mean, variance, standard deviation, coefficient of variation,
skewness, kurtosis, minimum, and maximum and also add
these descriptors to the set of features.
The distribution of each feature was compared between
all six pairs of datasets using the univariate Wald-Wolfowitz
runs test and the univariate Kolmogorov-Smirnov test. We
found that at most 14% (WW runs test), or 6% (KS
test), respectively, of the columns do not show a signifi-
cant difference with respect to the 5% significance level,
and thus, can be considered similar. These findings persist
even when scaling the data to standard scores, both when
using a common scaling for both datasets or when using
individual scaling per dataset. We additionally conducted
the multivariate Friedman-Rafsky runs test and the multi-
variate Kolmogorov-Smirnov test according to [51] over all
features. The tests were conducted on a 5% random sam-
ple of the datasets due to the computational complexity of
the multivariate tests. Here, when using Euclidean pairwise Fig. 2. Distributions of KPIs/QoE across datasets.
distance, we find high p-values for some pairs, namely,
for Wue_2020 and Wue_2021 (p = 0.60), Wue_2020 Wue_2021 and Zag_2021 (p = 0.09). This suggests that
and Zag_2021 (p = 0.34), Zag_2020 and Wue_2021 these pairs of datasets are not too different from a statistical
(p = 0.57), Zag_2020 and Zag_2021 (p = 0.65), and point of view. However, considering Mahalanobis distance
SEUFERT AND ORSOLIC: IMPROVING THE TRANSFER OF ML-BASED VIDEO QoE ESTIMATION 2829

Fig. 3. Distribution of samples across datasets and classes for session-level Fig. 4. Distribution of samples across datasets and classes for real-time KPI
QoE/KPI classification. classification.

TABLE III
OVERVIEW OF C LASSIFICATION TARGETS P ER Q O E M ETRIC AND depicted in Figure 4. Having these features and labels with
R ESPECTIVE S PLIT C ONDITIONS . *MOS AND S TARTUP D ELAY
A RE O NLY C ONSIDERED IN S ESSION -L EVEL E STIMATION one second granularity allows to train ML models, which can
estimate the video streaming KPIs every second, and thus,
allows for a fine-granular real-time tracing of the QoE.
The feature set is the same set that was used in [16], [17].
First, we compute packet count (total, uplink, downlink), traf-
fic volume (total, uplink, downlink), uplink and downlink ratio
of packet count and traffic volume, and number and volume as
well as ratio of TCP and UDP packets. In addition, we consider
the time from time slot start until the first uplink and downlink
packet and time from the last uplink and downlink packet until
time slot end, which give the burst duration (total, uplink,
or applying standard scaling minimizes p-values for all pairs
downlink) within the time slot. We compute the time slot
to below 0.01, and thus, confirms that the session-level
throughput (total, uplink, downlink), burst throughput (total,
features become highly different when streaming videos from
uplink, downlink), as well as distributional statistics (mean,
a different network. This can cause problems when applying
variance, standard deviation, coefficient of variation, skewness,
ML methods trained on one dataset to another dataset, as we
kurtosis, minimum, maximum) for both the packet size and
will showcase below. Thus, it is required to research and apply
inter-arrival time distribution for uplink and downlink traffic.
mechanisms which can improve the model transfer.
Finally, we compute slope and intercept of the regression line,
Considering the labels, the datasets contain all KPIs in
which fits the cumulative upload and download volume and
Figure 2 (d)-(j), namely, startup delay, average resolution,
cumulative uplink and downlink inter-arrival times within the
average video bitrate, number of quality changes, number of
time slot. This results in a total of 69 features per time slot.
stalling events, total stalling time, and MOS as estimated by
Additionally, we compute features to track trends as well as
the standardized QoE model from ITU-T Recomm. P.1203 [9].
the overall state of the ongoing session. The trend features use
We additionally apply a simple binary classification into high
a sliding window of length 3 seconds, i.e., three consecutive
(≥ 700p) or low average resolution, existing (true) or non-
time slots. The time slots are considered as a single trend
existing (false) stalling, short (< 5 s) or long startup delay,
macro time slot of 3 seconds for which the same 69 features
and high (≥ 500 kbps) or low average bitrate according to
are computed as for the micro time slot of 1 second. In
thresholds listed in Table III. The resulting distribution of
a similar fashion, we compute features of a session macro
samples across datasets and classes is depicted in Figure 3.
time slot, which are based on all past time slots including
It can be seen that all classes have a substantial amount of
the current one. This results in a total of 208 features (time
instances, with the minimum of 1861 instances.
slot number + 69 time slot features + 69 trend features
+ 69 session features).
C. Real-Time Network Traffic Features
The real-time network traffic features are based on the same V. M ETHODOLOGY FOR T ESTING C ROSS -A PPLICABILITY
raw data as the session-level features. However, here, the
The methodology for testing cross-applicability of models is
packet-level traces are split into small time slots of 1 second,
summarized in Fig. 5, while the following subsections provide
which results in a dataset containing more than 5 million time
more detailed information about the depicted steps.
slots. For each time slot, the traffic is described with a set of
features and labeled with the corresponding application-level
KPIs, namely, the current resolution and average bitrate of the A. Model Selection
played out representation and whether the video is currently To find the best model and hyperparameters, we follow
stalling or not. When we apply binary classification using a two stage approach using scikit-learn. First, we
the same thresholds as above (Table III), this results in the do a broad study on different model types and very few
distribution of samples across datasets and classes, which is hyperparameter combinations. For this, we focus on each
2830 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 21, NO. 3, JUNE 2024

Fig. 5. Methodology for testing model cross-applicability, repeated for each combination of dataset and KPI.

dataset individually and additionally filter out all videos, features with the k highest scores. Here, parameter k ranges
which contain an advertisement. We compare the performance from 5 to 50 in steps of 5, and scores are computed either
of different classifiers for the binary classes described based on mutual information or F-score.
above. We undersample the classes to obtain a balanced
dataset and avoid any preprocessing of the features, such B. Baseline Establishment
as scaling. We perform an 80:20 training/test split, apply
Using Random Forest with maximum tree depth and number
3-fold cross-validation on the training set, and compare the
of trees determined in the hyperparameter study, we evaluate
performance of Gaussian Naive Bayes, Stochastic Gradient
the performance in the following cases:
Descent, k-Nearest Neighbors, Decision Tree, Random Forest,
• Network-specific evaluation – Each model is trained on a
Extremely Randomized Trees, Gradient Tree Boosting,
subset (66%) of the dataset and tested on the rest (34%)
XGBoost, and Support Vector Machine classifiers. Similarly,
of the same dataset. This results in 20 models, estimating
we study the performance for continuous P.1203 MOS esti-
5 QoE metrics for 4 datasets.
mation on the session-level datasets using Bayesian Ridge,
• Cross-testing – Each of the 20 network-specific models
Stochastic Gradient Descent, k-Nearest Neighbors, Decision
(trained as defined above) are tested on the remaining 3
Tree, Random Forest, Extremely Randomized Trees, Gradient
datasets.
Tree Boosting, XGBoost, and Support Vector Machine
• General model evaluation – Models are trained on a
regressors.
subset (66%) of the merged datasets (containing samples
Our results confirmed earlier findings, e.g., [17], that tree-
from all 4 datasets) and tested on the rest of the merged
based models provide the best trade-off between estimation
dataset (34%).
accuracy and training and inference speed. In the study, we
These results are then used as a baseline for assessing
compared the obtained F1 scores for all algorithms. The results
the methods for improving the transfer. Our expectation is
have shown that Random Forest was either outperforming all
that our methods for improving transfer would yield model
other algorithms or sharing the top result (when F1 score is
performance between the one achieved with cross-testing
rounded to 2 decimals) with one of the following: Support
and network-specific evaluation. The goal of the transfer
Vector Machine, XGBoost, Gradient Tree Boosting. For con-
improvement is to get as close as possible to the results
sistency purposes and to reduce the number of test iterations,
obtained with network-specific and general models without
we decided to test the subsequent steps of the methodology
needing all the data that was used for baseline models.
using Random Forest only. We also note that some algorithms,
e.g., Support Vector Machine, gave comparable results to
Random Forest, but are computationally much more intensive, C. Improving Transfer
additionally motivating not considering them further. Thus, we For the remainder of this work, we will refer to the dataset,
focus on Random Forest models for the rest of this work. on which a model was originally trained, as the source dataset
As a second step of the hyperparameter study, we then focus or source domain. We will consider the other dataset, on which
exclusively on Random Forest and explore a larger set of the trained model is tested, as the transfer dataset or transfer
hyperparameter combinations for that algorithm. This includes domain. Aiming to improve the model transfer in comparison
different max tree depth values of 3, 5, 8, 10, or None (no to the results obtained via cross-testing, we identify and
limit), and different numbers of trees, ranging from 20 to 200 assess methods based on scaling, decomposition, manifold
in steps of 20. Additionally, we allow a feature selection using learning, ML-based feature representation transfer, elimination
the SelectKBest algorithm, selecting either all features or the of drifting features, and enrichment of the source dataset with
SEUFERT AND ORSOLIC: IMPROVING THE TRANSFER OF ML-BASED VIDEO QoE ESTIMATION 2831

the data from the transfer domain. This section describes the
used methods.
Scaling. We apply min-max-scaling and standard scaling to
each feature individually based on three modes. First, for a pair
of datasets (source and transfer), we train the scalers only on
the source dataset, and apply them to both source and transfer
datasets (source (S)). Second, we train and apply two scalers
independently of each other, one on the source dataset and
one on the transfer dataset (source & transfer, (S&T)). Finally,
we add 10% of the transfer dataset to the source dataset only
for training a single scaler. Afterwards, the trained scaler is Fig. 6. Performance of baseline and general models.
applied to both original source and transfer datasets (source
merged (SM)).
Decomposition. We use the decomposition-based meth- eliminate them prior to model training on the source dataset.
ods Principal Component Analysis (PCA) and Canonical The assumption here is that the model would perform better on
Correlation Analysis (CCA), which allow to reduce the dimen- the transfer dataset if it relies only on features that are similar
sionality of datasets by concentrating most of the dataset’s in the two datasets. To identify drifting features, we evaluate
variance in fewer dimensions using a linear transformation. how well can models based on one feature classify the origin
For PCA, we linearly transform the datasets keeping all of the sample (source or transfer dataset). Concretely, we take
components, i.e., without reducing dimensionality. Note that network traffic features of both datasets, and label the samples
the subsequent feature selection can act as dimensionality with their origin (dataset). Then, for each feature, we train a
reduction in this case. CCA is a supervised approach requiring Random Forest model that classifies the origin based on that
labels. It reduces the datasets to a single component, i.e., a feature only and evaluate it through 2-fold cross-validation. If
single dimension. For both PCA and CCA, we again apply all the model performs well, the feature is considered as drifting.
three modes source, source & transfer, source merged. In the analysis described later, we eliminate the features based
Manifold learning. We apply manifold learning, which on the area under the receiver operating characteristic curve
is a non-linear dimensionality reduction technique, in par- (ROC _AUC ), which is a measure for the goodness of the
ticular Local Linear Embedding (LLE) and Isomap. While classification performance. This means, we eliminate features
LLE computes a lower-dimensional projection of the dataset that exceed the drifting threshold set to ROC _AUC > 0.7
that preserves distances within local neighborhoods, Isomap and ROC _AUC > 0.8.
computes a lower-dimensional embedding which considers Enrichment. The source dataset is enriched with a labeled
distances between all data points. For both LLE and Isomap, subsample from the transfer dataset. We evaluated three
we again apply all three modes source, source & transfer, intensities of enrichment that we later refer to as 10%- 20%-
source merged. Additionally, we implemented semi-supervised and 30%-enrichment, depending on how much data from the
manifold alignment (SSMA) [46], [47] in mode source & transfer domain is added to the source domain training set.
transfer, which considers distances of pairs of corresponding Note that 10% enrichment means that to the balanced source
instances between source and transfer datasets. We construct dataset that has the size of N samples, we add 0.1N samples
these pairs from the video sessions in both datasets, which from the transfer dataset. The rest of the transfer dataset
streamed the same video under the same bandwidth limitation, samples are used for testing.
and compute their distance according to their application-
layer KPIs, namely initial delay, number of stalling event, VI. E VALUATION
total stalling time, number of quality changes, average quality, We established the baseline by training and testing models
and average bitrate. While the embedding is learned from the in a network-specific fashion. The performance of such models
corresponding pairs, it can then be computed also for all other is depicted in Figure 6. The figure also shows the performance
data points. of general models trained and tested on the merged dataset
ML-based feature representation transfer. We implemented consisting of samples from all four datasets (denoted by “all”
an ML-based feature representation transfer (MLFRT) in mode in the Figure). For all KPIs, general models outperformed
source & transfer, which again considers the same pairs of network-specific models. This indicates that introducing data
corresponding instances between source and transfer datasets from different sources into model training results in better
as above. It trains a multi-output Random Forest regressor coverage of feature space, thus increasing the performance
to transform the features from the transfer dataset into the over all datasets. The estimation of P.1203 is an exception
corresponding features of the source dataset. It performs a in that regard, which could be explained by MOS generally
3-fold cross-validation to find the best hyperparameters from being a more complex concept, but also poor results obtained
the considered Random Forest hyperparameter set described on individual datasets (e.g., Zag_2021). The figure shows
above. Afterwards, it transforms the transfer dataset into its that models trained and tested on Wue_2020 perform better
source domain representation. in comparison to other network-specific models. This may be
Drift elimination. For a pair of datasets (source and trans- attributed to somewhat clearer separation between the classes
fer), we find the features that differ among the two datasets and for that particular dataset (cf. Figure 2).
2832 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 21, NO. 3, JUNE 2024

Fig. 7. Transfer dataset classification performance of models trained on source dataset using different methods for improving transfer compared to best
model trained on transfer dataset (baseline). Positive values indicate improvement wrt. to baseline.

Figure 7 evaluates the classification performance of the values, which means that those methods can result in much
investigated transfer methods on the transfer dataset. For this, higher performance degradation compared to the baseline.
we compare the performance of models, which are trained on In addition, LLE and Isomap have a strongly negative 95-
the source dataset and transferred to the transfer dataset, to the percentile (upper whisker), while PCA, SSMA, and MLFRT
performance of the best model trained on the transfer dataset, even have a strongly negative maximum value in S&T mode,
which acts as baseline. The different methods to improve such that they will almost always (LLE, Isomap) or always
transfer are depicted on the x-axis. The plot shows boxplots (PCA, SSMA, MLFRT) result in a substantial performance
for each method, which describe the distribution of difference loss, and thus, cannot be not recommended. Considering the
in F1 score compared to baseline over all 12 source/transfer methods in SM mode, where some transfer data is merged to
dataset combinations and all 4 classification targets, i.e., in the source data before training the methods, we see similar
total 48 combinations. The median (50-percentile) is depicted performance than for the methods in S mode. Thus, also
as an orange line, the box extends from the lower to the upper here, transfer performance can be close to the baseline,
quartile (25- to 75-percentile), whiskers reach from 5- to 95- but can also lead to a substantial F1 score reduction. The
percentile, and extreme values beyond the whiskers are plotted most consistent performance is reached by CCA (SM), which
as individual points. Positive values indicate improvements with reaches values close to the baseline in almost all cases,
respect to the baseline while negative values show a decrease but requires labels from the transfer domain. In contrast, all
in F1 score. For example, considering the leftmost box for unsupervised approaches cannot outperform simple scaling
“No Preprocessing”, the lowest point at −0.38 indicates that methods, and thus, cannot ensure high performance when
there was a combination of source dataset, transfer dataset, models are transferred.
and target for which the transferred model had an F1 score, Finally, we investigate the performance when eliminating
which was 0.38 lower than the baseline result. This shows drifting features or enriching the source dataset with a labeled
that the performance of trained models can substantially drop subsample of the transfer dataset, which is depicted in the
when models are applied to other datasets for which they have five rightmost boxes in Figure 7. Here, we can see that all
not been trained and no method for improving the transfer is boxes intersect the red zero line, having values closer to the
applied. Nevertheless, the median difference when applying no baseline and in some cases even extending to positive values.
preprocessing is at −0.05 and the upper whisker extends to the Here, again the fluctuations around the zero line might be
maximum value of −0.01, which suggests that performance of depending on the peculiarities of the actual data. Nevertheless,
the transferred models without any preprocessing can be also generally, boxes are smaller and we see a much more consis-
close and only slightly below the baseline. tent performance with respect to the baseline, which means
We see a similar range of performance differences for most that they almost always avoid severe performance degradation
of the investigated source mode (S) methods. In particular, when using the transferred models. Thus, we conclude that
we find that more complex methods cannot outperform simple both methods are well suited for improving the transfer
scaling approaches. When investigating those distributions performance. While enrichment gives the best results, which
more closely, we could not find any regularity concerning was expected, it has the drawback that a labeled subsample
source dataset, transfer dataset, or target. Thus, we infer of the transfer dataset is required, which might typically not
that the actual combination itself and its peculiarities in be available. However, drift elimination provides an almost
the concrete training and test sets define whether a certain equally good performance without requiring labels from the
methods performs close to the baseline or not. When inves- transfer dataset. Therefore, it is much more applicable in
tigating the methods in S&T mode, we see that, except for typical transfer scenarios, and thus, proves to be the most
standard scaling, the boxes and whiskers extend to much lower valuable method for improving transfer here.
SEUFERT AND ORSOLIC: IMPROVING THE TRANSFER OF ML-BASED VIDEO QoE ESTIMATION 2833

Fig. 8. Transfer dataset regression performance of models trained on source dataset using different methods for improving transfer compared to best model
trained on transfer dataset (baseline). Negative values indicate improvement wrt. to baseline.

Similarly to Figure 7, Figure 8 depicts the transfer dataset


performance for P.1203 MOS estimation, which is a regres-
sion task. The figure shows a boxplot for the distribution
of difference in root mean square error (RMSE) between
transfer methods and baseline result over all 12 source/transfer
dataset combinations. Note that RMSE results are on the same
scale as MOS values, and that, as the differences (errors)
between estimated and actual P.1203 MOS values need to
become smaller, thus, in this figure, negative values indicate
an improvement with respect to the baseline. The best results
on the transfer dataset are again reached by enrichment. We
can see that these boxes are located around the baseline and
in some cases extending to negative values, i.e., to smaller
RMSEs than the baseline models.
For other methods, it can be seen that the results gener-
ally resemble the previously shown results for classification
performance. In particular, we see that most methods result
in a modest increase of RMSE with respect to the baseline Fig. 9. Features identified as most drifting.
between 0.05 and 0.15 when the models were trained on the
source dataset with only few outliers. We also can observe
that S and SM mode models perform similarly, and that (one for Wue_2020 and Wue_2021, one for Wue_2020 and
S&T mode models perform typically worse. Again, the most Zag_2020, etc.). The Figure shows the number of evaluations
consistent performance is given by CCA (SM), which can (dataset pairs) in which the feature was identified as drifting
reach RMSE values closely above the baseline in all cases, and the average value of ROC _AUC . Note that the average
but requires labels from the transfer domain. Looking at the only takes into account evaluations in which ROC _AUC was
results for drift elimination, we can see that it gives slightly higher than 0.7. For example, the maximum size of uplink
worse RMSE compared to the baseline in most cases, and can packets greater than 100B was drifting in all 6 evaluation,
only improve the RMSE in a single source/transfer dataset with average drift across 6 evaluations equal to 0.9577. The
combination. This is different from the classification results maximum download data volume in slots of 100ms was
above, and shows that this method not always can reach a drifting in 2 evaluations, with average drift across those 2
transfer performance on a par to baseline models. Still, it evaluations being 0.8068.
provides the best performance when considering only unsuper- The most severely drifting features are related to maximum
vised methods, which do not require labels from the transfer length of uplink and downlink packets. As these features
domain. correspond to a largest packet for each video, the values are
In the following, we investigate drifting features in more expected to be defined by link properties and should not
detail. Figure 9 shows the most severely drifting features vary a lot across video samples originating from the same
across all datasets. The severity of drift is measured as dataset. We confirm that with Figure 10(a) where it is visible
ROC _AUC in the 2-fold cross-validation of the model that the max downlink packet length is typically consistent
classifying the dataset origin. Since feature drift is tested for within a dataset, but the feature clearly separates distinct
distinct dataset pairs, this gives 6 evaluations of ROC _AUC datasets.
2834 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 21, NO. 3, JUNE 2024

outperform simple scaling, we found drift elimination to be


the best method for reaching performance close to the transfer
baseline. By eliminating drifting features, which does not
require labels from the transfer domain, the trained models
need to focus on features that are similar in both domains.
This removes domain-specific peculiarities, such as maximum
downlink packet size from the data, and thus, forces models
to ignore them and focus on learning more general concepts
instead. Consequently, models trained only on similar features
should also give similar performance on both datasets. Our
results showed that, for MOS regression, the performance
of this method was only slightly worse than the baseline
performance, while for the classification tasks, its performance
was mostly on a par with the baseline or the enrichment
method. This shows that drift elimination can be recommended
for typical transfer scenarios.

VII. C ONCLUSION AND O UTLOOK


Machine learning has been heavily used for estimating KPIs
and QoE of video streaming flows in the network. While
providing promising results in terms of estimation model
Fig. 10. Distributions of selected drifting features.
performance, such approaches typically require extensive data
collection - both of network traffic features and application-
level performance information. Considering the wide variety
In Figure 10(b), we inspect the drift of the minimum inter- of scenarios in which video streaming could be used (e.g.,
arrival time of uplink packets across datasets. The minimum different devices, networks, operating systems), it is clear that
value of inter-arrival time is typically reached when there are including all of the possible scenarios in the data collection
two consecutive packets on the uplink, which are very small, is demanding, if not impossible. In this paper, we tackled
and the link is not congested. In this case, this feature can be the problem of data collection extensiveness by evaluating
considered a proxy for the available link capacity. The figure whether an existing model trained on data from one network
shows that, especially for larger values, the feature clearly could be adapted and used on data obtained from a different
separates the datasets. network. The adaptation methods used in the paper require
Finally, we look into the distributions of the minimum only network traffic features without labels (which can be
size of non-trivial uplink packets in Figure 10(c), i.e., packets calculated from the traffic generated by real users, without
greater than 100KB, which excludes pure acknowledgement test devices), or require a significantly smaller labelled dataset
packets. It can be seen that datasets collected in the same from the network for which the model is adapted.
year are more similar when it comes to this feature. Thus, The paper evaluated adaptation methods based on scaling,
while our previous examples showed drifts in the network decomposition, manifold learning, ML–based feature repre-
characteristics, this hints at a drift happening on the application sentation transfer, drift elimination, and enrichment. While the
side. In particular, we assume that the size of the non-trivial evaluation pointed out that enrichment of the source dataset
uplink packets has increased by a few bytes most likely due with labelled data from the transfer domain and the elimination
to having introduced a new field into the uplink data before of drifting features are the most promising methods for
our 2021 measurements. improving the transfer of models in this context, subsequent
To sum up, our evaluations showed that a consistent studies are needed to address related research questions. An
performance on the transfer dataset close to the baseline, i.e., a interesting way forward may be to explore combinations
model specifically trained on the transfer dataset, can only be of the described methods. For example, eliminated drifting
reached by using labelled data from the transfer dataset. The features could be transformed and introduced back into the
best method is enrichment of the source dataset with labelled dataset.
data from the transfer domain, which can even improve on the Apart from the session-level QoE/KPI estimation models
baseline in some cases. The reason is that this method not only considered in this work, we will investigate whether the
allows to use a larger dataset for training the model but also same methods are as promising in the case of real-time KPI
allows to consider examples collected in the transfer domain, estimation using the collected real-time ML dataset, or when
which help to learn meaningful and more general concepts considering transfer between different bandwidth limitation
that apply to both source and transfer domains. conditions. Moreover, the methodology could be repeated
However, in a typical transfer scenario, such labels from for new datasets obtained in scenarios differing in aspects
the transfer domain will not be available, which prevents to other than network (e.g., mobile operating system, streaming
use enrichment. In this case, while most methods could not service).
SEUFERT AND ORSOLIC: IMPROVING THE TRANSFER OF ML-BASED VIDEO QoE ESTIMATION 2835

When it comes to collecting new datasets, a big challenge is [7] B. Bross, J. Chen, J.-R. Ohm, G. J. Sullivan, and Y.-K. Wang,
obtaining a good coverage of the feature space. For the model “Developments in international video coding standardization after AVC,
with an overview of versatile video coding (VVC),” Proc. IEEE,
to learn to detect QoE degradations, the training set needs to vol. 109, no. 9, pp. 1463–1493, Sep. 2021.
include such degradations. Existing literature, however, does [8] A. A. Barakabitze et al., “QoE management of multimedia streaming
not provide guidelines on how to collect data to ensure a services in future networks: A tutorial and survey,” IEEE Commun.
Surveys Tuts., vol. 22, no. 1, pp. 526–565, 1st Quart., 2019.
good coverage of scenarios that may occur in an operational [9] “ITU-T recommendation P. 1203: Parametric bitstream-based quality
network. Another challenge is related to assessing the needed assessment of progressive download and adaptive audiovisual streaming
size of the dataset to train robust QoE/KPI estimation models. services over reliable transport,” Int. Telecommun. Union, Geneva,
Switzerland, 2017. [Online]. Available: https://www.itu.int/rec/T-REC-P.
Additional questions arise when considering practical appli- 1203/en
cation of the models. The research focus in the area has [10] “ITU-T recommendation P. 1204: Video quality assessment of stream-
mainly been on developing methodologies, while deploying ing services over reliable transport for resolutions up to 4K,” Int.
Telecommun. Union, Geneva, Switzerland, 2020. [Online]. Available:
solutions based on proposed methodologies would require https://www.itu.int/rec/T-REC-P.1204-202001-P/en
experimentation, adaptation, and customization. The amount [11] V. Aggarwal, E. Halepovic, J. Pang, S. Venkataraman, and H. Yan,
of resources (compute, storage, network) needed for 1) real- “Prometheus: Toward quality-of-experience estimation for mobile apps
from passive network measurements,” in Proc. 15th Workshop Mobile
time processing of the traffic and calculating the traffic Comput. Syst. Appl., 2014, pp. 1–6.
features, and 2) executing the model on calculated features [12] V. Krishnamoorthi, N. Carlsson, E. Halepovic, and E. Petajan,
will depend on the available infrastructure. There is a number “BUFFEST: Predicting buffer conditions and real-time requirements of
HTTP(S) adaptive streaming clients,” in Proc. 8th ACM Multimedia Syst.
of solution design choices that may be considered in that Conf., 2017, pp. 76–87.
regard. For example, an ISP may be willing to sacrifice [13] M. H. Mazhar and Z. Shafiq, “Real-time video quality of experience
some model performance (e.g., using simpler models) if that monitoring for HTTPS and QUIC,” in Proc. Conf. Comput. Commun.
(INFOCOM), 2018, pp. 1331–1339.
would significantly reduce the computation cost. Moreover, [14] D. Tsilimantos, T. Karagkioules, and S. Valentin, “Classifying flows and
they might not be interested in assessing the QoE of each and buffer state for YouTube’s HTTP adaptive streaming service in mobile
every session, but rather in sampling the traffic in a meaningful networks,” in Proc. ACM Multimedia Syst. Conf. (MMSys), Jun. 2018,
pp. 1–13.
way. [15] G. Dimopoulos, I. Leontiadis, P. Barlet-Ros, and K. Papagiannaki,
With respect to the methods that may be considered in “Measuring video QoE from encrypted traffic,” in Proc. Internet Meas.
future work, we also see deep learning based methods as worth Conf., 2016, pp. 513–526.
[16] M. Seufert, P. Casas, N. Wehner, L. Gang, and K. Li, “Stream-
exploring when large datasets are available. In particular, based machine learning for real-time QoE analysis of encrypted video
this includes approaches using pre-trained and frozen layers streaming traffic,” in Proc. 22nd Conf. Innovat. Clouds Internet Netw.
obtained from the source dataset and newly added trainable Workshops (ICIN), 2019, pp. 76–81.
[17] S. Wassermann, M. Seufert, P. Casas, L. Gang, and K. Li, “ViCrypt to the
layers, which can adapt the model decisions to the transfer rescue: Real-time, machine-learning-driven video-QoE monitoring for
dataset. Another promising direction are models for learning encrypted streaming traffic,” IEEE Trans. Netw. Service Manag., vol. 17,
more meaningful representations, such as transformers, which no. 4, pp. 2007–2023, Dec. 2020.
[18] F. Bronzino, P. Schmitt, S. Ayoubi, G. Martins, R. Teixeira, and
can potentially learn embeddings that are more independent N. Feamster, “Inferring streaming video quality from encrypted traffic:
of certain network peculiarities, and thus, allow for a better Practical models and deployment experience,” Meas. Anal. Comput.
transfer performance. We will investigate these approaches in Syst., vol. 3, no. 3, pp. 1–25, 2019.
[19] C. Gutterman et al., “Requet: Real-time QoE detection for encrypted
future works. YouTube traffic,” in Proc. 10th ACM Multimedia Syst. Conf., 2019,
pp. 48–59.
[20] I. Orsolic and L. Skorin-Kapov, “A framework for in-network QoE
ACKNOWLEDGMENT monitoring of encrypted video streaming,” IEEE Access, vol. 8,
The authors alone are responsible for the content. pp. 74691–74706, 2020.
[21] I. Bartolec, I. Orsolic, and L. Skorin-Kapov, “Impact of user playback
interactions on in-network estimation of video streaming performance,”
R EFERENCES IEEE Trans. Netw. Service Manag., vol. 19, no. 3, pp. 3547–3561,
Sep. 2022.
[1] “Ericsson mobility report,” Ericsson, Stockholm, Sweden, Rep., [22] I. Orsolic and M. Seufert, “On machine learning based video QoE esti-
Nov. 2021. [Online]. Available: https://www.ericsson.com/4ad7e9/ mation across different networks,” in Proc. 16th Int. Conf. Telecommun.
assets/local/reports-papers/mobility-report/documents/2021/ericsson- (ConTEL), 2021, pp. 62–69.
mobility-report-november-2021.pdf [23] I. Orsolic, D. Pevec, M. Suznjevic, and L. Skorin-Kapov, “YouTube
[2] A. Feldmann et al., “A year in lockdown: How the waves of COVID-19 QoE estimation based on the analysis of encrypted network traffic using
impact Internet traffic,” Commun. of ACM, vol. 64, no. 7, pp. 101–108, machine learning,” in Proc. IEEE Globecom Workshops, 2016, pp. 1–6.
2021. [24] I. Orsolic, D. Pevec, M. Suznjevic, and L. Skorin-Kapov, “A
[3] “The global Internet phenomena report,” Sandvine, Waterloo, ON, machine learning approach to classifying YouTube QoE based on
Canada, Rep., Jan. 2022. [Online]. Available: https://www.sandvine. encrypted network traffic,” Multimedia Tools Appl., vol. 76, no. 21,
com/hubfs/Sandvine_Redesign_2019/Downloads/2022/Phenomena% pp. 22267–22301, 2017.
20Reports/GIPR%202022/Sandvine%20GIPR%20January%202022.pdf [25] I. Orsolic, M. Suznjevic, and L. Skorin-Kapov, “YouTube QoE esti-
[4] R. Schatz, M. Fiedler, and L. Skorin-Kapov, “QoE-based network and mation from encrypted traffic: Comparison of test methodologies and
application management,” in Quality of Experience. Cham, Switzerland: machine learning based models,” in Proc. 10th Int. Conf. Qual.
Springer, 2014, pp. 411–426. Multimedia Exp. (QoMEX), 2018, pp. 1–6.
[5] L. Skorin-Kapov, M. Varela, T. Hoßfeld, and K.-T. Chen, “A survey of [26] P. Casas et al., “Predicting QoE in cellular networks using machine
emerging concepts and challenges for QoE management of multimedia learning and in-smartphone measurements,” in Proc. 9th Int. Conf. Qual.
services,” ACM Trans. Multimedia Comput. Commun. Appl. (TOMM), Multimedia Exp. (QoMEX), Erfurt, Germany, 2017, pp. 1–6.
vol. 14, no. 2s, pp. 1–29, 2018. [27] M. Seufert, P. Casas, N. Wehner, L. Gang, and K. Li, “Features that mat-
[6] I. Sodagar, “MPEG-DASH: The standard for multimedia streaming ter: Feature selection for On-Line stalling prediction in encrypted video
over Internet,” Redmond, WA, USA, Int. Org. Standard., White Paper streaming,” in Proc. Conf. Comput. Commun. Workshops (INFOCOM
ISO/IEC JTC1/SC29/WG11 W13533, Apr. 2012. WKSHPS), 2019, pp. 688–695.
2836 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 21, NO. 3, JUNE 2024

[28] S. Wassermann, M. Seufert, P. Casas, L. Gang, and K. Li, “Let me [45] M. Baena-Garcıa, J. del Campo-Ávila, R. Fidalgo, A. Bifet, R. Gavalda,
decrypt your beauty: Real-time prediction of video resolution and bitrate and R. Morales-Bueno, “Early drift detection method,” in Proc. 4th Int.
for encrypted video streaming,” in Proc. Netw. Traffic Meas. Anal. Conf. Workshop Knowl. Discov. Data Streams, vol. 6, 2006, pp. 77–86.
(TMA), 2019, pp. 199–200. [46] J. Ham, D. Lee, and L. Saul, “Semisupervised alignment of manifolds,”
[29] S. Wassermann, M. Seufert, P. Casas, L. Gang, and K. Li, “I see what in Proc. Int. Workshop Artif. Intell. Statist., 2005, pp. 120–127.
you see: Real time prediction of video quality from encrypted streaming [47] J. Wang, X. Zhang, X. Li, and J. Du, “Semi-supervised mani-
traffic,” in Proc. 4th Internet-QoE Workshop QoE-Based Anal. Manag. fold alignment with few correspondences,” Neurocomputing, vol. 230,
Data Commun. Netw., 2019, pp. 1–6. pp. 322–331, Mar. 2017.
[30] S. Schwarzmann, C. C. Marquezan, M. Bosk, H. Liu, R. Trivisonno, [48] M. Seufert, R. Schatz, N. Wehner, and P. Casas, “QUICker or not?
and T. Zinner, “Estimating video streaming QoE in the 5G - an empirical analysis of QUIC vs TCP for video streaming QoE
architecture using machine learning,” in Proc. 4th Internet-QoE provisioning,” in Proc. 3rd Int. Workshop Qual. Exp. Manag. (QoE-
Workshop QoE-Based Anal. Manag. Data Commun. Netw., 2019, Manage), 2019, pp. 7–12.
pp. 7–12. [49] F. Wamser, M. Seufert, P. Casas, R. Irmer, P. Tran-Gia, and R. Schatz,
[31] S. Schwarzmann, C. C. Marquezan, R. Trivisonno, S. Nakajima, and “YoMoApp: A tool for analyzing QoE of YouTube HTTP adaptive
T. Zinner, “Accuracy vs. cost trade-off for machine learning based QoE streaming in mobile networks,” in Proc. Eur. Conf. Netw. Commun.
estimation in 5G networks,” in Proc. Int. Conf. Commun. (ICC), 2020, (EuCNC), 2015, pp. 239–243.
pp. 1–6. [50] M. Seufert, “Quality of experience and access network traffic
[32] “System architecture for the 5G system,” Eur. Telecomm. Stand. Instit., management of HTTP adaptive video streaming,” Ph.D. thesis,
Sophia Antipolis, France, Rep. TS 123 501, version 17.8.0, 2023. Faculty Math. Comput. Sci., Univ. Würzburg, Würzburg, Germany,
[33] “Architecture enhancements for 5G system (5GS) to support network 2017. [Online]. Available: https://opus.bibliothek.uni-wuerzburg.de/files/
data analytics services,” Eur. Telecomm. Stand. Instit., Sophia Antipolis, 15413/Seufert_Michael_Thomas_HTTP.pdf
France, Rep. TS 123 288, version 17.8.0, 2023. [51] J. H. Friedman and L. C. Rafsky, “Multivariate generalizations of the
[34] M. Seufert, S. Wassermann, and P. Casas, “Considering user behavior Wald-Wolfowitz and Smirnov two-sample tests,” Ann. Statist., vol. 7,
in the quality of experience cycle: Towards proactive QoE-aware traffic no. 4, pp. 697–717, 1979.
management,” IEEE Commun. Lett., vol. 23, no. 7, pp. 1145–1148,
Jul. 2019.
[35] I. Orsolic, P. Rebernjak, M. Suznjevic, and L. Skorin-Kapov, “In-
network QoE and KPI monitoring of mobile YouTube traffic: Insights
for encrypted iOS flows,” in Proc. 14th Int. Conf. Netw. Service Manag.
(CNSM), 2018, pp. 233–239. Michael Seufert (Member, IEEE) received the bach-
[36] S. Ickin, K. Vandikas, F. Moradi, J. Taghia, and W. Hu, “Ensemble- elor’s degree in economathematics and the Diploma,
based synthetic data synthesis for federated QoE modeling,” in Proc. Ph.D., and Habilitation degrees in computer sci-
6th IEEE Conf. Netw. Softwarizat. (NetSoft), 2020, pp. 72–76. ence from the University of Würzburg, Germany,
[37] S. Ickin, M. Fiedler, and K. Vandikas, “QoE modeling on split features and holds the First State Examination degree in
with distributed deep learning,” Network, vol. 1, no. 2, pp. 165–190, mathematics, computer science, and education for
2021. teaching in secondary schools. He is a Full Professor
[38] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, “Learning with drift with the University of Augsburg, Germany, head-
detection,” in Brazilian Symposium on Artificial Intelligence. Berlin, ing the Chair of Networked Embedded Systems
Germany: Springer, 2004, pp. 286–295. and Communication Systems. His research focuses
[39] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and on user-centric communication networks, including
N. D. Lawrence, Dataset Shift in Machine Learning. Cambridge, MA, QoE of Internet applications, AI/ML for QoE-aware network management, as
USA: MIT Press, 2009. well as group-based communications.
[40] G. Widmer and M. Kubat, “Learning in the presence of concept drift
and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, 1996.
[41] A. Tsymbal, “The problem of concept drift: Definitions and related
work,” Comput. Sci. Dept., Univ. Dublin, Dublin, Ireland, Rep. TCD-
CS-2004-15, 2004. Irena Orsolic received the M.Sc. degree in
[42] A. Bifet, J. Gama, M. Pechenizkiy, and I. Žliobaite, “Handling information and communication technology and the
concept drift: Importance, challenges and solutions,” Dept. Comput. Ph.D. degree in computer science from the Faculty
Sci., Univ. Waikato, Hamilton, New Zealand, 2011. [Online]. Available: of Electrical Engineering and Computing, University
https://www.cs.waikato.ac.nz/∼abifet/PAKDD2011/PAKDD11Tutorial_ of Zagreb, Croatia, in 2016 and 2020, respectively,
Handling_Concept_Drift.pdf where she was a Postdoctoral Researcher and a
[43] K. O. Stanley, “Learning concept drift with a committee of decision member of the Multimedia Quality of Experience
trees,” Dept. Comput. Sci., Univ. Texas Austin, Austin, TX, USA, Rep. Research Lab. Since 2023, she has been working
UT-AI-TR-03–302, 2003. as an Experienced Core Network Researcher with
[44] I. Žliobaitė, M. Pechenizkiy, and J. Gama, “An overview of concept drift Ericsson AB, Stockholm, Sweden. The focus of her
applications,” in Big Data Analysis: New Algorithms for a New Society. research is on quality of experience estimation of
Cham, Switzerland: Springer, 2016, pp. 91–114. encrypted video streaming by using machine learning methods.

You might also like