0% found this document useful (0 votes)
45 views

A Big Data Enabled Hierarchical Framewor

Uploaded by

peppas4643
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

A Big Data Enabled Hierarchical Framewor

Uploaded by

peppas4643
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. *, NO.

*, MONTH YYYY 1

A Big Data-enabled Hierarchical Framework for


Traffic Classification
Giampaolo Bovenzi, Giuseppe Aceto, Domenico Ciuonzo, Senior Member, IEEE,
Valerio Persico, and Antonio Pescapé, Senior Member, IEEE

Abstract—According to the critical requirements of the Inter- as a fully-available common good. As a result, ATs like Tor
net, a wide range of privacy-preserving technologies are available, are currently widespread, with a turnout up to 2 million users.1
e.g. proxy sites, virtual private networks, and anonymity tools.
At the same time, understanding the nature of the commu-
Such mechanisms are challenged by traffic-classification endeav-
ors which are crucial for network-management tasks and have nications flowing through the Internet is critical for operators
recently become a milestone in their privacy-degree assessment, to properly manage the networks. Such process is commonly
both from attacker and designer standpoints. Further, the new referred to as traffic analysis. Traffic Classification (TC) in
Internet era is characterized by the capillary distribution of particular—i.e. inferring the type of traffic that a network
smart devices leveraging high-capacity communication infras-
entity is generating—is a building block of the utmost impor-
tructures: this results in huge amount of heterogeneous network
traffic, i.e. big data. Hence, herein we present BDeH, a novel tance for quality-of-service enforcement, traffic engineering,
hierarchical framework for traffic classification of anonymity network security [2], etc. Leveraging Machine Learning (ML)
tools. BDeH is enabled by big data-paradigm and capitalizes approaches to accomplish this task also fulfills the requirement
the machine learning workhorse for operating with encrypted of preserving privacy by performing the classification solely
traffic. In detail, our proposal allows for seamless integration
based on the statistical features of the encrypted traffic, without
of data parallelism provided by big-data technologies with model
parallelism enabled by hierarchical approaches. Results prove decrypting its content [3]. Also, the adoption of hierarchical
that the so-achieved double parallelism carries no negative impact approaches allows for performance gains, by splitting the TC
on traffic-classification effectiveness at any granularity level and task in sub-problems. Equally important, though hierarchical
achieves non negligible performance enhancements with respect approaches may result in increasing training complexity, they
to non-hierarchical architectures (+4.5% F-measure). Also, it
can leverage model parallelism, due to their scalability and
significantly gains over either pure data or pure model paral-
lelism (resp. centralized) approaches by reducing both training modularity. As a result, hierarchical ML-based TC has recently
completion time—up to 78% (resp. 90%)—and cloud-deployment appealed to the scientific community [4], [5], [6], [7]. How-
cost—up to 31% (resp. 10%). ever, the capillary diffusion of Internet-enabled devices and the
Index Terms—big data; dark web; encrypted traffic; hierar- growth of network-link capacity and coverage introduce new
chical classification; traffic classification. challenges in implementing effective and efficient TC. Such
emergence calls for the design and the deployment of highly-
I. I NTRODUCTION scalable architectures, to permit a feasible (time-constrained)
fine-grained analysis of huge amounts of heterogeneous net-
The increasing criticality that Internet has been gaining work traffic. This results in a Big Data (BD) scenario to
since its birth, puts privacy and security assurance in the be suitably managed and capitalized. In this direction, initial
spotlight. To accommodate these requirements, a wide range effort has been put forward by scientific community and
of privacy-preserving technologies have been developed, such industry to apply BD technologies, e.g. Apache Spark or
as proxy sites, virtual private networks, and Anonymity Tools Apache Hadoop, to ML-based TC [8], [9], [10], exploiting
(ATs). The latter act as facilitators for Internet users, allowing data parallelism.
them to obfuscate the communication as well as the nature of
As TC becomes more and more challenging, benefiting
the exchanged contents to any eavesdropping entity. On the
from both model and data parallelization approaches is of
one hand, ATs challenge authorities in the discovery of cyber-
clear appeal, but none of the two is trivially applicable
crimes, e.g., selling copyrighted or malicious software, drugs,
in itself. On one hand, model parallelism requires accurate
guns, child porn, and stolen digital identities or hiding online
architecture planning to fit the specific (classification) problem
frauds, extremism, hacking, and abuses. On the other hand,
in order to reap the potential benefits in terms of classification
they are essential in sharing crucial information through the
effectiveness and design advantages [7]. On the other hand,
Internet, e.g. when censorship is enforced by non-democratic
data parallelization via the adoption of BD does not represent
actors, or for the sole right to privacy [1]. This last-mentioned
a transparent enabler, as it may imply classification perfor-
aspect confirms their original significance to keep the Internet
mance degradation, trading efficiency for effectiveness [11].
Manuscript received 1st Dec. 2019; revised 12th May and 18th June 2020; Accordingly, their combination is far from being trivial, as
accepted 13th July 2020. their interplay is not known a priori [12]. Nevertheless, jointly
G. Bovenzi, G. Aceto, D. Ciuonzo, V. Persico and A. Pescapé are with the
University of Napoli Federico II (Italy). E-mail: {giampaolo.bovenzi,
leveraging model and data parallelism is extremely promising
giuseppe.aceto, domenico.ciuonzo, valerio.persico,
pescape}@unina.it. 1 https://tinyurl.com/y5cucfno.
2 IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. *, NO. *, MONTH YYYY

to accommodate the needs raising from recent scenarios in tralized). While these two aspects are non-exclusive, related
computer networks that call for tools for processing huge works are partitioned into two sets according to them, with
amounts of data produced by heterogeneous devices (e.g., as no intersection: this first aspect contributes to motivate our
those generated by IoT platforms) in a timely manner and at a research. Still, the novelty of our approach lies in addressing
predictable cost (e.g., leveraging cloud or fog platforms [13]). the challenging integration of both model and data parallelism
While the separate adoption of either model or data paral- (due to their interplay), aiming at a sophisticated and highly-
lelization has been investigated to a certain extent, to the best effective TC system. Also, we investigate whether the training
of our knowledge their combination has not been explored in phase of the reviewed hierarchical architectures leverages the
TC literature. This has motivated the present work. intrinsic model parallelism of this paradigm, even via simple
Accordingly, our goal is providing and evaluating a novel parallel scheduling techniques. Indeed, such categorization
TC framework to address the classification of traffic gener- points to the lack of this facet in all TC works based on
ated by anonymity tools. Based on the above motivations, hierarchical methods. On the other hand, in the literature,
the proposed framework boasts the benefits of both model the application of (manifold) BD technologies to TC tasks is
and data parallelism and is able to provide the appealing significantly represented as well by recent works proposing
characteristics of modularity, scalability, fast retraining, which approaches to exploit advantages of distributed computing
make it suitable for working with traffic of today’s (and in TC. In particular, they adopt BD technologies to define
next-generation) networks. To meet the above desiderata, we distributed ML models to enhance scalability or classification
investigate the interplay of model and data parallelism and performance, and also to meet real-time analysis requirements.
evaluate their interaction along multiple dimensions. As a By investigation of all the above works, the most used BD
result, this work paves the way to a novel approach for technologies result to be Apache Hadoop and Apache Spark.
designing TC algorithms. Additionally, the works in Tab. I focus on different kinds of
In detail, the technical contributions of the paper are the Traffic Types, including web services (e.g. Facebook, Gmail,
following. First, we survey the works that either apply hier- Skype, Google), network attacks (i.e. DDoS), and typologies
archical TC or enhance standard TC with the integration of associated to different contexts (e.g. P2P, video, Tor and
BD technologies (Tab. I) to highlight the lack of BD-enabled mobile), for both hierarchical and BD-enabled approaches.
hierarchical TC approaches based on ML. Secondly, we design For hierarchical TC, this results in proposals that consider
a BD-enabled hierarchical (BDeH) framework implementing hierarchies with up to three levels of classes and classification
double parallelism, i.e. that integrates the advantages of model models with increasingly-refined granularity. The considered
parallelism given by hierarchical TC [7], with those originated Traffic Objects include, for both approaches, flows, biflows,
by data parallelism, provided by BD technologies. Thirdly, and TCP connections. In addition, several BD-related works
we evaluate the proposed system for the classification of the focus on the finest granularity, i.e. packets. Concerning Input
traffic generated by ATs, showing the effective gain against (a) Data, all the reviewed approaches feed the classifiers with
flat (i.e. non-hierarchical) BD-enabled TC and (b) hierarchical different sets of statistical features of the considered TOs. The
centralized (non-BD) counterparts. We evaluate our proposal most common features are related to the inter-arrival-time, byte
along three different dimensions, i.e. (i) TC effectiveness, (ii) count, TCP flags, packet count, and payload length.
training-completion time, and (iii) cost incurred for deploying Also, classifiers are mostly common between the two ap-
this task on a public cloud. As a result, we show that BDeH proaches, that leverage state-of-art ML models, like Support
gains over each counterpart and effectively capitalizes the Vector Machine (SVM), Decision Tree (DT) and related
advantages arising from both types of parallelism. BDeH evolution, k-Nearest Neighbour (k-NN), and Neural Networks
source code is publicly released2 and the validation has been (NN). Furthermore, more than half of the reviewed works rely
performed on a publicly available dataset [14], for the sake of on a private dataset, thus precluding further comparisons and
repeatability and reproducibility [15]. advancements. The only exceptions are few works releasing
The rest of the manuscript is organized as follows: Sec. II only a part of their considered traffic data [11], [22]. Still,
discusses related literature on either hierarchical TC or stan- nearly all the works considering hierarchical TC use a private
dard TC with the integration of BD technologies; Sec. III dataset. Analogously, a significant share of reviewed works
presents the proposed BDeH framework, whereas Sec. IV provides enough details for reproducible implementation of the
reports the corresponding numerical evaluation; finally, Sec. V approach proposed therein, e.g. [8], [19], [21]. Nonetheless,
provides conclusions and points to future perspectives. only our recent work publicly releases the code implementa-
tion of the proposed method [7].
II. R ELATED W ORKS Finally, a common limitation of existing works on hierar-
In this section, we review works regarding either hierar- chical approaches is the lack of performance evaluation in
chical TC or BD-enabled TC. To this end, in Tab. I we terms of training time when model parallelism is enforced, i.e.
provide their comparative overview along multiple key aspects, versus a flat approach. Moreover, with regards to BD-enabled
highlighted by the corresponding columns. approaches, no cost analysis of real scenario deployment is
The first high-level distinction pertains to the nature of provided (except for our recent work [11]), despite this gives
the approach (hierarchical vs. flat, and BD-enabled vs. cen- a clear view about the trade-off with inference performance,
looking for the optimization of a distributed deployment.
2 https://github.com/jmpr0/hierarchical-framework In summary, compared with the state-of-art, our novel
BOVENZI et al.: DOUBLE PARALLELISM IN TC: A BDEH FRAMEWORK 3

Table I: Summary of previous works on either Hierarchical TC or BD-enabled TC.

implementation
Open Dataset

Reproducible
Hierarchical

parallelism)

parallelism)
BD-enabled
(w. model

BD Tech.
Levels
(data

TO Input Data Traffic Type ML Model Paper


39 statistics L1) P2P / non-P2P SVM (L1)
#
G # 3 — B (PR, PS, WS, L2) P2P type SVDD (L2) # # J. Yu et al., KSII TIIS 2010 [4]
JIT, DUR, FLGT, etc.) L3) Application OC-SVM (L3)
L1) Known/Unseen
200 statistics
#
G # 3 — F L2) Protocol DT, NN, and SVM # # L. Grimaudo et al., IEEE IWCMC 2012 [6]
(PT, FLGT, TTL, etc.)
L3) Application/Site
7 statistics
# — H F (PTs, PR, PC, Applications SVM #
G V. D’Alessandro et al., IEEE ICCC 2015 [8]
FL, DUR)
Multiview with 4 view,
#
G # 2-3 — B — — # # S.-H. Yoon et al., IEEE APNOMS 2015 [16]
different hierarchy per view
30 packet fields
# — S P DDoS vs. Normal GA #
G G
# M. Mizukoshi et al., IEEE CEC 2015 [17]
(TTL, PR, FLGI, CHK, etc.)
18 statistics L1) HTTPS services class
#
G # 2 — C NB, RT, DT, and RF # # W. M. Shbair et al., IEEE/IFIP NOMS 2016 [18]
(PC, IaT, and PL) L2) HTTPS services folds
6 statistics
# — H F (BC, and PS at different Applications DT #
G Z. Yuan et al., IEEE ICOACS 2016 [9]
ISO/OSI layers)
# — S F 100 Tstat metrics Web Services Proposal # #
G M. Trevisan et al., IEEE Big Data 2016 [19]
TS, SRC, DST, PR and
# — H P DDoS vs. Normal — # #
G S. Hameed et al., IEEE/IFIP NOMS 2016 [20]
others header info.
4 statistics L1) Asymmetric/Symmetric
#
G # 2 — F k-NN # # Y. n. Dong et al., Elsevier ComNet 2017 [5]
(BC, PS, FC, and IaT) L2) Video traffic type
4 statistics L1) Tor / Normal traffic DT (L1)
#
G # 2 — F # #
G J. Lingyu et al., IEEE ICCSN 2017 [21]
(PL, PS and IaT) L2) Tor application Tri-training Alg. (L2)
PL for the first 5 packets,
GBT, RF, SVM,
# — S, I B per direction PC, and BC, Web Services # # L.-V. Le et al., SSE TNC 2018 [10]
and NN
and PTs.
248 statistics
(PTs, IaT, BC at different
# — SS B ISO/OSI layers, per Applications PM #
G G
# X. Li et al., IEEE CIS 2018 [22]
direction PC, FLGT, WS,
IaT, RTT, and DA, etc.)
74 statistics L1) AT
#
G # 3 — F (BC, PC, PL, IaT, FLGT, L2) Service NB, BN, DT, and RF A. Montieri et al., IEEE TNSE 2019 [7]
WS, TOS, TTL, etc.) L3) Application
49 (12 after selection) of
# — S P packet fields (SRC, DST, DDoS vs. Normal NB, DT, and RF #
G A. Alsirhani et al., IEEE TNSM 2019 [23]
PR, TTL, PT, etc.)
(PL, IaT, PT, FLGT) for
# — S B the first 20 packets, Mobile Apps CNN, LSTM #
G G
# G. Aceto et al., IEEE TMA 2019 [11]
Payload of first 784 bytes
74 statistics L1) AT
3 S F (BC, PC, PL, IaT, FLGT, L2) Service RF BDeH (this paper)
WS, TOS, TTL, etc.) L3) Application

Legend of acronyms (— when not applicable):

Hierarchical: # (Flat), G
# (Hierarchical w/o model parallelism), (Hierarchical w/ model parallelism);
BD-enabled: # (W/o BD framework), (W/ BD framework);
BD Tech.: H (Apache Hadoop), I (IBM InfoSphere), S (Apache Spark), SS (Apache Spark Streaming);
TO: B (biflow), C (TCP connection), F (flow), P (packet);
Input Data: BC (Byte Count), CHK (IP checksum), DA (duplicate ACK flag), DST (destination IP), DUR (duration), FC (fragment count), FL (flow
length), FLGI (IP flags), FLGT (TCP flags), IaT (inter–arrival–time), JIT (jitter), PC (packet count), PL (payload length), PR (protocol), PS (packet size),
PT (port), RTT (Round Trip Time), SRC (source IP), TOS (type of service), TS (timestamp), TTL (time to live), WS (window size);
Traffic Type: AT (Anonymity Tool), DDoS (Distributed Denial–of–Service), P2P (Peer–to–peer);
ML Model: BN (Bayesian Network), CNN (Convolutional NN), DT (Decision Tree), GA (Genetic Algorithm), GBT (Gradient Boosted Tree), k-NN
(k-Nearest Neighbour), LSTM (Long Short-Term Memory), NB (Naïve Bayes), NN (Neural Network), OC-SVM (One-class SVM), PM (Pattern Matching),
RF (Random Forest), RT (Random Tree), SVDD (Support Vector Data Description), SVM (Support Vector Machine).
Open Dataset: # (W/o publicly available dataset), G
# (At least one publicly available dataset), (W/ publicly available dataset);
Reproducible implementation: # (W/o details for reproducibility), G
# (W/ details for reproducibility), (W/ publicly available implementation).
4 IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. *, NO. *, MONTH YYYY

approach takes place in the non-trivial intersection of these BDeH at work. Each TO is provided as input to the proposed
two approaches to TC. By doing so, it capitalizes the strong ad- BDeH framework. Its output is a set of {ℓ̂1 , . . . , ℓ̂G } predicted
vantages of the two complementary parallelisms, and provides labels, each corresponding to a given TC granularity level
a detailed analysis of their integration. Indeed, regarding the (cf. Fig. 1a). In detail, g-th level corresponds to Lg classes
latter aspect, a public dataset [14] is used for the experimental to discriminate from, with Lg growing at finer granularities
validation, and the implementation code is publicly released. (i.e. Lg+1 > Lg ). For example, in our considered scenario, we
face G = 3 granularity levels for TC. The first level focuses
on identifying the AT used to transport traffic, i.e. assigning
ℓ̂1 from L1 = 3 classes. The second delves into classification
III. BD E H F RAMEWORK D ESIGN
of the type of transport service offered by the specific AT,
Herein, we describe the design choices underlying our i.e. assigning ℓ̂2 from L2 = 7 classes. Finally, the finest
BDeH framework for TC. Precisely, we first introduce the granularity is associated to labeling the specific application
operational workflow and the requirements of its training phase tunneled in the AT with a given transport service, i.e. assigning
(Sec. III-A). Secondly, we discuss the concepts of data and ℓ̂3 from L3 = 21 classes.
model parallelism, respectively, outlining their benefits and In order to assign the G labels to each TO, our BDeH
implications (Sec. III-B). Then, we present the BD infrastruc- relies on the HC paradigm, that imposes a tree dependence for
ture supporting our framework (Sec. III-C). Last, we discuss classes belonging to different levels. Specifically, each class at
evaluation metrics adopted for its validation (Sec. III-D). (g+1)-th TC level has at most one parent class, which belongs
to g-th TC level. In other words, BDeH is made of multiple
classifier nodes (whose number is denoted with Nc ) arranged
A. Traffic-classification workflow via BDeH framework
as a tree, which are traversed in a top-down fashion [6], [7].
Figure 1 highlights how the proposed BDeH framework Additionally, for each node, we adopt the widely-used local-
is integrated in the overall TC workflow that receives raw classifier per-parent-node approach [26], imposing the design
traffic as input to produce labeled traffic objects. In details, of a multi-class classifier for each parent node in the class
the operations of BDeH require an input of raw network traffic hierarchy.3
that can be effectively modeled in a hierarchical fashion. In Based on these design choices, for each TO to be classified
other words, the action of associating network traffic to a label BDeH first predicts ℓ̂1 , corresponding to the most generic
must be doable at different degrees of granularity. This is a class. The above label is then used to select the classifier node
generic requirement that matches many and diverse real-life in charge of providing the label ℓ̂2 . This procedure is repeated
traffic analysis or management scenarios [5], [6], and can be until the G-th TC level. We remark that allowed classes for ℓ̂2
easily tailored to specific needs. To provide a clear example are only the children of ℓ̂1 , thus narrowing the choice of classes
with a valued practical application, we refer to traffic generated to be predicted at second level. Making the different nodes in
by Anonymity Tools (ATs). Still, we remark that the proposed the tree aware only of a subset of the entire classification space
framework is not limited to ATs, albeit being motivated also is the natural outcome of a divide-et-impera approach. This
by this practical application. In fact, BDeH is designed to results into a simplification of the TC problem, reducing the
benefit from any Hierarchical Classification (HC) model of number of classes (L̄1 , . . . , L̄Nc ) among which each node has
network traffic, e.g. mobile apps (arranged as categories, apps to discriminate. Indeed, each node is trained to distinguish only
and versions) [24], also including flat models as simpler cases. among its children nodes. For example, in our AT scenario,
Having defined a hierarchy of classes, the first pre-requisite for we obtain Nc = 8 classifier nodes (with corresponding classes
TC is the choice of a criterion for aggregating packets sharing to discriminate from) as reported in Fig. 2.
the same label. Next paragraph is dedicated to this part. Although errors at a given class level could propagate
Traffic Segmentation. The raw traffic is to be segmented in downwards the hierarchy, the HC choice promotes architecture
atomic Traffic Objects (TOs), based on a criterion dependent modularity, and enables model parallelism, specifically suit-
on the intended application of the classification results (cf. able for BD architectures (later shown in Sec. III-B). Further,
Fig. 1a). Indeed, each TO is assigned a label, therefore the HC approach enables a fine-grained (per-node) optimization
actions following the classification can discriminate traffic on of the feature set, the classifier, the hyperparameters, and even
such unit of granularity. Most proposals segment traffic in the TO [7]. As a result, HC is likely to achieve a significant
flows or biflows [25]. In detail, a flow is a stream of packets TC performance gain against a flat counterpart, i.e. a single
sharing the same 5-tuple (i.e. source IP and port, destination classifier solving the finest (g = G) TC task [7].
IP and port, and transport-level protocol), thus taking into Training Requirements of BDeH. To operate in test phase,
account their directions. Differently, in a biflow the source BDeH needs to be previously initialized by a training phase
and destination (IP address, port) pairs can be exchanged. (Fig. 1b), that trains all the Nc ML classifiers nodes of the
In both cases the termination is defined based on a user- hierarchy. For this phase, the input is a collection of Nc
defined timeout. Other works employ diverse TOs, e.g. TCP training sets {T1 , . . . , TNc }, all obtained starting from total
connections. The latter differs from the biflow only in the
3 The aforementioned choice is usually preferred to (a) local-classifier per-
initiation and termination heuristics. Once the relevant TO is
node and (b) local-classifier per-level approaches, since it avoids combinatorial
selected, the workflow details of our BDeH framework can be growth of the number of classifiers Nc and membership inconsistency in the
summarized as follows. label set {ℓ̂1 , . . . , ℓ̂G }, respectively [26].
BOVENZI et al.: DOUBLE PARALLELISM IN TC: A BDEH FRAMEWORK 5

Testing (Operational) Phase Training Phase


BDeH TC Framework Untrained Hierarchical
BDeH TC Framework
TC Classifier
subtask Trained Hierarchical
Classifier
Raw Traffic Training Task
Traffic
w/

Scheduler
Segmentation ML model
Hierarchical
in Traffic
BD
Model Infrastructure
Objects
(e.g. ATs)
Labeled Traffic Objects
Labeled Traffic Object

(a) (b)
Figure 1: BDeH Framework: (a) TC workflow process (testing phase) and (b) training phase.

ROOT
subsets, each one assigned to a different worker. The lat-
ter performs learning (a) from its portion of data and (b)
Level 1
Jon
synchronizing with other workers through partial information
Tor I2P
Donym exchanges. Such process is typically accomplished thanks to
Level 2 a coordinating entity (named master), which is also in charge
Jon I2PApp I2PApp
Tor TorApp TorPT I2PApp of collecting the final result of the training process based on
Donym 0BW 80BW

Level 3 data parallelism [11]. It is worth noting that the fragmentation


of the training set could degrade classification performance.
Figure 2: Hierarchical traffic classifier
MERDA
for Anon17 dataset. Classifiers
(solid black squares) distinguish among several classes
Furthermore, a higher number of workers may also negatively
(dots). Dashed grey squares correspond to degenerate impact the temporal gain, due to the burden imposed by
(i.e. single class) classifiers. synchronization overhead at controller side [11].
Differently, model parallelism resorts to splitting the model
associated to a ML algorithm (with no partitioning of the
training set T and retaining only a subset of its samples. training set), with the aim of simplification of the learning task
Specifically, Tn contains only the (training) samples associ- at each worker. In our peculiar case, model parallelism can
ated to the L̄n labels constituting the TC task to be solved leverage dependencies among traffic classes (i.e. hierarchical
by n-th node. Accordingly, this implies |Tn |≪ |T| due to dependency) to perfectly parallelize the learning task over
the reduced number of classes (|·| denotes the training set the different workers, each one assigned to a sub problem.
size), except for theTroot node (n = 1). Indeed, in general
S Indeed, the breakdown of the classification procedure along
n∈Ng Tn = T and n∈Ng Tn = ∅, where Ng denotes the set a tree-like architecture enables the training phase of distinct
of classifiers concurring to g-th TC granularity level. Referring ML (sub-)models in parallel (e.g. the classifier nodes). Ac-
to our AT scenario, the training set size of each classifier node cordingly, there is no TC performance loss due to model
|Tn | (compared with |T|, required when training a flat TC partitioning (by construction) when considering HC. This is
approach), is shown in Fig. 4d. one of the peculiarities of HC approaches as opposed to
The output is the set of trained classifier nodes composing flat counterparts, where model partitioning is not performed
the tree hierarchy. We remark that the training process of each based on hierarchical class representation. Nevertheless, from
node is independent on the others. Accordingly, although this a time-related perspective advantages against a flat approach
phase may be implemented by a single entity training all Nc are not guaranteed and need to be investigated. In other words
nodes in a sequential fashion, we show hereinafter that our the advantages arising from the training of multiple but less
BDeH approach can leverage BD-infrastructures to exploit complex classifiers, instead of the training of a single but more
parallelism of both the classifiers (model) and data. This allows complex classifier, are not obvious.
obtaining increased time-efficiency and, with suitable design, Accordingly, the key idea behind our proposal is to combine
also cost-effectiveness. the benefits of data parallelism enabled by BD paradigms
with those deriving from the model parallelism granted by
the adoption of the HC architecture. In TC problems, since the
B. Data and Model Parallelism duration of learning phase for TC tasks is directly proportional
The BD paradigm enables both batch and streaming dis- to both the number of samples used for training and the
tributed analysis, addressing the issues related to high data number of classes to choose from, data parallelism can be
variability, volume, and velocity. Usually, they are used to employed to reduce the number of samples, whereas model
obtain a time performance speedup and also contribute to parallelism to reduce the number of classes.
shorten the training phase in ML applications. This may be Referring to our AT scenario, model parallelism (induced by
achieved by exploiting both the notions of data and model HC) leads to the decomposition of the original classifier with
parallelism. L3 = 21 classes, supported by the whole training set T, into
In a nutshell, data parallelism is based on the split of a Nc = 8 independently-trainable simpler nodes considering at
training set associated to a ML algorithm in non-overlapping most five classes (cf. Fig. 2) with smaller training sets Tn .
6 IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. *, NO. *, MONTH YYYY

BDeH TC Framework Once a ML training task is assigned to a given bucket,


BD Infrastructure BDeH performs the latter job by implementing data paral-
Bucket lelism through master-workers exchanges. In other terms, data
parallelism is implemented within each bucket.

Model Parallelism
Untrained Nodes

In detail, data parallelism for training each ML model

Trained Nodes
Scheduler
(associated to a classifier node) is enforced by a BD framework
(e.g. Spark) through two main phases, i.e. data spreading and
model parameters exchange (Fig. 3, bottom). First, during data
Worker
spreading phase, the training set (already a model-specific
Master subset of the full training set) is further split into several (non-
overlapping) portions (i.e. <split>). Accordingly, the master
node acquires the required resources in the worker nodes and
Data Spreading Model Parameters Exchange
assigns each set portion to a given worker (i.e. <allocate>).
Then, during model parameters exchange phase, status in-
Data Parallelism

Wo

M
as r formation is repeatedly exchanged between workers and the
rke

te te
r as
rs

M master to (i) enable the aggregation of model parameters


Shared
<split> Memory from all worker instances (<merge>, workers→master) and
Wo

<allocate> <merge> (ii) synchronize the workers with the updated master status
rke

<update>
rs

(<update>, master→workers).
Ideally, scheduling aims at minimizing the training comple-
Figure 3: BDeH framework with focus on the BD infrastructure: tion time ttot of the HC architecture (i.e. the makespan). Since
scheduling of training tasks over buckets (top, model par-
allelism) and internal structure of a bucket (bottom, data the latter requires all the nodes to be trained in order to be
parallelism). put in test phase, the above time corresponds to:

ttot fi max tb (1)


The impact on training time of each Tn (viz. complexity) is b=1,...,B
further minimized by data parallelism, which allows splitting i.e., the longest completion time among buckets {t1 , . . . , tB },
each by the number of workers considered. Our experimental where B is the number of buckets. In detail,
validation (Sec. IV) shows how, in the application of the PNc the completion
time of b-th bucket can be written as tb = n=1 ψb,n tn , where
proposed framework, there actually is an improvement in tn is the training time required for n-th classifier node and
terms of both training time/cost and classification performance. ψb,n ∈ {0, 1} is an indicator variable being one (resp. zero)
To this end, henceforth we describe the BD infrastructure when the training task of n-th node is assigned (resp. not
supporting the BDeH framework. assigned) to b-th bucket.5
Accordingly, the optimal scheduler provides the solution to
C. Description of BD Infrastructure the following optimization:
BDeH is supported by a BD infrastructure, shown in Fig. 3 ( Nc
)
and detailed as follows. The BD infrastructure manages the p fi arg min max
Ψ b
t (Ψ) =
X
ψb,n tn (2)
computing resources in B units (buckets) that are instantiated Ψ b=1,...,B
n=1
on the BD framework (e.g. Apache Spark) and are handled B×Nc
by a scheduler. Workload assignment is based on a master- where Ψ ∈ {0, 1} , whose (b, n)-th entry equals ψb,n .
slave architecture, with a controller and several workers, and Hence, the optimization is carried out over the space of
leverages a distributed file-system (e.g. Spark SQL, Hadoop selection matrices Ψ (column sum is constrained to one, i.e.
Distributed File-System, etc.). Hence, within each bucket, one task is assigned only to one bucket). It is worth noticing
there is one master node and a pool of worker nodes, with that the solution to the optimization in Eq. (2) is infeasible
each master assigning and coordinating the work of the pool. since in real scenarios the time required for each training
Buckets are assumed to have the same number of workers task (namely t1 , . . . , tNc ) is unknown a priori and due to NP-
(Nw ), as the scheduler acts as a load balancer that distributes completeness of the optimization problem. We now explain
tasks uniformly across buckets. how we circumvent these two technical issues.
In BDeH, the scheduler implements model parallelism by First, in the place of task completion time, we define a
scheduling the training task of each node of the HC architec- surrogate function ρ(·, ·) associated to the completion time.
ture4 to one of the buckets and forwarding the corresponding The need for defining a surrogate metric originates from no
training set, along with ML model specifications. Hence, known general and explicit expressions of complexity of ML
training tasks assigned to the same bucket are executed se- classifiers as a function of relevant parameters considered: (i)
quentially, whereas training tasks assigned to different buckets size of the training set and (ii) number of classes. In detail,
are run in parallel. for the n-th classifier waiting time tn , such function depends
4 It is worth underlining that a preliminary phase is required to build the 5 For simplicity, we suppose the completion time of each node t does not
n
training sets {T1 , . . . , TNc } associated to the nodes of hierarchy (starting vary over bucket assignment. Still, BDeH framework could be generalized to
from T as a result of label-wise splitting operations). to heterogeneous buckets, i.e. having different resource budgets.
BOVENZI et al.: DOUBLE PARALLELISM IN TC: A BDEH FRAMEWORK 7

on the number of samples of training set (|Tn |) and classes to the pay-per-use model. This cost is proportional to the
of the node’s TC task (L̄n ), namely duration of the analysis (training phase in our case) and
Nc the configuration of the BD architecture (i.e. the degree of
parallelism). In turn, the configuration cost is proportional to
X
ρn fi ρ(|Tn |, L̄n ) fi |Tn | L̄n / ( |Tm | L̄m ) (3)
m=1 the number of master machines and the number of worker
machines. Hence, the total cost ctot scales according to:
Hence, we replace tn with ρn to perform the optimization in
Eq. (2). It is worth noticing that other complexity measures ctot fi B (cM + Nw cW ) ttot (4)
monotonically growing with both |Tn | and L̄n could be
considered as well. where cM (resp. cW ) denotes the master (resp. worker) hourly
Secondly, we resort to a priority scheduling approach6 that cost. Herein, to reflect a realistic deployment, we consider
has the advantage of being O(Nc ). It is based on the afore- the current fees of a public cloud infrastructure, i.e. Amazon
mentioned surrogate, and aims at assigning training tasks so to Web Services (AWS), shown in Tab. II. Also, AWS, like most
balance completion time among all buckets. In the following, public cloud providers, provides a per-second billing, with a
we assume that all scheduling strategies deal with Nc > B 60 s minimum.7
since they collapse into trivial assigments for Nc ≤ B.
We design two scheduling strategies, namely offline and on- IV. E XPERIMENTAL VALIDATION
line scheduling. Both sort the tasks by decreasing ρn first and In this section, we describe the experimental scenario con-
assign one task per bucket accordingly. The former statically sidered for the assessment of BDeH and discuss the resulting
assigns each remaining task to the bucket having the lowest outcomes. In detail, in Sec. IV-A we first outline the dataset
current sum of ρn ’s already assigned to it. The latter strategy used for experimentation. Then, in Sec. IV-B we describe the
dynamically evaluates the state of the buckets and, when a experimental testbed and the technologies adopted for BDeH
bucket completes its currently-assigned task, it is assigned the implementation. Finally, in Sec. IV-C we discuss the perfor-
remaining task with the highest ρn . Differently from the offline mance results according to the proposed evaluation metrics.
scheduling, this strategy exploits time completion feedback
at the cost of monitoring buckets state. Indeed, feedback has
been shown to “repair” the degrading effect of uncertainties A. Dataset Description
on scheduling results (due to unavailability of the tn ’s and The dataset used to validate our proposal is Anon17 [14],
their replacement with ρn ’s) and provide general beneficial collected in a real network environment at the NIMS Lab
effects (i.e. independently on the specific scheduling approach within ’14–’17 and publicly available.8 This dataset fits the
adopted) by monitoring workers’ workload status [27]. proposed HC architecture, consisting of AT traffic (generated
via Tor, I2P, and JonDonym) with a three-level (G = 3) label-
ing scheme, i.e., from coarser to finer granularity: Anonymity
D. Evaluation Metrics
Tool (L1 = 3 classes), Traffic Type (L2 = 7 classes), and
Our analysis takes into account the following three com- Application (L3 = 21 classes). Precisely, the dataset contains
plementary (and intertwined) performance aspects, which are ≈ 1.46M flows for a total of ≈ 430M packets.9
essential for a complete evaluation of a BD-enabled TC The dataset is released in the form of 74 statistics per flow
system [11]: (i) TC performance; (ii) training completion (that can be extracted also from encrypted traffic). Thus, the
time; (iii) cloud deployment cost. TO we refer to in our study is the flow. We leverage the full
Since BD technologies do not constitute a transparent ac- set of features in our experimentation for all nodes. These
celerator for the training phase of ML-based traffic classifiers choices are dictated by the type of analysis to perform, which
(can impact the resulting performance), we evaluate TC per- could be either batch (also termed post-mortem), or streaming
formance of our BDeH framework leveraging the well-known (also termed real-time). Our choices are suitable for the batch
F-measure to assess classification effectiveness. This metric TC task we aim to address. This notwithstanding, our BDeH
represents the harmonic mean of per-class precision and recall, framework can accommodate alternative design choices.
arithmetically averaged over all the considered classes.
Also, because reducing the processing time required for a
task completion is arguably the major driver in the adoption B. Experimental Setup
of BD architectures, we provide a detailed evaluation of this To conduct the proposed analysis, we deploy a naïve config-
key aspect. In detail, we focus on training completion time uration of the HC architecture in which all the nodes present
ttot , because TC systems are expected to require frequent re- the same configuration in terms of ML algorithm. Based on
training operations, due to aging of training data as a result of the related literature [7], [29], [30], we have initially identified
the quick evolution of network applications and their usage. four relevant ML models: two tree-based models, (i.e. Random
Finally, beyond the performance in terms of TC effec-
7 Since all the training completion times observed in this work exceed 60 s,
tiveness and execution time, a key aspect when deploying such constraint does not affect our cost evaluation.
applications on cloud is the cost of the analysis according 8 https://web.cs.dal.ca/~shahbar/data.html
9 In this paper, we down-sample to 5% the most-populated traffic-type
6 This approach is also referred to as “longest processing time” approach classes of the original dataset, using a pre-processing strategy similar to [7],
in scheduling literature. [28] so as to mitigate class imbalance toward “majority” classes.
8 IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. *, NO. *, MONTH YYYY

Table II: Pricing of Amazon Web Services (AWS) equivalent ma-


chines in July 2019.
training completion time (Fig. 4a) and related costs (Fig. 4b).
By distributed flat approach, we mean a single RF classifier
AWS Machine vCPU RAM Cost trained at granularity level g = G = 3 using data-parallelism
Name Role [#] [Gib] [$/h]
(e.g. splitting the whole training set T among the workers
a1.xlarge Master 4 8 0.101
a1.large Worker 2 4 0.051 Nw , which are then coordinated by the master for performing
the learning task, see Sec. III-B). For completeness, we also
report the respective theoretical baselines (“Hierarchical
Forest (RF) and Decision Tree) and two Bayesian-based mod- BL” and “Flat BL”), defined as the completion time and
els, (i.e. Naïve Bayes and Bayesian Network). Among this cost for the centralized version divided by the number of
set of ML algorithms, our attention is restricted to the sole workers Nw . In other words, the latter are ideal curves which
RF, being the best classifier in terms of F-measure (among do not take into account synchronization overhead and allow to
above mentioned models), and the slower to train, due to the appreciate its impact (by comparison) on realistic data-parallel
ensemble nature characterizing it. We remark that selecting approaches.
the optimal ML classifier per node and/or optimizing the As shown in Fig. 4a, the training completion time shows
corresponding set of features (such as in [7]) falls outside an asymptotic behaviour, almost flattening at around 8–10
the goal of this work and would not result in any change to workers. The comparison between hierarchical and flat training
the proposed methodology. Hence, other ML solutions like completion time shows an almost constant gain of ≈ 50% for
Gradient Boosted Trees (GBTs) are not evaluated, but they the former approach, e.g. with 10 workers the training phase
are left to future developments. is completed by the HC and flat approach in 316.97 ± 19.83 s
The proposed BDeH framework is implemented in Python, and 643.01 ± 28.32 s, respectively. The above gain originates
with the Scikit-learn module used to compute TC metrics. from the simpler classification tasks associated to the nodes in
To conduct the experimental evaluation, we selected Apache the hierarchy and the smaller training sets Tn (cf. Sec. III-A
Spark (with PySpark module) as the most suitable BD tech- and Fig. 4d). Notably, all the training completion time curves
nology, in terms of performance and ease of deployment. do not follow the respective theoretical baselines, as the actual
The experimental testbed is deployed on the OpenStack speedup is burdened by the overhead imposed by the master–
cluster at the University of Naples “Federico II”. We deploy workers communication. Accordingly, as shown in Fig. 4b,
Spark clusters in standalone mode, using its own resource the reduction of the training completion time does not result
manager in the master to manage several slaves, or workers into cost savings, with cost curves showing a monotonically
(limiting to one the maximum number of cores per worker). increasing trend.
The characteristics of the adopted instances are in Tab. II, In Fig. 4c we compare the F-measure vs. the number of
where we also report the AWS-equivalent name and pricing10 . workers of the hierarchical and flat classifiers against their
Finally, we underline that the experiments are conducted centralized counterparts (Nw = 1). F-measure is shown for
through a stratified 10-fold cross-validation procedure to as- all the three levels of TC granularity. Beyond the effective
sure stability of results. Hence, for each performance metric, gain of HC [7] (e.g., red vs. green curves), it is apparent
we report the mean (µ) and standard deviation (σ), as a µ±3σ that performance remains stable when Nw grows. This result
confidence interval. may seem counterintuitive at first, since the fragmentation
of training set could negatively impact the performance of
ML classifiers [11]. Still, such weak dependence could be
C. Experimental Results explained by the parallel and ensemble nature of the RF
In the following, we first show the performance achievable classifier. Indeed, the aforementioned peculiarity leads to a
with pure data parallelism. Then, we evaluate the impact federated RF version (on Spark) which suffers less from
of adopting different scheduling strategies to assign tasks to compressed intermediate exchanges among workers (via the
buckets, assessing pure model parallelism. Finally, we experi- master), as opposed to other algorithms. Hence, this makes
mentally evaluate the benefits originated by the capitalization it more resilient to the fragmentation of the training set. We
of data and model double-parallelism in our BDeH framework. remark that, in general, this is not the case for ML approaches,
Pure Data Parallelism. In our first campaign we inspect the e.g. see neural networks in [11] for a mobile TC task.
impact of sole data parallelism, hence we do not take into ac- Finally, in Fig. 4d we provide the fine-grain analysis of the
count model parallelism, i.e. we assign all the nodes to a single training completion time for the flat-approach as well as for
bucket (B = 1, no scheduler needed). In detail, we perform each node of the HC hierarchy (as shown in Fig. 2). Therein,
the training of the Nc classifiers of the hierarchy sequentially, the diameter of each circle is proportional to the training
submitting each training task at a time to the single master. completion time of the related classifier (tn ). In detail, for
Accordingly, we analyze the trend of the metrics varying the each classifier the inner circle refers to the optimized data-
number of workers Nw linked to the master, namely we per- parallel configuration (the optimal number of workers11 is
form per-classifier data parallelism. We compare the adoption reported in square brackets close to node names), the outer
of data parallelism on HC (curve “Hierarchical”) against one refers to the centralized (single worker) execution. The
the distributed flat approach (curve “Flat”), in terms of
11 Here the optimal number is related to the minimum number of workers
10 https://aws.amazon.com/it/ec2/pricing/on-demand/ that allow to achieve the lowest completion time observed.
6
200

Traini

Nu
3
0
BOVENZI et al.: DOUBLE PARALLELISM IN TC: A BDEH FRAMEWORK 9
2 4 6 8 10 12 14 16 18 20 22 24 102 103 104 105
Number of Workers Number of Samples

0.3
1000 100
1000 Hierarchical
Hierarchical Hierarchical
Hierarchical BL
BL 21 Flat [10] TorPT [10]
Hierarchical
Flat
Flat Hierarchical
FlatBL
Flat BL
BL 21
90 Flat [10] TorPT [10]
0.25

Classes
800 Flat Flat BL 18 I2PApp80BW [10] Tor [10]
TimeCompletion
[s]Cost [$]

Classes
I2PApp80BW [10] Tor [10]

[%]
800 18
Training Completion

80 I2P [10] I2PApp [4]


0.2 15 I2P [10]
ROOT [20] I2PApp
TorApp [4][4]
600 15

F-measure
70 ROOT [20] TorApp [4]
Time [s]

600 12 I2PApp0BW [10]


0.15

ofof
12 Circle I2PApp0BW
diameter is [10]
proportional to training completion time
60
Training

400 Hierarchical Hierarchical C


9 Circle diameter is proportional to training completion time

Number
0.1
400 Flat Flat C
9

Number
Training

50
6 Hierarchical L2 Hierarchical L2 C
200
0.05 6 Hierarchical L1 Hierarchical L1 C
200 40
3 Flat L2 Flat L2 C
3 Flat L1 Flat L1 C
00 30
10 2 3
104 18 20 22 10 5
22 44 66 88 10 10 12
12 14 14 16
16 1818 20
20 22 24
22 24 22 4 6 8 1010 12 14 16 245
0 10 10 3
Number 10 4
10
2 4 6 Number
8Number
10 12 ofof14Workers
Workers
16 18 20 22 24 Number ofof Samples
Workers
Number (a) of Workers Number
(c)of Samples
0.3 100
1000 Hierarchical Hierarchical BL
0.3 100
Flat
Hierarchical
Hierarchical FlatBL
Hierarchical
Hierarchical BL
BL 21
90
0.25 Flat [10] TorPT [10]
Flat
Flat Flat
Flat BL
BL
[$]

90

of Classes
0.25

[%]
800 18 I2PApp80BW [10] Tor [10]
Training Completion

80
[$]

0.2

[%]
I2P [10] I2PApp [4]
[s]Cost

80
0.2 15

F-measure
70 ROOT [20] TorApp [4]
Time Cost

600
0.15

F-measure
70 I2PApp0BW [10]
12
60
Training

0.15 Hierarchical Hierarchical C


0.1 60 Circle Hierarchical
diameter is proportional to training completion
Flat Flat C time
Training

400 9 Hierarchical C
Number
50 HierarchicalFlat L2 HierarchicalFlat L2 CC
0.1 Hierarchical L2L1 Hierarchical L1
50 L2 C
0.05 40 Hierarchical
6 Flat L1
L2
Hierarchical
Flat L2
C
200 Hierarchical Hierarchical L1 CC
0.05 40 Flat L2
L1 Flat
0 3
30 Flat Flat L1
L2 CC
2 4 6 8 10 12 14 16 18 20 22 24 2 4 6Flat8L1 10 12 14 16Flat 18L120C 22 24
00 30
1022 4 6 8Number
22 44 66 88 Number 12 of14Workers 12of14 Workers
104 18 20 22 10
3 5
10
10 12 14 16
16 18
18 2020 22 24 1010 16 24
Number of
Number of Workers
Workers Number
Number of
of Samples
Workers
(b) (d)
0.3 100
Hierarchical
Figure 4: On the Hierarchical
left: Completion time BL (b) of training phase for hierarchical and flat approaches against their respective
(a) and cost
Flat (— BL), varyingFlatthe BL
0.25On the right: Classification performance (F-measure) (c), at all 3 90
theoretical baselines number of workers.
Training Cost [$]

granularity levels, hierarchical and flat approaches against their


F-measure [%]

respective centralized version (— C), varying number of worker80 machines. Dashed lines refer to configuration with only one
0.2worker machine, i.e. no data parallelism. Complexity map (d), where each circle represents a classifier. For each, inner and outer
70
borders refer to the training completion time with optimized data-parallel configuration (the optimal number of workers is shown
0.15in squared brackets) and centralized configuration (with one worker), respectively.
60 Hierarchical Hierarchical C
Graph values are provided as µ ± 3σ, corresponding to a confidence interval of 99.75%.
0.1 Flat Flat C
50 Hierarchical L2 Hierarchical L2 C
Hierarchical L1 Hierarchical L1 C
0.05 40900
difference between the outer and the inner diameter shows Flat L2 Flat L2 C
Time: Cost:
0.09
Flat L1 Flat L1 C
the gain0achieved with data parallelism. Circles are scattered 30 800 Offline 0.08
2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14
Online 16 18 20 22 24
Training Completion

according to the number of samples of the related training sets


Number of Workers
700
Number of Workers
Oracle 0.07
(x-axis) and classes that must be told apart (y-axis). 600
Optimal
0.06 Training Cost [$]
Time [s]

From this plot, we infer that a higher complexity cor- 500 0.05
responds to a higher reduction in training time when data 400 0.04
parallelism is leveraged. This positive effect is balanced by the 300 0.03
unavoidable saturation due to the synchronization bottleneck
200 0.02
between master-workers.
100 0.01
Pure Model Parallelism. In this campaign we evaluate, in the
0 0
case of pure model parallelism (i.e. Nw = 1, resulting in each 1 2 3 4 5 6 7
bucket with one master and one slave), how different schedul- # of Buckets
ing strategies available impact the training completion time Figure 5: Training completion time (left y axis) and cost (right y axis)
(ttot ) and the training cost (ctot ) achieved by the architecture. achievable with different scheduling strategies in case of
Notably, for this experimental analysis we only consider ttot pure model parallelism (bars report average and standard
deviation over 10 folds, markers report the sole average).
and ctot , since TC effectiveness is not impacted by scheduling
selection (viz. model parallelism).
In detail, we evaluate the two proposed variants of the prior-
ity scheduling (Sec. III-C), i.e. its offline and online versions, realizes the same scheduling with clairvoyant knowledge of
against two baselines. The former is an oracle baseline which training completion time for each node {t1 , . . . , tNc }, which
10 IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. *, NO. *, MONTH YYYY

is used in the place of {ρ1 , . . . , ρNc }. The latter is the the cost (Fig. 6 middle) the cheapest configuration is obtained
optimal baseline which derives from the full exploration of all with Nw = 2 and B = 2 ( symbol), corresponding
scheduling choices, namely the solution to Eq. (2). Figure 5 to a cost of 0.028$. This configuration leads to +24.42%,
reports the results vs. B = 1, . . . , 7 buckets. The general trend +31.02%, +9.9% relative cost reduction w.r.t. optimized pure
witnesses the benefits of adopting pure model parallelism, data parallelism (with Nw = 2) and pure model parallelism
which allows to improve the average performance from a (with B = 2), and centralized approaches, respectively.
ttot perspective. For instance, when considering B = 5 with Finally, we also consider a time-cost score (s(ttot , ctot ))
online scheduler, there is a relative ttot improvement of up to taking into account both training completion time (ttot ) and
≈ +74% with respect to B = 1 (i.e. the centralized case), cost (ctot ). In detail, we evaluate the weighted average of the
at expenses of a slight ctot increase (i.e. ≈ +33%). Also, the two normalized metrics which ranges from 0 to 1 (the lower,
proposed priority scheduling strategies achieve results not far the better). More specifically, the formula is:
from the oracle and the optimal baselines, i.e. ≈ −27.1% and
≈ −27.3% relative loss of training time, respectively, at most. s(ttot , ctot ) = λ1 r
t tot + λ2 r
c tot (5)
In detail, the online variant outperforms the offline counterpart where r t tot and r c tot represent the corresponding min-max
in most of the cases, i.e. ≈ +26% relative gain in terms of normalized counterparts14 , with weights λ1 = λ2 = 0.5. It
training time, at most, and ≈ −20% relative loss of training is worth remarking that these weights could be configured to
cost, at most. This gain is due to the side knowledge of nodes get the desired trade-off.
completion time originating from the online feedback. Thus, the best-balanced configuration is attained with Nw =
Finally, regardless of the scheduling strategy adopted, the 6 workers and B = 5 buckets ( symbol) and shows a training
obtained results show a saturation, with completion time completion time of 82.81 ± 16.89 s and a cost of 0.047$,
settling at ≈ 200s for deployments with B ≥ 4. This highlights corresponding to a score of 0.05. This configuration leads to
the inherent limitations of pure model parallelism that need +76.89%, +58.78%, +90.28% relative reduction considering
to be overcome by leveraging data and model parallelism pure data parallelism (with Nw = 6), pure model parallelism
together, as shown in the following analysis. More specifically, (with B = 5) and centralized approaches, respectively.
we focus on the online variant of the proposed priority Summarizing, in the light of the results reported and by
scheduling since it guarantees better performance and feasible comparison with existing literature, some important take-home
implementation. messages can be drawn. First, our validation on the sole data
Data+Model Double-Parallelism. Based on the outcomes of parallelism highlights the need for a complete investigation
previous experimentations, we now evaluate the performance which includes the related cost analysis. Indeed, we have found
of general BDeH framework. In detail, we jointly consider data that a naive increase of the number of workers incurs in a
(Nw > 1 workers) and model (B > 1 buckets) parallelism. saturation of the time gain (due to synchronization overhead)
Accordingly, to measure performance we consider the train- which negatively impacts the cost of the training architecture.
ing completion time and the resulting cost: TC effectiveness Such complementary and close-to-deployment investigation
is not considered since the effect of data parallelism on RF has been previously addressed only by our recent work [11]
has been shown to be negligible.12 which was limited to a non-hierarchical scenario. Secondly,
The above two metrics are explored by varying the number from a time-cost perspective (even) pure model parallelism is
of (a) available buckets B and (b) workers per bucket Nw . For beneficial for hierarchical approaches to TC: nonetheless, this
completeness, in Fig. 6 we again report, in dotted horizontal requires the careful design of a feasible scheduling strategy for
and dashed vertical boxes, the configurations pertaining to its effective capitalization. We believe the importance of this
pure data parallelism (Nw > 1 and B = 1) and pure model aspect is likely to increase due to the rising trend toward large-
parallelism (Nw = 1 and B > 1), respectively. Notably, scale TC problems [24]. Hence, our proposal complements
the intersection of the two boxes reports also the time-cost the lack of this analysis in similar studies on HC [5], [6],
performance of the centralized HC approach (Nw = 1 and [7]. Finally, results concerning the integration of both model
B = 1) presented13 in [7]. (via the designed scheduler) and data parallelism granted by
By looking at the results, the minimum training completion our BDeH approach have highlighted the high-performance
time (Fig. 6 top) is obtained with Nw = 11 workers and (in terms of cost, time and TC effectiveness) achieved by our
B = 5 buckets ( symbol), corresponding to 74.54 ± 13.44 s, framework, as well as its flexibility, in comparison to existing
on average. This configuration leads to +77.58%, +61.79%, hierarchical TC implementations (e.g. [6], [7]).
and +90.02% relative reduction of the training completion
time with respect to optimized pure data parallelism (with
Nw = 11) and pure model parallelism (with B = 5), and V. C ONCLUSIONS AND F UTURE P ERSPECTIVES
centralized approaches, respectively. Differently, considering In this paper we presented BDeH, a Big Data-enabled
12 Indeed, increasing B only affects the degree of model parallelism (i.e. Hierarchical framework aimed at tackling new challenges in
there is a higher number of buckets which can process the training tasks network TC. Our framework capitalizes two complementary
associated to the classifier nodes) and does not alter TC performance of HC types of parallelism: model and data parallelism. These are
approaches. Conversely, increasing Nw incurs in the same TC performance
insensitivity observed in Fig. 4c. enabled by HC approaches and BD technologies, respectively.
13 Time-cost performance was not addressed in [7], given the focus on TC
14 Specifically, x−xmin
effectiveness and the lack of a BD infrastructure. r fi
x xmax −xmin
.
BOVENZI et al.: DOUBLE PARALLELISM IN TC: A BDEH FRAMEWORK 11

1.0

Training Completion
only RF-based classifier nodes, our BDeH framework is quite

87 700 general and the proposed methodology does not restrict to a


600 specific ML approach. Hence, the numerical evaluation of less-
65

Time [s]
500 complex and appealing alternative classifiers, such as GBT,
43
0.8 400 is doable and of certain interest. Indeed, GBT complexity is
21 300 O(n p nt ) (as opposed to O(n2 p nt ) for RF), n being the
200 number of samples within the training set, p the number of
1 3 5 7 9 11 13 15 17 19 21 23 100 features, and nt the number of trees.
Number of Buckets

Accordingly, the proposed BDeH framework suggests the


0.6 following future directions: (i) use of more sophisticated

Training Cost [$]


87 0.225 classifier nodes (e.g. GBT) and alternative hierarchical ap-
0.200
65 0.175 proaches; (ii) advanced BD-enabled HC optimization, e.g. pro-
0.150 gressive censoring [7] with optimal per-node reject thresholds;
43 0.125 (iii) stream-based learning implementations [31] of BDeH
21 0.100
0.4
0.075 accounting for concept drift; (iv) further optimization from
1 3 5 7 9 11 13 15 17 19 21 23 0.050 both the viewpoints of HC and BD technologies, e.g. the
per-node optimization of the tree-structure and the usage
1.0 of heterogeneous buckets with diverse number of workers
87 0.8
Time Cost respectively; (v) advanced scheduling based on TC level
65
0.2
0.6Score prioritization.
43 0.4
21 0.2 R EFERENCES
1 3 5 7 9 11 13 15 17 19 21 23 0.0
Number of Workers per Bucket
0.0
[1] G. Aceto and A. Pescapé, “Internet censorship detection: A survey,”
0.0 0.2 0.4 0.6 0.8 1.0 Computer Networks, vol. 83, pp. 381–421, 2015.
[2] A. Dainotti, A. Pescapé, and G. Ventre, “Worm traffic analysis and
Figure 6: Completion time (top), cost (center), and time-cost score characterization,” in IEEE International Conference on Communications
(bottom) of training task, varying the number of buckets (ICC), 2007, pp. 1435–1442.
and worker machines. Results for pure data parallelism and [3] V. F. Taylor, R. Spolaor, M. Conti, and I. Martinovic, “Robust smart-
pure model parallelism correspond to row 1 and column 1, phone app identification via encrypted network traffic analysis,” IEEE
respectively; centralized HC corresponds to cell (1,1). For Trans. Inf. Forensics Security, vol. 13, no. 1, pp. 63–78, 2017.
each metric the optimum combination is marked. [4] J. Yu, H. Lee, Y. Im, M.-S. Kim, and D. Park, “Real-time classification
of Internet application traffic using a hierarchical multi-class SVM,” KSII
Transactions on Internet & Information Systems, vol. 4, no. 5, 2010.
[5] Y. n. Dong, J. j. Zhao, and J. Jin, “Novel feature selection and
Neither HC nor BD grant improvements with a naive ap- classification of Internet video traffic based on a hierarchical scheme,”
Computer Networks, vol. 119, pp. 102–111, 2017.
proach: the proposal and analysis of their combination for TC [6] L. Grimaudo, M. Mellia, and E. Baralis, “Hierarchical learning for fine
is the main contribution of this work. In accord with recent grained internet traffic classification,” in IEEE 8th International Wireless
focus on privacy and anonymity concerns, we evaluated BDeH Communications and Mobile Computing Conference (IWCMC), 2012,
pp. 463–468.
on Anon17, a public dataset gathering traffic traces from Tor, [7] A. Montieri, D. Ciuonzo, G. Bovenzi, V. Persico, and A. Pescapé, “A
I2P, and JonDonym. This, together with the public release dive into the dark web: Hierarchical traffic classification of anonymity
of our implementation of BDeH, allows to maximally foster tools,” IEEE Trans. Netw. Sci. Eng., 2019.
[8] V. D’Alessandro, B. Park, L. Romano, C. Fetzer et al., “Scalable network
repeatability and further advances on this hot topic. traffic classification using distributed support vector machines,” in IEEE
Our analysis inspected the training-completion time and 8th International Conference on Cloud Computing (ICCC), 2015, pp.
the cost that results from deploying this task on public- 1008–1012.
[9] Z. Yuan and C. Wang, “An improved network traffic classification algo-
cloud environments, other than TC performance. In fact, the rithm based on Hadoop decision tree,” in IEEE International Conference
latter may be affected by the fragmentation of training set, of Online Analysis and Computing Science (ICOACS), 2016, pp. 53–56.
caused by data parallelism. We compared our proposal with [10] L.-V. Le, B.-S. Lin, and S. Do, “Applying big data, machine learning,
and SDN/NFV for 5G early-stage traffic classification and network QoS
both distributed flat (i.e. non-hierarchical) and centralized control,” Transactions on Networks and Communications, vol. 6, no. 2,
hierarchical (i.e. non-parallel) approaches. In the former case, p. 36, 2018.
there is a gain in terms of all the performance metrics (e.g. [11] G. Aceto, D. Ciuonzo, A. Montieri, V. Persico, and A. Pescapé, “Know
your big data trade-offs when classifying encrypted mobile traffic with
≈ +4.5% F-measure at L3), even in the case of sole use deep learning,” in IEEE/ACM Network Traffic Measurement and Analysis
of data parallelism. In the latter case, there is a significant Conference (TMA), 2019.
relative reduction of both training completion time and cost [12] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie,
(e.g. ≈ 90% and ≈ 10%, respectively). Moreover, the benefits A. Kumar, and Y. Yu, “Petuum: A new platform for distributed machine
learning on big data,” IEEE Transactions on Big Data, vol. 1, no. 2, pp.
of double-parallelism are also confirmed by the gain w.r.t two 49–67, 2015.
configurations constrained by either parallelism type. Indeed a [13] K. Yang, H. Ma, and S. Dou, “Fog intelligence for network anomaly
relative improvement of up to ≈ 78% in terms of completion detection,” IEEE Network, vol. 34, no. 2, pp. 78–82, 2020.
[14] K. Shahbar and A. N. Zincir-Heywood, “Packet momentum for identifi-
time reduction and up to ≈ 31% in terms of cost reduction cation of anonymity networks,” Journal of Cyber Security and Mobility,
was observed. Although our experimental analysis considered vol. 6, no. 1, pp. 27–56, 2017.
12 IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, VOL. *, NO. *, MONTH YYYY

[15] V. Bajpai, A. Brunstrom, A. Feldmann, W. Kellerer, A. Pras, Giuseppe Aceto is an Assistant Professor at Univer-
H. Schulzrinne, G. Smaragdakis, M. Wählisch, and K. Wehrle, “The sity of Napoli Federico II, where he received his PhD
Dagstuhl beginners guide to reproducibility for experimental networking in Telecommunication Engineering. His research
research,” ACM SIGCOMM Computer Communication Review, vol. 49, concerns network performance and censorship, both
no. 1, pp. 24–30, 2019. in traditional networks and SDN, and ICTs applied
[16] S.-H. Yoon, K.-S. Shim, S.-K. Lee, and M.-S. Kim, “Framework for to health. He received the best paper award at IEEE
multi-level application traffic identification,” in 17th IEEE Asia-Pacific ISCC 2010, and 2018 Best Journal Paper Award by
Network Operations and Management Symposium (APNOMS), 2015, pp. IEEE CSIM.
424–427.
[17] M. Mizukoshi and M. Munetomo, “Distributed denial of services attack
protection system with genetic algorithms on Hadoop cluster computing
framework,” in IEEE Congress on Evolutionary Computation (CEC),
2015, pp. 1575–1580.
[18] W. M. Shbair, T. Cholez, J. François, and I. Chrisment, “A multi-
level framework to identify HTTPS services,” in IEEE/IFIP Network
Operations and Management Symposium (NOMS), 2016, pp. 240–248.
[19] M. Trevisan, I. Drago, M. Mellia, H. H. Song, and M. Baldi, “WHAT:
A big data approach for accounting of modern web services,” in IEEE
International Conference on Big Data (Big Data), 2016, pp. 2740–2745.
[20] S. Hameed and U. Ali, “Efficacy of live DDoS detection with hadoop,”
in NOMS 2016-2016 IEEE/IFIP Network Operations and Management Domenico Ciuonzo (S’11-M’14-SM’16) is an As-
Symposium. IEEE, 2016, pp. 488–494. sistant Professor at University of Napoli Federico
[21] J. Lingyu, L. Yang, W. Bailing, L. Hongri, and X. Guodong, “A II. He holds a PhD from University of Campania
hierarchical classification approach for Tor anonymous traffic,” in IEEE L. Vanvitelli (IT) and, from 2011, he has held
9th International Conference on Communication Software and Networks several visiting researcher appointments. Since 2014
(ICCSN), 2017, pp. 239–243. he is editor of several IEEE, IET and E LSEVIER
[22] X. Li, Y. Wang, W. Ke, and H. Feng, “Real-time network traffic clas- journals. His research interests include data fusion,
sification based on CDH pattern matching,” in IEEE 14th International wireless sensor networks, internet of things, network
Conference on Computational Intelligence and Security (CIS), 2018, pp. analytics and machine learning.
130–134.
[23] A. Alsirhani, S. Sampalli, and P. Bodorik, “DDoS detection system:
Using a set of classification algorithms controlled by fuzzy logic
system in Apache Spark,” IEEE Transactions on Network and Service
Management, vol. 16, no. 3, pp. 936–949, 2019.
[24] G. Aceto, D. Ciuonzo, A. Montieri, V. Persico, and A. Pescapé, “Mirage:
Mobile-app traffic capture and ground-truth creation,” in 4th IEEE
International Conference on Computing, Communications and Security
(ICCCS), 2019, pp. 1–8.
[25] A. Dainotti, A. Pescapé, and K. C. Claffy, “Issues and future directions
in traffic classification,” IEEE Netw., vol. 26, no. 1, pp. 35–40, 2012.
[26] C. N. Silla and A. A. Freitas, “A survey of hierarchical classification Valerio Persico is an Assistant Professor at DIETI,
across different application domains,” Data Mining and Knowledge University of Napoli Federico II, where he received
Discovery, vol. 22, no. 1-2, pp. 31–72, 2011. the PhD in Computer and Automation Engineering
[27] S. R. Lawrence and E. C. Sewell, “Heuristic, optimal, static, and in 2016. His work concerns network measurements,
dynamic schedules when processing times are uncertain,” Journal of cloud-network monitoring, and Internet path tracing
Operations Management, vol. 15, no. 1, pp. 71–82, 1997. and topology discovery. He has co-authored more
[28] A. Montieri, D. Ciuonzo, G. Aceto, and A. Pescape, “Anonymity than 30 papers within international journals and
services Tor, I2P, JonDonym: Classifying in the dark (web),” IEEE conference proceedings.
Trans. Depend. Sec. Comput., pp. 1–1, 2018.
[29] K. Shahbar and A. N. Zincir-Heywood, “An analysis of Tor pluggable
transports under adversarial conditions,” in IEEE Symposium on Com-
putational Intelligence for Security and Defense Applications (CISDA),
2017.
[30] ——, “Traffic flow analysis of Tor pluggable transports,” in IEEE 11th
International Conference on Network and Service Management (CNSM),
2015, pp. 178–181.
[31] P. Mulinka and P. Casas, “Stream-based machine learning for network
security and anomaly detection,” in ACM Workshop on Big Data
Analytics and Machine Learning for Data Communication Networks
(Big-DAMA), 2018, pp. 1–7.
Antonio Pescapé (SM’09) is a Full Professor of
computer engineering at the University of Napoli
Federico II. His work focuses on measurement,
monitoring, and analysis of the Internet. He has
co-authored more than 200 conference and journal
papers, he is the recipient of a number of research
Giampaolo Bovenzi is a PhD Student at DIETI,
awards. Also, he has served as an independent
University of Napoli Federico II, since November
reviewer/evaluator of research projects/project pro-
2018. He achieved MS degree (summa cum laude)
posals co-funded by a number of governments and
at the same University in October 2018. His research
agencies.
interests focus on (anonymized and encrypted) traffic
classification, network security (with focus on IoT),
and blockchain.

You might also like