0% found this document useful (0 votes)
30 views

A Proxy Identifier Based On Patterns in Traffic Flows

This document discusses methods for identifying proxy traffic based on analyzing patterns in traffic logs without access to the proxy server or clients. It introduces the challenges of proxy identification and different types of proxies. The goal is to design a machine learning approach to identify high-anonymity proxy traffic in encrypted and non-encrypted conditions based on passive traffic analysis.

Uploaded by

JohnDoe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

A Proxy Identifier Based On Patterns in Traffic Flows

This document discusses methods for identifying proxy traffic based on analyzing patterns in traffic logs without access to the proxy server or clients. It introduces the challenges of proxy identification and different types of proxies. The goal is to design a machine learning approach to identify high-anonymity proxy traffic in encrypted and non-encrypted conditions based on passive traffic analysis.

Uploaded by

JohnDoe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2015 IEEE 16th International Symposium on High Assurance Systems Engineering

A Proxy Identifier based on Patterns in Traffic Flows

Vahid Aghaei-Foroushani and A. Nur Zincir-Heywood


Faculty of Computer Science
Dalhousie University
Halifax, NS, Canada
Email: {vahid, zincir}@cs.dal.ca

Abstract—Proxies are used commonly on today’s Internet. On its IP address. However, any user can easily access a server
one hand, end users can choose to use proxies for hiding their via an unaccountable proxy, not to mention malicious users.
identities for privacy reasons. On the other hand, ubiquitous Therefore, a server can no longer assure whether the IP address
systems can use it for intercepting the traffic for purposes such associated with a connection is actually the address of a client
as caching. In addition, attackers can use such technologies to or that of a stepping-stone, i.e. proxy. Moreover, in the context
anonymize their malicious behaviours and hide their identities.
Identification of such behaviours is important for defense appli-
of malicious users, the usage of a proxy is usually associated
cations since it can facilitate the assessment of security threats. with botnets, which have become a common infrastructure for
The objective of this paper is to identify proxy traffic as seen cyber threats and online crime. On one hand, bots are often
in a traffic log file without any access to the proxy server or offered or sold as proxies to anyone who does not want to be
the clients behind it. To achieve this: (i) we employ a mixture of traced for their activities on the Internet. On the other hand, the
log files to represent real-life proxy behavior, and (ii) we design use of proxies increases the challenge to trace the originator,
and develop a data driven machine learning based approach to which might be a bot or a regular user.
provide recommendations for the automatic identification of such
behaviours. Our results show that we are able to achieve our The major challenge to the above problems lies in the
objective with a promising performance even though the problem lack of a capability to unambiguously identify the different
is very challenging. behaviours of a proxy and the clients behind a proxy. When a
Keywords—Traffic Flow; Network Security; Proxy; Behavior server receives a request such as a HTTP (HyperText Transfer
Analysis Protocol) request from a host (client machine), there is no
systematic way to determine whether the host itself generates
the request, or it is a proxy that is relaying the request for
I. I NTRODUCTION another host.
In general, a proxy server is a host, which intercepts the Proxies come in different varieties. (i) Transparent proxy:
network traffic in order to manipulate some of its properties. This type of proxy identifies itself as a proxy to the visited
For example, the most commonly known proxy is a web server. Moreover, it reveals the user’s IP address, so it will not
caching proxy that is originally invented to enhance the hide the user’s identity. (ii) Anonymous Proxy: This type of
performance of web browsing by intercepting the traffic to proxy identifies itself as a proxy server. It is detectable (as a
check whether the requested web object is on the proxy cache proxy), but provides reasonable anonymity for most users by
or not. hiding their IP addresses. (iii) Distorting Proxy: This type of
However, today proxies are used also to meet the need proxy identifies itself as a proxy server, but creates an incorrect
for anonymous web surfing [1]. Users can anonymously surf originating IP address available through the HTTP headers. So
the web without revealing their own IP (Internet Protocol) it provides anonymity by creating a false identity. (iv) High-
addresses by using a proxy server as a stepping-stone. In doing Anonymity Proxy: This type of proxy does not identify itself
so, the user’s actual IP address is hidden once it goes through as a proxy server and does not reveal the original IP address
a proxy. Most of the times when a proxy server retrieves of a user.
information (objects) from web sites on behalf of the user,
it provides only its own identity to the sites visited. In this In this research, we aim to analyze different behaviours of
way, users’ connections look as if they are targeting the proxy high-anonymity proxies in more detail since they are the most
server rather than the services they request. This feature of challenging ones to identify in network traffic logs. To this
proxy servers is very advantageous to users especially when end, we study and evaluate a machine learning based approach
they are forced to use stepping stones in order to access on different types of traffic logs to understand how far we
internet services that are blocked by their governments, service can push this approach to identify the incoming proxy base
providers or organizations. traffic on the server side. Specifically, we are interested in
identifying proxies based on their behaviour seen in the traffic
While web surfing over a proxy is an effective way to log files that are captured on a server that is outside of the
protect one’s anonymity and privacy, it is also like a double- proxy network. In other words, we assume that we do not have
sided sword. It may raise security problems, too [1]. In any a priori information or access to the proxy or to the client
other words, attackers can use it to hide their anonymity as using the proxy. By using a machine learning based approach,
well! Under such a scheme, users are no longer accountable we aim to discover patterns in the traffic without analyzing
because their identity from the server’s perspective is not the payload and without checking a static feature such as an
trustworthy. Normally, a server identifies a user (client) by IP address, a port number or a proxy identifier. To achieve

978-1-4799-8111-3/15 $31.00 © 2015 IEEE 118


DOI 10.1109/HASE.2015.26
this, we (i) employ a mixture of log files to represent real-life traffic, then the server will send the ICMP (Internet Control
proxy behavior, and we (ii) design and develop a data driven Message Protocol) Ping packets, or any “active probe” scheme
machine learning based approach to provide recommendations implemented. However, such a scheme would only work if the
for the automatic identification of traffic from an anonymous client replies to these probing requests. Another challenge of
proxy. Finally, we investigate all of the above under both these schemes is that most of the times routers handle ICMP
encrypted and non-encrypted traffic conditions. Our results are packets or active known probing packets in their slow path
very promising in terms of identifying traffic systematically (leading to overestimation of RTT), or they simply discard
coming from proxies. them. Furthermore, most of the proxy servers with default
configurations discard such ICMP probing packets as well.
II. BACKGROUND
B. Passive Measurement Based Schemes
In the literature, there are few studies that aim to identify
proxy traffic on a server that is outside of the proxy network This type of schemes [2], [3] also make use of the
[2]–[6]. Even then, these require some information about either assumption regarding the delays experienced and using them
the proxy or the client behind the proxy. There are also to identify proxies, but some passive schemes also enhance
some previous works on identifying the Network Address passive delay measurements with other information such as
Translation (NAT) traffic [3], [7], which in some aspect is operating system or web browser information, again passively
similar to proxy. In [7], we have shown that NAT traffic can measured or fingerprinted. Passive schemes do not introduce
be identified by analyzing traffic flows. However, there are additional packets, i.e. active probe packets, onto the network.
some major differences between a NAT server and a Proxy Instead, they make use of the existing information in the traffic
server that make their traffic identification completly diffrent. or other log files. We summarized the well-known passive
A NAT server modifies the IP address in a header of an IP measurement techniques below.
packet to allow a private IP address to be used for traffic
1) OWD – One Way Delay: This technique measures one-
within a LAN (Local Area Network) and a public IP address
way delay by noting the time it takes an arbitrary packet
for any communication with the rest of the Internet. On the
to transit between two precisely synchronized measurement
other hand, a proxy server is located between a client (that is
points [8], [9]. The major limitation of this technique is that
looking for a resource) and some other servers (that provide
the OWD needs to be set up in several measurement points
the resource) and acts as a mediator. The client requesting the
along the path (network), and also the time between these
resource connects to the proxy server and the proxy evaluates
measurement points need to be synchronized. Indeed, such a
the request based on its filtering rules. In this sense, proxies
technique is not realistic when we only have access to the
are more like firewalls since they intercept the traffic such as
servers, where the analysis is made (for example a web server
a cache server.
for forensic analysis), but not the other hosts (for example the
In general, we can group the previous works on identifying proxy or the clients behind the proxy) on the network.
proxy traffic into two general categories: (i) Active measure-
2) Single Measuring Point: In this case, the RTT is calcu-
ments, and (ii) Passive measurements based schemes. In the
lated from the time between a request packet is sent to a server,
rest of this section, we summarize these schemes and discuss
and a matching reply packet coming back from the same
their limitations.
server [10], [11]. Request/response packet-pairs are matched
based on well-known fields in the packet header or payload
A. Active Measurement Based Schemes (e.g. sequence numbers in TCP or ICMP echo packets). One
Under these schemes, it is assumed that the traffic gen- of the major limitations of this approach is that it requires
erated by a regular client (without going through a proxy) measurements on the client machine. Thus, this scheme works
would have different behaviours compared with the traffic only when we are at the position of a client machine (one
relayed by a proxy that a client uses to forward his/her packets. using the proxy server) and want to calculate the RTT to the
Some researchers in the field have used this assumption for server, but not when we are at the position of the web server
developing different schemes to detect the presence of a proxy and want to calculate the RTT to the client.
using the inter-arrival times and payload sizes of individual 3) SPP – Synthetic Packet Pairs: This technique, SPP,
packets arriving at a server, such as a web server. Using estimates the RTT between two measurement points along a
such schemes different researchers [4]–[6] claimed to achieve network path. Traffic is observed at both measurement points,
approximately 90% in detecting proxy traffic. and the RTT between the two measurement points is estimated
However, most of the time, these schemes employ addi- from pairs of packets seen travelling in each direction [12].
tional packets, called “active probes”, to measure the inter- Again, the main limitation of this technique is that it requires
arrival times, or the delays on the network. So active probes traffic traces from both of the server and the client sides.
are injected (sent) into the traffic on the network and their Therefore, it is not relevant when we aim to identify the proxy
transit times are used to estimate (sample) the network delay, traffic behaviour only based on the traffic on the server side.
namely RTT (Round Trip time), on the path the probes follow
at the time they are sent. III. M ETHODOLOGY
One of the major problems of this scheme is that the Given the limitations of the state of the art techniques
client should be configured to reply to the active probes, i.e. discussed above, in this research, we propose a machine
measurement packets, send from the host that is analyzing learning based approach to identify high level behaviour of
the RTTs. For example, if a web server is analyzing the proxy machines in a given network traffic trace. To this end,

119
Fig. 1: Test Bed Network

we have employed two classification based learning techniques


to evaluate on our traffic data sets. These are the C4.5 decision
tree classifier and the Naive Bayes classifier. The reason
we chose these two learning algorithms are two folds: C4.5
Fig. 2: The Categories of the Generated Proxy Data Sets
learning technique provides the solution it learns from the data
in a tree form using if-then-else format. This makes it easy for
a human expert to analyze the solution and to understand what
the algorithm learned. In other words, the solution is no longer such as HTTP, HTTPS and FTP. It reduces bandwidth and
a black box. Moreover, a C4.5 decision tree classifier not only improves response times by caching and accessing frequently-
can identify the linear patterns in the data but can also identify requested web pages from the cache. Squid has extensive
the non-linear patterns [13]. On the other hand, Naive Bayes access controls and runs on most available operating systems,
is one of the well-known statistical learning algorithms (albeit including Windows and is licensed under the GNU GPL.
with an opaque solution) that can identify linear patterns in the
data very efficiently [13], so it naturally represents a standard Squid has some features that makes it very suitable for ana-
baseline classifier for this work. lyzing proxy traffic. First of all, Squid is a free and open source
proxy server. Secondly, it is widely used by the Internet Service
Classification is a supervised learning technique, where Providers (ISPs) all over the world. This enables the results of
the aim is to learn a mapping from the input space to the this research to potentially be used in practice. Thirdly, Squid
output space whose correct values (labels) are provided by a can anonymize connections by disabling or changing specific
supervisor (ground truth). Thus, both of the learning techniques header fields of a client’s HTTP requests. Last but not the least,
employed in this work require a training phase to learn the Squid proxy server can also be configured as a high anonymity
patterns and/or mappings on the input data. Then the learned proxy not to identify itself as a proxy server, so that the web
models are evaluated on unseen test data. servers would not be able to recognize / identify (under normal
conditions) that the traffic is coming from a proxy server.
A. Test Bed Network
C. Generating Proxy Data Sets
To be able to evaluate our approach, we have set up the
following testbed network, shown in Figure 1. On our testbed, To generate our data sets, we have used the URLs of the
we have created three separate networks to be able to generate top 500 websites that are provided by Alexa. This URL list1 is
traffic for different scenarios and test our proposed system. then used to generate HTTP requests under different scenarios
These are: using the different proxy networks described in section III-A.
• Local Proxy Network: This network is located on the In the first scenario, all the HTTP requests to 500 Alexa
Dalhousie University network and directly connected websites are generated on the direct network and captured at
to the Proxy server. The only way this network can the edge router. This traffic can be used to investigate the
access to the Internet is by going through the proxy. behaviors of the unencrypted HTTP traffic that does not go
through a proxy.
• Remote Proxy Network: This network is located out-
side the Dalhousie University network, and connected Then, we designed two scenarios for generating proxy
to the Proxy server through the Internet. This network traffic. These scenarios include: (i) the proxy and the client
is configured to forward all of its traffic through the can be both on the same network, we call this local proxy; or
proxy server. (ii) the proxy and the client could be on different networks,
we call this remote proxy. In these two scenarios, the proxy
• Direct network: This network is directly connected to traffic generation process is repeated several times, each time
the Internet without using a proxy server. with a different configuration mode of the Squid proxy server
to understand how these configurations could affect (if at all)
B. Squid Proxy Server the identification of proxy traffic. Different modes of the Squid
proxy server include:
In this research, the Squid proxy server is employed to
generate the proxy traffic representing different behaviours. 1 The list of Alexa’s URLs is available at:
Squid is a caching proxy [14] for the Web supporting protocols http://web.cs.dal.ca/∼vahid/proxy/alexa.txt

120
Host: pgl.yoyo.org Host: pgl.yoyo.org
Connection: keep-alive Connection: keep-alive
Accept: text/html, application/xhtml+xml, application/xml; Accept: text/html, application/xhtml+xml, application/xml;
q=0.9, image/webp, */*; q=0.8 q=0.9, image/webp, */*; q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) User-Agent: Mozilla/5.0 (Windows NT 5.1) Ap-
AppleWebKit/537.36 (KHTML, like Gecko) pleWebKit/537.36 (KHTML, like Gecko)
Chrome/32.0.1700.107 Safari/537.36 Chrome/32.0.1700.102 Safari/537.36
Accept-Encoding: gzip,deflate,sdch Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,fa;q=0.6 Accept-Language: en-US,en;q=0.8,fa;q=0.6
Via: 1.1 Localhost (squid/3.1.1g)
TABLE I: A Sample of the HTTP User Agent without Using X-Forwarded-For: 192.168.100.6
a Proxy, or in a Proxy Network running in High Anonymity Cache-Control: max-age=0
Connection: keep-alive
Mode
TABLE II: A Sample of the HTTP User Agent in a Proxy
Network with Default Configuration, i.e. running in
Transparent Mode.
• Configuring the proxy server in the No-Cache mode,
so that the proxy server just relays the traffic between
the users and the web servers.
• Configuring the proxy server in the cache mode. In web server(s) visited, but also the visited web server could not
this case, we have generated two data sets: recognize that it is communicating with a proxy server, Table I.
In other words, traffic coming through a high anonymity proxy
◦ The cache server is empty, so that the proxy will look just the same as the traffic not using any proxy. In
server has to refer to the web servers for every this case, the client can completely hide his/her identity from
single request by the users, but at the same the web server. This mode is the most ambigious mode from
time it caches the traffic. the perspective of the server that is visited (giving the service
◦ The cache server has already cached the re- requested). We configured the Squid proxy server to operate
quests to the 500 web sites, so at this time the at the high anonymity mode by adding some access controls
proxy server can respond to some of users’ to the Squid proxy configuration2 .
requests from its own cache.
Then we again generated all the six types of proxy traffic
Figure 2 summarizes the data sets generated for this work. explained before with the new proxy server configurations
The six blue boxes indicate the six different proxy data sets (high anonymity mode). This is what the (web) servers see
generated. Once we generated these 6 proxy data sets, we have as the most challenging case because there is no obvious way
found that the default configuration of the Squid proxy server to find that this traffic is coming from a proxy. In Table III,
is in the transparent proxy mode. It means that the Squid proxy we have summarized the information of all the 13 data sets
server embeds all the information about the users in the HTTP generated.
user agents. To understand what exactly Squid proxy reveals
in this mode, we visit the following URL: Furthermore, Squid can also serve as a proxy for HTTPS
(secure HTTP) traffic. However, since the traffic is encrypted,
http://pgl.yoyo.org/http/browser-headers.php it is not possible to cache it. In such traffic, the communication
Table I shows the information revealed about the client between the client and the server is encrypted. Thus, when a
who accesses the above URL on the direct network we set proxy intercepts the traffic between the server and the client,
up (without a proxy). Table II shows the information revealed it only changes the port numbers and the IP addresses. There
about the client who accesses the above URL via Local or is no way to change the user agent by the proxy, because
Remote Proxy, where Squid Proxy is running in the transparent everything in an HTTPS communication is encrypted. So, web
mode. These tables show that when the proxy server is in the caching and HTTP user agent configurations do not apply for
transparent mode, it sends all the information of its client(s) encrypted proxy traffic.
to the web server. This enables the web server (or anyone To generate examples of such traffic, we created a list of
analyzing the traffic) to infer that the traffic is coming from web sites (166) that use only HTTPS as their communication
a host (client computer) behind the proxy server. In this case, protocol3 (such a list is not provided by Alexa). In this case, we
the web server receives the IP address of the client from the again used the same network setup as we did for generating the
proxy that is running in the transparent mode. HTTP (unencrypted traffic) proxy data sets. To this end, first
However, finding the existence of a proxy server and of all, we run the HTTPS web requests to all the web sites
the client behind it, is not always this simple. As discussed on the list on the direct network and captured the resulting
before, there are four operation (configuration) modes of traffic at the edge router. This traffic is used to investigate the
proxy servers: (i) Transparent mode, (ii) Anonymous mode, behaviors of the non-proxy HTTPS traffic in this work. Then,
(iii) Distorting mode, and (iv) High Anonymity mode. The we repeated this process for the clients on the local proxy
aforementioned examples are all in the transparent mode. A network and the remote proxy network. In Table IV, we have
user / attacker may use other modes of a proxy server to summarized the information of our HTTPS (encrypted) proxy
hide his/her identity for privacy or malicious reasons. The data sets.
anonymous and the distorting modes provide some levels of 2 The added access controls to the Squid configuration can be found at:
anonymity, but the highest anonymity mode of a proxy server http://web.cs.dal.ca/∼vahid/proxy/config.txt
is the fourth one, High Anonymity mode. In this mode, not 3 The list of HTTPS URLs is available at:
only the identity of the proxy cliets would not be sent to the http://web.cs.dal.ca/∼vahid/proxy/https.txt

121
Size Number of Duration Data Set
Data Set (bytes) Packets (hh:mm) ID
No Proxy 80,747,884 96484 1:33 1
Cache Empty 80,616,017 100956 1:15 2
Local Full 61,014,664 77242 1:05 3
Header No Cache 81,077,862 95305 1:02 4
Cache Empty 71,855,827 89135 1:07 5
Remote Full 51,100,296 62702 1:44 6
Proxy No Cache 69,429,793 84440 1:21 7
Cache Empty 80,935,673 97891 1:15 8
Local Full 58,901,656 70714 0:52 9
No Header No Cache 79,659,292 93854 1:00 10
Cache Empty 66,834,471 79906 1:07 11
Remote Full 51,607,605 62691 0:56 12
No Cache 66,265,355 79644 1:05 13

TABLE III: Summary of the Generated Unencrypted Proxy Data Sets

Size Number of Duration Data Set total fpackets total fvolume total bpackets total bvolume
Data Set (bytes) Packets (hh:mm) ID min fpktl mean fpktl max fpktl std fpktl
No Proxy 135,702,128 153821 1:26 14 min bpktl mean bpktl max bpktl std bpktl
Local Proxy Network 166,926,162 245598 2:17 15 min fiat mean fiat max fiat std fiat
Remote Proxy Network 216,829,519 309518 2:31 16 min biat mean biat max biat std biat
Duration min active mean active max active
TABLE IV: Summary of the Generated Encrypted Proxy std active min idle mean idle max idle
std idle sflow fpackets sflow fbytes sflow bpackets
Data Sets sflow bbytes fpsh cnt bpsh cnt furg cnt
burg cnt total fhlen total bhlen

TABLE V: Features Employed in the Trained Models of the


D. Features Employed Classifiers [17]

In this work, we converted our packet based traffic traces


(tcpdump files) to traffic flows. To this end, Netmate open
source tool [15] is employed to generate the flows and compute Brief statistics on these traffic traces are given in Tables III
the statistical features for each flow. Once the flows are gener- and IV.
ated, we do not use the source and destination IP addresses as
well as the source and destination port numbers in our feature The next step is to randomly sample (using uniform
set to represent the flow traffic to our classifiers, because port probability) data sets from the different categories of flows.
numbers can be assigned dynamically and IP addresses can Then we use a hierarchical approach to train our classifiers
be anonymized very easily. Moreover, such information may on these randomly sampled flows as Proxy versus No-Proxy
bias the results. Our aim here is to find patterns in the traffic traffic. For this work, we employ the C4.5 and Naive Bayes
without using any biased features. Indeed, to be able to apply learning techniques for classification purposes. Then the output
our approach both to the encrypted and the unencrypted traffic, of this process becomes the input for the Encrypted Proxy
we do not employ any payload (application layer) information vs Unencrypted Proxy classifier for identifying high level
as features to our classifiers. behavior of the proxy traffic. After this, all flows identified
as proxy, encrypted and unencrypted, run through the Local
1) Netmate Features: Netmate [15] is an open source Proxy vs Remote Proxy classifiers. This classifier detects if the
flow generator (exporter). Flows are bidirectional and the first clients behind the proxy server are located on the same network
packet of the flow identified by Netmate determines the for- as the proxy server, or located on a different network. Whether
ward (source to destination) direction. A flow can be uniquely the unencrypted proxy traffic is classified as local or remote,
identified by five parameters within a certain time period. it still runs through two other classifiers which are Cache vs
These parameters are source and destination IP addresses, No-Cache classifier (this classifier detects whether the proxy
source and destination port numbers and protocol. Netmate server caches the forwarding traffic or not) and Header vs No-
considers only the UDP and the TCP flows. Moreover, the Header Classifier (this classifier detects whether there is any
UDP flows are terminated by a flow timeout, whereas the TCP fingerprint of the proxy device in the header of the HTTP
flows are terminated upon proper connection teardown or by request or not). Figure 3 gives an overview of the prototype
a flow timeout, whichever occurs first. The flow timeout value system developed and used in these evaluations.
employed in this work is 600 seconds as recommended by the
IETF [16]. The statistical features generated by the Netmate It should be noted here that all of this analysis is performed
tool are shown in Table V. on a machine (web server) that is on a different network than
where the proxy and its clients are located. The following
summarizes the machine learning techniques employed.
E. Overview of the Proposed System
1) Decision Tree Algorithm: C4.5 is an algorithm that
As discussed above, our proposed system is a machine generates a decision tree using information gain. A decision
learning based approach using the network flow features. To tree is a hierarchical data structure for implementing a divide-
this end, our 16 proxy traffic traces (discussed earlier) are and-conquer strategy. C4.5 is an efficient non-parametric tech-
employed for flow feature extraction, using the Netmate tool. nique that can be used for both classification and regression

122
DR = T P/(T P + F N ) (1)

Whereas the F P R rate reflects the number of out-class


(anything that is not in-class) flows incorrectly classified (as
in-class) using Eq. 2:

F P R = F P/(F P + T N ) (2)

Naturally, a high DR rate and a low F P R are the most


desirable outcomes. Moreover, False Negative, F N , reflects
the percentage of in-class traffic that is classified as out-class
traffic, and True Negative, T N , implies that out-class traffic
that is classified correctly. In the following we present the
results of the C4.5 and Naive Bayes classification algorithms
on our data sets.

Fig. 3: An Overview of our Prototype System A. Results of the Classification Experiments


Given our approach discussed above, to identify the proxy
traffic, we considered 11 different cases, including:
problems. C4.5 constructs decision trees from a set of training 1) “NoProxy Unencrypted” vs. “Proxy Unencrypted”
data applying the concept of information entropy.The training 2) “NoProxy Encrypted” vs. “Proxy Encrypted”
data is a set, such that each input of the set is an instance of 3) “NoProxy” vs. “Proxy”
already classified samples. Each sample in the set is a vector 4) “NoProxy” vs. “Proxy Encrypted” vs. “Proxy Unen-
where each element in the vector represents a feature of the crypted”
sample. C4.5 can split the data into smaller subsets using the 5) “Proxy Encrypted” vs. “Proxy Unencrypted”
fact that each feature of the data can be used to make a decision 6) “NoProxy Unencrypted” vs “Proxy Unencrypted Lo-
(one class versus another class). The feature with the highest cal” vs “Proxy Unencrypted Remote”
information gain is used to make the decision of the split. A 7) “NoProxy Encrypted” vs. “Proxy Encrypted Local”
more detailed explanation of C4.5 algorithm can be found in vs. “Proxy Encrypted Remote”
[13]. 8) “NoProxy” vs “Proxy Local” vs “Proxy Remote”
2) Naive Bayes Algorithm: A Naive Bayes classifier is a 9) “NoProxy Unencrypted” vs “Proxy Unencrypted
simple probabilistic classifier based on applying Bayes’ the- Cache” vs “Proxy Unencrypted NoCache”
orem (from Bayesian statistics) with strong (Naive) indepen- 10) “NoProxy Unencrypted” vs. “Proxy Unencrypted
dence assumptions. In simple terms, a Naive Bayes classifier Header” vs. “Proxy Unencrypted NoHeader”
assumes that the presence (or absence) of a particular feature of 11) “Proxy Unencrypted Header” vs. “Proxy Unen-
a class is unrelated to the presence (or absence) of any other crypted NoHeader”
feature. Depending on the precise nature of the probability For each case, we performed two different sets of tests: one
model, Naive Bayes classifiers can be trained efficiently in a for the C4.5 and one for the Naive Bayes classifiers. In both
supervised learning approach. More detailed information on sets of experiments, balance training data sets are employed.
the Naive Bayesian algorithm can be found in [13]. Tables III and IV present the ID of each of our proxy traffic
data sets for ease of referencing, and Table VI presents the
IV. E XPERIMENTS AND R ESULTS results of all of the proxy identification experiments performed
on these data sets in the scope of this work.
In this work, the learning models of the C4.5 and Naive
Bayes algorithms are trained and tested using WEKA [18]. To
this end, we use the default parameters for all the techniques B. Analysis of Results
employed in WEKA. As discussed earlier, aforementioned When we analyzed the results presented in Table VI, we
traffic data sets, summarized in Tables III and IV, are used see that identifying proxy traffic from a given traffic file of a
during these evaluations4 . In these experiments, we did not host such as a web server that does not have access or a priori
do any parameter sensitivity, because our aim is to understand information about the proxy network is a very challenging
how far we can push such an approach as a black box system. problem. The main reason behind this is the fact that proxy
In traffic classification, two metrics are typically used in behavior is very diverse. The diversity comes from: (i) the
order to quantify the performance of the classifier: Detection different kinds of proxies used, (ii) the location (relative to
Rate (DR) and False Positive Rate (F P R). In this case, DR the client) of the proxy used, and (iii) whether the traffic is
reflects the number of in-class (the class that we are interested encrypted or not.
in) flows correctly classified and is calculated using Eq. 1: We have looked into 11 different cases using 16 different
4 The NIMS Lab proxy traffic data sets are available for testing and
traffic files to investigate whether we can differentiate such
diverse traffic behaviours using a machine learning approach.
benchmarking purposes at:
http://web.cs.dal.ca/∼vahid/proxy.html

123
C4.5 Naive Bayes
Class Data Set ID DR FPR DR FPR
NoProxy-Unencrypted 1 92.5 10.8 88.6 33.5
Proxy-Unencrypted 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 89.2 7.5 66.5 11.4
NoProxy-Encrypted 14 97.7 3.8 16 2
Proxy-Encrypted 15, 16 96.2 2.3 98 84
NoProxy 1, 14 94.1 7.9 17.9 4.6
Proxy 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16 92.1 5.9 95.4 82.1
NoProxy 1, 14 90.5 4.3 17.2 4.4
Proxy-Encrypted 15, 16 97.3 1.6 95.6 43.6
Proxy-Unencrypted 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 93.5 3.5 74.3 8.5
Proxy-Encrypted 15, 16 99.8 0.2 96.5 21.6
Proxy-Unencrypted 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 99.8 0.2 78.4 3.5
NoProxy-Unencrypted 1 88 7.1 43 14.7
Proxy-Unencrypted-Local 2, 3, 4, 8, 9, 10 62.3 16.5 5.8 3
Proxy-Unencrypted-Remote 5, 6, 7, 11, 12, 13 65 18.7 80.7 67.6
NoProxy-Encrypted 14 95.4 2.2 14.9 2.1
Proxy-Encrypted-Local 15 85.1 8.3 73.6 67
Proxy-Encrypted-Remote 16 84.7 7 37.1 18.1
NoProxy 1, 14 89.8 5.5 16.8 4.2
Proxy-Local 2, 3, 4, 8, 9, 10, 15 72.8 13.3 7.5 3.9
Proxy-Remote 5, 6, 7, 11, 12, 13, 16 73.8 13.1 90.9 84.3
NoProxy-Unencrypted 1 88.5 7.3 36.2 11.3
Proxy-Unencrypted-Cache 2, 3, 5, 6, 8, 9, 11, 12 41.4 17.7 7 7.1
Proxy-Unencrypted-NoCache 4, 7, 10, 13 62.2 28.9 82.8 68.6
NoProxy-Unencrypted 1 88.2 7.9 86.2 32.6
Proxy-Unencrypted-Header 2, 3, 4, 5, 6, 7 92.1 3.7 9.3 3.6
Proxy-Unencrypted-NoHeader 8, 9, 10, 11, 12, 13 85.5 5.5 62 35
Proxy-Unencrypted-Header 2, 3, 4, 5, 6, 7 98.6 2 32.1 20.4
Proxy-Unencrypted-NoHeader 8, 9, 10, 11, 12, 13 98 1.7 79.6 67.9

TABLE VI: The Employed Data Sets and the Results (Detection Rate and False Positive Rate) for 11 Proxy Identification
Experiments, Using C4.5 and Naive Bayes Algorithms

Our results show that, our approach is promising when the different proxy traffic behaviours. In this section, we present
C4.5 machine learning technique is used to classify different the visualization of these behaviours by giving screenshots of
behaviours using the traffic flow features. Specifically, we the prototype system. In this case, we have employed three
obtain very high performances with the C4.5 classifier, to classifiers. These include: “Proxy” vs “No-Proxy”, “Encrypted
identify the proxy behavior under encrypted traffic conditions. Proxy” vs “Unencrypted Proxy”, and “Unencrypted Proxy with
Surprising enough, the problem is more challenging when Header” vs “Unencrypted Proxy without Header”. Our system
traffic is unencrypted. We think that it is easier to mimic works on the traffic flows and classifies them according to the
normal behaviour when data is unencrypted whereas it be- patterns that are discovered by the C4.5 learning algorithm.
comes more challenging to mimic normal behaviour when data As such, traffic flows are classified based on the high level
is encrypted. However, under unencrypted traffic conditions, application behavior type identified in the flow. For every
if proxy header information is available, again the problem flow in each type (class), the source IP, the source port, the
becomes easier and our performance increases, Table VI. destination IP, the destination port, and the protocol type (TCP
or UDP) are stored so that, if necessary, one can dig down to
When we analyzed the patterns that are automatically packet level from the flows.
discovered by the learning algorithms, we saw that flow (Net-
mate) features selected by the C4.5 classifier under different
cases seem to estimate the delay and the size of the flows. In our system we have the ability to visualize the data in
This is similar to what the passive measurement techniques either a rectangle view or a tree view. For this part of the
perform, but we avoid their limitations such as setting up the system, we employ “treemap” [19], and “spacetree” [20] open
method in several measurement points and synchronizing the source programs to create the visualization component. Figure
time between them, or having access to the traffic log files 4 is an example of visualizing the proxy traffic classification in
of the clients’ network. In our analysis, the most important the rectangle view. The dimensions of the rectangles are based
flow features contributing to find behavioural patterns in the on the number of flows of a specific proxy behaviour. As can be
traffic to identify proxies are: 1) The number of packets in the seen in this figure, the input data has been classified into “No-
forward direction, 2) The minimum forward packet length, 3) Proxy” (Yellow rectangle) and “Proxy” (left big rectangle).
The average of the forward packet length, 4) The maximum The Proxy class is also classified into “Encrypted” (Blue
backward packet length and 5) The minimum backward inter- rectangle) and “Unencrypted”. The Unencrypted Proxy class is
arrival time. then classified into “Unencrypted Proxy with Header” (Green
rectangle) and “Unencrypted Proxy without Header” (Orange
V. V ISUALIZATION OF DIFFERENT BEHAVIOURS rectangle). Figure 5 represents an example of visualizing the
proxy traffic classification using the tree view. In this figure,
We designed and developed a prototype system based the number in each node shows the number of occurrences of
on our C4.5 machine learning based approach to analyze the specific type (class) of flows it represents. As can be seen,

124
ACKNOWLEDGMENT
Thi research is supported by the Canadian Safety and Secu-
rity Program (CSSP) E-Security grant. The CSSP is led by the
Defense Research and Development Canada, Centre for Secu-
rity Science (CSS) on behalf of the Government of Canada and
its partners across all levels of government, response and emer-
gency management organizations, nongovernmental agencies,
industry and academia. This research is conducted as part of
the Dalhousie NIMS Lab at http://projects.cs.dal.ca/projectx/.

Fig. 4: Visualizing the Proxy Traffic Classification in R EFERENCES


Rectangle View [1] B. Li, E. Erdin, M. H. Gunes, G. Bebis, and T. Shipley. Review: An
overview of anonymity technology usage. Computer Communications,
Elsevier, 36(12):1–37, 2013.
[2] R. Beverly. A robust classifier for passive tcp/ip fingerprinting. in
Passive and Active Measurement Workshop:5th international workshop
France: Springer, 3015:158–167, 2004.
[3] G. Maier, F. Schneider, and A. Feldmann. Nat usage in residential
Fig. 5: Visualizing the Proxy Traffic Classification in Tree broadband networks. Proceedings of the 12th international conference
on Passive and active measurement, pages 32–41, 2011.
View
[4] Han-Wei Hsiao and Wei-Cheng Fan. Detecting step-stone with network
traffic mining approach. Fourth International Conference on Innovative
Computing, Information and Control (ICICIC), pages 1176–1179, 2009.
[5] HC. Wu and SH. Huang. Neural network based detection of stepping
the input data has been classified into “Proxy” and “NoProxy” stone intrusion. Expert Systems with Applic., 32(2):1431–1437, 2010.
nodes based on the trained C4.5 classification models. For each [6] RM. Lin, YC. Chou, and KT. Chen. Stepping stone detection at the
of these nodes, there are other sub-nodes, which can be viewed server side. IEEE Conference on Computer Communications Workshops
by clicking on the node of interest and digging further down (INFOCOM WKSHPS), pages 964–969, 2011.
in the traffic. [7] Yasemin Gokcen, Vahid Aghaei-Foroushani, and A. NurZincir-
Heywood. Can we identify nat behavior by analyzing traffic flows?
IEEE Security and Privacy Workshops, International Workshop on
Cyber Crime (IWCC), 2014.
VI. C ONCLUSIONS AND F UTURE W ORKS [8] T. Zseby, S. Zander, and G. Carle. Evaluation of building blocks for
passive one-way-delay measurement. Passive and Active Measurement
Workshop, April 2001.
In this research, we study to identify the traffic coming
[9] S. Niccolini, M. Molina, F. Raspall, and S. Tartarelli. Design and
from different clients behind a proxy device whether the traffic implementation of a one way delay passive measurement system. 9th
is encrypted or unencrypted. To this end, we employed a IEEE/IFIP Network Operations and Management Symposium (NOMS),
machine learning based approach using only traffic flow infor- 1:469–482, April 2004.
mation. To achieve this, we evaluated two learning techniques, [10] Hao Jiang and Constantinos Dovrolis. Passive estimation of tcp round-
namely C4.5 and Naive Bayes, using a flow exporter, Netmate. trip times. ACM Computer Communication Review (SIGCOMM), 32(3),
July 2002.
In doing so, we compared the performances of two different
learning techniques for the same traffic capture. [11] B. Veal, K. Li, and D. Lowenthal. New methods for passive estimation
of tcp round-trip times. 6th international conference on Passive and
Active Network Measurment, Springer-Verlag, pages 121–134, 2005.
Moreover, the proposed system includes a visualization
[12] Sebastian Zander, Grenville Armitage, Thuy Nguyen, Lutz Mark, and
component that enables the user to analyze the traffic flows and Brandon Tyo. Passive estimation of tcp round-trip times. IEEE
their high level application behaviors in terms of two views. 38th Conference on Local Computer Networks (LCN), pages 264–267,
These are: (i) Rectangle view; and (ii) Tree view. October 2013.
[13] E.Alpaydin. Introduction to Machine Learning. MIT Press, Mas-
Our results show that it is possible to identify different sachusetts, 2004.
behaviours of the proxy traffic using the C4.5 based classifier [14] Squid, accessed september 10, 2014. http://www.squid-cache.org.
based on the Netmate flow features. Our results show that dif- [15] Netmate, accessed september 10, 2014. http://ip-measurement.org/tools/
ferent statistical features in the traffic flows indicate consistent netmate/.
patterns for traffic coming through different types of proxies. [16] Ietf realtime traffic flow measurement (rtfm), accessed september 10,
Our analysis shows that the most challenging behaviours are 2014. http://www.ietf.org/proceedings/38/97apr-final/xrtftr70.htm.
hidden in the unencrypted channels and are under no-cache [17] The output features of the netmate, accessed september 10, 2014. https:
//code.google.com/p/netmate-flowcalc/wiki/Features.
proxy traffic. It should be noted here that we perform this
[18] Weka: Data mining software in java, accessed september 10, 2014. http:
analysis without using any a priori information or any access //www.cs.waikato.ac.nz/ml/weka/.
information on the proxy server or the clients behind it.
[19] Treemap space visualization, accessed september 10, 2014. http://www.
cs.umd.edu/hcil/treemap/.
Future research will investigate different types of proxies
[20] Spacetree tree browser, accessed september 10, 2014. http://www.cs.
and anonymizers in both encrypted and unencrypted tunnels umd.edu/hcil/spacetree/.
using different flow feature sets.

125

You might also like