A Proxy Identifier Based On Patterns in Traffic Flows
A Proxy Identifier Based On Patterns in Traffic Flows
Abstract—Proxies are used commonly on today’s Internet. On its IP address. However, any user can easily access a server
one hand, end users can choose to use proxies for hiding their via an unaccountable proxy, not to mention malicious users.
identities for privacy reasons. On the other hand, ubiquitous Therefore, a server can no longer assure whether the IP address
systems can use it for intercepting the traffic for purposes such associated with a connection is actually the address of a client
as caching. In addition, attackers can use such technologies to or that of a stepping-stone, i.e. proxy. Moreover, in the context
anonymize their malicious behaviours and hide their identities.
Identification of such behaviours is important for defense appli-
of malicious users, the usage of a proxy is usually associated
cations since it can facilitate the assessment of security threats. with botnets, which have become a common infrastructure for
The objective of this paper is to identify proxy traffic as seen cyber threats and online crime. On one hand, bots are often
in a traffic log file without any access to the proxy server or offered or sold as proxies to anyone who does not want to be
the clients behind it. To achieve this: (i) we employ a mixture of traced for their activities on the Internet. On the other hand, the
log files to represent real-life proxy behavior, and (ii) we design use of proxies increases the challenge to trace the originator,
and develop a data driven machine learning based approach to which might be a bot or a regular user.
provide recommendations for the automatic identification of such
behaviours. Our results show that we are able to achieve our The major challenge to the above problems lies in the
objective with a promising performance even though the problem lack of a capability to unambiguously identify the different
is very challenging. behaviours of a proxy and the clients behind a proxy. When a
Keywords—Traffic Flow; Network Security; Proxy; Behavior server receives a request such as a HTTP (HyperText Transfer
Analysis Protocol) request from a host (client machine), there is no
systematic way to determine whether the host itself generates
the request, or it is a proxy that is relaying the request for
I. I NTRODUCTION another host.
In general, a proxy server is a host, which intercepts the Proxies come in different varieties. (i) Transparent proxy:
network traffic in order to manipulate some of its properties. This type of proxy identifies itself as a proxy to the visited
For example, the most commonly known proxy is a web server. Moreover, it reveals the user’s IP address, so it will not
caching proxy that is originally invented to enhance the hide the user’s identity. (ii) Anonymous Proxy: This type of
performance of web browsing by intercepting the traffic to proxy identifies itself as a proxy server. It is detectable (as a
check whether the requested web object is on the proxy cache proxy), but provides reasonable anonymity for most users by
or not. hiding their IP addresses. (iii) Distorting Proxy: This type of
However, today proxies are used also to meet the need proxy identifies itself as a proxy server, but creates an incorrect
for anonymous web surfing [1]. Users can anonymously surf originating IP address available through the HTTP headers. So
the web without revealing their own IP (Internet Protocol) it provides anonymity by creating a false identity. (iv) High-
addresses by using a proxy server as a stepping-stone. In doing Anonymity Proxy: This type of proxy does not identify itself
so, the user’s actual IP address is hidden once it goes through as a proxy server and does not reveal the original IP address
a proxy. Most of the times when a proxy server retrieves of a user.
information (objects) from web sites on behalf of the user,
it provides only its own identity to the sites visited. In this In this research, we aim to analyze different behaviours of
way, users’ connections look as if they are targeting the proxy high-anonymity proxies in more detail since they are the most
server rather than the services they request. This feature of challenging ones to identify in network traffic logs. To this
proxy servers is very advantageous to users especially when end, we study and evaluate a machine learning based approach
they are forced to use stepping stones in order to access on different types of traffic logs to understand how far we
internet services that are blocked by their governments, service can push this approach to identify the incoming proxy base
providers or organizations. traffic on the server side. Specifically, we are interested in
identifying proxies based on their behaviour seen in the traffic
While web surfing over a proxy is an effective way to log files that are captured on a server that is outside of the
protect one’s anonymity and privacy, it is also like a double- proxy network. In other words, we assume that we do not have
sided sword. It may raise security problems, too [1]. In any a priori information or access to the proxy or to the client
other words, attackers can use it to hide their anonymity as using the proxy. By using a machine learning based approach,
well! Under such a scheme, users are no longer accountable we aim to discover patterns in the traffic without analyzing
because their identity from the server’s perspective is not the payload and without checking a static feature such as an
trustworthy. Normally, a server identifies a user (client) by IP address, a port number or a proxy identifier. To achieve
119
Fig. 1: Test Bed Network
120
Host: pgl.yoyo.org Host: pgl.yoyo.org
Connection: keep-alive Connection: keep-alive
Accept: text/html, application/xhtml+xml, application/xml; Accept: text/html, application/xhtml+xml, application/xml;
q=0.9, image/webp, */*; q=0.8 q=0.9, image/webp, */*; q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) User-Agent: Mozilla/5.0 (Windows NT 5.1) Ap-
AppleWebKit/537.36 (KHTML, like Gecko) pleWebKit/537.36 (KHTML, like Gecko)
Chrome/32.0.1700.107 Safari/537.36 Chrome/32.0.1700.102 Safari/537.36
Accept-Encoding: gzip,deflate,sdch Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,fa;q=0.6 Accept-Language: en-US,en;q=0.8,fa;q=0.6
Via: 1.1 Localhost (squid/3.1.1g)
TABLE I: A Sample of the HTTP User Agent without Using X-Forwarded-For: 192.168.100.6
a Proxy, or in a Proxy Network running in High Anonymity Cache-Control: max-age=0
Connection: keep-alive
Mode
TABLE II: A Sample of the HTTP User Agent in a Proxy
Network with Default Configuration, i.e. running in
Transparent Mode.
• Configuring the proxy server in the No-Cache mode,
so that the proxy server just relays the traffic between
the users and the web servers.
• Configuring the proxy server in the cache mode. In web server(s) visited, but also the visited web server could not
this case, we have generated two data sets: recognize that it is communicating with a proxy server, Table I.
In other words, traffic coming through a high anonymity proxy
◦ The cache server is empty, so that the proxy will look just the same as the traffic not using any proxy. In
server has to refer to the web servers for every this case, the client can completely hide his/her identity from
single request by the users, but at the same the web server. This mode is the most ambigious mode from
time it caches the traffic. the perspective of the server that is visited (giving the service
◦ The cache server has already cached the re- requested). We configured the Squid proxy server to operate
quests to the 500 web sites, so at this time the at the high anonymity mode by adding some access controls
proxy server can respond to some of users’ to the Squid proxy configuration2 .
requests from its own cache.
Then we again generated all the six types of proxy traffic
Figure 2 summarizes the data sets generated for this work. explained before with the new proxy server configurations
The six blue boxes indicate the six different proxy data sets (high anonymity mode). This is what the (web) servers see
generated. Once we generated these 6 proxy data sets, we have as the most challenging case because there is no obvious way
found that the default configuration of the Squid proxy server to find that this traffic is coming from a proxy. In Table III,
is in the transparent proxy mode. It means that the Squid proxy we have summarized the information of all the 13 data sets
server embeds all the information about the users in the HTTP generated.
user agents. To understand what exactly Squid proxy reveals
in this mode, we visit the following URL: Furthermore, Squid can also serve as a proxy for HTTPS
(secure HTTP) traffic. However, since the traffic is encrypted,
http://pgl.yoyo.org/http/browser-headers.php it is not possible to cache it. In such traffic, the communication
Table I shows the information revealed about the client between the client and the server is encrypted. Thus, when a
who accesses the above URL on the direct network we set proxy intercepts the traffic between the server and the client,
up (without a proxy). Table II shows the information revealed it only changes the port numbers and the IP addresses. There
about the client who accesses the above URL via Local or is no way to change the user agent by the proxy, because
Remote Proxy, where Squid Proxy is running in the transparent everything in an HTTPS communication is encrypted. So, web
mode. These tables show that when the proxy server is in the caching and HTTP user agent configurations do not apply for
transparent mode, it sends all the information of its client(s) encrypted proxy traffic.
to the web server. This enables the web server (or anyone To generate examples of such traffic, we created a list of
analyzing the traffic) to infer that the traffic is coming from web sites (166) that use only HTTPS as their communication
a host (client computer) behind the proxy server. In this case, protocol3 (such a list is not provided by Alexa). In this case, we
the web server receives the IP address of the client from the again used the same network setup as we did for generating the
proxy that is running in the transparent mode. HTTP (unencrypted traffic) proxy data sets. To this end, first
However, finding the existence of a proxy server and of all, we run the HTTPS web requests to all the web sites
the client behind it, is not always this simple. As discussed on the list on the direct network and captured the resulting
before, there are four operation (configuration) modes of traffic at the edge router. This traffic is used to investigate the
proxy servers: (i) Transparent mode, (ii) Anonymous mode, behaviors of the non-proxy HTTPS traffic in this work. Then,
(iii) Distorting mode, and (iv) High Anonymity mode. The we repeated this process for the clients on the local proxy
aforementioned examples are all in the transparent mode. A network and the remote proxy network. In Table IV, we have
user / attacker may use other modes of a proxy server to summarized the information of our HTTPS (encrypted) proxy
hide his/her identity for privacy or malicious reasons. The data sets.
anonymous and the distorting modes provide some levels of 2 The added access controls to the Squid configuration can be found at:
anonymity, but the highest anonymity mode of a proxy server http://web.cs.dal.ca/∼vahid/proxy/config.txt
is the fourth one, High Anonymity mode. In this mode, not 3 The list of HTTPS URLs is available at:
only the identity of the proxy cliets would not be sent to the http://web.cs.dal.ca/∼vahid/proxy/https.txt
121
Size Number of Duration Data Set
Data Set (bytes) Packets (hh:mm) ID
No Proxy 80,747,884 96484 1:33 1
Cache Empty 80,616,017 100956 1:15 2
Local Full 61,014,664 77242 1:05 3
Header No Cache 81,077,862 95305 1:02 4
Cache Empty 71,855,827 89135 1:07 5
Remote Full 51,100,296 62702 1:44 6
Proxy No Cache 69,429,793 84440 1:21 7
Cache Empty 80,935,673 97891 1:15 8
Local Full 58,901,656 70714 0:52 9
No Header No Cache 79,659,292 93854 1:00 10
Cache Empty 66,834,471 79906 1:07 11
Remote Full 51,607,605 62691 0:56 12
No Cache 66,265,355 79644 1:05 13
Size Number of Duration Data Set total fpackets total fvolume total bpackets total bvolume
Data Set (bytes) Packets (hh:mm) ID min fpktl mean fpktl max fpktl std fpktl
No Proxy 135,702,128 153821 1:26 14 min bpktl mean bpktl max bpktl std bpktl
Local Proxy Network 166,926,162 245598 2:17 15 min fiat mean fiat max fiat std fiat
Remote Proxy Network 216,829,519 309518 2:31 16 min biat mean biat max biat std biat
Duration min active mean active max active
TABLE IV: Summary of the Generated Encrypted Proxy std active min idle mean idle max idle
std idle sflow fpackets sflow fbytes sflow bpackets
Data Sets sflow bbytes fpsh cnt bpsh cnt furg cnt
burg cnt total fhlen total bhlen
122
DR = T P/(T P + F N ) (1)
F P R = F P/(F P + T N ) (2)
123
C4.5 Naive Bayes
Class Data Set ID DR FPR DR FPR
NoProxy-Unencrypted 1 92.5 10.8 88.6 33.5
Proxy-Unencrypted 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 89.2 7.5 66.5 11.4
NoProxy-Encrypted 14 97.7 3.8 16 2
Proxy-Encrypted 15, 16 96.2 2.3 98 84
NoProxy 1, 14 94.1 7.9 17.9 4.6
Proxy 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16 92.1 5.9 95.4 82.1
NoProxy 1, 14 90.5 4.3 17.2 4.4
Proxy-Encrypted 15, 16 97.3 1.6 95.6 43.6
Proxy-Unencrypted 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 93.5 3.5 74.3 8.5
Proxy-Encrypted 15, 16 99.8 0.2 96.5 21.6
Proxy-Unencrypted 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 99.8 0.2 78.4 3.5
NoProxy-Unencrypted 1 88 7.1 43 14.7
Proxy-Unencrypted-Local 2, 3, 4, 8, 9, 10 62.3 16.5 5.8 3
Proxy-Unencrypted-Remote 5, 6, 7, 11, 12, 13 65 18.7 80.7 67.6
NoProxy-Encrypted 14 95.4 2.2 14.9 2.1
Proxy-Encrypted-Local 15 85.1 8.3 73.6 67
Proxy-Encrypted-Remote 16 84.7 7 37.1 18.1
NoProxy 1, 14 89.8 5.5 16.8 4.2
Proxy-Local 2, 3, 4, 8, 9, 10, 15 72.8 13.3 7.5 3.9
Proxy-Remote 5, 6, 7, 11, 12, 13, 16 73.8 13.1 90.9 84.3
NoProxy-Unencrypted 1 88.5 7.3 36.2 11.3
Proxy-Unencrypted-Cache 2, 3, 5, 6, 8, 9, 11, 12 41.4 17.7 7 7.1
Proxy-Unencrypted-NoCache 4, 7, 10, 13 62.2 28.9 82.8 68.6
NoProxy-Unencrypted 1 88.2 7.9 86.2 32.6
Proxy-Unencrypted-Header 2, 3, 4, 5, 6, 7 92.1 3.7 9.3 3.6
Proxy-Unencrypted-NoHeader 8, 9, 10, 11, 12, 13 85.5 5.5 62 35
Proxy-Unencrypted-Header 2, 3, 4, 5, 6, 7 98.6 2 32.1 20.4
Proxy-Unencrypted-NoHeader 8, 9, 10, 11, 12, 13 98 1.7 79.6 67.9
TABLE VI: The Employed Data Sets and the Results (Detection Rate and False Positive Rate) for 11 Proxy Identification
Experiments, Using C4.5 and Naive Bayes Algorithms
Our results show that, our approach is promising when the different proxy traffic behaviours. In this section, we present
C4.5 machine learning technique is used to classify different the visualization of these behaviours by giving screenshots of
behaviours using the traffic flow features. Specifically, we the prototype system. In this case, we have employed three
obtain very high performances with the C4.5 classifier, to classifiers. These include: “Proxy” vs “No-Proxy”, “Encrypted
identify the proxy behavior under encrypted traffic conditions. Proxy” vs “Unencrypted Proxy”, and “Unencrypted Proxy with
Surprising enough, the problem is more challenging when Header” vs “Unencrypted Proxy without Header”. Our system
traffic is unencrypted. We think that it is easier to mimic works on the traffic flows and classifies them according to the
normal behaviour when data is unencrypted whereas it be- patterns that are discovered by the C4.5 learning algorithm.
comes more challenging to mimic normal behaviour when data As such, traffic flows are classified based on the high level
is encrypted. However, under unencrypted traffic conditions, application behavior type identified in the flow. For every
if proxy header information is available, again the problem flow in each type (class), the source IP, the source port, the
becomes easier and our performance increases, Table VI. destination IP, the destination port, and the protocol type (TCP
or UDP) are stored so that, if necessary, one can dig down to
When we analyzed the patterns that are automatically packet level from the flows.
discovered by the learning algorithms, we saw that flow (Net-
mate) features selected by the C4.5 classifier under different
cases seem to estimate the delay and the size of the flows. In our system we have the ability to visualize the data in
This is similar to what the passive measurement techniques either a rectangle view or a tree view. For this part of the
perform, but we avoid their limitations such as setting up the system, we employ “treemap” [19], and “spacetree” [20] open
method in several measurement points and synchronizing the source programs to create the visualization component. Figure
time between them, or having access to the traffic log files 4 is an example of visualizing the proxy traffic classification in
of the clients’ network. In our analysis, the most important the rectangle view. The dimensions of the rectangles are based
flow features contributing to find behavioural patterns in the on the number of flows of a specific proxy behaviour. As can be
traffic to identify proxies are: 1) The number of packets in the seen in this figure, the input data has been classified into “No-
forward direction, 2) The minimum forward packet length, 3) Proxy” (Yellow rectangle) and “Proxy” (left big rectangle).
The average of the forward packet length, 4) The maximum The Proxy class is also classified into “Encrypted” (Blue
backward packet length and 5) The minimum backward inter- rectangle) and “Unencrypted”. The Unencrypted Proxy class is
arrival time. then classified into “Unencrypted Proxy with Header” (Green
rectangle) and “Unencrypted Proxy without Header” (Orange
V. V ISUALIZATION OF DIFFERENT BEHAVIOURS rectangle). Figure 5 represents an example of visualizing the
proxy traffic classification using the tree view. In this figure,
We designed and developed a prototype system based the number in each node shows the number of occurrences of
on our C4.5 machine learning based approach to analyze the specific type (class) of flows it represents. As can be seen,
124
ACKNOWLEDGMENT
Thi research is supported by the Canadian Safety and Secu-
rity Program (CSSP) E-Security grant. The CSSP is led by the
Defense Research and Development Canada, Centre for Secu-
rity Science (CSS) on behalf of the Government of Canada and
its partners across all levels of government, response and emer-
gency management organizations, nongovernmental agencies,
industry and academia. This research is conducted as part of
the Dalhousie NIMS Lab at http://projects.cs.dal.ca/projectx/.
125