0% found this document useful (0 votes)
562 views10 pages

The Internet Is For Porn Measurement and Analysis of Online Adult Traffic

1) The document analyzes a large dataset of over 323 terabytes of traffic from 80 million users to dozens of major adult websites over a week. 2) It finds adult traffic is primarily video and images, with unique temporal access patterns. Content popularity distributions are skewed. 3) It provides insights into aggregate traffic characteristics, content types and sizes, popularity and temporal dynamics, and user engagement that can inform optimizations to adult content delivery.

Uploaded by

Joseph Chan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
562 views10 pages

The Internet Is For Porn Measurement and Analysis of Online Adult Traffic

1) The document analyzes a large dataset of over 323 terabytes of traffic from 80 million users to dozens of major adult websites over a week. 2) It finds adult traffic is primarily video and images, with unique temporal access patterns. Content popularity distributions are skewed. 3) It provides insights into aggregate traffic characteristics, content types and sizes, popularity and temporal dynamics, and user engagement that can inform optimizations to adult content delivery.

Uploaded by

Joseph Chan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

The Internet is For Porn:

Measurement and Analysis of Online Adult Traffic


Faraz Ahmed∗, M. Zubair Shafiq†, Alex X. Liu∗
∗ Department of Computer Science and Engineering, Michigan State University. {farazah, alexliu}@cse.msu.edu
† Departmentof Computer Science, The University of Iowa. [email protected]

Abstract—Adult (or pornographic) websites attract a large adult content consumption – users consume whatever they find
number of visitors and account for a substantial fraction of on the front-pages of adult websites. However, the utility of
the global Internet traffic. However, little is known about the these studies is limited due to three main reasons: (1) they
makeup and characteristics of online adult traffic. In this paper,
we present the first large-scale measurement study of online rely on website data obtained by crawling, which is limited in
adult traffic using HTTP logs collected from a major commercial terms of both temporal coverage and granularity; and (2) their
content delivery network. Our data set contains approximately analysis cannot distinguish among users because they rely on
323 terabytes worth of traffic from 80 million users, and includes aggregate view counts of adult videos; and (3) the focus of
traffic from several dozen major adult websites and their users these studies is on the behavioral and demographic aspects
in four different continents. We analyze several characteristics
of online adult traffic including content and traffic composition, of two specific adult websites, and whether their findings are
device type composition, temporal dynamics, content popularity, representative of other adult websites is unclear.
content injection, and user engagement. Our analysis reveals
several unique characteristics of online adult traffic. We also
Main Findings. In this paper, we conduct the first large-scale
analyze implications of our findings on adult content delivery. measurement study of online adult traffic using HTTP logs
Our findings suggest several content delivery and cache per- collected from a major commercial content delivery network.
formance optimizations for adult traffic, e.g., modifications to The week-long HTTP logs include traffic from several dozen
website design, content delivery, cache placement strategies, and major adult websites and their users in four different conti-
cache storage configurations.
nents. Overall, our HTTP logs account for approximately 323
Index Terms—Adult websites; Content delivery; Porn
terabytes worth of traffic from 80 million users. Based on
these traces, we present some detailed characteristics of five
I. I NTRODUCTION
popular adult websites. We choose these websites on the basis
Background. As the saying goes: “The Internet is for porn” of their popularity, e.g., several of these adult websites have
[7]. While it is difficult to estimate how much porn content been ranked in the global Alexa top-500 list [1]. This selection
is available on the Internet [27], there are several widely represents a broad variety of adult websites, ranging from
varying estimates available. According to one estimate, there traditional YouTube-style adult video services, adult image
are at least 4 million adult websites on the Internet, which sharing services, to adult social networking services.
constitute approximately 12% of all websites [3]. It has also We provide an in-depth analysis of online adult traffic
been reported that adult websites have more monthly unique including its aggregate, content, and user dynamics. Below,
visitors than Netflix, Twitter, and Amazon combined [4], [19]. we summarize our key findings and their implications on adult
Furthermore, multiple websites in the Alexa’s top-500 global content delivery.
list serve adult content [1]. Moreover, a recent measurement
1) Aggregate Analysis: Adult traffic primarily comprises of
study of a tier-1 ISP reported that at least 15% of all mo-
bile video traffic in the Unites States is from adult content video and image multimedia content. For many popular adult
websites, up to 99% traffic volume consists of video and image
providers [9]. Overall, these statistics indicate that online adult
content. While the majority of users access adult websites from
content attracts a large number of users and accounts for a
substantial fraction of the global Internet traffic. However, desktop, smartphones and other mobile devices account for a
non-trivial fraction of visitors.
despite its significant volume, little is known about the makeup
and characteristics of online adult traffic. The understanding 2) Content Analysis: Adult content has widely varying sizes:
of online adult traffic is important for optimizing its content images are generally less than 1 megabyte and videos are
delivery, which involves complex interactions between content on the order of tens of megabytes. Content popularity dis-
delivery networks and ISPs. tributions exhibit the expected skewness. The temporal access
Limitation of Prior Art. There is little prior work on dedi- patterns of adult websites are unique and different from the
cated analysis of online adult websites. Only recently, Tyson typical diurnal access patterns of traditional web content.
et al. [25], [26] studied behavioral aspects of two popular Our clustering analysis of content popularity reveals groups
adult sites: YouPorn and PornHub. These studies reveal several of objects with diurnal, long-lived, and short-lived temporal
unique characteristics of adult content such as the elasticity of access patterns.
3) User Analysis: User engagement with adult websites is that the fraction of user generated content on YouPorn is much
shorter than non-adult websites. However, adult content can smaller as compared to non-adult video websites like YouTube
be addictive, i.e., some users repeatedly access certain content. and Vimeo. However, they found that YouPorn has more aver-
For example, at least 10% of video objects have more than 10 age views per video than other non-adult video websites. The
requests per unique user. authors concluded that the video viewing behavior on YouPorn
is dependent on two main factors: front page browsing patterns
4) Implications: Our findings have implications on adult
and the number of categories assigned to a particular video.
website design and content delivery infrastructure manage-
They found that videos with most views are located on the
ment. For example, a vast majority of users do not visit
front page of the website and the number of views a video
adult websites on smartphones. This finding highlights the
receives is correlated with the number of categories assigned
need for adult websites to improve their web interfaces and
to it.
content delivery strategies for mobile devices. As another
In [26], Tyson et al. crawled PornHub, an adult social
example, due to their unique diurnal access patterns, it is
networking website, to analyze the demographic makeup of
important to separately account for adult traffic in the traffic
users. The study shows that a majority of profiles on PornHub
forecasting models and network resource allocation. We also
belong to young males, however female profiles are more
find that adult content providers cannot rely on browser cache
popular in terms of receiving more comments and profile
to store locally popular content because of prevalent use of
views. The authors also found that active profiles have larger
incognito/private web browsing. Adult content providers can
social groups and there is positive correlation between content
instead optimize content caching performance by customized
uploaded by a profile and its number of connections. The data
networked cache configuration. For example, content delivery
analyzed in these studies is collected by periodic crawling
networks can improve performance and reduce network traffic
of these two adult websites, which is generally limited to
by pushing copies of popular adult objects to locations closer
meta-data like view counts, ratings, user profile information.
to their end-users.
In contrast, our work is based on analysis of detailed HTTP
Paper Organization. The rest of the paper is organized as access logs of multiple adult websites, which allows us to track
follows. Section II reviews prior work on adult traffic analysis. individual user content requests at a fine-grained timescale.
We provide background and details of our data collection Some prior web traffic measurement studies have reported
methodology in Section III. Section IV discusses measurement basic statistics of adult traffic in the context of overall traffic
and analysis of online adult traffic. We discuss implications of makeup. For example, Du et al. studied HTTP traffic from
our findings in Section V. Section VI concludes the paper. Internet kiosks from two developing countries and reported
that adult content accounts for less than 1% of total traffic
II. R ELATED W ORK volume [8]. Erman et al. studied video traffic in a large 3G
Despite a large number of studies on general web traffic cellular network and reported that adult content accounts for
analysis (e.g., [6], [18], [21], [24], [29], [32]), little is known approximately 15% of total mobile video traffic [9]. While
about the makeup and characteristics of online adult traffic. To these studies are useful, they do not specifically analyze adult
fill this gap, this paper presents the first in-depth and large- traffic in terms of its unique characteristics as compared to
scale characterization of online adult traffic. To provide some non-adult traffic. To the best of our knowledge, our work
context for our study, we review prominent related work below. provides the first in-depth analysis of online adult traffic at
Wondracek et al. studied the economic and security issues of scale.
the online adult industry [28]. In their study, the authors used
manual inspection and automated crawling to investigate the III. DATA
characteristics of adult websites. They classified adult websites Most major adult and non-adult content publishers use third-
based on their functionality into the following categories: party content delivery networks (CDNs) to efficiently deliver
paysites, link collections, search engines, domain redirector their content to end-users. According to recent estimates, a
services, keyword-based redirectors, and traffic brokers. They significant fraction of Internet traffic is served by CDNs [15].
also created two adult web sites from scratch for traffic profil- For instance, Akamai alone delivers between 15-30% of all
ing and vulnerability assessment. They concluded that several web traffic [2]. A CDN operator typically places content
prevalent practices in online adult industry are questionable at multiple geographically distributed data centers. A user’s
and can be used to conduct malicious activities. This work request for content (such as a web page, an image, or a
provides a broad understanding of the economic and security video file) is redirected to the closest data center via DNS
issues of the online adult industry. In contrast, we analyze redirection, anycast, or other CDN-specific methods [11], [20].
detailed HTTP access logs of several popular online adult For this study, we collected HTTP access logs from multiple
websites to understand their content and user traffic patterns. data centers of a major commercial CDN for the duration
More recently, Tyson et al. conducted measurement studies of one week. The HTTP logs include traffic from several
of two Web 2.0 adult websites YouPorn and PornHub [25], dozen major adult websites and their users in four different
[26]. In [25], the authors crawled YouPorn to understand user continents. Overall, our HTTP logs account for more than 323
interactions with the website. This measurement study shows terabytes worth of traffic from 80 million users. All personally
6
identifiable information in the HTTP logs (e.g., IP addresses) 10
is anonymized to protect the privacy of end users without Video Image Other

affecting the usefulness of our analysis. Each record in our 84%


99%
99% 99%
trace includes information about an HTTP request, containing 4 15%
10 98%
publisher identifier, hashed URL, object file type, object size

# of objects
in bytes, user agent, and the timestamp when the request was
received. We use the user agent field to distinguish between <1%
different device types, operating systems, and web browsers 1%
2
10 <1% <1%
[10]. Each record in our trace also includes information about
the corresponding HTTP response sent by the CDN server, <1%
<1% <1%
containing the cache status for the requested object. A HIT <1%
value indicates that the requested object was found in the CDN 0 <1%
10
cache and a MISS value indicates that the file does not exist V−1 V−2 P−1 P−2 S−1

in the CDN cache. We use the HTTP response codes and Fig. 1. Content composition of five adult websites. V-1 uses 6.6K objects,
cache status information to measure caching performance of V-2 uses 55.6K objects, P-1 uses 16.3K objects, P-2 uses 29.6K objects, and
S-1 uses 22.9K objects. We breakdown content into 3 categories: (1) video,
proprietary CDN caching algorithms. (2) image, and (3) other. Other category includes objects that are not classified
Through an extensive manual analysis of publisher iden- as video or image.
tifiers, we separated adult content publishers from the rest
8
(dubbed as “non-adult”). We select five popular adult websites 10
Video Image Other
in our data set for further in-depth analysis. The names of
99%
these websites are anonymized due to business confidentiality 6
10 62% 56%
43%
agreements. For adult websites, two websites primarily serve 34% 99% 97%

Request Count
YouTube-style adult video content (termed V-1 and V-2), two 3%
websites provide image-heavy adult content (termed P-1 and 4
10 2%
P-2), and one is an adult social networking website (termed <1%
S-1). <1% <1%
<1%
2
10 <1%
IV. M EASUREMENT & A NALYSIS
<1%
In this section, we analyze aggregate statistics of adult
0
websites (e.g., content composition), content dynamics (e.g., 10
V−1 V−2 P−1 P−2 S−1
popularity, new content injection), and user dynamics (e.g.,
(a) Request Count
session length, addiction to adult content).
Video Image Other
A. Aggregate Analysis
10
10
Content Composition. We first analyze the content compo-
Request Size (KB)

sition of adult websites on CDN servers. To this end, we 99% 98%


75%
categorize objects based on their file types into video (e.g., 18%
1% 99% 6%
FLV, MP4, MPG, AVI, WMV), image (e.g., JPG, PNG, GIF, 84%
5 15%
TIFF, BMP), and other (e.g., text, audio, HTML, CSS, XML, 10
JS). From Figure 1, we note that V-1 primarily stores video
<1%
objects on the CDN servers, e.g., 98% of all objects are videos. <1%
<1%
In contrast, V-2 stores a mix of image (84%) and video (15%)
<1%
objects. V-2 uses a large number of GIF images to show a 0
10
<1% <1%
video summary when users hover the cursor over the video. V−1 V−2 P−1 P−2 S−1
P-1, P-2, and S-1 mainly store images (99%) on CDN servers. (b) Request Size
Traffic Composition. We next analyze the traffic composition Fig. 2. Traffic composition of five adult websites. We note that audio and
video multimedia content dominates adult traffic.
of adult websites in terms of the request count and request
size during the data collection period. Request count is the
total number of requests received from website visitors for all significantly more video traffic than other content types – V-
objects. Request size is the total size of objects requested by 1 traffic includes 3.1M requests for video objects. V-2 has
website visitors for all objects. We again breakdown content smaller percentage of video content traffic as compared to
across video, image, and other categories respectively. Figure images or other categories. For V-2, 359K requests are for
2(a) shows the traffic composition distributions with respect video content whereas 657K requests are for image content.
to their request count. We observe that the majority of traffic For P-1 and P2, 719K and 175K requests are for images,
on adult websites consists of video and images. Only V-1 has respectively. For S-1, 231K requests are for images. Figure
5.5 100
V−1 Desktop Android iOS Misc
V−2
5
Percentage Traffic Volume

P−1 80
P−2
4.5 S−1
60

Users (%)
4

40
3.5

3 20

2.5
6 12 18 24 0
Hour V−1 V−2 P−1 P−2 S−1

Fig. 3. Hourly traffic volume timeseries of five adult websites. Fig. 4. Device type composition

2(b) shows the traffic composition distributions with respect to (Android and iOS) and miscellaneous (tablets and other mobile
their request size. In contrast to Figure 2(a), we note that video devices) categories. For instance, V-2 has more than 95%
content accounts for disproportionately more traffic volume. users accessing content from traditional desktop devices. We
Since video files are significantly larger than image files, observe that image-heavy and social networking websites
videos tend to dominate the traffic in terms of byte volume. receive relatively more visitors from smartphone devices as
Video traffic in V-1 alone accounts for 258 gigabytes worth compared to video websites. For instance, more than one-third
of traffic. This traffic mix is composed of mostly multimedia of users access S-1 from smartphone and miscellaneous device
content and it is representative of popular free (ad-driven) and categories. The differences across different adult websites can
subscription based adult websites [28]. Our findings highlight be partially explained by user preference to view adult content
that the adult content publishers and the respective CDNs need on larger screens. CDNs can customize content delivery (e.g.,
to design and provision their infrastructure to primarily serve compression, encoding) depending on different video playback
multimedia content. devices. Since a vast majority of users do not visit adult
Temporal Access Patterns. We next analyze the temporal websites on smartphones, our findings highlight the need for
access patterns for adult websites. Figure 3 plots the nor- adult websites to further improve their web interfaces and
malized hourly timeseries of traffic volume across the day. content delivery strategies for mobile devices.
We converted the timestamps to local timezones to calcu- B. Content Dynamics
late hourly traffic volumes. Overall, we observe that access
patterns of adult websites are not typical diurnal patterns. Content Size. We next investigate content sizes for adult
Prior literature (e.g., [21], [24], [29]) reported content access websites. Figure 5 plots the Cumulative Distribution Functions
peaks during 7-11 pm and troughs in late night and early (CDFs) of content sizes. Overall, we observe that content sizes
morning hours. In contrast, for example, we note that V-1 vary in the range of a few kilobytes (KB) to hundreds of
traffic volume peaks at late-night and early morning hours. megabytes (MB), where majority of requested video objects
This pattern for V-1 is almost opposite to typical diurnal hours have sizes greater than 1 MB and image objects are less
reported in prior literature. The temporal access patterns for than 1 MB in size. Figure 5(a) shows the content size
V-2, P-1, P-2, and S-1 have less pronounced variations than distribution of video objects. Video adult websites V-1 and
V-1, yet they are different from typical diurnal patterns. The V-2 have a majority of objects larger than 1 MB. P-1 and
differences between peak access of adult and other content S-1 have relatively small number of video objects. P-2 has
are likely due to the unique nature and viewing preferences the largest video object sizes. Figure 5(b) shows the content
for adult content. Thus, it is important for network operators size distribution of image objects. We note that multiple adult
to separately account for adult traffic in the traffic forecasting websites have bi-modal distributions, indicating thumbnail
models and network resource allocation. sized images as well as large images of sizes up to 1 MB.
Device/OS Usage. The web traffic is known to be gradually These observations have significant implications for CDN/ISP
shifting from traditional desktop to smartphones and tablets caching optimization. For instance, ISPs/CDNs can employ
over the last several years [16]. Recall that we extract user separate caching platforms to optimally serve small and large
agent information from HTTP headers to identify device/OS sized objects. The caching platform for small objects can
of a user. We next investigate the device/OS composition of be optimized for high-throughput I/O; whereas, the caching
user requests to adult websites. All adult websites discussed in platform for large objects can be optimized for more storage
this paper have mobile friendly websites/apps. Figure 4 plots capacity.
the device distribution of users accessing the adult websites. Content Popularity. We now investigate object popularity
We observe that the desktop category dominates smartphones for adult websites. We quantify object popularity in terms
1 1

0.8 0.8

0.6 0.6
CDF

CDF
0.4 0.4
V−1 V−1
V−2 V−2
0.2 P−1 0.2 P−1
P−2 P−2
S−1 S−1
0 2 4 6 8 10
0 0 2 4 6
10 10 10 10 10 10 10 10 10
File Size (Bytes) # of Requests
(a) Video (a) Video
1 1

0.8 0.8

0.6 0.6
CDF

CDF
0.4 0.4
V−1 V−1
V−2 V−2
0.2 P−1 0.2 P−1
P−2 P−2
S−1 S−1
0 2 4 6 8 10
0 0 2 4 6
10 10 10 10 10 10 10 10 10
File Size (Bytes) # of Requests
(b) Image (b) Image
Fig. 5. Content size distributions. We note bi-modal distributions for image Fig. 6. Content popularity distributions
objects. Small images are low-resolution thumbnails and large images are
high-resolution pictures. 1

0.8
of request count. Understanding the popularity of objects
Objects Requested

is important because CDNs typically optimize their caching


0.6
performance by focusing on popular objects and reduce stor-
age costs by ignoring unpopular and dynamically changing
0.4
content. Additionally, analyzing the popularity of adult objects V−1
can provide insights for identifying similarities and differences V−2
0.2 P−1
between adult objects and non-adult objects. We plot the P−2
request count distribution of video and image objects for adult S−1
websites in Figure 6. We observe long-tail distributions for all 0
1 2 3 4 5 6 7
adult websites. This observation indicates that a significant Content Age (days)

fraction of adult objects are requested infrequently and a Fig. 7. Fraction of total object requested for adult websites at different ages
small fraction of adult objects are very popular. From a con-
tent delivery perspective, this information is useful as CDNs
can improve their caching performance by caching heavily unique aspects of adult content popularity by further analyzing
requested objects. The long-tailed distribution is similar to individual object request patterns.
those reported for traditional web and video content in prior Impact of Content Injection and Aging on Popularity.
literature, where a smaller fraction of viral media content is We would expect more requests for objects when they are
heavily requested. The long-tailed distribution also potentially new, and as the content ages we expect its popularity to
highlights some social aspects of adult content. In comparison, decrease accordingly. To understand this phenomenon for adult
the large number of requests for typical non-adult objects is websites, we plot the fraction of adult objects requested at
mainly because of word-of-mouth content sharing in online so- different ages in Figure 7. The plot shows that a declining
cial networking websites [22]. We would not typically expect fraction of objects are requested as their age increases. In
adult content viewers to share adult content on popular social particular, about 20% of objects are not requested after 3
networking websites, though recent work has shown that users days for most adult websites. Only about 10% of objects are
use social features in adult websites [26]. We next investigate requested throughout the trace duration of one week.
20 warping path w is defined as:

15 X
L
Cw = c(ai , bi )
Distance

10 i=1

Intuitively, the cost function c is defined as the area between


5
the time warped time series. Using a dynamic programming
0
approach, DTW computes all possible sets of mappings (warp-
ing paths) between two time series. The optimal warping path

22 ed
33 ers

20 ed
11 l−A

14 l−B
iv

liv
%

%
%

%
li

a
−l
ut

rn

rn
t−
is the path w′ that has the minimum total cost among all
ng
O

or
iu

iu
D

D
Lo

Sh
(a) Video V-2 possible warping paths. The total cost of the optimal warping
path is defined as the DTW distance. We use the DTW distance
2 as a metric for quantifying the similarity between two request
count time series. We compute pairwise DTW distances for
1.5
all request count time series and obtain a similarity matrix.
Distance

1 We then use the pairwise DTW distance matrix to obtain


hierarchical clusters for the request count time series. We use
0.5
agglomerative hierarchical clustering to obtain dendrogram for
0
each adult website.
Figures 8(a) and (b) show two example dendrograms of
61 nal

25 ed

14 wd
%

iv
%

%
ro
r
iu

−l

hC
D

ng

video and image objects for V-2 and P-2, respectively. The x-
as
Lo

Fl

axis of a dendrogram shows the cluster labels and their mem-


(b) Image P-2
berships and the y-axis represents the DTW similarity metric.
Fig. 8. Content clustering dendrograms of two adult websites. We observe four dominant popularity trends in our clustering
analysis: diurnal, long-lived, and short-lived. We
also observe that some objects in V-2 and P-2 websites have
other popularity trends that cannot be neatly categorized as any
We further explore the popularity trends of objects by clus-
of the aforementioned categories. We categorize these objects
tering them with respect to their temporal popularity patterns.
as outliers.
To identify temporal popularity patterns, we analyze the time
To visualize the unique popularity trends, we next identify
series of request count for individual objects. Characterizing
a representative sample object from each cluster and plot its
the request count time series helps us to assess how fast
normalized request count time series with point-wise standard
an adult object reaches its maximum popularity, identifying
deviations. To identify the representative sample object for
governing popularity trends of image and video objects, and
each cluster, we identify its medoid, where a medoid is
developing insights for improving the caching performance of
defined as the most centrally located point of a cluster [14].
CDNs.
We calculate point-wise standard deviations by looking at
We identify distinct popularity trends by measuring the the normalized request count timeseries of all objects in the
similarities in shape between the normalized request count cluster. We plot the normalized request count time series of
time series. We use Dynamic Time Warping (DTW) to com- mediods of unique clusters in Figure 9. The shaded regions
pute similarity between two request count time series [17]. represent the standard deviation of all timeseries from the
DTW uses a dynamic programming approach to obtain a mean of the cluster and the solid line represent the medoid
minimum distance alignment between two time series. More of the cluster. Figures 9(a) and 10(a) show the medoids for
specifically, for two time series T1 and T2 , DTW obtains the diurnal popularity cluster. Diurnal popularity trends of
an optimal alignment between T1 and T2 by warping the video objects highlight that certain video objects are requested
time dimension of T1 and T2 . DTW achieves an optimal continuously with regular day/night time variations. Figures
alignment between two time series by obtaining a non-linear 9(b) and 10(b) show the cluster medoids of objects with
mapping of points on one time series to points of the second long-lived popularity trend. These objects request count
time series. This non-linear mapping is also known as the reaches its maximum popularity within the first day of their
warping path. More specifically, for T1 = (a1 , a2 , ..., aN ) injection. Their request count decays in a diurnal fashion and
and T2 = (b1 , b2 , ..., bM ) of length N and M respectively, completely dies down after a few days. Figures 9(c) and 10(c)
a warping path w = (w1 , w2 , ..., wL ) defines an alignment show the cluster medoids of objects with short-lived
between T1 and T2 , where w1 = (a1 , b1 ) defines the mapping popularity trend. The request count of these objects reaches its
of element a1 ∈ T1 to b1 ∈ T2 . A cost function is used to maximum during the first day of its arrival but decays sharply
compute the optimality of a warping path. Large cost values and completely dies down within a few hours. Our further
indicate low similarity in shapes of two time series and small analysis reveals that video objects with diurnal trends are
cost values indicate high similarity. The total cost of the a smaller in size as compared video objects with long-lived
0.025
0.08

Normalized Request Count


Normalized Request Count
Normalized Request Count
0.02
0.1
0.06
0.015

0.01 0.04
0.05
0.005 0.02

0
0 0

−0.005
Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri

(a) Diurnal-A (b) Long-lived (c) Short-lived


Fig. 9. Cluster medoids for V-2 adult website

0.015 0.06
0.06

Normalized Request Count


0.05
Normalized Request Count

Normalized Request Count


0.04
0.01 0.04
0.03

0.02
0.02
0.005
0.01

0 0

0 −0.01
Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri

(a) Diurnal (b) Long-lived (c) Short-lived


Fig. 10. Cluster medoids for P-2 adult website

and short-lived trends. The long-lived video objects 1

have the largest size followed by short-lived video ob-


jects. 0.8

0.6
Overall, our analysis reveals that objects served by adult
CDF

websites have diverse popularity trends. Each website has a


0.4
different composition of these popularity trends. A large frac- V−1
tion of image and video objects have diurnal request patterns. V−2
0.2 P−1
There are two possible explanations for diurnal patterns. First, P−2
prior work [25] suggests that users discover content on adult S−1
0
websites through front page browsing. Objects with diurnal 1 sec 5 sec 1 min 1 hr 1 day 1 week
patterns are most likely image and video objects displayed on Inter−arrival time
Fig. 11. User request inter-arrival time distributions
the front page of a website, and users always access these
objects when they visit the website. Second, we note that C. User Dynamics
the potentially addictive nature of adult content could drive
users to access specific content repeatedly, similar to non-adult User Request Inter-arrival Time. We characterize the user
media content. request arrival process in terms of its Inter-arrival time (IAT)
distribution. Figure 11 plots the user request IAT distributions
CDNs can utilize this information to optimize cache control for all adult websites. Comparing different adult websites, we
by re-validating diurnal objects less frequently and other observe that video adult websites have shorter request IATs as
objects more frequently, for example, hourly for objects with compared to image-heavy adult websites. For video objects in
short-lived access patterns and daily for objects with adult websites, the median request IAT is less than 10 minutes,
long-lived access patterns. This can also be achieved by whereas it is more that 1 hour for image-heavy adult websites.
setting longer expire times for objects with diurnal and We later use these observations for estimating user session
long-lived access patterns. Furthermore, CDNs can reduce lengths.
network traffic by pushing copies of objects with diurnal User Session Length. A key metric from the perspective of
and long-lived request patterns to locations closer to end- content publishers and CDNs is user engagement [23], which
users. is typically quantified in terms of website bounce time [12].
1 1

0.8 0.9

0.6 0.8

CDF
CDF

0.4 0.7
V−1 V−1
V−2 V−2
0.2 P−1 0.6 P−1
P−2 P−2
S−1 S−1
0 0.5 0 1 2
1 sec 5 sec 1 min 10 min 1 hr 10 10 10
User Session Length Requests per User

Fig. 12. User session length distributions (a) Video


6 1
10

0.9

4
10
# of Requests

0.8

CDF
0.7
2 V−1
10
V−2
0.6 P−1
P−2
0
S−1
10 0 1 2 3 4 5
0.5 0 1 2
10 10 10 10 10 10 10 10 10
# of Users Requests per User
(a) V-1 (b) Image
6
10 Fig. 14. CDF of repeated content access by users

Alexa. We note that average session lengths for popular non-


4
adult websites tend to be much larger than popular adult
10
# of Requests

websites. For example, the average session length for YouTube


[31] is approximately two minutes, whereas the average ses-
sion length or XVIDEOS [30] is less than one minute.
2
10
User Addiction. To further investigate user engagement, we
next analyze user addiction. We analyze repeated content
0
accesses by a user to investigate content addiction. For each
10 0 1 2 3 4 5 object, we compute the total number of requests and the total
10 10 10 10 10 10
# of Users number of unique users who make these requests. Figure 13(a)
(b) P-1 shows the scatter plot highlighting repeated access of video
Fig. 13. Repeated access of objects objects for V-1 adult website. Each data point in the plot
represents a distinct video object. We observe that certain
From the network-side logs, we can estimate user engagement video objects are requested by a user multiple times, i.e.,
in terms of user session length,1 where a session consists of data points above the diagonal. In some cases, an object is
consecutive user requests within a timeout interval. We set requested by a large number of times by an individual user.
the timeout value for user sessions at 10 minutes based on For example, some objects have up to two orders of magnitude
our earlier analysis of user request IAT distributions. We plot more requests than unique users. This observation indicates
user session length distributions for adult websites in Figure that some video objects are popular due repeated access by
12. We observe that the median session lengths for most adult a certain user (i.e., addiction), whereas other video objects
websites are around one minute. Our findings indicate that user are popular due to multiple users accessing the content (i.e.,
engagement for adult content consists of relatively short-lived viral). Figure 13(b) shows the scatter plot of repeated access of
sessions as compared to non-adult content. We further verified objects for P-1 adult website. A comparison of repeated access
our observations using the engagement statistics reported by reveal that video objects are more likely to get repeated user
1 It is noteworthy that the user session length is a strictly lower-bound of
requests than image objects. This also highlights that video
traditional bounce time because we cannot tell how long a user continues to content is more addictive/engaging as compared to image
watch the downloaded content from network-side logs. content.
1
V−1
V−1 6
10 V−2
V−2
0.8 P−1
P−1
P−2
P−2
S−1

Request Count
S−1
4
0.6 10
CDF

0.4
2
10
0.2

0
0 10
0 0.2 0.4 0.6 0.8 1 200 204 206 304 403 416
Hit Ratio Response Codes
(a) Image (a) Video
1
V−1
V−1 6
10 V−2
V−2
P−1
0.8 P−1
P−2
P−2
S−1

Request Count
S−1
4
0.6 10
CDF

0.4
2
10
0.2

0
0 10
0 0.2 0.4 0.6 0.8 1 200 204 206 304 403 416
Hit Ratio Response Codes

(b) Video (b) Image


Fig. 15. Hit ratio for image and video objects Fig. 16. Response codes

To further analyze user addiction to content, we plot the indicates that the requested objects were not present in the
distribution of repeated user access for all objects. Figure CDN cache. A hit ratio value of 1 indicates that all requested
14 plots the CDF of number of requests per user for all objects were served from the CDN cache. Note that the
adult websites. We note that less than 1% of image objects CDN treats video chunks as separate objects for the sake of
are requested more than 10 times by a user, whereas at caching. Comparing image and video objects, we observe that
least 10% of video objects have more than 10 requests per image objects have better overall cache hit ratio than video
unique user. From a CDN’s perspective, this information is objects. S-1 has the smallest percentage of objects added to the
particularly useful as they can differentiate between objects CDN cache. To understand the cache performance of objects
that are popular only due to requests from multiple users with different popularity, we compute correlations between
versus those objects that are popular because of repeated hit ratio and object popularity. As expected, we find that
accesses by a user. As we discuss later, this information is also popular objects tend to have higher hit ratios (more than 0.9
useful in optimizing local browser caching and proxy caches correlation coefficient for all adult websites). Thus, while the
deployed by many ISPs. Objects accessed multiple times by distributions in Figure 15 may indicate otherwise, overall CDN
a single user or a small number of users should be locally cache hit ratios range between 80-90% for different adult
cached closer to end-users. websites. It is noteworthy that CDNs often customize cache
configuration and performance for individual publishers. Thus,
V. I MPLICATIONS some differences in cache hit ratios may also reflect differences
We discuss potential implications of our measurement and in priorities for different content publishers. Furthermore,
analysis of online adult traffic. We are particularly interested in customized caching strategies for streaming video content can
understanding the impact of different content access patterns also be implemented by the CDN.
on CDN caching. To this end, we analyze the caching per- HTTP Response Codes. We next analyze HTTP response
formance for adult websites by looking at server-side HTTP codes for adult websites. Figure 16 shows the number of
response codes and cache hit ratios. requests associated with each type of response code. The most
CDN Cache Hit Ratios. We first investigate the cache per- common HTTP response codes for both video and image
formance of CDN servers by analyzing server-side cache hit objects include: 200, 206, 304, and 403. We note that a
ratios. Figure 15 plots the distributions of cache hit ratios majority of response code are 200. Of particular interest to
for objects in all adult websites. A hit ratio value of 0 a CDN operator is the 304 response code, which indicates
that the client’s requested object is not modified and the local [10] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and
cache copy is up to date. We note that 304 responses constitute T. Berners-Lee. RFC 2616: Hypertext Transfer Protocol – HTTP/1.1.
Technical report, Network Working Group, Internet Engineering Task
a small fraction of all requests, which indicates the poten- Force, 1999.
tial for improved localized content caching to improve user [11] B. Frank, I. Poese, G. Smaragdakis, A. Feldmann, B. M. Maggs, S. Uh-
performance and reduce traffic load on CDN content replica lig, V. Aggarwal, and F. Schneider. Recent Advances in Networking,
chapter Collaboration Opportunities for Content Delivery and Network
servers. Despite the cacheability of popular adult objects, 304 Infrastructures. ACM SIGCOMM, 2013.
response counts are particularly low for adult websites because [12] I. Grigorik. Breaking the 1000 ms Time to Glass Mobile Barrier. In SF
users are known to browse adult content in incognito/private HTML5 meetup, 2013.
[13] Q. Huang, K. Birman, R. van Renesse, W. Lloyd, S. Kumar, and H. C.
browsing modes [5]. Web browsers dispose local cache content Li. An Analysis of Facebook Photo Caching. In SOSP, 2013.
when users close the incognito/private browser windows. In [14] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduc-
contrast, note that Facebook reported that more than 65% tion to cluster analysis, volume 344. John Wiley & Sons, 2009.
[15] C. Labovitz. First Data on Changing Netflix and CDN Market Share,
of their photo requests are served from local browser caches June 2012. http://www.deepfield.net/2012/06/first-data-on-changing-
[13]. Unfortunately, while traditional non-adult website can netflix-and-cdn-market-share/.
fully utilize browser cache to improve performance and reduce [16] A. Lipsman. Major Mobile Milestones in May: Apps Now Drive Half
of All Time Spent on Digital. http://www.comscore.com/Insights/Blog/
traffic, adult content publishers cannot solely rely on it in Major-Mobile-Milestones-in-May-Apps-Now-Drive-Half-of-All-Time-
designing their content and caching mechanisms. Spent-on-Digital, June 2014.
[17] M. Müller. Dynamic Time Warping. Information retrieval for music
and motion, pages 69–84, 2007.
VI. C ONCLUSION [18] D. Naboulsi, M. Fiore, S. Ribot, and R. Stanica. Large-scale Mobile
In this paper, we presented a large scale and in-depth mea- Traffic Analysis: a Survey. IEEE Communications Surveys & Tutorials,
2015.
surement and analysis of online adult traffic. We provide an in- [19] O. Ogas and S. Gaddam. A billion wicked thoughts: What the world’s
depth analysis of their aggregate, content, and user dynamics. largest experiment reveals about human desire. Dutton New York, NY,
We find that the temporal access patterns of adult website 2011.
[20] M. Pathan and R. Buyya. Content Delivery Networks, Chapter 2: A
are unique and different from typical diurnal access patterns. Taxonomy of CDNs. Springer, 2008.
While a majority of users access adult websites from desktop, [21] U. Paul, A. P. Subramanian, M. M. Buddhikot, and S. R. Das. Under-
smartphones and other mobile devices account for a non- standing Traffic Dynamics in Cellular Data Networks. In IEEE Infocom,
2011.
trivial fraction of visitors. Our clustering analysis of content [22] T. Rodrigues, F. Benevenuto, M. Cha, K. P. Gummadi, and V. Almeida.
popularity reveals groups of objects with diurnal, long-lived, On Word-of-Mouth Based Discovery of the Web. In ACM Internet
and short-lived temporal access patterns. User engagement in Measurement Conference (IMC), 2011.
[23] M. Z. Shafiq, J. Erman, L. Ji, A. X. Liu, J. Pang, and J. Wang.
adult websites is shorter than non-adult websites; however, Understanding the Impact of Network Dynamics on Mobile Video User
adult content can be addictive, i.e., some users repeatedly Engagement. In ACM SIGMETRICS, 2014.
access certain content. Our findings have implications on adult [24] M. Z. Shafiq, L. Ji, A. X. Liu, and J. Wang. Characterizing and Modeling
Internet Traffic Dynamics of Cellular Devices. In ACM SIGMETRICS,
website design and content delivery infrastructure manage- 2011.
ment. For instance, adult content providers cannot rely on [25] G. Tyson, Y. Elkhatib, N. Sastry, and S. Uhlig. Demystifying Porn 2.0:
browser cache to store locally popular content because of the A look into a major adult video streaming website. In ACM Internet
Measurement Conference (IMC), pages 417–426. ACM, 2013.
prevalent usage of incognito/private web browsing. Content [26] G. Tyson, Y. Elkhatib, N. Sastry, and S. Uhlig. Are People Really Social
delivery networks can also reduce improve performance and on Porn 2.0? In AAAI Conference on Web and Social Media (ICWSM),
network traffic by pushing copies of popular adult objects to 2015.
[27] M. Ward. Web porn: Just how much is there? http://www.bbc.com/
locations closer to the end-users. news/technology-23030090, July 2013.
[28] G. Wondracek, T. Holz, C. Platzer, E. Kirda, and C. Kruegel. Is
R EFERENCES the Internet for Porn? An Insight Into the Online Adult Industry. In
Workshop on the Economics of Information Security (WEIS), 2010.
[1] Alexa Top 500 Global Sites. http://www.alexa.com/topsites. [29] Q. Xu, J. Erman, A. Gerber, Z. Mao, J. Pang, and S. Venkataraman.
[2] Facts & Figures - Akamai. http://www.akamai.com/html/about/facts Identifying diverse usage behaviors of smartphone apps. In ACM Internet
figures.html. Measurement Conference (IMC), 2011.
[3] Internet Filter: Internet Pornography Statistics. http://internet-filter- [30] xvideos.com Site Overview - Alexa. http://www.alexa.com/siteinfo/
review.toptenreviews.com/internet-pornography-statistics.html, 2006. xvideos.com.
[4] Porn Sites Get More Visitors Each Month Than Netflix, Amazon [31] youtube.com Site Overview - Alexa. http://www.alexa.com/siteinfo/
And Twitter Combined. http://www.huffingtonpost.com/2013/05/03/ youtube.com.
internet-porn-stats n 3187682.html, 2013. [32] Y. Zhang and A. Arvidsson. Understanding the characteristics of
[5] G. Aggarwal, E. Bursztein, C. Jackson, and D. Boneh. An analysis cellular data traffic. ACM SIGCOMM Computer Communication Review,
of private browsing modes in modern browsers. In USENIX Security 42(4):461–466, 2012.
Symposium, 2010.
[6] X. An and G. Kunzmann. Understanding mobile Internet usage behavior.
In IFIP Networking, 2014.
[7] AvenueQ. The Internet Is for Porn. https://www.youtube.com/watch?v=
LTJvdGcb7Fs.
[8] B. Du and M. D. E. Brewer. Analysis of WWW Traffic in Cambodia
and Ghana. In WWW, 2006.
[9] J. Erman, A. Gerber, K. Ramakrishnan, S. Sen, and O. Spatscheck. Over
The Top Video: The Gorilla in Cellular Networks. In ACM Internet
Measurement Conference (IMC), 2011.

You might also like