0% found this document useful (0 votes)
35 views

Celestini 2017

The document discusses the design, implementation, and testing of a flexible toolkit for mining the structure and content of Tor and the dark web. It provides background on web mining and the dark web. The toolkit is able to crawl, download, extract, and index resources from Tor in order to analyze how it works and how it is being used.

Uploaded by

singhkumarania
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Celestini 2017

The document discusses the design, implementation, and testing of a flexible toolkit for mining the structure and content of Tor and the dark web. It provides background on web mining and the dark web. The toolkit is able to crawl, download, extract, and index resources from Tor in order to analyze how it works and how it is being used.

Uploaded by

singhkumarania
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Design, Implementation and Test of a Flexible Tor-Oriented Web

Mining Toolkit
Alessandro Celestini Stefano Guarino
Institute for Applied Computing (IAC-CNR) Institute for Applied Computing (IAC-CNR)
Via dei Taurini 19 Via dei Taurini 19
Rome, Italy Rome, Italy
[email protected] [email protected]

ABSTRACT first publicly known portrait of the “surface” Web, thus paving the
Searching and retrieving information from the Web is a primary way for further research which, over the last two decades, provided
activity needed to monitor the development and usage of Web significant results: the optimization of search algorithms [14] and
resources. Possible benefits include improving user experience Web data extraction [18], the identification of efficient compressed
(e.g. by optimizing query results) and enforcing data/user secu- representations [11, 15, 16], the proposal of suitable representation
rity (e.g. by identifying harmful websites). Motivated by the lack (i.e. graph based) models [12, 24], just to name a few. Web mining
of ready-to-use solutions, in this paper we present a flexible and becomes an even more interesting/challenging task when the target
accessible toolkit for structure and content mining, able to crawl, includes the submerged Internet contents usually known as “deep”
download, extract and index resources from the Web. While being Web (estimated to sum up to more than 90% of the whole World
easily configurable to work in the “surface” Web, our suite is specif- Wide Web (WWW)), not crawled/indexed by traditional search
ically tailored to explore the Tor dark Web, i.e. the ensemble of engines. In general, while search engines are specifically tailored
Web servers composing the world’s most famous darknet. Notably, to respond to user queries with popularity as one of the key prefer-
the toolkit is not just a Web scraper, but it includes two mining ential criteria, the heterogeneous and unstructured nature of data
modules, respectively able to prepare content to be fed to an (ex- available on the WWW calls for flexible Web mining tools able to
ternal) semantic engine, and to reconstruct the graph structure of extensively gather, organize, index and analyse Internet resources.
the explored portion of the Web. Other than discussing in detail A recent research trend is especially focused on the subset of
the design, features and performance of our toolkit, we report the the deep Web usually called “dark” Web. This is the collection of
findings of a preliminary run over Tor, that clarify the potential of web resources that exist on darknets, describable as overlay net-
our solution. works, which despite leaning on the public Internet require specific
software, configuration or authorization to access. Among dark-
CCS CONCEPTS nets, Tor (The Onion Router 1 ) is probably the most known and
used. It is a communication network designed as a low-latency,
•Information systems →Web mining; Deep web; Web crawl-
anonymity-guaranteeing and censorship-resistant network, relying
ing; •Networks →Topology analysis and generation;
on an implementation of the so-called onion routing protocol [17].
Its servers, run by volunteers over the Internet, work as routers to
KEYWORDS
allow Tor users to access the Internet anonymously, evading tradi-
Dark Web, Tor Web graph tional network surveillance and traffic analysis mechanisms. Other
ACM Reference format: than guaranteeing an anonymous access to normal websites, Tor
Alessandro Celestini and Stefano Guarino. 2017. Design, Implementation allows running anonymous and untraceable Web servers, known
and Test of a Flexible Tor-Oriented Web Mining Toolkit. In Proceedings of as hidden services, that can only be accessed using a Tor-enabled
WIMS ’17, Amantea, Italy, June 19-22, 2017, 10 pages. browser. A hidden service is identified by its onion url and it is
DOI: 10.1145/3102254.3102266
not associated with any (visible) IP address. Tor is able to interpret
such urls and forward data packets to and from a hidden service,
1 INTRODUCTION guaranteeing anonymity in both directions. Hidden services can
As the Web has become the main means for information exchange offer various kinds of services, not just Web servers, without need-
and retrieval, a whole body of work focuses on gaining a better un- ing to reveal the location of the site. The main reason why both
derstanding of its content and shape, in order to improve usability the research community [1, 5, 6, 31, 33] and society/media are in-
and security. A few seminal papers [19, 22, 23] allowed drawing the creasingly showing interest in Tor is that a completely anonymous
network represents the perfect breading ground for illegal activi-
Permission to make digital or hard copies of all or part of this work for personal or ties: Tor hidden services have been accused to provide protection to
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation terrorists 2 , and are known to host marketplaces for drugs, weapons
on the first page. Copyrights for components of this work owned by others than ACM and pedo-pornography [2]. Mining the structure and content of the
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
Tor dark web is a primary instrument to gain a better understanding
fee. Request permissions from [email protected].
WIMS ’17, Amantea, Italy 1 https://www.torproject.org/
2 http://www.bloomberg.com/news/articles/2014-10-15/how-anti-is-spies-fight-
© 2017 ACM. 978-1-4503-5225-3/17/06. . . $15.00
DOI: 10.1145/3102254.3102266 terrorism-with-digital-tools
WIMS ’17, June 19-22, 2017, Amantea, Italy Alessandro Celestini and Stefano Guarino

of how Tor works and how it is being used, with a direct impact on contents. Along the same line, Biryukov et al. [5] give an overview
both user experience and security. of Tor hidden services, studying 39824 hidden service descriptors
Despite publicly available and effective crawlers do exist [9, 10, collected in 2013 [6], and analysing and classifying their contents.
20, 27, 28], we feel that the research community lacks a sufficiently At the time of the crawl they were able to connect to 7114 desti-
general-purpose configurable solution for addressing (deep, dark) nations, but half of them were excluded because inappropriate for
Web mining. The purpose of this paper is exactly to fill that gap, by classification, ending up with 3050 destinations. Their results show
presenting the design, implementation and test of a toolkit which that resources devoted to criminal activities (drugs, adult content,
can be easily used to automatically explore the Web. In our suite, counterfeits, weapons, etc.) constitute 44%, whereas the remaining
Web crawling is made very “user-friendly” by letting a coordinator 56% are devoted to a number of different topics: “Politics” (9%) and
module control the spider, i.e. the software concretely responsible of “Anonymity” (8%) are among the most popular. Biryukov et al. also
jumping from site to site along hyperlinks. All collected data are ex- estimate the popularity of hidden services by looking at the request
tracted and indexed, and can be easily forwarded for content mining rates for hidden service descriptors by clients. They find that, while
to any external semantic engine accessible through RESTful APIs. the content of Tor hidden services is rather varied, the most popular
Additionally, our solution integrates a module that reconstructs hidden services are related to botnets. Even if the approach used
the explored Web graph, thus allowing structural/topological anal- by Biryukov et al. for hidden service descriptors collection cannot
ysis. We introduce our toolkit in a Tor-oriented guise, but it can be reproduced (they exploited a bug of Tor, fixed in recent versions
be tuned to work on any subset of the Web, at the cost of a minor of the software) other authors presented similar studies. Owen
configuration effort. and Savage [29] analysed hidden services’ contents resorting to
To support the potential of our solution, we also report the high- manual classification, while Biryukov et al. used topic modelling
lights of a preliminary (yet wide) exploration of the Tor dark Web. techniques. Other works focus their attention on specific subsets of
These preliminary tests were run in the context of the “IANCIS” Tor hidden services, like blogs, forum or marketplaces. An example
ISEC Project 3 , and the presence of Expert System 4 as project part- is [30], where Soska and Christin present a study focused on Tor
ner gave us the access to their data analysis solutions. Specifically marketplaces showing interesting evolutionary properties of the
we had the opportunity of using their tool Cogito 5 as external anonymous marketplace ecosystem. The authors perform a specific
semantic engine, as reported in Section 4. An extensive analysis Tor Web crawling, collecting data from 16 different marketplaces
of our findings lies beyond the scope of the present paper, whose over more than two years, without focusing on a specific category
focus is on design and implementation, and it is to be found in [4]. of products. The results of their study suggest that marketplaces
However, by making our toolkit available on demand and providing are quite resilient to law enforcement takedowns and large-scale
a preview of its capabilities, we are confident to give new impetus frauds. They also evidence that the majority of the products being
to the Web mining literature, and to pave the way for the extension sold belong to drugs category.
of the body of work related to Tor and other similar darknets. Web crawling is a well-studied topic, and several successful
This paper is organized as follows: Section 2 presents a review crawlers’ design have been proposed [9, 20, 28]. Nevertheless, re-
of related work; Section 3 details the design and the performance search interest in the subject is still significant, with recent studies
of our framework; Section 4 discusses a set of preliminary findings concentrating on focus and specialized crawlers. As a matter of
that highlight the relevance of the proposed tool; finally, Section 5 fact, to extract specific information from Web sources like blogs,
reports a few concluding remarks and directions for future work. forums or websites it is usually necessary to design a customized
crawler. An example is the BlogForever Crawler [7] presented by
2 RELATED WORK Blanvillain et al., whose goal was to develop a crawler able to auto-
The Tor network has attracted significant attention from the re- matically extract information from blog posts (e.g. articles, authors,
search community, interested both in assessing its security with dates and comments). The crawler is actually a component of the
respect to de-anonymization attacks, and in understanding which BlogForever platform whose aim is to harvest, preserve, manage
threats a publicly available anonymous communication system and reuse blog content. Zhao et al. [34] propose SmartCrawler a
could expose society to. One of the main research directions over framework for efficient harvesting deep Web interfaces. Authors
Tor analysis consists in studying how Tor is used and what kind of show that SmartCrawler is capable of achieving both wide coverage
contents its websites host, with the aim of understanding whether of deep Web interfaces and maintaining highly efficient crawling.
the charge of providing protection and anonymity to criminals is Another example of special purpose crawler is TwitterEcho, an
indeed well deserved. Spitters et al. [31] present a study focused open source crawler proposed by Bonjak et al. [13]. TwitterEcho is
on the analysis of Tor hidden services contents. They apply classifi- a Twitter crawler, which allows retrieval of Twitter data from a fo-
cation and topic model techniques to analyse the contents of over cused community of interest. The crawler is designed as distributed
a thousand Tor hidden services, in order to model their thematic application enabling the crawling of high volumes of data, while
organization and linguistic diversity. Their results indicate that respecting the limits imposed by Twitter. Authors state that Twit-
most hidden services in their data set exhibit illegal or controversial terEcho was designed to support academic researchers or analysts
who wish to carry out research with focused Twitter data.
3 DG Home Affairs ISEC Programme 2013 - Project IANCIS “Indexing of We observe that in general the crawling of particular types of
Anonymous Networks for Crime Information Search”, 2014-2016, GA n. websites, such as Tor’s hidden services, requires the use of tailored
HOME/2013/ISEC/AG/INT/4000005222 - www.iancis.eu
4 http://www.expertsystem.com/ approaches and customized tools. Specifically, we evaluated differ-
5 http://www.expertsystem.com/cogito/ ent alternatives for the development of our general-purpose crawler,
Design, Implementation and Test of a Flexible Tor-Oriented Web Mining Toolkit WIMS ’17, June 19-22, 2017, Amantea, Italy

including the implementation of a crawler prototype from scratch, the extractor uses the Apache Tika API 7 . Apache Tika is a project
and finally choosing BUbiNG [10] as the base for our crawler com- of the Apache Software Foundation, providing a java toolkit able
ponent. We additionally support the use of focused crawlers to to detect and extract metadata and text from over a thousand dif-
improve data collection, with the possibility to merge data gathered ferent file types. The extractor only stores texts from web pages
resorting to different crawlers in a single dataset. This approach that replied successfully during the crawling (HTTP status code
drove the design and implementation of our toolkit. “200 OK”), and it filters extracted texts according to specific con-
figuration options (e.g. on a language basis). All data are stored
3 THE TOOLKIT in a document-oriented database, in which each web resource is
We now present our toolkit, whose components and workflow are stored as a document. For each web resource the extractor stores:
summarized in Figure 1. It consists of a core application for massive (i) the HTTP response header, (ii) the WARC record header, (iii)
Web crawling, indexing and text mining, and of two external inde- the metadata provided by Tika, (iv) the extracted text, and (iv) the
pendent modules for customized/focused crawling and structure language of the document’s text. Specifically, we use ArangoDB 8 ,
mining respectively 6 . a multi-model, open-source, NoSQL database with flexible data
models for documents, graphs, and key-values. It supports ACID
transactions if required and provides a SQL-like query language or
3.1 The Core Application
JavaScript extensions. Once the extractor terminates its operations,
At the heart of the toolkit is a core application in charge of au- the coordinator activates the analyser.
tomatically exploring Tor websites and collecting their contents, The analyser is a multi-thread process that concurrently reads
while indexing and clustering gathered data. Based on an analysis documents from the database and sends their texts to the semantic
of other applications oriented to the collection and analysis of data, engine, for analysis. For each document the analyser prepares a
we designed the core of our toolkit around four main components, RESTful request, containing its text and language, and sends the
all written in Java: a coordinator, a crawler, an extractor and an request to the engine. The engine sends back analyses results, which
analyser. To manage the operations of the Java application two are again stored by the analyser in the initial database, together
scripts are provided to start and stop its execution. with other document’s information. Once all documents have been
In the following, we describe the toolkit’s core application more analysed, the coordinator stops the analyser.
in detail: we start with an overview of its components and work- In our test, reported in Section 4, the analyser relies on Cog-
flow, that prompts us to report on implementation and operational ito, a multi-language semantic engine developed by Expert Sys-
aspects; we then summarize all possible configuration options, tem, which can understand the meaning in context within unstruc-
pointing out the most important ones; finally, we focus on the tured text. Cogito is able to find hidden relationships, trends and
crawler, discussing its design and functionalities. events, transforming unstructured data into structured informa-
3.1.1 Components and Workflow. The main process of the core tion. Through several analyses, it identifies three different types of
application, that we developed from scratch, is the coordinator, entities (people, places and companies/organizations), categorizes
responsible of organizing the operations of other units. When documents on the basis of several taxonomies and it is able to ex-
launching the tool, the coordinator is activated and it starts reading tract entity co-occurrences. Since Cogito is a proprietary software,
the configuration file and setting up the application. It checks the it cannot be included in any freely available version of our toolkit.
existence of the database and creates one if it does not exist. After Yet, our analyser can be used out-of-the-box with any semantic
initialization procedures, the coordinator activates the crawler, the engine providing RESTful APIs. Embedding the analyser with other
process which concretely performs the task of visiting Tor websites. options for text mining is our primary goal for further development
The crawler carries out a breadth-first search of Tor websites and of the toolkit.
stores all retrieved resources in a WARC archive. A list of known 3.1.2 Configuration. The application behaviour can be set up
hidden services, the root set, is provided to the crawler to determine by modifying a Java properties file, named “config.properties” (see
the data collection starting points. When the archive reaches a Table 1 for a list of all available configuration parameters). First of
threshold size, the coordinator stops the crawler. The threshold size all, through the configuration file it is possible to determine what
of the archive is a configuration parameter, which can be modified operations the application will perform, by selecting any subset of
in the configuration file. Once the crawler stops, the coordinator the three building blocks, i.e. crawling, extraction, and analysis. If
moves the WARC archive and creates a new empty file, which will the extractor is enabled, either or both of two supported modalities
be used by the crawler at its next activation, then the extractor can be selected namely extraction from WARC and extraction from
starts. file. In the first modality the extractor uses a WARC file, in the latter
The extractor is a multi-thread process that concurrently reads it uses a file system directory to retrieve resources to elaborate.
and extracts text from resources contained in the WARC archive A number of options can be specified through the configuration
(and/or in a file system directory, see Section 3.1.2). The extractor file, including: (i) the number of threads used by the extractor
reads the WARC archive, extracts texts from collected data and and the analyser agents; (ii) the threshold size of WARC archives
stores them in a database. To extract text from digital documents, generated by the crawler – when that size is reached, the archive is
passed to the extractor and a new archive is created by the crawler;
6 Researchers interested in using/testing our toolkit are invited to contact us. Other
7 https://tika.apache.org
than to sharing the latest version, we will be available for support in all configuration
aspects, including the integration of different semantic text analysis modules. 8 https://www.arangodb.com
WIMS ’17, June 19-22, 2017, Amantea, Italy Alessandro Celestini and Stefano Guarino

Figure 1: Components and workflow.

(iii) the directory used by the crawler to store data; (iv) the name that is often used in combination with Tor, because it is no longer
of the database to be created or used to store documents; (v) the maintained and currently seems not able to manage correctly the
name of the database’s collections used by the extractor and the format of some HTTP responses. However, any HTTP proxy config-
analyser agents; (vi) the directory used by the extractor to retrieve ured to use Tor can be used. Through the crawler configuration file,
file – in case the extraction from file modality has been activated. that is a Java properties file named ‘crawler.properties’, it is possible
Moreover, it is possible to activate the file system data storage to specify which HTTP proxy the crawler must use. Several other
option, i.e. data produced by crawler, extractor and analyser can parameters can be set through that configuration file, including:
be stored locally in the file system, keeping in mind that a copy (i) the number of threads the crawler will use for its operations
of the same data is stored in the database by default. Finally, the (parsing, dns resolution, fetching); (ii) which resources are collected
configuration file contains the address and other parameters used from websites (e.g. html pages, media file, digital documents); (iii)
to contact the RESTful web service exposed by the semantic engine. network timeouts used when contacting websites; (iv) where col-
The settings concerning the RESTful web service allow the use of lected date will be stored; (v) how and whether to manage cookies;
different engines whether remote or local, for the analysis of textual (vi) the delays between requests to the same website. For a complete
data. list of configuration parameters we refer the reader to BUbiNG’s
documentation 13 . With our toolkit we provide a standard crawler
3.1.3 The Crawler. We evaluated different alternatives for the
configuration, tested for Tor network, thus editing that file is not
development of the crawler. In particular, we implemented a crawler
needed for standard usage.
prototype from scratch and evaluated it against three main existing
For what concerns the crawling process, in BUbiNG a pool of
candidates: Apache Nutch 9 [21], Heritrix 10 [27] and BUbiNG [10].
software agents are responsible for both exploring the Web and
Considering several criteria, such as performance, configurability,
collecting (and partially elaborating) the data. Such agents work
extensibility and supportability, we found BUbiNG to be the most
in parallel, each handling in turn several threads, and by default
appropriate choice as the base for our crawler component. BUbiNG
implementing a breadth-first-search. Due to Tor’s high volatility,
is a high-performance, scalable, distributed, open-source crawler,
the results of the crawling process are theoretically susceptible
written in Java, and developed by the Laboratory for Web Algorith-
to fluctuations based on the order in which links are followed.
mics (LAW) at the Computer Science Department of the University
Yet, we did not have any reason to prefer one order over another,
of Milan.
and we therefore did not modify this setting. Summing up, in our
Significant efforts were needed for the integration of BUbiNG
experiments BUbiNG was used as follows:
within our toolkit, so as to turn it into an application’s component
under the control of the coordinator. Moreover, we needed to en- • A predetermined set of hidden services, the root set, is
able BUbiNG operating in Tor instead of the surface Web. Whereas, inserted in an url list.
by default, BUbiNG presents a set of threads that perform DNS • The first fetching thread available extracts the first onion url
requests, in our application we avoid these requests and send them from the list and exports the content of the corresponding
directly to a HTTP proxy. We chose to use privoxy 11 , after testing hidden service in main memory.
other alternatives. In particular, we decided not to use polipo 12 , • The first parsing thread available analyses that content
9 http://nutch.apache.org
aiming at extracting new onion urls to visit.
10 https://webarchive.jira.com/wiki/display/Heritrix
11 https://www.privoxy.org
12 https://www.irif.fr/ jch/software/polipo/ 13 http://law.di.unimi.it/software/bubing-docs/overview-summary.html
Design, Implementation and Test of a Flexible Tor-Oriented Web Mining Toolkit WIMS ’17, June 19-22, 2017, Amantea, Italy

• The new onion urls found are passed to a sieve able to verify • If so, the urls are discarded, otherwise they are added in
whether those urls were already visited before. the tail of the url list.
• Fetching threads attempt to contact each url for a maximum
Table 1: List of all available configuration parameters (file of three times, after that the url is considered not available.
‘config.properties’).

Parameter Description 3.2 External Modules


extractorThreads number of threads used by the extractor Alongside of the core application, our toolkit comprehends two
analyserThreads number of threads used by the analyser external modules which can be launched independently: a set of
Scrapy Spiders, written in Python, and a Graph Builder, written in
dbUser username used to access the database
C. The spiders enhance the crawling process by permitting focused
dbPassword password used to access the database crawling, supporting semi-automated procedures, and offering anti-
dbName name of database used to store documents detection settings. The Graph Builder supports the reconstruction
of the graph associated to the crawling process, enabling topological
extractedDocCollection name of database’s collection used to store
studies of the explored portion of the Web.
extracted text
analysedDocCollection name of database’s collection used to store 3.2.1 Scrapy Spiders. Besides BUbiNG, we developed and in-
analysed text tegrated in our toolkit a set of customized spiders which can be
crawling enable/disable crawling operation
managed as modules and used to boost the crawling process. Specif-
ically, our spiders were written relying on Scrapy 14 , a Python
extraction enable/disable extraction operation framework for crawling websites and extracting data. Using ad-hoc
analysis enable/disable analysis operation spiders besides the main crawler allows for focused crawling which
crawlerDir directory used by crawler to create WARC supports a semi-automated (i.e. human assisted) procedure needed
files to gain access to hidden contents requiring a login procedure or
a captcha solver. Moreover, through configuration settings, we
extractWARC enable/disable extraction from WARC, re-
are able to choose a breadth-first or depth-first search strategy
quires ’extraction=true’
for the crawling procedure. A further customization required for
extractFile enable/disable extraction from file, requires our spiders consists in avoiding crawling detection, which may be
’extraction=true’ implemented by our targets. To that purpose, we included configu-
fileDir directory containing files to extract, requires ration settings concerning the robots exclusion protocol, network
’extractFile=true’ timeouts and (again) delays between requests to the same website.
languageFilter a comma separated list of languages, only Furthermore, we programmed our spiders to visit only specific
texts in these languages are stored and anal- sections of targeted websites and to carry out only legal and not
ysed suspicious actions. Finally, the data collected by Scrapy Spiders
warcSizeThreshold threshold size of WARC archive, when
are stored as files in a directory and are integrated in the core ap-
reached crawler is stopped and extraction plication via the extractor module. Indeed, through configuration
agent is started settings, the extractor unit can be instructed to retrieve files from
system directories. The Scrapy Spiders module can be used as a
storeFilesExtraction enable/disable storage of texts extracted on
template to create other spiders with the same framework, or as an
file system
example of how to integrate data collected by other crawlers.
dirFilesExtraction directory used to store texts extracted, re-
quires ’storeFilesExtraction=true’ 3.2.2 Graph Builder. The Graph Builder is written in C and
storeFilesAnalysis enable/disable storage of analysis results on
supports the reconstruction of the graphs associated to the crawling
file system process, using the WARC archive created by BUbiNG. To parse
HTML file and extract links we use the mythml library 15 . The
dirFilesAnalysis directory used to store analysis results, re-
Graph Builder module is a multi-thread application, it takes as
quires ’storeFilesAnalysis=true’
input the name of the WARC file to read and the number of threads
storeFilesCrawling enable/disable storage of crawled data on file to use to build the graphs. For each WARC file three directed graphs
system are created by the module, a page graph, a host graph and a service
dirFilesCrawling directory used to store crawled data, requires graph, which are represented as list of edges. In the page graph an
’storeFilesCrawling=true’ edge exists between page A and page B if there is a least a link from
engineHost host name of RESTful web service page A to page B. In the host graph an edge exists between host A
and host B if there is a least a page of host A that links to a page of
enginePath invoked endpoint of RESTful web service
host B. The service graph is the higher level of grouping, in which
engineKey key used to invoke RESTful web service each node represents an hidden service, which is identified by a
engineKeyFieldName name of the form field used to store ’en-
14 https://scrapy.org
gineKey’ 15 https://github.com/lexborisov/myhtml
WIMS ’17, June 19-22, 2017, Amantea, Italy Alessandro Celestini and Stefano Guarino

16 character string (base32 encoded). In this graph an edge exists special nodes of the network called directory authorities, endowed
between service A and service B if there is a least a page of service with their own directory signing key.
A that links to a page of service B. Each graph is represented by Developing suitable tools to explore and analyse the Tor Web is
two files that are written as output by the module: a “.index” and a of primary importance for a number of reasons. First, searching
“.edges” file. The “.index” file contains the id of each graph’s node through the Tor Web is not possible using standard approaches,
with the corresponding url, while the “.edges” file contains the list since Tor’s hidden services are not indexed by traditional search
of graph’s edges represented as id pairs. engines like Google or Bing. Specific Tor search engines exist,
accessible through the surface Web (e.g. Ahmia 16 , Onion.link 17 ),
or through Tor only (e.g. DuckDuckGo 18 , TORCH 19 ), but they
3.3 Usage
are not likewise reliable, mostly because Tor is very volatile and
To manage the execution of the core application we provide three not as connected as the surface Web. Even the size of the Tor Web
scripts, which are used to start, stop and reload the configuration file. can be hardly estimated, despite a few crawling attempts [5, 30, 31].
Moreover, the core application handles the following POSIX signals: Although the analysis of the surface Web graph has flourished in
SIGTERM, SIGHUP and SIGINT. If the application receives a SIGINT the past, to the best of our knowledge no similar result/work for
or SIGTERM it activates the stop procedure, if it receives a SIGHUP the Tor Web exists, and consequently little or no information over
it schedules the reload of configuration file. The other two modules, the structure of the Tor hidden services network is known to date.
namely the Graph Builder and the Scrapy Spiders, have to be started A few attempts at classifying the content of Tor hidden services can
separately. The user can decide to use any subset of modules and be found in the literature [5, 30, 31], but each presents limitations
operations provided by the toolkit. In particular, the Graph Builder related to either the scale or the scope of the crawling, or to the
should be activated only at the end of the crawling procedure, techniques – mostly, topic model-based – used to mine the collected
whereas the Scrapy Spiders can be activated whenever needed. For textual data. Both novel instruments and further studies are needed
Scrapy Spiders the only requirements are: the activation of the core to fully characterize the Tor Web and to collect as much information
application and the activation of the extraction-from-file modality. as possible on its structure and content, which is instrumental in
In case these requirements are not met, data collected by spiders reaching a clear understanding of its functioning and of the habits
won’t be extracted, analysed and stored in the database by the core of its users (hidden service’s owners and clients).
application.
4.2 Data Collection
4 TESTING THE TOOLKIT As shown by past attempts [5, 25], exploring the entire Tor Web is
To assess the performance and the potential of our toolkit, we tested not feasible/practical for a number of reasons: it can be unstruc-
its ability to mine the contents and structure of the Tor Web. In tured, or volatile/temporary; furthermore, relaying timeouts often
this section, we report relevant findings of this preliminary, yet occur. Additionally, many hidden services could deliberately try
wide, exploration run. More details can be found in [4], where to limit their visibility (e.g. by not advertising themselves). We
we thoroughly investigate the topology of the Tor Web graph, the therefore argue that a complete scan of the Tor Web is only possible
semantics of extracted texts, and their mutual correlation. In the supporting the crawler with some sort of brute force searcher, able
following, we first briefly recap what the Tor Web is and how it to continuously feed the url list of the crawler with new hidden
works, in order to both motivate and contextualize the successive services found by participating to Tor (e.g. to collect hidden service
analysis. descriptors), or by trying to access random onion urls. We followed
a viable alternative, used in related studies [30], consisting in look-
4.1 Tor Principles ing at the portion of the Tor Web that is accessible from the surface
Web. Specifically, our root set was composed of a (large) set of
The expression Tor Web refers to the network made of all Tor’s
hidden services obtained merging together results of both standard
hidden services, connected through hyperlinks. This network is not
(e.g. Google) and Tor-specific (e.g. Ahmia) search engines, other
to be confused with the network of Tor’s relays, i.e. routers man-
than onion urls advertised on a few Tor wikis and link directories,
aged by volunteers over the Internet upon which Tor’s anonymity
like “The Hidden Wiki” page 20 . Notably, our root set contains the
relies, through multiple layers of encryption that are stripped off
complete list of hidden services indexed by Ahmia at the time our
one by one along the route that connects source to destination [17].
crawling activity started. This choice can be read as follows: we
To contact a hidden service, a client needs to obtain the service’s
focus on the portion of the Tor Web that is accessible to “common”
onion url (a 80-bit excerpt of the SHA-1 hash digest of its public
users 21 , only leaving out the most isolated and (probably) most
key), which is used to download a signed descriptor, containing the
service’s public key and a list of introduction points. Communica-
tion between client and hidden service takes place through secure
16 https://ahmia.fi
circuits to a commonly known relay, known as rendezvous point. 17 https://onion.link
The rendezvous point is chosen by the client and communicated to 18 http://3g2upl4pq6kufc4m.onion

the hidden service via one of the introduction points. The security 19 http://xmh57jrzrnw6insl.onion
20 wikitjerrta4qgz4.onion
of Tor relies on cryptography at three different levels: (i) encryption 21 Ournotion of a common user includes even advanced users that know how to run a
for data privacy within Tor, (ii) authentication between clients and crawler, but not users who are given the url of a hidden service by the owner of that
relays, and (iii) integrity/authenticity of the list of relays, stored by service.
Design, Implementation and Test of a Flexible Tor-Oriented Web Mining Toolkit WIMS ’17, June 19-22, 2017, Amantea, Italy

Table 2: Crawling results In-Degree


100 Out-Degree

Number of pages 918885


Number of domains/subdomains 5420 10−1
Number of hidden services 5144

Fraction of nodes
Total number of hyperlinks 20446513
10−2
Number of bidirectional hyperlinks 2483366

volatile hidden services (not very interesting, at least from a struc- 10−3
tural point of view) and thus measuring the size of the “public” part
of the Tor Web.
We ran the crawler for about six weeks. Since preliminary tests 10−4
showed that the latency of the Tor network is much higher than 0 10 20
Degree
30 40 50

expected, we set the connection and socket timeout to 360 seconds.


At the end of the process our dataset contained 1119048 records,
Figure 2: In-degree and out-degree distributions for the di-
thus the crawler tried to connect to 1119048 urls. However, the
rected host graph
actually visited urls were 918885, 824324 of which replied with a
success HTTP status code (200), whereas the other 94561 replied
with a 3xx status code 22 . All remaining connection attempts ended 10%
with an error code: 96816 urls replied with a 4xx status code, and 10
103347 urls with a 5xx status code. The numbers of our crawling
are summarized in Table 2. Let us underline that these numbers do 8 8%
support the quality of both our root set and the crawling process:
the amount of hidden services we were able to analyse is comparable
out-degree

6 6%
to those used in previous similar attempts in the literature [5, 31].
4 4%
4.3 Structure Mining
By feeding the Graph Builder module of our platform with all data
2
gathered in the crawling phase, we were able to reconstruct the 2%
graph of the explored portion of the Tor Web. The outer border of
the Tor network, i.e. pages of the surface Web that link to or are 0
0%
linked by Tor pages, is not reached by the crawler, and is therefore 0 2 4 6 8 10
not included in the graph (but this behaviour can be easily adjusted in-degree
modifying the configuration file of the crawler). Quite naturally, it
is possible to define three different graphs: (i) a page graph, whose Figure 3: Joint in-out-degree distribution for the directed
nodes are visited pages and whose edges are hyperlinks between host graph
pages; (ii) a host graph (HG), where pages belonging to the same
host (i.e. the same Tor domain or subdomain) are grouped into a
single node, and all hyperlinks among pages of two different groups outgoing hyperlinks, and adding up hosts with out-degree 1 we
collapse into a single edge; (iii) a service graph, analogous to the reach ∼ 90% of the whole graph, while hosts with out-degree 2 or 3
host graph, except that grouping occurs at the hidden service level. are one order of magnitude rarer than those with out-degree 1. On
For our structural analysis, we focused on the host graph, which the other hand, having in-degree 0 or 1 is far less likely for a host,
we believe better synthesizes the connectivity and navigability of which is reasonable if we think of how the host graph is built based
the Tor Web. We considered both its directed (the direction of each on the page graph. The distance between the two distributions is
edge is of course induced by the corresponding hyperlink) and its notable up to degree ∼ 30; after that the two trends become very
undirected version, which exhibits significant differences. Table 3 comparable. Overall, Figure 2 suggests that most hosts have only
summarizes the characteristics of the two graphs. a few inbound connections, and do not provide any link to any
Figures 2 and 3 provide a snap-shot of the in- and out-degree other host. This is further confirmed by the joint in-out-degree
distributions of the directed host graph, focusing, for the sake of distribution, depicted in Figure 3, that shows that almost half of the
clarity, on the most relevant parts of the plots. From Figure 2 we hosts concurrently have out-degree 0 and in-degree ≤ 10. Among
see that, despite the two distributions having a somewhat similar other things, these figures underline the importance that specific
qualitative behaviour, the out-degree is significantly more squashed hosts (most likely, link directories) have in the topology of the Tor
at 0. Specifically, ∼ 72% of the hosts have out-degree 0, i.e., no Web graph, providing links to a very large number of other hosts
22 A status code 3xx is related to Web redirection (https://www.w3.org/Protocols/ with very small in-degree and no outgoing edges, in a star-like
rfc2616/rfc2616-sec10.html). structure.
WIMS ’17, June 19-22, 2017, Amantea, Italy Alessandro Celestini and Stefano Guarino

Table 3: The host graphs

Graph Nodes Edges CCs (GCS) Diameter Max Degree


Undirected 5420 64379 1 (5420) 4 4759
WCCs SCCs In-Degree Out-Degree
Directed 5420 65716 ∞
1 (5420) 4514 (670) 303 4751
(W/S)CCs = (Weakly/Strongly) Connected Components – GCS = Giant Component Size

Degree
100 Directed
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Undirected
10−1

10−1

Fraction of nodes
Fraction of nodes

10−2
10−2

10−3
10−3

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16


Normalized BC

Figure 5: Normalized BC distribution for the host graphs


10−1

comparing the directed and undirected host graph. The BC score


of a node is proportional to the number of shortest paths passing
through it; roughly speaking, it quantifies the importance of that
Fraction of nodes

10−2
node in the overall connectivity of the graph and in the way infor-
mation flow through it. Although some slight differences emerge,
the overall trend is the same for the two graphs: there are a few
very central nodes, surrounded by a vast majority of peripheral
10−3 nodes. For the computation of the BC we resorted to a multi-GPU
implementation [3, 32].

4.4 Content Mining


0 10 20 30 40 50 60 70 80 90 100 In order to evaluate the content of all collected data, we made
Degree use of the Cogito semantic engine, that can be integrated into our
platform thanks to its RESTful APIs, and used to analyse all textual
Figure 4: Degree distribution for the undirected host graph data collected 23 . Cogito’s ability to understand what a text is about
relies on a customizable semantic network called Sensigrafo, and
on a disambiguation engine called Essex. Both modules strongly
If we consider the volatility of the Tor network, and that even the depend on the specific taxonomy used to model concepts of interest.
surface Web has been shown to present similar properties [22], it is In our case, the taxonomy was defined within the scope of the
not surprising that the network of Tor hosts is very disconnected if “IANCIS” ISEC Project, with categories chosen so as to cover a wide
represented as a directed graph. However, in Figure 4 we see that range of topics, including, but not limited to, illegal activities often
the undirected version of the graph is significantly more connected associated with Tor. These categories, together with an overall
than its directed counterpart. In particular, zooming in on degrees score that quantifies their measured relevance in our dataset, are
≤ 100, we see that no node has degree 0 (in fact, there is only one reported in Table 4. Specifically, in our framework Cogito was
giant connected component), and that degrees 1, 2 and 3 together configured to return, for each page p, a value Sp [i] ∈ [0, 1] that
account for only ∼ 10% of the whole graph. quantifies to which extent the content of p can be associated to
Finally, to have a better understanding of the importance of the i t h category of our taxonomy. Putting together scores Sp [i] for
single hosts in the Tor network, in Figure 5 we plot the statisti- 23 We focused on English text, but Cogito’s taxonomy can be extended to other
cal distribution of the normalized Betweenness Centrality (BC), languages.
Design, Implementation and Test of a Flexible Tor-Oriented Web Mining Toolkit WIMS ’17, June 19-22, 2017, Amantea, Italy

all i yields a semantic vector Sp of size n (n = 17 in our case) that 0.06

describes the “position” of page p with respect to the taxonomy.


The overall score of category i reported in Table 4 is obtained 0.05
averaging Sp [i] over all pages p whose semantic vector is not all
null, i.e. whose content is somewhat relevant with respect to at 0.04
least one category of our taxonomy. What emerges from Table 4

Frequency
is that “Information System” is by far the most relevant category
in the dataset, probably due to its wide scope, followed by “Social 0.03

Network”, which captures the abundance of forums and markets


where Tor users interact. Not surprisingly, topics related with cyber 0.02
criminality are also significantly relevant, as well as “Illicit Drugs”
and “Terrorism”. Conversely, “Child Pornography” related content 0.01
appeared in our dataset less frequently than one could imagine,
based on previous works. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity

Table 4: Relevance scores of the categories of our taxonomy


in crawled pages Figure 6: Statistical distribution of the cosine similarity
among pages of our crawl
Category Score
Information System 0.65872592 most semantic vectors are significantly unbalanced (i.e. the corre-
Social Network 0.16532867 sponding page can be associated with only a small subset of the
categories considered), so that any two vectors are either very sim-
Cyber Security 0.13046941 ilar or completely unrelated. This fosters the intuition that the Tor
Illicit Drugs 0.09255990 Web is very topic-oriented, with most hidden services focusing on
a specific topic, and only a few hubs, mostly marketplaces, forums,
Terrorism 0.07622640
wikis, or link directories, that can be related to several different
Weaponary 0.05358337 topics.
Cyber Deception 0.05026633
5 CONCLUSIONS
Fraud 0.03810798
In this paper, we presented a novel Web mining toolkit, designed
Cyber Attack 0.03700734 with the aim of providing a wide-scope, flexible, easy-to-use in-
Murder 0.01545559 strument to thoroughly explore (portions of) the Web, collecting
and analysing data, and reconstructing the graph structure of the
Child Pornography 0.01166300 crawled network. Motivated by the increasing attention paid by
Rape 0.00722616 both researchers and society to submerged and anonymous Web
servers and resources, we tailored our toolkit to operate on the
Racism 0.00467250
Tor dark Web, but we left full control to the user for configur-
IT Services Companies 0.00149265 ing/adjusting/modifying its behaviour.
Media Companies 0.00121687
Other than describing in details our design and implementation
choices, we reported the results of a test run executed over three
Arms Trafficking 0.00066599 months, exploring the portion of the Tor Web that can be reached
Gambling 0.00024497 starting from the surface Web. Our findings, analysed more in depth
in [4], provide a fascinating portrait of the world’s most famous
darknet, clarifying the effectiveness and potential of our toolkit.
Finally, the content of two pages p1 and p2 can be compared By making our suite available on demand, we expect to pave the
based on the cosine similarity of the two corresponding semantic way for further research able to provide meaningful insights on the
vectors, cosine (Sp1 , Sp2 ), defined as the cosine of the angle between characteristics of the most hidden and intriguing sides of the Web.
Sp1 and Sp2 . If cosine (Sp1 , Sp2 ) = 0 the two vectors are orthogo- The only element of our toolkit that we cannot release is the
nal and the two pages have nothing in common. Conversely, if semantic engine used in our test run, a proprietary software. Even
cosine (Sp1 , Sp2 ) = 1 the two pages can be considered fully equiv- though our analyser can interact with any engine providing REST-
alent, at least with respect to our taxonomy. We computed the ful APIs, implementing our own text mining tool to replace Cogito
pairwise cosine similarity of the semantic vectors associated to is the primary direction for future extensions of our toolkit. Luck-
all pages of our dataset with the goal of assessing their semantic ily, the literature flourishes with compelling ideas [8, 26] which
uniformity. Figure 6 shows the statistical distribution of the cosine can be developed into algorithms and applications for addressing
similarity between any two pages of our crawling. In most cases communal desiderata of Web content mining, such as extracting
cosine (Sp1 , Sp2 ) ≈ 1 or cosine (Sp1 , Sp2 ) = 0, which suggests that topics/clusters and assessing document similarities.
WIMS ’17, June 19-22, 2017, Amantea, Italy Alessandro Celestini and Stefano Guarino

REFERENCES [23] Raymond Kosala and Hendrik Blockeel. 2000. Web Mining Research: A Survey.
[1] Daniel Arp, Fabian Yamaguchi, and Konrad Rieck. 2015. Torben: A Practical SIGKDD Explor. Newsl. 2, 1 (June 2000), 1–15. DOI:https://doi.org/10.1145/360402.
Side-Channel Attack for Deanonymizing Tor Communication. In Proceedings of 360406
the 10th ACM Symposium on Information, Computer and Communications Security [24] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D Sivakumar, Andrew
(ASIA CCS ’15). ACM, New York, NY, USA, 597–602. DOI:https://doi.org/10.1145/ Tomkins, and Eli Upfal. 2000. Stochastic models for the Web graph. In Foundations
2714576.2714627 of Computer Science, 2000. Proceedings. 41st Annual Symposium on. 57–65. DOI:
[2] Monica J. Barrat. 2012. Silk Road: Ebay for Drugs. Addiction 107, 3 (2012), https://doi.org/10.1109/SFCS.2000.892065
683–683. DOI:https://doi.org/10.1111/j.1360-0443.2011.03709.x [25] Damon McCoy, Kevin Bauer, Dirk Grunwald, Tadayoshi Kohno, and Douglas
[3] Massimo Bernaschi, Giancarlo Carbone, and Flavio Vella. 2016. Scalable Between- Sicker. 2008. Shining Light in Dark Places: Understanding the Tor Network. In
ness Centrality on multi-GPU Systems. In Proceedings of the ACM International Privacy Enhancing Technologies, Nikita Borisov and Ian Goldberg (Eds.). Lecture
Conference on Computing Frontiers (CF ’16). ACM, New York, NY, USA, 29–36. Notes in Computer Science, Vol. 5134. Springer Berlin Heidelberg, 63–76. DOI:
DOI:https://doi.org/10.1145/2903150.2903153 https://doi.org/10.1007/978-3-540-70630-4 5
[4] Massimo Bernaschi, Alessandro Celestini, Stefano Guarino, and Flavio Lombardi. [26] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
2017. Exploring and Analyzing the Tor Hidden Services Graph. ACM Transactions estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
on the Web (2017), To appear. (2013).
[5] Alex Biryukov, Ivan Pustogarov, Fabrice Thill, and Ralf-Philipp Weinmann. 2014. [27] Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery, and Michele Kimpton.
Content and Popularity Analysis of Tor Hidden Services. In Distributed Comput- 2004. An Introduction to Heritrix An open source archival quality web crawler.
ing Systems Workshops (ICDCSW), 2014 IEEE 34th International Conference on. In In IWAWfi04, 4th International Web Archiving Workshop. Citeseer.
188–193. DOI:https://doi.org/10.1109/ICDCSW.2014.20 [28] Christopher Olston, Marc Najork, and others. 2010. Web crawling. Foundations
[6] Alex Biryukov, Ivan Pustogarov, and Ralf-Philipp Weinmann. 2013. Trawling for and Trends® in Information Retrieval 4, 3 (2010), 175–246.
Tor Hidden Services: Detection, Measurement, Deanonymization. In Proceedings [29] Gareth Owen and Nick Savage. 2016. Empirical analysis of Tor hidden services.
of the 2013 IEEE Symposium on Security and Privacy (SP ’13). IEEE Computer IET Information Security 10, 3 (2016), 113–118.
Society, Washington, DC, USA, 80–94. DOI:https://doi.org/10.1109/SP.2013.15 [30] Kyle Soska and Nicolas Christin. 2015. Measuring the Longitudinal Evolution
[7] Olivier Blanvillain, Nikos Kasioumis, and Vangelis Banos. 2014. Blogforever of the Online Anonymous Marketplace Ecosystem. In 24th USENIX Security
crawler: techniques and algorithms to harvest modern weblogs. In Proceedings Symposium, USENIX Security 15, Washington, D.C., USA, August 12-14, 2015.
of the 4th International Conference on Web Intelligence, Mining and Semantics 33–48.
(WIMS14). ACM, 7. [31] Martijn Spitters, Stefan Verbruggen, and Mark van Staalduinen. 2014. Towards
[8] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet alloca- a Comprehensive Insight into the Thematic Organization of the Tor Hidden
tion. Journal of machine Learning research 3, Jan (2003), 993–1022. Services. In Intelligence and Security Informatics Conference (JISIC), 2014 IEEE
[9] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. Joint. 220–223. DOI:https://doi.org/10.1109/JISIC.2014.40
Ubicrawler: A scalable fully distributed web crawler. Software: Practice and [32] Flavio Vella, Giancarlo Carbone, and Massimo Bernaschi. 2016. Algorithms
Experience 34, 8 (2004), 711–726. and Heuristics for Scalable Betweenness Centrality Computation on Multi-GPU
[10] Paolo Boldi, Andrea Marino, Massimo Santini, and Sebastiano Vigna. 2014. BUb- Systems. CoRR abs/1602.00963 (2016). http://arxiv.org/abs/1602.00963
iNG: Massive crawling for the masses. In Proceedings of the Companion Publication [33] Zachary Weinberg, Jeffrey Wang, Vinod Yegneswaran, Linda Briesemeister,
of the 23rd International Conference on World Wide Web Companion. 227–228. Steven Cheung, Frank Wang, and Dan Boneh. 2012. StegoTorus: A Camouflage
[11] Paolo Boldi and Sebastiano Vigna. 2004. The Webgraph Framework I: Com- Proxy for the Tor Anonymity System. In Proceedings of the 2012 ACM Conference
pression Techniques. In Proceedings of the 13th International Conference on on Computer and Communications Security (CCS ’12). ACM, New York, NY, USA,
World Wide Web (WWW ’04). ACM, New York, NY, USA, 595–602. DOI: 109–120. DOI:https://doi.org/10.1145/2382196.2382211
https://doi.org/10.1145/988672.988752 [34] Feng Zhao, Jingyu Zhou, Chang Nie, Heqing Huang, and Hai Jin. 2016.
[12] Anthony Bonato. 2005. A Survey of Models of the Web Graph. In Combinatorial SmartCrawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Inter-
and Algorithmic Aspects of Networking, Alejandro Lpez-Ortiz and AngleM. Hamel faces. IEEE Transactions on Services Computing 9, 4 (2016), 608–620.
(Eds.). Lecture Notes in Computer Science, Vol. 3405. Springer Berlin Heidelberg,
159–172. DOI:https://doi.org/10.1007/11527954 16
[13] Matko Bošnjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, and
Luı́s Sarmento. 2012. Twitterecho: a distributed focused crawler to support open
research with twitter data. In Proceedings of the 21st International Conference on
World Wide Web. ACM, 1233–1240.
[14] Soumen Chakrabarti, Amit Pathak, and Manish Gupta. 2011. Index Design and
Query Processing for Graph Conductance Search. The VLDB Journal 20, 3 (June
2011), 445–470. DOI:https://doi.org/10.1007/s00778-010-0204-8
[15] Francisco Claude and Susana Ladra. 2011. Practical Representations for Web
and Social Graphs. In Proceedings of the 20th ACM International Conference on
Information and Knowledge Management (CIKM ’11). ACM, New York, NY, USA,
1185–1190. DOI:https://doi.org/10.1145/2063576.2063747
[16] Francisco Claude and Gonzalo Navarro. 2010. Fast and Compact Web Graph
Representations. ACM Trans. Web 4, 4, Article 16 (Sept. 2010), 31 pages. DOI:
https://doi.org/10.1145/1841909.1841913
[17] Roger Dingledine, Nick Mathewson, and Paul Syverson. 2004. Tor: The Second-
Generation Onion Router. In Proceedings of the 13th Usenix Security Symposium.
[18] Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner.
2014. Web data extraction, applications and techniques: A survey. Knowledge-
Based Systems 70 (2014), 301 – 323. DOI:https://doi.org/10.1016/j.knosys.2014.07.
007
[19] Gary William Flake, Steve Lawrence, C. Lee Giles, and Frans M. Coetzee. 2002.
Self-Organization and Identification of Web Communities. IEEE Computer 35
(2002), 66–71.
[20] Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web
crawler. World Wide Web 2, 4 (1999), 219–229.
[21] Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin. 2004. Nutch: A
flexible and scalable open-source web search engine. Oregon State University 1
(2004), 32–32.
[22] JonM. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and
AndrewS. Tomkins. 1999. The Web as a Graph: Measurements, Models, and
Methods. In Computing and Combinatorics, Takano Asano, Hideki Imai, D.T. Lee,
Shin-ichi Nakano, and Takeshi Tokuyama (Eds.). Lecture Notes in Computer
Science, Vol. 1627. Springer Berlin Heidelberg, 1–17. DOI:https://doi.org/10.
1007/3-540-48686-0 1

You might also like