Web Ware Housing
Web Ware Housing
A STUDY ON
“WEB WAREHOUSING”
Session: 2009–11
Presented at
1
Acknowledgement
The beatitude, bliss & euphoria that accompany successful completion any task
would not be completed without the expression of appreciation of simple virtues to
the people who made it possible.
So, I take my immense pleasure in expressing a whole hearted thanks to all the
faculty members who guided me all the way making this project successful.
I am also thankful to Mrs. MAHIMA RAI (H.O.D.) for her guidance & cooperation in
this work.
2
Preface
The underlying aim of the seminar on contemporary issue as an integral part of MBA
program is to provide the students with practical aspects of the organization –
working environment.
Such type of presentation helps a student to visualize and realize about the
congruencies between the theoretical learning in the premises of college and actual
followed by the organization. It gives the knowledge of application aspect of the
theories learnt in the classroom.
3
Executive Summary
Users require applications to help them obtaining knowledge from the web. However, the
specific characteristics of web data make it difficult to create these applications. One possible
solution to facilitate this task is to extract information from the web, transform and load it to a
Web Warehouse, which provides uniform access methods for automatic processing of the
data. Web Warehousing is conceptually similar to Data Warehousing approaches used to
integrate relational information from databases. However, the structure of the web is very
dynamic and cannot be controlled by the Warehouse designers. Web models frequently do
not reflect the current state of the web. Thus, Web Warehouses must be redesigned at a late
stage of development .These changes have high costs and may jeopardize entire projects.
This thesis addresses the problem of modelling the web and its influence in the design of
Web Warehouses. A model of a web portion was derived and based on it, a Web Warehouse
prototype was designed .The prototype was validated in several real-usage scenarios. The
obtained results show that web modelling is a fundamental step of the web data integration
process.
4
Index
1. Introduction ..........................................................1-5
1.1 Objectives and methodology...................................3
1.2 Contributions...........................................................4
2. Web characterization............................................6-9
2.1 Web characterization...............................................6
2.2 Terminology............................................................6
2.3 Sampling methods and identification .....................8
3. Crawling................................................................10-14
3.1 Crawler types and functioning................................10
3.2 Requirements..........................................................11
3.3 Web partitioning and assignment...........................13
3.4 Crawler examples...................................................13
4. Designing a Web Warehouse...............................15-38
4.1 The Versus repository...........................................16
4.1.1 Content Manager....................................................16
4.1.1.1 Elimination of partial duplicates in a WWh..............16
4.1.1.2 Data model.................................................................17
4.1.1.3 An algorithm for eliminating duplicates....................17
4.1.1.4 Fake duplicate............................................................18
4.1.1.5 Content Manager architecture....................................18
4.1.1.6 Implementation...........................................................19
4.1.2 Catalog.....................................................................20
4.1.2.1 Data model............................................................. ....20
4.1.2.2 Operational model.......................................................21
5
4.1.2.3 Implementation............................................................24
Chapter 1
6
Introduction
The web is the largest source of information ever built. It provides a quick, cheap and simple
publishing media. However, its full potential is far from being completely explored. Users
require applications for aiding them finding, summarizing and extracting useful knowledge
from web data. However, the web was designed to provide information to be interpreted by
humans and not automatically processed by software applications. The size and transience of
web data make it difficult to design efficient systems able to harness its complexity in useful
time and the heterogeneity and disrespect of standards make it difficult to automatically
interpret the data. Thus, the automatic processing of web data is a challenging task.
7
to determine commercial strategies (www.webanalyticsassociation.org). So, Web
Warehousing is a research area that has received a growing interest in the last years. Web
Warehousing is conceptually similar to Data Warehousing. Figure 1.1 presents the data
integration process in both. Data Warehouses integrate data gathered from tables in relational
databases. The data is migrated from its source models into an uniform data model. Then,
Data Mining applications generate statistical reports that summarize the knowledge contained
in the data. Web Warehouses integrate hyper textual documents gathered from sites on the
web .Web Warehouses also store information according to an uniform model that enables its
automatic processing by Web Mining applications .The characteristics of data influence the
design of an information system, so the first step in the design of a Warehouse is to analyze
the data sources .Data Warehousing assumes the existence of a well-defined model of the
data sources. They are usually On-Line Transaction Processing (OLTP) databases that respect
relational data models. On the other hand, the source of information that feeds Web
Warehouses is the web and not relational databases. Unlike relational databases, the structure
of the web cannot be controlled by the people that design the WWh and does not follow a
static structured data model. Models and characterizations of the web are scarce and
frequently outdated, not reflecting its current state, making it difficult to make realistic
assumptions in the design of a Web Warehouse. Frequently, Web Warehouses must be
redesigned at a late stage of development because problems are detected only when the WWh
leaves the experimental setup and begins to integrate information gathered from the real web.
These changes have high costs and may jeopardize entire projects.
8
This work addresses the problem of modelling the web and its influence on Web
Warehousing as the main objective. This thesis aims to answer the following research
questions:
I believe that the task of modelling the web must be part of the process of web data
integration, because accurate models are crucial in making important design decisions at an
early WWh development stage. Web models also enable the tuning of a WWh to reflect the
evolution of the web. The methodology used in this research was mainly experimental. I
derived a model of a portion of the web and, based on it, I developed Webhouse, a WWh for
investigating the influence of web characteristics in Web Warehouses design. This
development was performed in collaboration with other members of my research group.
Figure 1.2 presents an overview of the components of Webhouse. Each one addresses one
stage of the integration process: modelling, extraction, transformation and loading. Although
9
the integration process is decomposed in several steps, they are not independent from each
other.
The influence of web characteristics was studied during the design of each one of them. The
extraction most sensitive stage of the integration process, because the software component
interacts directly with the web and must address unpredictable situations. This thesis focuses
mainly on the aspects of extracting information from the web and loading it into the WWh.
The transformation of web data is not thoroughly discussed in this work. The efficiency of
Webhouse as a complete system was validated through its application in several real usage
scenarios.
This research was validated by applying the Engineering Method (Zelkowitz & Wallace,
1998). Several versions of Webhouse were iteratively developed and tested until the design
could not be significantly improved. The Portuguese Web was chosen as a case study to
analyze the impact of web characteristics in the design of a WWh. Models of the web were
extracted through the analysis of the information integrated in the WWh. On its turn, a WWh
requires models of the web to be designed. The performance of each version of the WWh and
gradually improve it. So, although this thesis presents a sequential structure, the actual
research was conducted as an iterative process.
1.2 Contributions
10
Web Characterization: concerns the monitoring and modelling of the web;
Web Crawling: investigates the automatic extraction of contents from the web;
Web Characterization:
Web Crawling:
A novel architecture for a scalable, robust and distributed crawler (Gomes &
Silva, 2006b);
An analysis of techniques to partition the URL space among the processes of a
distributed crawler;
A study of bandwidth and storage saving techniques, that avoid the download of
duplicates and invalid URLs.
Web Warehousing:
A new architecture for a WWh that addresses all the stages of web data
integration, from its extraction from the web to its processing by mining
applications;
An analysis of the impact of web characteristics in the design and performance of
a Web Warehouse;
An algorithm that eliminates duplicates at storage level in a distributed system
(Gomes et al., 2006b).
11
Chapter 2
Web characterization
The design of efficient Web Warehouses requires combining knowledge from Web
characterization and Crawling. Web Characterization concerns the analysis of data samples to
model characteristics of the web. Crawling studies the automatic harvesting of web data.
Crawlers are frequently used to gather samples of web data in order to characterize it. Web
warehouses are commonly populated with crawled data. Research in crawling contributes to
optimizing the extraction stage of the web data integration process.
2.2 Terminology
As the web evolves, new concepts emerge and existing terms gain new meanings .Studies in
web characterization are meant to be used as historical documents that enable the analysis of
the evolution of the web. However, there is not a standard terminology and the current
meaning of the terms may become obscure in the future.
Between 1997 and 1999, the World-Wide Web Consortium (W3C) promoted the Web
Characterization Activity with the purpose of defining and implementing mechanisms to
support web characterization initiatives (W3C, 1999a). The scope of this activity was to
characterize the web as a general distributed system, not focusing on specific users or sites. In
12
1999, the W3C released a working draft defining a web characterization terminology (W3C,
1999b). The definitions used in this thesis were derived from that draft:
Media type: identification of the format of a content through a Multipurpose Internet Mail
Extension (MIME) type (Freed & Borenstein , 1996a);
Meta-data: information that describes the content. Meta-data can be generated during the
download of a content (e.g. time spent to be downloaded), gathered from HTTP header fields
(e.g. date of last modification) or extracted from a content (e.g. HTML meta-tags);
Page: content with the media type text/html (Connolly & Masinter, 2000);
Home page: content identified by an URL where the le path component is empty or a '/' only;
Site: collection of contents referenced by URLs that share the same host name (Fielding et
al., 1999);
Web server: a machine connected to the Internet that provides access to contents through the
HTTP protocol;
Duplicate hosts (duphosts): sites with different names that simultaneously serve the same
content (Henzinger, 2003);
Subsite: cluster of contents within a site, maintained by a different publisher than that of the
parent site;
Virtual hosts: sites that have different names but are hosted on the same IP address and web
server;
Publisher or author: entity responsible for publishing information on the web. Some of the
definitions originally proposed in the draft are controversial and had to be adapted to become
13
more explicit. The W3C draft defined that a page was a collection of information, consisting
of one or more web resources, intended to be rendered simultaneously, and identified by a
single URL.
A set of byte wise equal contents are duplicates. However, there are also similar contents that
replicate a part of another content (partial duplicates). Defining a criterion that identifies
contents as being similar enough to be considered the same is highly subjective. If multiple
contents only differ on the value of a visit counter that changes on every download, they
could reasonably be considered the same. However, when the difference between them is
only as short as a number on the date of a historical event, this small difference could be very
significant.
Web characterizations are derived from samples of the web. Ideally, each sample would be
instantly gathered to be a representative snapshot of the web. However, contents cannot be
accessed immediately, because of the latency times of the Internet and web server responses.
Hence, samples must be gathered within a limited time interval named the time span of the
sample. Structural properties of the web are derived from a snapshot of the web extracted
within a short timespan(Heydon & Najork, 1999). On the other hand, researchers also harvest
samples with a long times pan to study the evolution of the web (Fetterly et al., 2003).
Traffic logs. The accesses to web contents through a given service are registered on log files.
Traffic logs can be obtained from web proxies (Bent et al., 2004;Mogul, 1999b), web servers
(Arlitt &Williamson, 1997; Davison, 1999; Iyengar et al., 1999; Marshak & Levy, 2003;
14
Rosenstein, 2000), search engines(Beitzel et al., 2004; Silverstein et al., 1999), web clients
(Cunha et al.,1995; Gribble & Brewer, 1997) or gateways (Caceres et al., 1998; Douglis et
al., 1997). The samples gathered from traffic logs are representative of the portion of the web
accessed by the users of a given service and not of the general information available on the
web;
The main problem with this approach is that virtual hosts are not analyzed (OCLC, 2001).The
web is designed to break all the geographical barriers and make information universally
available. However, a WWh cannot store all the information from the web. So, it gathers data
from selected and well-defined web portions.
As the web is the product of multiple user groups, it is possible to identify portions within it
containing the sites of interest to them. These are designated as community webs and can be
defined as the set of documents that refer to a certain subject or are of interest to a community
of users. The detection of a community web is not always obvious, even if methods for
identifying its boundaries are available. If one is interested in a small and static set of
contents, then enumerating all the URLs that compose the community web can be adequate.
However, it becomes very expensive to maintain the list of URLs if it grows or changes
frequently (Webb, 2000).
Chapter 3
15
Crawling
Web warehouses use crawlers to extract data from the web. This section presents crawler
types and their operation. Then, it discusses architectural options to design a crawler and
strategies to divide the URL space among several crawling processes. Finally, it provides
crawler examples and compares their design and performance.
Crawlers can be classified in four major classes according to their harvesting strategies:
Broad. Collect the largest amount of information possible within a limited time interval
(Najork & Heydon, 2001);
Incremental. Revisit previously fetched pages, looking for changes (Edwards et al., 2001);
Focused. Harvest information relevant to a specific topic, usually with the help of a
classification algorithm, to filter irrelevant contents (Chakrabarti et al.,1999);
Deep. Harvest information relevant to a specific topic but, unlike focused crawlers,have the
capacity of filling forms in web pages and collect the returned pages(Ntoulas et al., 2005;
Raghavan & Garcia-Molina, 2001).
Although each type of crawler has specific requirements, they all present a similar
functioning. A crawl of the web is bootstrapped with a list of URLs, called the seeds, which
are the access nodes to the portion of the web to crawl.
For instance, to crawl a portion of the web containing all the contents hosted in the .GOV
domain, URLs from that domain should be used as seeds. Then, a crawler iteratively extracts
links to new URLs and collects their contents. The seeds should be carefully chosen to
16
prevent the crawler from wasting resources visiting URLs that do not reference accessible or
relevant contents. They can be gathered from different sources:
User submissions. The seeds are posted by the users of a given service. However, many of
them are invalid because they were incorrectly typed or reference sites still under
construction;
Previous crawls. The seeds are extracted from a previous crawl. The main problem of this
source of seeds is that URLs have short lives and an old crawl could supply many invalid
seeds;
Domain Name System listings. The seeds are generated from domain names. However, the
domains reference servers on the Internet and some of them are not web servers. So, the
generated seeds may not be valid. Another problem is that the lists of the top-level domains
of the web portion to be crawled are usually not publicly available.
3.2 Requirements
There are several types of crawlers. Although each one has specific requirements ,they all
share ethical principles and address common problems. A crawler must be:
Polite. A crawler should not overload web servers. Ideally, the load imposed while crawling
should be equivalent to that of a human while browsing. A crawler should expose the
purposes of its actions and not impersonate a browser, so that webmasters can track and
report inconvenient actions
Robust. The publication of information on the web is uncontrolled. A crawler must be robust
against hazardous situations that may affect its performance or cause its mal-functioning;
Fault tolerant. Even a small portion of the web is composed by a large number of contents,
which may take several days to be harvested. Crawlers frequently present a distributed
architecture comprising multiple components hosted on different machines.
Able to collect meta-data. There is meta-data temporarily available only during the crawl
(e.g. date of crawl). A crawler should keep these meta-data because it is often required by the
WWh clients. For instance, the Content-Type HTTP header field identifies the media type of
content. If this meta-data element is lost, the content type must be guessed later;
17
Configurable. A crawler should be highly configurable to enable the harvesting of different
portions of the web without suffering major changes;
Scalable. The crawl of a portion of the web must be completed within a limited time and the
download rate of a crawler must be adequate to the requirements of the application that will
process the harvested data. A WWh that requires weekly refreshments of data cannot use a
crawler that takes months to harvest the required web data.
Economic. A crawler should be parsimonious with the use of external resources, such as
bandwidth, because they are outside of its control. A crawler may to connect the Internet
through a large bandwidth link but many of the visited web servers do not;
Manageable. A crawler must include management tools that enable the quick detection of its
faults or failures. For instance, a hardware failure may
require human intervention. On the other hand, the actions of a crawler may be deemed
unacceptable to some webmasters. So, it is important to keep track of the actions executed by
the crawler for latter identification and correction of undesirable behaviours.
18
A partitioning function maps an URL to its partition. The main objective of partitioning the
URL space is to distribute the workload among the Crawling Processes, creating groups of
URLs that can be harvested independently. After partitioning, each CP is responsible for
harvesting exclusively one partition at a time. The partitioning strategy has implications on
the operation of the crawler.In general, the following partitioning strategies may be
considered:
Site partitioning. Each partition contains the URLs of a site. This partitioning schema differs
from the above, because several sites may be hosted on the same IP address (virtual hosts)
and each will be crawled separately;
Page partitioning. Each partition contains a fixed number of URLs independently from their
physical location. A partition may contain URLs hosted on different sites and IP addresses.
Page partitioning is suitable to harvest a selected set of independent pages spread on the web.
The Googlebot is present in the Google original research paper (Brin &
Page,1998). The Googlebot is present in the Google original research paper (Cho
et al., 2004).
Silva et al. (1999) described the CobWeb crawler, one of the components of a
search engine for the Brazilian web that used proxy servers to reduce implemen-
tation costs and save network bandwidth when updating a set of documents.
Boldi et al. (2002b) presented the Ubicrawler, giving special attention to its fault
tolerance and scalability features.
19
Table -1: Crawler design options.
Chapter 4
The Web is a powerful source of information, but its potential can only be harnessed with
applications specialized in aiding web users. However, most of these applications cannot
20
retrieve information from the web on-the-y, because it takes too long to download the data.
Pre-fetching the required information and storing it in a Web Warehouse (WWh) is a
possible solution. This approach enables the reuse of the stored data by several applications
and users.
AWWh architecture must be adaptable so it may closely track the evolution of the web,
supporting distinct selection criteria and gathering methods. Meta-data must ensure the
correct interpretation and preservation of the stored data. The storage space must
accommodate the collected data and it should be accessible to humans and machines,
supporting complementary access methods to fulfil the requirements of distinct usage
contexts. These access methods should provide views on past states of the stored data to
enable historical analysis.
The focus of the chapter is on the design of the Webhouse prototype, discussing the
extraction, loading and management of web data.
21
Figure 4.2: Versus architecture.
Figure 4.2 represents the Versus repository architecture. It is composed by the Content
Manager and the Catalog. The Content Manager provides storage space for the contents
(Gomes et al., 2006b). The Catalog provides high performance access to structured meta-
data. It keeps information about each content, such as the date when it was collected or the
reference to the location where it was stored in the Content Manager.
Web warehousing involves a large amount of data and claims for storage systems able to
address the specific characteristics of web collections. The duplication of contents is
prevalent in web collections. It is difficult to avoid downloading duplicates during the crawl
of a large set of contents, because they are commonly referenced by distinct and apparently
unrelated URLs (Bharat & Broder, 1999; Kelly & Mogul, 2002; Mogul, 1999a). Plus, the
contents kept by a WWh have an additional number of duplicates, because it is built
incrementally and many contents remain unchanged over time, being repeatedly stored.
Delta storage or encoding, is a technique used to save space that consists on storing only the
difference from a previous version of a content (MacDonald,1999). There are pages that
suffer only minor changes, such as the number of visitors received or the current date. Delta
storage enables storing only the part of the content that has changed, eliminating partial
duplicates.
Figure 4.3: Storage structure of a volume: a tree holding blocks on the leafs.
22
4.1.1.2 Data model
The data model of the Webhouse Content Manager relies on three main classes:
Instance, volume and block. The instance class provides a centralized view of a storage space
composed of volumes containing blocks. Each block keeps a content and related operational
meta-data. The signature is the number obtained from applying a fingerprinting algorithm to
the content. A content key contains the signature of the content and the volume where it was
stored. A block holds an unique content within the volume. It is composed by a header and a
data container (see Figure 4.3).
The location of a block within the volume tree is obtained by applying a function called
sig2location to the content's signature. Assuming that the signature of a content is unique,
two contents have the same location within a volume if they are duplicates. Consider a
volume tree with depth n and a signature with m bytes of length. Sig2location uses the (n - 1)
most significant bytes in the signature to identify the path to follow in the volume tree. The
ith byte of the signature identifies the tree node with depth i. The remaining bytes of the
signature (m-n-1) identify the block name on the leaf of the tree. For instance,considering a
volume tree with depth 3, the block holding a content with signature ADEE2232AF3A4355
would be found in the tree by following the nodes AD, EEand leaf 2232AF3A4355.
The detection of duplicates is performed during the storage of each content, ensuring that
each distinct content is stored in a single block within the instance. When a client requests the
storage of a content, the system performs a sequence of tasks:
2. Apply sig2location to the signature and obtain the location l of the corresponding
block;
3. Search for a block in location l within the n volumes that compose the instance,
multicasting requests to the volumes;
23
4. If a block is found in volume, the content is considered to be a duplicate and its
reference counter is incremented. Otherwise, the content is stored in a new block with
location l in the volume identified by s mod n;
Theoretically, if two contents have the same signature they are duplicates. However,
fingerprinting algorithms present a small probability of collision that causes the generation of
the same signature for two different contents (Rabin, 1979).Relying exclusively on the
comparison of signatures to detect duplicates within a large collection of contents, could
cause some contents to be wrongly identified as duplicates and not stored. These situations
are called fake duplicates.
On its turn, the Content Manager is platform independent and runs at application level
without imposing changes in the configuration of the underlying operating system. Peer-to-
peer file systems, such as Oceanstore (Rhea et al., 2003), are designed to manage a large and
highly variable set of nodes with small storage capacity, distributed over wide-area networks
(typically the Internet). This raises specific problems and imposes complex intra-node
communication protocols that guarantee properties such as security, anonymity or fault
tolerance that unnecessarily limit throughput on controlled networks. An experiment
performed by the authors of Ocean store showed that it is on average 8.3 times slower than
NFS(Callaghan et al., 1995) on a local-area network (LAN).
24
Figure 4.4 depicts the architecture of the Content Manager.
An instance is composed by a thin middleware library, the connector object and the volume
servers. Clients access an instance through a connector object that keeps references to the
volumes that compose the instance. A change in the composition of the instance, such as the
addition of a new volume, implies an update of the connector. Each volume server manages
the requests and executes the correspondent low-level operations to access the contents. The
contents are transmitted between the library and the servers in a compressed format to reduce
network traffic and data processing on the server.
4.1.1.6 Implementation
The storage structure of a volume was implemented as a directory tree over the file system
where the blocks are files residing at the leaf directories. The block header is written in
ASCII format so that it can be easily interpreted, enabling access to the content kept in the
block independently from the Web store software. A 64-bit implementation of Rabin's
fingerprinting algorithm was used to generate the content signatures (Rabin, 1979). The
Content Manager supports Zlib as the built-in compression method, but other compression
algorithms can be included. This way, contents can be compressed using adequate algorithms
and accommodate new formats. The connector object was implemented as an XML file. The
library and the volume servers were written in Java using JDK
25
Figure 4.5: Versus Content Manager data model.
1.4.2 (6 132 lines of code). The communication between them is through Berkeley sockets.
Currently, clients access volume servers in sequence, (a multicast protocol is not yet
implemented). Volume servers are multi-threaded, launching a thread for handling each
request. Volume servers guarantee that each block is accessed in exclusive mode through
internal block access lists.
4.1.2 Catalog
This section describes the data model that supports the Catalog and the operational model that
enables parallel loading and access to the data stored in the Versus repository.
Figure 4.5 presents the UML class model of the Catalog. This model is generic to enable its
usage for long periods of time independently from the evolution of web formats. Plus, it also
enables the usage of the repository in contexts different from Web Warehousing. For
instance, it can be applied to manage meta-data on the research articles kept in a Digital
Library. However, it is assumed that the contents are inserted into the repository in bulk
loads.
26
A Versus client application is composed by a set of threads that process data in parallel. Each
application thread does its specific data processing and Versus is responsible for managing
and synchronizing them. The operational model of Versus was inspired on the model
proposed by Campos (2003). It is composed by three workspaces with different features that
keep the contents meta-data:
Archive (AW). Stores meta-data permanently. It keeps version history for the contents to
enable the reconstruction of their earlier views. The AW is an append-only storage, the data
stored cannot be updated or deleted;
Group (GW). Keeps a temporary view of the meta-data shared by all application threads. It
enables the synchronization among the application threads and data cleaning before the
archival of new data;
Private (PW). Provides local storage and fast access to data by application threads. Private
workspaces are independent from each other and reside on the application threads addressing
space. An application thread can be classified in three categories according to the task it
executes:
Loader. Generates or gathers data from an information source and loads it into Versus;
Processor. Accesses data stored in Versus, processes it and loads the resulting data;
Reader. Accesses data stored in Versus and does not change it, neither generates new
information.
The data stored is partitioned to enable parallel processing. A Working Unit (WU) is a data
container used to transfer partitions of meta-data across the workspaces. The Working Units
are transferred from one workspace to another via check-out and check-in operations (Katz,
1990). When a thread checks-out a WU, the corresponding meta-data is locked in the source
workspace and it is copied to the destination workspace. When the thread finishes the
processing of the WU, it integrates the resulting data into the source workspace (check-in).
The threads that compose an application share the same partitioning function. There are two
classes of Working Units:
Strict. Contain exclusively the Versions that belong to the Working Unit. Strict Working
Units should be used by applications that that do not need to create new Versions;
27
Extended. The Extended Working Units may also contain Versions that do not belong to the
WU, named the External Versions.
Access
Figure 4.6 depicts the access to information stored on the Catalog. Each work space provides
different access levels to the meta-data. The Archive Work-space provides centralized access
to all the archived Versions. The applications define the time space of the Versions they want
to access through a Resolver object. For instance, an application can use a Resolver that
chooses the last Version archived from each Source. The Group Workspace also provides
centralized access to the stored information but it holds at most one Version from each
Source.
It does not provide historical views on the data. The Private Workspaces are hosted on the
application threads and keep one WU at a time enabling parallel processing. The Archive and
Group Workspace should be hosted on powerful machines while the Private Workspaces can
be hosted on commodity servers.
The workflow of a parallel application that accesses data stored in Versus is the following:
1. The application checks-out the required meta-data from the AW to the GW;
2. Several application threads are launched in parallel. Each one of them starts its own PW
and iteratively checks-out one WU at a time, processes it and executes the check-in into the
28
GW. The contents cannot be updated after they are set and any change on a content must be
stored as new Facet;
3. When there are no unprocessed Working Units the new data kept in the GW is checked-in
into the AW.
The contents are not transferred in the check-out operations, they are retrieved on-demand
from the Content Manager. There are two assumptions behind this design decision. The first
is that the contents belonging to a WU represent an amount of data much larger than the
corresponding meta-data.
Load
Figure 4.7 depicts the loading of data into Versus. The meta-data is loaded by the application
threads into the Private Workspaces, while the contents are stored in the Content Manager
that eliminates duplicates at storage level. However, if a content is identified as a duplicate,
the corresponding meta-data, such as URL, is still stored in the Catalog PW. This way, the
Webhouse clients can later access the warehoused collection independently from the
elimination of duplicates mechanism. The work flow of a parallel application that loads
information into Versus is the following:
2. Several parallel threads are launched. Each one of them starts its own PW and iteratively
checks-out one emptyWU, loads it with meta-data extracted from the Sources and executes
the check-in into the GW;
3. When there are no unprocessed Working Units the new data kept in the GW is checked-in
into the AW
29
Figure 4.7: Loading data into Versus
The references to the new contents loaded into the Content Manager are kept as meta-data in
the PW. If an application thread fails before the check-in, for instance due to a power failure,
the references to the contents would be lost, originating orphan contents that could not be
later accessed. Versus provides recovery methods to restart the processing of a Working Unit
and remove the orphan contents if an application thread fails. Versus also supports direct
loads to the Group or Archive Workspaces but they should be used for small amounts of data
because parallel loading is not supported.
4.1.2.3 Implementation
The Catalog was mainly implemented using the Java environment and relational database
management systems (DBMS). The AW and GW were implemented using Oracle 9i DBMS
(Oracle Corporation, 2004). The advanced administration features of this DBMS, such as
partitioning or query optimizations, enable the configuration of the system to be used in the
context of Web Warehousing, addressing efficiently the processing of large amounts of data.
The use of the SQL language for data manipulation enabled the reuse of the code in the three
kinds of workspaces, although each one also had particular data structures and optimization
profiles.
The PW used Hyper Sonic SQL DBMS (HyperSonicSQL). It is written in Java and can be
configured to run in three modes:
30
Memory. The DBMS runs inside the client application. The data is kept exclusively in-
memory. If there is failure of the client application the data is lost;
File. The DBMS run inside the client application but the data is stored in files;
Client/server. The DBMS and client applications run independently and communicate
through a network connection using JDBC. The data can be kept in memory or in files.
A WWh crawls data from the web to extract new information. The permanent evolution of
the web and the upcoming of new usage contexts demands continuous
research in crawling systems. Kahle (2002), the founder of the Internet Archive,revealed that
their commercial crawler is rewritten every 1218 months to reflect changes in the structure of
the web. Although a crawler is conceptually simple,its development is expensive and time
consuming, because most problems arise when the crawler leaves the experimental
environment and begins harvesting the web.
A suitable partitioning function that divides the URL space across the set of Crawling
Processes that compose a distributed crawler must be chosen according to the characteristics
of the portion of the web being harvested. Otherwise, the requirements for a crawler may not
be fulfilled. Three partitioning strategies were analyzed: IP, site and page partitioning (see
Chapter 2). The number of URLs contained in a partition should be ideally constant to
facilitate load balancing.
The page partitioning is the most adequate according to this criterion. The IP partitioning
tends to create some extremely large partitions due to servers that host thousands of sites,
such as Geocities (www.geocities.com) or Blogger (www.blogger.com). The site partitioning
31
is more likely to create partitions containing a single URL, due to sites under construction or
presenting an error message.
Table 2 summarizes the relative merits of each strategy, which are characterized by the
following determinants:
DNS caching. A CP executes a DNS lookup to map the site name contained in an URL into
an IP address, establishes a TCP connection to the correspondent web server and then
downloads the content. The DNS lookups are responsible for 33% of the time spent to
download a content (Habib & Abrams, 2000). Hence, caching a DNS response and using it to
download several contents from the same site optimizes crawling. A CP does not execute any
DNS lookup during the crawl when harvesting an IP partition,because all the URLs are
hosted on the IP address that identifies the partition.
A site partition requires one DNS lookup to be harvested because all its URLs have the same
site name. A page partition contains URLs from several different sites, so a CP would not
benefit from caching DNS responses;
A page partition contains URLs hosted on different servers,so a CP does not benefit from
using keep-alive connections. On the other hand, with IP partitioning an entire server can be
crawled through one single keep-alive connection. When a crawler uses site partitioning, a
single keep-alive connection can be used to crawl a site. However, the same web server may
be congured to host several virtual hosts. Then, each site will be crawled through a new
connection;
Reuse of site meta-data. Sites contain meta-data, such as the Robots Exclusion file, that
influences crawling. The page partitioning strategy is not suitable to reuse the site's meta-data
32
because the URLs of a site are spread across several partitions. With the IP partitioning, the
site's meta-data can be reused, but it requires additional data structures to keep the
correspondence between the sites and the meta-data.
Independency. The site and page partitioning enable the assignment of an URL to a partition
independently from external resources. The IP partitioning depends on the DNS servers to
retrieve the IP address of an URL and it cannot be applied if the DNS server becomes
unavailable. If the site of an URL is relocated to a different IP address during a crawl, two
invocations of the function for the same URL would return different partitions.
This section details the design of the VN crawler. It was designed as a Versus client
application to take advantage of the distribution features provided by the Versus repository.
VN has a hybrid Frontier, uses site partitioning and dynamic-pull assignment:
Hybrid frontier. Each CP has an associated Local Frontier where it stores the meta-data
generated during the crawl of a partition. The meta-data on the seeds and crawled URLs is
stored on the Global Frontier. A CP begins the crawl of a new site partition by transferring a
seed from the Global to its Local Frontier (check-out). Then, the URLs that match the site are
harvested by the CP. When the crawl of the partition is finished, the correspondent meta-data
is transferred to the Global Frontier (check-in).
Site partitioning. Besides the advantages discussed in the previous Section, three additional
reasons lead to the adoption of the site partitioning strategy. First, a CP frequently accesses
the Local Frontier to execute the URL-seen test. As Portuguese sites are typically small and
links are mostly internal to the sites, the Local Frontier can be maintained in memory during
the crawl of the site to optimize the execution of the URL-seen test. Second, web servers are
designed to support access pat-terns typical of human browsing. The crawling of one site at a
time enables the reproduction of the behavior of browsers, so that the the actions of the
crawler do not disturb the normal operation of web servers. Third, site partitioning facilitates
the implementation of robust measures against spider traps;
33
Figure 4.8: VN architecture.
The checks-in moves the partition from the second to the third list.Figure 4.8 describes VN's
architecture. It is composed by a Global Frontier, a Manager that provides tools to execute
administrative tasks and several Crawling Nodes (CNodes). The Manager is composed by:
The Seeder that generates seeds to a new crawl from user submissions, DNS
listings and home pages of previously crawled sites and inserts them in the Global
Frontier;
The Reporter that gets statistics on the state of the system and emails them to a
human Administrator;
The Cleaner allows to release resources acquired by faulty Crawling Processes.
Each CNode hosts:
34
Figure 4.9: Sequence diagram: crawling a site.
The scheduling of the execution of the Crawling Processes within a CNode is delegated to the
operating system. It is assumed that when a CP is blocked, for instance while executing IO
operations, another CP is executed.
Crawlers get a seed to a site and follow the links within it to harvest its contents.They usually
impose a depth limit to avoid the harvesting of infinite sites (Baeza Yates & Castillo, 2004).
There are three policies to traverse links within a site (Cothey, 2004):
Best-first. The crawler chooses the most relevant URLs to be crawled first according to a
given criteria as, for instance, their PageRank value (Brin & Page, 1998);
35
Breadth-first. The crawler iteratively harvests all the URLs available at each level of depth
within a site;
Depth-first. The crawler iteratively follows all the links from the seed until the maximum
depth is achieved.
To face hazardous situations while crawling the web and possible hardware problems on the
underlying cluster of machines, VN was designed to tolerate faults at different levels in its
components.
36
The URL-seen test is executed in two steps: first, when the URLs are inserted in the Local
Frontier and upon the check-in to the Global Frontier. 81. % of the links embedded in pages
reference URLs internal to its site (Broder et al., 2003).
The URL-seen test for internal URLs is done locally because all the seen URLs belonging to
the site are covered by the Local Frontier. So, when the CP finishes harvesting the site, it can
check-in the internal URLs to the Global Frontier without further testing.
Home page. The home page policy assumes that all the contents within a site are accessible
through a link path from its home page. Hence, a CP replaces every external URL by its site
home page before inserting it in the Local Frontier (see Figure 5.10). The home page policy
reduces the number of external URLs to check-in. However, if a CP cannot follow links from
the home page, the remaining pages of the site will not be harvested;
Deep link. A deep link references an external URL different than the home page. The deep
link policy assumes that there are pages not accessible through a link path from the home
page of the site. The CP inserts the external URLs without any change in the Local Frontier
to maximize the coverage of the crawl. For instance, in Figure 4.10 the URL
www.othersite.com/orphan.html is not accessible from the home page of the site but it is
linked from the site www.mysite.com. However, if the external URL references a content
without links, such as a postscript document, the crawl of the site would be limited to this
content.
Combined. Follows deep links but always visits the home page of the sites. This policy is
intended to maximize coverage. Even if a deep link references a content without links, the
remaining site accessible through a link path from the home page will be harvested.
37
containing white spaces is syntactically incorrect. However, there are web servers that enable
the usage of malformed URLs;
Discarding URLs that reference unregistered sites. The site name of an URL must be
registered in the DNS. Otherwise, the crawler would not be able to map the domain name into
an IP address to establish a connection to the server and download the content. Thus, an URL
referencing an unregistered site name is invalid. However, testing if the site names of the
URLs are registered before inserting them into the Frontier imposes an additional overhead
on the DNS servers.
Duplicates occur when two or more different URLs reference the same content. A crawler
should avoid harvesting duplicates to save on processing, bandwidth and storage space. The
crawling of duplicates can be avoided through the normalization of URLs:
1. Case normalization: the hexadecimal digits within a percent-encoding triplet (e.g., "%3a"
versus "%3A") are case-insensitive and therefore should be normalized to use uppercase
letters for the digits A-F;
3. Convert site name to lower case: the domain names are case insensitive thus, the URLs
www.site.com/ and WWW.SITE.COM/ reference the same content;
6. Add trailing '/' when the path is empty: The HTTP specification statesthat if the path name
is not present in the URL, it must be given as '/'when used as a request for a resource
(Fielding et al., 1999). Hence, the transformation must be done by the client before sending a
request. This rule of normalization prevents that URLs, such as www.site.com and www.
site.com/, originate duplicates;
38
7. Remove trailing anchors: anchors are used to reference a part of a page
(e.gwww.site.com/file#anchor). However, the crawling of URLs that divert only on the
anchors would result in repeated downloads of the same page;
8. Add prex "www." to site names that are second-level domains: the following section will
show that most of the sites named with a second-level domain are also available under the
site name with the prex "www.";
9. Remove well-known trailing le names: two URLs that are equal except for a well known
trailing le name such as "index.html", "index.htm", "index.shtml", "default.html" or
"default.htm", usually reference the same content. The results obtained in experiments
crawling the Portuguese Web showed that removing these trailing le names reduced the
number of duplicates by 36%. It is technically possible that the URLs with and without the
trailing le reference different contents. However, situations of this kind were not found in the
experiments. The conclusion is that this heuristic does not reduce the coverage of a crawler
noticeably.
4.2.2.4 Implementation
The VN web crawler integrates components developed within the XLDB Group and external
software. It was mainly written in Java using jdk1.4.2 (3 516 lines of code), but it also
includes software components implemented in native code. The Crawling Processes use hash
tables to keep the list of duphosts and the DNS cache. The Parser was based on WebCAT, a
Java package for extracting and mining meta-data from web documents (Martins & Silva,
2005b).
The Classifier used to harvest the Portuguese Web includes a language identifier (Martins &
Silva, 2005a). The Robots Exclusion le interpreter was generated using Jlex (Berk &
Ananian, 2005). The Seeder and the Cleaner are Java applications. The Reporter and
Watchdog were implemented using shell scripts that invoke operating system commands,
such as ps or iostat.
The web is very heterogeneous and there are hazardous situations to crawling that disturb the
extraction of web data to be integrated in a WWh. Some of them are malicious, while others
39
are caused by mal-functioning web servers or authors that publish information on the web
without realizing that it will be automatically processed. Crawler developers must be aware
of these situations to design robust crawlers capable of coping with them.
Heydon & Najork (1999) defined a spider trap as an URL or set of URLs that cause a crawler
to crawl indefinitely. In this thesis the definition was relaxed and situations that significantly
degrade the performance of a crawler were also considered as spider traps, although they may
not originate infinite crawls.
DNS wildcards. A zone administrator can use a DNS wildcard to synthesize resource records
in response to queries that otherwise do not match an existing domain (Mockapetris, 1987).
In practice, any site under a domain using a wildcard will have an associated IP address, even
if nobody registered it. DNS wildcards are used to make sites more accepting of
typographical errors in URLs because they redirect any request to a site under a given domain
to a default doorway page (ICANN, 2004).
Malfunctions and infinite size contents. Malfunctioning sites are the cause of many spider
traps. These traps usually generate a large number of URLs that reference a small set of pages
containing default error messages. Thus, they are detectable by the abnormally large number
of duplicates within the site. For instance, sites that present highly volatile information, such
as online stores, generate their pages from information kept in a database.
If the database connection breaks, these pages are replaced by default error messages
informing that the database is not available. A crawler can mitigate the effects of this kind of
traps by not following links within a site when it tops a number of duplicates.
40
Session identifiers and cookies. HTTP is a stateless protocol that does not allow tracking of
user reading patterns by itself. However, this is often required by site developers, for
instance, to build proles of typical users.A session identifier embedded in the URLs linked
from pages allows maintaining state about a sequence of requests from the same user.
Directory list reordering. Apache web servers generate pages to present lists of files
contained in a directory. This feature is used to easily publish files on the web. Figure 5.11
presents a directory list and its embedded links. The directory list contains 4 links to pages
that present it reordered by Name, Last-Modified date, Size and Description, in ascendant or
descendent order.
Figure 4.11: Apache directory list page and the linked URLs.
Growing URLs. A spider trap can be set with a symbolic link from a directory/spider to the
directory / and a page /index.html that contains a link to the /spider directory. Following the
links will create an infinite number of URLs (www.site.com/spider/spider/...) (Jahn, 2004).
Although, this example may seem rather academic, these traps exist on the web. There are
41
also advertisement sites that embedded the history of the URLs followed by an user on the
links of their pages.
Crawlers interpret the harvested contents to extract valuable data such as links or texts. If a
crawler cannot extract the linked URLs from a page, it will not be able to iteratively harvest
the web. The page text is important for focused crawlers that use classification algorithms to
determine the relevance of the contents, for instance, a focused crawler could be interested in
harvesting contents containing a set of words. However, the extraction of data from contents
is not straightforward because there are situations on the web that make contents hard to
interpret:
Wrong identification of media type. The media type of a content is identified through the
HTTP header field Content-Type. HTTP clients choose the adequate software to interpret the
content according to its media type. For instance, a content with the Content-Type
application/pdf is commonly interpreted by the Adobe Acrobat software.
Malformed pages. A malformed content does not comply with its media type format
specification, which may prevent its correct interpretation. Malformed HTML contents are
prevalent on the web. One reason for this is that authors commonly validate their pages
through visualization on their browsers, which tolerate format errors to enable the
presentation of pages to humans without visible errors. As a result, the HTML interpreter
used by a crawler should also be tolerant to common syntax errors, such as unmatched tags
(Martins & Silva, 2005b; Woodruffet al., 1996);
Cloaking. A cloaking web server provides different contents to crawlers than to other clients.
This may be advantageous if the content served is a more crawler-friendly representation of
the original. For instance, a web server can serve a Macromedia Shockwave Flash Movie to a
browser and an alternative XML representation of the content to a crawler. However,
spammers use cloaking to deceive search engines without inconveniencing human visitors.
42
common to and pages where normal links are JavaScript programs activated through a
clicking on pictures or selecting options from a drop-down list (Thelwall,2002).
Duplicate hosts (duphosts) are sites with different names that simultaneously serve the same
content. Technically, duphosts can be created through the replication of contents among
several machines, the usage of virtual hosts or the creation of DNS wildcards. There are
several situations that originate duphosts:
Mirroring. The same contents are published on several sites to backup data, reduce the load
on the original site or to be quickly accessible to some users;
Domain squatting. Domain squatters buy domain names desirable to specific businesses, to
make prot on their resale. The requests to these domains are redirected to a site that presents a
sale proposal. To protect against squatters, companies also register multiple domain names
related to their trade marks and point them to the company's site;
Temporary sites. Web designers buy domains for their customers and point them temporally
to the designer's site or to a default "under construction" page. When the customer's site is
deployed the domain starts referencing it.
SameHome. Both sites present equal home pages. The home page describes the content of a
site. So, if two sites have the same home page they probably present the same contents.
However, there are home pages that permanently change their content, for instance to include
advertisements, and two home pages in the data set may be different although the remaining
contents of the sites are equal.
SameHomeAnd1Doc. Both sites present equal home pages and at least one other equal
content. This approach follows the same intuition than the Same Home for the home pages
43
but tries to overcome the problem of transient duphosts composed by a single page;
DupsP. Both sites present a minimum percentage (P) of equal contents and have at least two
equal contents. Between the crawl of the duphosts to build the data set, some pages may
change, including the home page. This approach assumes that if the majority of the contents
are equal between two sites, they are duphosts. A minimum of two equal contents was
imposed to reduce the presence of sites under construction.
44
Chapter 5
Validation
Several versions of the system were successively released until its design could not be
significantly improved. The final version of the system was subject to several experiments to
validate its efficiency. The data used to validate Webhouse was obtained through controlled
and observational methods. The controlled methods were used to validate the Webhouse
components individually. Replicated experiments measured differences before and after using
a new component. Dynamic analysis experiments collected performance results in the
production environment of the tumba! search engine.
The execution of experiments based on simulations with artificial data was minimized,
because the web is hardly reproducible in a controlled environment and the obtained results
might not be representative of the reality. The data collected to validate Webhouse as a
complete system was mainly gathered using observational methods. Webhouse was validated
using a case study approach during the development of a search engine for the Portuguese
Web. Data was obtained to measure the effectiveness of each new version of the system. The
final version of Webhouse was used in several projects to collect feedback on its performance
(field study validation method).
45
5.1 Crawler evaluation
This section presents the results obtained from experiments performed while harvesting the
Portuguese Web with the VN crawler. These crawls ran in June and July, 2005, with the
purpose of evaluating the crawler's performance. The analysis of the results enabled the
detection of bottlenecks, mal-functions and helped on tuning the crawler's configuration
according to the characteristics of the harvested portion of the web.
Web Warehouses require storage systems able to address the specific characteristics of web
collections. One peculiar characteristic of these collections is the Existence of large amounts
of duplicates. The Versus Content Manager was designed to efficiently manage duplicates
through a manageable, lightweight and flexible architecture, so that it could be easily
integrated in existing systems.
This section presents the results gathered from four experiments ran on the Content Manager
against NFS. These replicate its application in several usage contexts. NFS was chosen as
baseline, because it is widely known and accessible, enabling the reproducibility of the
experiments.
This section describes the main applications of Webhouse, covering the features and
operation of the tumba! search engine. It also describes how Webhouse was used in several
other research experiments, discusses selection criteria to populate a national web archive and
describes the use of Webhouse in a web archive prototype.
46
Chapter 6
Conclusions
The web is a powerful source of information, but additional tools to help users in taking
advantage from its potential are required. One of the problems that these tools must address is
how to cope with a data source which was not designed to be automatically interpreted by
software applications. Web warehousing is an approach to tackle this problem. It consists on
extracting data from the web, storing it locally and then, providing uniform access methods
that facilitate its automatic processing and reuse by different applications. This approach is
conceptually similar to Data Warehousing approaches, used to integrate information from
relational databases. However, the peculiar characteristics of the web, such as its dynamics
and heterogeneity, raise new problems that must addressed to design an efficient Web
Warehouse (WWh).
In general, the characteristics of the data sources have a major influence on the design of the
information systems that process the data. A major challenge in the design of Web
Warehouses is that web data models are scarce and become quickly stale. Previous web
characterization studies showed that the web is composed of distinct portions with peculiar
characteristics. It is important to accurately define the boundaries of these portions and model
them, so that the design of a WWh can reject the characteristics of the data it will store. The
methodology used to sample the web influences the derived characterizations. Hence, the
samples used to model a portion of the web must be gathered using a methodology that
emulates the extraction stage of the web data integration process.
The characterization of the sites, contents and link structure of a web portion is crucial to
design an efficient Web Warehouse to store it. Some features derived from the analysis of the
global web may not be representative of more restricted domains, such as national webs.
However, these portions can be of interest to large communities and characterizing a small
portion of the web is quite accessible and can be done with great accuracy. Some web
47
characteristics, such as web data persistence, require periodical samples of the web to be
modelled. These metrics should be included in web models because they enable the
identification of evolution tendencies that are determinant to the design of efficient Web
Warehouses, which keep incrementally built data collections.
A set of se-lection criteria delimits the boundaries of a web portion and it is defined
according to the requirements of the application that will process the harvested data. The
selection criteria should be easily implemented as an automatic harvesting policy. Selection
criteria based on content classification and domain restrictions revealed to be suitable
options.
The methodology used to gather web samples influences the obtained characterizations. In
the context of Web Ware-housing, crawling is an adequate sampling method because it is the
ost174commonly used in web data extraction. However, the configuration and technology
used in the crawler, and the existence of hazardous situations on the web influences the
derived models. Thus, the interpretation of statistics gathered from a web portion, such as a
national web, is beyond a mathematical analysis. It requires knowledge about web technology
and social reality of a national community.
How persistent is information on the web? Web data persistence cannot be modelled through
the analysis of a single snapshot of the web. Hence, models for the persistence of URLs and
contents of the Portuguese Web were derived from several snapshots gathered for three years.
The lifetime of URLs and contents follows a exponential distribution. Most URLs have short
lives and the death rate is higher in the first months, but there is a minority that persists for
long periods of time.
The obtained half-life of URLs was two months and the main causes of death were the
replacement of URLs and the deactivation of sites. Persistent URLs are mostly static, short
and tend to be linked from other sites. The lifetime of sites (half-life of 556 days) is
significantly larger than the lifetime of URLs. The obtained half-life for contents was just two
days.
48
The comparison of the obtained results with previous works suggests that the lifetime of
contents is decreasing. Persistent contents were not related to depth and were not particularly
distributed among sites. About half of the persistent URLs referenced the same content
during their lifetime.
Web data models help on important design decisions in the initial phases of Web
Warehousing projects. Duplication of contents is prevalent on the web and it is difficult to
avoid the download of duplicates during a crawl because the duplicates are commonly
referenced by distinct and apparently unrelated URLs. A collection of contents built
incrementally presents an additional number of duplicates because many contents remain
unchanged over time and are repeatedly stored. Hence, eliminating duplicates at storage level
in a WWh is an appealing feature. However, the mechanisms adopted for the elimination of
duplicates must address URL transience that may prevent the implementation of algorithms
based on historical analysis.
49
Chapter 7
References:
Bibliography :
1) http://www.jwz.org/doc/threading.html.
2) http://www.w3.org/XML/Query.
3) http://www-rocq.inria.fr/gemo/Gemo/Prokects/npp/.
4) http://www.xfra.net/qizxopen.
50