0% found this document useful (0 votes)

180 views

Web Ware Housing

This document provides an introduction and overview of web warehousing. It discusses how web warehousing aims to extract, transform, and load web data into a centralized system (called a Web Warehouse) to facilitate automatic processing and analysis of that data by applications like web mining tools. However, designing effective web warehouses is challenging due to the size, dynamic nature, heterogeneity, and lack of standards of web data. The document outlines the objectives and contributions of the research presented, which aims to address challenges in modeling the web and its influence on the design of web warehouses.

Uploaded by

rahulsogani123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

180 views

Web Ware Housing

Uploaded by

rahulsogani123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 50

CONTEMPORARY ISSUE ON SEMINAR

A STUDY ON

“WEB WAREHOUSING”

Session: 2009–11
Presented at

Submitted By: - Submitted To:-

Dishant Sisodia Mr. Rajat Mendiratta
MBA II Sem.

1
Acknowledgement

The beatitude, bliss & euphoria that accompany successful completion any task
would not be completed without the expression of appreciation of simple virtues to
the people who made it possible.

So, I take my immense pleasure in expressing a whole hearted thanks to all the
faculty members who guided me all the way making this project successful.

It is my privilege to express a deep sense of gratitude and thanks to Mr. RAJAT

MENDIRATTA for providing us various information directly related to project.

I am also thankful to Mrs. MAHIMA RAI (H.O.D.) for her guidance & cooperation in
this work.

I extend my gratitude and thankfulness to Apex Institute of Management & Science.

Date: Submitted By:

(Dishant Sisodia)
Place: Jaipur

2
Preface

The underlying aim of the seminar on contemporary issue as an integral part of MBA
program is to provide the students with practical aspects of the organization –
working environment.

Such type of presentation helps a student to visualize and realize about the
congruencies between the theoretical learning in the premises of college and actual
followed by the organization. It gives the knowledge of application aspect of the
theories learnt in the classroom.

The seminar project in WEB WAREHOUSING is a complete experience in itself,

which provide me with the understanding. This has become as inspirable of my
knowledge of management being learned in MBA program.

3
Executive Summary

Users require applications to help them obtaining knowledge from the web. However, the
specific characteristics of web data make it difficult to create these applications. One possible
solution to facilitate this task is to extract information from the web, transform and load it to a
Web Warehouse, which provides uniform access methods for automatic processing of the
data. Web Warehousing is conceptually similar to Data Warehousing approaches used to
integrate relational information from databases. However, the structure of the web is very
dynamic and cannot be controlled by the Warehouse designers. Web models frequently do
not reflect the current state of the web. Thus, Web Warehouses must be redesigned at a late
stage of development .These changes have high costs and may jeopardize entire projects.
This thesis addresses the problem of modelling the web and its influence in the design of
Web Warehouses. A model of a web portion was derived and based on it, a Web Warehouse
prototype was designed .The prototype was validated in several real-usage scenarios. The
obtained results show that web modelling is a fundamental step of the web data integration
process.

4
Index

1. Introduction ..........................................................1-5
1.1 Objectives and methodology...................................3
1.2 Contributions...........................................................4
2. Web characterization............................................6-9
2.1 Web characterization...............................................6
2.2 Terminology............................................................6
2.3 Sampling methods and identification .....................8
3. Crawling................................................................10-14
3.1 Crawler types and functioning................................10
3.2 Requirements..........................................................11
3.3 Web partitioning and assignment...........................13
3.4 Crawler examples...................................................13
4. Designing a Web Warehouse...............................15-38
4.1 The Versus repository...........................................16
4.1.1 Content Manager....................................................16
4.1.1.1 Elimination of partial duplicates in a WWh..............16
4.1.1.2 Data model.................................................................17
4.1.1.3 An algorithm for eliminating duplicates....................17
4.1.1.4 Fake duplicate............................................................18
4.1.1.5 Content Manager architecture....................................18
4.1.1.6 Implementation...........................................................19

4.1.2 Catalog.....................................................................20
4.1.2.1 Data model............................................................. ....20
4.1.2.2 Operational model.......................................................21

5
4.1.2.3 Implementation............................................................24

4.2 The VN crawler.....................................................25

4.2.1 Partitioning strategies............................................ 25
4.2.2 Crawler architecture.............................................. 27
4.2.2.1 Crawling algorithm...................................................29
4.2.2.2 Fault management.....................................................30
4.2.2.3 URL-seen test........................................................... 31
4.2.2.4 Implementation.........................................................33

4.3 Coping with hazardous situations.........................33

4.3.1 Spider traps....................................................34
4.3.2 Hard to interpret contents.......................................36
4.3.3 Duplicate hosts..............................................37
5. Validation......................................................39
5.1 Crawler evaluation................................................40
5.2 Versus content management................................40
5.3 Webhouse applications........................................40
6. Conclusions...................................................41-43
7. References.....................................................44

Chapter 1

6
Introduction

The web is the largest source of information ever built. It provides a quick, cheap and simple
publishing media. However, its full potential is far from being completely explored. Users
require applications for aiding them finding, summarizing and extracting useful knowledge
from web data. However, the web was designed to provide information to be interpreted by
humans and not automatically processed by software applications. The size and transience of
web data make it difficult to design efficient systems able to harness its complexity in useful
time and the heterogeneity and disrespect of standards make it difficult to automatically
interpret the data. Thus, the automatic processing of web data is a challenging task.

One possible solution to address the above challenges

is adopting a Data Warehousing approach. The idea is to extract information from the web,
transform and load it to a system, called a Web Warehouse (WWh), which provides uniform
access methods to facilitate the automatic processing of the data. Besides overcoming
accessibility problems, Web Warehousing extends the lifetime of web contents and its reuse
by different applications across time. Web warehousing has been widely used to support
numerous applications. The integration of web data for offline automatic processing by web
mining applications is a requirement of a wide range of systems. Web companies, such as
Google (www.google.com) or Amazon A9 (www.a9.com), rely on Web Warehouses to
support their businesses. The Internet Archive harvests and preserves web data for historical
purposes, following a Web Warehousing approach (www.archive.org).Marketeers analyze
web data

7
to determine commercial strategies (www.webanalyticsassociation.org). So, Web
Warehousing is a research area that has received a growing interest in the last years. Web
Warehousing is conceptually similar to Data Warehousing. Figure 1.1 presents the data
integration process in both. Data Warehouses integrate data gathered from tables in relational
databases. The data is migrated from its source models into an uniform data model. Then,
Data Mining applications generate statistical reports that summarize the knowledge contained
in the data. Web Warehouses integrate hyper textual documents gathered from sites on the
web .Web Warehouses also store information according to an uniform model that enables its
automatic processing by Web Mining applications .The characteristics of data influence the
design of an information system, so the first step in the design of a Warehouse is to analyze
the data sources .Data Warehousing assumes the existence of a well-defined model of the
data sources. They are usually On-Line Transaction Processing (OLTP) databases that respect
relational data models. On the other hand, the source of information that feeds Web
Warehouses is the web and not relational databases. Unlike relational databases, the structure
of the web cannot be controlled by the people that design the WWh and does not follow a
static structured data model. Models and characterizations of the web are scarce and
frequently outdated, not reflecting its current state, making it difficult to make realistic
assumptions in the design of a Web Warehouse. Frequently, Web Warehouses must be
redesigned at a late stage of development because problems are detected only when the WWh
leaves the experimental setup and begins to integrate information gathered from the real web.
These changes have high costs and may jeopardize entire projects.

1.1 Objectives and methodology

8
This work addresses the problem of modelling the web and its influence on Web
Warehousing as the main objective. This thesis aims to answer the following research
questions:

 Which features should be considered in a model of the web?

 How can the boundaries of a portion of the web be defined?
 What can bias a web model?

 How persistent is information on the web?

 How do web characteristics affect the design of Web Warehouses?

I believe that the task of modelling the web must be part of the process of web data
integration, because accurate models are crucial in making important design decisions at an
early WWh development stage. Web models also enable the tuning of a WWh to reflect the
evolution of the web. The methodology used in this research was mainly experimental. I
derived a model of a portion of the web and, based on it, I developed Webhouse, a WWh for
investigating the influence of web characteristics in Web Warehouses design. This
development was performed in collaboration with other members of my research group.
Figure 1.2 presents an overview of the components of Webhouse. Each one addresses one
stage of the integration process: modelling, extraction, transformation and loading. Although

9
the integration process is decomposed in several steps, they are not independent from each
other.

 Viúva Negra (VN): extracts information from the web by iteratively

following the linked URLs embedded in web pages. These systems are
broadly known as crawlers;
 WebCat: transforms web documents into an uniform data model (Martins&
Silva, 2005b). This component was designed and developed by BrunoMartins;
 Versus: loads and stores web data;
 Webstats: models the web, generating statistics on the documents and
correspondent meta-data stored in Versus.

The influence of web characteristics was studied during the design of each one of them. The
extraction most sensitive stage of the integration process, because the software component
interacts directly with the web and must address unpredictable situations. This thesis focuses
mainly on the aspects of extracting information from the web and loading it into the WWh.
The transformation of web data is not thoroughly discussed in this work. The efficiency of
Webhouse as a complete system was validated through its application in several real usage
scenarios.

This research was validated by applying the Engineering Method (Zelkowitz & Wallace,
1998). Several versions of Webhouse were iteratively developed and tested until the design
could not be significantly improved. The Portuguese Web was chosen as a case study to
analyze the impact of web characteristics in the design of a WWh. Models of the web were
extracted through the analysis of the information integrated in the WWh. On its turn, a WWh
requires models of the web to be designed. The performance of each version of the WWh and
gradually improve it. So, although this thesis presents a sequential structure, the actual
research was conducted as an iterative process.

1.2 Contributions

Designing Web Warehouses is complex and requires combining knowledge from

different domains. This thesis provides contributions in multiple aspects of webdata
integration research:

10
Web Characterization: concerns the monitoring and modelling of the web;

Web Crawling: investigates the automatic extraction of contents from the web;

Web Warehousing: studies the integration of web data.

My specific contributions in each field are:

Web Characterization:

 A thorough characterization of the structural properties of the Portuguese Web

(Gomes & Silva, 2005);
 New models for estimating URL and content persistence on the web.
 Despite the ephemeral nature of the web, there is persistent information and this
thesis presents a characterization of it (Gomes & Silva,2006a);
 A detailed description of hazardous situations on the web that make it difficult to
automate the processing of web data.

Web Crawling:

 A novel architecture for a scalable, robust and distributed crawler (Gomes &
Silva, 2006b);
 An analysis of techniques to partition the URL space among the processes of a
distributed crawler;
 A study of bandwidth and storage saving techniques, that avoid the download of
duplicates and invalid URLs.

Web Warehousing:

 A new architecture for a WWh that addresses all the stages of web data
integration, from its extraction from the web to its processing by mining
applications;
 An analysis of the impact of web characteristics in the design and performance of
a Web Warehouse;
 An algorithm that eliminates duplicates at storage level in a distributed system
(Gomes et al., 2006b).

11
Chapter 2

Web characterization

The design of efficient Web Warehouses requires combining knowledge from Web
characterization and Crawling. Web Characterization concerns the analysis of data samples to
model characteristics of the web. Crawling studies the automatic harvesting of web data.
Crawlers are frequently used to gather samples of web data in order to characterize it. Web
warehouses are commonly populated with crawled data. Research in crawling contributes to
optimizing the extraction stage of the web data integration process.

2.1 Web characterization

A characterization of the web is of great importance. It reflects technological and sociological

aspects and enables the study of the web evolution. An accurate characterization of the web
improves the design and performance of applications that use it as a source of information
(Cho & Garcia-Molina, 2000a). This section introduces the terminology adopted to clarify
web characterization concepts. It discusses sampling methodologies and the identification of
contents belonging to web communities. Finally, it presents previous works on the
characterization of the structural properties and information persistence on the web.

2.2 Terminology

As the web evolves, new concepts emerge and existing terms gain new meanings .Studies in
web characterization are meant to be used as historical documents that enable the analysis of
the evolution of the web. However, there is not a standard terminology and the current
meaning of the terms may become obscure in the future.

Between 1997 and 1999, the World-Wide Web Consortium (W3C) promoted the Web
Characterization Activity with the purpose of defining and implementing mechanisms to
support web characterization initiatives (W3C, 1999a). The scope of this activity was to
characterize the web as a general distributed system, not focusing on specific users or sites. In

12
1999, the W3C released a working draft defining a web characterization terminology (W3C,
1999b). The definitions used in this thesis were derived from that draft:

Content: file resulting from a successful HTTP download;

Media type: identification of the format of a content through a Multipurpose Internet Mail
Extension (MIME) type (Freed & Borenstein , 1996a);

Meta-data: information that describes the content. Meta-data can be generated during the
download of a content (e.g. time spent to be downloaded), gathered from HTTP header fields
(e.g. date of last modification) or extracted from a content (e.g. HTML meta-tags);

Page: content with the media type text/html (Connolly & Masinter, 2000);

Home page: content identified by an URL where the le path component is empty or a '/' only;

Link: hypertextual reference from one content to another;

Site: collection of contents referenced by URLs that share the same host name (Fielding et
al., 1999);

Invalid URL: an URL that references a content that cannot be downloaded;

Web server: a machine connected to the Internet that provides access to contents through the
HTTP protocol;

Duplicates: a set of contents that are bytewise equal;

Partial duplicates: a set of contents that replicate a part of a content;

Duplicate hosts (duphosts): sites with different names that simultaneously serve the same
content (Henzinger, 2003);

Subsite: cluster of contents within a site, maintained by a different publisher than that of the
parent site;

Virtual hosts: sites that have different names but are hosted on the same IP address and web
server;

Publisher or author: entity responsible for publishing information on the web. Some of the
definitions originally proposed in the draft are controversial and had to be adapted to become

13
more explicit. The W3C draft defined that a page was a collection of information, consisting
of one or more web resources, intended to be rendered simultaneously, and identified by a
single URL.

According to this definition, it is confusing to determine the contents that should be

considered as part of a page. For instance, consider an HTML content and its embedded
images. This information is meant to be rendered simultaneously but the images are
referenced by several URLs different from the URL of the HTML content. Researchers
commonly describe their experimental data sets providing the number of pages (Cho &
Garcia-Molina, 2000a; Fetterly et al., 2003; Lavoie et al.,1997). According to the W3C
definition, a data set containing one million pages should include embedded images.
However, most researchers considered that a page was a single HTML document.

A set of byte wise equal contents are duplicates. However, there are also similar contents that
replicate a part of another content (partial duplicates). Defining a criterion that identifies
contents as being similar enough to be considered the same is highly subjective. If multiple
contents only differ on the value of a visit counter that changes on every download, they
could reasonably be considered the same. However, when the difference between them is
only as short as a number on the date of a historical event, this small difference could be very
significant.

2.3 Sampling methods and identification of community webs

Web characterizations are derived from samples of the web. Ideally, each sample would be
instantly gathered to be a representative snapshot of the web. However, contents cannot be
accessed immediately, because of the latency times of the Internet and web server responses.
Hence, samples must be gathered within a limited time interval named the time span of the
sample. Structural properties of the web are derived from a snapshot of the web extracted
within a short timespan(Heydon & Najork, 1999). On the other hand, researchers also harvest
samples with a long times pan to study the evolution of the web (Fetterly et al., 2003).

There are two main sources of samples:

Traffic logs. The accesses to web contents through a given service are registered on log files.
Traffic logs can be obtained from web proxies (Bent et al., 2004;Mogul, 1999b), web servers
(Arlitt &Williamson, 1997; Davison, 1999; Iyengar et al., 1999; Marshak & Levy, 2003;

14
Rosenstein, 2000), search engines(Beitzel et al., 2004; Silverstein et al., 1999), web clients
(Cunha et al.,1995; Gribble & Brewer, 1997) or gateways (Caceres et al., 1998; Douglis et
al., 1997). The samples gathered from traffic logs are representative of the portion of the web
accessed by the users of a given service and not of the general information available on the
web;

Crawls. Contents are automatically harvested by a crawler to be characterized off-line.

Search engines are responsible for the largest crawls of the web.However, these crawls are
biased because search engines are mostly interested in gathering popular pages to present
them as search results to their users (Cho et al., 1998). Henzinger et al. (2000) showed that
the usage of random walks combined with the visit ratio and the PageRank values of the
pages visited can be used to gather unbiased samples of the web. O'Neill et al. (2003)
sampled the web by getting a list of randomly generated IP addresses and then attempting to
connect to the default HTTP port (80) at each address to find and harvest sites.

The main problem with this approach is that virtual hosts are not analyzed (OCLC, 2001).The
web is designed to break all the geographical barriers and make information universally
available. However, a WWh cannot store all the information from the web. So, it gathers data
from selected and well-defined web portions.

As the web is the product of multiple user groups, it is possible to identify portions within it
containing the sites of interest to them. These are designated as community webs and can be
defined as the set of documents that refer to a certain subject or are of interest to a community
of users. The detection of a community web is not always obvious, even if methods for
identifying its boundaries are available. If one is interested in a small and static set of
contents, then enumerating all the URLs that compose the community web can be adequate.
However, it becomes very expensive to maintain the list of URLs if it grows or changes
frequently (Webb, 2000).

Chapter 3
15
Crawling

Designing a crawler to harvest a small set of well-defined URLs is simple. However,

harvesting information spread across millions of pages requires adequate selection criteria
and system architectures. Most crawlers adopt distributed architectures that enable parallel
crawling to cope with the large size of the web.

Web warehouses use crawlers to extract data from the web. This section presents crawler
types and their operation. Then, it discusses architectural options to design a crawler and
strategies to divide the URL space among several crawling processes. Finally, it provides
crawler examples and compares their design and performance.

3.1 Crawler types and functioning

Crawlers can be classified in four major classes according to their harvesting strategies:

Broad. Collect the largest amount of information possible within a limited time interval
(Najork & Heydon, 2001);

Incremental. Revisit previously fetched pages, looking for changes (Edwards et al., 2001);

Focused. Harvest information relevant to a specific topic, usually with the help of a
classification algorithm, to filter irrelevant contents (Chakrabarti et al.,1999);

Deep. Harvest information relevant to a specific topic but, unlike focused crawlers,have the
capacity of filling forms in web pages and collect the returned pages(Ntoulas et al., 2005;
Raghavan & Garcia-Molina, 2001).

Although each type of crawler has specific requirements, they all present a similar
functioning. A crawl of the web is bootstrapped with a list of URLs, called the seeds, which
are the access nodes to the portion of the web to crawl.

For instance, to crawl a portion of the web containing all the contents hosted in the .GOV
domain, URLs from that domain should be used as seeds. Then, a crawler iteratively extracts
links to new URLs and collects their contents. The seeds should be carefully chosen to

16
prevent the crawler from wasting resources visiting URLs that do not reference accessible or
relevant contents. They can be gathered from different sources:

User submissions. The seeds are posted by the users of a given service. However, many of
them are invalid because they were incorrectly typed or reference sites still under
construction;

Previous crawls. The seeds are extracted from a previous crawl. The main problem of this
source of seeds is that URLs have short lives and an old crawl could supply many invalid
seeds;

Domain Name System listings. The seeds are generated from domain names. However, the
domains reference servers on the Internet and some of them are not web servers. So, the
generated seeds may not be valid. Another problem is that the lists of the top-level domains
of the web portion to be crawled are usually not publicly available.

3.2 Requirements

There are several types of crawlers. Although each one has specific requirements ,they all
share ethical principles and address common problems. A crawler must be:

Polite. A crawler should not overload web servers. Ideally, the load imposed while crawling
should be equivalent to that of a human while browsing. A crawler should expose the
purposes of its actions and not impersonate a browser, so that webmasters can track and
report inconvenient actions

Robust. The publication of information on the web is uncontrolled. A crawler must be robust
against hazardous situations that may affect its performance or cause its mal-functioning;

Fault tolerant. Even a small portion of the web is composed by a large number of contents,
which may take several days to be harvested. Crawlers frequently present a distributed
architecture comprising multiple components hosted on different machines.

Able to collect meta-data. There is meta-data temporarily available only during the crawl
(e.g. date of crawl). A crawler should keep these meta-data because it is often required by the
WWh clients. For instance, the Content-Type HTTP header field identifies the media type of
content. If this meta-data element is lost, the content type must be guessed later;

17
Configurable. A crawler should be highly configurable to enable the harvesting of different
portions of the web without suffering major changes;

Scalable. The crawl of a portion of the web must be completed within a limited time and the
download rate of a crawler must be adequate to the requirements of the application that will
process the harvested data. A WWh that requires weekly refreshments of data cannot use a
crawler that takes months to harvest the required web data.

Economic. A crawler should be parsimonious with the use of external resources, such as
bandwidth, because they are outside of its control. A crawler may to connect the Internet
through a large bandwidth link but many of the visited web servers do not;

Manageable. A crawler must include management tools that enable the quick detection of its
faults or failures. For instance, a hardware failure may

Figure 3.1: Crawler architectures.

require human intervention. On the other hand, the actions of a crawler may be deemed
unacceptable to some webmasters. So, it is important to keep track of the actions executed by
the crawler for latter identification and correction of undesirable behaviours.

3.3 Web partitioning and assignment

18
A partitioning function maps an URL to its partition. The main objective of partitioning the
URL space is to distribute the workload among the Crawling Processes, creating groups of
URLs that can be harvested independently. After partitioning, each CP is responsible for
harvesting exclusively one partition at a time. The partitioning strategy has implications on
the operation of the crawler.In general, the following partitioning strategies may be
considered:

IP partitioning. Each partition contains the URLs hosted on a given IP address;

Site partitioning. Each partition contains the URLs of a site. This partitioning schema differs
from the above, because several sites may be hosted on the same IP address (virtual hosts)
and each will be crawled separately;

Page partitioning. Each partition contains a fixed number of URLs independently from their
physical location. A partition may contain URLs hosted on different sites and IP addresses.
Page partitioning is suitable to harvest a selected set of independent pages spread on the web.

3.4 Crawler examples

 The Googlebot is present in the Google original research paper (Brin &
Page,1998). The Googlebot is present in the Google original research paper (Cho
et al., 2004).
 Silva et al. (1999) described the CobWeb crawler, one of the components of a
search engine for the Brazilian web that used proxy servers to reduce implemen-
tation costs and save network bandwidth when updating a set of documents.
 Boldi et al. (2002b) presented the Ubicrawler, giving special attention to its fault
tolerance and scalability features.

19
Table -1: Crawler design options.

Chapter 4

Designing a Web Warehouse

The Web is a powerful source of information, but its potential can only be harnessed with
applications specialized in aiding web users. However, most of these applications cannot

20
retrieve information from the web on-the-y, because it takes too long to download the data.
Pre-fetching the required information and storing it in a Web Warehouse (WWh) is a
possible solution. This approach enables the reuse of the stored data by several applications
and users.

AWWh architecture must be adaptable so it may closely track the evolution of the web,
supporting distinct selection criteria and gathering methods. Meta-data must ensure the
correct interpretation and preservation of the stored data. The storage space must
accommodate the collected data and it should be accessible to humans and machines,
supporting complementary access methods to fulfil the requirements of distinct usage
contexts. These access methods should provide views on past states of the stored data to
enable historical analysis.

Figure 4.1: Webhouse architecture.

The focus of the chapter is on the design of the Webhouse prototype, discussing the
extraction, loading and management of web data.

21
Figure 4.2: Versus architecture.

4.1 The Versus repository

Figure 4.2 represents the Versus repository architecture. It is composed by the Content
Manager and the Catalog. The Content Manager provides storage space for the contents
(Gomes et al., 2006b). The Catalog provides high performance access to structured meta-
data. It keeps information about each content, such as the date when it was collected or the
reference to the location where it was stored in the Content Manager.

4.1.1 Content Manager

Web warehousing involves a large amount of data and claims for storage systems able to
address the specific characteristics of web collections. The duplication of contents is
prevalent in web collections. It is difficult to avoid downloading duplicates during the crawl
of a large set of contents, because they are commonly referenced by distinct and apparently
unrelated URLs (Bharat & Broder, 1999; Kelly & Mogul, 2002; Mogul, 1999a). Plus, the
contents kept by a WWh have an additional number of duplicates, because it is built
incrementally and many contents remain unchanged over time, being repeatedly stored.

4.1.1.1 Elimination of partial duplicates in a WWh

Delta storage or encoding, is a technique used to save space that consists on storing only the
difference from a previous version of a content (MacDonald,1999). There are pages that
suffer only minor changes, such as the number of visitors received or the current date. Delta
storage enables storing only the part of the content that has changed, eliminating partial
duplicates.

Figure 4.3: Storage structure of a volume: a tree holding blocks on the leafs.

22
4.1.1.2 Data model

The data model of the Webhouse Content Manager relies on three main classes:

Instance, volume and block. The instance class provides a centralized view of a storage space
composed of volumes containing blocks. Each block keeps a content and related operational
meta-data. The signature is the number obtained from applying a fingerprinting algorithm to
the content. A content key contains the signature of the content and the volume where it was
stored. A block holds an unique content within the volume. It is composed by a header and a
data container (see Figure 4.3).

4.1.1.3 An algorithm for eliminating duplicates

The location of a block within the volume tree is obtained by applying a function called
sig2location to the content's signature. Assuming that the signature of a content is unique,
two contents have the same location within a volume if they are duplicates. Consider a
volume tree with depth n and a signature with m bytes of length. Sig2location uses the (n - 1)
most significant bytes in the signature to identify the path to follow in the volume tree. The
ith byte of the signature identifies the tree node with depth i. The remaining bytes of the
signature (m-n-1) identify the block name on the leaf of the tree. For instance,considering a
volume tree with depth 3, the block holding a content with signature ADEE2232AF3A4355
would be found in the tree by following the nodes AD, EEand leaf 2232AF3A4355.

The detection of duplicates is performed during the storage of each content, ensuring that
each distinct content is stored in a single block within the instance. When a client requests the
storage of a content, the system performs a sequence of tasks:

1. Generate a signature s for the content;

2. Apply sig2location to the signature and obtain the location l of the corresponding
block;

3. Search for a block in location l within the n volumes that compose the instance,
multicasting requests to the volumes;

23
4. If a block is found in volume, the content is considered to be a duplicate and its
reference counter is incremented. Otherwise, the content is stored in a new block with
location l in the volume identified by s mod n;

5. Finally, a content key referencing the block is returned to the client.

4.1.1.4 Fake duplicates

Theoretically, if two contents have the same signature they are duplicates. However,
fingerprinting algorithms present a small probability of collision that causes the generation of
the same signature for two different contents (Rabin, 1979).Relying exclusively on the
comparison of signatures to detect duplicates within a large collection of contents, could
cause some contents to be wrongly identified as duplicates and not stored. These situations
are called fake duplicates.

4.1.1.5 Content Manager architecture

The Versus Content Manager presents a distributed architecture composed by a set of

autonomous nodes that provide disk space with no central point of coordination, extending
storage capacity by adding new nodes without imposing major changes in the system.
Although network file systems also provide distributed access to data, they are executed at
operating system kernel level and require administrative privileges to be installed and
operated (Rodeh & Teperman, 2003).

On its turn, the Content Manager is platform independent and runs at application level
without imposing changes in the configuration of the underlying operating system. Peer-to-
peer file systems, such as Oceanstore (Rhea et al., 2003), are designed to manage a large and
highly variable set of nodes with small storage capacity, distributed over wide-area networks
(typically the Internet). This raises specific problems and imposes complex intra-node
communication protocols that guarantee properties such as security, anonymity or fault
tolerance that unnecessarily limit throughput on controlled networks. An experiment
performed by the authors of Ocean store showed that it is on average 8.3 times slower than
NFS(Callaghan et al., 1995) on a local-area network (LAN).

24
Figure 4.4 depicts the architecture of the Content Manager.

An instance is composed by a thin middleware library, the connector object and the volume
servers. Clients access an instance through a connector object that keeps references to the
volumes that compose the instance. A change in the composition of the instance, such as the
addition of a new volume, implies an update of the connector. Each volume server manages
the requests and executes the correspondent low-level operations to access the contents. The
contents are transmitted between the library and the servers in a compressed format to reduce
network traffic and data processing on the server.

4.1.1.6 Implementation

The storage structure of a volume was implemented as a directory tree over the file system
where the blocks are files residing at the leaf directories. The block header is written in
ASCII format so that it can be easily interpreted, enabling access to the content kept in the
block independently from the Web store software. A 64-bit implementation of Rabin's
fingerprinting algorithm was used to generate the content signatures (Rabin, 1979). The
Content Manager supports Zlib as the built-in compression method, but other compression
algorithms can be included. This way, contents can be compressed using adequate algorithms
and accommodate new formats. The connector object was implemented as an XML file. The
library and the volume servers were written in Java using JDK

25
Figure 4.5: Versus Content Manager data model.

1.4.2 (6 132 lines of code). The communication between them is through Berkeley sockets.
Currently, clients access volume servers in sequence, (a multicast protocol is not yet
implemented). Volume servers are multi-threaded, launching a thread for handling each
request. Volume servers guarantee that each block is accessed in exclusive mode through
internal block access lists.

4.1.2 Catalog

This section describes the data model that supports the Catalog and the operational model that
enables parallel loading and access to the data stored in the Versus repository.

4.1.2.1 Data model

Figure 4.5 presents the UML class model of the Catalog. This model is generic to enable its
usage for long periods of time independently from the evolution of web formats. Plus, it also
enables the usage of the repository in contexts different from Web Warehousing. For
instance, it can be applied to manage meta-data on the research articles kept in a Digital
Library. However, it is assumed that the contents are inserted into the repository in bulk
loads.

4.1.2.2 Operational model

26
A Versus client application is composed by a set of threads that process data in parallel. Each
application thread does its specific data processing and Versus is responsible for managing
and synchronizing them. The operational model of Versus was inspired on the model
proposed by Campos (2003). It is composed by three workspaces with different features that
keep the contents meta-data:

Archive (AW). Stores meta-data permanently. It keeps version history for the contents to
enable the reconstruction of their earlier views. The AW is an append-only storage, the data
stored cannot be updated or deleted;

Group (GW). Keeps a temporary view of the meta-data shared by all application threads. It
enables the synchronization among the application threads and data cleaning before the
archival of new data;

Private (PW). Provides local storage and fast access to data by application threads. Private
workspaces are independent from each other and reside on the application threads addressing
space. An application thread can be classified in three categories according to the task it
executes:

Loader. Generates or gathers data from an information source and loads it into Versus;

Processor. Accesses data stored in Versus, processes it and loads the resulting data;

Reader. Accesses data stored in Versus and does not change it, neither generates new
information.

The data stored is partitioned to enable parallel processing. A Working Unit (WU) is a data
container used to transfer partitions of meta-data across the workspaces. The Working Units
are transferred from one workspace to another via check-out and check-in operations (Katz,
1990). When a thread checks-out a WU, the corresponding meta-data is locked in the source
workspace and it is copied to the destination workspace. When the thread finishes the
processing of the WU, it integrates the resulting data into the source workspace (check-in).

The threads that compose an application share the same partitioning function. There are two
classes of Working Units:

Strict. Contain exclusively the Versions that belong to the Working Unit. Strict Working
Units should be used by applications that that do not need to create new Versions;

27
Extended. The Extended Working Units may also contain Versions that do not belong to the
WU, named the External Versions.

Figure 4 .6: Accessing information stored in Versus.

Access

Figure 4.6 depicts the access to information stored on the Catalog. Each work space provides
different access levels to the meta-data. The Archive Work-space provides centralized access
to all the archived Versions. The applications define the time space of the Versions they want
to access through a Resolver object. For instance, an application can use a Resolver that
chooses the last Version archived from each Source. The Group Workspace also provides
centralized access to the stored information but it holds at most one Version from each
Source.

It does not provide historical views on the data. The Private Workspaces are hosted on the
application threads and keep one WU at a time enabling parallel processing. The Archive and
Group Workspace should be hosted on powerful machines while the Private Workspaces can
be hosted on commodity servers.

The workflow of a parallel application that accesses data stored in Versus is the following:

1. The application checks-out the required meta-data from the AW to the GW;

2. Several application threads are launched in parallel. Each one of them starts its own PW
and iteratively checks-out one WU at a time, processes it and executes the check-in into the

28
GW. The contents cannot be updated after they are set and any change on a content must be
stored as new Facet;

3. When there are no unprocessed Working Units the new data kept in the GW is checked-in
into the AW.

The contents are not transferred in the check-out operations, they are retrieved on-demand
from the Content Manager. There are two assumptions behind this design decision. The first
is that the contents belonging to a WU represent an amount of data much larger than the
corresponding meta-data.

Load

Figure 4.7 depicts the loading of data into Versus. The meta-data is loaded by the application
threads into the Private Workspaces, while the contents are stored in the Content Manager
that eliminates duplicates at storage level. However, if a content is identified as a duplicate,
the corresponding meta-data, such as URL, is still stored in the Catalog PW. This way, the
Webhouse clients can later access the warehoused collection independently from the
elimination of duplicates mechanism. The work flow of a parallel application that loads
information into Versus is the following:

1. The application creates a new empty layer on the GW;

2. Several parallel threads are launched. Each one of them starts its own PW and iteratively
checks-out one emptyWU, loads it with meta-data extracted from the Sources and executes
the check-in into the GW;

3. When there are no unprocessed Working Units the new data kept in the GW is checked-in
into the AW

29
Figure 4.7: Loading data into Versus

The references to the new contents loaded into the Content Manager are kept as meta-data in
the PW. If an application thread fails before the check-in, for instance due to a power failure,
the references to the contents would be lost, originating orphan contents that could not be
later accessed. Versus provides recovery methods to restart the processing of a Working Unit
and remove the orphan contents if an application thread fails. Versus also supports direct
loads to the Group or Archive Workspaces but they should be used for small amounts of data
because parallel loading is not supported.

4.1.2.3 Implementation

The Catalog was mainly implemented using the Java environment and relational database
management systems (DBMS). The AW and GW were implemented using Oracle 9i DBMS
(Oracle Corporation, 2004). The advanced administration features of this DBMS, such as
partitioning or query optimizations, enable the configuration of the system to be used in the
context of Web Warehousing, addressing efficiently the processing of large amounts of data.
The use of the SQL language for data manipulation enabled the reuse of the code in the three
kinds of workspaces, although each one also had particular data structures and optimization
profiles.

The PW used Hyper Sonic SQL DBMS (HyperSonicSQL). It is written in Java and can be
configured to run in three modes:

30
Memory. The DBMS runs inside the client application. The data is kept exclusively in-
memory. If there is failure of the client application the data is lost;

File. The DBMS run inside the client application but the data is stored in files;

Client/server. The DBMS and client applications run independently and communicate
through a network connection using JDBC. The data can be kept in memory or in files.

4.2 The VN crawler

A WWh crawls data from the web to extract new information. The permanent evolution of
the web and the upcoming of new usage contexts demands continuous

Table 2: Comparison of the partitioning strategies

research in crawling systems. Kahle (2002), the founder of the Internet Archive,revealed that
their commercial crawler is rewritten every 1218 months to reflect changes in the structure of
the web. Although a crawler is conceptually simple,its development is expensive and time
consuming, because most problems arise when the crawler leaves the experimental
environment and begins harvesting the web.

4.2.1 Partitioning strategies

A suitable partitioning function that divides the URL space across the set of Crawling
Processes that compose a distributed crawler must be chosen according to the characteristics
of the portion of the web being harvested. Otherwise, the requirements for a crawler may not
be fulfilled. Three partitioning strategies were analyzed: IP, site and page partitioning (see
Chapter 2). The number of URLs contained in a partition should be ideally constant to
facilitate load balancing.

The page partitioning is the most adequate according to this criterion. The IP partitioning
tends to create some extremely large partitions due to servers that host thousands of sites,
such as Geocities (www.geocities.com) or Blogger (www.blogger.com). The site partitioning

31
is more likely to create partitions containing a single URL, due to sites under construction or
presenting an error message.

Table 2 summarizes the relative merits of each strategy, which are characterized by the
following determinants:

DNS caching. A CP executes a DNS lookup to map the site name contained in an URL into
an IP address, establishes a TCP connection to the correspondent web server and then
downloads the content. The DNS lookups are responsible for 33% of the time spent to
download a content (Habib & Abrams, 2000). Hence, caching a DNS response and using it to
download several contents from the same site optimizes crawling. A CP does not execute any
DNS lookup during the crawl when harvesting an IP partition,because all the URLs are
hosted on the IP address that identifies the partition.

A site partition requires one DNS lookup to be harvested because all its URLs have the same
site name. A page partition contains URLs from several different sites, so a CP would not
benefit from caching DNS responses;

Use of keep-alive connections. Establishing a TCP connection to a web server takes on

average 23% of the time spent to download a content (Habib & Abrams, 2000). However,
HTTP keep-alive connections enable the download of several contents reusing the same TCP
connection to a server (Fielding et al., 1999).

A page partition contains URLs hosted on different servers,so a CP does not benefit from
using keep-alive connections. On the other hand, with IP partitioning an entire server can be
crawled through one single keep-alive connection. When a crawler uses site partitioning, a
single keep-alive connection can be used to crawl a site. However, the same web server may
be congured to host several virtual hosts. Then, each site will be crawled through a new
connection;

Server overloading. In general, a crawler should respect a minimum interval of time

between consecutive requests to the same web server to avoid overloading it. This is called
the courtesy pause. Page partitioning is not suitable to guarantee courtesy pauses, because the
URLs of a server are spread among several partitions.

Reuse of site meta-data. Sites contain meta-data, such as the Robots Exclusion file, that
influences crawling. The page partitioning strategy is not suitable to reuse the site's meta-data

32
because the URLs of a site are spread across several partitions. With the IP partitioning, the
site's meta-data can be reused, but it requires additional data structures to keep the
correspondence between the sites and the meta-data.

Independency. The site and page partitioning enable the assignment of an URL to a partition
independently from external resources. The IP partitioning depends on the DNS servers to
retrieve the IP address of an URL and it cannot be applied if the DNS server becomes
unavailable. If the site of an URL is relocated to a different IP address during a crawl, two
invocations of the function for the same URL would return different partitions.

4.2.2 Crawler architecture

This section details the design of the VN crawler. It was designed as a Versus client
application to take advantage of the distribution features provided by the Versus repository.
VN has a hybrid Frontier, uses site partitioning and dynamic-pull assignment:

Hybrid frontier. Each CP has an associated Local Frontier where it stores the meta-data
generated during the crawl of a partition. The meta-data on the seeds and crawled URLs is
stored on the Global Frontier. A CP begins the crawl of a new site partition by transferring a
seed from the Global to its Local Frontier (check-out). Then, the URLs that match the site are
harvested by the CP. When the crawl of the partition is finished, the correspondent meta-data
is transferred to the Global Frontier (check-in).

A CP successively checks-out a partition containing a seed, harvests the corresponding

information from the web and checks-in the resultant meta-data, until there are no unvisited
seeds in the Global Frontier;

Site partitioning. Besides the advantages discussed in the previous Section, three additional
reasons lead to the adoption of the site partitioning strategy. First, a CP frequently accesses
the Local Frontier to execute the URL-seen test. As Portuguese sites are typically small and
links are mostly internal to the sites, the Local Frontier can be maintained in memory during
the crawl of the site to optimize the execution of the URL-seen test. Second, web servers are
designed to support access pat-terns typical of human browsing. The crawling of one site at a
time enables the reproduction of the behavior of browsers, so that the the actions of the
crawler do not disturb the normal operation of web servers. Third, site partitioning facilitates
the implementation of robust measures against spider traps;

33
Figure 4.8: VN architecture.

Dynamic-pull assignment. The Global Frontier assigns partitions to Crawling Processes as

they pull them. The Global Frontier guarantees that a site is never harvested simultaneously
by two Crawling Processes. The Global Frontier identifies each partition with the site's hash
and manages three lists: i) sites to crawl; ii) sites being crawled and; iii) sites crawled. When
a CP checks-out a partition, it is moved from the first to the second list.

The checks-in moves the partition from the second to the third list.Figure 4.8 describes VN's
architecture. It is composed by a Global Frontier, a Manager that provides tools to execute
administrative tasks and several Crawling Nodes (CNodes). The Manager is composed by:

 The Seeder that generates seeds to a new crawl from user submissions, DNS
listings and home pages of previously crawled sites and inserts them in the Global
Frontier;
 The Reporter that gets statistics on the state of the system and emails them to a
human Administrator;
 The Cleaner allows to release resources acquired by faulty Crawling Processes.
Each CNode hosts:

34
Figure 4.9: Sequence diagram: crawling a site.

 The Crawling Processes and the corresponding Local Frontiers;

 One Volume that stores the harvested contents;
 One Watchdog that restarts the Crawling Processes if they are considered dead
(inactive for a given period of time).

The scheduling of the execution of the Crawling Processes within a CNode is delegated to the
operating system. It is assumed that when a CP is blocked, for instance while executing IO
operations, another CP is executed.

4.2.2.1 Crawling algorithm

Crawlers get a seed to a site and follow the links within it to harvest its contents.They usually
impose a depth limit to avoid the harvesting of infinite sites (Baeza Yates & Castillo, 2004).
There are three policies to traverse links within a site (Cothey, 2004):

Best-first. The crawler chooses the most relevant URLs to be crawled first according to a
given criteria as, for instance, their PageRank value (Brin & Page, 1998);

35
Breadth-first. The crawler iteratively harvests all the URLs available at each level of depth
within a site;

Depth-first. The crawler iteratively follows all the links from the seed until the maximum
depth is achieved.

4.2.2.2 Fault management

To face hazardous situations while crawling the web and possible hardware problems on the
underlying cluster of machines, VN was designed to tolerate faults at different levels in its
components.

A CP launches an independent thread (Collector) to execute sensitive tasks, such as the

download, meta-data extraction and parsing of a content. The CP terminates the execution of
the Collector after a limited time. Hence, a fault caused by the harvest and processing of a
single content does not compromise the crawl of the remaining contents at the site. However,
this imposes the overhead of launching a new thread to crawl each URL.

Figure 4.10: Deep vs. home page seeding policies.

4.2.2.3 URL-seen test

36
The URL-seen test is executed in two steps: first, when the URLs are inserted in the Local
Frontier and upon the check-in to the Global Frontier. 81. % of the links embedded in pages
reference URLs internal to its site (Broder et al., 2003).

The URL-seen test for internal URLs is done locally because all the seen URLs belonging to
the site are covered by the Local Frontier. So, when the CP finishes harvesting the site, it can
check-in the internal URLs to the Global Frontier without further testing.

Home page. The home page policy assumes that all the contents within a site are accessible
through a link path from its home page. Hence, a CP replaces every external URL by its site
home page before inserting it in the Local Frontier (see Figure 5.10). The home page policy
reduces the number of external URLs to check-in. However, if a CP cannot follow links from
the home page, the remaining pages of the site will not be harvested;

Deep link. A deep link references an external URL different than the home page. The deep
link policy assumes that there are pages not accessible through a link path from the home
page of the site. The CP inserts the external URLs without any change in the Local Frontier
to maximize the coverage of the crawl. For instance, in Figure 4.10 the URL
www.othersite.com/orphan.html is not accessible from the home page of the site but it is
linked from the site www.mysite.com. However, if the external URL references a content
without links, such as a postscript document, the crawl of the site would be limited to this
content.

Combined. Follows deep links but always visits the home page of the sites. This policy is
intended to maximize coverage. Even if a deep link references a content without links, the
remaining site accessible through a link path from the home page will be harvested.

VN supports the home page and combined policies. As an optimization, when VN is

configured to follow the home page policy, it discards the external URLs hosted on the sites
contained in the seeds of the crawl, because they were already inserted in the Global Frontier
by the Seeder. Discarding external URLs contained in the initial seed list breaks the deep link
and combined policies, because a link may reference a page from a site contained in the
initial seed list that is not accessible from the home page.

Discarding malformed URLs. A malformed URL is syntactically incorrect (Berners-Lee et

al., 2005). Malformed URLs are most likely caused by a typing errors. For instance, an URL

37
containing white spaces is syntactically incorrect. However, there are web servers that enable
the usage of malformed URLs;

Discarding URLs that reference unregistered sites. The site name of an URL must be
registered in the DNS. Otherwise, the crawler would not be able to map the domain name into
an IP address to establish a connection to the server and download the content. Thus, an URL
referencing an unregistered site name is invalid. However, testing if the site names of the
URLs are registered before inserting them into the Frontier imposes an additional overhead
on the DNS servers.

Duplicates occur when two or more different URLs reference the same content. A crawler
should avoid harvesting duplicates to save on processing, bandwidth and storage space. The
crawling of duplicates can be avoided through the normalization of URLs:

1. Case normalization: the hexadecimal digits within a percent-encoding triplet (e.g., "%3a"
versus "%3A") are case-insensitive and therefore should be normalized to use uppercase
letters for the digits A-F;

2. Percent-Encoding Normalization: decode any percent-encoded octet that corresponds to an

unreserved character;

3. Convert site name to lower case: the domain names are case insensitive thus, the URLs
www.site.com/ and WWW.SITE.COM/ reference the same content;

4. Convert relative to absolute le paths: For instance, www.site.com/dir/../index.html to

www.site.com/index.html;

5. Remove identification of the HTTP default port 80 : For instance,

changewww.site.com:80/index.html to www.site.com/index.html;

6. Add trailing '/' when the path is empty: The HTTP specification statesthat if the path name
is not present in the URL, it must be given as '/'when used as a request for a resource
(Fielding et al., 1999). Hence, the transformation must be done by the client before sending a
request. This rule of normalization prevents that URLs, such as www.site.com and www.
site.com/, originate duplicates;

38
7. Remove trailing anchors: anchors are used to reference a part of a page
(e.gwww.site.com/file#anchor). However, the crawling of URLs that divert only on the
anchors would result in repeated downloads of the same page;

8. Add prex "www." to site names that are second-level domains: the following section will
show that most of the sites named with a second-level domain are also available under the
site name with the prex "www.";

9. Remove well-known trailing le names: two URLs that are equal except for a well known
trailing le name such as "index.html", "index.htm", "index.shtml", "default.html" or
"default.htm", usually reference the same content. The results obtained in experiments
crawling the Portuguese Web showed that removing these trailing le names reduced the
number of duplicates by 36%. It is technically possible that the URLs with and without the
trailing le reference different contents. However, situations of this kind were not found in the
experiments. The conclusion is that this heuristic does not reduce the coverage of a crawler
noticeably.

4.2.2.4 Implementation

The VN web crawler integrates components developed within the XLDB Group and external
software. It was mainly written in Java using jdk1.4.2 (3 516 lines of code), but it also
includes software components implemented in native code. The Crawling Processes use hash
tables to keep the list of duphosts and the DNS cache. The Parser was based on WebCAT, a
Java package for extracting and mining meta-data from web documents (Martins & Silva,
2005b).

The Classifier used to harvest the Portuguese Web includes a language identifier (Martins &

Silva, 2005a). The Robots Exclusion le interpreter was generated using Jlex (Berk &
Ananian, 2005). The Seeder and the Cleaner are Java applications. The Reporter and
Watchdog were implemented using shell scripts that invoke operating system commands,
such as ps or iostat.

4.3 Coping with hazardous situations

The web is very heterogeneous and there are hazardous situations to crawling that disturb the
extraction of web data to be integrated in a WWh. Some of them are malicious, while others

39
are caused by mal-functioning web servers or authors that publish information on the web
without realizing that it will be automatically processed. Crawler developers must be aware
of these situations to design robust crawlers capable of coping with them.

The description of hazardous situations to crawling is scarce in the scientific literature,

because most experiments are based on simulations or short-term crawls that do not enable
their identification. Hazardous situations are commonly ignored in academic studies because
the scientific hypotheses being tested assume a much simpler model of the web than observed
in reality. Hence, the detection of hazardous situations on the web is a recurrent problem that
must be addressed by every new system developed to process web data. Moreover, new
hazardous situations arise as the web evolves, so their monitoring and identification requires
a continuous effort.

4.3.1 Spider traps

Heydon & Najork (1999) defined a spider trap as an URL or set of URLs that cause a crawler
to crawl indefinitely. In this thesis the definition was relaxed and situations that significantly
degrade the performance of a crawler were also considered as spider traps, although they may
not originate infinite crawls.

DNS wildcards. A zone administrator can use a DNS wildcard to synthesize resource records
in response to queries that otherwise do not match an existing domain (Mockapetris, 1987).
In practice, any site under a domain using a wildcard will have an associated IP address, even
if nobody registered it. DNS wildcards are used to make sites more accepting of
typographical errors in URLs because they redirect any request to a site under a given domain
to a default doorway page (ICANN, 2004).

Malfunctions and infinite size contents. Malfunctioning sites are the cause of many spider
traps. These traps usually generate a large number of URLs that reference a small set of pages
containing default error messages. Thus, they are detectable by the abnormally large number
of duplicates within the site. For instance, sites that present highly volatile information, such
as online stores, generate their pages from information kept in a database.

If the database connection breaks, these pages are replaced by default error messages
informing that the database is not available. A crawler can mitigate the effects of this kind of
traps by not following links within a site when it tops a number of duplicates.

40
Session identifiers and cookies. HTTP is a stateless protocol that does not allow tracking of
user reading patterns by itself. However, this is often required by site developers, for
instance, to build proles of typical users.A session identifier embedded in the URLs linked
from pages allows maintaining state about a sequence of requests from the same user.

Directory list reordering. Apache web servers generate pages to present lists of files
contained in a directory. This feature is used to easily publish files on the web. Figure 5.11
presents a directory list and its embedded links. The directory list contains 4 links to pages
that present it reordered by Name, Last-Modified date, Size and Description, in ascendant or
descendent order.

Figure 4.11: Apache directory list page and the linked URLs.

Growing URLs. A spider trap can be set with a symbolic link from a directory/spider to the
directory / and a page /index.html that contains a link to the /spider directory. Following the
links will create an infinite number of URLs (www.site.com/spider/spider/...) (Jahn, 2004).
Although, this example may seem rather academic, these traps exist on the web. There are

41
also advertisement sites that embedded the history of the URLs followed by an user on the
links of their pages.

4.3.2 Hard to interpret contents

Crawlers interpret the harvested contents to extract valuable data such as links or texts. If a
crawler cannot extract the linked URLs from a page, it will not be able to iteratively harvest
the web. The page text is important for focused crawlers that use classification algorithms to
determine the relevance of the contents, for instance, a focused crawler could be interested in
harvesting contents containing a set of words. However, the extraction of data from contents
is not straightforward because there are situations on the web that make contents hard to
interpret:

Wrong identification of media type. The media type of a content is identified through the
HTTP header field Content-Type. HTTP clients choose the adequate software to interpret the
content according to its media type. For instance, a content with the Content-Type
application/pdf is commonly interpreted by the Adobe Acrobat software.

Malformed pages. A malformed content does not comply with its media type format
specification, which may prevent its correct interpretation. Malformed HTML contents are
prevalent on the web. One reason for this is that authors commonly validate their pages
through visualization on their browsers, which tolerate format errors to enable the
presentation of pages to humans without visible errors. As a result, the HTML interpreter
used by a crawler should also be tolerant to common syntax errors, such as unmatched tags
(Martins & Silva, 2005b; Woodruffet al., 1996);

Cloaking. A cloaking web server provides different contents to crawlers than to other clients.
This may be advantageous if the content served is a more crawler-friendly representation of
the original. For instance, a web server can serve a Macromedia Shockwave Flash Movie to a
browser and an alternative XML representation of the content to a crawler. However,
spammers use cloaking to deceive search engines without inconveniencing human visitors.

JavaScript-intensive pages. JavaScript is a programming language created to write

functions, embedded in HTML pages that enable the generation of presentations that were not
possible using HTML alone. The AJAX (Asynchronous JavaScript And XML) libraries
contributed to the widespread usage of this language in pages (Paulson, 2005). It is becoming

42
common to and pages where normal links are JavaScript programs activated through a
clicking on pictures or selecting options from a drop-down list (Thelwall,2002).

4.3.3 Duplicate hosts

Duplicate hosts (duphosts) are sites with different names that simultaneously serve the same
content. Technically, duphosts can be created through the replication of contents among
several machines, the usage of virtual hosts or the creation of DNS wildcards. There are
several situations that originate duphosts:

Mirroring. The same contents are published on several sites to backup data, reduce the load
on the original site or to be quickly accessible to some users;

Domain squatting. Domain squatters buy domain names desirable to specific businesses, to
make prot on their resale. The requests to these domains are redirected to a site that presents a
sale proposal. To protect against squatters, companies also register multiple domain names
related to their trade marks and point them to the company's site;

Temporary sites. Web designers buy domains for their customers and point them temporally
to the designer's site or to a default "under construction" page. When the customer's site is
deployed the domain starts referencing it.

SameHome. Both sites present equal home pages. The home page describes the content of a
site. So, if two sites have the same home page they probably present the same contents.
However, there are home pages that permanently change their content, for instance to include
advertisements, and two home pages in the data set may be different although the remaining
contents of the sites are equal.

SameHomeAnd1Doc. Both sites present equal home pages and at least one other equal
content. This approach follows the same intuition than the Same Home for the home pages

43
but tries to overcome the problem of transient duphosts composed by a single page;

Table 3: Results from the five approaches to detect duphosts.

DupsP. Both sites present a minimum percentage (P) of equal contents and have at least two
equal contents. Between the crawl of the duphosts to build the data set, some pages may
change, including the home page. This approach assumes that if the majority of the contents
are equal between two sites, they are duphosts. A minimum of two equal contents was
imposed to reduce the presence of sites under construction.

44
Chapter 5

Validation

Several versions of the system were successively released until its design could not be
significantly improved. The final version of the system was subject to several experiments to
validate its efficiency. The data used to validate Webhouse was obtained through controlled
and observational methods. The controlled methods were used to validate the Webhouse
components individually. Replicated experiments measured differences before and after using
a new component. Dynamic analysis experiments collected performance results in the
production environment of the tumba! search engine.

The execution of experiments based on simulations with artificial data was minimized,
because the web is hardly reproducible in a controlled environment and the obtained results
might not be representative of the reality. The data collected to validate Webhouse as a
complete system was mainly gathered using observational methods. Webhouse was validated
using a case study approach during the development of a search engine for the Portuguese
Web. Data was obtained to measure the effectiveness of each new version of the system. The
final version of Webhouse was used in several projects to collect feedback on its performance
(field study validation method).

Table 4: Hardware used to support crawling experiments.

45
5.1 Crawler evaluation

This section presents the results obtained from experiments performed while harvesting the
Portuguese Web with the VN crawler. These crawls ran in June and July, 2005, with the
purpose of evaluating the crawler's performance. The analysis of the results enabled the
detection of bottlenecks, mal-functions and helped on tuning the crawler's configuration
according to the characteristics of the harvested portion of the web.

5.2 Versus content management

Web Warehouses require storage systems able to address the specific characteristics of web
collections. One peculiar characteristic of these collections is the Existence of large amounts
of duplicates. The Versus Content Manager was designed to efficiently manage duplicates
through a manageable, lightweight and flexible architecture, so that it could be easily
integrated in existing systems.

This section presents the results gathered from four experiments ran on the Content Manager
against NFS. These replicate its application in several usage contexts. NFS was chosen as
baseline, because it is widely known and accessible, enabling the reproducibility of the
experiments.

5.3 Webhouse applications

This section describes the main applications of Webhouse, covering the features and
operation of the tumba! search engine. It also describes how Webhouse was used in several
other research experiments, discusses selection criteria to populate a national web archive and
describes the use of Webhouse in a web archive prototype.

46
Chapter 6

Conclusions

The web is a powerful source of information, but additional tools to help users in taking
advantage from its potential are required. One of the problems that these tools must address is
how to cope with a data source which was not designed to be automatically interpreted by
software applications. Web warehousing is an approach to tackle this problem. It consists on
extracting data from the web, storing it locally and then, providing uniform access methods
that facilitate its automatic processing and reuse by different applications. This approach is
conceptually similar to Data Warehousing approaches, used to integrate information from
relational databases. However, the peculiar characteristics of the web, such as its dynamics
and heterogeneity, raise new problems that must addressed to design an efficient Web
Warehouse (WWh).

In general, the characteristics of the data sources have a major influence on the design of the
information systems that process the data. A major challenge in the design of Web
Warehouses is that web data models are scarce and become quickly stale. Previous web
characterization studies showed that the web is composed of distinct portions with peculiar
characteristics. It is important to accurately define the boundaries of these portions and model
them, so that the design of a WWh can reject the characteristics of the data it will store. The
methodology used to sample the web influences the derived characterizations. Hence, the
samples used to model a portion of the web must be gathered using a methodology that
emulates the extraction stage of the web data integration process.

Which features should be considered in a model of the web?

The characterization of the sites, contents and link structure of a web portion is crucial to
design an efficient Web Warehouse to store it. Some features derived from the analysis of the
global web may not be representative of more restricted domains, such as national webs.
However, these portions can be of interest to large communities and characterizing a small
portion of the web is quite accessible and can be done with great accuracy. Some web

47
characteristics, such as web data persistence, require periodical samples of the web to be
modelled. These metrics should be included in web models because they enable the
identification of evolution tendencies that are determinant to the design of efficient Web
Warehouses, which keep incrementally built data collections.

How can the boundaries of a portion of the web be defined?

A set of se-lection criteria delimits the boundaries of a web portion and it is defined
according to the requirements of the application that will process the harvested data. The
selection criteria should be easily implemented as an automatic harvesting policy. Selection
criteria based on content classification and domain restrictions revealed to be suitable
options.

What can bias a web model?

The methodology used to gather web samples influences the obtained characterizations. In
the context of Web Ware-housing, crawling is an adequate sampling method because it is the
ost174commonly used in web data extraction. However, the configuration and technology
used in the crawler, and the existence of hazardous situations on the web influences the
derived models. Thus, the interpretation of statistics gathered from a web portion, such as a
national web, is beyond a mathematical analysis. It requires knowledge about web technology
and social reality of a national community.

How persistent is information on the web? Web data persistence cannot be modelled through
the analysis of a single snapshot of the web. Hence, models for the persistence of URLs and
contents of the Portuguese Web were derived from several snapshots gathered for three years.
The lifetime of URLs and contents follows a exponential distribution. Most URLs have short
lives and the death rate is higher in the first months, but there is a minority that persists for
long periods of time.

The obtained half-life of URLs was two months and the main causes of death were the
replacement of URLs and the deactivation of sites. Persistent URLs are mostly static, short
and tend to be linked from other sites. The lifetime of sites (half-life of 556 days) is
significantly larger than the lifetime of URLs. The obtained half-life for contents was just two
days.

48
The comparison of the obtained results with previous works suggests that the lifetime of
contents is decreasing. Persistent contents were not related to depth and were not particularly
distributed among sites. About half of the persistent URLs referenced the same content
during their lifetime.

How do web characteristics affect Web Warehouses design?

Web data models help on important design decisions in the initial phases of Web
Warehousing projects. Duplication of contents is prevalent on the web and it is difficult to
avoid the download of duplicates during a crawl because the duplicates are commonly
referenced by distinct and apparently unrelated URLs. A collection of contents built
incrementally presents an additional number of duplicates because many contents remain
unchanged over time and are repeatedly stored. Hence, eliminating duplicates at storage level
in a WWh is an appealing feature. However, the mechanisms adopted for the elimination of
duplicates must address URL transience that may prevent the implementation of algorithms
based on historical analysis.

49
Chapter 7

References:

Bibliography :

1) Baeza-Yates, R. & Poblete, B. (2006). Dynamics of the chilean web

structure. Comput. Networks, 50, 14641473.
2) Barroso, L.A., Dean, J. & Hölzle, U. (2003). Web search for a planet: The
Google cluster architecture. IEEE Micro, 23, 2228
3) Berners-Lee, T., Fielding, R. & Masinter, L. (2005). Uniform Resource
Identier (URI): Generic Syntax .
4) Bhowmick, S., Madria, S. & Ng, W.K. (2003). Web Data Management .
Springer-Verlag New York, Inc., Secaucus, NJ, USA.
5) Brewington, B.E. & Cybenko, G. (2000). How dynamic is the web?
Computer Networks, 33, 257276
Webliography :

1) http://www.jwz.org/doc/threading.html.
2) http://www.w3.org/XML/Query.
3) http://www-rocq.inria.fr/gemo/Gemo/Prokects/npp/.
4) http://www.xfra.net/qizxopen.