100% found this document useful (20 votes)
94 views50 pages

Google s PageRank and Beyond The Science of Search Engine Rankings Amy N. Langville 2024 scribd download

Langville

Uploaded by

millssorcigf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (20 votes)
94 views50 pages

Google s PageRank and Beyond The Science of Search Engine Rankings Amy N. Langville 2024 scribd download

Langville

Uploaded by

millssorcigf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Download the full version of the ebook at ebookname.

com

Google s PageRank and Beyond The Science of Search


Engine Rankings Amy N. Langville

https://ebookname.com/product/google-s-pagerank-and-beyond-
the-science-of-search-engine-rankings-amy-n-langville/

OR CLICK BUTTON

DOWNLOAD EBOOK

Download more ebook instantly today at https://ebookname.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Get to the Top on Google Tips and Techniques to Get Your


Site to the Top of the Search Engine Rankings and Stay
There David Viney
https://ebookname.com/product/get-to-the-top-on-google-tips-and-
techniques-to-get-your-site-to-the-top-of-the-search-engine-rankings-
and-stay-there-david-viney/
ebookname.com

Google and the Culture of Search 1st Edition Ken Hillis

https://ebookname.com/product/google-and-the-culture-of-search-1st-
edition-ken-hillis/

ebookname.com

Programming Google App Engine Second Edition Dan Sanderson

https://ebookname.com/product/programming-google-app-engine-second-
edition-dan-sanderson/

ebookname.com

A Beginner s Guide to Social Theory Theory Culture Society


1st Edition Shaun Best

https://ebookname.com/product/a-beginner-s-guide-to-social-theory-
theory-culture-society-1st-edition-shaun-best/

ebookname.com
The Government Of Life Foucault Biopolitics And
Neoliberalism 1st Edition Vanessa Lemm

https://ebookname.com/product/the-government-of-life-foucault-
biopolitics-and-neoliberalism-1st-edition-vanessa-lemm/

ebookname.com

Imaging the Eye from Front to Back with RTVue Fourier


Domain Optical Coherence Tomogaphy 1st Edition Huang Md
Phd
https://ebookname.com/product/imaging-the-eye-from-front-to-back-with-
rtvue-fourier-domain-optical-coherence-tomogaphy-1st-edition-huang-md-
phd/
ebookname.com

The Spirituality of Transformation Joy and Justice The


Ignatian Way for Everyone 1st Edition Patrick Saint-Jean

https://ebookname.com/product/the-spirituality-of-transformation-joy-
and-justice-the-ignatian-way-for-everyone-1st-edition-patrick-saint-
jean/
ebookname.com

Surfacing Up Psychiatry and Social Order in Colonial


Zimbabwe 1908 1968 1st Edition Lynette Jackson

https://ebookname.com/product/surfacing-up-psychiatry-and-social-
order-in-colonial-zimbabwe-1908-1968-1st-edition-lynette-jackson/

ebookname.com

Data Modeling and Database Design 2nd Edition Narayan S.


Umanath

https://ebookname.com/product/data-modeling-and-database-design-2nd-
edition-narayan-s-umanath/

ebookname.com
Time Frames Japanese Cinema and the Unfolding of History
1st Edition Scott Nygren

https://ebookname.com/product/time-frames-japanese-cinema-and-the-
unfolding-of-history-1st-edition-scott-nygren/

ebookname.com
Google’s PageRank and Beyond:
The Science of Search Engine Rankings
This page intentionally left blank
Google’s PageRank and Beyond:
The Science of Search Engine Rankings

Amy N. Langville and Carl D. Meyer

PRINCETON UNIVERSITY PRESS

PRINCETON AND OXFORD


This page intentionally left blank
Contents

Preface ix

Chapter 1. Introduction to Web Search Engines 1


1.1 A Short History of Information Retrieval 1
1.2 An Overview of Traditional Information Retrieval 5
1.3 Web Information Retrieval 9

Chapter 2. Crawling, Indexing, and Query Processing 15


2.1 Crawling 15
2.2 The Content Index 19
2.3 Query Processing 21

Chapter 3. Ranking Webpages by Popularity 25


3.1 The Scene in 1998 25
3.2 Two Theses 26
3.3 Query-Independence 30

Chapter 4. The Mathematics of Google’s PageRank 31


4.1 The Original Summation Formula for PageRank 32
4.2 Matrix Representation of the Summation Equations 33
4.3 Problems with the Iterative Process 34
4.4 A Little Markov Chain Theory 36
4.5 Early Adjustments to the Basic Model 36
4.6 Computation of the PageRank Vector 39
4.7 Theorem and Proof for Spectrum of the Google Matrix 45

Chapter 5. Parameters in the PageRank Model 47


5.1 The α Factor 47
5.2 The Hyperlink Matrix H 48
5.3 The Teleportation Matrix E 49

Chapter 6. The Sensitivity of PageRank 57


6.1 Sensitivity with respect to α 57
vi CONTENTS

6.2 Sensitivity with respect to H 62


6.3 Sensitivity with respect to vT 63
6.4 Other Analyses of Sensitivity 63
6.5 Sensitivity Theorems and Proofs 66

Chapter 7. The PageRank Problem as a Linear System 71


7.1 Properties of (I − αS) 71
7.2 Properties of (I − αH) 72
7.3 Proof of the PageRank Sparse Linear System 73

Chapter 8. Issues in Large-Scale Implementation of PageRank 75


8.1 Storage Issues 75
8.2 Convergence Criterion 79
8.3 Accuracy 79
8.4 Dangling Nodes 80
8.5 Back Button Modeling 84

Chapter 9. Accelerating the Computation of PageRank 89


9.1 An Adaptive Power Method 89
9.2 Extrapolation 90
9.3 Aggregation 94
9.4 Other Numerical Methods 97

Chapter 10. Updating the PageRank Vector 99


10.1 The Two Updating Problems and their History 100
10.2 Restarting the Power Method 101
10.3 Approximate Updating Using Approximate Aggregation 102
10.4 Exact Aggregation 104
10.5 Exact vs. Approximate Aggregation 105
10.6 Updating with Iterative Aggregation 107
10.7 Determining the Partition 109
10.8 Conclusions 111

Chapter 11. The HITS Method for Ranking Webpages 115


11.1 The HITS Algorithm 115
11.2 HITS Implementation 117
11.3 HITS Convergence 119
11.4 HITS Example 120
11.5 Strengths and Weaknesses of HITS 122
11.6 HITS’s Relationship to Bibliometrics 123
11.7 Query-Independent HITS 124
11.8 Accelerating HITS 126
11.9 HITS Sensitivity 126
CONTENTS vii

Chapter 12. Other Link Methods for Ranking Webpages 131


12.1 SALSA 131
12.2 Hybrid Ranking Methods 135
12.3 Rankings based on Traffic Flow 136

Chapter 13. The Future of Web Information Retrieval 139


13.1 Spam 139
13.2 Personalization 142
13.3 Clustering 142
13.4 Intelligent Agents 143
13.5 Trends and Time-Sensitive Search 144
13.6 Privacy and Censorship 146
13.7 Library Classification Schemes 147
13.8 Data Fusion 148

Chapter 14. Resources for Web Information Retrieval 149


14.1 Resources for Getting Started 149
14.2 Resources for Serious Study 150

Chapter 15. The Mathematics Guide 153


15.1 Linear Algebra 153
15.2 Perron–Frobenius Theory 167
15.3 Markov Chains 175
15.4 Perron Complementation 186
15.5 Stochastic Complementation 192
15.6 Censoring 194
15.7 Aggregation 195
15.8 Disaggregation 198

Chapter 16. Glossary 201

Bibliography 207

Index 219
This page intentionally left blank
Preface

Purpose
As teachers of linear algebra, we wanted to write a book to help students and the general
public appreciate and understand one of the most exciting applications of linear algebra
today—the use of link analysis by web search engines. This topic is inherently interesting,
timely, and familiar. For instance, the book answers such curious questions as: How do
search engines work? Why is Google so good? What’s a Google bomb? How can I
improve the ranking of my homepage in Teoma?
We also wanted this book to be a single source for material on web search engine rank-
ings. A great deal has been written on this topic, but it’s currently spread across numerous
technical reports, preprints, conference proceedings, articles, and talks. Here we have
summarized, clarified, condensed, and categorized the state of the art in web ranking.

Our Audience
We wrote this book with two diverse audiences in mind: the general science reader
and the technical science reader. The title echoes the technical content of the book, but
in addition to being informative on a technical level, we have also tried to provide some
entertaining features and lighter material concerning search engines and how they work.
The Mathematics
Our goal in writing this book was to reach a challenging audience consisting of the
general scientific public as well as the technical scientific public. Of course, a complete
understanding of link analysis requires an acquaintance with many mathematical ideas.
Nevertheless, we have tried to make the majority of the book accessible to the general sci-
entific public. For instance, each chapter builds progressively in mathematical knowledge,
technicality, and prerequisites. As a result, Chapters 1-4, which introduce web search and
link analysis, are aimed at the general science reader. Chapters 6, 9, and 10 are particularly
mathematical. The last chapter, Chapter 15, “The Mathematics Guide,” is a condensed but
complete reference for every mathematical concept used in the earlier chapters. Through-
out the book, key mathematical concepts are highlighted in shaded boxes. By postponing
the mathematical definitions and formulas until Chapter 15 (rather than interspersing them
throughout the text), we were able to create a book that our mathematically sophisticated
readers will also enjoy. We feel this approach is a compromise that allows us to serve both
audiences: the general and technical scientific public.
x PREFACE

Asides
An enjoyable feature of this book is the use of Asides. Asides contain entertaining news
stories, practical search tips, amusing quotes, and racy lawsuits. Every chapter, even the
particularly technical ones, contains several asides. Often times a light aside provides the
perfect break after a stretch of serious mathematical thinking. Brief asides appear in shaded
boxes while longer asides that stretch across multiple pages are offset by horizontal bars
and italicized font. We hope you enjoy these breaks—we found ourselves looking forward
to writing them.

Computing and Code


Truly mastering a subject requires experimenting with the ideas. Consequently, we have
incorporated Matlab code to encourage and jump-start the experimentation process. While
any programming language is appropriate, we chose Matlab for three reasons: (1) its matrix
storage architecture and built-in commands are particularly suited to the large sparse link
analysis matrices of this text, (2) among colleges and universities, Matlab is a market leader
in mathematical software, and (3) it’s very user-friendly. The Matlab programs in this book
are intended to be instruction, not production, code. We hope that, by playing with these
programs, readers will be inspired to create new models and algorithms.

Acknowledgments
We thank Princeton University Press for supporting this book. We especially enjoyed
working with Vickie Kearn, the Senior Editor at PUP. Vickie, thank you for displaying just
the right combination of patience and gentle pressure. For a book with such timely mate-
rial, you showed amazing faith in us. We thank all those who reviewed our manuscripts
and made this a better book. Of course, we also thank our families and friends for their
encouragement. Your pride in us is a powerful driving force.

Dedication
We dedicate this book to mentors and mentees worldwide. The energy, inspiration, and
support that is sparked through such relationships can inspire great products. For us, it
produced this book, but more importantly, a wonderful synergistic friendship.
Chapter One
Introduction to Web Search Engines

1.1 A SHORT HISTORY OF INFORMATION RETRIEVAL


Today we have museums for everything—the museum of baseball, of baseball players, of
crazed fans of baseball players, museums for world wars, national battles, legal fights, and
family feuds. While there’s no shortage of museums, we have yet to find a museum ded-
icated to this book’s field, a museum of information retrieval and its history. Of course,
there are related museums, such as the Library Museum in Boras, Sweden, but none con-
centrating on information retrieval. Information retrieval1 is the process of searching
within a document collection for a particular information need (called a query). Although
dominated by recent events following the invention of the computer, information retrieval
actually has a long and glorious tradition. To honor that tradition, we propose the cre-
ation of a museum dedicated to its history. Like all museums, our museum of information
retrieval contains some very interesting artifacts. Join us for a brief tour.
The earliest document collections were recorded on the painted walls of caves. A
cave dweller interested in searching a collection of cave paintings to answer a particular
information query had to travel by foot, and stand, staring in front of each painting. Un-
fortunately, it’s hard to collect an artifact without being gruesome, so let’s fast forward a
bit.
Before the invention of paper, ancient Romans and Greeks recorded information on
papyrus rolls. Some papyrus artifacts from ancient Rome had tags attached to the rolls.
These tags were an ancient form of today’s Post-it Note, and make an excellent addition to
our museum. A tag contained a short summary of the rolled document, and was attached
in order to save readers from unnecessarily unraveling a long irrelevant document. These
abstracts also appeared in oral form. At the start of Greek plays in the fifth century B . C .,
the chorus recited an abstract of the ensuing action. While no actual classification scheme
has survived from the artifacts of Greek and Roman libraries, we do know that another
elementary information retrieval tool, the table of contents, first appeared in Greek scrolls
from the second century B . C . Books were not invented until centuries later, when necessity
required an alternative writing material. As the story goes, the Library of Pergamum (in
what is now Turkey) threatened to overtake the celebrated Library of Alexandria as the
best library in the world, claiming the largest collection of papyrus rolls. As a result, the
Egyptians ceased the supply of papyrus to Pergamum, so the Pergamenians invented an
alternative writing material, parchment, which is made from thin layers of animal skin. (In
fact, the root of the word parchment comes from the word Pergamum.) Unlike papyrus,

1 The boldface terms that appear throughout the book are also listed and defined in the Glossary, which begins

on page 201.
2 CHAPTER 1

parchment did not roll easily, so scribes folded several sheets of parchment and sewed them
into books. These books outlasted scrolls and were easier to use. Parchment books soon
replaced the papyrus rolls.
The heights of writing, knowledge, and documentation of the Greek and Roman
periods were contrasted with their lack during the Dark and Middle Ages. Precious few
documents were produced during this time. Instead, most information was recorded orally.
Document collections were recorded in the memory of a village’s best storyteller. Oral
traditions carried in poems, songs, and prayers were passed from one generation to the
next. One of the most legendary and lengthy tales is Beowulf, an epic about the adventures
of a sixth-century Scandinavian warrior. The tale is believed to have originated in the
seventh century and been passed from generation to generation through song. Minstrels
often took poetic license, altering and adding verses as the centuries passed. An inquisitive
child wishing to hear stories about the monster Grendel waited patiently while the master
storyteller searched his memory to find just the right part of the story. Thus, the result of the
child’s search for information was biased by the wisdom and judgement of the intermediary
storyteller. Fortunately, the invention of paper, the best writing medium yet, superior to
even parchment, brought renewed acceleration to the written record of information and
collections of documents. In fact, Beowulf passed from oral to written form around A . D .
1000, a date over which scholars still debate. Later, monks, the possessors of treasured
reading and writing skills, sat in scriptoriums working as scribes from sunrise to sunset.
The scribes’ works were placed in medieval libraries, which initially were so small that
they had no need for classification systems. Eventually the collections grew, and it became
common practice to divide the holdings into three groups: theological works, classical
authors of antiquity, and contemporary authors on the seven arts. Lists of holdings and
tables of contents from classical books make nice museum artifacts from the medieval
period.
Other document collections sprung up in a variety of fields. This dramatically ac-
celerated with the re-invention of the printing press by Johann Gutenberg in 1450. The
wealthy proudly boasted of their private libraries, and public libraries were instituted in
America in the 1700s at the prompting of Benjamin Franklin. As library collections grew
and became publicly accessible, the desire for focused search became more acute. Hierar-
chical classification systems were used to group documents on like subjects together. The
first use of a hierarchical organization system is attributed to the Roman author Valerius
Maximus, who used it in A . D . 30 to organize the topics in his book, Factorum ac dicto-
rum memorabilium libri IX (Nine Books of Memorable Deeds and Sayings). Despite these
rudimentary organization systems, word of mouth and the advice of a librarian were the
best means of obtaining accurate quality information for a search. Of course, document
collections and their organization expanded beyond the limits of even the best librarian’s
memory. More orderly ways of maintaining records of a collection’s holdings were de-
vised. Notable artifacts that belong in our information retrieval museum are a few lists
of individual library holdings, sorted by title and also author, as well as examples of the
Dewey decimal system (1872), the card catalog (early 1900s), microfilm (1930s), and the
MARC (MAchine Readable Cataloging) system (1960s).
These inventions were progress, yet still search was not completely in the hands of
the information seeker. It took the invention of the digital computer (1940s and 1950s) and
the subsequent inventions of computerized search systems to move toward that goal. The
INTRODUCTION TO WEB SEARCH ENGINES 3

first computerized search systems used special syntax to automatically retrieve book and
article information related to a user’s query. Unfortunately, the cumbersome syntax kept
search largely in the domain of librarians trained on the systems. An early representative
of computerized search such as the Cornell SMART system (1960s) [146] deserves a place
in our museum of information retrieval.
In 1989 the storage, access, and searching of document collections was revolution-
ized by an invention named the World Wide Web by its founder Tim Berners-Lee [79]. Of
course, our museum must include artifacts from this revolution such as a webpage, some
HTML, and a hyperlink or two. The invention of linked document collections was truly
original at this time, despite the fact that Vannevar Bush, once Director of the Office of
Scientific Research and Development, foreshadowed its coming in his famous 1945 essay,
“As We May Think” [43]. In that essay, he describes the memex, a futuristic machine
(with shocking similarity to today’s PC and Web) that mirrors the cognitive processes of
humans by leaving “trails of association” throughout document collections. Four decades
of progress later, remnants of Bush’s memex formed the skeleton of Berners-Lee’s Web. A
drawing of the memex (Figure 1.1) by a graphic artist and approved by Bush was included
in LIFE magazine’s 1945 publishing of Bush’s prophetic article.

Figure 1.1 Drawing of Vannevar Bush’s memex appearing in LIFE. Original caption read: “Memex
in the form of a desk would instantly bring files and material on any subject to the op-
erator’s fingertips. Slanting translucent screens supermicrofilm filed by code numbers.
At left is a mechanism which automatically photographs longhand notes, pictures, and
letters, then files them in the desk for future reference.”

The World Wide Web became the ultimate signal of the dominance of the Informa-
tion Age and the death of the Industrial Age. Yet despite the revolution in information
storage and access ushered in by the Web, users initiating web searches found themselves
floundering. They were looking for the proverbial needle in an enormous, ever-growing
information haystack. In fact, users felt much like the men in Jorge Luis Borges’ 1941
short story [35], “The Library of Babel”, which describes an imaginary, infinite library.

When it was proclaimed that the Library contained all books, the first im-
pression was one of extravagant happiness. All men felt themselves to be
the masters of an intact and secret treasure. There was no personal or world
problem whose eloquent solution did not exist in some hexagon.
4 CHAPTER 1

. . . As was natural, this inordinate hope was followed by an excessive depres-


sion. The certitude that some shelf in some hexagon held precious books and
that these precious books were inaccessible seemed almost intolerable.

Much of the information in the Library of the Web, like that in the fictitious Library
of Babel, remained inaccessible. In fact, early web search engines did little to ease user
frustration; search could be conducted by sorting through hierarchies of topics on Yahoo, or
by sifting through the many (often thousands of) webpages returned by the search engine,
clicking on pages to personally determine which were most relevant to the query. Some
users resorted to the earliest search techniques used by ancient queriers—word of mouth
and expert advice. They learned about valuable websites from friends and linked to sites
recommended by colleagues who had already put in hours of search effort.
All this changed in 1998 when link analysis hit the information retrieval scene
[40, 106]. The most successful search engines began using link analysis, a technique that
exploited the additional information inherent in the hyperlink structure of the Web, to im-
prove the quality of search results. Web search improved dramatically, and web searchers
religiously used and promoted their favorite engines like Google and AltaVista. In fact, in
2004 many web surfers freely admit their obsession with, dependence on, and addiction
to today’s search engines. Below we include the comments [117] of a few Google fans
to convey the joy caused by the increased accessibility of the Library of the Web made
possible by the link analysis engines. Incidentally, in May 2004 Google held the largest
share of the search market with 37% of searchers using Google, followed by 27% using
the Yahoo conglomerate, which includes AltaVista, AlltheWeb, and Overture.2

• “It’s not my homepage, but it might as well be. I use it to ego-surf. I use
it to read the news. Anytime I want to find out anything, I use it.”—Matt
Groening, creator and executive producer, The Simpsons
• “I can’t imagine life without Google News. Thousands of sources from
around the world ensure anyone with an Internet connection can stay in-
formed. The diversity of viewpoints available is staggering.”—Michael
Powell, chair, Federal Communications Commission
• “Google is my rapid-response research assistant. On the run-up to a
deadline, I may use it to check the spelling of a foreign name, to acquire
an image of a particular piece of military hardware, to find the exact
quote of a public figure, check a stat, translate a phrase, or research the
background of a particular corporation. It’s the Swiss Army knife of
information retrieval.”—Garry Trudeau, cartoonist and creator, Doones-
bury

Nearly all major search engines now combine link analysis scores, similar to those
used by Google, with more traditional information retrieval scores. In this book, we record
the history of one aspect of web information retrieval. That aspect is the link analysis
or ranking algorithms underlying several of today’s most popular and successful search

2 These market share statistics were compiled by comScore, a company that counted the number of searches

done by U.S. surfers in May 2004 using the major search engines. See the article at
http://searchenginewatch.com/reports/article.php/2156431.
INTRODUCTION TO WEB SEARCH ENGINES 5

engines, including Google and Teoma. Incidentally, we’ll add the PageRank link analysis
algorithm [40] used by Google (see Chapters 4-10) and the HITS algorithm [106] used by
Teoma (see Chapter 11) to our museum of information retrieval.

1.2 AN OVERVIEW OF TRADITIONAL INFORMATION RETRIEVAL


To set the stage for the exciting developments in link analysis to come in later chapters, we
begin our story by distinguishing web information retrieval from traditional informa-
tion retrieval. Web information retrieval is search within the world’s largest and linked
document collection, whereas traditional information retrieval is search within smaller,
more controlled, nonlinked collections. The traditional nonlinked collections existed be-
fore the birth of the Web and still exist today. Searching within a university library’s col-
lection of books or within a professor’s reserve of slides for an art history course—these
are examples of traditional information retrieval.
These document collections are nonlinked, mostly static, and are organized and cate-
gorized by specialists such as librarians and journal editors. These documents are stored in
physical form as books, journals, and artwork as well as electronically on microfiche, CDs,
and webpages. However, the mechanisms for searching for items in the collections are
now almost all computerized. These computerized mechanisms are referred to as search
engines, virtual machines created by software that enables them to sort through virtual
file folders to find relevant documents. There are three basic computer-aided techniques
for searching traditional information retrieval collections: Boolean models, vector space
models, and probabilistic models [14]. These search models, which were developed in
the 1960s, have had decades to grow, mesh, and morph into new search models. In fact,
as of June 2000, there were at least 3,500 different search engines (including the newer
web engines) [37], which means that there are possibly 3,500 different search techniques.
Nevertheless, since most search engines rely on one or more of the three basic models, we
describe these in turn.

1.2.1 Boolean Search Engines

The Boolean model of information retrieval, one of the earliest and simplest retrieval meth-
ods, uses the notion of exact matching to match documents to a user query. Its more refined
descendents are still used by most libraries. The adjective Boolean refers to the use of
Boolean algebra, whereby words are logically combined with the Boolean operators AND,
OR , and NOT . For example, the Boolean AND of two logical statements x and y means that
both x AND y must be satisfied, while the Boolean OR of these two statements means that
at least one of these statements must be satisfied. Any number of logical statements can be
combined using the three Boolean operators. The Boolean model of information retrieval
operates by considering which keywords are present or absent in a document. Thus, a doc-
ument is judged as relevant or irrelevant; there is no concept of a partial match between
documents and queries. This can lead to poor performance [14]. More advanced fuzzy set
theoretic techniques try to remedy this black-white Boolean logic by introducing shades of
gray. For example, a title search for car AND maintenance on a Boolean engine causes
the virtual machine to return all documents that use both words in the title. A relevant doc-
ument entitled “Automobile Maintenance” will not be returned. Fuzzy Boolean engines
use fuzzy logic to categorize this document as somewhat relevant and return it to the user.
6 CHAPTER 1

The car maintenance query example introduces the main drawbacks of Boolean
search engines; they fall prey to two of the most common information retrieval problems,
synonymy and polysemy. Synonymy refers to multiple words having the same meaning,
such as car and automobile. A standard Boolean engine cannot return semantically related
documents whose keywords were not included in the original query. Polysemy refers to
words with multiple meanings. For example, when a user types bank as their query, does
he or she mean a financial center, a slope on a hill, a shot in pool, or a collection of objects
[24]? The problem of polysemy can cause many documents that are irrelevant to the user’s
actual intended query meaning to be retrieved. Many Boolean search engines also require
that the user be familiar with Boolean operators and the engine’s specialized syntax. For
example, to find information about the phrase iron curtain, many engines require quo-
tation marks around the phrase, which tell the search engine that the entire phrase should
be searched as if it were just one keyword. A user who forgets this syntax requirement
would be surprised to find retrieved documents about interior decorating and mining for
iron ore.
Nevertheless, variants of the Boolean model do form the basis for many search en-
gines. There are several reasons for their prevalence. First, creating and programming a
Boolean engine is straightforward. Second, queries can be processed quickly; a quick scan
through the keyword files for the documents can be executed in parallel. Third, Boolean
models scale well to very large document collections. Accommodating a growing collec-
tion is easy. The programming remains simple; merely the storage and parallel processing
capabilities need to grow. References [14, 75, 107] all contain chapters with excellent
introductions to the Boolean model and its extensions.

1.2.2 Vector Space Model Search Engines

Another information retrieval technique uses the vector space model [147], developed by
Gerard Salton in the early 1960s, to sidestep some of the information retrieval problems
mentioned above. Vector space models transform textual data into numeric vectors and ma-
trices, then employ matrix analysis3 techniques to discover key features and connections
in the document collection. Some advanced vector space models address the common text
analysis problems of synonymy and polysemy. Advanced vector space models, such as LSI
[64] (Latent Semantic Indexing), can access the hidden semantic structure in a document
collection. For example, an LSI engine processing the query car will return documents
whose keywords are related semantically (in meaning), e.g., automobile. This ability to
reveal hidden semantic meanings makes vector space models, such as LSI, very powerful
information retrieval tools.
Two additional advantages of the vector space model are relevance scoring and rel-
evance feedback. The vector space model allows documents to partially match a query by
assigning each document a number between 0 and 1, which can be interpreted as the like-
lihood of relevance to the query. The group of retrieved documents can then be sorted by
degree of relevancy, a luxury not possible with the simple Boolean model. Thus, vec-
tor space models return documents in an ordered list, sorted according to a relevance
score. The first document returned is judged to be most relevant to the user’s query.

3 Mathematical terms are defined in Chapter 15, the Mathematics Chapter, and are italicized throughout.
INTRODUCTION TO WEB SEARCH ENGINES 7

Some vector space search engines report the relevance score as a relevancy percentage.
For example, a 97% next to a document means that the document is judged as 97% rele-
vant to the user’s query. (See the Federal Communications Commission’s search engine,
http://www.fcc.gov/searchtools.html, which is powered by Inktomi, once known
to use the vector space model. Enter a query such as taxes and notice the relevancy score
reported on the right side.) Relevance feedback, the other advantage of the vector space
model, is an information retrieval tuning technique that is a natural addition to the vec-
tor space model. Relevance feedback allows the user to select a subset of the retrieved
documents that are useful. The query is then resubmitted with this additional relevance
feedback information, and a revised set of generally more useful documents is retrieved.
A drawback of the vector space model is its computational expense. At query time,
distance measures (also known as similarity measures) must be computed between each
document and the query. And advanced models, such as LSI, require an expensive singu-
lar value decomposition [82, 127] of a large matrix that numerically represents the entire
document collection. As the collection grows, the expense of this matrix decomposition
becomes prohibitive. This computational expense also exposes another drawback—vector
space models do not scale well. Their success is limited to small document collections.

Understanding Search Engines

The informative little book by Michael Berry and Murray Browne, Understanding
Search Engines: Mathematical Modeling and Text Retrieval [23], provides an
excellent explanation of vector space models, especially LSI, and contains several
examples and sample code. Our mathematical readers will enjoy this book and its
application of linear algebra algorithms in the context of traditional information
retrieval.

1.2.3 Probabilistic Model Search Engines

Probabilistic models attempt to estimate the probability that the user will find a particular
document relevant. Retrieved documents are ranked by their odds of relevance (the ratio
of the probability that the document is relevant to the query divided by the probability that
the document is not relevant to the query). The probabilistic model operates recursively
and requires that the underlying algorithm guess at initial parameters then iteratively tries
to improve this initial guess to obtain a final ranking of relevancy probabilities.
Unfortunately, probabilistic models can be very hard to build and program. Their
complexity grows quickly, deterring many researchers and limiting their scalability. Prob-
abilistic models also require several unrealistic simplifying assumptions, such as indepen-
dence between terms as well as documents. Of course, the independence assumption is
restrictive in most cases. For instance, in this document the most likely word to follow in-
formation is the word retrieval, but the independence assumption judges each word
as equally likely to follow the word information. On the other hand, the probabilistic
framework can naturally accommodate a priori preferences, and thus, these models do of-
fer promise of tailoring search results to the preferences of individual users. For example, a
8 CHAPTER 1

user’s query history can be incorporated into the probabilistic model’s initial guess, which
generates better query results than a democratic guess.

1.2.4 Meta-search Engines

There’s actually a fourth model for traditional search engines, meta-search engines, which
combines the three classic models. Meta-search engines are based on the principle that
while one search engine is good, two (or more) are better. One search engine may be great
at a certain task, while a second search engine is better at another task. Thus, meta-search
engines such as Copernic (www.copernic.com) and SurfWax (www.surfwax.com) were
created to simultaneously exploit the best features of many individual search engines.
Meta-search engines send the query to several search engines at once and return the re-
sults from all of the search engines in one long unified list. Some meta-search engines
also include subject-specific search engines, which can be helpful when searching within
one particular discipline. For example, Monster (www.monster.com) is an employment
search engine.

1.2.5 Comparing Search Engines

Annual information retrieval conferences, such as TREC [3], SIGIR, CIR [22] (for tradi-
tional information retrieval), and WWW [4] (for web information retrieval), are used to
compare the various information retrieval models underlying search engines and help the
field progress toward better, more efficient search engines. The two most common rat-
ings used to differentiate the various search techniques are precision and recall. Precision
is the ratio of the number of relevant documents retrieved to the total number of docu-
ments retrieved. Recall is the ratio of the number of relevant documents retrieved to the
total number of relevant documents in the collection. The higher the precision and recall,
the better the search engine is. Of course, search engines are tested on document collec-
tions with known parameters. For example, the commonly used test collection Medlars
[6], containing 5,831 keywords and 1,033 documents, has been examined so often that
its properties are well known. For instance, there are exactly 24 documents relevant to
the phrase neoplasm immunology. Thus, the denominator of the recall ratio for a user
query on neoplasm immunology is 24. If only 10 documents were retrieved by a search
engine for this query, then a recall of 10/24 = .416 is reported. Recall and precision
are information retrieval-specific performance measures, but, of course, when evaluating
any computer system, time and space are always performance issues. All else held con-
stant, quick, memory-efficient search engines are preferred to slower, memory-inefficient
engines. A search engine with fabulous recall and precision is useless if it requires 30
minutes to perform one query or stores the data on 75 supercomputers. Some other perfor-
mance measures take a user-centered viewpoint and are aimed at assessing user satisfaction
and frustration with the information system. A book by Robert Korfhage, Information Stor-
age and Retrieval [107], discusses these and several other measures for comparing search
engines. Excellent texts for information retrieval are [14, 75, 163].
INTRODUCTION TO WEB SEARCH ENGINES 9

1.3 WEB INFORMATION RETRIEVAL


1.3.1 The Challenges of Web Search

Tim Berners-Lee and his World Wide Web entered the information retrieval world in 1989
[79]. This event caused a branch that focused specifically on search within this new docu-
ment collection to break away from traditional information retrieval. This branch is called
web information retrieval. Many web search engines are built on the techniques of tradi-
tional search engines, but they differ in many important ways. We list the properties that
make the Web such a unique document collection. The Web is:

• huge,
• dynamic,
• self-organized, and
• hyperlinked.

The Web is indeed huge! In fact, it’s so big that it’s hard to get an accurate count of
its size. By January 2004, it was estimated that the Web contained over 10 billion pages,
with an average page size of 500KB [5]. With a world population of about 6.4 billion,
that’s almost 2 pages for each inhabitant. The early exponential growth of the Web has
slowed recently, but it is still the largest document collection in existence. The Berkeley
information project, “How Much Information,” estimates that the amount of information
on the Web is about 20 times the size of the entire Library of Congress print collection [5].
Bigger still, a company called BrightPlanet sells access to the so-called Deep Web, which
they estimate to contain over 92,000TB of data spread over 550 billion pages [1]. Bright-
Planet defines the Deep Web as the hundreds of thousands of publicly accessible databases
that create a collection over 500 times larger than the Surface Web. Deep webpages can
not be found by casual, routine surfing. Surfers must request information from a particular
database, at which point, the relevant pages are served to the user dynamically within a
matter of seconds. As a result, search engines cannot easily find these dynamic pages since
they do not exist before or after the query. However, Yahoo appears to be the first search
engine aiming to index parts of the Deep Web.
The Web is dynamic! Contrast this with traditional document collections which
can be considered static in two senses. First, once a document is added to a traditional
collection, it does not change. The books sitting on a bookshelf are well behaved. They
don’t change their content by themselves, but webpages do, very frequently. A study by
Junghoo Cho and Hector Garcia-Molina [52] in 2000 reported that 40% of all webpages in
their dataset changed within a week, and 23% of the .com pages changed daily. In a much
more extensive and recent study, the results of Fetterly et al. [74] concur. About 35% of
all webpages changed over the course of their study, and also pages that were larger in size
changed more often and more extensively than their smaller counterparts. Second, for the
most part, the size of a traditional document collection is relatively static. It is true that
abstracts are added to MEDLINE each year, but how many? Hundreds, maybe thousands.
These are minuscule additions by Web proportions. Billions of pages are added to the Web
each year. The dynamics of the Web make it tough to compute relevancy scores for queries
when the collection is a moving, evolving target.
10 CHAPTER 1

The Web is self-organized! Traditional document collections are usually collected


and categorized by trained (and often highly paid) specialists. However, on the Web, any-
one can post a webpage and link away at will. There are no standards and no gatekeepers
policing content, structure, and format. The data are volatile; there are rapid updates, bro-
ken links, and file disappearances. One 2002 U.S. study reporting on “link rot” suggested
that up to 50% of URLs cited in articles in two information technology journals were in-
accessible within four years [1]. The data is heterogeneous, existing in multiple formats,
languages, and alphabets. And often this volatile, heterogeneous data is posted multiple
times. In addition, there is no editorial review process, which means errors, falsehoods,
and invalid statements abound. Further, this self-organization opens the door for sneaky
spammers who capitalize on the mercantile potential offered by the Web. Spammers was
the name originally given to those who send mass advertising emails. With one click of
the send button, spammers can send their advertising message to thousands of potential
customers in a matter of seconds. With web search and online retailing, this name was
broadened to include those using deceptive webpage creation techniques to rank highly in
web search listings for particular queries. Spammers resorted to using minuscule text font,
hidden text (white on a white background), and misleading metatag descriptions to fool
early web search engines (like those using the Boolean technique of traditional informa-
tion retrieval). The self-organization of the Web also means that webpages are created for
a variety of different purposes. Some pages are aimed at surfers who are shopping, others
at surfers who are researching. In fact, search engines must be able to answer many types
of queries, such as transactional queries, navigational queries, and informational queries.
All these features of the Web combine to make the job for web search engines Herculean.
Ah, but the Web is hyperlinked! This linking feature, the foundation of Vannevar
Bush’s memex, is the saving grace for web search engines. Hyperlinks make the new
national pastime of surfing possible. But much more importantly, they make focused, ef-
fective searching a reality. This book is about ways that web search engines exploit the
additional information available in the Web’s sprawling link structure to improve the qual-
ity of their search results. Consequently, we focus on just one aspect of the web information
retrieval process, but one we believe is the most exciting and important. However, the ad-
vantages resulting from the link structure of the Web did not come without negative side
effects. The most interesting side effects concern those sneaky spammers. Spammers soon
caught wind of the link analysis employed by major search engines, and immediately set
to work on link spamming. Link spammers carefully craft hyperlinking strategies in the
hope of increasing traffic to their pages. This has created an entertaining game of cat and
mouse between the search engines and the spammers, which many, the authors included,
enjoy spectating. See the asides on pages 43 and 52.
An additional information retrieval challenge for any document collection, but espe-
cially pertinent to the Web, concerns precision. Although the amount of accessible infor-
mation continues to grow, a user’s ability to look at documents does not. Users rarely look
beyond the first 10 or 20 documents retrieved [94]. This user impatience means that search
engine precision must increase just as rapidly as the number of documents is increasing.
Another dilemma unique to web search engines concerns their performance measurements
and comparison. While traditional search engines are compared by running tests on famil-
iar, well studied, controlled collections, this is not realistic for web engines. Even small
web collections are too large for researchers to catalog, count, and create estimates of the
precision and recall numerators and denominators for dozens of queries. Comparing two
INTRODUCTION TO WEB SEARCH ENGINES 11

search engines is usually done with user satisfaction studies and market share measures in
addition to the baseline comparison measures of speed and storage requirements.

1.3.2 Elements of the Web Search Process

This last section of the introductory chapter describes the basic elements of the web in-
formation retrieval process. Their relationship to one another is shown in Figure 1.2. Our
purpose in describing the many elements of the search process is twofold: first, it helps
emphasize the focus of this book, which is the ranking part of the search process, and sec-
ond, it shows how the ranking process fits into the grand scheme of search. Chapters 3-12
are devoted to the shaded parts of Figure 1.2, while all other parts are discussed briefly in
Chapter 2.

WWW Page Repository


Crawler Module User

ries
Que

Results
Indexing Module
query- i nd e p e n d e n t

Query Ranking
Module Module

Indexes

Content Index Special-purpose indexes


Structure Index

Figure 1.2 Elements of a search engine

• Crawler Module. The Web’s self-organization means that, in contrast to traditional


document collections, there is no central collection and categorization organization.
Traditional document collections live in physical warehouses, such as the college’s
library or the local art museum, where they are categorized and filed. On the other
hand, the web document collection lives in a cyber warehouse, a virtual entity that
is not limited by geographical constraints and can grow without limit. However,
this geographic freedom brings one unfortunate side effect. Search engines must
12 CHAPTER 1

do the data collection and categorization tasks on their own. As a result, all web
search engines have a crawler module. This module contains the software that col-
lects and categorizes the web’s documents. The crawling software creates virtual
robots, called spiders, that constantly scour the Web gathering new information and
webpages and returning to store them in a central repository.
• Page Repository. The spiders return with new webpages, which are temporarily
stored as full, complete webpages in the page repository. The new pages remain in
the repository until they are sent to the indexing module, where their vital informa-
tion is stripped to create a compressed version of the page. Popular pages that are
repeatedly used to serve queries are stored here longer, perhaps indefinitely.
• Indexing Module. The indexing module takes each new uncompressed page and
extracts only the vital descriptors, creating a compressed description of the page that
is stored in various indexes. The indexing module is like a black box function that
takes the uncompressed page as input and outputs a “Cliffnotes” version of the page.
The uncompressed page is then tossed out or, if deemed popular, returned to the page
repository.
• Indexes. The indexes hold the valuable compressed information for each webpage.
This book describes three types of indexes. The first is called the content index.
Here the content, such as keyword, title, and anchor text for each webpage, is stored
in a compressed form using an inverted file structure. Chapter 2 describes the in-
verted file in detail. Further valuable information regarding the hyperlink structure
of pages in the search engine’s index is gleaned during the indexing phase. This
link information is stored in compressed form in the structure index. The crawler
module sometimes accesses the structure index to find uncrawled pages. Special-
purpose indexes are the final type of index. For example, indexes such as the image
index and pdf index hold information that is useful for particular query tasks.

The four modules above (crawler, page repository, indexers, indexes) and their cor-
responding data files exist and operate independent of users and their queries. Spiders
are constantly crawling the Web, bringing back new and updated pages to be indexed and
stored. In Figure 1.2 these modules are circled and labeled as query-independent. Unlike
the preceding modules, the query module is query-dependent and is initiated when a user
enters a query, to which the search engine must respond in real-time.

• Query Module. The query module converts a user’s natural language query into
a language that the search system can understand (usually numbers), and consults
the various indexes in order to answer the query. For example, the query module
consults the content index and its inverted file to find which pages use the query
terms. These pages are called the relevant pages. Then the query module passes the
set of relevant pages to the ranking module.
• Ranking Module. The ranking module takes the set of relevant pages and ranks
them according to some criterion. The outcome is an ordered list of webpages such
that the pages near the top of the list are most likely to be what the user desires.
The ranking module is perhaps the most important component of the search pro-
cess because the output of the query module often results in too many (thousands
INTRODUCTION TO WEB SEARCH ENGINES 13

of) relevant pages that the user must sort through. The ordered list filters the less
relevant pages to the bottom, making the list of pages more manageable for the user.
(In contrast, the similarity measures of traditional information retrieval often do not
filter out enough irrelevant pages.) Actually, this ranking which carries valuable,
discriminatory power is arrived at by combining two scores, the content score and
the popularity score. Many rules are used to give each relevant page a relevancy
or content score. For example, many web engines give pages using the query word
in the title or description a higher content score than pages using the query word in
the body of the page [39]. The popularity score, which is the focus of this book,
is determined from an analysis of the Web’s hyperlink structure. The content score
is combined with the popularity score to determine an overall score for each rele-
vant page [30]. The set of relevant pages resulting from the query module is then
presented to the user in order of their overall scores.

Chapter 2 gives an introduction to all components of the web search process, ex-
cept the ranking component. The ranking component, specifically the popularity score,
is the subject of this book. Chapters 3 through 12 provide a comprehensive treatment of
the ranking problem and its suggested solutions. Each chapter progresses in depth and
mathematical content.
This page intentionally left blank
Chapter Two
Crawling, Indexing, and Query Processing

Spiders are the building blocks of search engines. Decisions about the design of the crawler
and the capabilities of its spiders affect the design of the other modules, such as the index-
ing and query processing modules.
So in this chapter, we begin our description of the basic components of a web search
engine with the crawler and its spiders. We purposely exclude one component, the ranking
component, since it is the focus of this book and is covered in the remaining chapters.
The goals and challenges of web crawlers are introduced in section 2.1, and a simple
program for crawling the Web is provided. Indexing a collection of documents as enormous
as the Web creates special storage challenges (section 2.2), and also has search engines
constantly increasing the size of their indexes (see the aside on page 20). The size of
the Web makes the real-time processing of queries an astounding feat, and section 2.3
describes the structures and mechanisms that make this possible.

2.1 CRAWLING
The crawler module contains a short software program that instructs robots or spiders on
how and which pages to retrieve. The crawling module gives a spider a root set of URLs
to visit, instructing it to start there and follow links on those pages to find new pages.
Every crawling program must address several issues. For example, which pages should the
spiders crawl? Some search engines focus on specialized search, and as a result, conduct
specialized crawls, through only .gov pages, or pages with images, or blog files, etc. For
instance, Bernhard Seefeld’s search engine, search.ch, crawls only Swiss webpages and
stops at the geographical borders of Switzerland. Even the most comprehensive search
engine indexes only a small portion of the entire Web. Thus, crawlers must carefully select
which pages to visit.
How often should pages be crawled? Since the Web is dynamic, last month’s crawled
page may contain different content this month. Therefore, crawling is a never-ending pro-
cess. Spiders return exhausted, carrying several new and many updated pages, only to be
immediately given another root URL and told to start over. Theirs is an endless task like
Sisyphus’s uphill ball-rolling. However, some pages change more often than others, so a
crawler must decide which pages to revisit and how often. Some engines make this deci-
sion democratically, while others refresh pages in proportion to their perceived freshness
or importance levels. In fact, some researchers have proposed a crawling strategy that uses
the PageRank measure of Chapters 3 and 4 to decide which pages to update [31].
How should pages be crawled ethically? When a spider visits a webpage, it con-
sumes resources, such as bandwidth and hits quotas, belonging to the page’s host and the
Random documents with unrelated
content Scribd suggests to you:
Expectat dampnum qui facit inde forum:
Testis erit Iudas quid erit sibi fine doloris;
Dum crepuit medius, culpa subibat onus.
250 Penituit culpam, que semel nisi fecerat illam,
Quod tulit et lucrum reddidit ipse statim;776
Set nec eo veniam meruit nec habere salutem,777
Iam valet exemplum tale mouere virum.
Vendidit ipse semel iustum, nos cotidianum
Ob lucri precium vendimus omne malum:
Ille restaurauit, set nos restringimus aurum;
Penituit, set nos absque pauore sumus.
Sic et auaricia tanta feritate perurget
Corda viri, quod ab hoc vix homo liber abit:
260 Cessat iusticia, cessatque fides sociata,
Fraus dolus atque suum iam subiere locum:
Plebs sine iure manet, non est qui iura tuetur,
Non est qui dicat, iura tenere decet:
Omnibus in causis, vbi gentes commoda querunt,
Nunc modus est que fides non habuisse fidem.778
Vox leuis illa Iacob, Esau manus hispida nuper,
Que foret ista dies, signa futura dabant:
Alterius casu stat supplantator, et eius
Qui fuerat socius fraude subintrat opes:
270 Ex dampno fratris frater sua commoda querit;
Vnus si presit, inuidet alter ei:
Filius ante diem patruos iam spectat in annos,
Nec videt ex oculis ceca cupido suis:
Nunc amor est solus, nec sentit habere secundum,
Stans odioque tibi diligit ipse tua.
Quid modo, cumque manus mentitur dextra
sinistre,
Dicam? set caueat qui sapienter agit.
Viuitur ex velle, non amplius est via tuta,
Cuncta licent cupido, dum vacat ipse lucro;
280 Arma, rapina, dolus, amor ambiciosus habendi,
Amplius ad proprium velle sequntur iter:
Lex silet et nummus loquitur, ius dormit et aurum
Peruigil insidiis vincit vbique suis:
Hasta nocet ferri, gladius set plus nocet auri;
Regna terit mundi, nilque resistit ei.
Set quia mors dubium concludit ad omnia finem,
Est nichil hic certum preter amare deum:
Rebus in humanis semper quid deficit, et sic
Ista nichil plenum fertile vita tenet:
290 Quod tibi dat proprium mundus, tibi tollit id ipsum,
Deridensque tuum linquit inane forum:
Quam prius in finem mundi deuenerit huius,
Nulla potest certo munere vita frui.
Heu, quid opes opibus cumulas, qui propria queris,
Cum se nemo queat appropriare sibi?
Hunc igitur mundum quia perdes, quere futurum;
Est aliter vacuum tempus vtrumque tuum.
Mammona transibit et auara cupido peribit,
In cineres ibit, mors tua fata bibit,
300 Pauper ab hac vita, sic princeps, sic heremita,
Mortuus ad merita transiet omnis ita.
Quicquid homo voluit, mors mundi Salomon:
cuncta reuoluit, Memorare
Nemoque dissoluit, quin morti nouissima, et
ineternum non
debita soluit:
peccabis.
Hec qui mente capit gaudia, raro
sapit.
Set sibi viuenti qui consilio sapienti Idem: Omnia
Prouidet, ingenti merito placet fac cum
omnipotenti. consilio, et
Tempore presenti que sunt mala ineternum non
penitebis.
proxima genti,779
Ex oculo flenti Gower canit ista legenti:
Quisque sue menti qui concipit aure patenti
310 Mittat, et argenti det munera largus egenti;
Stat nam mortalis terra repleta malis.780
Hoc ego bis deno Ricardi regis in anno
Compaciens animo carmen lacrimabile scribo.781
Vox sonat in populo, fidei iam deficit ordo,
Vnde magis solito cessat laus debita Cristo,
Quem peperit virgo genitum de flamine sacro.
Hic deus est et homo, perfecta salus manet in quo,
Eius ab imperio processit pacis origo,
Que dabitur iusto, paciens qui credit in ipso.
320 Vir qui vult ideo pacem componere mundo,
Pacificet primo iura tenenda deo.

FOOTNOTES:
751 Carmen super multiplici, &c. The MSS. referred to are
SCEHL with Fairfax 3 (F) and Bodley 294 (B).
752 Title and Preface ll. 1-12 om. EL
753 Title and Preface consequenter] hic precipue F
754 Title and Preface ‘Putruerunt,’ &c. om. E
755 Title and Preface pro salute interpellam] pro salute efficacius
interpellem F
756 Anno] In Anno F
757 l. 13 ad om. S
758 35 palliet F (corr.) palleat SCHLB paleat E
759 58 scintilla CEL
760 62 non] nec F
761 63 nec] non CEH
762 86 No paragr. here FL stude SCEHLB time F
763 88 Que fantasias aliter tibi FB
764 90 Paragr. FL
765 126 stat om. S
766 138 f. Two lines om. FL The section ll. 142-224 is omitted
here in E and inserted after l. 321
767 143 legit C
768 154 omnis SFLB viuus CEH
769 159 fatuis om. F
770 190 inde] ille FL
771 200 fata EHLB
772 203 hoc EH
773 210 mortis] cordis S
774 217 statum S
775 234 modernus SFLB modernos CEH
776 251 Quot C
777 252 Sic CEH
778 265 est qui CEH
779 307 Paragr. SE no paragr. CHFLB
780 After 311 one line space F
781 312-321 Ten lines om. L
782TRACTATUS DE LUCIS SCRUTINIO

Incipit tractatus de lucis Scrutinio quam a


diu viciorum tenebre, prothdolor,
suffocarunt, 783 secundum illud in euangelio,
Qui ambulat in tenebris nescit quo vadat.

Heu, quia per crebras humus est Nota quod


viciata tenebras, eorum lucerna
Vix iter humanum locus vllus habet minime,
clarescit quos
sibi planum.
in ecclesia per
Si Romam pergas, vt ibi tua lumina Antipapam
tergas, Auaricia
Lumina mira cape, quia Rome sunt promotos
duo pape; ditescit.
Et si plus cleri iam debent lumina
queri,
Sub modio tecta latitat lucerna reiecta:
Presulis officia mundus tegit absque sophia,
Stat sua lux nulla, dum Simonis est ibi bulla;
Est iter hoc vile, qui taliter intrat ouile,
10 Nec bene discernit lucem qui lumina spernit.
Sic caput obscurum de membris nil fore purum784
Efficit, et secum sic cecus habet sibi cecum.785
Aut si vis gressus claros, non De luce ordinis
ordo professus professi.
Hos tibi prestabit, quos caucius vmbra fugabit.
Ordine claustrali manifestius in speciali786
Lux ibi pallescit, quam mens magis inuida nescit:
Lux et moralis tenebrescit presbiteralis,
Clara dies transit, nec eis lucerna remansit;
Sunt ibi lucerne, iocus, ocia, scorta, taberne,
20 Quorum velamen viciis fert sepe iuuamen.
Sic perit exemplum lucis, quo turbida templum
Nebula perfudit, que lumina queque recludit:787
Sic vice pastorum quos Cristus ab ante bonorum
Legerat, ecce chorum statuit iam mundus eorum.
Si lux presentum scrutetur in Nota quod, si
orbe regentum, regum lucerna
Horum de guerra pallet sine lumine in manu
caritatis
terra:
deuocius
Ne periant leges, iam Roma petit gestaretur,
sibi reges, ecclesia nunc
Noscat vt ille pater que sit sibi diuisa eorum
credula mater. auxilio
Scisma modernorum patrum discrecius
nouitate duorum reformaretur,
eciam et
Reges delerent, si Cristi iura
incursus
30 viderent; paganorum a
Lux ita regalis decet ecclesiam Cristi finibus
specialis, eorum probitate
Qua domus alma dei maneat sub eminus
spe requiei. expelleretur.
Teste paganorum bello furiente deorum,
Raro fides crescit vbi regia lux tenebrescit:
Hec tamen audimus, set et hec verissima
scimus788
Nec capit hec mentis oculus de luce regentis.
Vlterius quere, cupias si lumen habere,
Lumina namque Dauid sibi ceca magis titulauit.
Si regni proceres aliter pro De luce
lumine queres, procerum.
40 Aspice quod plenum non est ibi tempus amenum,
Dumque putas stare, palpabis iter, quia clare
Nemo videt, quando veniet de turbine grando,
Diuicie cece fallunt sine lumine sese;
Quam prius ille cadat, vix cernit habens vbi vadat:
Sic via secura procerum non est sine cura.
Stans honor ex onere sibi conuenit acta videre;
Qui tamen extentum modo viderit experimentum,
De procerum spera non surgunt lumina vera.
Si bellatorum lucem scrutabor De luce Militum
eorum, et aliorum qui
Lucerne lator tenebrosus adest bella sequntur.
50 gladiator,
Sunt ibi doctrina luxus, iactura, rapina,
Que non splendorem querunt set habere cruorem;
Et sic armatus lucem pre labe reatus
Non videt, vnde status suus errat in orbe grauatus.
Si lex scrutetur, ibi lux n o n De luce
i n u e n i e t u r, legistarum.
Quin vis aut velle ius concitat esse rebelle:
Non populo lucet index quem mammona ducet,
Efficit et cecum quo sepe reflectitur equm.
Ius sine iure datur, si nummus in aure loquatur,
60 Auri splendore tenebrescit lumen in ore,
Omnis legista viuit quasi lege sub ista,
Quo magis ex glosa loculi fit lex tenebrosa.
Si Mercatorum querantur lumina De luce
morum, Mercatorum.
Lux non fulgebit, vbi fraus cum ciue manebit.
Contegit vsure subtilis forma figure
Vultum laruatum, quem diues h a b e t
s i m i l a t u m.
Si dolus in villa tua possit habere sigilla,
Vix reddes clarus bona que tibi prestat auarus;
Et sic maiores fallunt quamsepe minores,
70 Vnde dolent turbe sub murmure plebis in vrbe.
Sic inter ciues errat sine lumine diues,
Dumque fidem nescit, lux pacis ab vrbe recessit.
Si patriam quero, nec ibi michi De luce vulgari,
lumina spero; que patriam
Nam via vulgaris tenebris viciatur conseruat.
amaris:
Plebs racione carens hec est sine moribus
arens,789
Cuius subiectam vix Cristus habet sibi sectam.
Sunt aliqui tales, quos mundus habet speciales,
Fures, raptores, homicide, turbidiores:
Sunt et conducti quidam pro munere ducti,
80 Quos facit assisa periuros luce rescisa.790
Rustica ruralis non est ibi spes aliqualis,
Quo nimis obscura pallent sine lumine rura:
Sic magis illicebras mundanas quisque
tenebras791
Nunc petit, et vota non sunt ad lumina mota.
Sic prior est mundus, et si deus esse secundus
Posset, adhuc talis foret in spe lux aliqualis:
Set quasi nunc totus deus est a plebe remotus;792
Sic absente duce perit orbis iter sine luce.
O nimis orbatus varii de labe Hic in fine793
reatus, tenebras
Omnis in orbe status modo stat deplangens pro
90 quasi preuaricatus. luce optinenda
Cum tamen errantes alios sine lege deum exorat.
vagantes794
Cecos deplango, mea propria viscera tango:795
Cecus vt ignorat quo pergere, dumque laborat,796
Sic iter explorat mea mens, que flebilis orat:
Et quia perpendo quod lucis ad vltima tendo,
Nunc iter attendo quo perfruar in moriendo.
Tu, qui formasti lucem tenebrasque creasti,
Crimina condones, et sic tua lumina dones:
In terram sero tunc quando cubicula quero,
100 Confer candelam, potero qua ferre medelam.
Hec Gower scribit, lucem dum querere quibit;
Sub spe transibit, vbi gaudia lucis adibit:
Lucis solamen det sibi Cristus. Amen.

FOOTNOTES:
782 Text of S collated with CEHL
783 Title 2 Suoffocarunt S
784 11 oscurum CH
785 12 cecus] secus C
786 15 manifestus L
787 22 Nebula] Lumina L
788 35 et hec] per hec L
789 75 hec om. L
790 80 rescisa SEHL recisa C (corr.)
791 83 illecebras EL
792 87 a plebe] a luce CEH
793 89 margin Hic in fine] Nota quod Iohannes Gower auctor
huius libr hic in fine E
794 91 sine luce L
795 92 de plango C
796 93 S has here lost a leaf
ECCE PATET TENSUS etc.797
Ecce patet tensus ceci Cupidinis arcus,
Vnde sagitta volans ardor amoris erit.
Omnia vincit amor, cecus tamen errat vbique,
Quo sibi directum carpere nescit iter.
Ille suos famulos ita cecos ducit amantes,
Quod sibi quid deceat non videt vllus amans:
Sic oculus cordis carnis caligine cecus
Decidit, et racio nil racionis habet.
Sic amor ex velle viuit, quem ceca voluptas
10 Nutrit, et ad placitum cuncta ministrat ei;
Subque suis alis mundus requiescit in vmbra,
Et sua precepta quisquis vbique facit.
Ipse coronatus inopes simul atque potentes
Omnes lege pari conficit esse pares.
Sic amor omne domat, quicquid natura creauit,
Et tamen indomitus ipse per omne manet:
Carcerat et redimit, ligat atque ligata resoluit,
Vulnerat omne genus, nec sibi vulnus habet.
Non manet in terris qui prelia vincit amoris,
20 Nec sibi quis firme federa pacis habet:
Sampsonis vires, gladius neque Dauid in istis
Quid laudis, sensus aut Salomonis, habent.
O natura viri, poterit quam tollere nemo,
Nec tamen excusat quod facit ipsa malum!
O natura viri, que naturatur eodem
Quod vitare nequit, nec licet illud agi!
O natura viri, duo que contraria mixta
Continet, amborum nec licet acta sequi!
O natura viri, que semper habet sibi bellum
30 Corporis ac anime, que sua iura petunt!
Sic magis igne suo Cupido perurit amantum
Et quasi de bello corda subacta tenet.
Qui vult ergo sue carnis compescere flammam,
Arcum preuideat vnde sagitta volat.
Nullus ab innato valet hoc euadere morbo,
Sit nisi quod sola gracia curet eum.

* * * * *
The MS. has here lost a leaf.

FOOTNOTES:
797 ‘Ecce patet tensus’ &c. This follows the Cinkante Balades in
the Trentham MS.
EST AMOR etc.798

Carmen quod Iohannes Gower super amoris


multiplici varietate sub compendio metrice
composuit.799

Est amor in glosa pax bellica, lis pietosa,


Accio famosa, vaga sors, vis imperiosa,
Pugna quietosa, victoria perniciosa,
Regula viscosa, scola deuia, lex capitosa,
Cura molestosa, grauis ars, virtus viciosa,
Gloria dampnosa, flens risus et ira iocosa,
Musa dolorosa, mors leta, febris preciosa,
Esca venenosa, fel dulce, fames animosa,
Vitis acetosa, sitis ebria, mens furiosa,
10 Flamma pruinosa, nox clara, dies tenebrosa,800
Res dedignosa, socialis et ambiciosa,
Garrula, verbosa, secreta, silens, studiosa,
Fabula formosa, sapiencia prestigiosa,
Causa ruinosa, rota versa, quies operosa,
Vrticata rosa, spes stulta fidesque dolosa.
Magnus in exiguis variatus vt est tibi clamor,
Fixus in ambiguis motibus errat amor.
Instruat audita tibi leccio sic repetita;
Mors, amor et vita participantur ita.

Lex docet auctorum quod iter carnale bonorum


Tucius est, quorum sunt federa coniugiorum:
Fragrat vt ortorum rosa plus quam germen
agrorum,
Ordo maritorum caput est et finis amorum.
5 Hec est nuptorum carnis quasi regula morum,
Que saluandorum sacratur in orbe virorum.
Hinc vetus annorum Gower sub spe meritorum
Ordine sponsorum tutus adhibo thorum.

FOOTNOTES:
798 Text of S, collated with F See also vol. i. p. 392
799 Title Carmen de variis in amore passionibus breuiter
compilatum F
800 10 tenobrosa S
QUIA VNUSQUISQUE etc.801

Quia vnusquisque, prout a deo accepit, aliis


impartiri tenetur, Iohannes Gower super hiis
que deus sibi sensualiter donauit villicacionis
sue racionem802 secundum aliquid alleuiare
cupiens, tres precipue libros per ipsum, dum
vixit, doctrine causa compositos ad aliorum
noticiam in lucem seriose produxit.803
Primus liber Gallico sermone editus in
decem diuiditur partes, et tractans de viciis et
virtutibus, necnon et de variis huius seculi
gradibus, viam qua peccator transgressus ad
sui creatoris agnicionem redire debet, recto
tramite docere conatur. Titulusque libelli istius
Speculum Meditantis nuncupatus est.
Secundus enim liber sermone Latino metrice
compositus tractat de variis infortuniis tempore
regis Ricardi secundi in Anglia contingentibus:
vnde non solum regni proceres et communes
tormenta passi sunt, set et ipse crudelissimus
rex, suis ex demeritis ab alto corruens, in
foueam quam fecit finaliter proiectus est.
Nomenque voluminis huius Vox Clamantis
intitulatur.
Tercius vero804 liber, qui ob reuerenciam
strenuissimi domini sui domini Henrici de
Lancastria, tunc Derbeie Comitis, Anglico
sermone conficitur, secundum Danielis
propheciam super huius mundi regnorum
mutacione a tempore regis Nabugodonosor
vsque nunc tempora distinguit. Tractat eciam
secundum Aristotilem super hiis quibus rex
Alexander tam in sui regimen quam aliter eius
disciplina edoctus fuit. Principalis tamen huius
operis materia super amorem et infatuatas
Amantum passiones fundamentum habet:
nomenque sibi appropriatum Confessio
Amantis specialiter sortitus est.

FOOTNOTES:
801 ‘Quia vnusquisque’ &c. Text of S, collated with CHGF. See
also vol. iii. p. 479 f.
802 3 racionem SCH racionem, dum tempus instat, GF
803 4 ff. tres—produxit] inter labores et ocia ad aliorum noticiam
tres libros doctrine causa forma subsequenti propterea
composuit GF
804 18 vero] iste F
ENEIDOS BUCOLIS etc.805

Carmen, quod quidam Philosophus in


memoriam Iohannis Gower super
consummacione suorum trium librorum forma
subsequenti806 composuit, et eidem gratanter
transmisit.807

Eneidos Bucolis que Georgica metra perhennis


Virgilio laudis serta dedere scolis;
Hiis tribus ille libris prefertur honore poetis,
Romaque precipuis laudibus instat eis.
Gower, sicque tuis tribus est dotata libellis
Anglia, morigeris quo tua scripta seris.
Illeque Latinis tantum sua metra loquelis
Scripsit, vt Italicis sint recolenda notis;
Te tua set trinis tria scribere carmina linguis
10 Constat, vt inde viris sit scola lata magis:
Gallica lingua prius, Latina secunda, set ortus
Lingua tui pocius Anglica complet opus.808
Ille quidem vanis Romanas obstupet aures,
Ludit et in studiis musa pagana suis;
Set tua Cristicolis fulget scriptura renatis,
Quo tibi celicolis laus sit habenda locis.

FOOTNOTES:
805 ‘Eneidos Bucolis’ &c. Text of S, collated with CHGF
806 forma subsequenti] versificatum F
807 Title Epistola cuiusdam Philosophi Iohanni Gower super
consummacione suorum trium librorum, prout inferius patet,
gratanter transmissa G
808 12 Anglia F
O DEUS IMMENSE etc.809

Carmen quod Iohannes Gower adhuc viuens


super principum regimine vltimo composuit.810

O deus immense, sub quo dominantur in ense


Quidam morosi Reges, quidam viciosi,
Disparibus meritis sic pax sic mocio litis
Publica regnorum manifestant gesta suorum:
Quicquid delirant Reges, plectuntur Achiui,
Quo mala respirant, vbi mores sunt fugitiui.
Laus et honor Regum foret obseruacio legum,
Ad quas iurati sunt prima sorte vocati:
Vt celeste bonum puto concilium fore donum,
10 Quo prius in terris pax contulit oscula guerris:
Consilium dignum Regem facit esse benignum,
Est aliter signum quo spergitur omne malignum.
In bonitate pares sumat sibi consiliares
Rex bonus, et cuncta venient sibi prospera iuncta:
Qui regit optentum de consilio sapientum
Regnum, non ledit set ab omni labe recedit:
Consilium tortum scelus omne refundit abortum
Regis in errorem, regni quo perdit amorem.
‘Ve qui predaris,’ Y s a i a s clamat auaris;
20 Sic verbis claris loquitur tibi qui dominaris.
Rex qui plus aurum populi quam corda thesaurum
Computat, a mente populi cadit ipse repente.
Os vbi vulgare non audet verba sonare,
Stat magis obscura sub murmure mens loqutura:
Que stupet in villa cicius plebs murmurat illa,
Vnde malum crescit, sapiens quo sepe pauescit.
Est tibi credendum murmur satis esse timendum;
Cum sit commune, tunc te super omnia mune.811
Lingua nequit fari mala, cor nec premeditari,
30 Que parat obliqus sub fraude dolosus amicus:
Mundus erit testis, vir talis vt altera pestis
Inficit occulto regnum de crimine multo.
Blandus adulator et auarus consiliator,
Quamuis non velles, plures facit esse rebelles:
Sepius ex herbis morbus curatur acerbis,
Sepe loquela grauis iuuat et nocet illa suauis.
Qui falsum pingunt sub fraudeque vera refingunt,
Hii sunt qui blando sermone nocent aliquando:
Rex qui conducit tales, sibi scandala ducit,
40 Nomen et abducit quod nobile raro reducit:
Quod viguit mane, sibi vespere transit inane,
Dummodo creduntur que verba dolosa loquntur.
Consilio tali regnum magis in speciali
Vndique turbatur, quo Regis honor variatur:
Nunc ita sicut heri poterit res ista videri,
Vnde magis plangit populus, quem lesio tangit.
Set premunitus non fallitur inde peritus;
Quod videt ante manum, fugit omne notabile
vanum:
Cum laqueatur auis, cauet altera, sicque suauis
50 Rex pius in cura semper timet ipse futura.
Rex insensatus nullos putat esse reatus,
Quam prius ante fores casus sibi sint grauiores;
Set qui prescire vult causas, expedit ire,
Plebis et audire voces per easque redire:
Si sit in errore Regis vel in eius honore,
Hoc de clamore populi prefertur ab ore.
Est qui morosus, Rex non erit ambiciosus,
Set sub eo tutum regni manet omne statutum:
Nomine preclarus nunquam fuit vllus auarus,
60 Larga manus nomen cum laude meretur et omen:
Nomen regale populi vox dat tibi, quale
Sit, bene siue male, deus illud habet speciale.
Rex qui tutus eris, si temet noscere Nota.
queris,
Ad vocem plebis aures sapienter habebis:
Culpe vel laudis ex plebe creatur, vt audis,
F a m a f e r e n s v e r b a q u e dulcia sunt et
acerba.
Fama cito crescit, subito tamen illa vanescit,
Saltem fortuna stabilis quia non manet vna:
Principio scire fortunam seu stabilire,
70 Non est humanum super hoc quid ponere planum;
Fine set expertum valet omnis dicere certum,
Qualia sunt facta, quia tunc probat exitus acta.
Rex qui laudari cupit et de fine beari,
Sint sua facta bona, recoletur vt inde corona.
Regia precedant benefacta que crimina cedant,
Viuat vt eterno sic Rex cum Rege superno:
Absque deo vana cum sit tibi cotidiana
Pompa, recorderis, sine laude dei morieris.
Rex sibi qui mundum prefert Cristumque
secundum
80 Linquit, adherebit vbi finis laude carebit:
Regis enim vita cum sit sine laude sopita,
Nomen erat quale, dabit vltima cronica tale.
Et sic concludo breuiter de carmine nudo,
Ordine quo regnant Reges, sua nomina pregnant.
Quo caput infirmum, nichil est de corpore firmum,
Plebs neque firmatur, vbi virtus non dominatur.
Rex qui securam laudis vult carpere curam,
Cristum preponat, Reges qui laude coronat:
Nam qui presumit de se, cum plus sibi sumit,
90 Fine carens laude stat fama retrograda caude.
Omni viuenti scola pertinet ista regenti,
Displicet hic genti qui non placet omnipotenti,
Gracia succedit, meritis vbi culpa recedit:
Qui sic non credit, sua Rex regalia ledit.
Non ex fatali casu set iudiciali
Pondere regali stat medicina mali.
Plebs vt ouile gregis, mors vitaque, regula legis,
Sub manibus Regis sunt ea quanta legis.

You might also like