Evol. of Web Searching
Evol. of Web Searching
Figure 1 Dynamic Web page generation Web directories and search engines
Web directories explained
What is the difference between a Web
directory and a search engine? A Web
directory is:
. a pre-defined list of Web sites;
. compiled by human editors;
. categorised according to subject/topic;
. selective.
Because humans compile Web directories, a
dynamically generated Web pages (see qualitative decision concerning the content on
Table I). each listed Web site has already been made.
Consequently Web directories are popular
The first NEC study (Lawrence and Giles,
with Internet users looking for particular
1998) estimated that the visible Web
information because they feel that they have a
contained at least 320 million Web pages in
head start in identifying ``the best of the Web''
December 1997, whilst the second study
for the topic that they are interested in.
(Lawrence and Giles, 1999) estimated the
In using a Web directory the user can
visible Web had risen to 800 million Web
navigate through the listings or search across
pages, representing six terabytes of text data,
the entire directory (see Appendix). The
as of February 1999. Owing to its highly
major Web directories also license search
disparate structure and range of data types,
engine indexes to provide secondary results
there has been as yet no scientific research
whenever their human-compiled directory
conducted to determine the size of the
fails to produce matching results to the user's
invisible Web. query. For example, the world's largest Web
However, most publishers distribute their directory, Yahoo!, licenses the Inktomi search
data on the Web by integrating huge index for just this purpose.
databases, often gigabytes in size, with a As a result of the manual compilation
front-end search interface. By virtue of its process, Web sites that have been indexed by
commercial professionally published origin, Web directories will remain listed within that
such information is typically of high value and directory, unless, in the highly unlikely event,
more highly structured and indexed than the they are manually removed. This permanent
visible Web. The user's search enquiry will presence is not guaranteed for a listing within
generate customised, as opposed to generic, a search engine index, thus making a listing
results. Therefore, for professional within a popular Web directory such as
researchers, it can be said that information is Yahoo! highly desirable.
increasingly accessed via the Web, rather than Broadly speaking, any Web site that
on it. comprises several pages of organised links can
Nonetheless, the ``visible'' Web constitutes be considered a Web directory. Many
a significant contribution to the dissemination individuals, whether experts in their field or
of human knowledge, and as the NEC those passionate about a particular subject,
studies acknowledged ``much of [this] have compiled such sites. One such voluntary
material is not available in traditional Web directory which has exploded to global
databases''. It is no surprise that several status, becoming a real rival to world-leader
surveys such as Nielsen Netratings or Media Yahoo!, is the Open Directory. Other Web
Matrix (www.mediamatrix.com) consistently directories of specific relevance to information
show that search engines are amongst the professionals include:
most popular destination sites on the Web. . Sheila Webber's excellent Business
Information Sources on the Internet
Table I Comparison of static and dynamically generated www.dis.strath.ac.uk/business
Web pages . Business Researcher's Interests
Static Web pages Dynamic Web pages www.brint.com/interest.html
Manually produced Computer generated
Search engines explained
Generic information Customised information
When using a search engine, the user is
Most are indexable Not indexable
searching a database of indexed Web sites. All
125
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
Web site had to offer as much value to the ever decreasing timescales. The same can be
user as possible. As users were ceding control said for Web search technology. By focusing
of their initial Internet experiences (they still their efforts on e-commerce and portalisation
do), Microsoft, Netscape, AOL, etc. were the first generation of search sites ± the ``big
able to control users' primary destination five'' ± neglected their core search
Web site each time the user logged on. As functionality. While they reigned supreme for
search sites were the first desired destination several years, this neglect, and failure to
sites of users, these companies licensed search appropriately adapt to a changing
engine and Web directory providers to environment, created niche opportunities
provide search services. Thus emerged the which were soon exploited by new types of
concept of the portal. search providers.
Portals became a huge success. However, in
the rush for users and their attendant dollars, Meta search engines
many search providers began to neglect their Meta search engines enable the user to search
core service ± their search index or Web across several search engines and Web
directory. From late 1996 until September directories simultaneously. Some of the most
1997 the growth of the main search engine popular meta search engines include:
indexes and Web directories was negligible Dogpile (www.dogpile.com) searches 14
(Sullivan, 1999a), despite the continued different engines and directories but does not
relentless growth of the Web. The spurt in eliminate duplicates. It was acquired in
growth of most search engine indexes in the August 1999 by search engine GO2 for
first half of 1996 was primarily attributable to US$40 million in stock and a further US$15
the arrival of AltaVista at the end of 1995, million in cash. At the time of acquisition,
with the largest search index at the time. This Dogpile only had five employees (The Wall
period was also marked by a distinct lack of Street Journal, 5 August 1999).
search technology innovation. Although meta Mamma (www.mamma.com) searches
search engines such as Mamma, Dogpile and seven engines but removes duplicates and re-
Metacrawler first rose to prominence during orders results according to its own relevance
this period, their search functionality was ranking algorithm.
essentially based on the ``location and Others include:
frequency of keywords'' approach that was . 2Q ± www.2q.com
developed by the main search engines. . Infind ± www.infind.com
Meanwhile, the distinction between search . Isleuth ± www.isleuth.com
engines and Web directories became . Surfy ± www.surfy.com
somewhat blurred for the user as the search . Webtaxi ± www.webtaxi.com
engines licensed Web directories and vice
versa, whilst AOL, Netscape, Microsoft, etc. Popularity based analysis
licensed both. This cross-fertilisation resulted The first generation of search engines created
in portals becoming all encompassing search indexes by spidering Web sites, analysing the
sites. It was not until the arrival of a second- location and frequency of words. Web
generation of search engine providers in 1998 directories were compiled manually.
that new approaches to indexing and Launched in April 1998, Direct Hit
searching the Web became available. (www.directhit.com) represented a radical
new departure from these approaches, and
dubbed its methodology ``the third way''. The
Evolution of search technology system was claimed to be user-controlled as
the ranking of results is based on Web sites
Anyone who has ever seen a diagrammatic that users have visited. Like many of the
representation of the evolution of life on our second-generation search technologies, it is
planet, as we currently understand it, would not a separate search engine with its own
notice that basic cellular lifeforms were index that can be accessed directly. Instead it
around for a very long time before the provides a second-level analysis of search
evolution of more complex biological entities. results where it is incorporated within existing
However, once this point had been reached, search engines, one being HotBot.
the rapid diversification of life into ever more Prior to licensing Direct Hit, HotBot
organised and intelligent forms occurred in returned a list of results based on the standard
127
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
methodology of matching search terms with other. Both offered natural language
content on the Web sites in its index. Now, searching, but adopted different philosophies
Direct Hit will run a second level analysis on in developing their solutions.
the user's set of results. From its database it Ask Jeeves (www.askjeeves.com) was
will identify those Web sites which are launched 1 June 1998. Billed as ``the first
popular, according to the number of visits natural language search agent on the
that each Web site has received, and then re- Internet'' it operates by matching a user's
rank the search results accordingly, with the query against a database of 7 million template
most popular Web sites that match your questions. If there is no match then the user is
search term presented first in the list of presented with the nearest alternatives from
results. the database and asked to select the most
However, the popularity of a Web site can appropriate. It will also conduct a metasearch
be largely determined by its search engine across AltaVista, Go (Infoseek), Lycos and
rankings and there are all sorts of ways to Yahoo! It has now been licensed by AltaVista
manipulate those if you have a good for its own search site. However, artificial
understanding of how search engines work. intelligence (AI) experts have criticised the
Direct Hit tries to compensate for this by company's natural language claims. It was
boosting obscure sites. For example, a Web named after the resourceful butler in P.G.
site could provide lots of valuable information Wodehouse's novels.
about a particular topic but could The Electric Monk (www.electricmonk.com)
nevertheless feature further down the list of was launched a few weeks later. This search
results of search engines. If a searcher has service conducts a syntactical analysis of the
been tenacious enough to dig down as far as query using natural language algorithms.
result number 100 (presumably an These algorithms will also make use of
information professional), and click on it, thesauri to consider alternative related words.
then Direct Hit's algorithms will give this site The natural language search is then translated
a big boost up the list of results next time it into a complex Boolean query and submitted
appears in other searchers' lists of results. If to AltaVista. It was named after a character in
other users do not click on this obscure Web a Douglas Adams novel.
site, then it will drop down the list of results
for subsequent searchers, because it did not Links-based analysis
prove popular (Green, 1999a). Since its The first-generation search engines have
launch, the company has successfully licensed focused on building huge indexes with the
its technology to ten search sites including goal of answering every possible kind of
AOL, HotBot, Lycos, MSN and LookSmart general query. They focus on the content of
and is available within Netscape each specific page they visit with little
Communicator 4.5 and Apple's Sherlock consideration of how these pages interrelate
search utility. and connect. As already discussed, the
indexing methodologies they use fail to
Natural language searching consider the complexity of human language:
As already discussed, the first generation of syntax (sentence structure), synonyms
search engines operated by matching the (different words for the same meaning) and
keywords submitted by the user to the polysemy (different meanings to the same
contents of the Web pages in their databases. word).
They did not consider the context of the Links-based analysis attempts to overcome
search terms, i.e. the syntactical relationships these problems by examining the relationships
between the search terms and other between pages ± the 1 billion or so hyperlinks
vocabulary within their index. Furthermore, that weave the Web together (Clever Team,
they search for literal exact matches and 1999). By examining how Web pages link
therefore fail to consider semantics or use together, links-based analysis offers
thesauri (Green, 1999b). Most search engines methodologies for identifying authoritative
also automatically ignore frequently used sources of topic-specific information, eliciting
words such as ``or'', ``to'', ``not'', etc. In June quality, highly relevant results to users'
1998 there was a major breakthrough in queries. Not surprisingly, links-based analysis
addressing these limitations. Two new search has quickly gained prominence amongst
engines were launched within weeks of each Internet users and is attracting a lot of
128
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
attention from both computer information effectiveness developed a system that was
scientists and corporate Internet investors. referred to internally as HITS (Hyperlink-
Google (www.google.com), like Yahoo!, was Induced Topic Search). The project later
developed by students at Stanford University. became known as ``Clever''.
This technology uses a methodology known Related to the scientific citation index (the
as PageRank (named after Larry Page, one of study of how scientific papers refer to one
its creators) to crawl the Web and analyse another), Clever examines the hypertext
how Web sites link to each other. Results are context of a keyword search. Like Google,
ranked on importance, i.e. how many other Clever examines hyperlinks and the
Web sites link to them. If you, as a Web site surrounding commentary. Unlike Google,
author, have included hyperlinks to other sites which crawls the Web, Clever first submits
that you deem important, then you have the query to a search engine such as AltaVista,
exercised some editorial judgement. In the and then conducts its links analysis on a set of
same way that Web directories, such as pages from the results produced by that
Yahoo!, are compiled by editors on a manual search engine ± typically about 200 pages. By
basis, Google seeks to capitalise on the adding all the pages that link to and from
editorial judgement of millions of Web site these 200 pages, Clever creates what is called
authors on an automated basis. a root set ± usually between 1,000 and 5,000
As a result, of course, it can analyse far pages. Using linear algebraic analysis, Clever
more Web sites than the humans who build then begins an iterative process of analysing
directories such as Yahoo!. In fact, unlike this root set of results to divide them into two
search engines that become less useful the categories: authorities and hubs (Clever
larger their index of Web sites becomes, Team, 1999). Authorities are Web pages
Google claims to return even better results about a particular topic that have lots of links
with a bigger index. Google also seeks to to them, i.e. they are authoritative sources of
capitalise on the accompanying editorial information. Hubs are Web pages which are a
commentary by processing the text around guide to, or list, authoritative sources, i.e. they
each hyperlink (Green, 1999a). do the most citing.
Links-based analysis does feature in the Hubs are similar to portals in that they act
relevance ranking algorithms of some search as a jump point for anyone interested in the
engine providers such as Excite and HotBot. particular topic they cover. Unlike Google,
However, Google is the only search engine which retains rankings for individual Web
that is exclusively focused on links-based sites in its index, independently of the user's
searching that is currently publicly available search query, Clever will always create a new
for Web-wide searching. The company root set for each query and prioritise each
estimates that its index is between 70 million page according to the context of the
and 100 million pages, but, through the links specific search statement. While not yet
analysis, enables users to reach an estimated available for Web-wide searching, IBM's
300 million Web pages. Google's research team is currently refining the Clever
combination of extensive reach and greater search engine and have been experimenting
accuracy of results has quickly catapulted this with Clever to automatically develop Web
relative latecomer to top ten status in search directories.
engine popularity. Data released by Nielsen Focused Crawler is another search engine
Net Ratings in August 1999 showed that technology that is being developed by IBM.
Google gained the largest month-on-month However, it is not yet as developed as Clever.
increase in unique audience figures. Visits to Unlike other search engines (including
Google increased by a massive 88 per cent Google and Clever) which perform an
compared to the average of 2.1 per cent for analysis after they have crawled through a
the other top ten search engines for that collection of hyperlinks, Focused Crawler, as
month. Later that month Google signed its its name suggests, seeks to identify highly
first licensing deal with AOL subsidiary relevant collections of data to topic-specific
Netscape, to be the main search provider on searching by crawling the Web with a specific
the Netcenter portal. goal, ignoring irrelevant sections of the Web.
Clever (www.almaden.ibm.com/cs/k53/ In other words, it only crawls Web sites of
clever.html) came about when a team of IBM relevance to the user's query, rather than
researchers examining search engine identify a subset of relevant Web sites as a
129
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
result of an analysis of a larger set of crawled News, Alt and Misc. Owing to the huge
sites. Focused Crawler crawls the Web guided number of groups available, specialised search
by a relevance and popularity mechanism that engines have emerged to identify relevant
has two parts: a classifier that evaluates the groups and postings.
relevance of a Web page to the user's search Deja News (www.dejanews.com) is probably
query, and a distiller that identifies ``hypertext the most widely known newsgroup search
nodes that are great access points to many engine. It contains a directory of selected
relevant pages within a few links'' newsgroups which users can browse through
(Chakrabarti et al., n.d). or search for a particular group, topic or
posting. A powersearch facility enables users
Newsgroup searching to search by author, date, language, plus
The Internet delivers two primary benefits: options on how the results are displayed.
content and connectivity. Although distinct, Reference.com (www.reference.com) is
the two are often closely interrelated. Portals similar to Deja News, but also enables
are a perfect example: they represent the searches in Web forums (Web-based bulletin
synergistic exploitation of both content and boards) and mailing lists (where each posting
connectivity to create e-commerce is sent to your e-mail address). Users also
opportunities. However, while the Web is the have the option to save searches.
primary repository of human knowledge on Liszt's Newsgroups Directory (http://
the Internet, it is not the only one. liszt.com/news) uses Deja News for searching
Newsgroups, where collections of individuals on newsgroups, but has its own extensive list
share their experiences, knowledge and of mailing lists and IRC channels. There is
opinions on a subject of common interest, also a directory of newsgroups with
constitute an important area of consideration descriptions for most.
for information retrieval. The distinction
between the Web and newsgroups is that the Subject-specific indexes
Web primarily represents a large body of Company information
explicit human knowledge whilst newsgroups There are many sites (usually from company
primarily represent a large body of implicit and business information providers) that any
knowledge. Explicit, codifiable knowledge researcher can visit. The amount and quality
can help individuals and organisations, but it of information provided for free varies.
is implicit knowledge ± the realm of However, all such sites are Web-enabled
experience, creativity and ideas ± that offers versions of commercial databases, rather than
the greatest potential. In an increasingly true search indexes. In a test of the ability of
knowledge-based information society, it will leading search engines and directories to
be implicit knowledge that will be needed to deliver relevant results for searches on
successfully exploit explicit knowledge to company names, conducted by the online
create new opportunities and develop industry publication Search EngineWatch,
adaptability. HotBot and Google were ranked joint first
Considering this, the role of specialised search engines while Netscape Search was
newsgroup search engines will become more ranked first Web directory (Sullivan, 1999b).
important as individuals use the Internet to However, company research is not the
seek out experts (or indeed anyone who is exclusive focus of these search sites.
qualified) to help with their problems. This Launched in August 1999, 1Jump
prediction is based not merely on a belief in (www.1jump.com) is a specialised search
human altruism, but also on phenomena such index that focuses exclusively on information
as the emergent sociology of citations on the and news about companies. In addition to
Web (Chakrabarti et al., n.d), the explosive providing news, this search engine also
growth of the volunteer-based Open provides details of company executives (titles,
Directory (see appendix) and the emphasis on age, background and e-mail addresses),
people/expert connectivity in many corporate patents (every patent owned by a company)
intranets. and ``peers'' (subsidiaries, parents and related
There are literally thousands of newsgroups companies). It also enables the user to visit
covering all manner of topics. These are other Web pages that are relevant to a
organised in a tree-like structure with eight particular company, e.g. an industry
main categories: Comp, Rec, Sci, Soc, Talk, association.
130
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
Multimedia and image files words and live highlighting of search terms
According to industry analyst organisation are possible because of the nature of
Future Image in its report ``Comparative intelligent agents. Unlike a standard software
Evaluation of Web Image Search Engines'' program that will execute specific functions
almost 70 per cent of the Web is non-textual. within clearly defined parameters, agents/
Considering that humans assimilate and bots:
process information in visual format more . are adaptive ± they can interpret
readily than textual format, and the greater monitored events to make appropriate
availability of broadband capacity in the near- decisions;
future, the role of multimedia search engines . are self-organising ± they assimilate both
will continue to grow. The three main information and experience;
specialised search engines in this area are: . can communicate with both the user and
(1) Ditto ± www.ditto.com other bots (Green, 1999c).
(2) Scour ± www.scour.net
Agents can search across a wide range of
(3) AltaVista PhotoFinder ±
document types and formats. They can
www.altavista.com
provide a uniform interface for search queries
Some other specialised search indexes across different sources and are true
include: ``infomediaries'' in that they can identify and
. Finding People ± www.whowhere.com search appropriate resources that may or may
. Law ± www.fastsearch.com/law not be known to the searcher. The adaptive
. Health ± www.drkoop.com element of intelligent agents is central to the
. Movies ± www.imdb.com/search functionality of many search products that
. Ask an Expert ± www.vrd.org/locator/ incorporate agents. The following popular
subject.html search utilities, which all contain agent
. Information Please ± www.infoplease. technology, are available as free downloads
com and as more comprehensive paid versions:
Mata Hari (www.thewebtools.com) can
learn one set of power search commands and
Search utilities and intelligent agents then automatically translate these for each
search service/database that it queries for the
As already discussed, meta-search sites such user.
as Dogpile and Mamma have grown in BullsEye Pro (www.intelliseek.com)
popularity as they allow users to search across incorporates 11 different intelligent agents,
different search indexes simultaneously with including technology from Verity to conduct
duplicates removed and results re-ranked what it calls ``Web mining''. The different
(depending on the meta-search service used). agents are used to target specific types of
Search utilities represent the logical evolution information such as business news in over 450
of this functionality. Unlike meta-search sources on both the visible and invisible Web.
engines, where the processing power to refine It will automatically run searches, allows
results still remains on the server the user is import/export of searches to other users,
interrogating, search utilities are programs whilst users can chose to receive change alerts
that are installed on to the user's hard drive. by HTML e-mail, pager or other hand-held
By shifting processing power away from the data devices.
server, and on to the user's own desktop, Copernic (www.copernic.com) can translate
search utilities offer a much greater range of a search statement for different services and
search and results analysis functionality. then simultaneously submit the query to these
Like several of the second-generation search engines, Web directories and
search technologies that have emerged databases. There are also about 20 categories
(Electric Monk, Google) many of these search such as business and finance, science, etc.,
utilities incorporate intelligent agents (or with predetermined Web sources to search in.
bots). Indeed, many of the powerful features Recognising the advantages offered by
offered by search utilities, such as language search utilities, some search providers have
independent searching, filtering, automatic released a variety of free basic search utility
refinement of results and document programs as ``plug ins''. As the name suggests,
summaries, active hyperlinking of query once installed, they are incorporated within
131
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
the user's Web browser and enable the search example, hyperlinks can go through to the
engine provider to offer more features. Search relevant section of a document rather than the
providers that have released search utilities entire document. It also enables powerful
include Infoseek (Infoseek Express), AltaVista structured searching akin to database field
(AltaVista Discovery) and more recently searching, but on textual Web pages. In other
Lycos (See More). words, XML not only enables explicit
A common function of agents is that they description of Web page content, but also
allow the user to specify a high-level goal describes the rules for manipulating each data
instead of issuing explicit instructions, leaving set contained within the information. This
the ``how'' and ``when'' decisions to the agent. enables a small program such as a Java script
This, combined with their ability to search to process the information on the user's local
across data in unstructured formats, to hard drive according to their requirements,
automatically learn and adapt to user rather than the user requesting a new Web
preferences and to identify patterns, is giving page from the central server. Multiply by
agent technology an ever increasing role in millions of Web users, and this capability will
Web searching. dramatically decrease the demands on Web
servers and improve network traffic (Green,
1999d). Based on open standards, XML will
allow data exchange between different
XML computer systems regardless of operating
system or hardware.
HTML is dead. Since XML was completed
As XML is also based on Unicode, a
by the World Wide Web Consortium (W3C,
the body responsible for developing technical character encoding system that supports the
standards for the Web) in early 1998, it has intermingling of text in all of the world's
attracted an almost evangelical response. major languages, it will also allow the
Most Web pages are currently produced in exchange of information across national and
hypertext mark-up language (HTML). While cultural boundaries (Bosak and Bray, 1999).
HTML's ease of use fuelled its widespread Using various XML style sheets (XSL)
adoption, it is somewhat limited in that it is publishers will also be able to automatically
primarily concerned with the design/layout of redesign their content for various devices.
a Web page, rather than the information that There are even style sheets that will read the
actually appears on that page. Considering text of the Web page aloud, which is of great
that a primary use of the Web is for benefit to the visually impaired.
information retrieval this design focus is However, while XML will deliver great
something of a drawback. HTML is a spin-off benefits for searching, publishing and
from SGML, a much more robust mark-up exchanging information, these benefits will
language that was approved by the not be realised without some effort.
International Organisation for Standards First, each industry will need to agree on
(ISO) in 1986. However, SGML is too standards for the tags used to describe
complex for the Web. Seeking to address the information that is specific to their discipline.
limitations of HTML, the W3C developed Mathematicians, genealogists and chemists
XML as a subset of SGML that would have already agreed on standards to facilitate
address the semantic and structural the realisation of XML's benefits. In other
considerations of information retrieval and areas, standards are yet to be agreed upon and
exchange that would work on the Web. there will be struggles over who controls the
XML is an open technology that offers standard (``XML and search'', n.d).
tremendous possibilities for electronic Second, Web publishers will require greater
publishing, e-commerce, information retrieval sophistication than simply knowledge of
and data exchange. It consists of rules that HTML, graphics and a few other
enable anyone to create their own mark-up applications. They will need new XML tools
language. XML describes information using and computer programmers and information
pairs of tags that are nested inside one another scientists who will be able to interpret the
to multiple levels (Bosak and Bray, 1999). content of the information being published
These create a tree structure of nested and provide the appropriate data trees/nested
hierarchies. This convention allows users hierarchies, hyperlink structures, meta data,
direct access to just the particular segment of style sheets and document definition types
the information that they are interested in; for (DDTs).
132
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
Finally, search engines will need to learn portal model so beloved of their predecessors,
the standard tag structure that has been the ``big five''. Also, if the main portals wish to
agreed by each industry/interest group. They introduce micro-payments for searches, these
will also need to change their search interfaces second generation search companies will
to offer users the choice between text provide the refinement technologies. In the
searching and field/tag searching. Currently, evolution of Web searching this has created a
text-based search engines will return a list of symbiotic relationship between the two
documents that will contain some generations: to succeed in attracting as many
information relating to the user's request. users as possible and to generate as many e-
XML enabled query-searching, like any other commerce sales opportunities as possible, the
query language, will return the relevant data first-generation of search firms will continue
that has been extracted from a document, to focus their efforts on portalisation and
rather than the entire document. Such query- e-commerce. However, they will need the
based searching can also be used to perform new search technology offered by second-
computational analysis and manipulation of generation firms to provide consumers with
presentation on retrieved data items (``XML
the search requirements they demand ±
and search'', n.d).
search requirements that in turn will fuel
To facilitate the transition to XML, in
e-commerce consumer buying. The second-
August 1999 the W3C released a hybrid of
generation search firms need the portals to
HTML 4.0 and XML (XHTML 1.0) for
attract the consumers who will use their
review. It is highly unlikely that there will ever
search services.
be an HTML 5.0. Earlier in April, IBM
Taken to its logical conclusion, it is quite
launched the Internet's first search engine
possible that one or more of the ``big five''
that is exclusively focused on XML data,
search portal firms will drop out of the
called xCentral. This search engine is
sporadic yet ongoing search index size war.
available from IBM's XML Web site.
Instead, they may decide to contract out all
search functionality to second-generation
The Future
firms whose core focus is providing better
Micro-payments for searches?
search technology. By co-opting the Open
Jakob Nielsen predicts ``. . . that in the future
Directory, and relegating the results from its
we will have micro-payments for search.
own index as secondary to those from the
Realistically, to provide quality information
Open Directory, Lycos has hinted at the
over the long term requires serious effort.
shape of things that may yet come. However,
Companies have to be compensated for
until the market matures somewhat, the ``big
providing that'' (Janah, 1999). If users are to
five'' first-generation search portals may feel
be charged micro-payments then they are uncomfortable about completely
going to start demanding better refinement relinquishing control of search functionality.
technologies for their searches. Much of the Instead, they may develop their existing
search technology innovation over the last 18 relationships with second-generation firms
months has come from second-generation into an outsourced/partnership model with
search companies. By focusing on clearly defined service level agreements, etc.
portalisation and e-commerce the first Such a strategic realignment of their business
generation of search firms have ceded control operations would be in line with current
of technological innovation to their newer business process outsourcing (BPO) trends
equivalents. and would prove popular with their
According to data from institutional investors.
PriceWaterhouseCoopers and research firm
IPO monitor, in the last year search engine Portals go mobile
Using wireless application protocol (WAP),
companies have raised more than $274.7
search sites and publishers alike will be able to
million in private funds and another $282
extend their reach beyond the PC to mobile
million in public offerings. Almost all of these
phones and other hand-held devices such as
funds are going to this second-generation of
PDAs. One such portal already launched is
search firms (Investor's Business Daily, 1999).
Zingo, which has been jointly developed by
Searching outsourced? Lucent Technologies and Netscape. Aimed at
It is interesting that none of the second- telecommunications providers, Zingo also
generation search companies has adopted the enables HTML pages to be converted into
133
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
VXML (voice extensible mark-up language) many things ranging from plants to galaxies.
for audio applications on hand-held devices. Knowing this will help search engine
Coupled with the reduced broadband providers develop better algorithms that
demand that XML promises to deliver, the exploit the predictive behaviour of systems
future of information retrieval can be governed by the power principle. Now
anywhere you need it. emerging from its nascent stages, the Web
Search engine standardisation may evolve into a highly organised, vastly
Launched by Danny Sullivan of diverse and complicated system.
SearchEngineWatch
(www.searchenginewatch.com), the Search
Engines Standards Project aims to foster References
standards amongst the major search services.
Participants include representatives from the Albert, R., Jeong, H. et al. (1999), ``Diameter of the World-
largest Web search engines, academics and Wide Web'', Nature, 9 September.
Bosak, J. and Bray, T. (1999), ``XML and the second-
industry analysts. Some of the common
generation Web'', Scientific American, May.
standards that the project has helped to Chakrabarti, S., Van den Berg, M. and Dom, B. (n.d.),
develop include a common syntax for the ``Focused crawling: a new approach to topic-specific
command to narrow a search by a specific Web resource discovery'', www.almaden.ibm.com/
Web site, and the ability for all major search almaden/feat/www8
sites to locate an exact URL within their Clever Team (1999), ``Hypersearching the Web'', Scientific
indexes using the URL: command. Future American, June.
Green, D. (1998a), ``Search insider'', Information World
proposals include additional commands for
Review, Vol. 14, 1 November.
searching and meta tags for controlling search Green, D. (1998b), ``First through the portal: the business
indexing robots. potential of highly trafficked Web sites'', Business
This voluntary initiative parallels voluntary Information Review, Vol. 15 No. 3.
efforts to develop standardised XML tag sets Green, D. (1999a), ``Search insider'', Information World
for specific industries and interest groups. It Review, Vol. 14, 6 April.
Green, D. (1999b), ``In search of success'', The
would appear that the connectivity provided
Independent, 29 March.
by the Internet is also encouraging greater Green, D. (1999c), ``Search insider'', Information World
collaboration in general. These and other Review, Vol. 14, 7 May.
collaborative efforts (such as the Open Green, D. (1999d), ``Here come the X Files'', Information
Directory) represent admirable attempts to World Review, February.
create a degree of order. Hunerman, B.A. and Adamic, L.A. (1999), ``Growth
dynamics of the World Wide Web'', Nature, 9
Order vs. chaos September.
The tension between these two diametrically Investor's Business Daily (1999), ``Computers and
opposing forces can be witnessed on the technology ± investors betting on big hit in new
Internet. The relentless growth in activity is Web search engines'', Investor's Business Daily, 2
August.
reducing the Web into a state of digital chaos.
Janah, M. (1999), ``Web directories profit motive
Against this are the commendable efforts of complicates searches by consumers'', San Jose
paid indexers and volunteers attempting to Mercury News, 16 August.
create an ordered structure. Could the Web Lawrence, S. and Giles, C.L. (1998), ``Accessibility of
prove to be self-organising just like any other information on the Web'', Science, Vol. 280, April.
biological system that evolves to greater Lawrence, S. and Giles, C.L. (1999), ``Accessibility of
complexity and organisation? In the 9 information on the Web'', Nature, 8 July.
Sullivan, D. (1999a), ``Search engine sizes'', Search Engine
September 1999 issue of Nature, two
Watch, September, www.searchenginewatch.com/
surprising research papers were published reports/sizes.html
(Albert et al.; Hunerman and Adamic, 1999). Sullivan, D. (1999b), ``Company names test'', Search
Mathematicians had expected the Internet to Engine Watch, August, www.searchenginewatch.
follow the model of random inanimate com/reports/names.html
networks, but both studies discovered that the ``The top ten referring search engines'' (1999), September,
Internet did indeed appear to be ``evolving'' www.statmarket.com
[The] Wall Street Journal (1999), ``Web-search firm
and that its growth resembled organic life.
acquired in $55 million transaction'', The Wall Street
The Internet is evolving according to the Journal, 5 August.
universal ``power principle'' of physics. This ``XML and search'', Search Tools, www.searchtools.com/
power principle governs the order found in related/xml.html
134
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
any other search engine provider. Like most areas (AngelFire, Tripod, WiseWire, etc.) and
other second-generation search providers, the e-commerce has become its primary focus.
company is focusing on co-branding its Although it acquired rival search engine
technology rather than building its own HotBot in October 1998, it switched to a Web
search portal. In August the company signed directory format in April 1999. Primary results
a deal with AOL subsidiary Netscape to be are now derived from the Open Directory (see
the main search provider on the Netcenter below), with secondary results coming from its
portal. own index. It has also added almost 8,000
www.google.com databases of information specific to different
industries. HotBot continues to be operated as
HotBot a separate venture.
Launched in May 1996 by Wired. Acquired www.lycos.com
by Lycos in October 1998, but continues to
be run as a separate service from the Lycos Northern Light
search engine. Accesses the Inktomi search Launched in August 1997. Has continually
engine index, rather than compiling its own been one of the largest indexes, gradually
index. However, primary results are derived increasing in size until it became the biggest
from Direct Hit, the popularity-based search search engine (indexing 16 per cent of the
provider (see above). Directory listings are Web). This leading position has since been
derived from the Open Directory (see below). superseded by the launch of FAST in May
www.hotbot.com 1999. The company also offers a special
collection of non-Web material such as
Inktomi newspaper and magazine articles. Whilst it is
Founded in February 1996, Inktomi is free to search within the special collection,
probably the most famous search engine users must pay a charge (up to $US4) to view
index. It powers the search results for several any articles from this collection. Search
famous portals and search sites including results are clustered in folders by topic. Like
HotBot (where it debuted), Yahoo!, AOL, AltaVista, this search engine is popular with
MSN Search and SNAP. However, not all of researchers due to its scope and functionality.
these companies access Inktomi's full 110 www.northernlight.com
million page index and there are variations in
results between the different search sites due Open Directory
to the different filtering and relevance ranking Launched in June 1998. This directory uses
algorithms Inktomi provides to each partner volunteer editors to catalogue the Web. This
company. It is not possible to interrogate the initiative quickly gained prominence and was
Inktomi index directly. acquired by Netscape later in November of
www.inktomi.com that year. Netscape pledged to allow anyone
to use the directory. In April Lycos re-
LookSmart launched itself as a directory service, deriving
Launched in October 1996. Like Yahoo!, its primary results from the Open Directory.
LookSmart is a human-compiled directory. In http://dmoz.org
addition to providing its own search site, the
company also licenses its directory to other RealNames
companies including AltaVista (who in turn Launched in 1998. Formerly known as
provide search results to LookSmart Centraal Corp., RealNames charges
whenever there is no match to a user's query companies an annual US$100 to register
within the directory) and in August 1999 with individual keywords, such as company name,
Excite (replacing Excite's own directory). or a brand name. Obviously many companies
During that same month, the company raised want to, and do, register many keywords to
US$92.4 million on its public listing of 7.7 protect their brands etc. This has proved a
million shares at US$12 each. very successful economic model for the
www.looksmart.com company and in August 1999 it successfully
raised over US$70 million from venture
Lycos capitalists in a third round of financing.
Launched as a search engine in May 1994. Although the index is directly available as a
The company rapidly diversified into other download from the company's Web site, and
136
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
is incorporated within Microsoft's Internet accounting for a staggering 43.5 per cent of all
Explorer 5 browser, its most notable success search engine referrals in August 1999 (``Top
has been access from search engines that ten'', 1999). It is the Web's largest human-
license its index, such as AltaVista and Go compiled directory, listing over 1 million sites.
(Infoseek). These directory listings are also supplemented
www.realnames.com by search results derived from Inktomi's 110
Yahoo! million page search index. Launched a new
Launched in late 1994, Yahoo! has become photo search service during the summer.
the most popular search site on the Web, www.yahoo.com
137