0% found this document useful (0 votes)
11 views

Seminar II Pdf3

This document proposes an adaptive real-time web search engine that allows users more control over search parameters. Unlike conventional search engines that rely solely on pre-indexed data, the proposed approach performs searches in real-time by traversing links from user-specified starting URLs. It aims to return more relevant results by prioritizing pages most likely to satisfy complex multi-criteria queries within given time constraints.

Uploaded by

samjhanakarki65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Seminar II Pdf3

This document proposes an adaptive real-time web search engine that allows users more control over search parameters. Unlike conventional search engines that rely solely on pre-indexed data, the proposed approach performs searches in real-time by traversing links from user-specified starting URLs. It aims to return more relevant results by prioritizing pages most likely to satisfy complex multi-criteria queries within given time constraints.

Uploaded by

samjhanakarki65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

An Adaptive Real-Time Web Search Engine

Augustine Chidi Ikeji Farshad Fotouhi


Computer Science Department Computer Science Department
Eastern Michigan University Wayne State University
Ypsilanti, MI 48197 Detroit, MI 48202
(734) 487-1063 (313) 577-3107
[email protected] [email protected]

Abstract There are several search engines on the Internet. Each search
The Internet provides a wealth of information scattered all over engine has several indexes that it uses to aid the search. This
the world. The fact that the information may be located anywhere approach has some drawbacks. The search engines do not use the
makes it both convenient for placing information on the Web and same indexes and thus give different results for the same query.
difficult for others to find. Conventional search engines can only Also due to the dynamic nature of the net, the information or
locate information that is in their search index and users do not links indexed may change or get deleted which results in users
have much choice in limiting or expanding the search parameters. getting the wrong results. There is an enormous cost associated
Some web pages like those for news services change frequently with bringing the index up to date because all links and page
and will not work well with index based search engines because contents associated with the index may have to be reviewed.
the indexed information may become obsolete at any moment.
If there is no index entry for what the user is searching for, the
We are proposing an efficient algorithm for finding information search engines may not find it. Every possible search phrase can
on the Web that gives the user greater control over the search path not be indexed, but search engines should do better than leaving
and what to search for. Unlike the conventional techniques, our the user on their own for the hard to find information. It would
algorithm does not use an index and works in real time. We save be nice for users to have the ability to guide the search space,
on space, we can search any part of the Internet indicated by the limit the query result set, or limit the query processing time.
user, and since the search is in real time, the result will be current. Some search engines allow the user to restrict the search space
by selecting from a list of their own supported indexes, or to
limit the result set by specifying how the search keywords are to
1. Introduction be matched. However, users can not indicate an index that is not
The Internet provides huge amounts of information. The amount supported by the search engine. This is quite damaging because
and variety of information is never going to shrink but rather sometimes users have an idea of where what they are searching
keeps on growing. One main challenge has been in providing for may be located, and such user input should be allowed. In
fast and efficient techniques for finding information on the net. addition, the results presented to users are limited to URLs only.
To complicate matters, because of the huge amounts of Users can not request other web page attributes like last
information available for processing, it is impossible to search modified date, title, size, or images.
the entire Internet. There is an ongoing world wide project
named Cooperative Online Resource Catalog (CORC) to catalog Conventional search engines do not allow users to issue queries
the Web much like the way libraries were cataloged many years that go beyond whether or not the pages contain certain phrases
ago so books can be found based on author or title. The effort is like the date or size of a web page, titles in the page, or multiple
sponsored by the Online Computer Library Center in Ohio, and occurrences of a phrase. For example, it is not possible to issue a
it involves many universities across the United States. Even query such as: Find the URL of web pages with multiple
those working on the project acknowledge that not every Web references to “President Clinton”, no reference to “Lewinsky”,
page will be cataloged [3]. This underscores the fact that there with an image that has “Clinton” as part of the title, a size of 5K
will always be information out there that will be hard to reach. bytes or larger, last modified within the past 6 months, and the
The goal of the project is to create a huge index for some of the search should start from www.whitehouse.gov and complete in
Web sites based on content. This will help, but search engines one minute. There is an answer to such a query but the
still have to search this huge index for the desired values. conventional search engines can not handle it due to their over-
dependency on indexed information. Queries where users can
Permission to make digital or hard copies of all or part of this work for specify starting URLs are important because users can then trust
person or classroom use is granted without fee provided that copies the source of the results.
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, to republish, to post on servers or to redistribute to lists, In an environment where the user can place constraints such as
requires prior specific permission and/or a fee. indicate a search path and limit the search time, the search would
WIDM 99 Kansas City Mo USA have to be done in real time since the search path is dynamic.
Copyright ACM 1999 1-58113-221-2/99/11…$5.00 Due to time and space constraints, the search engine will not be
able to search all the links originating from the user specified
URLs and thus will have to render a partial result based on a
sample of the reachable URLs. The issue then is for the
processor to generate an optimal result under the constraints.

12
force[5]. This means that the processor may retrieve as many
In this paper, we propose a real-time search technique that items as possible and test whether or not each item satisfies the
allows a user to influence the search path, search attributes, as query conditions.
well as place a limit on the search time. The search attributes
may involve virtually any information that relates to a Web page The drawback with the linear search technique is that the search
such as title, size, date, or images. Users get to specify starting engine makes no attempt to dynamically locate pockets of
URLs with their query and the search is limited to reachable concentration of data more likely to satisfy the query condition
links from the specified URLs. Given a query and the URLs to and process such pockets first. This is especially damaging when
originate the search from, we process the pages and traverse the there is not enough resource such as time to process the entire
links originating from the starting URLs in an order based on the items. The WWW underscores this problem.
most likely paths that will satisfy the query requirements. For
example, when searching for information on “admissions” 2.3 Web Search Techniques
starting from the URL www.ucla.edu, if two links The web search engines have three basic components to them - a
www.ucla.edu/alumni/alumni and robot, a catalog, and the query processor. The robot roams the
www.ucla.edu/admissions/admissions are found at Internet searching for new information to return to the catalog
www.ucla.edu, we will rank engine [4]. Information returned by the robot is entered in the DB
www.ucla.edu/admissions/admissions higher than the alumni site and indexed so it can be found and retrieved faster in future. The
because it contains more of the search keyword “admissions”. robot is the most important part of the whole process because if it
Since we seek out the most likely pages to satisfy the user query does not return a web page or some other page with a link to it,
and process them first, our technique yields a result that is more such a page may never be part of the index and thus may never be
pertinent to what the user wants within the given time scope. Our included in a search. Each robot has its own algorithm for
algorithm does not need a-priori information such as an index navigating the Web and determining if a web page should be
and works in real time. We save on space for storing an index, returned back to the catalog engine. In addition, cycles in the web
we can search any part of the Internet indicated by the user, and links are often encountered. If not handled properly, cycles may
since the search is in real time, the result we present is current. result in multiple catalog entries or the omission of some sites.
Most robots return the main page information at the root of a site
The rest of the paper is arranged as follows. Section 2 provides and don’t bother too much with other links that can be reached
an overview of general and Web search techniques as well as from the root. For example, a University’s main page may be
their pros and cons. In section 3, we present our technique and returned by a robot but links to pages like the Professors home
its analysis. We end with the conclusion and area of future works pages, Academic Programs, Sports Programs, etc. directly or
in section 4. transitively anchored to the page may not be returned or added to
the index. Some indexing schemes base the importance of a page
2. Search Techniques on the amount of traffic it receives. Highly trafficked pages are
The main search techniques can be lumped into two categories - added or retained in the index while lesser-trafficked ones are not.
those supported by a-priori information and those that are not.
The two approaches are presented below, then followed by the The catalog is essentially a DB containing information on web
Web search techniques. pages. Not all catalogs contain the same information. It is
obvious that what is returned by the robot greatly influences what
2.1 A-Priori Information goes into the catalog. In addition, the catalog engine has to decide
A-priori information is additional information provided on top of what key words or phrases to use for indexing the web page. For
the actual data. It may be kept on some search keywords in a example, if a web page contains the title “Presidents of United
database, and whenever applicable it is used to aid query States” and another one contains the title “Presidents in United
processing. The cost of maintaining the information is justified States”; both pages are likely to be indexed under Presidents and
by distributing it across all queries that use it. Different types of United States. Although the earlier title may relate to information
a-priori information may be maintained. Some systems maintain on the list of US presidents and the later may have information on
what is traditionally known as an index, but a-priori information the role of US presidents, this distinction is lost in the index. A
may also be in the form of statistics like the distribution of search for “Presidents in United States” using the index is likely
certain values that can be useful for computing aggregates. to yield both pages.

A-priori information such as indexes or statistics are not always The query processor is the part that takes user queries which most
maintained on all query select conditions. When a query of the time is in the form of keywords or phrases and presents the
involves a condition that does not have any a-priori information user with links to the sites that contain related information to their
to aid in its processing, the search engine would have to process search. Normally, this process involves matching the search
the query as best it could using other means or reject the query phrase with entries in the index and then returning the associated
totally. For example, since indexes are not kept on the image link and summary of the matched entries to the user. Different
titles contained in web pages, the conventional search engines techniques may be used in the match process - for example, an
will reject a search based on such titles. exact match, or case sensitive match or “similar” match may be
applied. In similar match, book may be matched with books, US
with United States, and index with indices.
2.2 Brute Force (Linear Search)
When there is no index, nor are there other a-priori statistics to One major drawback for these search engines is that the robot and
guide a search, the search engine may process the items by brute the query processor do not work hand in hand in real time. The

13
query processor depends on the information in the catalog and due <sizeCond> -> size <relop> <UnsignedInt>
to the dynamic nature of the Web, such information may change, <relop> -> < | <= | = | > | >= | !=
get deleted completely, or the host may be down temporarily or <date>-> <UnsignedInt>/<UnsignedInt>/<UnsignedInt>
permanently. Another component may be added to the search
engine that will periodically check the validity of the catalog The user may select web page attributes like title, URL address,
information and update it appropriately. This is a major work and size or last modified date. One or more URLs must be
because of the huge amount of information to be dealt with and specified in the FROM clause. To indicate a maximum depth for
how often changes occur. For instance, the web pages for most the search, a URL may be preceded by a plus sign and a positive
Universities are likely to change from semester to semester, while integer in the FROM clause. For example, +2www.yahoo.com
the pages for news organizations may change every minute. means that the search should go no more than two path lengths
from www.yahoo.com. The WHERE clause may include the
The search engines traverse their index looking for matching Boolean operators “and”, “or”, and “not”. PhraseCond is for
information in their own order i.e. the user can not influence what indicating phrases to be matched in url, title, body or title of
is searched or in what order except via the keywords. The images in the web page. DateCond and SizeCond are for
problem is that users sometimes have an idea of where what they comparing web page dates and sizes respectively. If the WHERE
are searching for is located and would like the search to start from clause is empty, it is assumed to be true. Lastly, a time limit is
such sites. For example, one may want to search for information required and the unit is seconds.
on US embassies originating from www.us.gov because they want
to trust the source of the result. Also, these search engines display We have both a command line and an applet based search. In the
only URLs and sometimes a summary of the page which most of command line version, users may enter queries based on the
the time is not a true reflection of the page content. Users can not grammar. The applet has buttons to check for the select attributes,
specify that the result should be other page attributes like date, three text fields for the FROM, WHERE and TIME values
size, title or images in the document. respectively, as well as a GO button to press when done entering
the parameters. The result of the query is presented in a list, and
Given the same query, each search engine may yield a different users may click on a URL in the list to get the web page opened.
result because they are defined by the way their robots work and Figure 1 shows what the applet looks like.
the indexing technique they use. Even the same search engine
may not give you the same result when given the same query
several times. It has nothing to do with how much time that
elapsed between the queries. Some search engines use a bidding
process in determining what results to present to the user and in
what order. For example, if you search for keyword “flowers”, the
search engine may solicit the flower shops for bids on how much
they will pay for their links to be presented as part of the result.
The ultimate result is presented to the user with the highest
bidders at the top of the list. Sometimes the best match for what
the user wants is at the bottom of the list, or not even presented
because those pages did not bid or bid high enough.

3. The Real-Time Web Search Engine


Our algorithm takes care of the above problems by allowing users
to control what needs to be searched, the URLs to start the search
from, what is presented in the result, and the amount of time to
invest in the search. It also operates in real time and does not need
a-priori information such as an index.
Figure 1. The browser based Search Engine
The grammar for our search engine is similar to WebSQL [2] with
some extra features. For example, we added a time constraint and The following are some sample queries that may be entered.
the ability to specify multiple occurrences of a phrase in the 1. Select URL, lastModifiedDate
WHERE clause by placing a number just before the phrase. The From +3www.ucla.edu, +5www.harvard.edu
simplified grammar for the query language is as follows – Where body contains “research”
<QUERY> -> Time 60;
SELECT <attribute> [,<attribute>]*
FROM [+<UnsignedInt>]<URL>[,[+<UnsignedInt>]<URL>]* This will display URLs and last modified dates of web pages
WHERE [not](<phraseCond>|<dateCond>|<sizeCond>) originating from www.ucla.edu and www.harvard.edu,
[(and|or) [not](<phraseCond>|<dateCond>|<sizeCond>)]* containing the phrase “research” in their body. The maximum
TIME <UnsignedInt> ; depth of the links searched in the www.ucla.edu part is three and
<attribute> -> title | lastModifiedDate | size | URL the depth for www.harvard.edu is five. Also the search concludes
<UnsignedInt> -> (0|1|..|9)+ in 60 seconds.
<phraseCond> -> (url | title | body | imagetitle) contains
[[(+|=)] <UnsignedInt>]<phrase> 2. Select title
<dateCond> -> lastModifiedDate <relop> <date>

14
From www.whitehouse.gov they inherit 1000 points or half the final rank of their parent. The
Where body contains +2 “President Clinton” and rank is increased by 400 points for each size or date condition the
not body contains “Lewinsky” and page satisfies. It is decreased by 400 points for each size or date
body contains =3 “Kosovo” and condition failed. The rank is increased by 700 points for each title
imageTitle contains “Clinton” and or image condition satisfied, and reduced by 700 points for each
lastModifiedDate > 5/30/1999 unmatched title or image condition. Similarly, phrases in the body
Time 120; of the page have a weight of 300 points.

Query 2 displays titles of web pages originating from Our rational for the above values is as follows. Sometimes the
whitehouse.gov that contain two or more references to “President relationship between a parent web page and a child web page is
Clinton”, no reference to “Lewinsky”, exactly three references to useful in determining if the child is likely to satisfy a query given
“Kosovo”, an image title containing “Clinton”, last modified after that the parent satisfied it. We assume the child’s URL is
May 30th, 1999 and the query should end in 120 seconds. anchored in the parent’s web page. For example, if we are
searching for the keyword “computers” and start with two web
Given a query, we assign the highest priority to each of the URLs pages given by the user. If one of them has more occurrences of
in the FROM clause, and place them in a heap. They get the “computers” than the other, we would hope that the children of
highest priority because we want to make sure that all URLs in the page with more occurrences of “computers” are more likely to
the initial set given by the user get processed as long as the time is relate to “computers”. Also, it is possible that some of the children
sufficient. Next, we create threads that work on the query. The descending from the page with more occurrences of “computers”
threads work in parallel and directly communicate results back to may not have information relating to “computers”. Since some
the client. Each thread takes the current high ranking node in the children may have the desired values and others may not, we let
heap and processes it. Nodes removed from the heap are replaced the children inherit half the parent’s rank up to a maximum of
with the last node in the heap if available, before a percolate down 1000 points. Allowing a child to inherit all of the parent’s points
(towards the leaf nodes) is applied from the root to maintain the may lead our algorithm down the wrong path for a long time.
heap order property. One heap is used, and all threads interact Given several children within a parent, we pad a child’s inherited
with this heap. Insertions and removal of nodes from the heap is rank if its URL (the child’s page is never retrieved at this point)
synchronized so only one thread can be in that section at any time. contains desirable phrases in the WHERE clause. The weights
This is essential because if one thread is reading from the heap added during the processing of a parent are justified as follows.
and another is writing to it, the heap may become inconsistent. We deem phrases matched in URLs as most important. For
The use of threads is essential because sometimes hosts are too example, if a web page had the URL www.computers.com, it is
slow in replying to a server program with web page information, very likely that the corresponding page will have lots of
especially when the requested page is not available. When threads “computers”. Following in importance are phrases in the title and
are used, this slowness is not as damaging because should one image parts of the page. Phrases in the body are not as important
thread block waiting for reply from host, other threads have a but several phrases matched in the body can out-weigh phrases
chance of continuing and sending something back to the user. matched in other areas. Lastly, the size and date parameters are
not good indicators of the relationship between parent and child,
The processing of a node involves retrieving the web page but since there is one size and one date per page, we weighed it
contents corresponding to the URL and searching it. The slightly higher than when a single phrase is matched in the body.
searching of a page entails matching the content with the The final URL rank is shown next to each URL in the result list.
WHERE clause of the query and extracting other URLs This may serve as a guide to the user by showing them how well
the URL and its corresponding page content satisfies the query
contained in it. If the page URL and contents satisfy the conditions.
WHERE clause, the SELECT attributes are presented to the
user. URLs extracted from processed pages are assigned harvard.edu
priorities based on how well their address text and the page they Priority:1300
are contained in (anchor) matche the query’s WHERE clause. The
ranking process is as follows. First, the extracted URL inherits
half the rank of its parent or 1000 points whichever is smaller.
Next, the rank is increased 800 points for each desirable phrase in harvard.edu/academics/ news.harvard.edu
the WHERE clause that it contains. It is decreased 800 points for Priority: 600 Priority: 600
each undesirable phrase in the where clause it contains. For
example, if the WHERE clause in the query states “body contains
“admissions” and not body contains “alumni””, the URL
www.ucla.edu/admissions will get its rank increased by 800 nytimes.com aol.com ebay.com amazon.com
points, the URL www.ucla.edu/alumni will decrease in rank by Priority: 300 Priority: 400 Priority: 500 Priority:200
800 points, and the url www.ucla.edu/registration will not be
affected. This is the rank used in placing the URL in the heap and : : : :
it determines how soon the corresponding page will get processed. Figure 2. Heap Data Structure for URLs waiting to be
When a URL is removed from the heap and processed, we modify processed.
the rank again because of the new information obtained from the
page content, size and last modified date. This rank modification
Processing of nodes removed from the heap and insertion of new
will only affect the children URLs extracted from the page since
nodes into the heap are repeated until the allotted time is

15
exhausted. A sample tree structure of the heap and a pseudo-code keywords or phrases at web sites or links descending from such
of the algorithm are summarized in figures 2 and 3 respectively. sites.

// JAVA based pseudo code of the algorithm. 4. Conclusion


- get query In this paper, we presented an algorithm that allows a user to
- parse query indicate any attribute of a web page - size, date, URL, title, image,
- place starting URLs into the heap or contents as part of their query condition or result. The
- while currentTime < endTime { algorithm also allows the user to indicate when a keyword is to be
// if empty heap, repeat b/c one of the threads matched multiple times, the starting URLs as well as a time limit
// may insert something later for the search. The search process prioritizes the URLs based on
if ((h=heap.getNextItem())==null) { their likelihood of satisfying the user query. We process the URLs
- update currentTime in order of highest priorities. By spending most of the time with
- continue; pages more likely to match the query, the chances of obtaining a
} “good” result in terms of quality and quantity are increased. In
// get web page information addition, the whole process happens in real time so when a user is
try { presented with a result, there is a strong likelihood that it is valid.
- c = connection to web page at h.URL It does not require a-priori information such as an index. None of
- if needed, get modified date and size of web page the index based search engines can do what is presented in our
- read web page via connection c algorithm. Even if they extend their index to cover every
} parameter such as number of occurrences of each keyword, date
catch Exceptions { and size of the page, there is no way it can be done for all sites out
- handle exceptions there - our algorithm works with any site.
}
- extract URLs from web page and A possible extension of this algorithm is to combine it with index
compare WHERE clause to web page values based search engines since indexes can be faster when applicable.
- rank web page based on how well it fits the WHERE The algorithm can be extended to do join operations on web
clause pages. For example, if a user wants to know what the web pages
- if WHERE clause is satisfied, send SELECT descending from www.fbi.gov and www.cia.gov have in
attributes to client common, this can be done by extracting URLs starting from
- insert the extracted URLs into the heap www.fbi.gov and comparing the web pages to those from
- update currentTime www.cia.gov. A limit would have to be specified on the
} // while currentTime < endTime maximum path length to traverse and the comparison may even be
limited to URLs, page titles or image titles only.
Figure 3. Pseudo-Code of the Algorithm.

The algorithm was implemented in the JAVA programming References


language using a 450 MHz IBM PC compatible with a 56K [1] K. Mahalingam, and M. Huhns, “A Tool for
Modem and 64MB RAM. We tested it with different queries, and Organizing Web Information”, IEEE Computer, pp.
the following are some of the results obtained from the command 80-85, June, 1997.
line version of the algorithm.
[2] A. Mendelson, G. Mihaila, and T. Milo, “Querying
Query 1:Select URL, size The World Wide Web”, Journal of Digital Libraries
From www.whitehouse.gov 1(1), pp. 68-88, 1997.
Where URL contains “kids”
Time 60;
Result1: 67 distinct URLs/Sizes were presented. [3] J. Miller, “WCC will be riding herd in efforts to
catalog Web sites”, The Ann Arbor News, p. B2,
Query 2: Select URL, lastModifiedDate Wednesday, February 24, 1999.
From +3www.ucla.edu, +5www.harvard.edu
Where body contains “research” [4] R. Peters, and R. Sikorski, “Smarter Searching”,
Time 60; Science Magazine, Vol. 277, p. 976, August 1998.
Result2: 46 distinct URLs/Dates were presented
[5] E. Ramez, and S. B. Navathe, “Fundamentals of
Query3: Select URL
From www.harvard.edu
Database Systems”, The Benjamin/Cummings
Where body contains “gun” Publishing Company, Inc., Redwood City,
Time 120; California, 1989
Result3: 150 distinct URLs were presented.

Query #3 illustrates another application for our algorithm - it can


be used to periodically check for occurrences of certain

16

You might also like