Chapter - 2 Literature Survey: S. No Page No
Chapter - 2 Literature Survey: S. No Page No
CHAPTER 2
LITERATURE SURVEY
CHAPTER - 2
LITERATURE SURVEY
reached by following links as in [17, 18]. In particular, a large part of the Web is
hidden behind search forms, and is reachable only when users type in a set of
keywords, or queries, to the forms. These pages are often referred to as the Hidden
Web as in [19] or the Deep Web as in [17], because search engines typically cannot
index the pages and do not return them in their results (thus, the pages are essentially
According to many studies, the size of the Hidden Web increases rapidly as
more organizations make their valuable content accessible online through an easy-to-
use Web interface as in [17]. In as in [18], Chang et al. estimate that well over
100,000Hidden-Web sites currently exist on the Web. Moreover, the content provided
by many Hidden-Web sites is often of very high quality and can be extremely
valuable to many users as in [17]. For example, Pub Med as in [20] hosts many high-
quality papers on medical research that were selected from careful peer-review
processes, while the site of the US Patent and Trademarks Office as in [21] makes
art.In this dissertation, the study how to build a search engine for the Hidden Web
i.e., how effectively collect the data from the Hidden Web and enable the users to
of, and access to information items. The overall goal of an IR system can be stated as
33
to provide items that are relevant to a users information need. In the context of text
retrieval, which is the focus of this thesis, information items typically correspond to
the user, the IR system can at best estimate it. This task is further aggravated by the
fact that both queries and documents are semantically ambiguous expressions of
match between information needs and items, as would be the case in a data retrieval
query, an IR system must be able to first understand the information need underlying
this query. In turn, this information need may convey distinct user intents, from a
general search for information about a topic, to a search for a particular website.
system. A recent report revealed that at least 100 billion searches are conducted on the
leading commercial web search engine each month, amounting to over 3.3 billion
searches each day. Besides understanding the information needs of such a mass of
users with varying interests and backgrounds, web search engines must also strive to
addressable documents. While the lack of a central control is key for the
produced content, from its language and writing style, to its authoritativeness and
information repositories is its interconnected nature. Indeed, not only do web authors
publish massive amounts of information, but they also create links (also known as
hyperlinks) between the published information. As a result, the Web can be viewed as
documents represented as directed edges . Understanding the web graph is crucial for
understanding the structure and dynamics of the Web itself, but it also plays a
fundamental role in designing effective and efficient web search engines. The
particularly challenging environment for search. To cope with this challenge, web
search engines are typically designed with three core components: crawler, indexer,
2.1.1 Crawling
Crawling is the process by which search engines collect documents from the
Web into a local corpus. Such a corpus can be then processed by the search engine in
order to allow users to efficiently locate information. The overall goal of crawling is
end, a web crawler must maximize its crawling rate, while making efficient use of its
own resources, as well as the resources of the servers that host the desired documents.
Crawling the Web can be seen as a graph traversal problem. As shown in Figure
below, at all times, the crawler maintains a list of URLs to be visited, the so-called
crawling frontier, which is initially filled with a few seed URLs. While the frontier is
not empty, the next URL to be visited is removed from it and downloaded by a fetcher
35
module, after a DNS resolver translates the URL domain into an IP address. The
fetched document is processed by the crawl controller and the extracted contents are
stored locally for indexing. The URLs extracted from this document and the
documents own URL, for continuous crawls are inserted back into the frontier, so
While new documents are created and existing ones are modified at a massive
scale, the resources available for crawlingnotably, storage and bandwidth are
limited. To make crawling scalable, web crawlers must consider carefully which
URLs to visit, and how often to revisit each URL. The decision of which URLs to
visit depends on the predicted usefulness of each URL regardless of any particular
query. Such a decision could be based on the global importance of the document
referred to by the URL or its perceived quality. However, in practice, it has been
identifies important pages early in the crawling process. The decision of how often to
revisit a particular URL can be even more involved. With the dynamic nature of the
Web, by the time a web crawler has finished crawling its frontier, many events could
36
have happened. These events can include the creation, update, or deletion of
documents related to news, sports, and personal pages tend to change more frequently
years have witnessed the emergence of social media, which encourage real-time
publishing on collaborative projects, blogs, micro blogs, social networking sites, and
virtual game worlds. To provide access to the wealth of information on the Web, a
crawler must be able to adapt itself to the publishing patterns of such heterogeneous
outlets, e.g., by crawling more often those pages that change more often.
2.1.2 Indexing
the local corpus suitable for automatic processing by a search engine. The devised
document representations are then stored in appropriate data structures for efficient
access by the query processor. Given a corpus of documents (e.g., crawled from the
Web), each document is indexed following the general process illustrated in Figure
below. Initially, a parser extracts the textual content from each document. The
extracted content is then processed by a tokenizer, which splits the raw text into
and records their occurrences in each document. In this process, two main data
structures are created, which are at the core of modern indexing architectures. The
first of these is a lexicon, which stores information for all unique terms in the corpus,
such as their total number of occurrences and the number of documents where they
occur. The second structure is an inverted file, which stores, for each term in the
different documents, such as the frequency of the term in each document. To enable
efficient storage and retrieval, both structures are typically compressed. Indexing may
be performed in a single batch, in which case the whole corpus must be re-indexed
web documents can be a complex task. With the global and democratic nature of the
Web, web documents can have a variety of content types and character encodings,
which may not be immediately identifiable from the document itself (in an HTML
header) or from its provider (in an HTTP response header). Even pure textual content
may contain noise. Indeed, web documents typically comprise irrelevant content
besides their core topic, such as advertisements, client-side scripting code, and
frequently a whole HTML template structure. Such a noisy content can hurt not only
the effectiveness of a search engine, but also its efficiency, since more content needs
to be stored and processed. In order to remove noise and extract cleaner content for
relatively trivial task for most western languages, in which tokens can be separated by
German do not separate compound words. In the extreme, East Asian languages such
as Chinese, Japanese, and Korean have no word boundaries at all. A similar problem,
Not all identified tokens are directly useful for search. For this reason, each
token can be analyzed and transformed through a series of text operations before
being indexed. For instance, a search engine can choose not to index too common
terms. Such terms, known as stop words, possess little discriminative power for
their presence can also impact efficiency, since their posting lists can be almost as
long as the number of documents in the corpus. Besides gap removal, another
common text operation is stemming, a process that reduces multiple words to their
contain a different variant of the query terms. For instance, after stemming, the terms
retrieval, retriever, and retrieving can all be reduced to their common root,
retrieve. Alternatively, the search engine may choose to index all the identified
tokens in their original form, in which case text operations are delayed until the query
processing stage. This choice is more flexible, as it allows for text operations to be
deployed only when they are predicted to be helpful. Different information about
terms, documents, and the occurrence of terms in documents can be indexed. The
most basic information, which is one of the pillars for query-dependent ranking, is the
frequency of a term in a document. Recording the position where each term occurs in
each document can also help improve the effectiveness of a search engine. For
instance, the terms information and retrieval appearing next to each other can be a
strong indicator of the relevance of a document for the query information retrieval.
In addition, term frequency and positional information can be recorded for different
fields of a document, such as its title, URL, or body. Another valuable source of
39
evidence, which conveys how a document is described by the rest of the Web, is the
anchor text of the incoming hyperlinks to this document. Finally, several other
features that can help infer the prior relevance of a document regardless of any query
When a user poses a query, the search engine examines its index structures to locate
the most relevant documents for this query. Given the size of the Web and the short
length of typical web search queries, there may be billions of matching documents for
a single query.
documents, so that the most relevant documents are presented ahead of less relevant
ones. Query processing consists of three basic operations. Initially, the search engine
users information need. This query may go through a series of query understanding
operations, aimed to overcome the gap between the users information need and the
ill-defined representation of this need in the form of a query. This stage is important,
since misinterpreting the users information need implies that relevant documents may
40
never be returned, regardless of how sophisticated the subsequent retrieval is. Once a
suitable representation of the users query has been created, a matching process
retrieves the indexed documents that contain the query terms. Lastly, to ensure that
the user is presented with the most likely relevant documents for the query, the
retrieved documents are scored and sorted by a ranking process. Query understanding
aims at deriving a representation of the users query that is better suited for a search
understanding operations are query topic classification, aimed at restricting the scope
of the retrieved documents, and query expansion, to enhance the query representation
with useful terms from the local corpus or from external resources, such as a query
log or a knowledge base such as Wikipedia. Users typically expect instant responses
from a web search engine. This makes it inefficient to fully score all documents
process. In the first layer, matching documents from the entire corpus are returned as
an disarrayed set using a standard Boolean retrieval approach. The second layer
overall ordering of the initially matched documents at a low cost. This cost can be
made even lower by deploying efficient matching techniques, so as to short circuit the
examination of the posting lists of documents that will not make the final ranked list.
2.2.1 Benefits
present. Since a larger part of Web clients depend on internet searchers to run
across pages, when pages are not ordered via internet searchers, they are
straight to Hidden-Websites and issue inquiries there, they can't access the
Web locales, the client as of now needs to waste a huge measure of time and
each of them and investigating the effect, by making the Hidden-Web pages
numerous Web clients on web indexes for finding informative data, web
crawlers impact how the clients recognize the Web as in [22]. Clients don't
fundamentally recognize what really exists on the Web, yet what is ordered
42
via web crawlers as in [22]. Consistent with a later article as in [23], a few
The Deep Web comprises information that exists on the Web however are
unavailable by content web search tools through universal creeping and indexing as in
[24]. The most prevalent path to enter this information is to physically load up Html
developing at this quick pace, to the point that successfully evaluating its size is a
troublesome issue as in [26, 27, 28, 29]. In the past, scientists have proposed
numerous results for make the Deep Web information more convenient to clients. Ru
and Horowitz as in [30] order these results into 2 classes: dynamic substance archive,
based classes: (i) results for expansion the substance per cleavability on content
internet searchers, for example collecting/indexing dynamic pages as in [31, 32] and
making store of dynamic page substance as in [33,34, 35]; (ii) results for expansion
the intra-area inquiry capability, for example meta-web indexes as in [36, 37, 38, 39,
40, 41]; (iii) results for perform learning conglomeration, for example philosophy
43
essential to achieve any of the previously stated results. Profound Web holds no less
than 10 million high caliber interfaces as in [43] and programmed recovery of such
content marks and structure components (textbox, choice record, and so on.). An
doesn't have a standard layout of parts and there is unbounded number of conceivable
layout designs as in [45]. In this way, while human clients effortlessly observe an
interface dependent upon past encounters and surface signals, machine preparing of
an interface is testing.
2.2.2 Challenges
Given the enormous size of the Hidden Web and the fact that the information
is only accessible through a search form, building a search engine for the Hidden Web
Robots are programs that cross the Web robotically and download pages for
explore engines. Traditional robots today lay mainly on top of the hyper-links on top
of the Web to find out and download documents. Due to the lack of links pointing to
the Deep Web documents, the current look for engines might not directory the
Hidden-Web documents.
44
The data on the Web nowadays is highly volatile. New and improved content
is constantly added to serve the users needs in a better way. At the same time, the
potentially useful content is removed from the Web at an administrators whim. Once
our crawler has downloaded the information from the Hidden-Web, it needs to
periodically refresh its local copy in order to enable users to search for up-to-date
very significant for the agents to select cautiously which documents to refresh: if the
crawler downloads pages that have not changed, then it is simply wasting time and
bandwidth instead of updating our local copy. Therefore, one challenge that the
crawler has to face is being able to identify which pages on the Hidden-Web have
Once there are downloaded the Hidden-Web pages, It enables the users to
important overturned indexes which are replicated dozens of era for scalability. Given
the enormous size of available information on the Hidden Web, this task can become
very costly. One way to reduce the cost is to replace some of the full inverted indexes
with smaller, pruned inverted indexes as in [46, 47], where the most frequently
accessed portions of the index are stored. While this approach can provide major
the find outcome, when the peak answers are computed only from the trimmed
Since our goal is to provide the users with Hidden-Web results of the highest
how to keep away from a few deprivation of outcome excellence owing to the
Of late some locales on the Web watch a regularly expanding divide of their
movement originating from web index referrals. For numerous business Web locales,
pages inside list items by crafting bogus or spam Web pages which aim solely at
manipulating search engines and are useless to any human user. In the case of the
Hidden-Web, these malicious Web site operators may try to pollute our index by
injecting spam content in their Hidden-Web databases so that our crawler can
download it. Therefore, one challenge that to address is to create an effective Web
spam detection method, in order to ensure the quality of the data that returns to users.
Crawlers are systems that consequently cross the Web diagram, recovering
pages and raising a nearby storehouse of the share of the Web that they visit. Hinging
on the provision close by, the pages in the vault are either used to raise seek records,
crawlers have just focused on a part of the Web called the freely file capable Web
(Piw) as in [48]. This implies the set of pages reachable perfectly by accompanying
In any case, various later studies as in [49, 48, 50] have watched that a
noteworthy portion of Web substance indeed untruths outside the Piw. Particularly,
extensive partitions of the Web are "stowed away" behind pursuit shapes, in
searchable organized and unstructured databases (called the concealed Web as in [51]
or profound Web as in [49]). Pages in the concealed Web are powerfully created
according to questions submitted through the inquiry shapes. The concealed Web
informative data (e.g., the Census Bureau, Patents and Trademarks Office, news
media associations) are putting their substance web, furnishing Web-receptive pursuit
offices over existing databases. For example, the site Invisibleweb.com records over
10000 such databases extending from chronicles of work postings to catalogs, news
documents, and electronic inventories. Later gauges as in [49] put the measure of the
shrouded Web as far as produced Html pages) at around 500 times the extent of the
Piw.
simpler if the site hosting the database is agreeable. For example, a crawler may be
neighborhood intranet. Thus, the web servers running on the inside system might be
arranged to distinguish asks for from the crawler and accordingly, fare the whole
47
by some e-business destinations, which distinguish asks for from the crawlers of
major web search tool associations and accordingly, fare their whole catalog/database
for indexing.
Later studies indicate that a huge portion of Web substance can't be arrived at
by taking after connections as in [17, 18]. Specifically, an imposing part of the Web is
"covered up" behind pursuit structures and is reachable just when clients sort in a set
of essential words, or inquiries, to the shapes. These pages are frequently implied as
the Hidden Web as in [52] or the Deep Web as in [17], in light of the fact that web
search tools regularly can't list the pages and don't return them in their outcomes
(consequently, the pages are basically "stowed away" from an ordinary Web client).
Consistent with numerous examines, the measure of the Hidden Web builds quickly
to utilize Web interface as within [17]. As in [18], Chang et al. appraise that well over
100,000 Hidden-Web locales presently exist on the Web. In addition, the substance
could be amazingly profitable to numerous clients as in [17]. Case in point, Pub Med
hosts numerous high caliber papers on therapeutic research that were chosen from
cautious associate audit techniques, while the webpage of the Us Patent and
creators analyze "former art" an adequate Hidden-Web crawler can have a gigantic
expenses and exertion needed. There are two center tests to achieving a successful
Hidden-Web crawler: (a) The crawler must have the ability to grasp and model a
question interface, and (b) The crawler needs to think of serious questions to issue to
the question interface as in [53], where a technique for studying inquiry interfaces was
displayed.
Here, there is a result for the second test, i.e. how a crawler can consequently
produce inquiries with the intention that it can find and download the Hidden-Web
pages. Decidedly, when the inquiry shapes record all conceivable qualities for a
one question at once. The point when the question shapes have a "free content" data,
advisable for us to pick? In what manner can the crawler consequently concoct
Given that the main "section" to the pages in a Hidden-Web webpage is its
inquiry from, a Hidden-Web crawler might as well accompany the three steps
portrayed in the past area. That is, the crawler needs to produce an inquiry, issue it to
the Web website, download the outcome record page, and accompany the connections
to download the real pages. As a rule, a crawler has restricted time and system assets,
Algorithm:
(1) while ( there are available resources ) do
// select a term to send to the site
(2) qi= Select Term()
// send query and acquire result index page
(3) R(qi) = QueryWebsite( qi )
// download the pages of interest
(4) Download( R(qi) )
(5) done
In above figure 2.4 the non specific calculation for a Hidden-Web crawler. For
just. The crawler first chooses which inquiry term it is set to utilize (Step (2)), issues
the question, and recovers the outcome record page (Step (3)). At long last, taking
into account the connections discovered on the effect file page, it downloads the
Hidden-Web pages from the webpage (Step (4)). This same methodology is rehashed
Given this calculation that is the most basic choice that a crawler needs to
make is the thing that inquiry to issue afterward. Provided that the crawler can issue
efficacious inquiries that will give back numerous matching pages, the crawler can
50
fulfill its slithering at an opportune time utilizing least assets. Interestingly, if the
crawler issues totally incidental questions that don't give back any matching pages, it
might squander every last bit of its assets essentially issuing inquiries while never
recovering real pages. Accordingly, how the crawler chooses the following question
accompanies: expect that the crawler downloads pages from a Web webpage that has
a set of pages S (the rectangle in Figure 2.5). stand for every Web page in S as a focus
(dabs in Figure 2.5).every potential inquiry qi that issue could be seen as a subset of
S, holding all the focuses (pages) that are returned when the issue qi to the webpage.
Every subset is connected with a weight that speaks for the expense of issuing
the inquiry. Under this formalization, the objective is to find which subsets (inquiries)
spread the most extreme number of focuses (Web pages) with the base aggregate
weight (expense).
51
in [18].there are two primary challenges that to address in this formalization. First and
foremost, in a down to earth scenario, the crawler does not know which Web pages
will be returned by which inquiries, so the subsets of S are not known ahead of time.
Without knowing these subsets the crawler can't choose which inquiries to pick to
boost the scope. Second, the set-blanket issue is known to be Np-Hard as in [18], so a
proficient calculation to take care of this issue optimally in polynomial time has yet to
be discovered.
All Web Crawlers should satisfy some predefined features. The main features
i. Robustness: It may happen that a Web server creates spider traps and
generates web pages that may mislead crawler to get stuck while fetching
52
ii. Politeness: Web servers have their own limitations in terms of hidden and
open policies for regulating the crawler visit rate. So WebCrawler should have
iii. Distributed: It is the requirement of time that a WebCrawler should have the
iv. Scalable: The crawler architecture should provide support to expand the crawl
v. Quality: As it is not possible to index the whole WWW pages and as also a
large fraction of web pages are useless for end users, so the crawler should be
vi. Performance and efficiency: The crawler should utilize the available
manner possible.
vii. Freshness: As large fraction of web pages are dynamic in nature, so the
crawler should adopt some policy to revisit the crawled pages repeatedly in
viii. Extensible: Crawler should be designed in such a way so that it may come up
with changes like new data format, new fetch protocol and so on. This may be
optimum crawlers are those which support all the above mentioned features
but few features are a must such as robustness, politeness etc and a few are