MODULE_1
MODULE_1
Dr.N.PARVIN
ASSISTANT PROFESSOR
DEPARTMENT OF COMPUTER APPLICATIONS
Introduction
The World Wide Web (or the Web for short) has impacted on almost every aspect of our lives.
The World Wide Web is officially defined as a “wide-area hypermedia information retrieval initiative aiming to give
universal access to a large universe of documents.”
Not only can we find needed information on the Web, but we can also easily share our information and knowledge with
others.
The World Wide Web (WWW), is a system of interconnected webpages and information that you can access using
the Internet.
It was created to help people share and find information easily, using links that connect different pages together.
The Web allows us to browse websites, watch videos, shop online, and connect with others around the world through our
computers and phones.
WWW is defined as the collection of different websites around the world, containing different information
shared via local servers (or computers).
Web pages are linked together using hyperlinks which are HTML-formatted and, also referred to as hypertext.
The benefit of hypertext is it allows you to pick a word or phrase from the text and click on other sites that have more
information about it.
A Web browser is used to access web pages.
Web browsers can be defined as programs which display text, data, pictures, animation and video on the Internet.
Hyperlinked resources on the World Wide Web can be accessed using software interfaces provided by Web browsers.
Web operates just like client-server architecture of the internet.
When users request web pages or other information, then the web browser of your system request to the server for the
information and then the web server provide requested services to web browser back and finally the requested service is
utilized by the user who made the request.
Web browsers can be used for several tasks including conducting searches, mailing, transferring files, and much more.
A Brief History of the Web and the Internet
1.Creation of the Web
The Web was invented in 1989 by Tim Berners- Lee, who, at that time, worked at CERN (Centre European pour la
Recherche Nucleaire, or European Laboratory for Particle Physics) in Switzerland.
He coined World Wide Web server, httpd, and the first client program (a browser and editor),“WorldWideWeb”.
It began in March 1989 when Tim Berners-Lee submitted a proposal titled
“Information Management: A Proposal” to his superiors at CERN.
In the proposal, he discussed the disadvantages of hierarchical information organization and outlined the
advantages of a hypertext-based system.
The proposal called for a simple protocol that could request information stored in
remote systems through networks, and for a scheme by which information could be exchanged in a common format and
documents of individuals could be linked by hyperlinks to other documents.
It also proposed methods for reading text and graphics using the display technology.
The proposal essentially outlined a distributed hypertext system, which is the basic architecture of the Web.
In 1990, Berners-Lee re-circulated the proposal and received the support to begin the work.
They introduced their server and browser, the protocol used for communication between clients and the server, the
HyperText Transfer Protocol (HTTP), the HyperText Markup Language (HTML) used for authoring Web documents, and
the Universal Resource Locator (URL).
2.Mosaic and Netscape Browsers
In February of 1993, Marc Andreesen from the University of Illinois’ NCSA (National Center for Supercomputing
Applications) and his team released the first "Mosaic for X" graphical Web browser for UNIX.
A few months later, different versions of Mosaic were released for Macintosh and Windows operating systems.
For the first time, a Web client, with a consistent and simple point-and-click graphical user interface, was implemented for
the three most popular operating systems available at the time.
In mid-1994, Silicon Graphics founder Jim Clark collaborated with Marc Andreessen, and they founded the company
Mosaic Communications.
The Netscape browser was released to the public, which started the explosive growth of the Web.
The Internet Explorer from Microsoft entered the market in August, 1995 and began to challenge Netscape.
3.Internet
Initially in the 1960s, the Internet was started as a medium for sharing information with government researchers.
The Internet started with the computer network ARPANET (Advanced Research Projects Agency Network).
The first ARPANET connections were made in 1969, and in 1972, it was demonstrated at the First
International Conference on Computers and Communication, held in Washington D.C.
At the conference, ARPA scientists linked computers together from 40 different locations.
Transfer Control Protocol (TCP/IP) which was developed in 1970, was adopted as a new communication protocol for
ARPANET in 1983.
The technology enabled various computers on different networks to communicate with each other.
4.Search Engines
With information being shared worldwide, there was a need for individuals to find information.
The search system Excite was introduced in 1993 by six Stanford University students.
EINet Galaxy was established in 1994 as part of the MCC Research Consortium at the University of Texas.
Jerry Yang and David Filo created Yahoo! in 1994,which started out as a listing of their favorite Web sites, and offered
directory search.
In subsequent years, many search systems emerged, e.g., Lycos, Inforseek, AltaVista, Inktomi, Ask Jeeves,
Northernlight, etc.
Google was launched in 1998 by Sergey Brin and Larry Page based on their research project at Stanford University.
Microsoft started to commit to search in 2003, and launched the MSN search engine in spring 2005.
Yahoo! provided a general search capability in 2004 after it purchased Inktomi in 2003.
Top search Engines at 2024
Google
Microsoft Bing
Yahoo!
Yandex
DuckDuckGo
Baidu
Ask.com
Naver
Ecosia
AOL
Web usage mining refers to the discovery of user access patterns from Web usage logs, which record every
click made by each user.
Web usage mining applies many data mining algorithms.
One of the key issues in Web usage mining is the pre-processing of clicks tream data in usage logs
in order to produce the right data for mining.
Web mining, data collection can be a substantial task, especially for Web structure
and content mining, which involves crawling a large number of target Web pages.
Once the data is collected, we go through the same three-step process: data pre- processing, Web data
mining and post-processing.
Relevance Feedback
1
Equation 1
Issues of RF
Text Pre-Processing
The documents in a collection are used for retrieval, some
preprocessing tasks are usually performed.
For traditional text documents (no HTML tags), the tasks are :
For Web pages, additional tasks such as HTML tag removal and
identification of main content blocks also require careful considerations.
Stop word Removal
Stopwords are frequently occurring and insignificant words in a
language that help construct sentences but do not represent any
content of the documents.
Articles, prepositions and conjunctions and some pronouns are
natural candidates.
Common stop words in English include:
a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or,
that, the, these, this, to, was, what, when, where, who, will, with
Hyphens
Breaking hyphens are usually applied to deal with inconsistency of usage.
For example, some people use “state-of-the-art”, but others use “state of the art”.
If the hyphens in the first case are removed, we eliminate the inconsistency problem.
However, some words may have a hyphen as an integral part of the word, e.g., “Y-21”.
Thus, in general, the system can follow a general rule (e.g., removing all hyphens) and
also have some exceptions.
Note that there are two types of removal
(1) Each hyphen is replaced with a space and
(2) Each hyphen is simply removed without leaving a space so that “state-of-
the-art” may be replaced with “state of the art” or state of the art”. In some
systems both forms are indexed as it is hard to determine which is correct,
e.g., if “pre-processing” is converted to “pre processing”, then some relevant
pages will not be found if the query term is “preprocessing”.
Punctuation Marks:
Punctuation can be dealt with similarly as hyphens.
Case of Letters: All the letters are usually converted to either the upper or lower
case.
Web Page Pre-Processing
For Web pages, additional tasks such as HTML tag removal and identification of main
content blocks also require careful considerations.
1. Identifying different text fields:
In HTML, there are different text fields, e.g., title, metadata, and body.
Identifying them allows the retrieval system to treat terms in different fields
differently.
For example, in search engines terms that appear in the title field of a page are
regarded as more important than terms that appear in other fields and are
assigned higher weights because the title is usually a concise description
of the page.
In the body text, those emphasized terms (e.g., under header tags <h1>, <h2>, …,
bold tag <b>, etc.) are also given higher weights.
2. Identifying anchor text:
Anchor text associated with a hyperlink is treated specially in search engines
because the anchor text often represents a more accurate description of the
information contained in the page pointed to by its link.
In the case that the hyperlink points to an external page (not in the same site), it is
especially valuable because it is a summary description of the page given by other
people rather than the author/owner of the page, and is thus more trustworthy.
3. Removing HTML tags:
The removal of HTML tags can be dealt with similarly to punctuation. One issue
needs careful consideration, which affects proximity queries and phrase queries.
They showed that search and data mining results can be improved
significantly if only the main content blocks are used.
There are two techniques for finding such blocks in Web pages.
Partitioning based on visual cues:
This method uses visual information to help find main content blocks in a
page.
Visual or rendering information of each HTML element in a page can be
obtained from the Web browser.
For example, Internet Explorer provides an API that can output the X and Y
coordinates of each element.
A machine learning model can then be built based on the location and
appearance features for identifying main content blocks of pages.
Of course, a large number of training examples need to be manually labeled.
Tree matching:
This method is based on the observation that in most commercial Web sites pages are
generated by using some fixed templates.
The method thus aims to find such hidden templates.
Since HTML has a nested structure, it is thus easy to build a tag tree for each
page.
Tree matching of multiple pages from the same site can be performed to find such
templates.
Once a template is found, we can identify which blocks are likely to be the main content
blocks based on the following observation:
The text in main content blocks are usually quite different across different pages of the
same template, but the non main content blocks are often quite similar in different
pages.