0% found this document useful (0 votes)
11 views53 pages

MODULE_1

The document outlines the course CADX 153: Web Mining, detailing the impact of the World Wide Web on information sharing and retrieval. It covers the history of the Web, the evolution of web browsers, the development of search engines, and the significance of web data mining, including its various types and processes. Additionally, it discusses the challenges of web data mining and the importance of pre-processing for effective information retrieval.

Uploaded by

PARVIN NAVAB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views53 pages

MODULE_1

The document outlines the course CADX 153: Web Mining, detailing the impact of the World Wide Web on information sharing and retrieval. It covers the history of the Web, the evolution of web browsers, the development of search engines, and the significance of web data mining, including its various types and processes. Additionally, it discusses the challenges of web data mining and the importance of pre-processing for effective information retrieval.

Uploaded by

PARVIN NAVAB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

COURSE CODE : CADX 153

COURSE NAME : WEB MINING

Dr.N.PARVIN
ASSISTANT PROFESSOR
DEPARTMENT OF COMPUTER APPLICATIONS
Introduction
The World Wide Web (or the Web for short) has impacted on almost every aspect of our lives.
The World Wide Web is officially defined as a “wide-area hypermedia information retrieval initiative aiming to give
universal access to a large universe of documents.”
Not only can we find needed information on the Web, but we can also easily share our information and knowledge with
others.
The World Wide Web (WWW), is a system of interconnected webpages and information that you can access using
the Internet.
It was created to help people share and find information easily, using links that connect different pages together.
The Web allows us to browse websites, watch videos, shop online, and connect with others around the world through our
computers and phones.
WWW is defined as the collection of different websites around the world, containing different information
shared via local servers (or computers).
Web pages are linked together using hyperlinks which are HTML-formatted and, also referred to as hypertext.
The benefit of hypertext is it allows you to pick a word or phrase from the text and click on other sites that have more
information about it.
A Web browser is used to access web pages.
Web browsers can be defined as programs which display text, data, pictures, animation and video on the Internet.
Hyperlinked resources on the World Wide Web can be accessed using software interfaces provided by Web browsers.
Web operates just like client-server architecture of the internet.
When users request web pages or other information, then the web browser of your system request to the server for the
information and then the web server provide requested services to web browser back and finally the requested service is
utilized by the user who made the request.
Web browsers can be used for several tasks including conducting searches, mailing, transferring files, and much more.
A Brief History of the Web and the Internet
1.Creation of the Web
The Web was invented in 1989 by Tim Berners- Lee, who, at that time, worked at CERN (Centre European pour la
Recherche Nucleaire, or European Laboratory for Particle Physics) in Switzerland.
He coined World Wide Web server, httpd, and the first client program (a browser and editor),“WorldWideWeb”.
It began in March 1989 when Tim Berners-Lee submitted a proposal titled
“Information Management: A Proposal” to his superiors at CERN.
In the proposal, he discussed the disadvantages of hierarchical information organization and outlined the
advantages of a hypertext-based system.
The proposal called for a simple protocol that could request information stored in
remote systems through networks, and for a scheme by which information could be exchanged in a common format and
documents of individuals could be linked by hyperlinks to other documents.
It also proposed methods for reading text and graphics using the display technology.
The proposal essentially outlined a distributed hypertext system, which is the basic architecture of the Web.
In 1990, Berners-Lee re-circulated the proposal and received the support to begin the work.
They introduced their server and browser, the protocol used for communication between clients and the server, the
HyperText Transfer Protocol (HTTP), the HyperText Markup Language (HTML) used for authoring Web documents, and
the Universal Resource Locator (URL).
2.Mosaic and Netscape Browsers
In February of 1993, Marc Andreesen from the University of Illinois’ NCSA (National Center for Supercomputing
Applications) and his team released the first "Mosaic for X" graphical Web browser for UNIX.
A few months later, different versions of Mosaic were released for Macintosh and Windows operating systems.
For the first time, a Web client, with a consistent and simple point-and-click graphical user interface, was implemented for
the three most popular operating systems available at the time.
In mid-1994, Silicon Graphics founder Jim Clark collaborated with Marc Andreessen, and they founded the company
Mosaic Communications.
The Netscape browser was released to the public, which started the explosive growth of the Web.
The Internet Explorer from Microsoft entered the market in August, 1995 and began to challenge Netscape.
3.Internet
Initially in the 1960s, the Internet was started as a medium for sharing information with government researchers.
The Internet started with the computer network ARPANET (Advanced Research Projects Agency Network).
The first ARPANET connections were made in 1969, and in 1972, it was demonstrated at the First
International Conference on Computers and Communication, held in Washington D.C.
 At the conference, ARPA scientists linked computers together from 40 different locations.
 Transfer Control Protocol (TCP/IP) which was developed in 1970, was adopted as a new communication protocol for
ARPANET in 1983.
 The technology enabled various computers on different networks to communicate with each other.
4.Search Engines
 With information being shared worldwide, there was a need for individuals to find information.
 The search system Excite was introduced in 1993 by six Stanford University students.
 EINet Galaxy was established in 1994 as part of the MCC Research Consortium at the University of Texas.
 Jerry Yang and David Filo created Yahoo! in 1994,which started out as a listing of their favorite Web sites, and offered
directory search.
 In subsequent years, many search systems emerged, e.g., Lycos, Inforseek, AltaVista, Inktomi, Ask Jeeves,
Northernlight, etc.
 Google was launched in 1998 by Sergey Brin and Larry Page based on their research project at Stanford University.
Microsoft started to commit to search in 2003, and launched the MSN search engine in spring 2005.
 Yahoo! provided a general search capability in 2004 after it purchased Inktomi in 2003.
 Top search Engines at 2024
 Google
 Microsoft Bing
 Yahoo!
 Yandex
 DuckDuckGo
 Baidu
 Ask.com
 Naver
 Ecosia
 AOL

5.W3C (The World Wide Web Consortium):


 W3C was formed in the December of 1994 by MIT and CERN as an international
 W3C's main objective was “to promote standards for the evolution of the Web and interoperability between WWW
products by producing specifications and reference software.”
 The first International Conference on World Wide Web (WWW) was also held in 1994. From 1995 to 2001, the
growth of the Web boomed.

Web Data Mining


 The rapid growth of the Web makes it the largest publicly
 The Web has many unique characteristics, which make mining useful information and knowledge a fascinating and
challenging task to accessible data source in the world.
 Characteristics of web:
 The amount of data/information on the Web is huge and still growing. The coverage of the information is
also very wide and diverse. One can find information on almost anything on the Web.
 Data of all types exist on the Web, e.g., structured tables, semistructured Web pages, unstructured texts, and
multimedia files (images, audios, and videos).
 Information on the Web is heterogeneous. Due to the diverse authorship of Web pages, multiple pages may
present the same or similar information using completely different words and/or formats. This makes integration of
information from multiple pages a challenging problem.
 A significant amount of information on the Web is linked. Hyperlinks exist among Web pages within a site and
across different sites. Within a site, hyperlinks serve as information organization mechanisms. Across different
sites, hyperlinks represent implicit conveyance of authority to the target pages. That is, those pages that are
linked (or pointed) to by many other pages are usually high quality pages or authoritative pages simply
because many people trust them.
 The information on the Web is noisy. The noise comes from two main sources. First, a typical Web page contains
many pieces of information., the main content of the page, navigation links, advertisements, copyright notices,
privacy policies, etc. For a particular application, only part of the information is useful. The rest is considered
noise. To perform fine-grain Web information analysis and data mining, the noise should be removed. Second,
due to the fact that the Web does not have quality control of information, i.e., one can write almost anything that one
likes, a large amount of information on the Web is of low quality, erroneous, or even misleading.
 The Web is also about services. Most commercial Web sites allow people to perform useful operations at their sites,
e.g., to purchase products, to pay bills, and to fill in forms.
1.Data Mining

Data mining is also called knowledge discovery in databases (KDD).


It is commonly defined as the process of discovering useful patterns or knowledge from data sources, e.g., databases,
texts, images, the Web, etc.
The patterns must be valid, potentially useful, and understandable.
Data mining is a multi-disciplinary field involving machine learning, statistics, databases, artificial intelligence,
information retrieval, and visualization.
There are many data mining tasks. Some of the common ones are supervised learning (or classification),
unsupervised learning (or clustering), association rule mining, and sequential pattern mining.
A data mining application usually starts with an understanding of the application domain by data analysts (data
miners), who then identify suitable data sources and
the target data.
With the data, data mining can be performed, which is usually carried out in three main steps:
 Pre-processing:
 The raw data is usually not suitable for mining due to various reasons.
 It may need to be cleaned in order to remove noises or abnormalities.
 The data may also be too large and/or involve many irrelevant attributes, which call for data reduction
through sampling and attribute selection.
 Data mining:
 The processed data is then fed to a data mining algorithm which will produce patterns or knowledge.
 Post-processing:
 In many applications, not all discovered patterns are useful.
 This step identifies those useful ones for applications.
 Various evaluation and visualization techniques are used to make the decision.
 The whole process (also called the data mining process) is almost always iterative. It usually takes many
rounds to achieve final satisfactory results, which are then incorporated into real-world operational tasks.
 Traditional data mining uses structured data stored in relational tables, spread sheets, or flat files in the tabular
form.
 With the growth of the Web and text documents, Web mining and text mining are becoming increasingly
important and popular.
2.Web Mining
 Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page
content,
 and usage data.
 Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three
types:
 Web structure mining:
 Web structure mining discovers useful knowledge from hyperlinks (or links for short), which represent
the structure of the Web.
 For example, from the links, we can discover important Web pages, which, incidentally, is a key
technology used in search engines.
 We can also discover communities of users who share common interests.
 Web content mining:
 Web content mining extracts or mines useful information or knowledge.
 For example, we can automatically classify and cluster Web pages according to their topics.
 However, we can also discover patterns in Web pages to extract useful data such as descriptions of products,
postings of forums, etc, for many purposes.
 Furthermore, we can mine customer reviews and forum postings to discover consumer sentiments.

Web usage mining:

 Web usage mining refers to the discovery of user access patterns from Web usage logs, which record every
click made by each user.
 Web usage mining applies many data mining algorithms.
 One of the key issues in Web usage mining is the pre-processing of clicks tream data in usage logs
in order to produce the right data for mining.
 Web mining, data collection can be a substantial task, especially for Web structure
and content mining, which involves crawling a large number of target Web pages.
 Once the data is collected, we go through the same three-step process: data pre- processing, Web data
mining and post-processing.
Relevance Feedback
1

Equation 1
Issues of RF
Text Pre-Processing
The documents in a collection are used for retrieval, some
preprocessing tasks are usually performed.
For traditional text documents (no HTML tags), the tasks are :

• Stop word removal,


• Stemming and
• Handling of
 digits,
 hyphens,
 Punctuations and
 cases of letters

For Web pages, additional tasks such as HTML tag removal and
identification of main content blocks also require careful considerations.
Stop word Removal
 Stopwords are frequently occurring and insignificant words in a
language that help construct sentences but do not represent any
content of the documents.
 Articles, prepositions and conjunctions and some pronouns are
natural candidates.
 Common stop words in English include:
a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or,
that, the, these, this, to, was, what, when, where, who, will, with

 Such words should be removed before documents are indexed


and stored.
 Stop words in the query are also removed before retrieval is
performed.
Stemming
Stemming refers to the process of reducing words to their stems or
roots.
A stem is the portion of a word that is left after removing its prefixes
and suffixes.
In English, most variants of a word are generated by the introduction
of suffixes (rather than prefixes).
Thus, stemming in English usually means suffix removal, or stripping.
For example, “computer”, “computing”, and “compute” are reduced
to “comput”. “walks”, “walking” and “walker” are reduced to “walk”.
Stemming enables different variations of the word to be considered in
retrieval, which improves the recall.
There are several stemming algorithms, also known as stemmers.
In English, the most popular stemmer is perhaps the Martin Porter's
stemming algorithm which uses a set of rules for stemming.
Other Pre-Processing Tasks for Text
Digits
 Numbers and terms that contain digits are removed in traditional IR systems
except some specific types, e.g., dates, times, and other pre specified types
expressed with regular expressions.
 However, in search engines, they are usually indexed.

Hyphens
 Breaking hyphens are usually applied to deal with inconsistency of usage.
 For example, some people use “state-of-the-art”, but others use “state of the art”.
 If the hyphens in the first case are removed, we eliminate the inconsistency problem.
However, some words may have a hyphen as an integral part of the word, e.g., “Y-21”.
 Thus, in general, the system can follow a general rule (e.g., removing all hyphens) and
also have some exceptions.
Note that there are two types of removal
(1) Each hyphen is replaced with a space and
(2) Each hyphen is simply removed without leaving a space so that “state-of-
the-art” may be replaced with “state of the art” or state of the art”. In some
systems both forms are indexed as it is hard to determine which is correct,
e.g., if “pre-processing” is converted to “pre processing”, then some relevant
pages will not be found if the query term is “preprocessing”.

Punctuation Marks:
 Punctuation can be dealt with similarly as hyphens.
 Case of Letters: All the letters are usually converted to either the upper or lower
case.
Web Page Pre-Processing
For Web pages, additional tasks such as HTML tag removal and identification of main
content blocks also require careful considerations.
1. Identifying different text fields:
 In HTML, there are different text fields, e.g., title, metadata, and body.
 Identifying them allows the retrieval system to treat terms in different fields
differently.
 For example, in search engines terms that appear in the title field of a page are
regarded as more important than terms that appear in other fields and are
 assigned higher weights because the title is usually a concise description
 of the page.
 In the body text, those emphasized terms (e.g., under header tags <h1>, <h2>, …,
bold tag <b>, etc.) are also given higher weights.
2. Identifying anchor text:
 Anchor text associated with a hyperlink is treated specially in search engines
because the anchor text often represents a more accurate description of the
information contained in the page pointed to by its link.
 In the case that the hyperlink points to an external page (not in the same site), it is
especially valuable because it is a summary description of the page given by other
people rather than the author/owner of the page, and is thus more trustworthy.
3. Removing HTML tags:
 The removal of HTML tags can be dealt with similarly to punctuation. One issue
needs careful consideration, which affects proximity queries and phrase queries.

 HTML is inherently a visual presentation language. In a typical commercial page,


information is presented in many rectangular blocks.
 Simply removing HTML tags may cause problems by joining text that should not be
joined.
 For example, “cite this article” at the bottom of the left column will join “Main Page”
on the right, but they should not be joined.
 They will cause problems for phrase queries and proximity queries.
 This problem had not been dealt with satisfactorily by search engines at the time
when this book was written.
4. Identifying main content blocks:
 A typical Web page, especially a commercial page, contains a large amount of
information that is not part of the main content of the page.
 For example, it may contain banner ads, navigation bars, copyright notices, etc.,
which can lead to poor results for search and mining.
 The main content block of the page is the block containing “Today’s featured article.”
 It is not desirable to index anchor texts of the navigation links as a part of the
content of this page.
 Several researchers have studied the problem of identifying main content
blocks.

 They showed that search and data mining results can be improved
significantly if only the main content blocks are used.
 There are two techniques for finding such blocks in Web pages.
 Partitioning based on visual cues:
 This method uses visual information to help find main content blocks in a
page.
 Visual or rendering information of each HTML element in a page can be
obtained from the Web browser.
 For example, Internet Explorer provides an API that can output the X and Y
coordinates of each element.
 A machine learning model can then be built based on the location and
appearance features for identifying main content blocks of pages.
 Of course, a large number of training examples need to be manually labeled.
Tree matching:
 This method is based on the observation that in most commercial Web sites pages are
generated by using some fixed templates.
 The method thus aims to find such hidden templates.
 Since HTML has a nested structure, it is thus easy to build a tag tree for each
 page.
 Tree matching of multiple pages from the same site can be performed to find such
templates.
 Once a template is found, we can identify which blocks are likely to be the main content
blocks based on the following observation:
 The text in main content blocks are usually quite different across different pages of the
same template, but the non main content blocks are often quite similar in different
pages.

6.5.5 Duplicate Detection


 Duplicate documents or pages are not a problem in traditional IR.
 However, in the context of the Web, it is a significant issue.
 There are different types of duplication of pages and contents on the Web.
 Copying a page is usually called duplication or replication, and copying an entire
site is called mirroring.
 Duplicate pages and mirror sites are often used to improve efficiency of browsing
and file downloading worldwide due to limited bandwidth across different
geographic regions and poor or unpredictable network performances.
 Of course, some duplicate pages are the results of plagiarism. Detecting such
pages and sites can reduce the index size and improve search results.
 Several methods can be used to find duplicate information.
 The simplest method is to hash the whole document, e.g., using the MD5
algorithm, or computing an aggregated number (e.g., checksum).
 However, these methods are only useful for detecting exact duplicates.
 On the Web, one seldom finds exact duplicates.
 For example, even different mirror sites may have different URLs, different Web
masters, different contact information, different advertisements to suit local
needs, etc.
 One efficient duplicate detection technique is based on n-grams (also called
shingles).
 An n-gram is simply a consecutive sequence of words of a fixed window size n.
 For example, the sentence, “John went to school with his brother,” can be
represented with five 3-gram phrases “John went to”, “went to school”, “to school
with”, “school with his”, and “with his brother”.
 Note that 1-gram is simply the individual words.
 Let Sn(d) be the set of distinctive n-grams (or shingles) contained in document d.
Each n-gram may be coded with a number or a MD5 hash
 (which is usually a 32-digit hexadecimal number).
 Given the n-gram representations of the two documents d1 and d2, Sn(d1) and
Sn(d2), the Jaccard coefficient can be used to compute the similarity of the two
documents,

 A threshold is used to determine whether d1 and d2 are likely to be duplicates


 of each other. For a particular application, the window size n and the
 similarity threshold are chosen through experiments.
LIST OF UNSUPERVISED LEARNING ALGORITHM

You might also like