0% found this document useful (0 votes)

6 views23 pages

Unit-Ii Text and Web Page Pre-Processing: Stop Words

Uploaded by

tryaditi1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views23 pages

Unit-Ii Text and Web Page Pre-Processing: Stop Words

Uploaded by

tryaditi1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

UNIT-II

Text and Web Page Pre-Processing

Stop word Removal, Stemming, Other Pre-Processing Tasks for Text, Web Page Pre-
Processing, Duplicate Detection
Before the documents in a collection are used for retrieval, some preprocessing tasks are usually
performed. For traditional text documents (no HTML tags), the tasks are stopword removal,
stemming, and handling of digits, hyphens, punctuations, and cases of letters. For Web pages,
additional tasks such as HTML tag removal and identification of main content blocks also
require careful considerations..

Stop word Removal:

Stop words are frequently occurring and insignificant words in a language that help construct
sentences but do not represent any content of the documents. Articles, prepositions and
conjunctions and some pronouns are natural candidates. Common stop words in English include:
a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or,that, the, these, this, to,
was, what, when, where, who, will, with
Such words should be removed before documents are indexed and stored. Stopwords in the
query are also removed before retrieval is performed.

Stemming:
In many languages, a word has various syntactical forms depending on the contexts that it is
used. For example, in English, nouns have plural forms, verbs have gerund forms (by adding
“ing”), and verbs used in the past tense are different from the present tense. These are considered
as syntactic variations of the same root form. Such variations cause low recall for a retrieval
system because a relevant document may contain a variation of a query word but not the exact
word itself. This problem can be partially dealt with by stemming.
Stemming refers to the process of reducing words to their stems or roots. A stem is the portion
of a word that is left after removing its prefixes and suffixes. In English, most variants of a word
are generated by the introduction of suffixes (rather than prefixes). Thus, stemming in English
usually means suffix removal, or stripping.
For example,
“computer”, “computing”, and “compute” are reduced to “comput”.
“walks”, “walking” and “walker” are reduced to “walk”.
Stemming enables different variations of the word to be considered in retrieval, which improves
the recall. There are several stemming algorithms, also known as stemmers. In English, the
most popular stemmer is perhaps the Martin Porter's stemming algorithm, which uses a set of
rules for stemming.
Over the years, many researchers evaluated the advantages and disadvantages of using
stemming. Clearly, stemming increases the recall and reduces the size of the indexing structure.
However, it can hurt precision because many irrelevant documents may be considered relevant.
For example, both “cop” and “cope” are reduced to the stem “cop”. However, if one is looking
for documents about police, a document that contains only “cope” is unlikely to be relevant.
Although many experiments have been conducted by researchers, there is still no conclusive
evidence one way or the other. In practice, one should experiment with the document collection
at hand to see whether stemming helps.
Stemming is performed to improve the efficiency and accuracy of natural language processing
tasks by reducing words to their base or root form. This process helps in standardizing different
variations of a word to a common form, enabling better matching and comparison. For example,
the words "running", "runner", and "ran" can all be reduced to the stem "run". By doing this,
search engines and text analysis tools can recognize that these words are related and treat them as
equivalent, which enhances information retrieval, indexing, and text mining. Stemming also
reduces the dimensionality of the data, making algorithms more efficient and improving their
performance in tasks such as document classification, clustering, and topic modeling. Overall,
stemming helps in capturing the underlying meaning of words and their relationships, leading to
more accurate and meaningful results in various applications.
Other Pre-Processing Tasks for Text
Digits: Numbers and terms that contain digits are removed in traditional IR systems except
some specific types, e.g., dates, times, and other pre specified types expressed with regular
expressions. However, in search engines, they are usually indexed.
Hyphens: Breaking hyphens are usually applied to deal with inconsistency of usage. For
example, some people use “state-of-the-art”, but others use “state of the art”. If the hyphens
in the first case are removed, we eliminate the inconsistency problem. However, some words
may have a hyphen as an integral part of the word, e.g., “Y-21”. Thus, in general, the system can
follow a general rule (e.g., removing all hyphens) and also have some exceptions.
Note that there are two types of removal, i.e.,
(1) each hyphen is replaced with a space and
(2) each hyphen is simply removed without leaving a space so that “state-of-the-art” may be
replaced with “state of the art” or “stateoftheart”.
In some systems both forms are indexed as it is hard to determine which is correct, e.g., if “pre-
processing” is converted to “pre processing”, then some relevant pages will not be found if the
query term is “preprocessing”.
Punctuation Marks: Punctuation can be dealt with similarly as hyphens.
Case of Letters: All the letters are usually converted to either the upper or lower case.

Web Page Pre-Processing

We have indicated at the beginning of the section that Web pages are different from traditional
text documents. Thus, additional pre-processing is needed. We describe some important ones
below.
1. Identifying different text fields: In HTML, there are different text fields, e.g., title,
metadata, and body. Identifying them allows the retrieval system to treat terms in different fields
differently. For example, in search engines terms that appear in the title field of a page are
regarded as more important than terms that appear in other fields and are assigned higher weights
because the title is usually a concise description of the page. In the body text, those emphasized
terms (e.g., under header tags <h1>, <h2>, …, bold tag <b>, etc.) are also given higher weights.
2. Identifying anchor text: Anchor text associated with a hyperlink is treated specially in
search engines because the anchor text often represents a more accurate description of the
information contained in the page pointed to by its link. In the case that the hyperlink points to
an external page (not in the same site), it is especially valuable because it is a summary
description of the page given by other people rather than the author/owner of the page, and is
thus more trustworthy.
3. Removing HTML tags: The removal of HTML tags can be dealt with similarly to
punctuation. One issue needs careful consideration, which affects proximity queries and phrase
queries. HTML is inherently a visual presentation language. In a typical commercial page,
information is presented in many rectangular blocks (see Fig.). Simply removing HTML tags
may cause problems by joining text that should not be joined. For example, in Fig., “cite this
article” at the bottom of the

left column will join “Main Page” on the right, but they should not be joined. They will cause problems
for phrase queries and proximity queries. This problem had not been dealt with satisfactorily by search
engines.

4. Identifying main content blocks: A typical Web page, especially a commercial page, contains a
large amount of information that is not part of the main content of the page. For example, it may contain
banner ads, navigation bars, copyright notices, etc., which can lead to poor results for search and mining.
In Fig. the main content block of the page is the block containing “Today’s featured article.” It is not
desirable to index anchor texts of the navigation links as a part of the content of this page. Several
researchers have studied the problem of identifying main content blocks. They showed that search and
data mining results can be improved significantly if only the main content blocks are used. We briefly
discuss two techniques for finding such blocks in Web pages.

Partitioning based on visual cues: This method uses visual information to help find main
content blocks in a page. Visual or rendering information of each HTML element in a page
can be obtained from the Web browser. For example, Internet Explorer provides an API
that can output the X and Y coordinates of each element. A machine learning model can then
be built based on the location and appearance features for identifying main content blocks of
pages. Of course, a large number of training examples need to be manually labeled .

Tree matching: This method is based on the observation that in most commercial Web sites
pages are generated by using some fixed templates. The method thus aims to find such hidden
templates. Since HTML has a nested structure, it is thus easy to build a tag tree for each
page.

Tree matching of multiple pages from the same site can be performed to find such templates.
Once a template is found, we can identify which blocks are likely to be the main content blocks
based on the following observation: the text in main content blocks are usually quite different
across different pages of the same template, but the nonmain content blocks are often quite
similar in different pages. To determine the text similarity of corresponding blocks (which are
subtrees), the shingle method can be used.

Identifying Main Content Blocks in Web Pages

When dealing with web pages, especially commercial ones, a significant challenge is filtering
out non-essential information like ads, navigation bars, and other distractions to focus on the
main content. Let's explain the process of identifying main content blocks with an example,
using two techniques: Partitioning Based on Visual Cues and Tree Matching.

Example Web Page

Consider a web page from a news website. Here's a simplified structure:

 Header: Contains the website logo and navigation links.

 Banner Ad: An advertisement displayed prominently at the top.
 Main Content: The primary article titled "Today’s featured article: AI advancements."
 Sidebar: Contains links to other articles and advertisements.
 Footer: Contains copyright information and additional navigation links.

Technique 1: Partitioning Based on Visual Cues

Steps:

1. Collect Visual Information:

o Use a web browser's API (e.g., Internet Explorer's API) to get the X and Y
coordinates of each HTML element. This gives the exact position and size of
elements on the page.
2. Extract Features:
o Gather features such as position (X, Y coordinates), size (width, height), and
appearance (font size, color, etc.).
3. Machine Learning Model:
o Build a machine learning model to classify each block as "main content" or "non-
main content."
o Use manually labeled training examples where experts have identified the main
content blocks in various web pages.

Example Application:

 Header: Positioned at the top, likely classified as non-main content.

 Banner Ad: Detected through its size and position, classified as non-main content.
 Main Content: Positioned centrally with distinct font size and style, classified as main
content.
 Sidebar: Positioned on the side, with links similar to navigation, classified as non-main
content.
 Footer: Positioned at the bottom, classified as non-main content.

By applying this technique, the machine learning model would identify the "Today’s featured
article: AI advancements" section as the main content block.

Technique 2: Tree Matching

Steps:

1. Build Tag Tree:

o Convert the HTML structure of the web page into a tree representation where
each HTML tag is a node, and nested tags form subtrees.
2. Identify Templates:
o Analyze multiple pages from the same website to identify common structural
templates. For instance, the header, sidebar, and footer might appear consistently
across pages, while the main content changes.
3. Match Trees:
o Compare the tag trees of different pages. Identify subtrees that vary significantly
across pages (likely main content) versus those that remain consistent (likely non-
main content).
4. Text Similarity (Shingle Method):
o Use the shingle method to measure text similarity within corresponding blocks
(subtrees). Main content blocks typically have different text across pages,
while non-main content blocks (like sidebars and footers) have similar or
identical text.

The Shingle Method is a technique used to measure the similarity between two pieces of
text by breaking them down into smaller, overlapping chunks called "shingles." This
method is particularly useful for comparing texts to find out how similar they are, even if
the texts have slight variations.

What Are Shingles?

Shingles are small, overlapping subsequences of words or characters taken from a text. For
example, if we use a shingle size of 3 words (3-gram shingles), the sentence "The quick brown
fox jumps over the lazy dog" can be broken down into the following shingles:

1. "The quick brown"

2. "quick brown fox"
3. "brown fox jumps"
4. "fox jumps over"
5. "jumps over the"
6. "over the lazy"
7. "the lazy dog"

Steps to Measure Text Similarity

1. Choose Shingle Size:

o Decide the size of the shingles (number of words or characters in each shingle). A
common choice is 2 or 3 words.
2. Create Shingles:
o Break each text into shingles of the chosen size.
3. Compare Shingles:
o Compare the sets of shingles from both texts to find how many shingles they have
in common.
4. Calculate Similarity Score:
o Use a formula to calculate the similarity score based on the number of common
shingles and the total number of shingles in both texts.

Example Application:

 Main Content: The section "Today’s featured article: AI advancements" varies

significantly in text content compared to other articles.
 Header, Sidebar, Footer: These sections have similar text and structure across different
pages.
By identifying the template and using text similarity, the tree matching technique would pinpoint
the article as the main content block.

Duplicate Detection
Duplicate documents or pages are not a problem in traditional IR. However, in the context of the
Web, it is a significant issue. There are different types of duplication of pages and contents on
the Web.
Copying a page is usually called duplication or replication, and copying an entire site is
called mirroring. Duplicate pages and mirror sites are often used to improve efficiency of
browsing and file downloading worldwide due to limited bandwidth across different geographic
regions and poor or unpredictable network performances. Of course, some duplicate pages are
the results of plagiarism. Detecting such pages and sites can reduce the index size and improve
search results.

Several methods can be used to find duplicate information. The simplest method is to hash the
whole document, e.g., using the MD5 algorithm, or computing an aggregated number (e.g.,
checksum). However, these methods are only useful for detecting exact duplicates. On the
Web, one seldom finds exact duplicates. For example, even different mirror sites may have
different URLs, different Web masters, different contact information, different advertisements to
suit local needs, etc.

One efficient duplicate detection technique is based on n-grams (also called shingles). An n-
gram is simply a consecutive sequence of words of a fixed window size n. For example, the
sentence, “John went to school with his brother,” can be represented with five 3-gram phrases
“John went to”, “went to school”, “to school with”, “school with his”, and “with his brother”.
Note that 1-gram is simply the individual words. Let Sn(d) be the set of distinctive n-grams (or
shingles) contained in document d. Each n-gram may be coded with a number or a MD5 hash
(which is usually a 32-digit hexadecimal number). Given the n-gram representations of the two
documents d1 and d2, Sn(d1) and Sn(d2), the Jaccard coefficient can be used to compute the
similarity of the two documents,

A threshold is used to determine whether d1 and d2 are likely to be duplicates of each other. For
a particular application, the window size n and the similarity threshold are chosen through
experiments.

Example

Let's compare two simple sentences using 2-word shingles:

 Text 1: "The quick brown fox"
 Text 2: "The quick brown dog"

Step 1: Create Shingles

 Text 1 Shingles:
1. "The quick"
2. "quick brown"
3. "brown fox"
 Text 2 Shingles:
1. "The quick"
2. "quick brown"
3. "brown dog"

Step 2: Compare Shingles

 Common shingles: "The quick", "quick brown"

 Total shingles in Text 1: 3
 Total shingles in Text 2: 3

Step 3: Calculate Similarity Score

 Jaccard Similarity: A common formula to calculate similarity is the Jaccard Similarity,

which is defined as:

Number of common shingles: 2

 Total unique shingles: 4 ("The quick", "quick brown", "brown fox", "brown dog")

Jaccard Similarity=2/4=0.5

So, the similarity score between "The quick brown fox" and "The quick brown dog" is 0.5,
indicating that they share 50% of their shingles.

Why Use the Shingle Method?

 Captures Overlapping Information: By using overlapping shingles, this method

captures the continuity of words, which helps in recognizing similarities even with minor
changes in the text.
 Effective for Large Texts: It works well with large texts, making it suitable for
applications like web page comparison and plagiarism detection.
 Simple and Intuitive: The concept of breaking text into smaller chunks and comparing
them is straightforward and easy to understand.
Social Network Analysis: HITS: HITS Algorithm, Finding Other Eigen vectors,
Relationships with Co-Citation and Bibliographic Coupling, Strengths and Weaknesses of
HITS

Social Network Analysis

Social network is the study of social entities (people in an organization, called actors), and their
interactions and relationships. The interactions and relationships can be represented with a
network or graph, where each vertex (or node) represents an actor and each link represents a
relationship.

From the network we can study the properties of its structure, and the role, position and
prestige of each social actor. We can also find various kinds of sub-graphs, e.g., communities
formed by groups of actors.
Social network analysis is useful for the Web because the Web is essentially a virtual society,
and thus a virtual social network, where each page can be regarded as a social actor and each
hyperlink as a relationship.
Many of the results from social networks can be adapted and extended for use in the Web
context. The ideas from social network analysis are indeed instrumental to the success of Web
search engines.

We introduce two types of social network analysis, centrality and prestige, which are closely
related to hyperlink analysis and search on the Web. Both centrality and prestige are measures
of degree of prominence of an actor in a social network.

Centrality
Centrality defines how important a node is within a network.
Important or prominent actors are those that are linked or involved with other actors
extensively. In the context of an organization, a person with extensive contacts (links) or
communications with many other people in the organization is considered more important than
a person with relatively fewer contacts.
The links can also be called ties. A central actor is one involved in many ties. Fig. shows a
simple example using an undirected graph. Each node in the social network is an actor and each
link indicates that the actors on the two ends of the link communicate with each other.
Intuitively, we see that the actor i is the most central actor because he/she can communicate
with most other actors.

Fig. An example of a social network

There are different types of links or involvements between actors. Thus, several types of
centrality are defined on undirected and directed graphs. We discuss three popular types below.

D is the number of edges connected to the node i.

N is the number of nodes

Types of Centrality

1. Degree Centrality:
o Definition: Measures the number of direct connections (ties) an actor has.
o Application: An actor with a high degree centrality is considered well-connected
and influential within the network.
o Example: In a social network, a person who communicates frequently with many
others has high degree centrality.
2. Betweeness Centrality:
o Definition: Measures the extent to which an actor lies on the shortest paths
between other actors.
o Application: An actor with high betweenness centrality acts as a bridge or broker,
facilitating communication between different parts of the network.
o Example: A person who connects different groups within an organization and
controls information flow has high betweenness centrality.
3. Closeness Centrality:
o Definition: Measures how close an actor is to all other actors in the network,
based on the shortest paths.
o Application: An actor with high closeness centrality can quickly interact with all
others, making them efficient in spreading information.
o Example: A person who can quickly reach and influence all members of a
network has high closeness centrality.

These types of centrality help in identifying key actors in a network based on their connectivity,
bridging role, and efficiency in communication.

Prestige

Prestige is a more refined measure of prominence of an actor than centrality .We need to
distinguish between ties sent (out-links) and ties received (in-links). A prestigious actor is
defined as one who is object of extensive ties as a recipient. In other words, to compute the
prestige of an actor, we only look at the ties (links) directed or pointed to the actor (in-links).
Hence, the prestige cannot be computed unless the relation is directional or the graph is
directed. The main difference between the concepts of centrality and prestige is that centrality
focuses on out-links while prestige focuses on in-links.
We define three prestige measures. The third prestige measure (i.e., rank prestige) forms the
basis of most Web page link analysis algorithms, including PageRank and HITS
Rank Prestige
The above two prestige measures are based on in-degrees and distances. However, an important
factor that has not been considered is the prominence of individual actors who do the “voting”
or “choosing.”
In the real world, a person i chosen by an important person is more prestigious than chosen by a
less important person. For example, a company CEO voting for a person is much more
important than a worker voting for the person.
If one’s circle of influence is full of prestigious actors, then one’s own prestige is also high. Thus
one’s prestige is affected by the ranks or statuses of the involved actors. Based on this intuition,
the rank prestige PR(i) is defined as a linear combination of links that point to i:

Key Differences:

 Centrality: Focuses on out-links (ties sent).

 Prestige: Focuses on in-links (ties received).

Three Prestige Measures:

1. Degree Prestige:
o Definition: Measures the number of direct in-links an actor has.
o Application: An actor with many incoming ties is considered prestigious.
o Example: In a corporate setting, a person frequently contacted by others for
advice has high degree prestige.
2. Proximity Prestige:
o Definition: Measures how close an actor is to all other actors, considering only
the in-links.
o Application: An actor who can be quickly reached by others has high proximity
prestige.
o Example: A manager who can be quickly accessed by employees for guidance
has high proximity prestige.
3. Rank Prestige:
o Definition: Considers the prestige of the actors linking to the actor, not just the
number of in-links.
o Application: An actor linked by other prestigious actors has higher rank prestige.
o Example: A researcher cited by other highly influential researchers has high rank
prestige.
o Significance: Forms the basis for web page link analysis algorithms like
PageRank and HITS, where the importance of a page is determined by the
importance of pages linking to it.
HITS
PageRank is static because it calculates the importance of web pages based on the overall
structure of the web and the links between pages. The PageRank values are precomputed
and do not change based on specific search queries.

HITS is search query dependent: because it computes two scores, Authority and Hub,
dynamically for each query. HITS identifies relevant pages (authorities) and the pages that link
to them (hubs) based on the specific search results, making it responsive to the particular query
entered by the user.

HITS stands for Hypertext Induced Topic Search . Unlike PageRank which is a static ranking
algorithm, HITS is search query dependent. When the user issues a search query, HITS first
expands the list of relevant pages returned by a search engine and then produces two rankings of
the expanded set of pages, authority ranking and hub ranking.

An authority is a page with many in-links. The idea is that the page may have authoritative
content on some topic and thus many people trust it and thus link to it. A hub is a page with
many out-links. The page serves as an organizer of the information on a particular topic and
points to many good authority pages on the topic. When a user comes to this hub page, he/she
will find many useful links which take him/her to good content pages on the topic. Fig. 7.8
shows an authority page and a hub page.
The key idea of HITS is that a good hub points to many good authorities and a good authority is
pointed to by many good hubs.
Thus, authorities and hubs have a mutual reinforcement relationship. Fig. 7.9 shows a set of
densely linked authorities and hubs (a bipartite sub-graph).
HITS Algorithm

let us first describe how HITS collects pages to be ranked. Given a broad search query, q, HITS
collects a set of pages as follows:
1. It sends the query q to a search engine system. It then collects t (t = 200 is used in the HITS
paper) highest ranked pages, which assume to be highly relevant to the search query. This set is
called the root set W.
2. It then grows W by including any page pointed to by a page in W and any page that points to a
page in W. This gives a larger set called S.
However, this set can be very large. The algorithm restricts its size by allowing each page in W
to bring at most k pages (k = 50 is used in the HITS paper) pointing to it into S. The set S is
called the base set.
HITS then works on the pages in S, and assigns every page in S an authority score and a hub
score. Let the number of pages to be studied be n. We again use G = (V, E) to denote the
(directed) link graph of S. V is the set of pages (or nodes) and E is the set of directed edges (or
links). We use L to denote the adjacency matrix of the graph.

Why Transpose for Authority Scores?

The reason we use the transpose of the adjacency matrix L to calculate the authority scores is
based on the directional nature of links:

 Hub Scores Calculation: The hub score of a page is determined by the authority scores
of the pages it points to. This is straightforward and is given by multiplying the adjacency
matrix L with the authority vector a


 Authority Scores Calculation: The authority score of a page is determined by the hub
scores of the pages that point to it. Here, we need to consider the incoming links. To do
this, we use the transpose of the adjacency matrix L:
The transpose LT effectively reverses the direction of the links in L. This allows us to calculate
the authority score of a page by summing the hub scores of all pages that link to it.

Authority Score: A page is considered authoritative if many hubs link to it.

Hub Score: A page is considered a hub if it links to many authoritative pages.
Example Graph

Consider the same graph:

 Page A links to Page B and Page C

 Page B links to Page C
 Page C links to Page A

The adjacency matrix L is:

5. Iterate until convergence:

Continue this process until the authority and hub scores converge to stable values. This will
typically happen when the change in scores between iterations falls below a predefined
threshold.

Converged Scores

After several iterations, the scores will converge, providing the final authority and hub scores for
each node in the graph. The nodes with the highest scores will be considered the most
authoritative and central in terms of hubs.

This process helps in ranking the nodes based on their authority and hub values in a linked
network.

The final authority and hub scores reflect the relative importance of each page as an authority and as a
hub in the given web graph. This iterative process highlights the most authoritative pages and the best
hubs based on the structure of the web links.
Finding Other Eigenvectors

Eigenvectors and Eigenvalues

In linear algebra, an eigenvector of a matrix is a vector that, when multiplied by the matrix,
yields a scalar multiple of itself. This scalar is called an eigenvalue. Mathematically, for a matrix
M and a vector v, if λ is an eigenvalue, then:

Concept: Eigenvector

Example:

Suppose we have a square matrix A: | 2 1 |

|1 1|

And we want to find the eigenvectors of A.

Eigenvector: A non-zero vector v that, when multiplied by A, results in a scaled version of itself:

Av = λv where λ (lambda) is a scalar.

Example Eigenvector: v = | 1 |

|1|

When we multiply A by v, we get:

Av = | 2 1 | | 1 | = | 3 |

|1 1||1| |2|

Which is equal to 2 times v:

Av = 2v So, v is an eigenvector of A with eigenvalue λ = 2.

In essence, the eigenvector v doesn't change direction when transformed by A, only its
magnitude is scaled by λ.

1. Set Av equal to λv:

|3|

|2|=λ |1|

|1|

1. Equate corresponding elements:

3 = λ_1

2 = λ_1

1. Solve for λ:

λ = 3/1 = 3 (not true, since λ must be the same for both equations)

λ = 2/1 = 2 (true, since it satisfies both equations)

So, the eigenvalue λ is 2.

Note that this is a simple example, and in general, finding eigen values and eigenvectors can be
more complex and require more sophisticated methods.

The HITS (Hyperlink-Induced Topic Search) algorithm is used to rank web pages by identifying
two types of pages:

1. Authorities: Pages that contain useful information about a topic.

2. Hubs: Pages that link to authorities.

The algorithm works by finding the principal eigenvectors of the adjacency matrix of a graph,
representing the most important authorities and hubs.

Principal vs. Non-Principal Eigenvectors

 Principal Eigenvectors: These are the main eigenvectors found by the HITS algorithm,
representing the most densely connected authorities and hubs in the graph. These are the
top-ranked pages related to a query.
 Non-Principal Eigenvectors: These represent other, smaller clusters of densely
connected hubs and authorities within the same graph. These clusters can be relevant to
the query but are separate from the main cluster.

Why Look for Non-Principal Eigenvectors?

In some cases, the query might be:

1. Ambiguous: A term like "jaguar" could refer to an animal or a car. The principal
eigenvector might capture the most common use, but non-principal eigenvectors can help
identify other meanings.
2. Broadly Used: A term like "classification" could appear in different contexts, such as
biology or machine learning. Non-principal eigenvectors can help uncover these different
contexts.
3. Polarizing: For issues like "abortion," the web pages might be divided into different
communities (pro and anti), with each community not linking to the other. Non-principal
eigenvectors can help identify these separate groups.

Example to Illustrate

1. Principal Eigenvector: Suppose you search for "jaguar." The principal eigenvector
might identify the most popular pages about the animal "jaguar."
2. Non-Principal Eigenvectors: These could identify clusters of pages about the "Jaguar"
car, a community of web pages about "Jaguar" in mythology, etc.

By identifying these additional clusters, we get a richer, more complete understanding of how the
term is used across different contexts on the web.

Summary

Principal eigenvectors identify the main cluster of relevant pages.

Non-principal eigenvectors help uncover other clusters that are also relevant but less
connected to the main cluster.

This approach is useful for handling ambiguous queries, broadly used terms, and polarized
issues, allowing us to see multiple relevant communities.
Relationships with Co-Citation and Bibliographic Coupling

Co-citation is used to measure the similarity of two papers (or publications). If papers i and j are both
cited by paper k, then they may be said to be related in some sense to each other, even though they do not
directly cite each other. Fig. 7.3 shows that papers i and j are co-cited by paper k. If papers i and j are
cited together by many papers, it means that I and j have a strong relationship or similarity. The more
papers they are cited by, the stronger their relationship is.

Let L be the citation matrix. Each cell of the matrix is defined as follows: Lij = 1 if paper i cites paper j,
and 0 otherwise. Co-citation (denoted by Cij) is a similarity measure defined as the number of papers
that co-cite i and j, and is computed with

where n is the total number of papers. Cii is naturally the number of papers that cite i. A square matrix C
can be formed with Cij, and it is called the cocitation matrix. Co-citation is symmetric, Cij = Cji, and is
commonly used as a similarity measure of two papers in clustering to group papers of similar topics
together.

Bibliographic Coupling

Bibliographic coupling operates on a similar principle, but in a way it is the mirror image of co-citation.
Bibliographic coupling links papers that cite the same articles so that if papers i and j both cite paper k,
they may be said to be related, even though they do not directly cite each other. The more papers they
both cite, the stronger their similarity is. Fig. 7.4 shows both papers i and j citing (referencing) paper k.
Fig. 7.4. Both paper i and paper j cite paper k
We use Bij to represent the number of papers that are cited by both papers i and j:

Bii is naturally the number of references (in the reference list) of paper i. A square matrix B can be formed
with Bij, and it is called the bibliographic coupling matrix. Bibliographic coupling is also symmetric
and is regarded as a similarity measure of two papers in clustering.

Authority pages and hub pages have their matches in the bibliometric citation context. An
authority page is like an influential research paper (publication) which is cited by many subsequent
papers.

A hub page is like a survey paper which cites many other papers (including those influential
papers). It is no surprise that there is a connection between authority and hub, and co-citation and
bibliographic coupling.
Recall that co-citation of pages i and j, denoted by Cij, is computed as

This shows that the authority matrix (LTL) of HITS is in fact the cocitation matrix C in the Web context.
Likewise, recall that bibliographic coupling of two pages i and j, denoted by Bij, is computed as

which shows that the hub matrix (LLT) of HITS is the bibliographic coupling matrix B in the Web
context.

Strengths and Weaknesses of HITS

The main strength of HITS is its ability to rank pages according to the query topic, which may be able to
provide more relevant authority and hub pages. The ranking may also be combined with information
retrieval based rankings. However, HITS has several disadvantages.
First of all, it does not have the anti-spam capability of PageRank. It is quite easy to influence HITS by
adding out-links from one’s own page to point to many good authorities. This boosts the hub score of the
page. Because hub and authority scores are interdependent, it in turn also increases the authority score of
the page.
Another problem of HITS is topic drift. In expanding the root set, it can easily collect many pages
(including authority pages and hub pages) which have nothing to do the search topic because out-links of
a page may not point to pages that are relevant to the topic and in-links to pages in the root set may be
irrelevant as well because people put hyperlinks for all kinds of reasons, including spamming.
The query time evaluation is also a major drawback. Getting the root set, expanding it and then
performing eigenvector computation are all time consuming operations.

Module 4 Notes
No ratings yet
Module 4 Notes
34 pages
BSC Bca 2 Sem Mathematics Discrete Mathematics 2 21103330 Dec 2021
No ratings yet
BSC Bca 2 Sem Mathematics Discrete Mathematics 2 21103330 Dec 2021
4 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Double Domination Number of Some Families of Graph
100% (1)
Double Domination Number of Some Families of Graph
38 pages
Operations On Graph PDF
100% (5)
Operations On Graph PDF
5 pages
Mathematics For Computer Science - IV-Semester 4-2023-1688974103
No ratings yet
Mathematics For Computer Science - IV-Semester 4-2023-1688974103
3 pages
Java Programming For BSC It 4th Sem Kuvempu University
100% (1)
Java Programming For BSC It 4th Sem Kuvempu University
52 pages
Einexprs Quantum Einsum
No ratings yet
Einexprs Quantum Einsum
4 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
IR....
No ratings yet
IR....
5 pages
3-IRS-MID-IMPORTANT-QUESTIONS
No ratings yet
3-IRS-MID-IMPORTANT-QUESTIONS
6 pages
Final Exam MATH348 Spring 2020
No ratings yet
Final Exam MATH348 Spring 2020
5 pages
IEEE Report Graphs - Sreekar
No ratings yet
IEEE Report Graphs - Sreekar
3 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
3-More on Indexing & Text Operations
No ratings yet
3-More on Indexing & Text Operations
27 pages
Lec 19
No ratings yet
Lec 19
60 pages
ItC - Chapter-5 - Foundations of Algorithm Design and Documentation Part 2
No ratings yet
ItC - Chapter-5 - Foundations of Algorithm Design and Documentation Part 2
21 pages
Euler Tour
No ratings yet
Euler Tour
2 pages
IRS UNIT-2 NOTES_241015_102936
No ratings yet
IRS UNIT-2 NOTES_241015_102936
27 pages
GE 112 Mathematics in The Modern World
75% (4)
GE 112 Mathematics in The Modern World
24 pages
ir manual
No ratings yet
ir manual
53 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
DSA PRACTICAL CODES[1]
No ratings yet
DSA PRACTICAL CODES[1]
39 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
MMW Journal
No ratings yet
MMW Journal
4 pages
Unit 2
No ratings yet
Unit 2
40 pages
Graph Theory More....
No ratings yet
Graph Theory More....
30 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Song 2024 On Querying Historical Connectivity
No ratings yet
Song 2024 On Querying Historical Connectivity
25 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
NLP Qa
No ratings yet
NLP Qa
10 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
2011 Dawson Stemmer
No ratings yet
2011 Dawson Stemmer
7 pages
Sustainable Transport: An E Network-Case Study: Sustainability
No ratings yet
Sustainable Transport: An E Network-Case Study: Sustainability
14 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Unit 1b
No ratings yet
Unit 1b
24 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
irs unit-ii-notes
No ratings yet
irs unit-ii-notes
18 pages
Clustering Algorithms: I I M M M N S
No ratings yet
Clustering Algorithms: I I M M M N S
16 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
IRS Cataloging and Indexing 2.1
No ratings yet
IRS Cataloging and Indexing 2.1
12 pages
Vit Assignment
No ratings yet
Vit Assignment
2 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
Unit Iii Graphs
No ratings yet
Unit Iii Graphs
32 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
One Point, The Graph Fails The Vertical Line Test
No ratings yet
One Point, The Graph Fails The Vertical Line Test
7 pages
NLP Mod-5
No ratings yet
NLP Mod-5
17 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
NLP Pre-Processing
No ratings yet
NLP Pre-Processing
6 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
ESSENTIALS PROGRAMMING MATHEMATICA Exercises and Solutions PDF
100% (1)
ESSENTIALS PROGRAMMING MATHEMATICA Exercises and Solutions PDF
347 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
Data - Structure Notes
No ratings yet
Data - Structure Notes
72 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
CS6702 Graph Theory and Applications Unit I Introduction
No ratings yet
CS6702 Graph Theory and Applications Unit I Introduction
25 pages
Quiz2 3510 Cheat-Sheet
100% (1)
Quiz2 3510 Cheat-Sheet
4 pages
Lecturer: Michel X. Goemans Scribe: Nick Harvey
No ratings yet
Lecturer: Michel X. Goemans Scribe: Nick Harvey
3 pages
Cap 3-8
No ratings yet
Cap 3-8
9 pages
Lessonplans 2
No ratings yet
Lessonplans 2
16 pages
CS8077 - GRAPH THEORY AND APPLICATIONS Syllabus 2017 Regulation PDF
No ratings yet
CS8077 - GRAPH THEORY AND APPLICATIONS Syllabus 2017 Regulation PDF
3 pages
Bangla News Summarization
No ratings yet
Bangla News Summarization
10 pages
Grade 8 Graphs
No ratings yet
Grade 8 Graphs
14 pages
Graph Question Bank
No ratings yet
Graph Question Bank
8 pages
The Basics of HTML (Hypertext Markup Language) Coding For Beginners: Learn Foundational HTML Programming Concepts
From Everand
The Basics of HTML (Hypertext Markup Language) Coding For Beginners: Learn Foundational HTML Programming Concepts
Roggie Clark
No ratings yet
The Basics of Front-End Web Development (HTML, CSS, and JavaScript): Learn How To Design and Build Websites As A Beginner
From Everand
The Basics of Front-End Web Development (HTML, CSS, and JavaScript): Learn How To Design and Build Websites As A Beginner
Roggie Clark
No ratings yet
Exploring Data with Access 2019
From Everand
Exploring Data with Access 2019
Larry Rockoff
No ratings yet
Bossing Spreadsheets: A Girl's Guide to Data Analysis: Bossing Up
From Everand
Bossing Spreadsheets: A Girl's Guide to Data Analysis: Bossing Up
Sophie Johnson
No ratings yet
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Beginning HTML and CSS
From Everand
Beginning HTML and CSS
Rob Larsen
No ratings yet
XHTML
From Everand
XHTML
Jitendra Patel
No ratings yet
Learn HTML Programming in 7 Days: Ultimate Beginners Guide to Build and Design Your Own Website
From Everand
Learn HTML Programming in 7 Days: Ultimate Beginners Guide to Build and Design Your Own Website
Austin Myers
4.5/5 (4)
HTML Beginner's Crash Course: HTML for Beginner's Guide to Learning HTML, HTML & CSS, & Web Design
From Everand
HTML Beginner's Crash Course: HTML for Beginner's Guide to Learning HTML, HTML & CSS, & Web Design
Quick Start Guides
4/5 (1)
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
James Learning Javascript Programming
From Everand
James Learning Javascript Programming
James Lombard
No ratings yet
Easy html and css
From Everand
Easy html and css
S VASIST
No ratings yet
HTML & CSS For Beginners: Your Step by Step Guide to Easily HTML & CSS Programming in 7 Days
From Everand
HTML & CSS For Beginners: Your Step by Step Guide to Easily HTML & CSS Programming in 7 Days
i Code Academy
4/5 (7)
HTML in 30 Pages
From Everand
HTML in 30 Pages
U.Q. Magnusson
4.5/5 (14)
HTML & CSS: Learn the Fundaments in 7 Days
From Everand
HTML & CSS: Learn the Fundaments in 7 Days
Michael Knapp
4/5 (20)
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
From Everand
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
Steven Bright
No ratings yet

Unit-Ii Text and Web Page Pre-Processing: Stop Words

Uploaded by

Unit-Ii Text and Web Page Pre-Processing: Stop Words

Uploaded by

UNIT-II

Text and Web Page Pre-Processing

Stop word Removal:

Web Page Pre-Processing

Identifying Main Content Blocks in Web Pages

Example Web Page

Consider a web page from a news website. Here's a simplified structure:

 Header: Contains the website logo and navigation links.

Technique 1: Partitioning Based on Visual Cues

1. Collect Visual Information:

 Header: Positioned at the top, likely classified as non-main content.

Technique 2: Tree Matching

1. Build Tag Tree:

What Are Shingles?

1. "The quick brown"

Steps to Measure Text Similarity

1. Choose Shingle Size:

 Main Content: The section "Today’s featured article: AI advancements" varies

Let's compare two simple sentences using 2-word shingles:

Step 1: Create Shingles

Step 2: Compare Shingles

 Common shingles: "The quick", "quick brown"

Step 3: Calculate Similarity Score

 Jaccard Similarity: A common formula to calculate similarity is the Jaccard Similarity,

Number of common shingles: 2

Why Use the Shingle Method?

 Captures Overlapping Information: By using overlapping shingles, this method

Social Network Analysis

Fig. An example of a social network

D is the number of edges connected to the node i.

 Centrality: Focuses on out-links (ties sent).

Three Prestige Measures:

Why Transpose for Authority Scores?

Authority Score: A page is considered authoritative if many hubs link to it.

Consider the same graph:

 Page A links to Page B and Page C

The adjacency matrix L is:

Eigenvectors and Eigenvalues

Suppose we have a square matrix A: | 2 1 |

And we want to find the eigenvectors of A.

Av = λv where λ (lambda) is a scalar.

When we multiply A by v, we get:

Which is equal to 2 times v:

Av = 2v So, v is an eigenvector of A with eigenvalue λ = 2.

1. Set Av equal to λv:

1. Equate corresponding elements:

λ = 2/1 = 2 (true, since it satisfies both equations)

So, the eigenvalue λ is 2.

1. Authorities: Pages that contain useful information about a topic.

Principal vs. Non-Principal Eigenvectors

Why Look for Non-Principal Eigenvectors?

In some cases, the query might be:

Principal eigenvectors identify the main cluster of relevant pages.

Strengths and Weaknesses of HITS

You might also like