Unit-Ii Text and Web Page Pre-Processing: Stop Words
Unit-Ii Text and Web Page Pre-Processing: Stop Words
Stemming:
In many languages, a word has various syntactical forms depending on the contexts that it is
used. For example, in English, nouns have plural forms, verbs have gerund forms (by adding
“ing”), and verbs used in the past tense are different from the present tense. These are considered
as syntactic variations of the same root form. Such variations cause low recall for a retrieval
system because a relevant document may contain a variation of a query word but not the exact
word itself. This problem can be partially dealt with by stemming.
Stemming refers to the process of reducing words to their stems or roots. A stem is the portion
of a word that is left after removing its prefixes and suffixes. In English, most variants of a word
are generated by the introduction of suffixes (rather than prefixes). Thus, stemming in English
usually means suffix removal, or stripping.
For example,
“computer”, “computing”, and “compute” are reduced to “comput”.
“walks”, “walking” and “walker” are reduced to “walk”.
Stemming enables different variations of the word to be considered in retrieval, which improves
the recall. There are several stemming algorithms, also known as stemmers. In English, the
most popular stemmer is perhaps the Martin Porter's stemming algorithm, which uses a set of
rules for stemming.
Over the years, many researchers evaluated the advantages and disadvantages of using
stemming. Clearly, stemming increases the recall and reduces the size of the indexing structure.
However, it can hurt precision because many irrelevant documents may be considered relevant.
For example, both “cop” and “cope” are reduced to the stem “cop”. However, if one is looking
for documents about police, a document that contains only “cope” is unlikely to be relevant.
Although many experiments have been conducted by researchers, there is still no conclusive
evidence one way or the other. In practice, one should experiment with the document collection
at hand to see whether stemming helps.
Stemming is performed to improve the efficiency and accuracy of natural language processing
tasks by reducing words to their base or root form. This process helps in standardizing different
variations of a word to a common form, enabling better matching and comparison. For example,
the words "running", "runner", and "ran" can all be reduced to the stem "run". By doing this,
search engines and text analysis tools can recognize that these words are related and treat them as
equivalent, which enhances information retrieval, indexing, and text mining. Stemming also
reduces the dimensionality of the data, making algorithms more efficient and improving their
performance in tasks such as document classification, clustering, and topic modeling. Overall,
stemming helps in capturing the underlying meaning of words and their relationships, leading to
more accurate and meaningful results in various applications.
Other Pre-Processing Tasks for Text
Digits: Numbers and terms that contain digits are removed in traditional IR systems except
some specific types, e.g., dates, times, and other pre specified types expressed with regular
expressions. However, in search engines, they are usually indexed.
Hyphens: Breaking hyphens are usually applied to deal with inconsistency of usage. For
example, some people use “state-of-the-art”, but others use “state of the art”. If the hyphens
in the first case are removed, we eliminate the inconsistency problem. However, some words
may have a hyphen as an integral part of the word, e.g., “Y-21”. Thus, in general, the system can
follow a general rule (e.g., removing all hyphens) and also have some exceptions.
Note that there are two types of removal, i.e.,
(1) each hyphen is replaced with a space and
(2) each hyphen is simply removed without leaving a space so that “state-of-the-art” may be
replaced with “state of the art” or “stateoftheart”.
In some systems both forms are indexed as it is hard to determine which is correct, e.g., if “pre-
processing” is converted to “pre processing”, then some relevant pages will not be found if the
query term is “preprocessing”.
Punctuation Marks: Punctuation can be dealt with similarly as hyphens.
Case of Letters: All the letters are usually converted to either the upper or lower case.
left column will join “Main Page” on the right, but they should not be joined. They will cause problems
for phrase queries and proximity queries. This problem had not been dealt with satisfactorily by search
engines.
4. Identifying main content blocks: A typical Web page, especially a commercial page, contains a
large amount of information that is not part of the main content of the page. For example, it may contain
banner ads, navigation bars, copyright notices, etc., which can lead to poor results for search and mining.
In Fig. the main content block of the page is the block containing “Today’s featured article.” It is not
desirable to index anchor texts of the navigation links as a part of the content of this page. Several
researchers have studied the problem of identifying main content blocks. They showed that search and
data mining results can be improved significantly if only the main content blocks are used. We briefly
discuss two techniques for finding such blocks in Web pages.
Partitioning based on visual cues: This method uses visual information to help find main
content blocks in a page. Visual or rendering information of each HTML element in a page
can be obtained from the Web browser. For example, Internet Explorer provides an API
that can output the X and Y coordinates of each element. A machine learning model can then
be built based on the location and appearance features for identifying main content blocks of
pages. Of course, a large number of training examples need to be manually labeled .
Tree matching: This method is based on the observation that in most commercial Web sites
pages are generated by using some fixed templates. The method thus aims to find such hidden
templates. Since HTML has a nested structure, it is thus easy to build a tag tree for each
page.
Tree matching of multiple pages from the same site can be performed to find such templates.
Once a template is found, we can identify which blocks are likely to be the main content blocks
based on the following observation: the text in main content blocks are usually quite different
across different pages of the same template, but the nonmain content blocks are often quite
similar in different pages. To determine the text similarity of corresponding blocks (which are
subtrees), the shingle method can be used.
When dealing with web pages, especially commercial ones, a significant challenge is filtering
out non-essential information like ads, navigation bars, and other distractions to focus on the
main content. Let's explain the process of identifying main content blocks with an example,
using two techniques: Partitioning Based on Visual Cues and Tree Matching.
Example Application:
By applying this technique, the machine learning model would identify the "Today’s featured
article: AI advancements" section as the main content block.
Steps:
The Shingle Method is a technique used to measure the similarity between two pieces of
text by breaking them down into smaller, overlapping chunks called "shingles." This
method is particularly useful for comparing texts to find out how similar they are, even if
the texts have slight variations.
Shingles are small, overlapping subsequences of words or characters taken from a text. For
example, if we use a shingle size of 3 words (3-gram shingles), the sentence "The quick brown
fox jumps over the lazy dog" can be broken down into the following shingles:
Example Application:
Duplicate Detection
Duplicate documents or pages are not a problem in traditional IR. However, in the context of the
Web, it is a significant issue. There are different types of duplication of pages and contents on
the Web.
Copying a page is usually called duplication or replication, and copying an entire site is
called mirroring. Duplicate pages and mirror sites are often used to improve efficiency of
browsing and file downloading worldwide due to limited bandwidth across different geographic
regions and poor or unpredictable network performances. Of course, some duplicate pages are
the results of plagiarism. Detecting such pages and sites can reduce the index size and improve
search results.
Several methods can be used to find duplicate information. The simplest method is to hash the
whole document, e.g., using the MD5 algorithm, or computing an aggregated number (e.g.,
checksum). However, these methods are only useful for detecting exact duplicates. On the
Web, one seldom finds exact duplicates. For example, even different mirror sites may have
different URLs, different Web masters, different contact information, different advertisements to
suit local needs, etc.
One efficient duplicate detection technique is based on n-grams (also called shingles). An n-
gram is simply a consecutive sequence of words of a fixed window size n. For example, the
sentence, “John went to school with his brother,” can be represented with five 3-gram phrases
“John went to”, “went to school”, “to school with”, “school with his”, and “with his brother”.
Note that 1-gram is simply the individual words. Let Sn(d) be the set of distinctive n-grams (or
shingles) contained in document d. Each n-gram may be coded with a number or a MD5 hash
(which is usually a 32-digit hexadecimal number). Given the n-gram representations of the two
documents d1 and d2, Sn(d1) and Sn(d2), the Jaccard coefficient can be used to compute the
similarity of the two documents,
A threshold is used to determine whether d1 and d2 are likely to be duplicates of each other. For
a particular application, the window size n and the similarity threshold are chosen through
experiments.
Example
Text 1 Shingles:
1. "The quick"
2. "quick brown"
3. "brown fox"
Text 2 Shingles:
1. "The quick"
2. "quick brown"
3. "brown dog"
Jaccard Similarity=2/4=0.5
So, the similarity score between "The quick brown fox" and "The quick brown dog" is 0.5,
indicating that they share 50% of their shingles.
From the network we can study the properties of its structure, and the role, position and
prestige of each social actor. We can also find various kinds of sub-graphs, e.g., communities
formed by groups of actors.
Social network analysis is useful for the Web because the Web is essentially a virtual society,
and thus a virtual social network, where each page can be regarded as a social actor and each
hyperlink as a relationship.
Many of the results from social networks can be adapted and extended for use in the Web
context. The ideas from social network analysis are indeed instrumental to the success of Web
search engines.
We introduce two types of social network analysis, centrality and prestige, which are closely
related to hyperlink analysis and search on the Web. Both centrality and prestige are measures
of degree of prominence of an actor in a social network.
Centrality
Centrality defines how important a node is within a network.
Important or prominent actors are those that are linked or involved with other actors
extensively. In the context of an organization, a person with extensive contacts (links) or
communications with many other people in the organization is considered more important than
a person with relatively fewer contacts.
The links can also be called ties. A central actor is one involved in many ties. Fig. shows a
simple example using an undirected graph. Each node in the social network is an actor and each
link indicates that the actors on the two ends of the link communicate with each other.
Intuitively, we see that the actor i is the most central actor because he/she can communicate
with most other actors.
Types of Centrality
1. Degree Centrality:
o Definition: Measures the number of direct connections (ties) an actor has.
o Application: An actor with a high degree centrality is considered well-connected
and influential within the network.
o Example: In a social network, a person who communicates frequently with many
others has high degree centrality.
2. Betweeness Centrality:
o Definition: Measures the extent to which an actor lies on the shortest paths
between other actors.
o Application: An actor with high betweenness centrality acts as a bridge or broker,
facilitating communication between different parts of the network.
o Example: A person who connects different groups within an organization and
controls information flow has high betweenness centrality.
3. Closeness Centrality:
o Definition: Measures how close an actor is to all other actors in the network,
based on the shortest paths.
o Application: An actor with high closeness centrality can quickly interact with all
others, making them efficient in spreading information.
o Example: A person who can quickly reach and influence all members of a
network has high closeness centrality.
These types of centrality help in identifying key actors in a network based on their connectivity,
bridging role, and efficiency in communication.
Prestige
Prestige is a more refined measure of prominence of an actor than centrality .We need to
distinguish between ties sent (out-links) and ties received (in-links). A prestigious actor is
defined as one who is object of extensive ties as a recipient. In other words, to compute the
prestige of an actor, we only look at the ties (links) directed or pointed to the actor (in-links).
Hence, the prestige cannot be computed unless the relation is directional or the graph is
directed. The main difference between the concepts of centrality and prestige is that centrality
focuses on out-links while prestige focuses on in-links.
We define three prestige measures. The third prestige measure (i.e., rank prestige) forms the
basis of most Web page link analysis algorithms, including PageRank and HITS
Rank Prestige
The above two prestige measures are based on in-degrees and distances. However, an important
factor that has not been considered is the prominence of individual actors who do the “voting”
or “choosing.”
In the real world, a person i chosen by an important person is more prestigious than chosen by a
less important person. For example, a company CEO voting for a person is much more
important than a worker voting for the person.
If one’s circle of influence is full of prestigious actors, then one’s own prestige is also high. Thus
one’s prestige is affected by the ranks or statuses of the involved actors. Based on this intuition,
the rank prestige PR(i) is defined as a linear combination of links that point to i:
Key Differences:
1. Degree Prestige:
o Definition: Measures the number of direct in-links an actor has.
o Application: An actor with many incoming ties is considered prestigious.
o Example: In a corporate setting, a person frequently contacted by others for
advice has high degree prestige.
2. Proximity Prestige:
o Definition: Measures how close an actor is to all other actors, considering only
the in-links.
o Application: An actor who can be quickly reached by others has high proximity
prestige.
o Example: A manager who can be quickly accessed by employees for guidance
has high proximity prestige.
3. Rank Prestige:
o Definition: Considers the prestige of the actors linking to the actor, not just the
number of in-links.
o Application: An actor linked by other prestigious actors has higher rank prestige.
o Example: A researcher cited by other highly influential researchers has high rank
prestige.
o Significance: Forms the basis for web page link analysis algorithms like
PageRank and HITS, where the importance of a page is determined by the
importance of pages linking to it.
HITS
PageRank is static because it calculates the importance of web pages based on the overall
structure of the web and the links between pages. The PageRank values are precomputed
and do not change based on specific search queries.
HITS is search query dependent: because it computes two scores, Authority and Hub,
dynamically for each query. HITS identifies relevant pages (authorities) and the pages that link
to them (hubs) based on the specific search results, making it responsive to the particular query
entered by the user.
HITS stands for Hypertext Induced Topic Search . Unlike PageRank which is a static ranking
algorithm, HITS is search query dependent. When the user issues a search query, HITS first
expands the list of relevant pages returned by a search engine and then produces two rankings of
the expanded set of pages, authority ranking and hub ranking.
An authority is a page with many in-links. The idea is that the page may have authoritative
content on some topic and thus many people trust it and thus link to it. A hub is a page with
many out-links. The page serves as an organizer of the information on a particular topic and
points to many good authority pages on the topic. When a user comes to this hub page, he/she
will find many useful links which take him/her to good content pages on the topic. Fig. 7.8
shows an authority page and a hub page.
The key idea of HITS is that a good hub points to many good authorities and a good authority is
pointed to by many good hubs.
Thus, authorities and hubs have a mutual reinforcement relationship. Fig. 7.9 shows a set of
densely linked authorities and hubs (a bipartite sub-graph).
HITS Algorithm
let us first describe how HITS collects pages to be ranked. Given a broad search query, q, HITS
collects a set of pages as follows:
1. It sends the query q to a search engine system. It then collects t (t = 200 is used in the HITS
paper) highest ranked pages, which assume to be highly relevant to the search query. This set is
called the root set W.
2. It then grows W by including any page pointed to by a page in W and any page that points to a
page in W. This gives a larger set called S.
However, this set can be very large. The algorithm restricts its size by allowing each page in W
to bring at most k pages (k = 50 is used in the HITS paper) pointing to it into S. The set S is
called the base set.
HITS then works on the pages in S, and assigns every page in S an authority score and a hub
score. Let the number of pages to be studied be n. We again use G = (V, E) to denote the
(directed) link graph of S. V is the set of pages (or nodes) and E is the set of directed edges (or
links). We use L to denote the adjacency matrix of the graph.
The reason we use the transpose of the adjacency matrix L to calculate the authority scores is
based on the directional nature of links:
Hub Scores Calculation: The hub score of a page is determined by the authority scores
of the pages it points to. This is straightforward and is given by multiplying the adjacency
matrix L with the authority vector a
Authority Scores Calculation: The authority score of a page is determined by the hub
scores of the pages that point to it. Here, we need to consider the incoming links. To do
this, we use the transpose of the adjacency matrix L:
The transpose LT effectively reverses the direction of the links in L. This allows us to calculate
the authority score of a page by summing the hub scores of all pages that link to it.
Continue this process until the authority and hub scores converge to stable values. This will
typically happen when the change in scores between iterations falls below a predefined
threshold.
Converged Scores
After several iterations, the scores will converge, providing the final authority and hub scores for
each node in the graph. The nodes with the highest scores will be considered the most
authoritative and central in terms of hubs.
This process helps in ranking the nodes based on their authority and hub values in a linked
network.
The final authority and hub scores reflect the relative importance of each page as an authority and as a
hub in the given web graph. This iterative process highlights the most authoritative pages and the best
hubs based on the structure of the web links.
Finding Other Eigenvectors
In linear algebra, an eigenvector of a matrix is a vector that, when multiplied by the matrix,
yields a scalar multiple of itself. This scalar is called an eigenvalue. Mathematically, for a matrix
M and a vector v, if λ is an eigenvalue, then:
Concept: Eigenvector
Example:
|1 1|
Eigenvector: A non-zero vector v that, when multiplied by A, results in a scaled version of itself:
Example Eigenvector: v = | 1 |
|1|
Av = | 2 1 | | 1 | = | 3 |
|1 1||1| |2|
|3|
|2|=λ |1|
|1|
3 = λ_1
2 = λ_1
1. Solve for λ:
λ = 3/1 = 3 (not true, since λ must be the same for both equations)
Note that this is a simple example, and in general, finding eigen values and eigenvectors can be
more complex and require more sophisticated methods.
The HITS (Hyperlink-Induced Topic Search) algorithm is used to rank web pages by identifying
two types of pages:
The algorithm works by finding the principal eigenvectors of the adjacency matrix of a graph,
representing the most important authorities and hubs.
1. Ambiguous: A term like "jaguar" could refer to an animal or a car. The principal
eigenvector might capture the most common use, but non-principal eigenvectors can help
identify other meanings.
2. Broadly Used: A term like "classification" could appear in different contexts, such as
biology or machine learning. Non-principal eigenvectors can help uncover these different
contexts.
3. Polarizing: For issues like "abortion," the web pages might be divided into different
communities (pro and anti), with each community not linking to the other. Non-principal
eigenvectors can help identify these separate groups.
Example to Illustrate
1. Principal Eigenvector: Suppose you search for "jaguar." The principal eigenvector
might identify the most popular pages about the animal "jaguar."
2. Non-Principal Eigenvectors: These could identify clusters of pages about the "Jaguar"
car, a community of web pages about "Jaguar" in mythology, etc.
By identifying these additional clusters, we get a richer, more complete understanding of how the
term is used across different contexts on the web.
Summary
This approach is useful for handling ambiguous queries, broadly used terms, and polarized
issues, allowing us to see multiple relevant communities.
Relationships with Co-Citation and Bibliographic Coupling
Co-citation is used to measure the similarity of two papers (or publications). If papers i and j are both
cited by paper k, then they may be said to be related in some sense to each other, even though they do not
directly cite each other. Fig. 7.3 shows that papers i and j are co-cited by paper k. If papers i and j are
cited together by many papers, it means that I and j have a strong relationship or similarity. The more
papers they are cited by, the stronger their relationship is.
Let L be the citation matrix. Each cell of the matrix is defined as follows: Lij = 1 if paper i cites paper j,
and 0 otherwise. Co-citation (denoted by Cij) is a similarity measure defined as the number of papers
that co-cite i and j, and is computed with
where n is the total number of papers. Cii is naturally the number of papers that cite i. A square matrix C
can be formed with Cij, and it is called the cocitation matrix. Co-citation is symmetric, Cij = Cji, and is
commonly used as a similarity measure of two papers in clustering to group papers of similar topics
together.
Bibliographic Coupling
Bibliographic coupling operates on a similar principle, but in a way it is the mirror image of co-citation.
Bibliographic coupling links papers that cite the same articles so that if papers i and j both cite paper k,
they may be said to be related, even though they do not directly cite each other. The more papers they
both cite, the stronger their similarity is. Fig. 7.4 shows both papers i and j citing (referencing) paper k.
Fig. 7.4. Both paper i and paper j cite paper k
We use Bij to represent the number of papers that are cited by both papers i and j:
Bii is naturally the number of references (in the reference list) of paper i. A square matrix B can be formed
with Bij, and it is called the bibliographic coupling matrix. Bibliographic coupling is also symmetric
and is regarded as a similarity measure of two papers in clustering.
Authority pages and hub pages have their matches in the bibliometric citation context. An
authority page is like an influential research paper (publication) which is cited by many subsequent
papers.
A hub page is like a survey paper which cites many other papers (including those influential
papers). It is no surprise that there is a connection between authority and hub, and co-citation and
bibliographic coupling.
Recall that co-citation of pages i and j, denoted by Cij, is computed as
This shows that the authority matrix (LTL) of HITS is in fact the cocitation matrix C in the Web context.
Likewise, recall that bibliographic coupling of two pages i and j, denoted by Bij, is computed as
which shows that the hub matrix (LLT) of HITS is the bibliographic coupling matrix B in the Web
context.
The main strength of HITS is its ability to rank pages according to the query topic, which may be able to
provide more relevant authority and hub pages. The ranking may also be combined with information
retrieval based rankings. However, HITS has several disadvantages.
First of all, it does not have the anti-spam capability of PageRank. It is quite easy to influence HITS by
adding out-links from one’s own page to point to many good authorities. This boosts the hub score of the
page. Because hub and authority scores are interdependent, it in turn also increases the authority score of
the page.
Another problem of HITS is topic drift. In expanding the root set, it can easily collect many pages
(including authority pages and hub pages) which have nothing to do the search topic because out-links of
a page may not point to pages that are relevant to the topic and in-links to pages in the root set may be
irrelevant as well because people put hyperlinks for all kinds of reasons, including spamming.
The query time evaluation is also a major drawback. Getting the root set, expanding it and then
performing eigenvector computation are all time consuming operations.