Towards The Semantic Web: Collaborative Tag Suggestions
Towards The Semantic Web: Collaborative Tag Suggestions
ABSTRACT assign labels (in the form of keywords) to Web objects with
Content organization over the Internet went through several a purpose to share, discover and recover them. Discovery
interesting phases of evolution: from structured directories to enables users to find new content of their interest shared by
unstructured Web search engines and more recently, to tagging other users. Recovery enables a user to recall content that
as a way for aggregating information, a step towards the was discovered before. Further, tagging allows ranking and
semantic web vision. Tagging allows ranking and data data organization to utilize metadata from individual users
organization to directly utilize inputs from end users, enabling
directly. It brings some benefits of semantic Web into the
machine processing of Web content. Since tags are created by
individual users in a free form, one important problem facing
current HTML dominated Web.
tagging is to identify most appropriate tags, while eliminating We are witnessing an increasing number of tagging services
noise and spam. For this purpose, we define a set of general on the web, such as Flickr [11], Delicious [10], My Web
criteria for a good tagging system. These criteria include high 2.0 [12], Rawsugar [14], and Shadows [15]. Flickr enables
coverage of multiple facets to ensure good recall, least effort to users to tag photos and share them with others. Delicious
reduce the cost involved in browsing, and high popularity to
users can tag URLs and share their bookmarks with the
ensure tag quality. We propose a collaborative tag suggestion
algorithm using these criteria to spot high-quality tags. The public. My Web 2.0 provides a Web-scale social search
proposed algorithm employs a goodness measure for tags derived engine to enable users to find, use, share, and expand
from collective user authorities to combat spam. The goodness human knowledge. It allows users to save and tag Web
measure is iteratively adjusted by a reward-penalty algorithm, pages so that they can easily browse and search for the
which also incorporates other sources of tags, e.g., content-based content again. It also enables users to share Web pages
auto-generated tags. Our experiments based on My Web 2.0 show within a personalized community or to the public by setting
that the algorithm is effective. access privileges. Further, My Web 2.0 provides scoped
search within user’s trusted social networks, e.g., friends or
Keywords friends of friends. Consequently, the search results are
Classification, tagging, information retrieval, collaborative
personalized and spam-filtered by the trusted networks.
filtering, Web 2.0.
Tagging advocates a grass root approach to form a so-
1. INTRODUCTION called “folksonomy”, which is neither hierarchical nor
Effectively organizing information over the World Wide exclusive. With tagging, a user can enter labels in a free
Web has been a challenging problem since the beginning. form to tag any object; it therefore relieves users much
In the early days of the Internet, portal services organized burden of fitting objects into a universal ontology.
Web content into hierarchical directories, assuming that the Meanwhile, a user can use a certain tag combination to
Web can be organized by strict structures of topics. express the interest in objects tagged by other users, e.g.,
However, the manually supervised directories have been tags (renewable, energy) for objects tagged by both
gradually predominated by crawler-based search engines the keywords renewable and energy.
for at least two reasons: data explosion and the unstructured Ontology works well when the corpus is small or in a
nature of Web content. While search engines work well for constrained domain, the objects to be categorized are
users to access Web information by issuing ad hoc queries, stable, and the users are experts [8]. A universal ontology is
they use very limited semantic information of the Web difficult and expensive to construct and maintain when
content by parsing content and exploiting the hyperlink there involve hundreds of millions of users with diverse
structure established by Web masters. The pull model used background. When used to organize Web objects, ontology
by search engines makes it hard to discover new and faces two hard problems: unlike physical objects, digital
dynamic content. According to Brightplanet, the deep Web content is seldom semantically pure to fit in a specific
can be 500 times larger than the surface Web. In addition, category; and it is difficult to predict the paths, through
personalization and spam detection require human inputs. which a user would explore to discover a digital object [8].
Furthermore, it is difficult for people to share massive Taking Yahoo directory as an example, a recipe book
unstructured Web pages among each other or recover them belongs to both the categories Shopping and Health,
later. A push model that directly takes inputs from users
solves these problems. Tagging is a process by which users
since it is hard to predict which category an end user would
perceive to be the best fit.
Tagging bridges some gap between browsing and search. tagging
Browsing enumerates all objects and finds the desirable one
by exerting the recognition aspect of human brain, whereas folksonomy
search uses association and dives directly to the interested
objects, and thus is mentally less obnoxious [9].
The benefits of tagging do not come without a cost. For
instance, the number of tags in a social network multiples ontology
like rabbits [13]. The structure in traditional hierarchy
disappears: Tagging relates to faceted classification, which
uses clearly defined, mutually exclusive, and collectively
exhaustive aspects to describe objects. For instance, a Figure 1. Tag browsing via filtering. The objects tagged by
music piece can be identified by facets such as artist, the tag “folksonomy” intersect with those tagged by the tags
album, genre, and composer. Faceted systems fail to dictate “tagging” and “ontology.” Therefore, the tags “tagging” and
a linear order in which to experience the facets, a step “ontology” are related to the tag “folksonomy.”
crucial for guiding the users to explore this system. Since tags auto-generated via content-based or context-based
tags are created by end-users in a free form, they can be analysis.
chaotic when compared with a faceted system constructed
by experts. This lack of order and depth can result in a • We have implemented a simplified tag suggestion
disaster, leaving the users muddled in a “hodgepodge” [13]. scheme in My Web 2.0. Our experience shows that this
simple scheme is quite effective in suggesting
To remedy the shortcomings of tagging, we advocate using
appropriate tags that possess the properties proposed
collaboratively filtering to automatically identify high-
by us for a good tagging system.
quality tags for users, leveraging the collective wisdom of
Web users. Specifically, this paper makes the following The rest of the paper is organized as follows: Section 2
contributions: discusses an important usage of tags for relational
browsing. Section 3 describes a set of criteria for selecting
• We discuss the desirable properties of a good tagging
high quality tags and proposes an algorithm for tag
system, which include: (a) high coverage of multiple
suggestion. In section 4, we illustrate our algorithm with a
facets, (b) high popularity, and (c) least-effort. Faceted
few examples. We conclude in Section 5.
and generic tags can facilitate the aggregation of
objects entered by different users. It makes discovery
and recovery of tagged content easier. Tags used by a 2. RELATIONAL TAG BROWSING
large number of people for a given object are less
likely to be spam and more likely to be used by a new Tagging is a tool to organize objects for the purposes of
user for the same object. Least-effort has two recovery and discovery. Unlike scientific classification,
meanings: The number of objects identified by the which forces a hierarchical structure on objects, tagging
suggested tags should be small, and the number of tags organizes objects in a network structure, thus making it
for identifying an object should be minimized as well. suitable to organize Web objects, which lack a clear
This enables efficient recovery of the tagged objects. hierarchical structure by nature. Tagging, when combined
with search technology, becomes a powerful tool to
• We propose collaborative tagging techniques that discover interesting Web objects. With the help of search
suggest tags for an object based on what other users technology, tagged objects can be browsed or searched for.
use to tag the object. This not only addresses the The way tags work is analogous to filters. They are treated
vocabulary divergence problem, but also relieves users as logical constraints to filter the objects. Refinement of
the obnoxious task of having to come up with a good results is done through strengthening the constraints
set of tags. whereas generalization is done by weakening them. E.g.,
• We propose a reputation score for each user based on tag combination (2006, calendar) strengthens tag
the quality of the tags contributed by the user. (2006) and tag (calendar).
• By introducing the notion of “virtual” users, our tag Figure 1 illustrates how tags can be used as a filtering
suggestion algorithm incorporates not only user- mechanism for browsing and searching for objects. In My
generated tags but also other sources of tags, such as Web 2.0, we explore the co-occurrence of tags to enable tag
browsing through progressive refinement. When a user
selects a tag combination, the system returns the set of include generic tags such as category (travel), location
objects tagged with the combination. Meanwhile, it also (San Francisco), time (2005), specific tag (Golden
returns the tags that relate to the selected tags, which are Gate Bridge), and subjective tag (cool).
those co-occur with the selected tags. In Figure 1, the tags Generic tags facilitate the aggregation of the content
(tagging) and (ontology) relate to the tag entered by different users and thus are often used for a large
folksonomy. number of objects. The larger the number of facets the more
likely a user is able to recall the tagged content.
In the next section, we describe our collaborative tag
suggestion algorithm. High popularity. If a set of tags are used by a large number
of people for a particular object, these tags are less likely to
3. COLLABORATIVE TAG SUGGESTION be a spam. They are more likely to uniquely identify the
3.1 A taxonomy of tags tagged content and the more likely to be used by a new user
Before presenting the algorithm, we first describe the for the given object. This is analogous to the term
categories of tags that we observe on My Web 2.0. frequency in traditional information retrieval.
1. Content-based tags: Tags that describe the content of Least-effort. The number of tags for identifying an object
an object or the categories that the object belongs to, should be minimized, and the number of objects identified
e.g., Autos, Honda Odyssey, batman, open by the tag combination should be small. As a result, a user
source, Lucene, and German Embassy. These can reach any tagged objects in a small number of steps via
tags are usually specific terms and are common in My tag browsing.
Web 2.0. Uniformity (normalization). Since there is no universal
2. Context-based tags: Tags that provide the context of an ontology, tags can diverge dramatically. Different people
object in which the object was created or saved, e.g., can use different terms for the same concept. In general, we
tags describing locations and time such as San have observed two general types of divergence: those due
Francisco, Golden Gate Bridge, and to syntactic variance, e.g., blogs, blogging, and bog;
2005-10-19. and those due to synonym, e.g., cell-phone and
mobile-phone, which are different syntactic terms that
3. Attribute tags: Tags that are inherent attributes of an refer to the same underlying concept. These kinds of
object but may not be able to be derived from the divergence are a double-edged sword. On the one hand,
content directly, e.g., author of a piece of content such they introduce noises to the system; on the other hand it can
as Jeremy’s Blog and Clay Shirky. increase recall. The right thing to do is to allow the users to
4. Subjective tags: Tags that express user’s opinion and use whatever form they like but to collapse the variances to
emotion, e.g., funny or cool. an internal canonical representation.
5. Organizational tags: Tags that identify personal stuff, Exclusion of certain types of tags. For example,
e.g., my paper or my work, and tags that serve as personally used organizational tags are less likely to be
a reminder of certain tasks such as to-read or shared by different users. Thus, they should be excluded
to-review. This type of tags is usually not useful for from public usage. Rather than ignoring these tags, My
global tag aggregation with other user’s tags. Web 2.0 includes a feature that auto-completes tags as they
are being typed by matching the prefixes of the tags entered
Golder and Huberman have also discussed tag by the user before. This not only improves the usability of
categorization [3]. the system but also enables the convergence of tags.
3.2 Criteria for good tags Our criteria are based on study of tag usage by real users in
In a large scale tagging system like My Web 2.0, an object My Web 2.0. Figure 2 shows the rank of a tag versus the
is usually identified by a group of tags. A specific tag is number of URLs labeled by the tag in a log-log scale, which
efficient to identify an object but less useful for other demonstrates a Zipf-like distribution. The figure only shows
people to discover new objects. In contrast, a generic tag is a subset of data publicly shared by users. We excluded
useful for discovery but not effective to narrow down three system introduced tags, which are automatically
objects. Tagging an object with a good set of tags helps generated for Web objects imported from other services.
both discovery and recovery. We argue that a good tag Our data shows that people naturally select some popular
combination should have the following properties. and generic tags to label their interested Web objects. The
most popular tags include music, news, software, blog, rss,
High coverage of multiple facets. A good tag combination web, programming, and design. These tags are convenient
should include multiple facets of the tagged objects. For for users to recover and share with other users.
example, tags for a URL to a travel attraction site may
Figure 2. Tag popularity Figure 3. Distribution of the number of Web objects tagged
with the corresponding number of tags
• Pa(ti|tj) --- the probability that any object is tagged This minimizes the overlap of the concepts identified
with ti, given it is already tagged with tj by any user. by the suggested tags.
Such correlation can be measured as the number of • Reward tag t’ if it co-occurs with the selected tag ti
people who have used both ti and tj over the number of when users tag object o.
people who have used with tj. This probability
indicates the overlap in terms of the concepts between S(t’,o) = S(t’,o) + Ps(t’|ti;O)*S(ti,o)
ti and tj.. To ensure that the suggested tags cover Since, a user is not likely to tag a given URL using tags
multiple facets, our algorithm attempts to minimize the that are syntactic variances, e.g., blogs, blogging,
and blog. This rewarding mechanism also improves Let a(u) be the authority score of a given user u. As we
the uniformity of the suggested tags. have mentioned before, the goodness measure of a (tag,
object) pair is the sum of the authority scores of all users
This simple principle ensures that the suggested tag
who have tagged the object with the tag, that is
combination has a good balance between coverage and
popularity. S (t , o) = ∑ a(u) (1)
The algorithm is summarized in Table 1. T is the set of tags u∈user ( t , o )
assigned to a given object o by all users. The algorithm Here user(t,o) denotes the set of users who have tagged a
suggests a pre-specified number of K tags for object o to given object o with the tag t.
users based on the tags in T. The suggested tags are stored
in R. One simple way to measure the authority of a user is to
Table 1. Basic Algorithm
assign authority score of the user according to the average
quality of this user’s tags (see Equation (2)).
R = {}; // result tag set
T = all the tags assigned to object o by all users; ∑ ∑ S (t , o)
o∈object ( u ), t ∈tag ( o , u )
X = a set of excluded tags a (u ) = (2)
K = pre-specified maximum number of suggested tags; ∑ | tag (o, u ) |
o∈object ( u )
T = T – X;
Compute S(t,o) for each t in T;
In Equation (2), object(u) is the set of objects tagged by the
While (T ≠ empty AND |R| < K) { user u, and tag(o, u) denotes the set of tags assigned to
object o by user u. Equation (2) measures the average
//find the tag with the highest additional contribution quality of a given user’s tags. The authority score a(u) can
ti∈T AND S(ti,o)≥S(tj,o) for tj∈T be computed via an iterative algorithm similar to HITs [7].
AND j≠i Initially, we can set the weight of each user to be the same,
e.g., 1.0.
//remove the chosen tag from T The above formula treats heavy users the same way as light
T=T-{ti}; users. It does not distinguish people who introduce original
tags from those who follow the steps of others. People who
//adjust the additional contribution of the remaining tags
introduce original and high quality tags should be assigned
foreach tag t’∈T {
S(t’,o)=S(t’,o)– higher authority than those who follow, and similarly for
Pa(t’|ti)*S(ti,o)+ people who are heavy users of the system. One way to
Ps(t’|ti;o)*S(ti,o); handle this is to give the user who introduces an original tag
} some bonus credit each time the tag is reinforced by another
//record the chosen tag user.
R = R ∪ {ti};
} If a tagging application also allows users to rate other users
or tagged objects as in many open rating systems [4][5], the
authority score from such open rating systems can be
Note that we have adopted a greedy approach to penalize incorporated into our collaborative tag suggestion
and reward the tag score because of its efficiency, which is algorithm.
important for dealing with Web-scale data. Other more
sophisticated algorithms are under investigation. 3.5 Content-based Tag Suggestions
In addition to using tags entered by the real end-users as a
3.4 Tag Spam Elimination source for tag suggestion, we can also suggest content-
As tagging becomes more and more popular, tag spam based (and context-based) tags based on analysis and
could become a serious problem. In order to combat tag classification of the tagged content and context. This not
spam, we introduce an authority score (or reputation score) only solves the cold start problem, but also increases the tag
for each user. The authority score measures how well each quality of those objects that are less popular.
user has tagged in the past. This can be modeled as a voting
One simple way to incorporate auto-generated tags is to
problem. Each time, a user votes correctly (consistent with
introduce a virtual user and assign an authority score to this
the majority of other users), the user gets a higher authority
user. The auto-generated tags are than attributed to this
score; the user gets a lower score with more bad votes.
virtual user. The algorithm described in Table 1 remains
Table 2. Suggested Tags for the URL http://wiki.osfoundation.org/bin/view/Projects/AjaxLibraries
Base case Pa Ps Pa AND Ps Pa AND Ps AND Syntactic
Variance Elimination
1 ajax, ajax, ajax, ajax, ajax,
2 javascript, library, javascript, javascript, javascript,
3 library, ajax library, programming, library, library,
4 ajax library, development, webdev, programming, programming,
intact. This mechanism allows us to incorporate multiple In the second case, we consider the penalty adjustment in
sources of tag suggestions under the same framework. the column labeled by Pa. In this case, javascript and
webdev are pushed down in the list. This is due to the
3.6 Tag Normalization relative big overlap between ajax and javascript and
Collapsing syntactic variances of the same term can fit in the overlap between ajax and webdev. In our system,
the same algorithmic framework, for instance, by Pa(javascript|ajax)=0.37, and Pa (webdev|ajax) =
computing the bi-grams (shingles of two characters [1]) of 0.22.
the tags in the currently chosen tag set C. To adjust the
In the third case (see the third column of Table 2), we
additional contribution of another tag, we compute the set
consider the rewarding mechanism without factoring in
of bi-grams (S) of the tag. The additional contribution of
penalties. As a result, the tags programming and
the tag can be computed by multiplying its current value
∩