0% found this document useful (0 votes)
36 views

Personal Information Systems: Gaurav Veda, Sarvesh Dwivedi

This document summarizes a research project on building a personal information system to address the problem of information overload on desktop systems. The researchers propose a new information model to automatically classify documents and allow users to retrieve related documents through non-keyword search based on the user's conceptual organization of information. This is motivated by the limitations of current hierarchical file systems and the need for a personalized system that captures how humans think about information.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Personal Information Systems: Gaurav Veda, Sarvesh Dwivedi

This document summarizes a research project on building a personal information system to address the problem of information overload on desktop systems. The researchers propose a new information model to automatically classify documents and allow users to retrieve related documents through non-keyword search based on the user's conceptual organization of information. This is motivated by the limitations of current hierarchical file systems and the need for a personalized system that captures how humans think about information.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Personal Information Systems

Gaurav Veda, Sarvesh Dwivedi


Email: {gveda, dwivedi}@cse.iitk.ac.in
Guide: Prof. Sumit Ganguly
Guide’s Email: [email protected]

Department of Computer Science & Engineering


Indian Institute of Technology
Kanpur, UP, INDIA - 208016

Abstract— We are living in the information age. Infact, we are it resides in the mind of the user. Currently, a user stores her
now facing the problem of information overload. There is so much information in a hierarchical fashion using files and directories
information present in this world, that we often find ourselves and tries to encode this information system in their naming
trying to find a needle in a haystack. On the web, we often
succeed in retrieving what we want by using search engines like conventions. However, the problem with this approach is that,
Google that make use of its underlying graph structure. However, firstly it is very difficult to maintain and is not scalable.
the problem remains largely open for desktop systems, inspite of Secondly, what should a user do when a given piece of data
a few recent softwares. qualifies for being kept in multiple directories, and thirdly,
Here, we present an information model which addresses what if the user wishes to arrange the data according to a
this problem. Using our model, users can retrieve even those
documents which do not contain the search keywords, but contain completely new classification scheme or someone else takes
information that is related to the search keywords. Since our over the system (eg. in a corporate setting). The aim of this
model is personalized, the property of two documents being project is to solve this problem, so that the user does not have
related depends on the user of the system. We also introduce to care much about how to arrange data and is still able to
a notion of activities which, we believe, is the way in which easily retrieve the required data. To do this, we propose a new
humans think about information. Due to the little computation
involved, our model enables us to answer queries efficiently and information model for storing information in a desktop system.
requires very little storage space.
II. M OTIVATION
I. I NTRODUCTION
Today, we are witnessing an information revolution. We are The aim of this project is to build a system which does
being constantly flooded with information, be it on the internet automatic classification and then allows a user to retrieve
or due to other sources (like emails), and it is becoming a document even if she does not know the exact words it
increasingly difficult to manage such huge amounts of infor- contains, but only has an idea what the document is about.
mation. An even more formidable task is to find something In other words, we want to build a system which is able to
relevant through this great mass of data. The problem has capture the way in which humans think about information. We
more or less been addressed for the internet, with the advent of feel that the restrictions imposed by the hierarchical directory
various good search engines (like Google) which can find out system are very artificial. In a hierarchical directory structure,
the required data given a few keywords. Most of these search a file is supposed to placed in only a single directory. However,
engines make use of the well understood hyperlink structure this doesn’t make much sense, since human beings tend to
of the web. classify a single information unit into various categories. For
This problem of efficient information retrieval however, example, a user will associate a paper on Peer to Peer Systems
persists for the data stored on the hard disks of our computers. with both Networks and Databases. Currently, there is also
One of the primary reasons for this is the absence, till date, of no mechanism to capture the temporal relation between data
any information model that relates documents for a desktop items. We believe that people often recall a particular piece
system in the way that hyperlinks relate documents on the of information or an event, using events that were done by
web. Moreover, the problem is escalating day by day with them at the same time, even though in terms of content the
the burgeoning size of secondary storage. Consequently, it events might be totally unrelated. As an example, a person
is becoming increasingly difficult for a user to store data might relate a book on Algebra to Switzerland, because she
logically and access it easily. read it while on vacation there. Thus there are various types of
We believe that each of the units of information on a user’s associations. This cannot be captured in the directory structure.
hard-drive collectively form an information system that is Also the directory structure scales badly because the user
much more than a decoupled collection of information units. has to manually create directories and links to files (if she
The information system that is built out of these units is wants a document to appear at more than place). Arranging
usually not modeled in explicit form at any place, but rather, and maintaining data on a 80GB/160GB (the current size of
hard-disk on a recently bought personal computer) requires too Along with doing complete content indexing, these systems
much time and energy which users are rarely able to devote. also integrate mail clients, chat clients and web browsers,
We want to build a system in which the user doesn’t have so that users can search their mails, chat logs and browsing
to change directories. Infact, she doesn’t even need to know history along with other files. Another feature of the above
how are files stored on the computer internally. Whenever she mentioned systems is that they have format specific readers
wants to retrieve a file, she just queries for it using some words for popular file formats ilke MS Word, MS Excel etc. Using
(not necessarily keywords) and the system displays a ranked these format readers, they are able to extract metadata from
list of matching documents. To save a file, she just has to the documents and index this metadata also. This enhances
specify its name and optionally some words that describe it. the quality of search. Additionally, during search this feature
The system then classifies it and stores it in a way that makes can be used to find specific types of files. Some of the
subsequent retrieval of the file easy. tools provide another feature which is the “up-to-the-moment
As stated above, we want to enable non-keyword based accurate” search capability. This capability means that within
search. All the systems that are in place today, rely on a a short time after the creation/addition of a new file to the
more or less purely keyword based scheme to search for system or the modification of an existing file, the tool will be
information. Such a system might very well cater to the kind able to index it and users will be able to search through the
of queries that we make on the web. This is because on the new/modified data.
web, we are mainly interested in finding information that has The tool Spotlight, which is offered by Apple in their
not been entered by us. However, even on the web, we are upcoming operating system Tiger, has a special feature of
sometimes plagued by the problem of being unable to retrieve saving search results in “Smart Folders”. Using this feature,
the information we seek, mainly because of purely keyword a user can arrange her data in a logical way and surpass
based search systems. This problem takes gigantic proportions the artificial barriers of directory structure as the user can
when it comes to a desktop. On a desktop, we often want to now view documents in a personalized way which is also
make queries which simply cannot be answered, if we just do dynamic. Also, these smart folders get updated as new data
a keyword based lookup. Using the words inputted by the user arrives on the system or old data gets deleted. This takes
to describe a document, we can say that a given document is care of classification to some extent, though in no way this
related to another one. Using all this information, the system classification is automatic.
will come to have an idea of how the user relates various Another tool is Haystack[5], which is an information man-
things, thus paving way for a personalized system. agement tool. It lets users view their data in whatever way
From our own experience and observing other users, we they wish using a semi-structured data model. This data
believe that an average user looks atmost at two-three pages model comprises of actual documents as well as the metadata
of search results. This translates to approximately 20-30 search of these documents. The user can build various relations
results. This implies that a good ranking algorithm is the heart among documents. This is done using Resource Description
of any search engine. Without a good ranking scheme, a search Framework (RDF)[6]. Haystack also uses format specific
engine is useless, howsoever fast it might be. Our aim is to readers, thus enabling the user to view a document as only an
calculate the most relevant results as quickly and efficiently information unit. It provides support for annotating documents.
as possible and present them to the user. Additionally, a user has the flexibility to define new data types
Once the whole system is in place, it would simplify which can have personalized attributes.
information retrieval and make it more effective. For example, The problem with the search tools discussed above is that
a professor would be able to retrieve a question that he they do only keyword-based search and as stated earlier, this
wrote for a database course exam some time back if he just does not suffice for a desktop system. Also, except Spotlight,
remembers the basic idea behind the question. She will not they lack the capability for any classification as such. On the
have to recall where did she store it. She would just have to other hand, though Haystack lets users arrange and view data
type in a few words describing the question (some or all of in a personalized fashion, it demands too much from the user
which might not even be contained in the document), and the if she wishes to attain any reasonable classification, since there
system will come up with the required document. is no automatic classification done as part of the system.
One of the problems with plain keyword based search, is
III. R ELATED W ORK the estimation of relevance of a keyword to a document. One
obvious way can be to use the number of times a word occurs
As of now, there are desktop search tools available from in the document. However, the number of occurrences is not
a number of vendors viz. Google[1], Yahoo[2], Apple[3], a true indicator of the relevance of the word to the actual
Copernic[4] and others. All of these systems essentially do content of the document. Moreover a given keyword might
a full content indexing of the data present on the system and not occur in the document, even though it is highly relevant
after this is done, the user can do a keyword based search on to the content.
the index for whatever data she wishes to find. Although this For example, suppose that there is a document containing
forms the basis of all these systems, they also provide a few a discussion on the primal-dual algorithm for linear program-
additional functionality. ming, and further suppose that the keyword(s) simplex algo-
Fig. 1. An example of the activity hierarchy

rithm does not appear in the document. However, documents is able to capture this relation in an automated system.
that are highly relevant to simplex algorithm are relevant to An information class is like an activity, but it does not have
the current document, since, the simplex algorithm is highly any start time or end time associated with it. They are timeless
relevant to linear programming. The solution in the context of things. eg. book writing is an activity, but a book belongs to
the web, as devised by Google (and its variants), is to model an information class.
documents as web pages, and to use hyperlinks from one page An information unit is primarily classified based on its
to another to recursively quantify relevance. So, continuing the content. Whenever a new file enters the system, it is first
example above, although a web page might not contain the classified into one or more activity or information class.
keyword(s) ‘simplex algorithm’, there might be many links to Each activity or information class has an associated descrip-
this page from other pages where this keyword occurs in the tion with it. This description consists of a set of keywords
hyperlink text. However, there is no direct notion of hyperlinks along with a number between 0 and 1 (which we call ‘rele-
in a personal information system, and hence, a Google like vance’) associated with each keyword. eg. let us consider the
search and retrieval mechanism does not extend obviously to course ‘CS719’ (An introduction to Data Streaming) to be an
this domain. activity. Its description tag might look like:
In our model, we also do a full content indexing of the doc- {data streams (1.0), algorithms (0.7), randomized algorithms
uments in the system. Additionally, by using format specific (0.9), database (0.3)}
readers, various formats can be supported by us. Document As stated earlier, the relevance of a keyword is a number
classification in our system is done automatically with optional between 0 and 1. It is a measure of how well a keyword
user intervention. The user can decide her level of involvement describes an activity or how relevant is the keyword to that
in the classification process. Therefore, we support all the particular activity. In the above example, we observe that
features provided by the existing search utilities. In addition, the course CS719 is accurately described by the keyword
we also provide the capability of non keyword based search ‘data streams’. The course comprises mainly of randomized
and the concept of activities (as is described later). This makes algorithms and hence that keyword has a relevance of 0.9.
our system radically different from any existing search system. The set of keywords together with their relevance scores, and
the start and end time, forms the signature of an activity. An
IV. O UR I NFORMATION M ODEL information class is also associated with a similar signature,
We conjecture that human beings tend to think and organize that is, keyword, relevance score pairs. In this sense, an activity
information in terms of activities and information classes. is distinguished from an information class by the presence
We teach classes, enroll in courses, take vacations, partici- of the time field, that is, duration. We discuss the relation
pate in projects etc. Each of these actions constitute an activity. between activities and information classes in greater detail in
Activities have a clear beginning and end in terms of time the section 6 below.
eg. the course CS625 (being offered this semester) started in
January 2005 and will end in May 2005. A. Information Classes Hierarchy
At a given point of time, we indulge in a number of Information classes may have a containment hierarchy
activities. Corporate activity is often organized into a set of which resembles a directed acyclic graph. For example, a book
concurrently running projects with reasonably well-defined consists of a foreword, a preface, a table of contents, chapters,
goals. bibliography and an index. A chapter is made up of sections,
New activities are often related to older activities eg. if an sections have the textual parts and figures, and so on. The
instructor teaches the course CS719 every fall semester, then above description can often be adequately captured using an
the activity of teaching the course in Fall 2004 is related to XML schema definition. We therefore propose that non-atomic
the activity of teaching the course in Fall 2005. information unit classes, that are built out of instances of
Information processed during the span of an activity is simpler information unit classes, are specified using an XML
typically related. The challenge lies in forming a model that schema definition (or, equivalently, using an object-oriented
schema definition). qualify to be put in two or more places in the model. Once this
step is over, the document becomes a leaf node in the directed
B. Activity Hierarchy acyclic graph which corresponds to activities. For document
Akin to information classes, activities too can be organized classification, we have to follow the following steps:
into an activity - sub-activity hierarchy. For example, the 1) Keyword Extraction: When we get a document, we have
course CS719 activity may have Problem Set as a sub-activity; to parse the document and extract its keywords based on the
analogously, Mid Semester Exam 1, Mid Semester Exam 2, End content. Additionally, we have to assign relevance scores to
Semester Exam and Course Projects may be sub-activities of the keywords. The “relevance score” denotes the relevance of a
the activity CS719. The sub-activity Problem Set may have given keyword to the document. These keywords and relevance
individual problem sets, say Assignment 1 through Assignment scores are later used to relate one document to another.
5 as further sub-activities. Each sub-activity is an activity, and 2) Inserting Documents in The Activity Model: In this step,
therefore, is specified by a set of keywords with weights and we have to decide the leaf node(s) where this document should
a duration. go in the activity model. This decision should be based on the
It appears to be reasonable to label a sub-activity with a keywords as well as their relevance scores with respect to the
subset of the set of keywords that its parent activity is labelled given document. A document should be placed as the leaf node
with, possibly with changed relevance measures. In addition, it on a path only if all the nodes (ie. activities and sub-activities)
may be labelled with a few other keywords. For example, the on the path are relevant (atleast beyond a certain threshold) to
Problem Set sub-activity may have problem set and assignment the keywords of this document.
as key words.
Schema definition for activities: Some activities may have B. Relevance Propagation
an XML schema associated with them. For example, the
To support non-keyword based search, we must define the
activity Course (of which CS719 is an instance), has simple
notion of two documents being related. Assuming such a
attributes, such as name of the course, instructor name(s), list
relation exists between two documents, we can say that if a
of student ids enrolled in the course, their e-mail ids etc. The
given word is related to the first document (or simply speaking,
sub-activity Problem Set has Date of handout and due date,
the given word is a keyword of the first document), then it
respectively as attributes. This means that each instance of
would also be related to the second document. We define the
the sub-activity Problem Set has these attributes (unless they
calculation of the strength of relation between the given word
are null). In general, the information wizard provided by the
and the second document as the first subproblem of relevance
system allows the specification of new classes of activities. In
propagation. In other words, we have to find a suitable function
addition, the system may be provided with a generic set of
to pass relevance of a given word from one document to
information classes that are useful to a wide variety of users,
another.
and specialized information classes for use by specific sets of
The other subproblem is to find an efficient way which
users. For example, doctors, lawyers, academician, software
can be used to calculate the relevance of documents to a
developers, accountants etc. will possibly have a different view
given word over the whole activity graph. In essence, given
of the world, that is encapsulated by a set of domain-specific
a word, relevance propagation will assign relevance score to
information classes. The usefulness of a personal information
each document in the activity graph.
system is expected to be greatly enhanced by having a library
of domain-specific information classes and activities.
C. Defining Activities and Sub-activities
V. P ROBLEM D EFINITION Till now we have assumed that an activity graph exists. As
Our system is based on the information model described by stated above, an activity has a duration associated with it. It is
us in Section 4. We need to address a host of problems so that very difficult for an automated system like ours to know when
our system is able to do automatic document classification, a new activity starts and when does an existing activity end. In
non-keyword based retrieval of documents and is able to other words, when a new document is entered into the system,
support activities. This invovles, firstly, classification of a our system has to decide whether it belongs to one or more
new/modified document so that it can enter our information of the current activities or whether it starts a new activity. The
model ie. it has to be put it in its proper place(s) in the activity same is true for sub-activities.
model. Later, while searching, we need to rank the document. This problem is somewhat simpler for information classes,
We use relevance propagation as the ranking methodology, since they are static. Our system will have many built-in in-
using which non-keyword based search is possible. We now formation classes and hence would rarely face the problem of
describe each problem in more detail: defining new information classes. We would also have certain
information classes that are meant for a certain category of
A. Document Classification people eg. lawyers.
This is the first action which is performed on a document Although there would also be a mechanism by which the
entering the system. In this step, we assign a proper place user would be able to define new information classes and
to the document in our information model. A document might activities, there should be a method to automatically infer
these. a co-author on self-organizing information systems while at
Toronto. The role of Toronto is completely secondary to
In our project, we concentrate on the problem of relevance the issue of self-organizing information systems; it however
propagation, since this is basic to the realization of our serves as a temporal key to identify that piece of information.
system. In Section 9, we discuss various models for relevance Queries of the kind: find X when I was doing Y, seem
propagation along with their advantages and shortcomings. In to be particularly relevant for personal information systems.
Section 10, we describe our current model. Since there has We therefore believe that there should be a significant role
been a lot of research on document classification, we do not allocated to the notion of activity, which has the notion of
address this problem here. We assume that in future there duration built in as a primary feature. Together with the notions
would be such classifiers that would give the set of defining of an activity hierarchy, the concept of activity offers an
keywords along with their relevance scores for a given input approach to an event modelling mechanism that more closely
document. Towards the end, we also discuss a few approaches matches the way we interact with the world.
that could be used for the activity - sub-activity definition
problem. The ideas presented for this problem are currently VII. R ETRIEVAL OF I NFORMATION IN A P ERSONAL
in the conceptual stage and need some experimentation and I NFORMATION S YSTEM
research before they could be used as a viable solution for the There are various types of queries that we intend to support
problem. in our system. A query could consist of just a number of words
that describe the document that we are seeking. A query might
VI. ACTIVITIES , I NFORMATION C LASSES AND A DESIGN
also contain usual boolean operators such as ‘and’ and ‘or’.
RATIONALE
This kind of querying is known as content based querying.
This section describes the relation between information Apart from this, a query may also have a structural part. A
classes and activities, and briefly describes the rationale behind query must have at least one of the two parts (it can have
this design. both the parts simultaneously). The structural part is similar
Information classes are designed to allow us to model what to traditional query languages, where, the schema information
is commonly understood as data in a database system. It is used by the query. We first consider the content based
has a structured component, namely, the (recursive) schema specification and then present the structural part.
definition of the class-subclass kind. In addition, it may have
unstructured parts as well. On the other hand, activities are a A. Content based querying
special kind of information class, which specify a duration of In its simplest form, a content based query is a collection of
the activity. keywords. A slightly more complex form of a content based
Consider the following simple example. Suppose that a query constructs the query using more complicated boolean
very well-known author has written a book, that has many expressions over keywords. Finally, the use of the WHEN
editions. The contents of each of those editions of the book operator allows the specification of temporal joins in the query
is an instance of the ‘Book’ information class. The Book statement.
information class may have a large and pre-defined hierarchy Consider the simplest form of a content based query, that
of content, in the form of foreword, preface, table of contents, is, the query is a conjunction of keywords. The problem faced
chapters, references, index etc. in various formats, namely, by the system is to retrieve all information units, instances of
source (eg. tex, fig, word, etc.), printable (eg. ps or pdf), information classes, activities and sub-activities that exhibit a
intermediate (eg. dvi). However, the Book information class high relevance to the set of keywords specified. Alternatively,
is not an activity because the information in this class is the system retrieves the top-k information units, classes and
independent of time (or, in other words, timeless!). In general, activities, in terms of a certain relevance metric, where, k is a
information classes represent timeless pieces of information. parameter. The parameter k may be constrained by the output
On the other hand, the Book Writing information class device of the user, such as the screen size, bandwidth available
can be an activity, strongly related to the specific edition to the user etc.
of the book whose writing is being modelled. Writing a
book is an activity, and includes correspondences with other B. Structural part of the query
authors, designated editors, proof-readers, versions of chapters, The following examples illustrate the structural part of the
exercises etc. The process of book writing, and the information query.
generated during this process is organized in this class. The Example 1: Suppose the query pertains to an email:
Book Writing class may include the final version of the book
Class = Email AND Sender = Avi AND Receiver =
as well (and may share it with the Book class). Queries about
Jeffrey
retrieving a specific correspondence with a co-author when a
certain milestone was reached is answered by this class. This can be simplified to:
In our opinion, one of the main mechanisms used by
E-mail Sender = Avi AND Receiver = Jeffrey
human beings for correlating events is temporal proximity.
For example, a professor may remember a discussion with Example 2: Suppose the query pertains to a course:
Courses TaughtIn Fall 2003 systems, with relevance 0.5. Suppose that the query was
on the keyword processor scheduling. Since both A and B
to indicate that the user is interested in courses taught in Fall
have a common keyword, namely, operating systems,
2003. The structural component of the query is similar to
the relevance join of A with B on the keyword operating
standard query languages for structured data (e.g., relational,
systems gives B a relevance score as a function r, given as
XML or object-oriented languages).
follows.
C. Temporal joins
s(r(A, processor scheduling), r(A, operating systems),
The use of temporal association in mapping information
units seems to be a significant way by which human beings r(B, operating systems)) (1)
associate, map and recall information. In view of this tech-
nique, we introduce the WHEN join predicate, as illustrated where, s is a certain composition function. The above function
by the following example. gives the relevance score of the information unit B upon
navigation from the information unit A. A is called the anchor
.jpg WHEN Activity = CS719 unit and B is called the destination unit of the join operation.
The above query asks for all photographs taken when the The anchor keyword(s) for a join operation is the keyword in
activity with the name CS719 was in progress. The following the anchor unit whose relevance has made the anchor unit
query asks for all vacation activities taken while writing the relevant to the query. In this case, the anchor keyword is
paper “estimating frequency moments”: processor scheduling. The join keyword(s) is the (set
of) keywords that are common to both units. In this example,
Vacation WHEN PaperWriting AND Title = “estimating the join keyword is operating systems. The destination
frequency moments” AND Author CONTAINS “Avi” ? keyword(s) are all the other keywords of the destination unit.
In our simplest model, all the remaining keywords of the
VIII. R ELEVANCE P ROPAGATION
destination unit share the same join relevance as computed
We first look at the idea of Relevance Joins that is fun- by equation (1).
damental to relevance propagation. After this, we describe A specific calculation for the function r in equation (1) is
various models for relevance propagation. as follows. Let the anchor unit be A, the destination unit be
B, the anchor keyword in A be K and the join keyword be J.
A. Relevance Joins: basic idea
Let r(Q, A, K) denote the relevance function of the keyword
We first consider the problem of retrieving all information K in the information unit A for the query Q. Right now let
units that exhibit a high relevance to the set of keywords. us assume that we make a single word query. In this case,
Conceptually the retrieval process works as follows. A rel- Q = K and r(Q, A, K) = r(A, K).
evance score is associated with each information unit in the
system, that is initialized to 0. Recall that each information unit r1 (Q, A, B, J) = r(A, K) · r(A, J) · r(B, J) · α
has associated with it a set of keywords together with their
X
relevance scores. For each keyword in the list of keywords, r(Q, A, B) = r(Q, A, B, J)
and for each information unit in the system, if the keyword is J
present in the list of keywords associated with the information
unit, then, the specified relevance of the keyword is added to r1 (Q, A, B, J) denotes the relevance of document B to the
the value of the relevance score for that information unit. More query Q on account of the join keyword J and the document
formally, let A be an information unit and K be a keyword. A. r(Q, A, B) denotes the total relevance that is passed to the
If K is a keyword appearing in the list of keywords of A, we document B by the document A corresponding to the query
denote by r(A,K) the score of how relevant the information Q.
unit A is for K. Let r(Q,A) denote a score of how relevant the The above equations make the implicit assumption that
information unit A is to the query Q. The initialization step is the relevance function r is always less than 1. The constant
performed as follows. α is chosen to be less than 1 so that navigation always
X reduces the relevance by at least a factor α. This implies that
r(Q, A) = r(A, K) applying relevance joins multiple times reduces the relevance
K∈Q geometrically (ie. the first level joins reduce relevance by at
Next, we define a relevance join operation and illustrate it least α, the next level reduces the relevance by at least α2 ,
with an example. Let A and B be two information units, for next by α3 and so on). Applying this to our example, with α
example, suppose that A is an article on processor scheduling = 0.9 we have,
and B is an article on memory management. Suppose that
A has two keywords, namely, processor scheduling, r(processor scheduling, A) = 1.0
with relevance 1 and operating systems, with relevance
r(processor scheduling, A, B, operating systems) =
0.5. Similarly, suppose that B has two keywords, namely,
memory management, with relevance 1 and operating 1 · 0.5 · 0.5 · 0.9 = 0.225
IX. R ELEVANCE P ROPAGATION M ODELS This method is exactly analogous to the power method for
finding the dominant eigenvector of a matrix and hence it
In this section, we present various mathematical models for
converges if λ2 / λ1 < 1.
the problem of calculating the relevance of information units
We calculate the relevance vector as
for a given query (ie. a set of keywords) that were conceived
by us. We also discuss the pros and cons of each model. This λ = λ(0) + λ(1) + . . .
forms a foundation for the next section in which we define
our current model for relevance propagation. The problem with this approach is that we cannot say for
certain that the quantity that we will get at the end will indeed
A. Eigenvector approach be a ranking of the documents in terms of relevance. Also,
although we know that the power method for computing the
In this model, we want to ensure that only incremental
dominant eigenvector converges, its speed depends upon the
relevances are passed on. This can be understood as follows.
ratio of the first eigenvalue to the second eigenvalue. Since we
Suppose documents A and B are related to each other. Also
cannot say anything about this ratio, we cannot say anything
suppose that A is related to document C and B is related to
about the rate of convergence or the computation required for
document D. Also, lets assume that in step 1, A and B get
this method.
some relevance from sources other than C and D and pass
part of it to each other. In step 2, A gets some more relevance B. All Paths Approach
due to C and B gets some more relevance due to D. However In this model, we first define the paths along which rele-
now, A passes a portion of only that relevance that it gets from vance propagation must take place. We then present the various
C to B and similarly, B passes only a portion of that relevance computation models for this approach which we tried but
that it gets from D to A. found to be unsatisfactory for tackling the problem.
In this model, we compute relevance as follows. Let M For this model, we consider a document-keyword graph as
be an n x n square matrix, where, n denotes the number follows. Every document corresponds to a single circle node in
of information units in the system. Thus, M has a row and the graph. Every keyword corresponds to a square node in the
a column for each information unit. Let A and B be two graph. If a keyword is in the list of keywords corresponding
information units numbered iA and iB respectively. The entry to a document, then there is an edge between the document
M[iB , iA ] gives a real number that is calculated as follows. node and the keyword node.
X For us, a relevance passing path is any path between two
M [iB , iA ] = r(A, J).r(B, J) (1)
documents that uses a single keyword only once. In other
Common
Keywords J words, all the square nodes (that correspond to keywords)
occuring in the path should be unique and there should be no
That is, the joint relevance coefficient, M[iB , iA ] is calculated
repetition. Note that a single document might appear multiple
as the sum of the products of the relevance values for each
times in the path since we are not imposing any condition on
join keyword. The assumption is that each relevance value
the circle nodes (that correspond to documents).
r(A, J) and r(B,J) is a real number between 0 and 1. Also, we
At the start of relevance passing, only nodes which have the
initialize M[i, i] to 0. This guarantees the property of passing
query keyword in their keyword-list have any relevance scores
on only incremental relevances.
with respect to this query. We call these nodes “source nodes”.
Let λ be an n-dimensional column vector that is initialized
In the first step of propagation, the source nodes will pass
as follows based on the query Q.
relevance to nodes which are directly related to them ie. the
X
(0)
λi A = r(A, K), 1 ≤ iA ≤ n (2) two documents contain some common keyword (which is not
K∈Q
the query keyword). This will be done by using the relevance
K ∈ Keyword(A) passing function defined in section 8A. In terms of the graph
(0) described above, this refers to those paths that originate from
The term λiA calculates the relevance of the information source nodes, contain only one square node (that does not
unit A based on the query Q using the relevance information correspond to the query keyword) and terminate at a circle
directly available and before performing relevance joins. The node. The new nodes which have now gained relevance are
iterative step is given by the following equation, added to the pool of source nodes and these new source nodes
λ(i+1) := M λ(i) (3) now pass relevance to other nodes related to them through
common keywords (other than the ones through which they
Since, the relevance values should lie between 0 and 1, we gained relevance).
normalize λ(j) , at the end of each iteration j, j = 0, 1, . . . , as It should be noted that the relevance scores that we use for
follows. passing relevance at each node in the graph corresponds to
1 the original relevance of a node with respect to a keyword.
λ(j) := (j)
λ(j) (4)
n
maxr=1 λr We do not consider the new relevance scores which might
be acquired as relevance passing progresses. This is different
(j)
if maxnr=1 λr ≥ 1 from the eigenvector approach above. However, in this model,
as stated above, if two documents contain two or more com- This model assumes that a linear function is used to pass the
mon keyowrds, then relevance passing between them occurs relevance from one node to another. Using this model, it will
multiple times (because of different paths). be very easy to propagate changes when only the relevance
To calculate the relevance of a node to a given query scores change. The biggest problem with this model is that
keyword, we consider all paths which start from a source effectively we will have to calculate the whole matrix every
node and end at this node. Therefore, the relevance of this time an addition or deletion of a node takes place. Also, it
node (and hence the document to which it corresponds) to the requires a huge amount of storage space.
given query can be defined as the sum of the relevance scores
this node will get using all such possible paths. X. C URRENT A PPROACH
Though, this model captures the basic notion of what
relevance is, it suffers from serious implementation issues. The In this section, we describe our current relevance propa-
all paths approach as such involves calculation of all possible gation model. As stated earlier, we assume that an average
paths over the graph, which in itself is computationally com- user looks at only the top 20-30 items. Therefore, we want an
plex (since there could be an exponential number of paths approach by which we can compute the top 50 or so results
in general). This makes the addition and deletion of nodes quickly. Also, we want the insertion, modification and deletion
computationally infeasible. A good algorithm for this approach of files in the system to be efficient. We want to avoid the huge
should be incremental in nature. By incremental, we mean that computation required in the previously stated models for these
on doing an addition or deletion of a node, or on changing actions, since these might be quite frequent.
the relevance of a keyword to a node, we should be able to We form a graph comprising of all the documents in the
propagate the changes without doing computations over the system. Two documents are connected if they have a common
whole graph again. keyword. An edge in the graph means that relevance can be
1) Normalization Model: In this model, we maintain two propagated between the two documents that are connected by
matrices. First is a N x N matrix which will be used to store the edge. As can be seen, this graph is an undirected graph. For
the total relevance passed from node N1 to another node N2 . the purpose of calculating the relevance propagation between
The second matrix is a N x W normalization matrix. The documents we do not need the edges of this graph to be given
need for this second matrix is explained below. any weights. However, as will be explained later (in Section
In the all paths approach, we consider only those paths 11A), we might need to assign certain weights to the edges
which do not contain the query keyword as an edge. This in this graph if we want to come up with Activity definitions
makes the results query specific. This also creates a problem (one of the problems mentioned in Section 5) according to the
that the relevance passed from one node to another is keyword approach discussed in Section 11A.
specific. In this approach, our main matrix stores the total In this approach, for a given query, we always maintain the
relevance passed between nodes using all the keywords. To top c · k documents (in terms of relevance) at each step in the
incorporate keyword exclusion, either we can add one more discovery process of documents. k is a parameter (around 30-
dimension W to the first matrix, and then an entry in the 40) that represents the number of documents that we expect
matrix will store the relevance passed from N1 to N2 for a the user to go through in response to her query. c is a small
given keyword W using the all paths algorithm over a graph number (around 2-3). As will be explained in Section 12A, we
that is similar to the above graph except that it doesn’t have the will be giving the final ranked list of documents using rank
node corresponding to W . Instead, we choose to use another aggregation of two lists (one of them being this list and the
matrix which we call the normalization matrix. This matrix other being the ranked list output by the plain keyword-based
stores the factor by which we need to normalize the total search module). We output a list of c · k elements from each
relevance passed to get the specific relevance for this keyword. module, even though we present only the top-k documents to
To calculate relevance of a node to a given keyword, we the user, so that we do not miss out on any relevant document
take the entries in the column for this node in the first matrix, and the final top-k elements output by us are the k most
suitably modify these values according to the normalization relevant documents for the query. We break the algorithm into
matrix and then add them up. three steps as follows:
The problem is that on addition or deletion of a node we will • Initialization Step: Whenever we are given a query, we
have to recompute the first matrix which is computationally first find the documents that directly contain the query
infeasible for each addition/deletion. Also calculating the keyword(s) in their keyword list. We create a list of c · k
normalization matrix exactly is infeasible and we can only elements and add all these documents into the list, if there
approximate it. are less than or equal to c · k such documents. If there
2) Differential Model: In this model, we use a N x N are more documents than what the list can contain, we
x (N x W ) matrix, where N is the number of nodes and add only the top c · k documents (in terms of relevance)
(N x W ) is the list of keywords for each node. An entry to the list. The documents in the list after this step
(N1 , N2 , N3 , W3 in this 4-dimensional matrix represents the are the documents from which we start the search for
change in the relevance passed from N1 to N2 when there is other relevant documents that do not contain the query
a change in the relevance of W3 with respect to the node N3 . keyword(s).
• Iteration Step: Using the document graph, look at all the B. Advantages of this model
documents that are reachable from the documents in the This model requires us to store very little information.
current c · k list. A document might be reachable from Specifically, we only need to store the graph corresponding to
multiple documents in the list. After identifying all the all the documents. Moreover, it is very easy to handle changes
documents, assign relevance to each one of them by using in the system (document insertion, modification and deletion).
the relevance propagation function described in Section The model is also quite computationally efficient.
8. If a document is reachable from multiple documents,
it will get relevance due to each one of them and its C. Rationale behind the model
net relevance will be the sum of all these individual In this model, at every step, we are storing only the top c · k
relevances. items that are relevant to the query. Moreover, the document
Now look at the documents in the c · k list. If there is discovery process is designed so that only documents reach-
a document in the list that has relevance lower than any able from these documents are considered. Intuitively, this
of the documents discovered in this iteration step, then means that we consider only those documents to be relevant
it will be evicted from the list and the newly discovered to the query, which get their relevance because of other highly
item will be added to the list. In other words, we now relevant documents. We are ruling out those documents that
consider all the documents in the current c · k list and accumulate relevance because they are connected to lots of
the newly discovered documents together and populate other documents in the graph which might be only moderately
the c · k list with the top c · k documents (in terms of or minimally related to the query. This will be the case if there
relevance) out of these documents. is a document which is tagged by general/common keywords
• Termination Step: If after one step of the iteration the and it is some kind of a general reference. eg. a book on
c · k list remains unchanged, ie. no new documents are operating systems will be related to almost any query related
added to the list, we terminate our search. This is because to computer science, although the relevance might be more or
any more iteration steps will not cause any change in this less depending on the query. We assume that a user is usually
list of documents. Also, if at any step in the iteration, a looking for only a specific document or a few number of
document that is added to the c · k list has relevance less documents that are quite specific to the query. Every document
than a certain threshold, we stop the algorithm and output will have certain specific features and certain general features.
the current c · k list along with the relevance scores of We expect the user to use the specific features of the document
documents in this list. This second condition is imposed when she is searching for it. So, going back to our example, if
to both speed by the algorithm and to avoid documents the user is indeed looking for the book on operating systems,
that are very marginally related to the query. she should probably write the name of the book or the name
The above steps can be carried out quite efficiently by of the authors or some such specific thing.
designing the data structures carefully (and/or by using in-
XI. S OME APPROACHES FOR THE ACTIVITY DEFINITION
sertion sort or some other sorting algorithm etc.). Therefore,
PROBLEM
answering queries using this model is a simple computation
and can be done quite efficiently. Here we present some of the possible approaches for
the problem of coming up with activities and sub-activities
(defined in Section 5). As mentioned there, there are just
some ideas that seem quite promising. These ideas need some
A. Document insertion, modification and deletion
amount of experimentation and analysis before they can be
used.
Since we are doing all computation on the fly using the
graph of all the documents in the system, document insertion, A. Clustering
modification and deletion are all accomplished very easily
In this approach, we make use of the graph consisting
and efficiently. All we have to do is to delete some links
of all the documents that is described in Section 10 above.
in the existing graph and add/or some new links. So when
In addition, we give weights to all the edges in the graph.
a new document is added to the system, we create a node
The weight is like a distance measure between the documents
representing the document in the graph. We now look at all
connected by the edge. Documents are at a distance of 0 if
the keywords in its keyword list and create an edge between
they have exactly the same keywords with the same weights. If
this document and any other document that contains even one
there is a high amount of relevance propagation between two
of these keywords in its keyword list. Upon deletion of a
documents, then the distance between them should be less.
document, we simply remove the node corresponding to it
So, naively, we can consider the distance function to be 1
from the graph along with all the edges linked to it. When
minus the relevance propagation between the two documents.
a document is modified, it is equivalent to deleting the old
By relevance propagation here we mean the function
document from the system and adding a new document to the
X
system. Therefore, to summarize, this approach can handle α · r(A, J) · r(B, J)
document insertion, modification and deletion very efficiently. J
Once we have this graph, we run a clustering algorithm • Relative Font Size: A higher font size as compared to the
over it. Our intuition is that we will get various clusters other words present in the document, signifies a higher
corresponding to various activities. Once we identify this top relevance score.
level clusters, we could go inside each cluster to identify • Bold/Italic: A word in bold form or italicized means that
sub-clusters that correspond to sub-activities. However, the it is being emphasized upon and hence is quite relevant
biggest problem with this approach is that the clusters that to the document in question.
we will identity will be non-overlapping whereas activities, • Proximity: For multi-word queries, a very important
by definition, could be overlapping (ie. two or more activities factor is how close do the query words occur in the
can contain the same document). document. The closer they are, the more relevant the
document is to the query. eg. if I search for “operating
B. Mining the relevance graph systems”, then a document that contains the two words
In this approach also we make use of the document graph. (operating and systems) together is much more likely to
We use techniques of data-mining on the graph to identify be the one that I am searching for, as compared to a
sets of keywords that have a lot of documents containing all or document that contains both the words separated from
almost all of them. So if the keywords w1 , w2 and w3 are found each other.
in many documents and with quite similar relevances, it means • Order: For multi-word queries, a document that contains
that these keywords, along with the average of the associated all the query words in the order in which they appear in
relevance scores, represent an activity or sub-activity. Naively, the query, is likely to be more relevant to the search as
if for a set S, the number of documents that are returned is compared to a document that does not respect the order.
large, then it is an activity and higher sets (ie. sets containing We have adopted the above parameters from Google’s web
S as a subset) are probably sub-activities of this activity. A search engine ([7]). These parameters are used by Google in
technique like the Apriori algorithm might be used to output both their web search as well as desktop search utility. Along
all such possible sets of keywords. with the above parameters, depending on specific formats,
XII. F ULL C ONTENT I NDEXING other parameters can also be used. For example, we can use
the tags present in html pages (such as title etc.).
Any desktop based search system should have support for
keyword based search. This means that given a keyword, we A. Rank aggregation
should be able to retrieve documents which contain the given
keyword. All current desktop search tools more or less provide Given a query, our system will compute two ranked lists
only this capability. of the documents. One ranked list will be obtained by doing
Our system enhances the current search tools with beyond a beyond keyword based search using the activity model and
keyword (or non-keyword) based search, temporal joins and relevance propagation. Doing a keyword based search using
other features. All these extra features help our system to find full content indexing will give us another ranked list. Our
documents that are relevant to the query produce a ranking of system will aggregate these two lists and then present the
the documents. In addition to this, we also do a plain keyword combined result to the user. The problem is to meaningfully
based search and get another ranking of ducuments. To support combine the two results. However, there are various well
keyword based search, we need to do full content indexing. known and easily implementable algorithms for combining top
During full content indexing, we also assign a relevance k-lists ([8]). As stated previously, we assume that the user will
score to every word with respect to the given document. For not be looking at more than 20-30 search results, and hence
example, if a document contains only ten words, then for this rank aggregation will in itself take little time.
particular document, each of these ten words will have an
associated relevance score. These relevance scores are assigned XIII. I MPLEMENTATION I SSUES
based on various parameters like: The implementation of the automatic document classifica-
• Frequency: the number of times a given word occurs in tion and retrieval system, presented in the previous sections,
the document. Naively, if a word occurs more number can easily be divided in several well defined modules. As
of times in a document, then that word is highly related the aim of our system is to enhance the current desktop
to the document. However, the relevance should not be information organizing and search utilities, hence our system
a linear function of frequency. Instead, there should be a will support the features currently available on such systems.
threshold frequency, below which the relevance behaves One of the main features of such systems are format readers
as a linear function of frequency, but it tapers off above for different file types. The parser for our system will take its
this threshold. This threshold might be a constant for all input in only a single format, namely XML ([9]). Since we
documents, or it might depend on the total number of use a modular approach in our implementation, we can easily
words in the document. build support for various file formats. We just need to write a
• Capitalization: whether the word is in small letters or module (or use an already existing module) for the required file
capital letters. A word in uppercase usually implies higher format, which will convert the given document to our custom
relevance. XML format. The XML data thus generated will be fed to
the parser, which will then parse it accordingly and generate A. DocInfo Files
a content based index. For every file we index, we have a corresponding docInfo
Another feature is “up-to-the-moment accurate” search. file. It is currently used to store the proximity information in
Implementing this is also quite easy. We can capture the write the form of offset from the start of the file. We also store the
system call or use some other technique to send a signal to docId of the corresponding file, the number of words occurring
our program (that is constantly running in the background) in the file and the hashes of the words. This latter information
whenever a new file is saved or an existing file is modified is needed to update the index when a file is changed.
or deleted. Depending on the event, we can take appropriate
action, such as indexing a newly written file or deleting some B. Hash File
index data when an already existing file is deleted. We have a single file in which hash value of each word is
Our system comprises of five modules which do the follow- stored. Along with this, we also store the offset of this word
ing: in the index file (described below).
• Full Content Indexing
C. DocId File
• Document Classification
• Information Classes It is a single file used to store the docIds of the files present
• Activity Definition in the system. The docId assigned to a new file is the last docId
• Relevance Propagation Mechanism assigned + 1. Also, if a file is updated, it is assigned a new
docId. All the information corresponding to the earlier version
The first module is completely independent of the other
of the file is removed and the modified file is treated like a
modules. It is responsible for keyword based search. The
new file.
next four modules handle automatic classification and beyond
keyword based search. The first module has been implemented D. Index Files
and is described in the next section.
We have 256 main index files, plus an index file for
The second module extracts keywords and assigns them
each word separately. The former ones are used to store the
relevance scores. But, as previously mentioned, we did not
top 64 entries (by relevance) for that word, while the latter
address this problem and also, there is no such engine
ones are used to store entries for all the files in which this
available. Because of this, a working system based on our
word occurs. While searching, we initially search in the main
information model is not realizable currently. However, as
index files only, unless the results are very less or the user
soon as such a document classification engine is available,
demands a full listing. The entries in both the main index
our system is easily implementable. A workaround approach
files and the subsidiary ones are sorted according to docIds.
to test the capability of our system is to manually tag some
This helps in multi-word queries. Each entry contains docId,
documents with keywords and relevance scores and give them
relevance and offset and the size of proximity information in
as input to the other three modules. However manual tagging
the corresponding docInfo file.
of documents is a time taking and strenuous job and because
of the time constraints, we were not able to test our system XV. C ONCLUSION AND F UTURE W ORK
using this approach. From the recent spate of tools released by all the major
industry players, it is clear that the field of Desktop Search
XIV. C URRENT I MPLEMENTATION
Systems is currently in focus. However, most of the work till
Presently, we have built a small search engine which does now is quite trivial and does not create a personalized search
full content indexing and retrieves file names given multi- experience for the user, which we believe, is essential for a
word queries. Currently, it treats every file as a plain text desktop system. In our work, probably the first attempt in this
file while doing the indexing. It ranks the results based on direction, we present an information model and algorithms
the frequency of occurrence and capitalization of the query that enable efficient querying based on the model. We also
word(s) in the document. While indexing, we also store incorporate all the features available in the other desktop
proximity information, ie. how close two given words are. search systems in town. The two most striking features of
However, in the current implementation, we do not incorporate our model are its support for non-keyword based queries and
this in our ranking algorithm for multi-word queries. temporal queries based on activities. In future, we envision
The engine has a GUI front-end based on Qt. It lets the user that in response to a search query, we will get ranked lists
to do a quick search as well as a full search. In quick search, of activities and information classes (based on relevance
we only search for the top 64 entries of each word. If the and temporal joins) in addition to a unified ranked list of
number of results is less than a given threshold, we promote documents. A user could click on any of the activities to find
the quick search to a full search. sub-activities and documents that belong to that activity and
In our implementation, we hash each word we encounter are relevant of the query.
(using double hashing) and give every file we index a unique There is a lot of work that can be done to enhance our
docId. The whole indexing mechanism works using four kinds system. Firstly, a document classification system or manu-
of files: ally tagged corpus must be used to tag a large number of
documents and test the effectiveness of our system. Also, the
approaches for activity - sub-activity definition, presented by
us here, need to be experimented with and researched. Once
all these things are in place, we hope that an actual system that
incorporates all these features can be implemented efficiently.
ACKNOWLEDGMENT
We would like to express our heartfelt gratitude to Prof.
Sumit Ganguly without whose guidance, inspiration and con-
stant motivation, this project would not have been possible. He
introduced us to the problem and most of the ideas described
in this paper were conceived during our meetings with him.
The help extended by him in writing this report is invaluable.
R EFERENCES
[1] Google desktop search. [Online]. Available:
http://desktop.google.com/about.html
[2] Yahoo desktop search. [Online]. Available:
http://desktop.yahoo.com/features
[3] Apple spotlight. [Online]. Available:
http://www.apple.com/macosx/tiger/spotlight.html
[4] Copernic desktop search. [Online]. Available:
http://www.copernic.com/en/products/desktop-search/
[5] Haystack universal information client. [Online]. Available:
http://haystack.lcs.mit.edu/
[6] Resource description format. [Online]. Available:
http://www.w3.org/RDF/
[7] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web
search engine,” Computer Networks and ISDN Systems, vol. 30, 1998.
[8] R. Fagin, R. Kumar, and D. Sivakumar, “Comparing top k lists,” in SODA
’03: Proceedings of the fourteenth annual ACM-SIAM symposium on
Discrete algorithms, 2003, pp. 28–36.
[9] Extensible markup language (xml). [Online]. Available:
http://www.w3.org/XML/

You might also like