07 Chapter2
07 Chapter2
CHAPTER 2
LITERATURE SURVEY
In this thesis are presented various problems and issues that exist with
the digital libraries and how the applications of agents solve a few issues are
exhibited. The detailed classification of intelligent agents based on various
criteria is reviewed first and then the tools used for the development of
intelligent agents are discussed. This survey on intelligent agents enables one
to identify the various characteristics of agents’ based systems that are
developed for solving different application problems. Survey has also been
used to identify the existing suitable tools for development of new
applications.
22
In this section, first the survey of the various tools available and their
capabilities for the intelligent-agent based system developments and the
problems exist with these tools are presented (Jeff Nelson 1999, Danny Lange
27
1998, Joseph P. Bigus 2001, Jeff Heatson 2002). The Section 2.3.1 is provided
with the summary of survey of the intelligent agent tools and infrastructure. In
Section 2.3.2 the various problems and capabilities of these existing tools are
discussed. Finally Section 2.3.3 presents the features of IBM Aglet and its
advantages.
2.3.1 Survey on Various Tools and Features for Intelligent Agents based
System Development
Generally there are certain criteria taken into account while selecting
the proper software for the agent development in compliance with basic
abstract agent characteristics. These characteristics are platform
independence, secure execution, event handling, dynamic class loading,
multithread programming, object sterilization, agent communication, easy
integration, resource control, object ownership control, platform independent
object mobility, preservation and resumption of the execution state, etc.
33
Proxy Aglet
Clients Interaction
Disposal: The disposal of an aglet will halt its current execution and
remove it from its current context.
37
Context A Context B
Retract
Dispatch
Aglet Aglet
Dispose
Class Disk
File Storage
Clone Listener: Listens for cloning events. One can customize this
listener to take specific actions when an aglet is about to be cloned, when the
clone is actually created, and after the cloning has taken place.
Mobility Listener: Listens for mobility events. One can use this
listener to take action when an aglet is about to be dispatched to another
context, when it is about to be retracted from a context, and when it actually
arrives in a new context.
38
Reply Set: A reply set can contain multiple future reply objects and is
used to get results as they become available. With this object, the sender can
also choose to get the first result and ignore subsequent replies.
There are many issues which exist in Digital Libraries and these are
explained in Section 1.4. Most of these issues are dynamic and knowledge
sensitive. Keeping in vision to improving retrieval effectiveness, four such
problems are taken for the present investigation. These problems are optimal
content allocation in federated digital libraries, intelligent domain specific
proxy design to improve caching, distributed text classification and user
adaptive retrieval.
The huge rate of increase in the data necessitates the need for massive
scale storage architecture at a faster rate than the growth of processor and
disks. In addition, reliance on hardware improvements alone is not sufficient
to keep pace with the demand for data storage.
All the above methods have not taken into account to use content
access semantics. In this present work, a multi-agent based user access pattern
oriented optimal content allocation method for federated digital libraries is
proposed which takes into account the content semantics and frequency of
content access in a particular region, in addition to the storage and bandwidth.
Normally no guarantee can be given for frequency of content access. While
considering the group of user accesses, the semantics of content access pattern
can be learned using a multi-agent system. By taking the apriori information
of the particular pattern for the particular period, the content is moved to the
particular region. The content access will vary according to the regional,
subject, cultural, research interest and so on. The cost effective method for
content allocation in a dynamic environment is possible by considering the
content semantics of user communities. As the number of requests for an
object is unknown, it is assumed to be zero and will be stored in the root
43
server. If it is accessed from the root then the total traffic in the system will
increase unpredictably, where the number of requests increase The user
access pattern learning aids in identifying the (i+1)th object pattern for
allocation in the same locations, if the (i+1)th object belongs to the ith object
pattern.
Content caching is one among the important issues which decide fast
and economical retrieval of information in the web digital libraries (Frew J.
2000, Leon Zhao J. 1999). Caching proxies (Abrams .M 1995, Almeida .J
1998, Elangelos P. Markatos 2002, Dilley .J 1999, Anawat Chankhunthod
1995, Brooks .S. M. C. 1995) become a critical component to handle web
traffic and reduce both network and client latency. These caching proxies
44
System proxies are designed for general purpose caching. Such proxies
neither consider user access patterns nor content semantics. Also, it is hard to
do the user pattern and content semantic learning. All are installed with static
policies and so one cannot expect a very high degree of performance. Usually
the application proxies are developed to do many application-oriented tasks
(Brooks .S. M. C., 1995). By considering all these issues, in this present work,
an intelligent agent proxy has been developed for web digital libraries. It
considers the content semantics and user access pattern of learning.
1998, David D.Lewis 1995, Ellen Riloff 1993, Kamal Nigam 1998, Andrew
McCallum 1999, HisnChun Chen 2002, Dimitris Meretakis 2000, David
Camacho 2001, Chandra Chekuri 1997, George Forman 2003, Chintan Patel
2003, Anne Kao 2003, Dou Shen 2004, Giuseppe Attardi 1999, Madhusudhan
Kongovi 2002, Thorsten Joachims 2001, Gerard Salton 1989, Neelamagam .A
2002a, Neelamagam .A 2002b, Neelamagam .A 2002c, Shian-Hua Lin 2002,
Ricardo Baeza-Yates 2004, Darrell Laham 2003, Roel Popping 2000, Guha.
R, 2004 , Stefan Decker 2000, William B. Frakes 1992, Klaus Krippendorff
2004, Behnak Yalaghian 2002, http://wordnet.princeton.edu, Y.Li,C.Zhang,
2003, Y.Li,C.Zhang, 1999) to the information and cognitive scientists. The
aim of this classification is to build an internal semantic hierarchy that allows
the user to search relevant documents either by browsing a topic hierarchy or
directly retrieving documents. This document classification is done using
different types of document inputs. There are varieties of such text
documents/contents, which are taken for classification such as title, abstract,
labeled documents, unlabeled documents etc. For example the title, abstract
and keywords of a research literature are taken for our present experiment.
There are two general basic principles for creating categories: cognitive
economy and perceived world structure (Thorsten Joachims 2001). The
principle of cognitive economy means that the function of categories is to
provide maximum information with the least cognitive effort. The principles
of perceived world structure means that the perceived world is not an
unstructured set of arbitrary or unpredictable attributes. The attributes that an
individual will perceive, and thus use for categorization, are determined by the
needs of an individual. These need to change as the physical and social
environment change. Psychologists agree that similarity plays a central role in
combining different items into a single category.
46
The Set Theoretic models (Ricardo Baeza-Yates 2004) are fuzzy set
model (David Camacho 2001, Ricardo Baeza-Yates 2004) and extended
Boolean model (Ricardo Baeza-Yates 2004). Fuzzy set model representing
documents with a set of keywords yields descriptions, which are only partially
related to the real semantic contents of the respective documents. This can be
47
modeled by considering that each term defines a fuzzy set and that each
document has a degree of membership in this set. The key idea is to associate
a membership function with the elements of the class. Extended Boolean
model is the one form of vector with the functionality of partial matching and
term weighting. In algebraic models there are three categories. These are the
generalized vector space model (Liao .Y 2002, Chandra Chekuri 1997,
Thorsten Joachims 2001), the latent semantic indexing model (Darrell Laham
2003) and the neural network model (Andrew McCallum 1999, Ricardo
Baeza-Yates 2004). In case of vector space models, a vector associates an
index for every term. Independence of index terms in the vector model
implies that the sub-set of vectors is linearly independent and forms a basis for
the subspace of interest. The main idea in the latent semantic indexing model
is to map each test document and trained index into a lower dimensional
space, which is associated with the concepts. Instead of doing index
matching, the system does a concept matching. The neural network is used as
the tool for the best pattern matching and learning in most of the discussed
conventional information retrieval models.
contain enough hints about its content to induce someone to read it. Such hints
are also sufficient to classify the document referred to. This kind of link
analysis is also used to classify the documents. But, the user navigation style
can not be always guaranteed. The concept of Semantic web is the long-
standing dream of the cognitive and information scientists (Chintan Patel
2003, Guha. R, 2004, Stefan Decker 2000). Cognitive modeling attempts are
made to design the best semantic network that gives the best inference. This
kind of network may give the best classification, if the whole document is
converted into such a network. One such an attempt was made at the Stanford
University through the TAP project (Guha. R, 2004). Such systems in the
internal concepts relationships are mapped through a completely connected
mesh network, creating a coherent semantic web from desperate chunks. But
the links are unmanageable. The practical implementation of such a system is
basically questionable. The Carnegie Millen University has made another
attempt to design such a semantic web. This is the well-known RDF data
model (Stefan Decker 2000). This method provides an aggregation at the data
model level. Higher-level differences between data sources sometimes make it
inappropriate to directly merge data from them. Also, a mechanism that
allows explicitly representing and reasoning with these assumptions and
differences has to be developed. Classification of document with such
ontology is highly difficult.
All the above-explained models are normally used along with the
conventional text information processing steps (Neelamagam .A 2002c, Shian-
Hua Lin 2002, Ricardo Baeza-Yates 2004, William B. Frakes 1992, Klaus
Krippendorff 2004). These text information-processing steps involve
tokenizing, stop word elimination, stemming, lemmatization, indexing,
ranking etc. After completing these steps the user will identify the terms and
then any one of the above said models will be applied to categorize the
50
information integration. Similar to the way the current information sources are
independently constructed, information agents can be developed and
maintained separately. Every information agent contains ontology of its
domain and its expertise. Each concept matrix together with the ACM
classification represents the ontology. The ontology consists of descriptions of
objects and their relationship (noun, verb phrases). The model provides a
semantic description of the domain, which is used extensively for processing
query.
the related department server and browse the information that is needed.
Instead the user expects a system that has to automatically search and
present/recommend the specific set of literature on the desktop from different
servers. This is called as an information integration problem (Shahram Rahimi
2003).
In this case the relevant literature collection from all these portals
together is the most important problem to be solved. Most of the retrieval
systems just retrieve the passive results (retrieve a set of articles) at the time of
searching. It is not able to retrieve the literature if some new literature has
been added on later. The main advantage of a recommendation system is that
it is able to actively recommend a set of newly added literature even after the
searching is over.
the user personalization (Monica Bonett 2001) is one way for the user-centric
computing and one of the important issues that assist fast, relevant and
economical retrieval of information in the federated or distributed digital
libraries. This user personalization involves a process of gathering user-
information during interaction with the user, which is then used to deliver
appropriate content services, tailored to meet the user’s needs. This
experience is used better to serve the customer by anticipating needs, making
the interaction efficient and satisfying both parties and to build a relationship
that encourage the customer to come again for subsequent operations. The
main difference between customization and personalization is that
customization occurs when the user can configure an interface and creates a
profile manually, adding and removing elements in the profile. In the process
of customization the user profile mapping is explicit and user driven. In other
words, in personalization the user is seen as being passive, or at least
somewhat less in control. It is the system’s responsibility to monitor, analyze
and react to behavior; for example, content offered can be based on tracking
and surfing decisions. A personalized service need not always be based on the
individual user behavior or user input.
In this chapter, the survey methodology is presented first, and then the
agent system classification is reviewed based on different criteria. Taking
57
these criteria as the bases, new criteria have been proposed for agent
classification. An agent of a particular functionality coming under a specific
architecture may possess a set of characteristics. That is, an information agent
that is a functional agent may be designed using reactive architecture,
possessing characteristics such as mobility, communication and cooperation.
So, An agent may satisfy one or more of the above said classification criteria,
combined or interchangeably. Agent technology is an emerging field so that a
new technique based on functionality, characteristics or architecture will
emerge that may lead to a new method of classification in the future.