Research on ontology based information retrieval techniques

48 KausarMukadam, FuzailMisarwala, Sindhu Nair
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue 10
October 2015
Research on Ontology Based Information Retrieval Techniques
KausarMukadam
Undergraduate Student,
Department of Computer Engineering
Dwarkadas J. Sanghvi College of
Engineering, Mumbai, India.
FuzailMisarwala
Undergraduate Student,
Sindhu Nair
Assistant Professor,
ABSTRACT
Information retrieval can be a daunting task owing to the
fact that there is colossal amount information available
on the web. Search engines have to precise and efficient
with the information they retrieve. They have to be
efficient in terms of time, space, and most importantly,
relevance of the documents retrieved. Users searching
using keywords, want results which are accurate and
match the intent of the user. In this paper, we study and
compare a few novel methodologies for information
retrieval in terms of their relevance scores and precision
ratings of the search results. The empirical data put forth
in this paper are directly obtained from the calculations
and results presented by the authors of the respective
proposed information retrieval techniques. We compare
the algorithms used in these proposals and their targeted
domains.
KEYWORDS
Ontology, Information, Retrieval.
INTRODUCTION
The advent of the World Wide Web brought with
itself a large volume of data and information that is
readily available for public use. Information
retrieval presents a means to gather this information
by reducing information overload. Information
retrieval (IR) can be defined as finding material or
documents of an unstructured nature, usually in the
form of text, which satisfies an information need
from large collections of data.
As most of the data is unstructured, numerous
information retrieval techniques have been
developed to help deal with the huge amounts of
unstructured knowledge accessible over networks.
The information retrieval techniques commonly
used are based on key-word, which uses lists of
keywords to describe the information content. The
main drawback of this method is that no
information about the semantic relationships
between these keywords is provided which makes
the use of these systems difficult for ordinary users.
Describing and then translating their needs into
keyword based request is a problem as information
needs cannot be expressed appropriately with
system terms.
One widely used approach to combat this issue is to
incorporate ontologies into the information system
which are used to represent essential concepts in a
subject area in addition to the semantic relationships
among them. Ontology provides metadata elements
and familiar vocabulary to put elucidation on
resources and uses class hierarchy and class
relations for metadata interpretation.
1. INFORMATION RETRIEVAL
MODELS
For retrieving of related documents through
information retrieval, the documents are usually
transformed into an appropriate representation.
Each information retrieval strategy uses a
specialized model for the representation of
documents which guide research and provide a
blueprint to implement a retrieval system. The
model predicts what will be relevant to the user
given the user query. For these predictions, the
models are grounded in some branch of
mathematics, in order to formalise a model, ensure
consistency, and to establish that it can be
implemented within a real system[1].
The major models for retrieval of information are
the Boolean model, the Statistical model (including
vector space and probabilistic retrieval model) and
the Linguistic and Knowledge-based models.
1.1. Boolean Model:
The Boolean model is one the first models of
information retrieval and is a much criticised model.
The model can be defined by thinking of the user
query term in the form of an unambiguous
definition of a document set. For example, the query
term „finance‟ defines the set of all documents that
are indexed with the term finance. In this model, the
operators of George Boole‟s mathematical logic-
logical product AND, logical sum OR and logical
difference NOT- can be combined along with query

IJIACS
ISSN 2347 – 8616
Volume 4, Issue 10
October 2015
terms and sets of documents to form new document
sets.
1.2. Extended Boolean Model:
Several methods have been developed to overcome
the disadvantages of the traditional model. The
traditional method has no provision for ranking, it
does not support the weight assignment to queries
or document terms, and the operators are too strict.
Smart Boolean approach and extended Boolean
models (for example- P-norm and Fuzzy Logic
approaches) provide relevance ranking to users.
The P-norm method allows query and document
terms to comprise of weights that are computed
through term frequency statistics with proper
normalization procedures. The normalized weights
are used to rank documents in decreasing order of
distance for an OR query, and increasing order of
distance for an AND query. The operators also have
an associated coefficient (P) to indicate the degree
of strictness of the Boolean operator (from 1 for
lowest strictness to infinity for highest strictness).
This method uses distance-based measure.
In Fuzzy Set theory, each element has a differing
degree of membership to a set which is a direct
contrast to traditional binary membership. The
index term weight for a given document reflects the
degree to which the term describes the document
content. The weight is an indication of membership
of the document in the associated fuzzy set.
1.3. Statistical Model:
These models use statistical information (term
frequencies) to determine the relevance of
documents with reference to the query, to produce a
list of documents ranked by an estimated relevance.
Some common types are vector space and
probabilistic models.
1.4. Vector Space Models:
The basic requirement of the vector space model is
that information retrieval objects are modelled as
elements in a vector space. Terms, documents,
queries, concepts are all represented as vectors in
the vector space. This implies that the system has
linear properties i.e. any two elements of the system
can be added to create a new element and can also
be multiplied by a real number[2]. The index
representations and query are represented as vectors
embedded in a dimensional Euclidean space, in
which each term is assigned a separate dimension.
The similarity measure is usually the cosine of the
angle that separates the two vectors d and q, where
d represents the documents index representation and
q represents the query.
1.5 Probabilistic Models:
Probabilistic models consider the information
retrieval process as a probabilistic inference.
Similarities in documents are assed as probabilities
that the document is pertinent to the query. These
models use various probabilistic theorems such as
Bayes' theorem. The Probabilistic retrieval model
implements the Probability Ranking Principle that
specifies that an IR system should rank the
documents on the basis of their probability of
relevance to the user query, using all the available
information. A variety of evidence sources are used
in this method, the most common one being the
statistical distribution of terms in the relevant and
non-relevant documents. Other probabilistic models
include Bayesian Network Models, 2-Poisson
Model, Probabilistic Indexing Model, etc.
1.6. Linguistic and Knowledge-Based Models:
Linguistic and knowledge-based approaches, which
have been developed to address various problems in
information retrieval, perform a semantic and
syntactic analysis in order to retrieve documents
more effectively.
2. ONTOLOGY:
The Oxford English Dictionary[3] defines ontology
as “A set of concepts and categories in a subject
area or domain that shows their properties and the
relations between them:” Ontology is a specification
of conceptualization which consists of a list of
terms (names and definitions) and the relationships
between them. The terms are used to represent
important concepts, or classes of objects, of the
domain. For example, in the university domain,
faculty, students, lecture rooms, courses and
departments can be some important concepts. The
ontology concept has found use in Artificial
Intelligence, Computer Science, and Knowledge
Engineering in a myriad set of related applications
including natural language processing, E-
commerce, information retrieval, and the Semantic
Web[4]. In information retrieval, ontologies have
been used to overcome the limitations of traditional
keyword-based search, and provide a vocabulary for
classification of the content and improve search
through class hierarchy based query expansion,
multifaceted browsing and searching, etc.

IJIACS
ISSN 2347 – 8616
Volume 4, Issue 10
October 2015
Figure 1. Basic process of text mining information retrieval based on ontology [5].
3. SURVEY OF NOVELMETHODOLOGIES
3.1. Retrieval Model for Traditional Chinese
Medicine:
Some shortcomings of traditional information
retrieval methods is discussed in [6]. The biggest
problems faced in information retrieval in the TCM
(Traditional Chinese Medicine Field) are those of
low coverage and high redundancy. The purpose of
working with traditional Chinese medicinal
literature and database is the lack of research in
TCM despite extensive and relevant research
carried out by scholars in other fields.
The TCM domain is constructed using a seven step
method. The paper then proceeds to summarize the
implementation process of their ontology based
information retrieval technique which is a two-step
technique. The next part deals with concept
similarity. It is stated that in the field of ontology,
the correlation between information is a
performance measure of the correlation between
concepts. In the domain of TCM, ontology follows
a clear hierarchical architecture. The relevance
between concepts is measured on a scale of 0 to 1.
If two concepts are unrelated, i.e. there is no
relevance, the relevance score is 0. This is an
effective method to quantify the correlation between
concepts and hence the effectiveness of the retrieval
system can be measured.
The paper defines three levels of correlation
between concepts. Thus by determining the degree
of correlation, the relevance between retrieved
information and information resources, the most
relevant information can be gathered.
In the final step of sorting the search results, a
sorting algorithm is proposed since the traditional
algorithms for sorting of search results may not be
very effective for the TCM domain. The measure of
concept similarity is used for the sorting of the
search results.
Figure 2. Information Retrieval framework for TCM[6].

IJIACS
ISSN 2347 – 8616
Volume 4, Issue 10
October 2015
inference that the newly proposed ontology based
information retrieval system was efficient and
effective in the Traditional Chinese Medicine
domain.
3.2. Semantic Indexing based Information
Retrieval Model:
Some limitations like the inability to describe
relations between search terms are dealt with in [7].
The proposed framework deals with important
issues related to semantic search and information
retrieval that are Scalability, Usability, and
Retrieval Performance. For the improvement of
scalability, the use of a semantic indexing approach
is suggested based on an entity retrieval model.
Usability is improved through the adoption of a
keyword based interface. The use of domain
specific information extraction, rules, and inference
is proposed to improve retrieval performance.
The framework proposed is based on three key
processes. They are representation of semantic
knowledge, semantic indexing, and querying. An
existing ontology is reused in the implementation of
information retrieval in transport systems for the
representation of semantic knowledge. OWL Web
Ontology Language is used. At the end of the first
step, useful OWL files are obtained that are indexed
for the search.
The next step is semantic indexing. An indexing
system is designed using entity retrieval model due
to the knowledge base being composed of entities
defined for RDF, OWL and RDFS. The knowledge
base, comprised of entities defined for RDF, is a
weighted and labelled graph where the edges are
properties and the nodes are the resources. The
graph is a set of RDF triples, which consist of three
components that are subject, predicate, and object.
The job of the subject is the identification of the
object described by the triple, while the function of
the predicate is the definition of the piece of data
present in the object that is given a value. The EAV
(Entity Attribute Value) model is adopted and used
for the indexing system. The indexing structure is
then described as it largely affects the retrieval
performance.
The next step is semantic querying. It is the process
of querying the EAV graph after the semantic
knowledge is represented and indexed. There are
three types of supported queries. Full text,
structural, and semi structural are the query types
supported. SIRE is used to search query and results
are obtained using a Boolean combination of an
attribute value pairs based on logical operators.
The proposed information retrieval method is then
evaluated using a set of pre-set queries, showing a
high rate of precision
Figure 3. Framework for Semantic Indexing based Information Retrieval [7].

IJIACS
ISSN 2347 – 8616
Volume 4, Issue 10
October 2015
3.3. Semantic Extension Retrieval Model:
A new technique aimed at tackling some key
problems posed by traditional keyword based
methods, is proposed in [8]. These problems are that
firstly, the keywords do not always convey the full
meaning of the content and the retrieved
information may be irrelevant. Secondly, the
keyword may have different meanings in different
contexts, which leads to difficulties in the
processing of query features, and thirdly, due to
polysemy and synonym problems in natural
language, keyword-based retrieval can only cover
information containing the same word, while other
information with similar meaning but different
words has been missing. [8] To overcome these
issues, an information retrieval technique based on
semantic extension is proposed.
The semantic retrieval is based on the semantic
extension. The strategy considers whether or not the
result is suitable for the user‟s query. The proposed
model is different when compared to the tradition
model of expressing content features through the
use of keywords, since the proposed model has the
provision of ontology annotation to summarize
semantic features of the information, and makes use
of semantic extension for retrieval. Two parts are
included in this model. They are ontology
annotation and retrieval of text based on semantic
extension.
Firstly, ontology annotation indexes documents
based on ontology of the domain. This serves as the
foundation for the text retrieval. This is followed by
the extension of the query keyword and turns it into
a full-text research. The results obtained are then
reordered. The indexing is executed by an index
writer which adds documents to the index and
serves as a core component for the construction of
the index. The core component for retrieval is the
index reader which reads the index. The analyser
pre-treats the documents through ontology
annotations and sends the content to the index
writer. It also matches the keyword from the query
to the domain. The analyser has a subcomponent
that is the ontology encoder. It processes the
elements of the domain ontology into a multi tree
which is in turn used for annotation and keyword
matching. The results obtained after further
processing is reordered.
For the performance evaluation of the proposed
technique, 1000 papers were collected and tested
on. Precision and recall measures were used for
evaluation, and the experimental results show a
fairly high rate of recall and precision as compared
to traditional keyword based information retrieval.
Figure 4. Proposed framework for Semantic Extension Retrieval Model [8].

IJIACS
ISSN 2347 – 8616
Volume 4, Issue 10
October 2015
4. PERFORMANCE MEASURES:
An ideal performance measure for an information
retrieval system would take into account the
resources used by the system to perform a retrieval
operation, the amount of effort time spent by a user
to obtain needed information, and the ability of the
system to retrieve useful items. But this approach is
extremely hard to implement. The user would want
the system to retrieve the highest number of
appropriate items possible and reduce the number of
non-relevant items in the response. The former
criterion is represented by Recall, and the latter one
is the concept of Precision[9].
Recall (also known as sensitivity in binary
classification) can be defined as the fraction of the
documents retrieved that are relevant to the user
query.
Precision (also known as positive predictive value)
is the fraction of the retrieved documents which
are relevant to the information requirement of the
user.
5. RESULT ANALYSIS:
Precision and Recall are the best performance
measures for any novel technique proposed by
researchers as the primary goal of these information
retrieval techniques is the return of relevant pages
as the search result. All of the above studied
techniques have presented a high success rate
through their own experiments. The average
precision and recall rates show a significant rise
when compared to traditional keyword based
methods, and serve as evidence for the fact that the
proposed methodologies have overcome the
difficulties that they set out to, in their respective
domains.
6. CONCLUSION AND FUTURE WORK:
Various novel ontology based information retrieval
techniques have been proposed by researchers,
which have been used in a specific or multiple
domain, aimed at defeating certain problems
encountered with the use of traditional keyword
based algorithms. These new techniques, manage to
overcome the stated issues and return high recall
and precision rates, and hence should be used with
increased frequency for information retrieval.
REFERENCES
[1] D. Hiemstra, "Information Retrieval Models∗,"
Goker, A., and Davies, J. Information Retrieval:
Searching in the 21st Century, 2009.
[2] V. V. Raghavan, "A critical analysis of vector space
model in information retrieval," Journal of American
Society for Information Science, 1986.
[3] "Oxford English Dictionary," [Online]. Available:
http://www.oxforddictionaries.com/definition/englis
h/ontology.
[4] S. S.Yasodha, "An Ontology-Based Framework for
Semantic Web Content Mining," International
Conference on Computer Communication and
Informatics (IEEE), 2014.
[5] M. Q. Song Yibing, "Research of literature
information retrieval method based on ontology,"
IEEE, 2014.
[6] Y. Z. D. Z. H. L. H. R. Aziguli Wulamu, "The
Research and Application of Ontology-Based
Information Retrieval," IEEE 9th Conference on
Industrial Electronics and Applications (ICIEA),
2014.
[7] M. A. Amir Zidi, "A Generalized Framework for
Ontology-Based Information Retrieval," IEEE ,
2013.
[8] H. L. Rui Zhang, "Design and Realization of
Semantic Extension Information Retrieval
Mechanism," Third International Conference on
Information Science and Technology, 2013.
[9] V. V. Raghavan, "A Critical Investigation of Recall
and Precision as MEasures of Retrieval System
Performance," ACM Transactions on Information
Systems, 1989.

Research on ontology based information retrieval techniques

Recommended

More Related Content

What's hot (20)

Similar to Research on ontology based information retrieval techniques (20)

Recently uploaded (20)

Research on ontology based information retrieval techniques