A Knowledge Based Approach To Resolve Wo
A Knowledge Based Approach To Resolve Wo
Available at http://www.ijcsonline.com/
Abstract
There is various approaches proposed by professionals in Natural Language processing community to solve the
ambiguity of word i.e. which is related to get the correct meaning of word in the particular context. Approaches are
ranged from Knowledge based to machine learning. This paper deals with approach which is based on the integration of
various knowledge source i.e. from which the data or information related to ambiguous word can be acquire, to resolve
the ambiguity of all open class word by removing stop words from target sentence.
Keywords: Word Sense Disambiguation, Natural language processing, Corpus, WordNet, Word Sense, Lexical knowledge.
206 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation
if else then approach; most appropriate feature in decision 3.3 SemCor Corpus
list is one sense per collocation (Agirre, E. and Martinez,
d. 2000). Algorithm which is mostly depend on the Princeton University developed Semcor, which originates
examples is Exemplar-based learning researched by a from the Brown Corpus (Princeton University, 2011). The
same researcher of Naïve baye‟s (Gerard Escudero et SemCor files contain over 20,000 tagged words across 352
al.[7], 2000) . Then the latest algorithm is Support Vector files. Every tag contains the part of speech, the lemma, and
Machines (SVM) which is based on the binary classes, the correct WordNet sense. This makes SemCor extremely
based on the irrelevant and relevant senses, it separates the useful for researchers using WordNet. SemCor provides a
classes (Navigli and roberto, 2009[10]) are some of professionally tagged resource to compare the accuracies
supervised approach. Main problem is with supervised is of word sense disambiguation algorithms. As data is
the effort of generating the manually tagged corpus. unstructured, to make it suitable, we have used JAVA‟s
To overcome the disadvantage of supervised that is DOM parser and transformed the corpus XML Format to 4
generating a manually creating corpus which is tagged one, forms Word Form, POS, Lemma, and Word Sense
researcher proposed the unsupervised methods. Mihalcea Number
and Moldovan, 2001[9], used this unsupervised approach
i.e. corpus which is untagged called feature selection
method. This method is automatic in nature, researcher 3.4 Part Of Speech Tagger
then tried for another unsupervised approach i.e. based on
the rank system which is Personalized PageRank algorithm We have used Stanford‟s MaxentTagger for the
as in (E. Agirre and A. Soroa [4], 2009), Similarity-based
purpose of part of speech. For that, we use it through
algorithms as in (R. Navigli and M. Lapata [10], 2010) are
some of unsupervised approach. A clear disadvantage is JAVA library. There are two taggers in this package; one
that, so far, the performance of unsupervised systems lies a is bi-directional dependency network tagger whose
lot lower than that of supervised systems due to a cluster accuracy was calculated 97.32% and second tagger given
issues. by Maxent tagger is by using only left second-order
sequence information whose accuracy mentioned was
III. OPEN SOURCE TOOLS 96.92%. We have used Java API: A MaxentTagger can be
For this project, we use a number of open source tools that made with a constructor taking as argument the location of
are referenced all the way through the project. The tools parameter files for a trained tagger. It is giving a proper
and resources include WordNet, a java interface of part of speech, later they plays very important role in
WordNet, a part of speech tagger, SemCor. Some of these disambiguation.
tools, WordNet for example, provide the definitions and
relations. Some resources, like SemCor, provide examples 3.5 JAWS Library
of correctly translated text. The sections below explain
what each tool/resource is and how this project uses them.
Java API for WordNet searching (JAWS) is a library that
we use for retrieving relations from wordnet lexical
3.1 WordNet database. It is maintained by CSE department at Southern
WordNet is a publicly available lexical database Methodist University. We can provide knowledge of
developed by Princeton University (Miller). There are ambiguous terms by retrieving information associated with
206941 words across 117659 SynSets, which are groups of it i.e. its word form, phrases etc.
synonyms, in WordNet 3.0. This means that there are IV. PROPOSED SYSTEM
117659 unique definitions available. This project uses
wordnet for getting the correct definition of sense. We proposed Hybrid Approach i.e. system will give the
Fellbaum.1998 [5], all of semantic relations about words is best sense depends upon the corpus as well as world
why WordNet is a lexical database. WordNet is so rich in knowledge. For feature extraction, the hybrid approach
information and so well executed that it is one of the most will integrate all knowledge sources i.e. POS and word
familiar tools for word sense disambiguation. form of ambiguous word, Context word that is of size
(+2,-2), Root form of word, Context word‟s root form and
POS i.e. integration of all knowledge sources. Steps for
3.2 WordNet Interface disambiguation for our purpose: Representation of context,
Wordnet interface is interface to WordNet using the Preprocessing of target sentence consisting ambiguous
Java API to WordNet. English nouns, verbs, adjectives word, Extraction of feature, Classification task and
and adverbs are prepared into synonym sets, each Extracting correct sense. In the literature survey, we came
representing one underlying lexical concept. Different across many algorithms which do not focus upon
relations link the synonym sets. Different methods like irrelevant data in the corpus. Here is an attempt to obtain
getDict, getLemma, getIndexTerms, getSynonym, the correct sense of an ambiguous word in the given
getSynset etc. are used for different purposes. WordNet is context.
helpful while identifying whether the word is ambiguous
or not.
207 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation
208 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation
209 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation
210 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation
211 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015