0% found this document useful (0 votes)
32 views

A Knowledge Based Approach To Resolve Wo

This document summarizes a research paper about using a knowledge-based approach to resolve word-level ambiguity for machine translation. The paper proposes an approach that integrates various knowledge sources, such as WordNet and SemCor corpus, to disambiguate the meanings of open-class words by removing stop words from the target sentence. It discusses related work on word sense disambiguation techniques, including knowledge-based, supervised, and unsupervised methods. It also describes various open-source tools used in the research, including WordNet, SemCor corpus, part-of-speech taggers, and the JAWS library.

Uploaded by

oro59jo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

A Knowledge Based Approach To Resolve Wo

This document summarizes a research paper about using a knowledge-based approach to resolve word-level ambiguity for machine translation. The paper proposes an approach that integrates various knowledge sources, such as WordNet and SemCor corpus, to disambiguate the meanings of open-class words by removing stop words from the target sentence. It discusses related work on word sense disambiguation techniques, including knowledge-based, supervised, and unsupervised methods. It also describes various open-source tools used in the research, including WordNet, SemCor corpus, part-of-speech taggers, and the JAWS library.

Uploaded by

oro59jo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Computer Systems (ISSN: 2394-1065), Volume 02– Issue 05, May, 2015

Available at http://www.ijcsonline.com/

A Knowledge based Approach to Resolve Word Level Ambiguity for Machine


Translation
Roshan R KarwaA, Dr. M.B.ChandakB, Deepak PandeC
A
M.Tech Scholar, CSE Department, SRCOEM, Nagpur, India
B
Professor and Head, CSE Department, SRCOEM, Nagpur, India
C
Technical SME, Project Management Office, Amdocs-Rogers-Brampton, Canada

Abstract

There is various approaches proposed by professionals in Natural Language processing community to solve the
ambiguity of word i.e. which is related to get the correct meaning of word in the particular context. Approaches are
ranged from Knowledge based to machine learning. This paper deals with approach which is based on the integration of
various knowledge source i.e. from which the data or information related to ambiguous word can be acquire, to resolve
the ambiguity of all open class word by removing stop words from target sentence.

Keywords: Word Sense Disambiguation, Natural language processing, Corpus, WordNet, Word Sense, Lexical knowledge.

overrule the problems, some approaches are based on


I. INTRODUCTION dictionaries and some are on the corpus evidence.
Human language is ambiguous as words can have Knowledge based approach are depend on the
multiple meaning. For example, „Play‟ in English can Knowledge resources like Wordnet which is a dictionary,
either have a meaning related to sport or a meaning related different thesaurus. They are also referred as Dictionary
to action drama play. Getting the correct meaning of based approach. To get the correct sense, knowledge based
ambiguous word is automatic in human being but for depends on the dictionaries. Agirre et al. [1], 1996
system, it is difficult as it lacks real world knowledge. The proposed Word Sense Disambiguation with Conceptual
task of identifying the real sense i.e. r Hence, determining Density method. This method‟s basic idea is to select a
correct meaning for words in context is called Word Sense sense based on the conceptual distance i.e. how the
Disambiguation. Important step in Word Sense ambiguous word and its context word are related. This
Disambiguation are as follows: given a set of word, a result is later extended by the same researcher i.e. Agirre
classifier is applied which makes use of one or more et al.[2], 2001 with the change approach to find the correct
sources of Knowledge to find out the most appropriate sense, they called the approach as the Selectional
senses with words in context. Sources is of two types, one preference method. This method look for the probable
is corpus evidence based which is either unlabelled or associations between word categories, simplest measure
annotated with word senses, and other is dictionaries for this word to word relation is frequency count. Overlap
related machine readable dictionaries, dictionaries, based approaches like Lesk, Extended lesk are purely a
thesauruses etc. Without knowledge sources, it is difficult based on the matching of word and contexts words. This
for both humans and machines to identify the correct sense approach is suggested by Satanjeev Banerjee, Ted
i.e. meaning. Several WSD techniques have been proposed Pedersen [14], 2002. Basic problems with this approach is
in the past ranging from knowledge based to supervise to it is heavily depends on dictionaries, which is also have
unsupervised methods. Supervised and unsupervised rely some restrictions over acquiring the common sense
on corpus evidence. Knowledge based relies on knowledge.
knowledge resources like Machine Readable dictionaries, Machine learning approaches are purely based on the
dictionaries, thesauruses. Supervised with fewer corpus which is tagged or tagged, Supervised and
annotation is to be used as fully supervised requires a large unsupervised are come under the machine learning.
annotated corpus. Supervised WSD learning uses tagged corpus which
This paper is organized as follows: Section 2 includes training and testing module, while training,
comprises related works in WSD; section 3 comprises preprocessing has to be done first and then applying some
open source tools; proposed approach is in section 4; trained algorithm and to test an unknown sample based on
implementation and execution is in section 5. trained data. Naïve baye‟s, Decision list, Support vector
machine are some of the supervised approaches. Naive
baye‟s learning approach is a mathematical as to find the
II. RELATED WORK IN WSD correct sense; it depends on the simple conditional
After facing problems related to natural language probability calculation, consisting of feature as
processing, Researchers proposed a various approaches to collocation, co-occurrence, part of speech (Gerard
Escudero et al.[7], 2000). Decision list algorithm is simple

206 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation

if else then approach; most appropriate feature in decision 3.3 SemCor Corpus
list is one sense per collocation (Agirre, E. and Martinez,
d. 2000). Algorithm which is mostly depend on the Princeton University developed Semcor, which originates
examples is Exemplar-based learning researched by a from the Brown Corpus (Princeton University, 2011). The
same researcher of Naïve baye‟s (Gerard Escudero et SemCor files contain over 20,000 tagged words across 352
al.[7], 2000) . Then the latest algorithm is Support Vector files. Every tag contains the part of speech, the lemma, and
Machines (SVM) which is based on the binary classes, the correct WordNet sense. This makes SemCor extremely
based on the irrelevant and relevant senses, it separates the useful for researchers using WordNet. SemCor provides a
classes (Navigli and roberto, 2009[10]) are some of professionally tagged resource to compare the accuracies
supervised approach. Main problem is with supervised is of word sense disambiguation algorithms. As data is
the effort of generating the manually tagged corpus. unstructured, to make it suitable, we have used JAVA‟s
To overcome the disadvantage of supervised that is DOM parser and transformed the corpus XML Format to 4
generating a manually creating corpus which is tagged one, forms Word Form, POS, Lemma, and Word Sense
researcher proposed the unsupervised methods. Mihalcea Number
and Moldovan, 2001[9], used this unsupervised approach
i.e. corpus which is untagged called feature selection
method. This method is automatic in nature, researcher 3.4 Part Of Speech Tagger
then tried for another unsupervised approach i.e. based on
the rank system which is Personalized PageRank algorithm We have used Stanford‟s MaxentTagger for the
as in (E. Agirre and A. Soroa [4], 2009), Similarity-based
purpose of part of speech. For that, we use it through
algorithms as in (R. Navigli and M. Lapata [10], 2010) are
some of unsupervised approach. A clear disadvantage is JAVA library. There are two taggers in this package; one
that, so far, the performance of unsupervised systems lies a is bi-directional dependency network tagger whose
lot lower than that of supervised systems due to a cluster accuracy was calculated 97.32% and second tagger given
issues. by Maxent tagger is by using only left second-order
sequence information whose accuracy mentioned was
III. OPEN SOURCE TOOLS 96.92%. We have used Java API: A MaxentTagger can be
For this project, we use a number of open source tools that made with a constructor taking as argument the location of
are referenced all the way through the project. The tools parameter files for a trained tagger. It is giving a proper
and resources include WordNet, a java interface of part of speech, later they plays very important role in
WordNet, a part of speech tagger, SemCor. Some of these disambiguation.
tools, WordNet for example, provide the definitions and
relations. Some resources, like SemCor, provide examples 3.5 JAWS Library
of correctly translated text. The sections below explain
what each tool/resource is and how this project uses them.
Java API for WordNet searching (JAWS) is a library that
we use for retrieving relations from wordnet lexical
3.1 WordNet database. It is maintained by CSE department at Southern
WordNet is a publicly available lexical database Methodist University. We can provide knowledge of
developed by Princeton University (Miller). There are ambiguous terms by retrieving information associated with
206941 words across 117659 SynSets, which are groups of it i.e. its word form, phrases etc.
synonyms, in WordNet 3.0. This means that there are IV. PROPOSED SYSTEM
117659 unique definitions available. This project uses
wordnet for getting the correct definition of sense. We proposed Hybrid Approach i.e. system will give the
Fellbaum.1998 [5], all of semantic relations about words is best sense depends upon the corpus as well as world
why WordNet is a lexical database. WordNet is so rich in knowledge. For feature extraction, the hybrid approach
information and so well executed that it is one of the most will integrate all knowledge sources i.e. POS and word
familiar tools for word sense disambiguation. form of ambiguous word, Context word that is of size
(+2,-2), Root form of word, Context word‟s root form and
POS i.e. integration of all knowledge sources. Steps for
3.2 WordNet Interface disambiguation for our purpose: Representation of context,
Wordnet interface is interface to WordNet using the Preprocessing of target sentence consisting ambiguous
Java API to WordNet. English nouns, verbs, adjectives word, Extraction of feature, Classification task and
and adverbs are prepared into synonym sets, each Extracting correct sense. In the literature survey, we came
representing one underlying lexical concept. Different across many algorithms which do not focus upon
relations link the synonym sets. Different methods like irrelevant data in the corpus. Here is an attempt to obtain
getDict, getLemma, getIndexTerms, getSynonym, the correct sense of an ambiguous word in the given
getSynset etc. are used for different purposes. WordNet is context.
helpful while identifying whether the word is ambiguous
or not.

207 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation

5.1. Representation of Context

We have used the SemCor corpus which is of XML format


as shown in figure 2.

Figure 1 Block Diagram of Proposed System

As SemCor text is an unstructured source of information


to make it suitable to an automatic method it is
transformed into structured format. That‟s why the steps of
preprocessing of corpus and target sentence need to be
Figure 2 Corpus XML file format
performed. After preprocessing of both, features are to be
extract to train a classifier and finally the classification
The data we got is in the XML form. Each data having
method is to be applying for extracting correct sense.
some attributes like pos, lemma, its wordnet sense and the
Algorithm 1 for Without Lexical Knowledge
actual word as its value. Attributes like POS, lemma, wnsn
//training
i.e. word sense number helps in feature creation. As text is
1. From corpus sentences extract features.
an unstructured source of information to make it suitable
2. Train the classifier with the features extracted above.
to an automatic method it is transformed into structured
//disambiguation
format. For that purpose, we use DOM parser to parse
3. Select some words around the target word.
XML files and represent the information of word as into
4. Compare ambiguous word with the word of feature set
four fields: word, POS. lemma and word sense number as
which is integration of knowledge source in classes.
shown in Figure 3. Also while these preprocessing, stop
5. Calculating the probabilities for each sense.
words i.e. unnecessary information is removed.
//calculation of winner sense
6. Comparing the result of each sense and assign sense of
maximum sense.

Algorithm 2 for With Lexical Knowledge


//training
1. From corpus sentences extract features.
2. Train the classifier with the features extracted above.
//disambiguation
3. Select some words around the target word.
4. Include World Knowledge
5. Compare ambiguous word (also consider here the
lexical knowledge of ambiguous word) words with the
word of feature set which is integration of knowledge Figure 3 Extracted data from Corpus
source in classes.
6. Calculating the probabilities for each sense. 5.2 Preprocessing of target sentence consisting
//calculation of winner sense ambiguous word
7. Comparing the result of each sense and assign sense of
maximum sense. The main step is to find the ambiguous word in the target
sentence. There might be possibility that the target
V. IMPLEMENTATION AND EXECUTION sentence contains more than one target word i.e. rather
than doing the lexical WSD task, focus is on all words
The above algorithm was implemented in JAVA and WSD
the GUI was built in Java. To find the ambiguous word, the part of speech tagging of
a sentence is to be done as in Figure 4. Here, we will be
only considering those words whose part-of-speech is
NOUN, VERB, ADJECTIVE or ADVERB.

208 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation

Compare these words with the word of feature set in


Enter Sentence: the theater play bang is released at classes. Before that the comparison of each feature of the
princeton hall training feature set and the target feature set are done to
get that which feature appeared how much times for a
Loading default properties from tagger ./english- particular sense.
bidirectional-distsim.tagger
Reading POS tagger model from ./english- 1 NN 20 a dramatic work intended for
bidirectional-distsim.tagger ... done [6.0 sec]. performance by actors on a stage
2 NN 7 a theatrical performance of a drama
the_DT theater_NN play_NN bang_NN is_VBZ 3 NN 8 a preset plan of action in team sports
released_VBN at_IN princeton_NN hall_NN 4 NN 5 a deliberate coordinated movement
requiring dexterity and skill
5 NN 2 a state in which action is feasible
Figure 4 Part of Speech tagging 6 NN 1 utilization or exercise
7 NN 1 an attempt to get something
Only considering content words are shown in Figure 5. 8 NN 1 activity by children that is guided more by
Also word is brought to its root form i.e. lemmatization is imagination than by fixed rules
to be done.
Figure 8 Interrelation of Corpus texts and target sentence
theater
play 5.4 Classification Task
bang
releas We use Naïve baye‟s algorithm which is a probabilistic
princeton based approach As approach is hybrid one and we are
hall using probabilistic measure. Parameters in the
probabilistic WSD are: Pr(s) i.e probability of sense and
Pr(Vwi|s) i.e. probability of feature w.r.t. particular sense
Figure 5 Consideration of Content words only Before calculation of probabilities, the comparison of each
feature of the training feature set and the target feature set
Word with its POS is used to check whether that word is
having multiple senses or not, by using WordNet, if the 1 0 10 6 2
system returns multiple meaning i.e. synsets then the word 20542
is considered as ambiguous word. In our example, 30251
ambiguous word presents are want and bank. 40210
50000
play 60101
70010
Figure 6 Ambiguous word 80100
As word „play‟ is Noun in our example, so we need to 45
look for noun sense ambiguity only. Noun sense of word
„Play‟ has 19 senses. Therefore, word „Play‟ is ambiguous. Figure 9 Training and Target feature set comparison
In our corpus SemCor, the word „Play‟ has come for 156
times. Then the prior probabilities for each sense of ambiguous
word(s) are calculated and referred as (P(si)).
Play 156 Pr(s)= count(s,w) / count(w)

Figure 7 Matching in corpus 20/45=0.44444445


7/45=0.15555556
5.3 Extraction of Feature 8/45=0.17777778
After pre-processing of corpus, we get sentence which 5/45=0.11111111
does not contain any irrelevant data in it and the data in its 2/45=0.044444446
root form. From that data features are extracted. The 1/45=0.022222223
window size of feature vector is of [-2, +2]. Here the 1/45=0.022222223
features are the words itself and the part-of-speech of that 1/45=0.022222223
words. This feature helps to train the classifier. Then these
feature set is directly used to compare with the feature of Figure 10 Prior probabilities of a sense of word „want‟
target sentence for disambiguation. Also we are counting
the frequency of senses that are present in the training set. Then next step is to calculate the probability of each
This all feature sets obtained for the training dataset is feature present in a particular sense i.e. to calculate
stored in the system for further use in testing of the ((P(fj/si)). Pr(Vwi|s)= count(Vwi,s,w)/count(s,w)
system.

209 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation

(10/20)*(6/20)*(2/20)*(20/45) =6.66E-4 1 0 47 36 12 0.55056 0.0011877073


(5/7)*(4/7)*(2/7)*(7/45) =0.0018140592 12 0 12 9 1 0.12359 1.2535985E-4
(2/8)*(5/8)*(1/8)*(8/45) =3.4722223E-4 14 0 1 0 0 0.01123 5.6179774E-6
(2/5)*(1/45)*(5/45) =8.88889E-5 15 0 8 13 8 0.11797 0.0010598997
1*(2/45) =4.444444E-6 16 0 3 0 0 0.02247 1.6853934E-5
(1/1)*(1/1)*(1/45) =2.2222222E-4 20542 0.03932 4.5861048E-4
(1/1)*(1/45) =2.222222E-5 30251 0.04494 8.77809E-5
(1/1)*(1/45) =2.222222E-5 40432 0.05617 1.3483148E-4
Figure 11 each feature wise probability calculation 50000 0.01123 1.1235954E-6
60101 0.00561 5.6179775E-5
5.5 Extracting correct sense 70021 0.01123 5.6179775E-5
Finally disambiguation process is performed using Bayes 80100 0.00561 5.6179774E-6
decision rule. The disambiguation computes the score of 178
ambiguous word and on the basis of maximum score it
decides the most appropriate sense for the given word in Figure 15 Feature matching; Prior probabilities and Final
the target sentence. Score
Thus in this way, we get the most appropriate meaning of
the ambiguous word. 1
a dramatic work intended for performance by actors
2 a theatrical performance of a drama on a stage
Figure 12 Extracting correct sense of word „Play‟
Figure 16 Winner Sense (including lexical knowledge)
Above mentioned are steps for disambiguation process
without including lexical knowledge. Xiaohua Zhou, Above mentioned are steps for disambiguation process and
Hyoil Han 2005, Lexical Knowledge is associative steps will be iterate whenever new ambiguous word will
(related) information related to the ambiguous word. Then be found.
by using the JAWS library, automatically lexical
knowledge is acquired which is shown in figure 13. Now
word „play‟ count and related world knowledge i.e. word V. PERFORMANCE MEASURES
associated to ambiguous word „play‟ is shown in Figure To measure, our system, we focus on two performance
13. measure Precision and Recall. Precision is proportion of
instances classified and Recall is proportion of total
play 538 instances to be classified. After testing 50 ambiguous
sport, frolic, fun, drama, man oeuvre, romp, turn, words, we found our system gives Precision is of 86%
looseness, gambol, gaming, caper, maneuver, bid, (Partial brown) and 78.78% (Complete brown) without
swordplay, shimmer, gambling lexical knowledge and with lexical knowledge 70%
Figure 13 Lexical knowledge of word „Play‟ (Partial brown) and 72% (Complete brown). Recall is of
And the other steps are same as explained above. 70% (Partial brown) and 72% (Complete brown) without
lexical knowledge and with lexical knowledge 70%
1 NN 98 a dramatic work intended for performance (Partial brown) and 72% (Complete brown).
by actors on a stage
12 NN 22 verbal wit or mockery (often at another's
expense but not to be taken seriously) VI. CONCLUSIONS
14 NN 2 gay or light-hearted recreational activity for Based on our study of WSD scenarios, we make the
diversion or amusement following conclusions:
15 NN 21 (game) the activity of doing something in
an agreed succession 1. Considering the disadvantages of all existing
16 NN 4 the act of playing for stakes in the hope of approaches i.e. knowledge based requires exhaustive
winning (including the payment of a price for a enumeration search and knowledge resources, supervised
chance to win a prize) has a problem of data sparseness, also huge number of
2 NN 7 a theatrical performance of a drama parameters require to be trained and the unsupervised
3 NN 8 a preset plan of action in team sports algorithm fails to distinguish between finer sense of a
4 NN 10 a deliberate coordinated movement requiring ambiguous word so effort has been made to resolve the
dexterity and skill issue by suggesting the hybrid approach.
5 NN 2 a state in which action is feasible
6 NN 1 utilization or exercise 2. Integration of various knowledge resources for a feature
7 NN 2 an attempt to get something set such as Part of speech, morphological form(Lemma) of
8 NN 1 activity by children that is guided more by word, Neighboring words(in form of collocation vector),
imagination than by fixed rules verb noun syntactic relation are helping us to obtain a
good accuracy for classification.
Figure 14 Interrelation of Corpus texts and target sentence

210 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015
Roshan R Karwa et al A Knowledge based Approach to Resolve Word Level Ambiguity for Machine Translation

3. System is working with high accuracy when the


inappropriate information is detached from the sentences
and also when the training data is increased.
REFERENCES
[1] Agirre, Eneko & German Rigau. 1996. "Word sense
disambiguation using conceptual density", in Proceedings of the
16th International Conference on Computational Linguistics
(COLING), Copenhagen, Denmark, 1996..
[2] Agirre, Eneko & David Martínez. 2001. Learning class-to-class
selectional preferences. Proceedings of the Conference on
Natural Language Learning, Toulouse, France, 15–22R.
[3] Agirre, E. and Martinez, d. 2000. Exploring automatic word sense
disambiguation with decision lists and the web. In Proceedings of
the 18th International Conference on Computational Linguistics
(COLING, Saarbr ¨ ucken, Germany). 11–19.
[4] E. Agirre and A. Soroa, “Personalizing PageRank for Word Sense
Disambiguation,” Proc. 12th Conf. European Chapter of the Assoc.
for Computational Linguistics (EACL 09), Assoc. for
Computational Linguistics, 2009, pp. 33–4.
[5] Fellbaum. WordNet: An Electronic Lexical Database. MIT Press,
Cambridge, Massachusetts, 1998.
[6] Judita Preiss, 2006: Probabilistic word sense disambiguation:
Analysis and techniques for combining knowledge sources,
Technical report, University of Cambridge, UCAM-CL-TR-673
ISSN 1476-2986, Number 673
[7] Gerard Escudero, Llu´ıs M`arquez and German Rigau, “Naïve
Bayes and Exemplar-based approaches to Word Sense
Disambiguation Revisited”, arXiv:CS/0007011v1, 2000.
[8] Mitesh M. Khapra, Anup Kulkarni, Saurabh Sohoney, and Pushpak
Bhattacharyya. 2010. All words domain adapted WSD: Finding
middle ground between Supervision and unsupervision. In Jan
Hajic, Sandra Carberry, and Stephen Clark, editors, ACL, pages
1532-1541.
[9] Mihalcea and D.I. Moldovan. Pattern Learning and Automatic
Feature Selection for Word Sense Disambiguation. In Proceedings
of the Second international Workshop on Evaluating Word Sense
Disambiguation Systems(SENSEVAL-2), 2001.
[10] Navigli, roberto, “word sense disambiguation: a survey”, ACM
computing surveys, 41(2), ACM press, pp. 1-69, 2009.
[11] Ping Chen and Chris Bowes, University of Houston-Downtown and
Wei Ding and Max Choly, University of Massachusetts, Boston
Word Sense Disambiguation with Automatically Acquired
Knowledge, 2012 IEEE INTELLIGENT SYSTEMS published by
the IEEE Computer Society.
[12] R. Navigli and M. Lapata, “An Experimental Study of Graph
Connectivity for Unsupervised Word Sense Disambiguation,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 32,no. 4,
2010, pp. 678–692.
[13] Roshan R. Karwa , M.B.Chandak "Word Sense Disambiguation:
Hybrid Approach with Annotation Up To Certain Level – A
Review", International Journal of Engineering Trends and
Technology (IJETT), V18(7),328-330 Dec 2014. ISSN:2231-5381.
www.ijettjournal.org. published by seventh sense research group.
[14] Satanjeev Banerjee, Ted Pedersen, “An adaptive Lesk Algorithm
for Word Sense Disambiguation Using WordNet”, Proceedings of
the Third International Conference on Computational Linguistics
and Intelligent Text Processing, page no: 136-145, 2002.

211 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 02, Issue 05, May, 2015

You might also like