Natural Language Processing & Info Systems
Natural Language Processing & Info Systems
Natural Language
LNCS 7934
Processing and
Information Systems
18th International Conference on Applications
of Natural Language to Information Systems, NLDB 2013
Salford, UK, June 2013, Proceedings
123
Lecture Notes in Computer Science 7934
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Elisabeth Métais Farid Meziane
Mohamad Saraee Vijayan Sugumaran
Sunil Vadera (Eds.)
Natural Language
Processing and
Information Systems
18th International Conference on Applications
of Natural Language to Information Systems, NLDB 2013
Salford, UK, June 19-21, 2013
Proceedings
13
Volume Editors
Elisabeth Métais
Conservatoire National des Arts et Métiers
Paris, France
E-mail: [email protected]
Farid Meziane
Mohamad Saraee
Sunil Vadera
University of Salford
Salford, Lancashire, UK
Email: {f.meziane, m.saraee, s.vadera}@salford.ac.uk
Vijayan Sugumaran
Oakland University
Rochester, MI, USA
E-mail: [email protected]
CR Subject Classification (1998): I.2.7, H.3, H.2.8, I.5, J.5, I.2, I.6, J.1
This volume of Lecture Notes in Computer Science (LNCS) contains the pa-
pers presented at the 18th International Conference on Application of Natural
Language to Information Systems, held at MediacityUK, University of Salford
during June 19–21, 2013 (NLDB2013). Since its foundation in 1995, the NLDB
conference has attracted state-of-the-art presentations and followed closely the
developments of the application of natural language to databases and informa-
tion systems in the wider meaning of the term.
The current conference proceedings reflect the development in the field and
encompass areas such as sentiment analysis and mining, forensic computing, the
Semantic Web and information search. This is in addition to the more tradi-
tional topics such as requirements engineering, question answering systems, and
named entity recognition. NLDB is now an established conference and attracts
researchers and practitioners from all over the world. Indeed, this year’s confer-
ence saw submission for works using a large number of natural languages that
include Chinese, Japanese, Arabic, Hebrew, and Farsi.
We received 80 papers and each paper was reviewed by at least three reviewers
with the majority having four or five reviews. The Conference Co-chairs and
Program Committee Co-chairs had a final consultation meeting to look at all the
reviews and made the final decisions on the papers to be accepted. We accepted
21 papers as long/regular papers, 15 short papers, and 17 poster presentations.
We would like to thank all the reviewers for their time, effort, and for com-
pleting their assignments on time despite tight deadlines. Many thanks to the
authors for their contributions.
Conference Chairs
Elisabeth Métais Conservatoire National des Arts et Metiers,
Paris, France
Farid Meziane University of Salford, UK
Sunil Vadera University of Salford, UK
Programme Committee
Jacky Akoka CNAM, France
Frederic Andres National Institute of Informatics, Japan
Apostolos Antonacopoulos University of Salford, UK
Eric Atwell University of Leeds, UK
Abdelmajid Ben Hamadou Sfax University, Tunisia
Bettina Berendt Leuven University, Belgium
Johan Bos Groningen University, The Netherlands
Goss Bouma Groningen University, The Netherlands
Philipp Cimiano Universität Bielefeld, Germany
Isabelle Comyn-Wattiau ESSEC, France
Walter Daelemans University of Antwerp, Belgium
Zhou Erqiang University of Electronic Science and
Technology, China
Stefan Evert University of Osnabrück, Germany
Vladimir Fomichov National Research University Higher School
of Economics Russia
Alexander Gelbukh Mexican Academy of Science, Mexico
Jon Atle Gulla NTNU, Norway
Karin Harbusch Koblenz University, Germany
Dirk Heylen University of Twente, The Netherlands
Helmut Horacek Saarland University, Germany
VIII Organization
Full Papers
Extraction of Statements in News for a Media Response Analysis . . . . . . 1
Thomas Scholz and Stefan Conrad
Short Papers
MOSAIC: A Cohesive Method for Orchestrating Discrete Analytics
in a Distributed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Ransom Winder, Joseph Jubinski, John Prange, and Nathan Giles
Poster Papers
Phrase Table Combination Deficiency Analyses in Pivot-Based SMT . . . . 355
Yiming Cui, Conghui Zhu, Xiaoning Zhu, Tiejun Zhao, and
Dequan Zheng
Analysing Customers Sentiments: An Approach to Opinion Mining and
Classification of Online Hotel Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Juan Sixto, Aitor Almeida, and Diego López-de-Ipiña
An Improved Discriminative Category Matching in Relation
Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Yongliang Sun, Jing Yang, and Xin Lin
Extracting Fine-Grained Entities Based on Coordinate Graph . . . . . . . . . 367
Qing Yang, Peng Jiang, Chunxia Zhang, and Zhendong Niu
NLP-Driven Event Semantic Ontology Modeling for Story . . . . . . . . . . . . . 372
Chun-Ming Gao, Qiu-Mei Xie, and Xiao-Lan Wang
The Development of an Ontology for Reminiscence . . . . . . . . . . . . . . . . . . . 376
Collette Curry, James O’Shea, Keeley Crockett, and Laura Brown
Chinese Sentence Analysis Based on Linguistic Entity-Relationship
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Dechun Yin
A Dependency Graph Isomorphism for News Sentence Searching . . . . . . . 384
Kim Schouten and Flavius Frasincar
Unsupervised Gazette Creation Using Information Distance . . . . . . . . . . . 388
Sangameshwar Patil, Sachin Pawar, Girish K. Palshikar,
Savita Bhat, and Rajiv Srivastava
A Multi-purpose Online Toolset for NLP Applications . . . . . . . . . . . . . . . . 392
Maciej Ogrodniczuk and Michal Lenart
A Test-Bed for Text-to-Speech-Based Pedestrian Navigation Systems . . . 396
Michael Minock, Johan Mollevik, Mattias Åsander, and
Marcus Karlsson
Automatic Detection of Arabic Causal Relations . . . . . . . . . . . . . . . . . . . . . 400
Jawad Sadek
A Framework for Employee Appraisals Based on Inductive Logic
Programming and Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Darah Aqel and Sunil Vadera
A Method for Improving Business Intelligence Interpretation
through the Use of Semantic Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
Shane Givens, Veda Storey, and Vijayan Sugumaran
Table of Contents XV
1 Motivation
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 1–12, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 T. Scholz and S. Conrad
!
"
# $
%
&'( $'(
$ $
) * +!"
,- $
..('(/. ..+0
$ ,-1-/
#" $%
*# +
..!
$
$
../
.
2 +..0 * 3
$$
* * + /$
" $
#$ +4 * 5 "
../+6+++7
the most important information for the SPD (the governing party of the region
Brandenburg) and both are annotated with a negative sentiment. The marked
sentences are not relevant for another party, e.g., and for Greenpeace other sen-
tences are relevant: a relevant statement would be the last two sentences of the
text snippet. So, the results of MRA depends on the analysis objects (in general
the customer of a MRA and its competitors or in the case of the pressrelations
dataset the German parties SPD and CDU). In this paper, we want to concen-
trate on the extraction of relevant statements, because they are essential for a
MRA and also other approaches show that a well-considered selection of text
parts could improve Sentiment Analysis for opinion-bearing text [8] or even work
with statements [11].
Task Definition: Let d ∈ D be a document and D a collection of news
articles. The task is to find a partition P of the set of all sentences Sd for every
d ∈ D so that P has ν elements and ν − 1 elements are relevant statements (ν
is unknown before analysis).
fp : d → P = {p1 , ..., pν } = {{sj , sj+1 , ...}, ..., {sk , sk+1 , ...}, {sl , sm , ...}} (1)
ν−1
pν contains all not relevant sentences and all pi with i ∈ {1, ..., ν − 1} include the
relevant statements. A statement is a consecutive sequence of relevant sentences
(a statement usually consists of up to four sentences). In general, documents
with only one element (all sentences are not relevant) and elements with only
one sentence (p = {si }, e.g.) are possible.
As figure 1 shows, the relevant statements are not only sentences, in which
the certain search strings (such as ’SPD’, ’Platzeck’, or ’Greenpeace’) appear.
Sometimes a coreference resolution is needed (cf. the last sentence in the first
Extraction of Statements in News for a Media Response Analysis 3
statement), but sometimes even such resolution would not help (cf. the last
sentence in the second statement). In our evaluation we will show that this
is often the case. Moreover, the antepenultimate sentence contains the word
’Platzeck’ and is not relevant, because it contains only additional information. So,
we propose a machine learning technique which is based on significant features of
relevant sentences and filter misclassified sentences by a density-based clustering.
The rest of the paper is organized as follows: We discuss Related Work in the
next section. In section three we explain our machine learning-based method for
the statement extraction. We evaluate and compare the results of our approach
with other techniques in section 4, before we conclude in the last section.
2 Related Work
The extraction of relevant statements for a MRA is related to several kinds of
areas: the automated creation of Text Summaries [1,6,7,12], Information Extrac-
tion [3,13] and Opinion Mining [8,9,11].
Automatic Text Summaries have a long history. An early approach works
with coreference chains [1] to estimate the sentences of a summary. Turney ex-
tracts important phrases by learned rules [12], while Mihalcea and Tarau build
graphs using Page Rank and a similarity function between two sentences [7]. A
language-independent approach for Text Summarization proposed by Litvak et
al. [6] is called DegExt. The approach transforms a given text into a graph rep-
resentation where words become nodes. Within this graph, the important words
are estimated by nodes with a high connectivity. These words are extracted as
keywords of the text and the summary consists of all sentences which contain
keywords. They report better results than TextRank [7] and GenEx [12] on the
benchmark corpus of summarized news articles of the 2002 DUC by extracting
15 keywords. So, we took DegExt as one of our comparison methods.
Also, the task of this contribution is related to Information Extraction tasks
such as the extraction of statements for market forecasts [13]. Here, a statement
consists of a 5-tuple of topic, geographic scope, period of time, amount of money
or growth rate, and the statement time, whereas the relation of time and money
information is particularly important. Hong et al. [3] extract events from sen-
tences. The event extraction covers the determination of the type of the event,
its participants, and their role. Both definitions of statements/events and their
methods do not fit in our issue.
In the field of Opinion Mining, the identification of Opinion Holders is an im-
portant task [4]. But in a MRA, we know the objects (organisations or persons,
e.g.) of an analysis. But the automatic extraction of statements is very inter-
esting for Opinion Mining tasks [10], to classify the tonality [11] as well as the
viewpoints of extracted statements [9]. The approach [8] of Sarvabhotla et al.,
called RSUMM (Review Summary), creates summaries of reviews for Opinion
Mining tasks. They weight sentences by the importance of the containing words
and the subjectivity. In this way, they select the most important and subjective
sentences for their subjective excerpt [8]. We apply two variants of this approach
for our evaluation.
4 T. Scholz and S. Conrad
In the news domain, the MPQA corpus [15] is a very important test corpus for
issues of Opinion Mining. Unfortunately, it has no extracted statements, because
it is not designed as a MRA. The pressrelations dataset [10] is a publicly available
corpus of a MRA. It contains 617 news articles which contain 1,521 statements
for the two biggest political parties in Germany. Overall, the articles include
15,089 sentences from which 3,283 are relevant for the two political parties. This
dataset is part of our evaluation. To evaluate our approaches, we use metrics of
the Text Summarization area, because this field has several things in common
with our task. Lin [5] proposes widely acknowledged metrics to estimate the
quality of text summaries. We use the ROUGE-L score to determine the quality
of the extracted statements.
As shown in the examples (figures 1 and 2), statements are not just consecutive
sentences or whole paragraphs, which contain certain search strings such as the
name of a person or a party. In figure 2, the last sentences of each statement do
not contain a keyword such as ’SPD’.
!
"#$#% &
! ' ! (%% ) !)*
+% !%%
! !,+"#$#%
&
! % !% - !!!
!,.
!! !,
+%% ! ! -!%%/ % ,
+% ! %
% !%% ! !! ! ! / ,
+"#$#% & % !%%) %)
!' - ,- %
!%%,
are a product and a (subsidiary) company, e.g.). In a MRA, some people are
of particular importance, because they are press spokesmen/spokeswomen of
a relevant organisation (the customer’s company or competitor) or they are
an advertising medium. Media analysts collect list of these entities in so-called
codebooks [9], because it is very difficult for humans to remember all relevant
persons and organisations.
For a Named Entity Recognition (NER), we apply GATE (General Architec-
ture for Text Engineering)2 to extract the persons and organisations in the text.
We have designed new JAPE Rules3 to improve our NER. The new rules handle
all important entities from our codebook with the highest priority. This secures
that these entities are found with a very high probability. Furthermore, we im-
proved the coreference resolution by adding a German pronominal coreferencer.
We divided our list into three parts: female person, male person and neuter.
In this way, we got the gender information for our NER. For Part-Of-Speech
Tagging and lemmatisation, we use the TreeTagger4.
Table 1. Feature set for our SVM classifier and our density-based clustering
After the classification of each sentence, we select all sentences which are classi-
fied as relevant. For every sentence, we count the frequency of every word in the
sentence and use these frequencies as input features (cf. table 1) for the perfor-
mance of a DBSCAN clustering [2]. We set the parameter Eps [2] (the radius of
a neighbourhood) to 1.0 and MinPts [2] (the number of minimum points in an
Eps neighbourhood) is set to 2. This secures that the clusters are very similar
and at the same time a similar misclassification occurs at least three times.
In a clustering approach, the clusters are usually more interesting in order
to identify objects, which share commonalities. But statements are representing
many different information and opinions over a large document corpus [14]. So,
our approach works the other way round and filters out clusters of not relevant
sentences because really relevant sentences tend to be noise and the same clas-
sification mistakes appear several times and thus become clusters. Thereby, we
use only sentences which are noise from a clustering perspective (cf. next sec-
tion). Since only the sentences classified as relevant are used for our clustering,
computational time can be saved for the performance of the clustering.
Our technique combines sentences which are classified as relevant by our SVM
and do not belong to any cluster in DBSCAN clustering. The input parameters
of the algorithm are the set of all sentences, the calculated classification model,
and the calculated clustering model:
The method combine takes two consecutive statements and append the second
one to the first one. R contains all pi with i ∈ {1, ..., ν −1} and pν are all sentences
which are not a part of an element in R.
Extraction of Statements in News for a Media Response Analysis 7
4 Evaluation
4.1 Experimental Setup
The Text Summarization method DegExt [6] is very language-independent, be-
cause the only required NLP resource is a tokenizer. DegExt allows to choose
the number of keywords (referred to as N) and, as a consequence, the size of the
summaries. We test several values for N, because the results of the experiments
of Litvak et al. show that the choice of N is important for the quality of the
result [6]. Consecutive sentences of a summary are combined into a statement.
We evaluate the RSUMM method [8] in two variants: The ’classical’ method
(denoted as RSUMM X%) calculates the lexical similarity between each sentence
as a vector and the vectors of the most important words or the most subjective
terms, respectively [8]. They compute a final score by adding the Jaccard simi-
larity of both scores [8] and select the top X % of the sentences which got the
highest scores. We use 20% of our training examples to create the vectors adf
(average document frequency) and asm (average subjective measure) [8].
As a second variant, we use both RSUMM scores as input values for a classifier
(denoted as RSUMM(+SVM)) and classify every sentence. Sarvabhotla et al.
use the SVMLight package6 , so we apply this learner. But we obtain a very low
accuracy (16.43% by using 50% for training, e.g.), because the classifier tends
to qualify every sentence as relevant. As a consequence, we use the SVM of our
technique which achieved better results (cf. section 4.2).
As two other baselines, we construct simple bags of words for every sentence
to classify the sentences by our classifier (denoted as TSF-Matrix 5%, where
TSF stands for term sentence frequency and the size of the training data is
5%). Likewise, we use only the extracted coreference chains of our important
entities to identify statements (denoted as Coreference Chains): If one element
of a chain of an important entity appears in the sentence, the sentence is relevant
and consecutively relevant sentences are combined to statements.
We test the methods on two datasets: The pressrelations dataset [10] has 617
articles with 1,521 gold statements and an own dataset with 5,000 articles of
a MRA about a financial service provider and 4 competitors (called Finance).
The articles include 7,498 statements. The codebook for the finance dataset in-
cludes 384 persons, 19 organisations, and 10 products, while the codebook for
the pressrelations dataset contains 386 persons (all party members of the 17th
German Bundestag7 , the German parliament), and 18 entries of organisations
(names and synonyms of the parties and concepts such as ’government’ or ’op-
position’ [9]). The same codebooks are used in [9].
4.2 Results
For the step of learning relevant sentences, table 2 and 3 show the results for
classifying single sentences as relevant or not. As the tables show, our classifier
6
http://svmlight.joachims.org/
7
collected from http://www.bundestag.de
8 T. Scholz and S. Conrad
needs only very limited training data (5% or 0.5%, resp.) to obtain good results
(there is nearly no difference between using 15% or 5% on the pressrelations
dataset). On Finance, the classifier requires even less data for good results.
The results show that it is more difficult to identify the relevant sentences,
while precision and recall of not relevant examples are very high. One reason
is unequal distribution of the two classes, of course. Finance includes 13,084
relevant sentences and 145,219 not relevant sentences. However, the tables show
that our method achieves better results on sentence level than RSUMM (+SVM).
For our further experiments, we use only 5% on the pressrelations dataset and
0.5% on the finance dataset for training, because these values achieve good results
and, for a practical solution, a technique should require less training as possible.
Here, we measure how many statements match the annotated statements of the
both datasets (denoted as Gold Standard Match). As well, we use the ROUGE-L
score [5] which is based on the idea that two summaries are similar, if the size
of the longest common subsequence (LCS) [5] is large:
Extraction of Statements in News for a Media Response Analysis 9
methods) tend to extract more statements than the number of gold statements?
Two media analysts examined all extracted statements on the pressrelations
dataset in a blind study (they do not know the extraction method) and recon-
sider all extracted statements: A statement is correct, when it is relevant (for
the analysis objects) and a tonality [10] with a viewpoint [9] can be estimated.
We use all methods with the best parameters (based on Flcs ).
The findings are depicted in table 6. Here, the F-score is almost 24% higher
than the second best approach (RSUMM (20%)). This analysis shows that the
approach extracts many more relevant statements which are not part of the gold
annotation. There are several reasons for this: In a MRA [14] sometimes only a
number of top-N statements are used. So, besides the gold statements which are
found exactly or partially, the machine-based approaches find more statements,
which are less important, but nevertheless adequate statements. Furthermore,
many of these statements are neutral, so that they are not all extracted, because
too many neutral statements may dilute the tonality in a practical analysis.
5 Conclusion
Our approach outperforms all comparison methods on both datasets. The find-
ings point out that the extraction of statements for a MRA could not be solved
only by Text Summarization. Furthermore, our evaluation shows that our tech-
nique can find many adequate statements. On the one hand, this approach can
be utilized to help media analysts who could save time by extracting relevant
statements. And on the other hand, our method closes a gap in an automated
approach for a MRA, because the combination of this approach, the classifica-
tion of the tonality [10,11] and the determination of perspectives [9] represents
a fully automated generation of analysis data for a MRA.
References
1. Azzam, S., Humphreys, K., Gaizauskas, R.: Using coreference chains for text sum-
marization. In: Proc. of the Workshop on Coreference and its Applications, Core-
fApp 1999, pp. 77–84 (1999)
12 T. Scholz and S. Conrad
2. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discov-
ering clusters in large spatial databases with noise. In: Proc. of the 2nd Intl. Conf.
on Knowledge Discovery and Data Mining (KDD 1996), pp. 226–231 (1996)
3. Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., Zhu, Q.: Using cross-entity in-
ference to improve event extraction. In: Proc. of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies, HLT
2011, vol. 1, pp. 1127–1136 (2011)
4. Kim, S.-M., Hovy, E.: Extracting opinions, opinion holders, and topics expressed in
online news media text. In: Proc. of the Workshop on Sentiment and Subjectivity
in Text, SST 2006, pp. 1–8 (2006)
5. Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Sum-
marization Branches Out: Proc. of the ACL 2004 Workshop, pp. 74–81. Association
for Computational Linguistics (2004)
6. Litvak, M., Last, M., Aizenman, H., Gobits, I., Kandel, A.: Degext - a language-
independent graph-based keyphrase extractor. In: Proc. of the 7th Atlantic Web
Intelligence Conference (AWIC 2011), pp. 121–130 (2011)
7. Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proc. of the 2004
Conf. on Empirical Methods in Natural Language Processing, EMNLP 2004 (2004)
8. Sarvabhotla, K., Pingali, P., Varma, V.: Sentiment classification: a lexical similarity
based approach for extracting subjectivity in documents. Inf. Retr. 14(3), 337–353
(2011)
9. Scholz, T., Conrad, S.: Integrating viewpoints into newspaper opinion mining for
a media response analysis. In: Proc. of the 11th Conf. on Natural Language Pro-
cessing, KONVENS 2012 (2012)
10. Scholz, T., Conrad, S., Hillekamps, L.: Opinion mining on a german corpus of a
media response analysis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD
2012. LNCS, vol. 7499, pp. 39–46. Springer, Heidelberg (2012)
11. Scholz, T., Conrad, S., Wolters, I.: Comparing different methods for opinion mining
in newspaper articles. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.)
NLDB 2012. LNCS, vol. 7337, pp. 259–264. Springer, Heidelberg (2012)
12. Peter, D.: Turney. Learning algorithms for keyphrase extraction 2(4), 303–336
(2000)
13. Wachsmuth, H., Prettenhofer, P., Stein, B.: Efficient statement identification for
automatic market forecasting. In: Proc. of the 23rd International Conference on
Computational Linguistics, COLING 2010, pp. 1128–1136 (2010)
14. Watson, T., Noble, P.: Evaluating public relations: a best practice guide to public
relations planning, research & evaluation. PR in practice series, ch. 6, pp. 107–138.
Kogan Page (2007)
15. Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions
in language. Language Resources and Evaluation 39(2-3), 165–210 (2005)
Sentiment-Based Ranking of Blog Posts
Using Rhetorical Structure Theory
1 Introduction
Social networks and blogs have rapidly emerged to become leading sources of
opinions in the Web. These repositories of opinions have become one of the most
effective ways to influence people’s decisions. In fact, companies are aware of the
power of social media and most enterprises try to monitor their reputation over
Twitter, blogs, etc. to infer what people think about their products and to get
early warnings about reputation issues. In this paper, we focus on one of the most
important sources of opinions in social media, i.e., the blogosphere [1]. In this
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 13–24, 2013.
c Springer-Verlag Berlin Heidelberg 2013
14 J.M. Chenlo et al.
scenario, classical information retrieval (IR) techniques are not enough to build
an effective system that deals with the opinionated nature of these new sources
of information. To mine opinions from blogs we need to design methodologies
for detecting opinions and determining their polarity [2].
In recent years, several works have been conducted to detect opinions in blog
posts [1]. Currently, the most popular approach is to consider this mining task
as a two-stage process that involves a topic retrieval stage (i.e., retrieve relevant
posts given a user query), and a re-ranking stage that takes into account opinion-
based features [3]. This second stage can also be subdivided into two different
substaks: an opinion-finding task, where the main aim is to find opinionated blog
posts related to the query, and a subsequent polarity task to identify the orien-
tation of a blog post with respect to the topic (e.g., positive or negative). For
polarity estimation, researchers often apply naive methods (e.g., classifiers based
on frequency of positive/negative terms) [4]. Polarity estimation is a really chal-
lenging task with many unresolved issues (e.g., irony, conflicting opinions, etc.).
We argue that this difficult estimation problem cannot be solved with regular
matching (or count-based) techniques alone. In fact, most lexicon-based polarity
classification techniques fail to retrieve more positive/negative documents than
baselines without polarity capabilities [3].
This phenomenon is caused by the polarity of a document being not so much
conveyed by the sentiment-carrying words that people use, but rather by the
way in which these words are used. Rhetorical roles of text segments and their
relative importance should be accounted for when determining the overall sen-
timent of a text (e.g., an explanation may contribute differently to the overall
sentiment than a contrasting text segment does) [5]. Rhetorical Structure The-
ory (RST) [6] is a linguistic method for describing natural text, characterizing
its structure primarily in terms of relations that hold between parts of the text.
Rhetorical relations (e.g., an explanation or a contrast) are very important for
text understanding, because they give information about how the parts of a text
are related to each other to form a coherent discourse.
Accounting for the rhetorical roles of text segments by means of a RST-based
analysis has proven to be useful when classifying the overall document-level
polarity of a limited set of movie reviews [5]. As this success comes at a cost of
computational complexity, the application of a RST-based analysis in large-scale
polarity ranking tasks in the field of IR is challenging. In this paper, we study how
we can utilize RST in a large-scale polarity ranking task and how RST helps
to understand the sentiment expressed by bloggers. More specifically, we aim
to identify the rhetorical relations that give good guidance for understanding
the sentiment conveyed by blog posts, as well as to quantify the advantage
of exploiting these relations. We also compare our RST-based methods with
conventional approaches for large-scale polarity ranking of blog posts.
In the blogosphere, the presence of spam, off-topic information, or relevant
information that is non-opinionated introduces noise and this is a major issue
that harms the effectiveness of opinion-finding techniques. Therefore, it would
not be wise to apply RST on the entire blog posts. We build on recent advances in
Sentiment-Based Ranking of Blog Posts Using Rhetorical Structure Theory 15
extracting key opinionated sentences for polarity estimation in blog posts [4] and
analyse the structure of the discourse only for selected passages. This is beneficial
to avoid noisy chunks of text and it is also convenient from a computational
complexity perspective because discourse processing is not lightweight.
2 Method
First, we present the methods to find relevant polar sentences in a blog post.
Then, we show how to perform rhetorical analysis over these key evaluative
sentences, in order to determine the relations between the different spans of
text. Finally, we define the overall orientation of a blog post as positive (resp.
negative) according to these key evaluative sentences. To this end, we take into
account the information provided by rhetorical relations.
negative) terms tagged in the sentence S divided by the total number of terms
in S 3 . β ∈ [0, 1] is a free parameter.
Different aggregation methods were considered in [4] to compute the final
polarity of a blog post based on its sentence-level scores, including the average
score of all polar sentences, the first or the last k polar sentences and the sen-
tences with the highest pol(S, Q). This last method, PolMeanBestN, was shown
to be very robust and, overall, it gives the best estimation of the polarity of
a blog post. Therefore, in this paper, we use this approach to extract the key
sentences that are injected to a RST module. The best configuration obtained
in [4] for PolMeanBestN is k = 1, which means that we select just one sentence
to estimate the overall polarity of a blog post.
Given an initial list of documents which is ranked by decreasing relevance
score (relnorm (D, Q)), we re-rank the list to promote on-topic blog posts that
are positive (resp. negative) opinionated as follows:
sentiment with respect to a book (“the book is horrible”). The other segment is
a satellite with contrasting information with respect to the nucleus, admitting
to some positive aspects of the book (“Although I like the characters”). For a
human reader, the polarity of this sentence is clearly negative, as the overall mes-
sage has a negative polarity. However, in a classical (word-counting) sentiment
analysis approach, all words would contribute equally to the total sentiment,
thus yielding a verdict of a neutral or mixed polarity at best. Exploiting the in-
formation contained in the RST structure could result in the nucleus being given
a higher weight than the satellite, thus shifting focus to the nucleus segment.
We can thus get a more reliable sentiment score. As such, in order to exploit the
rhetorical relations as imposed upon natural language text by a RST analysis,
distinct rhetorical roles of individual text segments should be treated differently
when aggregating the sentiment conveyed by these text segments. This could be
accomplished by assigning different weights to distinct rhetorical roles, quanti-
fying their contribution to the overall sentiment conveyed by a text [5].
3 Experiments
In this section, we describe the experiments designed to determine the useful-
ness of RST in a large-scale multi-topic domain. Concretely, we work with the
5
SPADE software takes on average 3 seconds to compute each sentence in a regular
desktop machine.
18 J.M. Chenlo et al.
Relation Description
attribution Clauses containing reporting verbs or cognitive predicates related to reported
messages presented in nuclei.
background Information helping a reader to sufficiently comprehend matters presented in
nuclei.
cause An event leading to a result presented in the nucleus.
comparison Clauses presenting matters which are examined along with matters presented
in nuclei in order to establish similarities and dissimilarities.
condition Hypothetical, future, or otherwise unrealized situations, the realization of
which influences the realization of nucleus matters.
consequence Information on the effects of events presented in nuclei.
contrast Situations juxtaposed to situations in nuclei, where juxtaposed situations are
considered as the same in many respects, yet differing in a few respects, and
compared with respect to one or more differences.
elaboration Rhetorical elements containing additional detail about matters presented in
nuclei.
enablement Rhetorical elements containing information increasing a readers potential abil-
ity of performing actions presented in nuclei.
evaluation An evaluative comment about the situation presented in the associated nu-
cleus.
explanation Justifications or reasons for situations presented in nuclei.
joint No specific relation is assumed to hold with the matters presented in the
associated nucleus.
otherwise A situation of which the realization is prevented by the realization of the
situation presented in the nucleus.
temporal Clauses describing events with a specific ordering in time with respect to events
described in nuclei.
BLOGS06 text collection [15], which is one of the most renowned blog test col-
lections with relevance, subjectivity, and polarity assessments.
We have built a realistic and chronologically organised query dataset with the
topics provided by TREC. We have optimised the parameters of our methods
(e.g., satellite weights) on the TREC 2006 and TREC 2007 topics, while using
the TREC 2008 topics as testing set. Two different training-testing processes fo-
cused on maximising MAP have been run, i.e., one for positive polarity retrieval
and another for negative polarity retrieval. To train all the parameters of our
models (including the satellite weights) we have used Particle Swarm Optimisa-
tion (PSO). PSO has shown its merits for the automatic tuning process of the
parameters of IR methods [16].
3.4 Results
Table 2 shows the results of our polarity approaches. Each run is evaluated in
terms of its ability to retrieve positive (resp. negative) documents higher up
in the ranking. The best value in each column for each baseline is underlined.
Statistical significance is assessed using the paired t-test at the 95% level. The
symbols and indicate a significant improvement or decrease over the corre-
sponding baseline. To specifically measure the benefits of RST techniques in the
estimation of a ranking of positive (resp. negative) blog posts we compare its per-
formance against the performance achieved by a very effective method for blog
polarity estimation (PolMeanBestN [4], presented in Section 2 ). PolMeanBestN
estimates the overall recommendation of a blog post by taking into account the
on-topic sentence in the blog post that has the highest polarity score (e.g., the
6
The baselines were selected by TREC from the runs submitted to the initial ad-hoc
retrieval task in the TREC blog track.
20 J.M. Chenlo et al.
Table 2. Polarity Results. Mean average precision (MAP) and precision at 10 (P10)
for positive and negative rankings of blog posts. The symbols () and () indicate
a significant improvement(decrease) over the original baselines provided by TREC and
the polM eanBestN method, respectively.
Negative positive
MAP P10 MAP P10
baseline1 .2402 .2960 .2662 .3680
+polMeanBestN .2408 .3000 .2698 .3720
+polMeanBestN(RST) .2516 .3180 .2733 .3740
baseline2 .2165 .2780 .2390 .3340
+polMeanBestN .2222 .2820 .2368 .3160
+polMeanBestN(RST) .2261 .3100 .2423 .3560
baseline3 .2488 .2840 .2758 .3500
+polMeanBest .2524 .2760 .2755 .3420
+polMeanBestN(RST) .2584 .2820 .2770 .3380
baseline4 .2636 .2740 .2731 .3580
+polMeanBestN .2730 .2840 .2705 .3500
+polMeanBestN(RST) .2825 .3240 .2716 .3620
baseline5 .2238 .3000 .2390 .3600
+polMeanBestN .2279 .3120 .2404 .3580
+polMeanBestN(RST) .2393 .3420 .2786 .4380
Table 3. Optimised weights for RST relation types trained with PSO over positive and
negative rankings and the percentage of presence of different relations in the training
Positive Negative
Relation % of Presence Weight % of Presence Weight
attribution .183 0.531 .177 2.000
background .034 -0.219 .038 -2.000
cause .009 1.218 .009 -0.011
comparison .003 -1.219 .003 -2.000
condition .029 -0.886 .025 -2.000
consequence .001 0.846 .001 1.530
contrast .016 -1.232 .017 -2.000
elaboration .207 2.000 .219 2.000
enablement .038 2.000 .038 1.221
evaluation .001 0.939 .001 -2.000
explanation .007 2.000 .008 2.000
joint .009 -1.583 .010 1.880
otherwise .001 -1.494 .001 -0.428
temporal .003 -2.000 .003 -0.448
and the attribution relation. For both positive and negative documents, satel-
lite segments elaborating on matters presented in nuclei are typically assigned
relatively high weights, exceeding those assigned to nuclei. Bloggers may, there-
fore, tend to express their sentiment in a more apparent fashion in elaborating
segments rather than in the core of the text itself. A similar pattern emerges
for attributing satellites as well as for persuasive text segments, i.e., those in-
volved in enablement relations, albeit to a more limited extent (lower frequency
of occurrence). Interestingly, however, the information in attributing satellites
appears to be more important in negative documents than in positive documents.
Another important observation is that the sentiment conveyed by elements in
contrast satellites gets a negative weight. This permits to appropriately estimate
the polarity of sentences such as the one we introduced in Section 2 (“Although
I like the characters, the book is horrible.”).
4 Related Work
In this paper we have taken the first steps towards studying the usefulness
of RST-based polarity analysis in the blogosphere. We found that the use of
discourse structure significantly improves polarity detection in blogs. We have
Sentiment-Based Ranking of Blog Posts Using Rhetorical Structure Theory 23
applied an effective and efficient strategy to select and analyse key opinion sen-
tences in a blog post and we have found some trends related to the way in which
people express their opinions in blogs. Concretely, there is a clear predominance
of attribution and elaboration rhetorical relations. Bloggers tend to express their
sentiment in a more apparent fashion in elaborating and attributing text seg-
ments rather than in the core of the text itself.
Finally, most of the methods proposed on this work are based on a sim-
ple combination of scores. As future work, we would like to study more formal
combination methods. Related to this, we are also interested in more refined
representations of rhetorical relations (e.g., LMs [21]). Another problem to take
into account is that we are using only one sentence to evaluate the polarity of the
blog post. Under these conditions the benefits of applying rhetorical relations
have some limitations (e.g., the sentence selected may not be a good represen-
tative for the blog post). In the near future, we plan to explore the benefits
of discourse structure while taking more sentences into account in our analy-
sis. Related to this, one of the core problems derived to the use of RST is the
processing time required for identifying discourse structure in natural language
text. Therefore, we would like to explore more efficient methods of identifying
the discourse structure of texts.
References
1. Santos, R.L.T., Macdonald, C., McCreadie, R., Ounis, I., Soboroff, I.: Information
retrieval on the blogosphere. Found. Trends Inf. Retr. 6(1), 1–125 (2012)
2. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends
in Information Retrieval 2(1-2), 1–135 (2007)
3. Ounis, I., Macdonald, C., Soboroff, I.: Overview of the TREC 2008 blog track. In:
Proc. of the 17th Text Retrieval Conference, TREC 2008. NIST (2008)
4. Chenlo, J.M., Losada, D.: Effective and efficient polarity estimation in blogs based
on sentence-level evidence. In: Proc. 20th ACM Int. Conf. on Information and
Knowledge Management, CIKM 2011, Glasgow, UK, pp. 365–374 (2011)
5. Heerschop, B., Goossen, F., Hogenboom, A., Frasincar, F., Kaymak, U., de Jong,
F.: Polarity analysis of texts using discourse structure. In: Proc. 20th ACM Int.
Conf. on Inf. and Knowledge Manag., CIKM 2011, Glasgow, UK, pp. 1061–1070
(2011)
6. Mann, W.C., Thompson, S.A.: Rhetorical structure theory: Toward a functional
theory of text organization. Text 8(3), 243–281 (1988)
7. Gerani, S., Carman, M.J., Crestani, F.: Proximity-based opinion retrieval. In: Proc.
33rd International ACM SIGIR Conference on Research and Development in In-
formation Retrieval, SIGIR 2010, pp. 403–410. ACM, New York (2010)
24 J.M. Chenlo et al.
8. Santos, R.L.T., He, B., Macdonald, C., Ounis, I.: Integrating proximity to subjec-
tive sentences for blog opinion retrieval. In: Boughanem, M., Berrut, C., Mothe,
J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 325–336. Springer,
Heidelberg (2009)
9. He, B., Macdonald, C., He, J., Ounis, I.: An effective statistical approach to blog
post opinion retrieval. In: Proc. 17th ACM Int. Conf. on Information and Knowl-
edge Management, CIKM 2008, pp. 1063–1072. ACM, New York (2008)
10. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-
level sentiment analysis. In: Proc. Conf. on Human Language Technology and Em-
pirical Methods in Natural Language Processing, HLT 2005, pp. 347–354. ACL
(2005)
11. He, B., Macdonald, C., Ounis, I.: Ranking opinionated blog posts using opinion-
finder. In: SIGIR, pp. 727–728 (2008)
12. Robertson, S.: How okapi came to TREC. In: Voorhees, E.M., Harman, D.K. (eds.)
TREC: Experiments and Evaluation in Information Retrieval, pp. 287–299 (2005)
13. Soricut, R., Marcu, D.: Sentence level discourse parsing using syntactic and lexical
information. In: Proc. 2003 Conf. of the North American Chapter of the ACL on
Human Language Technology, NAACL 2003, vol. 1, pp. 149–156. ACL, Stroudsburg
(2003)
14. Carlson, L., Marcu, D., Okurowski, M.E.: Building a discourse-tagged corpus in
the framework of rhetorical structure theory. In: Proc. 2nd SIGdial Workshop on
Discourse and Dialogue, SIGDIAL 2001, vol. 16, pp. 1–10. ACL (2001)
15. Macdonald, C., Ounis, I.: The TREC Blogs 2006 collection: Creating and analysing
a blog test collection. Technical Report TR-2006-224, Department of Computing
Science, University of Glasgow (2006)
16. Parapar, J., Vidal, M., Santos, J.: Finding the best parameter setting: Particle
swarm optimisation. In: 2nd Spanish Conf. on IR, CERI 2012, pp. 49–60 (2012)
17. Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts. In: Pr. of the ACL, pp. 271–278 (2004)
18. Zirn, C., Niepert, M., Stuckenschmidt, H., Strube, M.: Fine-grained sentiment anal-
ysis with structural features. In: Asian Federation of Natural Language Processing,
vol. 12 (2011)
19. Somasundaran, S., Namata, G., Wiebe, J., Getoor, L.: Supervised and unsupervised
methods in employing discourse relations for improving opinion polarity classifica-
tion. In: Proc. 2009 Conf. on Empirical Methods in Natural Language Processing,
EMNLP 2009, vol. 1, pp. 170–179. ACL (2009)
20. Zhou, L., Li, B., Gao, W., Wei, Z., Wong, K.F.: Unsupervised discovery of discourse
relations for eliminating intra-sentence polarity ambiguities. In: Proc. Conf. on
Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 162–171.
ACL, Stroudsburg (2011)
21. Lioma, C., Larsen, B., Lu, W.: Rhetorical relations for information retrieval. In:
Proc. 35th Int. Conf. ACM SIGIR on Research and Development in Information
Retrieval, SIGIR 2012, pp. 931–940. ACM, New York (2012)
Automatic Detection of Ambiguous Terminology
for Software Requirements
Yue Wang, Irene L. Manotas Gutiérrez, Kristina Winbladh, and Hui Fang
1 Introduction
A Software Requirements Specification (SRS) describes the required behaviour of a
software product, and is often specified as a set of necessary requirements for project
development. An ideal SRS should clearly state the requirements without introducing
any ambiguities. Unfortunately, it is impossible to avoid the ambiguous SRSs since they
are often described using natural languages.
A requirement is ambiguous if it can be interpreted in multiple ways. Ambiguous
requirements can be a major problem in software development [4]. Project partici-
pants tend to subconsciously disambiguate requirements based on their own under-
standing without realizing that they are ambiguous. As a result, different interpreta-
tions often remain undiscovered until later stages of the software life-cycle, when de-
sign and implementation choices materialize the specific interpretations. It costs 50-200
times as much to correct an error late in a software project compared to when it was
introduced [3].
One possible way of preventing ambiguous requirements is through manual
inspection [17], which clearly is time-consuming and error prone. Consequently, it is
important to study how to automatically detect ambiguous requirements in software
requirement specifications (SRS).
Establishing a consistent usage of terminology early on in a project is imperative
as it provides a vocabulary for the project and can greatly reduce misunderstandings.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 25–37, 2013.
c Springer-Verlag Berlin Heidelberg 2013
26 Y. Wang et al.
meaning 1 concept 1
meaning 3 concept 3
We propose to formulate the problem as a ranking problem that ranks all the impor-
tant concepts from a SRS based on their ambiguity scores. The ranked list of concepts
is expected to help requirement engineers to more efficiently identify ambiguous con-
cepts and revise the SRS accordingly. One advantage of formulating the problem this
way is to allow requirements engineers to decide how many concepts they want to go
through based on their own situations. For example, some engineers may want to catch
all ambiguous concepts while others may only have limited time to correct the most
ambiguous ones. Once the ambiguous concepts are identified and rephrased, the SRS
would have higher quality and can be better used in the subsequent stages of the project.
Specifically, we propose two feature-based methods that can rank the concepts based
on their overloaded and synonymous ambiguities respectively. Experiments are con-
ducted over four data sets with real-world SRSs. These data sets cover different types
and scales of software systems. Results show that the proposed methods are effective in
detecting both overloaded concepts and synonyms.
2 Related Work
3 Problem Formulation
A SRS is ambiguous if it can be interpreted in more than one ways [2]. There are many
different types of ambiguities, and here we focus on lexical ambiguities. Lexical ambi-
guities can be classified into overloaded ambiguity and synonymous ambiguity [25], as
shown in Figure 1. We define an overloaded ambiguity to be a concept that has lost its
specificity in the particular document. For example, consider the concepts user, guest
user, and verified user in a SRS. In cases where only user is used in the SRS, a reader
may not be able to distinguish which kind of user is intended. In contrast to overloaded
ambiguity, synonymous ambiguity is when multiple concepts refer to the same semantic
meaning. For instance, in the SRS of a testing gateway system, the concepts system and
28 Y. Wang et al.
testing gateway both refer to the system to be developed. As a result, requirement en-
gineers could use both concepts in the SRS without realizing the potential for conflicts
and misunderstandings.
To detect ambiguous concepts from a SRS collection, we first use C-value method
[24,7] to extract candidate concepts, and then propose to rank the extracted concepts
or concept pairs based on their degree of ambiguity. In particular, for overloaded ambi-
guity detection, concepts should be ranked based on the likelihood that a concept has
multiple interpretations, while for synonymous ambiguity detection, concept pairs are
ranked based on the likelihood that they represent the same meaning. The ranked lists
are expected to help requirements engineers focus on the concepts that are most likely
to be ambiguous so that they can quickly identify the places that need clarification.
The key challenge here is how to estimate the ambiguity score for a concept or a
concept pair. We focus on identifying useful features that could be used to identify
each type of ambiguities. For overloaded ambiguities, the features are mostly related
to the context of a concept, i.e., words that occur before and after the concept in the
same sentence. For synonymous ambiguity detection, the features are based on not only
context but also patterns and content of the candidate pairs. With the identified features,
we then propose a possible solution to combine them and learn the ambiguity scores for
the concepts or concept pairs. Details are provided in the following sections.
– Concept Frequency: Given a concept, this feature computes the frequency of the
concept in all the SRSs. The intuition is that a concept is more likely to cause an
overload ambiguity when it occurs more frequently in the collection.
– Context Diversity: For a given concept, the feature measures how diversified its
contexts are. We define a context of a concept as a set of words that occur in the
same sentence as the concept. If the concept is overloaded, its context should cover
different meanings for the sub-layer entities. Therefore, the diversity score should
be high. On the other hand, the entity that the concept refers to should be consistent
among different contexts, which means the context diversity should be low. The
context diversity score of a concept is computed as the inverse of the average cosine
similarity among all its contexts.
– Number of Clusters in the Context: Clustering is one possible way of partitioning
contexts of a concept into different groups with similar meanings. Thus, the num-
ber of clusters could be a good indicator of the degree of ambiguity of the concept.
In this paper, we use hierarchical agglomerative clustering method [12]. There are
multiple ways for clustering. During the training stage of our experiment, we tried
single-link, complete-link and centroid HAC algorithm. The results suggested that
Automatic Detection of Ambiguous Terminology for Software Requirements 29
the single link algorithm consistently outperform than the other algorithms. There-
fore, it is chosen as the final method. We keep grouping similar contexts together
until it reaches the stopping criterion, i.e., when the minimum similarity between
each group is smaller than a similarity boundary.
– Inter-Cluster Distance: It measures the average distance among different clusters.
The intuition is that when a concept is ambiguous, its context clusters would cover
different information, which leads to higher inter-cluster distance. The distance is
computed as the inverse of the similarity, which can be computed using cosine
similarity based on the context.
We now discuss how to combine all the features. Since each feature can be used indi-
vidually to rank concepts, we can then compute the ambiguity score of a concept based
on its ranking positions using each of the features. The concepts are then ranked based
on these scores.
Formally, c denotes a concept, ASO (c) denotes the overloaded ambiguity score of
the concept, and fi (c) is the value of feature fi for concept c. We can then have:
ASO (c) = αi · P S(fi (c))
where αi is the weight of the result of each feature fi and αi = 1. The weights can
be learned from a training set. P S(x) is the relative position score of each feature and
can be computed as:
P ositionInF eature(c, fi ) − 1
P S(fi (c)) = 1 − (1)
#T otalConcepts
where the PositionInFeature is the ranking of concept c in feature f.
Note that there could be other ways of combining these features. We choose to use
the relative value instead of the absolute score from each feature is because we want to
make the results from different features more comparable.
– Pattern-Based Similarity: Pattern-based features have been used to detect the se-
mantic relationship in large text corpora [9,19,14]. We follow a similar strategy to
detect synonym pairs in this paper. In particular, we start with a set of known pairs
of synonymous concepts, and then retrieve the sentences that mention both con-
cepts. We then identify patterns, i.e., common phrases or terms, and these patterns
will then use to retrieve more candidate pairs. The process is repeated until no more
new patterns can be found.
If concept ci and cj follow the discovered pattern P , then we have
SimP (ci , cj ) = SimP (cj , ci ) = 1.
Following the proposed methods, we are able to find the following patterns:
• c1 abbreviated c2
• c1 (c2 )
• c1 , also known as c2
• c1 , a.k.a. c2
– Textual-Based Similarity: A synonymous concept pair reflects the same semantic
meaning, so it is likely that their textual similarity is higher than other pairs. For
example, concepts account reference number and original account number both
refer to the number assigned to a user when opening an account. Thus, we have
SimT (ci , cj ) = CosineSimilarity(ci , cj ).
Each of the features captures one aspect of the synonymous ambiguities, and they all
have their own limitations. Context-based similarity feature may fail to detect the am-
biguous pairs from the same sentence, while pattern-based feature can mainly detect
those from the same sentence. Textural similarity is only effective when the ambiguous
pairs share common terms, and would fail to detect many that do not satisfy the require-
ment (e.g., the account reference number and its abbreviation arn). Thus, we propose
the following method to combine all the features to improve the performance:
where P S(x) is the relative position score as shown in Equation (1). The proposed
method trusts the results of pattern-based similarity more than other two features. When
the two concepts do not follow any learned patterns, we will the consider their context
and textual similarities. The importance between these two similarities is determined
by the parameter α.
6 Experiment Setup
6.1 Experiment Design
Our system takes a set of SRSs as input, and then returns two separate ranking lists for
the two kinds of ambiguities.
The pre-processing of the SRSs is kept to the minimum. We split the requirements
into sentences, but did not remove stop words or stem the words. Stop words are not
Automatic Detection of Ambiguous Terminology for Software Requirements 31
removed because they may be considered a stop word in one part of the document but
used in a meaningful way in other parts of the document. For example, the words to, be
are generally considered as stop words, but if these two words are removed, the concept
system to be will lose its meaning. Word stemming is not used here because it may gen-
erate new ambiguity. For example, the concepts programs, programmer, programming
are used correctly in the document without ambiguity. If word stemming is used, the
three concepts will change to program, which could unnecessarily make the problem of
overloaded ambiguity more difficult.
Results are evaluated with three measures, i.e., P@N (i.e., precision at top N results),
R@N (i.e., recall at top N) and MAP@N (i.e., mean average precision at top N). P@N
measures the percentage of top N detected concepts (or concept pairs) that are indeed
ambiguous. R@N measures the percentage of ambiguous concept (or concept pairs) that
are included in the top N results. MAP@N is a commonly used measure to evaluate the
ranking results of top N results. Our primary evaluation measure is MAP@10.
We conduct experiments over four real-world data sets obtained from different software
projects. These projects are chosen because they are real-world software projects, they
span different domains and sizes, and there have been consistent efforts on revising the
requirement documents. The characteristics of these projects are described in Table 1.
The information includes the project name, project type, project domain, SRS length
(in Terms), number of requirements, average requirement length and the number of
revisions to the requirement documents for each project. The participants involved in PI
and PII were software engineering students and professional developers with varying
skills and experience, while those for PIII and PIV were professional developers.
To quantitatively evaluate the proposed approach, we create judgments on both ambi-
guity types for each project. Each judgment indicates whether a concept is overloaded
ambiguous or whether a concept pair is synonymous ambiguous. The judgments are
created by five assessors with training in software engineering and requirement engi-
neering. For overloaded ambiguity, an assessor would go over all the candidate concepts
for a project, and then decide whether each of them is ambiguous or not. The decision is
made by first locating all the places where the concept was mentioned, and then check
whether the concept has multiple meanings by reading the context of the concepts. The
process for synonymous ambiguity is similar, while the assessor needs to compare the
contexts of concept pairs.
32 Y. Wang et al.
The four projects were cross-evaluated by different assessors, for each project, there
are at least 3 judgments for each type of ambiguity. With this judgments file, a voting
schema is used to make the final decision. For each type of ambiguity of each project,
we consider the candidate concept (pair of concepts) as ambiguous only if two or more
assessors identified it is ambiguous.
Table 2 describes the basic statistics of the created judgments for each project. It
includes the number of candidate concepts (i.e., Concepts), the number of overloaded
concepts (i.e., Overloaded) and the number of synonymous concept pairs (i.e., Synony-
mous). It is surprising to see that a significant portion of the candidate concepts are
still ambiguous even after at least 7 revisions, which reinforces the need for automated
techniques that can help reduce these ambiguities and produce more consistent SRSs.
7 Experiment Results
We now report the results for the proposed methods. There are several parameters in
the proposed methods, so we train the parameter values on one collection (i.e., PI) and
use the learned parameters for the remaining three test collections (i.e. PII, PIII and
PIV). We conduct two sets of experiments to evaluate the effectiveness of the proposed
methods for each ambiguity type, and report the optimal performance on the training
set and the test performance on the testing sets for both sets.
Table 3 shows the optimal performance of the proposed overloaded ambiguity detec-
tion methods for PI. All denotes the method that combines all the features. CDiv.,
CFreq, NClusters, and InterDist corresponds to the methods that use a single feature
for ranking. They correspond to context diversity, concept frequency, the number of
clusters in the context and inter-cluster distance respectively. During the training, we
also conducted the 5 fold cross-validation on PI. The average MAP@10 measure of the
proposed method (i.e., combining all feature) is 0.334. It is clear that combining all the
features can consistently and significantly outperform the baseline method over all the
test collections.
Table 4 shows the testing performance for the three test collections. Note that the pa-
rameters are set based on the values learned on the training set, i.e., PI. All still denotes
the performance of combining all the features, and BL denotes the best performance
when using a single feature. Moreover, the learned parameters on the training set seem
to work well on the other test sets even if they are from completely different domains.
Automatic Detection of Ambiguous Terminology for Software Requirements 33
All BL
MAP@10 P@10 R@10 MAP@10 P@10 R@10
PII 0.21 0.6 0.26 0.11 0.5 0.22
PIII 0.15 0.2 0.18 0.08 0.1 0.09
PIV 0.42 0.4 0.57 0.12 0.1 0.14
The similarity boundary is used as the stop criterion of the HAC method, i.e., when
the maximum similarity value of two clusters is smaller than the similarity boundary,
the clustering procedure stops. Therefore, the value of the similarity boundary affects
the performance of NCluster, InterDist and All for overloaded ambiguity detection.
We now examine the performance sensitivity with respect to the value of similarity
boundary. Figure 2 shows the sensitivity curves for all the three methods on the training
collection (i.e., PI). It is clear that the similarity boundary can not be either too large or
too small. When the similarity boundary is too large, we may separate similar contexts
into different groups. On the other hand, when the similarity boundary is too small,
we may not be able to distinguish different contexts. For example, if the threshold is
0.1, most of the contexts will be grouped together and the ability to differentiate them
is not limited. Our preliminary results suggest that the optimal value for the similarity
boundary is around 0.3.
All BL
MAP@10 P@10 R@10 MAP@10 P@10 R@10
PII 0.38 0.2 0.66 0.16 0.1 0.33
PIII 0.17 0.3 0.42 0.09 0.3 0.42
PIV 0.37 0.3 0.5 0.13 0.2 0.33
With the parameters trained on Project I, we report the test performance on the other
three collections in Table 6. BL denotes the baseline method using a single feature, and
we use the textual based feature in this set of experiments since it is more effective than
the other two features. Results show that it is more effective to combine all the features,
and the conclusion holds for all the test sets.
We also conduct an exit survey with assessors and ask them about their experience in
making the judgments for synonymous ambiguity detection. We find that it takes more
efforts to make judgments for this ambiguity type, and it is necessary to consider both
context and semantic meaning of the concepts to detect such ambiguities. Furthermore,
the assessors also state that the ranked list is a good tool that can help them identify
the ambiguous pairs more effective. In particular, the pairs remind them of some con-
cepts that could be interchangeable, which was really helpful, especially when the SRS
is long.
Automatic Detection of Ambiguous Terminology for Software Requirements 35
7.3 Discussions
Identifying ambiguous concepts from natural language is a difficult task, even for hu-
man assessors. To demonstrate that, we evaluated the judgment results from assessors.
As every project have 3 sets of judgment, one of them is chosen as the golden standard
to evaluate the remaining two. We iteratively conducted this evaluation in the project,
and reported the average performance as shown in table 7. It is worth to notice that
the performance of the manually created results is only around 0.5 for MAP. This low
value proved that ambiguity detection is a challenging tasks even for well trained human
assessors.
Overloaded Synonymy
MAP@10 P@10 R@10 MAP@10 P@10 R@10
PI 0.53 0.61 0.79 0.49 0.61 0.61
PII 0.49 0.78 0.49 0.46 0.68 0.76
PIII 0.47 0.48 0.50 0.36 0.61 0.38
PIV 0.52 0.51 0.53 0.57 0.58 0.67
References
1. Berry, D.M.: Ambiguity in natural language requirements documents. In: Paech, B., Martell,
C. (eds.) Monterey Workshop 2007. LNCS, vol. 5320, pp. 1–7. Springer, Heidelberg (2008)
2. Berry, D.M., Kamsties, E., Krieger, M.M.: From contract drafting to software specification:
Linguistic sources of ambiguity (2003),
http://se.uwaterloo.ca/˜dberry/handbook/ambiguityHandbook.pdf
3. Boehm, B.W., Papaccio, P.N.: Understanding and controlling software costs. IEEE Transac-
tion of Software Engineering 14, 1462–1477 (1988)
4. Chantree, F., Nuseibeh, B., de Roeck, A., Willis, A.: Identifying nocuous ambiguities in
natural language requirements. In: Proceedings of the 14th IEEE International Requirements
Engineering Conference, Washington, DC, USA, pp. 56–65 (2006)
5. Cobleigh, R.L., Avrunin, G.S., Clarke, L.A.: User guidance for creating precise and acces-
sible property specifications. In: ACM SIGSOFT 14th International Symposium on Founda-
tions of Software Engineering, pp. 208–218 (2006)
6. Damas, C., Lambeau, B., Dupont, P., van Lamsweerde, A.: Generating annotated behavior
models from end-user scenarios. IEEE Transaction of Software Engineering 31, 1056–1073
(2005)
7. Frantzi, K., Ananiadou, S.: Extracting nested collocations. In: Proceedings of the 16th Con-
ference on Computational Linguistics, vol. 1, pp. 41–46 (1996)
8. Greenspan, S., Mylopoulos, J., Borgida, A.: On formal requirements modeling languages:
Rml revisited. In: Proceedings of the 16th International Conference on Software Engineering,
Los Alamitos, CA, USA, pp. 135–147 (1994)
9. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings
of the 14th Conference on Computational Linguistics, Stroudsburg, PA, USA, vol. 2, pp.
539–545 (1992)
10. Hussain, I., Ormandjieva, O., Kosseim, L.: Automatic Quality Assessment of SRS Text by
Means of a Decision-Tree-Based Text Classifier. In: Seventh International Conference on
Quality Software (QSIC), pp. 209–218 (2007)
11. Ide, N., Véronis, J.: Word sense disambiguation: The state of the art. Computational Linguis-
tics 24, 1–40 (1998)
12. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge
University Press, New York (2008)
13. Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT
Press, Cambridge (1999)
14. Maynard, D., Funk, A., Peters, W.: Using lexico-syntactic ontology design patterns for on-
tology creation and population. In: Proceedings of WOP 2009 Collocated with ISWC 2009,
vol. 516 (2009)
15. Nikora, A., Hayes, J., Holbrook, E.: Experiments in Automated Identification of Ambiguous
Natural-Language Requirements. In: Proc. 21st IEEE International Symposium on Software
Reliability Engineering, San Jose
16. Porter, A., Votta, L.: Comparing detection methods for software requirements inspections: A
replication using professional subjects. Empirical Software Engineering 3, 355–379 (1998)
17. Porter, A.A., Votta Jr., L.G., Basili, V.R.: Comparing detection methods for software require-
ments inspections: A replicated experiment. IEEE Transaction of Software Engineering 21,
563–575 (1995)
18. Reubenstein, H.B., Waters, R.C.: The requirements apprentice: an initial scenario. SIGSOFT
Software Engineering Notes 14, 211–218 (1989)
19. Roark, B., Charniak, E.: Noun-phrase co-occurrence statistics for semiautomatic semantic
lexicon construction. In: Proceedings of the 17th International Conference on Computational
Linguistics, Stroudsburg, PA, USA, vol. 2, pp. 1110–1116 (1998)
Automatic Detection of Ambiguous Terminology for Software Requirements 37
20. Shull, F., Rus, I., Basili, V.: How perspective-based reading can improve requirements in-
spections. Computer 33, 73–79 (2000)
21. Tratz, S., Hovy, D.: Disambiguation of preposition sense using linguistically motivated fea-
tures. In: HLT-NAACL (Student Research Workshop and Doctoral Consortium), pp. 96–100
(2009)
22. Umber, A., Bajwa, I.S.: Minimizing ambiguity in natural language software requirements
specification. In: Digital Information Management (ICDIM), pp. 102–107 (2011)
23. van Lamsweerde, A.: Requirements Engineering: From System Goals to UML Models to
Software Specifications. John Wiley & Sons (2009)
24. Zhang, X., Fang, A.: An ATE system based on probabilistic relations between terms and
syntactic functions. In: 10th International Conference on Statistical Analysis of Textual Data
- JADT 2010 (2010)
25. Zou, X., Settimi, R., Cleland-Huang, J.: Improving automated requirements trace retrieval: a
study of term-based enhancement methods. In: Empirical Software Engineering, vol. 15, pp.
119–146 (2010)
26. Zowghi, D., Gervasi, V., McRae, A.: Using default reasoning to discover inconsistencies
in natural language requirements. In: Proceedings of the Eighth Asia-Pacific on Software
Engineering Conference, Washington, DC, USA, pp. 133–140 (2001)
An OpenCCG-Based Approach to Question
Generation from Concepts
1 Introduction
Speech-based information systems offer dialogue based access to information
via the phone. They avoid the complexity of computers/websites and introduce
the possibility of accessing automated systems without being distracted from
your visual focus (e.g. while driving a car) and without needing your hands
or eyes (e.g. for visually impaired people or workers with gloves). Most people
have already used spoken dialogue systems (SDS) in order to reserve tickets
for the cinema, to check the credit of a pay-as-you-go phone or to look for the
next bus. Generally, a SDS can be defined as a system that “enables a human
user to access information and services that are available on a computer or
over the Internet using spoken language as the medium of interaction” [11]. In
this way, these systems can offer a convenient way of retrieving information.
However, Bringert [5] identifies three major problems with current interactive
speech applications: They are not natural, not usable and not cheap enough.
Berg [2] found that 71% of users prefer the most natural dialogues when choosing
from three fictional human-machine dialogues. This is in line with the results of
Dautenhahn et al. [7], who also found that 71% of people wish for a human-
like communication with robots. Looi and See [14] describe the stereotype of
human-robot dialogue as being monotonous and inhumane. They argue that
the engagement between human and robot can be improved by implementing
politeness maxims, i.e. connecting with humans emotionally through a polite
social dialogue. In order to realise user-friendly and natural dialogue systems
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 38–52, 2013.
c Springer-Verlag Berlin Heidelberg 2013
An OpenCCG-Based Approach to Question Generation from Concepts 39
that apply politeness maxims and adapt their style to the user’s language, we
need support from a language generation component.
In this paper we describe a method for generating system questions in
information-seeking dialogue systems. Our aim is to formulate these questions
in different styles (formality and politeness) from abstract descriptions (concept-
to-speech). We hope to increase the user acceptance of dialogue systems by con-
tributing to a method for generating human-like and adaptive utterances.
2 Related Work
As this paper focusses on the generation of questions in different styles, we review
related work in the areas of question generation and linguistic style.
and the need for self-esteem and respect from others and states “that the use of
politeness is dependent on the social distance and the difference of power between
conversational partners, as well as on the threat of the speakers communicative
act towards the hearer”. Gupta et al. [8] state that “Politeness is an integral part
of human language variation, e.g. consider the difference in the pragmatic effect
of realizing the same communicative goal with either ‘Get me a glass of water
mate!’ or ‘I wonder if I could possible have some water please?’ ”. Jong et al.
[12] claim that language alignment happens not only at the syntactic level, but
also at the level of linguistic style. They consider linguistic style variations as
an important factor to give virtual agents a personality and make them appear
more socially intelligent. Also Raskutti and Zukerman [20] found that naturally
occurring information-seeking dialogues include social segments like greeting and
closing. Consequently, Jong et al. [12] describe an alignment model that adapts
to the user’s level of politeness and formality. The model has three dimensions:
politeness, formality and T-V distinction. Whereas politeness is associated with
sentence structures, formality is dependent on the choice of words. In many
languages we have to differentiate between a formal and an informal addressing
(T-V distinction) of people. This feature is clearly related to both formulation
and politeness. However, in Jong’s model this feature is not being influenced by
formality or politeness changes during the conversation in order to prevent the
dialogue from constantly switching between both extrema.
3 Style Variation
Style variation is the generation of different formulations with the same goal.
In this paper we focus on task-oriented dialogue systems. This class comprises
question-answering-, command-, and information-seeking/booking systems [16].
In particular, we regard the style variation of interrogatives. We use this term in
order to refer to all kinds of utterances that have the aim of getting information.
This may be a question (“When do you want to leave for London?”) or a request
(“Tell me when you want to leave for London!”). Both interrogatives have the
same intention, i.e. getting the time of departure.
concepts from which the system should generate questions, instead of formulat-
ing static and inflexible questions as strings. In this scenario, AQDs can also
formalise the parser (in this case a date grammar could be provided). They also
help the language understanding component by reducing differently formulated
utterances with the same goal to a common description.
For question generation, however, we need more information about the style.
While the classification by question words has been declared impractical for
describing the intention of an interrogative because different question words can
refer to the same goal, e.g. when and at what time [3], it is still important for
style variation. Hence, the relation between AQD and question word can be
useful for choosing the correct question word. Apart from the question word, we
can also vary grammatical characteristics. When generating a question for the
AQD answer type fact.temporal.date, we can think of different formulations:
We clearly see that all these utterances have the same intention. However, the
AQD is not sufficient for successful generation. In addition to the AQD we also
need to define semantic constraints. In this case we could constrain the type of
date to departures.
With this distinction between politeness (the use of please and subjunctive forms,
choice of question style) and formality (choice of words, T-V distinction) we now
have parameters at hand to change the style of a system interrogative. In the
next section we address the topic of modelling system interrogatives for usage
in a toolkit that allows the realisation of interrogatives with respect to given
parameters.
4 Realisation
what
Determination
Dimension: temporal
Lexicalisation
Spec: begin where
Content
Nouns
Reference: trip when destination
Politeness:
how
Formality: start of trip
Length:
T/V: Verbs
depart go
leave
return
Logical Forms
x Wh-Question
x
create
Wh-Request
x N-Request
x C-Wh-Question
x C-N-Question
x Command Realisation Lexicon (morph.xml)
Question Generation
Grammar (CCG)
(lexicon.xml)
Surface Text
With this information we can create logical forms that can be used as input for
our language generator. We first take a closer look at the language generation
process. Afterwards we describe our knowledge base and how we can use the
results in dialogue systems.
Y /X X X Y \X
> < (1)
Y Y
Apart from application combinators, there are composition and type-raising
combinators. Composition refers to the combination of two functions where the
domain of one is the range of the second. It is described with the operator
B together with the application direction. Type-raising turns “arguments into
functions over functions-over-such-arguments” [22], i.e. argument X is turned
into a function that has a complex argument that takes this X as an argument.
It allows “arguments to compose with the verbs that seek them” and is also used
in order to be able to apply all rules into one direction (i.e. incremental pro-
cessing; a full left-to-right proof). Type-raising is denoted with a T . Given the
following lexicon, we can derive the sentence “I like science” as in (2), or with
type-raising and composition as a full left-to-right-proof as in (3).
when
s[wh-question]/s[question] question
s[iwh-question]/s[b] indirect request
1
http://sourceforge.net/projects/openccg/
An OpenCCG-Based Approach to Question Generation from Concepts 45
In our example we apply the first category2. This can be read as: The word
‘when’ can become a sentence of type ‘wh-question’ if there is a sentence of type
‘question’ on the right-hand side. Since the definition of wh-words is not enough,
we now take a look at how the right-hand category of the rule is defined. A
question can be created with the help of an auxiliary verb like do:
do
s[question]/s[b]
The word ‘do’ can become a sentence of type ‘question’ if there is a sentence
with a bare infinitive on the right. In order to create such a sentence, we need
a verb. Generally, an intransitive verb is defined as s\np. The feature b denotes
a bare infinitive and to is an infinitive with to [10, 9]. The combination of the
lexems want, to, and go leads to a category that becomes a sentence with a bare
infinitive if there is a np on the left. Figure 2 shows the complete application of
the CCG categories.
The logical form for the wh-question from our last example is depicted in Figure
3. We use thematic roles in order to specify the proposition of a question. In this
case an agent wants a theme. We can also use semantic features to influence the
style of the utterance, i.e. in this case we want to formulate an interrogative sen-
tence with a second person singular agent. As already mentioned, logical forms
s { s t y p e=wh−q u e s t i o n } :
@w0(when ˆ
<prop >(w3 ˆ want ˆ
<mood>i n t e r r o g a t i v e ˆ
<agent >(w2 ˆ pron ˆ
<num>s g ˆ
<p e r s >2nd ) ˆ
<theme>(w5 ˆ go ˆ
<agent>x1 ) ) )
are the basis for the realisation process. It abstracts from grammar and word po-
sition issues and just reflects the logical meaning of an utterance. Additionally,
we need to know which words to use, i.e. OpenCCG requires a finished lexi-
calisation process. While there are words that are only influenced by inflection
(pronoun, sg, 2nd), we also have words that change the style of an utterance3 .
Another way of realising the same meaning with a different lexicalisation would
have been “When do you want to leave?”.
5 Concept to Text
As already mentioned, our aim is the automatic generation of system questions
in an information-seeking dialogue system. We want to be able to instruct the
system to create a question that asks for the departure time in a polite but
informal way without mentioning specific words. This is absolutely necessary
to create different levels of formality. So instead of defining words in the logical
3
Indicated with bold face in Figure 3.
An OpenCCG-Based Approach to Question Generation from Concepts 47
form we need meaning representations, i.e. we have to replace the bold faced
words in Figure 3 with concepts.
However, we need a more abstract formulation that also focusses on the similari-
ties of word senses. In a travel domain go and travel can be synonymous (“When
do you want to go” = “When do you want to travel”) and should therefore have
the same description. Thus, as a first draft, we propose to describe a word with
its type of usage (or context), so that every word w is assigned a:
– Part of Speech π
– Domain δ
– Context γ: Dimension ∧ Specification
– Referent ρ
which results in (w typeof π) ∈ δ = (γ, ρ). In the following examples you can
see how we can describe the meaning of words, together with some exemplary
sentences. The definition of go reads as follows: It is a verb in the travel domain
that can be used in a temporal context to describe the beginning of a trip (start
date) or in a local context to describe the end of a trip (destination). Nouns
follow the same scheme. The only difference is the part of speech.
When we take a look at question words, we can see that we have introduced a
general domain and a wildcard referent as well as a wildcard specificator.
48 M.M. Berg, A. Isard, and J.D. Moore
Apart from question words and related verbs or nouns (when do you want to
go) we also have verbs like want, tell, have and can that cannot be related to
an answer type. Here we can use the specificator to denote the context, e.g. a
word that can be used in any dimension to indicate a transfer of knowledge with
reference to any object. With this elementary definition of words we are now
able to describe questions. We basically describe a question by γ and ρ, i.e. the
context and the referent. The task of creating a question that asks for the begin
of a trip would be defined as: ask(f act.temporal.date, begin, trip). According to
our question style definitions from section 4.2 and their related grammars, we
choose either a verb or a noun to represent γ and ρ. Also neutral words like
want or tell are introduced in this step. In a very formal utterance tell could
be replaced by specify. Apart from the formality, we also have to choose the
correct question style according to the politeness. A high politeness value leads
to the introduction of the word please and changes the mood from indicative
to subjunctive. Moreover, we have a list that assigns a politeness value to every
question type and thus influences the construction of the logical forms. For ex-
ample, a can-question is more polite than a request. These values are currently
based on intuitions which were backed up by the choices of the evaluation par-
ticipants (see Section 6) but in future versions we plan to base them on user
studies.
The result of this step is the representation in Figure 3, which is – at the same
time – the input for the OpenCCG realiser.
1 i n t i n t e n d e d f o r m a l i t y =2; // 1 . . 5
2 i n t i n t e n d e d p o l i t e n e s s =1; // −2..5
3 q u e s t i o n s . add (new Q u e s t i o n ( " fact . temporal . date " , " begin " , " trip " ) ) ;
4 q u e s t i o n s . add (new Q u e s t i o n ( " fact . temporal . date " , " end " , " trip " ) ) ;
5 q u e s t i o n s . add (new Q u e s t i o n ( " fact . location " , " begin " , " trip " ) ) ;
6 q u e s t i o n s . add (new Q u e s t i o n ( " fact . location " , " end " , " trip " ) ) ;
7 q u e s t i o n s . add (new Q u e s t i o n ( " decision " , " possession " , " customer_card " ) ) ;
2 we have increased the politeness value to 4. You can see that the system now
chooses C-Questions and makes use of verbs instead of nouns.
A B C D
A: When do you want to set off? A: Can you please tell me when you A: Departure date? A: Departure date please!
B: ... want to set off? B: ... B: ...
A: Can you now tell me when you B: ... A: And the return date? A: Now please tell me your return
want to return? A: Could you now please tell me when B: ... date!
B: ... you want to return? A: Departure city? B: ...
A: Please tell me your departure city! B: ... B: ... A: Tell me your departure city!
B: ... A: Can you tell me where you want to A: And the destination? B: ...
A: And where do you want to go? start from? B: ... A: And the destination please!
B: ... B: ... A: Customer card? B: ...
A: Do you have a customer card? A: Can you now please tell me where B: ... A: Do you have a customer card?
B: ... you want to go? B: ...
B: ...
A: Do you have a customer card?
B: ...
Fig. 4. Survey
We began with the sorting task. In general the participants correctly classi-
fied the dialogues according to our intended politeness levels, and 46% put the
dialogues in exactly the right order (C, D, A, B). The participants classified the
two more impolite dialogues as more polite than they are, and the two more
polite ones as less polite, as shown in Table 1, but this can possibly be explained
by people’s tendency to choose values in the middle of a scale rather than at
either extreme. Now we asked the participants which dialogue they like most.
77% of them preferred dialogue A, 19% preferred dialogue B and 4% dialogue
D. When we quantify the preferred dialogues with the corresponding politeness
scores and normalise the result (see Equation 4), we get an average preferred
politeness score of 3.2 (original score) respectively 2.8 (user score), which again
refers to dialogue A.
|dialogues|
1
scorei × |votes dialoguei| (4)
|user| i=1
In the last step we asked the users to indicate which dialogue might have been
uttered by a human. 88% of the participants think that dialogue A could have
been uttered by a human, 42% think dialogue B might be of human origin, 19%
declare dialogue C and 4% dialogue D as human. This evaluation confirms that
the system is able to create questions in different politeness levels and that these
levels are correctly identified by the users.
References
1. Baldridge, J., Kruijff, G.J.M.: Multi-modal combinatory categorial grammar. In:
Proceedings of the Tenth Conference on EACL, Stroudsburg, PA, USA, pp. 211–
218 (2003)
2. Berg, M.M.: Survey on Spoken Dialogue Systems: User Expectations Regarding
Style and Usability. In: XIV International PhD Workshop, Wisla, Poland (October
2012)
3. Berg, M.M., Düsterhöft, A., Thalheim, B.: Towards interrogative types in task-
oriented dialogue systems. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H.
(eds.) NLDB 2012. LNCS, vol. 7337, pp. 302–307. Springer, Heidelberg (2012)
4. Boyer, K.E., Piwek, P. (eds.): Proceedings of QG 2010: The Third Workshop on
Question Generation, Pittsburgh (2010)
5. Bringert, B.: Programming language techniques for natural language applications
(2008)
6. Brown, P., Levinson, S.C., Gumperz, J.J.: Politeness: Some Universals in Language
Usage. Studies in Interactional Sociolinguistics. Cambridge University Press (1987)
7. Dautenhahn, K., Woods, S., Kaouri, C., Walters, M.L., Koay, K.L., Werry, I.: What
is a robot companion - friend, assistant or butler?, pp. 1488–1493 (2005)
8. Gupta, S., Walker, M.A., Romano, D.M.: Generating politeness in task based in-
teraction: An evaluation of the effect of linguistic form and culture. In: Proceedings
of the Eleventh European Workshop on NLG, Stroudsburg, PA, USA, pp. 57–64
(2007)
52 M.M. Berg, A. Isard, and J.D. Moore
9. Hockenmaier, J., Steedman, M.: CCGbank: User’s Manual. Tech. rep (May 2005)
10. Hockenmaier, J., Steedman, M.: CCGbank: A Corpus of CCG Derivations and De-
pendency Structures Extracted from the Penn Treebank. Comput. Linguist. 33(3),
355–396 (2007)
11. Jokinen, K., McTear, M.F.: Spoken Dialogue Systems. Synthesis Lectures on Hu-
man Language Technologies. Morgan & Claypool Publishers (2009)
12. de Jong, M., Theune, M., Hofs, D.: Politeness and alignment in dialogues with a
virtual guide. In: Proceedings of the 7th International Joint Conference on Au-
tonomous Agents and Multiagent Systems, Richland, SC, pp. 207–214 (2008)
13. Kruijff, G.J.M., White, M.: Specifying Grammars for OpenCCG: A Rough Guide
(2005)
14. Looi, Q.E., See, S.L.: Applying politeness maxims in social robotics polite dialogue.
In: Proceedings of the Seventh Annual ACM/IEEE International Conference on
Human-Robot Interaction, pp. 189–190. ACM, New York (2012)
15. Mairesse, F.: Learning to Adapt in Dialogue Systems: Data-driven Models for Per-
sonality Recognition and Generation. Ph.D. thesis, University of Sheffield (Febru-
ary 2008)
16. McTear, M.F., Raman, T.V.: Spoken Dialogue Technology: Towards the Conver-
sational User Interface. Springer (2004)
17. Olney, A.M., Graesser, A.C., Person, N.K.: Question generation from concept
maps. Dialogue and Discourse 3(2), 75–99 (2012)
18. Ou, S., Orasan, C., Mekhaldi, D., Hasler, L.: Automatic Question Pattern Genera-
tion for Ontology-based Question Answering. In: The Florida AI Research Society
Conference, pp. 183–188 (2008)
19. Papasalouros, A.: Automatic generation of multiple-choice questions from domain
ontologies. Engineer (Bateman 1997) (2008)
20. Raskutti, B., Zukerman, I.: Generating queries and replies during information-
seeking interactions. Int. J. Hum.-Comput. Stud. 47(6), 689–734 (1997)
21. Rus, V., Lester, J. (eds.): AIED 2009 Workshops Proceedings: The 2nd Workshop
on Question Generation, Brighton (2009)
22. Steedman, M., Baldridge, J.: Non-Transformational Syntax. In: Combinatory Cat-
egorial Grammar, pp. 181–224. Wiley-Blackwell (2011)
A Hybrid Approach for Arabic Diacritization
1 Introduction
Arabic text in almost all genres of Modern standard Arabic (MSA) is written without
short vowels, called diacritics. The restoration of these diacritics is valuable for natu-
ral language processing applications such as full-text search and text to speech. Devis-
ing a diacritization system for Arabic is a sophisticated task as Arabic language is
highly-inflectional and derivational. Moreover, Arabic sentences are characterized
with a relatively free word-order. The size of the Arabic vocabulary and the complex
Arabic morphological structure can both be managed efficiently via working on the
morpheme level (constituents of the words) instead of the word level. The system we
have built relies heavily on two core components: the morphological analyzer and the
part of speech (POS) tagger. By leveraging the systematic and compact nature of
Arabic morphology, we have developed a high quality rule-based morphological ana-
lyzer with high recall, driven by a comprehensive lexicon and handcrafted rules.
Moreover, we have developed a lightweight statistical morphological analyzer that is
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 53–64, 2013.
© Springer-Verlag Berlin Heidelberg 2013
54 A. Said et al.
trained on LDC’s Arabic Treebank corpus (ATB) [8]. The POS tagger is used to re-
solve most of the morphological and syntactic ambiguities in context. In the next
sections, we present our system as follows: Section 2 covers the linguistic description
of Arabic diacritization. Section 3 briefly covers previous related work, in section 4,
we elaborate on the different components of the diacritizer, and in section 5, we report
our system’s results compared to others, using the same evaluation setup: metrics
and data.
ِ Kasra ب
ِ /b//i/
ُ Damma ب
ُ /b//u/
Double case ending (tanween)
ً Tanween Fatha ً
ب /b//an/
ْ Sukuun ب
ْ /b/
In almost all genres of the Modern standard Arabic (MSA) written text, diacritics
are omitted, leading to a combinatorial explosion of ambiguities, since the same Arab-
ic word can have different part-of-speeches and meanings, based on the associated
diacritics, e.g.: (ْﺪ
َﻘ“ → ﻋcontract”, ْﺪ ِﻘ“ → ﻋnecklace”, َ ﱠﺪ َﻘ“ → ﻋcomplicate”).
The absence of diacritics adds layers of confusion for novice readers and for automat-
ic computation. For instance, the absence of diacritics becomes a serious obstacle to
many of the applications including text to speech (TTS), intent detection, and auto-
matic understanding in general. Therefore, automatic diacritization is an essential
component for automatic processing of Arabic text.
3 Related Work
closest to our work, we introduce new techniques to handle OOVs and generate case
endings that leading better results.
Habash and Rambow [2] use a morphological analyzer and a disambiguation sys-
tem called MADA [4]. They use a feature set including case, mood, and nunation, and
use SVMTool [7] as a machine learning tool. They use SRILM toolkit [9] to build an
open-vocabulary statistical language model (SLM) with Kneser–Ney smoothing. Ha-
bash and Rambow [2] did experiments using the full-form words and the lexemes
(prefix, stem, and suffix) citation form. The best results that have been reported are
the ones they obtain with the lexemes form with trigram SLM [2]. The system does
not handle OOV words which are not analyzed or haven’t been seen during training.
Zitouni et al. [3] have built a diacritization framework that based on maximum en-
tropy classification. The classifier is used to restore the missing diacritics on each
word letters. They also use a tokenizer (segmenter) and a POS tagger. They use dif-
ferent signals such as the segment n-grams, segment position of the character, the
POS of the current segment, and lexical features, including character and word n-
grams. Although they don’t have a morphological lexicon, they resort to statistical
Arabic morphological analysis to segment Arabic words into morphemes (segments).
These morphemes consist mainly of prefixes, stems, and suffixes. The maximum
entropy model combines all these features together to restore the missing vowels of
the input word sequence.
Emam and Fisher [6] introduced a hierarchical approach for diacritization. The ap-
proach starts with searching in a set of dictionaries of sentences, phrases and words
using a top down strategy. First they search in a dictionary of sentences, if there is a
matching sentence, they use the whole text. Otherwise the search starts with another
dictionary of phrases, then dictionary of words to restore the missing diacritics. If
there is no match at all previous layers, a character n-gram model is used to diacritize
each word. No experimental results of this patented work have been mentioned in the
available patent document.
The first three systems are trained and tested using LDC’s Arabic Treebank
(#LDC2004T11) of diacritized news stories text-part 3, v1.0 [8] that includes 600
documents (340 K words) from the Lebanese newspaper “AnNahar”. The text is split
into a training set (288 K words) and a test set (52 K words) [1], [2], [3]. To our
knowledge, these three systems are currently the best performing systems. We adopt
their metrics and use the same training and test set for fair comparison.
Afterwards, the corrected text is tokenized. Each input word is then analyzed through
both the statistical and rule-based morphological analyzers, combining zero or more
morphological analyses. Each analysis is composed of zero or more prefix(es), the
stem, zero or more suffix(es), the morphological pattern, the part of speech tag, and
the word tag probability.
(i)
Lexicon
Auto Corrector Tokenizer
Input Text
(ii)
(iii)
Case ending Rules
Diacritized Text
The next phase is responsible for selecting the most likely sequence of analyses
based on the context. This is achieved by the POS tagger, which is presented with a
lattice of morphological analyses. In addition to selecting the most probable analysis,
this process also disambiguates the residual case ending ambiguities.
The case ending diacritics are resolved in two passes: the first pass is a determinis-
tic one and is driven by rules, and the second pass resolves the residual case ending
ambiguities through the POS tagger. Out of Vocabulary (OOV) words, those not ana-
lyzed by the morphological analyzers, are diacritized by the out of vocabulary diacri-
tizer (OOV diacritizer) component that works on the character level. The contribution
of each component has been measured and the outcomes are reported later on the
“Results” section.
58 A. Said et al.
4.1 Auto-corrector
We conducted a thorough analysis of spelling mistakes in Arabic text. A corpuss of
one thousand articles, pickeed randomly from public Arabic news sites, has been seemi-
manually tagged for spellin ng mistakes. Each spelling mistake has also an associaated
error type. The analysis hass shown a WER of 6% which is considered very high. T The
diacritization task is significcantly affected by this high error rate since the morphollog-
ical analyzer fails to analyze misspelt words, and hence are left un-diacritized.
Figure 2 shows the distriibution of the spelling mistakes in Arabic text accordingg to
our analysis. It was found that more than 95% of these mistakes are classifiedd as
CAMs [5], and they could be categorized as follows: (i) Confusion between differrent
forms of Hamza ( ( ) أ إ ﺁii) Missing Hamza on plain Alef ( ( ) اiii) Confussion
between Yaa ( ) يand Alef ef-Maqsoura ( ) ى, and (iv)Confusion between Haa ( ) ﻩ
and Taa-Marbouta ( ) ة
No
…… Is Named Morphological
Input Entity?
Analyzer
Text
……
Yes
Analyzed?
…… Yes
Output
No
Text
…… Apply auto-correction rules
.
Fig. 3. Autocorrector workflow
A Hybrid Approach for Arabic Diacritization 59
We used ATB corpus [8] to train a statistical morphological analyzer that learns the
different possible diacritics, the part of speech tags, and the word tag probability:
hypothesis against the input word. For each valid hypothesis, diacritization rules are
applied to restore omitted vowels. For example, if the input word was “ً
”ﻣﺪرﺳﺔا
mdrspAF ending with Tanween Fatha the 1st two phases would propose an analysis
where the stem equals “ ”ﻣﺪرﺳﺔmdrsp and the suffix equals “ً ”اAF, however
the synthesis phase would reject this assumption since the synthesis rules would ac-
tually generate the word “ً
”ﻣﺪرﺳﺘﺎmdrstAF converting “ ”ةp to “ ”تt.
Stem-Affixes
Stem Extractor Matrix
Extraction Rules
Synthesizer
Synthesis Rules
CE Generation Rules
The synthesizer is also responsible for generating the correct case ending diacritic
and to position it on the appropriate letter.
Moreover, the morphological analyzer assigns for each generated morphological
structure a set of morphological, lexical and syntactic features. Examples of lexical
features are transitivity and verb class, morphological features include definiteness,
gender and number and examples of syntactic features are case ending, and genitivity.
using sequence labeling techniques. We have trained our POS tagger with the ATB
corpus [8], the same training set used by Rashwan et al. [1], Habash and Rambow [2],
and Zitouni et al. [3].
While in most of the cases, each analysis is mapped to a different tag, sometimes
more than one analysis map to the same tag, in such cases, we pick the first analysis.
̂ argmax | |
The POS tagger has been tested on the ATB test set, and it yields accuracy numbers
of 86.8% on tag level (including case-ending).
62 A. Said et al.
5 Results
We adopt the same metrics used by Rashwan et al. [1], Habash and Rambow [2], and
Zitouni et al. [3]; also we use the same test set they used to compare our results with
the three systems. The metrics that were used are:
1. Count all words, including numbers and punctuation.
2. Each letter or digit in a word is a potential host for a set of diacritics.
3. Count all diacritics on a single letter as a single binary choice.
4. Non-variant diacritization (stem level) is approximated by removing all diacritics
from the final letter (Ignore Last), while counting that letter in the evaluation.
A Hybrid Approach for Arabic Diacritization 63
Two error rates are calculated: the diacritic error rate (DER), which represents the
number of letters which diacritics were incorrectly restored, and the WER
representing the number of words having at least one DER.
As depicted in Table 3, our system provides the best results in terms of WER on
the full form level, and shows comparable results on the other metrics. We found that
the test set has a noticeable number of errors such as (misspelt words, colloquial
words, undiacritized words, wrong diacritization)1. We also tested our system against
another blind test set (TestSet2) consisting of 1K sentences that we collected from
different sources and had them manually diacritized. Out of this test set, we derived
two other test sets: one for full-form diacritization and another one for morphological
diacritization. It’s worth mentioning here that the way the “ignore last” metric is han-
dled is not always linguistically correct. In many cases, the syntactic diacritics do not
show on the last letter and rather appear on the last letter of the stem as in
“ُ
َﻪَﺘ
َﺳْر
َﺪ”ﻣ. The actual case ending here is the Fatha appearing on the before-last
letter “”ت. In Table 4 below, the first two rows show the results of our system using
both the ATB test set, and TestSet2. The remaining rows in the table show the results
of our system on TestSet2 after disabling the POS tagger, the case ending rules, the
NED, and the auto corrector respectively.
1
The corpus was revised by a set of linguists who reported these errors.
64 A. Said et al.
6 Conclusion
We presented in this paper a hybrid approach for Arabic diacritics restoration. The
approach is combining both rule-based and data driven techniques. Our system is
trained and tested using the standard ATB corpus for fair comparison with other sys-
tems. The system shows improved results over the best reported systems in terms of
full form diacritization. As a future work we speculate that further work on POS tag-
ging and disambiguation techniques such as word sense disambiguation could further
improve our morphological diacritization.
References
1. Rashwan, M.A.A., et al.: A stochastic arabic diacritizer based on a hybrid of factorized and
unfactorized textual features. IEEE Transactions on Audio, Speech, and Language
Processing 19, 166–175 (2011)
2. Habash, N., Rambow, O.: Arabic diacritization through full morphological tagging. In:
NAACL-Short 2007 Human Language Technologies 2007: The Conference of the North
American Chapter of the Association for Computational Linguistics; Companion Volume,
Short Papers, pp. 53–56 (2007)
3. Zitouni, I., Sorensen, J.S., Sarikaya, R.: Maximum entropy based restoration of arabic dia-
critics. In: Proceedings of the 21st International Conference on Computational Linguistics
and 44th Annual Meeting of the ACL, pp. 577–584 (2006)
4. Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological
disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting on Associa-
tion for Computational Linguistics, ACL 2005, pp. 573–580 (2005)
5. Buckwalter, T.: Issues in Arabic orthography and morphology analysis. In: Proceedings of
the COLING 2004 Workshop on Computational Approaches to Arabic Script-Based Lan-
guages, pp. 31–34 (2004)
6. Emam, O., Fisher, V.: A hierarchical approach for the statistical vowelization of arabic
text. Tech. rep., IBM (2004)
7. Gimnez, J., Mrquez, L.: Svmtool: A general pos tagging generator based on support vector
machines. In: LERC 2004. pp. 573–580 (2004)
8. Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The penn arabic treebank: Building a
large-scale annotated arabic corpus. In: Arabic Lang. Technol. Resources Int. Conf.;
NEMLAR, Cairo, Egypt (2004)
9. Stolcke, A.: Srilman extensible language modeling toolkit. In: Proceedings of the 7th In-
ternational Conference on Spoken Language Processing (ICSLP 2002), pp. 901–904
(2002)
10. Laerty, J.: Conditional random fields: Probabilistic models for segmenting and labeling
sequence data. In: The Eighteenth International Conference on Machine Learning,
pp. 282–289 (2001)
11. Jurafsky, D., Martin, J.H.: Speech and Language Processing; an Introduction to Natural
Language Processing, Computational Linguistics, and Speech Processing. Prentice-Hall
(2000)
EDU-Based Similarity
for Paraphrase Identification
1 Introduction
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 65–76, 2013.
c Springer-Verlag Berlin Heidelberg 2013
66 N.X. Bach, N.L. Minh, and A. Shimazu
However, they only consider some special kinds of data, which the discourse
structures can be easily achieved.
Complete discourse structures like in the RST Discourse Treebank (RST-DT)
[7] are difficult to achieve though they can be very useful for paraphrase compu-
tation [29]. In order to produce such complete discourse structures for a text, we
first segment the text into several elementary discourse units (EDUs) (discourse
segmentation step). Each EDU may be a simple sentence or a clause in a complex
sentence. Consecutive EDUs are then put in relation with each other to create
a discourse tree (discourse tree building step) [24]. An example of a discourse
tree with three EDUs is shown in Figure 1. Existing full automatic discourse
parsing systems are neither robust nor very precise [3,29]. Recently, however,
several discourse segmenters with high performance have been introduced [2,19].
The discourse segmenter described in Bach et al. [2] gives 91.0% in the F1 score
on the RST-DT corpus when using Stanford parse trees [20].
In this paper, we present a new method to compute the similarity between two
sentences based on elementary discourse units (EDU-based similarity). We first
segment two sentences into several EDUs using a discourse segmenter, which is
trained on the RST-DT corpus. These EDUs are then employed for computing
the similarity between two sentences. The key idea is that for each EDU in
one sentence, we try to find the most similar EDU in the other sentence and
compute the similarity between them. We show how our method can be applied
to the paraphrase identification task. Experimental results on the PAN corpus
[23] show that our method is effective for the task. To our knowledge, this is the
first work that employs discourse units for computing similarity as well as for
identifying paraphrases.
The rest of this paper is organized as follows. We first present related work
and our contributions in Section 2. Section 3 describes the relation between
paraphrases and discourse units. Section 4 presents our method, EDU-based
similarity. Experiments on the paraphrase identification task are described in
Section 5. Finally, Section 6 concludes the paper.
There have been many studies on the paraphrase identification task. Finch et
al. [17] use some MT metrics, including BLEU [28], NIST [13], WER [26], and
EDU-Based Similarity for Paraphrase Identification 67
PER [22] as features for a SVM classifier. Wan et al. [36] combine BLEU features
with some others extracted from dependency relations and tree edit-distance.
They also take SVMs as the learning method to train a binary classifier. Mihal-
cea et al. [25] use pointwise mutual information, latent semantic analysis, and
WordNet to compute an arbitrary text-to-text similarity metric. Kozareva and
Montoyo [21] employ features based on longest common subsequence (LSC), skip
n-grams, and WordNet. They use a meta-classifier composed of SVMs, k-nearest
neighbor, and maximum entropy models. Rus et al. [30] adapt a graph-based ap-
proach (originally developed for recognizing textual entailment) for paraphrase
identification. Fernando and Stevenson [16] build a matrix of word similarities
between all pairs of words in both sentences. Das and Smith [11] introduce a
probabilistic model which incorporates both syntax and lexical semantics using
quasi-synchronous dependency grammars for identifying paraphrases. Socher et
al. [33] describe a joint model that uses the features extracted from both single
words and phrases in the parse trees of the two sentences.
Most recently, Madnani et al. [23] present an investigation of the impact of
MT metrics on the paraphrase identification task. They examine 8 different MT
metrics, including BLEU [28], NIST [13], TER [31], TERP [32], METEOR [12],
SEPIA [18], BADGER [27], and MAXSIM [8], and show that a system using
nothing but some MT metrics can achieve state-of-the-art results on this task.
In our work, we also employ MT metrics as features of a paraphrase identification
system. The method of using them, however, is very different from the method
in previous work.
Discourse structures have only marginally been considered for paraphrase
computation. Regneri and Wang [29] introduce a method for collecting para-
phrases using discourse information on a special type of data, TV show episodes.
With such kind of data, they assume that discourse structures can be achieved
by taking sentence sequences of recaps. Our work employ the recent advances
in discourse segmentation. Hernault et al. [19] present a sequence model for seg-
menting texts into discourse units using Conditional Random Fields. Bach et
al. [2] introduce a reranking model for discourse segmentation using subtree fea-
tures. Two segmenters achieve 89.0% and 91.0%, respectively, in the F1 score on
RST-DT when using Stanford parse trees.
The aim of our work is to exploit discourse information for computing para-
phrases in general texts. Our main contributions can be summarized in the
following points:
In this section, we describe the relation between paraphrases and discourse units.
We will show that discourse units are blocks which play an important role in
paraphrasing.
Figure 2 shows an example of a paraphrase sentence pair. In this example, the
first sentence can be divided into three elementary discourse units (EDUs), 1A,
1B, and 1C, and the second sentence can also be segmented into three EDUs, 2A,
2B, and 2C. Comparing these six EDUs, we can see that they make three aligned
pairs of paraphrases: 1A with 2A, 1B with 2B, and 1C with 2C. Therefore, if we
consider the first sentence is the original sentence, the second sentence can be
created by paraphrasing each discourse unit in the original sentence.
Figure 3 shows a more complex case. The first sentence consists of four EDUs,
3A, 3B, 3C, and 3D; and the second sentence includes four EDUs, 4A, 4B, 4C,
and 4D. In this case, if we consider the first sentence is the original one, we have
some remarks:
By analyzing paraphrase sentences, we found that discourse units are very im-
portant to paraphrasing. In many cases, a paraphrase sentence can be created
by applying the following operations to the original sentence:
An example of Operation 1 and Operation 2 is the case of units 3A, 3B, and
3C in Figure 3 (reordering 3A and 3B, and then combining 3A and 3C). Unit 3D
illustrates an example for Operation 3. The last operation is the most important
operation, which is applied to almost all of discourse units.
4 EDU-Based Similarity
Motivated from the analysis of the relation between paraphrases and discourse
units, we propose a method to compute the similarity between two sentences.
Our method considers each sentence as a sequence of EDUs.
First, we present the notion of ordered similarity functions. Given two arbi-
trary texts t1 and t2 , an ordered similarity function Simordered(t1 , t2 ) will return
a real score, which measures how t1 is similar to t2 . Note that in this function,
the roles of t1 and t2 are different, in which t2 can be seen as a gold standard and
we want to evaluate t1 based on t2 . Examples of ordered similarity functions are
MT metrics, which evaluate how a hypothesis text (t1 ) is similar to a reference
text (t2 ).
Given an ordered similarity function Simordered, we can define the similarity
between two arbitrary texts t1 and t2 as follows:
Simordered (t1 , t2 ) + Simordered(t2 , t1 )
Sim(t1 , t2 ) = . (1)
2
Let (s1 , s2 ) be a sentence pair, then s1 and s2 can be represented as sequences of
elementary discourse units: s1 = (e1 , e2 , . . . , em ) and s2 = (f1 , f2 , . . . , fn ), where
m and n are the numbers of discourse units in s1 and s2 , respectively. We define
an ordered similarity function between s1 and s2 as follows:
m
Simordered (s1 , s2 ) = Imp(ei , s1 ) ∗ Simordered(ei , s2 ) (2)
i=1
70 N.X. Bach, N.L. Minh, and A. Shimazu
Simordered(s1 , s2 ) + Simordered(s2 , s1 )
Sim(s1 , s2 ) =
2
1 |ei |
m
= ∗ ∗ M axnj=1 Simordered (ei , fj )
2 i=1 |s1 | (6)
1 |fj |
n
+ ∗ ∗ M axm
i=1 Simordered (fj , ei ).
2 j=1 |s2 |
5 Experiments
This section describes our experiments on the paraphrase identification task
using EDU-based similarities as features for an SVM classifier [35]. Like the
EDU-Based Similarity for Paraphrase Identification 71
Line Computation
s1 : Or his needful holiday has come , and he is
1 staying at a friend ’s house , or is thrown into Length=27
new intercourse at some health-resort .
s2 : Or need a holiday has come , and he
2 stayed in the house of a friend , or disposed Length=29
of in a new relationship to a health resort .
Sentence-based Similarity
3 BLEU(s1 , s2 ) = 0.5333
4 BLEU(s2 , s1 ) = 0.5330
5 Sim(s1 , s2 ) = BLEU (s1 ,s2 )+BLEU
2
(s2 ,s1 )
= 0.5332
Discourse Units
6 e1 : Or his needful holiday has come , Length=7
7 e2 : and he is staying at a friend ’s house , Length=10
e3 : or is thrown into new intercourse at some
8 Length=10
health-resort .
9 f1 : Or need a holiday has come , Length=7
10 f2 : and he stayed in the house of a friend , Length= 10
f3 : or disposed of in a new relationship to a
11 Length=12
health resort .
EDU-based Similarity
12 BLEU(e1 , f1 ) = 0.7143 BLEU(e1 , f2 ) = 0.0931 BLEU(e1 , f3 ) = 0.0699
13 BLEU(e2 , f1 ) = 0.1818 BLEU(e2 , f2 ) = 0.5455 BLEU(e2 , f3 ) = 0.0830
14 BLEU(e3 , f1 ) = 0.0833 BLEU(e3 , f2 ) = 0 BLEU(e3 , f3 ) = 0.4167
15 EDU BLEU(s1 , s2 ) = 27 7
∗ 0.7143 + 1027
∗ 0.5455 + 10
27
∗ 0.4167 = 0.5416
16 BLEU(f1 , e1 ) = 0.7143 BLEU(f1 , e2 ) = 0.1613 BLEU(f1 , e3 ) = 0.0699
17 BLEU(f2 , e1 ) = 0.1000 BLEU(f2 , e2 ) = 0.5429 BLEU(f2 , e3 ) = 0
18 BLEU(f3 , e1 ) = 0.0833 BLEU(f3 , e2 ) = 0.0833 BLEU(f3 , e3 ) = 0.4167
19 EDU BLEU(s2 , s1 ) = 29 7
∗ 0.7143 + 1029
∗ 0.5429 + 12
29
∗ 0.4167 = 0.5321
EDU BLEU (s1 ,s2 )+EDU BLEU (s2 ,s1 )
20 EDU Sim(s1 , s2 ) = 2
= 0.5369
about 4.3 discourse units, and about 40.1 words in the training set, 41.1 words
in the test set. We chose this corpus for these reasons. First, it is a large corpus
for detecting paraphrases. Second, it contains many long sentences. Our method
computes similarities based on discourse units. It is suitable for long sentences
with several EDUs. Last, according to Madnani et al. [23], the PAN corpus
contains many realistic examples of paraphrases.
We evaluated the performance of our paraphrase identification system by ac-
curacy and the F1 score. The accuracy was the percentage of correct predictions
over all the test set, while the F1 score was computed only based on the para-
phrase sentence pairs1 .
5.2 MT Metrics
We investigated our method with six different MT metrics (six types of ordered
similarity functions). These metrics have been shown to be effective for the task
of paraphrase identification [23].
1. BLEU [28] is the most commonly used MT metric. It computes the amount
of n-gram overlap between a hypothesis text (the output of a translation
system) and a reference text.
2. NIST [13] is a variant of BLEU using the arithmetic mean of n-gram over-
laps. Both BLEU and NIST use exact matching. They have no concept of
synonymy or paraphrasing.
3. TER [31] computes the number of edits needed to “fix” the hypothesis text
so that it matches the reference text.
4. TERP [32] or TER-Plus is an extension of TER, that utilizes phrasal sub-
stitutions, stemming, synonyms, and other improvements.
5. METEOR [12] is based on the harmonic mean of unigram precision and
recall. It also incorporates stemming, synonymy, and paraphrase.
6. BADGER [27], a language independent metric, computes a compression dis-
tance between two sentences using the Burrows Wheeler Transformation
(BWT).
Among six MT metrics, TER and TERP compute a translation error rate be-
tween a hypothesis text and a reference text. Therefore, the smaller they are,
the more similar the two texts are. When using these metrics in computing
EDU-based similarities, we replaced the max function in Equation (6) by a min
function.
1
If we consider each sentence pair as an instance with label +1 for paraphrase and
label -1 for non-paraphrase, the reported F1 score was the F1 score on label +1.
EDU-Based Similarity for Paraphrase Identification 73
In all experiments, we chose SVMs [35] as the learning method to train a binary
classifier2 .
First, we investigated each individual MT metric. To see the contributions of
EDU-based similarities, we conducted experiments in two settings. In the first
setting, we directly applied the MT metric to pairs of sentences to get the similar-
ities (sentence-based similarities). In the second one, we computed EDU-based
similarities in addition to the sentence-based similarities. Like Madnani et al.
[23], in our experiments, we used BLEU1 through BLEU4 as 4 different features
and NIST1 through NIST5 as 5 different features3 . Table 3 shows experimental
results in two settings on the PAN corpus. We can see that, adding EDU-based
similarities improved the performance of the paraphrase identification system
with most of the MT metrics, especially with NIST(3.0%), BLEU (0.6%), and
TER (0.3%).
Table 4 shows experimental results with multiple MT metrics on the PAN cor-
pus. With each MT metric, we computed the similarities in both methods, based
directly on sentences and based on discourse units. We gradually added MT met-
rics one by one to the system. After adding the TERP metric, we achieved 93.1%
accuracy and 93.0% in the F1 score. Adding more two metrics METEOR and
BADGER, the performance was not improved.
Two last rows of Table 4 shows the results of Madnani et al. [23] when using
4 MT metrics, including BLEU, NIST, TER, and TERP (Madnani-4) and when
using all 6 MT metrics (Madnani-6)4 . Compared with the best previous results,
our method improves 0.8% accuracy and 0.9% in the F1 score. It yields a 10.4%
error rate reduction. Also note that, the previous work employs a meta-classifier
with three constituent classifiers, Logistic regression, SVMs, and instance-based
learning, while we use only a single classifier with SVMs.
We also investigated our method on long and short sentences. We divided
sentence pairs in the test set into two subsets: Subset1 (long sentences) contains
2
We conducted experiments on LIBSVM tool [9] with the RBF kernel.
3
BLEUn and NISTn use n-grams.
4
Madnani et al. [23] show that adding more MT metrics does not improve the per-
formance of the paraphrase identification system.
74 N.X. Bach, N.L. Minh, and A. Shimazu
sentence pairs that both sentences have at least 4 discourse units5 , and Subset2
(short sentences) contains the other sentence pairs. Table 5 shows the informa-
tion and experimental results on two subsets. Subset1 consists of 1317 sentence
pairs (on average, 6.5 EDUs and 56.6 words per sentence), while Subset2 consists
of 1683 sentence pairs (on average, 2.6 EDUs and 27.2 words per sentence). We
can see that, our method was effective for the long sentences, which we achieved
96.6% accuracy and 94.8% in the F1 score compared with 90.4% accuracy and
92.3% in the F1 score of the short sentences.
6 Conclusion
In this paper, we proposed a new method to compute the similarity between
two sentences based on elementary discourse units, EDU-based similarity. This
method was motivated from the analysis of the relation between paraphrases and
discourse units. By analyzing examples of paraphrases, we found that discourse
units play an important role in paraphrasing. We applied EDU-based similarity
to the task of paraphrase identification. Experimental results on the PAN corpus
showed the effectiveness of the proposed method. To the best of our knowledge,
this is the first work to employ discourse units for computing similarity as well as
for identifying paraphrases. Although our method is proposed for computing the
similarity between two sentences, it can be also used to compute the similarity
between two arbitrary texts.
In the future, we would like to apply our method to other datasets for the
paraphrase identification task as well as to other related tasks such as recognizing
textual entailment [5] and semantic textual similarity [1]. Another direction is
to improve the method of computing similarity, especially how to evaluate the
5
Number 4 was chosen because on average each sentence contains about 4 EDUs (see
Table 2).
EDU-Based Similarity for Paraphrase Identification 75
References
1. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A.: SemEval-2012 Task 6: A Pilot
on Semantic Textual Similarity. In: Proceedings of SemEval, pp. 385–393 (2012)
2. Bach, N.X., Minh, N.L., Shimazu, A.: A Reranking Model for Discourse Segmen-
tation using Subtree Features. In: Proceedings of SIGDIAL, pp. 160–168 (2012)
3. Bach, N.X., Le Minh, N., Shimazu, A.: UDRST: A Novel System for Unlabeled Dis-
course Parsing in the RST Framework. In: Isahara, H., Kanzaki, K. (eds.) JapTAL
2012. LNCS (LNAI), vol. 7614, pp. 250–261. Springer, Heidelberg (2012)
4. Barzilay, R., McKeown, K.R., Elhadad, M.: Information Fusion in the Context of
Multi-Document Summarization. In: Proceedings of ACL, pp. 550–557 (1999)
5. Bentivogli, L., Dagan, I., Dang, H.T., Giampiccolo, D., Magnini, B.: The fifth
Pascal Recognizing Textual Entailment Challenge. In: Proceedings of TAC (2009)
6. Callison-Burch, C., Koehn, P., Osborne, M.: Improved Statistical Machine Trans-
lation Using Paraphrases. In: Proceedings of NAACL, pp. 17–24 (2006)
7. Carlson, L., Marcu, D., Okurowski, M.E.: RST Discourse Treebank. Linguistic
Data Consortium (LDC) (2002)
8. Chan, Y.S., Ng, H.T.: MAXSIM: A Maximum Similarity Metric for Machine Trans-
lation Evaluation. In: Proceedings of ACL-HLT, pp. 55–62 (2008)
9. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines. ACM
Transactions on Intelligent Systems and Technology 2(3), 27:1-27:27 (2011)
10. Corley, C., Mihalcea, R.: Measuring the Semantic Similarity of Texts. In: Proceed-
ings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and
Entailment, pp. 13–18 (2005)
11. Das, D., Smith, N.A.: Paraphrase Identification as Probabilistic Quasi-Synchronous
Recognition. In: Proceedings of ACL-IJCNLP, pp. 468–476 (2009)
12. Denkowski, M., Lavie, M.: Extending the METEOR Machine Translation Metric
to the Phrase Level. In: Proceedings of NAACL, pp. 250–253 (2010)
13. Doddington, G.: Automatic Evaluation of Machine Translation Quality using N-
gram Co-occurrence Statistics. In: Proceedings of the 2nd International Conference
on Human Language Technology Research, pp. 138–145 (2002)
14. Dolan, B., Quirk, C., Brockett, C.: Unsupervised Construction of Large Paraphrase
Corpora: Exploiting Massively Parallel News Sources. In: Proceedings of COLING,
pp. 350–356 (2004)
15. Duboue, P.A., Chu-Carroll, J.: Answering the Question You Wish They had Asked:
The Impact of Paraphrasing for Question Answering. In: Proceedings of NAACL,
pp. 33–36 (2006)
16. Fernando, S., Stevenson, M.: A Semantic Similarity Approach to Paraphrase De-
tection. In: Proceedings of CLUK (2008)
76 N.X. Bach, N.L. Minh, and A. Shimazu
17. Finch, A., Hwang, Y.S., Sumita, E.: Using Machine Translation Evaluation Tech-
niques to Determine Sentence-level Semantic Equivalence. In: Proceedings of the
3rd International Workshop on Paraphrasing, pp. 17–24 (2005)
18. Habash, N., Kholy, A.E.: SEPIA: Surface Span Extension to Syntactic Dependency
Precision-based MT Evaluation. In: Proceedings of the Workshop on Metrics for
Machine Translation at AMTA (2008)
19. Hernault, H., Bollegala, D., Ishizuka, M.: A Sequential Model for Discourse Seg-
mentation. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 315–326.
Springer, Heidelberg (2010)
20. Klein, D., Manning, C.: Accurate Unlexicalized Parsing. In: Proceedings of ACL,
pp. 423–430 (2003)
21. Kozareva, Z., Montoyo, A.: Paraphrase Identification on the Basis of Supervised
Machine Learning Techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala,
T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg
(2006)
22. Leusch, G., Ueffing, N., Ney, H.: A Novel String-to-String Distance Measure with
Applications to Machine Translation Evaluation. In: Proceedings of MT Summit
IX (2003)
23. Madnani, N., Tetreault, J., Chodorow, M.: Re-examining Machine Translation Met-
rics for Paraphrase Identification. In: Proceedings of NAACL-HLT, pp. 182–190
(2012)
24. Mann, W.C., Thompson, S.A.: Rhetorical Structure Theory. Toward a Functional
Theory of Text Organization. Text 8, 243–281 (1988)
25. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and Knowledge-based
Measures of Text Semantic Similarity. In: Proceedings of AAAI, pp. 775–780 (2006)
26. Niessen, S., Och, F.J., Leusch, G., Ney., H.: An Evaluation Tool for Machine Trans-
lation: Fast Evaluation for MT Research. In: Proceedings of LREC (2000)
27. Parker, S.: BADGER: A New Machine Translation Metric. In: Proceedings of the
Workshop on Metrics for Machine Translation at AMTA (2008)
28. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A Method for Automatic
Evaluation of Machine Translation. In: Proceedings of ACL, pp. 311–318 (2002)
29. Regneri, M., Wang, R.: Using Discourse Information for Paraphrase Extraction.
In: Proceedings of EMNLP-CONLL, pp. 916–927 (2012)
30. Rus, V., McCarthy, P.M., Lintean, M.C., McNamara, D.S., Graesser, A.C.: Para-
phrase Identification with Lexico-Syntactic Graph Subsumption. In: Proceedings
of FLAIRS Conference, pp. 201–206 (2008)
31. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A Study of Trans-
lation Edit Rate with Targeted Human Annotation. In: Proceedings of the Con-
ference of the Association for Machine Translation in the Americas, AMTA (2006)
32. Snover, M., Madnani, N., Dorr, B., Schwartz, R.: TER-Plus: Paraphrase, Seman-
tic, and Alignment Enhancements to Translation Edit Rate. Machine Transla-
tion 23(23), 117–127 (2009)
33. Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic Pool-
ing and Unfolding Recursive Autoencoders for Paraphrase Detection. In: Advances
in Neural Information Processing Systems 24 (NIPS), pp. 801–809 (2011)
34. Uzuner, O., Katz, B., Nahnsen, T.: Using Syntactic Information to Identify Plagia-
rism. In: Proceedings of the 2nd Workshop on Building Educational Applications
using Natural Language Processing, pp. 37–44 (2005)
35. Vapnik, V.N.: Statistical Learning Theory. Wiley Interscience (1998)
36. Wan, S., Dras, R., Dale, M., Paris, C.: Using Dependency-Based Features to Take
the “Para-farce” out of Paraphrase. In: Proceedings of the 2006 Australasian Lan-
guage Technology Workshop, pp. 131–138 (2006)
Exploiting Query Logs and Field-Based Models
to Address Term Mismatch in an HIV/AIDS
FAQ Retrieval System
1 Introduction
We have developed an Automated SMS-Based HIV/AIDS FAQ retrieval system
that can be queried by users to provide answers on HIV/AIDS related questions.
The system uses, as its information source, the full HIV/AIDS FAQ question-
answer booklet provided by the Ministry of Health (MOH) in Botswana for
its IPOLETSE1 call centre. This FAQ question-answer booklet is made up of
205 question-answer pairs organised into eleven chapters of varying sizes. For
example, there is a chapter on “Nutrition, Vitamins and HIV/AIDS” and a
chapter on “Men and HIV/AIDS”. Below is an example of a question-answer
pair entry that can be found in Chapter Eight, “Introduction to ARV Therapy”:
1
http://www.hiv.gov.bw/content/ipoletse
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 77–89, 2013.
c Springer-Verlag Berlin Heidelberg 2013
78 E. Thuma, S. Rogers, and I. Ounis
2 Related Work
are the document expansion approach proposed in [2,17] and the query expansion
approach in [1]. The document expansion approach proposed by Billerbeck and
Zobel [2] yielded unpromising results and this might be partly due to the fact
that the expansion terms were selected automatically without using the actual
query relevance judgements. Hence this might have resulted in the wrong terms
being used to expand irrelevant documents. In this work, we will rely on the query
relevance judgements to avoid linking query terms to irrelevant FAQ documents.
In Web IR, there is the notion of document fields and this provides a way to
incorporate the structure of a document in the retrieval process [16]. For example,
the contents of different HTML tags (e.g anchor text,title, body) are often used
to represent different document fields [13,16]. Earlier work by [10] has shown
that combining evidence from different fields in Web retrieval improves retrieval
performance. In this paper, we represent the FAQ document made up of question-
answer pairs into a QU EST ION and an AN SW ER field. We then introduce
a third field, F AQLog, that we use to add additional terms from queries for
which the true relevant FAQ documents are known. We aim to solve the term
mismatch problem in our FAQ retrieval system by combining evidence from
these three fields.
We will evaluate the proposed approach using two different enrichment strate-
gies. First, we enrich the FAQ documents using all the terms from a query log.
In this approach, all the queries from the training set for which the true relevant
FAQ documents are known will be added into the new introduced F AQLog field
as shown in Table 1. In other words, if an FAQ document is known to be relevant
to a query, then this query is added to its F AQLog field. For the remainder of
this paper we will refer to this approach as the Term Frequency approach. In
the second approach, we will enrich the FAQ documents using term occurrences
from a query log. Here, all the unique terms from the training set for which
the true relevant FAQ documents are known will be added to the F AQLog
field as shown in Table 2. In other words, only new query terms that do not
appear in the F AQLog field will be added to that field. For the remainder of
this paper we will refer to this approach as the Term Occurrence approach. We
will apply field-based weighting models on the enriched FAQ documents using
PL2F [10] and BM25F [16].
The main difference between the two enrichment approaches is that the fre-
quencies with which users use some rare terms in specific FAQ documents can be
captured if the term frequency enrichment approach is used. For example, under
the term frequency approach (Table 1), the term frequencies of the terms gender
and inf ected in the FAQLog field are: gender = 2 and inf ected = 2. Under the
term occurrence approach (Table 2) the term frequencies of these terms are 1
because the query terms under this approach can only be added to this field once
even if they appear in many queries. Since, both BM25F and PL2F rely on term
frequencies to calculate the final retrieval score of a relevant document given a
Exploiting Query Logs and Field-Based Models to Address Term Mismatch 81
query, our two enrichment strategies will always give different retrieval scores.
We will investigate the usefulness of each enrichment approach in Section 5.
5 Experimental Description
For all our experimental evaluation, we used the Terrier-3.52 [12], an open source
Information Retrieval (IR) platform. All the FAQ documents used in this study
were first pre-processed before indexing and this involved tokenising the text
and stemming each token using the full Porter [14] stemming algorithm. To fil-
ter out terms that appear in a lot of FAQ documents, we did not use a stopword
list during the indexing and the retrieval process. Instead, we ignored the terms
that had low Inverse Document Frequency (IDF) when scoring the documents.
Indeed, all the terms with term frequency higher than the number of the FAQ
documents (205) were considered to be low IDF terms. Earlier work in [9] has
shown that stopword removal using a stopword list from various IR platforms
like Terrier-3.5 can affect retrieval performance in SMS-Based FAQ retrieval. The
normalisation parameter for BM25 was set to its default value of b = 0.75. For
BM25F, the normalization parameter of each field was also set to 0.75 and these
were (b.0 = 0.75, b.1 = 0.75, b.2 = 0.75), representing the normalisation parame-
ters for the QU EST ION , AN SW ER and F AQLog fields respectively. For PL2,
the normalisation parameter was set to its default value of c = 1. For PL2F, the
normalisation parameter for each field was set to (c.0 = 1.0, c.1 = 1.0, c.2 = 1.0) ,
representing the QU EST ION , AN SW ER and F AQLog fields respectively.
Table 3. Examples of some of the web pages that were crawled from the web to use
as an external collection for query expansion using collection enrichment approach
terms together with the original query terms were used for retrieval on the non
enriched FAQ documents collection.
Table 4. The mean and standard deviation for the field weights (w.1 = 1)
MRR Contour (FAQLog Field (w.2=4)) MRR Contour (Answer Field (w.1=1))
10 11
0.7
10 0.715
0.695
8
Question Field (w.0)
2 0.665 4
0.69
0.66 3
0 2 0.685
0 2 4 6 8 10 2 4 6 8 10
Answer Field (w.1) FAQLog Field (w.2)
Fig. 1. The denotes the region of highest MRR in relation to field weights w.0,w.1
and w.2 in this particular contour plots that were chosen randomly from our results.
The higher MRR values for all the other random splits are inside the dotted rectangles.
out that these values were averaged taking into consideration that small changes
in the parameter values of these models are known to produce small changes
in the accuracy of relevance [15]. Our analysis of the various contour plots also
show that the mean field weights in Table 4 are also within the region of higher
MRR values that is bounded by the dotted rectangle in Figure 1.(b) for all the
training samples.
Table 5. The mean retrieval performance for each Collection. Significant improvement
in MRR and Recall if the FAQ documents are enriched with queries over non enriched
FAQ documents, as denoted by ∗ (t-test, p < 0.05). Also, the was significant improve-
ment in MRR and recall if field weights were optimised compared to non optimised
field weights, as denoted by ∗∗ (t-test, p < 0.05).
Test Evaluation Measure
Evaluation Collection Enrichment Strategy Weighting Model Field Weights (w.1 = 1) MRR MAP Recall
Q(Only) No Enrichment BM25F/BM25 w.0 = 1 0.4312 0.2197 0.2495
EXV 1
Q and A No Enrichment BM25F/BM25 w.0 = 1 0.4106 0.2302 0.2380
Q(Only) and QE Query Expansion BM25F/BM25 w.0 = 1 0.4162 0.2022 0.2528
EXV 4
Q ,A and QE Query Expansion BM25F/BM25 w.0 = 1 0.4317 0.2692 0.2974
Q,A and 200SMS 0.6120 0.4878 0.4951∗
EXV 1 and EXV 3 Q,A and 400SMS Term Occurrence BM25F/BM25 w.0 = 1, w.2 = 1 0.6614 0.4913 0.5466∗
Q,A and 600SMS 0.6608 0.5039 0.5924∗
Q,A and 200SMS 0.6774 0.5741 0.6772∗∗
EXV 2 and EXV 3 Q,A and 400SMS Term Occurrence BM25F w.0 = 5.98, w.2 = 5.94 0.6692 0.5867 0.7089∗∗
Q,A and 600SMS 0.6666 0.5935 0.7009∗∗
Q,A and 200SMS 0.6492 0.5146 0.5327∗
EXV 1 and EXV 3 Q,A and 400SMS Term Frequency BM25F/BM25 w.0 = 1, w.2 = 1 0.6833 0.5491 0.5765∗
Q,A and 600SMS 0.6921 0.5435 0.6043∗
Q,A and 200SMS 0.6847 0.6035 0.6902∗∗
EXV 2 and EXV 3 Q,A and 400SMS Term Frequency BM25F w.0 = 4.02, w.2 = 6.98 0.7179 0.6455 0.7546∗∗
Q,A and 600SMS 0.7315 0.6747 0.7484∗∗
Q(Only) No Enrichment PL2F/PL2 w.0 = 1 0.4526 0.2720 0.2545
EXV 1
Q and A No Enrichment PL2F/PL2 w.0 = 1 0.4106 0.2438 0.2711
Q(Only) and QE Query Expansion PL2F/PL2 w.0 = 1 0.4297 0.2552 0.2815
EXV 4
Q ,A and QE Query Expansion PL2F/PL2 w.0 = 1 0.4430 0.2627 0.2764
Q,A and 200SMS 0.6068 0.5074 0.5841∗
EXV 1and EXV 3 Q,A and 400SMS Term Occurrence PL2F/PL2 w.0 = 1, w.2 = 1 0.6310 0.5272 0.6168∗
Q,A and 600SMS 0.6831 0.5413 0.6340∗
Q,A and 200SMS 0.6766 0.5866 0.6950∗∗
EXV 2 and EXV 3 Q,A and 400SMS Term Occurrence PL2F w.0 = 6.68, w.2 = 5.74 0.6938 0.6093 0.7188∗∗
Q,A and 600SMS 0.7004 0.6187 0.7465∗∗
Q,A and 200SMS 0.6213 0.5432 0.5941∗
EXV 1 and EXV 3 Q,A and 400SMS Term Frequency PL2F/PL2 w.0 = 1, w.2 = 1 0.6580 0.5535 0.6268∗
Q,A and 600SMS 0.6990 0.5848 0.6484∗
Q,A and 200SMS 0.6701 0.6134 0.7246∗∗
EXV 2 and EXV 3 Q,A and 400SMS Term Frequency PL2F w.0 = 5.53, w.2 = 7.04 0.7112 0.6515 0.7585∗∗
Q,A and 600SMS 0.7254 0.6892 0.7713∗∗
This has some disadvantages as some queries might be expanded with irrelevant
terms. Despite some of the disadvantages, a slight gain in MRR and recall was
observed when the question and answer field were used and query expansion
applied. However, there was a decrease in retrieval performance when only the
question field was used, suggesting that the terms from the external collection
might be adding noise to the original query.
7 Conclusions
References
1. Billerbeck, B., Scholer, F., Williams, H.E., Zobel, J.: Query Expansion using As-
sociated Queries. In: Proc. of CIKM (2003)
2. Billerbeck, B., Zobel, J.: Document Expansion Versus Query Expansion For Ad-hoc
Retrieval. In: Proc. of ADCS (2005)
3. Fang, H.: A Re-examination of Query Expansion Using Lexical Resources. In: Proc.
ACL:HLT (2008)
4. Hammond, K., Burke, R., Martin, C., Lytinen, S.: FAQ Finder: A Case-Based
Approach to Knowledge Navigation. In: Proc. of CAIA (1995)
5. Jeon, J., Croft, W.B., Lee, J.H.: Finding Similar Questions in Large Question and
Answer Archives. In: Proc. of CIKM (2005)
6. Kim, H., Lee, H., Seo, J.: A Reliable FAQ Retrieval System Using a Query Log
Classification Technique Based on Latent Semantic Analysis. Info. Process. and
Manage. 43(2), 420–430 (2007)
7. Kim, H., Seo, J.: High-Performance FAQ Retrieval Using an Automatic Clustering
Method of Query Logs. Info. Process. and Manage. 42(3), 650–661 (2006)
8. Kwok, K.L., Chan, M.: Improving Two-Stage Ad-hoc Retrieval for Short Queries.
In: Proc. of SIGIR (1998)
9. Leveling, J.: On the Effect of Stopword Removal for SMS-Based FAQ Retrieval.
In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.) NLDB 2012. LNCS,
vol. 7337, pp. 128–139. Springer, Heidelberg (2012)
10. Macdonald, C., Plachouras, V., He, B., Lioma, C., Ounis, I.: University of Glasgow
at WebCLEF 2005: Experiments in Per-Field Normalisation and Language Specific
Stemming. In: Proc. of CLEF (2006)
11. Moreo, A., Navarro, M., Castro, J.L., Zurita, J.M.: A High-Performance FAQ Re-
trieval Method Using Minimal Differentiator Expressions. Know. Based Syst. 36,
9–20 (2012)
12. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A
High Performance and Scalable Information Retrieval Platform. In: Proc. of OSIR
at SIGIR (2006)
13. Plachouras, V., Ounis, I.: Multinomial Randomness Models for Retrieval with Doc-
ument Fields. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS,
vol. 4425, pp. 28–39. Springer, Heidelberg (2007)
14. Porter, M.F.: An Algorithm for Suffix Stripping. Elec. Lib. Info. Syst. 14(3),
130–137 (2008)
15. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and
Beyond. Found. Trends Info. Retr. 3(4), 333–389 (2009)
Exploiting Query Logs and Field-Based Models to Address Term Mismatch 89
16. Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 Extension to Multiple
Weighted Fields. In: Proc. of CIKM (2004)
17. Singhal, A., Pereira, F.: Document Expansion for Speech Retrieval. In: Proc. of
SIGIR (1999)
18. Sneiders, E.: Automated FAQ Answering: Continued Experience with Shallow Lan-
guage Understanding. Question Answering Systems. In: Proc. of AAAI Fall Symp.
(1999)
19. Sneiders, E.: Automated FAQ Answering with Question-Specific Knowledge Rep-
resentation for Web Self-Service. In: Proc. of HSI (2009)
20. Voorhees, E.M.: Query Expansion Using Lexical-Semantic Relations. In: Proc. of
SIGIR, pp. 61–69 (1994)
21. Whitehead, S.D.: Auto-FAQ: an Experiment in Cyberspace Leveraging. Comp.
Net. and ISDN Syst. 28(1-2), 137–146 (1995)
22. Xue, X., Jeon, J., Croft, W.B.: Retrieval Models for Question and Answer Archives.
In: Proc. of SIGIR (2008)
Exploring Domain-Sensitive Features
for Extractive Summarization
in the Medical Domain
1 Introduction
As the problem of information overload still grows and the amount of informa-
tion available on the internet increases, text summarization is becoming a more
important Natural Language Processing (NLP) task. The objective of summariz-
ing a text document automatically is to provide a shorter version of a document,
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 90–101, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Exploring Domain-Sensitive Features for Extractive Summarization 91
which typically has 10-30% of the original text length [1] and still contains the
most important information in a coherent form. In contrast, manual summariza-
tion is costly and time-consuming.
In spite of extensive research on automatic text summarization, there is still
limited research on domain-specific summarization and on domain-adaptation of
summarization for domains such as the medical, chemical, or legal domain. This
paper describes experiments for adapting a generic summarizer to document
summarization in the medical domain. The summarizer produces an extractive
summary consisting of the most important sentences in the original document.
We employ various features that have been described in previous research for
generic summarizers to form a baseline summarizer. The main contribution of
this paper is the investigation of simple domain-specific and semantic features
for adapting summarization to the medical domain.
For domain adaptation, we extend the generic summarizer with domain-
specific features. We investigate whether medical or semantic features or their
combination can contribute to improve the produced summaries in the medical
domain and automatically evaluate the quality of the summaries. We perform
summarization experiments on documents from the medical domain and analyze
the results in detail, by measuring the impact of each individual domain-specific
feature. Then we conduct experiments using a combination of all features and
measure the improvement of system accuracy when using additional features.
Finally, we measure the ROUGE-N score to automatically evaluate the summa-
rization quality with reference to the reference summary from the document. We
obtain 84.08% accuracy on balanced training data.
The rest of this paper is organized as follows: Related work is briefly reviewed
in Section 2. In Section 3 we introduce our text summarizer. The evaluation
approach is presented in Section 4, followed by an analysis and discussion of
results in Section 5. The paper concludes with Section 6.
2 Related Work
Lin et al. [9] used SUMMARIST [1] to compare the effects of eighteen differ-
ence features on summarization. These features are also included in our summa-
rizer: proper names, date/time terms, pronouns, prepositions and quotes. They
used the machine learning algorithm C4.5 to evaluate single features as well as
an optimal feature combination. The performance of the individual methods and
combination showed that the best scoring result with respect to F-score is the
feature combination.
There have been many automatic summarization techniques introduced since
the first research on using machine learning techniques [10]. However the main
idea of later work concentrates essentially on the comparison of different learning
algorithms and the way how to categorize feature classes [6].
Evaluation of summarization is notoriously difficult. Summarization tasks
such as DUC1 or the Text Analysis Conference (TAC) encouraged research by
providing large documents corpora and summaries. However, judgments for eval-
uation typically relied on human annotators or assessors. The INEX2 task for
evaluating retrieval of snippets (short extracts from documents), introduced in
2011, is also related to summarization. This task aims at evaluating the snippet
extraction to investigate if a user can understand the content of a document
without reading the full document.
There are several summarization systems that have dealt with summarization
of documents in the medical domain. Yang et al. [11] built and evaluated a query-
based automatic summarizer on the domain of mouse genes studied in micro-
array experiments. Their system implemented sentence extraction following the
approach proposed by Edmundson [3]. However, before ranking sentences by
aggregating features such as special keywords, sentence length, and cue phrases,
the gene set was clustered into groups based on free text, and MeSH and GO
terms belonging to a gene ontology. They used Medline abstracts to investigate
ranked sentences of the summary output.
The Technical Article Summarizer (TAS) [12] automatically generates a sum-
mary that is suitable for patient characteristics when the input to the system is
a patient record and journal articles. This helps physicians or medical experts
to easily find information relevant to the patient’s situation.
Our REZIME Summarizer system utilizes machine learning to automatically
determine the importance of a sentence based on features. We implemented a
large set of sentences features that have been described in previous research, but
we use a different approach to incorporate these features in a training model. In
this paper, we are particularly interested in how these established and proven,
but generic features can be extended for domain-adaptation. We think that
domain-specific knowledge and semantic information will help to adapt the sum-
marization to the domain we chose for our experiments, the medical domain. We
employed a subset of a collection of documents from BioMed Central (BMC)3 ,
which contain the (reference) abstracts and the full body of texts.
1
http://www.nist.gov/tac/
2
https://inex.mmci.uni-saarland.de/
3
http://www.biomedcentral.com/info/about/datamining/
Exploring Domain-Sensitive Features for Extractive Summarization 93
3 REZIME Summarizer
The objective of the REZIME summarizer4 is to select the most important sen-
tences as a summary that represents the original text.
6
For simplicity, we refer to sentences and words, but the description can be generalized
to include text fragments such as phrases.
Exploring Domain-Sensitive Features for Extractive Summarization 95
Non Term Checking Features: The Non Term Checking Feature group in-
cludes more complex features compared to the above group. For a more detailed
description of these features, the reader is again referred to the original literature.
– Cluster Keyword Feature: considers two significant words as related if they
are separated by not more than five insignificant words. Important sentences
will have large clusters of significant words; proposed by Luhn [2].
– The Global Bushy Feature: generates inter-document links based on similar-
ity of paragraphs; paragraphs with many links share vocabulary with many
other paragraphs and are important; proposed by Salton [14].
– Number of Terms/Sentence Length Feature: The number of terms in a sen-
tence, assuming that too long or too short sentences are unimportant for a
summary [3].
– Skimming Feature: The position of a sentence in a paragraph. The under-
lying assumption is that sentences occurring early in a paragraph are more
important for a summary [13,15].
– TS-ISF Feature: Similar to TF-IDF, but works on the sentence level. Every
sentence is treated like a document. Sentences which have a lot of keywords
is likely included in summary [16,13].
Semantic Features. The lexical database WordNet7 has a high coverage of terms
from medicine and biology (e.g. names of diseases or drugs). Thus, we define
7
http://wordnet.princeton.edu/
96 D.T. Nguyen and J. Leveling
4 Experiments
In this section, we describe the experiments using different machine learning
algorithms to classify sentences and the final evaluation of the summarizer.
|s1 ∩ s2|
Tovl (s1, s2) = (1)
max(|s1|, |s2|)
An initial filtering step aims at generating training data of high quality. Doc-
uments which do not contain at least one sentence in class 1 were discarded.
Therefore, we randomly selected 2,000 medical domain articles (101 Megabyte).
After filtering, 1,263 documents remain as training data.
The number of sentences with a low term overlap score is always much larger
than higher term overlap score sentence. This forms our first training set with
unbalanced data. We also generated a balanced training set for training the
classifier, i.e. where the number of class 1 instances is closer to or equal to the
number of class 0 instances.
Table 1. Training results on balanced and unbalanced data for IB1, Naive Bayes (NB),
Logistic Regression (LR), and Bayes Decision Tree (BDT)
Accuracy [%]
Training Data Instances (1/0) IB1 NB LR BDT
Balanced 15,108 (7,554/7,554) 84.16 69.08 76.05 82.71
Unbalanced 1 22,662 (7,554/15,108) 84.39 72.34 77.96 80.73
Unbalanced 2 37,770 (7,554/30,216) 87.18 76.65 83.31 84.31
Unbalanced 3 52,878 (7,554/45,324) 89.12 78.58 86.80 85.18
0.0500
0.1000
Training instance Training
0.0000 (0/1) ratio 0.0000 instance (0/1)
1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 ratio
We also report the compression ratio of generated summaries for each setting
in Figure 3. This figure shows that with increasingly more unbalanced data, the
length of the produced summary decreases, i.e. balanced training data would
produce a more even distribution of sentences in class 0 and class 1 (almost 0.5),
which leads to a longer summary.
Our evaluation results show that surprisingly, training on unbalanced data yields
a higher accuracy compared to training on balanced data. This can at least in
part be explained by the fact that with larger training data, more instances
in the dominating class are classified correctly. The automatic evaluation ap-
proach shows that a considerably good accuracy can be achieved compared to
evaluations based on costly manual judgments.
The additional features for domain adaptation improve summarization per-
formance (accuracy) on the training data. Interestingly, the semantic features
improve the performance more than the medical term features (+6.15% vs.
+1.83%). The extension REZIME’s feature set with both the semantic and the
medical term features shows the best performance.
In our ROUGE evaluation, when training with balanced or unbalanced train-
ing data with a low ratio of class 0 and class 1 instances, the F-score is very
low because the generated summary is longer than the abstract of document. In
contrast, increasing the percentage of non-relevant sentences makes the gener-
ated summary to be shorter and more accurate. This leads to improvement of the
results compared to the baseline system as well as the F-scores. The ROUGE per-
formance on CASE normalized data is better than that using STOP (stopword
removal). However, the F-score of the REZIME system is also higher than the
baseline, when the training instances ratio increases. The ratio where instance 0
equals 10 fold instance 1 shows the best improvement. In summary, setting the
ratio may control the compression rate of the produced output summaries and
the summarization quality.
References
1. Hovy, E., Lin, C.Y.: Automated text summarization in SUMMARIST. In: Mani,
I., Maybury, M.T. (eds.) Advances in Automatic Text Summarization. MIT Press
(1999)
2. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Re-
search and Development 2(2), 159–165 (1958)
3. Edmundson, H.P.: New methods in automatic extracting. Journal of the
ACM 16(2), 264–285 (1969)
Exploring Domain-Sensitive Features for Extractive Summarization 101
Abstract. While there are many large knowledge bases (e.g. Freebase,
Yago, DBpedia) as well as linked data sets available on the web, they
typically lack lexical information stating how the properties and classes
are realized lexically. If at all, typically only one label is attached to these
properties, thus lacking any deeper syntactic information, e.g. about syn-
tactic arguments and how these map to the semantic arguments of the
property as well as about possible lexical variants or paraphrases. While
there are lexicon models such as lemon allowing to define a lexicon for
a given ontology, the cost involved in creating and maintaining such
lexica is substantial, requiring a high manual effort. Towards lowering
this effort, in this paper we present a semi-automatic approach that
exploits a corpus to find occurrences in which a given property is ex-
pressed, and generalizing over these occurrences by extracting depen-
dency paths that can be used as a basis to create lemon lexicon entries.
We evaluate the resulting automatically generated lexica with respect
to DBpedia as dataset and Wikipedia as corresponding corpus, both in
an automatic mode, by comparing to a manually created lexicon, and
in a semi-automatic mode in which a lexicon engineer inspected the re-
sults of the corpus-based approach, adding them to the existing lexicon
if appropriate.
1 Introduction
The structured knowledge available on the web is increasing. The Linked Data
Cloud, consisting of a large amount of interlinked RDF datasets, has been grow-
ing steadily in recent years, now comprising more than 30 billion RDF triples1 .
Popular and huge knowledge bases exploited for various purposes are Freebase,
DBpedia, and Yago.2 Search engines such as Google are by now also collecting
and exploiting structured data, e.g. in the form of knowledge graphs that are
used to enhance search results.3 As the amount of structured knowledge avail-
able keeps growing, intuitive and effective paradigms for accessing and querying
1
http://www4.wiwiss.fu-berlin.de/lodcloud/state/
2
http://www.freebase.com/, http://dbpedia.org/,
http://www.mpi-inf.mpg.de/yago-naga/yago/
3
http://www.google.com/insidesearch/features/search/knowledge.html
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 102–113, 2013.
c Springer-Verlag Berlin Heidelberg 2013
A Corpus-Based Approach for the Induction of Ontology Lexica 103
this knowledge become more and more important. An appealing way of access-
ing this growing body of knowledge is through natural language. In fact, in
recent years several researchers have developed question answering systems that
provide access to the knowledge in the Linked Open Data Cloud (e.g. [8], [13],
[14], [2]). Further, there have been some approaches to applying natural lan-
guage generation techniques to RDF in order to verbalize knowledge contained
in RDF datasets (e.g. [10], [12], [4]). For all such systems, knowledge about how
properties, classes and individuals are verbalized in natural language is required.
The lemon model4 [9] has been developed for the purpose of creating a standard
format for publishing such lexica as RDF data. However, the creation of lex-
ica for large ontologies and knowledge bases such as the ones mentioned above
involves a high manual effort. Towards reducing the costs involved in building
such lexica, we propose a corpus-based approach for the induction of lexica for
a given ontology which is capable of automatically inducing an ontology lexicon
given a knowledge base or ontology and an appropriate (domain) corpus. Our
approach is supposed to be deployed in a semi-automatic fashion by proposing
a set of lexical entries for each property and class, which are to be validated by
a lexicon engineer, e.g. using a web interface such as lemon source 5 .
As an example, consider the property dbpedia:spouse as defined in DB-
pedia. In order to be able to answer natural language questions such as Who
is Barack Obama married to? we need to know the different lexicalizations of
this property, such as to be married to, to be the wife of, and so on. Our ap-
proach is able to find such lexicalizations on the basis of a sufficiently large
corpus. The approach relies on the fact that many existing knowledge bases
are populated with instances, i.e. by triples relating entities through properties
such as the property dbpedia:spouse. Our approach relies on such triples, e.g.
dbpedia:Barack Obama, dbpedia:spouse, dbpedia:Michelle Obama
to find
occurrences in a corpus where both entities, the subject and the object, are men-
tioned in one sentence. On the basis of these occurrences, we use a dependency
parser to parse the relevant context and generate a set of lexicalized patterns
that very likely express the property or class in question.
The paper is structured as follows: in Section 2 we present the general ap-
proach, distinguishing the case of inducing lexical entries for properties and for
classes. The evaluation of our approach with respect to 80 pseudo-randomly se-
lected classes and properties is presented in Section 3. Before concluding, we
discuss some related work in Section 4.
2 Approach
Our approach6 is summarized in Figure 1. The input is an ontology and the
output is a lexicon in lemon format for the input ontology. In addition, it relies
on an RDF knowledge base as well as a (domain) corpus.
4
For detailed information, see http://lemon-model.net/
5
http://monnetproject.deri.ie/lemonsource/
6
Available at https://github.com/swalter2/knowledgeLexicalisation
104 S. Walter, C. Unger, and P. Cimiano
The processing differs for properties and classes. In what follows, we describe
the processing of properties, while the processing of classes, which does not rely
on the corpus, is explained below in Section 2.5. For each property to be lexical-
ized, all triples from the knowledge base containing this property are retrieved.
The labels of the subject and object entities of these triples are then used for
searching the corpus for sentences in which both occur. Based on a dependency
parse of these sentences, patterns are extracted that serve as basis for the con-
struction of lexical entries. In the following, we describe each of the steps in more
detail.
Given a property, the first step consists in extracting from the RDF knowledge
base all triples containing that property. In the case of DBpedia, for the property
dbpedia:spouse, for example, 44 197 triples are found, including the following7 :
appos
prep pobj
x wife of y
All patterns found by the above process, whose relative frequency is above a
given threshold θ, are then transformed into a lexical entry in lemon format. For
instance, the above mentioned pattern is stored as the following entry:
This entry comprises a part of speech (noun), a canonical form (the head noun
wife), a sense referring to the property spouse in the ontology, and a syntac-
tic behavior specifying that the noun occurs with two arguments, a copulative
argument that corresponds to the subject of the property and a prepositional
object that corresponds to the object of the property and is accompanied by a
marker of.9 The specific subcategorization frame is determined by the kind of
dependency relations that occur in the pattern. Currently, our approach covers
nominal frames (e.g. activity and wife of ), transitive verb frames (e.g loves),
and adjectival frames (e.g. Spanish).
The lexicalization process for classes differs from that for properties in that the
corpus is not used. Instead, for each class in the ontology, its label is extracted
as lexicalization. In order to also find alternative lexicalizations, we consult
WordNet to find synonyms. For example, for the class http://dbpedia.org/
ontology/Activity with label activity, we find the additional synonym action,
thus leading to the following two entries in the lemon lexicon10 :
9
From a standard lexical point of view the syntactic behavior might look weird.
Instead of viewing the specified arguments as elements that are locally selected
by the noun, they should rather be seen as elements that occur in a prototypical
syntactic context of the noun. They are explicitly named as it would otherwise be
impossible to specify the mapping between syntactic and semantic arguments.
10
As linguistic ontology we use ISOcat (http://isocat.org); in the examples, how-
ever, we will use the LexInfo vocabulary (http://www.lexinfo.net/ontology/2.0/
lexinfo.owl) for better readability.
A Corpus-Based Approach for the Induction of Ontology Lexica 107
These entries specify a part of speech (noun), together with a canonical form
(the class label) and a sense referring to the class URI in the ontology.
3 Evaluation
selected properties from different frequency ranges, i.e. ranging from properties
with very few instances to triples with many instances. We then filtered those that
turned out to either have no instances—leaving in only one empty property per
set, meltingPoint and sublimationPoint, in order to be able to evaluate possi-
ble fallback strategies—or to not have an intuitive lexicalization, e.g. espnId. On
average, the properties selected for training have 36 100 instances (ranging from
15 to 229 579), while the properties in the test set have 59 532 instances on aver-
age (ranging from 9 to 444 025). The training and test sets are also used in the
ontology lexicalization task of the QALD-3 challenge11 at CLEF 2013.
We use the training set to determine the threshold θ, and then evaluate the
approach on the unseen properties in the test set.
⎧
⎪
⎨1, if a in lauto has been mapped to the same semantic argument
map(a) = of p as in lgold
⎪
⎩
0, otherwise
0.8
0.6
Result
4 Related Work
In this section we briefly discuss related work in the area of extracting lexical
patterns or paraphrases from corpora that verbalize a given relation in an ontol-
ogy. An approach that is similar in spirit to our approach is Wanderlust [1] which
A Corpus-Based Approach for the Induction of Ontology Lexica 111
to work with web data in order to overcome data sparseness (e.g. as in [3]). This
is clearly an option to make use of in case not enough instances are available
or not enough seed sentences can be found in the given corpus to bootstrap the
pattern acquisition process.
Acknowledgment. This work has been funded by the European Union’s Sev-
enth Framework Programme (FP7-ICT-2011-SME-DCL) under grant agreement
number 296170 (PortDial).
References
1. Akbik, A., Broß, J.: Wanderlust: Extracting semantic relations from natural lan-
guage text using dependency grammar patterns. In: Proceedings of the Workshop
on Semantic Search in Conjunction with the 18th Int. World Wide Web Conference
(2009)
2. Bernstein, A., Kaufmann, E., Kaiser, C., Kiefer, C.: Ginseng: A guided input nat-
ural language search engine. In: Proceedings of the 15th Workshop on Information
Technologies and Systems, pp. 45–50 (2005)
3. Blohm, S., Cimiano, P.: Using the web to reduce data sparseness in pattern-
based information extraction. In: Kok, J.N., Koronacki, J., Lopez de Mantaras,
R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI),
vol. 4702, pp. 18–29. Springer, Heidelberg (2007)
4. Bouayad-Agha, N., Casamayor, G., Wanner, L.: Natural language generation and
semantic web technologies. Semantic Web Journal (in press)
5. Gerber, D., Ngomo, A.: Bootstrapping the linked data web. In: Proceedings of the
10th International Semantic Web Conference, ISWC (2011)
6. Ittoo, A., Bouma, G.: On learning subtypes of the part-whole relation: Do not
mix your seeds. In: Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics (ACL), pp. 1328–1336 (2010)
7. Ling, D., Pantel, P.: DIRT - discovery of inference rules of text. In: Proceedings
of the 7th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 323–328. ACM (2001)
8. Lopez, V., Fernandez, M., Motta, E., Stieler, N.: Poweraqua: Supporting users in
querying and exploring the semantic web. Semantic Web Journal, 249–265 (2012)
9. McCrae, J., Spohr, D., Cimiano, P.: Linking lexical resources and ontologies on the
semantic web with lemon. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia,
B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS,
vol. 6643, pp. 245–259. Springer, Heidelberg (2011)
10. Mellish, C., Sun, X.: The semantic web as a linguistic resource: opportunities for
natural language generation. In: Proceedings of 26th SGAI International Confer-
ence on Innovative Techniques and Applications of Artificial Intelligence, pp. 298–
303. Elsevier (2006)
11. Pantel, P., Pennacchiotti, M.: Espresso: Leveraging generic patterns for automati-
cally harvesting semantic relations. In: Proceedings of the 21st International Con-
ference on Computational Linguistics (COLING), pp. 113–120. ACM (2006)
12. Third, A., Williams, S., Power, R.: OWL to english: a tool for generating organised
easily-navigated hypertexts from ontologies. In: Proceedings of 10th International
Semantic Web Conference (ISWC), pp. 298–303 (2011)
13. Unger, C., Bühmann, L., Lehmann, J., Ngonga-Ngomo, A.-C., Gerber, D., Cimi-
ano, P.: Sparql template-based question answering. In: Proceedings of the World
Wide Web Conference (WWW), pp. 639–648. ACM (2012)
14. Walter, S., Unger, C., Cimiano, P., Bär, D.: Evaluation of a layered approach to
question answering over linked data. In: Cudré-Mauroux, P., et al. (eds.) ISWC
2012, Part II. LNCS, vol. 7650, pp. 362–374. Springer, Heidelberg (2012)
SQUALL: A Controlled Natural Language
as Expressive as SPARQL 1.1
Sébastien Ferré
1 Introduction
An open challenge of the Semantic Web [12] is semantic search, i.e., the ability
for users to browse and search semantic data according to their needs. Seman-
tic search systems can be classified according to their usability, the expressive
power they offer, their compliance to Semantic Web standards, and their scala-
bility. The most expressive approach by far is to use SPARQL [17], the standard
RDF query language. SPARQL 1.11 features graph patterns, filters, unions, dif-
ferences, optionals, aggregations, expressions, subqueries, ordering, etc. However,
SPARQL is also the least usable approach, as it is defined at a low-level in terms
of relational algebra. There are mostly two approaches to make more usable
semantic search systems: navigation and natural language. Navigation is used
in semantic browsers (e.g., Fluidops Information Workbench2 ), and in seman-
tic faceted search (e.g., SlashFacet [11], BrowseRDF [16], Sewelis [6]). Semantic
faceted search can reach a significant expressiveness, but still much below than
SPARQL 1.1, and it does not scale easily to large datasets such as DBpedia3 .
Natural language is used in search engines in various forms, going from full
natural language (e.g., FREyA [3], Aqualog [14]) to mere keywords (e.g., NLP-
Reduce [13]) through controlled natural languages (e.g., Ginseng [1]). Questions
1
http://www.w3.org/TR/sparql11-query/
2
http://iwb.fluidops.com/
3
http://dbpedia.org
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 114–125, 2013.
c Springer-Verlag Berlin Heidelberg 2013
SQUALL: A Controlled Natural Language as Expressive as SPARQL 1.1 115
The two basic units of these languages are resources and triples. A resource
can be either a URI (Uniform Resource Identifier), a literal (e.g., a string, a
number, a date), or a blank node, i.e., an anonymous resource. A URI is the
absolute name of a resource, i.e., an entity, and plays the same role as a URL
w.r.t. web pages. Like URLs, a URI can be a long and cumbersome string (e.g.,
http://www.w3.org/1999/02/22-rdf-syntax-ns#type), so that it is often de-
noted by a qualified name (e.g., rdf:type), where rdf: is the RDF namespace.
In the N3 notation, the default namespace : can be omitted for qualified names
that do not collide with reserved keywords (bare qualified names).
A triple (s p o) is made of 3 resources, and can be read as a simple sentence,
where s is the subject, p is the verb (called the predicate), and o is the object. For
instance, the triple (Bob knows Alice) says that “Bob knows Alice”, where Bob
and Alice are the bare qualified names of two individuals, and knows is the bare
qualified name of a property, i.e., a binary relation. The triple (Bob rdf:type
man) says that “Bob has type man”, or simply “Bob is a man”. Here, the resource
man is used as a class, and rdf:type is a property from the RDF namespace. The
triple (man rdfs:subClassOf person) says that “man is a subclass of person”,
or simply “every man is a person”. The set of all triples of a knowledge base
forms an RDF graph.
Query languages provide on semantic web knowledge bases the same service
as SQL on relational databases. They generally assume that implicit triples
have been inferred and added to the base. The standard RDF query language,
SPARQL, reuses the SELECT FROM WHERE shape of SQL queries, using graph
patterns in the WHERE clause. A graph pattern G is one of:
Aggregations and expressions can be used in the SELECT clause (e.g., COUNT, SUM,
2 * ?x), and GROUP BY clauses can be added to a query. Solution modifiers can
also be added to the query for ordering results (ORDER BY) or returning a subset
of results (OFFSET, LIMIT). Other query forms allow for closed questions (ASK),
for returning the description of a resource (DESCRIBE), or for returning RDF
graphs as results instead of tables (CONSTRUCT). SPARQL has been extended
into an update language to insert and delete triples in/from a graph. The most
general update form is DELETE D INSERT I WHERE G, where I and D must
be sets of triple patterns, and G is a graph pattern that defines bindings for
variables occuring in I and D.
SQUALL: A Controlled Natural Language as Expressive as SPARQL 1.1 117
The syntactic and semantic analysis of SQUALL are formally defined and im-
plemented as a Montague grammar made of around 100 rules5 . Montague gram-
mars [4] are an approach to natural language semantics that is based on formal
logic and λ-calculus. It is named after the American logician Richard Montague,
who pioneered this approach [15]. A Montague grammar is a context-free gen-
erative grammar, where each rule is decorated by a λ-term that denotes the
semantics of the syntactic construct defined by the rule. The semantics is de-
fined in a fully compositional style, i.e., the semantics of a construct is always
5
The full Montague grammar can be found in the source code at
https://bitbucket.org/sebferre/squall2sparql/src (file syntax.ml), or in
a previous paper [5] for an earlier version of SQUALL.
118 S. Ferré
Triple Patterns. Each noun or non-auxiliary verb plays the role of a class or a
predicate in a triple pattern. If a question is about a class or a predicate, the
verbs “belongs” and “relates” are respectively used.
– “Which person is the author of a publication whose publication year is 2012?”
– “To which nationality does John Smith belong?” (here, “nationality” is a meta-class
whose instances are classes of persons: e.g., “French”, “German”).
– “What relates John Smith to Mary Well?”
Queries. SELECT queries are obtained by open questions, using one or several
question words (“which” as a determiner, “what” or “who” as a noun phrase).
Queries with a single selected variable can also be expressed as imperative
sentences. ASK queries are obtained by closed questions, using either the word
“whether” in front of an affirmative sentence, or using auxiliary verbs and subject-
auxiliary inversion.
– “Which person is the author of which publication?”
– “Give me the author-s of Paper42.”
– “Whether John Smith know-s Mary Well?”
– “Does Mary Well know the author of Paper42?”
6
http://caml.inria.fr/ocaml/
120 S. Ferré
Solution Modifiers. The ordering of results (ORDER BY) and partial results
(LIMIT, OFFSET) are expressed with adjectives like “highest”, “2nd lowest”, “10
greatest”.
Built-ins. Built-in functions and operators used in SPARQL filters and expres-
sions are expressed by pre-defined nouns, verbs, and relational adjectives: e.g.,
“month”, “contains”, “greater than”. They can therefore be used like classes and
properties.
– “Which person has a birth date whose month is 3 and whose year is greater than
2000?”
– “Give me the publication-s whose title contains ”natural language”?”
Join. The coordination “and” can be used with all kinds of phrases. It generates
complex joins at the relational algebra level.
– “John Smith and Mary Well have age 42 and are an author of Paper42 and Paper43.”
Union. Unions of graph patterns are expressed by the coordination “or”, which
can be used with all kinds of phrases, like “and”.
– “Which teacher or student teach-es or attend-s a course whose topic is NL or DB?”
Option. Optional graph patterns are expressed by the adverb “maybe”, which
can be used in front of all kinds of phrases, generally verb phrases.
– “The author-s of Paper42 have which name and maybe have which email?”
– “Which publication has the lastPage - the firstPage greater than 10?”
– “Return concat(the firstname, ” ”, the lastname) of all author-s of Paper42.”
Property Paths. Property sequences and inverse properties are covered by the
flexible syntax of SQUALL. Alternative and negative paths are respectively cov-
ered by the coordination “or” and the adverb “not”. Reflexive and transitive
closures of properties have no obvious linguistic counterpart, and are expressed
by property suffixes among “?”, “+”, and “*”.
Named Graphs. The GRAPH construct of SPARQL, which serves to restrict graph
pattern solutions to a named graph, can be expressed using “in graph” as a
preposition. A prepositional phrase can be inserted at any location in a sentence,
and its scope is the whole sentence.
Graph Literals. The SPARQL query forms CONSTRUCT and DESCRIBE return
graphs, i.e. sets of triples, instead of sets of solutions. A DESCRIBE query is ex-
pressed by the imperative verb “describe” followed by a resource or a universally-
quantified noun phrase. A CONSTRUCT query is expressed by using curly brackets
to quote sentences and make them a graph literal.
A detailed review of SPARQL 1.1 grammar reveals only a few missing fea-
tures: (1) updates at graph level (e.g., LOAD, DROP), (2) use of results from other
endpoints (e.g., VALUES, SERVICE), (3) transitive closure on complex property
paths (e.g., (^author/author)+ for co-authors of co-authors, and so on).
122 S. Ferré
SQUALL Queries Look Natural. The use of variables is hardly ever necessary in
SQUALL (none was used in the 100 training questions), while SPARQL queries
are cluttered with many variables. No special notations were used, except for
namespaces. Only grammatical words are used to provide syntax, and they are
used like in natural language. There are 9 out of 100 questions where SQUALL
is identical to natural language, up to proper names which are replaced by URIs:
– “Is res:Proinsulin a Protein?”
– “What is the currency of res:Czech Republic?”
7
http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
SQUALL: A Controlled Natural Language as Expressive as SPARQL 1.1 123
Most Discrepancies between Natural Language and SQUALL are a Matter of Vo-
cabulary. Most discrepancies come from the fact that for each concept, a single
word has been chosen in the DBpedia ontology, and related words are not avail-
able as URIs. Because SQUALL sentences use URIs as nouns and verbs, some
reformulation is necessary. In the simplest case, it is enough to replace a word by
another: e.g., “wife” vs “dbp:spouse”. In other cases, a verb has to be replaced by a
noun, which requires changes in the syntactic structure: e.g., “Who developed the
video game World of Warcraft?” vs “Who is the developer of res:World of Warcraft?”.
An interesting example is “Who is the daughter of Bill Clinton married to?” vs
“Who is the dbp:spouse of the child of res:Bill Clinton?”. The former question could
be expressed in SQUALL if “marriedTo” was made an equivalent property to
“dbp:spouse”, and if “daughter” was made a subproperty of “child”. In fact, this
kind of discrepancy could be resolved, either by enriching the ontology with
related words, or by preprocessing SQUALL sentences using natural words to
replace them by URIs. The latter solution has already been studied as a compo-
nent of existing question answering systems [3,14], and could be combined with
translation from SQUALL to SPARQL.
Some Discrepancies are Deeper in that they Exhibit Conceptual Differences be-
tween Natural Language and the Ontology. We shortly discuss three cases:
– “List all episodes of the first season of the HBO television series The Sopranos!”
vs “List all TelevisionEpisode-s whose series is res:The Sopranos and whose season-
Number is 1.”. In natural language, an episode is linked to a season, which in
turn is linked to a series. In DBpedia, an episode is linked to a series, on one
hand, and to a season number, on the other hand. In DBpedia, a season is
not an entity, but only an attribute of episodes.
– “Which caves have more than 3 entrances?” vs “Which Cave-s have an
dbp:entranceCount greater than 3?”. The natural question is nearly a valid sen-
tence in SQUALL, but it assumes that each cave is linked to each of its
entrances. However, DBpedia only has a property “dbp:entranceCount” from
a cave to its number of entrances.
– “Which classis does the Millepede belong to?” vs “What is the dbp:classis of
res:Millipede?”. The natural question is again a valid SQUALL sentence (af-
ter moving ’to’ at the beginning), but it assumes that res:Millipede is an
instance of a class, which is itself an instance of dbp:classis. DBpedia does
not define classes of classes, and therefore uses dbp:classis as a property
from a species to its classis.
124 S. Ferré
References
1. Bernstein, A., Kaufmann, E., Kaiser, C.: Querying the semantic web with Ginseng:
A guided input natural language search engine. In: Work. Information Technology
and Systems, WITS (2005)
2. Ceri, S., Gottlob, G., Tanca, L.: What you always wanted to know about datalog
(and never dared to ask). IEEE Trans. Knowl. Data Eng. 1(1), 146–166 (1989)
3. Damljanovic, D., Agatonovic, M., Cunningham, H.: Identification of the question
focus: Combining syntactic analysis and ontology-based lookup through the user
interaction. In: Language Resources and Evaluation Conference (LREC). ELRA
(2010)
8
http://lemon-model.net/index.html
SQUALL: A Controlled Natural Language as Expressive as SPARQL 1.1 125
4. Dowty, D.R., Wall, R.E., Peters, S.: Introduction to Montague Semantics. D. Reidel
Publishing Company (1981)
5. Ferré, S.: SQUALL: a controlled natural language for querying and updating RDF
graphs. In: Kuhn, T., Fuchs, N.E. (eds.) CNL 2012. LNCS, vol. 7427, pp. 11–25.
Springer, Heidelberg (2012)
6. Ferré, S., Hermann, A.: Reconciling faceted search and query languages for the
Semantic Web. Int. J. Metadata, Semantics and Ontologies 7(1), 37–54 (2012)
7. Fuchs, N.E., Kaljurand, K., Schneider, G.: Attempto Controlled English meets
the challenges of knowledge representation, reasoning, interoperability and user
interfaces. In: Sutcliffe, G., Goebel, R. (eds.) FLAIRS Conference, pp. 664–669.
AAAI Press (2006)
8. Fuchs, N.E., Schwitter, R.: Web-annotations for humans and machines. In: Fran-
coni, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 458–472.
Springer, Heidelberg (2007)
9. Haase, P., Broekstra, J., Eberhart, A., Volz, R.: A comparison of RDF query lan-
guages. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004.
LNCS, vol. 3298, pp. 502–517. Springer, Heidelberg (2004)
10. Hermann, A., Ferré, S., Ducassé, M.: An interactive guidance process supporting
consistent updates of RDFS graphs. In: ten Teije, A., Völker, J., Handschuh, S.,
Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez,
N. (eds.) EKAW 2012. LNCS (LNAI), vol. 7603, pp. 185–199. Springer, Heidelberg
(2012)
11. Hildebrand, M., van Ossenbruggen, J., Hardman, L.: /facet: A browser for hetero-
geneous semantic web repositories. In: Cruz, I., Decker, S., Allemang, D., Preist,
C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS,
vol. 4273, pp. 272–285. Springer, Heidelberg (2006)
12. Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies.
Chapman & Hall/CRC (2009)
13. Kaufmann, E., Bernstein, A.: Evaluating the usability of natural language query
languages and interfaces to semantic web knowledge bases. J. Web Semantics 8(4),
377–393 (2010)
14. Lopez, V., Uren, V., Motta, E., Pasin, M.: Aqualog: An ontology-driven question
answering system for organizational semantic intranets. Journal of Web Seman-
tics 5(2), 72–105 (2007)
15. Montague, R.: Universal grammar. Theoria 36, 373–398 (1970)
16. Oren, E., Delbru, R., Decker, S.: Extending faceted navigation for RDF data. In:
Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M.,
Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 559–572. Springer, Heidelberg
(2006)
17. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. In:
Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M.,
Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 30–43. Springer, Heidelberg
(2006)
18. Schwitter, R., Kaljurand, K., Cregan, A., Dolbear, C., Hart, G.: A comparison of
three controlled natural languages for OWL 1.1. In: Clark, K., Patel-Schneider,
P.F. (eds.) Workshop on OWL: Experiences and Directions (OWLED), vol. 258.
CEUR-WS (2008)
19. Smart, P.: Controlled natural languages and the semantic web. Tech. rep.,
School of Electronics and Computer Science University of Southampton (2008),
http://eprints.ecs.soton.ac.uk/15735/
Evaluating Syntactic Sentence
Compression for Text Summarisation
1 Introduction
Text compression has several practical applications in natural language processing
such as text simplification [1], headline generation [2] and text summarization [3].
The goal of automatic text summarization is to produce a shorter version of the
information contained in a text collection and produce a relevant summary [4]. In
extractive summarization, sentences are extracted from the document collection
and assigned a score according to a given topic/query relevance [5] or some other
metric to determine how important it is to the final summary. Summaries are usu-
ally bound by a word or sentence limit and within these limits, the challenge is to
extract and include as much relevant information as possible. However, since the
sentences are not processed or modified, they may contain phrases that are irrele-
vant or may not contribute to the targeted summary. As an example, consider the
following topic, query and sentence (1) 1 ,
Topic: Southern Poverty Law Center
Query: Describe the activities of Morris Dees and the Southern Poverty Law
Center
1
All examples are taken from the TAC 2008 or DUC 2007 corpora.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 126–139, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Evaluating Syntactic Sentence Compression for Text Summarisation 127
(1) Since co-founding the Southern Poverty Law Center in 1971, Dees has
wielded the civil lawsuit like a buck knife, carving financial assets out of hate
group leaders who inspire followers to beat, burn and kill.
In sentence (1), some phrases could be dropped without losing much information
relevant to the query. Possible shorter forms of the sentence include :
(1c1) Since co-founding the Southern Poverty Law Center in 1971, Dees has
wielded the civil lawsuit like a buck knife, carving financial assets out of
hate group leaders who inspire followers to beat, burn and kill.
(1c2) Since co-founding the Southern Poverty Law Center in 1971, Dees has
wielded the civil lawsuit like a buck knife, carving financial assets out of
hate group leaders who inspire followers to beat, burn and kill.
(1c3) Since co-founding the Southern Poverty Law Center in 1971, Dees has
wielded the civil lawsuit like a buck knife, carving financial assets out of
hate group leaders who inspire followers to beat, burn and kill.
the likelihood of arriving at the long string t, when s is expanded. Their model was
designed considering two key features: preserving grammaticality and preserving
useful information. In order to calculate the probabilities, they have used context
free grammar parses of sentences and a word based bi-gram estimation model.
They have evaluated their system using the Ziff-Davis corpus and have showed
that their approach could score similar compression rates compared to human
written compressed texts but importance and grammaticality are slightly lower
than human-written texts. On the other hand, [11] introduced semantic features
to improve a decision tree based classification. Here, the authors used Charniak’s
parser [12] to generate syntactic trees and incorporated semantic information us-
ing WordNet [13]. The evaluation showed a slight improvement in importance of
information preserved in shortened sentences. But again, the effect on summa-
rization was not noted. [14] points out that text compression could be seen as a
problem of finding a global optimum by considering the compression of the whole
text/document. The authors used syntactic trees of each pairs of long and short
sentences to define rules to deduce shorter syntactic trees out of original syntactic
trees. They also used the Ziff-Davis corpus for their evaluation as well as human
judgment. They evaluated their technique based on importance and grammatical-
ity of sentences and the results were lower compared to the scores of the human
written abstractions. Similarly, [15] describes the use of integer linear program-
ming model to infer globally optimal compressions while adhering to linguistically
motivated constraints and show improvement in automatic and human judgment
evaluations. [16] have also described an approach on syntactic pruning based on
trasformed dependency trees and a linear integer model. The authors have trans-
formed the dependency trees into graphs where the nodes represents nouns and
verbs and these transformed dependency trees are trimmed based on the results
of an integer linear programming model that decides the importance of each sub-
tree. Their evaluation has shown an improvement compared to the language model
based compression techniques.
The previous work described above were evaluated intrinsically by comparing
their results to human generated summmaries. A few previous work did however
measure sentence compression extrinsically for the purpose of text summariza-
tion. In particular, [10] took a conservative approach and used a list of keyword
phrases to identify less significant parts of the text and remove them from long
sentences. The keyword list was implemented in an adhoc fashion and was used
to omit specific terms. They have evaluated their pruning techniques within their
summarization system CLASSY [17] with DUC 2005 [18], and showed an improve-
ment in ROUGE scores. In their participation to the DUC 2006 [19] automatic
summarization track, their system placed among the top three based on ROUGE
scores.
In contrast, [7] used complete dependency parses and applied pruning rules
based on grammatical structures. They used specific grammatical filters includ-
ing prepositional complements of verbs, subordinate clauses, noun appositions
and interpolated clauses. They have achieved a compression rate of 74% while
retaining grammaticality and readability of text. In [20] the authors also used
Evaluating Syntactic Sentence Compression for Text Summarisation 129
3 Pruning Heuristics
To evaluate syntactic sentence pruning methods in the context of automatic text
summarization, we have implemented several syntax-based heuristics and have
evaluated them with standard summarization benchmarks. We took as input a
list of extracted sentences ranked by their relevance score as generated by an au-
tomatic summarizer. We then performed a complete parse of these sentences, and
applied various syntax-based pruning approaches to each tree node to determine
whether to prune or not a particular sub-tree. The pruned sentences were then
included in the final summary in place of the original sentences and evaluated
for content against the given model summaries. Three basic sentence compression
approaches were attempted: syntax-driven pruning, syntax and relevancy based
pruning, and relevancy-driven syntactic pruning. Let us describe each approach
in detail.
(6) In the Public Records Office in London archivists are creating a catalog
of all British public records regarding the Domesday Book of the 11th century.
Here, the prepositional phrase, In the Public Records Office in London is attached
to the entire clause; while, of all British public records and of the 11th century are
attached to the nouns catalog and Domesday Book. PPs attached to NPs often act
as noun modifiers and as a consequence can be pruned like any adjective phrase.
In addition, PPs attached to an entire clause often present complementary infor-
mation that can also be removed. On the other hand, PPs can be attached to verb
phrases, as in:
(7) Australian Prime Minister John Howard today defended the governments
decision to go ahead with uranium mining
on development and environmental grounds.
where with uranium mining and on development and environmental grounds are
attached to go ahead. PPs that modify verb phrases should be pruned with cau-
tion as they may be part of the verb’s frame and required to understand the verb
phrase. In that case, removing them would likely loose the meaning of the sen-
tence. PPs attached to VPs that are positioned after the head verb are therefore
not pruned. However, PPs attached to VPs that are positioned prior to the verb
are considered less likely to be mandatory and are removed. Removing PPs based
solely on syntactic information will likely make mistakes. PPs that do not con-
tain necessary information may be kept, and vice-versa. However, the purpose of
this heuristic is to prune as cautiously as possible. Sections 3.2 and 3.3 describe
heuristics that take semantics into account.
except for noun phrases, verb phrases or individual words, we calculate its co-
sine similarity with the topic/query based on tf-ifd values. Sub-trees below a cer-
tain threshold are pruned; the others are kept. We do not allow pruning of noun
phrases, verb phrases and individual words in order to preserve a minimal gram-
maticality; all other phrase types, are however possible candidates for pruning.
For example, consider the following scenario:
Query: What positive and negative developments have there been in Turkey’s
efforts to become a formal member of the European Union?
(8) Turkey had been asking for three decades to join the European Union but its
demand was turned away by the European Union in December 1997 that led
to a deterioration of bilateral relations.
Here, Sentence 8 is the original candidate extracted from the corpus. Its parse
tree generated by the Stanford Parser [22] is shown is Figure 1, with the relevancy
score indicated in bold. For example, the sub-tree rooted by the SBAR (that led
to a deterioration of bilateral relations) was computed to have a relevance of 0.0
with the topic and the query. All sub-trees rooted at a node whose relevance is
smaller than some threshold value are pruned. If we set t = 0 (i.e. any relevance
with topic/query will be considered useful), the above sentence would therefore
be compressed as:
NNS relations
NP
PP JJ bilateral
IN of
NP
NN deterioration
PP NP
DT a
S VP TO to
SBAR
{0.0} VBN led
WHNP WDT that
CD 1997
PP NP
NP {0.0} NNP December
IN in
PP NNP Union
{0.241}
NP NNP European
DT the
VP IN by
PRT RP away
VP
S VBN turned
{0.182} VBD was
NN demand
NP
PRP its
CC but
NNP Union
NP NNP European
S VP DT the
ROOT {0.455} S VP VB join
TO to
NNS decades
VP PP NP
{0.0} CD three
VP IN for
VBG asking
VP
S VBN been
{0.5286} VBD had
NP NNP Turkey
(8c) Turkey had been asking for three decades to join the European Union but its
demand was turned away by the European Union in December 1997 that led to a
deterioration of bilateral relations.
4 Evaluation
To evaluate our pruning techniques extrinsically for the purpose of summary gen-
eration, we used two standard text corpora available for summarization: the Text
Analysis Conference (TAC) 2008 [23], which provides a text corpus created from
blogs and the Document Understanding Conference (DUC) 2007 [18] which pro-
vides a text corpus of news articles. To ensure that our results were not tailored
to one specific summarizer, we used two different systems: BlogSum [24], an au-
tomatic summarizer based on discourse relations and MEAD [25], a generalized
automatic summarization system. In order to generate syntactic trees for our ex-
periment, we used the Stanford Parser [22]. To evaluate each compression tech-
nique, we generated summaries without any compression and compared the
results based on two metrics: compression rates and ROUGE scores for content
evaluation.
Syntax and Relevancy Based Pruning. Table 2 shows the compression rate
achieved by each heuristic using the syntax and relevancy based pruning. As the
results show, with both datasets, the compression effect of each heuristic has been
toned down, but the relative ranking of the heuristics are the same. This seems
to imply that each type of syntactic phrase is as likely to contain irrelevant in-
formation; and one particular construction should not be privileged for pruning
134 P. Perera and L. Kosseim
purposes. Overall, when all pruning heuristics are combined, the relevancy factor
reduces the pruning by about 8 to 11% (from 73-75% to 82-86%) .2
removed by the relevancy-driven pruning and their relative frequencies. As the re-
sults shows, the most frequent syntactic structures removed were PPs and the least
were adverbial phrases (Adv). This result correlates with our syntax-driven prun-
ing as we achieved similar individual compression rates for these phrase structures.
2
The reduction rate is of course proportional to the relevancy threshold used (see
Section 3.2). In this experiment, we set the threshold to be the most conservative
(t = 0), hence keeping everything that has any relevance to the topic/query.
Evaluating Syntactic Sentence Compression for Text Summarisation 135
BlogSum MEAD
TAC 2008 DUC 2007 TAC 2008 DUC 2007
No of. Relative No of. Relative No of. Relative No of. Relative
Phrases Frequency Phrases Frequency Phrases Frequency Phrases Frequency
PP Pruning 395 50.5% 402 62.4% 177 42.3% 408 63.6%
Other 189 24.1% 136 29.3% 157 31.6% 149 30.1%
RC Pruning 94 12.0% 56 8.7% 44 10.5% 59 9.2%
Adj Pruning 75 9% 35 5.4% 26 6.2% 20 3.1%
Adv Pruning 29 3.7% 15 2.4% 14 3.3% 5 1.0%
Total 782 100% 644 100% 418 100% 641 100%
Syntax-Driven Pruning. Tables 5 and 6 show the results obtained with and
without content filling respectively. Table 5 show a drop in ROUGE score for
both summarization systems and both datasets. This goes against our hypoth-
esis that by default specific syntactic constructions can be removed without los-
ing much content. In addition, when filling the summary with extra sentences,
ROUGE scores do seem to improve (as shown in Table 6); however Pearson’s χ2
and t-tests show that this difference is not statistically significant. What is more
surprising is that this phenomenon is true for the combined heuristics, but also
for each individual pruning heuristic.
BlogSum MEAD
TAC 2008 DUC 2007 TAC 2008 DUC 2007
R-2 R-SU4 R-2 R-SU4 R-2 R-SU4 R-2 R-SU4
Original 0.074 0.112 0.088 0.141 0.040 0.063 0.086 0.139
Adv Pruning 0.074 0.113 0.089 0.143 0.039 0.063 0.086 0.139
RC Pruning 0.072 0.109 0.087 0.140 0.039 0.062 0.085 0.138
TC-VP Pruning 0.073 0.111 0.088 0.140 0.040 0.063 0.085 0.137
Adj Pruning 0.068 0.108 0.084 0.140 0.038 0.063 0.080 0.136
PP Pruning 0.065 0.103 0.072 0.121 0.035 0.056 0.069 0.117
Combined 0.060 0.100 0.074 0.128 0.034 0.056 0.068 0.121
136 P. Perera and L. Kosseim
BlogSum MEAD
TAC 2008 DUC 2007 TAC 2008 DUC 2007
R-2 R-SU4 R-2 R-SU4 R-2 R-SU4 R-2 R-SU4
Original 0.074 0.112 0.088 0.141 0.040 0.063 0.086 0.140
Adv Pruning 0.075 0.114 0.090 0.143 0.044 0.063 0.087 0.140
RC Pruning 0.073 0.111 0.088 0.141 0.039 0.062 0.086 0.140
TC-VP Pruning 0.073 0.111 0.089 0.141 0.040 0.062 0.086 0.139
Adj Pruning 0.075 0.110 0.085 0.142 0.038 0.063 0.082 0.140
PP Pruning 0.070 0.131 0.079 0.131 0.035 0.058 0.076 0.127
Combined 0.065 0.139 0.065 0.139 0.035 0.060 0.077 0.135
BlogSum MEAD
TAC 2008 DUC 2007 TAC 2008 DUC 2007
R-2 R-SU4 R-2 R-SU4 R-2 R-SU4 R-2 R-SU4
Original 0.074 0.112 0.088 0.141 0.040 0.063 0.086 0.139
Adv Pruning 0.074 0.113 0.089 0.143 0.039 0.063 0.087 0.140
RC Pruning 0.073 0.110 0.088 0.141 0.039 0.062 0.086 0.139
TC-VP Pruning 0.073 0.111 0.088 0.141 0.040 0.063 0.085 0.138
Adj Pruning 0.070 0.110 0.086 0.142 0.038 0.063 0.082 0.138
PP Pruning 0.072 0.110 0.086 0.137 0.039 0.062 0.079 0.129
Combined 0.069 0.110 0.085 0.140 0.038 0.061 0.078 0.132
Syntax and Relevancy Based Pruning. Recall that the syntax-driven prun-
ing did not consider the relevancy of the sub-tree to prune. When we do take the
relevancy to account; surprisingly the ROUGE scores do not improve significantly
either. Tables 7 and 8 show the ROUGE scores of the compressed summaries based
on syntax and relevancy without filling (Table 7) and with content filling (Table
8). Again any semblance of improvement is not statistically significant.
4.3 Discussion
Although the results of the compression rate were inline with previous work [7,20],
we were surprised at the results of the content evaluation. However this might ex-
plain why, to our knowledge, so little work can be found in the literature on the
evaluation of syntactic sentence pruning for summarization. Our pruning heuris-
tics could of course be fine-tuned to be more discriminating. We could, for ex-
ample, use verb frames or lexical-grammatical rules to prune PPs; but we do not
foresee a significant increase in ROUGE scores. The relevance measure that we
Evaluating Syntactic Sentence Compression for Text Summarisation 137
BlogSum MEAD
TAC 2008 DUC 2007 TAC 2008 DUC 2007
R-2 R-SU4 R-2 R-SU4 R-2 R-SU4 R-2 R-SU4
Original 0.074 0.112 0.088 0.141 0.040 0.063 0.086 0.140
Adv Pruning 0.075 0.114 0.090 0.143 0.040 0.063 0.087 0.140
RC Pruning 0.072 0.111 0.088 0.141 0.040 0.062 0.086 0.140
TC-VP Pruning 0.074 0.111 0.089 0.141 0.040 0.062 0.086 0.139
Adj Pruning 0.072 0.111 0.086 0.142 0.038 0.063 0.085 0.141
PP Pruning 0.072 0.111 0.088 0.141 0.040 0.062 0.084 0.135
Combined 0.071 0.112 0.088 0.145 0.037 0.062 0.085 0.141
BlogSum MEAD
TAC 2008 DUC 2007 TAC 2008 DUC 2007
R-2 R-SU4 R-2 R-SU4 R-2 R-SU4 R-2 R-SU4
Original 0.074 0.112 0.088 0.141 0.040 0.063 0.086 0.139
Relevancy-Driven Without Filling 0.065 0.100 0.077 0.125 0.034 0.055 0.066 0.110
Relevancy-Driven With Filling 0.068 0.106 0.083 0.135 0.033 0.060 0.078 0.128
used (see Section 3.3) could also be experimented with, but again, we do not ex-
pect much increase from that end. Using a better performing summarizer might
also be a possible avenue of investigation to provide us with better input sentences
and better “filling” sentences after compression.
References
1. Chandrasekar, R., Doran, C., Srinivas, B.: Motivations and Methods for Text Sim-
plification. In: Proceedings of COLING 1996, Copenhagen, pp. 1041–1044 (1996)
2. Dorr, B., Zajic, D., Schwartz, R.: Hedge Trimmer: A Parse-and-Trim Approach to
Headline Generation. In: Proceedings of the HLT-NAACL Workshop on Text Sum-
marization, pp. 1–8 (2003)
3. Knight, K., Marcu, D.: Summarization beyond sentence extraction: A probabilistic
approach to sentence compression. Artificial Intelligence 139(1), 91–107 (2002)
4. Hahn, U., Mani, I.: The Challenges of Automatic Summarization. IEEE Computer
5. Murray, G., Joty, S., Ng, R.: The University of British Columbia at TAC 2008. In:
Proceedings of TAC 2008, Gaithersburg, Maryland, USA (2008)
6. Jing, H.: Sentence Reduction for Automatic Text Summarization. In: Proceedings
of the Sixth Conference on Applied Natural Language Processing, Seattle, pp. 310–
315 (April 2000)
7. Gagnon, M., Da Sylva, L.: Text Compression by Syntactic Pruning. In: Lamontagne,
L., Marchand, M. (eds.) Canadian AI 2006. LNCS (LNAI), vol. 4013, pp. 312–323.
Springer, Heidelberg (2006)
8. Jaoua, M., Jaoua, F., Belguith, L.H., Hamadou, A.B.: Évaluation de l’impact
de l’intégration des étapes de filtrage et de compression dans le processus
d’automatisation du résumé. In: Résumé Automatique de Documents. Document
numérique, Lavoisier, vol. 15, pp. 67–90 (2012)
9. Jing, H., McKeown, K.R.: Cut and Paste Based Text Summarization. In: Proceed-
ings of NAACL-2000, Seattle, pp. 178–185 (2000)
10. Conroy, J.M., Schlesinger, J.D., O’Leary, D.P., Goldstein, J.: Back to Basics:
CLASSY 2006. In: Proceedings of the HLT-NAACL 2006 Document Understanding
Workshop, New York City (2006)
11. Nguyen, M.L., Phan, X.H., Horiguchi, S., Shimazu, A.: A New Sentence Reduction
Technique Based on a Decision Tree Model. International Journal on Artificial In-
telligence Tools 16(1), 129–138 (2007)
12. McClosky, D., Charniak, E., Johnson, M.: Effective Self-Training for Parsing. In:
Proceedings of HLT-NAACL 2006, New York, pp. 152–159 (2006)
13. Fellbaum, C.: WordNet: An Electronic Lexical Database. The MIT Press (May
1998)
14. Le Nguyen, M., Shimazu, A., Horiguchi, S., Ho, B.T., Fukushi, M.: Probabilistic
Sentence Reduction Using Support Vector Machines. In: Proceedings of COLING
2004, Geneva, pp. 743–749 (August 2004)
15. Clarke, J., Lapata, M.: Global Inference for Sentence Compression an Integer Linear
Programming Approach. Journal of Artificial Intelligence Research (JAIR) 31(1),
399–429 (2008)
16. Filippova, K., Strube, M.: Dependency Tree Based Sentence Compression. In: Pro-
ceedings of the Fifth International Natural Language Generation Conference, INLG
2008, Stroudsburg, PA, USA, pp. 25–32 (2008)
17. Schlesinger, J.D., O’Leary, D.P., Conroy, J.M.: Arabic/English Multi-document
Summarization with CLASSY: The Past and the Future. In: Gelbukh, A. (ed.) CI-
CLing 2008. LNCS, vol. 4919, pp. 568–581. Springer, Heidelberg (2008)
Evaluating Syntactic Sentence Compression for Text Summarisation 139
Abstract. With the rapid growth of user-generated content on the internet, senti-
ment analysis of online reviews has become a hot research topic recently, but due
to variety and wide range of products and services, the supervised and domain-
specific models are often not practical. As the number of reviews expands, it is
essential to develop an efficient sentiment analysis model that is capable of ex-
tracting product aspects and determining the sentiments for aspects. In this paper,
we propose an unsupervised model for detecting aspects in reviews. In this model,
first a generalized method is proposed to learn multi-word aspects. Second, a set
of heuristic rules is employed to take into account the influence of an opinion
word on detecting the aspect. Third a new metric based on mutual information
and aspect frequency is proposed to score aspects with a new bootstrapping itera-
tive algorithm. The presented bootstrapping algorithm works with an unsuper-
vised seed set. Finally two pruning methods based on the relations between
aspects in reviews are presented to remove incorrect aspects. The proposed model
does not require labeled training data and can be applicable to other languages or
domains. We demonstrate the effectiveness of our model on a collection of prod-
uct reviews dataset, where it outperforms other techniques.
1 Introduction
In the past few years, with the rapid growth of user-generated content on the internet,
sentiment analysis (or opinion mining) has attracted a great deal of attention from
researchers of data mining and natural language processing. Sentiment analysis is a
type of text analysis under the broad area of text mining and computational intelli-
gence. Three fundamental problems in sentiment analysis are: aspect detection, opi-
nion word detection and sentiment orientation identification [1-2].
Aspects are topics on which opinion are expressed. In the field of sentiment analy-
sis, other names for aspect are: features, product features or opinion targets [1-5].
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 140–151, 2013.
© Springer-Verlag Berlin Heidelberg 2013
An Unsupervised Aspect Detection Model for Sentiment Analysis of Reviews 141
Aspects are important because without knowing them, the opinions expressed in a
sentence or a review are of limited use. For example, in the review sentence “after
using it, I found the size to be perfect for carrying in a pocket”, “size” is the aspect for
which an opinion is expressed. Likewise aspect detection is critical to sentiment anal-
ysis, because its effectiveness dramatically affects the performance of opinion word
detection and sentiment orientation identification. Therefore, in this study we concen-
trate on aspect detection for sentiment analysis.
Existing aspect detection methods can broadly be classified into two major ap-
proaches: supervised and unsupervised. Supervised aspect detection approaches re-
quire a set of pre-labeled training data. Although the supervised approaches can
achieve reasonable effectiveness, building sufficient labeled data is often expensive
and needs much human labor. Since unlabeled data are generally publicly available, it
is desirable to develop a model that works with unlabeled data. Additionally due to
variety and wide range of products and services being reviewed on the internet, su-
pervised, domain-specific or language-dependent models are often not practical.
Therefore the framework for the aspect detection must be robust and easily transfera-
ble between domains or languages.
In this paper, we present an unsupervised model which addresses the core tasks ne-
cessary to detect aspects from review sentences in a sentiment analysis system. In the
proposed model we use a novel bootstrapping algorithm which needs an initial seed
set of aspects. Our model requires no labeled training data or additional information,
not even for the seed set. The model can easily be transform between domains or
languages. In the remainder of this paper, detailed discussions of existing works on
aspect detection will be given in section 2.Section 3 describes the proposed aspect
detection model for sentiment analysis, including the overall process and specific
designs. Subsequently we describe our empirical evaluation and discuss important
experimental results in section 4. Finally we conclude with a summary and some fu-
ture research directions in section 5.
2 Related Work
Several methods have been proposed, mainly in the context of product review mining
[1-14]. The earliest attempt on aspect detection was based on the classic information
extraction approach of using frequently occurring noun phrases presented by Hu and
Liu [3]. Their work can be considered as the initiator work on aspect extraction from
reviews. They use association rule mining (ARM) based on the Apriori algorithm to
extract frequent itemsets as explicit product features, only in the form of noun phras-
es. Their approach works well in detecting aspects that are strongly associated with a
single noun, but are less useful when aspects encompass many low-frequency terms.
The proposed model in our study works well with low-frequency terms and uses more
POS patterns to extract the candidates for aspect. Wei et al. [4] proposed a semantic-
based product aspect extraction (SPE) method. Their approach begins with
preprocessing task, and then employs the association rule mining to identify candidate
product aspects. Afterward, on the basis of the list of positive and negative opinion
142 A. Bagheri, M. Saraee, and F. de Jong
words, the semantic-based refinement step identifies and then removes from the set of
frequent aspects possible non-product aspects and opinion-irrelevant product aspects.
The SPE approach relies primarily on frequency- and semantic-based extraction for
the aspect detection, but in our study we use frequency-based and inter-connection
information between the aspects and give more importance to multi-word aspects and
aspects with an opinion word in the review sentence. Somprasertsri and Lalitroj-
wong’s [8] proposed a supervised model for aspect detection by combining lexical
and syntactic features with a maximum entropy technique. They extracted the learn-
ing features from an annotated corpus. Their approach uses a maximum entropy clas-
sifier for extracting aspects and includes the postprocessing step to discover the
remaining aspects in the reviews by matching the list of extracted aspects against each
word in the reviews. We use Somprasertsri and Lalitrojwong’s work for a comparison
to our proposed model, because the model in our study is completely unsupervised.
Our work on aspect detection designed to be as unsupervised as possible, so as to
make it transferable through different types of domains, as well as across languages.
The motivation is to build a model to work on the characteristics of the words in re-
views and interrelation information between them, and to take into account the influ-
ence of an opinion word on detecting the aspect.
Figure 1 gives an overview of the proposed model used for detecting aspects in sen-
timent analysis. Below, we discuss each of the functions in aspect detection model in
turn.
Fig. 1. The proposed model for aspect detection for sentiment analysis
An Unsupervised Aspect Detection Model for Sentiment Analysis of Reviews 143
In the review sentences, some aspects that people talk about have more than one sin-
gle word, “battery life”, “signal quality” and “battery charging system” are examples.
This step is to find useful multi-word aspects from the reviews. A multi-word aspect
is represented by … where represents a single-word contained in ,
and is the number of words contained in . In this paper, we propose a generalized
version of FLR method [17, 18] to rank the extracted multi-word aspects and select
the importance ones. FLR is a word scoring method that uses internal structures and
frequencies of candidates. The FLR for an aspect is calculated as:
(1)
… (2)
The left score of each word of a target aspect is defined as the number of
types of words appearing to the left of , and the right score is defined in the
same manner. An LR score for single word is defined as:
(3)
The proposed generalization of the FLR method is on the definition of two parame-
ters: and . We change the definitions to give more importance to the
144 A. Bagheri, M. Saraee, and F. de Jong
aspects with more containing words. In the new definition, in addition to the frequen-
cy we consider position of in aspect . For the score of each word of a
target aspect, we not only consider a single word on the left of , but we consider if
there is more than one word on the left. We assign a weight for each position, that this
weight is equal to one for the first word on the left, is two for the second word and so
on. We define the score in the same manner. In addition, we apply the add-one
smoothing to both of them to avoid the score being zero when has no connected
words.
Where is the current aspect, is the number of the sentences in the corpus
which is appeared, , is the frequency of co-occurrence of aspect and in
each sentence. is th aspect in the list of seed aspects, and is number of sentences
in the corpus. The A-Score metric is based on mutual information between an aspect
and a list of aspects, in addition it considers frequency of each aspect. We apply the
add-one smoothing to the metric, so all co-frequencies be non-zero. This metric helps
to extract more informative aspects and more co-related ones.
iterative clustering technique which in each iteration, the most interesting and valua-
ble candidate is chosen to adjust the current seed set. This technique continues until
satisfying a stopping criterion like a predefined number of outputs. The important part
in an iterative bootstrapping algorithm is how to measure the value score of each can-
didate in each iteration. The proposed iterative bootstrapping algorithm for detecting
aspects is shown in figure 2. In this algorithm we use A-score metric to measure the
value score of each candidate in each iteration.
From figure 2, the task of the proposed iterative bootstrapping algorithm is to en-
large the initial seed set into a final list of aspects. In each iteration, the current ver-
sion of the seed set and the list of candidate aspects are used to find the value score of
A-Score metric for each candidate, resulting one more aspect for the seed set. Finally,
the augmented seed set is the final aspect list and the output of the algorithm.
After finalizing the list of aspects, there may exist redundant selected ones. For in-
stances, “Suite” or “Free Speakerphone” are both redundant aspects, while “PC Suite”
and “Speakerphone” are meaningful ones. Aspect pruning aims to remove these kinds
of redundant aspects. For aspect pruning, we propose two kinds of pruning methods:
Subset-Support Pruning and Superset-Support Pruning. We extracted these methods
based on the experiment studies in our research.
Subset-Support Pruning
As we can see from table 1, two of the POS patterns are “JJ NN” and “JJ NN NN”.
These patterns extract some useful and important aspects like “remote control” or
“optical zoom”, but there are some redundant and meaningless aspects regarding to
these patterns. Aspects like “free speakerphone” or “rental dvd player” are examples,
while subset of them “speakerphone” or “dvd player” are useful aspects. This step
checks multi-word aspects that start with an adjective (JJ POS pattern), and removes
146 A. Bagheri, M. Saraee, and F. de Jong
those that are likely to be meaningless. In this step we remove the adjective part for
aspects and then check a threshold if the second part is meaningful.
Superset-Support Pruning
In this step we remove redundant single word aspects. We filter single-word aspects
which there is a superset ones of them. “Suite” or “life” are both examples of these
redundant aspects which “PC Suite” or “battery life” are superset meaningful ones.
4 Experimental Results
In this section we discuss the experimental results for the proposed model and pre-
sented algorithms. We employed datasets of customer reviews for five products for
our evaluation purpose (available at http://www.cs.uic.edu/~liub/FBS/sentiment-
analysis.html#datasets). This dataset focus on electronic products: Apex AD2600
Progressive-scan DVD player, Canon G3, Creative Labs Nomad Jukebox Zen Xtra 40
GB, Nikon Coolpix 4300, and Nokia 6610. Table 2 shows the number of manually
tagged product aspects and the number of reviews for each product in the dataset.
No. of
Dataset Number of reviews
manual aspects
Canon 45 100
Nikon 34 74
Nokia 41 109
Creative 95 180
Apex 99 110
algorithm, and the stopping criterion is defined when about 70 to 120 aspects have
been learned. For the subset-support pruning method we set the threshold 0.5. In su-
perset-support pruning step if an aspect has a frequency lower than three and its ratio
to the superset aspect is less than experimentally threshold set one, it is pruned. Table
3 shows the experimental results of our model at three main steps described in section
3, Multi-word aspects and heuristic rules, Iterative bootstrapping with A-Score and
Aspect pruning steps.
Table 2. Recall and precision at three main steps of the proposed model
Iterative boot-
Multi-word aspects and Aspect
Dataset strapping with A-
heuristic rules pruning
Score
Precision
Canon 26.7 75.0 83.1
Nikon 28.4 69.8 87.5
Nokia 23.9 73.5 79.0
Creative 14.8 79.2 88.9
Apex 19.3 78.8 82.0
Recall
Canon 85.7 74.0 70.1
Nikon 82.4 72.5 68.6
Nokia 84.1 72.5 71.0
Creative 78.9 59.2 56.3
Apex 74.6 65.1 65.1
Table 3 gives all the precision and recall results at the main steps of the proposed
model. In this table, column 1 lists each product. Each column gives the precision and
recall for each product. Column 2 uses extracted single-word aspects and selected
multi-word aspects based on generalized FLR approach and employing heuristic rules
for each product. The results indicate that extracted aspects contain a lot of errors.
Using this step alone gives poor results in precision. Column 3 shows the correspond-
ing results after employing Iterative bootstrapping algorithm with A-Score metric. We
can see that the precision is improved significantly by this step but the recall drops.
Column 4 gives the results after pruning methods are performed. The results demon-
strate the effectiveness of the pruning methods. The precision is improved dramatical-
ly, but the recall drops a few percent.
We evaluate the effectiveness of the proposed model compared with the ben-
chmarked results by [4]. Wei et al. proposed a semantic-based product aspect extrac-
tion (SPE) method and compared the results of the SPE with the association rule
mining approach (ARM) given in [3]. The SPE technique exploits a list of positive
and negative adjectives defined in the General Inquirer to recognize opinion words
semantically and subsequently extract product aspects expressed in customer reviews.
148 A. Bagheri, M. Saraee, and F. de Jong
Table 4 shows the experimental results of our model in comparison with SPE and
ARM techniques (the values in this Table for ARM and SPE come from the results in
[4]). Both the ARM and SPE techniques employ a minimum support threshold set at
1% in the frequent aspect identification step for finding aspects according to the asso-
ciation rule mining.
From Table 4, the macro-averaged precision and recall of the existing ARM tech-
nique are 47.9% and 60.9% respectively, whereas the macro-averaged for precision
and recall of the SPE technique are 49.8% and 71.6% respectively. Thus the effec-
tiveness of SPE is better than that of the ARM technique, recording improvements in
macro-averaged precision and recall. However, our proposed model outperforms both
benchmark techniques in precision, achieving a macro-averaged precision of 84.1%.
Specifically, macro-averaged precision obtained by the proposed model is34.3% and
36.2% higher than those reached by the existing ARM technique and SPE, respective-
ly. The proposed model reaches to a macro-averaged recall at 66.2%, where improves
the ARM by 5.3%, but it is about 5.4% less than SPE approach. When considering the
micro average measures, we observe similar results to those we obtained by using
macro average measures.
It is notable that we observe a more substantial improvement in precision that in
recall with our proposed model and techniques. Observing from Table 4, our model
makes significant improvements over others in all the datasets in precision, but in
recall SPE has better performance. For example, our model records 36.2% and 34.3%
improvements in terms of macro-averaged precision over the ARM and SPE tech-
niques respectively, and 37.5% and 35% improvements in terms of micro-averaged
precision. However, the proposed model achieves an averagely higher recall than the
ARM technique but a slightly lower recall than the SPE technique. One reason is that
for the iterative bootstrapping algorithm we limit number of output aspects between
70 and 120 aspects, therefore the precision for the output will be better than the recall,
another reason for low recall is that of our model only works in detecting explicit
aspects from review sentences.
Figure 3 shows the F-score measures of different approaches using different
product datasets. In all five datasets, our model achieves the highest F-score. This
An Unsupervised Aspect Detection Model for Sentiment Analysis of Reviews 149
90
80 ARM
70
60
F-Score
50 SPE
40
30
20 Proposed
10 Model
0
Product Datasets
Fig. 3. F-scores of ARM, SPE, and the Proposed model for each dataset
This comparative evaluation suggests that the proposed model, which involves fre-
quency-based and inter-connection information between the aspects and gives more
importance to multi-word aspects and uses the influence of an opinion word in the
review sentence, attains better effectiveness for product aspect extraction. The exist-
ing ARM technique depends on the frequencies of nouns or noun phrases for the as-
pect extraction, and SPE relies primarily on frequency- and semantic-based extraction
of noun phrases for the aspect detection. For Example, our model is effective in de-
tecting aspects such as “digital camera” or “battery charging system”, which both
ARM and SPE are failed on extraction of these non-noun phrases. Additionally, we
can tune the parameters in our model to extracts aspects with less or more words, for
example aspect “canon power shot g3” can be finding by the model. Finally, the re-
sults show using a completely unsupervised approach for aspect detection in senti-
ment analysis could achieve promising performances.
As mentioned before, the proposed model is an unsupervised domain-independent
model. We therefore empirically investigate the performance of using a supervised
technique for aspect detection in comparison to the proposed model. We employ re-
sults of a supervised technique from Somprasertsri and Lalitrojwong’s work [8]. They
proposed an approach for aspect detection by combining lexical and syntactic features
with a maximum entropy model. Their approach uses the same data set collection of
product reviews we experimented on. They extract the learning features from the
annotated corpus of Canon G3 and Creative Labs Nomad Jukebox Zen Xtra 40 GB
from customer review dataset. In their work, the set of data was split into a training
set of 80% and a testing set of 20%. They employed the Maxent version 2.4.0 as the
classification tool. Table 5 shows the micro-averaged precision, micro-averaged recall
and micro-averaged F-score of their system output in comparison to our proposed
model for the Canon and Creative datasets.
150 A. Bagheri, M. Saraee, and F. de Jong
Table 4. Micro-averaged precision, recall and F-score for supervised maximum entropy and
our unsupervised model
Table 5 shows that for the proposed model, the precision is improved dramatically
by 13.9%, the recall is decreased by 5.6% and the F-score is increased by 2.6%.
Therefore our proposed model and presented algorithms outperforms the Sompra-
sertsri and Lalitrojwong’s model. The significant difference between our model and
theirs is that they use a fully supervised structure for aspect detection, but our pro-
posed model is completely unsupervised and domain independent. Although in most
applications the supervised techniques can achieve reasonable effectiveness, but pre-
paring training dataset is time consuming and the effectiveness of the supervised
techniques greatly depends on the representativeness of the training data. In contrast,
unsupervised models automatically extract product aspects from customer reviews
without involving training data. Moreover, the unsupervised models seem to be more
flexible than the supervised ones for environments in which various and frequently
expanding products get discussed in customer reviews.
5 Conclusions
This paper proposed a model for the task of identifying aspects in reviews. This mod-
el is able to deal with two major bottlenecks, domain dependency and the need for
labeled data. We proposed a number of techniques for mining aspects from reviews.
We used the inter-relation information between words in a review and the influence of
an opinion word on detecting an aspect. Our experimental results indicate that our
model is quite effective in performing the task. In our future work, we plan to further
improve and refine our model. We plan to employ clustering methods in conjunction
with the model to extract implicit and explicit aspects together to summarize output
based on the opinions that have been expressed on them.
Acknowledgments. We would like to thank Professor Dr. Dirk Heylen and his group for
giving us the opportunity to work with the Human Media Interaction (HMI) group from univer-
sity of Twente.
References
1. Qiu, G., Liu, B., Bu, J., Chen, C.: Opinion word expansion and target extraction through
double propagation. Computational Linguistics 37(1), 9–27 (2011)
2. Thet, T.T., Na, J.C., Khoo, C.S.G.: Aspect-Based Sentiment Analysis of Movie Reviews
on Discussion Boards. Journal of Information Science 36(6), 823–848 (2010)
An Unsupervised Aspect Detection Model for Sentiment Analysis of Reviews 151
3. Hu, M., Liu, B.: Mining opinion features in customer reviews. In: American Association
for Artificial Intelligence (AAAI) Conference, pp. 755–760 (2004)
4. Wei, C.P., Chen, Y.M., Yang, C.S., Yang, C.C.: Understanding what concerns consumers:
A semantic approach to product feature extraction from consumer reviews. Information
Systems and E-Business Management 8(2), 149–167 (2010)
5. Brody, S., Elhadad, N.: An unsupervised aspect-sentiment model for online reviews. In:
2010 Annual Conference of the North American Chapter of the Association for Computa-
tional Linguistics, Los Angeles, California, pp. 804–812 (2010)
6. Popescu, A., Etzioni, O.: Extracting product features and opinions from reviews. In: Con-
ference on Human Language Technology and Empirical Methods in Natural Language
Processing, Vancouver, pp. 339–346 (2005)
7. Yi, J., Nasukawa, T., Bunescu, R., Niblack, W.: Sentiment analyzer: Extracting sentiments
about a given topic using natural language processing techniques. In: 3rd IEEE Interna-
tional Conference on Data Mining (ICDM 2003), Melbourne, FL, pp. 427–434 (2003)
8. Somprasertsri, G., Lalitrojwong, P.: Automatic product feature extraction from online
product reviews using maximum entropy with lexical and syntactic features. In: IEEE In-
ternational Conference on Information Reuse and Integration, pp. 250–255 (2008)
9. Zhu, J., Wang, H., Zhu, M., Tsou, B.K.: Aspect-based opinion polling from customer re-
views. IEEE Transactions on Affective Computing 2(1), 37–49 (2011)
10. Zhai, Z., Liu, B., Xu, H., Jia, P.: Constrained LDA for Grouping Product Features in Opi-
nion Mining. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS,
vol. 6634, pp. 448–459. Springer, Heidelberg (2011)
11. Su, Q., Xu, X., Guo, H., Guo, Z., Wu, X., Zhang, X., Su, Z.: Hidden sentiment association
in chinese web opinion mining. In: 17th International Conference on World Wide Web,
Beijing, China, pp. 959–968 (2008)
12. Moghaddam, S., Ester, M.: ILDA: interdependent LDA model for learning latent aspects
and their ratings from online product reviews. In: 34th International ACM SIGIR Confe-
rence on Research and Development in Information Retrieval, pp. 665–674. ACM (2011)
13. Fu, X., Liu, G., Guo, Y., Wang, Z.: Multi-aspect sentiment analysis for Chinese online so-
cial reviews based on topic modeling and HowNet lexicon. Knowledge-Based Systems 37,
186–195 (2013)
14. Lin, C., He, Y., Everson, R., Ruger, S.: Weakly supervised joint sentiment-topic detection
from text. IEEE Transaction on Knowledge & Data Engineering 24(6), 1134–1145 (2012)
15. Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of
English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
16. Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-Rich Part-of-Speech Tagging
with a Cyclic Dependency Network. In: Proceedings of HLT-NAACL, pp. 252–259
(2003)
17. Nakagawa, H., Mori, T.: Automatic Term Recognition based on Statistics of Compound
Nouns and their Components. Terminology 9(2), 201–219 (2003)
18. Yoshida, M., Nakagawa, H.: Automatic Term Extraction Based on Perplexity of Com-
pound Words. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS
(LNAI), vol. 3651, pp. 269–279. Springer, Heidelberg (2005)
19. Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1-2),
69–90 (1999)
Cross-Lingual Natural Language Querying
over the Web of Data
Abstract. The rapid growth of the Semantic Web offers a wealth of semantic
knowledge in the form of Linked Data and ontologies, which can be considered
as large knowledge graphs of marked up Web data. However, much of this knowl-
edge is only available in English, affecting effective information access in the
multilingual Web. A particular challenge arises from the vocabulary gap resulting
from the difference in the query and the data languages. In this paper, we present
an approach to perform cross-lingual natural language queries on Linked Data.
Our method includes three components: entity identification, linguistic analysis,
and semantic relatedness. We use Cross-Lingual Explicit Semantic Analysis to
overcome the language gap between the queries and data. The experimental re-
sults are evaluated against 50 German natural language queries. We show that an
approach using a cross-lingual similarity and relatedness measure outperforms
other systems that use automatic translation. We also discuss the queries that can
be handled by our approach.
1 Introduction
1.1 Motivation
In the last decade, the Semantic Web community has been working extensively towards
creating standards, which tend to increase the accessibility of available information
on the Web, by providing structured metadata1 . Yahoo research recently reported [1]
that 30% of all HTML pages on the Web contain structured metadata such as micro-
data, RDFa, or microformat. This structured metadata facilitates the possibility of au-
tomatic reasoning and inferencing. Thus, by embedding such knowledge within web
documents, additional key information about the semantic relations among data objects
mentioned in the web pages can be captured.
One of the most difficult challenge in multilingual web research is cross-lingual doc-
ument retrieval, i.e. retrieval of relevant documents that are written in a language other
than the query language. To address this issue we present a method for cross-lingual
1
http://events.linkeddata.org/ldow2012/slides/Bizer-LDOW2012-
Panel-Background-Statistics.pdf
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 152–163, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Cross-Lingual Natural Language Querying over the Web of Data 153
natural language querying, which aims to retrieve all relevant information even if it is
only available in a language different from the query language. Our approach differs
from the state-of-the-art methods, which mainly consist of translating the queries into
document languages ([2], [3]). However, the poor accuracy of automatic translation of
short texts like queries makes this approach problematic. Hence, using large knowledge
bases as an interlingua [4] may prove beneficial. The approach discussed here consid-
ers Linked Data as a structured knowledge graph. The Linked Open Data (LOD) cloud
currently contains more than 291 different structured knowledge repositories in RDF2
format, which are linked together using “DBpedia”, “freebase” or “YAGO”. It contains
a large number of instances in many different languages, however, the vocabulary used
to define ontology relations is mainly in English. Thus, querying this knowledge base
is not possible in other languages even if the instances are multilingual. Cross-lingual
natural language querying is required to access this structured knowledge base, which
is the main objective of our approach.
1.2 Problem
1.3 Contribution
The main focus of our approach is the interpretation of NL-Queries by traversal over the
structured knowledge graph, and the construction of a corresponding SPARQL query.
As discussed in Section 1.2, translation based approaches for cross-lingual NL-Queries
suffer from the poor quality of automatic translation. Therefore, in this paper, we intro-
duce a novel approach for performing cross-lingual NL-Queries over structured knowl-
edge base, without automatic translation. As an additional contribution, we have created
and analyzed a benchmark dataset of 50 NL-Queries in German. We discuss the results
of a comparison of our method with an automatic translation method over the 28 NL-
Queries that can be addressed by our approach.
2
Resource Description Framework (RDF) is the World Wide Web consortium (W3C) specifi-
cation to represents the conceptual description. It was designed as a metadata data model.
3
http://www.w3.org/TR/rdf-sparql-query/
154 N. Aggarwal, T. Polajnar, and P. Buitelaar
Our algorithm can also be used for cross-lingual document retrieval provided that
the document collection is already marked up with a knowledge base, for instance,
Wikipedia articles annotated with DBpedia.
Most of the proposed approaches that address the task of Cross-Lingual Information Re-
trieval (CLIR) reduce the problem into a monolingual scenario by translating the search
query or documents in the corresponding language. Many of them perform query trans-
lation ([8], [9], [2], [3])) into the language of the documents. However, all of these ap-
proaches suffer from the poor performance of machine translation on short texts (query).
Jones et al. [3] performed query translation by restricting the translation to the cultural
heritage domain, while Nquyen et al. [2] makes use of the Wikipedia cross-lingual links
structure.
Without relying on machine translation, some approaches ([10], [11], [12]) make
use of distributional semantics. They calculate a cross-lingual semantic relatedness
score between the query and the documents. However, none of these approaches take
any linguistic information into account, and do not make use of large available struc-
tured knowledge bases. With the assumption that documents of different languages
are already marked-up with the knowledge base (for instance, Wikipedia articles are
annotated with DBpedia), the problem of CLIR can be converted into querying over
structured data. There is still a language barrier, as queries can be in different languages,
while most of the structured data is only available in English. Qall-Me [13] performs
NL-Querying over structured information by using textual entailment to convert a nat-
ural language question into SPARQL. This system relies on availability of multilingual
structured data. It can only retrieve the information that is available in the query lan-
guage. Therefore, this system is not able to perform CLIR. Freitas et al. [5] proposed
an approach for natural language querying over linked data, based on the combination
of entity search, a Wikipedia-based semantic relatedness (using ESA) measure, and
spreading activation. Their approach is similar to ours, but it can not deal with different
languages.
3 Background
We used DBpedia4 as a knowledge base for our experiments. DBpedia is a large struc-
tured knowledge base, which is extracted from Wikipedia info-boxes. It contains data in
the form of a large RDF graph, where each node represents an entity or a literal and the
edges represent relations between entities. Each RDF statement can be divided into a
subject, predicate and object. DBpedia contains a large ontology, describing more than
3.5 millions instances, forming a large general structured knowledge source. Also, it is
very well-connected to several other Linked Data repositories in the Semantic Web. As
4
http://dbpedia.org/
Cross-Lingual Natural Language Querying over the Web of Data 155
DBpedia instances are extracted from Wikipedia, they contains cross-links across the
different languages, however, the properties (or relations) associated with the instances,
are mainly defined in English.
In order to query DBpedia, a structured query is required. SPARQL is the standard
structured query language for RDF, and allows users to write unambiguous queries to
retrieve RDF triples.
Semantic relatedness of two given terms can be obtained by calculating the similarity
between two high dimensional vectors in a distributed semantic space model (DSM).
According to the distributional hypothesis, the semantic meaning of a word can (at
least to a certain extent) be inferred from its usage in context, that is its distribution in
text. This semantic representation is built through a statistical analysis over the large
contextual information in which a term occurs. One recent popular model to calculate
this relatedness by using the distributed semantics is Explicit Semantic Analysis (ESA)
proposed by Gabrilovich and Markovitch [14], which attempts to represent the seman-
tics of a given term in a high dimensional vector of explicitly defined concepts. In the
original paper the Wikipedia articles were used to built the ESA model. Every dimen-
sion of the high dimensional vector reflects a unique Wikipedia concept or title, and the
weight of the dimensions are created by taking the TF-IDF weight of a given term in
the corresponding title of a Wikipedia document.
An interesting characteristic of Wikipedia is that this very large collective knowl-
edge is available in multiple languages, which facilitates an extension of existing ESA
for multiple languages called Cross-Lingual Explicit Semantic Analysis (CL-ESA) pro-
posed by Sorg et al. [15]. The articles in Wikipedia are linked together across the lan-
guages. This cross-lingual linked structure can provide a mapping of a vector in one
language to another. To understand CLESA, let us take two terms ts in language Ls and
tt in language Lt . As a first step, a concept vector for ts is created using the Wikipedia
corpus in Ls . Similarly, the concept vector for tt is created in Lt . Then, one of the
concept vectors can be converted to the other language by using the cross-lingual links
between articles across the languages, provided by Wikipedia. After obtaining both of
the concept vectors in one language, the relatedness of the terms ts and tt can be cal-
culated by using the cosine product, similar to ESA. For better efficiency, we chose
to make a multilingual index by composing poly-lingual Wikipedia articles using the
cross-lingual mappings. In such a case, no conversion of the concept vector in one lan-
guage to the other is required. Instead, it is possible to represent the Wikipedia concept
with some unique name common to all languages such as, for instance, the Uniform
Resource Identifier (URI) of the English Wikipedia.
4 Approach
The first step of the interpretation process is the identification of potential entities, i.e.
the Linked Data concepts (classes and instances), present in the NL-Query. A baseline
entity identification can be defined as the identification of an exact match between the
label of a concept against the term appearing in the NL-Query; for example, DBpedia:
Bill Clinton shown in Figure 1. “Bill Clinton” is the name of a person and it appeared
as a label of DBpedia: Bill Clinton URI in the database. However, a term such as “Min-
isterpräsidenten von Spanien” and “Christus im Sturm auf dem See Genezareth” do not
appear as labels in the database. Therefore, in order to resolve this issue, we translate
the term to get the approximate term in the corresponding language and find the best
matched label in the database. We use the Bing translation system6 to perform the auto-
matic translation but the quality of translation is not very promising and we do not get
the exact translation of a given label. Therefore, we calculate the token edit distance be-
tween translated label and the labels in our database and select the maximum matched
one. For instance, the translation of “Christ in the storm on the sea of Galilee” is “Christ
in storm on the sea of Galilee” but label of the appropriate concept is “The Storm on
the Sea of Galilee”.
In addition, our approach includes a disambiguation process, in which we disam-
biguate the selected concept candidates based on their associated relations in the knowl-
edge base. For instance, in a given NL-Query “Wie viele Angestellte hat Google?”@de7
two different DBpedia entities can be found with the label “Google”, i.e. “DBpedia:
Google Search” and “DBpedia: Google”. We calculate similarity scores with all as-
sociated relations of both, and find that term “Angestellte” in the NL-Query obtained
maximum similarity score with the relation “number Employees”, which is associated
with “DBpedia: Google”.
5
Translated from the QALD-2 challenge dataset, which has 100 NL-Queries in English, over
DBpedia.
6
http://www.bing.com/translator
7
Translation of “How many employees does Google have?” from the English test dataset.
Cross-Lingual Natural Language Querying over the Web of Data 157
Fig. 1. Query interpretation pipeline for the German NL-Query “Mit wem is die Tochter von Bill
Clinton verheiratet?” (“Who is the daughter of Bill Clinton married to?”@en)
5 Evaluation
5.1 Datasets
In order to evaluate our approach, we created a testset of 50 NL-Queries in German.
The benchmark is created by manually translating the English NL-Queries provided by
the “Question Answering over Linked Data (QALD-2)” dataset, consists of 100 NL-
Queries in English over DBpedia. All of the NL-Queries are annotated with keywords,
corresponding SPARQL queries and answers retrieved from DBpedia. Also, every NL-
Query specifies some additional attributes, for example, if a mathematical operation
such as aggregation, count or sort is needed in order to retrieve the appropriate answers.
We translated QALD-2 dataset and divided it into two parts, one for training and one
for testing. Therefore, each dataset contains 50 NL-Queries in German. We performed
a manual analysis to keep the same complexity level in both the datasets. We divided all
of the NL-Queries into three different categories: simple, template-based and SPARQL
aggregation. Simple queries contain the DBpedia entities and their relations (DBpedia
properties), and do not need a predefined template or rule to construct the corresponding
SPARQL query. However, these queries include semantic and linguistic variations, that
means they express the DBpedia properties by using related terms rather than having the
exact label of a property. For instance, in a given query “How tall is Michael Jordan?”,
“tall” does not appear in the vocabulary of DBpedia properties, however, the answer
of the query can be retrieved by DBpedia property “height” appearing with “DBpe-
dia: Michael Jordan”. Those queries, which required predefined templates or rules, are
categorized as template-based [6] queries, for example, the query “Give me all pro-
fessional skateboarders from Sweden.” required a predefined template for retrieving all
persons with occupation Skateboarding and born in Sweden. SPARQL aggregation type
of queries need performing a mathematical operation such as aggregation, count or sort,
therefore, they also require a predefined template.
Cross-Lingual Natural Language Querying over the Web of Data 159
Following the categorization, we divided the dataset into two parts by keeping an
equal number of queries in each category. We then performed our experiments on the
prepared test dataset of 50 NL-Queries in German. Table 1 shows the statistics about
both the datasets. We are extending these datasets for other languages and they are
freely available.
Table 2. Error type and its distribution over 50 natural language queries and 28 selected natural
language queries in German
5.2 Experiment
We evaluated the outcome of our approach at all three stages of the processing pipeline:
1) entity identification, 2) linguistic analysis, and 3) semantic similarity and relatedness
measures. This way, we can investigate the errors introduced by individual components.
As shown in Figure 1, the third component “semantic similarity and relatedness mea-
sures” relies on the correctness of the constructed DAG, i.e. on the performance of both
the previous components (entity identification and linguistic analysis). Therefore, it is
important to examine the performance of individual components. We evaluated the out-
come of entity identification and linguistic analysis on all 50 NL-Queries of the test
dataset. However, all of the template-based and SPARQL aggregation type NL-Queries
are out of the scope of our settings. Therefore, we discuss the results obtained for re-
maining 28 NL-Queries. The entity identification component was evaluated in both
ways; entity identification without using automatic translation and entity identification
with automatic translation10. Table 2 shows that appropriate entities could not be found
in 10 NL-Queries out of 50 NL-Queries and 3 NL-Queries out of 28. However, by us-
ing automatic translation the error is reduced to 7 and 1 NL-Queries respectively. To
evaluate the performance of the linguistic analysis component, we counted the number
of NL-Queries, for which the Stanford parser was unable to generate the dependencies.
The statistics of the errors in linguistic analysis are shown in Table 2. As explained in
Section 4.3, to find the relevant properties associated with the selected DBpedia entity,
a comparison of all the properties and the next term from the DAG is needed. This
requires a good cross-lingual similarity and relatedness measure. Therefore, to exam-
ine the effect of similarity and relatedness measure over automatic translation, we used
three different settings in calculating the scores: a) automatic translation followed by
10
The automatic translation was only used for those entities, that could not be found in the
database with the given labels.
160 N. Aggarwal, T. Polajnar, and P. Buitelaar
system” than “developer”. Our approach simply failed to find the results for Q14, due
to the appearance of more than one highly related properties, such as “mission name”,
“mission duration”, “mission” and “launch pad”, for identified entity “Apollo 14”, with
“Astronauten” and “astronauts”.
Our approach can also retrieve the partial set of appropriate results for more complex
NL-Queries like “Gib mir alle Menschen, die in Wien geboren wurden und in Berlin
gestorben sind”11 . Therefore, we also report the performance of our system on the over-
all test dataset of 50 NL-Queries. The results are shown in Table 4. In this way, we can
find the overall coverage of our approach on all types of NL-Queries.
Acknowledgments. This work has been funded in part by the European Union under
Grant No. 258191 for the PROMISE project, as well as by the Science Foundation
Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
References
1. Mika, P., Potter, T.: Metadata statistics for a large web corpus. In: WWW 2012 Workshop on
Linked Data on the Web (2012)
2. Nguyen, D., Overwijk, A., Hauff, C., Trieschnigg, D.R.B., Hiemstra, D., De Jong, F.: Wik-
itranslate: query translation for cross-lingual information retrieval using only wikipedia. In:
Proceedings of the 9th CLEF (2009)
11
“Give me all people that were born in Vienna and died in Berlin.” in the English test dataset.
Cross-Lingual Natural Language Querying over the Web of Data 163
3. Jones, G., Fantino, F., Newman, E., Zhang, Y.: Domain-specific query translation for mul-
tilingual information access using machine translation augmented with dictionaries mined
from Wikipedia. In: CLIA 2008, p. 34 (2008)
4. Steinberger, R., Pouliquen, B., Ignat, C.: Exploiting multilingual nomenclatures and
language-independent text features as an interlingua for cross-lingual text analysis appli-
cations. In: Proc. of the 4th Slovenian Language Technology Conf., Information Society
(2004)
5. Freitas, A., Oliveira, J.G., O’Riain, S., Curry, E., Pereira da Silva, J.C.: Querying linked data
using semantic relatedness: a vocabulary independent approach. In: Muñoz, R., Montoyo,
A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 40–51. Springer, Heidelberg (2011)
6. Unger, C., Bhmann, L., Lehmann, J., Ngomo, A.C.N., Gerber, D., Cimiano, P.: Sparql tem-
plate based question answering. In: 21st International World Wide Web Conference, WWW
2012 (2012)
7. Yahya, M., Berberich, K., Elbassuoni, S., Ramanath, M., Tresp, V., Weikum, G.: Natural
language questions for the web of data. In: EMNLP-CoNLL 2012 (2012)
8. Lu, C., Xu, Y., Geva, S.: Web-based query translation for english-chinese CLIR. In: Compu-
tational Linguistics and Chinese Language Processing (CLCLP), pp. 61–90 (2008)
9. Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language infor-
mation retrieval: Problems, methods, and research findings. Information Retrieval, 209–230
(2001)
10. Littman, M., Dumais, S.T., Landauer, T.K.: Automatic cross-language information retrieval
using latent semantic indexing. In: Cross-Language Information Retrieval, ch. 5, pp. 51–62.
Kluwer Academic Publishers (1998)
11. Zhang, D., Mei, Q., Zhai, C.: Cross-lingual latent topic extraction. In: Proceedings of the 48th
Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 1128–
1137. Association for Computational Linguistics, Stroudsburg (2010)
12. Sorg, P., Braun, M., Nicolay, D., Cimiano, P.: Cross-lingual information retrieval based on
multiple indexes. In: Working Notes for the CLEF 2009 Workshop, Cross-Lingual Evaluation
Forum, Corfu, Greece (2009)
13. Ferrández, Ó., Spurk, C., Kouylekov, M., Dornescu, I., Ferrández, S., Negri, M., Izquierdo,
R., Tomás, D., Orasan, C., Neumann, G., Magnini, B., Vicedo, J.L.: The qall-me framework:
A specifiable-domain multilingual question answering architecture. Web Semantics, 137–
145 (2011)
14. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based ex-
plicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Arti-
ficial Intelligence, pp. 1606–1611 (2007)
15. Sorg, P., Cimiano, P.: An experimental comparison of explicit semantic analysis implementa-
tions for cross-language retrieval. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds.)
NLDB 2009. LNCS, vol. 5723, pp. 36–48. Springer, Heidelberg (2010)
Extractive Text Summarization:
Can We Use the Same Techniques for Any Text?
Abstract. In this paper we address two issues. The first one analyzes
whether the performance of a text summarization method depends on
the topic of a document. The second one is concerned with how certain
linguistic properties of a text may affect the performance of a number of
automatic text summarization methods. For this we consider semantic
analysis methods, such as textual entailment and anaphora resolution,
and we study how they are related to proper noun, pronoun and noun
ratios calculated over original documents that are grouped into related
topics. Given the obtained results, we can conclude that although our
first hypothesis is not supported, since it has been found no evident
relationship between the topic of a document and the performance of
the methods employed, adapting summarization systems to the linguistic
properties of input documents benefits the process of summarization.
1 Introduction
The first attempts to tackle the task of automatic text summarization were made
as early as in the middle of the past century [17]. Since then the capabilities of
modern hardware have increased enormously. However, nowadays when we talk
about automatic text summarization we mostly focus on extractive summaries
and hope to top the threshold of 50% on the recall [22]. Extractive summaries, as
opposed to the abstractive ones that involve natural language generation tech-
niques, consist of segments of the original text. The task sounds less challenging
than it has been proven to be [22].
The extractive summarization systems developed so far have been tested on a
number of different corpora [22]. There has been a significant number of systems
proposed for the task of summarization of the newswire articles. Many of those
systems emerged due to the Document Understanding Conference challenges
(DUC)1 [8,25]. Even that the last challenge has been held in 2007, the DUC
data is still being used in research [15,16,24,26]. Some experiments were done
1
http://duc.nist.gov/
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 164–175, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Can We Use the Same Techniques for Any Text? 165
with the Reuters newswire corpus [2]. The short newswire articles differ from
fiction. The summarization systems that target this niche have adapted to the
particular characteristics of fiction. There has been research on short fiction
summarization [12], fairy tales [16], whole books [19], etc. Due to the rapid
growth of the amounts of web data, the need to summarize becomes even more
acute. More recent research has focused on Web 2.0 textual genres, such as
forum [30] and blog [11] summarization. The specific language used in blogs and
forums makes the task being different to that of newswire article summarization.
Between the blog and the newswire summarization we could place the e-mail
summarization that ranges from summarizing a single e-mail message [20] to
the whole thread of related e-mails [23]. Automatic text summarization has also
been combined with speech recognition to summarize spoken dialogues [9,18].
Summarization systems have been adapted to a number of different domains.
In particular, there has been an extensive research in summarizing medical docu-
ments [1]; a) medical journal articles [6,3]; b) healthcare documents for patients
[7]. Another domain that attracted the attention is the legal domain. There
have been some experiments with the documents from the European Legislation
Website2 [3].
However, text documents differ depending on genre, text type, domain, sub-
language, style, particular topic covered, etc. (for a detailed discussion see [13]).
Personal style of a writer, their vocabulary size, word choice, use of expressive
means and irony, sentence length and structure preferences are not less affect-
ing. Dialogues and monologues, science fiction and love stories, technical reports
and newswire articles, poems and legalese, use of metaphors and synonyms,
anaphoric expressions and proper nouns all these carry with them their unique
properties. Those properties may affect the quality of summaries generated using
the techniques developed for the task of automatic text summarization. And in
this paper we would like to study this issue.
We adapt out systems to specific domains, genres, text styles. We develop
and implement different summarization techniques and heuristics. But to the
best of our knowledge, so far there has been no attempt to treat documents
in a collection differently from each other. If a system makes use of pronominal
anaphora resolution module, it will try to resolve anaphora in all the documents.
Now, what if the document contains only a few pronouns? The performance will
slow down but the results will stay the same. What if a document contains a high
number of pronouns and the chosen anaphora resolution module cannot handle
them correctly? The performance will slow down and the resulting summary will
be of a worse quality. If we consider word sense disambiguation task and some
specific domains like e.g. legalese documents, the language used is so precise
that synonymy disambiguation will probably introduce no improvement into the
quality of summaries.
In this paper we address two issues. The first one is concerned with the prob-
lem of preliminary document analysis and how the linguistic properties of a
text may affect the performance of a number of automatic text summarization
2
http://eur-lex.europa.eu/en/legis/index.htm
166 T. Vodolazova et al.
2 Related Work
With the evolution of technology different methods and heuristics have been
used to improve extractive summarization systems. The early systems relied on
the simple heuristics: i) sentence location (sentences located in the beginning or
end of the text, headings and the sentences highlighted in bold among others are
considered to be more important and are included in the final summary) [5]; ii)
cue phrases (presence of previously defined words and phrases as “concluding”,
“argue”, “propose” or “this paper”) [5,28]; iii) segment length (sentences with
the length below some predefined threshold can be automatically ignored) [28];
iv) the most frequent words (exploring the term distribution of a document allows
to identify the most frequent words that are assumed to represent at the same
time the most important concepts of the document) [17].
Today we apply various methods to structure information that we extract from
documents and to analyze it intelligently. Graph theory has been successfully ap-
plied to represent the semantic contents of a document [24] . Latent Semantic
Analysis, that involves term by sentences matrix representation and singular
value decomposition has also been proven to benefit the task of extractive sum-
marization [26,10]. A number of machine learning algorithms such as decision
trees, rule induction, decision forests, Naive Bayes classifiers and neural networks
among others have been adapted to this task as well [20,4]. Part-of-speech tag-
gers [20], word sense disambiguation algorithms [24], anaphora resolution [26],
textual entailment [27,15] and chunking [20] are among the most frequently used
linguistic analysis methods.
To the best of our knowledge there has been no attempt to analyze the impact
of shallow linguistic properties of the original text on the quality of automatically
generated summaries.
However, there has been a related work involving automatic text summa-
rization and sentence structure. Nenkova et al. [21] focused on how sentence
structure can help to predict the linguistic quality of generated summaries. The
authors selected a set of structural features that include:
Can We Use the Same Techniques for Any Text? 167
– sentence length
– parse tree depth
– number of fragment tags in the sentence parse
– phrase type proportion
– average phrase length
– phrase type rate was computed for prepositional, noun and verb phrases as
diving the number of words of each phrase by the sentence length
– phrase length was computed for prepositional, noun and verb phrases as
diving the number of phrases of the given type that appeared in the sentence
by the sentence length
– length of NPs/PPs contained in a VP
– head noun modifiers
Though the set of features is different and more diverse, the phrase type ratio
and phrase length can be probably compared to the noun, pronoun and proper
noun ratios selected for our research. A ranking SVM was trained using these
features. The summary ranking accuracy of the ranking SVM was compared
to other linguistic quality measures, that include Coh-Metrix, language models,
word coherence and entity coherence measures. The evaluation of results was
done on the system and input levels. Whereas in the former all participating
systems were ranked according to their performance on the entire test set, and
in the latter all the summaries produced for a single given input.
Structural features proved to be best suitable for input-level human sum-
maries, middle of the range for input level system summaries and about the
worst class of features for system-level evaluation of automatic summaries. At
the same time being the most stable set of features and ranging the least across
the chosen evaluation settings.
3 Summarization System
To analyze the impact of proper noun, pronoun and noun ratios we have cho-
sen the summarization system described in [29]. The system allows a modular
combination of anaphora resolution, textual entailment and word sense disam-
biguation tools with the term or concept frequency based scoring module. In this
research we focused on textual entailment and anaphora resolution.
Textual Entailment. The task of textual entailment is to capture the semantic
inference between text fragments. There has been a number of summarization
systems utilizing textual entailment to aid in summarization process. Both in
the process of evaluating the final summary and in the process of summary
generation. In the latter case textual entailment is often applied to eliminate the
semantic redundancy of a document [15].
Anaphora resolution. Powerful pronominal anaphora resolution tool relates
pronouns to their nominal antecedents. This is of use to all the summarization
methods that rely on term overlap, from the simple term frequency to latent
semantic analysis. Steinberger et al. [26] report an increase of 1.5% for their
168 T. Vodolazova et al.
We have further grouped the selected articles according to the more general
topic covered, e.g. marathon, Olympics, Super Bowl were assigned to the topic
on sports, etc. This yielded 5 groups, covering the general topics on accidents,
natural disasters, politics, sports and famous people (please see Table 1 for more
details).
Having grouped the data in different topics, we proceeded with their linguistic
analysis. The selected documents were processed using a part-of-speech tagger
to obtain the average noun (NR), pronoun (PR) and proper noun ratios (PNR)
for each of the 25 topics. These ratios were calculated by diving the number of
words of the respective word class by the total number of words in a document.
PNR NR PR size
accidents 1. battleship explosion 0.11466 0.34381 0.03540 670.0
2. ferry accidents 0.10874 0.34006 0.04024 423.666
3. IRA attack 0.10666 0.31853 0.05897 599.625
natural 4. earthquake Iran 0.15052 0.35761 0.02933 444.8
disasters 5. China flood 0.12880 0.38535 0.02032 383.8
6. Hurricane Gilbert 0.13867 0.36818 0.02501 730.923
7. Mount Pinatubo volcano 0.10716 0.33642 0.03041 672.4
8. North American drought 0.10402 0.34050 0.02485 398.0
9. thunderstorm US 0.13165 0.36461 0.02791 718.7
politics 10. Checkpoint Charlie 0.16148 0.34312 0.04406 513.2
11. abortion law 0.11954 0.33815 0.06449 545.833
12. Germany reunification 0.14561 0.33543 0.03163 558.5
13. Honecker protest 0.15005 0.34585 0.03887 286.545
14. Iraq invades Kuwait 0.16509 0.36821 0.03189 552.555
15. Robert Maxwell companies 0.17074 0.37558 0.03782 444.1
16. striking coal miners 0.09267 0.36252 0.02651 507.083
17. US ambassadors 0.20187 0.38561 0.03811 415.545
sports 18. Super Bowl 0.17758 0.39363 0.03184 438.8
19. marathon 0.13454 0.33495 0.05224 810.555
20. Olympics 0.15359 0.35780 0.04059 607.5
famous 21. Leonard Bernstein 0.19529 0.38679 0.04427 596.923
people 22. Lucille Ball 0.13095 0.32332 0.08895 848.714
23. Margaret Thatcher 0.13329 0.31561 0.06571 624.4
24. Sam Walton 0.10693 0.32362 0.05967 566.714
25. Gorbachev 0.09943 0.31251 0.05673 745.9
average 0.13718 0.35031 0.04183 564.191
170 T. Vodolazova et al.
This shallow linguistic analysis methods were chosen in agreement with the
summarization system described in Section 3. The noun ratio has been chosen
since a topic of a document is usually characterized in the form of noun phrases
and textual entailment (with or without the word sense disambiguation) can
be used to eliminate the semantic redundancy. The anaphora resolution process
involves analyzing the pairs of nouns, pronouns and proper nouns in a document.
Table 1 contains the results for the selected features topic-wise. The figures
higher than the average are highlighted in bold. Already on this shallow analysis
level it can be seen, that different topics have different tendencies. The documents
that cover political issues and sports tend to have a higher number of proper
nouns. The articles about famous people contain a lot of pronouns. The latter
led us to the hypothesis, that summarization systems that involve anaphora
resolution would yield summaries of a better quality for those articles. While
the former suggested to rather apply a textual entailment heuristics. The actual
results obtained when applying the selected summarization system to the set of
25 groups of documents are discussed in Section 5.
5.2 Discussion
We pursued two different goals in this research. The initial hypothesis was that
the topic of the original document and the quality of generated summary are
related. The second one was that the quality of a generated summary rather
depends on linguistic properties of the original text and how they interact with
the particular summarization technique chosen to tackle this task.
Below is the analysis of the results with respect to both of the goals.
Does the Topic of the Original Document affect the Quality of Gener-
ated Summaries? On hand of the obtained results we couldn’t prove the first
hypothesis. If we for example consider the sports topic, it becomes evident that:
– already the starting values for the ASW setting range from 0.36541 to
0.50047
172 T. Vodolazova et al.
But the combination of AR and TEWSD improves over the results yielded on
AR. The opposite case is observed for the topic 1, when AR benefits, TE yields
worse results, and their combination is still worse than the results of AR only.
We thus assume that there are more linguistic properties and ways of interaction
with the summarization techniques that are the subject to further research.
Nevertheless, it becomes clear that the linguistic properties of the original
document affect the quality of generated summaries.
References
16. Lloret, L., Palomar, M.: A Gradual Combination of Features for Building Au-
tomatic Summarisation Systems. In: Proceedings of the 12th International Con-
ference on Text, Speech and Dialogue (TSD), Pilsen, Czech Republic, pp. 16–23
(2009)
17. Luhn, H.P.: The Automatic Creation of Literature Abstracts. IBM Journal of Re-
search and Development 2(2), 157–165 (1958)
18. McKeown, K., Hirschberg, J., Galley, M., Maskey, S.: From Text to Speech Summa-
rization. In: International Conference on Acoustics, Speech, and Signal Processing,
pp. 997–1000. IEEE, Philadelphia (2005)
19. Mihalcea, R., Ceylan, H.: Explorations in Automatic Book Summarization. In: Pro-
ceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.
380–389 (2007)
20. Muresan, S., Tzoukermann, E., Klavans, J.L.: Combining Linguistic and Machine
Learning Techniques for Email Summarization. In: Proceedings of the 2001 Work-
shop on Computational Natural Language Learning (ConLL 2001). Association for
Computational Linguistics, Stroudsburg (2001)
21. Nenkova, A., Chae, J., Louis, A., Pitler, E.: Empirical Methods in Natural Lan-
guage Generation. Springer, Heidelberg (2010)
22. Nenkova, A.: Automatic Summarization. Foundations and Trends in Information
Retrieval 5, 103–233 (2011)
23. Nenkova, A., Bagga, A.: Facilitating Email Thread Access by Extractive Summary
Generation. In: Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R. (eds.) Recent
Advances in Natural Language Processing III, Selected Papers from RANLP 2003,
pp. 287–296. John Benjamins, Amsterdam (2003)
24. Plaza, L., Dı́az, A.: Using Semantic Graphs and Word Sense Disambiguation. Tech-
niques to Improve Text Summarization. Procesamiento del Lenguaje Natural 47,
97–105 (2011)
25. Saggion, H.: Topic-based Summarization at DUC 2005. In: Proceedings of the
Document Understanding Workshop, Vancouver, B.C., Canada, pp. 1–6 (2005)
26. Steinberger, J., Poesio, M., Kabadjov, M.A., Ježek, K.: Two Uses of Anaphora
Resolution in Summarization. Information Processing and Management 43(6),
1663–1680 (2007)
27. Tatar, D., Tamaianu-Morita, E., Mihis, A., Lupsa, D.: Summarization by Logic
Segmentation and Text Entailment. In: 33rd CICLing, pp. 15–26 (2008)
28. Teufel, S., Moens, M.: Sentence extraction as a classification task. In: ACL/EACL
1997 Workshop on Intelligent Scalable Text Summarization, pp. 58–65. Association
for Computational Linguistics, Madrid (1997)
29. Vodolazova, T., Lloret, E., Muñoz, R., Palomar, M.: A Comparative Study of the
Impact of Statistical and Semantic Features in the Framework of Extractive Text
Summarization. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012.
LNCS, vol. 7499, pp. 306–313. Springer, Heidelberg (2012)
30. Yang, J., Cohen, A.M., Hersh, W.: Automatic summarization of mouse gene infor-
mation by clustering and sentence extraction from MEDLINE abstracts. In: AMIA
Annual Symposium, pp. 831–835 (2007)
Unsupervised Medical Subject Heading
Assignment Using Output Label Co-occurrence
Statistics and Semantic Predications
1 Introduction
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 176–188, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Unsupervised Medical Subject Heading Assignment 177
Information (NCBI). PubMed lets users search over 22 million biomedical cita-
tions available in the MEDLINE bibliographic database curated by the National
Library of Medicine (NLM) from over 5000 leading biomedical journals in the
world. To keep up with the explosion of information on various topics, users
depend on search tasks involving Medical Subject Headings (MeSH R
) that are
assigned to each biomedical article. MeSH is a controlled hierarchical vocabulary
of medical subjects created by the NLM. Once articles are indexed with MeSH
terms, users can quickly search for articles that pertain to a specific subject of
interest instead of relying solely on key words based searches.
Since MeSH terms are assigned by librarians who look at the full text of an arti-
cle, they capture the semantic content of an article that cannot easily be captured
by key word or phrase searches. Thus assigning MeSH terms to articles is a routine
task for the indexing staff at NLM. This is empirically shown to be a complex task
with 48% consistency because it heavily relies on indexers’ understanding of the
article and their familiarity with the MeSH vocabulary [1]. As such, the manual
indexing task takes a significant amount of time leading to delays in the availabil-
ity of indexed articles. It is is observed that it takes about 90 days to complete 75%
of the citation assignment for new articles [2]. Moreover, manual indexing is also
a fiscally expensive initiative [3]. Due to these reasons, there have been many re-
cent efforts to come up with automatic ways of assigning MeSH terms for indexing
biomedical articles. However, automated efforts (including ours) mostly focused
on predicting MeSH terms for indexing based solely on the abstract and title text
of the articles. This is because most full text articles are only available based on
paid licenses not subscribed by many researchers.
Many efforts in MeSH term prediction generally rely on two different methods.
The first method is the k-nearest neighbor (k-NN) approach where k articles that
are already tagged with MeSH terms and whose content is found to be “close” to
the new abstract to be indexed are obtained. The MeSH terms from these k arti-
cles form a set of candidate terms for the new abstract. A second method is based
on applying machine learning algorithms to learn binary classifiers for each MeSH
term. A new candidate abstract would then be put through all the classifiers and
the corresponding MeSH terms of classifiers that return a positive response are
chosen as the indexed terms for the abstract. We note that both k-NN and ma-
chine learning approaches need large sets of abstracts and the corresponding MeSH
terms to make predictions for new abstracts. In this paper, we propose an unsuper-
vised ensemble approach to extract MeSH terms and test it on two gold standard
datasets. Our approach is based on named entity recognition (NER), relationship
extraction, knowledge-based graph mining, and output label co-occurrence statis-
tics. Prior attempts have used NER and graph mining approaches as part of their
supervised approaches and we believe this is the first time relationship extraction
and output label co-occurrences are applied for MeSH term extraction. Further-
more, our approach is purely unsupervised in that we do not use a prior set of
already tagged MEDLINE citations with their corresponding MeSH terms.
Before we continue, we would like to emphasize that automatic indexing at-
tempts, including our current attempt, are generally not intended to replace
178 R. Kavuluru and Z. He
trained indexers but are mainly motivated to expedite the indexing process and
increase the productivity of the indexing initiatives at the NLM. Hence in these
cases, recall might be more important than precision although an acceptable
trade-off is necessary. In the rest of the paper, we first discuss related work and
the context of our paper in Section 2. We describe our dataset and methods in
Section 3. We provide an overview of the evaluation measures and present results
with discussion in Section 4.
2 Related Work
NLM initiated efforts in MeSH term extraction with their Medical Text Indexer
(MTI) program that uses a combination of k-NN based approach and NER
based approaches with other unsupervised clustering and ranking heuristics in a
pipeline [4]. MTI recommends MeSH terms for NLM indexers to assist in their
efforts to expedite the indexing process1 . Another recent approach by Huang et
al. [2] uses k-NN approach to obtain MeSH terms from a set of k already tagged
abstracts and use the learning to rank approach to carefully rank the MeSH
terms. They use two different gold standard datasets one with 200 abstracts and
the other with 1000 abstracts. They achieve an F-score of 0.5 and recall 0.7 on the
smaller dataset compared to MTI’s F-score of 0.4 and recall 0.57. Several other
attempts have tried different machine learning approaches with novel feature
selection [5] and training data sample selection [6] techniques. A recent effort
by Jimeno-Yepes et al. [7] uses a large dataset and uses meta-learning to train
custom binary classifiers for each label and index the best performing model for
each label for applying on new abstracts; we request the reader to refer to their
work for a recent review of machine learning used for MeSH term assignment.
As mentioned in Section 1, most current approaches rely on large amounts of
training data. We take a purely unsupervised approach under the assumption
that we have access to output label2 co-occurrence frequencies where training
documents may not be available.
3 Our Approach
We use two different datasets, a smaller 200 abstract dataset and a larger 1000
abstract dataset used by Huang et al. [2]; besides results from their approach,
they also report on the results produced by NLM’s MTI system. We chose these
datasets and compare our results with their outcomes as they represent the k-NN
and machine learning approaches typically used by most researchers to address
MeSH term extraction. To extract MeSH terms, we used a combination of three
methods: NER, knowledge-based graph mining, and output label co-occurrence
1
For the full architecture of MTI’s processing flow, please see: http://skr.nlm.nih.
gov/resource/Medical Text Indexer Processing Flow.pdf
2
Here the ‘labels’ are MeSH terms; we use ‘label’ to conform to the notion of classes
in multi-label classification problems.
Unsupervised Medical Subject Heading Assignment 179
The UMLS3 is a large domain expert driven aggregation of over 160 biomedical
terminologies and standards. It functions as a comprehensive knowledge base and
facilitates interoperability between information systems that deal with biomedi-
cal terms. It has has three main components: Metathesaurus, Semantic Network,
and SPECIALIST lexicon. The Metathesaurus has terms and codes, henceforth
called concepts, from different terminologies. Biomedical terms from different vo-
cabularies that are deemed synonymous by domain experts are mapped to the
same Concept Unique Identifier (CUI) in the Metathesaurus. The semantic net-
work acts as a typing system that is organized as a hierarchy with 133 semantic
types such as disease or syndrome, pharmacologic substance, or diagnostic proce-
dure. It also captures 54 important relations (called semantic relations) between
biomedical entities in the form of a relation hierarchy with relations such as
treats, causes, and indicates. The Metathesaurus currently has about 2.8 mil-
lion concepts with more than 12 million relationships connecting these concepts.
The relationships take the form C1 → < rel − type > → C2 where C1 and
C2 are concepts in the UMLS and < rel − type > is a semantic relation such
as treats, causes, or interacts. The semantic interpretation of these relationships
(also called triples) is that the C1 is related to C2 via the relation < rel−type >.
The SPECIALIST lexicon is useful for lexical processing and variant generation
of different biomedical terms.
Here M[i][i] = 1 because the numerator is just the same as the denomina-
tor. We note with this definition of M[i][j] is an estimate of the probability
P (j-th term|i-th term). Let T and A be the set of title and abstract MeSH
terms extracted using NER, respectively, and C = T ∪ A be the set of context
terms which includes the MeSH terms extracted from both title and abstract.
Let α and β be the thresholds used to identify highly co-occurrent terms and
to select a few of these terms that are also contextually relevant, respectively.
Details of these thresholds will be made clear later in this section. Next we show
the pseudocode of candidate term expansion algorithm.
Algorithm. Expand-Candidate-Terms (T , A, α, β, M[ ][ ])
1: Initialize seed list S = T
2: Set context terms C = T ∪ A
3: S.append(Apply-Context(A, β, C, M[ ][ ]))
{Next, we iterate over terms in list S}
4: for all terms t in S do
5: Let H = [ ] be an empty list
6: for each i such that M[t][i] > α do
7: H.append(i-th MeSH term)
8: relevantT erms = Apply-Context(H, β, C, M[ ][ ])
9: relavantT erms = relevantT erms − S {avoid adding existing terms}
10: S.append(relevantT erms)
11: return S
where c(N, Di , Ei ) is the number of true positives (correct gold standard terms)
in the top N ranked list of candidate terms in Ei for citation Di . Given this, the
micro F-score is Fμ = 2Pμ Rμ /(Pμ + Rμ ). We also define average precision of a
citation AP (Di ) computed considering top N terms as
1
N
c(r, Di , Ei )
AP (Di , N ) = I(Eir ) · ,
|Gi | r=1 r
where Eir is the r-th ranked term in the set of predicted terms Ei for citation
Di and the function I(Eir ) is a Boolean function with a value of 1 if Eir ∈ Gi
and 0 otherwise. Finally, the mean average precision (MAP) of the collection of
citations D when considering top N predicted terms is given by
|D|
1
M AP (D, N ) = AP (Di , N ).
|D| i=1
6
Check-tags form a special small set of MeSH terms that are always checked by trained
coders for all articles. Here is the full check tag list: http://www.nlm.nih.gov/bsd/
indexing/training/CHK 010.htm
Unsupervised Medical Subject Heading Assignment 185
Remark 2. In our experiments, MeSH terms that are associated with concepts at
a distance greater than 1 from the input concept in the graph mining approach
(Section 3.3) did not provide a significant improvement in the results. Hence here
we only report results when the shortest distance between the input concept and
the MeSH ancestors is ≤ 1.
We used two different datasets – the smaller dataset has 200 citations
and is called the NLM2007 dataset. The larger 1000 citation dataset is
denoted by L1000. Both datasets can be obtained from the NLM website:
http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/indexing/paperdat.zip.
Next, we present our best micro average precision, recall, F-score, and MAP in
Table 1 in comparison with the results obtained by supervised ranking method
by [2] and the results obtained when using NLM’s MTI program (as reported
by Huang et al. in their paper). From the table we see that the performance
of our unsupervised methods is comparable (except in the case of the MAP
measure) to that of the MTI method, which uses a k-NN approach. However, as
can be seen, a supervised ranking approach that relies on training data and uses
the k-NN approach performs much better than our approaches. We emphasize
that our primary goal has been to demonstrate the potential of unsupervised
approaches that can complement supervised approaches when training data is
available but can work with reasonable performance even when training data is
scarce or unavailable, which is often the case in many biomedical applications.
Furthermore, unlike in many unsupervised scenarios, we do not even have access
to the full artifact (here, full text of the article) to be classified, which further
demonstrates the effectiveness of our method.
F-score of at least 14%. This shows that using simplistic approaches that rely
only on NER may not provide reasonable performance.
Whether using unsupervised or supervised approaches, fine tuning the param-
eters is always an important task. Next, we discuss how different thresholds (α
and β in Section 3.4) and different values of N effect the performance measures.
We believe this is important because low values for thresholds and high cut-off
values for N have the potential to increase recall by trading-off some precision.
We experimented with different threshold ranges for α and β and also different
values of N . We show some interesting combinations we observed for the L1000
dataset in Table 3. We gained a recall of 1% by changing N from 25 to 35 with
the same thresholds. Lowering the thresholds with N = 35 lead to a 5% gain in
recall with an equivalent decrease in precision, which decreases the F-score by
5% while increasing the MAP score by 1%.
L1000 dataset
Rμ Pμ Fμ MAP
N = 25, α = 0.10, β = 0.05 0.51 0.33 0.40 0.36
N = 25, α = 0.08, β = 0.04 0.56 0.29 0.38 0.38
N = 35, α = 0.08, β = 0.04 0.57 0.28 0.38 0.38
N = 35, α = 0.06, β = 0.03 0.62 0.23 0.33 0.39
Finally, among the ranking approaches we tried, the best ranking method is
Borda’s aggregation of the two ranked lists, the first of which is based on average
co-occurrence scores and the second is the semantic predication based binning
approach with average co-occurrence as the tie-breaker within each bin. This
aggregated ranking is used to obtain the best scores we reported in all the tables
discussed in this section. The semantic predication based binning provided a 3%
improvement in the MAP score both in the NLM2007 and L1000 datasets.
5 Conclusion
References
1. Funk, M., Reid, C.: Indexing consistency in medline. Bulletin of the Medical Li-
brary Association 71(2), 176 (1983)
2. Huang, M., Névéol, A., Lu, Z.: Recommending mesh terms for annotating biomed-
ical articles. J. of the American Medical Informatics Association 18(5), 660–667
(2011)
3. Aronson, A., Bodenreider, O., Chang, H., Humphrey, S., Mork, J., Nelson, S.,
Rindflesch, T., Wilbur, W.: The nlm indexing initiative. In: Proceedings of the
AMIA Symposium, American Medical Informatics Association, p. 17 (2000)
4. Aronson, A., Mork, J., Gay, C., Humphrey, S., Rogers, W.: The NLM indexing
initiative: Mti medical text indexer. In: Proceedings of MEDINFO (2004)
5. Yetisgen-Yildiz, M., Pratt, W.: The effect of feature representation on medline doc-
ument classification. In: AMIA Annual Symposium Proceedings, American Medical
Informatics Association, vol. 2005, pp. 849–853 (2005)
6. Sohn, S., Kim, W., Comeau, D.C., Wilbur, W.J.: Optimal training sets for bayesian
prediction of MeSH assignment. Journal of the American Medical Informatics As-
sociation 15(4), 546–553 (2008)
7. Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., Aronson, A.R.: A one-
size-fits-all indexing method does not exist: Automatic selection based on meta-
learning. JCSE 6(2), 151–160 (2012)
8. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification.
Lingvisticae Investigationes 30(1), 3–26 (2007)
9. Aronson, A.R., Lang, F.M.: An overview of metamap: historical perspective and
recent advances. J. American Medical Informatics Assoc. 17(3), 229–236 (2010)
188 R. Kavuluru and Z. He
10. Bodenreider, O., Nelson, S., Hole, W., Chang, H.: Beyond synonymy: exploiting
the umls semantics in mapping vocabularies. In: Proceedings of AMIA Symposium,
pp. 815–819 (1998)
11. Rindflesh, T.C., Fiszman, M.: The interaction of domain knowledge and linguistic
structure in natural language processing: interpreting hypernymic propositions in
biomedical text. J. of Biomedical Informatics 36(6), 462–477 (2003)
12. Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for
the web. In: Proceedings of the 10th International Conference on World Wide Web,
WWW 2001, pp. 613–622 (2001)
Bayesian Model Averaging and Model Selection
for Polarity Classification
University of Milano-Bicocca
Viale Sarca, 336 - 20126 Milan, Italy
{federico.pozzi,fersini,messina}@disco.unimib.it
1 Introduction
According to the definition reported in [1], sentiment “suggests a settled opinion
reflective of one’s feelings”. The aim of Sentiment Analysis (SA) is therefore to
define automatic tools able to extract subjective information, such as opinions
and sentiments from texts in natural language, in order to create structured
and actionable knowledge to be used by either a Decision Support System or
a Decision Maker. The polarity classification task can be addressed at different
granularity levels, such as word, sentence and document level. The most widely
studied problem is SA at document level [2], in which the naive assumption is
that each document expresses an overall sentiment. When this is not ensured, a
lower granularity level of SA could be more useful and informative. In this work,
polarity classification has been investigated at sentence level. The main polarity
classification approaches are focused on identifying the most powerful model for
classifying the polarity of a text source. However, an ensemble of different models
could be less sensitive to noise and could provide a more accurate prediction [3].
Regarding SA, the study of ensembles is still on its infancy. This is mainly due
to the difficulty to find out a reasonable trade-off between classification accuracy
and increasing computational time, that is particularly challenging when dealing
with online and real-time big data. To the best of our knowledge, the existing
approaches of a voting system for SA are based on traditional methods such
as Bagging [4] and Boosting [5], disregarding how to select the best ensemble
composition. In this paper we propose a novel BMA approach that combines
different models selected using a specific selection strategy heuristic.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 189–200, 2013.
c Springer-Verlag Berlin Heidelberg 2013
190 F.A. Pozzi, E. Fersini, and E. Messina
(2)
= P (l(s)|i)P (i)P (D|i)
i∈C
an initial set C, riC is iteratively computed excluding at each iteration the clas-
sifier that achieves the lowest riC . In order to define the initial ensemble, the
baseline classifiers in C have to show some level of dissimilarity. This can be
achieved using models that belong to different families (i.e. generative, discrimi-
native and large-margin
n models). The proposed strategy allows us to reduce the
search space from k=1 k!(n−k)! n!
to n − 1 potential candidates for determining
the optimal ensemble. In fact, at each iteration the classifier with the lowest riC
is disregarded until the smallest combination is achieved.
The baseline classifiers considered in this paper are the following:
Naı̈ve Bayes. NB [7] is the simplest generative model that can be applied to
the polarity classification task. It predicts the polarity label l given a vector
representation of textual cues by exploiting the Bayes’ Theorem.
Support Vector Machines. SVMs [9] are linear learning machines that try to find
the optimal hyperplane discriminating samples of different classes, ensuring the
widest margin.
4 Experimental Investigation
4.1 Experimental Setup
In this study, three benchmark datasets are considered.
Bayesian Model Averaging and Model Selection for Polarity Classification 193
1
http://www.sics.se/people/oscar/datasets/
2
http://www.cs.jhu.edu/∼mdredze/datasets/sentiment/
3
www.cs.cornell.edu/people/pabo/movie-review-data/
4
http://www.rottentomatoes.com/
5
www.cs.uic.edu/∼liub/FBS/sentiment-analysis.html
6
http://alias-i.com/lingpipe/
7
http://www.csie.ntu.edu.tw/∼cjlin/libsvm/
194 F.A. Pozzi, E. Fersini, and E. Messina
[8] and CRF is induced exploiting regularized likelihood [10]. ME and linear
chain CRF classifiers have been applied using the MALLET package8 .
In this section the performance achieved on the considered datasets, both by the
baseline classifiers and the ensemble methods (MV and BMA, described in Sect.
2), are presented. To this purpose, we measured Precision (P ), Recall (R) and
F1 -measure, defined as
TP TP 2·P ·R
P = R= F1 = (6)
TP + FP TP + FN P +R
both for the positive and negative labels (in the sequel denoted by P+ , R+ , F 1+
and P− , R− , F 1− respectively). We also measured Accuracy, defined as
TP + TN
Acc = (7)
TP + FP + FN + TN
8
http://mallet.cs.umass.edu/
Bayesian Model Averaging and Model Selection for Polarity Classification 195
!
achieves 71.07% of global accuracy (Table 1), while the best ensemble (composed
of DIC, ME and CRF) achieves an accuracy of 74.24% and 75.85% for MV and
BMA respectively. The contribution of each classifier belonging to a given en-
semble can be computed a priori by applying the model selection strategy.
Starting from the initial set C={DIC, NB, ME, SVM, CRF}, the classifiers
are sorted with respect to their contribution by computing (4). As shown in
Table 3, the classifier with the lowest contribution at the first iteration is NB.
Then, (4) is re-computed on the ensemble {C \ NB}, highlighting SVM as the
classifier with the lowest contribution. At iteration 3 and 4, the worst classifiers
to be removed from the ensemble are ME and CRF respectively.
As highlighted by the accuracy measure, the model selection heuristic is able
to determine the optimal composition by evaluating four ensemble candidates. In
this case, the optimal solution is found at iteration 3, where the best ensemble is
composed of {DIC, ME, CRF}. For sake of completeness, all ensemble performance
are depicted in Figure 1. The cumulative chart of accuracy is reported in Figure 2.
Table 4 reports performance achieved on ProductDataMD “books”. The con-
tribution of the best BMA is about 3.55%, while 1.6% for MV.
!
!
!
!
This result can be easily figured out by Table 7, where the best ensemble for
BMA is composed of DIC, ME, SVM and CRF (Iteration 2).
5 Conclusion
In this work we discussed how to explore the potential of ensembles of classi-
fiers for sentence level polarity classification and proposed an ensemble method
Bayesian Model Averaging and Model Selection for Polarity Classification 199
!
!
"*&
"*$"'
"*%
"*$
"*"%
"*"#
"*#
")+%'
"*
")*''
")*'
")*"*
")+
"))('
")*
")()"
")'*'
"))
")(
")&$&
")'
")&
")%
")$
")#
")
"(+
References
1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends
in Information Retrieval 2, 1–135 (2008)
2. Yessenalina, A., Yue, Y., Cardie, C.: Multi-level structured models for document-
level sentiment classification. In: Proc. of the Conf. on Empirical Methods in NLP
(2010)
3. Dietterich, T.G.: Ensemble learning. In: The Handbook of Brain Theory and Neural
Networks, pp. 405–508. Mit Pr. (2002)
4. Whitehead, M., Yaeger, L.: Sentiment mining using ensemble classification models.
In: Sobh, T. (ed.) Innovations and Advances in Computer Sciences and Engineer-
ing, pp. 509–514. Springer Netherlands (2010)
5. Xiao, M., Guo, Y.: Multi-view adaboost for multilingual subjectivity analysis. In:
24th Inter. Conf. on Computational Linguistics, COLING 2012, pp. 2851–2866 (2012)
6. Hoeting, J.A., Madigan, D., Raftery, A.E., Volinsky, C.T.: Bayesian model averag-
ing: A tutorial. Statistical Science 14(4), 382–417 (1999)
7. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text clas-
sification. In: AAAI 1998 Workshop on Learning for Text Categ., pp. 41–48 (1998)
8. McCallum, A., Pal, C., Druck, G., Wang, X.: Multi-conditional learning: Genera-
tive/discriminative training for clustering and classification. In: AAAI, pp. 433–439
(2006)
9. Cortes, C., Vapnik, V.: Support-vector networks. ML 20(3), 273–297 (1995)
10. Sutton, C.A., McCallum, A.: An introduction to conditional random fields. Foun-
dations and Trends in ML 4(4), 267–373 (2012)
11. Täckström, O., McDonald, R.: Semi-supervised latent variable models for sentence-
level sentiment analysis. In: Proc. of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies, pp. 569–574 (2011)
12. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and
blenders: Domain adaptation for sentiment classification. In: Association for Com-
putational Linguistics (2007)
13. Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment cate-
gorization with respect to rating scales. In: Proc. of the 43rd Annual Meeting on
Association for Computational Linguistics, pp. 115–124 (2005)
14. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proc. of the 10th
ACM SIGKDD Inter. Conf. on Knowledge Discovery and DM, pp. 168–177 (2004)
An Approach for Extracting and Disambiguating
Arabic Persons' Names Using Clustered Dictionaries
and Scored Patterns
1 Introduction
Named entity recognition (NER) has become a crucial constituent of many natural
language processing (NLP) and text mining applications. Examples of those applica-
tions include Machine Translation, Text Clustering and Summarization, Information
Retrieval and Question Answering systems. An exhaustive list can be found in [5].
Arabic NER has attracted much attention during the past couple of years, with
research in the area achieving results comparable to those reported for the English
language.
Approaches for recognizing named entities from text have been divided into three
categories which are “Rule Based NER”, “Machine learning based NER” and “Hybr-
id NER”. The “Rule Based NER” combines grammar, in the form of handcrafted
rules, with gazetteers to extract named entities. “Machine learning based NER” utiliz-
es large datasets and features extracted from text, to train a classifier in order to rec-
ognize a named entity. Hence this approach converts the named recognition task into
a classification task. Machine learning algorithms could be further divided into either
supervised or unsupervised. The “Hybrid NER” combines the machine learning ap-
proach with the rule based approach. A comparison between the rule based approach
and the machine learning approach is given in [13]. As mentioned in [1, 13, 17], it is
difficult to extend the rule based approach to new domains because of the necessity of
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 201–212, 2013.
© Springer-Verlag Berlin Heidelberg 2013
202 O. Zayed, S. El-Beltagy, and O. Haggag
complicated linguistic analysis to detect the named entities. Conversely, the difficulty
of the machine learning approach lies in that it requires a precise selection of features
from a training dataset which is tagged in a certain manner to recognize new entities
from new testing dataset in the same domain.
To reach acceptable results however, employment of an Arabic parser is a must in
any of the above listed approaches. While this is perfectly valid for extracting named
entities from MSA, it is difficult to apply on colloquial Arabic, which is currently
used extensively in micro-blogging and social media contexts. The main difficulty of
applying previously devised approaches on this type of media, is the fact that existing
Arabic parsers cannot deal with colloquial Arabic at any acceptable degree of accura-
cy. Without the utilization of such parsers, the degree of ambiguity in Arabic person
name detection rises significantly for reasons that are detailed in section 2.
This paper introduces an approach for extracting Arabic persons’ names, the most
challenging Arabic named entity, without utilizing any Arabic parsers or taggers. The
presented approach makes use of a limited set of dictionaries integrated with a statis-
tical model based on association rules, a name clustering module, and a set of rules to
detect person names. The main challenges addressed by this work could be summa-
rized as:
─ Overcoming the person name ambiguity problem without the use of parsers, tag-
gers or morphological analyzers.
─ Avoiding the shortcomings of both rule based NER and machine learning based
NER approaches including employment of complex linguistic analysis, huge sets
of gazetteers, huge training sets, feature extraction from annotated corpus…etc. in
order to be able to extend the approach to new domains, primarily colloquial Arab-
ic, in our future work.
Evaluation of the presented approach was carried out on a benchmark dataset and
shows that the system outperforms the state of the art machine learning based system.
While the recall of the system falls below the state of the art hybrid system, the preci-
sion of the system is comparable to it.
The rest of the paper is organized as follows: Section 2 discusses Arabic specific
challenges faced when building NER systems; Section 3 describes the proposed ap-
proach in detail. In Section 4, system evaluation on a benchmark dataset is discussed.
Section 5 highlights an overview of the literature on NER systems in Arabic lan-
guage. Finally conclusion and future work is presented in Section 6.
The Arabic language is a complex and rich language, which steps up the challenges
faced by researchers when developing an Arabic natural language processing (ANLP)
application [11]. Recognizing Arabic named entities is a difficult task due to a variety
of reasons as explained in detail in [1, 11]. Those reasons are revisited with examples:
An Approach for Extracting and Disambiguating Arabic Persons' Names 203
─ One of the major challenges of Arabic language is that it has many levels of ambi-
guity [11]. A significant level of ambiguity is the semantic ambiguity in which one
word could imply a variety of meanings. For example, the word “ ”ﻧﺒﻴﻪcould imply
the phrase (his prophet), the adjective (intelligent) or the name of a person (Nabih).
─ Arabic named entities could appear with conjunctions or other connection letters
which complicates the task of extracting persons’ names from Arabic text such as
“( ”وﻣﺤﻤﺪand Mohammed), “( ”آﻤﺤﻤﺪas Mohammed), “( ”ﻟﻤﺤﻤﺪto Mohammed),
“( ”ﻓﻤﺤﻤﺪthen Mohammed) or “( ”ﺑﻤﺤﻤﺪwith Mohammed).
─ Most of the Arabic text suffers from lack of diacritization. Lack of diacritization
causes another level of ambiguity in which a word could belong to more than one
part of speech with different meanings [1, 11]. For example, the word “ ”ﻧﻬﻲwith-
out diacritics could imply the female name (Noha), or the verb (prohibited).
─ Arabic lacks capitalization as it has a unified orthographic case [1]. In English
some named entities can be distinguished because they are capitalized. These in-
clude persons’ names, locations and organizations.
─ Arabic text often contains not only Arabic named entities, but translated and trans-
literated named entities to Arabic [11] which often lack uniform representation. For
example, the name (Margaret) can be written in Arabic in different ways such as
“”ﻣﺮﺟﺮﻳﺖ,”” ﻣﺎرﺟﺮﻳﺖ, “ ”ﻣﺮﻏﺮﻳﺖor “”ﻣﺎرﻏﺮﻳﺖ.
─ Many persons’ names are either derived from adjectives or can be confused with
other nouns sharing the same script. Examples of ambiguous Arabic male names
include [Adel, Said, Hakim, and Khaled] their different adjective or noun polyse-
my are [Just, Happy, Wise, and Immortal]. Examples of some ambiguous female
names include [Faiza, Wafia, Omneya, and Bassma] which could be interpreted as
[Winner, Loyal, Wish, and Smile]. Examples of some ambiguous family/last
names are [Harb, Salama, Khatab and Al-Shaer] which translate to [War, Safety,
Speech/Letter and The Poet].
─ Moreover, some Arabic persons’ names match with verbs such as [Yahya, Yasser,
and Waked] their different verb polysemy are [Greets, Imprisons, and Empha-
sized]. In addition, some foreign persons’ names transliterated to Arabic could be
interpreted as prepositions or pronouns such as [Ho, Anna, Ann, and, Lee] their
different prepositions or pronouns are [He, I, That, Mine].
The combination of the above listed factors, makes the recognition of Arabic person
names the most challenging of Arabic named entities to extract without any parsers.
Simply building a system based on straightforward matching of persons’ names using
dictionaries, will often result in mistakes. The traditional solution for this is using
parsers or taggers. However, extracting persons’ names from colloquial Arabic text
invalidates this solution as existing parsers fail to parse colloquial Arabic at an ac-
ceptable level of precision mainly due to sentence irregularity, incompleteness and the
varied word order of colloquial Arabic [17]. In this paper, the ambiguity problem is
addressed in two ways. First, publicly available dictionaries of persons’ names are
grouped into clusters. Second, a statistical model based on association rules is built to
extract patterns that indicate the occurrence of persons’ names. These approaches will
be explained in detail in section 3.
204 O. Zayed, S. El-Beltagy, and O. Haggag
In this work, a rule based approach combined with a statistical model, is adopted to
identify and extract person names from Arabic text. Our approach tries to overcome
two of the major shortcomings of using rule based techniques which are the difficulty
of modifying a rule based approach for new domains and the necessity of using huge
sets of gazetteers. Section 5 highlights the differences between the resources needed
by our approach and previous approaches.
Our approach consists of two phases, as shown in Fig. 1. In the first phase, “The
building of resources phase”, person names are collected and clustered, and name
indicating patterns are extracted. In the second phase, “Extraction of persons’ names
phase”, name patterns and clusters are used to extract persons’ names from input text.
Both of these phases are described in depth, in the following subsections.
In this phase the resources on which the system depends are prepared. This phase is
divided into 4 stages. In the first stage, persons’ names are collected from public re-
sources. In the second stage, dictionaries of first, male/middle and family persons’
names are built from collected resources. In the third stage, names are grouped to-
gether into clusters to address the Arabic persons’ names ambiguity problem as will
be detailed later. In the fourth and final stage, a corpus is used to build and score
patterns which indicate the occurrence of a person’s name. Scoring of the patterns is
done using association rules.
Name Collection. Wikipedia1, with its huge collection of names under the people
category, offers an excellent resource for building a database for persons’ names.
Kooora2, which is an Arabic website for sports, also provides a large list of football
1
http://ar.wikipedia.org/wiki/ﺗﺮاﺟﻢ:ﺗﺼﻨﻴﻒ
2
http://www.kooora.com/default.aspx?showplayers=true
An Approach for Extracting and Disambiguating Arabic Persons' Names 205
and tennis players’ names. In this stage, Wikipedia and Kooora websites were used to
collect a list of about 19,000 persons’ full names. Since the aim of this work is not
just to recognize names of famous people, but instead to identify the name of any
person even if it does not appear in the collected lists, the collection was further
processed and refined in order to achieve this goal in the “Building the dictionaries”
stage.
Building of Dictionaries. In this stage, the list of names collected in the previous
stage (we call this list the “full_names_19000_list”) was processed in such a way so
as to separate first names from family names in order to create three names lists which
are first, male/middle, and family names lists. Collecting a list of male names is im-
portant as a male name is often used as a family name. It is difficult to know whether
a first name is a male or female name, but any middle name is always a male name.
At the beginning, input names in the list are normalized using the rules presented in
[12]. This step addresses the different variations of Arabic persons’ name representa-
tion. As described in [17], Arabic names could have affixes such as prefixes or em-
bedded nouns. A word preceded or followed by those affixes must not be split on
white spaces, instead the word and its affix should be considered as a single entity.
For example, the male name ( ﻋﺒﺪ اﻟﻌﺰﻳﺰAbdulaziz) should not be split as ( ﻋﺒﺪAbd)
denoting the first name and ( اﻟﻌﺰﻳﺰAlaziz) denoting a family name, instead it should
be treated as single entity ( ﻋﺒﺪ اﻟﻌﺰﻳﺰAbdulaziz) and considered as a first name. Table
1 lists the different variations of Arabic persons’ names with examples [17].
3
All lists mentioned in this paper are available for download from: http://tmrg.
nileu.edu.eg/downloads.html
206 O. Zayed, S. El-Beltagy, and O. Haggag
Building of Name Clusters. In a simplistic world, once the name lists are built, they
can be used to identify previously unseen names by stating that a full name is com-
posed of a first name followed by zero or more male names followed by (a male name
or a family name). However, as stated before, the inherent ambiguity of Arabic
names, does not lend itself to such a simplistic solution. One of the problems of sim-
ple matching is the possibility of incorrectly extracting a name which is a combination
of an Arabic name and a foreign name. For example, given the phrase: اﺗﻬﻢ اﻳﻤﻦ ﺑﻮش
(Ayman accused Bush), using a simple matching approach would result in the extrac-
tion of the full name ( اﻳﻤﻦ ﺑﻮشAyman Bush) even though it is highly unlikely that an
Arabic person’s name such as ( اﻳﻤﻦAyman) will appear besides an American per-
son’s name such as ( ﺑﻮشBush). In the example above, the translation put the verb
“accused” between “Ayman” and “Bush”, but in the Arabic representation, both
names are placed next to each other and preceded by the verb. Since Arabic text
often contains not only Arabic names, but names from almost any country translite-
rated to Arabic, incorrectly identifying those could affect the system’s precision sig-
nificantly. A more common form of error resulting from simple matching is encoun-
tered when prepositions or pronouns match with names in the compiled name lists as
explained in section 2. For example when the phrase ( ان ﻣﺤﻤﺪThat Mohammed) is
encountered, the simple matching approach will result in the incorrect extraction of
the full name:( ان ﻣﺤﻤﺪAnn Mohammed).
Given the fact the “full_names_19000_list” contains Arabic, English, French, Span-
ish, Hindi, and Asian persons’ names, written in the Arabic language, we decided to
cluster these names and allow name combinations only within generated clusters.
As a pre-processing step, the 19,000 persons’ names list is traversed to build a dic-
tionary in which the first name is a key item whose corresponding value is a list of the
other middle and family names that have occurred with it. The variations of writing
Arabic persons’ names mentioned in the previous subsection are considered. This
dictionary is converted to a graph, such that first names, middle names and family
names form separate nodes. Edges are then established between each first name and
its corresponding middle and family names. The resulting graph consisted of 17393
nodes, and 22518 undirected edges.
The Louvain method [9] was then applied to the graph for finding communities
within the network. A community in this context is a cluster of names that are related.
The Louvain method defines a resolution parameter; this parameter manages the size
of communities. The standard resolution parameter p value is 1.0. A smaller value for
p results in the generation of smaller communities while a larger value for p results in
larger communities. By trying several values for this resolution parameter on the
ANERcorp4 [3] dataset, the value of p=7 was found to produce the best results.
The outcome was a set of 1995 clusters. Each name is assigned a class number de-
noting which community (cluster) it belongs to.
Fig. 2 shows a snapshot of the resulting clusters. It can be observed from visualiz-
ing the data that most of the culturally similar names were grouped together; it can be
noted that most of the names common in the Arabic-speaking regions were grouped
4
http://www1.ccls.columbia.edu/~ybenajiba/downloads.html
An Approach for Extracting and Disambiguating Arabic Persons' Names 207
Fig. 2. Visualization of generated clusters, to the left are all generated clusters, lone clusters can
be seen on the border and the two largest clusters are those of Arabic names (left) and Western
names (right). To the right is a closer view of a subset of the Arabic names cluster.
together. The same applies to English and French names and to other names that are
kind of unique to their region such as Asian names.
Extracting Scored Patterns. In this stage, a statistical model is built to automatically
learn patterns which indicate the occurrence of a person’s name.
Initially each name in the “full_names_19000_list” is used as a query to search
news articles to build learning dataset from the same domain that we are targeting to
extract persons’ names from. Akhbarak5 API and Google Custom Search API6 were
used to search and retrieve news stories.
Around 200 news article links were crawled (whenever possible) for each person
name in the “full_names_19000_list”. A total number of around 3,800,000 million
links were collected using this procedure. After downloading the pages associated
with these links, BoilerPipe7 was used to extract the content or body of each news
article. Very similar stories were detected and removed.
Following this step, unigram patterns around each name are extracted. Three lists
are formed. A complete pattern list keeps set of complete patterns around the name
with their count. A complete pattern consists of <word1><name><word2>. The
<name> part just indicates that a name has occurred between words: word1 and word2.
Two type of unigram pattern lists are kept: a “before” list keeps the patterns that ap-
pear before a name with their counts (example: ( اآﺪconfirmed)) and an “after” list
stores patterns that occur after a name with their count (example: ( انthat)).
Finally the support measure employed by association rules [2] is used to score each
pattern in the three lists. Support is calculated as the ratio of the count of a pattern
followed by a name over the total count of all patterns followed by a name.
The newly created three lists of scored patterns are saved descendingly according
to the value of the score.
5
http://www.akhbarak.net/
6
https://developers.google.com/custom-search/v1/overview
7
http://code.google.com/p/boilerpipe/
208 O. Zayed, S. El-Beltagy, and O. Haggag
The above rule is used to extract names from a sentence such as:
... ﻗﺎل اﻟﺮﺋﻴﺲ ﻣﺤﻤﺪ ﻣﺮﺳﻲان ﻣﺼﺮ ﺗﺨﻄﻮ
President Mohammad Morsi said that Egypt is stepping through …
This rule is generalized to extract names from sentences which contain multi hono-
rifics before the person’s name such as:
... ﻗﺎل رﺋﻴﺲ اﻟﻮزراء اﻻﺳﺮاﺋﻴﻠﻲ اﻳﻬﻮد اوﻟﻤﺮت إﻧﻪ ﻋﺎزم
Prime Minister of Israel Ehud Olmert said that he will …
An Approach for Extracting and Disambiguating Arabic Persons' Names 209
An example of one of the rules used to “learn new names” is to check for a pattern
from “the patterns before list” followed by an unknown name (not in the dictionaries)
with the prefix ( ﻋﺒﺪAbd) followed by known male name and/or family name (the
previous stopping criterion is used).
Another rule to learn new unknown family names is to check for a pattern from
“the patterns before list” followed by a known first name followed by an unknown
name such as:
… وﻗﺎل ﻣﺪﻳﺮ اﻟﻤﺆﺳﺴﻪ ﻓﺮﻳﺪون ﻣﻮاﻓﻖان اﻟﻤﺴﺘﺜﻤﺮ
The Director of the Foundation Feridun Mouafiq said that the investor …
In this example ( ﻓﺮﻳﺪونFeridun) is a known first name while ( ﻣﻮاﻓﻖMouafiq) is un-
known family name; our system is able to extract this person’s full name correctly.
Other rules are employed, but are not included due to space limitations. The next
section shows how the use of patterns and the use of clusters improve the system
performance.
4 System Evaluation
The presented system was evaluated using the precision, recall and f-score measures
based on what it extracted as names from the benchmark ANERcorp [3] dataset. As
mentioned in [3] , ANERcorp consists of 316 articles which contain 150,286 tokens
and 32,114 types. Proper Names form 11% of the corpus. Table 2 provides a compari-
son between the results of the presented system with two state of the art systems
which are the hybrid NERA approach [1] and the machine learning approach using
conditional random fields (CRF) [4].
Table 2. Comparison between our system performance in terms of precision, recall and F-score
with the current two state of the art systems
Table 3 shows the effect of using clusters, patterns and disambiguation lists on the
system’s performance.
5 Related Work
The majority of previous work addressing NER in Arabic language was developed for
the formal MSA text which is the literary language used in newspapers and scientific
books. NER from informal colloquial Arabic, currently being used widely in social
media communication, has not been directly addressed. In [17], previous work on
Arabic NER is discussed extensively. The currently used rule based approaches to
extract named entities from MSA text, are dependent on tokenizers, taggers and pars-
ers combined with a huge set of gazetteers. Although, those approaches might be for
extracting persons’ names from a formal domain, it will be hard to modify them for
the colloquial domain [17].
There is some similarity between our approach and another approach based on local
grammar [16] which uses reporting verbs as patterns to indicate the occurrence of per-
sons’ names. However our approach extracts patterns automatically from the domain
under study, while the other approach is limited to a list of reporting verbs. NERA [15]
is a system for extracting Arabic named entities using a rule-based approach in which
linguistic grammar-based techniques are employed. NERA was evaluated on purpose-
built corpora using ACE and Treebank news corpora that were tagged in a semi-
automated way. The work presented in [10] describes a person named entity recognition
system for the Arabic language. The system makes use of heuristics to identify person
names and is composed of two main parts: the General Architecture for Text Engineer-
ing (GATE) environment and the Buckwalter Arabic Morphological Analyzer
(BAMA). The system makes use of a huge set of dictionaries.
As mentioned in [1], the most frequently used approach for NER is the machine
learning approach by which text features are used to classify the input text depending
on an annotated dataset. Benajiba et al. applied different machine learning techniques
[3–8] to extract named entities from Arabic text. The best performing of these makes
use of optimized feature sets [4]. ANERSys [3] was initially developed based on n-
grams and a maximum entropy classifier. A training and test corpora (ANERcorp)
and gazetteers (ANERgazet) were developed to train, evaluate and boost the imple-
mented technique. ANERcorp is currently considered the benchmark dataset for test-
ing and evaluating NER systems. ANERSys 2.0 [7] basically improves the initial
technique used in ANERSys by combining the maximum entropy with POS tags in-
formation. By changing the probabilistic model from Maximum Entropy to Condi-
tional Random Fields the accuracy of ANERSys is enhanced [8].
Hybrid approaches combine machine learning techniques, statistical methods and
predefined rules. The most recent hybrid NER system for Arabic uses a rule based
NER component integrated with a machine learning classifier [1] to extract three
types of named entities which are persons, locations and organizations. The reported
results of the system are significantly better than pure rule-based systems and pure
machine-learning classifiers. In addition the results are also better than the state of the
An Approach for Extracting and Disambiguating Arabic Persons' Names 211
art Arabic NER system based on conditional random fields [4]. The system was ex-
tended to include more morphological and contextual features [14] and to extract
eleven different types of named entities using the same hybrid approach.
Compared with other approaches, our system utilizes a far more limited set of re-
sources. All our system requires is a large set of names, which can be easily obtained
from public resources such as Wikipedia and a list of honorifics. Our system also,
avoids the use of parsers or taggers and the need for annotated datasets.
This paper presented a novel approach for extracting persons’ names from Arabic
text. This approach integrated name dictionaries and name clusters with a statistical
model for extracting patterns that indicate the occurrence of persons’ names. The used
approach overcomes major limitations of the rule based approach which are the need
for a huge set of gazetteers and domain dependence. More importantly, the fact that
the presented work uses no parsers or taggers, and uses publicly available resources to
learn patterns, means that the system can be easily adapted to work on colloquial
Arabic or new domains. Our rule based approach was able to overcome the ambiguity
of Arabic persons’ names using clusters. Building the patterns’ statistical model using
association rules improved the tasks of Arabic persons’ names disambiguation and
extraction from any domain. System evaluation on a benchmark dataset, showed that
the performance of the presented technique is comparable to the state of the art ma-
chine learning approach while it still needs some improvements to compete with the
state of the art hybrid approach.
This work is a part of a continuous work to extract named entities from any type
of Arabic text whether it is the informal colloquial Arabic or the formal MSA. Our
plans for the future are to improve the results obtained by this approach while avoid-
ing model over-fitting. The main intention is to test this approach on a colloquial da-
taset collected from Arabic social media.
References
1. Abdallah, S., Shaalan, K., Shoaib, M.: Integrating rule-based system with classification for
Arabic named entity recognition. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS,
vol. 7181, pp. 311–322. Springer, Heidelberg (2012)
2. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items
in Large Databases. In: Proceedings of the 1993 ACM SIGMOD International Conference
on Management of Data, SIGMOD 1993, New York, pp. 207–216 (1993)
3. Benajiba, Y., Rosso, P., BenedíRuiz, J.M.: ANERsys: An Arabic Named Entity Recogni-
tion System Based on Maximum Entropy. In: Gelbukh, A. (ed.) CICLing 2007. LNCS,
vol. 4394, pp. 143–153. Springer, Heidelberg (2007)
4. Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition using optimized fea-
ture sets. In: Proceedings of the Conference on Empirical Methods in Natural Language
Processing, EMNLP 2008, pp. 284–293. Association for Computational Linguistics, Mor-
ristown (2008)
212 O. Zayed, S. El-Beltagy, and O. Haggag
5. Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition: A feature-driven
study. IEEE Transactions on Audio, Speech, and Language Processing 17(5), 926–934
(2009)
6. Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition: An svm-based ap-
proach. In: The International Arab Conference on Information Technology, ACIT 2008
(2008)
7. Benajiba, Y., Rosso, P.: Anersys 2.0: Conquering the ner task for the Arabic language by
combining the maximum entropy with pos-tag information. In: IICAI, pp. 1814–1823
(2007)
8. Benajiba, Y., Rosso, P.: Arabic named entity recognition using conditional random fields.
In: Workshop on HLT & NLP within the Arabic World. Arabic Language and Local Lan-
guages Processing: Status Updates and Prospects (2008)
9. Blondel, V.D., Guillaume, J., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities
in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10008 (2008)
10. Elsebai, A., Meziane, F., Belkredim, F.Z.: A rule based persons names Arabic extraction
system. In: The 11th International Business Information Management Association Confe-
rence, IBIMA 2009, Cairo, pp. 1205–1211 (2009)
11. Farghaly, A., Shaalan, K.: Arabic natural language processing: Challenges and solutions.
ACM Transactions on Asian Language Information Processing 8(4), 1–22 (2009)
12. Larkey, L., Ballesteros, L., Connell, M.E.: Light stemming for Arabic information retriev-
al. Arabic Computational Morphology 38, 221–243 (2007)
13. Mansouri, A., Affendey, L.S., Mamat, A.: Named entity recognition using a new fuzzy
support vector machine. In: Proceedings of the 2008 International Conference on Comput-
er Science and Information Technology, ICCSIT 2008, Singapore, pp. 24–28 (2008)
14. Oudah, M., Shaalan, K.: A pipeline Arabic named entity recognition using a hybrid ap-
proach. In: Proceedings of the 24th International Conference on Computational Linguis-
tics, COLING 2012, India, pp. 2159–2176 (2012)
15. Shaalan, K., Raza, H.: NERA: Named entity recognition for Arabic. Journal of the Ameri-
can Society for Information Science and Technology, 1652–1663 (2009)
16. Traboulsi, H.: Arabic named entity extraction: A local grammar-based approach. In: Pro-
ceedings of the International Multiconference on Computer Science and Information
Technology, vol. 4, pp. 139–143 (2009)
17. Zayed, O., El-Beltagy, S., Haggag, O.: A novel approach for detecting Arabic persons’
names using limited resources. In: Complementary Proceedings of 14th International Con-
ference on Intelligent Text Processing and Computational Linguistics, CICLing 2013,
Greece (2013)
ANEAR: Automatic Named Entity Aliasing Resolution
1 Introduction
Named Entity Aliasing Resolution is the process where the different instances (aliases
and variants) of an entity are detected and recognized as being referents to the same per-
son within large collections of data. An example of this problem is shown in Figure 1
where each cluster contains several aliases for the same person (e.g. Yasser Arafat,
Abou Ammar). The variation in name aliases can manifest as a difference in spelling
(e.g. Qaddafi, Gazzafi, Qadafi, Qazzafy), difference in the name mention such as Mo-
hamed Hosni Mubarak, vs. Hosni Mubarak, or by using a completely different alias
such as Abou Mazen as an alternate for Mahmoud Abbas. Restricting this problem to
aliases of famous people leads to a relatively easier resolution process since the aliases
are typically publicly known. However, with the proliferation of web based data and
social media, we note the pervasive use of aliases by ordinary people. Nowadays, the
use of aliases and fake names is increasingly spreading among larger groups of peo-
ple and becoming more popular due to political (terrorism, revolutions), criminal and
privacy reasons. Hence, the ability to recognize and identify the different aliases of an
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 213–224, 2013.
c Springer-Verlag Berlin Heidelberg 2013
214 A. Zirikly and M. Diab
entity improves the quality of information extracted (higher recall) by helping the entity
linking and tracking, leading to better overall information extraction performance.
The NEAR task is relatively close to the Entity Mention Detection (EMD) task.1,2
However they differ in several aspects. In NEAR there is no processing of pronominal
mentions by definition. Moreover, the NEAR task, as defined for this paper, specifi-
cally focuses on detecting aliases for person named entities (PNE) and does not handle
other NE types such as Organizations and Locations addressed in the EMD task. We
should highlight, however, that there is nothing inherent in the NEAR task that bars
it from processing other types of NEs. To date, most work in relating PNEs in docu-
ments relies on external resources, such as Wikipedia to provide links between aliases
and PNE, thus confining the aliasing resolution task to famous people. In this paper, we
build a system, Automatic NEAR (ANEAR), that is domain and language independent
and does not rely on external knowledge resources. We use unsupervised clustering
methods to identify and link the different candidate variants of an entity. We experi-
ment with two languages, Arabic and English, independently. We empirically examine
the impact of morphological processing on the feature space. We also investigate the
usage of part of speech tag information in our models. Finally, we attempt to measure
the effect of various value content modeling approaches on the system such as TF-IDF
and co-occurrence frequency. ANEAR’s best performance is Fβ=1 score is 70.03% on
Arabic compared to an Fβ=1 score of 67.85% on the English data.
model. We populate the NFR matrix with different values based on variable weighting
schemes that reflect the relatedness scores. Subsequently, we apply unsupervised clus-
tering algorithms to extract and group the different aliases and variants of an entity in
one cluster. We experiment with two languages English and Arabic and use parallel data
of the same size in order to compare and contrast performance cross-linguistically.
The selection of the features in conjunction with the relatedness scoring scheme has a
significant impact on the performance of the clustering algorithm. The structure of the
matrix is as follows: the row entries of the matrix are the PNEs, the dimensions are
either bag of words (BOW) features or classes derived from them such as POS tags,
and the feature values are some form of the co-occurrence statistic between the PNE
and the feature instance.
2.1.1 Feature Dimensions. Our basic feature set is a BOW feature. We experiment
with several possible tokenization levels for the words in the data collection: (i) LEX
Inflected forms known as lexemes e.g. babies is a lexeme and contractions such as isn’t
are spelled out as is not; (ii) LEM Citation forms known as lemmas3 , babies is the
lexeme and it would be reduced to the lemma baby, likewise the lexeme is becomes the
lemma be. It is worth noting that for Arabic, a characteristic of the writing system is
that words are typically rendered without short vowels and other pronunciation markers
known as diacritics. For our purposes the LEM for Arabic will be the fully diacritized
lemma, and the Lexeme, LEX is not diacritized. In order to identify if diacritization
helps our process on the lexeme and the lemma levels, we explore a third word form in
Arabic which is the diacritized lexeme DLEX. An example of a diacritized lexeme in
Arabic is the DLEM xaAmiso,4 fifth, and its undiacritized form is xAms.
Creating the vector space model for English and Arabic varies due to the nature of
the two languages. Arabic has a much more complex morphological structure than En-
glish. Hence, as expected the number of lexeme dimensions for Arabic far exceeds that
for English. Moreover, the lexeme to lemma ratio in Arabic is much higher in Arabic
compared to English. We note that our Arabic data collection has 71910 diacritized
lexemes compared to 67125 undiacritized lexeme and 38537 diacritized lemmas cor-
responding to a 6.65% and 46.41% reduction in the feature space for LEX and LEM,
respectively, compared to DLEX in Arabic. For English the number of lexemes is sig-
nificantly smaller for the same data collection size, 41317 lexemes corresponding to
32890 lemmas, representing a relatively smaller reduction in the feature space, going
from LEX to LEM, of 20.4%.
3
It should be noted that lemmas are also lexemes however they are a specific inflectional form
that are conventionally chosen as a citation form, for example a typical lemma for a noun is
the inflected 3rd person masculine singular form of the noun.
4
All the Arabic used in this paper uses the Buckwalter transliteration scheme as described in
http://www.qamus.com
216 A. Zirikly and M. Diab
2.1.1.1 Extended Dimensions In order to reduce the sparseness of the NFR matrix
and add a level of abstraction, we augment the features space with part of speech (POS)
tag features. Algorithm 1 explains the mechanism of generating the congregated POS
features.
2.1.1.2 Feature Values are assigned based on one of the following metrics:
1. Co-occurrence Frequency (COF): PNE-feature co-occurrence frequency within a
predetermined context window size of a sentence, SENT where the feature and
the PNE co-occur in the same sentence, or a document, DOC, where the feature
and the PNE co-occur in the same document. This results in either COF-SENT or
COF-DOC.
2. Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF is calculated over
the entire document collection. We have two settings varying the document size pa-
rameter for TF-IDF: (i) TF-IDF-DOC is based on using the entire collection of
documents, and (ii) TF-IDF-PNE is based on constraining the document collection
to those documents that mention the PNE. Both TF-IDF-DOC and TF-IDF-NE use
the same equations as defined in 2 and 1 for calculating the feature values, however
the former uses the entire document collection to calculate the values for DOC in
the equations, while the latter is constrained to the document collection that men-
tions the PNE of interest, i.e. the vector row entry PNE in the matrix. Intuitively,
both metrics capture the relative importance of the feature with respect to the PNE
in a given document collection.
|DOCs|
idf (f eature, 6 DOC) = log (1)
DOC ∈ DOCs : f eature ∈ DOC
Table 1. Sample NFR matrix illustrating the Feature Value (FV) Metrics COF-DOC values and
their corresponding RRO values for the the various PNEs across 6 Lemma feature dimensions
replace them with their relative vector rank order value. Table 1 illustrates an ex-
ample of the mapping between the COF-DOC values and the corresponding RRO
values.7
2.1.2 Clustering and Retrieving the Different Groups of PNEs. We apply unsuper-
vised clustering using the cosine similarity function across the feature vectors in order
to produce the multiple groups of entities along with their aliases, i.e. grouping PNEs.
Our chosen clustering approach takes as input the NFR sparse matrix and applies the
Repeated Bisection clustering method that locally and globally optimizes the clustering
solution C which contains multiple groups of entities conjoined with their instances.
C= c: c= aliase (3)
P NE e
3 Evaluation
3.1 Data and Preprocessing Tools
All of our experiments use the GALE Phase (2) Release (1) parallel dataset for En-
glish & Arabic.8 We preprocessed the Arabic and English datasets in order to pro-
duce the NER tags, lexemes, lemmas and the Arabic diacritized lemmas. For all the
7
We experiment with assigning a rank order value of 0 to the features that have a COF/TFIDF
value of 0 versus, giving it the lowest rank order value in a given vector. We note that assign-
ing missing features a value of 0 yielded significantly better results over ranking the missing
features as the lowest rank order in the vector due to two factors: assigning the 0 features the
lowest rank renders the actual rank variable across different vectors introducing significant
noise, i.e. similar missing features will have different rank order values across different PNE
row entries. The effect is exacerbated given the significant sparseness in the matrix.
8
LDC2007E103. (http://www.ldc.upenn.edu).
218 A. Zirikly and M. Diab
English preprocessing we use the Stanford CoreNLP toolset [1], for Arabic we use
AMIRA by [2] for lexeme, diacritized lemma and undiacritized lemma generation. We
use NIDA-ANER, the Arabic Named Entity Recognition by [3] to produce PNE tagged
data. Figure 2 depicts the ANEAR processing steps.
Due to the lack of annotated evaluation data for the aliasing resolution problem in
Arabic and the limited evaluation data in English, we create our own English and Arabic
evaluation data from the GALE dataset. Building the gold file comprises the following
steps: a) Extract and list all the PNEs in the GALE dataset; b) In order to avoid singleton
cases we set a unigram frequency threshold of ≥ 100 for each of the PNEs in order to
be added to any of our clusters. This process yields an A list; c) Then we extract the
transliterations of the PNEs based on string edit distance similarity measures for A; d)
We then manually identify the aliases of the PNE in A in the dataset. The resulting gold
standard file yields 26 PNE clusters in each language along with their respective aliases.
The total number of PNEs in the Arabic set is 116 corresponding to 26 PNE clusters,
and the total number of PNEs in English is 105 corresponding to 26 PNE clusters.
For automatic clustering, we use the CLUTO software package,9 which employs
multiple classes of k-way clustering algorithms that clusters low and high dimensional
datasets with various similarity functions. CLUTO shows a robust clustering perfor-
mance that outperforms many clustering algorithms such as K-means. We use the Re-
peated Bisection algorithm with default parameter settings. This clustering algorithm is
a hard clustering algorithm. For clustering performance comparative reasons, we also
use Matlab10 implementations of the K-means and Hierarchical clustering algorithms.
For each language, we have combinations of the following considerations. For the
feature dimensions: (i) word tokenization level: Lexemes (LEX) vs. lemmas (LEM)
vs. diacritized lexemes (DLEX) (the latter is only for Arabic). For the feature values,
we have the following conditions: (i) simple co-occurrence frequency: COF-SENT and
COF-DOC; (ii) TF-IDF-DOC and TF-IDF-NE; (iii) Rank Order with four settings:
9
http://glaros.dtc.umn.edu/gkhome/views/CLUTO
10
MATLAB and Statistics Toolbox Release 2009, The MathWorks, Inc., Natick, Massachusetts,
United States.
ANEAR: Automatic Named Entity Aliasing Resolution 219
3.3 Results
In Table 2, all the ANEAR conditions outperform the random baseline by a significant
margin. ANEAR best results for English are obtained in the LEM_COF-DOC experi-
mental setting achieving an Fβ=1 score of 67.85% using the augmented POS features,
and the best results for Arabic are achieved in the condition LEM_TF-IDF-DOC in
the BOW+POS condition achieving an Fβ=1 =70.03%, with a narrow second condition
LEX_TF-IDF-DOC with a score of Fβ=1 =69.58%.
In general with the BOW setting, the TF-IDF conditions outperform the compara-
tive COF conditions. For example, in the English results, we note that LEX_TF-IDF-
DOC|NE both outperform LEX_COF-SENT|DOC conditions (60.63% and 53.57% vs.
49.66% and 41.56%, respectively). Moreover, in the BOW setting, using RRO adversely
impacts performance in both languages.
For both languages, The COF-DOC conditions outperform the COF-SENT condi-
tions across the board. Also the TF-IDF-DOC conditions outperform the TF-IDF-NE
conditions in the BOW setting, suggesting that narrowing the document collection ex-
tent is adverse to system performance.
For English, LEM conditions outperform LEX conditions except in the TF-IDF-
DOC condition. However in the latter condition the difference between LEM and LEX
conditions is relatively small (1%). In Arabic, the results are more consistent with LEM
outperforming both LEX and DLEX in all the conditions, in the BOW setting.
Adding POS tag features has an overall positive impact on performance in English.
In Arabic the story is quite different. The COF-SENT conditions in Arabic yield the
worst results. But adding POS tag information to the other models seems to significantly
improve performance.
For the Arabic experiments, under the BOW setting, the best F-score of 68.99% is
obtained from the diacritized dataset (LEM) with TF-IDF-DOC. Using DOC provides
better performance compared to SENT. Similarly to English results, adding POS tags
to the feature space improves performance in both the LEX and LEM conditions, but
not in the DLEX condition. This may be attributed to level of detail present in the
DLEX forms combined with the detailed POS tag used. The best performing condition
220 A. Zirikly and M. Diab
Table 2. ANEAR Fβ=1 scores performance for both English and Arabic datasets under the dif-
ferent experimental conditions and feature settings, BOW and BOW+POS
4 Discussion
4.1 Balancing the Data
We are cognizant of the unbalanced distribution of the aliases in the dataset within one
cluster which highly affects the clustering performance. Hence, in addition to testing on
ANEAR: Automatic Named Entity Aliasing Resolution 221
the original dataset, we generate another balanced version that has a more normalized
distribution based on the following approach:
When we balance the evaluation data, we observe an overall significant increase
in absolute performance where the best condition LEM_COF-SENT yields an F-score
of 96.05% for English compared to the best condition in Arabic of LEM_TFIDF-NE
yielding an F-score of 96.45%.
Fig. 3. ANEAR performance comparison be- Fig. 4. Comparison between ANEAR and
tween balanced and unbalanced Arabic and random baseline performance
English datasets
Arabic shows more robust results and seems less affected (f-score = 70.03%) when
compared to English (f-score = 67.85%). The more balanced distribution scheme adds
a significant performance improvement (≈ +25%) as shown in Figure 3. Based on the
results, we generally notice that diacritized lexemes produce better performance, despite
the higher feature dimensionality that yields a more sparse data set, yet decreasing the
ambiguity results is a gain. Figure 3 contrasts ANEAR performance against a random
baseline system with a gain of ≈ +39% in Arabic and ≈ +30% in English.
Hierarchical clustering, though it does not require specifying the number of clusters
as an input parameter, the number of clusters is automatically induced, it yields much
poorer F-score results.
K-Means achieves the best performance under the condition DLEX_TF-IDF-NE
(in Arabic) with an Fβ=1 score of 36.49%. On the other hand, Hierarchical clustering
shows its best performance under the condition: LEX_COF-DOC with an Fβ=1 score of
21.38%. Figure5 shows a comparison among the different clustering algorithms when
tested on balanced and unbalanced dataset.
Fig. 5. Comparison among Hierarchical, K-means and CLUTO Repeated Bisection K-way Clus-
tering when tested on the Arabic balanced and unbalanced datasets
5 Related Work
To date, most of the work related to the aliasing resolution problem has been mainly
performed in the area of Named Entity Disambiguation, where two entities share the
same name. Moreover, the NED task has typically focused on English since there are
no annotated data sets for other languages. Our work employs unsupervised techniques
to induce the PNE groups of name aliases while most work that we are aware of to
date, uses predefined lists of PNEs and their corresponding aliases and used for train-
ing in a supervised manner. [4] proposed a framework for alias detection for a given
entity using a logistic regression classifier that relies on a number of features such as
co-occurrence relevance. Similarly, [5] presented a more complicated system that also
relies on an input list of names and their aliases. They first retrieve a list of candidate
aliases for a given entity using lexical patterns that introduce aliases, then they rank the
set of retrieved aliases based on different factors: a) Lexical pattern frequency, b) Co-
occurrence in anchor texts using different metrics such as TF-IDF and cosine similarity
functions, and, c) Page counts of name-alias co-occurrence. [6,7] and [8] proposed a
knowledge-based method that captures and leverages the structural semantic knowledge
in multiple knowledge sources (such as Wikipedia and WordNet) in order to improve
the disambiguation performance. Other disambiguation methods utilize ranked similar-
ity measurements among entity-based summaries. [9,10]. [11] have used unsupervised
ANEAR: Automatic Named Entity Aliasing Resolution 223
clustering algorithms on a rich feature space that is extracted from biographical facts. In
PNE identification, [12] proposes a lexical pattern-based approach to extract a large set
of candidate aliases from a web search engine. Then, a myriad of ranking scores (lexical
pattern frequency, word co-occurrences and page counts on the web) are integrated into
a single ranking function and fed into a support vector machines (SVM) to identify and
predict aliases for a particular PNE.
Other contributions involved handling structured datasets such as Link Data Sets.
[13] presented a hybrid probabilistic orthographic-semantic supervised learning model
to recognize aliases.
Entity linking tackles a similar problem to NEAR where a name mention is mapped
to an entry in a Knowledge Base (KB). Entity Linking relies heavily on Wikipedia pages
to populate the KB and generates a dictionary that is used in name-variant mappings
as illustrated in [14]. They integrate a number of features in order to choose the best
mapping. These features include the surface forms, semantic links which assumes the
availability of structured data and weighted bag of words features that are extracted
from the Wikipedia documents. All of the above features assume that the entities to
be resolved with their aliases are celebrities where Wikipedia reference them and their
aliases.
Our approach provides a broader range of alias identification, since it does not rely
on any lexical or string similarity properties. In addition, the identification process is
executed offline with no dependence on external resources.
6 Conclusion
References
1. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into informa-
tion extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics, ACL 2005, pp. 363–370. Association for Com-
putational Linguistics, Stroudsburg (2005)
224 A. Zirikly and M. Diab
2. Diab, M.: Second generation tools (amira 2.0): Fast and robust tokenization, pos tagging, and
base phrase chunking. In: Choukri, K., Maegaard, B., eds.: Proceedings of the Second Inter-
national Conference on Arabic Language Resources and Tools. The MEDAR Consortium,
Cairo (2009)
3. Benajiba, Y., Diab, M.T., Rosso, P.: Arabic named entity recognition: A feature-driven study.
IEEE Transactions on Audio, Speech & Language Processing 17(5), 926–934 (2009)
4. Jiang, L., Wang, J., Luo, P., An, N., Wang, M.: Towards alias detection without string sim-
ilarity: an active learning based approach. In: Proceedings of the 35th International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp.
1155–1156. ACM, New York (2012)
5. Bollegala, D., Matsuo, Y., Ishizuka, M.: Automatic discovery of personal name aliases from
the web. IEEE Trans. on Knowl. and Data Eng. 23(6), 831–844 (2011)
6. Han, X., Zhao, J.: Structural semantic relatedness: A knowledge-based method to named
entity disambiguation. In: Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, pp. 50–59. Association for Computational Linguistics, Uppsala
(2010)
7. Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: Pro-
ceedings of EMNLP-CoNLL, vol. 2007, pp. 708–716 (2007)
8. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based ex-
plicit semantic analysis. In: IJCAI 2007: Proceedings of the 20th International Joint Con-
ference on Artifical Intelligence, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San
Francisco (2007)
9. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space
model. In: COLING-ACL, pp. 79–85 (1998)
10. Bagga, A., Biermann, A.W.: A methodology for cross-document coreference. In: Proceed-
ings of the Fifth Joint Conference on Information Sciences (JCIS 2000), pp. 207–210 (2000)
11. Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Daelemans, W.,
Osborne, M. (eds.) Proceedings of CoNLL-2003, pp. 33–40. Edmonton, Canada (2003)
12. Bollegala, D., Matsuo, Y., Ishizuka, M.: Automatic discovery of personal name aliases from
the web. IEEE Trans. Knowl. Data Eng. 23(6), 831–844 (2011)
13. Hsiung, P., Moore, A., Neil, D., Schneider, J.: Alias detection in link data sets. Master’s
thesis, Technical Report CMU-RI-TR-04-22 (March 2004)
14. Charton, E., Gagnon, M.: A disambiguation resource extracted from wikipedia for semantic
annotation. In: LREC, pp. 3665–3671 (2012)
15. Chen, Y., Martin, J.: Towards robust unsupervised personal name disambiguation. In: Pro-
ceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Process-
ing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 190–198. Asso-
ciation for Computational Linguistics, Prague (2007)
16. Sutton, C., Mccallum, A.: Introduction to Conditional Random Fields for Relational Learn-
ing. MIT Press (2006)
Improving Candidate Generation for Entity
Linking
Yuhang Guo1 , Bing Qin1, , Yuqin Li2 , Ting Liu1 , and Sheng Li1
1
School of Computer Science and Technology,
Harbin Institute of Technology, Harbin, China
2
Beijing Information Science and Technology University, Beijing, China
{yhguo,bqin,tliu,sli}@ir.hit.edu.cn,
[email protected]
1 Introduction
Entity Linking (EL) is the task of identifying the target entity which a name
refers to. It can help text analysis systems to understand the context of the name
in-depth by leveraging known information of the entity. On the other hand, new
knowledge about this entity can be populated by mining information from the
context. Figure 1 illustrates entity linking can help question answering: knowing
the name Washington refers to actor Denzel Washington (rather than George
Washington or the State of Washington) in the question: Who did Washington
play in Training Day, one can find the corresponding answer (Detective Alonzo
Harris) directly in the knowledge base.
Corresponding author.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 225–236, 2013.
c Springer-Verlag Berlin Heidelberg 2013
226 Y. Guo et al.
EL can be broken down into two steps: candidate generation and candidate
ranking. The first step generates a set of candidate entities of the target name
and the second step ranks the candidates. Several ranking models have been
proposed for the second step. However, few works have focused on the candidate
generation step. Generating candidates is a critical step for the linking systems.
If the target entity is not included in the candidate set, no ranking model can
return the correct one.
A number of resources have been proposed to improve the generation recall
[2,4,24,6]. By leveraging these resources, the number of the candidates can be
very big. Take the target name Washington for example, the generation will
return more than 600 candidates.
Bounding the number of the candidates is important in the applications of
EL. Lessening the candidates will reduce time and memory costs of the ranking,
and further make sophisticated time and memory consuming ranking models be
practicable. How to generate small candidate sets under the premise of ensuring
high recall is an interesting problem.
In this paper, we propose a novel candidate generation approach. In this
approach, the generator first extracts the target name’s co-reference names in the
context. From this set the generator then selects the most reliable name (i.e. the
least ambiguous name) to generate candidates by leveraging a Wikipedia-derived
name-entity mapping. Next the generator prunes the candidates according to
their frequencies and their similarity to the target name.
Experiment on benchmark data sets shows that our candidate generation can
increase the recall and reduce the candidate number effectively. Further analysis
shows that both the accuracy and the speed of the system can benefit from the
proposed candidate generation approach, especially for the target names with
large candidate set. The system runtime can be effectively saved over the baseline
Improving Candidate Generation for Entity Linking 227
candidate set. The highest accuracy in the evaluation is improved by 2.2% and
3.4%.
2 Related Work
EL is similar to Word Sense Disambiguation (WSD), a widely-studied natural
language processing task. In WSD the sense of a word (e.g. bank: river bank or a
financial institution) is identified according to the context of the word [10,20,15].
Both WSD and EL disambiguate polysemous words/names according to the
context. The difference between the two tasks is in that, the disambiguation
targets in WSD are lexical words whereas in EL are names. In WSD, the senses
of words are defined in dictionaries, such as WordNet [18]. In EL, however, no
open domain catalog has included all entities and all of their names. The study on
WSD have a history of several decades[10,20,15]. Recently, as the development of
the large scale open domain knowledge bases (such as Wikipedia, DBpedia[1,1]
and Yago[23], etc.), EL has been attracting more and more attentions.
Early EL borrowed successful techniques in WSD: take each sense (candidate
entity) as a class and resolve the problem by multi-class classifier[17,2]. However,
in WSD a word usually has several senses but in EL a name may have dozens to
hundreds of candidate entities. Under such high polysemy, the accuracy of the
classifier cannot be guaranteed.
EL systems can be broken down into two steps: candidate generation and
candidate ranking[11]. Early candidate generation approaches directly match
the target name in the knowledge base[2]. Recently, several techniques have
been proposed and have achieved certain success in recall.
– Substitute the target name to a longer name in the names co-reference chain
in the context[4].
– If the target name is an acronym, substitute it with the full name in the
context[4,24].
– Filter acronym expansions with a classifier[26].
– If the exact match fails, then use partial search[24] or fuzzy match[14] (e.g.
return candidates with high Dice coefficient).
The candidate ranking is based on the similarity between the candidate entity
and the context surround with the target name. A number of features have been
proposed: Plain text[24]; Concepts, such as Wikipedia category[2], Wikipedia
concept[9], topic model concept[12,22,26]; And neighboring entities, which in-
clude the entities mapped from unambiguous names[19] and the collectively
disambiguated entities[4,13,8,22]. The entity-context similarity is measured in:
cosine similarity[24], language model score[7] and the inner coherence among
neighboring entities measured by link similarity[19,21,22] and collective topic
model similarity[22]. Besides directly use these similarities for the ranking, ma-
chine learning methods has been applied to combine these similarities[19,27,5].
Sophisticated ranking models need heavy computation costs. For example, the
time complexity of the list wise learning to rank method is exponential[3,25,27].
228 Y. Guo et al.
In this mapping, a name is mapped to all the entities it may refer to. For exam-
ple, name Washington is mapped to Denzel Washington, George Washington
and State of Washington, etc. All through this work, we use the Aug. 2, 2012
1
A information structure of Wikipedia.
Improving Candidate Generation for Entity Linking 229
version of English Wikipedia dump, which contains more than 4.1 million arti-
cles2 . In all, we extract 23,895,819 name-entity pairs with their co-occurrence
frequencies. Summing up this frequency for the same entity, we can get the fre-
quency of the entity in Wikipedia, which will be used in the following part of
the linking system.
– The full form is in front of the enclosed acronym (e.g. ... the newly formed
All Basotho Convention (ABC))
– The acronym is in front of the enclosed full form (e.g. ... at a time when the
CCP (Chinese Communist Party) claims ...)
– The acronym consists of the initial letters of the full name words (e.g. ...
leaders of Merkel’s Christian Democratic Union ... CDU ...)
consult the next most reliable potential name only if the current name returns
no candidate.
Two points should be considered for the reliability of the potential name:
According to our observation, longer name has a smaller N and higher frequency
name has a higher P. In order to keep a small candidate set and a high recall at
the same time, the considered potential name should be of both high frequency
and long.
In this work, the potential names are first sorted by their types: longer names,
normalized query names (including acronym expansion and Wiki-style normal-
ization) and shorter names, and then by frequency in the same type.
The back-off strategy prunes candidates from name aspect. Whereas the fol-
lowing strategies prune candidates from the entity aspect. The filter by frequency
strategy filter out the candidates with low frequency and the filter by similarity
strategy filter out candidates with low similarity to the target name. We define
the similarity between a name and an entity as follows: The target name nt is
similar to a candidate entity e if and only if at least one name (ne ) of this entity
is similar to nt .
Here we propose a novel name similarity measurement. The formula is
Len(LCS(w, nt ))
Sim(ne , nt ) = w∈n e
(1)
w∈nt Len(w)
where Len(s) is the length of string s, LCS(s1 , s2 ) is the longest common string
of s1 and s2 . Note that this similarity is asymmetric.
Table 1. Notations
similarity threshold is used to filter out the un-similar entities, and the candidate
set volume threshold limits the maximum size of the candidate set5 .
4 Experiment
The experiment is conducted on four KBP data sets (i.e. KBP2009-KBP2012)
which are taken from the Knowledge Base Population (KBP) Track [16,11]. The
data sets share the same track knowledge base which is derived from Wikipedia
and contains 818,741 entities. We use KBP2009 and KBP2010 as the training
and development data and KBP2011 and KBP2012 as the test data.
In the KBP-EL evaluation, the input is a set of queries. Each query consists of
a target name mention and a context document. The output is the target entity
ID in the knowledge base or NIL if the target entity is absent in the knowledge
base. The number of queries/NIL-answer queries for each data set is: KBP2009:
3904/2229, KBP2010: 2250/1230, KBP2011: 2250/1126, KBP2012: 2250/1049.
Our experiments include two parts. The first part evaluates the recall and av-
eraged candidate set size. The recall is the percentage of the non-NIL queries for
which the candidate set covers the referent entity. The second part evaluates the
final EL system performance, including the micro-averaged accuracy (percentage
of queries linked correctly) and the averaged runtime cost per query.
5
In this work we set the candidate set volume threshold 30 and the similarity threshold
0.6.
232 Y. Guo et al.
Here we compare our context based approach: CBCG with the baseline, directly
matching in NEM: DMatch. Table 2 shows the recall and the averaged candidate
number per query of the candidate generators. From this table, we can see that
the recall of CBCG outperforms DMatch and can achieve higher than 93% on
each of the data sets. On KBP2011 and KBP2012, the recall of CBCG outper-
forms DMatch by 15.6% and 5.2% respectively. On the other hand, the number
of the candidates of CBCG only 22.5% and 9.5% of DMatch on KBP2011 and
KBP2012 respectively. Few literature has reported both of the recall and the
averaged candidate number. The Literature [6] reported their candidate genera-
tion recall was 0.878 and the averaged candidate number was 7.2 on KBP2009.
Our approach outperforms the recall by 5.3% achieves a comparable candidate
number on the same data set.
Table 2. Candidate generation recall and averaged candidate number on KBP data
sets
We add the strategies into the generator in turn to evaluate their contributions.
Figure 2 shows that, directly matching the target name in NEM results in a large
number of candidates. Using AcroExp, LongName and ShortName strategies, the
recall will be improved. Using LongName and fByFreq, the averaged number of
the candidates will be reduced significantly. Using all of these strategies, we can
obtain balanced candidate sets with high recall and small size.
Improving Candidate Generation for Entity Linking 233
AvgCand#onKBP2011(leftaxis) AvgCand#onKBP2012(leftaxis)
RecallonKBP2011(rightaxis) RecallonKBP2012(rightaxis)
160.0 1.00
140.0 132.3 134.7
0.95
120.0
0.90
100.0
80.0 0.85
60.0
38.3 43.2 40.4 42.2 0.80
40.0 30.2
22.7
14.3 12.6 0.75
20.0 10.8 8.6
0.0 0.70
DMatch AcroExp LongName ShortName fByFreq fBySim
KBP2011 KBP2012
1200
1000
800
600
400
200
0
[0,4) [4,16) [16,32) [32,64) [64,128) [128,INF)
Fig. 4. Accuracy and averaged time cost per query (seconds) of ListNet based on the
DMatch and the CBCG in different polysemy ranges on KBP2011 and KBP2012
Improving Candidate Generation for Entity Linking 235
5 Conclusion
Candidate generation is essential for the EL task. The candidate number for
the target names may be very large. Generating small candidate set under the
premise of ensuring high recall is critical for the applications of the EL systems.
In this paper we propose a novel candidate generation approach. This approach
combines several strategies to balance the recall and the size of the candidate set.
Experimental results on benchmark data set shows that our candidate generation
can significantly improve the EL system performances on recall, accuracy and
efficiency over the baseline. On the KBP2011 and KBP2012 data sets, the recall
is improved by 15.6% and 5.2%, the accuracy is improved by 5.4%-11.4%, the
system runtime is saved by 70.3% and 76.6%, and the highest accuracy in the
evaluation is improved by 2.2% and 5.4% respectively. For the most polysemous
target names on KBP2011 and KBP2012, the accuracy improvement achieves
18.0% and 8.6%, and the runtime is saved by 90.2% and 85.8% respectively.
References
1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia:
A nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC 2007.
LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
2. Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disam-
biguation. In: EACL (2006)
3. Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., Li, H.: Learning to rank: from pairwise
approach to listwise approach. In: Proceedings of the 24th International Conference
on Machine Learning (2007)
4. Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data.
In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (2007)
5. Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.: Entity disambiguation for
knowledge base population. In: Proceedings of the 23rd International Conference
on Computational Linguistics (2010)
6. Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.: Evaluating
entity linking with wikipedia. Artificial Intelligence 194, 130–150 (2013)
7. Han, X., Sun, L.: A generative entity-mention model for linking entities with knowl-
edge base. In: Proceedings of the 49th Annual Meeting of the Association for Com-
putational Linguistics: Human Language Techologies (2011)
8. Han, X., Sun, L., Zhao, J.: Collective entity linking in web text: a graph-based
method. In: Proceedings of the 34th International ACM SIGIR Conference on
Research and Development in Information (2011)
236 Y. Guo et al.
9. Han, X., Zhao, J.: Named entity disambiguation by leveraging wikipedia seman-
tic knowledge. In: Proceeding of the 18th ACM Conference on Information and
Knowledge Management, CIKM 2009 (2009)
10. Ide, N., Véronis, J.: Introduction to the special issue on word sense disambiguation:
the state of the art. Comput. Linguist. 24(1), 2–40 (1998)
11. Ji, H., Grishman, R.: Knowledge base population: Successful approaches and chal-
lenges. In: Proceedings of the 49th Annual Meeting of the Association for Compu-
tational Linguistics: Human Language Technologies (2011)
12. Kataria, S.S., Kumar, K.S., Rastogi, R.R., Sen, P., Sengamedu, S.H.: Entity dis-
ambiguation with hierarchical topic models. In: Proceedings of the 17th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining
(2011)
13. Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annota-
tion of wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (2009)
14. Lehmann, J., Monahan, S., Nezda, L., Jung, A., Shi, Y.: Lcc approaches to knowl-
edge base population at TAC 2010. In: Proceedings of the Text Analysis Conference
(2010)
15. McCarthy, D.: Word sense disambiguation: An overview. Language and Linguistics
Compass 3(2), 537–558 (2009)
16. McNamee, P., Dang, H.: Overview of the tac 2009 knowledge base population track.
In: Proceedings of the Second Text Analysis Conference, TAC 2009 (2009)
17. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge.
In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge
Management, CIKM 2007 (2007)
18. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to
WordNet: An On-line Lexical Database*. Int. J. Lexicography 3, 235–244 (1990)
19. Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceeding of the 17th
ACM Conference on Information and Knowledge Management, CIKM 2008 (2008)
20. Navigli, R.: Word sense disambiguation: A survey. ACM Comput. Surv. 41, 1–69
(2009)
21. Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for
disambiguation to wikipedia. In: Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies (2011)
22. Sen, P.: Collective context-aware topic models for entity disambiguation. In: Pro-
ceedings of the 21st International Conference on World Wide Web (2012)
23. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In:
Proceedings of the 16th International Conference on World Wide Web (2007)
24. Varma, V., Bharat, V., Kovelamudi, S., Bysani, P., Santhosh, G.S.K., Kiran Ku-
mar, N., Reddy, K., Kumar, K., Maganti, N.: IIIT hyderabad at TAC 2009. In:
Proceedings of the Second Text Analysis Conference, TAC 2009 (2009)
25. Xia, F., Liu, T.-Y., Wang, J., Zhang, W., Li, H.: Listwise approach to learning to
rank: theory and algorithm. In: Proceedings of the 25th International Conference
on Machine Learning (2008)
26. Zhang, W., Sim, Y.C., Su, J., Tan, C.L.: Entity linking with effective acronym
expansion, instance selection, and topic modeling. In: Proceedings of the 22nd
International Joint Conference on Artificial Intelligence, IJCAI 2011, Barcelona,
Catalonia, Spain, July 16-22 (2011)
27. Zheng, Z., Li, F., Huang, M., Zhu, X.: Learning to link entities with knowledge
base. In: Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (2010)
Person Name Recognition Using the Hybrid Approach
Abstract. Arabic Person Name Recognition has been tackled mostly using
either of two approaches: a rule-based or Machine Learning (ML) based ap-
proach, with their strengths and weaknesses. In this paper, the problem of Arab-
ic Person Name Recognition is tackled through integrating the two approaches
together in a pipelined process to create a hybrid system with the aim of en-
hancing the overall performance of Person Name Recognition tasks. Extensive
experiments are conducted using three different ML classifiers to evaluate the
overall performance of the hybrid system. The empirical results indicate that the
hybrid approach outperforms both the rule-based and the ML-based approaches.
Moreover, our system outperforms the state-of-the-art of Arabic Person Name
Recognition in terms of accuracy when applied to ANERcorp dataset, with pre-
cision 0.949, recall 0.942 and f-measure 0.945.
1 Introduction
Named Entity Recognition (NER) is the task of detecting and classifying proper
names within texts into predefined types, such as Person, Location and Organization
names [19], in addition to the detection of numerical expressions, such as date, time,
and phone number. Many Natural Language Processing (NLP) applications employ
NER as an important preprocessing step to enhance the overall performance.
Arabic is the official language in the Arab world where more than 300 million
people speak Arabic as their native language [22]. Arabic is a Semitic language and
one of the richest natural languages in the world in terms of morphology [22]. Interest
in Arabic NLP has been gaining momentum in the past decade, and some of the tasks,
such as NER, have proven to be challenging due to the language’s rich morphology.
Person Name Recognition for Arabic has been receiving increasing attention, yet
opportunities for improvement in performance are still available. Most of the Arabic
NER systems, which have the capability of recognizing Person names, have been
developed using two types of approaches: the rule-based approach, notably NERA
system [24], and the ML-based approach, notably ANERsys 2.0 [6]. Arabic rule-
based NER systems rely on handcrafted grammatical rules acquired from linguists.
Therefore, any maintenance applied to rule-based systems is labor-intensive and time
consuming especially if linguists with the required knowledge are not available [21].
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 237–248, 2013.
© Springer-Verlag Berlin Heidelberg 2013
238 M. Oudah and K. Shaalan
On the contrary, ML-based NER systems utilize learning algorithms that make use of
a selected set of features extracted from datasets annotated with named entities (NEs)
for building predictive NER classifiers. The main advantages of the ML-based NER
systems are that they are updatable with minimal time and effort as long as sufficient-
ly large datasets are available.
In this paper, the problem of Arabic Person Name Recognition is tackled through
integrating the ML-based approach with the rule-based approach to develop a hybrid
system in an attempt to enhance the overall performance. Our early hybrid Arabic
NER research [1] provided the capability to detect and classify Person NEs in Arabic
texts in addition to Location and Organization NEs, where only Decision Trees tech-
nique was used within the hybrid system. This technique was applied to a limited set
of selected features. The experimental results were promising and assure the quality
of the prototype [1]. As a continuation, we extend the ML feature space to include
morphological and contextual features. In addition to Decision Trees, we investigate
two more ML algorithms: Support Vector Machines and Logistic Regression in the
recognition of 11 different types of NEs [20]. In this paper, we report our experience
with Arabic Person name recognition in particular. A wider standard datasets are used
to evaluate our system. In [20], we reported a set of experimental results which was
an indicative of a better system’s performance in term of accuracy. Thereafter, more
experiments and analysis of results are conducted to assess the quality of the hybrid
system by means of standard evaluation metrics.
The structure of the remainder of this paper is as follows. Section 2 provides some
background on NER, while Section 3 gives a literature review. Section 4 describes the
method followed for data collection. Section 5 illustrates the architecture of the pro-
posed system and then describes in details the main components. The experimental
results are reported and discussed in Section 6. Section 7 concludes this paper and
gives directions for future work.
2 Background
In the 1990s, at the Message Understanding Conferences (MUC), the task of NER
was firstly introduced by the research community. Three main NER subtasks were
defined at the 6th MUC: ENAMEX (i.e. Person, Location and Organization), TIMEX
(i.e. temporal expressions), and NUMEX (i.e. numerical expressions).
The role of NER within NLP applications differs from an application to another.
Examples of those NLP applications (but not limited to) are listed below:
ing into account their classified NEs. For example, the word “ ”واﺷﻨﻄﻦwaAšinTun1
“Washington” can be recognized as a Location NE or a Person NE, hence the cor-
rect classification will lead to the extraction of the relevant documents.
• Machine Translation (MT). MT is the task of translating a text into another natu-
ral language. NEs need special handling in order to be translated correctly. Hence,
the quality of NE translation would become an integral part that enhances the per-
formance of the MT system [4]. In the translation from Arabic to Latin languages,
Person names (NEs) can also be found as regular words (non-NEs) in the language
without any distinguishing orthographic characteristics between the two surface
forms. For example, the surface word “ ”وﻓﺎءwafaA’ can be used in Arabic text as a
noun which means trustfulness and loyalty, and also as a Person name.
• Question Answering (QA). QA application is closely related to IR but with more
sophisticated results. A QA system takes questions as input and returns concise and
precise answers. NER can be exploited in recognizing NEs within the questions to
help identifying the relevant documents and then extracting the correct answers
[16]. For instance, the words “ ”إرﻧﺴﺖ وﻳﻮﻧﻎĂirnist wayuwnγ “Ernst & Young” may
be classified as Organization or Person NEs according to the context.
1
We used Habash-Soudi-Buckwalter transliteration scheme [15].
240 M. Oudah and K. Shaalan
In this section, we focus on the Arabic NER systems that have the capability to rec-
ognize Person names. They are divided to Rule-based and ML-based systems.
4 Data Collection
The linguistic resources are of two main categories: corpora and gazetteers. The cor-
pora used in this research are Automatic Content Extraction 2 (ACE) corpora and
ANERcorp3 dataset. In the literature, they are commonly used for evaluation as well
as comparison with existing systems. The dataset files have been prepared and trans-
formed using our tag schema and in XML format. An example of a Person name in
our tag schema is: <Person><هﻨﺎء/Person>. The three ACE corpora used in this re-
search are ACE 2003 (Newswire (NW) and Broadcast News (BN)) and ACE 2004
(NW) datasets. ANERcorp is an annotated dataset provided by [5]. In this study, the
total number of annotated Person NEs covered by all datasets is 6,695 as demonstrat-
ed in Table 1. Another type of linguistic resources used is gazetteers. The gazetteers
required for Person name recognition are collected as is from [24].
In this article, we propose a hybrid architecture that is demonstrably better than the
rule-based or ML-based systems individually. Figure 1 illustrates the architecture of
the proposed hybrid system for Arabic. The system consists of two sequential loosely
coupled components: 1) a rule-based component that produces NE labels based on
lists of NEs/keywords and contextual rules, and 2) an ML-based post-processor in-
tended to make use of rule-based component’s NE decisions as features aiming at
enhancing the overall performance of the NER task.
The rule-based component is a reproduction of the NERA system [24] using the
GATE framework4. It consists of three main modules: Whitelists (lists of full names),
Grammar Rules and a Filtration mechanism (blacklists of invalid names) as illustrated
in Figure 1. In GATE, the rule-based component works as a corpus pipeline where a
corpus is processed through an Arabic tokenizer, resources including a list of gazet-
teers, and local grammatical Rules (implemented as finite-state transducers).
2
Available for us under license agreement from the Linguistic Data Consortium (LDC).
3
Available to download on http://www1.ccls.columbia.edu/~ybenajiba/
downloads.html
4
GATE is freely available at the web link: http://gate.ac.uk/
Person Name Recognition Using the Hybrid Approach 243
Figure 2 illustrates an example of the Person name rules utilized by the rule-based
component. The function of the rule in figure 2 is recognizing expressions that start
with “ ”اﺑﻮor “ ”امthen followed by a First Person Name with the possibility of having
a First, Middle or Last Name afterwards. Examples of Person names extracted by this
rule: “( ”اﺑﻮ ﺣﺴﻦThe father of Hassan), and “( ”ام ﻋﻤﺮ ﻃﻪThe mother of Omar Taha).
Input text
Rule-based component
Exact
matching
Whitelists
Gazetteers
Parsing Grammar
Rules
Filtering
Blacklist
Blacklist Gazetteers
Tagged text
Feature Extraction
ML component
ML method Model (Classifier)
:Per.Person={rule=”PersonRule1}, :Per.Person={rule=”PersonRule5}
6 Experimental Results
5
WEKA is available on www.cs.waikato.ac.nz/ml/weka/
6
MADA is available on: http://www1.ccls.columbia.edu/MADA/
Person Name Recognition Using the Hybrid Approach 245
inclusion/exclusion of feature groups. The reference datasets are the initial datasets
described with their tagging details in Section 4 including ACE corpora and ANER-
corp.
The performance of the rule-based component is evaluated using GATE built-in
evaluation tool, so-called AnnotationDiff. On the other hand, the ML-based compo-
nent uses three different functions (or classifiers) to be applied to the datasets, includ-
ing Decision trees, SVM and Logistic regression approaches which are available in
WEKA via J48, LibSVM and Logistic classifiers respectively. In this research, 10-
fold cross validation is chosen to avoid overfitting. The WEKA tool provides the
functionality of applying the conventional k-fold cross-validation for evaluation.
Table 2. The results of applying the proposed hybrid system on ACE2003 (NW & BN),
ACE2004 (NW), & ANERcorp datasets in order to extract Person names
Table 3. The results of ANERsys 1.0, ANERsys 2.0, CRF-based system [7] and Abdallah et al.
[1]’s system compared to our hybrid system’s highest performance when applied to ANERcorp
dataset in order to extract Person names
Person
System
Precision Recall F-measure
ANERsys 1.0 [5] 0.5421 0.4101 0.4669
ANERsys 2.0 [6] 0.5627 0.4856 0.5213
CRF-based system [7] 0.8041 0.6742 0.7335
Abdallah et al. [1] 0.949 0.9078 0.928
Our Hybrid System (J48) 0.949 0.942 0.945
In the literature, the use of either rule-based approach or pure ML-based approach is
considered a successful approach for Arabic NER in general and Arabic Person name
Person Name Recognition Using the Hybrid Approach 247
References
1. Abdallah, S., Shaalan, K., Shoaib, M.: Integrating Rule-Based System with Classification
for Arabic Named Entity Recognition. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS,
vol. 7181, pp. 311–322. Springer, Heidelberg (2012)
2. AbdelRahman, S., Elarnaoty, M., Magdy, M., Fahmy, A.: Integrated Machine Learning
Techniques for Arabic Named Entity Recognition. IJCSI 7, 27–36 (2010)
3. Abdul-Hamid, A., Darwish, K.: Simplified Feature Set for Arabic Named Entity Recogni-
tion. In: Proceedings of the 2010 Named Entities Workshop, pp. 110–115 (2010)
4. Babych, B., Hartley, A.: Improving Machine Translation Quality with Automatic Named
Entity Recognition. In: Proceedings of the 7th International EAMT workshop on MT and
other Language Technology Tools, Improving MT through other Language Technology
Tools: Resources and Tools for Building MT (EAMT 2003), pp. 1–8 (2003)
5. Benajiba, Y., Rosso, P., BenedíRuiz, J.M.: ANERsys: An Arabic Named Entity Recogni-
tion System Based on Maximum Entropy. In: Gelbukh, A. (ed.) CICLing 2007. LNCS,
vol. 4394, pp. 143–153. Springer, Heidelberg (2007)
6. Benajiba, Y., Rosso, P.: ANERsys 2.0: Conquering the NER task for the Arabic language
by combining the Maximum Entropy with POS-tag information. In: Proceedings of Work-
shop on Natural Language-Independent Engineering, IICAI 2007, pp. 1814–1823 (2007)
7. Benajiba, Y., Rosso, P.: Arabic Named Entity Recognition using Conditional Random
Fields. In: Proceedings of LREC 2008 (2008)
8. Benajiba, Y., Diab, M., Rosso, P.: Arabic Named Entity Recognition: An SVM-Based Ap-
proach. In: Proceedings of (ACIT 2008), pp. 16–18 (2008)
248 M. Oudah and K. Shaalan
9. Benajiba, Y., Diab, M., Rosso, P.: Arabic Named Entity Recognition Using Optimized
Feature Sets. In: Proceedings of EMNLP 2008, pp. 284–293 (2008)
10. Benajiba, Y., Diab, M., Rosso, P.: Arabic Named Entity Recognition: A Feature-Driven
Study. IEEE Transactions on Audio, Speech and Language Processing 17, 926–934 (2009)
11. Benajiba, Y., Diab, M., Rosso, P.: Using Language Independent and Language Specific
Features to Enhance Arabic Named Entity Recognition. The International Arab Journal of
Information Technology 6, 464–473 (2009)
12. Elsebai, A., Meziane, F., BelKredim, F.Z.: A Rule Based Persons Names Arabic Extrac-
tion System. In: Communications of the IBIMA, pp. 53–59 (2009)
13. Farber, B., Freitag, D., Habash, N., Rambow, O.: Improving NER in Arabic Using a Mor-
phological Tagger. In: Proceedings of Workshop on HLT & NLP within the Arabic World
(LREC 2008), pp. 2509–2514 (2008)
14. Habash, N., Owen, R., Ryan, R.: MADA+TOKAN: A Toolkit for Arabic Tokenization,
Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatiza-
tion. In: Proceedings of the 2nd International Conference on Arabic Language Resources
and Tools, MEDAR (2009)
15. Habash, N., Soudi, A., Buckwalter, T.: On Arabic Transliteration. In: Arabic Computa-
tional Morphology: Knowledge-based and Empirical Methods, pp. 15–22 (2007)
16. Hamadene, A., Shaheen, M., Badawy, O.: ARQA: An Intelligent Arabic Question Answer-
ing System. In: Proceedings of ALTIC 2011 (2011)
17. Maloney, J., Niv, M.: TAGARAB: A Fast, Accurate Arabic Name Recognizer Using
High-Precision Morphological Analysis. In: Proceedings of the Workshop on Computa-
tional Approaches to Semitic Languages (Semitic 1998), pp. 8–15 (1998)
18. Mesfar, S.: Named Entity Recognition for Arabic Using Syntactic Grammars. In: Kedad,
Z., Lammari, N., Métais, E., Meziane, F., Rezgui, Y. (eds.) NLDB 2007. LNCS, vol. 4592,
pp. 305–316. Springer, Heidelberg (2007)
19. Nadeau, D., Sekine, S.: A Survey of Named Entity Recognition and Classification. Ling-
visticae Investigationes 30, 3–26 (2007)
20. Oudah, M.M., Shaalan, K.: A Pipeline Arabic Named Entity Recognition Using a Hybrid
Approach. In: Proceedings of COLING 2012, pp. 2159–2176 (2012)
21. Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D.: Us-
ing Machine Learning to Maintain Rule-based Named-Entity Recognition and Classifica-
tion Systems. In: Proceeding of Association for Computational Linguistics, pp. 426–433
(2001)
22. Shaalan, K.: Rule-based Approach in Arabic Natural Language Processing. IJICT 3, 11–19
(2010)
23. Shaalan, K., Raza, H.: Person Name Entity Recognition for Arabic. In: Proceedings of the
5th Workshop on Important Unresolved Matters, pp. 17–24 (2007)
24. Shaalan, K., Raza, H.: Arabic Named Entity Recognition from Diverse Text Types. In:
Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 440–451.
Springer, Heidelberg (2008)
25. Shaalan, K., Raza, H.: NERA: Named Entity Recognition for Arabic. Journal of the Amer-
ican Society for Information Science and Technology 60, 1652–1663 (2009)
26. Zaghouani, W.: RENAR: A Rule-Based Arabic Named Entity Recognition System. ACM
Transactions on Asian Language Information Processing 11, 1–13 (2012)
A Broadly Applicable and Flexible Conceptual
Metagrammar as a Basic Tool for Developing
a Multilingual Semantic Web
Vladimir A. Fomichov
1 Introduction
During last decade, one has been able to observe in different parts of the world the
permanent growth of interest in designing natural language (NL) interfaces to applied
intelligent systems and in constructing other kinds of NL processing systems, or
linguistic processors. In particular, a number of projects being useful for practice are
described in [1-7].
One of the most acute and large-scale problems is to endow the existing Web with
the ability of extracting information from numerous sources in various natural
languages (of cross-language information retrieval) and of constructing NL-interfaces
to a number of knowledge repositories recently developed under the framework of the
Semantic Web project [2, 8-12].
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 249–259, 2013.
© Springer-Verlag Berlin Heidelberg 2013
250 V.A. Fomichov
The aim of this paper is to introduce the notion of a broadly applicable and flexible
Conceptual Metagrammar (CM) and to ground the opinion that the first version of
such CM does already exist. More exactly, that the definition of the class of SK-
languages (standard knowledge languages) provided by the theory of K-
representations (knowledge representations) [9-13] can be interpreted as the first
version of a broadly applicable and flexible CM. The final part of the paper discusses
the connections with the related approaches.
2 Problem Statement
3 Methodology
As far as in the middle of the 1960s, the researchers had practically the only formal
approach to describing structured meanings (SMs) of NL-texts : the first-order logic
(FOL). Due to numerous restrictions of FOL, the search for more powerful and
flexible formal means for describing SMs of NL-texts was started in the second half
of the 1960s. As a result, a number of new theories have been developed, first of all,
the Theory of Generalized Quantifiers (TGQ), Discourse Representation Theory
(DRT), Theory of Semantic Nets (TSN), Theory of Conceptual Graphs (TCG),
Episodic Logic (EL), and Theory of K-representations (knowledge representations).
The latter theory is an original theory of designing semantic-syntactic analysers of
NL-texts with the broad use of formal means for representing input, intermediary, and
output data [9-13]. This theory also contributes to the development of logic-
informational foundations of (a) Semantic Web of a new generation, (b) E-commerce,
and (c) multi-agent systems theory (agent communication languages) [11-12].
In order to understand the principal distinction of the theory of K-representations
from other mentioned approaches to formalizing semantics of NL, let’s consider an
analogy. Bionics studies the peculiarities of the structure and functioning of the living
beings in order to discover the new ways of solving certain technical problems. Such
theories as TGQ, DRT, TSN, TCG, EL and several other theories were elaborated on
the way of expanding the expressive mechanisms of FOL. To the contrary, the theory
of K-representations was developed as a consequence of analysing the basic
expressive mechanisms of NL and putting forward a conjecture about a system of
partial operations on conceptual structures underpinning these expressive
252 V.A. Fomichov
mechanisms. Of course, the idea was to develop a formal model of this system being
compatible with FOL.
The first basic constituent of the theory of K-representations is the theory of SK-
languages (standard knowledge languages). The kernel of this theory is a
mathematical model describing a system of such 10 partial operations on structured
meanings (SMs) of natural language texts (NL-texts) that, using primitive conceptual
items as "blocks", we are able to build SMs of arbitrary NL-texts (including articles,
textbooks, etc.) and arbitrary pieces of knowledge about the world. The analysis of
the scientific literature shows that today the class of SK-languages opens the broadest
prospects for representing SMs of NL-texts in a formal way.
The second basic constituent of the theory of K-representations is a broadly
applicable mathematical model of a linguistic database [9, 11]. The third basic
constituent of the theory of K-representations is several complex, strongly structured
algorithms carrying out semantic-syntactic analysis of texts from some practically
interesting sublanguages of NL. The algorithm SemSynt1 transforms a NL-text in its
semantic representation being a K-representation [11]. The input texts (statements,
commands, and questions of many kinds) can be from the English, German, and
Russian languages. This algorithm is implemented by means of a program in the
language PYTHON.
The paper [14] describes an application of the theory of K-representations to the
elaboration of a new approach to semantic search of documents on the Web. The
subject of the paper is semantic processing of the requests about the achievements or
failures of the organizations (firms, etc.) and people. A generalized request of the end
user is transformed into a set of concrete requests, it is done with the help of a goals
base storing the semantic representations of the goals of active systems. A model of a
goals base is constructed with the help of the theory of K-representations.
Degr(B) is the carrier of the partial algebra, Operations(B) is the set consisting of the
partial unary operations Op[1], …, Op[10] on Degr(B).
The volume of the complete description in [11] of the mathematical model
introducing, in essence, the operations Op[1], …, Op[10] on Degr(B) and, as a
consequence, determining the class of SK-languages considerably exceeds the volume
of this paper. That is why, due to objective reasons, this model can’t be included in
this paper. A short outline of the model can be found in [10].
4 Results
Let’s consider the principal new expressive mechanisms introduced by the definition
of the class of SK-languages.
To sum up, SK-languages allow for describing semantic structure of the sentences
with direct and indirect speech and of the discourses with the references to the
meanings of phrases and larger parts of a discourse, for constructing compound
designations of the notions, sets, and sequences.
5 Discussion
The advantages of the theory of K-representations in comparison with first-order
logic, Discourse Representation Theory, and Episodic Logic are, in particular, the
possibilities: (1) to distinguish in a formal way objects (physical things, events, etc.)
and notions qualifying them; (2) to build compound representations of notions; (3) to
distinguish in a formal manner objects and sets of objects, concepts and sets of
concepts; (4) to build complex representations of sets, sets of sets, etc.; (5) to describe
set-theoretical relationships; (6) to effectively describe structured meanings (SMs) of
discourses with references to the meanings of phrases and larger parts of discourses;
(7) to describe SMs of sentences with the words "concept", "notion"; (8) to describe
SMs of sentences where the logical connective "and" or "or" joins not the
expressions-assertions but designations of things, sets, or concepts; (9) to build
complex designations of objects and sets; (10) to consider non-traditional functions
with arguments or/and values being sets of objects, of concepts, of texts' semantic
258 V.A. Fomichov
6 Conclusions
The arguments stated above and numerous additional arguments set forth in the
monograph [11] give serious grounds to conclude that the definition of the class of
SK-languages can be interpreted as the first version of a broadly applicable and
flexible Conceptual Metagrammar.
The theory of K-representations was developed as a tool for dealing with numerous
questions of studying semantics of arbitrarily complex natural language texts: both
sentences and discourses. Grasping the main ideas and methods of this theory requires
considerably more time than it is necessary for starting to construct the formulas of
the first-order logic. However, the efforts aimed at studying the foundations of the
theory of K-representations would be highly rewarded. Independently on an
application domain, a designer of a NL processing system will have a convenient tool
for solving various problems.
References
1. Popescu, A.-M., Etzioni, O., Kautz, H.: Towards a Theory of Natural Language
Interfaces to Databases. In: Proc. of the 8th Intern. Conf. on Intelligent User
Interfaces, Miami, FL, pp. 149–157 (2003)
2. Kaufmann, E., Bernstein, A.: How Useful Are Natural Language Interfaces to the
Semantic Web for Casual End-Users? In: Aberer, K., et al. (eds.) ASWC/ISWC
2007. LNCS, vol. 4825, pp. 281–294. Springer, Heidelberg (2007)
A Broadly Applicable and Flexible Conceptual Metagrammar 259
3. Cimiano, P., Haase, P., Heizmann, J., Mantel, M.: ORAKEL: A Portable Natural
Language Interface to Knowledge Bases. Technical Report, Institute AIFB,
University of Karlsruhe, Germany (2007)
4. Frank, A., Krieger, H.-U., Xu, F., Uszkoreit, H., Crysmann, B., Jrg, B., Schaeffer,
U.: Question Answering from Structured Knowledge Sources. J. of Applied
Logic 5(1), 20–48 (2007)
5. Prince, V., Roche, M. (eds.): Information Retrieval in Biomedicine: Natural
Language Processing for Knowledge Integration. IGI Global (2009)
6. Harrington, B., Clark, S.: ASKNet: Creating and Evaluating Large Scale
Integrated Semantic Networks. In: Proceedings of the 2008 IEEE International
Conference on Semantic Computing, pp. 166–173. IEEE Computer Society,
Washington DC (2008)
7. Rindflesh, T.C., Kilicoglu, H., Fiszman, M., Roszemblat, G., Shin, D.: Semantic
MEDLINE: An Advanced Information Management Application for
Biomedicine. Information Services and Use, vol. 1, pp. 15–21. IOS Press (2011)
8. Wilks, Y., Brewster, C.: Natural Language Processing as a Foundation of the
Semantic Web. Foundations and Trends in Web Science. Now Publ. Inc.,
Hanover (2006)
9. Fomichov, V.A.: The Formalization of Designing Natural Language Processing
Systems. MAX Press, Moscow (2005) (in Russian)
10. Fomichov, V.A.: Theory of K-representations as a Source of an Advanced
Language Platform for Semantic Web of a New Generation. In: Web Science
Overlay J. On-line Proc. of the First Intern. Conference on Web Science, Athens,
Greece, March 18-20 (2009), http://journal.webscience.org/221/
1/websci09_submission_128.pdf
11. Fomichov, V.A.: Semantics-Oriented Natural Language Processing:
Mathematical Models and Algorithms. Springer, Heidelberg (2010a)
12. Fomichov, V.A.: Theory of K-representations as a Comprehensive Formal
Framework for Developing a Multilingual Semantic Web. Informatica. An
International Journal of Computing and Informatics 34(3), 387–396 (2010b)
(Slovenia)
13. Fomichov, V.A.: A Mathematical Model for Describing Structured Items of
Conceptual Level. Informatica. An Intern. J. of Computing and Informatics 20(1),
5–32 (1996) (Slovenia)
14. Fomichov, V.A., Kirillov, A.V.: A Formal Model for Constructing Semantic
Expansions of the Search Requests About the Achievements and Failures. In:
Ramsay, A., Agre, G. (eds.) AIMSA 2012. LNCS, vol. 7557, pp. 296–304.
Springer, Heidelberg (2012)
15. Harrington, B., Wojtinnik, P.-R.: Creating a Standardized Markup Language for
Semantic Networks. In: Proceedings of the 2011 IEEE Fifth International
Conference on Semantic Computing, pp. 279–282. IEEE Computer Society,
Washington DC (2011)
16. Turnpenny, P.D., Ellard, S.: Emery’s Elements of Medical Genetics, 12th edn.
Elsevier Limited, Edinburgh (2005)
MOSAIC: A Cohesive Method for Orchestrating Discrete
Analytics in a Distributed Model
1 Introduction
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 260–265, 2013.
© Springer-Verlag Berlin Heidelberg 2013
MOSAIC: A Cohesive Method for Orchestrating Discrete Analytics 261
2 Background
The analytic integration problem presented in this work is not new. The DARPA
TIPSTER Text Program was a 9-year multi-million dollar R&D effort to improve
HLT for the handling of multilingual corpora for use within the intelligence process.
Its first phase funded algorithms for Information Retrieval / Extraction, resulting in
pervasive repeated functionality. The second phase sought to develop an architecture
[1]. Despite a timetable of six months, years were required. Any success lay in mak-
ing the architecture palatable for voluntary adoption [2].
From TIPSTER’s Phase III emerged GATE (General Architecture for Text Engi-
neering) [3]. GATE has evolved since 1995 and is widely used by European research-
ers. An important emphasis of GATE was a separation of data storage, execution, and
visualization from the data structures and analytics. Integration in GATE is achieved
by making use of standards of Java and XML to allow inter-analytic communication.
A third relevant HLT architecture is UIMA (Unstructured Information Manage-
ment Architecture), a scalable integration platform for semantic analytics and search
components. Developed by IBM, UIMA is a project at the Apache Software Founda-
tion and has earned support in the HLT community. UIMA was meant to insulate
analytic developers from the system concerns while allowing for tightly-coupled dep-
loyments to be delivered with the ease of service-oriented distributed deployments
[4]. UIMA has a greater scalability than GATE when using UIMA AS to distribute
analyses to operate in parallel as part of a single workflow.
A recent IBM success is Watson, the question answering system [5] capable of de-
feating human champions of Jeopardy, a striking feat of engineering. This was a mul-
ti-year intense research and development project undertaken by a core team of
researchers that produced the DeepQA design architecture, where components that
produce annotations or make assertions were implemented as UIMA annotators.
Recent work in developing an HLT architecture alternative [6] expressed the chal-
lenge of using existing HLT integration platforms in a research environment. In this
project, Curator, the desire was to avoid a single HLT preprocessing framework as
well as all-encompassing systems with steep learning curves such as is the case with
GATE or UIMA. Rather, Curator was designed to directly support the use-case of
diverse HLT analytics operating in concert. This articulates the scenario we raised.
3 Methods
respect to MOSAIC: developers, architects, and users. Developers are analytic crea-
tors, architects are integrators into the architecture, and users make and execute
workflows. MOSAIC must handle complex sequential and concurrent workflows with
analytic modules as black boxes. Discrete analytics must not be tightly coupled to the
workflow.
The inbound gateway handles the input stream of documents. Based on active
workflow instances deployed by the executive, the inbound gateway will submit doc-
uments to the data bus. Documents can be triaged here, such that only documents that
match the active workflow instances’ specifications move forward.
The executive for the system orchestrates all the user-specified behavior in the ex-
ecution of a workflow. Workflows are capable of being deployed persistently and ad
hoc. Workflow information is retained in the output, making results traceable and
repeatable. Our constraints suggest a software framework targeted to analytic
processing which handles crawling through documents, parsing the documents, and
routing the documents to the appropriate analytics, treated as plug-ins to the system.
The data bus is responsible for collecting, managing, and indexing data in a distri-
buted fashion, allowing search and retrieval of documents and artifacts. Artifacts can
last the life of their workflow instance or be made persistent.
Analytics include any existing software packages used extensively for text to in-
formation processing and smaller scale software crafted by subject matter experts.
The system’s flexibility means analytics not yet written can be added as plug-ins later.
Adapters are necessary for maintaining a common interchange format for the data
to be passed between the analytics. A common interchange format can represent any-
thing extracted or generated from the documents. This is a language that the adapters
can interpret when translating data produced by one analytic for use by another.
Analytics typically produce raw formats of their own data objects. With an adapter
layer, there is no expectation placed on analytic developers to write to the common
interchange format or reengineer an existing analytic. Yet this does require that adap-
ters be created for each raw format to convert output to the interchange format. For
each particular input, another adapter must be written that will create data in this for-
mat from data in the common interchange format. Analytics that share a raw model
and format either input or output can use the same adapters.
Note, an analytic is concerned with the extraction or generation of artifacts from
input, while the adapter is concerned with the conversion of one analytic format to
another. Architects who are familiar with both the common model and individual raw
analytic data models are the appropriate developers of adapters. This underscores the
importance of having a separate adapter layer as this allows the analytics to be treated
as buffered from the system integration and allows the common model to evolve
without a direct impact on the analytics.
The case-study application involves workflows that operate on raw text, generating
results mapped into a knowledge representation preserving annotation and prove-
nance. Results are merged into a knowledge base generated from text source material.
MOSAIC: A Cohesive Method for Orchestrating Discrete Analytics 263
4 Results
Fig. 1. Timing results (ms) for analytics and imposed overhead (orchestration, adaptation, and
write out) using a MySQL-backed store (a) and a variant using F-Logic and OntoServer (b)
5 Discussion
MOSAIC was intended to address the needs of a research environment, but MOSAIC
does not hinder the transition of workflow threads to production. It conformed to the
requirements that the analytics be integrated seamlessly into a workflow that ad-
dresses larger-scope problems but without requiring integration-based rework on the
analytics. The loosely-coupled nature of MOSAIC makes possible the rapid prototyp-
ing of the analytics in these workflows and permits substitution of subcomponents
(i.e., executive, data bus) to allow for new technologies.
We have engineered an implementation of MOSAIC that embraces HLT analytics
(text and speech) and supporting analytics of other domains (e.g. image processing,
metadata analysis) across different workflows. At present, there are 16 HLT analytics
and 5 supporting analytics in this implementation spanning 8 workflows geared to-
ward solving larger problems within different genres of documents (textual, auditory,
MOSAIC: A Cohesive Method for Orchestrating Discrete Analytics 265
image, and composites). The typical time to full integration for a new analytic that
requires adapter development is improved over integration into our past efforts.
This adaptation is essential to the design of MOSAIC within the content extraction
domain. Because the final results of the system are knowledge objects, these results
need to have a cohesive representation despite the diversity of the models and formats
of analytic output. Analytic developers cannot be expected to reengineer their analyt-
ics to fit our common representation, because it is often impossible to exert control
over external analytic developers who did not model their analytics to the common
representation and further it is not the role of the analytic developers to perform and
maintain integration into a potentially evolving format. MOSAIC affords a division of
labor such that it is the responsibility of MOSAIC integrators to perform this adapta-
tion externally to the analytics. This does not remove the necessity for doing the adap-
tation work, but it does allow for the proper delineation of work roles such that this
manner of integration is possible for analytics of disparate origins. There are domains
of document processing (i.e., document decomposition, format and language conver-
sion) which are not founded on producing knowledge results and have no require-
ments for adaptation, indicating MOSAIC could be used in these domains as is.
References
1. Altomari, P.J., Currier, P.A.: Focus of TIPSTER Phases I and II. In: Advances in Text
Processing: TIPSTER Program Phase II, pp. 9–11. Morgan Kaufmann Publishers, Inc.,
San Francisco (April 1994-September 1996)
2. Grishman, R.: Building an Architecture: A CAWG Saga. In: Advances in Text Processing:
TIPSTER Program Phase II, pp. 213–215. Morgan Kaufmann Publishers, Inc., San Fran-
cisco (1996)
3. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: An Architecture for
Development of Robust HLT. In: Proceedings of the 40th Anniversary Meeting of the As-
sociation for Computational Linguistics (ACL 2002), Philadelphia, PA, pp. 168–175
(2002)
4. Ferrucci, D., Lally, A.: UIMA: An Architectural Approach to Unstructured Information
Processing in the Corporate Research Environment. Natural Language Engineering 10(3-
4), 327–348 (2004)
5. Ferrucci, D., et al.: Building Watson: an Overview of the DeepQA Project. AI Maga-
zine 31(3), 59–79 (2010)
6. Clarke, J., Srikumar, V., Sammons, M., Roth, D.: An NLP Curator (or: How I Learned to
Stop Worrying and Love NLP Pipelines). In: Proceedings of LREC 2012 (2012)
7. Boschee, E., Weischedel, R., Zamanian, A.: Automatic Information Extraction. In: Pro-
ceedings of the 2005 International Conference on Intelligence Analysis, pp. 2–4 (2005)
8. Surdeanu, M., Harabagiu, S.: Infrastructure for Open-domain Information Extraction. In:
Proceedings of the Human Language Technology Conference, pp. 325–333 (2002)
9. SRA NetOwl,
http://www.sra.com/netowl/entity-extraction/features.php
10. Alias-I, LingPipe 4.1.0, http://alias-i.com/lingpipe
11. Taylor, M., Carlson, L., Fontaine, S., Poisson, S.: Searching Semantic Resources for Com-
plex Selectional Restrictions to Support Lexical Acquisition. In: Third International Confe-
rence on Advances in Semantic Processing, pp. 92–97 (2009)
Ranking Search Intents Underlying a Query
Yunqing Xia1, Xiaoshi Zhong1, Guoyu Tang1, Junjun Wang1, Qiang Zhou1,
Thomas Fang Zheng1, Qinan Hu2, Sen Na2, and Yaohai Huang2
1
Tsinghua National Laboratory for Information Science and Technology, Department of
Computer Science and Technology, Tsinghua University, Beijing 100084, China
{gytang,yqxia,xszhong,jjwang,zq-lxd,fzheng}@tsinghua.edu.cn
2
Canon Information Technology (Beijing) Co. Ltd., Beijing 100081, China
{huqinan,nasen,huangyaohai}@canon-ib.com.cn
Abstract. Observation on query log of search engine indicates that queries are
usually ambiguous. Similar to document ranking, search intents should be
ranked to facilitate information search. Previous work attempts to rank intents
with merely relevance score. We argue that diversity is also important. In this
work, unified models are proposed to rank intents underlying a query by com-
bining relevance score and diversity degree, in which the latter is reflected by
non-overlapping ratio of every intent and aggregated non-overlapping ratio of a
set of intents. Three conclusions are drawn according to the experiment results.
Firstly, diversity plays an important role in intent ranking. Secondly, URL is
more effective than similarity in detecting unique subtopics. Thirdly, the aggre-
gated non-overlapping ratio makes some contribution in similarity based intent
ranking but little in URL based intent ranking.
1 Introduction
Search engines receive billions of queries every day while more than 30 percent que-
ries are ambiguous. The ambiguity can be classified into two types: (1) Meaning of
the query cannot be determined. For example, in query “bat”, one is difficult to know
whether the query refers to a flying mammal or a tool for playing squash. (2) Facet of
the query cannot be determined. For example, in query “batman”, one cannot figure
out which facet the user wants to know. Previous work ranks intents with relevance
score and document similarity [1-2], which are insufficient. Observations disclose that
intents usually overlap with each other. For example, history of
San Francisco always mentions places and persons in the city. Search with “San
Francisco” usually indicates three overlapping intents: San Francisco history, San
Francisco places and San Francisco people. We argue the non-overlapping (i.e.,
unique) part amongst the intents plays a vital role in intent ranking.
In this work, unified models are proposed to rank intents underlying a query by
combining relevance score and diversity degree. For the diversity degree, we propose
the non-overlapping ratio to measure difference between intents. When calculating
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 266–271, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Ranking Search Intents Underlying a Query 267
cosine distance between two documents, we started from the term-based vector and
further proposed the sense-based vector. Three conclusions are drawn from the expe-
rimental results. Firstly, diversity plays an important role in intent ranking. Secondly,
URL is more effective than similarity in detecting unique subtopics. Thirdly, the ag-
gregated non-overlapping ratio makes some contribution in similarity based intent
ranking but little in URL based intent ranking.
The rest of this paper is organized as follows. In Section 2, we summarize related
work. In Section 3, we report intent mining. We then present the non-overlapping
ratio and intent ranking models in Section 4 and Section 5, respectively. Experiments
and discussions are given in Section 6, and we conclude the paper in Section 7.
2 Related Work
Intent mining is a new research topic arising from NTCIR9 intent mining task [1]. To
obtain the subtopic candidates, THUIR system uses Google, Bing, Baidu, Sogou,
Youdao, Soso, Wikipedia and query log [2], which are proved helpful. Clustering on
subtopic candidates is also used to find intents. For example, the Affinity Propagation
algorithm is adopted in HITCSIR system to find intents [3]. In subtopic ranking, most
NTCIR9 intent mining systems rely merely on relevance score [2-4]. Differently, we
incorporated diversity into unified models for intent ranking.
Very recently, diversity has been explored by search engines to obtain diversified
search results. For example, uogTr system applied the xQuAD framework for diversi-
fying search results [5]. Some early diversification algorithms explore similarity func-
tions to measure diversity [6]. Essential Pages algorithm was proposed to reduce in-
formation redundancy and returns Web pages that maximize coverage with respect to
the input query [7]. This work is different as we propose unified intent ranking mod-
els considering both document relevance and intent overlap.
In our intent mining system, intents are discovered from a set of subtopics, which are
text strings reflecting certain aspects of the query.
4 Non-overlapping Ratio
Diversity is in fact reflected by non-overlapping (NOL) ratio, which is the ratio of
non-overlapping parts over the overlapping parts within the intent. Consider an intent
I = ft1 ; t2 ; :::tNg; where ti denotes a subtopic. Using subtopic t as a query, we obtain
a set of search results t V fr1; r2; :::; rMg with a search engine, where rj represents a
search result. For Web search, we can further represent search result by the unique
url string and document d : r j ´ furl j ; d j g. Considering overlap, documents cov-
ered by an intent can be divided into unique part and common part.
We define Non-Overlapping (NOL) ratio of an intent as the ratio of unique part to
the common part within the intent. Formally, given an intent I that covers a search
result set R = fr1 ; r2 ; :::; r© g , we divide R into R = Runiq [ Rcomm , where
1 2 K 1 2 L
Runiq = fruniq ; runiq ; :::; runiq g and Rcomm = frcomm ; rcomm ; :::; rcomm g represent the
unique part and the remaining (common) part, respectively, and K + L = ©. NOL of
intent is calculated as follows.
kRuniq k + ¯
ratioNOL = (1)
kRcommk + ¯
where ¯ is set 1 to avoid the divided-by-zero error.
We designed two ways to count unique search results. In the first way, we simply
compare the URL’s of search results to determine uniqueness. In the second way,
uniqueness is determined if it is not semantically similar to another search result. We
adopt cosine distance in document similarity measuring based on vector space model.
5 Intent Ranking
We present two intent ranking models are designed based on NOL ratio.
1
Freebase: http://www.freebase.com/
Ranking Search Intents Underlying a Query 269
P k
wuniq + ¯
ratioNOL
w = Pk l (2)
l wcomm + ¯
The relevance score is calculated with cosine distance. Finally, intents are ranked
according to the weighted NOL (w-NOL) ratio.
kRbunique k + ¯
ratioANOL = (3)
kRbcommon k + ¯
Ranking intents with the w-ANOL ratio is an iterative process. It starts from the top
ranked intent and ends with an intent list. Given n intents ¦n = fI1; I2 ; :::; Ing obtained
in the n-th step, the n+1-th step seeks to find an intent I ¤ within the remaining intents
that satisfies:
where ¦n = ¦¡¦n.
6 Evaluation
- Mean Average Precision (MAP): MAP measures the mean of the average pre-
cision scores for each query.
Methods: The following intent ranking methods will be evaluated in our experiments.
- RIR: Intent ranking based merely on relevance.
- MIR: Intent ranking with MMR[4].
- UIR: Intent ranking according to w-NOL ratio based on URL.
- SIR: Intent ranking according to w-NOL ratio based on document similarity.
- UAIR: Intent ranking according to w-ANOL ratio based on URL.
- SAIR: Intent ranking according to w-ANOL ratio based on document similari-
ty.
Experiment is conducted to justify contribution of the unified models. We set the
similarity threshold value 0.8 in unique document determination.
Experimental results of the six methods are presented in Fig. 1. Three observations
are made on the experimental results.
0.90
0.80 RIR
0.70 MMR
UIR
0.60
SIR
0.50
UAIR
0.40
SAIR
nDCG@10 MAP
Fig. 1. Experimental results of the intent ranking methods
Firstly, we compare the NOL ratio based ranking methods (i.e., UIR, SIR, USIR
and SAIR) against the traditional relevance based ranking methods (i.e., RIR and
MIR). Seen from Fig.1, all the NOL ratio based ranking methods outperform the tra-
ditional methods significantly. It can be concluded that NOL ratio makes significant
contribution to intent ranking. Secondly, we compare the four the NOL ratio based
ranking methods (i.e., UIR, SIR, USIR and SAIR). Shown in Fig.1, the ANOL ratio
based methods (i.e., USIR and SAIR) outperforms the NOL ratio based methods (i.e.,
UIR and SIR) on MAP. But on nDCG, there is no consistent outperformance. We
conclude that the ANOL ratio tends to offer the accurate intents higher ranks while is
not necessarily advantageous over NOL ratio in assigning the correct ranks. Thirdly,
we compare the URL based intent ranking methods (i.e., UIR and USIR) and the si-
milarity based methods (i.e., SIR and SAIR). Seen in Fig.1, the URL based methods
outperform the similarity based methods consistently. We thus conclude that similari-
ty do not contribute in detecting unique search results.
Ranking Search Intents Underlying a Query 271
7 Conclusion
This paper seeks to prove that diversity is important in ranking intents underlying a
query. Contributions of this work are summarized as follows. Firstly, diversity degree
is incorporated in intent ranking. Secondly, non-overlapping ratio is proposed to cal-
culate diversity degree of intent. Thirdly, intents are ranking with non-overlapping
ratio in standalone manner and aggregating manner, respectively. Three conclusions
are drawn according to the experimental results. First, diversity plays an important
role in intent ranking. Second, URL is more effective than similarity in detecting
unique subtopics. At last, the aggregated non-overlapping ratio makes some contribu-
tion in similarity based intent ranking but little in URL based intent ranking.
References
1. Song, R., Zhang, M., Sakai, T., Kato, M., Liu, Y., Sugimoto, M., Wang, Q., Orii, N.:
Overview of the NTCIR-9 INTENT Task. In: Proc. of NTCIR-9 Workshop Meeting,
Tokyo, Japan, December 6-9, pp. 82–104 (2011)
2. Xue, Y., Chen, F., Zhu, T., Wang, C., Li, Z., Liu, Y., Zhang, M., Jin, Y., Ma, S.: THUIR at
NTCIR-9 INTENT Task. In: Proc. of NTCIR-9, Tokyo, Japan, December 6-9 (2011)
3. Song, W., Zhang, Y., Gao, H., Liu, T., Li, S.: HITSCIR System in NTCIR-9 Subtopic
Mining Task. In: Proc. of NTCIR-9, Tokyo, Japan, December 6-9 (2011)
4. Han, J., Wang, Q., Orii, N., Dou, Z., Sakai, T., Song, R.: Microsoft Research Asia at the
NTCIR-9 Intent Task. In: Proc. of NTCIR-9, Tokyo, Japan, December 6-9 (2011)
5. Santos, R.L.T., Macdonald, C., Ounis, I.: University of Glasgow at the NTCIR-9 Intent
task: Experiments with Terrier on subtopic mining and document ranking. In: Proc. of
NTCIR-9, Tokyo, Japan, December 6-9 (2011)
6. Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering
documents and producing summaries. In: Proc. of SIGIR 1998, Melbourne, Australia, pp.
335–336 (1998)
7. Swaminathan, A., Mathew, C.V., Kirovski, D.: Essential Pages. In: Proc. of WI 2009, Mi-
lan, Italy, pp. 173–182 (2009)
8. Santamaría, C., Gonzalo, J., Artiles, J.: Wikipedia as sense inventory to improve diversity
in web search results. In: Proc. of ACL 2010, Uppsala, Sweden, pp. 1357–1366 (2010)
9. Brody, S., Lapata, M.: Bayesian word sense induction. In: Proc. of EACL 2009, pp.
103–111 (2009)
10. Dueck, D.: Affinity Propagation: Clustering Data by Passing Messages. University of To-
ronto Ph.D. thesis (June 2009)
Linguistic Sentiment Features for Newspaper
Opinion Mining
1 Introduction
Every day, many news texts are published and distributed over the internet (up-
loaded newspaper articles, news from online portals). They contain potentially
valuable opinions. Many organisations analyse the polarity of sentiment in news
items which talk about them. How is the media image about company XY? Is
the sentiment changing after the last advertising campaign? For instance, a Me-
dia Response Analysis (MRA) answers these questions [12]. In a MRA, several
media analysts have to read the collected news, select relevant statements from
the articles and assign a sentiment for each statement. This means in effect, a
MRA needs a big human effort. At the same time, the internet contains more
and more potentially relevant articles. As a consequence, media monitoring ser-
vices require more machine-aided methods. Opinions are not stated so clearly in
newspaper articles [1]. In the news, some special features are important for the
sentiment, so that an only-word-based method cannot solve this problem.
Formal Task Definition: Given a statement s which consists of the words
wi with i ∈ {1, ..., sn }. The task is to find the polarity of sentiment y for the
statement s:
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 272–277, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Linguistic Sentiment Features for Newspaper Opinion Mining 273
2 Related Work
Research in Opinion Mining is far-reaching [7], however the most techniques
tackle this problem in the domain of customer reviews [7]. Many approaches for
Opinion Mining in reviews collect sentiment-bearing words [6]. There are meth-
ods [4] which try to handle linguistic or contextual sentiment such as negations.
The negation as the maybe most important linguistic factor is often treated
by heuristic rules [4], which reverse the polarity of sentiment words. Interesting
techniques for the effects of negations have been introduced by Jia et al. [5].
Here, the scope of negations are derived from different rules. In addition, we
are interested in a linguistic and grammatical context as in Zhou et al. [13].
They show that conjunctions can be used to avoid ambiguities within sentences.
In the news domain, many approaches on this topic only work with reported
speech objects [1]. News articles are less subjective [1], but quotations in news-
paper articles are often the place where more subjective text and opinions can
be found [1]. However, only opinions, which are part of a reported speech object,
can be analysed by this method. An analysis [9] shows that in a MRA less than
22% of the opinion-bearing text contain quoted text and only in less than 5%
the area of quoted text is larger than 50% of the whole relevant opinion.
Here, scat are only the words in statement s which belong to one of the four
important categories (adjectives, nouns, verbs, and adverbs) and σmethod is one
of the five word based methods.
(Linguistic Effect Features β). The second technique tries to capture an area
of this effect and it takes the sentiment of the area as the feature value of this
aspect (resulting in Linguistically Influenced Sentiment Features γ). The
feature value is the sum of the sentiment of the influenced words. We implement
techniques from Jia et al. [5], who are trying to capture different effect areas
for negations. We adapt their candidate scope [5] and delimiter rules [5] using
static and dynamic delimiters for the German language and expand them also
for our non negation features: The static delimiters [5] remove themselves and all
words after them from the scope. Static delimiters are words such as “because”,
“when” or “hence” [5]. A conditional delimiter [5] becomes a delimiter if it has
the correct POS-tag, is inside a negation scope, and leads to opinion-bearing
words. Examples are words such as “who”, “where” or “like”. In addition, we
have designed a second method which creates a scope around an effect word. All
words in the scope have a smaller distance to all other effect words (in number
of words between them).
p(s) o(s)
fβ1 (s) = fβ2 (s) = (3)
p(s) + o(s) p(s) + o(s)
fγ1 (s) = σ(w) fγ2 (s) = σ(w) (4)
w∈Pw w∈Ow
1.0 if ∃w ∈ s : w is a negation
fβ3 (s) = fγ3 (s) = σ(w) (5)
0.0 otherwise
w∈Nw
The use of conjunctions can also indicate a polarity. We create a test data of 1,600
statements, collect the conjunctions and associate them with a sentiment value
νc by their appearance in positive and negative statements. Table 1 (left) shows
the different conjunctions and their value to influence the sentiment. The type β
feature for conjunctions is the sum of all sentiment values νc of all conjunctions
Cs of the statement s. The conjunction influenced words are Cw . The scope is
Linguistic Sentiment Features for Newspaper Opinion Mining 275
Table 1. Left: Conjunctions and sentiment value. Right: Hedging auxiliary verbs.
determined by the candidate scope [5] and delimiter rules [5], but only words after
the conjunction are concerned because the conjunction itself is a delimiter. The
multiplication with νc indicates which type of conjunction influences the affected
words. If the conjunction expresses a contrast (e.g. “but” with νc = −1.0), the
sentiment of the words will be inverted.
c∈Cs νc
fβ4 (s) = fγ4 (s) = νc ∗ σ(w) (6)
|Cs |
w∈Cw
A short part of quoted text can be a hint for irony in written texts [2] and a
long part can stand for a reported speech object. As a result, a machine learning
approach can better differentiate between irony and reported statements, if the
length and the affected words of quoted text are measured. q(s) is the part of a
statement s, which appears in quotation marks. l(x) is the length (in characters)
of a text x. Qw are the words inside a quotation.
l(q(s))
fβ5 (s) = fγ5 (s) = σ(w) (7)
l(s)
w∈Qw
Modal verbs like “can” or “would” can weaken the strength of the polarity. The
full list of auxiliary verbs for hedging expressions is shown in table 1 (right). The
method counts how often full verbs are influenced by hedging expressions h(s)
in comparison to all full verbs v(s). Hw is the set of words affected by hedging.
Here again, the candidate scope [5] and delimiter rules [5] are used.
h(s)
fβ6 (s) = fγ6 (s) = σ(w) (8)
v(s)
w∈Hw
5 Evaluation
We evaluate our approach on two different datasets: The first corpus, called
Finance, represents a real MRA about a financial service provider. It contains
5,500 statements (2,750 are positive, 2,750 are negative) from 3,452 different
news articles. The second dataset is the pressrelations dataset [10]. We use
approx. 30% of the dataset to construct a sentiment dictionary. This means that
1,600 statements (800 are positive, 800 are negative) are used for Finance and
308 statements for the pressrelations dataset. The sentiment dictionaries contain
words which are weighted by the methods explained in section 3. We use 20% of
the remaining set to train a classification model. The results are depicted in table
2 and show that the features β and γ improve sentiment allocation. The features
increased performance of all methods, except the information gain method on
pressrelations. However, in all other cases, the methods achieved the best results
by using all features. SentiWS, as the dictionary based approach, got the highest
improvement (over 7% on finance and over 14% on pressrelations). The entropy-
based method with all features got the highest accuracy with 75.28% on Finance,
which is an improvement of over 5% to the baseline.
By comparing all results, the influence of feature set β seems to be bigger than
the influence of feature set γ on Finance, while it is the other way around on the
pressrelations dataset. The reason for this is the nature of the two domains. The
political texts are more complicated so that a deeper analysis, which exploits
values of the influenced sentiment-bearing words, provides more benefit. Never-
theless, except the for information gain method, the combination of all linguistic
features achieved an increase to the baselines of at least over 3%.
6 Conclusion
In conclusion, linguistic features are very useful for Opinion Mining in newspa-
per articles. The evaluation shows that the linguistic features can be integrated
into existing solutions and thereby improve the computation of sentiment. The
improvement is especially large and therefore interesting for dictionary based
approaches. Moreover, this approach achieved high accuracies of over 70% and
in one case an accuracy of over 75%.
Linguistic Sentiment Features for Newspaper Opinion Mining 277
References
1. Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., van der Goot, E., Halkia,
M., Pouliquen, B., Belyaeva, J.: Sentiment analysis in the news. In: Proc. of the
7th Intl. Conf. on Language Resources and Evaluation, LREC 2010 (2010)
2. Carvalho, P., Sarmento, L., Silva, M.J., de Oliveira, E.: Clues for detecting irony
in user-generated contents: oh..!! it’s “so easy”;-). In: Proc. of the 1st International
CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion, TSA 2009, pp.
53–56 (2009)
3. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexi-
cography. In: Proc. of the 27th Annual Meeting on Association for Computational
Linguistics, ACL 1989, pp. 76–83 (1989)
4. Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining.
In: Proc. of the Intl. Conf. on Web Search and Web Data Mining, WSDM 2008,
pp. 231–240 (2008)
5. Jia, L., Yu, C., Meng, W.: The effect of negation on sentiment analysis and retrieval
effectiveness. In: Proc. of the 18th ACM Conference on Information and Knowledge
Management, CIKM 2009, pp. 1827–1830 (2009)
6. Kaji, N., Kitsuregawa, M.: Building lexicon for sentiment analysis from massive
collection of html documents. In: Proc. of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language
Learning, EMNLP-CoNLL (2007)
7. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends
in Information Retrieval 2(1-2), 1–135 (2008)
8. Remus, R., Quasthoff, U., Heyer, G.: SentiWS – a publicly available german-
language resource for sentiment analysis. In: Proc. of the 7th Intl. Conf. on Lan-
guage Resources and Evaluation, LREC 2010 (2010)
9. Scholz, T., Conrad, S.: Integrating viewpoints into newspaper opinion mining for
a media response analysis. In: Proc. of the 11th Conf. on Natural Language Pro-
cessing, KONVENS 2012 (2012)
10. Scholz, T., Conrad, S., Hillekamps, L.: Opinion mining on a german corpus of a
media response analysis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD
2012. LNCS, vol. 7499, pp. 39–46. Springer, Heidelberg (2012)
11. Scholz, T., Conrad, S., Wolters, I.: Comparing different methods for opinion mining
in newspaper articles. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.)
NLDB 2012. LNCS, vol. 7337, pp. 259–264. Springer, Heidelberg (2012)
12. Watson, T., Noble, P.: Evaluating public relations: a best practice guide to public
relations planning, research & evaluation. PR in practice series, ch. 6, pp. 107–138.
Kogan Page (2007)
13. Zhou, L., Li, B., Gao, W., Wei, Z., Wong, K.-F.: Unsupervised discovery of dis-
course relations for eliminating intra-sentence polarity ambiguities. In: Proc. of
the 2011 Conference on Empirical Methods in Natural Language Processing, pp.
162–171 (2011)
Text Classification of Technical Papers
Based on Text Segmentation
1 Introduction
In many research fields, a lot of papers are published every year. When re-
searchers look for technical papers by a search engine, only papers including
user’s keywords are retrieved, and some of them might be irrelevant to the re-
search topics that users want to know. Therefore, a survey of past researches is
hard and difficult. Automatic identification of the research topics of the technical
papers would be helpful for the survey. It is a kind of text classification problem.
Our goal is to design an effective model which determines the categories of a
given technical paper about natural language processing. In our approach, the
model will consider the text segments in the paper. Several models with different
feature sets from different segments are trained and combined. Furthermore, new
features associated with the title of the paper are introduced.
2 Background
Text classification has a long history. Many techniques have been studied to
improve the performance. The commonly used text representation is bag-of-
words [1]. Not words but phrases, word sequences or N-grams [2] are sometimes
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 278–284, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Text Classification of Technical Papers Based on Text Segmentation 279
used. Most of them focused on words or N-grams extracted from the whole doc-
ument with feature selection or feature weighting scheme. Some of the previous
work aimed at the integration of document contents and citation structure [3] [4].
Nomoto supposes the structure of the document as follows: the nucleus ap-
pears at the beginning of the text, followed by any number of supplementary
adjuncts [5]. Then keywords for text classification are extracted only from the
nucleus. Identification of nucleus and adjuncts is as a kind of text segmentation,
but our text segmentation is fit for technical papers.
Larkey proposed a method to extract words only from the title, abstract, the
first twenty lines of summary and the section containing the claims of novelty
for a patent categorization application [6]. His method is similar to our research,
but he classifies the patent documents, not technical papers. Furthermore, we
proposed a novel method called back-off model as described in Subsection 4.4.
There are many approaches for multi-label classification. However, they can be
categorized into two groups: problem transformation and algorithm adaptation
[7]. The former group is based on any algorithms for single-label classification.
They transform the multi-label classification task into one or more single-label
classification. On the other hand, the latter group extends traditional learning
algorithms to deal with multi-label data directly.
3 Dataset
We collect technical papers in proceedings of the Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL) from 2000 to 2011. To determine
the categories (research topics) of the papers, we first refer the category list
used for paper submission to the Language Resources and Evaluation Confer-
ence (LREC). Categories are coarse grained research topics such as syntactic
parsing, semantic analysis, machine translation and so on. Categories for each
paper in the collection are annotated by authors. The total number of papers in
the collection is 1,972, while the total number of categories is 38. The average
number of the categories per a paper is 1.144. Our dataset is available on the
git repository 1 .
with high posterior probability from different perspectives are selected. Here the
perspectives are binary approach methods with different feature sets. Figure 1
shows an architecture of back-off model. At first, a model with a basic feature
set judges categories for the paper. The basic feature set is a set of words in the
title with Title Bi-Gram and/or Title SigNoun feature 2 . The results of model 1
are a list of categories with their posterior probabilities {(Ci , Pi1 )}. The system
outputs categories Ci where Pi1 are greater than a threshold T1 . When no class
is chosen, model 2 using words in the abstract as well as basic features is applied.
Similarly, model 3 (using words in introduction as well) and model 4 (using words
in conclusion as well) are applied in turn. When no class is chosen by model 4,
all categories whose probabilities Pik are greater than 0.5 are chosen. If no Pik
is greater than 0.5, the system chooses one class with the highest probability.
The threshold Tk for the model k is set smaller than that of the previous step.
We investigate several sets of thresholds in the experiments in Section 5.
No
No
No
Model 4: {(Ci, Pi4)}
Pi4 > T4 ? Yes {Ci: Pi4 > T4}
Basic, A, I, Conclu.
No
No
Max(Pi1,Pi2,Pi3,Pi4)
5 Evaluation
The proposed methods are evaluated by 10-fold cross validation on the collection
of the papers described in Section 3. We used exact match ratio (EMR), accuracy,
2
Three basic feature sets were investigated: Title + Title Bi-Gram (BF1 ), Title +
Title SigNoun (BF2 ) and Title + Title Bi-gram + Title SigNoun (BF3 ). In our
experiments, BF1 achieved the best.
282 T.H. Nguyen and K. Shirai
85
ML−kNN
80 Binary Approach
75 Back−off
72 71
70 71 70
70 69
67 67 67 67 66
64 65
65
61 62
60 60 60
60
Performance (%)
58
55 54
51 51
50 48 49
47 48
45
45 43
40
40
35
30
30
25
20
EMR Accuracy Precision Recall Micro−P Micro−R Micro−F Macro−P Macro−R Macro−F
6 Conclusion
To identify research topics of papers, we proposed a feature selection method
based on the structure of the paper and new features derived from the title. We
also proposed back-off model, which combines classifiers with different feature
sets from different segments of the papers. Experimental results indicate that our
methods are effective for text categorization of technical papers. In the future,
we will explore more effective methods of feature selection and feature weighting
to improve the accuracy of text classification.
References
1. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput.
Surv. 34(1), 1–47 (2002)
2. Rahmoun, A., Elberrichi, Z.: Experimenting n-grams in text categorization. Int.
Arab J. Inf. Technol., 377–385 (2007)
3. Cao, M.D., Gao, X.: Combining contents and citations for scientific document clas-
sification. In: Australian Conference on Artificial Intelligence, pp. 143–152 (2005)
4. Zhang, M., Gao, X., Cao, M.D., Ma, Y.: Modelling citation networks for improving
scientific paper classification performance. In: Yang, Q., Webb, G. (eds.) PRICAI
2006. LNCS (LNAI), vol. 4099, pp. 413–422. Springer, Heidelberg (2006)
5. Nomoto, T., Matsumoto, Y.: Exploiting text structure for topic identification. In:
Proceedings of the 4th Workshop on Very Large Corpora, pp. 101–112 (1996)
6. Larkey, L.S.: A patent search and classification system. In: Proceedings of the
Fourth ACM Conference on Digital Libraries, DL 1999, pp. 179–187. ACM, New
York (1999)
7. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O.,
Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685.
Springer US (2010)
284 T.H. Nguyen and K. Shirai
8. Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach to multi-label learning.
Pattern Recognition 40(7), 2038–2048 (2007)
9. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A
java library for multi-label learning. Journal of Machine Learning Research 12,
2411–2414 (2011)
10. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), Software
available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm
11. Morgan, W.: Statistical hypothesis tests for NLP,
http://cs.stanford.edu/people/wmorgan/sigtest.pdf
Product Features Categorization Using Constrained
Spectral Clustering
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
[email protected],{zniu,sylbit}@bit.edu.cn
1 Introduction
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 285–290, 2013.
© Springer-Verlag Berlin Heidelberg 2013
286 S. Huang, Z. Niu, and Y. Shi
2 Related Work
In this paper, we mainly define and extract two types of constraints between feature
expressions: morphological and contextual constraints, by leveraging the prior obser-
vations in domain reviews. According to constraining direction, these constraints are
generally divided into two classes: direct and reverse constraints, which indicate the
confidence that pair-wise feature expressions should or should not belong to the same
categories respectively.
1, ( xi, x j) ∈ M ;
Z ij = − 1, ( xi, x j) ∈ R ; (1)
0,
o th e r w is e .
Since only has limited influence on the local pair-wise feature expressions where
| | 0, inspired by [10], the constraint propagation is employed to spread the local
constraints throughout the whole feature expressions collection.
Let Ē : | | 1 , where Ē denotes a set of constraints with
the associated confidence scores, for 0 is equivalent to , while
0 is equivalent to , , with | | being the confidence score. Given
the similarity matrix between feature expressions, calculate the symmetric weight
matrix for . The constraint propagation algorithm is defined as follows:
/ /
(1). Construct the matrix , where is a diagonal matrix with
its , -element equals to the sum of the -th row of .
(2). Iterate 1 1 for vertical constraint propagation
until convergence, where Ē and is a parameter in the range (0, 1).
We empirically set as 0.5 in this work.
288 S. Huang, Z. Niu, and Y. Shi
1 − (1 − E i j ) (1 − W ≥ 0; (2)
* *
), E
W i j =
ij ij
(1 + E i j ) W i j , < 0.
* *
E ij
Here we get a new weight matrix that incorporated the exhaustive set of propa-
gated constraints obtained by the constraint propagation, then we perform the spectral
clustering algorithm with .
4 Experimental Setup
Three product domains of customer reviews: digital camera, cell phone and vacuum
cleaner are employed to evaluate our proposed product features categorization strate-
gy. Since this paper only focuses on the product features categorization problem, we
assume that feature expressions are already extracted and manually tagged into mea-
ningful categories as the gold standards. The statistics are described in Table l.
results. Purity measures the extent that a category contains only data from one gold-
partition.
5 Result Analysis
Table 3 describes results comparison with the baselines on three domains respec-
tively. It is showed that our proposed CSC method always achieves the best entropy
and purity performance. Compared with the basic K-means clustering, CSC achieves
obvious better performance. Compared with the spectral clustering without any con-
straints, CSC also achieves obvious improvement, which verifies the contribution of
the morphological and contextual constraints to product features categorization.
of this problem are modeled as constraints, and exploited as prior knowledge to im-
prove the categorization performance. The local constraints are spread by the con-
straint propagation, and incorporated into spectral product features clustering globally.
Empirical evaluation on real-life dataset has demonstrated the effectiveness of our
proposed strategy compared with the state-of-art baselines.
Since the product features do not always exhibit as a flat structure that can be parti-
tioned clearly, our future work will devote to cluster them into a fine-grained hierar-
chical structure.
References
1. Liu, B., Hu, M., Cheng, J.: Opinion observer: analyzing and comparing opinions on the
web. In: Proceedings of the 14th International Conference on World Wide Web, WWW
2005, pp. 342–351 (2005)
2. Guo, H., Zhu, H., Guo, Z., et al.: Product feature categorization with multilevel latent se-
mantic association. In: Proceeding of the 18th ACM Conference on Information and
Knowledge Management, CIKM 2009, pp. 1087–1096 (2009)
3. Su, Q., Xu, X., Guo, H., et al.: Hidden sentiment association in chinese web opinion min-
ing. In: Proceedings of the 17th International Conference on World Wide Web, WWW
2008, pp. 959–968 (2008)
4. Carenini, G., Ng, R.T., Zwart, E.: Extracting knowledge from evaluative text. In: Proceed-
ings of the 3rd International Conference on Knowledge Capture, K-CAP 2005, pp. 11–18
(2005)
5. Huang, S., Liu, X., Peng, X., et al.: Fine-grained product features extraction and categori-
zation in reviews opinion mining. In: Proceedings of 2012 IEEE 12th International Confe-
rence on Data Mining Workshops, ICDM 2012, pp. 680–686 (2012)
6. Titov, I., McDonald, R.: Modeling online reviews with multi-grain topic models. In: Pro-
ceedings of the 17th International Conference on World Wide Web, WWW 2008, pp.
111–120 (2008)
7. Zhao, W.X., Jiang, J., Yan, H., Li, X.: Jointly modeling aspects and opinions with a Max-
Ent-LDA hybrid. In: Proceedings of the 2010 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2010, pp. 56–65 (2010)
8. Zhai, Z., Liu, B., Xu, H., Jia, P.: Constrained LDA for grouping product features in opi-
nion mining. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS,
vol. 6634, pp. 448–459. Springer, Heidelberg (2011)
9. Zhai, Z., Liu, B., Xu, H., Jia, P.: Clustering product features for opinion mining. In: Pro-
ceedings of the fourth ACM International Conference on Web Search and Data Mining,
WSDM 2011, pp. 347-354 (2011)
10. Lu, Z., Ip, H.H.S.: Constrained spectral clustering via exhaustive and efficient constraint
propagation. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI.
LNCS, vol. 6316, pp. 1–14. Springer, Heidelberg (2010)
A New Approach for Improving Cross-Document
Knowledge Discovery Using Wikipedia
Abstract. In this paper, we present a new model that incorporates the extensive
knowledge derived from Wikipedia for cross-document knowledge discovery.
The model proposed here is based on our previously introduced Concept Chain
Queries (CCQ) which is a special case of text mining focusing on detecting se-
mantic relationships between two concepts across multiple documents. We at-
tempt to overcome the limitations of CCQ by building a semantic kernel for
concept closeness computing to complement existing knowledge in text corpus.
The experimental evaluation demonstrates that the kernel-based approach out-
performs in ranking important chains retrieved in the search results.
1 Introduction
Traditionally text documents are represented as a Bag of Words (BOW) and the se-
mantic relatedness between concepts are measured based on statistical information
from the corpus such as the widely used tf-idf weighting scheme [3], [7]. The main
theme of our previous introduced Concept Chain Queries (CCQ) [3] was specifically
designed to discover semantic relationships between two concepts across documents
where relationships found reveal semantic paths linking two concepts across multiple
text units. However, only the BOW model was used in CCQ for text representation
and thus the techniques proposed in [3] have the inborn limitations. For example,
Ziyad Khaleel, also known as Khalil Ziyad was a Palestinian-American al-Qaeda
member, based in the United States, being identified as a "procurement agent" for Bin
Ladin’s terroristic organization. Clearly he has a close relationship with Bin Ladin.
Nevertheless, he will not be taken into consideration if his name does not appear in
the document collection where the concept chain queries are performed. To alleviate
such limitations, this effort proposes a new model that has a semantic kernel built
inside to embed the extensive knowledge from Wikipedia into the original knowledge
base, aiming at taking advantage of outside knowledge to improve cross-document
knowledge discovery. Here we employ the Explicit Semantic Analysis (ESA) tech-
nique introduced by Gabrilovich et al. [1] to help build an ESA-based kernel that
captures the semantic closeness of concepts in a much larger knowledge space.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 291–296, 2013.
© Springer-Verlag Berlin Heidelberg 2013
292 P. Yan and W. Jin
2 Related Work
There have been a great number of text mining algorithms for capturing relationships
between concepts developed [3], [5], [6], [7]. However, built on the traditional Bag-
of-Words (BOW) representation with no or little background knowledge being taken
into account, those efforts achieved a limited discovery scope. Hotho et al. [2] ex-
ploited WordNet to improve the BOW text representation and Martin [4] developed a
method for transforming the noun-related portions of WordNet into a lexical ontology
to enhance knowledge representation. These techniques suffer from relatively limited
coverage and painful maintenance of WordNet compared to Wikipedia, the world’s
largest knowledge base to date. [8] embeds background knowledge derived from Wi-
kipedia into a semantic kernel to enrich document representation for text classifica-
tion. The empirical evaluation demonstrates their approach successfully achieves
improved classification accuracy. However, their method is based on a thesaurus built
from Wikipedia and constructing the thesaurus requires a considerable amount of
effort. Our proposed solution is motivated by [1], [8], and to tackle the above prob-
lems, we 1) adapt the ESA technique to better suit our task and further develop a se-
quence of heuristic strategies to filter out irrelevant terms and retain only top-k most
relevant concepts to the given topics; 2) build an ESA-based kernel which requires
much less computational effort to measure the closeness between concepts using Wiki
knowledge.
3 Kernel Method
We adapt the Explicit Semantic Analysis (ESA) to remove noise concepts derived from
Wikipedia [9], and then build a semantic kernel for semantic relatedness computing.
The basic idea of kernel methods is to embed the data in a suitable feature space (with
more information integrated), such that solving the problem in the new space is easier
(e.g. linear). To be exact, the new space here stands for the space that incorporates
A New Approach for Improving Cross-Document Knowledge Discovery 293
Wikipedia knowledge, and the kernel represents the semantic relationship between two
concepts/topics uncovered in this new space.
The purpose of building the ESA-based kernel is in concern of word semantics omis-
sion in the BOW model where feature weight is calculated only considering the num-
ber of occurrences. To build the semantic kernel for a given topic, we first need to
transform the concept vector constructed using the BOW model into a different vector
(i.e. space transformation) with new knowledge embedded. Suppose the topic T is
represented by a weighted vector of concepts: φ (T ) =< c1 , c2 ,..., cn > using the BOW
model. The value of each element in the vector corresponds to a tf-idf value. We then
define a kernel matrix M for the topic T as show in Table 1.
M is a symmetrical matrix and the elements fell on the diagonal line are all equal to
1, since according to ESA, the same concept has the same interpretation vector, which
means the ESA-based similarity between two same concepts is 1. Formally, M is de-
fined as below:
1 if i = j
M i, j = (1)
SimESA (ci , c j ) / Sim _ Max if i ≠ j
Where SimESA (ci , c j ) is the ESA similarity between ci and cj, and Sim _ Max is the
maximum value in M besides the elements on the diagonal line. Then a transforma-
tion of φ (T ) can be achieved through: φ (T ) = φ (T ) M , where φ (T ) represents the
topic T in a linear space with much more information integrated. With φ (T ) , the
ESA-based kernel between two topics T1 and T2 can be represented as:
Therefore, the semantic relationship between two topics is now represented using the
ESA-based kernel i.e. k (T1 , T2 ) which incorporates Wiki knowledge.
294 P. Yan and W. Jin
Once the relevant concepts for a topic of interest have been identified using CCQ [3],
we are ready to use ESA-based kernel to help compute the semantic relatedness be-
tween concepts. For example, given the concept “Clinton” as a topic of interest, and
the BOW-based concept vector for “Clinton” and the corresponding kernel matrix are
shown in Table 2 and Table 3. Table 4 illustrates the improvement through multiply-
ing the BOW-based concept vector by the kernel matrix. This is consistent with our
understanding that Hillary as Clinton’s wife should be considered most related to him.
Shelton, served as Chairman of the Joint Chiefs of Staff during Clinton’s terms in
office, stays in the second position. At last, Clancy, who hardly has a relationship
with Clinton is degraded to the end of the vector.
Table 2. The BOW-based concept vector for Table 3. The kernel matrix for “Clinton”
“Clinton”
4 Empirical Evaluation
An open source document collection pertaining to the 9/11 attack, including the pub-
licly available 9/11 commission report was used in our evaluation. The report consists
of Executive Summary, Preface, 13 chapters, Appendix and Notes. Each of them was
considered as a separate document resulting in 337 documents. Query pairs selected
by the assessors covering various scenarios (e.g., ranging from popular entities to rare
entities) were conducted and used as our evaluation data.
Average Rank
Model
Length 1 Length 2 Length 3 Length 4
BOW-based Approach 1 2.74 3.13 5.20
Kernel-based Approach 1 1.82 2.71 4.00
296 P. Yan and W. Jin
Average Rank
Model
Length 1 Length 2 Length 3 Length 4
BOW-based Approach 1 11.15 4.73 9.23
Kernel-based Approach 1 3.05 2.91 7.35
References
1. Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-based
Explicit Semantic Analysis. In: 20th International Joint Conference on Artificial Intelli-
gence, pp. 1606–1611. Morgan Kaufmann, San Francisco (2007)
2. Hotho, A., Staab, S., Stumme, G.: Wordnet improves Text Document Clustering. In: SIGIR
2003 Semantic Web Workshop, pp. 541–544. Citeseer (2003)
3. Jin, W., Srihari, R.: Knowledge Discovery across Documents through Concept Chain Que-
ries. In: 6th IEEE International Conference on Data Mining Workshops, pp. 448–452. IEEE
Computer Society, Washington (2006)
4. Martin, P.A.: Correction and Extension of WordNet 1.7. In: de Moor, A., Ganter, B., Lex,
W. (eds.) ICCS 2003. LNCS (LNAI), vol. 2746, pp. 160–173. Springer, Heidelberg (2003)
5. Srinivasan, P.: Text Mining: Generating hypotheses from Medline. Journal of the American
Society for Information Science and Technology 55(5), 396–413 (2004)
6. Srihari, R.K., Lamkhede, S., Bhasin, A.: Unapparent Information Revelation: A Concept
Chain Graph Approach. In: 14th ACM International Conference on Information and Know-
ledge Management, pp. 329–330. ACM, New York (2005)
7. Swason, D.R., Smalheiser, N.R.: Implicit Text Linkage between Medline Records: Using
Arrowsmith as an Aid to Scientific Discovery. Library Trends 48(1), 48–59 (1999)
8. Wang, P., Domeniconi, C.: Building Semantic Kernels for Text Classification using Wiki-
pedia. In: 14th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 713–721. ACM, New York (2008)
9. Yan, P., Jin, W.: Improving Cross-Document Knowledge Discovery Using Explicit Seman-
tic Analysis. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2012. LNCS, vol. 7448, pp.
378–389. Springer, Heidelberg (2012)
Using Grammar-Profiles to Intrinsically Expose
Plagiarism in Text Documents
1 Introduction
The huge amount of publicly available text documents makes it increasingly eas-
ier for authors to copy suitable text fragments into their works. On the other
side, the task of identifying plagiarized passages becomes increasingly more dif-
ficult for software algorithms that have to deal with large amounts of possible
sources. An even harder challenge is to find plagiarism in text documents where
the majority of sources is composed of books and other literature that is not dig-
itally available. Nevertheless, more and more recent events show that especially
in such cases it would be important to have reliable tools that indicate possible
misuses.
The two main approaches for detecting plagiarism in text documents are exter-
nal and intrinisic methods, respectively. Given a suspicious document, external
algorithms compare text fragments with any available sources (e.g. collections
from the world wide web), whereas intrinsic algorithms try to detect plagia-
rism by inspecting the suspicious document only. Frequently applied techniques
in both areas as well as in related topics such as authorship identification or
text categorization include n-gram comparisons or standard IR techniques like
common subsequences [2] combined with machine learning techniques [3].
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 297–302, 2013.
c Springer-Verlag Berlin Heidelberg 2013
298 M. Tschuggnall and G. Specht
The idea of the approach described in this paper is to use a syntactical feature,
namely the grammar used by an author, to identify passages that might have
been plagiarized. Due to the fact that an author has many different choices of how
to formulate a sentence using the existing grammar rules of a natural language,
the assumption is that the way of constructing sentences is significantly different
for individual authors. For example, the famous Shakespeare quote ”To be, or
not to be: that is the question.” (1) could also be formulated as ”The question is
whether to be or not to be.” (2) or even ”The question is whether to be or not.” (3)
which is semantically equivalent but differs significantly according to the syntax.
The main idea of this approach is to quantify those differences by creating a
grammar profile of a document and to utilize sliding windows techniques to find
suspicious text sections.
The rest of this paper is organized as follows: Section 2 describes the algorithm
in detail, while a preliminary evaluation of it is shown in Section 3. Finally,
Section 4 sketches related work and Section 5 summarizes the main ideas and
discusses future work.
S S
NP VP
S NP VP
DT NN VBZ SBAR
(The) (question) (is)
S : S DT NN VBZ SBAR
(The) (question) (is) IN S
(whether)
VP NP VP
IN S
(whether) VP
VP CC RB VP DT RBZ NP S CC S
(or) (not) (that) (is) TO VP
(or) (to)
TO VP TO VP DT NN VP RB VP
(To) (to) (the) (question) VB NP
(be)
VB VB TO TO
VP VP QP
(be) (be) (to) (to)
VB VB CC RB
(be) (be) (or) (not)
Fig. 1. Grammar Trees Resulting From Parsing Sentence (1), (2) and (3)
shifted left and right additionally: If then less than p nodes exist horizon-
tally, the corresponding pq-gram is filled with * for missing nodes. Therefore
also the pq-grams [S-VP-*-*-VP], [S-VP-*-VP-CC], [S-VP-RB-VP-*] or
[S-VP-VP-*-*] are valid. Finally, the pq-gram index contains all valid pq-
grams of a grammar tree, whereby multiple occurences of the same pq-grams
are also present multiple times in the index.
4. Subsequently, the pq-gram profile of the whole document is calculated by
combining all pq-gram indexes of all sentences. In this step the number of
occurences is counted for each pq-gram and then normalized by the doc-
ument length, i.e. normalized by the total number of distinct pq-grams.
As an example, the three mostly used pq-grams of a selected document
are: {[NP-NN-*-*-*], 2.7%}, {[PP-IN-*-*-*], 2.3%}, {[S-VP-*-*-VBD],
1.1%}. The pq-gram profile then consists of the complete table of pq-grams
and their occurences in the given document, indicating the favours or the
style of syntax construction used by the (main) author.
5. The basic idea is now to utilize sliding windows and calculate the distance for
each window compared to the pq-gram profile. A window has a predefined
length l which defines how many sentences should be contained, and the
window step s defines the starting points of the windows.
Then for each window the pq-gram profile P (w) is calculated and compared
to the pq-gram profile of the whole document. For calculating the distance,
the measure proposed in [12] has been used, as it is well suited for com-
paring short text fragments (the window w) with large text fragments (the
document D):
2(fw (p) − fD (p)) 2
d(w, D) =
fw (p) + fD (p)
p∈P (w)
∂susp
μ + 2σ
μ+σ
plagiarized
(solution)
predicted
(algorithm)
3 Evaluation
test corpus (over 4000 documents). The F-scores are composed of high precision
values compared to low recall values. For example, the best performance resulted
from a precision value of about 75% and a recall value of about 31%, indicating
that if the algorithm predicts a text passage to be plagiarized it is often correct,
but on the other hand lacks of finding all passages.
4 Related Work
An often applied concept in the field of intrinsic plagiarism detection is the us-
age of n-grams [12,6], where the document is split up into chunks of three or
four letters, grouped and - as proposed with the algorithm in this paper - ana-
lyzed through sliding windows. Another approach also uses the sliding window
technique but is based on word frequencies, i.e. the assumption that the set of
words used by authors is significantly different [10]. Approaches in the field of
author detection and genre categorization also use NLP tools to analyze docu-
ments based on syntactic annotations [13]. Word- and text-based statistics like
the average sentence length or the average parse tree depth are used in [5].
Another interesting approach used in authorship attribution that tries to de-
tect the writing style of authors by analyzing the occurences and variations of
spelling errors is proposed in [8]. It is based on the assumption that authors
tend to make similar spelling and/or grammar errors and therefore uses this
information to attribute authors to unseen text documents.
Lexicalized tree-adjoining-grammars (LTAG) are poposed in [4] as a ruleset
to construct and analyze grammar syntax by using partial subtrees, which may
also be used with this approach as an alternative to pq-gram patterns.
inspections showed that the algorithm produces high precision values, i.e. mostly
predicts plagiarism only where this is really the case. On the other hand it could
be improved to find more plagiarism cases, i.e. increasing the recall value.
Future work should also evaluate the approach against a larger and more
representative test set. Additionally, all parameters like window length, window
step, pq-gram configurations or other thresholds should be optimized. As cur-
rently no lexical information is used, a combination with existing approaches
could enhance the overall performance as well as the adaption to more (syntac-
tically complex) languages. Finally, the PQ-PlagInn algorithm also seems to be
very suitable for tasks in the field of authorship attribution/verification or text
categorization, and it should thus be adjusted and accordingly evaluated.
References
1. Augsten, N., Böhlen, M., Gamper, J.: The pq-Gram Distance between Ordered
Labeled Trees. ACM Transactions on Database Systems, TODS (2010)
2. Gottron, T.: External Plagiarism Detection Based on Standard IR Technol-
ogy and Fast Recognition of Common Subsequences. In: CLEF (Notebook Pa-
pers/LABs/Workshops) (2010)
3. Joachims, T.: Text Categorization with Suport Vector Machines: Learning with
Many Relevant Features. In: Proceedings of the 10th European Conference on
Machine Learning, London, UK, pp. 137–142 (1998)
4. Joshi, A.K., Schabes, Y.: Tree-Adjoining Grammars. Handbook of Formal Lan-
guages 3, 69–124 (1997)
5. Karlgren, J.: Stylistic Experiments For Information Retrieval. PhD thesis, Swedish
Institute for Computer Science (2000)
6. Kestemont, M., et al.: Intrinsic Plagiarism Detection Using Character Trigram
Distance Scores. In: CLEF Labs and Worksh. Papers, Amsterdam, Netherlands
(2011)
7. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proc. of the 41st
Meeting on Comp. Linguistics, Stroudsburg, PA, USA, pp. 423–430 (2003)
8. Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribu-
tion. In: IJCAI 2003 Workshop on Computational Approaches to Style Analysis
and Synthesis, pp. 69–72 (2003)
9. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated cor-
pus of English: The Penn Treebank. Comp. Linguistics 19, 313–330 (1993)
10. Oberreuter, G., et al.: Approaches for Intrinsic and External Plagiarism Detection.
In: Notebook Papers of CLEF Labs and Workshops (2011)
11. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework
for Plagiarism Detection. In: Proceedings of the 23rd International Conference on
Computational Linguistics (COLING 2010), Beijing, China (2010)
12. Stamatatos, E.: Intrinsic Plagiarism Detection Using Character n-gram Profiles.
In: CLEF (Notebook Papers/Labs/Workshop) (2009)
13. Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in
terms of genre and author. Comput. Linguist. 26, 471–495 (2000)
14. Tschuggnall, M., Specht, G.: Detecting Plagiarism in Text Documents through
Grammar-Analysis of Authors. In: 15. GI-Fachtagung Datenbanksysteme für Busi-
ness, Technologie und Web, Magdeburg, Germany (2013)
Feature Selection Methods in Persian Sentiment Analysis
1 Introduction
In the recent decade, with the enormous growth of digital content in internet and data-
bases, sentiment analysis has received more and more attention between information
retrieval and natural language processing researchers. Up to now, many researches
have been conducted sentiment analysis on English, Chinese or Russian languages
[1-9]. However on Persian text, in our knowledge there is little investigation con-
ducted on sentiment analysis [10]. Persian is an Indo-European language, spoken and
written primarily in Iran, Afghanistan, and a part of Tajikistan. The amount of infor-
mation in Persian language on the internet has increased in different forms. As the
style of writing in Persian language is not firmly defined on the web, there are too
many web pages in Persian with completely different writing styles for the same
words [11, 12]. Therefore in this paper, we study a model of feature selection in sen-
timent classification for Persian language, and experiment our model on a Persian
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 303–308, 2013.
© Springer-Verlag Berlin Heidelberg 2013
304 M. Saraee and A. Bagheri
product review dataset. In the reminder of this paper, Section 2 describes the proposed
model for sentiment classification of Persian reviews. In Section 3 we discuss impor-
tant experimental results, and finally we conclude with a summary in section 5.
Persian sentiment analysis suffers from low quality, where the main challenges are
- Lack of comprehensive solutions or tools
- Using of a wide variety of declensional suffixes
- Word spacing
o In Persian in addition to white space as inter-words space, an intra-
word space called pseudo-space separates word’s part.
- Utilizing many informal or colloquial words
In this paper, we propose a model, using n-gram features, stemming and feature selec-
tion to overcome the Persian language challenges in sentiment classification.
Feature Selection methods sort features on the basis of a numerical measure computed
from the documents in the dataset collection, and select a subset of the features by
thresholding that measure. In this paper four different information measures were
implemented and tested for feature selection problem in sentiment analysis. The
measures are Document Frequency (DF), Term Frequency Variance (TFV), Mutual
Information (MI) [14] and Modified Mutual Information (MMI). Below we discuss
presented MMI approach.
c
A B
C D
Table 1 records co-occurrence statistics for features and classes. We also have that
the number of review documents, N = A+B +C +D. These statistics are very useful
for estimating probability values [13, 14]. By using Table 1, MI can be computed by
equation (7):
,
, log (1)
, log (2)
Intuitively MI measures if the co-occurrence of f and c is more likely than their inde-
pendent occurrences, but it doesn’t measure the co-occurrence of f and or the co-
occurrence of other features and class c. We introduce a Modified version of Mutual
Information as MMI which consider all possible combinations of co-occurrences of a
feature and class label. First we define four parameters as the following:
- , : Probability of co-occurrence of feature f and class c together.
- , : Probability of co-occurrence of all features except f in all classes ex-
cept c together.
- , : Probability of co-occurrence of all features except feature f in class c.
- , : Probability of co-occurrence of feature f in all classes except c.
We calculate MMI score as Equation (10):
, , , ,
, (3)
, (4)
306 M. Saraee and A. Bagheri
3 Experimental Results
To test our methods we compiled a dataset of 829 online customer reviews in Persian
language from different brands of cell phone products. We assigned two annotators to
label customer reviews by selecting a positive or negative polarity on the review lev-
el. After annotation, the dataset reached to 511 positive and 318 negative reviews.
Table 2. F-scores for phases 1 and 2, Without and with n-gram features and stemming
In this work we applied four different feature selection approaches, MI, DF, TFV
and MMI with the Naive Bayes learning algorithm to the online Persian cellphone
reviews. In the experiments, we found that using feature selection with learning algo-
rithms can perform improvement to classifications of sentiment polarities of reviews.
Table 3 indicates Precision, Recall and F-score measures on two classes of Positive
and Negative polarity with the feature selection approaches.
Table 3. Precision, Recall and F-score measures for the feature selection approaches with naive
bayes classifier
The results from Table 3 indicate that the TFV, DF and MMI have better perfor-
mances than the traditional MI approach. In terms of F-score, MMI improves MI with
21.46% and 32.46% on Negative and Positive classes respectively, DF overcomes MI
with 19.36% and 32.5% better performances for Negative and Positive review docu-
ments respectively and TFV improves MI with19.7% and 32.76% for Negative and
Positive documents respectively. The reason of poor performance for MI is that of MI
only uses the information between the corresponding feature and the corresponding
class and does not utilize other information about other features and other classes.
When we compare DF, TFV and MMI, we can find that the MMI beats both DF and
TFV on F-scores of Negative review documents with 2.1% and 1.76% improvements
respectively, but for the Positive review documents DF and TFV have 0.04% and
0.3% better performance than the MMI, respectively.
To assess the overall performance of techniques we adopt the macro and micro av-
erage, Figure 1 shows the macro and micro average F-score.
1
0.9
0.8
0.7
0.6
0.5
0.4 Micro-Averaged
F-score
0.3
0.2 Macro-Averaged
0.1 F-score
0
MI DF TFV MMI
Fig. 1. Macro and micro average F-score for MI, DF, TFV and MMI
From this Figure we can find that the MMI proposed approach has slightly better
performance than the DF and TFV approaches and has significant improvements on
MI method. The basic advantage of the MMI is using of whole information about a
feature, positive and negative factors between features and classes. MMI in overall
can reach to 85% of F-score classification. It is worth noting that with a larger train-
ing corpus the feature selection approaches and the learning algorithm could get high-
er performance values. Additionally the proposed approach – MMI – is not only for
Persian reviews and in addition can be applied to other domains or other classification
problems.
In this paper we proposed a novel approach for feature selection, MMI, in sentiment
classification problem. In addition we applied other feature selection approaches, DF,
MI and TFV with the Naive Bayes learning algorithm to the online Persian cellphone
308 M. Saraee and A. Bagheri
reviews. As the results show, using feature selection in sentiment analysis can im-
prove the performance. The proposed MMI method that uses the positive and negative
factors between features and classes improves the performance compared to the other
approaches. In our future work we will focus more on sentiment analysis about Per-
sian text.
References
1. Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. Mining Text Data.
pp. 415–463 (2012)
2. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine
learning techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods
in Natural Language Processing, vol. 10, pp. 79–86. ACL (2002)
3. Moraes, R., Valiati, J.F., Gavião Neto, W.P.: Document-level sentiment classification: an
empirical comparison between SVM and ANN. Expert Systems with Applications (2012)
4. Cui, H., Mittal, V., Datar, M.: Comparative experiments on sentiment classification for on-
line product reviews. In: Proceedings of National Conference on Artificial Intelligence,
Menlo Park, Cambridge, London, vol. 21(2), p. 1265 (2006)
5. Yussupova, N., Bogdanova, D., Boyko, M.: Applying of sentiment analysis for texts in
russian based on machine learning approach. In: Proceedings of Second International Con-
ference on Advances in Information Mining and Management, pp. 8–14 (2012)
6. Popescu, A.M., Etzioni, O.: Extracting product features and opinions from reviews. In:
Proceedings of Conference on Empirical Methods in Natural Language Processing (2005)
7. Zhu, J., Wang, H., Zhu, M., Tsou, B.K., Ma, M.: Aspect-based opinion polling from cus-
tomer reviews. IEEE Transactions on Affective Computing 2(1), 37–49 (2011)
8. Liu, B., Hu, M., Cheng, J.: Opinion observer: analyzing and comparing opinions on the
web. In: Proceedings of Conference on World Wide Web, pp. 342–351 (2005)
9. Turney, P.D., Littman, M.L.: Unsupervised learning of semantic orientation from a hun-
dred-billion-word corpus. Technical Report EGB-1094, National Research Council Cana-
da (2002)
10. Shams, M., Shakery, A., Faili, H.: A non-parametric LDA-based induction method for sen-
timent analysis. In: Proceedings of 16th IEEE CSI International Symposium on Artificial
Intelligence and Signal Processing, pp. 216–221 (2012)
11. Farhoodi, M., Yari, A.: Applying machine learning algorithms for automatic Persian text
classification. In: Proceedings of IEEE International Confernce on Advanced Information
Management and Service, pp. 318–323 (2010)
12. Taghva, K., Beckley, R., Sadeh, M.: A stemming algorithm for the Farsi language. In: Pro-
ceedings of IEEE International Conference on Information Technology: Coding and Com-
puting, ITCC, vol. 1, pp. 158–162 (2005)
13. Mitchell, T.: Machine Learning, 2nd edn. McGraw-Hill (1997)
14. Duric, A., Song, F.: Feature selection for sentiment analysis based on content and syntax
models. Decision Support Systems (2012)
Towards the Refinement of the Arabic Soundex
1 Introduction
The general goal of approximate string matching is to perform the string match-
ing of a pattern P in a text T where one or both of them have suffered from
some kind of corruption [8, 9]
In this paper, we take in interest the problem of approximate string match-
ing that allows phonetic errors. It can be applied in the correction of phonetic
spelling errors after a seizure over a speech-to-text system, in the retrieval of
similar names or while text searching.
A method to tackle with this issue is to code phonemes using a phonetic
encoding algorithm in order to code similarly the words that are pronounced in
the same way.
The best-known phonetic encoding algorithm is Soundex [2]. Primarily used
to code names based on the way they sound, the Soundex algorithm keeps the
first letter of the name, reduces the name to its canonical form [6] and uses
three digits to represent the rest of its letters. Many Soundex improvements had
been developed for English [18, 7, 11, 12] and it had been expanded for several
languages [13, 1] including Arabic [19, 3, 17].
Although spelling correction for Arabic had recently become a very challeng-
ing field of research, almost all the solutions in the specialized literature are
dedicated to a particular class of Arabic speakers [14]. The same applies to the
Arabic Soundex functions which are, moreover, proposed for restricted sets of
data such as Arab names [19, 3].
In [10], a Soundex function that takes into account the phonetic features of
the Arabic language had been proposed.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 309–314, 2013.
c Springer-Verlag Berlin Heidelberg 2013
310 N.D. Ousidhoum and N. Bensaou
value of the resulting binary code is computed and the phonetic code generated
x can be used as a hash key for indexation.
The indexation of a dictionary D using S generates a hash table where the
words that have the same phonetic code k are included in the same set.
For example, given the words w= (katif = shoulder) , w =
(katab =
(katam = to hide).
to write), w =
The canonical forms of w, w and w are themselves. Their phonetic codes
are then calculated, such as :
S(w) = C(
).SC(
).C( ).SC( ).C(
).SC(
)
= 10.0000.00.0110.00.0001 = 131457 = S(w ) = S(w ) = k.
Setk = {w|S(w) = k}, therefore w, w , w ∈ Setk and w, w , w are phoneti-
cally equivalent. Setk fills the cell k of the dictionary D shown below.
max
3 Refinement Elements
– Nc = C(c),
– nc is the new phonetic subcategory of c (Table 1).
Given w a canonical form of a word : rfalg (w) = Nc1 .nc1 . . . . .Ncn .ncn = x.
312 N.D. Ousidhoum and N. Bensaou
The ”Speech Therapy Refinement” rfst is a refinement element that codifies the
Arabic letters regarding the phonetic confusions common to children who can’t
pronounce correctly some Arabic phonemes, such as rfst (c) = (Nc , nc ) with :
– Nc = C(c),
– nc is the new phonetic subcategory of c (Table 2).
Given w a canonical form of a word: rfst (w) = Nc1 .nc1 . . . . .Ncn .ncn = x.
4 Evaluation
To assess the phonetic encoding of Arabic using the classical Arabic Soundex S,
the Algerian Dialect Refinement rfalg and the Speech Therapy Refinement rfst ,
we indexed a dictionary of 2017 triliteral Arabic roots (connected to their inflected
forms). Table 3 summarizes the evaluation of this indexation in terms of : distinct
codes, words with the same encoding and the maximum value of the encoding.
5 Conclusion
Our research contributes to the extension of the Arabic Soundex phonetic en-
coding algorithm by focusing on specific phonetic criteria related to different
sources of phonetic alterations.
This work would help in creating phonetic dictionaries, in resolving the Arabic
spelling correction issue by being associated to a spelling corrector like [4, 15, 21,
16] as a module that corrects phonetic spelling mistakes or in detecting sources
of confusions between phonemes. It can also be lengthened by supporting new
particular ”refinement elements”.
References
[1] Maniez, D.: Cours sur les Soundex,
http://www-info.univ-lemans.fr/˜ carlier/recherche/soundex.html
[2] National Archives: The Soundex Indexing System,
http://www.archives.gov/research/census/soundex.html
314 N.D. Ousidhoum and N. Bensaou
[3] Aqeel, S.U., et al.: On the Development of Name Search Techniques for Arabic.
J. Am. Soc. Inf. Sci. Technol. 57(6), 728–739 (2006)
[4] Ben Hamadou, A.: Vérification et correction automatiques par analyse affixale
des textes écrits en langage naturel: le cas de l’arabe non voyellé. PhD thesis,
University of Sciences, Technology and Medicine of Tunis (2003)
[5] Al Husseiny, A.: Dirassat Qur’aniya-2- Ahkam At-Tajweed Bee Riwayet Arsh An
Nafia An Tariq Al’azraq. Maktabat Arradwan (2005)
[6] Hall, P.A.V., Dowling, G.R.: Approximate String Matching. Computing Sur-
veys 12(4) (1980)
[7] Lait, A., Randell, B.: An Assessment of Name Matching Algorithms. Technical
Report, University of Newcastle upon Tyne (1993)
[8] Navarro, G.: A Guided Tour to Approximate String Matching. ACM Comput.
Surv. 33(1), 31–88 (2001), doi:10.1145/375360.375365
[9] Navarro, G., Baeza-Yates, R.: Very Fast and Simple Approximate String Matching.
Information Processing Letters (1999)
[10] Ousidhoum, N.D., Bensalah, A., Bensaou, N.: A New Classical Arabic Soundex
algorithm. In: Proceedings of the Second Conference on Advances in Communica-
tion and Information Technologies (2012),
http://doi.searchdl.org/03.CSS.2012.3.28
[11] Philips, L.: Hanging on the Metaphone. Computer Language 7(12) (December
1990)
[12] Philips, L.: The Double Metaphone Search Algorithm. Dr Dobb’s (2003)
[13] Precision Indexing Staff: The Daitch-Mokoto Soundex Reference Guide. Heritage
Quest (1994)
[14] Rytting, C.A., et al.: Error Correction for Arabic Dictionary Lookup. In: Proceed-
ings of the Seventh International Conference on Language Resources and Evalua-
tion, LREC 2010 (2010)
[15] Shaalan, K., Allam, A., Gomah, A.: Towards Automatic Spell Checking for Arabic.
In: Proceedings of the Fourth Conference on Language Engineering, Egyptian
Society of Language Engineering, ELSE (2003)
[16] Shaalan, K., et al.: Arabic Word Generation and Modelling for Spell Checking.
In: Proceedings of the Eight International Conference on Language Resources and
Evaluation, LREC 2012 (2012)
[17] Shaalan, K., Aref, R., Fahmy, A.: An Approach for Analyzing and Correcting
Spelling Errors for Non-native Arabic learners. In: Proceedings of the 7th Inter-
national Conference on Informatics and Systems, INFOS 2010. Cairo University
(2010)
[18] Taft, R.L.: Name Searching Techniques. Technical Report, New York State Iden-
tification and Intelligence System, Albany, N.Y. (1970)
[19] Yahia, M.E., Saeed, M.E., Salih, A.M.: An Intelligent Algorithm For Arabic
Soundex Function Using Intuitionistic Fuzzy Logic. In: International IEEE Con-
ference on Intelligent Systems, IS (2006)
[20] Watson, J.C.E.: The Phonology and Morphology of Arabic. OUP Oxford (2007)
[21] Ben Othmane Zribi, C., Ben Ahmed, M.: Efficient Automatic Correction of Mis-
spelled Arabic Words Based on Contextual Information. In: Palade, V., Howlett,
R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 770–777. Springer, Heidelberg
(2003)
An RDF-Based Semantic Index
1 Introduction
In this work, we propose a novel semantic indexing technique particularly suitable for
knowledge management applications. Nowadays, in fact, one of the most challenging
aspects in Information Retrieval (IR) area lies in the ability of Information Systems to
manage efficiently and effectively very large amount of digital documents by extracting
and indexing the related most significant concepts that are generally used to capture and
express documents’ semantics.
In the literature, the most widely approaches used by IR Systems to allow an efficient
semantic-based retrieval on textual documents are: Conceptual Indexing, Query Expan-
sion and Semantic Indexing [1]. All these approaches opportunely combine knowledge
representation and natural language processing techniques to accomplish their task. The
systems that use the conceptual indexing approach usually ground on catalogs of texts
belonging to specific domains and exploit ad-hoc ontologies and taxonomies to asso-
ciate a conceptual description to documents. In particular, document indexing tech-
niques based on ontology-based concepts’ matching are approaches typically used in
specialist domains as juridical [2] and medical [3] ones. On the other hand, interest-
ing approaches based on taxonomic relationships are adopted as in [4]: the taxonomic
structure is used to organize links between semantically related concepts, and to make
connections between terms of a request and related concepts in the index. Differently
from the previous ones, systems that use query expansion technique do not need to
extract any information from the documents and at the same time to change their struc-
ture, but they act on the query provided by the user. The basic idea is to semantically
enrich, during the retrieval process, the user query with words that have semantic rela-
tionships (e.g. synonyms) with the terms by which the original query is expressed. This
approach requires the use of lexical databases and thesauri (e.g. WordNet) and semantic
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 315–320, 2013.
c Springer-Verlag Berlin Heidelberg 2013
316 F. Amato et al.
disambiguation techniques for the query keywords in order to have results that are more
accurate [5].
In particular, approaches based on query expansion can be used to broaden the set of
retrieved documents, or to increase the retrieval precision using the expansion procedure
for adding new terms for refining results. Eventually, systems using semantic indexing
techniques exploit the meaning of documents’ keywords to perform indexing opera-
tions. Thus, semantic indexes include word meanings rather than the terms contained
in documents. A proper selection of the most representative words of the documents
and their correct disambiguation is indispensable to ensure the effectiveness of this ap-
proach. In [6], several interesting experiments of how to use word sense disambiguation
into IR systems are reported. A large study about the applicability of semantics to IR is
in the opposite discussed in [7] , in which the problem of lexical ambiguity is bypassed
associating a clear indication of word meanings to each relevant terms, explicating pol-
ysemy and homonymy relationships. Furthermore, a semantic index is built on the base
of a disambiguated collection of terms in the SMART IR System designed by [8]. The
use of these approaches is obviously limited by the need of having available specific
thesauri for establishing the correct relationships among concepts.
In this paper, we describe a semantic indexing technique based on RDF (Resource
Description Framework) representation of the main concepts of a document. With the
development of the Semantic Web, in fact, a large amount of RDF native documents
are published on the Web and, for what concerns digital documents, several techniques
could be used to transform a text document into a RDF model, i.e. a subject, verb, object
triple [9]. Thus, in our approach, we propose to capture the semantic nature of a given
document, commonly expressed in Natural Language, by retrieving a number of RDF
triples and to semantically index the documents on the base of meaning of the triples’
elements (i.e. subject, verb, object). The proposed index can be hopefully exploited
by actual web search engines to improve the retrieval effectiveness with respect to the
adopted query keywords or for automatic topic detection tasks.
The paper is organized as in the following. In the next section we illustrate our pro-
posal for RDF based semantic index discussing indexing algorithms and providing some
implementation details. Section 3 contains experimentation aiming at validating the ef-
fectiveness and efficiency of our proposal. Finally, some conclusions are outlined in
Section 4.
In particular, the lexical database used in this work is Wordnet, while the semantic
similarity measure adopted is the Leachock and Chodorow metric [11], but the de-
scribed approach is parametric with respect to the established similarity measure. In the
case of multiple senses, words are opportunely disambiguated choosing the most fitting
sense for the considered domain using a context-aware and taxonomy-based approach
[12], if necessary. Moreover, it is assumed that evaluation of the similarity measure
require a constant time. The distance between two RDF triples is defined as a linear
combination of the distances between the subjects, the predicates and the objects of the
two triples and it also requires a constant time. Finally, it is assumed that the maximum
number of clusters m is much smaller than the number of triples N. On the base of such
hypothesis, the main steps of the index building algorithm are the following:
In step one, it is used a single-pass iterative clustering method that randomly choose
the first triple and create the first cluster, labeling such a triple as the centroid of the
first cluster. Successively, the clustering algorithm performs a loop until all input triples
are processed in a random order 1 . In particular, for each triple, the clustering algorithm
tries to find the most suitable cluster and then adds the triple to the cluster: the most
suitable cluster is the cluster for which the centroid is closest to the current triple. If
such distance is less than the threshold t then the algorithm adds the current triple to the
most suitable cluster. Otherwise, if the current number of clusters is less then maximum
number of cluster m then a new cluster is created and the algorithm marks the current
triple as the centroid of the new cluster itself. In the opposite, if a new cluster cannot be
created, because the current number of cluster is equal to m, the current triple is added
to the most suitable cluster, even if the distance between the triple and the centroid
is greater than the threshold. The complexity of the clustering algorithm is O(mN )
because for each triple, the algorithm evaluates at most m distances. The second step
is devoted to build a new cluster, considering the centroids of the previous clusters,
and to find the correct centroid for the new cluster as the triple having the minimum
average distance from the other ones. In particular, this step requires an O(m) time for
the creation of the cluster and an O(m2 ) time to find the centroid of the new cluster. The
third step performs a loop over all the clusters obtained in the first step and is composed
by two sub-steps. In sub-step 3(a) the algorithm maps each triple in a point of R3 . The
x-coordinate of the point is the distance between the subject of the triple and the subject
1
For the semantic clustering aims, we can select any supervised clustering algorithm able to
partition the space of documents into several clusters on the base of the semantic similarity
among triples.
318 F. Amato et al.
of the centroid of the cluster. In the same way, the algorithm calculates the y-coordinate
and z-coordinate using the predicates and objects elements.
In sub-step 3(b) the algorithm builds a 3-d tree [13] with the obtained mapped points.
The overall complexity of the third step is O(N log N ), that is time required to build the
3-d trees. The fourth step repeats the sub-step 3(a) with the cluster of centroids obtained
in the second step and the sub-step 3(b) with the related mapped points. The complexity
of the step 4 is O(m log m) because the cluster of centroids contains at most m triples.
Hence, the presented algorithm builds the data structure behind the semantic index in an
O(N log N ) time, where N is the number of RFD triples. The O(m2 ) time is dominated
by O(N log N ) because m is much smaller than N.
Figure 1 shows an example of a part of our indexing structure in the case of web
pages reporting some news on latest events happening in Italy (e.g. a search performed
using different combination of query triples will produce a result set containing the
documents or parts of them which main semantics effectively corresponds to the query
one). At the first level, we adopt a 3-d tree to index the triples related to the centroids of
the clusters, while at second one, a 3-d tree is used for each cluster and the triples can
contain a reference to the original document from which they came from.
In a similar way, a range query and k-nearest neighbors query can be performed on
our indexing structure. Hence, this kind of search can be done efficiently by using the
well-known k-d tree properties.
An RDF-Based Semantic Index 319
In this section, we describes the adopted experimental protocol, used to evaluate the
efficiency and effectiveness of our indexing structure, and discussing the obtained pre-
liminary experimental results.
Fig. 2. Average Search Times and Average Success Rate using the Semantic Index
Regarding the triples collection, we selected a subset of the Billion Triple Chal-
lenge 2012 Dataset2 , in which data are encoded in the RDF NQuads format, and used
the context information to perform a correct semantic disambiguation of triples ele-
ments3 . In particular, as evaluation criteria in the retrieval process using our semantic
index, we measured from one hand the average search time as a function of indexed
triples, and from the other one the success rate, in other terms the number of relevant4
returned triples with respect to several performed k-nearest neighbors queries on the
data collection. The obtained results were studied using different values for the cluster-
ing threshold t (0.1,0.4,0.6) and for k (the result set size).
The Figure 2 shows the average search times for the different values of t. The search
time function exhibits in each situation a logarithmic trend and the asymptotic com-
plexity is O(log(n) + c), n being the number of triples, as we theoretically expected.
For what the effectiveness concerns, we have computed a sort of average precision of
our index in terms of relevant results with respect to a set of query examples (belonging
to different semantic domains).
2
http://km.aifb.kit.edu/projects/btc-2012/
3
The Billion Triple Challenge 2012 dataset consists of over a billion triples collected from a
variety of web sources in the shape < subject >< predicate >< object >< context >
(e.g. < pope >< decide >< resign >< religion > ). The dataset is usually used to
demonstrate the scalability of applications as well as the capability to deal with the specifics
of data that has been crawled from the public web.
4
A result triple is considered relevant if it has a similar semantics to the query triple.
320 F. Amato et al.
1 Introduction
“Wordplay is a literary technique and a form of wit in which the words that are used
become the main subject of the work, primarily for the purpose of intended effect
or amusement. Puns, phonetic mix-ups such as spoonerisms, obscure words and
meanings, clever rhetorical excursions, oddly formed sentences, and telling character
names are common examples of word play”.1 Pragmaticians and literary scholars have
researched puns [1-4]. Pun-generating software exists [5-6]. Software tools for
entertainment are occasionally intended to stimulate human cognition as well as to
make the user experience a gratifying playful experience. This is true for Serendipity
Generators, apps which suggest to a user which way to turn when out for a stroll [7].
We are primarily interested in an application to Hebrew wordplay, and this
requires a fresh look. Within computational humour, automatically devising puns or
1
http://en.wikipedia.org/wiki/Word_play
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 321–327, 2013.
© Springer-Verlag Berlin Heidelberg 2013
322 Y. HaCohen-Kerner, D.N. Cohen, and E. Nissan
punning riddles as figured prominently [5-6]. Our main contribution in this study has
been to present how Hebrew wordplay differs from wordplay in other languages
(English).
Rabbinic homiletics revels in wordplay that is only sometimes humorous. A poetic
convention enables gratification that is not necessarily humorous. Cultural exposure
to this tradition apparently conditions human appreciation of the Hebrew outputs of
our tool, so that humour is not necessarily a requirement for gratification from the
playfulness deriving from those outputs. As in other computational humour tools, we
do not have a model of humour in our software, which assists users in experiencing
gratification from onomastic wordplay. The input of the software is a personal given
name (a forename or a first name) Our working system either segments an input
name, and/or introduces minimal modifications into it, so that the list of one or more
components are extant words (Hebrew if the input is Hebrew, English if the input is
English), whose juxtaposition is left to the user to make sense of. As it turns out, such
output stimulates creativity in the subjects (ourselves) faced with the resulting word-
lists: they often easily “make sense”. We segment and/or use transformations such as:
addition of a letter, deletion of a letter, and a replacement of a similar letter.
These are graphemic puns [8]. These, using letter replacements, have been
previously applied in DARSHAN [9]. DARSHAN generates ranked sets of either
one-sentence or one-paragraph homilies using various functions, e.g., pun
generations, numerological interpretations, and word or letter replacements. Our next
step would be to start to implement the punning module of GALLURA, at present a
theoretical model that generates playful explanations for input proper names (such as
place-names) by combining phono-semantic matching (PSM) with story-generation
skills [10-12]
Hebrew is a Semitic language. It is written from right to left. Inflection in Semitic,
like in Romance, is quite rich, but Semitic morphology is nonconcatenative (with the
consonants of the root being “plugged” into “free spaces” of a derivational or
inflectional pattern). [13] is a survey of Hebrew computational linguistics. It is
important to note that the very nature of the Hebrew script is somewhat conducive to
success: it is a consonantal script, with some letters inserted in a mute function
(matres lectionis) suggesting vowels: w is [v] or [o] or [u]; y is consonantal [y] or
vocalic [i] or long [e].
Section 2 presents the workings of the model. Section 3 presents the results of
experiments and analyzes them. Section 4 provides several illustrative examples and
their analysis. Section 5 concludes the paper, and proposes a potential future research.
2 The Model
Given a name, our system tries to propose one word or a sequence of words as a
possible playful, punning “explication” in the same language as the input name. The
output should be rather similar to the input word from the spelling viewpoint. These
are “graphemic puns” indeed. To generate similar word(s) from the spelling
viewpoint, we divide the given word into relevant sequential sequences of sub-words,
Experiments in Producing Playful “Explanations” for Given Names 323
which compose the input word and/or apply one or two of the following three
transformations: deletion of a letter, insertion of a letter, replacement of a similar
letter.
In order to avoid straying from the given names, we have performed up to 2
transformations on each given name (i.e. according to Levenshtein measure, the
maximal allowed distance is 2).
A similar letter in Hebrew is replaced using groups of Hebrew letters that either
sound similar or are allographs of the same grapheme. Examples of such groups are: א
ע- ה-, ב- ו, ת- ט, ך- ק- כ, ם- מ, ן- נ, ף- פ, ץ- צ- ז, and ש – ס.
Examples of groups of English letters that sound similar: a - e (e.g., see, sea; man,
men), a - w (e.g., groan, grown), b - p (e.g.: bin, pin), d - t (e.g., bead, beat), e - w
(e.g., shoe, show), f - p (e.g., full, pull), m - n (e.g., might, night), o - w (e.g., too,
two), s - z (e.g., analyse, analyze), and u - w: (e.g., suite, sweet). Except the use of
these groups of similar letters, we have no phonetic model in our software.
3 Experiments
Experiments have been performed in two languages: Hebrew and English. For each
language, we used two main datasets: a list of given names and a lexicon (i.e. the
language’s vocabulary). Table 1 shows general details about these four datasets.
We have chosen 50 names in Hebrew and 50 names in English for which our
system seems to produce somewhat “surprising” associations. Each one of the results
has been evaluated manually by two people who are speakers of both Hebrew and
English.
Three evaluation criteria were required for each output: grammatical correctness,
“creative” level, and sound similarity to the original (input) word. For each criterion,
the reviewer was required to give an evaluation from 5 (the highest) to 1 (the lowest).
The value of the grammatical correctness measure represents the degree of the
grammatical correctness of the produced explanation. For instance, if the
“explanation” produced is a phrase containing at least two words, the evaluation is
given according to the grammatical connections between the words. The value of the
creative level measure represents the degree of creativity of the produced explanation
in regards to surprise and novelty. The value of the sound similarity measure
represents the degree of the sound similarity between the produced explanation and
324 Y. HaCohen-Kerner, D.N. Cohen, and E. Nissan
the original word when they are pronounced. Tables 2 and 3 present the values of
these three evaluation criteria for the produced explanations in Hebrew and English,
respectively.
4 Illustrative Examples
Tables 4 and 5 present six detailed examples for Hebrew and English, respectively.
Due to space limitations, we shall explain in detail only one relatively complex
example. The input of the fifth Hebrew example is the Hebrew name ( אביתרEvyatar,
Ebiatar). Firstly, the system segments the input into a sequence of two acceptable
words in Hebrew and then activates two letter additions. The output is the following
sequence: ( יותר2) ( אהב1), which means “he loved more”.
Experiments in Producing Playful “Explanations” for Given Names 325
We have presented a system that, when fed a one-word input (a personal given name)
segments it and/or modifies it using one or two transformations (addition of a letter,
deletion of a letter and replacement of a similar letter) so that the output is a list of
words extant in the lexicon of the same language as the input. Our experiments show
that in Hebrew reasonable association between the input and output is perceived as
higher than in English. As for English, often the segmentation or modification is
perceived to be underwhelming, but on occasion containing some element of surprise.
Arguably, the nature of both Hebrew writing and Hebrew morphology militate
towards such differences in perception. However, cultural factors also contribute to
make the associations proposed by the tool for Hebrew, more readily accepted by
members of the culture, not necessarily as a joke. English output, if perceived to be
acceptable rather than absurd, is accepted instead as a (mildly) humorous pun.
We have contrasted our kind of graphemic puns to ones from the Far East [8]. In
addition, we have designed a phono-semantic matching (PSM) module [10]
interfacing a future story-generation tool for devising playful explanations for input
proper names [11-12]. This phenomenon is known from human cultures in various
contexts, e.g. [14].
Experiments in Producing Playful “Explanations” for Given Names 327
References
1. Redfern, W.D.: Puns. Basil Blackwell, Oxford (1984); 2nd edn. Penguin, London (2000)
2. Sharvit, S.: Puns in Late Rabbinic Literature. In: Schwarzwald, O.R., Shlesinger, Y. (eds.)
Hadassah Kantor Jubilee Volume, pp. 238–250. Bar-Ilan University Press, Ramat-Gan
(1995) (in Hebrew)
3. Sharvit, S.: Play on Words in Late Rabbinic Literature. In: Hebrew Language and Jewish
Studies, Jerusalem, pp. 245–258 (2001) (in Hebrew)
4. Dynel, M.: How do Puns Bear Relevance? In: Kisielewska-Krysiuk, M., Piskorska, A.,
Wałaszewska, E. (eds.) Relevance Studies in Poland. Exploring Translation and
Communication Problems, vol. 3, pp. 105–124. Warsaw Univ. Press, Warsaw (2010)
5. Hempelmann, C.F.: Paronomasic Puns: Target Recoverability towards Automatic
Generation. PhD thesis. Purdue University, Indiana (2003)
6. Waller, A., Black, R., Mara, D.A.O., Pain, H., Ritchie, G., Manurung, R.: Evaluating the
STANDUP Pun Generating Software with Children with Cerebral Palsy. ACM
Transactions on Accessible Computing (TACCESS) 1(3), article no. 16, at the ACM site
(2009)
7. de Lange, C.: Get Out of the Groove. New Scientist 215(2879), 47–49 (2012)
8. HaCohen-Kerner, Y., Cohen, D.N., Nissan, E., Zuckermann, G.: Graphemic Puns, and
Software Making Them Up: The Case of Hebrew, vs. Chinese and Japanese. In: Felecan,
O. (ed.) Onomastics in the Contemporary Public Space. Cambridge Scholars Publishers,
Newcastle (in press)
9. HaCohen-Kerner, Y., Avigezer, T.S., Ivgi, H.: The Computerized Preacher: A Prototype of
an Automatic System that Creates a Short Rabbinic Homily. B.D.D. (Bekhol Derakhekha
Daehu): Journal of Torah and Scholarship 18, 23–46 (2007) (in Hebrew)
10. Nissan, E., HaCohen-Kerner, Y.: The Design of the Phono-Semantic Matching (PSM)
Module of the GALLURA Architecture for Generating Humorous Aetiological Tales. In:
Felecan, O. (ed.), Unconventional Anthroponyms. Cambridge Scholars Publishers,
Newcastle (in press)
11. Nissan, E., HaCohen-Kerner, Y.: Information Retrieval in the Service of Generating
Narrative Explanation: What we Want from GALLURA. In: Proceedings of the 3rd
International Conference on Knowledge Discovery and Information Retrieval (KDIR), pp.
487–492 (2011)
12. Nissan, E., HaCohen-Kerner, Y.: Storytelling and Etymythology: A Multi-agent Approach
(A Discussion through Two “Scandinavian” Stories). In: HaCohen-Kerner, Y., Nissan E.,
Stock, O., Strapparava, C., Zuckermann, G. (eds.), Research into Verbal Creativity,
Humour and Computational Humour. Topics in Humor Research, Benjamins, Amsterdam
(to appear)
13. Wintner, S.: Hebrew Computational Linguistics: Past and Future. Artificial International
Review 21(2), 113–138 (2004)
14. Zuckermann, G.: “Etymythological Othering” and the Power of “Lexical Engineering” in
Judaism, Islam and Christianity. In: Omoniyi, T., Fishman, J.A. (eds.) Explorations in the
Sociology of Language and Religion, Benjamins, Amsterdam, ch.16, pp. 237–258 (2006)
Collaborative Enrichment of Electronic Dictionaries
Standardized-LMF
1 Introdution
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 328–336, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Collaborative Enrichment of Electronic Dictionaries Standardized-LMF 329
Few years ago, the dictionaries construction field has been consolidated by the
publication of the LMF-ISO 24613 norm (Lexical Markup Framework) [6]. This
norm offers a framework for modeling large lexical resources in a very refined way.
LMF has been proved compatible with the majority of vehicular languages.
In this paper, we highlight the challenges related to the collaborative enrichment of
large, LMF-standardized dictionaries. In such dictionaries, enormous knowledge are
required and can be defined only by expert users (i.e., linguists, lexicographers) such
as syntactic behaviors, semantic roles and semantic predicates. In addition, specific
links are to be considered among several lexical entries such as morphological deriva-
tion links or semantic relations links (i.e., synonymy, antinomy) between senses. Oth-
er more complex links are as those of syntactic-semantic dependency. The enrichment
difficulty increases when the partners of a link are entered in the dictionary separate-
ly. As difficulty, we can mention the case of the synonymy link where the corres-
ponding sense to be linked is not yet introduced. In general, the issues affecting the
integrity of the dictionary concern the absence of mandatory knowledge, a wrong
links and redundant knowledge.
The solution that we propose is a wiki-based approach that benefits from the fine
structure ensured by LMF. This fine structure provides selective access to all know-
ledge in the normalized dictionary and consequently promotes the control of the
enrichment. The proposed approach ensures the properties of completeness, cohe-
rence and non-redundancy using a set of appropriate rules. In order to illustrate the
proposed approach, we give a report on the experimentation carried on an Arabic
normalized dictionary [9].
We start with giving an overview of the main approaches used for the enrichment
of electronic dictionaries. Then, we introduce the LMF norm. Thereafter, we describe
the main issues related to the collaborative enrichment of the LMF-standardized dic-
tionaries. After that, we expose the proposed approach. Afterward, we detail the ap-
plication of the proposed approach on an Arabic normalized dictionary.
In this section we try to enumerate the main approaches that were proposed to accom-
plish the enrichment of electronic dictionaries. The first one is based on the massive
typing of the content of one or several paper dictionaries. This was the case for exam-
ple of the dictionary TLFi [5], using the sixteen paper volumes of the old dictionary
TLF (“Trésor de la Langue Française”) in the Atilf laboratory in French. This ap-
proach is considered very costly in time and number of people typing, although it
provides reference content.
In order to reduce the cost of the typing, some works have recourse to the digitiza-
tion of old dictionaries. This was the case for example of several dictionaries of Arab-
ic as implemented in sakhr environment1. This approach is inexpensive but provides
unstructered dictionaries and therefore research services are very rudimentary.
1
lexicons.sakhr.com
330 A. Khemakhem, B. Gargouri, and A. Ben Hamadou
LE : write SF : transitive
َ ( َﻛﺘwrite)
LE : َﺐ SA : subject
Sense1
RelatedForm RelatedForm Syntactic
SA : complement
type=deriverdForm type=stem Sense j Behaviour
…
SA : complement
SenseN
Legend: SF: SubCategorization Frame
LE : ( ﻛَﺎﺗِﺐwriter)
SA: Syntactic Argument
2
www.lexicalmarkupframework.org/
Collaborative Enrichment of Electronic Dictionaries Standardized-LMF 331
LMF allows building dictionaries with a large linguistic coverage. Thus, the know-
ledge to be introduced for a single lexical entry are various and leads relationships
with other entries or directly with knowledge of other entries. Such information and
links are not simple and require linguistic expertise. Consequently, the enrichment
phase, notably with a collaborative approach might be a difficult task.
From another point of view, LMF has its advantages that favor the use of a colla-
borative enrichment and thus reduce the complexity of this task. Indeed, it ensures the
uniformity of the structure of the lexical entries having the same grammatical catego-
ry. Thus, the same acquisition models can be employed while ensuring the appropri-
ate constraints. Moreover, it offers a finely structured model in a way that we can
accede to each knowledge separately. Hence, appropriate set of acquisition constraints
can be provided for each knowledge or relatinship.
In conclusion, we can state that LMF can be considered as a solution for the colla-
borative enrichment of dictionaries with a large linguistic coverage as it is already
confirmed as a solution for modeling such dictionaries.
Base of Propositions
Analysis &
Acquisition &
Base of rules Integration
Update
User Validation
As shown in the Figure 3, the enrichment process is based on three phases and uses
apart the normalized dictionary, a base of rules and a base of user propositions. The
enrichment concerns the adding of a new LE or the updating of an old one which will
332 A. Khemakhem, B. Gargouri, and A. Ben Hamadou
be saved in the base of propositions. After that, we must launch the analysis and the
integration of propositions which generate the normalized form of the LE and detect
conflicts. These conflicts will be studied and resolved by the committee through the
validation phase. All validated LEs we will be passed by the analysis before being
recorded in the normalized dictionary.
b. Analysis and Integration. This phase must be executed after adding or up-
dating a LE. It is an automatic phase including the following four steps:
The rule 1 deals with the completeness of a new LE. It checks the mandatory
knowledge.
Collaborative Enrichment of Electronic Dictionaries Standardized-LMF 333
Rule 1: if ( New (LE) and LE = CF) Then Mandatory (POS ) and (Lem-
ma = CF) EndIf
Any new LE of a Canonical Form (CF) must have a POS and a Lemma to be stored
is the D. New() verifies if the element received as parameter is new and Mandatory()
controls the existence of mandatory knowledge.
Rule 2: if ( New (S) ) Then Mandatory (Def ) EndIf
Each new sense (S) must have at least one not empty definition (Def).
The choice of the experimentation case is justified, firstly by the fact that in our team
we are working on the Arabic language, and secondly by the existence of a standar-
dized model and a first version of a dictionary constructed according to LMF3.
At present, the prototype is hosted locally on the intranet of our laboratory. We de-
signed four users which are experts in the lexicography domain to test the imple-
mented system. They started with the diet of 10000 entries: 4000 verbs, 5980 names
and 20 particles. Each user has to deal with the entries starting with a list of Arabic
letters. We notice that they have had difficulty starting to discover the interfaces and
the requested knowledge, which needs the development of a user guide to help new
users. Moreover, the experts can work offline and they connect only once to send
proposals.
In order to evaluate the developed system, we conducted a qualitative assessment
of the fragment introduced by human experts. Then we observed a few gaps at all
levels of control, namely the completeness, the consistency and the non-redundancy.
When analyzing the results, we noticed that some specific rules for the case of Arabic
must be added. They are related to specific aspects of Arabic morphology that uses
the concepts of root and schema pattern. These rules are then formulated and imple-
mented. In addition, some problems related to the link establishment were noted, not-
ably for semantic links. Indeed, when the user does not mention a sense of the word
supporting its synonym, the system is unable to assure this kind of link although it
exist when a human analyzes the entries.
The role of the committee was limited to deal with the request of canceling some
knowledge that was given on purpose to test the system. Indeed, this kind of update
might cause incoherence in the content of the dictionary.
7 Discussion
The forcefulness of the Wiki approach as a remote collaborative tool has generated
the appearance of several projects dealing with the collaborative construction of
3
www.miracl.rnu.tn/Arabic-LMF-Dictionary
4
www.adobe.com/fr/products/flex.html
Collaborative Enrichment of Electronic Dictionaries Standardized-LMF 335
lexical resources [7] [4] but the famous one is the Wiktionary interested in dictiona-
ries’ development of various languages [13]. The contribution in these projects is
devoted for all kinds of users and the treated knowledge is not detailed. For example,
the Wiktionary is an open dictionary and it is based on a simple pattern containing the
part of speech, etymology, senses, definitions, synonyms and translations. However, it
does not treat the syntax and it do not have the means to link synonyms and related
senses. The simplicity of its structure makes the alimentation task within reach of no
experts in this field and his contents suffer from a lack of knowledge and precision.
Furthermore, the use of an LMF standardized dictionary is a strong point of our
project. The related model is complex and provides a wide linguistic coverage [2].
The robustness of the model makes the enrichment task unreachable by everyone,
despite the GUI which facilitates the edition and the check of information in semi-
structured documents. This task is controlled by a set of rules ensuring the properties
of completeness, coherence and non-redundancy. Moreover, a validation committee
resolve the conflicts of propositions to guarantee the consistency of the dictionary
content.
8 Conclusion
References
1. Arregi, X., et al.: Semiautomatic of the EuskalHiztegia Basque Dictionary to a queryable
electronicform. In: L’objet, LMO 2002, pp. 45–57 (August 2002)
2. Baccar, F., Khemakhem, A., Gargouri, B., Haddar, K., Ben Hamadou, A.: LMF standar-
dized model for the editorial electronic dictionaries of Arabic. In: 5th International Work-
shop on Natural Language Processing and Cognitive Science, NLPCS 2008, Barcelone,
Espagne, June 12-13 (2008)
336 A. Khemakhem, B. Gargouri, and A. Ben Hamadou
3. Bellynck, V., Boitet, C.: and Kenwright J., Construction collaborative d’un lexique
français-anglais technique dans IToldU: contribuer pour apprendre. In: 7èmes Journées
scientifiques du réseau LTT (Lexicologie Terminologie Traduction) de l’AUF (agence un-
iversitaire de la francophonie), Bruxelles (2005)
4. Daoud, M., Daoud, D., Boitet, C.: Collaborative Construction of Arabic Lexical Re-
sources. In: Proceedings of the International Conference on MEDAR 2009, Cairo, Egypt
(2009)
5. Dendien, J., Pascal, M., Pierrel, J.-M.: Le Trésor de la Langue Française informatisé: Un
exemple d’informatisation d’un dictionnaire de langue de référence. TAL 44(2), 11–39
(2003)
6. Francopoulo, G., George, M.: ISO/TC 37/SC 4 N453 (N330 Rev.16), Language resource
management- Lexical markup framework, LMF (2008)
7. Garoufi, K., Zesch, T., Gurevych, I.: Representational Interoperability of Linguistic and
Collaborative Knowledge Bases. In: Proceedings of KONVENS 2008 Workshop on Lexi-
cal Semantic and Ontological Resources Maintenance, Representation, and Standards, Ber-
lin, Germany (2008)
8. Gurevych, I., Eckle-Kohler, J., Hartmann, S., Matuschek, M., Meyer, C., Wirth, C.: Uby -
A Large-Scale Unified Lexical-Semantic Resource Based on LMF. In: Proceedings of the
13th Conference of the European Chapter of the Association for Computational Linguistics
(EACL 2012), pp. 580–590 (April 2012)
9. Khemakhem, A., Gargouri, B., Haddar, K., Ben Hamadou, A.: LMF: Lexical Markup
Framework. In: LMF for Arabic, pp. 83–96. Wiley Editions (March 2013)
10. Khemakhem, A., Elleuch, I., Gargouri, B., Ben Hamadou, A.: Towards an automatic con-
version approach of editorial Arabic dictionaries into LMF-ISO 24613 standardized model.
In: Proceedings of the International Conference on MEDAR 2009, Cairo, Egypt (2009)
11. Mangeot, M., Enguehard, C.: Informatisation de dictionnaires langues africaines-français.
Actes des Journées LTT 2011 (Septembre 15-16, 2011)
12. Mangeot, M., Sérasset, G., Lafourcade, M.: Construction collaborative d’une base lexicale
multilingue, le projet Papillon. TAL 44(2), 151–176 (2003)
13. Sajous, F., Navarro, E., Gaume, B., Prévot, L., Chudy, Y.: Semi-Automatic Endogenous
Enrichment of Collaboratively Constructed Lexical Resources: Piggybacking onto Wiktio-
nary. In: Proceedings of the 7th International Conference on Natural Language Processing
(IceTAL 2010), Reykjavik, Iceland (2010)
14. Sarkar, A.I., Pavel, D.S.H., Khan, M.: Collaborative Lexicon Development for Bangla. In:
Proceedings International Conference on Computer Processing of Bangla (ICCPB 2006),
Dhaka, Bangladesh (2006)
Enhancing Machine Learning Results for Semantic
Relation Extraction
Abstract. This paper describes a large scale method to extract semantic rela-
tions between named entities. It is characterized by a large number of relations
and can be applied to various domains and languages. Our approach is based
on rule mining from an Arabic corpus using lexical, semantic and numerical
features.
Three primordial steps are needed: Firstly, we extract the learning
features from annotated examples. Then, a set of rules are generated au-
tomatically using three learning algorithms which are Apriori, Tertius
and the decision tree algorithm C4.5. Finally, we add a module of sig-
nificant rules selection in which we use an automatic technique based
on many experiments. We achieved satisfactory results when applied to
our test corpus.
1 Introduction
Relation extraction presents the task of discovering useful relationship between two
Named Entities (NEs) from text contents. As this task is very useful for information
extraction applications like business intelligence and event extraction as well as the
natural language processing tasks such as question-answering, many research works
have been already performed. Some works are rule-based which rely on hand-crafted
patterns [3]. Others use machine learning algorithms to extract relations between NEs.
We distinguish unsupervised machine learning methods [5, 11] that conduct to extract
words between NEs and cluster them in order to produce many clusters of relations.
Hence, considered relations must occur many times between NEs within the same
sentence which is not always possible in Arabic sentences. However, semantic
relations can occur either before the first NE, between NEs or after the second NE.
Alternative supervised learning methods can be used to automatically extract relation
patterns based on annotated corpus and linguistic features [7, 8, 14]. Inspired of this
latter, we proposed our supervised method based on rules learning. The remainder of
this paper is organized as follows: Firstly, we explain our proposed method. Then, we
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 337–342, 2013.
© Springer-Verlag Berlin Heidelberg 2013
338 I. Boujelben, S. Jamoussi, and A. Ben Hamadou
present the evaluation results when our method is applied on a test corpus. Finally,
some conclusions are drawn in order to structure future work.
2 Proposed Method
The main idea of our work is to generate automatically rules which are used to extract
semantic relation that may occur between NEs. Indeed, a rule is defined as a set of
dependency path indicating a semantic link between NEs.
Our method consists of three steps: the learning feature identification, the automat-
ic generation of rules using machine learning algorithms and finally the selection of
significant rules that aims to iteratively improve the performance of our system. First
of all, we extract sentences that contain at least two NEs from our training corpus. As
we know, the Arabic language is characterized by its complex structure and its long
sentences [7]. For that, when analyzing our Arabic training sentences, we note that
numerous relations are expressed through words surrounding the NEs that can be
either before the first NE, between NEs or after the second NE. Additionally, some
NEs are unrelated, in despite of their presence in the same sentence. Some previous
works like [14] use a dependency path between NEs to estimate if there is a relation.
To address this problem, we seek to limit the context of semantic relation extraction
in order to guarantee the existence of relation between the attested NEs. Referring to
linguistic experts and the study of examples, the clause segmentation task can present
a better solution that can tackle this problem on average of 80%. This extraction re-
quired an Arabic clauses splitter [7] as well as an Arabic NEs [9] recognition tools.
Arabic language suffers from the lack of available linguistic resources like annotated
corpora and part-of-speech tagging. Therefore, we need to spend additional effort to
label and verify our linguistic resources used for the learning. Hence, we need an
efficient part-of-speech tagging to produce morphological tag of each context word or
symbol (punctuation mark) given that Arabic is an agglutinative language in which
the clitics are agglutinated to words. For example, the conjunction (و/and) or the
pronoun (هﻢ/them) can be affixed to a noun or a verb and thus causes several morpho-
logical ambiguities. We elaborated a sample transducer for surface morphological
analysis using NooJ platform [12] based on Arabic resources of [9].
Many early algorithms use a variety of linguistic information including lexical, se-
mantic and syntactic information [13, 15]. Many others use no syntactic information
like the DPIRE algorithm [4]. In our case, we focused only on lexical, numerical and
semantic features without syntactic information. The features used are as follows:
Our method is distinct from previous works [2, 4] which aim to recognize the se-
mantic class of relation. We extract the position of word surrounding the NEs that
reflect the semantic relation. So, we are not limited to a defined number of classes.
Enhancing Machine Learning Results for Semantic Relation Extraction 339
From annotated clauses, we are able to build our training dataset by extracting our
learning features. In fact, fifteen features are identified in which fourteen are retrieved
automatically, and the last one, the position of relation, is manually identified. Our
training data is composed of 1012 sentences collected from electronic and journalistic
articles in Arabic. They contain 8762 tokens including word forms, digits, and delimi-
ters. These sentences have 2000 NEs, in which only 1360 NEs are related.
the number of attributes that compose one rule. In fact, we believe that each rule
composed of one or two attributes will induce erroneous and redundant results.
Hence, the rules composed of less than 3 attributes will be neglected. The second
filtration consists in hiding the rules by tuning their confidence1 and support2. The
higher theses values, the more often the rule items are associated together. In our case,
the rule getting a confidence value below a threshold value will be removed. Next, we
opt to an enrichment step for our obtained rules by generating others new rules from
the best ones. This enrichment has the advantage of increasing the coverage of our
system. This means, for each rule disposing of a set of more than three attributes, we
eliminate iteratively one attribute to obtain an equivalent number of derived rules.
Next, these selected rules as well as the top rules are applied to our training data set.
So, we will obtain a very large number of rules in which some of them are redundant.
We proceed then to a third step of selection in which we compare each target rule
with its derived rules in order to satisfy two assumptions: If one of the derived rules
holds with a confidence value more than a specified threshold and gets the highest
support, then it will be selected. In the case that all derived rules have confidence
values below the threshold value, we will conserve only the target rule and eliminate
all its derived rules.
These results demonstrate the amelioration of the F-measure values among the se-
lection rules levels (from 62.5% to 70.1%). The best F-measure value is obtained in
the final level which proves the effectiveness of our selection module. Thus, we suc-
ceed to pick the best compromise between these three parameters: precision, recall
and rules number.
1
The confidence shows how frequently the rule head occurs among all the groups containing
the rule body.
2
The support presents the number of instances in which the rule is applicable.
Enhancing Macchine Learning Results for Semantic Relation Extraction 341
For the second selection n level which consists of filtering the obtained rules by
comparing them to a confid dence value threshold, we have first to define the right vval-
ue of this threshold. Thereefore, we plotted the precision/recall curves (figure 1)) by
varying the confidence thrreshold value. As mentioned in Figure1, the thresholdd is
fixed to 0.6.
The results presented in n table3 show that our relation extraction system is quuite
precise. However, it has a low
l recall, since it cannot handle exceptional relations be-
tween NEs. Indeed, our system
s is able to extract only explicit relations that are
|expressed through a speciial word or a punctuation mark in the sentence. Wherreas
implicit relations that are not
n indicated directly by specific words are difficult too be
extracted since the output of our system is the word indicating the relation betw ween
NEs. The recall errors are also due to the influence of the NE recognition step. In
effect, some Arabic NEs likel the organization type entity has not been recogniized
which intricate the relation discovering. In the other hand, the difficulty to identify the
right type of NE poses a problem
p in relation extraction, for instance “Tunis / ﺲ ﺗﻮﻧﺲ
“could be either the name of a person or the name of a country. Thus, this probllem
can produce errors when ap pplying a rule to the associated instance. So, to resolve this
kind of problems, it is cru ucial to have a very efficient NE recognition tool for the
Arabic language.
342 I. Boujelben, S. Jamoussi, and A. Ben Hamadou
Our proposed method elaborates a set of two learning steps: The first one is used to
generate automatically the rules through the combination of three learning algorithms.
The second serves to select the significant rules from the generated ones. Unlike other
recent works which are interested only in a specific domain, our method is general
enough to be applied independently of both domain and language.
As perspectives, we still have other possible improvements to enhance the overall
system performance. The addition of syntactic features and anaphora resolution can
be used to improve the coverage of our system. Also, we tend to utilize human exper-
tise rules with our learning selected rules to develop a hybrid approach which can
improve our system capacities. Finally, it would be interesting to evaluate our system
with other corpus in different languages and domains.
References
1. Agrawal, R., Srikant, R., Imielinski, T., Swami, A.: Mining Association rules between Sets
of items in Large Databases. In: ACM, pp. 207–216 (1993)
2. Ben Abacha, A., Zweigenbaum, P.: A Hybrid Approach for the Extraction of Semantic Re-
lations from MEDLINE Abstracts. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS,
vol. 6609, pp. 139–150. Springer, Heidelberg (2011)
3. Boujelben, I., Jamoussi, S., BenHamadou, A.: Rule based approach for semantic relation
extraction between Arabic named entities. In: NooJ (2012)
4. Brin, S.: Extracting patterns and relations from the World Wide Web. In: Atzeni, P., Men-
delzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer,
Heidelberg (1999)
5. Culotta, A., Bekkerman, R., McCallum, A.: Extracting Social Networks and Contact In-
formation from Email and the Web. In: CEAS (2004)
6. Flach, P.A., Lachiche, N.: The Tertius system (1999),
http://www.cs.bris.ac.uk/Research/MachineLearning/Tertius
7. Keskes, I., Benamara, F., Belguith, L.: Clause-based Discourse Segmentation of Arabic
Texts. In: LREC, pp. 21–27 (2012)
8. Kramdi, S.E., Haemmerl, O., Hernandez, N.: Approche générique pour l’extraction de re-
lations à partir de textes. In: IC (2009)
9. Mesfar, S.: Analyse morpho-syntaxique automatique et reconnaissance des entités
nommées en Arabe standard. University of Franche-Comté, Ecole doctorale langages, es-
paces, temps, socits (2008)
10. Quinlan, J.R.: Programs for Machine Learning. Morgan Kaufmann Publishers. Inc., San
Mateo (1993)
11. Shinyama, Y., Sekine, S.: Preemptive information extraction using unrestricted relation
discovery. In: HLT-NAACL, pp. 304–311 (2006)
12. Silberztein, M.: NooJ manual (2003), http://www.nooj4nlp.net
13. Stevenson, S., Swier, R.: Unsupervised Semantic Role Labeling. In: EMNLP, pp. 95–102
(2004)
14. Zelenko, D., Aone, C., Richardella, A.: Kernel Methods for Relation Extraction. JMLR,
1083–1106 (2003)
15. Zhou, G., Zhang, M., Donghong, J., Zhu, Q.: Tree kernel-based relation extraction with
context-sensitive structured parse tree information. In: EMNLP-CoNLL (2007)
GenDesc: A Partial Generalization of Linguistic
Features for Text Classification
1 Introduction
Textual data classification is an issue that has many applications, such as sen-
timent classification or thematic categorization of documents. This paper de-
scribes a classification method based on supervised learning algorithms. These
algorithms require labelled data (i.e data composed of an input object and its
class). They have a training phase during which they receive some features as-
sociated with the corresponding class label. After the training phase, the model
built by the algorithm can associate a class to a set of features without label.
The quality of classification depends not only on the quality of the learning
algorithm, but also on data representation.
The usual method for textual data representation is the ”bag of words” model:
Each word is an input feature of a learning algorithm [1]. This representation
considers each word as a separate entity, has the advantage of being simple
and gives satisfactory results [2]. However, a lot of information is missed: For
instance,the position of each word relatively to the others, is an important lin-
guistic information that disappears with the ”bag of words” representation. To
store such an information, the n-gram model could be convenient. However, the
multiplication of features and their lack of ’genericity’ make this type of model
unsatisfactory, since its ad hoc aspect impedes the quality of the learning algo-
rithm output. For these reasons, it is necessary to find solutions to generalize
features. In this paper, generalizing features means replacing specific features,
such as words by more generic features, such as their POS tags, but is not re-
stricted to this sole ’generalization’. This track has been quite well explored by
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 343–348, 2013.
c Springer-Verlag Berlin Heidelberg 2013
344 G. Tisserant, V. Prince, and M. Roche
Using More General Features. The idea of replacing some words with their
grammatical category comes from two observations: First, some words are more
interesting than the others for textual data classification, and depending on the
task, they should be used as features while others could be discarded. Secondly,
all information cannot be transcribed in words only. Part of it is provided by
the context: The sole occurrence of an important word is not enough by itself,
it must be contextualized. For instance, the presence of a number of adverbs
or adjectives may represent a crucial information to detect types of processed
texts (e.g. opinion associated to a document or a sentence, etc.). In this paper
we tackle the identification of words that can be generalized, because their are
not discriminant. GenDesc is a method that will replace some words by their
POS tags, according to a ranking function that evaluates the words frequency
and their power of discrimination.
Word Position. Since words are not randomly inserted to make a sentence,
their position, mostly in languages without declination, is a crucial information
that is lost in the bag of words approach. Word position can be given by n-grams
of words, which can be combined with generalization. So n-grams composed of
GenDesc: A Partial Generalization of Linguistic Features 345
words, POS tags, or both, can be obtained, thus combining the initial Gen-
Desc approach with the n-gram model. Each type of n-grams gives different
information but each can be useful for the learning algorithm.
General Process. The approach we propose is divided into different steps. The
first one determines the part-of-speech category of each word in the corpus. The
next step selects the words that will be generalized. These will be used directly
in their inflected form as features. The final step builds unigrams, bigrams, and
trigrams from the remaining words and the generalized word labels. This data
will form our features, used as a training input for the learning algorithm to
build the prediction model for the classification task.
3 Experiments
3.1 Experimental Protocol and Used Algorithm
Corpus. We have tested our method on a subset of the DEFT2007 corpus [10].
It is built with 28000 interventions of French congressmen, about legislation
under consideration in the National Assembly. We worked on a subset of the
corpus consisting of 1000 texts regarding the legislation, evenly balanced between
sentences ’pro’ and ’con’. We applied the SYGFRAN morphosyntactic parser [11]
for French in order to obtain a POS tag to the words of the text. SYGFRAN
recall and precision are quite high for French (more than 98% recall and precision
for the DEFT corpus).
346 G. Tisserant, V. Prince, and M. Roche
3.2 Results
GenDesc. The different ranking functions in order to generalize words have
been tested. The quality of obtained results are independent of the learning al-
gorithm (see last paragraph of this section). Table 1 shows the results obtained
using the NaivesBayes algorithm without the use of n-grams with different func-
tions and thresholds. The use of POS tags (only) as features gives an accuracy at
53.75%. The use of words as features gives an accuracy at 60.41%. We consider
these results as a baseline. Discriminence function (D) seems quite appropriate
Table 1. Table showing the accuracy according to different functions with different
thresholds. Values in bold correspond to the values better than the baseline.
for our purpose. Results show that when combined with TF (Term Frequency),
its accuracy increases, while other functions degrade its performance The opti-
mal threshold changes according to the function. If we consider the function D,
the optimal threshold is always around 0.3, regardless of the learning algorithm,
and whether n-grams are used or not.
1
http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayes.html
2
http://weka.sourceforge.net/doc/weka/classifiers/trees/J48.html
3
http://weka.sourceforge.net/doc/weka/classifiers/functions/SMO.html
GenDesc: A Partial Generalization of Linguistic Features 347
Table 2. Tables showing the results obtained with the different learning algorithms
GenDesc:
n-grams u u+b b u+t t u+b+t b+t
SVM 67.65 64.40 60.75 68.46 59.26 65.52 61.46
Bayes 68.26 67.55 62.78 69.67 58.11 68.36 61.66
Tree 59.84 59.74 55.88 60.95 52.23 60.85 55.68
Words:
n-grams u u + b b u + t t u+b+t b + t
SVM 61.36 58.62 57.00 62.98 57.81 59.63 57.81
Bayes 60.14 60.24 59.26 60.95 57.99 60.55 59.53
Tree 55.38 54.77 52.13 57.61 54.16 52.43 55.38
u: unigrams, b: bigrams, t: trigrams
more generic tag in our future work. Similarly, some words that are preserved
can be replaced by their canonical form, in order to generalize information. Fi-
nally, we plan to use the method on another corpus, to test its appropriateness
to other data and other languages.
References
1. Harris, Z.: Distributional structure. Word 10, 146–162 (1954)
2. Joachims, T.: Text categorization with suport vector machines: Learning with
many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS,
vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
3. Gamon, M.: Sentiment classification on customer feedback data: noisy data, large
feature vectors, and the role of linguistic analysis. In: Proceedings of the 20th In-
ternational Conference on Computational Linguistics, COLING 2004. Association
for Computational Linguistics (2004)
4. Joshi, M., Penstein-Rosé, C.: Generalizing dependency features for opinion mining.
In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, ACLShort
2009, pp. 313–316. Association for Computational Linguistics (2009)
5. Porter, M.F.: Readings in information retrieval, 313–316. Morgan Kaufmann Pub-
lishers Inc. (1997)
6. Prabhakaran, V., Rambow, O., Diab, M.: Predicting overt display of power in writ-
ten dialogs. In: Proceedings of the 2012 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
pp. 518–522 (June 2012)
7. Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: The good the
bad and the omg! In: Adamic, L.A., Baeza-Yates, R.A., Counts, S. (eds.) ICWSM.
AAAI Press (2011)
8. Matsumoto, S., Takamura, H., Okumura, M.: Sentiment classification using word
sub-sequences and dependency sub-trees. In: Ho, T.-B., Cheung, D., Liu, H. (eds.)
PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 301–311. Springer, Heidelberg (2005)
9. Xia, R., Zong, C.: Exploring the use of word relation features for sentiment clas-
sification. In: Proceedings of the 23rd International Conference on Computational
Linguistics: Posters, COLING 2010, pp. 1336–1344. Association for Computational
Linguistics (2010)
10. Grouin, C., Berthelin, J.B., Ayari, S.E., Heitz, T., Hurault-Plantet, M., Jardino,
M., Khalis, Z., Lastes, M.: Présentation de deft 2007. In: Actes de l’atelier de
clôture du 3eme Défi Fouille de Textes, pp. 1–8 (2007)
11. Chauché, J.: Un outil multidimensionnel de l’analyse du discours. In: Proceedings
of the 10th International Conference on Computational Linguistics, COLING 1984,
pp. 11–15. Association for Computational Linguistics (1984)
12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)
Entangled Semantics
1 Introduction
In 2009, the W3C announced a new standard: the Simple Knowledge Organisa-
tion System (SKOS) a model for expressing the basic structure and content of
concept schemes such as thesauri, classification schemes, subject heading lists,
taxonomies, folksonomies, and other similar types of controlled vocabulary[1].
This meant that existing knowledge organization systems employed by libraries,
museums, newspapers, government portals, and others could now be shared, re-
used, interlinked, or enriched. Since then, SKOS has seen growing acceptance in
the Linked Data publishers community and more than 20% of existing Linked
Open Data is using SKOS relations to describe some aspects of their datasets 1 .
A relevant example for this article is the domain specific Thesaurus for the
Social Sciences (TheSoz)2 that has been released in SKOS format [7]. We use it
to investigate its dual role as knowledge base for semantic annotations and as a
language-independent resource for translation. For the experiments, we use the
German Indexing and Retrieval Test database (GIRT) and a set of topics from
the CLEF 2004-2006 Domain-Specific (DS) track. The focus of the experiments
is to determine the value of using a SKOS resource in monolingual and bilingual
retrieval, testing two techniques for annotation, explicit and implicit, and their
effects on retrieval. Our results show a mixed picture with better results for
bilingual runs, but worse average precision performance for the monolingual
1
http://lod-cloud.net/state/
2
http://datahub.io/dataset/gesis-thesoz
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 349–354, 2013.
c Springer-Verlag Berlin Heidelberg 2013
350 D. Tanase and E. Kapetanios
ones when compared to averages of all submitted runs for the corresponding
CLEF DS track between 2004-2006.
This article is structured in four sections starting with related work in Section
2, a short description of key elements of SKOS in Section 3, semantic annota-
tion experiments for monolingual and bilingual settings in Section 4, and our
conclusions in Section 5.
2 Related Work
For almost twenty years now, the Semantic Web has been advocated as a space
where things are assigned a well-defined meaning. In the case of text as the
thing, a new technique has been developed for determining the meaning of the
text and mapping it to a semantic model like a thesaurus, ontology, or other
type of knowledge base. We refer to this as semantic annotation (SA) and adhere
to the description specified in [3]: Semantic annotation is a linking procedure,
connecting an analysis of information objects (limited regions in a text) with a
semantic model. The linking is intended to work towards an effective contribution
to a task of interest to end users.
The challenging aspect of semantic annotation is to process text at a deep
level and detangle its meaning before mapping it to classes or objects from a
formalized semantic model. Thesauri have previously been used for SA, for ex-
ample in concept-based cross-language medical information retrieval performed
using the Unified Medical Language System (UMLS) and Medical Subject Head-
ing (MeSH) [5], yet with small impact. Among the reasons brought forward for
this was the lack of coverage for a given document collection, the slow process of
updating a static resource, and the problematic process of integration between
different resources in different languages. The Semantic Web is set to change the
approach by being a platform for interoperable, collaborative creation, dynamic
(self-enriching) and distributed for language resources. Therefore, by taking a
closer look at a interlinked thesauri formulated in SKOS we want to establish,
if there is a positive impact on the overall retrieval.
(TheSoz) for school. For each concept in this dataset, the SKOS concept spec-
ification incorporates a set of multilingual lexical labels: the unique preferred
term, a number of alternative terms, and additional documentation such as def-
initions and optional notes that describe the concept scheme’s domain. TheSoz
is also a Linked Open Data resource with interlinks to DBpedia4 (the extracted
structured content from the Wikipedia project), the AGROVOC5 thesaurus con-
taining specific terms for agricultural digital goods, as well as STW Thesaurus
for Economics6 .
"School"@en
http://zbw.eu/stw/descriptor/11377-5 skos:prefLabel
"Schule"@de
http://dbpedia.org/resource/School
http://aims.fao.org/aos/agrovoc/c_6852
dbpedia-owl:abstract
skos:prefLabel
"A school is an institution designed for the teaching
"École"@fr of students (or pupils) under the direction of
"Schools"@en teachers. Most countries have systems of formal
"SCHULE"@de education, which is commonly compulsory..."@en
"..."@de
In short, a SKOS resource has two levels of structure: a conceptual level, where
concepts are identified and their interrelationships established; and a terminolog-
ical correspondence level, where terms are associated (preferred or non-preferred)
to their respective concepts.
4 Experiments
The experiments used 75 topics in English (EN) and German (DE) from the
CLEF DS Tracks from 2004-2006 for searching the GIRT collection. This data is
distributed by the European Language Association (ELRA)7 . Previous results
were ambivalent about the improvements possible through the use of domain
specific resources in improving CLIR results and we wanted to set a new baseline
and contrast it with previous work.
4
http://wiki.dbpedia.org/DBpediaLive
5
http://aims.fao.org/website/AGROVOC-Thesaurus
6
http://zbw.eu/stw/versions/latest/about
7
http://catalog.elra.info/
352 D. Tanase and E. Kapetanios
alternative labels. For Topic 172 a successful match was parenting style. Also, if
a concept has several alternative labels they are grouped to be handled as syn-
onyms by Terrier’s query language. Thus, we rely on good precision P@1 and we
set a threshold for the ranking score. Quality of SAs is hard to establish without
a golden standard and in order to be transparent regarding the output of these
two steps we are releasing the annotated topics and concept signatures we have
built for TheSoz’s concepts12 . Note, that the translations for both types of anno-
tations are performed using TheSoz’s multilingual labels, while the topic’s title is
translated using Google’s Translate service. We noticed that approximately 25%
of topics the annotations are complementing each other and circumventing the
topic’s intent (e.g Topic 174: Poverty and homelessness in cities with implicit
annotation street urchin and explicit annotations homelessness and poverty).
4.4 Results
After annotating and translating the topics, the necessary search indexes where
built. We used language-specific stop-wordlists and stemmers and run a series
of query formulations combinations, considering at turn pairings between the
title (T), implicit annotations (A), and explicit annotations (C). We used PL2
(Poisson estimation for randomness)13 as matching model and the default Query
Expansion.
All results for Mean Average Precision (MAP) are listed in Table 1 and the
percentage computations are performed against the second row of the table,
which specifies the average MAP for past DS tracks. The best set of runs were
obtained for the bilingual contexts in comparison to past experiments and Google
Translate’s web service clearly helps to achieve comparable performance, about
90%, to the monolingual runs. Yet, for combined runs using the topic’s title and
annotations, we saw an increase in performance relative to the average of all MAP
values corresponding to that particular CLEF run. If we also compare across
columns in Table 1, we notice that EN vs DE-EN is outperformed by the latter.
Based on a human assessment of annotations for DE topics, and considering we
did not use any word de-compounding tools, we noticed that there are a smaller
number of annotations per topic 3-4 for DE topics as opposed to 5-6 for EN
topics. This is evidence that when choosing the right annotation performance
rises, but too many and of varied granularity lead to mixed results.
In one of the best runs at the CLEF DS Track 2005 that used TheSoz [4],
mapping concepts to topics relied on inferred concept signatures based on the the
co-occurrence of terms from titles and abstracts in documents and the concept
terms associated with the document. This presupposes that the collection of
documents has been annotated (this is true for GIRT) an assumption we found
restrictive. Therefore, for the implicit annotations we used DBpedia descriptions
that have allowed concepts too broad or too specific when matching a topic
(e.g Topic 170: Lean production in Japan was matched to the lean management
concept).
12
http://bit.ly/XUMrQK
13
http://terrier.org/docs/v3.5/configure_retrieval.html#cite1
354 D. Tanase and E. Kapetanios
5 Conclusions
The results in the previous section show the potential and some of the limitations
of using the interlinked TheSoz as a knowledge base and language-independent
resource for monolingual and bilingual IR settings. Though, the experimental
results have not outperformed across the board, they set a new and improved
baseline for using SKOS-based datasets with the GIRT DS collection, and are
an example of component-based evaluation. Further work will concentrate on
refining and extending the annotation process for the document collection and
on experimenting with different levels of granularity for annotations in an IR
context. We are aiming at generic solutions that are robust to the idiosyncrasies
of interlinked SKOS datasets specifications.
References
1. SKOS Simple Knowledge Organization System Primer (February 2009),
http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/
2. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval the concepts
and technology behind search, 2nd edn. Addison-Wesley (2011)
3. Kamps, J., Karlgren, J., Mika, P., Murdock, V.: Fifth workshop on exploiting se-
mantic annotations in information retrieval: ESAIR 2012. In: Proceedings of the
21st ACM International Conference on Information and Knowledge Management,
CIKM 2012, pp. 2772–2773. ACM, New York (2012)
4. Petras, V.: How one word can make all the difference - using subject metadata for
automatic query expansion and reformulation. In: Working Notes for the CLEF 2005
Workshop, Vienna, Austria, September 21-23 (2005)
5. Volk, M., Ripplinger, B., Vintar, S., Buitelaar, P., Raileanu, D., Sacaleanu, B.:
Semantic annotation for concept-based cross-language medical information retrieval.
International Journal of Medical Informatics 67(13), 97–112 (2002)
6. Wartena, C., Brussee, R., Gazendam, L., Huijsen, W.-O.: Apolda: A practical tool
for semantic annotation. In: Proceedings of the 18th International Conference on
Database and Expert Systems Applications, DEXA 2007, pp. 288–292. IEEE Com-
puter Society, Washington, DC (2007)
7. Zapilko, B., Sure, Y.: Converting TheSoz to SKOS. Technical report, GESIS –
Leibniz-Institut für Sozialwissenschaften, Bonn. GESIS-Technical Reports 2009|07
(2009)
Phrase Table Combination Deficiency Analyses
in Pivot-Based SMT
Yiming Cui, Conghui Zhu, Xiaoning Zhu, Tiejun Zhao, and Dequan Zheng
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
{ymcui,chzhu,xnzhu,tjzhao,dqzheng}@mtlab.hit.edu.cn
Abstract. As the parallel corpus is not available all the time, pivot language
was introduced to solve the parallel corpus sparseness in statistical machine
translation. In this paper, we carried out several phrase-based SMT experi-
ments, and analyzed the detailed reasons that caused the decline in translation
performance. Experimental results indicated that both covering rate of phrase
pairs and translation probability accuracy affect the quality of translation.
1 Introduction
In order to solve the parallel language data limitations, the pivot language method is
introduced [1-3]. Pivot language becomes a bridge method between source and target
languages, whose textual data are not largely available. When we choose a language
as pivot language, it should provide a relatively large parallel corpus either in source-
pivot direction, or in pivot-target direction.
In this paper, we focus on the phrase tables generated by two directions (source-
pivot, pivot-target), that is triangulation method. This method multiplies correspond-
ing translation probabilities and lexical weights in source-pivot and pivot-target
phrase table to induce a new source-target phrase table.
2 Related Work
Utiyama and Isahara [3] investigate in the performance of three pivot methods. Cohn
and Lapata [4] use multi-parallel corpora to alleviate the poor performance when
using small training sets, but do not reveal the weak points of current phrase-based
system when using a pivot method. What affects the pivot-based machine translation
quality is discussed in general aspects by Michael Paul and Eiichiro Sumita [5], but
not detailed explained in a certain aspect.
When combining the two phrase tables generated by source-pivot and pivot-target
corpora, we should take two elements into account.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 355–358, 2013.
© Springer-Verlag Berlin Heidelberg 2013
356 Y. Cui et al.
The first element is phrase translation probability. We assume that source phrases
are independent with target phrases. In this way, we can induce the phrase translation
probability ϕ ( s | t ) when given the pivot phrases as Eq.1.
ϕ (s | t ) = p ϕ (s | p) ⋅ϕ ( p | t ) (1)
Where s, p and t denotes the phrases in the source, pivot and target respectively.
The second element is lexical weight, that is word alignment information a and in
a phrase pair (s,t) and lexical translation probability w(s|t)[6].
We assume a1 and a2 be the word alignment inside phrase pairs (s,p) and (p,t) re-
spectively, and the word alignment a of phrase pair (s,t) can be got by Eq.2.
a = {( s, t ) | ∃p : ( s, p ) ∈ a1 & ( p, t ) ∈ a2 } (2)
Then we can estimate the lexical translation probability by induced word alignment
information, as shown in Eq.3. In this way, we can use source-pivot and pivot-target
phrase table to generate a new source-target phrase table.
count ( s, t )
w( s | t ) = (3)
s ' count (s ', t )
4 Experiments
In our experiments, the pivot language is chosen as English, because of its large
availability of bilingual corpus. Our goal is to build a Chinese-Japanese machine
translation system. The corpus is selected as HIT trilingual parallel corpus [7]. There
are two ways to divide the corpus. The first is parallel one, which indicates that both
directions share the same training sets; the second is non-parallel one, which means
the training sets of two directions are independent with each other. The Statistics are
shown in Table 1.
Standard Pivot
Parallel 1088394 252389200
Non-Parallel 1088394 92063889
Parallel Non-Parallel
zh jp zh jp
Standard 521709 558819 521709 558819
Pivot 320409 380929 97860 131682
In general, we can see some problems revealed in figures above. Firstly, though pi-
vot phrase table may be larger than the standard one in size (230 times bigger), the
actual phrases are less than the standard one (about 60%). This reminds us that during
the phrase table combination, some phrases would be lost. That is to say, the pivot
language cannot bridge the phrase pairs in source-pivot and pivot-target directions.
Secondly, due to a larger scale in phrase table and lower useful phrases, pivot phrase
table brings so much noise during the combination. This would be a barrier, because
the noise would affect both the quality and the efficiency in the translation process.
Then we carried out the following experiments to show what caused low phrase
coverage. We extracted the phrase pairs (s,t) that exist in standard model but not in
pivot model. When given phrase s, we searched the Chinese-English phrase table to
get its translation e, and use corresponding phrase t to search the English-Japanese
phrase table to get its origin e’. Then we compared output e and e’, and see what rea-
sons that caused the failure in connecting phrases in two models. We calculated the
number of phrase pairs that was successfully connected by pivot in the Table 5.
Parallel Non-parallel
Connected phrase pairs 310439(34.75%) 73044(9.84%)
As we can see above, in parallel models there are only 34.75% phrase pairs con-
nected, and in non-parallel situation, the rate goes down to 9.84%. So we examined
the output file, and noticed some phenomenon which accounts for low number of
connected phrase pairs. Firstly, Arabic numerals can be converted into English (e.g.
100 -> one hundred); secondly, the word with similar meanings can be converted (e.g.
8.76% -> 8.76 percent); thirdly, punctuations can be removed or added (e.g. over ->
over.).
pivot phrase tables, and the probabilities of each. In this way, we can see in the condi-
tion of the same phrase pairs, how results differ when using different translation prob-
ability. The results are shown in Table 6, which the parameters were not tuned.
Table 5. BLEU scores of old and new generated models(with parallel data)
Standard Pivot
Old 26.88 17.56
New 24.99 21.44
We can see that, in new models, the variety of the probability brings a 3.55 BLEU
score gap. We found a quite unusual phenomenon that, though new pivot model re-
duce to 0.85% of its original size, the BLEU score rise up to 21.44. This can also be a
proof that there are too much noise in pivot phrase table. The noise affected the trans-
lation quality, and translation effectiveness is also impacted due to its large size.
5 Conclusion
The experiments showed that the translation result may decrease along with the
change of coverage of phrase pairs and translation probability accuracy. We still need
to improve the covering rate of phrase pairs, and we also should improve our transla-
tion probability accuracy, not merely using a multiplication of each probabilities.
References
1. de Gispert, A., Marino, J.B.: Catalan-English statistical machine translation without parallel
corpus: bridging through Spanish. In: Proceedings of 5th International Conference on Lan-
guage Resources and Evaluation, pp. 65–68 (2006)
2. Wu, H., Wang, H.: Pivot Language Approach for Phrase-Based Statistical Machine Transla-
tion. In: Proceedings of 45th Annual Meeting of ACL, pp. 856–863 (2007)
3. Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based sta-tistical ma-
chine translation. In: Proceedings of HLT, pp. 484–491 (2007)
4. Cohn, T., Lapata, M.: Machine Translation by Triangulation: Making Effective Use of Mul-
ti-Parallel Corpora. In: Proceedings of the 45th ACL, pp. 348–355 (2007)
5. Paul, M., Sumita, E.: Translation Quality Indicators for Pivot-based Statistical MT. In: Pro-
ceedings of 5th IJCNLP, pp. 811–818 (2011)
6. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Human Language
Technology Conference of the North American Chapter of the Association for Computa-
tional Linguistics, pp. 127–133 (2003)
7. Yang, M., Jiang, H., Zhao, T., Li, S.: Construct Trilingual Parallel Corpus on Demand. In:
Huo, Q., Ma, B., Chng, E.-S., Li, H. (eds.) ISCSLP 2006. LNCS (LNAI), vol. 4274, pp.
760–767. Springer, Heidelberg (2006)
Analysing Customers Sentiments: An Approach
to Opinion Mining and Classification of Online
Hotel Reviews
1 Introduction
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 359–362, 2013.
c Springer-Verlag Berlin Heidelberg 2013
360 J. Sixto, A. Almeida, and D. López-de-Ipiña
2 System Framework
Using machine learning and natural language processing (NLP) techniques, we
have created a classifier to positively or negatively evaluate reviews. We have
implemented a general purpose classifier that takes a review as input data and
assigns to it a category label. Next, the same classifier evaluates the split up
parts of the review. Currently, numerous works for sentiment analysis and text
classification are developed using Maximum Entropy Model (MaxEnt) and Naı̈ve
Bayes Model and based in this works we have used the Stanford NLP Classifier2
tool for building the maximum entropy models. The text classification approach
involves labelled instances of texts for the supervised classification task, therefore
the proposed framework, described in Figure 1, integrates multiple text analy-
sis techniques and cover the whole text process task. This framework contains
some separated components based on text mining techniques, expert knowledge,
examples, lexico-semantic patterns and empiric observations, using a modular
structure that allows to enable and disable components to adapt the software
composition to achieve the maximum successful results.
When processing the text, we analyse each sentence separately to facilitate
the process of filtration into categories. In the first step, we tokenize the sen-
tences to analyse the words independently. During this step, we clean the text of
irrelevant symbols and English stop words3. Then we use the Stanford lemmati-
zation and part-of-speech (POS) tagger to use the lemmas, their POS tags and
their raw frequency as classifier features.These components provide a classifier
features candidate list for polarity evaluation. Lastly, we annotate words using
Q-Wordnet to establish the polarity and determine if a text (word, sentence or
document) is associated with a positive or negative opinion.
Using part-of-speech (POS) tagger information, we are able to relate an ad-
jective with the noun it refers to, generating bigrams with a polarity assigned
1
http://wordnet.princeton.edu/
2
http://nlp.stanford.edu/software/classifier.shtml
3
http://www.textfixer.com/resources/common-english-words.txt
Analysing Customers Sentiments: An Approach to Opinion Mining 361
3 Experimental Results
The detailed experimental results are presented in Table 1. Experiments have
been realized with the 1000 reviews corpus with two labels classification. We have
used a 4-fold random cross-validation. The accuracy of the framework has been
evaluated by comparing the classifier result with the manually labelled rating of
reviews. We have found that the addition of POS Tagging to the framework have
not provided a significant improvement, in fact, accuracy has decreased when
we have used it without polarity detection . We have hypothesized this is due
to word-category disambiguation is not relevant when the system only uses uni
grams. In spite of this, POS Tagging has been also integrated in other system
tasks, for example, during the word classifying process we have used the word
category. The addition of polarity and negative tokens has improved the classifier
significantly in relation to the base classifier, but we deem that this process is
more relevant yet during the splitting task, due to the text to be analysed being
shorter and the sentiment about concrete features being more important than
the general sentiment of the review.
4 Conclusions
We have presented a a system for extracting opinions from reviews, focused on
knowing the customers evaluation about the different features of hotel. Unlike
362 J. Sixto, A. Almeida, and D. López-de-Ipiña
References
1. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods
for sentiment analysis. Computational Linguistics (2011)
2. Dave, K., Lawrence, S., Pennock, D.: Mining the peanut gallery: Opinion extraction
and semantic classification of product reviews. In: Proceedings of the 12th Interna-
tional Conference on World Wide Web, WWW 2003 (2003)
3. Huang, J., Etzioni, O., Zettlemoyer, L., Clark, K., Lee, C.: RevMiner: An Extractive
Interface for Navigating Reviews on a Smartphone. In: Proceedings of the 25th ACM
Symposium on User Interface Software and Technology (2012)
4. Zhuang, L., Jing, F., Zhu, X.-Y., Zhang, L.: Movie review mining and summariza-
tion. In: Proceedings of the ACM SIGIR Conference on Information and Knowledge
Management, CIKM (2006)
5. Agerri, R., Garcı́a-Serrano, A.: Q-WordNet: Extracting Polarity from WordNet
Senses. In: Proceedings of the Seventh Conference on International Language Re-
sources and Evaluation (2010)
An Improved Discriminative Category
Matching in Relation Identification
Department of Computer Science and Technology, East China Normal University, China
[email protected], {jyang,xlin}@cs.ecnu.edu.cn
1 Introduction
Along with rapid growth of the digital resources and World Wide Web text informa-
tion, corpus is becoming large and heterogeneous. More and more unsupervised in-
formation extraction (URE) methods are developed. URE aims at finding relations
between entity pairs without any prior knowledge. URE is firstly introduced by Hase-
gawa [1] with the steps: (1) tagging named entities (2) getting co-occurrence named
entities pairs and their context (3) measuring context similarities (4) making clusters
of named entity pairs (5) labeling each cluster of named entity pairs. Rosenfeld and
Feldman [3] compare several feature extraction methods and clustering algorithms
and let identification of relations for further extraction by semi-supervised systems.
Wang [4] applied URE in Chinese corpus and gives an overall summary and made an
improvement based on the heuristic rules and co-kmeans method. In the relation iden-
tification step, Hasegawa [1] uses one most frequent word to represent the relation of
a cluster. Chen [2] employs a discriminative category matching (DCM) to find typical
and discriminative words to label for clusters not for entity pairs. Yan [5] use an en-
tropy-based algorithm to ranking features.
For selecting a good relation word, we employs the improved DF [8] method to
rank low-frequency entity pairs’ features in this paper. Then we select a set of features
for each cluster. Last, we propose an improved DCM method to select features to
describe the relations. Experiments show that our method can select more accurate
features to identify relations.
∗
Corresponding author.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 363–366, 2013.
© Springer-Verlag Berlin Heidelberg 2013
364 Y. Sun, J. Yang, and X. Lin
This paper uses ICTCLAS 2011 to segment sentences and we extract all nouns be-
tween the two entities and two in the two sides by removing stopwords. Represent as
Pi ={ei1, ei2 ,(w1,t1),(w2 ,t2 ),}. In k-means cluster method, we determine the range of clus-
ter number k according [6] and use Silhouette index [7] as the evaluation metric to
obtain the optimal value of k. The result is {C1 , C2 , C3 , , Ck } . To re-rank the fea-
tures by their importance on relation identification, we use an improved DF method
[8] with a threshold θ which is used to roughly distinguish the frequency of wi . Table
1 is an example of function f ( w) and the importance can be calculated by Equation
(2) [9]. Then we get a new whole features’ order: W = {w1' , w2' , , wm' } .For low-
frequency entity pairs whose frequency is less than θ , we rearrange its order and get a
new feature sequence Pi ={ei1, ei2,(w1' ,t1' ),(w2' ,t2' ),}. The detail of the Improved DCM me-
thod is shown in Table 2.
w1 w2 w3 …
P1 1 1 0 …
P2 1 0 2 …
… … … … …
log 2 (df i , k + 1)
Ci , k = (3)
log 2 ( N i + 1)
, where k means the kth feature( wk ) in cluster i . dfi , k is the number of entity pair
which contains wk in cluster i . Ni is the number of entity pair in cluster i .
3 Experiments
In this paper, we ignore the Named Entity (NE) identification and suppose that we
have got the entity pairs in corpus. Our method is tested on two different corpora. One
of them is one month quantity of People's Daily (labeled by “PD”) in 1998, including
334 distinct entity pairs and 19 relation words. The second is web corpus which we
use Baidu search engine to get co-occurrence sentences (labeled by “Baidu”), which
contains 343 distinct entity pairs and 8 relation words. Table 2 gives some details of
these two datasets.
PD Baidu
主席(president) 总统(president)
Relations Number of Entity pairs Relations Number of Entity pairs
书记(secretary) 市长(mayor)
56 98
首相(premier) 首都(capital)
31 45
7 22
… … … …
In order to measure results automatically, each entity pair’s relation is labeled ma-
nually as Table 2 shows. Precision can be defined as follows:
N
P = co rrect
(4)
N correct + N erro r
N correct and N error are the numbers of right and error results of relation extraction.
According to the evaluation metric in this paper, we can get the final precision of
two different methods in our two dataset as Table 4 shows.
PD Baidu
DCM 86.60% 86.58%
Improved DCM 91.07% 95.91%
366 Y. Sun, J. Yang, and X. Lin
Acknowledgements. This paper was funded by the Shanghai Science and Technolo-
gy commission Foundation (No. 11511502203) and International Cooperation Foun-
dation (No. 11530700300).
References
1. Hasegawa, T., Sekine, S., Grishman, R.: Discovering Relations among Named Entities from
Large Corpora. In: ACL 2004 (2004)
2. Chen, J., Ji, D., Tan, C.L., Niu, Z.: Unsupervised Feature Selection for Relation Extraction.
In: IJCNLP 2005, JejuIsland, Korea (2005)
3. Benjamin, R., Ronen, F.: Clustering for Unsupervised Relation Identification. In: Proceed-
ings of CIKM 2007 (2007)
4. Wang, J.: Research on Unsupervised Chinese Entity Relation Extraction Method, East Chi-
na Normal University (2012)
5. Yan, Y., Naoaki, O., Yutaka, M., Yang, Z., Mitsuru, I.: Unsupervised relation extraction by
mining Wikipedia texts using information from the web. In: Proceedings of the Joint Confe-
rence of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, Suntec, Singapore, August 2-7, vol. 2 (2009)
6. Zhou, S., Xu, Z., Xu, T.: New method for determining optimal number of clusters in K-
means clustering algorithm. Computer Engineering and Applications 46(16), 27–31 (2010)
7. Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number
of clusters in a dataset. Genome Biology 3(7), 1–21 (2002)
8. Xu, Y., LI, J., Wang, B., Sun, C.: A study of Feature Selection for Text Categorization Base
on Term Frequency. In: Chinese Information Processing Front Progress China Chinese In-
formation Society 25th Anniversary of Academic Conference Proceedings (2006)
9. Xu, Y., Huai, J., Wang, Z.: Reduction Algorithm Based on Discernibility and Its Applica-
tions. Chinese Journal of Computers 26(1) (January 2003)
Extraccting Fine-Grained Entities
Bassed on Coordinate Graph
Abstract. Most previo ous entity extraction studies focus on a small set of coarse-
grained classes, such as
a person etc. However, the distribution of entities within
query logs of search engine
e indicates that users are more interested in a wider
range of fine-grained entities, such as GRAMMY winner and Ivy League
member etc. In this paaper, we present a semi-supervised method to extract fine-
grained entities from an
a open-domain corpus. We build a graph based on enti-
ties in coordinate listss, which are html nodes with the same tag path of the
DOM trees. Then classs labels are propagated over the graph from known enti-
ties to unknowns. Exp periments on a large corpus from ClueWeb09a dataset
show that our proposed d approach achieves the promising results.
1 Introduction
In recent years, the task off harvesting entities from the Web excites interest in tthe1
field of information extractiion. Most previous work focused on how to extract entiities
for a small set of coarse-grrained classes. However, the distribution of entities witthin
query logs of search enginee indicates that
users are more interested in n a wider range
of fine-grained entities [1], such as Ivy list
League member etc. In this t paper, we
introduce and study a task k to serve the
growing interest: web fined d-grained entity
extraction. Compared to traaditional entity
extraction, the main challlenges in fine-
grained entity extraction liee in 1) there are
so many fine-grained cllasses defined Fig. 1. Coordinate List of entities
——————
3
Corresponding author.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 367–371, 2013.
© Springer-Verlag Berlin Heidelberg 2013
368 Q. Yang et al.
according to different tasks. 2) There is usually no context available. But the website
editors usually use coordinate components such as table or list to facilitate readers to
identify the similar content or concepts as Fig.1.Those entities in the same list or the
same column of a table tend to belong to the same conceptual classes. We can build a
graph to capture such co-occurrence relationship of entities. Then we propagate class
labels of known entities to unknowns. Results demonstrate promising results .
The rest of the paper is organized as follows: Section 2 describes our approach to
the task of fine-grained entity extraction. Evaluation is presented in Section 3. In
Section 4, we discuss related work. Section 5 concludes the paper and discusses future
work.
In the task of fine-grained entity extraction, we are given 1) a set of coordinate lists
, ,…, extracted from web pages, and 2) a list of fine-grained classes
, ,…, defined in Wikipedia categories. We aim at inferring the classes
that these entities belong to.
According to the observation that web site editors usually express similar contents
with the same html tags, we assume all entities in the same coordinate list have simi-
lar classes. HTML list and table are two special cases of coordinate list.
First we group entities according to their text nodes’ tag paths rooted from
<HTML> into coordinate lists with the following rules: (1) the count of tokens (at
least 2 and most 50); (2) non-characters (starting, ending or all); (3) some special
types (e.g., number, date, URL, etc.) are filtered out; (4) the count of entities in a
coordinate lists (at least 5). Then according to the co-occurrence of entities in differ-
ent coordinate lists, we can build a coordinate graph , , . is a entity
set ,…, . is a edge of entities and that co-occur in lists.
reflects the similarity of class labels between two entities. In this paper, we
use frequency of co-occurrence and PMI in different coordinate lists to
measure the similarity between two entities. The PMI can be computed as
follows:
wco ( vi , v j ) × C
wpmi ( eij ) = log
f ( vi ) × f ( v j )
(1)
1 if vi belongs to class c j
pij = (2)
0 else
In the label propagation process, each entity will receive the label information
from its neighborhoods, and retain its initial class state at time t + 1:
fi t +1 = α
v j ∈N ( vi )
wij f jt + (1 − α ) fi 0
(3)
3 Experiments
This paper aims at extracting fine-grained entities. In line with this, we select two
datasets the Wiki-list and the Web-list for comparative studies. The Wiki-list contains
30,525 entities and 73,074 types extracted from Wikipedia pages. The Web-list con-
tains 8,724 entities and 20,593 types extracted from traditional web pages. In the gold
set, there are about 3.7 types for each entity. Fig. 2 (a) and (b) show the effect of vary-
ing unlabeled ratio from 10% to 90%, with a step up size of 10%. Clearly, the per-
formance decreases gradually, because it is more difficult to predict with less known
entities. According to Fig. 2 (a) and (b), for the Wiki-list dataset, the PMI measure
can get better performance than co-occurrence measure. However, for the Web-list
dataset, the PMI measure cannot achieve better performance than co-occurrence
measure. In addition, the performance for the Wiki-list dataset is significant better
than the performance for the Web-list dataset. We believe that it is because traditional
web pages are much noisier than Wikipedia pages.
370 Q. Yang et al.
0.5 0.6
0.4
0.4
0.3
MicroF1
AP
0.2
0.2
wiki_PMI wiki_PMI
0.1 wiki_Co wiki_Co
web_PMI web_PMI
web_Co web_Co
0.0 0.0
20 40 60 80 20 40 60 80
Unlabeled Ratio(%) Unlabeled Ratio(%)
(a) (b)
4 Related Work
Recently, there has been some work on building tagging systems using a large num-
ber of fine-grained types. Some focus on person categories (e.g. [4] ). Some deal with
around 100 types at most (e.g. [5] ). We label entities with a great large number of
types. Limaye et al. [6] annotate table cells and columns with corresponding catego-
ries and relations from an existing catalog or type hierarchy for a single table. Differ-
ent from them, we assign class labels for a list corpus instead. Talukdar et al. [7]
acquires labeled classes and their instances from both of unstructured and structured
text sources using graph random walks. They construct the graph model based on the
relationship of entities and classes. We use structured text sources and construct the
graph model based on the relationship of entities co-occurrence in coordinate ways.
Acknowledgments. This work was supported by the National Natural Science Foun-
dation of China (NO.61272361, NO.61250010).
References
1. Guo, J., et al.: Named entity recognition in query. In: Proceedings of the 32nd International
ACM SIGIR Conference on Research and Development in Information Retrieval, Boston,
MA, USA, pp. 267–274. ACM (2009)
Extracting Fine-Grained Entities Based on Coordinate Graph 371
2. Jiang, P., et al.: Wiki3C: exploiting wikipedia for context-aware concept categorization. In:
Proceedings of the Sixth ACM International Conference on Web Search and Data Mining,
Rome, Italy, pp. 345–354. ACM (2013)
3. Wang, F., Zhang, C.: Label propagation through linear neighborhoods. In: Proceedings of
the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, pp.
985–992. ACM (2006)
4. Ekbal, A., et al.: Assessing the challenge of fine-grained named entity recognition and clas-
sification. In: Proceedings of the 2010 Named Entities Workshop, Uppsala, Sweden, pp.
93–101. Association for Computational Linguistics (2010)
5. Ling, X., Weld, D.S.: Fine-Grained Entity Recognition. In: Proceedings of the 26th Confe-
rence on Artificial Intelligence, AAAI (2012)
6. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using enti-
ties, types and relationships. Proc. VLDB Endow. 3(1-2), 1338–1347 (2010)
7. Weischedel, R., Brunstein, A.: Bbn pronoun coreference and entity type corpus. Linguistic
Data Consortium, Philadelphia (2005)
NLP-Driven Event Semantic Ontology Modeling
for Story
School of Information Science and Engineering, Hunan University, Changsha 410082, China
{gcm211,xieqiumei0814}@163.com,[email protected]
1 Introduction
Cognition psychologists consider events as the basic unit of the real world that human
memory can understand. This paper defines event e as: (p, a1, a2, … , an), where p is
the predicate that triggers the presence of e in text and it can’t be null, while a1, a2,
… , an are the arguments associated with e. Wei Wang [1] obtained 5W1H events in
topic sentences from Chinese news, but it focused on the interesting information not
the whole text; Yao-Hua Liu [2] derived the binary relations based on verbs from
news text, but it suffered quality loss for extracting higher order n-ary facts. We use
Open Information Extraction (OIE) [3] to capture n-ary fact-frames from Chinese
children stories and convert them into event structures. For the complex sentences, we
represent the extracted n-ary facts from relative clause as the nested structure of main
fact-frame structure. The fact-frame structure is made of facts which are composed of
attribute-value pairs where attribute is obtained from dependency relation from parser
or regular expressions, while value is the word from sentence or entity annotated type.
The set of attributes are {subject, predicate, object, instrument, place, time, subject
property, object property, subject amount, object amount, adverb}.
Ontology is an effective knowledge organization model with the semantic expres-
sion and reasoning ability. SOSDL ontology is designed to be a common event model
to describe the event information in multimedia data (text, audio, image, etc.), with
the representation ability of quantitative temporal (Allen time relations [4]) and spa-
tial relations (topologic relations, directional relations and distance relations). The top
concepts are Entity, Object, Time, Space, Information Entity, Event and Quality. For
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 372–375, 2013.
© Springer-Verlag Berlin Heidelberg 2013
NLP-Driven Event Semantic Ontology Modeling for Story 373
an event e, we incorporate its predicate as the individual of Event, its arguments as the
individuals of the relevant SOSDL Classes by its fact-attributes. The relations be-
tween predicate and arguments are attached by reification technique1, while the rela-
tions between events are ordered by event start time and duration. Then we can use
web semantic technology to query and reason for the applications, like Question
Answering Engines and Text-Driven Animation Generation, etc. This paper mainly
focuses on the event semantic elements extraction which is the key step.
For original text, we do word segmentation and POS tagging by ICTCALS2, apply
rules to identify named entities (characters, time, location) and Chinese verb change
forms (e.g. 看一看 (look at), 看了又看 (look at for times)), treat the character dialogues
as a whole part, split text into sentences, and finally save it as XML format.
Stanford Chinese Dependency Parser [5] gives 45 kinds of named grammar rela-
tions and a default relation dep. The output of it consists of sentences represented as
binary dependencies from the head lemma to the dependent lemma: rel (head,dep).
An extraction example is shown in Fig.1. We have totally got 33 fact frame elements
extraction rules for Chinese sentences which explicitly contain a verb or adjective as
the candidate predicate, and these extraction rules are faced with main clause, relative
把 被
clause, Chinese specific structures “ (BA)”, “ (BEI)”, etc.
However, Chinese sentences are composed of topics and comments, which can not
contain candidate predicates but just convey a clear meaning (e.g., “ 他今年 岁。
8 (He
岁
is 8 years old this year.)”). As most language theories focus on verbs, the dependency
parser will fail to parse these sentences. In this case “ (year)” will be treated as
verb not a quantifier. As a complement, we use regular expressions to extract entities
Compared to the even form, there are three types of n-ary fact-frame need to be mod-
ified: (1) fact-frame only contains time or location fact; (2) fact-frame has at most one
subject and multi-predicates; (3) fact-frame has multi-subjects, and each subject is
followed by multi-predicates. We use the following heuristics to process them as can-
didate event form: for case (1), as our text understanding based on event model, we
attach it to next fact-frame who has predicate fact; for case (2), common senses tell us
these multi-predicates are likely share a same subject, we create the candidate events
as the number of predicates; for case (3), we spilt the fact-frame into sub-frames by
subjects, for each sub-frame we do it as case (2). As Lexical Grammar Model [6]
specifies verbs in any language can be classified into ten general domains, and each
lexical domain is characterized in terms of the trigger words of a general verb or ge-
nus, we define the event types as the verb classifications and use trigger-event-type
table to identify event types. Finally we populate the event elements to SOSDL.
3 Evaluation
We use 154 Chinese children stories as the data set and set up four experiments. Base-
line1 is the pretreatment without identification of Chinese verb change forms. Base-
line2 only contains main clauses extraction rules. Liuyaohua is the method in [2]. We
apply P, R, F and completeness C to evaluate n-ary facts extraction. Let N t be the
number of facts that should be extracted from texts, N f be the number of facts found
by methods. For extracted facts, we manually judge the following numbers: 1) true
and complete N t &c , 2) true and incomplete N t &inc , or 3) false. True and incomplete
facts either lack arguments that are present in the sentence, or contain underspecified
arguments, but are nevertheless valid statements. Let R = ( N t &c + N t &inc ) / N t ,
R = ( N t &c + N t &inc ) / N f , C = N t & c / N f , and F = 2 RP ( R + P ) .
NLP-Driven Event Semantic Ontology Modeling for Story 375
Table 1. Evaluation results of System, Baseline1, Baseline2 and Liuyaohua with R, P, F and C
From Table1 we observe a significantly higher numbers of true and complete facts
for system, as well as a higher overall R, P, F, and C. The R in Liuyaohua is 59.39%,
and it suffers quality loss for n-ary facts extraction. It is seen that Baseline1 find out
more facts than System, that’s because Chinese verb change forms result to the re-
dundant information which makes Stanford Dependency Parser lacks training for
those types of sentences. From the results of system and Baseline2, we see rich ex-
traction rules can effectively improve the completeness and recall.
4 Conclusion
We described an almost unsupervised approach for event semantic understanding task
of Chinese children texts. One major drawback of our system to extract facts is that
the dependency parse does not contain the dependency dep, which indicates unclear
grammatical relationships. Additionally, wrong segmentation and POS tag may pro-
duce accumulative errors with the dependency parse, for example a noun is wrong
tagged as a verb, or segmented into a verb or a noun. Future work will focus on the
using very fast dependency parsers and concluding rich linguistic grammars of Chi-
nese special structure to improve the extraction results.
References
1. Wang, W.: Chinese News Event 5WH Semantic Elements Extraction for Event Ontology
Population. In: Proceedings of the 21st International Conference Companion on World
Wide Web, Lyon, France, pp. 197–202 (2012)
2. Liu, Y.H.: Chinese Event Extraction Based on Syntactic Analysis. MA Thesis. Shang Hai
University, China (2009)
3. Gamallo, P., Garcia, M.: Dependency-Based Open Information Extraction. In: Proceedings
of the 13th Conference of the European Chapter of the Association for Computational Lin-
guistics, Avignon, France, pp. 10–18 (2012)
4. Allen, J.F.: Maintaining Knowledge about Temporal Intervals. Communications of the
ACM 26, 832–843 (1983)
5. Chang, P.C., Tseng, H., Jurafsky, D., Manning, C.D.: Discriminative Reordering with Chi-
nese Grammatical Relations Features (2010),
http://nlp.stanford.edu/pubs/ssst09-chang.pdf
6. Ruiz de Mendoza Ibáñez, F.J., Mairal Usón, R.: Levels of description and constraining fac-
tors in meaning construction: an introduction to the Lexical Constructional Model (2008),
http://www.lexicom.es/drupal/files/RM_Mairal_2008_Folia_Lingu
istica.pdf
The Development of an Ontology for Reminiscence
Abstract. The research presented in this paper investigates the construction and
feasibility of use of an ontology of reminiscence in a conversational agent (CA)
with suitable reminiscence mechanisms for non-clinical use within a healthy ag-
ing population who may have memory loss as part of normal aging and thereby
improve subjective well-being (SWB).
1 Introduction
1.1 Why an Ontology Is Important
The use of ontologies in computer science has been s t e a d i l y emerging into the
discipline over several decades. The evolution of the semantic web has encouraged
the development of ontologies. This is because an ontology represents the shared
understanding and the well-defined meaning of a domain of interest, thereby enabl-
ing computers and people to collaborate better [1].
2 Ontology: Production
2.1 Methods
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 376–379, 2013.
© Springer-Verlag Berlin Heidelberg 2013
The Development of an Ontology for Reminiscence 377
3 Ontology of Rem
miniscence
Initially, text was written down
d in natural language that described the reminisceence
domain. This allowed the creation
c of a glossary of natural language terms and deffini-
tions [9] [Figure 1]. Ontollogy was initially proposed by the artificial intelligeence
community to model decllarative knowledge for knowledge-based systems, too be
shared with other systems. Once the concept of ontology was defined, productionn of
the Ontology of Reminisceence was begun. A preliminary ontology was created and
mapped to the WordNet hiierarchy [3:4:10:11:12], and then implemented within the
Fig
g. 2. Mapping of the moon landing data
378 C. Curry et al.
conversational agent ‘Betty’ (CA). The mapping to the WordNet hierarchy was
achieved by breaking down the ontology into nouns, adjectives, opposites, preposi-
tions, verbs and concepts. These were then mapped to elements within WordNet and
scripted in the program of the CA.
The CA, ‘Betty’, has both short and long term memory. This means that ‘Betty’
can listen, talk and remember, all by using saved variables. What the user has already
said during the conversation can be checked, using conditions to verify whether a
variable was set, as well as the rules controlling input parameters. After each input,
the CA first tried to understand the user input, then it updated the current state and
generated output to the user. All inputs and outputs are appended to the log. These
logs can be studied to enable the CA to be updated as required.
3.1 Experiments
This research conducted a pilot evaluation via a comparative usability test with 5
people, to explore if the CA ‘Betty’ effectively contributed to reminiscence in terms
of its functionality and interface. For our test, a group of five over 45-year-olds
spoke with ‘Betty’ for a five minute period. This was to test the precision, recall, and
accuracy of the CA [5]. Further experiments were run to test for user subjective well-
being and memory recall improvement. These were carried out with 30 participants
aged 45+ and showed that well-being was improved by the use of the CA and that the
participants recall of past events was increased. Well being was measured before and
after application of the CA by use of a general anxiety and depression scale. The
application of an Everday Memory Questionnaire (EMQ) [7] demonstrated a
noticeable difference in cognitive ability after use of the CA. This more direct as-
sessment of the errors experienced by older adults during their daily activities may be
more useful for directing the research into developing an intervention that will have a
practical and therapeutic impact [13].
References
1. Gómez-Pérez, A.: Knowledge sharing and reuse. In: Liebowitz (ed.) Handbook of Applied
Expert Systems. CRC Press, Boca Raton (1998)
2. Gruninger, M., Fox, M.S.: Methodology forthe Design and Evaluation of Ontologies. In:
Proceedings of the Workshop on Basic Ontological Issues in Knowledge Sharing, IJCAI
1995, Montreal (1995)
3. Hendler, J., McGuinness, D.L.: The DARPA Agent Markup Language. IEEE Intelligent
Systems 16(6), 67–73 (2000)
4. Humphreys, B.L., Lindberg, D.A.B.: The UML Sproject: making the conceptual
connection between users and the information they need. Bulletin of the Medical Library
Association 81(2), 170 (1993)
5. Nielsen, J., Landauer, T.K.: A mathematical model of the finding of usability problems. In:
Proceedings of ACM INTERCHI 1993 Conference, Amsterdam, The Netherlands, April
24-29, pp. 206–213 (1993)
6. Webster, J.D.: The reminiscence functions scale: a replication. International Journal of
Aging and Human Development 44(2), 137–148 (1997)
7. Wagner, N., Hassanein, K., Head, M.: Computer use by older adults. A multi-disciplinary
review. Computers in Human Behaviour 26, 870–882 (2010)
8. Parker, J.: Positive communication with people who have dementia. In: Adams, T.,
Manthorpe, J. (eds.) Dementia Care, pp. 148–163. Arnold, London (2003)
9. Butler, R.N.: The Life Review: An interpretation of reminiscence in the aged.
Psychiatry 26, 65–76 (1963)
10. Miller, G.A.: WordNet: A Lexical Database for English. Communications of the
ACM 38(11), 39–41 (1995)
11. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge
(1998)
12. WordNet: An Electronic Lexical Database (citations above) is available from MIT Press,
http://mitpress.mit.edu (accessed on January 14, 2013)
13. Duong, C., Maeder, A., Chang, E.: ICT-based visual interventions addressing social
isolation for the aged. Studies Health Technology Inform. 168, 51–56 (2011)
Chinese Sentence Analysis Based on Linguistic
Entity-Relationship Model
Dechun Yin*
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
[email protected]
1 Introduction
Many rule-based and corpus-based methods have been proposed for the Chinese syn-
tactic parsing. The rule-based methods need a large number of generation rules that
are often manually edited by developers, and the corpus-based methods need a large-
scale corpus to train the linguistic model. For avoiding the laborious, costly and
time-consuming work of editing the rules and building the corpus, we propose the
linguistic entity relationship model(LERM). In the model, we only use the few meta-
rules to describe the grammars and parse the sentence.
*
Corresponding author.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 380–383, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Chinese Sentence Analysis Based on Linguistic Entity-Relationship Model 381
modes, are extracted and generalized from the Chinese corpus with the help of Chinese
grammar dictionary, and therefore they can be used to describe the most basic grammat-
ical and semantic logic of Chinese sentence.
Generally, the relationship modes of linguistic entities are lexicalized and built on
the verb, and they can describe the syntactic and semantic structure of a sentence,
such as the relationship of predicate and argument. Furthermore, the relationship
modes can also be built on the adjective in adjective-predicate sentence, or on the
noun in noun-predicate sentence. In this paper we only present the relationship modes
that are built on the verb because they are the most important and complex relation-
ship modes compared with the others.
LERM has five most basic relationships, and each relationship includes some rela-
tionship modes. The relationship G denotes the subject-verb-object(SVO) or subject-
verb(SV) sentence; D denotes the double-object sentence; C denotes the causative or
是
imperative sentence; L includes but no limited “ ” sentence; E includes but no
有
limited “ ” sentence(i.e., sentence of being or existential sentence). They are de-
scribed in detail in Table 1. In the relationship modes, entity a, b, c or s is the argu-
ment. In particular, s is a special entity of being a subsentence, which presents the
property of recursion of natural language. The relationship G, D, C, L or E is the pre-
dicate and is often built on the verb. However, in some special Chinese sentence, such
as the adjective-predicate sentence whose relationship is G, the predicate is adjective,
and the relationship modes of G only include aG and sG.
The relationship modes are lexicalized for being built on the verbs. They are se-
miautomatically extracted, manually edited, and stored in the linguistic entity rela-
tionship dictionary. For example, some relationship modes built on the verb “ ” are 看
described and the conceptual constraints of entities are given in Table 2.
few meta-rules, which are similar to the generation rules of context-free gram-
mar(CFG). The few meta-rules are used for recursively parsing the subsequence X
and Y of CS.
看
Table 2. Some Meanings and Relationship Modes of Verb “ ”
Table 3 shows that the system LERM gets the better performance. Especially the
labeled and root accuracy are encouraging. Since the entry of the parsing is the verb
that is often the root node of most sentences, the parsing can take the global syntactic
and semantic features into account. This ensures that the verb and its arguments are
adequately analyzed and verified. As a result, the root accuracy of system LERM is
remarkably higher than the baseline system, and this also benefit the improvement of
the labeled accuracy.
References
1. Nivre, J.: Algorithms for Deterministic Incremental Dependency Parsing. Computational
Linguistics 34(4), 513–553 (2008)
2. Hall, J.: MaltParser-An Architecture for Labeled Inductive Dependency Parsing. Licentitate
thesis, Vaxjo University, pp. 52–53 (2006)
A Dependency Graph Isomorphism
for News Sentence Searching
Abstract. Given that the amount of news being published is only in-
creasing, an effective search tool is invaluable to many Web-based com-
panies. With word-based approaches ignoring much of the information in
texts, we propose Destiny, a linguistic approach that leverages the syn-
tactic information in sentences by representing sentences as graphs with
disambiguated words as nodes and grammatical relations as edges. Des-
tiny performs approximate sub-graph isomorphism on the query graph
and the news sentence graphs, exploiting word synonymy as well as hy-
pernymy. Employing a custom corpus of user-rated queries and sentences,
the algorithm is evaluated using the normalized Discounted Cumulative
Gain, Spearman’s Rho, and Mean Average Precision and it is shown
that Destiny performs significantly better than a TF-IDF baseline on
the considered measures and corpus.
1 Introduction
With the Web continuously expanding, humans are required to handle increas-
ingly larger streams of news information. While skimming and scanning can save
time, it would be even better to harness the computing power of modern ma-
chines to perform the laborious tasks of reading all these texts for us. In the past,
several approaches have been proposed, the most prominent being TF-IDF [6],
which uses a bag-of-words approach. Despite its simplicity, it has been shown
to yield good performance for fields like news personalization [1]. However, the
bag-of-words approach does not use any of the more advanced linguistic features
that are available in a text (e.g., part-of-speech, parse tree, etc.).
In this paper we propose a system that effectively leverages these linguistic
features to arrive at a better performance when searching news. The main idea is
to use the dependencies between words, which is the output of any dependency
parser, to build a graph representation of a sentence. Then, each word is denoted
as a node in the graph, and each edge represents a grammatical relation or
dependency between two words. Now, instead of comparing a set of words, we
can perform sub-graph isomorphism to determine whether the sentence or part
of a sentence as entered by the user can be found in any of the sentences in
the database. Additionally, we implemented the simplified Lesk algorithm [4] to
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 384–387, 2013.
c Springer-Verlag Berlin Heidelberg 2013
A Dependency Graph Isomorphism for News Sentence Searching 385
perform word sense disambiguation for each node, so that it will represent the
word together with its sense.
The method we propose to compare to graphs is inspired by the backtracking
algorithm of McGregor [5], but is adjusted to cope with partial matches. The
latter is necessary since we do not only want to find exact matches, but also
sentences that are similar to our query to some extent. As such, we aim to
produce a ranking of all sentences in the database given our query sentence.
2 News Searching
To compare two graphs, we traverse both the query sentence graph and each
of the news sentence graphs in the database in a synchronized manner. Given
a pair of nodes that are suitable to compare, we then recursively compare each
dependency and attached node, assigning points based on similarity of edges and
nodes. In this algorithm, any pair of nouns and any pair of verbs is deemed a
proper starting point for the algorithm. Since this results in possibly more than
one similarity score for this news-query sentence combination, we only retain the
highest one.
The scoring function is implemented as a recursive function, calling itself with
the next nodes in both the query graph and the news item graph that need to
be compared. In this way, it traverses both graphs in parallel until one or more
stopping criteria have been met. The recursion will stop when there are either
no more nodes or edges left to compare in either or both of the graphs, or when
the nodes that are available are too dissimilar to justify comparing more nodes
in that area of the graph. When the recursion stops, the value returned by the
scoring function is the accrued value of all comparisons made between nodes and
edges from the query graph and the news item graph.
A genetic algorithm has been employed to optimize the parameters that weigh
the similarity score when comparing nodes and edges. Mainly used to weigh
features, an additional parameter is used to control the recursion. If there is
no edge and node connected to the current node that is able to exceed this
parameter, the recursion will stop in this direction.
Computing the similarity score of edges is simply done by comparing the edge
labels, which denote the type of grammatical relation (e.g., subject, object, etc.).
For nodes, we compare five word characteristics: stem, lemma, literal word, basic
POS category (e.g., noun, verb, adjective, etc.), and detailed POS category (plural
noun, proper noun, verb in past tense, etc.). These lexico-syntactic features are
complemented by a check on synonymy and hypernymy using the acquired word
senses and WordNet [2]. Last, by counting all stems in the database, we adjust the
node score to be higher when a rare word rather than a regular word is matched.
3 Evaluation
In this section, the performance of the Destiny algorithm is measured and com-
pared with the TF-IDF baseline. To that end, we have created a database of 19
386 K. Schouten and F. Frasincar
news items, consisting of 1019 sentences in total, and 10 query sentences. All
possible combinations of query sentence and news sentence were annotated by
at least three different persons and given a score between 0 (no similarity) and
3 (very similar). Queries are constructed by rewriting sentences from the set of
news item sentences. In rewriting, the meaning of the original sentence was kept
the same as much as possible, but both words and word order were changed
(for example by introducing synonyms and swapping the subject-object order).
The results are compared using the normalized Discounted Cumulative Gain
(nDCG) over the first 30 results, Spearman’s Rho, and Mean Average Precision
(MAP). Since the latter needs to know whether a result is relevant or not, and
pairs of sentences are marked with a score between 0 and 3, we need a cut-off
value: above a certain similarity score, a result is deemed relevant. Since this
is a rather arbitrary decision, the reported MAP is the average MAP over all
possible cut-off values with a step size of 0.1, from 0 to 3.
TF-IDF mean score Destiny mean score rel. improvement t-test p-value
nDCG 0.238 0.253 11.2% < 0.001
MAP 0.376 0.424 12.8% < 0.001
Sp. Rho 0.215 0.282 31.6% < 0.001
4 Concluding Remarks
words are not only compared on a lexico-syntactic level, but also on a seman-
tic level by means of the word senses as determined by the word sense disam-
biguation implementation. This also allows for checks on synonymy and hyper-
nymy between words. Last, the performance results on the Mean Average Preci-
sion, Spearman’s Rho, and normalized Discounted Cumulative Gain demonstrate
the significant gain in search results quality when using Destiny compared to
TF-IDF.
Interesting topics for future work include the addition of named entity recog-
nition and co-reference resolution to match multiple referrals to the same entity
even though they might be spelled differently. Our graph-based approach would
especially be suitable for an approach to co-reference resolution like [3], as it also
utilizes dependency structure to find the referred entities.
References
1. Ahn, J., Brusilovsky, P., Grady, J., He, D., Syn, S.Y.: Open User Profiles for Adap-
tive News Systems: Help or Harm? In: 16th International Conference on World Wide
Web (WWW 2007), pp. 11–20. ACM (2007)
2. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press (1998)
3. Haghighi, A., Klein, D.: Coreference Resolution in a Modular, Entity-Centered
Model. In: Human Language Technology Conference of the North American Chapter
of the Association for Computational Linguistics (HLT-NAACL 2010), pp. 385–393.
ACL (2010)
4. Kilgarriff, A., Rosenzweig, J.: English senseval: Report and results. In: 2nd Interna-
tional Conference on Language Resources and Evaluation (LREC 2000), pp. 1239–
1244. ELRA (2000)
5. McGregor, J.J.: Backtrack Search Algorithms and the Maximal Common Subgraph
Problem. Software Practice and Experience 12(1), 23–34 (1982)
6. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill
(1983)
Unsupervised Gazette Creation
Using Information Distance
1 Introduction
The problem of information extraction for agriculture is particularly important 1
as well as challenging due to non-availability of any tagged corpus. Several
domain-specific named entities (NE) occur in the documents (such as news)
related to the agriculture domain: CROP (names of the crop including vari-
eties), DISEASE (names of the crop diseases and disease causing agents such as
bacteria, viruses, fungi, insects etc.) and CHEMICAL TREATMENT (names of
pesticides, insecticides, fungicides etc.). NE extraction (NEX) problem consists
of automatically constructing a gazette containing example instances for each
NE of interest. In this paper, we propose a new bootstrapping approach to NEX
and demonstrate its use for creating gazettes of NE in the agriculture domain.
Apart from the new application domain (agriculture) for NE extraction, most
important contribution of this paper is : use of a new variant of the information
distance [2], [1] to decide whether a candidate phrase is a valid instance of the
NE or not.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 388–391, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Unsupervised Gazette Creation Using Information Distance 389
4 Experimental Evaluation
The benchmark corpus consists of 30533 documents in English containing 999168
sentences and approximately 19 million words. It was collected using crawler4j3
by crawling the agriculture news websites such as the FarmPress group4 . Some
of the seeds used for each NE type are as follows:
– CROP: wheat, cotton, corn, soybean, strawberry, tomato, bt cotton
– DISEASE: sheath blight, wilt, leaf spot, scab, rot, rust, nematode
– CHEMICAL TREATMENT: di-syston, metalaxyl, keyplex, evito, verimark
Starting with the candidate list C and the initial seed list for T , the algorithm
CreateGazetteM ED iteratively created the final set of 500 candidates based on
M EDD,K . The post-processing step is used to further prune this list to create
the final gazette for the NE type T .
Gazette sizes for each NE type are shown in Fig 2(a). Detection rate of for
each NE is shown in Fig 2(b). Assessor improves precision for all NE types
3
code.google.com/p/crawler4j/ an open source web crawler by Yasser Ganjisaffar.
4
Permission awaited from the content-owners for public release of the corpus.
Unsupervised Gazette Creation Using Information Distance 391
for both measures M EDD,K and PMI. We compare the proposed algorithm
with BASILISK [3]. Also, to gauge the effictiveness of M EDD,K as a proxim-
ity measure, we compare it with PMI. To highlight effectiveness of the gazettes
created, we compared our DISEASE gazette with wikipedia. It was quite en-
couraging to find that, our gazette, though created on a limited size corpus,
contained diseases/pathogens not present in Wikipedia.5 Some of these are
- limb rot, grape colaspis, black shank, glume blotch, mexican rice
borer, hard lock, seed corn maggot, green bean syndrome.
Fig. 2. (a) Number of entries in the final gazette for each NE type. (To use the same
baseline for comparing precision of the proposed algorithm and BASILISK, we use the
gazette size of BASILISK same as that of M EDD,K with Assessor.) (b) Detection rate
of CreateGazetteM ED with Assessor.
5 Conclusions
In this paper, we proposed a new unsupervised (bootstrapping) NEX technique
for automatically creating gazettes of domain-specific named entities. It is based
on a new variant of the Multiword Expression Distance (MED) [1]. We also
compared the effectiveness of the proposed method with PMI, BASILISK [3] To
the best of our knowledge, this is the first time that NEX techniques are used
for the agricultural domain.
References
1. Bu, F., Zhu, X., Li, M.: Measuring the non-compositionality of multiword expres-
sions. In: Proc. of the 23rd Conf. on Computational Linguistics, COLING (2010)
2. Bennett, C., Gacs, P., Li, M., Vitanyi, P., Zurek, W.: Information distance. IEEE
Transactions on Information Theory 44(4), 1407–1423 (1998)
3. Thelen, M., Riloff, E.: A bootstrapping method for learning seman-tic lexicons us-
ing extraction pattern contexts. In: Proceedings of the Conference on Empirical
Methods in Natural Language Processing, EMNLP 2002 (2002)
5
Verified on 30th January, 2013.
A Multi-purpose Online Toolset
for NLP Applications
1 Introduction
The idea of making a linguistic toolset available online is not new; among other
initiatives, it has been promoted by CLARIN1 , following its aspirations for gath-
ering Web services offering language processing tools [3] or by related initiatives
such as WebLicht.
The first version of a toolset for Polish made available in the Web service
framework has been proposed in 2011, and called the Multiservice [2]. Its main
purpose was to provide a consistent set of mature annotation tools — previously
tested in many offline contexts, following the open-source paradigm and under
active maintenance — offering basic analytical capabilities for Polish.
Since then, the Multiservice has been thoroughly restructured and new lin-
guistic tools have been added. The framework currently features a morphological
analyzer Morfeusz PoliMorf, two disambiguating taggers Pantera and Concraft,
a shallow parser Spejd, the Polish Dependency Parser, a named entity recognizer
Nerf and a coreference resolver Ruler.
2 Architecture
The Multiservice allows for chaining requests involving integrated language tools:
requests to the Web service are enqueued and processed asynchronously, which
allows for processing larger amounts of text. Each call returns a token used to
check the status of the request and retrieve the result when processing completes.
The work reported here was partially funded by the Computer-based methods for
coreference resolution in Polish texts (CORE) project financed by the Polish Na-
tional Science Centre (contract number 6505/B/T02/2011/40).
1
Common Language Resources and Technology Infrastructure, see www.clarin.eu
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 392–395, 2013.
c Springer-Verlag Berlin Heidelberg 2013
A Multi-purpose Online Toolset for NLP Applications 393
One of the major changes in the current release is a redesign of the internal
architecture with the Apache Thrift framework (see thrift.apache.org, [1]),
used for internal communication across the service. It features a unified API for
data exchange and RPC, with automatically generated code for the most popular
modern programming languages (including C++, Java, Python, and Haskell),
the ability to create a TCP server implementing such an API in just a few lines
of code, no requirement of using JNI for communication across various languages
(unlike in UIMA), and much better performance than XML-based solutions.
The most important service in the infrastructure is the Request Manager,
using a Web Service-like interface with the Thrift binary protocol instead of
SOAP messages. It accepts new requests, saves them to the database as Thrift
objects, keeps track of associated language tools, selects the appropriate ones
for completing the request, and finally invokes each of them as specified in the
request and saves the result to the database.
Since the Request Manager service runs as a separate process (or even a
separate machine), it can potentially be distributed across multiple machines
or use a different DBMS without significant changes to other components. The
service can easily be extended to support communication APIs other than SOAP
or Thrift and the operation does not create significant overhead (sending data
using Apache Thrift binary format is much less time-consuming than sending
XMLs or doing actual linguistic analysis of texts).
Requests are stored in db4o — an object oriented database management sys-
tem which integrates smoothly with regular Java classes. Each arriving request
is stored directly in the database, without any object-relational mapping code.
Language tools run as servers listening to dedicated TCP ports and may
be distributed across multiple machines. There are several advantages of such
architecture, the first of which is its scalability — when the system is under
heavy load, it is relatively easy to run new service instances. Test versions of
the services can be used in a request chain without any configuration — there
is simply an optional request parameter that tells the address and port of the
service. Plugging-in new language tools is equally easy — Apache Thrift makes
it possible to create a TCP server implementing a given RPC API in just a few
lines of code.
The tools offer two interchangeable formats, supporting chaining and uniform
presentation of linguistic results: TEI P5 XML and its JSON equivalent. The
TEI P5 format is a packaged version of the stand-off annotation used by the
National Corpus of Polish (NKJP [4]), extended with new annotation layers
originally not available in NKJP.
Sample Python and Java clients for accessing the service have been imple-
mented. To facilitate non-programming experiments with the toolset, a sim-
ple Django-based Web interface (see Fig. 1) is offered to allow users to create
toolchains and enter texts to be processed.
394 M. Ogrodniczuk and M. Lenart
4 Conclusions
As compared to its offline installable equivalents, the toolset provides users with
access to the most recent versions of tools in a platform-independent manner and
without any configuration. At the same time, it offers developers a useful and
extensible demonstration platform, prepared for easy integration of new tools
within a common programming and linguistic infrastructure. We believe that
the online toolset will find its use as a common linguistic annotation platform
for Polish, similar to positions taken by suites such as Apache OpenNLP or
Stanford CoreNLP for English.
References
1. Agarwal, A., Slee, M., Kwiatkowski, M.: Thrift: Scalable cross-language services
implementation. Tech. rep., Facebook (April 2007),
http://thrift.apache.org/static/files/thrift-20070401.pdf
2. Ogrodniczuk, M., Lenart, M.: Web Service integration platform for Polish linguistic
resources. In: Proceedings of the 8th International Conference on Language Re-
sources and Evaluation, LREC 2012, pp. 1164–1168. ELRA, Istanbul (2012)
3. Ogrodniczuk, M., Przepiórkowski, A.: Linguistic Processing Chains as Web Services:
Initial Linguistic Considerations. In: Proceedings of the Workshop on Web Services
and Processing Pipelines in HLT: Tool Evaluation, LR Production and Validation
(WSPP 2010) at the 7th Language Resources and Evaluation Conference (LREC
2010), pp. 1–7. ELRA, Valletta (2010)
4. Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.):
Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warsaw (2012)
5. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT:
a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demon-
strations at the 13th Conference of the European Chapter of the Association for
Computational Linguistics, EACL 2012, pp. 102–107. Association for Computa-
tional Linguistics, Stroudsburg (2012)
A Test-Bed for Text-to-Speech-Based Pedestrian
Navigation Systems
1 Introduction
The automated generation of route directions has been the subject of many
recent academic studies (See for example the references in [1], or the very re-
cent works [2,3]) and commercial projects (e.g. products by Garmin, TomTom,
Google, Apple, etc.). The pedestrian case (as opposed to the automobile case)
is particularly challenging because the location of the pedestrian is not just re-
stricted to the road network and the pedestrian is able to quickly face different
directions. In addition, the scale of the pedestrian’s world is much finer, thus
requiring more detailed data. Finally the task is complicated by the fact that
the pedestrian, for safety, should endeavor to keep their eyes and hands free –
there is no room for a fixed dashboard screen to assist in presenting route di-
rections. We take this last constraint at full force – in our prototype there is no
map display; the only mode of presentation is text-to-speech instruction heard
incrementally through the pedestrian’s earpiece.
We present a system to support eyes-free, hands-free navigation through a city1 .
Our system operates in two distinct modes: manual and automatic. In manual
1
The research leading to these results has received funding from the European Commu-
nity’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no.
270019 (SpaceBook project www.spacebook-project.eu) as well as a grant through
the Kempe foundation (www.kempe.com).
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 396–399, 2013.
c Springer-Verlag Berlin Heidelberg 2013
A Test-Bed for Text-to-Speech-Based Pedestrian Navigation Systems 397
Fig. 1. Operator’s interface in manual mode guiding a visitor to ICA Berghem, Umeå
The technical specification and design of our system, with an initial reactive
controller, is described in a technical report [1]. That report gives a snap-shot
of our system as of October 2012. In the ensuing months we have worked to
optimize, re-factor and stabilize the system in preparation for its open source
release – working name janus (Interested readers are encouraged to contact
us if they wish to receive a beta-release). We have also further developed the
infrastructure to integrate FreeSWITCH for speech and some extra mechanism
to handle image streams. Finally we have added a facility that logs phone pic-
tures to PostgreSQL BLOBs, the TTS messages to PostgreSQL text fields, and
the audio-streams to files on the file system. Aside for server-side PL/pgSQL
functions, the system is written exclusively in Java and it uses ZeroC ICE for
internal communication. Detailed install instructions exists for Debian “wheezy”.
2 Field Tests
We have carried out field tests since late Summer 2012. The very first tests
were over an area that covered the Umeå University campus extending North
to Mariahem (An area of roughly 4 square kilometers, 1788 branching points,
398 M. Minock et al.
3027 path segments, 1827 buildings). For a period of several weeks, the first au-
thor tested the system 3-4 times per week while walking or riding his bicycle to
work and back. The system was also tested numerous times walking around the
Umeå University campus. A small patch of the campus immediately adjacent
to the MIT-Huset was authored with explicit phrases, overriding the automat-
ically generated phrases of a primitive NLG component (see the example in
[1]). These initial tests were dedicated to validating capabilities and confirming
bug fixes and getting a feel for what is and is not important in this domain.
For example problems like the quantity and timing of utterances (too much or
too little speech, utterances issued too late or too early) and oscillations in the
calculation of facing direction led to a frustrating user experience. Much effort
was directed toward fixing parameters in the underlying system, adding further
communication rules and state variables, etc.
In addition to these tests, in November 2012 we conducted an initial test of our
manual interface in Edinburgh (our database covered an area of roughly 5 square
kilometers, 4754 branching points, 9082 path segments, 3020 buildings) – walk-
ing the exact path used in the Edinburgh evaluations of the initial SpaceBook
prototype developed by SpaceBook partners Heriot-Watt and Edinburgh Uni-
versity [2]. With the PhoneApp running in Edinburgh and all back-end compo-
nents running in Umeå, the latencies introduced by the distance did not render
the system inoperable. Note that we did not test the picture capability at that
time, as it had not yet been implemented.
Due to the long Winter, we have conducted only a few outdoor tests with
the system from November 2012 to April 2013. What experiments we have run,
have been in an area surrounding KTH in Stockholm (An area slightly over 2
square kilometers, 1689 branching points, 3097 path segments, 542 buildings),
the center of Åkersberga, and continued tests on the Umeå University campus.
With the warming of the weather we look forward to a series of field tests and
evaluations over the Spring and Summer of 2013.
3 System Performance
Our optimization efforts have been mostly directed at minimizing latencies and
improving the performance of map rendering in our virtual pedestrian/tracking
tool. There are three latencies to consider from the PhoneApp to the controller
(GPS report, speech packet, image) and one latency to consider from the con-
troller to the PhoneApp (text message transmission). We are still working on
reliable methods to measure these latencies and, more importantly, their vari-
ability. In local installations (e.g. back-end components and PhoneApp running
in Umeå) the system latencies are either sub-second or up to 1-2 seconds – a
perfectly adequate level of performance. Running remotely (e.g. back-end com-
ponents running in Umeå and PhoneApp in Edinburgh) appears to simply add
a fixed constant to all four latencies.
All the map data is based on XML exports of OpenStreetMap data con-
verted to SQL using the tool osm2sb (see [1]). We have limited our attention
A Test-Bed for Text-to-Speech-Based Pedestrian Navigation Systems 399
References
1. Minock, M., Mollevik, J., Åsander, M.: Towards an active database platform
for guiding urban pedestrians. Technical Report UMINF-12.18, Umeå University
(2012),
https://www8.cs.umu.se/research/uminf/index.cgi?year=2012&number=18
2. Janarthanam, S., Lemon, O., Liu, X., Bartie, P., Mackaness, W., Dalmas, T., Goetze,
J.: A spoken dialogue interface for pedestrian city exploration: integrating naviga-
tion, visibility, and question-answering. In: Proc. of SemDial 2012, Paris, France
(September 2012)
3. Boye, J., Fredriksson, M., Götze, J., Gustafson, J., Königsmann, J.: Walk this way:
Spatial grounding for city exploration. In: Proc. 4th International Workshop on
Spoken Dialogue Systems, IWSDS 2012, Paris, France (November 2012)
4. Minock, M., Mollevik, J.: Prediction and scheduling in navigation systems. In: Pro-
ceedings of the Geographic Human-Computer Interaction (GeoHCI) Workshop at
CHI (April 2013)
Automatic Detection of Arabic Causal Relations
Jawad Sadek
Abstract. The work described in this paper is about the automatic detection and
extraction of causal relations that are explicitly expressed in Modern Standard
Arabic (MSA) texts. In this initial study, a set of linguistic patterns was derived
to indicate the presence of cause-effect information in sentences from open do-
main texts. The patterns were constructed based on a set of syntactic features
which was acquired by analyzing a large untagged Arabic corpus so that parts
of the sentence representing the cause and those representing the effect can be
distinguished. To the best of researchers knowledge, no previous studies have
dealt this type of relation for the Arabic language.
1 Introduction
Most studies on mining semantic relations focused on the detection of causal relations as
they are fundamental in many disciplines including text generation, information extrac-
tion and question answering systems. Furthermore, they closely relate to other relations
such as temporal and influence relations. These studies attempt to locate causation in
texts using two main approaches; hand-coded patterns [1, 2] and machine learning ap-
proaches that aim to construct syntactic patterns automatically [3]. However, the later has
exploited knowledge resources available for the language they addressed, such as large
annotated corpora, WordNet and Wikipedia. Unfortunately, Arabic Language, so far,
lacks mature knowledge base resources upon which machine learning algorithms can
rely. In this work a set of patterns was identified based on combinations of cue words and
part of speech (POS) labels which tend to appear in causal sentences. The extracted pat-
terns reflect strong causation relations and can be very useful in the future for systems
adopting machine learning techniques in acquiring patterns that indicate causation. The
current study has been developed predominantly to locate intrasentential casual relations
and this is believed to enhance the performance of the previous system for answering
“why” question which was based on finding causality across sentences [4].
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 400–403, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Automatic Detection of Arabic Causal Relations 401
two main categories. The first one is the verbal causality which can be captured by
the presence of nominal clauses e.g. ( اﻟﻤﻔﻌﻮل ﻷﺟﻠﻪAccusatives of purpose) or by cau-
sality connectors such as [( ﻟﺬاtherefore) – ( ﻣﻦ اﺟﻞfor)] although these connectors may
in many cases signal different relations other than causation. The second category is
the context-based causality that can be inferred by the reader using his/her general
knowledge without locating any of the previous indicators. This category includes
various Arabic stylistic structures and it is frequently used in rhetorical expressions
especially in novels, poetry and the Holy Quran.
The definition of implicit causal relations in Arabic has been controversial among
linguists and raised many interpretation and acceptance issues. It is not the aim of this
paper to add to these controversies but the study will be restricted to the extraction of
explicit relations indicated by ambiguous/unambiguous markers. Alternberg’s typolo-
gy of causal linkage was of great importance for extracting causal relation in English.
Unfortunately, no such list exists for the Arabic language. Hence, a list of Arabic
causal indicators needs to be created. All grammarians perspective’s causative con-
nectors mentioned in [5] have been surveyed alongside with the verbs that are
synonymous with the verb "( "ﻳﺴﺒﺐcause) such as “ ... ﻳﻨﺘﺞ، ”ﻳﺆديin addition to some
particles that commonly used in modern Arabic such as “”ﺣﻴﺚ.
The patterns were generated by analyzing a data collection extracted from a large
untagged Arabic corpus called arabiCorpus1. The patterns development process based
on the same techniques as those used in [1]; it went through several steps of inductive
and deductive reasoning methods. The two phases were assembled into single circular
so that the patterns continually cycled between them until finally a set of general pat-
terns was reached.
● Inductive Phase: The initial step which involves making specific observations on a
sample of sentences containing causal relations retrieved from the corpus, and then
detecting regularities and features in the sentences that indicate causation. This leads
to formulate some tentative patterns specifying cause and effects slots. For example
pattern (2) was constructed from sentence (1) specifying that the words preceding
( )ﺑﺴﺒﺐrepresent the effect part while the words following it represent the cause.
(1) .أﺟﻠﺖ "ﻧﺎﺳﺎ" أﻣﺲ هﺒﻮط ﻣﻜﻮك اﻟﻔﻀﺎء اﺗﻼﻧﺘﺲ وذﻟﻚ ﻧﻈﺮا ﻟﺴﻮء اﻷﺣﻮال اﻟﺠﻮﻳﺔ
“NASA has postponed landing of the space shuttle Atlantis yesterday due to bad weather”
(2) R (&C) [C] AND &This ﻧﻈﺮا+ [E] &.
● Deductive Phase: involves examining the patterns that have been formulated in the
previous step. At this stage the patterns are applied to the new text fragments ex-
tracted from the corpus. Three types of errors may be returned upon conducting the
1
http://arabicorpus.byu.edu/search.php
402 J. Sadek
■ Undetected Relation: this error occurs when the constructed patterns are unable to
locate the presence of a causal relation in a text fragment. To fix this error, more pat-
terns need to be added so that the missing relation can be identified; in some cases it
may be better to modify a pattern to cover all the absent relations by omitting some of
its features. For example, pattern (2) would miss the casual relation presented in sen-
tence (3) for omitting the word “”ﻧﻈﺮا. Hence, the new pattern (4) was added.
(3) اوﻟﺖ اﻟﺤﻜﻮﻣﺔ اهﺘﻤﺎﻣﺎ آﺒﻴﺮا ﻟﺘﻄﻮﻳﺮ اﻟﻘﻄﺎع اﻟﺰراﻋﻲ وذﻟﻚ رﻏﺒﺔ ﻣﻨﻬﺎ ﺑﺘﺤﻘﻴﻖ اﻻﻣﻦ اﻟﻐﺬاﺋﻲ
“Government paid great attention to the development of agriculture to achieve food security”
(4) R (&C) [C] AND [ ذﻟﻚE] &.
■ Irrelevant Relation: if a word has multiple meanings, the constructed pattern may
wrongly recognize a relation as causation one. For this kind of error, new patterns
need to be added and associated with the void value to exclude the expression that has
caused the defect. For instance the word “ ”ﻟﺬﻟﻚin sentence (5) acts as anaphoric refer-
ence. The new pattern (6) indicates an irrelevant indicator.
(5) .اﻗﺮأ ﻧﺸﺮة اﻟﺪواء ﻗﺒﻞ ﺗﻨﺎول أي ﺟﺮﻋﺔ ﻣﻨﻪ ﻓﻘﺪ ﻻ ﻳﻜﻮن ﻟﺬﻟﻚ اﻟﺪواء أي ﻋﻼﻗﺔ ﺑﻤﺮﺿﻚ
“Read the drug leaflet before taking it since that drug may not be adequate to your illness”
(6) X C ﻟﺬﻟﻚDTNN C &.
■ Misidentify Slots: in some cases even a relevant relation was correctly extracted,
though the patterns failed to fill the slots properly. A good remedy for this defect is to
reorder the patterns in a way that more specific patterns have the priority over the
more general ones. For example, pattern (8) is not sufficient to correctly fill the cause
and the effect slots of the causal relation in sentence (7); therefore an additional pat-
tern, as the one in (10), needs to be inserted before pattern (8).
(7) ﻳﻌﺎﻧﻲ اﻟﻤﻴﺰان اﻟﺘﺠﺎري ﻣﻦ اﻟﺨﻠﻞ وﻟﺬﻟﻚ ﻓﺈن اﻟﺤﻜﻮﻣﺔ ﺑﺪأت ﺑﺈﻗﺎﻣﺔ اﻟﻤﺸﺮوﻋﺎت اﻟﺘﻲ ﺗﻌﺘﻤﺪ ﻋﻠﻰ اﻟﺨﺪﻣﺎت
“Trade deficit has prompted the government to develop the services sector”
(8) R (&C) [E] [ ﻟﺬﻟﻚC] &.
(10) R (&C) [E] (And) [ ﻓﺈن ﻟﺬﻟﻚC] &.
Patterns were formulated using a series of different kind of tokens separated by
space. Tokens comprise the following items:
● Particular Word: such as the words “ ”ﻧﻈﺮاin pattern (2).
● Subpattern Reference: refers to a list containing a sequence of tokens. For instance
the subpattern &This in pattern (2) refers to a list of definite demonstrative nouns.
● Part-of-Speech tag: represents a certain syntactic category that has been assigned to
a text word as the definite noun tag in pattern (6).
● Slot: reflects the cause or the effect part of the relation under scrutiny.
● Symbol: instructs the program to take specific action during the pattern matching
procedure. For example the plus symbol in pattern (2) instructs the matching proce-
dure to add the word “ ”ﻧﻈﺮاto the cause slot of the relation.
Automatic Detection of Arabic Causal Relations 403
4 Experimental Results
The generated patterns were applied to a set of 200 sentences taken from the contem-
porary Arabic corpus2 which belong to the science category. Three native Arabic
speakers were asked to manually identify the presence of causal relations indicated by
causal links in each single sentence. Out of the 107 relations picked out by the sub-
jects the patterns could discover a total of 80 relations giving a Recall of 0.75 and
Precision of 0.77. In reviewing the causal relations missed by the patterns, it turned
out that 50% of them were selected by the subjects based on the occurrence of “cau-
sation fa’a” ( )ﻓﺎء اﻟﺴﺒﺒﻴﺔwhich was not taken into consideration in this study, while the
other half was located by causal links not included in the list.
The purpose of this study was to develop an approach for automatic identification of
causal relation in Arabic texts. The method operated well using some of NLP tech-
niques. The extraction system is still being developed as the patterns set has not been
completed yet. There are some types of verbs the meaning of which implicitly induce
a causal element; these verbs are called causative verbs for example “( ﻳﻘﺘﻞkill)”, ﻳﻮﻟﺪ
(Generate)” that can be paraphrased as “to cause to die” and “to cause to happen”.
Causal relations indicated by the aforementioned types of verbs may be explored in
future research.
References
1. Khoo, C.S.G., Kornfilt, J., Oddy, R.N., Myaeng, S.H.: Automatic extracting of cause-effect
information from newspaper text without knowledge-based inferencing. Literary and Lin-
guistic Computing 13(4), 177–186 (1988)
2. Carcia, D.: COATIS, an NLP System to Locate Expressions of Actions Connected by Cau-
sality Links. In: Plaza, E., Benjamins, R. (eds.) EKAW 1997. LNCS, vol. 1319, pp.
347–352. Springer, Heidelberg (1997)
3. Ittoo, A., Bouma, G.: Extracting Explicit and Implicit Causal Relations from Sparse, Do-
main-Specific Texts. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS,
vol. 6716, pp. 52–63. Springer, Heidelberg (2011)
4. Sadek, J., Chakkour, F., Meziane, F.: Arabic Rhetorical Relations Extraction for Answering
"Why" and "How to" Questions. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.)
NLDB 2012. LNCS, vol. 7337, pp. 385–390. Springer, Heidelberg (2012)
5. Haskour, N.: Al-Sababieh fe Tarkeb Al-Jumlah Al-Arabih. Aleppo University (1990)
2
http://www.comp.leeds.ac.uk/eric/latifa/research.htm
A Framework for Employee Appraisals Based on
Inductive Logic Programming and Data Mining Methods
1 Introduction
Employee appraisal systems are extensively required for evaluating employee
performance [1]. Even though appraisal systems have numerous benefits, some
employees question their fairness [2]. The existing commercial systems for appraisals
focus on recording information and not supporting goal setting or ensuring that the
objectives are SMART (specific, measurable, achievable, realistic, time-related) [3].
Developing a supportive appraisal system for goal setting represents a major
challenge. Thus, helping employees to write SMART objectives requires finding the
rules of writing the objectives. As the objectives are expressed in natural language,
natural language processing (NLP) [4] techniques may have the potential to be used
for defining the process of setting SMART objectives. NLP is based on extracting
structured information from unstructured text by using automatic methods such as
machine learning methods [5]. Inductive Logic Programming (ILP) [6] is a machine
learning discipline which extracts rules from examples and background knowledge.
As well as having rules that help structure objectives, there is a need to assess if a
stated objective can be met given the available resources and time. Therefore, data
mining techniques [7] may have the potential to be used for assessing the objectives.
This paper explores the use of machine learning and data mining techniques for
developing a novel system which supports employee appraisals. A new semantic
framework for appraisal systems is proposed. The framework facilitates the setting of
SMART objectives and providing feedback by using ILP to induce rules that define a
grammar for writing SMART objectives. The framework also utilises data mining
techniques for assessing the objectives. Parts of the framework have been
implemented and an empirical evaluation for the framework has been conducted.
The remaining of the paper is organized as follows. Section 2 proposes the system
framework. Section 3 describes the corpus and its tagging. Section 4 describes the use
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 404–407, 2013.
© Springer-Verlag Berlin Heidelberg 2013
A Framework for Employee Appraisals Based on Inductive Logic Programming 405
of ILP for writing SMART objectives. Section 5 illustrates the use of data mining
techniques for assessing the objectives. Section 6 presents the empirical evaluation of
the framework and concludes the paper.
A corpus of objectives has been developed containing 150 example sentences related
to the sales domain. This corpus has been created based on studying what constitutes
well written objectives and reviewing some SMART objective examples [3].
The GATE system is utilised for annotating the text in the developed corpus.
GATE (General Architecture for Text Engineering) is a publicly available system
developed at the University of Sheffield for language engineering. Based on GATE,
the sentences (objectives) in the developed corpus are first tokenized then the part of
speech (POS) annotations and the named entity annotations (NE) are specified.
The semantic tagging is done on the POS-tagged corpus by using WordNet and the
SR-AW software to determine the semantic classes of target words (verbs, nouns) that
occur frequently in SMART objectives. Results show that the action verbs (e.g.
increase, achieve, boost) that are used frequently in writing SMART objectives are
classified into one of the following verb semantic classes: “change”, “social”,
“possession”, “creation” or “motion”. Nouns (e.g. sales) which are used commonly in
writing SMART objectives are classified into the noun semantic class “possession”.
406 D. Aqel and S. Vadera
Some SMART objectives related to different domains (e.g. costs, profits) have been
examined semantically as well. Results show that the target words in these objectives
are classified into the same classes as the target words in the developed corpus.
To evaluate the accuracy of SR-AW, a corpus that consists of manually annotated
objective examples with WordNet semantic classes is used. For a sample of 30 target
words, the software has disambiguated 76% of words correctly; where 20% of words
have been classified with semantic tagging errors and 4% of them are ambiguous.
The study uses ILP for writing SMART objectives. ILP uses logic programming and
machine learning to induce theories from background knowledge and examples [6].
The inductive machine learning called “ALEPH”1 is applied on the POS and
semantically tagged corpus to learn a grammar for writing SMART objectives in
order to ensure that the objectives are “specific”, “measurable” and “time-related”. An
annotated set of 70 sentences is provided to ALEPH as input, together with
background knowledge and some examples. The positive (170 examples) and
negative (185 examples) example sets have been used for describing the “specific”,
“measurable” and “time-related” examples. ALEPH has induced 24 linguistic rules
for writing SMART objectives. ALEPH has achieved an accuracy of 91% for the
training data (proportion: 70%) and 81% for the testing data (proportion: 30%).
ALEPH has induced several rules, including the following PROLOG rule for
ensuring that an objective is “specific”:
specific(A,B) :-
to(A,C), change_verb (C,D), product(D,E),
possession_noun_nns(E,F), preposition_in(F,G), percent (H,B).
The following PROLOG rule is one of the induced rules by ALEPH for ensuring that
an objective is “measurable”:
measurable(A,B):-
percent(A,B)
The following PROLOG rule is induced by ALEPH for ensuring that an objective is
“time-related”:
time_related(A, B):-
date(A, B).
1
www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html
A Framework for Employee Appraisals Based on Inductive Logic Programming 407
References
1. Murphy, K., Cleveland, J.: Performance Appraisal: An Organizational Perspective, 3rd
edn., 349 pages. Allyn and Bacon (1991)
2. Rowland, C., Hall, R.: Organizational Justice and Performance: is Appraisal Fair?
EuroMed Journal of Business 7(3), 280–293 (2012)
3. Hurd, A.R., et al.: Leisure Service Management. Human Kinetics, 386 pages (2008)
4. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, 680
pages. MIT Press, Cambridge (1999)
5. Alpaydin, E.: Introduction to Machine Learning, 2nd edn., 584 pages. MIT Press (2010)
6. Muggleton, S., De Raedt, L.: Inductive Logic Programming: Theory and Methods.
Proceedings of Journal of Logic Programming 19(20), 629–679 (1994)
7. Witten, I., et al.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd
edn., 664 pages. Morgan Kaufmann, Elsevier (2011)
8. Cunningham, H., et al.: Developing Language Processing Components with GATE
Version 7 (A User Guide), The University of Sheffield (2012), http://GATE.ac.uk/
9. Miller, G., et al.: Introduction to WordNet: An On-line Lexical Database. International
Journal of Lexicography 3(4), 235–244 (1990)
10. Pedersen, T., Kolhatkar, V.: WordNet: SenseRelate:: All Words - A Broad Coverage Word
Sense Tagger that Maximizes Semantic Relatedness. In: The 2009 Annual Conference of
the North American Chapter of the Assoc. Comp. Lingui., pp. 17–20 (2009)
2
www.Infochimps.com/datasets/us-consumer-electronics-sales-and-
forecasts-2003-to-2007
A Method for Improving Business Intelligence
Interpretation through the Use of Semantic Technology
1 Introduction
Business intelligence (BI) encompasses a wide range of business supporting
analytical tools to process increasing amounts of corporate data. The interpretation of
output produced by BI applications, however, continues to be performed manually, as
does the determination of appropriate actions to be taken based upon that output
(Seufert et al. 2005). Knowledge surrounding these interpretations and actions builds
within the members of an organization. When employees depart an organization, they
take that knowledge with them, creating a vacuum.
The objectives of this research are to: define targeted ontologies, and analyze how
to use them to support business intelligence initiatives within an organization through
the capture of the knowledge of analysts. To do so, a methodology, called the
Targeted Ontology for Data Analytics (TODA), is being developed within the context
of dynamic capabilities theory. The contribution of the research is to develop an
ontology-driven methodology that will facilitate the interpretation of BI results.
2 Research Stages
Stage 1: Targeted ontology development. Semantic technologies are intended to
address questions of interoperability, recognition and representation of terms and
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 408–411, 2013.
© Springer-Verlag Berlin Heidelberg 2013
A Method for Improving Business Intelligence Interpretation 409
This research project follows a design science approach of build and evaluate (Hevner
et al. 2004). The artifact is the TODA methodology, the purpose of which is to
provide targeted ontologies that can be applied to improve the interpretation of
business intelligence analytics. The TODA methodology consists of four steps: create
targeted ontology, anchor target objects, apply ontological knowledge to BI output,
and assess output for new knowledge to be collected.
A prototype is being designed that implements the methodology for creating and
using targeted ontologies. The TODA Architecture is shown in Figure 1 and consists
of: a) user interface, b) ontology creator, c) results interpreter, d) BI environment, and
e) external data sources. The prototype is a necessary tool because refinement of the
targeted ontology development is needed.
A Sample Targeted Ontology Scenario
Assume a traditional business intelligence report depicts the results for a query asking
for dollar sales of a product category across all stores of a major grocery chain over a
period of several weeks. In this case, the category is cookies. Four subcategories of
cookies are displayed in the report. For each subcategory, an analyst can discern a
pattern of sales over the period of time displayed. That analyst may be missing key
pieces of information. For instance, the analyst may not know there was a stock out of
wafer cookies across the company for two weeks in April of 2012. If the company
had implemented a targeted ontology, the previous analyst could have input
information about the stock out creating a set of nodes as shown below. This would
be performed through the ontology creation module of the TODA architecture.
Assume a sample node set with three nodes. The first is the “*Item*” node. This
node describes an item involved in the node set. It is also the targeted node. In the
physical instantiation of the node, it includes the information necessary (database
keys, for example), to tie it to the corresponding information in the organization’s
business intelligence environment. The second node in the set describes a stock out
that happened to the item in question. It contains detailed information about the stock
out. Finally, the third node contains a period of time during which the stock out
occurred on the item in question. This node holds the month in this example, but it
could hold any period of time.
410 S. Givens, V. Storey, and V. Sugumaran
Targeted
BI Task Targeted
Ontology
Interface Ontology
Creation Module
Repository
Results Interpreter
If the organization has implemented the TODA method, the information resulting
from the query will be processed through the TODA interpreter (see Figure 1). The
interpreter will analyze the information gathered for the report and determine if any
targeted nodes exist with links to this data. If so, appropriate notes will be added to
the report providing the tacit knowledge from the TODA ontology.
In the original version of the report, there is no contextual information. The analyst
only has the numbers provided by the data warehouse to drive any decision-making
analysis. In this instance, she may see the dip in sales of wafer cookies in mid-April
as a sign of normal seasonality. Maybe there was a build up of cookie demand leading
into Easter and, once that was over, sales dipped. This conclusion would be a guess,
but given the data she has to examine, there is little else to guide her analysis. The
new version of the report results from the data being processed through a TODA-
based system. The data from the organization’s data warehouse is pre-processed by
the TODA results interpreter. The interpreter determines if any values in the data
belong to a dimension linked to the targeted ontology through a linked node. If so, it
finds any contextual knowledge stored in those nodes and adds it to the report. The
report production process is then completed.
The analyst now has access to contextual information about the time period where
wafer cookie sales dropped off. In her previous analysis, she determined this was
normal seasonality. Now she can see that this was due to a supply disruption. If the
analyst was supporting inventory-planning decisions, her previous analysis may have
led her to reduce inventories in mid-April. This would have caused stock-outs all over
again. Now she has the knowledge necessary to see that more inventory is needed
during this period, not less. This is the real contribution of TODA to practice;
providing business intelligence analysts with the contextual knowledge needed to
make better decisions. For the assessment, the hypotheses to be tested are
summarized in Table 2.
A Method for Improving Business Intelligence Interpretation 411
Table 1. Hypotheses
H1: Targeted ontologies can improve the performance of BI analysts who are
new to an organization
H2: Targeted ontologies can facilitate analysts providing better interpretations
of BI output
H3: Targeted ontologies can prevent analysts from misinterpreting BI
output
3 Conclusion
This research proposes the use of targeted ontologies to improve the interpretation of
business intelligence data. Challenges include developing good procedures and
heuristics for eliciting targeted ontologies and then creating the techniques and
algorithms needed to effectively apply them. Further work will be needed for
additional validation on multiple sites and the inclusion of other semantic
technologies.
Acknowledgement. The work of the third author has been partly supported by
Sogang Business School’s World Class University Program (R31–20002) funded by
Korea Research Foundation, and Sogang University Research Grant of 2011.
References
Hevner, A.R., March, S.T., Park, J., Ram, S.: Design science in information systems research.
MIS Quarterly 28(1), 75–105 (2004)
Seufert, A., Schiefer, J.: Enhanced business intelligence-supporting business processes with
real-time business analytics, pp. 919–925. IEEE (2005)
Staab, S., Gómez-Pérez, A., Daelemana, W., Reinberger, M.L., Noy, N.: Why evaluate
ontology technologies? Because it works! IEEE Intelligent Systems 19(4), 74–81 (2004)
Code Switch Point Detection in Arabic
1 Introduction
Linguistic code switching (LCS) refers to the use of more than one language in
the same conversation, either inter-utterance or intra-utterance. LCS is perva-
sively present in informal written genres such as social media. The phenomenon
is even more pronounced in diglossic languages like Arabic in which two forms
of the language co-exist. Identifying LCS in this case is more subtle in particular
in the intra-utterance setting.1 This paper aims to tackle the problem of code-
switch point (CSP) detection in a given Arabic sentence. A language-modeling
(LM) based approach is presented for the automatic identification of CSP in
a hybrid text of modern standard Arabic (MSA) and Egyptian dialect (EDA)
text. We examine the effect of varying the size of the LM as well as measuring
the impact of using a morphological analyzer on the performance. The results
are compared against our previous work [4]. The current system outperforms our
previous implementation by a significant margin of an absolute 4.4% improve-
ment, with an Fβ=1 score of 76.5% compared to 72.1%.
1
For a literature review, we direct the reader to our COLING 2012 paper [4].
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 412–416, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Code Switch Point Detection in Arabic 413
2 Approach
The hybrid system that is introduced here uses a LM with a back off to a mor-
phological analyzer (MA) to handle out of vocabulary (OOV) words to automat-
ically identify the CSP in Arabic utterances. While the MA approach achieves
a far better coverage of the words in a highly derivative and inflective language
such as Arabic, it is not able to take context into consideration. On the other
hand, LMs yield better disambiguation results because they model context in
the process.
3 Evaluation Dataset
We use three different sources of web-log data to create our evaluation dataset.
The first of which comes from the Arabic Online Commentary dataset that was
2
The LDC numbers of these corpora are 2006{E39, E44, E94, G05, G09, G10},
2008{E42, E61, E62, G05}, 2009{E08, E108, E114, E72, G01}, 2010{T17, T21, T23},
2011{T03}, 2012{E107, E19, E30, E51, E54, E75, E89, E94, E98, E99}.
3
We use Buckwalter transliteration scheme,
http://www.qamus.org/transliteration.htm
414 H. Elfardy, M. Al-Badrashiny, and M. Diab
4 Experimental Results
Fig. 1. Weighted Average of F-Scores of the MSA and DA classes with different ex-
perimental setups against the baseline systems, MAJB and COLB
analyzer) compared to 34.7% for the majority baseline MAJB and 72.1% for our
high baseline system, COLB.
5 Conclusion
References
1. Diab, M., Habash, N., Rambow, O., Altantawy, M., Benajiba, Y.: Colaba: Ara-
bic dialect annotation and processing. In: LREC Workshop on Semitic Language
Processing, pp. 66–74 (2010)
2. Diab, M., Hawwari, A., Elfardy, H., Dasigi, P., Al-Badrashiny, M., Eskandar, R.,
Habash, N.: Tharwa: A multi-dialectal multi-lingual machine readable dictionary
(forthcoming, 2013)
3. Elfardy, H., Diab, M.: Simplified guidelines for the creation of large scale dialectal
arabic annotations. In: LREC, Istanbul, Turkey (2012)
416 H. Elfardy, M. Al-Badrashiny, and M. Diab
4. Elfardy, H., Diab, M.: Token level identification of linguistic code switching. In:
COLING, Mumbai, India (2012)
5. Habash, N., Eskander, R., Hawwari, A.: A Morphological Analyzer for Egyptian
Arabic. In: NAACL-HLT Workshop on Computational Morphology and Phonology
(2012)
6. Maamouri, M., Graff, D., Bouziri, B., Krouna, S., Bies, A., Kulick, S.: Ldc standard
arabic morphological analyzer (sama) version 3.1 (2010)
7. Stolcke, A.: Srilm an extensible language modeling toolkit. In: ICSLP (2002)
8. Zaidan, O.F., Callison-Burch, C.: The arabic online commentary dataset: an anno-
tated dataset of informal arabic with high dialectal content. In: ACL (2011)
SurveyCoder: A System for Classification
of Survey Responses
1 Introduction
Open-ended questions are a vital component of a survey as they elicit subjective
feedback. Data available from responses to open-ended questions has been found
to be a rich source for variety of purposes. However, the benefits of open-ended
questions can be realized only when the unstructured, free-form answers which
are available in a natural language (such as English, German, Hindi and so on.)
are converted to a form that is amenable to analysis.
Survey coding is the process that converts the qualitative input available
from the responses to open-ended questions to a quantitative format that helps
in quick analysis of such responses. The set of customer responses in electronic
text format (also known as verbatims) and a pre-specified set of codes, called
code-frame constitute the input data to the survey-coding process. A code-frame
consists of a set of tuples (called code or label) of the form <code-id, code-
description>. Each code-id is a unique identifier assigned to a code and the
code-description usually consists of a short description that “explains” the code.
Survey coding task is to assign one or more of codes from the given code-frame to
each customer response. As per the current practice in market research industry,
it is carried out by specially trained human annotators (also known as coders).
Sample output of the survey-coding process is shown in Fig. 1.
The research community has approached the problem of automatic code as-
signment from multiple perspectives. An active research group in this area is
led by Sebastiani and Esuli et al. [1–3]. They approach the multiclass coding
problem using a combination of active learning and supervised learning. Almost
Sangameshwar Patil is also a doctoral research scholar at Dept. of CSE, IIT Madras.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 417–420, 2013.
c Springer-Verlag Berlin Heidelberg 2013
418 S. Patil and G.K. Palshikar
Fig. 1. Output of survey coding: Examples of verbatims and codes assigned to them
all the supervised learning techniques mentioned in the current literature need
training data which is specific to each survey. This training data is not available
with the survey and has to be created by the human annotators to begin with.
In most of the cases, the cost and effort required to create necessary training
data outweighs the benefits of using supervised learning techniques. Thus use of
supervised learning techniques is not the best possible solution.
3 Experimental Results
We have evaluated SurveyCoder using multiple survey datasets from diverse do-
mains such as over-the-counter-medicine, household consumer goods (e.g. deter-
gents, fabric softners etc.), food and snack items, customer satisfaction surveys,
campus recruitment test feedback surveys etc. Fig. 3 summarizes some of our
results for classification of survey responses (without using any feedback).
4 Conclusion
References
1. Giorgetti, D., Sebastiani, F.: Multiclass text categorization for automated survey
coding. In: Proceedings of ACM Symposium on Applied Computing (SAC) (2003)
2. Esuli, A., Sebastiani, F.: Active learning strategies for multi-label text classifica-
tion. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009.
LNCS, vol. 5478, pp. 102–113. Springer, Heidelberg (2009)
3. Esuli, A., Sebastiani, F.: Machines that learn how to code open-ended survey data.
International Journal of Market Research 52(6) (2010)
4. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press (1998)
5. Buchanan, B., Shortliffe, E.: Rule Based Expert Systems: The MYCIN Experiments
of the Stanford Heuristic Programming Project. Addison-Wesley, Reading (1984)
ISBN 978-0-201-10172-0
Rhetorical Representation and Vector Representation
in Summarizing Arabic Text
Abstract. This paper examines the benefits of both the Rhetorical Representa-
tion and Vector Representation for Arabic text summarization. The Rhetorical
Representation uses the Rhetorical Structure Theory (RST) for building the
Rhetorical Structure Tree (RS-Tree) and extracts the most significant para-
graphs as a summary. On the other hand, the Vector Representation uses a co-
sine similarity measure for ranking and extracting the most significant
paragraphs as a summary. The framework evaluates both summaries using pre-
cision. Statistical results show that Rhetorical Representation is superior to Vec-
tor Representation. Moreover, the rhetorical summary keeps the text in context,
without leading to lack of cohesion in which the anaphoric reference is not bro-
ken i.e. improving the ability of extracting the semantics behind the text.
E. Métais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 421–424, 2013.
© Springer-Verlag Berlin Heidelberg 2013
422 A. Ibrahim and T. Elghazaly
articles. Articles include different kinds of news including, general, political, busi-
ness, regional, and entertainment news. The average paragraphs of the article are five
and the average words in each paragraph are 24 words. The test set is classified into
three groups: small-sized articles (1-10 paragraphs), medium-sized articles (11-20
paragraphs), and large-sized articles (21-40 paragraphs). The overall figures of the
test set are illustrated in Table (1).
The proposed framework is applied to the RST as shown in Fig. (1) through four
steps. First, the original input text is segmented into paragraphs (as indicated by the
HTML <p> tags). Second, paragraphs are classified into the nucleus or satellites de-
pend on the algorithm in [3,4], and a JSON code (JavaScript Object Notation) is pro-
duced. Third, a text structure is represented by using the JSON code and the RS-Tree
is built. Finally, the nucleus nodes (significant paragraphs) are selected from the
RS - Tree.
Rhetorical Representation and Vector Representation in Summarizing Arabic Text 423
Category Figures
Corpus Textual Size 25.06 MB
No. of Articles 212
No. of Paragraphs 2249
No. of Sentences 2521
No. of words (exact) 66448
No of Word (root) 41260
No. of Stop word 15673
No. of small-sized articles
104
(Less than 10 paragraphs)
No. of medium-sized articles
79
(10 - 20 paragraphs)
No. of large-sized articles
29
(More than 20 paragraphs)
The proposed framework also, applies the VSM as shown in Fig. (1) by
representing the article parts (title, paragraphs) as vectors; and computing the cosine
similarities for each paragraph vector based on resemblance to the title vector. Fur-
thermore, for scoring comparable weights, long and short paragraphs should be nor-
malized by dividing each of their components according to the length. Computing the
cosine similarities uses the following equation and selects the top [6].
. ∑|V|
, .
| | | | | | | | |V| |V|
Where:
is the tf•idf weight of term ti in the title.
is the tf•idf weight of term ti in the paragraph
The tf•idf vector is composed of the product of a term frequency and the inverse doc-
ument frequency for each title terms that appears in the all article paragraphs.
Fig. (2) clarifies the precision which is based on the result of experiments. The Y-axis
represents the precision, and the X-axis represents the text size groups. VSM-
summary achieves average precision of 53.13%; whereas, RST-summary achieves
56.29%. However, in the large-sized articles the VSM-summary precision achieves
42.7% more than RST-summary which achieves only 39.02%. At the quality of
summary in itself (intrinsic), the RST-summary has kept result not out of context,
without lacking of cohesion and anaphoric reference not broken.
424 A. Ibrahim and T. Elghazaly
74.85%
80.00% 66.70%
70.00%
55.00%
60.00% 50.00%
50.00% 42.70% 39.02%
40.00%
30.00%
20.00%
10.00%
0.00%
1 ≤ Pn ≤ 10 10 < Pn ≤ 20 20 < Pn ≤40
Fig. 2. Performance of both VSM and RST summary results with the judgments
RST is a very effective technique for extracting the most significant text parts. Howev-
er, the limitation of RST appears when it is applied to large-sized articles. Statistical
results show that RST-summary is superior to VSM-summary: the average precision for
RST-summary is 56.29%; whereas, that of VSM-summary is 53.13%. Besides, the
VSM-Summary is incoherent and deviates from the context of the original text.
In the Future works, these two models may be combined together to provide a new
model that will improve the summary results by inlaying the rhetorical structure tree
using weights of VSM-summary. In addition, RS-Tree can be used to identify the
written styles of different writers.
References
1. Hammo, B.H., Abu-Salem, H., Martha, E.W.: A Hybrid Arabic Text Summarization Tech-
nique Based on Text Structure and Topic Identification. Int. J. Comput. Proc. Oriental Lang.
(2011)
2. Alsanie, W., Touir, A., Mathkour, H.: Towards an infrastructure for Arabic text summariza-
tion using rhetorical structure theory. M.Sc. Thesis, King Saud University, Riyadh, Saudi
Arabia (2005)
3. Ibrahim, A., Elghazaly, T.: Arabic text summarization using Rhetorical Structure Theory.
In: 8th International Conference on Informatics and Systems (INFOS), pp. NLP-34–NLP-38
(2012)
4. Ibrahim, A., Elghazaly, T.: Rhetorical Representation for Arabic Text. In: ISSR Annual
Conference the 46th Annual Conference on Statistics, Computer Science, and Operations
Research (2011)
5. Abd-Elfattah, M., Fuji, R.: Automatic text summarization. In: Proceeding of World Acade-
my of Science, Engineering and Technology, Cairo, Egypt, pp. 192–195 (2008)
6. Manning, C., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval, p. 181.
Cambridge University Press (2009)
Author Index