R11-1071
R11-1071
515
Proceedings of Recent Advances in Natural Language Processing, pages 515–520,
Hissar, Bulgaria, 12-14 September 2011.
Task str sty c the annotator A2 indicates that a different dimen-
Authorship Classification X sion might have been used to judge similarity.
Automatic Essay Scoring X X X To further investigate this issue, we asked the
Information Retrieval X X X
Paraphrase Recognition X annotators about the reasons for their judgments.
Plagiarism Detection X X A1 and A3 consistently focused only on the con-
Question Answering X tent of the texts and completely disregarded other
Short Answer Grading X X X
Summarization X X dimensions. A2 , however, was also taking struc-
Text Categorization X tural similarities into account, e.g. two texts were
Text Segmentation X X rated highly similar because of the way they are
Text Simplification X X
Word Sense Alignment X organized: First, an introduction to the topic is
given, then a quotation is stated, then the text con-
Table 1: Classification of common NLP tasks with cludes with a certain reaction of the acting subject.
respect to the relevant dimensions of text similar-
ity: structure (str), style (sty), and content (c) Content vs. Style The annotators in the previ-
ous study only identified the dimensions content
and structure. Style was not addressed, as the text
their relationships within a text. For example, pairs were all of similar style, and hence that di-
the task of automatic essay scoring (Attali and mension was not perceived as salient. Thus, we
Burstein, 2006) typically not only requires the es- selected 10 pairs of short texts from Wikipedia
say to be about a certain topic (content dimension), (WP) and Simple Wikipedia2 (SWP). We used the
but also an adequate style and a coherent structure first paragraphs of WP articles and the full texts
are necessary. However, in authorship classifica- of SWP articles to obtain pairs of similar length.
tion (Holmes, 1998) only style is important. Pairs were formed in all combinations (WP-WP,
Taking this dimension-centric view on text sim- SWP-WP, and SWP-SWP) to ensure that both
ilarity also opens up new perspectives. For exam- similarity dimensions were salient for some pairs.
ple, standard information retrieval usually consid- For example, an article from SWP and one from
ers only the content dimension (keyword overlap WP about the same topic share the same content,
between query and document). However, a scholar but are different in style, while two articles from
in digital humanities might be interested in texts SWP have a similar style, but different content.
that are similar to a reference document with re- We then asked three annotators to rate each pair
spect to style and structure, while texts with simi- according to the content and style dimensions. The
lar content are of minor interest. In this paper, we results show that WP-WP and SWP-SWP pairs are
only address dimensions inherent to texts, and do perceived as stylistically similar, while WP-SWP
not consider dimensions such as user intentions. pairs are seen similar with respect to their content.
516
Length in Rating # Judges
Dataset Text Type / Domain # Pairs
Terms () Scale per Pair
30 Sentence Pairs (Li et al., 2006) Concept Definitions 5–33 (11) 30 0–4 32
50 Short Texts (Lee et al., 2005) News (Politics) 45–126 (80) 1,225 1–5 8–12
Computer Science Assignments
Computer Science 1–173 (18) 630 0–5 2
(Mohler and Mihalcea, 2009)
Microsoft Paraphrase Corpus
News 5–31 (19) 5,801 binary 2–3
(Dolan et al., 2004)
517
Measure r ρ Measure r
Cosine Baseline .81 .83 Cosine Baseline .56
Term Pair Heuristic .83 .84
ESA (Wikipedia) .46
ESA (Wikipedia) .61 .77 ESA (Wiktionary) .53
ESA (Wiktionary) .77 .82 ESA (WordNet) .59
ESA (WordNet) .75 .80
ESA (Gabrilovich and Markovitch, 2007) .72
Kennedy and Szpakowicz (2008) .87 - LSA (Lee et al., 2005) .60
LSA (Tsatsaronis et al., 2010) .84 .87 WikiWalk (Yeh et al., 2009) .77
OMIOTIS (Tsatsaronis et al., 2010) .86 .89
STASIS (Li et al., 2006) .82 .81 Table 4: Results on the 50 Short Texts dataset. Sta-
STS (Islam and Inkpen, 2008) .85 .84
tistically significant7 improvements in bold.
Table 3: Results on the 30 Sentence Pairs dataset
ρ = 0.88. This shows that judgments are quite
that this dataset encodes the content dimension of stable across time and subjects.
similarity, but a rather constrained one. In Section 2.1, two annotators had a content-
centric view on similarity while one subject also
Evaluation Results Table 3 shows the results considered structural similarity important. When
of state of the art similarity measures obtained combining only the two content-centric annota-
on this dataset. We used a cosine baseline and tors, the correlation is ρ = 0.90, while it is much
implemented an additional baseline which disre- lower for the other annotator. Thus, we conclude
gards the actual texts and only takes the target that this dataset encodes the content dimension of
noun of each sentence into account. We computed text similarity.
their pairwise term similarity using the metric by
Lin (1998) on WordNet (Fellbaum, 1998). Our Evaluation Results Table 4 summarizes the re-
heuristic achieves Pearson r = 0.83 and Spearman sults obtained on this dataset. We used a co-
ρ = 0.84. The block of results in the middle shows sine baseline, and our implementation of ESA ap-
our implementation of Explicit Semantic Anal- plied to different knowledge sources. The results
ysis (ESA) (Gabrilovich and Markovitch, 2007) at the bottom are scores previously obtained and
using different knowledge sources (Zesch et al., reported in the literature. All of them signifi-
2008). The bottom rows show scores previously cantly outperform the baseline.7 In contrast to the
obtained and reported in the literature. None of the 30 Sentence Pairs, this dataset encodes a broader
measures significantly5 outperforms the baselines. view on the content dimension of similarity. It
Given the limitation of encoding rather term than obviously contains text pairs that are similar (or
text similarity and the fact that the dataset is also dissimilar) for reasons beyond partial string over-
very small (30 pairs), it is questionable whether it lap. Thus, the dataset might be used to intrinsi-
is a suitable evaluation dataset for text similarity. cally evaluate text similarity measures.
However, the distribution of similarity scores in
3.2 50 Short Texts this dataset is heavily skewed towards low scores,
The dataset by Lee et al. (2005) comprises 50 rela- with 82% of all term pairs having a text similarity
tively short texts (45 to 126 words6 ) which contain score between 1 and 2 on a 1–5 scale. This limits
newswire from the political domain. In analogy to the kind of conclusions that can be drawn as the
the study in Section 3.1, we performed an anno- number of the pairs in the most interesting class of
tation study to show whether the encoded judg- highly similar pairs is actually very small.
ments are stable across time and subjects. We Another observation is that we were not able to
asked three annotators to rate “How similar are reproduce the ESA score on Wikipedia reported
the given texts?”. We used the same uniformly by Gabrilovich and Markovitch (2007). We found
distributed subset as in Section 2.1. The resulting that the difference probably relates to the cut-off
Spearman correlation between the aggregated re- value used to prune the vectors as reported by Yeh
sults of the annotators and the original scores is et al. (2009). By tuning the cut-off value, we could
improve the score to 0.70, which comes very close
5
α = .05, Fisher Z-value transformation to the reported score of 0.72. However, as this tun-
6
Lee et al. (2005) report the shortest document having 51
7
words probably due to a different tokenization strategy. α = .01, Fisher Z-value transformation
518
Measure r Measure F-measure
Cosine Baseline .44 Cosine Baseline .81
Majority Baseline .80
ESA (Mohler and Mihalcea, 2009) .47
LSA (Mohler and Mihalcea, 2009) .43 ESA (Wikipedia) .80
Mohler and Mihalcea (2009) .45 LSA (Mihalcea et al., 2006) .81
Mihalcea et al. (2006) .81
Table 5: Results on the Computer Science Assign- OMIOTIS (Tsatsaronis et al., 2010) .81
PMI-IR (Mihalcea et al., 2006) .81
ments dataset Ramage et al. (2009) .80
STS (Islam and Inkpen, 2008) .81
ing is done directly on the evaluation dataset, it Finch et al. (2005) .83
Qiu et al. (2006) .82
probably overfits the cut-off value to the dataset. Wan et al. (2006) .83
Zhang and Patrick (2005) .81
3.3 Computer Science Assignments
The dataset by Mohler and Mihalcea (2009) was Table 6: Results on Microsoft Paraphrase Corpus
introduced for assessing the quality of short an-
swer grading systems in the context of computer Evaluation Results We summarize the results
science assignments. The dataset comprises 21 obtained on this dataset in Table 6. As detecting
questions, 21 reference answers and 630 student paraphrases is a classification task, we use an addi-
answers. The answers were graded by two teach- tional majority baseline which classifies all results
ers – not according to stylistic properties, but to the according to the predominant class of true para-
extent the content of the student answers matched phrases. The block of results in the middle con-
with the content of the reference answers. tains measures that are not specifically tailored to-
wards paraphrase recognition. None of them beats
Evaluation Results We summarize the results
the cosine baseline. The results at the bottom show
obtained on this dataset in Table 5. The scores are
measures which are specifically tailored towards
reported without relevance feedback (Mohler and
the detection of a bidirectional entailment relation-
Mihalcea, 2009) which distorts results by chang-
ship. None of them, however, significantly outper-
ing the reference answers. None of the measures
forms the cosine baseline. Obviously, recognizing
significantly8 outperforms the baseline. This is not
paraphrases is a very hard task that cannot simply
overly surprising, as the textual similarity between
be tackled by computing text similarity, as sharing
the reference and the student answer only consti-
similar content is a necessary, but not a sufficient
tutes part of what makes an answer the correct one.
condition for detecting paraphrases.
More sophisticated measures that also take lexi-
cal semantic relationships between terms into ac- 3.5 Discussion
count might even worsen the results, as typically
We showed that all four datasets encode the con-
a specific answer is required, not a similar one.
tent dimension of text similarity. The Computer
We conclude that similarity measures can be used
Science Assignments dataset and the Microsoft
to grade assignments, but it seems questionable
Paraphrase Corpus are tailored quite specifically
whether this dataset is suited to draw any conclu-
to a certain task. Thereby, factors exceeding the
sions on the performance of similarity measures
similarity of texts are important. Consequently,
outside of this particular task.
none of the similarity measures significantly out-
3.4 Microsoft Paraphrase Corpus performed the cosine baseline. The evaluation
of similarity measures on these datasets is hence
Dolan et al. (2004) introduced a dataset of 5,801
questionable outside of the specific application
sentence pairs taken from news sources on the
scenario. The 30 Sentence Pairs dataset was found
Web. They collected binary judgments from 2–3
to rather represent the similarity between terms
subjects whether each pair captures a paraphrase
than texts. Obviously, it is not suited for evaluating
relationship or not (83% interrater agreement).
text similarity measures. However, the 50 Short
The dataset has been used for evaluating text simi-
Texts dataset currently seems to be the best choice.
larity measures as, by definition, paraphrases need
As it is heavily skewed towards low similarity
to be similar with respect to their content.
scores, though, the conclusions that can be drawn
8
α = .05, Fisher Z-value transformation from the results are limited. Further datasets are
519
necessary to guide the development of measures David I. Holmes. 1998. The Evolution of Stylometry in Hu-
along other dimensions such as structure or style. manities Scholarship. Literary and Linguistic Computing,
13(3):111–117.
Aminul Islam and Diana Inkpen. 2008. Semantic Text Sim-
4 Conclusions ilarity Using Corpus-Based Word Similarity and String
Similarity. ACM Transactions on Knowledge Discovery
In this paper, we reflected on text similarity as a from Data, 2(2):1–25.
foundational technique for a wide range of tasks. Alistair Kennedy and Stan Szpakowicz. 2008. Evaluat-
ing Roget’s Thesauri. In Proceedings of the 46th Annual
We argued that while similarity is well grounded Meeting of the Association for Computational Linguistics:
in psychology, text similarity is less well-defined. Human Language Technologies, pages 416–424.
We introduced a formalization based on concep- Michael D. Lee, Brandon Pincombe, and Matthew Welsh.
2005. An empirical evaluation of models of text document
tual spaces for modeling text similarity along ex- similarity. In Proceedings of the 27th Annual Conference
plicit dimensions inherent to texts. We empirically of the Cognitive Science Society, pages 1254–1259.
grounded these dimensions by annotation stud- Yuhua Li, David McLean, Zuhair Bandar, James O’Shea, and
ies and demonstrated that humans indeed judge Keeley Crockett. 2006. Sentence Similarity Based on Se-
mantic Nets and Corpus Statistics. IEEE Transactions on
similarity along different dimensions. Further- Knowledge and Data Engineering, 18(8):1138–1150.
more, we discussed common evaluation datasets Dekang Lin. 1998. An information-theoretic definition of
and showed that it is of crucial importance for text similarity. In Proceedings of International Conference on
Machine Learning, pages 296–304.
similarity measures to address the correct dimen- Rada Mihalcea, Courtney Corley, and Carlo Strapparava.
sions. Otherwise, these measures fail to outper- 2006. Corpus-based and Knowledge-based Measures of
form even simple baselines. Text Semantic Similarity. In Proceedings of the 21st Na-
tional Conference on Artificial Intelligence.
We propose that future studies aiming at collect- Michael Mohler and Rada Mihalcea. 2009. Text-to-text Se-
ing human judgments on text similarity should ex- mantic Similarity for Automatic Short Answer Grading.
plicitly state which dimension is targeted in order In Proc. of the Europ. Chapter of the ACL, pages 567–575.
to create reliable annotation data. Further evalua- Long Qiu, Min-Yen Kan, and Tat-Seng Chua. 2006. Para-
phrase Recognition via Dissimilarity Significance Classi-
tion datasets annotated according to the structure fication. In Proceedings of the Conference on Empirical
and style dimensions of text similarity are neces- Methods in Natural Language Processing, pages 18–26.
sary to guide further research in this field. Daniel Ramage, Anna N. Rafferty, and Christopher D. Man-
ning. 2009. Random Walks for Text Semantic Similarity.
In Proceedings of the Workshop on Graph-based Methods
Acknowledgments for Natural Language Processing, pages 23–31.
This work has been supported by the Volkswagen Founda- Herbert Rubenstein and John B. Goodenough. 1965. Con-
tion as part of the Lichtenberg-Professorship Program under textual correlates of synonymy. Communications of the
grant No. I/82806, and by the Klaus Tschira Foundation un- ACM, 8(10):627–633.
der project No. 00.133.2008. We thank György Szarvas for John Sinclair, editor. 2001. Collins COBUILD Advanced
sharing his insights into the ESA similarity measure with us. Learner’s English Dictionary. HarperCollins, 3rd edition.
Linda B. Smith and Diana Heise. 1992. Perceptual similarity
and conceptual structure. In B. Burns, editor, Percepts,
References Concepts, and Categories. Elsevier.
Yigal Attali and Jill Burstein. 2006. Automated essay scor- George Tsatsaronis, Iraklis Varlamis, and Michalis Vazir-
ing with e-rater v.2.0. Journal of Technology, Learning, giannis. 2010. Text relatedness based on a word the-
and Assessment, 4(3). saurus. Journal of Artificial Intell. Research, 37:1–39.
Amos Tversky. 1977. Features of similarity. In Psychologi-
Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsuper- cal Review, volume 84, pages 327–352.
vised Construction of Large Paraphrase Corpora: Exploit-
Stephen Wan, Dras Mark, Robert Dale, and Cécile Paris.
ing Massively Parallel News Sources. In Proc. of the 20th
2006. Using dependency-based features to take the “para-
International Conference on Computational Linguistics.
farce” out of paraphrase. In Proc. of the Australasian Lan-
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical guage Technology Workshop, pages 131–138.
Database. MIT Press. Dominic Widdows. 2004. Geometry and Meaning. Center
Andrew Finch, Young-Sook Hwang, and Eiichiro Sumita. for the Study of Language and Information.
2005. Using machine translation evaluation techniques to Eric Yeh, Daniel Ramage, Christopher D. Manning, Eneko
determine sentence-level semantic equivalence. In Proc. Agirre, and Aitor Soroa. 2009. WikiWalk: Random walks
of the 3rd Intl. Workshop on Paraphrasing, pages 17–24. on Wikipedia for Semantic Relatedness. In Proceedings of
the Workshop on Graph-based Methods for Natural Lan-
Evgeniy Gabrilovich and Shaul Markovitch. 2007. Comput-
guage Processing, pages 41–49.
ing Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis. In Proc. of the 20th Intl. Joint Confer- Torsten Zesch, Christof Müller, and Iryna Gurevych. 2008.
ence on Artificial Intelligence, pages 1606–1611. Using Wiktionary for Computing Semantic Relatedness.
In Proc. of the 23rd AAAI Conf. on AI, pages 861–867.
Peter Gärdenfors. 2000. Conceptual Spaces: The Geometry Yitao Zhang and Jon Patrick. 2005. Paraphrase Identifica-
of Thought. MIT Press. tion by Text Canonicalization. In Proc. of the Australasian
Nelson Goodman. 1972. Seven strictures on similarity. In Language Technology Workshop, pages 160–166.
Problems and projects, pages 437–446. Bobbs-Merrill.
520