2020 Lrec-1 851
2020 Lrec-1 851
Abstract
Semantic similarity is an area of Natural Language Processing that is useful for several downstream applications, such as machine
translation, natural language generation, information retrieval, or question answering. The task consists in assessing the extent to which
two sentences express or do not express the same meaning. To do so, corpora with graded pairs of sentences are required. The grade is
positioned on a given scale, usually going from 0 (completely unrelated) to 5 (equivalent semantics). In this work, we introduce such a
corpus for French, the first that we know of. It is comprised of 1,010 sentence pairs with grades from five annotators. We describe the
annotation process, analyse these data, and perform a few experiments for the automatic grading of semantic similarity.
6889
A1 A2 A3 A4 A5
0.5 A few identical
segments
1 Same topic, loose One summarizes Little shared infor- Inference can be Almost unrelated
relation the other mation drawn meaning
1.5 Incomplete main
information on
one side and ex-
tra information
missing
2 Same topic, differ- Incomplete main Same function, lit- Intermediate level Same subject, dif-
ent information information on one tle shared informa- ferent information
side tion
2.5 Same meaning,
radically different
expression
3 Same topic, loosely Same meaning, dif- Extra information Main concept of Extra information
shared information ferent expression on one side one sentence is on one side
missing in the other
one
3.5 Same meaning,
paraphrases are
found
4 Almost same con- Same meaning, Same function and Additional infor- One slight differ-
tent, additional in- slight rephrasing almost same infor- mation on one ence in the deliv-
formation on one mation side ered information
side
4.5 Same meaning,
slight syntactic
difference
1. Les effets graves intéressant les systèmes hépatique • to come up with their own scale and criteria for the
et/ou dermatologique ainsi que les réactions intermediate values,
d’hypersensibilité imposent l’arrêt du traitement.
(Severe effects affecting the liver and/or dermatolog- • to define a short description of the annotation criteria.
ical systems and hypersensitivity reactions require We prefered not to bias the manual annotations with some
discontinuation of treatment.) a priori criteria, such as
2. Le traitement doit être arrêté en cas de réaction al- 1. use the score n for sentence pairs with syntactic mod-
lergique généralisée, éruption cutanée ou altérations ifications,
de la fonction du foie. (Treatment should be discon- 2. use the score m for sentence pairs with lexical modifi-
tinued in the event of a generalized allergic reaction, cations, etc.
rash or impaired liver function.)
Indeed, our motivation was to exploit the linguistic compe-
2.2. Annotation Process tence of the annotators and to compare their semantic sen-
The five annotators involved have received higher educa- sitivity and judgements. We assume also that, in this way,
tion: two of them are trained in Natural Language Process- the overall semantic scores should better represent the se-
ing, one is a medical practitioner. Except one, all annota- mantic similarity between the sentences.
tors are native French speakers. The authors were not part The annotators estimated that the annotation of the 1,010
of the annotators. The annotation guidelines provided to pairs of sentences took between seven and fifteen hours.
the annotators were very simple and short:
2.3. Scales and Annotation Criteria according to
• to assign a score of 0 when the sentences are com- the Annotators
pletely unrelated, The scales and criteria that were used by the annotators can
be seen in Table 1. We can observe differences and similar-
• to assign a score of 5 when the sentences mean the ities between the various annotation principles provided by
same, the annotators:
6890
• Except one, all the annotators assigned integer scores 3. Analysis of the Annotations
[0, 1, 2, 3, 4, 5] to the pairs of sentences. One anno- In this section, we further analyse the annotations: their
tator also used intermediary scores [0.5, 1.5, 2.5, 3.5, breakdown by score and the correlation of the scores from
4.5]; the five annotators.
• The A3 annotator considered that he took the sen-
3.1. Breakdown by Score
tences strictly as they were given, which means that
the unknown context was considered as non-existent.
That implies for example that pronouns in one sen-
tence were never assumed to be referring to an element
explicitly mentioned in the other sentence, increasing
the likelihood of dissimilarity;
6891
A1 A2 A3 A4 A5 2. Percentage of words from one sentence included in the
A1 1.0 0.77 0.72 0.84 0.81 other sentence, computed in both directions. This fea-
A2 0.77 1.0 0.64 0.75 0.74 tures represents possible lexical and semantic inclu-
A3 0.72 0.64 1.0 0.75 0.70 sion relations between the sentences;
A4 0.84 0.75 0.75 1.0 0.80
A5 0.81 0.74 0.70 0.80 1.0 3. Sentence length difference between specialized and
simplified sentences. This feature assumes that sim-
Table 2: Pearson’s correlation coefficients between the an- plification may imply stable association with the sen-
notators tence length;
6892
A1 A2 A3 A4 A5 Avg Vote Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia,
0.82 0.73 0.80 0.79 0.78 0.87 0.78 L. (2017). SemEval-2017 task 1: Semantic textual sim-
ilarity multilingual and crosslingual focused evaluation.
Table 3: Pearson’s correlation coefficient on regression ex- In Proceedings of the 11th International Workshop on
periments Semantic Evaluation (SemEval-2017), pages 1–14, Van-
couver, Canada, August. Association for Computational
Linguistics.
0.82 (A1). This shows that the various scales can be auto- Feitosa, D. and Pinheiro, V. (2017). Análise de medidas de
matically reproduced, and even if there are important dif- similaridade semântica na tarefa de reconhecimento de
ference between them, the annotations can be considered to implicação textual (analysis of semantic similarity mea-
be coherent. sures in the recognition of textual entailment task)[in
The most engaging observation is that the best results (0.87) Portuguese]. In Proceedings of the 11th Brazilian Sym-
are obtained on the average scores. This may mean that posium in Information and Human Language Technol-
the average scores and collective perception of the semantic ogy, pages 161–170, Uberlândia, Brazil, October. So-
similarity remain coherent despite the differences observed ciedade Brasileira de Computação.
during the annotation process. Franco-Salvador, M., Gupta, P., Rosso, P., and Banchs,
Interestingly, the result for Vote is the mean of the scores R. E. (2016). Cross-language plagiarism detection over
for the five annotators individually. continuous-space and knowledge graph-based represen-
tations of language. Knowledge-Based Systems, 111:87–
5. Conclusion 99.
We introduced a corpus annotated for semantic textual sim- Grabar, N. and Cardon, R. (2018). CLEAR – Simple Cor-
ilarity for French. Currently, this kind of data is indeed pus for Medical French. In Workshop on Automatic Text
missing in French. The corpus is composed of 1,010 sen- Adaption (ATA), pages 1–11, Tilburg, Netherlands.
tence pairs that come from comparable corpora aimed to- Kajiwara, T. and Komachi, M. (2016). Building a mono-
wards text simplification. More precisely, the original texts lingual parallel corpus for text simplification using sen-
come from the CLEAR corpus and from Wikipedia and tence similarity based on alignment between word em-
Vikidia articles. The corpus comes with grades manually beddings. In Proceedings of COLING 2016, the 26th
assigned by five annotators. Together with the scores, the International Conference on Computational Linguistics:
annotators provided the annotation scheme they adopted. Technical Papers, pages 1147–1158, Osaka, Japan, De-
We performed an analysis of the resulting data and showed cember. The COLING 2016 Organizing Committee.
that there are discrepancies in the scores that have been as- Wilhelm Kirch, editor, (2008). Pearson’s Correlation Co-
signed. Those discrepancies can be explained with different efficient, pages 1090–1091. Springer Netherlands, Dor-
annotation factors. We then used these data to automati- drecht.
cally predict the scores of the pairs of sentences. This set Krippendorff, K. (1970). Estimating the reliability, sys-
of experiments shows that the scores can be quite well re- tematic error and random error of interval data. Educa-
produced with automatic approaches. This indicates that tional and Psychological Measurement, 30(1):61–70.
the manually created data are reliable and can be used for a Levenshtein, V. I. (1966). Binary codes capable of correct-
variety of experiments where semantic textual similarity is ing deletions, insertions and reversals. Soviet physics.
of interest. At the time of publication, the dataset is being Doklady, 707(10).
used in an NLP challenge and will be made available for
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
the research community.
Dean, J. (2013). Distributed representations of words
and phrases and their compositionality. In Advances in
6. Acknowledgements Neural Information Processing Systems 26: 27th An-
We are grateful to our annotators for their valuable work. nual Conference on Neural Information Processing Sys-
We would also like to thank the reviewers for their helpful tems 2013. Proceedings of a meeting held December 5-8,
comments. This work was funded by the French National 2013, Lake Tahoe, Nevada, United States., pages 3111–
Agency for Research (ANR) as part of the CLEAR project 3119.
(Communication, Literacy, Education, Accessibility, Read- Stajner, S., Franco-Salvador, M., Ponzetto, S. P., and Rosso,
ability), ANR-17-CE19-0016-01. P. (2018). Cats: A tool for customised alignment of
text simplification corpora. In Proceedings of the 11th
7. Bibliographical References Language Resources and Evaluation Conference, LREC
Barzilay, R. and Elhadad, N. (2003). Sentence alignment 2018, Miyazaki, Japan, May 7-12.
for monolingual comparable corpora. In EMNLP, pages Vadapalli, R., J Kurisinkel, L., Gupta, M., and Varma, V.
25–32. (2017). SSAS: Semantic similarity for abstractive sum-
Cardon, R. and Grabar, N. (2019). Parallel sentence re- marization. In Proceedings of the Eighth International
trieval from comparable corpora for biomedical text sim- Joint Conference on Natural Language Processing (Vol-
plification. In Proceedings of Recent Advances in Natu- ume 2: Short Papers), pages 198–203, Taipei, Taiwan,
ral Language Processing, pages 168–177, Varna, Bul- November. Asian Federation of Natural Language Pro-
garia, september. cessing.
6893
Wieting, J., Berg-Kirkpatrick, T., Gimpel, K., and Neu-
big, G. (2019). Beyond BLEU:training neural machine
translation with semantic similarity. In Proceedings of
the 57th Annual Meeting of the Association for Compu-
tational Linguistics, pages 4344–4355, Florence, Italy,
July. Association for Computational Linguistics.
Yasui, G., Tsuruoka, Y., and Nagata, M. (2019). Using
semantic similarity as reward for reinforcement learning
in sentence generation. In Proceedings of the 57th An-
nual Meeting of the Association for Computational Lin-
guistics: Student Research Workshop, pages 400–406,
Florence, Italy, July. Association for Computational Lin-
guistics.
6894