0% found this document useful (0 votes)
8 views6 pages

2020 Lrec-1 851

This document presents a French corpus for semantic similarity, consisting of 1,010 sentence pairs annotated by five annotators on a scale from 0 to 5. The study details the annotation process, analyzes the data, and explores experiments for automatic grading of semantic similarity. This corpus aims to support various applications in Natural Language Processing, such as machine translation and information retrieval.

Uploaded by

yeshengjunrea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

2020 Lrec-1 851

This document presents a French corpus for semantic similarity, consisting of 1,010 sentence pairs annotated by five annotators on a scale from 0 to 5. The study details the annotation process, analyzes the data, and explores experiments for automatic grading of semantic similarity. This corpus aims to support various applications in Natural Language Processing, such as machine translation and information retrieval.

Uploaded by

yeshengjunrea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 6889–6894

Marseille, 11–16 May 2020


c European Language Resources Association (ELRA), licensed under CC-BY-NC

A French Corpus for Semantic Similarity


Rémi Cardon, Natalia Grabar
UMR 8163 STL, CNRS, Université de Lille
Domaine du Pont de bois
59653 Villeneuve d’Ascq CEDEX, France
{remi.cardon, natalia.grabar}@univ-lille.fr

Abstract
Semantic similarity is an area of Natural Language Processing that is useful for several downstream applications, such as machine
translation, natural language generation, information retrieval, or question answering. The task consists in assessing the extent to which
two sentences express or do not express the same meaning. To do so, corpora with graded pairs of sentences are required. The grade is
positioned on a given scale, usually going from 0 (completely unrelated) to 5 (equivalent semantics). In this work, we introduce such a
corpus for French, the first that we know of. It is comprised of 1,010 sentence pairs with grades from five annotators. We describe the
annotation process, analyse these data, and perform a few experiments for the automatic grading of semantic similarity.

Keywords: semantic similarity, manual annotation, French language, regression

1. Introduction attempt at reproducing the annotation automatically.


Semantic textual similarity is a subtask of Natural Lan- 2. Corpus and Annotation Process
guage Processing. At the level of sentences, the task con-
In this section, we first present the data provided to the an-
sists in evaluating to what extent two sentences express the
notators. We then describe the annotation process and anal-
same meaning. This task is useful for several applications,
yse the annotation criteria defined by the annotators.
such as machine translation, text summarization, informa-
tion retrieval, natural language generation, or text simplifi- 2.1. Data Processed
cation (Wieting et al., 2019; Vadapalli et al., 2017; Yasui et The same batch with 1,010 sentence pairs was provided
al., 2019; Kajiwara and Komachi, 2016). The computing of to five annotators. The sentence pairs are issued from
the semantic textual similarity requires corpora with anno- a general language corpus containing sentences extracted
tated pairs of sentences. The annotation is most of the time from Wikipedia 2 and Vikidia 3 articles, and from texts re-
performed on a continuous scale where scores range from lated to the medical field. In this last case, the sentences
0 (the sentences express completely unrelated meanings) are extracted from the CLEAR corpus (Grabar and Car-
to 5 (the meaning is exactly the same in both sentences). don, 2018), which includes information about drugs, med-
Several challenges dedicated to semantic textual similarity ical literature reviews, and medicine-related articles from
(STS) have been held within the SemEval evaluation cam- Wikipedia and Vikidia. The purpose of this corpus is to pro-
paign between 2012 and 2017. STS provides the research pose comparable contents which are distinguished by their
community with bilingual and monolingual data. In our technicality: technical and difficult to understand texts are
work, we are interested in monolingual semantic similar- paired with the corresponding simple or simplified texts.
ity. In relation with the monolingual semantic similarity, This is another factor that distinguishes our dataset from
data from a few languages (English, Spanish and Arabic) the existing datasets in other languages mentioned in sec-
have been exploited (Cer et al., 2017) and made available tion 1.. The candidate pairs of sentences were generated au-
for the research community. The overall STS benchmark tomatically while building a classification method(Cardon
data for English1 , with data taken from editions held from and Grabar, 2019) and then validated and selected manu-
2012 to 2017, contains 8,628 sentence pairs, while only ally. That method is similar to the one described in section
250 sentence pairs were proposed for Spanish and for Ara- 4.1.. The main difference is that it is based on the Ran-
bic. Besides, similar data are also proposed for Portuguese dom Forest classifier algorithm, whereas below we use it as
through the ASSIN workshop (Feitosa and Pinheiro, 2017) a Regressor. In the work presented in this paper, the goal
dataset, which is composed of 10,000 pairs – 5,000 for is to retain sentence pairs pertaining to various degrees of
Brazilian Portuguese and 5,000 for European Portuguese. similarity in order to be able to train a model to assign val-
All those datasets are taken from general language and vari- ues on a continuous scale instead of binary values (aligned
ous sources : news articles, forum posts and video subtitles. or not aligned).
Yet, there is no similar data in French. Hence, the semantic similarity between sentences within a
In our work, we introduce a semantic textual similarity cor- given pair is due to their technicality and to the complexity
pus for French. We first describe the data that have been of their contents, which can be lexical, syntactic or seman-
used and the annotation process, then we present the result- tic. Here is an example from the CLEAR corpus, with an
ing resource. We also describe an experiment that shows an English translation :
1
http://ixa2.si.ehu.es/stswiki/index.php/ 2
https://fr.wikipedia.org/
STSbenchmark 3
https://fr.vikidia.org

6889
A1 A2 A3 A4 A5
0.5 A few identical
segments
1 Same topic, loose One summarizes Little shared infor- Inference can be Almost unrelated
relation the other mation drawn meaning
1.5 Incomplete main
information on
one side and ex-
tra information
missing
2 Same topic, differ- Incomplete main Same function, lit- Intermediate level Same subject, dif-
ent information information on one tle shared informa- ferent information
side tion
2.5 Same meaning,
radically different
expression
3 Same topic, loosely Same meaning, dif- Extra information Main concept of Extra information
shared information ferent expression on one side one sentence is on one side
missing in the other
one
3.5 Same meaning,
paraphrases are
found
4 Almost same con- Same meaning, Same function and Additional infor- One slight differ-
tent, additional in- slight rephrasing almost same infor- mation on one ence in the deliv-
formation on one mation side ered information
side
4.5 Same meaning,
slight syntactic
difference

Table 1: Annotation criteria defined by the annotators

1. Les effets graves intéressant les systèmes hépatique • to come up with their own scale and criteria for the
et/ou dermatologique ainsi que les réactions intermediate values,
d’hypersensibilité imposent l’arrêt du traitement.
(Severe effects affecting the liver and/or dermatolog- • to define a short description of the annotation criteria.
ical systems and hypersensitivity reactions require We prefered not to bias the manual annotations with some
discontinuation of treatment.) a priori criteria, such as

2. Le traitement doit être arrêté en cas de réaction al- 1. use the score n for sentence pairs with syntactic mod-
lergique généralisée, éruption cutanée ou altérations ifications,
de la fonction du foie. (Treatment should be discon- 2. use the score m for sentence pairs with lexical modifi-
tinued in the event of a generalized allergic reaction, cations, etc.
rash or impaired liver function.)
Indeed, our motivation was to exploit the linguistic compe-
2.2. Annotation Process tence of the annotators and to compare their semantic sen-
The five annotators involved have received higher educa- sitivity and judgements. We assume also that, in this way,
tion: two of them are trained in Natural Language Process- the overall semantic scores should better represent the se-
ing, one is a medical practitioner. Except one, all annota- mantic similarity between the sentences.
tors are native French speakers. The authors were not part The annotators estimated that the annotation of the 1,010
of the annotators. The annotation guidelines provided to pairs of sentences took between seven and fifteen hours.
the annotators were very simple and short:
2.3. Scales and Annotation Criteria according to
• to assign a score of 0 when the sentences are com- the Annotators
pletely unrelated, The scales and criteria that were used by the annotators can
be seen in Table 1. We can observe differences and similar-
• to assign a score of 5 when the sentences mean the ities between the various annotation principles provided by
same, the annotators:

6890
• Except one, all the annotators assigned integer scores 3. Analysis of the Annotations
[0, 1, 2, 3, 4, 5] to the pairs of sentences. One anno- In this section, we further analyse the annotations: their
tator also used intermediary scores [0.5, 1.5, 2.5, 3.5, breakdown by score and the correlation of the scores from
4.5]; the five annotators.
• The A3 annotator considered that he took the sen-
3.1. Breakdown by Score
tences strictly as they were given, which means that
the unknown context was considered as non-existent.
That implies for example that pronouns in one sen-
tence were never assumed to be referring to an element
explicitly mentioned in the other sentence, increasing
the likelihood of dissimilarity;

• The scales from A2 and A3 are much more conserva-


tive than the other three. Yet they greatly differ from
one another. A2 is the only annotator who focused on
phrasing. In order to assign the highest score accord-
ing to their scale, the two sentences have to be identi-
cal. The scale given by A3 is more similar to the other
ones but it is conservative because of the strict view
Figure 1: Breakdown by category and annotator
related to context not being assumed;

• For specific grades, 2, 3 and 4 are quite similar for all


Figure 1 shows the breakdown by score and annotator. The
the annotators but A2: 2 involves that the sentences
x-axis shows the different scores and the y-axis shows the
have something that differentiates them, but they deal
number of pairs. The isolated bars are due to the scale used
with the same subject. 3 implies each time that there
by A2, which is the only one that included .5 values. We
is shared information but that one sentence expresses
also indicate figures for Avg and Vote.
information that is not found in the other one, and 4
We can observe that the 0 score is the most used by every
implies that the information is ”almost” or ”slightly”
annotator but one (A1). The annotator A3 assigned the 0
the same;
score to almost half the pairs, which is coherent with the
• It is more difficult to analyse the relationship between annotation criteria of this annotator, who did not assume
the descriptions for grade 1. A1 and A4 both mention context for coreference and thus had the most conservative
something in common, the domain, or grounds for in- approach.
ference, but also state that nothing more can reinforce We can also see that every annotator but one (A1) used 4
the link between the two. A3 and A5 focus on the lack more often than 5. This can be explained by the nature of
of relation between the sentences. the sentence pairs. As stated in section 2.1., the source cor-
pus is aimed towards simplification and the sentence pairs
To summarize, we can see that the annotators paid attention come from document pairs where one is more technical
to several criteria when deciding about the semantic relat- than the other one. In consequence, it can be expected that
edness of the sentences: there are more almost identical sentences than entirely iden-
tical ones, as the texts are not written for the same audience
• intersection of the meaning, such as missing informa- and thus do not deliver the exact same information in the
tion, incomplete information or extra words on either exact same way.
side, Looking at Avg and Vote, we observe that grades 3 and 4
are the most consistent overall. Grade 2 seems to be the
• use of paraphrases and different expressions, most inconsistent, with an average that is way above the
individual counts, and a vote that is low.
• possibility to do textual inference.

We observe also that the completeness of information is the


3.2. Correlation Coefficients
most frequently used criteria by all the annotators. We computed the Krippendorff’s α (Krippendorff, 1970)
to evaluate the global correlation coefficient of the anno-
2.4. Global scores tations. The α value for the five annotators is 0.69. This
Using all the scores from the five annotators, we computed value is above the generally observed threshold which is
two more values : considered as reliable (α = 0.67). Yet, this score is quite
low. When we take the average and the vote scores into
• The average score for each pair, rounded (”Avg” fur- account for the computation, the α value goes up to 0.77,
ther down); which is a sign that putting all the annotations together sig-
nificantly improves the data reliability. In order to explore
• The most frequent score out of the five for each pair those results more deeply, we computed the correlation be-
(”Vote” further down). tween pairs of annotators.

6891
A1 A2 A3 A4 A5 2. Percentage of words from one sentence included in the
A1 1.0 0.77 0.72 0.84 0.81 other sentence, computed in both directions. This fea-
A2 0.77 1.0 0.64 0.75 0.74 tures represents possible lexical and semantic inclu-
A3 0.72 0.64 1.0 0.75 0.70 sion relations between the sentences;
A4 0.84 0.75 0.75 1.0 0.80
A5 0.81 0.74 0.70 0.80 1.0 3. Sentence length difference between specialized and
simplified sentences. This feature assumes that sim-
Table 2: Pearson’s correlation coefficients between the an- plification may imply stable association with the sen-
notators tence length;

4. Average length difference in words between special-


ized and simplified sentences. This feature is similar
Table 2 shows the Pearson correlation(Kirch, 2008) for ev-
to the previous one but takes into account average dif-
ery combination of two annotators. The observations we
ference in sentence length;
can make are consistent with figure 1 and the criteria de-
scribed in section 2.3..: 5. Total number of common bigrams and trigrams. This
feature is computed on character ngrams. The assump-
• The lowest correlation coefficient (0.64) occurs be-
tion is that, at the sub-word level, some sequences
tween A2 and A3: A2 is the annotator who used steps
of characters may be meaningful for the alignment of
of .5 in his scale and A3 relied on annotation prin-
sentences if they are shared by them;
ciples that had him assign 0 to almost half the pairs.
Hence, those two annotators applied annotation scales 6. Word-based similarity measure exploits three scores
and criteria that differ the most from the other ones. (cosine, Dice and Jaccard). This feature provides
a more sophisticated indication on word overlap be-
• The correlation coefficients between the other three
tween two sentences. Weight assigned to each word is
annotators (A1, A4 and A5) are the highest: 0.84 for
set to 1;
A1 and A4, 0.81 for A1 and A5 and 0.80 for A4 and
A5. 7. Character-based minimal edit distance (Levenshtein,
• The other associations range between 0.70 and 0.77. 1966). This is a classical computation of edit distance.
It takes into account basic edit operations (insertion,
Globally, the correlation coefficients show a satisfying reli- deletion and substitution) at the level of characters.
ability for the dataset, with variations according to the dif- The cost of each operation is set to 1;
ferent scales that were used. We see that the two scales that
stand out have the lowest correlation coefficient with each 8. Word-based minimal edit distance (Levenshtein,
other, but at the same time they have a good correlation co- 1966). This feature is computed with words as units
efficient with the other three. Those other three have strong within sentence. It takes into account the same three
coefficients with one another. edit operations with the same cost set to 1. This feature
permits to compute the cost of lexical transformation
4. Experiments of one sentence into another;
In order to study how the resulting corpus can be exploited, 9. WAVG. This features uses word embeddings. The
we ran an experiment to check how accurately we could word vectors of each sentence are averaged, and the
automatically reproduce the annotations. In this section, we similarity score is calculated by comparing the two re-
first describe the automatic approach for scoring the pairs sulting sentence vectors (Stajner et al., 2018);
of sentences and then the results obtained.
10. CWASA. This feature is the continuous word
4.1. Automatic Approach for Scoring the Pairs alignment-based similarity analysis, as described in
of Sentences (Franco-Salvador et al., 2016).
We exploited a previously proposed method dedicated to
the detection of parallel sentences in comparable corpora For the last two features, we trained the embeddings on the
(Cardon and Grabar, 2019). Yet, in order to predict val- CLEAR corpus using word2vec (Mikolov et al., 2013), and
ues on a continuous scale, the Random Forest Regressor the scores are computed using the CATS tool (Stajner et al.,
is exploited instead of the classifier. We compute and use 2018).
several sets of features, mainly obtained from the lexical 4.2. Results
and sublexical content of the sentences, their word-based
similarity, and the corpus-suggested similarity from word We ran the experiment for every annotator. We also ran the
embeddings: experiment for Avg and Vote. We randomly split the data
into 90% for training and 10% for testing. As there are
1. Number of common non-stopwords. This feature per- small variations on each run due to random splitting, each
mits to compute the basic lexical overlap between spe- reported score represents the average over twenty runs.
cialized and simplified versions of sentences (Barzi- Table 3 shows the results obtained when scoring the pairs
lay and Elhadad, 2003). It concentrates on non-lexical of sentences tackled as the regression task. For the anno-
content of sentences; tators, the correlation coefficients range from 0.73 (A2) to

6892
A1 A2 A3 A4 A5 Avg Vote Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia,
0.82 0.73 0.80 0.79 0.78 0.87 0.78 L. (2017). SemEval-2017 task 1: Semantic textual sim-
ilarity multilingual and crosslingual focused evaluation.
Table 3: Pearson’s correlation coefficient on regression ex- In Proceedings of the 11th International Workshop on
periments Semantic Evaluation (SemEval-2017), pages 1–14, Van-
couver, Canada, August. Association for Computational
Linguistics.
0.82 (A1). This shows that the various scales can be auto- Feitosa, D. and Pinheiro, V. (2017). Análise de medidas de
matically reproduced, and even if there are important dif- similaridade semântica na tarefa de reconhecimento de
ference between them, the annotations can be considered to implicação textual (analysis of semantic similarity mea-
be coherent. sures in the recognition of textual entailment task)[in
The most engaging observation is that the best results (0.87) Portuguese]. In Proceedings of the 11th Brazilian Sym-
are obtained on the average scores. This may mean that posium in Information and Human Language Technol-
the average scores and collective perception of the semantic ogy, pages 161–170, Uberlândia, Brazil, October. So-
similarity remain coherent despite the differences observed ciedade Brasileira de Computação.
during the annotation process. Franco-Salvador, M., Gupta, P., Rosso, P., and Banchs,
Interestingly, the result for Vote is the mean of the scores R. E. (2016). Cross-language plagiarism detection over
for the five annotators individually. continuous-space and knowledge graph-based represen-
tations of language. Knowledge-Based Systems, 111:87–
5. Conclusion 99.
We introduced a corpus annotated for semantic textual sim- Grabar, N. and Cardon, R. (2018). CLEAR – Simple Cor-
ilarity for French. Currently, this kind of data is indeed pus for Medical French. In Workshop on Automatic Text
missing in French. The corpus is composed of 1,010 sen- Adaption (ATA), pages 1–11, Tilburg, Netherlands.
tence pairs that come from comparable corpora aimed to- Kajiwara, T. and Komachi, M. (2016). Building a mono-
wards text simplification. More precisely, the original texts lingual parallel corpus for text simplification using sen-
come from the CLEAR corpus and from Wikipedia and tence similarity based on alignment between word em-
Vikidia articles. The corpus comes with grades manually beddings. In Proceedings of COLING 2016, the 26th
assigned by five annotators. Together with the scores, the International Conference on Computational Linguistics:
annotators provided the annotation scheme they adopted. Technical Papers, pages 1147–1158, Osaka, Japan, De-
We performed an analysis of the resulting data and showed cember. The COLING 2016 Organizing Committee.
that there are discrepancies in the scores that have been as- Wilhelm Kirch, editor, (2008). Pearson’s Correlation Co-
signed. Those discrepancies can be explained with different efficient, pages 1090–1091. Springer Netherlands, Dor-
annotation factors. We then used these data to automati- drecht.
cally predict the scores of the pairs of sentences. This set Krippendorff, K. (1970). Estimating the reliability, sys-
of experiments shows that the scores can be quite well re- tematic error and random error of interval data. Educa-
produced with automatic approaches. This indicates that tional and Psychological Measurement, 30(1):61–70.
the manually created data are reliable and can be used for a Levenshtein, V. I. (1966). Binary codes capable of correct-
variety of experiments where semantic textual similarity is ing deletions, insertions and reversals. Soviet physics.
of interest. At the time of publication, the dataset is being Doklady, 707(10).
used in an NLP challenge and will be made available for
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
the research community.
Dean, J. (2013). Distributed representations of words
and phrases and their compositionality. In Advances in
6. Acknowledgements Neural Information Processing Systems 26: 27th An-
We are grateful to our annotators for their valuable work. nual Conference on Neural Information Processing Sys-
We would also like to thank the reviewers for their helpful tems 2013. Proceedings of a meeting held December 5-8,
comments. This work was funded by the French National 2013, Lake Tahoe, Nevada, United States., pages 3111–
Agency for Research (ANR) as part of the CLEAR project 3119.
(Communication, Literacy, Education, Accessibility, Read- Stajner, S., Franco-Salvador, M., Ponzetto, S. P., and Rosso,
ability), ANR-17-CE19-0016-01. P. (2018). Cats: A tool for customised alignment of
text simplification corpora. In Proceedings of the 11th
7. Bibliographical References Language Resources and Evaluation Conference, LREC
Barzilay, R. and Elhadad, N. (2003). Sentence alignment 2018, Miyazaki, Japan, May 7-12.
for monolingual comparable corpora. In EMNLP, pages Vadapalli, R., J Kurisinkel, L., Gupta, M., and Varma, V.
25–32. (2017). SSAS: Semantic similarity for abstractive sum-
Cardon, R. and Grabar, N. (2019). Parallel sentence re- marization. In Proceedings of the Eighth International
trieval from comparable corpora for biomedical text sim- Joint Conference on Natural Language Processing (Vol-
plification. In Proceedings of Recent Advances in Natu- ume 2: Short Papers), pages 198–203, Taipei, Taiwan,
ral Language Processing, pages 168–177, Varna, Bul- November. Asian Federation of Natural Language Pro-
garia, september. cessing.

6893
Wieting, J., Berg-Kirkpatrick, T., Gimpel, K., and Neu-
big, G. (2019). Beyond BLEU:training neural machine
translation with semantic similarity. In Proceedings of
the 57th Annual Meeting of the Association for Compu-
tational Linguistics, pages 4344–4355, Florence, Italy,
July. Association for Computational Linguistics.
Yasui, G., Tsuruoka, Y., and Nagata, M. (2019). Using
semantic similarity as reward for reinforcement learning
in sentence generation. In Proceedings of the 57th An-
nual Meeting of the Association for Computational Lin-
guistics: Student Research Workshop, pages 400–406,
Florence, Italy, July. Association for Computational Lin-
guistics.

6894

You might also like