0% found this document useful (0 votes)

8 views6 pages

2020 Lrec-1 851

This document presents a French corpus for semantic similarity, consisting of 1,010 sentence pairs annotated by five annotators on a scale from 0 to 5. The study details the annotation process, analyzes the data, and explores experiments for automatic grading of semantic similarity. This corpus aims to support various applications in Natural Language Processing, such as machine translation and information retrieval.

Uploaded by

yeshengjunrea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views6 pages

2020 Lrec-1 851

Uploaded by

yeshengjunrea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 6889–6894

Marseille, 11–16 May 2020

c European Language Resources Association (ELRA), licensed under CC-BY-NC

A French Corpus for Semantic Similarity

Rémi Cardon, Natalia Grabar
UMR 8163 STL, CNRS, Université de Lille
Domaine du Pont de bois
59653 Villeneuve d’Ascq CEDEX, France
{remi.cardon, natalia.grabar}@univ-lille.fr

Abstract
Semantic similarity is an area of Natural Language Processing that is useful for several downstream applications, such as machine
translation, natural language generation, information retrieval, or question answering. The task consists in assessing the extent to which
two sentences express or do not express the same meaning. To do so, corpora with graded pairs of sentences are required. The grade is
positioned on a given scale, usually going from 0 (completely unrelated) to 5 (equivalent semantics). In this work, we introduce such a
corpus for French, the first that we know of. It is comprised of 1,010 sentence pairs with grades from five annotators. We describe the
annotation process, analyse these data, and perform a few experiments for the automatic grading of semantic similarity.

Keywords: semantic similarity, manual annotation, French language, regression

1. Introduction attempt at reproducing the annotation automatically.

Semantic textual similarity is a subtask of Natural Lan- 2. Corpus and Annotation Process
guage Processing. At the level of sentences, the task con-
In this section, we first present the data provided to the an-
sists in evaluating to what extent two sentences express the
notators. We then describe the annotation process and anal-
same meaning. This task is useful for several applications,
yse the annotation criteria defined by the annotators.
such as machine translation, text summarization, informa-
tion retrieval, natural language generation, or text simplifi- 2.1. Data Processed
cation (Wieting et al., 2019; Vadapalli et al., 2017; Yasui et The same batch with 1,010 sentence pairs was provided
al., 2019; Kajiwara and Komachi, 2016). The computing of to five annotators. The sentence pairs are issued from
the semantic textual similarity requires corpora with anno- a general language corpus containing sentences extracted
tated pairs of sentences. The annotation is most of the time from Wikipedia 2 and Vikidia 3 articles, and from texts re-
performed on a continuous scale where scores range from lated to the medical field. In this last case, the sentences
0 (the sentences express completely unrelated meanings) are extracted from the CLEAR corpus (Grabar and Car-
to 5 (the meaning is exactly the same in both sentences). don, 2018), which includes information about drugs, med-
Several challenges dedicated to semantic textual similarity ical literature reviews, and medicine-related articles from
(STS) have been held within the SemEval evaluation cam- Wikipedia and Vikidia. The purpose of this corpus is to pro-
paign between 2012 and 2017. STS provides the research pose comparable contents which are distinguished by their
community with bilingual and monolingual data. In our technicality: technical and difficult to understand texts are
work, we are interested in monolingual semantic similar- paired with the corresponding simple or simplified texts.
ity. In relation with the monolingual semantic similarity, This is another factor that distinguishes our dataset from
data from a few languages (English, Spanish and Arabic) the existing datasets in other languages mentioned in sec-
have been exploited (Cer et al., 2017) and made available tion 1.. The candidate pairs of sentences were generated au-
for the research community. The overall STS benchmark tomatically while building a classification method(Cardon
data for English1 , with data taken from editions held from and Grabar, 2019) and then validated and selected manu-
2012 to 2017, contains 8,628 sentence pairs, while only ally. That method is similar to the one described in section
250 sentence pairs were proposed for Spanish and for Ara- 4.1.. The main difference is that it is based on the Ran-
bic. Besides, similar data are also proposed for Portuguese dom Forest classifier algorithm, whereas below we use it as
through the ASSIN workshop (Feitosa and Pinheiro, 2017) a Regressor. In the work presented in this paper, the goal
dataset, which is composed of 10,000 pairs – 5,000 for is to retain sentence pairs pertaining to various degrees of
Brazilian Portuguese and 5,000 for European Portuguese. similarity in order to be able to train a model to assign val-
All those datasets are taken from general language and vari- ues on a continuous scale instead of binary values (aligned
ous sources : news articles, forum posts and video subtitles. or not aligned).
Yet, there is no similar data in French. Hence, the semantic similarity between sentences within a
In our work, we introduce a semantic textual similarity cor- given pair is due to their technicality and to the complexity
pus for French. We first describe the data that have been of their contents, which can be lexical, syntactic or seman-
used and the annotation process, then we present the result- tic. Here is an example from the CLEAR corpus, with an
ing resource. We also describe an experiment that shows an English translation :
1
http://ixa2.si.ehu.es/stswiki/index.php/ 2
https://fr.wikipedia.org/
STSbenchmark 3
https://fr.vikidia.org

6889
A1 A2 A3 A4 A5
0.5 A few identical
segments
1 Same topic, loose One summarizes Little shared infor- Inference can be Almost unrelated
relation the other mation drawn meaning
1.5 Incomplete main
information on
one side and ex-
tra information
missing
2 Same topic, differ- Incomplete main Same function, lit- Intermediate level Same subject, dif-
ent information information on one tle shared informa- ferent information
side tion
2.5 Same meaning,
radically different
expression
3 Same topic, loosely Same meaning, dif- Extra information Main concept of Extra information
shared information ferent expression on one side one sentence is on one side
missing in the other
one
3.5 Same meaning,
paraphrases are
found
4 Almost same con- Same meaning, Same function and Additional infor- One slight differ-
tent, additional in- slight rephrasing almost same information on one ence in the deliv-
formation on one mation side ered information
side
4.5 Same meaning,
slight syntactic
difference

Table 1: Annotation criteria defined by the annotators

1. Les effets graves intéressant les systèmes hépatique • to come up with their own scale and criteria for the
et/ou dermatologique ainsi que les réactions intermediate values,
d’hypersensibilité imposent l’arrêt du traitement.
(Severe effects affecting the liver and/or dermatolog- • to define a short description of the annotation criteria.
ical systems and hypersensitivity reactions require We prefered not to bias the manual annotations with some
discontinuation of treatment.) a priori criteria, such as

2. Le traitement doit être arrêté en cas de réaction al- 1. use the score n for sentence pairs with syntactic mod-
lergique généralisée, éruption cutanée ou altérations ifications,
de la fonction du foie. (Treatment should be discon- 2. use the score m for sentence pairs with lexical modifi-
tinued in the event of a generalized allergic reaction, cations, etc.
rash or impaired liver function.)
Indeed, our motivation was to exploit the linguistic compe-
2.2. Annotation Process tence of the annotators and to compare their semantic sen-
The five annotators involved have received higher educa- sitivity and judgements. We assume also that, in this way,
tion: two of them are trained in Natural Language Process- the overall semantic scores should better represent the se-
ing, one is a medical practitioner. Except one, all annota- mantic similarity between the sentences.
tors are native French speakers. The authors were not part The annotators estimated that the annotation of the 1,010
of the annotators. The annotation guidelines provided to pairs of sentences took between seven and fifteen hours.
the annotators were very simple and short:
2.3. Scales and Annotation Criteria according to
• to assign a score of 0 when the sentences are com- the Annotators
pletely unrelated, The scales and criteria that were used by the annotators can
be seen in Table 1. We can observe differences and similar-
• to assign a score of 5 when the sentences mean the ities between the various annotation principles provided by
same, the annotators:

6890
• Except one, all the annotators assigned integer scores 3. Analysis of the Annotations
[0, 1, 2, 3, 4, 5] to the pairs of sentences. One anno- In this section, we further analyse the annotations: their
tator also used intermediary scores [0.5, 1.5, 2.5, 3.5, breakdown by score and the correlation of the scores from
4.5]; the five annotators.
• The A3 annotator considered that he took the sen-
3.1. Breakdown by Score
tences strictly as they were given, which means that
the unknown context was considered as non-existent.
That implies for example that pronouns in one sen-
tence were never assumed to be referring to an element
explicitly mentioned in the other sentence, increasing
the likelihood of dissimilarity;

• The scales from A2 and A3 are much more conserva-

tive than the other three. Yet they greatly differ from
one another. A2 is the only annotator who focused on
phrasing. In order to assign the highest score accord-
ing to their scale, the two sentences have to be identi-
cal. The scale given by A3 is more similar to the other
ones but it is conservative because of the strict view
Figure 1: Breakdown by category and annotator
related to context not being assumed;

• For specific grades, 2, 3 and 4 are quite similar for all

Figure 1 shows the breakdown by score and annotator. The
the annotators but A2: 2 involves that the sentences
x-axis shows the different scores and the y-axis shows the
have something that differentiates them, but they deal
number of pairs. The isolated bars are due to the scale used
with the same subject. 3 implies each time that there
by A2, which is the only one that included .5 values. We
is shared information but that one sentence expresses
also indicate figures for Avg and Vote.
information that is not found in the other one, and 4
We can observe that the 0 score is the most used by every
implies that the information is ”almost” or ”slightly”
annotator but one (A1). The annotator A3 assigned the 0
the same;
score to almost half the pairs, which is coherent with the
• It is more difficult to analyse the relationship between annotation criteria of this annotator, who did not assume
the descriptions for grade 1. A1 and A4 both mention context for coreference and thus had the most conservative
something in common, the domain, or grounds for in- approach.
ference, but also state that nothing more can reinforce We can also see that every annotator but one (A1) used 4
the link between the two. A3 and A5 focus on the lack more often than 5. This can be explained by the nature of
of relation between the sentences. the sentence pairs. As stated in section 2.1., the source cor-
pus is aimed towards simplification and the sentence pairs
To summarize, we can see that the annotators paid attention come from document pairs where one is more technical
to several criteria when deciding about the semantic relat- than the other one. In consequence, it can be expected that
edness of the sentences: there are more almost identical sentences than entirely iden-
tical ones, as the texts are not written for the same audience
• intersection of the meaning, such as missing informa- and thus do not deliver the exact same information in the
tion, incomplete information or extra words on either exact same way.
side, Looking at Avg and Vote, we observe that grades 3 and 4
are the most consistent overall. Grade 2 seems to be the
• use of paraphrases and different expressions, most inconsistent, with an average that is way above the
individual counts, and a vote that is low.
• possibility to do textual inference.

We observe also that the completeness of information is the

3.2. Correlation Coefficients
most frequently used criteria by all the annotators. We computed the Krippendorff’s α (Krippendorff, 1970)
to evaluate the global correlation coefficient of the anno-
2.4. Global scores tations. The α value for the five annotators is 0.69. This
Using all the scores from the five annotators, we computed value is above the generally observed threshold which is
two more values : considered as reliable (α = 0.67). Yet, this score is quite
low. When we take the average and the vote scores into
• The average score for each pair, rounded (”Avg” fur- account for the computation, the α value goes up to 0.77,
ther down); which is a sign that putting all the annotations together sig-
nificantly improves the data reliability. In order to explore
• The most frequent score out of the five for each pair those results more deeply, we computed the correlation be-
(”Vote” further down). tween pairs of annotators.

6891
A1 A2 A3 A4 A5 2. Percentage of words from one sentence included in the
A1 1.0 0.77 0.72 0.84 0.81 other sentence, computed in both directions. This fea-
A2 0.77 1.0 0.64 0.75 0.74 tures represents possible lexical and semantic inclu-
A3 0.72 0.64 1.0 0.75 0.70 sion relations between the sentences;
A4 0.84 0.75 0.75 1.0 0.80
A5 0.81 0.74 0.70 0.80 1.0 3. Sentence length difference between specialized and
simplified sentences. This feature assumes that sim-
Table 2: Pearson’s correlation coefficients between the an- plification may imply stable association with the sen-
notators tence length;

4. Average length difference in words between special-

ized and simplified sentences. This feature is similar
Table 2 shows the Pearson correlation(Kirch, 2008) for ev-
to the previous one but takes into account average dif-
ery combination of two annotators. The observations we
ference in sentence length;
can make are consistent with figure 1 and the criteria de-
scribed in section 2.3..: 5. Total number of common bigrams and trigrams. This
feature is computed on character ngrams. The assump-
• The lowest correlation coefficient (0.64) occurs be-
tion is that, at the sub-word level, some sequences
tween A2 and A3: A2 is the annotator who used steps
of characters may be meaningful for the alignment of
of .5 in his scale and A3 relied on annotation prin-
sentences if they are shared by them;
ciples that had him assign 0 to almost half the pairs.
Hence, those two annotators applied annotation scales 6. Word-based similarity measure exploits three scores
and criteria that differ the most from the other ones. (cosine, Dice and Jaccard). This feature provides
a more sophisticated indication on word overlap be-
• The correlation coefficients between the other three
tween two sentences. Weight assigned to each word is
annotators (A1, A4 and A5) are the highest: 0.84 for
set to 1;
A1 and A4, 0.81 for A1 and A5 and 0.80 for A4 and
A5. 7. Character-based minimal edit distance (Levenshtein,
• The other associations range between 0.70 and 0.77. 1966). This is a classical computation of edit distance.
It takes into account basic edit operations (insertion,
Globally, the correlation coefficients show a satisfying reli- deletion and substitution) at the level of characters.
ability for the dataset, with variations according to the dif- The cost of each operation is set to 1;
ferent scales that were used. We see that the two scales that
stand out have the lowest correlation coefficient with each 8. Word-based minimal edit distance (Levenshtein,
other, but at the same time they have a good correlation co- 1966). This feature is computed with words as units
efficient with the other three. Those other three have strong within sentence. It takes into account the same three
coefficients with one another. edit operations with the same cost set to 1. This feature
permits to compute the cost of lexical transformation
4. Experiments of one sentence into another;
In order to study how the resulting corpus can be exploited, 9. WAVG. This features uses word embeddings. The
we ran an experiment to check how accurately we could word vectors of each sentence are averaged, and the
automatically reproduce the annotations. In this section, we similarity score is calculated by comparing the two re-
first describe the automatic approach for scoring the pairs sulting sentence vectors (Stajner et al., 2018);
of sentences and then the results obtained.
10. CWASA. This feature is the continuous word
4.1. Automatic Approach for Scoring the Pairs alignment-based similarity analysis, as described in
of Sentences (Franco-Salvador et al., 2016).
We exploited a previously proposed method dedicated to
the detection of parallel sentences in comparable corpora For the last two features, we trained the embeddings on the
(Cardon and Grabar, 2019). Yet, in order to predict val- CLEAR corpus using word2vec (Mikolov et al., 2013), and
ues on a continuous scale, the Random Forest Regressor the scores are computed using the CATS tool (Stajner et al.,
is exploited instead of the classifier. We compute and use 2018).
several sets of features, mainly obtained from the lexical 4.2. Results
and sublexical content of the sentences, their word-based
similarity, and the corpus-suggested similarity from word We ran the experiment for every annotator. We also ran the
embeddings: experiment for Avg and Vote. We randomly split the data
into 90% for training and 10% for testing. As there are
1. Number of common non-stopwords. This feature per- small variations on each run due to random splitting, each
mits to compute the basic lexical overlap between spe- reported score represents the average over twenty runs.
cialized and simplified versions of sentences (Barzi- Table 3 shows the results obtained when scoring the pairs
lay and Elhadad, 2003). It concentrates on non-lexical of sentences tackled as the regression task. For the anno-
content of sentences; tators, the correlation coefficients range from 0.73 (A2) to

6892
A1 A2 A3 A4 A5 Avg Vote Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia,
0.82 0.73 0.80 0.79 0.78 0.87 0.78 L. (2017). SemEval-2017 task 1: Semantic textual sim-
ilarity multilingual and crosslingual focused evaluation.
Table 3: Pearson’s correlation coefficient on regression ex- In Proceedings of the 11th International Workshop on
periments Semantic Evaluation (SemEval-2017), pages 1–14, Van-
couver, Canada, August. Association for Computational
Linguistics.
0.82 (A1). This shows that the various scales can be auto- Feitosa, D. and Pinheiro, V. (2017). Análise de medidas de
matically reproduced, and even if there are important dif- similaridade semântica na tarefa de reconhecimento de
ference between them, the annotations can be considered to implicação textual (analysis of semantic similarity mea-
be coherent. sures in the recognition of textual entailment task)[in
The most engaging observation is that the best results (0.87) Portuguese]. In Proceedings of the 11th Brazilian Sym-
are obtained on the average scores. This may mean that posium in Information and Human Language Technol-
the average scores and collective perception of the semantic ogy, pages 161–170, Uberlândia, Brazil, October. So-
similarity remain coherent despite the differences observed ciedade Brasileira de Computação.
during the annotation process. Franco-Salvador, M., Gupta, P., Rosso, P., and Banchs,
Interestingly, the result for Vote is the mean of the scores R. E. (2016). Cross-language plagiarism detection over
for the five annotators individually. continuous-space and knowledge graph-based represen-
tations of language. Knowledge-Based Systems, 111:87–
5. Conclusion 99.
We introduced a corpus annotated for semantic textual sim- Grabar, N. and Cardon, R. (2018). CLEAR – Simple Cor-
ilarity for French. Currently, this kind of data is indeed pus for Medical French. In Workshop on Automatic Text
missing in French. The corpus is composed of 1,010 sen- Adaption (ATA), pages 1–11, Tilburg, Netherlands.
tence pairs that come from comparable corpora aimed to- Kajiwara, T. and Komachi, M. (2016). Building a mono-
wards text simplification. More precisely, the original texts lingual parallel corpus for text simplification using sen-
come from the CLEAR corpus and from Wikipedia and tence similarity based on alignment between word em-
Vikidia articles. The corpus comes with grades manually beddings. In Proceedings of COLING 2016, the 26th
assigned by five annotators. Together with the scores, the International Conference on Computational Linguistics:
annotators provided the annotation scheme they adopted. Technical Papers, pages 1147–1158, Osaka, Japan, De-
We performed an analysis of the resulting data and showed cember. The COLING 2016 Organizing Committee.
that there are discrepancies in the scores that have been as- Wilhelm Kirch, editor, (2008). Pearson’s Correlation Co-
signed. Those discrepancies can be explained with different efficient, pages 1090–1091. Springer Netherlands, Dor-
annotation factors. We then used these data to automati- drecht.
cally predict the scores of the pairs of sentences. This set Krippendorff, K. (1970). Estimating the reliability, sys-
of experiments shows that the scores can be quite well re- tematic error and random error of interval data. Educa-
produced with automatic approaches. This indicates that tional and Psychological Measurement, 30(1):61–70.
the manually created data are reliable and can be used for a Levenshtein, V. I. (1966). Binary codes capable of correct-
variety of experiments where semantic textual similarity is ing deletions, insertions and reversals. Soviet physics.
of interest. At the time of publication, the dataset is being Doklady, 707(10).
used in an NLP challenge and will be made available for
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
the research community.
Dean, J. (2013). Distributed representations of words
and phrases and their compositionality. In Advances in
6. Acknowledgements Neural Information Processing Systems 26: 27th An-
We are grateful to our annotators for their valuable work. nual Conference on Neural Information Processing Sys-
We would also like to thank the reviewers for their helpful tems 2013. Proceedings of a meeting held December 5-8,
comments. This work was funded by the French National 2013, Lake Tahoe, Nevada, United States., pages 3111–
Agency for Research (ANR) as part of the CLEAR project 3119.
(Communication, Literacy, Education, Accessibility, Read- Stajner, S., Franco-Salvador, M., Ponzetto, S. P., and Rosso,
ability), ANR-17-CE19-0016-01. P. (2018). Cats: A tool for customised alignment of
text simplification corpora. In Proceedings of the 11th
7. Bibliographical References Language Resources and Evaluation Conference, LREC
Barzilay, R. and Elhadad, N. (2003). Sentence alignment 2018, Miyazaki, Japan, May 7-12.
for monolingual comparable corpora. In EMNLP, pages Vadapalli, R., J Kurisinkel, L., Gupta, M., and Varma, V.
25–32. (2017). SSAS: Semantic similarity for abstractive sum-
Cardon, R. and Grabar, N. (2019). Parallel sentence re- marization. In Proceedings of the Eighth International
trieval from comparable corpora for biomedical text sim- Joint Conference on Natural Language Processing (Vol-
plification. In Proceedings of Recent Advances in Natu- ume 2: Short Papers), pages 198–203, Taipei, Taiwan,
ral Language Processing, pages 168–177, Varna, Bul- November. Asian Federation of Natural Language Pro-
garia, september. cessing.

6893
Wieting, J., Berg-Kirkpatrick, T., Gimpel, K., and Neu-
big, G. (2019). Beyond BLEU:training neural machine
translation with semantic similarity. In Proceedings of
the 57th Annual Meeting of the Association for Compu-
tational Linguistics, pages 4344–4355, Florence, Italy,
July. Association for Computational Linguistics.
Yasui, G., Tsuruoka, Y., and Nagata, M. (2019). Using
semantic similarity as reward for reinforcement learning
in sentence generation. In Proceedings of the 57th An-
nual Meeting of the Association for Computational Lin-
guistics: Student Research Workshop, pages 400–406,
Florence, Italy, July. Association for Computational Lin-
guistics.

6894

The Girl Who Can
100% (6)
The Girl Who Can
6 pages
Writing About Your Holiday PDF
100% (1)
Writing About Your Holiday PDF
3 pages
First Language English: Cambridge IGCSE
No ratings yet
First Language English: Cambridge IGCSE
8 pages
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
No ratings yet
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
29 pages
Semantic Similarity Between Medium-Sized Texts
No ratings yet
Semantic Similarity Between Medium-Sized Texts
13 pages
Published Paper
No ratings yet
Published Paper
12 pages
AAAI06-123 (Revisar para Referencias)
No ratings yet
AAAI06-123 (Revisar para Referencias)
6 pages
Evolution of Semantic Similarity - A Survey
No ratings yet
Evolution of Semantic Similarity - A Survey
35 pages
10 1002@cpe 5971
No ratings yet
10 1002@cpe 5971
17 pages
Semantic Similarity
No ratings yet
Semantic Similarity
14 pages
Short Text Similarity Calculation Based On Jaccard and Semantic Mixture
No ratings yet
Short Text Similarity Calculation Based On Jaccard and Semantic Mixture
9 pages
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
No ratings yet
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
13 pages
Sentence Similarity Based On Semantic Networks
No ratings yet
Sentence Similarity Based On Semantic Networks
36 pages
Deep Learning For Semantic Similarity
No ratings yet
Deep Learning For Semantic Similarity
7 pages
A Cognitive Study On Semantic Similarity Analysis
No ratings yet
A Cognitive Study On Semantic Similarity Analysis
6 pages
Semeval-2012 Task 6: A Pilot On Semantic Textual Similarity
No ratings yet
Semeval-2012 Task 6: A Pilot On Semantic Textual Similarity
9 pages
Sun 等 - 2022 - Sentence Similarity Based on Contexts
No ratings yet
Sun 等 - 2022 - Sentence Similarity Based on Contexts
16 pages
NLP Project
No ratings yet
NLP Project
16 pages
PESTS: Persian - English Cross Lingual Corpus For Semantic Textual Similarity
No ratings yet
PESTS: Persian - English Cross Lingual Corpus For Semantic Textual Similarity
21 pages
Semeval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
No ratings yet
Semeval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
15 pages
Text Semantic Similarity
No ratings yet
Text Semantic Similarity
17 pages
Vector Based Models
No ratings yet
Vector Based Models
41 pages
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
No ratings yet
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
11 pages
Technical Report: Learning Compound Noun Semantics
No ratings yet
Technical Report: Learning Compound Noun Semantics
167 pages
Jonathan Dunn - Natural Language Processing For Corpus Linguistics (2022, Cambridge) - Libgen - Li
No ratings yet
Jonathan Dunn - Natural Language Processing For Corpus Linguistics (2022, Cambridge) - Libgen - Li
96 pages
Paper 125
No ratings yet
Paper 125
11 pages
8-Measuring Text Similarity Based On Structure and Word Embedding
No ratings yet
8-Measuring Text Similarity Based On Structure and Word Embedding
20 pages
Sentence-Level Semantic Textual Similarity Using Word-Level Semantics
No ratings yet
Sentence-Level Semantic Textual Similarity Using Word-Level Semantics
4 pages
NLP Proj
No ratings yet
NLP Proj
13 pages
NLP Module3
No ratings yet
NLP Module3
27 pages
Hange-1 4
No ratings yet
Hange-1 4
13 pages
Evaluating of Efficacy Semantic Similarity Methods
No ratings yet
Evaluating of Efficacy Semantic Similarity Methods
8 pages
A Reflective View On Text Similarity: Daniel B Ar, Torsten Zesch, and Iryna Gurevych
No ratings yet
A Reflective View On Text Similarity: Daniel B Ar, Torsten Zesch, and Iryna Gurevych
6 pages
Tac Lde Notation Graph
No ratings yet
Tac Lde Notation Graph
12 pages
Corpus Linguistics: National Conference On Artificial Intelligence. 1, PP
No ratings yet
Corpus Linguistics: National Conference On Artificial Intelligence. 1, PP
4 pages
A Uniform Approach To Analogies, Synonyms, Antonyms, and Associations
No ratings yet
A Uniform Approach To Analogies, Synonyms, Antonyms, and Associations
8 pages
Semantic Answer Similarity For Evaluating Question Answering Models
No ratings yet
Semantic Answer Similarity For Evaluating Question Answering Models
9 pages
2022 Findings-Aacl 20
No ratings yet
2022 Findings-Aacl 20
6 pages
Text Similarity
No ratings yet
Text Similarity
31 pages
Text-To-Text Semantic Similarity For Automatic Short Answer Grading
No ratings yet
Text-To-Text Semantic Similarity For Automatic Short Answer Grading
9 pages
A Novel Hybrid Methodology of Measuring
No ratings yet
A Novel Hybrid Methodology of Measuring
10 pages
Compilation of Specialized Comparable Corpora in French and Japanese
No ratings yet
Compilation of Specialized Comparable Corpora in French and Japanese
9 pages
WSD, Textual Entailment, People Disambiguation and Affective Text
No ratings yet
WSD, Textual Entailment, People Disambiguation and Affective Text
29 pages
The Design of A System For The Automatic Extraction of A Lexical Database Analogous To Wordnet From Raw Text
No ratings yet
The Design of A System For The Automatic Extraction of A Lexical Database Analogous To Wordnet From Raw Text
8 pages
Data & Knowledge Engineering: Jesús Oliva, José Ignacio Serrano, María Dolores Del Castillo, Ángel Iglesias
No ratings yet
Data & Knowledge Engineering: Jesús Oliva, José Ignacio Serrano, María Dolores Del Castillo, Ángel Iglesias
3 pages
Mridul 2021 Ijca 921582
No ratings yet
Mridul 2021 Ijca 921582
7 pages
NLP Unit 3
No ratings yet
NLP Unit 3
83 pages
Abstract:: 2.2 The Word Aligner
No ratings yet
Abstract:: 2.2 The Word Aligner
5 pages
Patterns of Text in Honour of Michael Hoey
No ratings yet
Patterns of Text in Honour of Michael Hoey
5 pages
NLP Unit 3
No ratings yet
NLP Unit 3
20 pages
A Comparison of Document Similarity Algorithms
No ratings yet
A Comparison of Document Similarity Algorithms
10 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
A Survey of Numerous Text Similarity Approach
No ratings yet
A Survey of Numerous Text Similarity Approach
10 pages
2021 Sustainlp-1 9
No ratings yet
2021 Sustainlp-1 9
5 pages
Components For A Semantic Textual Similarity System
No ratings yet
Components For A Semantic Textual Similarity System
9 pages
CRPITV74 Yang
No ratings yet
CRPITV74 Yang
10 pages
Text Similarity Using Siamese Networks and Transformers
No ratings yet
Text Similarity Using Siamese Networks and Transformers
10 pages
Unit 3, 4 Textbook
No ratings yet
Unit 3, 4 Textbook
83 pages
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
No ratings yet
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
8 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
A Survey On Semantic Similarity Measures
No ratings yet
A Survey On Semantic Similarity Measures
5 pages
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
A Text Classification Approach To Detect Psychological Stress
No ratings yet
A Text Classification Approach To Detect Psychological Stress
19 pages
A Multiscale Theory For The Dynamical Evolution of
No ratings yet
A Multiscale Theory For The Dynamical Evolution of
5 pages
A Distributional Similarity Approach To The Detection of Semantic Change in The Google Books Ngram Corpus
No ratings yet
A Distributional Similarity Approach To The Detection of Semantic Change in The Google Books Ngram Corpus
5 pages
语言学概论
No ratings yet
语言学概论
44 pages
Đề Luyện Thi Tiếng Anh Tuyển Sinh Vào Lớp 10 THPT (Mã Đề 004)
No ratings yet
Đề Luyện Thi Tiếng Anh Tuyển Sinh Vào Lớp 10 THPT (Mã Đề 004)
4 pages
Communication Skills
No ratings yet
Communication Skills
24 pages
Future Will, Going To and Present Continuous Perfect Continuous
No ratings yet
Future Will, Going To and Present Continuous Perfect Continuous
1 page
(TV) Being Heard in Meetings
No ratings yet
(TV) Being Heard in Meetings
13 pages
Unpacking of Unit Learning Goals Diagram 1
100% (2)
Unpacking of Unit Learning Goals Diagram 1
2 pages
Buy Ebook Deep Learning Projects Using TensorFlow 2: Neural Network Development With Python and Keras 1st Edition Vinita Silaparasetty Cheap Price
100% (2)
Buy Ebook Deep Learning Projects Using TensorFlow 2: Neural Network Development With Python and Keras 1st Edition Vinita Silaparasetty Cheap Price
54 pages
Do You Know The Tochigi Dialect?
No ratings yet
Do You Know The Tochigi Dialect?
4 pages
Top-Class-Revision-Mathematics Workbook
No ratings yet
Top-Class-Revision-Mathematics Workbook
35 pages
My Resume-Arul Thin Agar An
No ratings yet
My Resume-Arul Thin Agar An
2 pages
Ancient Animal Wisdom Booklet
100% (5)
Ancient Animal Wisdom Booklet
44 pages
Adv Unit2 Revision
No ratings yet
Adv Unit2 Revision
2 pages
Ela 8 - Rhyme Lesson Plan
100% (1)
Ela 8 - Rhyme Lesson Plan
2 pages
Sluts and Riot Girls
No ratings yet
Sluts and Riot Girls
16 pages
Scene 1 - 20231122 - 111715 - 0000
No ratings yet
Scene 1 - 20231122 - 111715 - 0000
8 pages
Essay About Vietnamese's 3 Main Dialects
No ratings yet
Essay About Vietnamese's 3 Main Dialects
8 pages
Unit 62 - Verb Preposition Ing
No ratings yet
Unit 62 - Verb Preposition Ing
2 pages
Mid Term Form 2 2017 Samaku
No ratings yet
Mid Term Form 2 2017 Samaku
14 pages
The Writing in The Indus Script
No ratings yet
The Writing in The Indus Script
92 pages
The Thuuk Construction in Thai
No ratings yet
The Thuuk Construction in Thai
22 pages
ĐỀ TIẾNG ANH LẦN 4
No ratings yet
ĐỀ TIẾNG ANH LẦN 4
13 pages
Past Simple
No ratings yet
Past Simple
2 pages
Emojis-Exploring Their Suitability in An Evolving Work Environment
No ratings yet
Emojis-Exploring Their Suitability in An Evolving Work Environment
4 pages
Basics of Python
No ratings yet
Basics of Python
3 pages
ComputerEngg Tech Paper
No ratings yet
ComputerEngg Tech Paper
32 pages
Right To Education vs. Right To Education Act
No ratings yet
Right To Education vs. Right To Education Act
35 pages
English
No ratings yet
English
2 pages
Outlet Languagtype Genre Noteattitstatu1Con1Contactfirstname 1contact
No ratings yet
Outlet Languagtype Genre Noteattitstatu1Con1Contactfirstname 1contact
56 pages

2020 Lrec-1 851

Uploaded by

2020 Lrec-1 851

Uploaded by

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 6889–6894

Marseille, 11–16 May 2020

A French Corpus for Semantic Similarity

Keywords: semantic similarity, manual annotation, French language, regression

1. Introduction attempt at reproducing the annotation automatically.

Table 1: Annotation criteria defined by the annotators

• The scales from A2 and A3 are much more conserva-

• For specific grades, 2, 3 and 4 are quite similar for all

We observe also that the completeness of information is the

4. Average length difference in words between special-

You might also like