0% found this document useful (0 votes)
49 views10 pages

2020 Lrec-1 259

Uploaded by

Caio Cruz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views10 pages

2020 Lrec-1 259

Uploaded by

Caio Cruz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2110–2119

Marseille, 11–16 May 2020


c European Language Resources Association (ELRA), licensed under CC-BY-NC

Comparing Machine Learning and Deep Learning Approaches


on NLP Tasks for the Italian Language
Bernardo Magnini, Alberto Lavelli, Simone Magnolini
Fondazione Bruno Kessler
via Sommarive 18, Povo - Trento (ITALY)
{magnini, lavelli, magnolini}@fbk.eu

Abstract
We present a comparison between deep learning and traditional machine learning methods for various NLP tasks in Italian. We carried
on experiments using available datasets (e.g., from the Evalita shared tasks) on two sequence tagging tasks (i.e., named entity recognition
and nominal entity recognition) and four classification tasks (i.e., lexical relations among words, semantic relations among sentences,
sentiment analysis and text classification). We show that deep learning approaches outperform traditional machine learning algorithms
in sequence tagging, while for classification tasks that heavily rely on semantics approaches based on feature engineering are still
competitive. We think that a similar analysis could be carried out for other languages to provide an assessment of machine learning /
deep learning models across different languages.

Keywords: Machine Learning, Deep Learning, Italian Language

1. Introduction We consider this paper as a contribution in the direction of


developing benchmarks encompassing a variety of tasks in
In the recent years, the so called ”deep learning revolution” order to favour models that share general linguistic knowl-
has influenced and changed many fields of Artificial Intelli- edge across tasks. This is very much in the spirit of GLUE,
gence (e.g., machine learning and computer vision) and has the General Language Understanding Evaluation (Wang et
also affected all areas related to human language technolo- al., 2018), a collection of resources for training, evaluating,
gies. Initial results have been obtained with the adoption and analyzing natural language understanding systems.
of deep neural networks in speech recognition, with a sig-
The paper is structured as follows. Section 2 reports ba-
nificant boost of performance in automatic speech recogni-
sic notions about deep learning for NLP that will be used
tion systems (Graves et al., 2013). In Machine Translation,
for our experiments. Sections 3 and 4 focus on sequence
starting from 2013, the phrase-based statistical approaches
tagging tasks, named entity recognition and nominal entity
that were at the state of the art have been gradually substi-
recognition, respectively. Sections 4-7 report on classifica-
tuted with neural machine translation, based on deep learn-
tion tasks: lexical relations, textual entailment, sentiment
ing architectures, which have been proven to obtain bet-
analysis and text classification. Finally, Section 9 discusses
ter performance (Bahdanau et al., 2014). The main reason
our achievement and proposes work for the future.
for this increase in performance is that, as more training
data are available both for speech recognition and machine
translation, large neural networks have been proven to be
2. Deep Learning for NLP
superior to traditional machine learning (ML) algorithms, This section provides basic notions on deep learning for
such as support vector machines. NLP, which will be used in the rest of the paper. We in-
However, if we consider tasks related to semantic analy- troduce word vector representations, pre-trained language
sis of text, the limited availability of semantically anno- models, and long-short-term-memory architectures.
tated data, typically requiring specialized human effort, has
2.1. Word Embeddings
slowed the adoption of neural approaches. It is only in the
last few years that deep learning has obtained high perfor- Word embeddings are essentially vector representations of
mance across different NLP tasks. These models can often words, that are typically learnt by an unsupervised model
be trained with a single end-to-end model and do not re- when fed with large amounts of text (e.g., Wikipedia, sci-
quire task-specific feature engineering, thus they not only entific literature, news articles, etc.). These representations
tend to perform better than traditional ML, but they do re- capture semantic similarity between words among other
quire less human effort, making their adoption convenient. properties. They are hence very useful to represent words
In this paper we provide a in downstream NLP tasks such as POS tagging, NER etc.
Three families of word embeddings can be identified:
comparison between traditional approaches and deep learn-
ing applied to NLP tasks in the area of information extrac- • Bag of words based. The original word order indepen-
tion from Italian texts. We carried on experiments using dent models like Word2vec (Mikolov et al., 2013) and
available datasets on both sequence tagging (i.e., named GloVe (Pennington et al., 2014).
entity recognition, nominal entity recognition) and classi-
fication tasks (i.e., lexical relations among words, semantic • Attention (Transformer) based. Embeddings gener-
relations among sentences, sentiment analysis, text classifi- ated by BERT (Devlin et al., 2018), which has pro-
cation). duced state-of-the-art results to date in downstream

2110
tasks like NER, Q&A, classification etc. BERT takes
into account the order of words in a sentence but is
based on attention mechanism as opposed to sequence
models like ELMo.

• RNN family based. Sequence models (ELMo) that


produce word embeddings (Peters et al., 2018). ELMo
uses stacked bidirectional LSTMs to generate word
embeddings that have different properties based on the
layer that generates them.

2.2. BERT
BERT (Devlin et al., 2018) is a deep learning model that
has given state-of-the-art results on a wide variety of nat-
ural language processing tasks. It stands for Bidirectional
Encoder Representations for Transformers. It has been pre-
trained on Wikipedia and BooksCorpus and requires task-
specific fine-tuning.
BERT is available pre-trained on domain-specific corpora.
E.g., Clinical BERT (BERT pre-trained on a corpus of clin-
ical notes) and sciBERT (Pre-Trained Contextualized Em-
beddings for Scientific Text). BioBERT (Lee et al., 2019)
(Bidirectional Encoder Representations from Transformers
for Biomedical Text Mining) is a domain-specific language
representation model pre-trained on large-scale biomedical
corpora. With almost the same architecture across tasks, Figure 1: The main NeuroNLP2 structure. Dashed arrows
BioBERT largely outperforms BERT and previous state-of- indicate dropout layers applied on both the input and output
the-art models in a variety of biomedical text mining tasks vectors of BLSTM.
when pre-trained on biomedical corpora. While BERT ob-
tains performance comparable to that of previous state-of- arate LSTMs; (iii) a CRF layer that decodes the best label
the-art models, BioBERT significantly outperforms them sequence.
on the following three representative biomedical text min- NeuroNLP2 constructs a neural network model by feeding
ing tasks: biomedical named entity recognition (0.62% F1 the output vectors of BLSTM into a CRF layer, as it is de-
score improvement), biomedical relation extraction (2.80% picted in Figure 1. For each token in the input sequence,
F1 score improvement) and biomedical question answering first a character-level representation is computed by a CNN
(12.24% MRR improvement). with character embeddings as inputs. Then the character-
BERT is also available for languages other than English1 . level representation vector is concatenated with the word
In particular, it is provided a model for Chinese and a single embedding vector to feed the BLSTM network. The CNN
model for all the other languages, including Italian. for Character-level Representation is an effective approach
to extract morphological information (like the prefix or suf-
2.3. A Sequence Labeling Neural Architecture: fix of a word) from characters of words and encode it into
NeuroNLP2 neural representations. In NeuroNLP2 the CNN is similar
In this section we introduce NeuroNLP2 (Ma and Hovy, to the one proposed in (Chiu and Nichols, 2016), except
2016), a reference neural architecture for sequence label- that it uses only character embeddings as inputs, without
ing in NLP that achieved state-of-the-art performance for character type.
named entity recognition for English on the ConLL-2003 At the second layer each input sequence is presented
dataset. Specifically, we describe the most recent imple- both forwards and backwards to a bidirectional LSTM,
mentation of the system in Pytorch distributed by the au- whose output allows to capture past and future information.
thors2 . We selected this system not only for its state-of- LSTMs (Hochreiter and Schmidhuber, 1997) are variants
the-art performance and for code availability, but also for of recurrent neural networks (RNNs) designed to cope with
the peculiar structure of the network, which is common gradient vanishing problems. A LSTM unit is composed of
to other works, including (Lample et al., 2016). The sys- three multiplicative gates which control the proportions of
tem is composed of three layers (Figure 1): (i) a CNN that information to forget and to pass on to the next time step.
allows to extract information from the input text without The basic idea is to present each sequence forwards and
any pre-processing; (ii) a bidirectional LSTM layer that backwards to two separate LSTMs and then to concatenate
presents each sequence forwards and backwards to two sep- the output to capture past and future information, respec-
tively.
1
https://github.com/google-research/bert/ The LSTM’s hidden state takes information only from the
blob/master/multilingual.md past, knowing nothing about the future. However, for many
2
https://github.com/XuezheMax/NeuroNLP2 tasks it is beneficial to have access to both past (left) and fu-

2111
ture (right) contexts. A possible solution, whose effective- features from a rich model of a gazetteer and then concate-
ness has been proven by previous work (Dyer et al., 2015), nating such features with the input embeddings of a neu-
is provided by bi-directional LSTMs (BLSTM). (Ma and ral model is the best strategy in all experimental settings,
Hovy, 2016) apply a dropout layer on both the input and significantly outperforming more conventional approaches.
output vectors of the BLSTM. For the experiments they used exactly the same network
Finally, the third layer implemented in NeuroNLP2 is a parameters described in Ma and Hovy (2016) and provided
Conditional Random Fields (CRF) based decoder, which as default by the available implementation. As input em-
considers dependencies between entity labels in their con- beddings they use Stanford’s publicly available GloVe 100-
text and then jointly decodes the best chain of labels for a dimensional embeddings trained on 6 billion words from
given input sentence. For example, in NER with standard Wikipedia and web texts for English (in the same way as
IOB annotation, an I-token can not follow an O, a constraint Ma and Hovy (2016)); for Italian they use Stanford’s GloVe
which is captured by the CFR layer. Conditional Random 50-dimensional embeddings trained on a Wikipedia’s dump
3
Fields (Lafferty et al., 2001) offer several advantages over with the default setup. For out-of-vocabulary words they
hidden Markov models and stochastic grammars for such use a unique randomly generated vector for every word.
tasks, including the ability to relax strong independence
assumptions made in those models. For a sequence CRF dataset BERT Best Evalita 2009 SotA
model (only interactions between two successive labels are NER Evalita 2009 85.05 82.00 84.33
considered), training and decoding can be solved efficiently
by adopting the Viterbi algorithm. Table 1: Application of BERT fine-tuning for Named Entity
Recognition.
3. Named Entity Recognition In Table 1 we report the results we obtained applying BERT
Named entities are proper names referring to persons, loca- (multilingual model) to the NER Evalita 2009 (Speranza,
tions and organizations. A reference paper for the applica- 2009) task. The BERT model is compared against the sys-
tion of deep learning techniques to Named Entity Recog- tem that obtained the best result at Evalita 2009, (Zanoli et
nition is Ma and Hovy (2016), whose approach, Neu- al., 2009) and the state of the art for Italian NER (Nguyen
roNLP2, has been presented in Section 2.3.. The sys- et al., 2010).
tem is truly end-to-end, requiring no feature engineering As suggested by BERT developers, for sequence labeling
or data pre-processing, thus making it applicable to a wide BERT-NER4 was used simply performing some fine tuning
range of sequence labeling tasks. They evaluate the system on the training data with default parameters. However, in
on two datasets for two sequence labeling tasks — Pen- order to obtain this result the default parameters are differ-
nTreebank WSJ corpus for part-of-speech tagging and the ent from the ones used for classification. Another impor-
CoNLL 2003 corpus for named entity recognition (NER). tant detail is the great difference in performance among the
They obtain state-of-the-art performance on both datasets case sensitive and the case insensitive model: the former
— 97.55% accuracy for part-of-speech tagging and 91.21% one outperforms in a significant way the latter one.
F1 for NER. It can be noticed that the purpose of the experiment is not
Neural architectures for Italian NER have been already in- to obtain a new state of the art (even if in this case it was
vestigated by several works. Bonadiman et al. (2015) in- achieved), but to investigate how deep learning performs
troduce a Deep Neural Network (DNN) for Named Entity on a different task in a language that is not English. In fact,
Recognizers (NERs) in Italian. The network uses a sliding BERT-NER is not the implementation presented in Devlin
window of word contexts to predict tags. It relies on a sim- et al. (2018), but a third party implementation slightly less
ple word-level log-likelihood as a cost function and uses a performing than the one presented in the paper, that is not
new recurrent feedback mechanism to ensure that the de- freely available.
pendencies between the output tags are properly modeled.
The evaluation on the Evalita 2009 benchmark (Speranza, 4. Nominal Entity Recognition
2009) shows that the DNN performs on par with the best Nominal entities are noun phrase expressions describing an
NERs, outperforming the state of the art when gazetteer entity. They can be composed by a single noun (e.g., pasta,
features are used. carpet, parka) or by more than one token (e.g., capri sofa
Basile et al. (2017) propose a Deep Learning architec- bed beige, red jeans skinny fit, light weigh full frame cam-
ture for sequence labeling based on a state-of-the-art model era, grilled pork belly tacos). Differently from named en-
that exploits both word- and character-level representations tities, nominal entities are typically compositional, as they
through the combination of bidirectional LSTM, CNN and do allow morphological and syntactic variations (e.g., for
CRF. They evaluate the proposed method on three NLP food names, spanish baked salmon, roasted salmon and hot
tasks for Italian: PoS-tagging of tweets, Named Entity smoked salmon), which makes it possible to combine to-
Recognition and Super-Sense Tagging. Results show that kens of one entity name with tokens of another entity name
the system is able to achieve state-of-the-art performance to generate new names (e.g., for food names, salmon tacos
in all the tasks and in some cases overcomes the best sys- is a potential food name given the existence of salmon and
tems previously developed for Italian. tacos).
Magnolini et al. (2019) provide experimental evidences on
3
two datasets (named entities and nominal entities) and two 20/04/2018
4
languages (English and Italian), showing that extracting https://github.com/kyzhouhzau/BERT-NER

2112
I would like to order a salami pizza and two mozzarella cheese sandwiches
O O O O O O B-FOOD I-FOOD O O B-FOOD I-FOOD I-FOOD

Table 2: Example of IOB annotation of food nominal entities.

Nominal entity recognition has been approached with sys- many new entities can be potentially constructed by adding
tems based on linguistic knowledge, including morpho- new tokens - sub-entity ratio).
syntactic information, chunking, and head identification
(Pianta and Tonelli, 2010). In the framework of the ACE 4.2. Experiments on Nominal Entity Recognition
program (Doddington et al., 2004) there has been several In our experiments we compare nominal entity recognition
attempts to develop supervised systems for nominal enti- on the DPD dataset against named entity recognition on the
ties (Haghighi and Klein, 2010), which, however, had to CoNNL dataset. In both cases we show four configurations:
face the problem of the scarcity of annotated data, and, for (i) NeuroNLP2, the neural architecture presented in Sec-
this reason, were developed for few entity types. tion 2.3; (ii) NeuroNLP2 with the use of gazetteer features
Similarly to what is done for named entities, nominal en- (single-token) as reported in Table 3; (iii) NeuroNLP2 with
tity recognition has been approached as a sequence label- the use of gazetteer features (multi-token); (iv) NeuroNLP2
ing task. Given an utterance U = {t1 , t2 , ..., tn } and a with the use of gazetteer features based on a dedicated neu-
set of entity categories C = {c1 , c2 , ..., cm }, the task is ral model (NNg ).
to label the tokens in U that refer to entities belonging Table 4 shows the results of gazetteer integration as embed-
to the categories in C. As an example, using the IOB ding. The NeuroNLP2 model benefits significantly from
format (Inside-Outside-Beginning, (Ramshaw and Marcus, the gazetteer representation of NNg , especially for the DPD
1995)), the sentence “I would like to order a salami pizza dataset (with an increment of 2.54 in terms of F1 ). The
and two mozzarella cheese sandwiches” could be labeled as combination of NeuroNLP2 and NNg reaches state-of-the-
shown in Table 2. It is worth to mention that the IOB for- art performance on ConNLL-2003 when it is added as em-
mat does not allow to represent nested entities, a potential bedding feature, while both the single token and the multi-
limitation for nominal entities. token approaches do not improve the overall results. Using
4.1. Datasets for Nominal Entity Recognition gazetteer features as part of embedding dimensions helps
the model to adapt better when the training data are very
We use DPD – Diabetic Patients Diary – a dataset in Ital- few, like in the DPD dataset. Furthermore, the results on
ian made of diary entries of diabetic patients. Each day the the DPD dataset of NeuroNLP2 + NNg , compared to the
patient has to write down what s/he ate in order to keep others, show that NNg correctly generalizes nominal enti-
track of his/her dietary behavior. In DPD all entities of ties from the gazetteer, improving both Recall and Precision
type FOOD have been manually annotated by two annota- with respect to the multi-token approach.
tors (inter-annotator agreement is 96.75 dice coefficient).
Sentences in the dataset have a telegraphic style, e.g. the
main verb is often missing, resulting in a list of foods like 5. Lexical Relations among Words
the following: This section addresses the capacity of neural models
“<risotto ai multicereali e zucchine>F OOD to detect semantic relations (e.g., synonymy, semantic
<insalata>F OOD e <pomodori>F OOD ” (“<risotto with similarity, entailment, compatibility) between words (or
multigrain and zucchini> <salad> and <tomatoes>”). phrases, like the nominal expressions described in Section
Entity Gazetteers. In Table 3 we describe the gazetteers 4.2). We focus our experiments on the compatibility re-
that we have used in our experiments for two datasets (DPD lation, and adopt the definition of compatibility proposed
for nominal entities and CoNNL for named entities), re- by Kruszewski and Baroni (2015): two linguistic expres-
porting, for each entity type, sizes in terms of number of sions w1 and w2 are compatible iff, in a reasonably normal
entity names, the average length of the names (in number state of affairs, they can both truthfully refer to the same
of tokens), plus the length variability of such names (stan- thing. If they cannot, then they are incompatible. Under
dard deviation). We also report additional metrics that try this definition compatibility is a symmetric relation, which
to grasp the complexity of entity names in the gazetteer: (i) is different both from subsumption, which in not symmet-
the normalized type-token ratio (TTR), as a rough measure ric, from semantic similarity (Agirre et al., 2012) (two ex-
of how much lexical diversity is in the nominal entities in pressions can be compatible although not semantically sim-
a gazetteer, see Richards (1987); (ii) the ratio of type1 to- ilar, like aperitif and chips, and from textual entailment
kens, i.e. tokens that can appear in the first position of an (Dagan et al., 2005), as entailment is not a symmetric re-
entity name but also in other positions, and type2 tokens, lation.
i.e. tokens appearing at the end and elsewhere; (iii) the ra-
tio of entities that contain another entity as sub-part of their 5.1. Task definition
name. With these measures we are able to partially quan- The task is defined as follows: given a lexicon L and a
tify how difficult it is to recognize the length of an entity query q, the system should retrieve and order all the terms
(SD), how difficult it is to individuate the boundaries of an li in L such that q and li are compatible. L is a finite set
entity (ratio of type1 and type2 tokens), how much com- of n terms, and both terms li and the query q are nominal
positionality there is starting from basic entities (i.e., how expressions composed of one or more words. Accordingly,

2113
dataset
Gaz. #entities #tokens length ± SD TTR type1(%) type2(%) sub-entity(%)
PER 3613 6454 1.79 ±0.54 0.96 19.00 04.63 23.60
LOC 1331 1720 1.29 ±0.69 0.97 04.66 04.33 10.14
CoNNL
ORG 2401 4659 1.94 ±1.16 0.91 09.35 15.06 19.44
MISC 869 1422 1.64 ±0.94 0.89 08.61 08.73 19.85
DPD FOOD 23472 83264 3.55 ±1.87 0.75 17.22 22.97 11.27
Table 3: Gazetteers used in the experiments for Nominal Entity Recognition. Description is provided in terms of number
of entity names, total number of tokens, average length and standard deviation (SD) of entities, type-token ratio (norm
obtained by repeated sampling of 200 tokens), type1 and type2 unique tokens ratio and sub-entity ratio.

CoNLL DPD
Accuracy Precision Recall F1 Accuracy Precision Recall F1
NeuroNLP2 98.06 91.42 90.95 91.19 88.47 77.17 74.79 75.96
NeuroNLP2 + single token 98.06 91.53 90.51 91.02 88.29 75.63 77.19 76.40
NeuroNLP2 + multi token 98.08 91.41 90.76 91.08 88.98 78.90 76.33 77.59
NeuroNLP2 + NNg 98.05 91.41 91.02 91.22 89.89 79.68 77.36 78.50

Table 4: Results on Nominal Entity Recognition using gazetteers as features together with embeddings.

the expected output is a (possibly empty) list of compatible Dev Test Total
terms ordered by relevance with respect to q. Total queries 50 50 100
In practical scenarios (e.g., ontology matching) the lexicon Tokens/query 2.58 2.76 2.67
L can be composed of thousands, or tens of thousand of 1 terms 223 (54.9%) 261 (60.7%) 484 (57.9%)
terms (e.g., all concept names in DBPedia, all the names of 2 terms 156 (38.4%) 135 (31.4%) 291 (34.8%)
products in a catalogue, all word forms in WordNet, or all 3 terms 27 (6.7%) 34 (7.9%) 61 (7.3%)
the entry names in a dictionary). In our definition we do not Total terms 406 430 836
consider any relation among terms (e.g., semantic relations Tokens/term 4.01 4.21 4.12
such as IS-A), so that terms can be considered as indepen- Terms/query 8.12 8.6 8.36
dent. Finally, the problem is treated as in information re-
Table 5: Statistics about the dataset for compatibility rela-
trieval (IR), assuming that queries are nominal expressions
tion (n terms indicates the terms with compatibility rating
(as they are in most cases in IR) and that the document col-
equal to n).
lection (i.e., our lexicon L) is composed of documents con-
sisting of a single term (i.e., a nominal expression). Com-
patibility is formulated as a binary classification problem,
as two expressions can be either compatible or incompati- the two given expressions could or could not refer to the
ble. While Kruszewski and Baroni (2015) use a continuous same dish or food. More specifically, the task consisted of
scale from 1 (low compatibility) to 7 (high compatibility) assigning to each term a compatibility rating on a 3-point
and then estimate a compatibility threshold, in our work we scale where 3 means that they were fully convinced that the
use a three-value scale (from 1 to 3). two expressions could refer to the same dish or food, while
1 means that they thought that it was impossible that the
5.2. Datasets on compatibility relation two expressions referred to the same dish or food. In the
We focused our experiments on compatibility relations case of chicken with mushrooms and onions and chicken
among food names. We adopted an existing ontology in the with mushrooms, for example, the expected compatibility
food domain, the HeLiS ontology (Bailoni et al., 2016)5 , rating is 3, since the two expressions can (easily) refer to
which we use as the lexicon L. As for the queries q, we built the same dish (when mentioning a dish, people can easily
a set of 100 query terms that are completely independent of omit secondary ingredients). On the other hand, annotators
those contained in HeLiS; in fact, we extracted them from would assign a compatibility rating of 1 to the cod fillet and
among the dishes or types of food annotated in the Diabetic pork fillet pair, since a cod fillet cannot be a pork fillet. Fi-
Patients Diary6 , a corpus of above 1,000 meal descriptions nally, a compatibility rating of 2 would be assigned to pairs
written by diabetic patients (for example, wholemeal pasta like cod fillet with asparagus and rice cod fillet with fennel
with raw ham and tomatoes, cucumbers). and capers; in this case, the secondary ingredients listed
For each query, the annotator was presented with a list of in the query and in the proposed term differ, but they both
5 to 10 terms in alphabetical order. The annotator had to refer to cod fillet. Inter-annotator agreement, computed in
annotate each term for compatibility with the query it was terms of kappa statistic on the dual annotation of a subset
associated with, which means they had to decide whether of 21 query terms (for a total of 184 terms), is 0.76.
For our experiments, we split the dataset in two parts; half
5
http://w3id.org/helis. of the data was used as a development set and half as a test
6
https://hlt-nlp.fbk.eu/technologies/dpd. set (see Table 5 for detailed statistics about the two datasets

2114
and their annotations). M RR1 (i.e., [1 0 0]) is meant to capture the capacity
of the system to rank incompatible terms lower than all
5.3. Experiments on compatibility relation other terms. Although this information is also captured
We conducted experiments with the algorithms below. by M RR1,3 , M RR1 alone highlights the effect that differ-
Semantic Similarity. A baseline based on similarity of ent algorithms have on the reduction of misclassifications.
word embeddings. A term is considered compatible with M RR1 is normalized over [0 1].
the query if it is ranked in the best 5 terms according to 5.5. Results on compatibility relation
the cosine similarity of the vector representing the query
and the vector representing the term. More precisely, we Results (see Table 6) show that the semantic head approach
first extract the vector of each token of the query from a systematically outperforms the semantic similarity base-
GloVe model (Pennington et al., 2014) and then compute line, both when the threshold is used and when it is not
the average of the extracted vectors (i.e., the centroid vec- used. The algorithms with the threshold strongly reduce
tor). This method partially includes token overlap, in fact M RR1 , with a beneficial effect also on M RR1,3 ; this is
equal tokens have equal vectors, so expressions composed actually an expected effect of the threshold, as it enables
of the same tokens have the same vector representation. the system to better distinguish between compatible and in-
This baseline exclusively uses the information carried by compatible terms. On the other hand, a drawback of in-
GloVe vectors, as all the vectors included in the model are troducing the threshold is that it reduces M RR2,3 , i.e., the
used with the same weight. capability of the system to retrieve related terms. It is also
interesting to notice that the decrease in M RR2,3 is greater
Semantic Similarity with Threshold. The same as the in the semantic similarity baseline, which is due to the fact
semantic similarity baseline, with the addition of a compat- that it implements only the relatedness threshold.
ibility threshold operating over the best 5 terms. A term is As a final consideration, we point out that the decrease in
considered compatible with the query if it is ranked above performance on the test set as compared to the development
the compatibility threshold, which is empirically is calcu- set is consistent for the semantic head approach in terms of
lated on WordNet data. all the metrics; this shows that the compatibility threshold
Semantic Head. An approach based on the automatic is not overfitted on the development data, but it is general
recognition of the semantic head of a term, without any and has the same effect on the development and the test
threshold. A term is considered compatible with the query data.
if it is ranked in the best 5 terms according to both their re- Finally, in the last line of Table 6 we report the results
spective semantic heads and the similarity of their tokens. obtained applying the multilingual model of BERT to the
Semantic Head with Threshold. The approach based on compatibility task. BERT was applied through fine-tuning
semantic heads, integrated with a compatibility threshold of the multilingual model over the data of the compatibil-
over the best 5 terms retrieved by the semantic head algo- ity task. In addition, some fine tuning for the task was
rithm. performed on the generic model using the parameters sug-
gested in Devlin et al. (2018). BERT performs better
5.4. Evaluation Metrics than the previous approaches based on semantic similarity
among vectors, confirming the high capacity of the BERT
Evaluation is based on Mean Reciprocal Rank (MRR)
model to capture semantic relations among words, even for
(Craswell, 2009), a standard measure to evaluate retrieval
Italian.
systems (particularly question answering). While MRR is
designed for binary classification of retrieved objects (i.e.,
correct vs incorrect), in our scenario retrieved terms can
6. Textual Entailment
assume one value on a 3-point compatibility scale. We Driven by the assumption that language understanding cru-
therefore calculate the MRR for each value, thus obtain- cially depends on the ability to recognize semantic relations
ing M RR1 , M RR2 and M RR3 , respectively the MRR among portions of text, several text-to-text inference tasks
of terms that are not compatible with query (value=1), of have been proposed in the last decade, including recogniz-
terms with low compatibility (value=2) and of terms that ing paraphrasing (Dolan and Brockett., 2005), recognizing
are fully compatible (value=3). Results in Table 6 are textual entailment (RTE) (Dagan et al., 2005), and semantic
presented using the three metrics M RR1,3 , M RR2,3 and similarity (Agirre et al., 2012). A common characteristic of
M RR1 , as described below. such tasks is that the input are two portions of text, let’s
M RR1,3 is the difference between M RR3 and M RR1 , call them T ext1 and T ext2, and the output is a semantic
and indicates the ability of the system to rank compatible relation between the two texts, possibly with a degree of
terms higher than the incompatible ones. This metric is confidence of the system. For instance, given the following
the weighted average of the three MRR, with the M RR2 text fragments:
weight set to 0 (i.e., [-1 0 1]). M RR1,3 is normalized over Example 1. Text1: George Clooney’s longest relationship
[-1 1]. ever might have been with a pig. The actor owned Max, a
M RR2,3 is the weighted average of the three MRR with 300-pound pig.
the following respective weights [0 0.5 1], and it is meant Text2: Max is an animal.
to captures the ability of the algorithm to retrieve only
the terms that compatible (value=3) or almost compatible a system should be able to recognize that there is an ”en-
(value=2). M RR2,3 is normalized over [0 1]. tailment” relation among T ext1 and T ext2.

2115
Development Test
Metric M RR1,3 M RR2,3 M RR1 M RR1,3 M RR2,3 M RR1
Weights [-1 0 1] [0 0.5 1] [1 0 0] [-1 0 1] [0 0.5 1] [1 0 0]
Semantic similarity -0.236 0.345 0.398 -0.213 0.346 0.407
Semantic similarity with Threshold -0.011 0.259 0.174 -0.028 0.231 0.179
Semantic head -0.114 0.396 0.274 -0.184 0.356 0.357
Semantic head with Threshold 0.034 0.357 0.111 -0.015 0.298 0.165
BERT - - - -0.018 0.318 0.132

Table 6: Results on the development and test sets for compatibility relation detection.

6.1. Datasets used for Textual Entailment Text1 and Text2 is a characteristic that separates the posi-
We have tested the performance of a neural approach, based tive sentence pairs, for which the entailment relation holds,
on BERT, on two RTE datasets available for Italian. from the negative pairs, for which the entailment relation
does not hold. More specifically, EDITS is based on edit
RTE3 Italian. This is the Italian translation of the RTE-3 distance algorithms, and computes the T-H distance as the
dataset carried out during the EU project EXCITEMENT 7 . overall cost of the edit operations (i.e., insertion, deletion
The RTE-3 dataset for English (Giampiccolo et al., 2007) and substitution) that are necessary to transform Text1 into
consists of 1600 text-hypothesis pairs, equally divided into Text2.
a development set and a test set. While the length of the In this case we have a mixed situation. In fact, while BERT
hypotheses (h) was the same as in the RTE1a and RTE2 achieves a significant improvement over the state of the art
datasets, a certain number of texts (t) were longer than in (i.e., the EDITS system - (Kouylekov and Magnini, 2005)),
previous datasets, up to a paragraph. Four applications – it is largely below the state of the art in the Evalita 2009
namely IE, IR, QA and SUM – were considered as settings dataset. This is probably due to the fact that the variations
or contexts for the pairs generation, and 200 pairs were se- between Text1 and Text2 in the Evalita dataset are only par-
lected for each application in each dataset. tially due to semantic phenomena, and as a consequence
RTE Evalita 2009. This is the dataset developed for they are not captured by the BERT language model. On the
Evalita 2009 (Bos et al., 2009) tasks. Pairs of texts have be other hand, the RTE3 dataset contains much more semantic
taken from Italian Wikipedia articles, and are constructed lexical relations between the two sentences, and the BERT
by manually annotating contrasting texts taken from the model seems to better capture such relation with respect to
version history as provided by Wikipedia. The following EDITS (+3.5), which is based on word relations in Word-
is a pair where Text1 entails Text2: Net.

Example 2. Text1: Parla di attivita’ nei panni di direttore dataset BERT (multilingual) SotA
commerciale e, dopo sei mesi, di direttore generale. RTE 3 (ita) 69.25 63.50
Text2: Parla di attivita’ di direttore commerciale e, dopo RTE Evalita 2009 55.00 71.00
sei mesi, di direttore generale
Table 7: Application of BERT to Textual Entailment.
6.2. Results for textual entailment
In Table 7 we report the results obtained applying a neural
7. Sentiment Analysis
model, BERT (multilingual), over the two datasets.
Three shared tasks on sentiment analysis from Italian
BERT Multilingual. This approach makes use of the tweets were organized in the context of the EVALITA
BERT multilingual language model (Devlin et al., 2018) evaluation campaigns. SENTIPOLC (SENTIment POLar-
in order to establish as many as possible relations between ity Classification) was organized at EVALITA 2014 &
Text1 and Text2. A threshold is then estimated on the train- 2016 (Basile et al., 2014; Barbieri et al., 2016). In 2016 the
ing data, and used to separate entailment and no-entailment focus was on Italian texts from Twitter and there was a set
on the test data. As suggested by the BERT developers of related tasks with an increasing level of complexity. The
for classification tasks, some fine tuning for the task was main task concerns sentiment polarity classification at the
performed on the generic model using the parameters sug- message-level. Sentiments expressed in tweets are typically
gested in Devlin et al. (2018). categorized as positive, negative or neutral, but a message
State of the Art: EDITS. The system used for the exper- can contain parts expressing both positive and negative sen-
iments is the EDITS package (Edit Distance Textual En- timent (mixed sentiment), a feature that should be tackled.
tailment Suite) (Kouylekov and Magnini, 2005) . EDITS ABSITA (Aspect-Based Sentiment analysis at EVALITA)
implements a distance-based approach for recognizing tex- was organized at EVALITA 2018 (Basile et al., 2018), as
tual entailment, which assumes that the distance between an evolution of Sentiment Analysis aiming at capturing the
aspect-level opinions expressed in natural language texts.
7
https://sites.google.com/site/ Aspect-based Sentiment Analysis is approached as a se-
excitementproject/results/RTE3-ITA_V1_ quence of two subtasks: Aspect Category Detection (ACD)
2012-10-04.zip and Aspect Category Polarity (ACP).

2116
In Table 8 we report the results obtained applying BERT method based on a cascade of classifiers which make use of
(multilingual) to SENTIPOLC 2016 (task 2) and to AB- a set of syntactic and semantic features is presented. The
SITA 2018 (tasks ACD and ACP). As suggested for classi- resulting system is a novel hierarchical classification sys-
fication tasks, some fine tuning for the task was performed tem for the given task, that was experimentally evaluated.
on the generic model using the parameters suggested in De- As a follow-up of the work reported in (Gerevini et al.,
vlin et al. (2018). In particular, some aspects of the SEN- 2018), in (Putelli et al., submitted) deep learning techniques
TIPOLC 2016 dataset are difficult to address with BERT. and in particular Long Short Term Memory (LSTM) net-
For example, the fact that the dataset is strongly unbal- works (currently, the state-of-the-art method for many Nat-
anced, usually an important aspect to take into account with ural Language Processing tasks) are applied to the same
a supervised system like BERT. To reduce this effect we task, without the use of textual annotations. Each report is
down-sample the most common polarity, but even in this classified using a combination of neural network classifiers
case, the result is not competitive with the state of the art. which make use of syntactic and semantic features. The re-
On the other hand, is important to notice that in both cases sulting system is a novel hierarchical classification system
(SENTIPOLC 2016 and ABSITA 2018), the models were for the given task. In Table 9, there is a comparison with
not fine-tuned on Italian, but only on the task. According the performance of the system based on standard machine
to the paper by Pires et al. (2019), multilingual BERT is learning techniques and annotations of relevant snippets.
able to perform some cross-lingual adaptation but it is rea-
sonable to think that in a task more related to semantic a 9. Discussion and conclusions
deeper process of fine-tuning is needed. We have presented a comparison between deep learning
and traditional machine learning methods for various NLP
dataset BERT SotA tasks in Italian. We carried on experiments using available
SENTIPOLC 2016 - Task 2 52.17 66.38 datasets on two sequence tagging tasks (i.e., named entity
ABSITA 2018 - Task ACD 74.05 81.08 recognition and nominal entity recognition) and four clas-
ABSITA 2018 - Task ACP 68.13 76.73 sification tasks (i.e., lexical relations among words, seman-
tic relations among sentences, sentiment analysis and text
Table 8: Application of BERT to Sentiment Analysis. classification). Our experiments show that deep learning
approaches outperform traditional machine learning algo-
rithms in sequence tagging, while for classification tasks
8. Text Classification that heavily rely on semantics approaches based on feature
Finally, we focus on text classification applied to radiolog- engineering are still competitive. More in detail:
ical reports in Italian. Radiological reporting generates a
large amount of free-text clinical narratives, a potentially • BERT outperforms previous approaches both for
valuable source of information for improving clinical care named entities, textual entailment (RTE dataset) and
and supporting research. The use of automatic techniques text classification on clinical reports;
to analyze such reports is necessary to make their content • on nominal entity recognition, a task much more com-
effectively available to radiologists in an aggregated form. plex than NER, we have shown that the NeuroNLP2
In (Gerevini et al., 2018) the focus is on the classification of model can be extended with terms contained in a
chest computed tomography reports according to a classifi- gazetteer, achieving state-of-the-art performance;
cation schema proposed for this task by radiologists of the
Italian hospital ASST Spedali Civili di Brescia. The system • on the three datasets for sentiment analysis on tweets
is built exploiting a training dataset containing reports an- traditional machine learning outperforms BERT, indi-
notated by radiologists. Each report is classified according cating that more accurate fine tuning is still necessary;
to the schema developed by radiologists and textual evi- • on lexical relations (i.e., compatibility among words)
dences are marked in the report. The annotations are then a simple BERT fine tuning achieves results compara-
used to train different machine learning based classifiers. A ble to those obtained by more complex architectures
using linguistic features (e.g., the semantic head of the
Annotation Deep Learning term).
Acc FM Acc FM We think that a similar analysis could be carried out for
Exam type 96.0 95.8 96.2 96.0 other languages to provide an assessment of machine learn-
Result (First Exam) 77.3 76.1 78.3 76.3 ing / deep learning models across different languages.
Result (Follow-Up) 73.9 65.6 81.9 71.9 As for future work, we do believe that progress on language
Lesion Nature 66.3 62.3 73.2 71.2 technologies need benchmarks encompassing a variety of
Site Lung 93.2 71.9 90.9 76.6 tasks in order to favour models that share general linguis-
Site Pleura 93.2 75.5 94.4 75.8 tic knowledge across tasks. This is very much in the spirit
Site Mediastinum 92.9 81.0 88.3 72.9 of GLUE, the General Language Understanding Evaluation
(Wang et al., 2018), a collection of resources for training,
Table 9: Classification of radiological reports. Compari- evaluating, and analyzing natural language understanding
son between the approach based on standard ML techniques systems. Our next step will be to collect the Italian re-
and textual annotations and the model based on deep learn- sources used in this paper and propose them as a single
ing. In boldface the best results. benchmark for NLP tasks on the Italian language.

2117
10. Bibliographical References matic content extraction (ACE) program – tasks, data,
and evaluation. In Proceedings of the Fourth Interna-
Agirre, E., Cer, D., Diab, M., and Gonzalez-Agirre, A.
tional Conference on Language Resources and Evalua-
(2012). SemEval-2012 task 6: A pilot on semantic tex-
tion (LREC’04), Lisbon, Portugal, May. European Lan-
tual similarity. In *SEM 2012: The First Joint Confer-
guage Resources Association (ELRA).
ence on Lexical and Computational Semantics – Vol-
ume 1: Proceedings of the main conference and the Dolan, W. B. and Brockett., C. (2005). Automatically con-
shared task, and Volume 2: Proceedings of the Sixth In- structing a corpus of sentential paraphrases. In Proceed-
ternational Workshop on Semantic Evaluation (SemEval ings of the Third International Workshop on Paraphras-
2012), pages 385–393, Montréal, Canada, 7-8 June. As- ing (IWP2005), Asia Federation of Natural Language
sociation for Computational Linguistics. Processing.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural ma- Dyer, C., Ballesteros, M., Ling, W., Matthews, A., and
chine translation by jointly learning to align and trans- Smith, N. A. (2015). Transition-based dependency pars-
late. arXiv.1409.0473 [cs.CL]. ing with stack long short-term memory. In Proceedings
Bailoni, T., Dragoni, M., Eccher, C., Guerini, M., of the 53rd Annual Meeting of the Association for Com-
and Maimone, R. (2016). Healthy lifestyle support: putational Linguistics and the 7th International Joint
The PerKApp ontology. In OWL: Experiences and Conference on Natural Language Processing (Volume 1:
Directions–Reasoner Evaluation, pages 15–23. Springer. Long Papers), pages 334–343, Beijing, China, July. As-
Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli, N., sociation for Computational Linguistics.
and Patti, V. (2016). Overview of the EVALITA 2016 Gerevini, A., Lavelli, A., Maffi, A., Maroldi, R., Minard,
sentiment polarity classification task. In Proceedings of A.-L. M., Serina, I., and Squassina, G. (2018). Auto-
EVALITA 2016, Naples, Italy, December. matic classification of radiological reports for clinical
Basile, V., Bolioli, A., Patti, V., Rosso, P., and Nissim, M. care. Artificial Intelligence in Medicine, 91:72 – 81.
(2014). Overview of the Evalita 2014 sentiment polar- Giampiccolo, D., Magnini, B., Dagan, I., and Dolan, B.
ity classification task. In Proceedings of EVALITA 2014, (2007). The third PASCAL recognizing textual entail-
Pisa, Italy, December. ment challenge. In Proceedings of the ACL-PASCAL
Basile, P., Semeraro, G., and Cassotti, P. (2017). Bi- Workshop on Textual Entailment and Paraphrasing,
directional LSTM-CNNs-CRF for Italian sequence la- pages 1–9, Prague, June. Association for Computational
beling. In Proceedings of the Italian Conference on Linguistics.
Computational Linguistics (CLiC-it 2017), Roma, Italy, Graves, A., Mohamed, A.-r., and Hinton, G. (2013).
December. Speech recognition with deep recurrent neural networks.
Basile, P., Croce, D., Basile, V., and Polignano, M. (2018). In IEEE international conference on acoustics, speech
Overview of the EVALITA 2018 aspect-based sentiment and signal processing, pages 6645–6649. IEEE.
analysis task (ABSITA). In Proceedings of EVALITA Haghighi, A. and Klein, D. (2010). Coreference resolu-
2018, Turin, Italy, December. tion in a modular, entity-centered model. In Human Lan-
Bonadiman, D., Severyn, A., and Moschitti, A. (2015). guage Technologies: The 2010 Annual Conference of the
Deep neural networks for named entity recognition in North American Chapter of the Association for Com-
Italian. In Proceedings of the Italian Conference on putational Linguistics, pages 385–393. Association for
Computational Linguistics (CLiC-it 2015), Trento, Italy, Computational Linguistics.
December. Hochreiter, S. and Schmidhuber, J. (1997). Long short-
Bos, J., Zanzotto, F. M., and Pennacchiotti, M. (2009). term memory. Neural computation, 9(8):1735–1780.
Textual entailment at EVALITA 2009. In Proceedings Kouylekov, M. and Magnini, B. (2005). Recognizing
of EVALITA 2009, Reggio Emilia, Italy. textual entailment with tree edit distance algorithms.
Chiu, J. P. and Nichols, E. (2016). Named entity recog- In Proceedings of the First PASCAL Challenges Work-
nition with bidirectional LSTM-CNNs. Transactions of shop on Recognising Textual Entailment, pages 17–20,
the Association for Computational Linguistics, 4:357– Southampton, UK.
370. Kruszewski, G. and Baroni, M. (2015). So similar and
Craswell, N. (2009). Mean reciprocal rank. Encyclopedia yet incompatible: Toward the automated identification
of Database Systems, pages 1703–1703. of semantically compatible words. In Proceedings of the
Dagan, I., Glickman, O., and Magnini, B. (2005). The 2015 Conference of the North American Chapter of the
PASCAL Recognising Textual Entailment Challenge. Association for Computational Linguistics: Human Lan-
In Proceedings of the PASCAL Challenges Workshop guage Technologies, pages 964–969.
on Recognising Textual Entailment, pages 177–190, Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001).
Southampton, UK. Conditional random fields: Probabilistic models for seg-
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. menting and labeling sequence data. In Proceedings
(2018). BERT: Pre-training of deep bidirectional trans- of the Eighteenth International Conference on Machine
formers for language understanding. arXiv.1810.04805 Learning, ICML ’01, pages 282–289, San Francisco,
[cs.CL]. CA, USA. Morgan Kaufmann Publishers Inc.
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, Lample, G., Ballesteros, M., Subramanian, S., Kawakami,
L., Strassel, S., and Weischedel, R. (2004). The auto- K., and Dyer, C. (2016). Neural architectures for named

2118
entity recognition. In Proceedings of the 2016 Confer- Speranza, M. (2009). The named entity recognition task
ence of the North American Chapter of the Association at EVALITA 2009. In Proceedings of EVALITA 2009,
for Computational Linguistics: Human Language Tech- Reggio Emilia, Italy.
nologies, pages 260–270, San Diego, California, June. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O.,
Association for Computational Linguistics. and Bowman, S. (2018). GLUE: A multi-task bench-
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., mark and analysis platform for natural language under-
and Kang, J. (2019). BioBERT: a pre-trained biomed- standing. In Proceedings of the 2018 EMNLP Work-
ical language representation model for biomedical text shop BlackboxNLP: Analyzing and Interpreting Neural
mining. Bioinformatics, Sep. Networks for NLP, pages 353–355, Brussels, Belgium,
Ma, X. and Hovy, E. (2016). End-to-end Sequence Label- November. Association for Computational Linguistics.
ing via Bi-directional LSTM-CNNs-CRF. In Proceed- Zanoli, R., Pianta, E., and Giuliano, C. (2009). Named
ings of the 54th Annual Meeting of the Association for entity recognition through redundancy driven classifiers.
Computational Linguistics (Volume 1: Long Papers), In Proceedings of Evalita 2009.
volume 1, pages 1064–1074.
Magnolini, S., Piccioni, V., Balaraman, V., Guerini, M., and 11. Language Resource References
Magnini, B. (2019). How to use gazetteers for entity
recognition with neural models. In Proceedings of the
5th Workshop on Semantic Deep Learning (SemDeep-5),
pages 40–49, Macau, China, 12 August. Association for
Computational Linguistics.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
Dean, J. (2013). Distributed representations of words
and phrases and their compositionality. In Advances
in neural information processing systems, pages 3111–
3119.
Nguyen, T.-V. T., Moschitti, A., and Riccardi, G. (2010).
Kernel-based reranking for named-entity extraction. In
Coling 2010: Posters, pages 901–909, Beijing, China,
August. Coling 2010 Organizing Committee.
Pennington, J., Socher, R., and Manning, C. (2014).
Glove: Global vectors for word representation. In Pro-
ceedings of the 2014 conference on empirical methods
in natural language processing (EMNLP), pages 1532–
1543.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark,
C., Lee, K., and Zettlemoyer, L. (2018). Deep contextu-
alized word representations. In Proceedings of the 2018
Conference of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pages 2227–
2237, New Orleans, Louisiana, June. Association for
Computational Linguistics.
Pianta, E. and Tonelli, S. (2010). Kx: A flexible system
for keyphrase extraction. Proceedings of the 5th Inter-
national Workshop on Semantic Evaluation, pages 170–
173, 01.
Pires, T., Schlinger, E., and Garrette, D. (2019). How mul-
tilingual is multilingual BERT? In Proceedings of the
57th Annual Meeting of the Association for Computa-
tional Linguistics. Association for Computational Lin-
guistics.
Putelli, L., Gerevini, A. E., Lavelli, A., and Serina, I. (sub-
mitted). Deep learning for classification of radiology re-
ports with a hierarchical schema.
Ramshaw, L. and Marcus, M. (1995). Text chunking us-
ing transformation-based learning. In Third Workshop
on Very Large Corpora.
Richards, B. (1987). Type/token ratios: What do they re-
ally tell us? Journal of child language, 14(2):201–209.

2119

You might also like