2021-Sentence Embedding Models For Similarity Detection of Software Requirements
2021-Sentence Embedding Models For Similarity Detection of Software Requirements
https://doi.org/10.1007/s42979-020-00427-1
ORIGINAL RESEARCH
Abstract
Semantic similarity detection mainly relies on the availability of laboriously curated ontologies, as well as of supervised and
unsupervised neural embedding models. In this paper, we present two domain-specific sentence embedding models trained
on a natural language requirements dataset in order to derive sentence embeddings specific to the software requirements
engineering domain. We use cosine-similarity measures in both these models. The result of the experimental evaluation con-
firm that the proposed models enhance the performance of textual semantic similarity measures over existing state-of-the-art
neural sentence embedding models: we reach an accuracy of 88.35%—which improves by about 10% on existing benchmarks.
Introduction with the client and the use of documents that accurately
describe the application scenarios.
Context The use of natural language processing (NLP) in various
aspects of Requirements Engineering (RE) heavlily con-
Software requirement elicitation is mostly performed using tributes to speed up the software production process and to
natural language, due to the need of effective communication improve the quality of the resulting systems [9]. NLP allows
machines to understand and extract patterns from text data
by applying various techniques such as semantic textual sim-
ilarity, information retrieval, document classification, entity
This article is part of the topical collection “Applications of recognition and so on. Some of the advantages of using NLP
Software Engineering and Tool Support” guest edited by Nabendu in RE are as follows:
Chaki, Agostino Cortesi and Anirban Sarkar.
SN Computer Science
Vol.:(0123456789)
69 Page 2 of 11 SN Computer Science (2021) 2:69
SN Computer Science
SN Computer Science (2021) 2:69 Page 3 of 11 69
have demonstrated that transfer learning can significantly is designed for a variety of tasks to understand the natural
improve the performance of NLP models on specific tasks, languages dynamically.
such as classification. Setting off the eruption of universal Bidirectional Encoder Representations from Transform-
word embedding research, several works improved previ- ers (BERT [10]) is a pre-trained model developed by Google
ous unsupervised approaches (Word2Vec and state of the art researchers Devlin et al. in 2018. BERT is trained on giga-
contextual word vectors) by incorporating the supervision bytes of data from various sources (mostly from Wikipedia
of semantic or syntactic knowledge. The most noteworthy and Book Corpus) in an unsupervised fashion. In brief, the
is FastText [16] and ELMo [27]. FastText’s key enhance- training is performed by masking a few words (nearly 15%
ment over the initial Word2Vec vectors is the integration of the words) in a sentence and allowing the model to pre-
of character n-grams, which enables word representations dict masked words. As the model is trained to predict, it
to be determined for words that did not even exist in the also learns to generate an efficient internal representation
training data (“out-of-vocabulary” words). Within ELMo, of words as word embedding. The uniqueness of the BERT
each word is given a representation that is a feature of the model is that it explores text representation from both direc-
whole sentence of the corpus to which they belong. The tions to obtain a clearer understanding of meaning of the
embedding (E) is determined from the internal states of a context and their relationship. BERT has set new state of the
two-layer bidirectional Language Model (LMo)and, hence, art results for several NLP tasks such as question answer-
the name ELMo. ing, sentence classification, sentence pair regression and
In the area of sentence embedding, there is a general so on. To perform sentence pair regression, BERT accepts
perception that the simple technique of directly averaging two sentences separated by special token SEP and applies
the word vectors of a sentence (the so-called Bag-of-Word multi-head attention layers. The output is then passed to a
approach [23]) provides a good benchmark for several simple regression function to provide the final label. Using
downstream tasks. However, this approach creates vari- this architecture, BERT sets a new benchmark for perfor-
able length embeddings for sentences of different lengths. mance on semantic textual similarity among state-of-the-art
In [2], authors have proposed an algorithm which can take models. RoBERTa [20] has shown that minor modifications
any well-known word embedding mechanism to encode to pre-training processes can further enhance the efficiency
a sentence in a linear weighted combination fashion then of BERT. The major downside of BERT is that no independ-
compute the weighted average of the word vectors. Finally, ent sentence embedding mechanism is assessed. Two major
the projection of the vectors on their first principal compo- approaches are used to generate sentence embedding:
nent is removed. The first major proposals went further than
basic averaging of word vectors, using unsupervised training 1. Averaging Method: The most popular BERT methods
objectives. Skip-Thought Vectors [18] is an unsupervised for creating sentence embeddings by simply averaging
learning method for sentence embeddings. It is analogous the word embeddings of all words in one sentence.
to word embedding Skip-Gram model. The model consists 2. CLS vector: Alternatively, the CLS special token embed-
of an encoder–decoder based on Recurrent Neural Net- ding that appears at the beginning of the sentence may
work (RNN) that is trained to recreate the surrounding sen- be used. ([21, 35]).
tences from the current sentence. An improvement of Skip-
Thoughts vectors is presented in Quick-Thoughts Vectors The well-known bert-as-a-service1 repository offers both
[28]. The strength of the model is its speed of training. For a these options. In [19], authors have proposed another variant
long period of time, supervised learning of sentence embed- of BERT called ALBERT where two-parameter reduction
ding was assumed to generate embedding of lower quality techniques have been introduced to decrease the memory
than unsupervised approaches. This assumption was recently consumption and enhance the training speed of the model.
reversed, particularly after the InferSent [8] model emerged. They have also used a self-supervised loss that focuses on
Infersent uses Stanford Natural Language Inference [5] modeling inter-sentence coherence. In [34], authors have
(SNLI) corpus to train the classifier on top of a sentence proposed XLNet which integrates ideas from Transformer-
encoder. The authors implement the sentence encoder using XL, where the model generalized auto-regressive pre-train-
bi-directional LSTM coupled with max-pooling operator. In ing mechanism. It involves bidirectional learning of contexts
2018, Cer et al. from Google research proposed Universal by maximizing the expected likelihood over all permutations
Sentence Encoder [7] which has become one of Tensorflow of the factorization order. In another work [30], authors have
Hub’s most downloaded pre-trained text modules, providing come up with an idea to fine-tune BERT model on SNLI
versatile sentence embedding models which turn sentences dataset. In this work, the modification of the BERT model
into vector representations. The Universal Sentence Encoder
is trained with Deep Averaging Network (DAN) encoder. It
1
https://github.com/hanxiao/bert-as-service/.
SN Computer Science
69 Page 4 of 11 SN Computer Science (2021) 2:69
includes the inclusion of siamese and triplet network struc- Domain‑Specific Sentence Embedding
tures to produce semantically relevant sentence embeddings. Models
Another major contribution of the paper is the generation
of sentence embeddings that are compatible with cosine In this section, we introduce two domain-specific sentence
similarity measurements. In Table 1, we have summarized embedding models—namely PUBER and FiBER—for find-
sentence embedding models based on different parameters. ing the similarity (and dissimilarity) between pairs of natural
These parameters include the model architecture, leaning language requirements sentences. Both these models use the
methods, training datasets for each of the models and so on. BERT neural network architecture [10].
The parameter “Order of Words” specifies whether a particu- Figure 1 depicts the architecture of the BERT model,
lar model is aware of the orderings of words (uni-directional which is originally trained on the Wikipedia data and Book
or bi-directional) or not. Another parameter “Semantic Rela- Corpus.
tionship between Texts” specifies whether the model is con- Our first model PUBER uses the BERT architecture to
text sensitive or context free. STS [6] Benchmark comprises generate a pre-trained model from the PURE dataset. On
a selection of the English datasets used in the Semantic Tex- the other hand, FiBER uses the pre-trained BERT model
tual Similarity (STS) tasks. The datasets consist of text from to derive the cosine similarity between pairs of natural lan-
image captions, headlines and articles from news and user guage requirements sentences. This composite architecture
forums. In Table 1, we have presented the similarity score is then fine-trained using the PURE dataset. The main objec-
measured for different models for STS datasets. tive of this work is to train and build vocabulary for our
Current state-of-the-art features a range of well-known models to leverage the benefits of NLP in the Requirements
sentence embedding models, widely used for a number of Engineering domain.
NLP applications in different domains. Requirements Engi-
neering is not unique in incorporating NLP for develop- PUBER
ing potential solutions of specific problems. However, this
domain still lacks a rich domain-specific sentence embed- The PUBER model is built on the same architecture of
ding model which is the basic foundation for most NLP the BERT sentence embedding model as presented in
tasks. Fig. 1. However, the model is trained on the PURE dataset
SN Computer Science
SN Computer Science (2021) 2:69 Page 5 of 11 69
S
e Token E1 T1-1 ... T1 -24 T1
n 1
t
e V
n Token o E2 T2-1 ... T2 -24 T2
c 2 c
e a Softmax
1 b
u
...
S l
Token a Hidden
e
3 Layers
...
...
n r
t y
...
e
n Token
c 4 E1024
e T1024 -1 ... T1024 - 24 T1024
2
consisting of 34,268 unlabeled sentences. A distinctive cific words are represented by a single, unsplit
feature of the BERT architecture is its uniformity across token.
different tasks. There is minimal difference between
the pre-trained model architecture and the architecture 2. Once the vocabulary is prepared, we started training the
required for performing several downstream tasks like language model. The vocabulary is then used for the
semantic similarity, classification, sentiment analysis, etc. word embeddings and masking.
PUBER’s workflow model contains the multi-layer bidi- 3. As the model is based on BERT, we train it on a task
rectional transformer encoder of the BERT architecture. of Masked Language Modeling [10] which masks some
Additionally, the cosine similarity is evaluated between
sentence embeddings. The model is shown in Fig. 2.
In our evaluations, we use the original implementation
proposed by Vaswani et al. [32]. With respect to Fig. 1, we
have kept the number of intermediate transformer blocks
(Ti−j ) in each hidden layer to be 24, the number of hidden
layers ( Ei to Ti ) to be 1024 and the number of self-attention
heads to be 16 as proposed. We can describe the workflow
model of PUBER with the help of Fig. 1 and the follow-
ing steps.
SN Computer Science
69 Page 6 of 11 SN Computer Science (2021) 2:69
percentage of the input tokens at random, and then pre- model which uses the BERT architecture that is trained on
dicts those masked tokens. the PURE dataset. Essentially, two sentences are passed to
4. The final hidden vectors corresponding to the mask our PUBER model—say s1 and s2 . In the next phase, the
tokens are fed into a softmax layer. The training envi- PUBER model provides sentence embeddings for both the
ronment created is described as follows. sentences—hs1 and hs2, respectively. Finally, cosine similarity
is measured between the two embeddings and a similarity
(a) Batch size is set to 32. score is evaluated.
(b) Number of training step is considered to be
100,000.
(c) Learning rate is kept as 2e–5. FiBER
SN Computer Science
SN Computer Science (2021) 2:69 Page 7 of 11 69
(a) At the foundation level, we have the BERT pre-trained Table 2 Performance of FiBER for different pooling strategies
model. We already discussed the BERT architecture Pooling strategy Average accu-
with 1024 hidden layers, each with 24 transformer racy for test
blocks (Fig. 1). Every layer does multi-headed atten- dataset
tion computations on the word representation of the
MAX 84.65
previous layer. The multi-headed attention computa-
MEAN 88.35
tions create a new intermediate representation which is
CLS 80.25
then fed to the next layer of hidden states. We keep the
transformer architecture as it is in the original BERT
model.
(b) On top of the BERT model, we augment a MEAN Pool- Cosine similarity is generally used as a metric that
ing component. The pooling mechanisms are essen- measures the angle between vectors where the magni-
tial to get a fixed representation of a sentence. Thus, tude of the vectors are not considered. It could be the
the transformer model accepts two sentence s1 and case where we work with sentences of uneven lengths.
s2 and generates fixed-sized sentence embeddings— The number of occurrences of a particular word may
denoted by and , respectively. There are several be more frequent in one sentence than in the other.
pooling strategies available to perform certain tasks These are the situations where the semantic similarity
like classification, extraction of word embeddings and between two sentences can be affected if we consider
sentence embeddings. Four different pooling strategies the spatial distance measures. Cosine similarity gives
are described as follows: more accuracy for measuring semantic similarity as it
measures the angle between two vectors rather than
1. If the pooling is set to ‘None’, no pooling is applied. considering the spatial distance.
This will result in a [maximum-sequence-length, However, fixed-sized sentence embeddings are com-
1024] encode matrix for a sequence. This mode is patible for all standard similarity measuring (in terms
useful for solving token level tasks like word embed- of angle between vectors or spatial distance) methods
ding. Here, 1024 is the dimension of the encoder. like cosine similarity, correlation, Euclidean distance,
2. If pooling is set to ‘CLS’ tokens, only the vector Jaccard similarity and so on. It is worth mentioning
corresponding to first ‘CLS’ token is retrieved and here that we have calculated both cosine similarity and
the output encode matrix will be [batch_size, 1024]. correlation between two fixed-length embeddings to
This pooling type is useful for solving sentence-pair measure the similarity between sentences. Both meth-
classification tasks. ods provide almost identical results.
3. If pooling is set to ‘MEAN’, the embeddings will be
the average of the hidden state of encoding layer on Transformer Model Training
the time axis and the output encode matrix will be
[batch_size, 1024]. This mode is particularly useful In Fig. 3, we represent training of our transformer model
for sentence representation tasks. using the PURE dataset sentences within the dotted rectan-
4. Finally, if pooling is set to ‘MAX’, it takes the maxi- gular block. The PURE dataset contains 79 publicly availa-
mum of hidden state of encoding layers on the time ble natural language requirements documents collected from
axis. ‘MAX’ pooling is also useful for sentence rep- the Web. It consists of 34,268 sentences. We have used the
resentation tasks. Cosine loss function [3] for each of the 4 epochs. The cosine
loss function constrains the distribution of the features in the
The results of experimental comparison of the three same class. It is designed specially for the cosine-similarity
pooling strategies mentioned above are depicted in measurement. This loss function computes the cosine simi-
Table 2. The FiBER model exhibits best performance larity between the sentence embeddings and minimizes the
when adopting the ‘MEAN’ pooling strategy. Thus, we mean squared error loss. Furthermore, we used a batch size
keep our default configuration to ‘MEAN’ pooling. of 32 and the Adam optimizer [17] with learning rate of
(c) To measure the similarity between two test sentences, 3e–5. Finally, we tested our model on 800 pairs of unseen
we need to feed them to the neural network which requirements sentences. We evaluated the performance
updates the weights for generating fixed-sized sentence metric, in this case, the cosine similarity between sentence
embeddings. embeddings is computed. We have considered the threshold
(d) At last, the cosine similarity is measured on these fixed- of similarity metric to be 0.5 which is quite common while
size sentence embeddings. measuring cosine similarity.
SN Computer Science
69 Page 8 of 11 SN Computer Science (2021) 2:69
Accuracy for finding similar sentences in percentage 66.58 71.14 91.40 79.23 78.28 73.03 95.94
Accuracy for finding non-similar sentences in percentage 91.86 80.06 85.30 79.52 75.06 80.31 36.48
Average accuracy in percentage 78.50 75.60 88.35 79.37 76.75 76.50 67.62
False positives in percentage 8.39 17.26 14.96 20.20 24.67 19.42 63.25
False negatives in percentage 33.17 20.9 8.59 20.76 21.71 26.96 4.05
SN Computer Science
SN Computer Science (2021) 2:69 Page 9 of 11 69
(a) (b)
(c) (d)
SN Computer Science
69 Page 10 of 11 SN Computer Science (2021) 2:69
SN Computer Science
SN Computer Science (2021) 2:69 Page 11 of 11 69
Conference on empirical methods in natural language processing understanding. In: Advances in neural information processing
and the 9th International Joint Conference on natural language systems, 2019; p 5754–764.
processing (EMNLP-IJCNLP), association for computational 35. Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Bertscore:
linguistics, Hong Kong, China, 2019; p. 3982–992, https://doi. evaluating text generation with Bert. In: 8th International Confer-
org/10.18653/v1/D19-1410 ence on learning representations, ICLR, 2020; 2020.
31. Shirabad JS, Menzies TJ. The promise repository of software 36. Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba
engineering databases. In: School of information technology and A, Fidler S. Aligning books and movies: Towards story-like visual
engineering, University of Ottawa, Canada, 2005; vol 24. explanations by watching movies and reading books. In: Proceed-
32. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez ings of the IEEE International Conference on computer vision,
AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: NIPS 2015; p. 19–27.
2017: 31st Conference on Neural Information Processing Systems
(NIPS 2017), Long Beach, CA, USA. Red Hook, NY: Curran Publisher’s Note Springer Nature remains neutral with regard to
Associates Inc. 2017; p. 5998–6008. jurisdictional claims in published maps and institutional affiliations.
33. Winkler J, Vogelsang A. Automatic classification of requirements
based on convolutional neural networks. In: 2016 IEEE 24th
International Requirements Engineering Conference Workshops
(REW), IEEE, 2016; p. 39–45.
34. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le
QV. Xlnet: generalized autoregressive pretraining for language
SN Computer Science