0% found this document useful (0 votes)
24 views

2021-Sentence Embedding Models For Similarity Detection of Software Requirements

This document presents two new domain-specific models for analyzing software requirements in natural language: PUBER and FiBER. PUBER is based on the BERT architecture and trained on a dataset of natural language requirements. FiBER utilizes a pre-trained BERT model to find cosine-similarity between pairs of natural language requirements. An experimental evaluation found that FiBER achieves an overall accuracy of 88.35% for identifying semantically similar and dissimilar sentence pairs in requirements, improving over existing benchmarks by about 10%.

Uploaded by

less64014
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

2021-Sentence Embedding Models For Similarity Detection of Software Requirements

This document presents two new domain-specific models for analyzing software requirements in natural language: PUBER and FiBER. PUBER is based on the BERT architecture and trained on a dataset of natural language requirements. FiBER utilizes a pre-trained BERT model to find cosine-similarity between pairs of natural language requirements. An experimental evaluation found that FiBER achieves an overall accuracy of 88.35% for identifying semantically similar and dissimilar sentence pairs in requirements, improving over existing benchmarks by about 10%.

Uploaded by

less64014
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SN Computer Science (2021) 2:69

https://doi.org/10.1007/s42979-020-00427-1

ORIGINAL RESEARCH

Sentence Embedding Models for Similarity Detection of Software


Requirements
Souvick Das1 · Novarun Deb2 · Agostino Cortesi3 · Nabendu Chaki4

Received: 11 August 2020 / Accepted: 11 December 2020


© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. part of Springer Nature 2021

Abstract
Semantic similarity detection mainly relies on the availability of laboriously curated ontologies, as well as of supervised and
unsupervised neural embedding models. In this paper, we present two domain-specific sentence embedding models trained
on a natural language requirements dataset in order to derive sentence embeddings specific to the software requirements
engineering domain. We use cosine-similarity measures in both these models. The result of the experimental evaluation con-
firm that the proposed models enhance the performance of textual semantic similarity measures over existing state-of-the-art
neural sentence embedding models: we reach an accuracy of 88.35%—which improves by about 10% on existing benchmarks.

Keywords Requirements similarity · Sentence embedding · Semantic similarity

Introduction with the client and the use of documents that accurately
describe the application scenarios.
Context The use of natural language processing (NLP) in various
aspects of Requirements Engineering (RE) heavlily con-
Software requirement elicitation is mostly performed using tributes to speed up the software production process and to
natural language, due to the need of effective communication improve the quality of the resulting systems [9]. NLP allows
machines to understand and extract patterns from text data
by applying various techniques such as semantic textual sim-
ilarity, information retrieval, document classification, entity
This article is part of the topical collection “Applications of recognition and so on. Some of the advantages of using NLP
Software Engineering and Tool Support” guest edited by Nabendu in RE are as follows:
Chaki, Agostino Cortesi and Anirban Sarkar.

* Souvick Das 1. Support to functional and non-functional requirements


[email protected] classification.
Novarun Deb 2. Provision of design models for model-driven verifica-
[email protected] tion.
Agostino Cortesi 3. Reusability of software components using similarity
[email protected] measures.
Nabendu Chaki
[email protected] In fact, software engineers are required to understand,
analyze, and validate the requirements manually to gener-
1
Department of Computer Science and Engineering, ate the requirement specification document. Requirements
University of Calcutta, Kolkata 700106, India
engineers need to extract all related software requirements
2
Indian Institute of Information Technology, Vadodara, before preparing the final specifications. This also requires
Gandhinagar, Gujarat 382028, India
the classification of functional and non-functional require-
3
Department of Environmental Science, Informatics, ments which is also a tedious job. Several classification
and Statistics, Ca’ Foscari University, 30172 Venezia, Italy
algorithms (like [1, 33]) can be used to achieve classifica-
4
Department. of Computer Science and Engineering, tion of FRs and NFRs. Some of the researches are based
University of Calcutta, Kolkata 700106, India

SN Computer Science
Vol.:(0123456789)
69 Page 2 of 11 SN Computer Science (2021) 2:69

on clustering algorithms to classify FRs and NFRs [12]. Results


On the other hand, Knauss et. al. [25] presents a socio-
technical model for requirements classification. All these To evaluate the effectiveness of the two proposed domain-
approaches are mostly dependent on good sentence embed- specific models in the domain of Requirements Engineering,
ding mechanisms. For example, a domain-specific corpus we compared them with state-of-the-art sentence embedding
on computer science articles was used to generate embed- models. The evaluation results show that our PUBER model
ding of words and capture the context of a word to improve achieves an accuracy of 79.23% for identifying semantically
comprehensibility [24]. NLP can be used in RE for finding similar sentences and 79.52% for identifying dissimilar
relationships among multiple functional and non-functional sentence pairs. FiBER, on the other hand, achieves 91.40%
requirements which are essential to build and analyze design and 85.30% accuracy for similar and dissimilar sentences,
models at different phases [29]. Several works (like [15, 22]) respectively, i.e., an overall accuracy of 88.35% which
show reusable software engineering can be facilitated by improves by about 10% on existing benchmarks.
finding the semantic similarity between natural language
requirements.
Organization of the Paper
Research Problem
The rest of the paper is organized as follows. Sec-
Leveraging the benefits of NLP in requirements engineer- tion “Related Work” presents the state-of-the-art sentence
ing requires mechanisms that are purely dependent on rich embedding models as well as several researches where RE
sentence (or word) embedding models. State-of-the-art pre- community integrated NLP techniques with requirements
trained models represent generic, common sense knowl- engineering. In Section “Domain Special Sentence Embed-
edge and fail to achieve high accuracy in specific domains. ding Models”, we introduce our embedding models. Sec-
In spite of having pre-trained word embedding models on tion “Experimental Evaluation” shows the results of an
Stack Overflow posts ([4, 11]), there is a significant lack of experimental analysis when comparing different sentence
domain-specific pre-trained models in Requirements Engi- embedding models to find semantic textual similarity on
neering which could enhance the processing of natural lan- requirements datasets. Section “Conclusion” concludes.
guage requirements documents. This is the main research
problem being addressed in this paper. We intend to develop
pre-trained sentence embedding models that are specifically
designed for processing natural language requirements with Related Work
high accuracy.
Representation of words and sentences as vectors in a low
Contribution dimensional space enables us to incorporate various deep
learning NLP techniques to accomplish different challeng-
In this paper, we propose two new domain-specific models ing tasks. Word and sentence embeddings encode words and
which are specially designed for analyzing software require- sentences in a fixed-length vectors to drastically enhance
ments in natural language. the performance of the NLP tasks. Word embeddings are
The first model, called PUBER, is based on the BERT considered to be an improvement over traditional bag-of-
[10] architecture and trained on the PURE [13] dataset, words model which provides large and sparse word vec-
which contains 79 natural language requirements documents. tors. Word2Vec [23] is the first neural embedding model,
The second model, called FiBER, is more powerful than developed by Google researchers. Word2Vec represents
the first one. The original pre-trained BERT model is trained words as multidimensional array. Two unsupervised algo-
on Wikipedia articles and Book Corpus [36]. The proposed rithms namely Skip-gram and CBoW are used to generate
FiBER model utilizes the pre-trained BERT model to find the word embedding. Pennington et al. [26] proposed an
the cosine-similarity between pairs of natural language unsupervised algorithm(GloVe) for obtaining vector repre-
requirements. This composite architecture is then fine- sentations of words based on aggregation of global word to
trained again on the PURE requirements dataset. This allows word co-occurrence statistics from corpus.
FiBER to utilize a larger vocabulary to understand natural Recently, the search for universal embedding has been
language appropriately and be able to grasp the semantics gaining importance. Pre-trained embedding on a large cor-
of natural language requirements documents simultaneously. pus can be connected with a variety of downstream tasks
Both the models are able to generate fixed-length sen- such as classification, semantic textual similarity, sentiment
tence embeddings which are essential for many different analysis and so on to improve performance. This form of
NLP tasks. learning is referred as Transfer Learning. In [14], authors

SN Computer Science
SN Computer Science (2021) 2:69 Page 3 of 11 69

have demonstrated that transfer learning can significantly is designed for a variety of tasks to understand the natural
improve the performance of NLP models on specific tasks, languages dynamically.
such as classification. Setting off the eruption of universal Bidirectional Encoder Representations from Transform-
word embedding research, several works improved previ- ers (BERT [10]) is a pre-trained model developed by Google
ous unsupervised approaches (Word2Vec and state of the art researchers Devlin et al. in 2018. BERT is trained on giga-
contextual word vectors) by incorporating the supervision bytes of data from various sources (mostly from Wikipedia
of semantic or syntactic knowledge. The most noteworthy and Book Corpus) in an unsupervised fashion. In brief, the
is FastText [16] and ELMo [27]. FastText’s key enhance- training is performed by masking a few words (nearly 15%
ment over the initial Word2Vec vectors is the integration of the words) in a sentence and allowing the model to pre-
of character n-grams, which enables word representations dict masked words. As the model is trained to predict, it
to be determined for words that did not even exist in the also learns to generate an efficient internal representation
training data (“out-of-vocabulary” words). Within ELMo, of words as word embedding. The uniqueness of the BERT
each word is given a representation that is a feature of the model is that it explores text representation from both direc-
whole sentence of the corpus to which they belong. The tions to obtain a clearer understanding of meaning of the
embedding (E) is determined from the internal states of a context and their relationship. BERT has set new state of the
two-layer bidirectional Language Model (LMo)and, hence, art results for several NLP tasks such as question answer-
the name ELMo. ing, sentence classification, sentence pair regression and
In the area of sentence embedding, there is a general so on. To perform sentence pair regression, BERT accepts
perception that the simple technique of directly averaging two sentences separated by special token SEP and applies
the word vectors of a sentence (the so-called Bag-of-Word multi-head attention layers. The output is then passed to a
approach [23]) provides a good benchmark for several simple regression function to provide the final label. Using
downstream tasks. However, this approach creates vari- this architecture, BERT sets a new benchmark for perfor-
able length embeddings for sentences of different lengths. mance on semantic textual similarity among state-of-the-art
In [2], authors have proposed an algorithm which can take models. RoBERTa [20] has shown that minor modifications
any well-known word embedding mechanism to encode to pre-training processes can further enhance the efficiency
a sentence in a linear weighted combination fashion then of BERT. The major downside of BERT is that no independ-
compute the weighted average of the word vectors. Finally, ent sentence embedding mechanism is assessed. Two major
the projection of the vectors on their first principal compo- approaches are used to generate sentence embedding:
nent is removed. The first major proposals went further than
basic averaging of word vectors, using unsupervised training 1. Averaging Method: The most popular BERT methods
objectives. Skip-Thought Vectors [18] is an unsupervised for creating sentence embeddings by simply averaging
learning method for sentence embeddings. It is analogous the word embeddings of all words in one sentence.
to word embedding Skip-Gram model. The model consists 2. CLS vector: Alternatively, the CLS special token embed-
of an encoder–decoder based on Recurrent Neural Net- ding that appears at the beginning of the sentence may
work (RNN) that is trained to recreate the surrounding sen- be used. ([21, 35]).
tences from the current sentence. An improvement of Skip-
Thoughts vectors is presented in Quick-Thoughts Vectors The well-known bert-as-a-service1 repository offers both
[28]. The strength of the model is its speed of training. For a these options. In [19], authors have proposed another variant
long period of time, supervised learning of sentence embed- of BERT called ALBERT where two-parameter reduction
ding was assumed to generate embedding of lower quality techniques have been introduced to decrease the memory
than unsupervised approaches. This assumption was recently consumption and enhance the training speed of the model.
reversed, particularly after the InferSent [8] model emerged. They have also used a self-supervised loss that focuses on
Infersent uses Stanford Natural Language Inference [5] modeling inter-sentence coherence. In [34], authors have
(SNLI) corpus to train the classifier on top of a sentence proposed XLNet which integrates ideas from Transformer-
encoder. The authors implement the sentence encoder using XL, where the model generalized auto-regressive pre-train-
bi-directional LSTM coupled with max-pooling operator. In ing mechanism. It involves bidirectional learning of contexts
2018, Cer et al. from Google research proposed Universal by maximizing the expected likelihood over all permutations
Sentence Encoder [7] which has become one of Tensorflow of the factorization order. In another work [30], authors have
Hub’s most downloaded pre-trained text modules, providing come up with an idea to fine-tune BERT model on SNLI
versatile sentence embedding models which turn sentences dataset. In this work, the modification of the BERT model
into vector representations. The Universal Sentence Encoder
is trained with Deep Averaging Network (DAN) encoder. It
1
https​://githu​b.com/hanxi​ao/bert-as-servi​ce/.

SN Computer Science
69 Page 4 of 11 SN Computer Science (2021) 2:69

Table 1  Comparison of sentence embedding models


Parameters Models
Weighted sum Skip-thoughts InferSent Google’s USE BERT RoBERTa SBERT XLNet
of vectors

Learning Unsupervised Unsupervised Supervised Unsupervised Unsupervised Unsupervised Unsupervised Unsupervised


method
Architecture Feed-forward GRU (Gated Bi-directional Transformer Bi-directional Bi-directional Fine-tuned Transformer
Neural net- Recurrent LSTM with or Deep Transformer Transformer BERT on architecture
work model Units) or Softmax Averaging SNLI with with recur-
(Skip-gram LSTM classifier Network softmax rence
or CBoW) (Long classifier
Short-Term
Memory)
Trained Wikipedia Can be trained Trained on Wikipe- Wikipedia Wikipedia, Stanford Wikipedia,
dataset on any text GloVe dia, web and Book Book Natural Book Cor-
corpus or Fast- news,web Corpus Corpus, Language pus, Com-
TextSNLI Q/A Common- Inference mon Crawl,
Crawl News Dataset Giga5,
dataset and Clueweb
text corpus etc.
Order of Not Consid- Considered Considered Considered Considered Considered Considered Considered
words ered
Semantic rela- Not needed Needed Needed Needed Needed Needed Needed Needed
tion between
texts
STS Bench- 70 72.1 80.1 87.21 90 92.4 79.19 91.8
mark scores
[6]

includes the inclusion of siamese and triplet network struc- Domain‑Specific Sentence Embedding
tures to produce semantically relevant sentence embeddings. Models
Another major contribution of the paper is the generation
of sentence embeddings that are compatible with cosine In this section, we introduce two domain-specific sentence
similarity measurements. In Table 1, we have summarized embedding models—namely PUBER and FiBER—for find-
sentence embedding models based on different parameters. ing the similarity (and dissimilarity) between pairs of natural
These parameters include the model architecture, leaning language requirements sentences. Both these models use the
methods, training datasets for each of the models and so on. BERT neural network architecture [10].
The parameter “Order of Words” specifies whether a particu- Figure 1 depicts the architecture of the BERT model,
lar model is aware of the orderings of words (uni-directional which is originally trained on the Wikipedia data and Book
or bi-directional) or not. Another parameter “Semantic Rela- Corpus.
tionship between Texts” specifies whether the model is con- Our first model PUBER uses the BERT architecture to
text sensitive or context free. STS [6] Benchmark comprises generate a pre-trained model from the PURE dataset. On
a selection of the English datasets used in the Semantic Tex- the other hand, FiBER uses the pre-trained BERT model
tual Similarity (STS) tasks. The datasets consist of text from to derive the cosine similarity between pairs of natural lan-
image captions, headlines and articles from news and user guage requirements sentences. This composite architecture
forums. In Table 1, we have presented the similarity score is then fine-trained using the PURE dataset. The main objec-
measured for different models for STS datasets. tive of this work is to train and build vocabulary for our
Current state-of-the-art features a range of well-known models to leverage the benefits of NLP in the Requirements
sentence embedding models, widely used for a number of Engineering domain.
NLP applications in different domains. Requirements Engi-
neering is not unique in incorporating NLP for develop- PUBER
ing potential solutions of specific problems. However, this
domain still lacks a rich domain-specific sentence embed- The PUBER model is built on the same architecture of
ding model which is the basic foundation for most NLP the BERT sentence embedding model as presented in
tasks. Fig. 1. However, the model is trained on the PURE dataset

SN Computer Science
SN Computer Science (2021) 2:69 Page 5 of 11 69

S
e Token E1 T1-1 ... T1 -24 T1
n 1
t
e V
n Token o E2 T2-1 ... T2 -24 T2
c 2 c
e a Softmax
1 b
u

...
S l
Token a Hidden
e
3 Layers

...
...
n r
t y

...
e
n Token
c 4 E1024
e T1024 -1 ... T1024 - 24 T1024
2

Ei Initial Embedding Ti - j Intermediate Ti Final Embedding


Transformer blocks

Fig. 1  BERT architecture

consisting of 34,268 unlabeled sentences. A distinctive cific words are represented by a single, unsplit
feature of the BERT architecture is its uniformity across token.
different tasks. There is minimal difference between
the pre-trained model architecture and the architecture 2. Once the vocabulary is prepared, we started training the
required for performing several downstream tasks like language model. The vocabulary is then used for the
semantic similarity, classification, sentiment analysis, etc. word embeddings and masking.
PUBER’s workflow model contains the multi-layer bidi- 3. As the model is based on BERT, we train it on a task
rectional transformer encoder of the BERT architecture. of Masked Language Modeling [10] which masks some
Additionally, the cosine similarity is evaluated between
sentence embeddings. The model is shown in Fig. 2.
In our evaluations, we use the original implementation
proposed by Vaswani et al. [32]. With respect to Fig. 1, we
have kept the number of intermediate transformer blocks
(Ti−j ) in each hidden layer to be 24, the number of hidden
layers ( Ei to Ti ) to be 1024 and the number of self-attention
heads to be 16 as proposed. We can describe the workflow
model of PUBER with the help of Fig. 1 and the follow-
ing steps.

1. The first step in this procedure is to build the domain-


specific vocabulary. To build the vocabulary from the
alphabet of single byte, we have used the default Word-
Piece embedding with 30,000 token vocabulary. Several
characteristics of this vocabulary are presented as fol-
lows.

(a) The classification CLS token is considered as the


first token for every sequence.
(b) Differentiation of sentences is taken care by using
SEP token.
(c) Our domain specific vocabulary is optimized for
the PURE dataset. Compared to generic vocabu-
lary trained for English, more requirements-spe-
Fig. 2  PUBER Similarity Checking Model

SN Computer Science
69 Page 6 of 11 SN Computer Science (2021) 2:69

Fig. 3  FiBER similarity check-


ing model

percentage of the input tokens at random, and then pre- model which uses the BERT architecture that is trained on
dicts those masked tokens. the PURE dataset. Essentially, two sentences are passed to
4. The final hidden vectors corresponding to the mask our PUBER model—say s1 and s2 . In the next phase, the
tokens are fed into a softmax layer. The training envi- PUBER model provides sentence embeddings for both the
ronment created is described as follows. sentences—hs1 and hs2, respectively. Finally, cosine similarity
is measured between the two embeddings and a similarity
(a) Batch size is set to 32. score is evaluated.
(b) Number of training step is considered to be
100,000.
(c) Learning rate is kept as 2e–5. FiBER

The described settings enable us to obtain a bidirec- Model Architecture


tional pre-trained domain-specific language model. Once
this model is developed, it is ready to be used for differ- The fine-trained FiBER model is built by augmenting the
ent downstream tasks. In this paper, we have measured the pre-trained BERT model with a pooling strategy on top of
semantic similarity based on cosine similarity between it. Figure 3 shows the architecture of the FiBER transformer
two sentence embeddings. Figure 2 depicts the PUBER model.

SN Computer Science
SN Computer Science (2021) 2:69 Page 7 of 11 69

(a) At the foundation level, we have the BERT pre-trained Table 2  Performance of FiBER for different pooling strategies
model. We already discussed the BERT architecture Pooling strategy Average accu-
with 1024 hidden layers, each with 24 transformer racy for test
blocks (Fig. 1). Every layer does multi-headed atten- dataset
tion computations on the word representation of the
MAX 84.65
previous layer. The multi-headed attention computa-
MEAN 88.35
tions create a new intermediate representation which is
CLS 80.25
then fed to the next layer of hidden states. We keep the
transformer architecture as it is in the original BERT
model.
(b) On top of the BERT model, we augment a MEAN Pool-   Cosine similarity is generally used as a metric that
ing component. The pooling mechanisms are essen- measures the angle between vectors where the magni-
tial to get a fixed representation of a sentence. Thus, tude of the vectors are not considered. It could be the
the transformer model accepts two sentence s1 and case where we work with sentences of uneven lengths.
s2 and generates fixed-sized sentence embeddings— The number of occurrences of a particular word may
denoted by and , respectively. There are several be more frequent in one sentence than in the other.
pooling strategies available to perform certain tasks These are the situations where the semantic similarity
like classification, extraction of word embeddings and between two sentences can be affected if we consider
sentence embeddings. Four different pooling strategies the spatial distance measures. Cosine similarity gives
are described as follows: more accuracy for measuring semantic similarity as it
measures the angle between two vectors rather than
1. If the pooling is set to ‘None’, no pooling is applied. considering the spatial distance.
This will result in a [maximum-sequence-length,   However, fixed-sized sentence embeddings are com-
1024] encode matrix for a sequence. This mode is patible for all standard similarity measuring (in terms
useful for solving token level tasks like word embed- of angle between vectors or spatial distance) methods
ding. Here, 1024 is the dimension of the encoder. like cosine similarity, correlation, Euclidean distance,
2. If pooling is set to ‘CLS’ tokens, only the vector Jaccard similarity and so on. It is worth mentioning
corresponding to first ‘CLS’ token is retrieved and here that we have calculated both cosine similarity and
the output encode matrix will be [batch_size, 1024]. correlation between two fixed-length embeddings to
This pooling type is useful for solving sentence-pair measure the similarity between sentences. Both meth-
classification tasks. ods provide almost identical results.
3. If pooling is set to ‘MEAN’, the embeddings will be
the average of the hidden state of encoding layer on Transformer Model Training
the time axis and the output encode matrix will be
[batch_size, 1024]. This mode is particularly useful In Fig. 3, we represent training of our transformer model
for sentence representation tasks. using the PURE dataset sentences within the dotted rectan-
4. Finally, if pooling is set to ‘MAX’, it takes the maxi- gular block. The PURE dataset contains 79 publicly availa-
mum of hidden state of encoding layers on the time ble natural language requirements documents collected from
axis. ‘MAX’ pooling is also useful for sentence rep- the Web. It consists of 34,268 sentences. We have used the
resentation tasks. Cosine loss function [3] for each of the 4 epochs. The cosine
loss function constrains the distribution of the features in the
  The results of experimental comparison of the three same class. It is designed specially for the cosine-similarity
pooling strategies mentioned above are depicted in measurement. This loss function computes the cosine simi-
Table 2. The FiBER model exhibits best performance larity between the sentence embeddings and minimizes the
when adopting the ‘MEAN’ pooling strategy. Thus, we mean squared error loss. Furthermore, we used a batch size
keep our default configuration to ‘MEAN’ pooling. of 32 and the Adam optimizer [17] with learning rate of
(c) To measure the similarity between two test sentences, 3e–5. Finally, we tested our model on 800 pairs of unseen
we need to feed them to the neural network which requirements sentences. We evaluated the performance
updates the weights for generating fixed-sized sentence metric, in this case, the cosine similarity between sentence
embeddings. embeddings is computed. We have considered the threshold
(d) At last, the cosine similarity is measured on these fixed- of similarity metric to be 0.5 which is quite common while
size sentence embeddings. measuring cosine similarity.

SN Computer Science
69 Page 8 of 11 SN Computer Science (2021) 2:69

Table 3  Comparison of sentence embedding techniques on natural language requirements dataset


Accuracy for Sentence Categories Models
USE Infersent FiBER PUBER BERT RoBERTa DistillBERT

Accuracy for finding similar sentences in percentage 66.58 71.14 91.40 79.23 78.28 73.03 95.94
Accuracy for finding non-similar sentences in percentage 91.86 80.06 85.30 79.52 75.06 80.31 36.48
Average accuracy in percentage 78.50 75.60 88.35 79.37 76.75 76.50 67.62
False positives in percentage 8.39 17.26 14.96 20.20 24.67 19.42 63.25
False negatives in percentage 33.17 20.9 8.59 20.76 21.71 26.96 4.05

Experimental Evaluation RoBERTa gives 80.31% of accuracy score for identifying


dissimilar sentences. Finally, as we expect, DistillBERT
We compare the performance of our two models PUBER gives poor result relative to other approaches because of its
and FiBER with other state-of-the-art sentence embedding high false positives. It only achieves 36.48% of accuracy
models, namely Universal Sentence Encoder (USE), BERT, in order to identify dissimilar sentence pairs.
RoBERTa, DistilBERT and Infersent. We have evaluated Figure 4c shows how the FiBER model outperforms
these mechanisms on 800 pairs of software requirements other state of the art sentence embedding methods when
statements to measure semantic textual similarity between applied to a mix of similar and dissimilar sentences. Our
them. The dataset consists of pairs of sentences annotated fine-trained model achieves an improvement of almost
with binary labels—‘Yes’ and ‘No’. The label ‘Yes’ signi- 10% on average over Google’s Universal Sentence Encoder
fies that the statements are semantically related. The label and 12% compared to BERT or RoBERTa. FiBER achieves
‘No’ signifies the opposite. The dataset is built by manual 88.35% accuracy which is highest among all the state-of-
annotation. The requirements statements are taken from the the-art sentence embedding models. The PUBER model is
requirements dataset provided by the OpenScience tera- also slightly better than other sentence embedding models
PROMISE [31] repository. We have presented evaluation (except FiBER). USE, Infersent, BERT and RoBERTa show
results by plotting graphs for different ranges of number of quite similar accuracy scores, whereas DistillBERT has the
sentences from 100 to 800. The values are listed in Table 3. worst accuracy.
The evaluation result in Fig. 4a shows the accuracy of In case of finding semantically similar sentences,
different approaches in order to identify semantically similar although DistillBERT shows best results but percentage
sentences. The figure shows FiBER gives over 91.40% of of false positives (63.25%) is also highest among all the
accuracy for identifying the semantically related sentences. approaches. The consequence of this situation, Distill-
Whereas, the Universal Sentence Encoder shows quite poor BERT provides the worst outcome on the detection of dis-
accuracy score of 66.58% for identifying semantically simi- similar sentences. Google’s Universal Sentence Encoder
lar sentences. BERT achieves 78.28% accuracy score which performs best for identifying dissimilar sentences whereas
is better than RoBERTa. Infersent gives slightly better accu- the performance for identifying semantically similar sen-
racy score than Universal Sentence Encoder but is unable to tences is not quite well. The false-negative percentage is
beat BERT or RoBERTa for the same scenario. Our another also quite high for USE—approximately 33%. BERT and
model PUBER achieves better accuracy than BERT and RoBERTa provides almost similar accuracy on an average.
RoBERTa to identify semnatically semantic sentences. The On the other hand, our proposed FiBER model achieves
DistillBERT is very biased on identifying semantically simi- the highest accuracy on average, and also performs well
lar sentence pair and gives 95.94% of accuracy. DistillBERT for both similar and dissimilar sentence recognition. On
generates almost identical sentence embeddings for every the other side, false-positive and false-negative percentage
sentence so that it predicts nearly every pair of sentence as levels are also the second lowest for each case—14.96%
semantically similar. This is the reason why it shows a high and 8.59% respectively.
false positive of 63.25%. Considering the disjoint vocabulary and the scale of
Figure 4b presents the accuracy for different approaches improvement over state-of-the-art well-known models like
in order to identify dissimilar sentences. It shows that BERT, Google’s Universal Sentence Encoder and Infer-
our FiBER reaches 85.30% accuracy for identifying non- sent, we conclude that when the Requirements Engineer-
related sentences from the dataset. The Universal Sen- ing domain-specific vocabulary and sentence embeddings
tence Encoder achieves highest accuracy of 91.86% for are the key concern, FiBER and PUBER perform the best.
the same. PUBER achieves better accuracy than BERT.

SN Computer Science
SN Computer Science (2021) 2:69 Page 9 of 11 69

(a) (b)

(c) (d)

Fig. 4  Illustration of performance of different sentence embedding models

Conclusion domain. In the future direction, we aim to apply our model


to accomplish several such NLP tasks with our proposed
The PUBER model has a rich word piece vocabulary for sentence embedding models. These include requirements
Requirements Engineering domain. Since PUBER is purely classification, named entity recognition, and sentiment
trained on PURE requirements dataset, it does not have rich analysis to understand code quality in code repositories,
general English vocabulary. This is why the BERT model checking code similarity and so on.
has been fine-trained on the PURE dataset to build the
enhanced version which we call FiBER. The fine-trained Acknowledgements This work has been partially supported by the
Project IN17MO07 “Formal Specification for Secured Software Sys-
model is able to make use of BERT’s huge vocabulary and tem”, under the Indo-Italian Executive Programme of Scientific and
also understand specific words from the Requirements Engi- Technological Cooperation.
neering domain.
Since we have built sentence embedding models for Compliance with Ethical Standards
the Software Requirements domain, we can empower dif-
ferent NLP tasks within the Requirements Engineering Conflict of Interest Statement On behalf of all authors, the corre-
sponding author states that there is no conflict of interest.

SN Computer Science
69 Page 10 of 11 SN Computer Science (2021) 2:69

References 15. Ilyas M, Kung J. A similarity measurement framework for require-


ments engineering. In: 2009 Fourth International Multi-Confer-
ence on computing in the global information technology, IEEE,
1. Abad ZSH, Karras O, Ghazi P, Glinz M, Ruhe G, Schneider K.
2009; p. 31–4.
What works better? A study of classifying requirements. In: 2017
16. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for
IEEE 25th International Requirements Engineering Conference
efficient text classification. In: Proceedings of the 15th Confer-
(RE), IEEE; 2017; p. 496–501.
ence of the European Chapter of the association for computational
2. Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for
linguistics: volume 2, short papers, association for computational
sentence embeddings. In: International Conference on learning
linguistics, Valencia, Spain, 2017;.p. 427–31.
representations; 2016; p. 1–16.
17. Kingma DP, Ba J. Adam: a method for stochastic optimization. In:
3. Barz B, Denzler J. Deep learning on small datasets without pre-
3rd International Conference on learning representations, ICLR
training using cosine loss. In: The IEEE Winter Conference on
2015, San Diego, CA, USA, May 7–9, 2015, Conference Track
applications of computer vision, 2020; p. 1371–380.
Proceedings; 2015.
4. Biswas E, Vijay-Shanker K, Pollock L. Exploring word embed-
18. Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba
ding techniques to improve sentiment analysis of software engi-
A, Fidler S. Skip-thought vectors. In: Advances in neural informa-
neering texts. In: 2019 IEEE/ACM 16th International Conference
tion processing systems, 2015’ p. 3294–302.
on mining software repositories (MSR), IEEE, 2019; p. 68–78.
19. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R.
5. Bowman SR, Angeli G, Potts C, Manning CD. A large annotated
Albert: a lite bert for self-supervised learning of language repre-
corpus for learning natural language inference. In: Proceedings
sentations. In: arXiv preprint arXiv​:1909.11942​ 2019.
of the 2015 Conference on empirical methods in natural language
20. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M,
processing, association for computational linguistics, Lisbon, Por-
Zettlemoyer L, Stoyanov V. Roberta: A robustly optimized Bert
tugal, 2015. ;. 632–42, https​://doi.org/10.18653​/v1/D15-1075.
pretraining approach. In: arXiv preprint arXiv​:1907.11692​ 2019.
6. Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L. Seme-
21. May C, Wang A, Bordia S, Bowman SR, Rudinger R. On meas-
val-2017 task 1: semantic textual similarity-multilingual and
uring social biases in sentence encoders. In: Proceedings of the
cross-lingual focused evaluation. In: Association for Computa-
2019 Conference of the North American Chapter of the associa-
tional Linguistics, 2017; p. 1–14. https​://www.aclwe​b.org/antho​
tion for computational linguistics: human language technologies,
logy/S17-2001, arXiv preprint arXiv​:1708.00055​.
volume 1 (Long and Short Papers), association for computational
7. Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, St John R, Constant
linguistics, Minneapolis, Minnesota, 2019; p. 622–28, https​://doi.
N, Guajardo-Cespedes M, Yuan S, Tar C, Strope B, Kurzweil
org/10.18653​/v1/N19-1063
R. Universal sentence encoder for Eenglish. In: Proceedings of
22. Mihany FA, Moussa H, Kamel A, Ezzat E, Ilyas M. An automated
the 2018 Conference on empirical methods in natural language
system for measuring similarity between software requirements.
processing: system demonstrations, association for computa-
In: Proceedings of the 2nd Africa and Middle East Conference on
tional linguistics, Brussels, Belgium, 2018; p. 169–74, https​://
software engineering, 2016; p. 46–51.
doi.org/10.18653​/v1/D18-2029
23. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed
8. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A. Super-
representations of words and phrases and their compositionality.
vised learning of universal sentence representations from natural
In: NIPS’13: Proceedings of the 26th International Conference on
language inference data. In: Proceedings of the 2017 Conference
Neural Information Processing Systemss, Vol. 2. Red Hook, NY:
on empirical methods in natural language processing, association
Curran Associates Inc.; 2013. p. 3111–9.
for computational linguistics, Copenhagen, Denmark, 2017; p.
24. Mishra S, Sharma A. On the use of word embeddings for identify-
670–80, https​://doi.org/10.18653​/v1/D17-1070
ing domain specific ambiguities in requirements. In: 2019 IEEE
9. Dalpiaz F, Ferrari A, Franch X, Palomares C. Natural language
27th International Requirements Engineering Conference Work-
processing for requirements engineering: the best is yet to come.
shops (REW), IEEE, 2019; p. 234–40.
In: IEEE software, IEEE, 2018;35:115–19.
25. Ott D. Automatic requirement categorization of large natural lan-
10. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training
guage specifications at mercedes-benz for review improvements.
of deep bidirectional transformers for language understanding.
In: International Working Conference on requirements engineer-
In: Proceedings of the 2019 Conference of the North American
ing: foundation for software quality, Springer, 2013; p. 50–64.
Chapter of the Association for Computational Linguistics: Human
26. Pennington J, Socher R, Manning CD. Glove: global vectors for
Language Technologies, Volume 1 (Long and Short Papers),
word representation. In: Proceedings of the 2014 Conference on
Association for Computational Linguistics, Minneapolis, Min-
empirical methods in natural language processing (EMNLP),
nesota, 2019; p. 4171–186, https:​ //doi.org/10.18653/​ v1/N19-1423
2014; p. 1532–543.
11. Efstathiou V, Chatzilenas C, Spinellis D. Word embeddings for the
27. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K,
software engineering domain. In: Proceedings of the 15th Inter-
Zettlemoyer L. Deep contextualized word representations. In: Pro-
national Conference on mining software repositories, 2018; p.
ceedings of the 2018 Conference of the North American Chapter
38–41.
of the association for computational linguistics: human language
12. Eyal Salman H, Hammad M, Seriai AD, Al-Sbou A. Semantic
technologies, volume 1 (Long Papers), association for computa-
clustering of functional requirements using agglomerative hier-
tional linguistics, New Orleans, Louisiana, 2018; p. 2227–237,
archical clustering. In: Information, Multidisciplinary Digital
https​://doi.org/10.18653​/v1/N18-1202
Publishing Institute, 2018;9: 222.
28. Quan Z, Wang Z, Le Y, Yao B, Li K, Yin J. An efficient framework
13. Ferrari A, Spagnolo GO, Gnesi S. Pure: a dataset of public require-
for sentence similarity modeling. IEEE/ACM Trans Audio Speech
ments documents. In: 2017 IEEE 25th International Requirements
Lang Process. 2019;27:853–65.
Engineering Conference (RE), IEEE, 2017; p. 502–5.
29. Rahimi M, Mirakhorli M, Cleland-Huang J. Automated extraction
14. Howard J, Ruder S. Universal language model fine-tuning for
and visualization of quality concerns from requirements specifica-
text classification. In: Proceedings of the 56th Annual Meeting
tions. In: 2014 IEEE 22nd International Requirements Engineer-
of the association for computational linguistics (Volume 1: Long
ing Conference (RE), IEEE, 2014; p. 253–62.
Papers), association for computational linguistics, Melbourne,
30. Reimers N, Gurevych I. Sentence-BERT: sentence embeddings
Australia, 2018; p. 328–39, https:​ //doi.org/10.18653/​ v1/P18-1031
using Siamese BERT-networks. In: Proceedings of the 2019

SN Computer Science
SN Computer Science (2021) 2:69 Page 11 of 11 69

Conference on empirical methods in natural language processing understanding. In: Advances in neural information processing
and the 9th International Joint Conference on natural language systems, 2019; p 5754–764.
processing (EMNLP-IJCNLP), association for computational 35. Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Bertscore:
linguistics, Hong Kong, China, 2019; p. 3982–992, https​://doi. evaluating text generation with Bert. In: 8th International Confer-
org/10.18653​/v1/D19-1410 ence on learning representations, ICLR, 2020; 2020.
31. Shirabad JS, Menzies TJ. The promise repository of software 36. Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba
engineering databases. In: School of information technology and A, Fidler S. Aligning books and movies: Towards story-like visual
engineering, University of Ottawa, Canada, 2005; vol 24. explanations by watching movies and reading books. In: Proceed-
32. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez ings of the IEEE International Conference on computer vision,
AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: NIPS 2015; p. 19–27.
2017: 31st Conference on Neural Information Processing Systems
(NIPS 2017), Long Beach, CA, USA. Red Hook, NY: Curran Publisher’s Note Springer Nature remains neutral with regard to
Associates Inc. 2017; p. 5998–6008. jurisdictional claims in published maps and institutional affiliations.
33. Winkler J, Vogelsang A. Automatic classification of requirements
based on convolutional neural networks. In: 2016 IEEE 24th
International Requirements Engineering Conference Workshops
(REW), IEEE, 2016; p. 39–45.
34. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le
QV. Xlnet: generalized autoregressive pretraining for language

SN Computer Science

You might also like