Enhancing Machine Comprehension of GDPR Privacy Policies Using Recent Advancements in NLP
Enhancing Machine Comprehension of GDPR Privacy Policies Using Recent Advancements in NLP
Department of Digitalisation
cand.merc.(it.)(Data Science)
CDSCO4001E
By:
Alisa Ilina, 141804
Simon Christensen, 111803
Supervisor:
Daniel Hardt
This thesis was written as a part of the Master of Science in Business Administration
and Data Science at CBS. Please note that neither the institution nor the examiners are
responsible – through the approval of this thesis – for the theories and methods used, or
results and conclusions drawn in this work.
i
Acknowledgement
We would like to express our sincere gratitude to our supervisor Daniel Hardt for aspiring
guidance and constructive criticism.
ii
List of Figures
iii
List of Tables
iv
Acronyms
API Application Programming Interface
BERT Bidirectional Encoder Representation from Transformers
BS Batch Size
CNN Convolutional Neural Network
DNN Deep Neural Network
ECHR European Convention of Human Rights
EM Exact Match
EU European Union
FFN Feed-Forward Network
FN False Negative
FP False Positive
GDPR General Data Protection Regulation
GELU Gaussian Error Linear Unit
HPO Hyperparameter Optimisation
IR Information Retrieval
IoU Intersection over Union
LR Learning Rate
MHA Multi-Head Attention
ML Machine Learning
MLM Masked Language Modelling
NE Named Entity
NLP Natural Language Processing
NLQuAD Non-factoid Long Question Answering Dataset
NSP Next Sentence Prediction
ODG Open Domain General Dataset
POS Part of Speech
QA Question Answering
RNN Recurrent Neural Network
RoBERTa Robustly Optimized BERT Pretraining Approach
SAS Semantic Answer Similarity
SaaS Software as a Service
v
SGD Stochastic Gradient Descent
SQuAD Stanford Question Answering Dataset
TN True Negative
TP True Positive
vi
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Topic Delimitation and Superior Scope . . . . . . . . . . . . . . . . . . . . . 3
1.4 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 6
2.1 NLP in the Legal Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 NLP in Privacy Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Privacy Policies in Post-GDPR Regulatory Landscape . . . . . . . . . . . . 10
3 Theoretical Framework 13
3.1 GDPR and Privacy Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Term Weighting Schemes: TF-IDF and BM25 . . . . . . . . . . . . . 17
3.2.2 Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4 ElasticSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Deep Neural Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Vector semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Bidirectional Encoder Representation from Transformers (BERT) . . 22
3.3.4 Fine Tuning BERT with MLM and NSP . . . . . . . . . . . . . . . . 23
3.3.5 BERT Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Question Answering (QA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Types of QA Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 IR-based QA and BERT . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.3 Reading Comprehension and SQuAD . . . . . . . . . . . . . . . . . . 28
3.4.4 Retriever-Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.5 Evaluation Methods for QA . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Transfer Learning and Knowledge Distillation . . . . . . . . . . . . . . . . . 32
vii
4 Methodology 35
4.1 Research Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 Methodological Research Choice . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.4 Research Time Horizon . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.5 Techniques and Procedures . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 CRISP-DM Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Business Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Exploration of PolicyQA . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Acquisition and Exploration of GDPRQA . . . . . . . . . . . . . . . 43
4.5 Data Preparation and Preprocessing . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6.1 Modelling Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.2 Iteration 1: General Knowledge Acquisition . . . . . . . . . . . . . . 53
4.6.3 Iteration 2: Transferring Knowledge to Privacy Policy Domain . . . 53
4.6.4 Iteration 3: Optimising the Best Performing Model . . . . . . . . . . 54
4.6.5 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6.6 Model Associated Files . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6.7 Simulation of Production Environment Prototype . . . . . . . . . . . 62
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.1 Evaluation of Model Results . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.2 Evaluation of Simulated Production Prototype . . . . . . . . . . . . 68
4.8 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8.1 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8.2 Data Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8.3 PolicyCrawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8.4 Machine Learning Layer . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8.5 Feedback Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8.6 Further Improvements to the Architecture . . . . . . . . . . . . . . . 72
4.9 Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
viii
5 Results 74
5.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Models Pre-trained on SQuAD . . . . . . . . . . . . . . . . . . . . . 76
5.1.2 Models Fine-Tuned on NLQuAD . . . . . . . . . . . . . . . . . . . . 78
5.1.3 Top Model Performance Results . . . . . . . . . . . . . . . . . . . . 81
5.1.4 Hyperparameter Optimisation Findings . . . . . . . . . . . . . . . . 82
5.2 Production Setup Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Production Results GDPRQA . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 Production Results - PolicyQA . . . . . . . . . . . . . . . . . . . . . 86
6 Discussion 87
6.1 Test Environment Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.1 Iteration 1 - Open Domain General Knowledge Learning . . . . . . . 87
6.1.2 Iteration 2 - Transfer Learning on GDPR Privacy Policy Domain . . 88
6.1.3 Iteration 3 - Hyperparameter Optimisation Discussion . . . . . . . . 88
6.2 Production Environment Discussion . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Business Use Case — Deployment Discussion . . . . . . . . . . . . . . . . . 91
6.3.1 Internal Company Deployment . . . . . . . . . . . . . . . . . . . . . 92
6.3.2 External Web Deployment . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.3 Mobile Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.4 Recommendations for Deployment . . . . . . . . . . . . . . . . . . . 93
6.4 Answering Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Contribution and Implication for Research . . . . . . . . . . . . . . . . . . . 96
6.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.6.1 Data Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6.2 Labelling Subjectivity . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.6.3 Computational Power Limitations . . . . . . . . . . . . . . . . . . . 98
6.7 Learning Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.8.1 Standardisation of Labelling and Upscaling of the Dataset . . . . . . 99
6.8.2 Larger Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.8.3 Longformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.4 Dense Passage Retrievers . . . . . . . . . . . . . . . . . . . . . . . . 100
ix
6.8.5 Text Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.7 Other Machine Reading Comprehension NLP Tasks . . . . . . . . . 101
7 Conclusion 102
8 Appendix 105
References 108
x
1 INTRODUCTION
1 Introduction
1.1 Background
Digital society is proliferating in its complexity and data needs, hence signifying the ur-
gency of data protection more than ever (Haq, 2021). Data privacy protection becomes a
crucial social dilemma, with privacy policies acting as a medium for personal data man-
agement. Privacy policies inform users on how companies collect, use, share and manage
personal data, accompanied by the rights the users have with regard to their data. Compa-
nies disclose their practices by providing privacy policies on their websites and requesting
consent from the users. The privacy landscape has vastly evolved since the introduction of
privacy laws, such as GDPR in the EU, which require organisations to always make their
privacy practices accessible to the users under specific expectations. GDPR has changed
the regulatory landscape, allowing EU citizens more control over their data.
Digital users increasingly care about their privacy. Our data is continuously tracked, with
apps monitoring our environment, bodies, and activity, as the data we share turns into
the new means of ”currency”. Incidents of malicious intentions, misuse of information and
data leakage raise the concerns of people who want to be in control of their personally
identifiable information, posing a salient issue of data awareness. Specifically, numerous
recent incidents have brought attention to the concerns of personal data misuse. For in-
stance, throughout the 2010s, millions of Facebook users got their personal data collected
without consent by Cambridge Analytica for political advertising which resulted in a ma-
jor privacy scandal disclosed in 2018 (Lapowsky, 2019).
Yet, despite the rising data protection concerns, the research shows that users rarely ever
read privacy policies prior to giving their consent (Alduaij, Chen, & Gangopadhyay, 2016).
An array of issues is limiting the users in their right to have control of their personal data.
To start with, verbose lengthy explanations often result in users neglecting to read the
privacy policies on the collection and use of their data. McDonald and Cranor (2008) has
estimated an average of 201 hours needed to read through all the privacy policies annually
per average person. Since the time of their research, the exponential increase in volume,
velocity, variety and veracity of data and increasingly sophisticated technologies have only
1
1.2 Problem Statement 1 INTRODUCTION
Moreover, despite the fact that some companies have improved the coherency of their
policies after the introduction of GDPR, most still lack e↵ort in clear elaboration which
remains an obstacle to comprehension of policies. Such considerable energy and time
investment is hindering users from making informed decisions (Obar & Oeldorf-Hirsch,
2020). Furthermore, despite privacy policies targeting regular users, the legal language
used may be challenging to understand for a normal consumer (Oltramari et al., 2018).
Despite GDPR’s attempt to force companies to use an easily comprehensible language, it
is not further elaborated. Krumay and Klar (2020) find that the readability of privacy
policies is quite challenging for the users, and at least a college degree is expected to un-
derstand the policies.
Users shall be aware of the risks that come with their consent in order to be empowered to
make informed decisions. The legitimacy of the privacy policies is dependent on the user’s
comprehension of the document which, as research shows, is rare to happen (Reidenberg,
2000). As with any legal documents, privacy policies operate under the “notice and choice”
principle. That means that users shall read the policies prior to deciding to consent to
the service under the condition it matches what they expect their data to be used for.
Users might claim to have their data misused when they have consented to an ambiguous
privacy policy without full comprehension of the conditions. The compromises users make
with regard to sharing their data are highly nuanced, further complicating the matter.
Users are not equipped to comprehend policies on such a large scale veiled under ambigu-
ous lengthy language. Thus, stakeholders lack tools and solutions to address the depth
and breadth of privacy policies. Certain attempts have been made to address this issue,
such as machine-readable formats of policies (Cranor, 2002), and icons representing the
privacy introduced by the EU1 (European Commission, 2016). Yet, the initiatives have
been restrained by the issue of manual labour and scalability, as the e↵ort required to
1
“Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data and on the free movement of
such data, and repealing Directive 95/46/EC”
2
1.3 Topic Delimitation and Superior Scope 1 INTRODUCTION
adjust the existing policies and further maintain them is unproportionally high — the
human e↵ort needed to retrofit the new notices to existing policies and maintain them
over time is substantial.
Such dynamic is manifesting the potential to leverage Natural Language Processing (NLP)
techniques to interpret privacy policies, hence empowering users to selectively explore
issues salient to their usage of the respective application or website.
This paper will build on the recent advancements in natural language processing to allow
users to retain control over their privacy in more meaningful ways, addressing the unre-
alistic expectation of users to read numerous policies on an almost daily basis. It will
aim to initiate the development of a model capable of improving machine comprehension
in this domain. The research also attests to the importance of meeting the transparency
requirement of GDPR by enhancing comprehension of privacy policies.
One way to enhance the reading comprehension of privacy policies is to enable auto-
matic answering of the privacy-related user inquiries. Successful accomplishment of such
scope will result in a Question Answering (QA) system that would empower the users to
comprehend lengthy, verbose, often linguistically complex privacy policies with ease in a
time-efficient manner. The system will be able to extract data reflecting the user needs in
a modern regulatory post-GDPR landscape at scale.
The thesis possesses a high degree of complexity and requires a profound technical and
legal understanding of the matter. Given restricted time, scope and resources, the thesis
is designed to contribute to the delimited scope and build the foundation for a scalable
QA system for post-GDPR privacy policy reading comprehension.
3
1.4 Research Question 1 INTRODUCTION
Therefore, based on this premise, the paper will aim to answer the following research
questions:
RQ: How can recent advancements in NLP be leveraged to build a Question Answering
system that can answer privacy policy inquiries in the post-GDPR regulatory landscape?
Q1: How can Transformers, Deep Neural Networks, Transfer Learning, and data aug-
mentation aid in adaptation to a post-GDPR privacy domain and improve the current
state-of-the-art?
Q2: How can a developed privacy policy Question Answering system be implemented and
evaluated with respect to potential deployment in the real world scenario?
This section is meant to guide the reader through the structure of the thesis. Following
Introduction, Related Work will provide a literature overview of the relevant research in
the area of legal NLP, specifically within automatisation of privacy policy comprehension.
Next, Theoretical Framework presents the concepts around GDPR and the di↵erences it
has brought to privacy policies. It will further present Information Retrieval systems and
ElasticSearch, Deep Neural Language Modelling, and Transformer architectures, including
variations of BERT, Question Answering and the application of BERT. Additionally, it will
explain the concept of knowledge distillation and transfer learning. Further, Methodology
will display the key aspects of the research design. Guided by the CRISP-DM frame-
work, the thesis storyline evolves from understanding user needs and business objectives,
to in-depth analysis and exploration of privacy policy data. Specifically, the secondary
PolicyQA dataset and primary self-annotated GDPRQA dataset, as well as the collected
privacy policies are explored. Moreover, Methodology elaborates on all iterations of the
modelling – (1) open-domain general knowledge learning, (2) transfer learning of domain-
specific privacy knowledge, and (3) hyperparameter optimisation. Moreover, it discusses
the simulation of the models in a production environment and possible deployment use
case implications. In Results the models are evaluated using the relevant performance
metrics introduced in Theoretical Framework and compared against one another. Dis-
4
1.5 Thesis Structure 1 INTRODUCTION
5
2 RELATED WORK
2 Related Work
This chapter provides a glimpse into the existing research to understand the foundation
that the thesis is built upon. The chapter starts by reviewing the literature on the appli-
cation and popularity of NLP within the legal domain. It discusses the widespread usage
of BERT, with its domain-specific variations such as LegalBERT and PrivBERT. The
introduction of PrivBERT transitions the reader to the next section which dives into more
specifics behind the NLP research done within the area of privacy policies. The section
introduces the OPP-115 corpus, the building foundation which enabled extensive research
to be done and further datasets to be created. The researchers in this field have achieved
state-of-the-art results in a variety of NLP tasks, such as classification, text summarisa-
tion, and Question Answering. Yet, prevailing usage of pre-GDPR privacy policies corpora
means that the accuracy and scalability of working with GDPR relevant policies still re-
mained largely unexplored. Research has revealed the substantial shifts which GDPR has
brought into the regulatory landscape, suggesting the necessity of creating datasets with
GDPR relevant aspects.
Most laws are expressed in natural language, thus manifesting the potential for analysing
and predicting juridical language at scale with the use of NLP by transforming unstruc-
tured natural text into machine-readable representation. Over the last decade, the synergy
between law and NLP has seen prominent growth. NLP is able to address some inefficien-
cies of the current practices such as scalability, driven by the increasing number of digital
legal repositories as well as growing algorithmic and hardware power. The NLP potential
in the legal domain includes summarization of context, document relevance scoring, pre-
dicting judging outcomes, Question Answering, sentiment analysis, and others (Nay, 2018).
Transformer architectures such as BERT, which will be further discussed in Section 3.3,
have been in the spotlight of NLP for recent years. A common practice became to augment
the pre-training with domain-specific data which allows diving into domain implications.
BERT language models are often trained to learn the nuances of specific jargon of a given
domain in order to achieve higher performance, for instance, in financial, legal, medical,
and other industry-specific fields. Examples include SciBERT (biomedical and computer
6
2.2 NLP in Privacy Policies 2 RELATED WORK
Legal-BERT is a family of BERT models intended to work within legal NLP applications
(Wiratchawa et al., 2021). 12 GB of English legal literature was obtained to pre-train it.
Its pre-training data is based on EU and UK legislation, EU Court of Justice and Court
of Human Rights rulings, as well as the US court cases and contracts. Wiratchawa et al.
(2021) have found performance gains of using legal domain-specific BERT on QA of legal
documents. It resulted in state-of-the-art outcomes in several tasks: multi-label classifica-
tion of ECHR court cases, NER of contract headers and details, and a legal QA system.
Domain knowledge plays a significant role when dealing with legal language due to the
specific terminology and context used. Researchers also present LegalBERT-SMALL, 3
times smaller than the original LegalBERT, however still maintaining strong performance
(Wiratchawa et al., 2021).
Moreover, Srinath, Wilson, and Giles (2021) has developed a privacy domain-oriented
language model PrivBERT which is trained to identify the specifics behind the privacy
policy semantics and lexicon. PrivBERT will be further introduced and discussed in the
theoretical framework in section 3.3.
The initial in-depth research of combining natural language processing and privacy policies
came with the release of the OPP-115 corpus. The OPP-115 corpus was the first large-
scale e↵ort that fine-scaled the annotation of a sentence-level fragment of policies (Wilson
et al., 2016). It cclassified each sentence into one of ten categories while having specific
attributes related to each of the categories to ensure specificity.
1. First Party Collection/Use: The sentence includes how or why a controller collects
information about the user.
2. Third Party Sharing/Collection: The sentence includes how user information may
7
2.2 NLP in Privacy Policies 2 RELATED WORK
4. User Access, Edit & Deletion: The sentence includes information about how and if
the users can access, edit or delete their data.
5. Data Retention: The sentence includes information about how long the data is stored
6. Data Security: The sentence includes information about how the user’s information
is secured
7. Policy Change: The sentence contains information about if and how the users will
be informed when a policy changes.
8. Do Not Track: The sentence includes information about how and if the company
tracks information related to advertising.
9. International & Specific Audiences: The sentence includes information that targets
a specific group, such as Californians or Europeans.
10. Other: The sentence includes other information not related to the above-written
categories.
Wilson et al. (2019) acquired 115 randomly selected policies amounting to 266,713 words.
In total, they collected 23,194 data practice annotations of which they used linear regres-
sion, support vector machines, and a hidden Markov model to classify segments of policies
to the aforementioned classes.
The introduction of the first fine-grained annotated privacy policy dataset, OPP-115, en-
abled a broad range of NLP techniques to be implemented to comprehend and analyze
privacy policies. The OPP-115 corpus has since been used to train a variety of models,
for instance, models to extract opt-out choices from privacy policies (Sathyendra, Schaub,
Wilson, & Sadeh, 2016), models to identify policies on websites and their compliance is-
sues (Story et al., 2019), and models to classify privacy practices and answer non-factoid
questions of privacy policies through the framework Polisis (Harkous et al., 2018).
8
2.2 NLP in Privacy Policies 2 RELATED WORK
Specifically, Polisis was the first to introduce an automated framework for privacy pol-
icy analysis. Polisis uses a hierarchy of neural network classifiers through privacy-specific
word embeddings and multi-label classification. The model at the top-level classifies a
segment based on the aforementioned categories and the lower-level CNN classifier utilises
the attributes with respect to each category (Harkous et al., 2018, pg. 5). Additionally,
Harkous et al. introduced PriBot in their Polisis Framework, which used carefully de-
signed annotation ontology with broad coverage of personal information that enabled it
to be used within advertising, analytics, legal requirements, marketing, etc.
Further corpora similar to OPP-115 Corpus have enabled extensive research on privacy
practices on a broad level. The PrivacyQA corpus spans 35 mobile application privacy
policies, consisting of 1,750 questions and expert annotated answers for the privacy policy
question answering task (Ravichander, Black, Wilson, Norton, & Sadeh, 2020). The 1,750
questions in the PrivacyQA corpus were created based on crowdsourcing through Me-
chanicalTurk and used seven experts with legal training to construct answers to Turker2
questions to identify legally sound answers. The PolicyQA corpus (Ahmad et al., 2020)
adopted the same approach, but with several key di↵erences. The PolicyQA is based on
OPP-115 policies, spanning 714 questions and 25,017 answers. Instead of sentence selec-
tion, they used reading comprehension and were annotated by domain experts in both
question generation and answer annotation. PolicyQA will be utilized and explored fur-
ther in Section 4. Lebano↵ and Liu (2018) created the first corpus to detect the vagueness
of a policy, by using human-annotated vague words and sentences in privacy policies.
Sathyendra, Wilson, Schaub, Zimmeck, and Sadeh (2017), showcased a dataset with a
corresponding model to analyze, identify and categorize opt-out choices o↵ered in privacy
policies.
Further datasets without labels were released such as the one by Zimmeck et al. (2019)
which collected over 400k URLs to Android app privacy policy pages by crawling the
Google Play store. Amos et al. (2020) focused on the evolution of privacy policies over
time and presented an analysis and dataset from approximately 130,000 websites from
over two decades. Moreover, Zaeem and Barber (2021) gathered around 100,000 privacy
policies using company domains from DMOZ, a website which maintains categories of web-
2
A user of MechanicalTurk
9
2.3 Privacy Policies in Post-GDPR Regulatory Landscape 2 RELATED WORK
sites on the internet. Finally, the PrivaSeer corpus of 1,005,380 English language website
privacy policies was introduced by Srinath et al. (2021), where the number of unique web-
sites represented in the corpus was about ten times bigger than the second-largest publicly
available privacy policy dataset created by Amos et al. (2021). It further surpassed the
aggregate of unique websites represented in all other publicly available web privacy policy
corpora combined (Srinath et al., 2021).
“The introduction of the General Data Protection Regulation (GDPR) caused significant
changes to the practice of digital companies holding personal data, as well as the way they
communicate it to their clients” Linden, Khandelwal, Harkous, and Fawaz (2020, pg. 1).
Research of Linden et al. (2020) has revealed that the transparency requirement of GDPR
has proved privacy policies to be much more detailed, verbose and structured. Linden
et al. has suggested the necessity for a GDPR-augmented dataset in order to reflect and
analyse the changes in privacy policies under GDPR.
Moreover, more companies started complying with the law in order to guard themselves
against lawsuits. Degeling et al. (2019) has searched for privacy policies on 6357 websites
and estimated a 4.9% increase in the number of websites having a policy after the intro-
duction of GDPR.
Gallé, Christofi, and Elsahar (2019) have evaluated the need for a GDPR related dataset
in order to unleash the potential of ML techniques to assess privacy policies with respect to
GDPR compliance. The author claims that only the missing elements shall be annotated
to augment the existing datasets. GDPR has introduced novel ways of handling privacy
which comes with specific mandatory elements to be included in every policy which are
currently unaddressed by the existing datasets. Hence, this signifies the necessity for this
research paper to derive a GDPR dataset.
Gallé et al. (2019) called out “four potential problems that the use of previous datasets can
have when training models that would address the new requirement of that regulation” [pg.
10
2.3 Privacy Policies in Post-GDPR Regulatory Landscape 2 RELATED WORK
11]. These problems include the e↵ect of new elements, multi-linguality, type of companies
domain shift, and GDPR adaptation domain shift.
1. The e↵ect of new elements. To start with, GDPR requires companies to in-
clude contact details, as well as the assigned Data Protection Officer, DPO. Existing
datasets do not provide such information, yet it could be implemented with, for
instance, named entity recognition.
Moreover, GDPR facilitates specific provisions for “sensitive” data, which has not
been paid attention to in pre-GDPR times. Furthermore, as described above GDPR
provides a set of specific rights. The existing OPP dataset includes only one generic
label “User Choice”, lacking specifics on the user rights, hence identifying a crucial
gap in the existing datasets.
Next, GDPR strongly considers the risks of international transfers, especially those
outside of the EU/EEA. Existing annotations only partially refer to the data trans-
fer, missing details on where the third parties operate. Furthermore, automated
decision-making (Art. 22) is another GDPR aspect that is not covered in existing
datasets.
4. Domain shift due to GDPR adaptation. GDPR has resulted in new terminol-
ogy, structure and details in privacy policies, thus contributing to a whole shift in
the used language and redistribution of data. ML tools are rather sensitive to such
changes, thus introducing GDPR annotated dataset could benefit the research.
11
2.3 Privacy Policies in Post-GDPR Regulatory Landscape 2 RELATED WORK
Other research has been done with regard to GDPR. Specifically, in easing the under-
standing of privacy policies post-GDPR. Truong, Sun, Lee, and Guo (2020) envisioned
a personal data management platform designed around GDPR compliance through the
use of blockchain. Tesfay, Hofmann, Nakamura, Kiyomoto, and Serna (2018) created Pri-
vacyGuide which is a classification tool of European Privacy Policies. The classification
tool is built upon 45 annotated privacy policies and is significantly shallower than OPP-
115. Furthermore, it only covers 11 aspects of first-level categories. Torre et al. (2019)
presented a UML representation of the GDPR as a step toward automated compliance
checking, and Palmirani and Governatori (2018) introduced a framework for compliance
checking through modelling of legal documents.
12
3 THEORETICAL FRAMEWORK
3 Theoretical Framework
The following section will provide the theoretical concepts relevant to the exploration
of the topic of privacy policy natural language comprehension. The Section 3.1 will set
the legal foundation of the paper by discussing the key elements of GDPR and privacy
policies, including the challenges and opportunities induced by the shift of the regulatory
landscape with the e↵ect of GDPR. It will outline the e↵ect and novel structural changes
under GDPR on both individual and company level. Moreover, the section will refer to
the specific di↵erences in the nature of privacy policies post-GDPR, such as terminology,
verbosity, and new structural elements. Next, the thesis embarks on the technical theoret-
ical overview comprised of four topics: Information Retrieval 3.2, Deep Neural Language
Model architectures and Transformers 3.3, Question Answering 3.4, and transfer learning
3.5. Information Retrieval will cover the topics of term weighting schemes, inverted index,
as well as Lucene as the foundation of the ElasticSearch search engine. Next, Section 3.3
will revolve around the topics of Deep Neural Language Modelling and Transformer archi-
tectures. It will elaborate on the vector semantics, discuss in detail the algorithms behind
Transformers and, specifically, BERT with its variations, such as RoBERTa. Section 3.4
will discuss Information Retrieval based Question Answering systems, the use of encoders,
such as BERT, in Question Answering, and the theory behind Retrievers and Readers.
Lastly, Section 3.5 will discuss transfer learning and knowledge distillation, and the appli-
cation of BERT (DistilBERT and TinyBERT) for performing transformer distillation to
generate target task-specific knowledge.
A privacy policy is a legal document that discloses the practices in which a party collects,
uses, manages, processes and discloses personal information. The first notice of privacy
policy was made by the Council of Europe during a study of risks in relation to technol-
ogy’s influence on human rights (Murtezić, 2020). This has resulted in Convention 108 on
28 January 1981, with the Council recommending the development of a policy to protect
personal data. Personal data is thereby considered to be information that identifies a real
person.
The General Data Protection Regulation (GDPR) came into e↵ect on the 25th of May,
13
3.1 GDPR and Privacy Policies 3 THEORETICAL FRAMEWORK
2018 (Trzaskowski & Sørensen, 2019). It was designed around the purpose of providing
EU citizens with more control over their personal data as well as simplifying the regulatory
environment in which businesses navigate across all EU member states. The introduction
of GDPR has shifted the way businesses process the data and the way they communicate
their practices, including new structure, obligatory details and terminology in privacy
policies.
GDPR has harmonised the privacy regulations across the EU states. GDPR obliges orga-
nizations to provide citizens with a privacy policy that is presented ”in a concise, trans-
parent, intelligible, and easily accessible form”. It has to be written in clear and plain
language and delivered in a timely manner. Moreover, it requires the companies to be
transparent in their disclosure of any collection, processing, storage, or transfer of per-
sonal data (Trzaskowski & Sørensen, 2019).
GDPR gives EU citizens control over their data by referring the individual to a number
of specific rights. GDPR also states that the exercising methods of the above-mentioned
rights should be disclosed clearly in the privacy policy. The rights include the following:
14
3.1 GDPR and Privacy Policies 3 THEORETICAL FRAMEWORK
Furthermore, under GDPR, data controllers and processors are also faced with stricter
requirements, such as data protection by design and by default (Art.25) for increased
security and to “implement appropriate technical and organizational measures”, as well
as record all processing activities (Art.30). Organisations are also held accountable for
non-compliance meaning stricter security measures and regulations to process and manage
data should be undertaken. According to Art. 82(2), “any controller involved in process-
ing shall be liable for the damage caused by processing”, as the failure to do so may result
in large fines for data breaches, amounting up to 20 mln € or 4% of worldwide turnover
for the preceding financial year.
Art. 5 states the foundational principles of processing personal data which companies
must adhere to. They include the following principles:
2. Purpose limitation – a collection of data only for ”specified, explicit and legitimate
purposes”
5. Storage limitation – data kept in a form which permits identification of data subjects
for no longer than is necessary for the purposes
GDPR clearly separates between sensitive and non-sensitive personal data. Sensitive data
refers to information such as racial or ethnic origin, political opinions, religious or philo-
sophical beliefs, health, trade union membership, genetic and biometric data, and sexual
orientation. GDPR obliges companies to undertake more stringent protection rules and
15
3.2 Information Retrieval 3 THEORETICAL FRAMEWORK
Moreover, GDPR identifies the countries outside of EEA that can be considered “ade-
quate” or “non-adequate” in terms of levels of personal data protection. Transfers to
“adequate” countries are permitted under GDPR, while “non-adequate” locations would
require appropriate safeguards implemented in place. Therefore, GDPR signifies the im-
portance of stating the location of storage of data in the privacy policies exposed to the
individuals. GDPR provides the necessary list of safeguards for the protection of personal
data to guarantee its safety.
Information Retrieval, also referred to as IR, encompasses the retrieval of any information
manner given specific needs. The developed IR system is often referred to as the search
engine. The user structures a query for the retrieval system, which in turn can output a
set of ranked documents from a specific collection as seen in Figure 1.
16
3.2 Information Retrieval 3 THEORETICAL FRAMEWORK
Computation of term weight for each given document word allows scoring the match be-
tween a document and a query. The most commonly used term weighting schemes are the
TF-IDF weighting, and BM25.
TF-IDF computes a sparse model which captures word meanings by calculating the term
frequencies (tf ) and inverse document frequencies (idf ). tf estimates the frequency of
a word t in the document d. A popular approach is to decrease the term frequency to
mitigate skewness by taking log10 . That can be formulated as the following:
8
>
<1 + log10 , if count(t,d) > 0
tft,d =
>
:0 otherwise
N
The idf is calculated by idft = log10 ( df t
), where N is the total number of documents and
dft is the number of documents which contain the term t. It allows providing higher weight
to words present in only a few documents. For a term to get a high tf idf weight, the
term should hold a high term frequency (tf ) in a document (local parameter) and a low
document frequency across all documents (global parameter).
BM25 is a more powerful and complex variant of term weighting, also referred to as Okapi
BM25 (Robertson et al., 1995). BM25 introduces parameters k, which adapts the balance
between tf and idf , and b, which manages the significance of normalising the length of
a document. Therefore, given a query q, the score of document d can be formulated as
follows:
X ✓ ◆
N tft,d
log
t2q
dft k(1 b + b( |d|d|
avg |
)) + tft,d
where |davg | represents the length of an average document. If k = 0, BM25 does not use
term frequency, while a large value of k refers to raw term frequency. B spans from 0
(with no document length scaling) to 1 (scaling by length).
17
3.2 Information Retrieval 3 THEORETICAL FRAMEWORK
An inverted index is the most commonly used full-text search method, which is performed
by matching a term to its containing document. The inverted index consists of a list of
all unique words that are present in a document, as well as a list of documents for each
given word in which it appears. An inverted index, given a particular query, is capable of
providing a list of documents with the given term, as shown in Figure 2. It is based on a
dictionary where the keys represent tokens, and the values of the keys are pointing to the
documents’ IDs which contain the given term.
3.2.3 Lucene
Lucene is a full-text search library built in Java. It inputs a search query and returns a
number of documents ranked by the relevance to the documents which share similarities
with the highest scored query. Lucene o↵ers an array of configurable forms of search,
such as boolean queries, term queries, and range queries. Lucene indexes the terms, and
each given term is comprised of a token and a field name, thus the searches are always
field-specific (Karim, 2017). Once the index is committed, it becomes immutable, making
Lucene append-only.
18
3.2 Information Retrieval 3 THEORETICAL FRAMEWORK
3.2.4 ElasticSearch
In ElasticSearch, a document is the unit of search. It stores entire documents and indexes
the context of each document. An index consists of one or more documents, and the latter
consists of one or more fields. Every field of its distributed real-time ’document store’ is
searchable when indexed (Gotmlet, 2015). ElasticSearch is able to provide a fast response
time due to the structure of Lucene and the use of inverted index. When a query is ex-
ecuted, the inverted index table, as showcased in Figure 2, is looked up to find possible
documents containing terms in the query. It utilizes indices and shards to separate data
which in turn enables distribution and allows for high horizontal scalability.
19
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK
• Index (noun): Index is the location of document storage for it to be retrieved and
queried.
• Mapping: Mapping contains information on the type of data that each field holds.
(Berman, 2019).
Vector semantics represents the word in its multidimensional vector space constructed by
the distributions of other words in proximity. Vector semantics reveal the distributional
hypothesis3 on the similarity of word meanings. It is done by learning the embeddings
from distributions in passages of privacy policies. Embeddings, a key element of represen-
tation learning in NLP can be static or contextualised, such as BERT, which will further
be explained in more detail in Section 3.3. Embeddings are short dense vectors which
generally perform better compared to, for instance, n-gram generated sparse vectors. The
reason behind it is the little parameter space needed, which in turn prevents overfitting
and generalisation of data.
”Language is a sequence that unfolds in time”, Jurafsky and Martin (2002, pg. 172). Thus,
the tools that do not consider the dimension of time and allow access to all inputs at the
same time, fail to capture the temporal essence of language. Question Answering, just
like many other NLP challenges, signify the need for access to distance in text, which is
possible using sequential and transformer architectures. These two architectures are over-
coming the constraints of the Markov assumption4 and manage to use greater contexts in
text, giving the next token in the text a conditional probability. Yet, despite the sequential
architectures being able to retain the past information, they cannot address the issue of
what the attention of the model should be paid to, and how to train the model efficiently
in that regard. This is when Transformers and BERT prove why they have revolutionised
3
”Distributional hypothesis is the link between similarity in how words are distributed and similarity in
what they mean”, Jurafsky and Martin (2002, p. 96).
4
Markov assumption states that the next state is assumed to be dependent only upon the current state
(Jurafsky & Martin, 2002).
20
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK
NLP carving the way to becoming the key building block for text analytics.
3.3.2 Transformers
To start with, Transformers are comprised of the regular network components. However,
on top of that, they consist of self-attention layers, which facilitate grasping information
from a vast and robust textual context. According to Vaswani et al. (2017), Transformers
are fully based on self-attention when computing representations of input and output,
avoiding convolutional or recurrent neural nets. Hence, the need for a transfer of informa-
tion through recurrent connections is mitigated. The layers of Transformers’ self-attention
can map the sequences of input vectors (x1 , . . . , xn ) to output vectors (y1 , . . . , yn ). The
estimation of each item is not dependent on the other elements. The Transformers, thus
are able to benchmark one element to the others by its relevance in the provided con-
text, which is calculated with a score (xi , xj ) = xi · xj . The greater the result is, the
more meaningful similarity there is between the vectors. Consequently, the scores get nor-
malised with a softmax 5 function to compute the weight vector ↵i j. The output value is
then computed from a summation of the inputs previously noticed by the model, weighted
with the relative ↵ values.
X
yi = ↵ij xj
j<=i
The attention is comprised of the following: (1) comparing the relevant elements, (2) nor-
malising to obtain probability distribution, and (3) generation of the weighted sum of the
distribution (Jurafsky & Martin, 2002).
A standard Transformer layer includes two sublayers: multi-head attention (MHA) and
fully connected feed-forward network (FFN).
QK t
A= p
dk
(1)
Attention(Q, K, V ) = sof tmax(A)V
5
Softmax is an activation function that generalizes the logistic function to multiple dimensions.
21
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK
dk denotes the scaling factor of the dimension of keys, and A is the attention matrix com-
puted from the dot-product operation between Q and K. The final attention function
represents a weighted summation of values V , while the weight is calculated by using soft-
max on each column of matrix A. Attention matrices generally grasp significant language
knowledge, hence being crucial in knowledge distillation. Concatenating attention head
can determine the multi-head attention:
headi denotes the ith attention head, h denotes the number of attention heads, while W
is the linear transformation matrix.
The position-wise FFN di↵ers from a typical feed-forward network, as it uses the dense
layers for each position item in the sequence, hence called position-wise.
22
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK
BERT is well known for its two powerful training approaches — masked language mod-
elling (MLM), and next sentence prediction (NSP). Oftentimes it is possible to simply
take the BERT model and apply it directly to the needed task, however, it is common to
fine-tune it first. MLM fine-tuning and training can allow for BERT to grasp the specifics
of language in a given domain, which is highly important in such a unique linguistic con-
text as the field of privacy policies (Aroca-Ouellette & Rudzicz, 2020).
MLM works by giving BERT an input sentence and then further tuning the weights to
obtain the same sentence in the output. However, before BERT gets the input sentence,
some tokens are masked.
First, the text it tokenized which results in three tensors: input ids, token type ids,
attention mask. Input ids is the most crucial tensor which provides a tokenized repre-
sentation of the sentence which will be further adjusted. Next, a labels tensor is created
to calculate the loss and optimise the model during training. Labels tensor is a copy of
the input ids tensor.
Furthermore, a set of tokens inside of the input ids should be masked randomly. BERT
generally uses a 15% probability of masking every token while pre-training the model.
Labels and input ids tensors are processed through the model and the loss between the
two should be estimated. The loss is necessary to obtain the gradient changes to adjust
the weights of the model.
The di↵erence between the probability distributions for each token and the true labels
defines the loss. The 512 tokens (maximum input tokens that BERT can process) result
in the final embedding, known as logits. It also has a vector whose length is the same
as the size of the vocabulary of the model. The token ids tensor is predicted from the
logits by applying softmax and argmax 7 (Aroca-Ouellette & Rudzicz, 2020).
7
Argmax is a function that returns the maximum value from the target function.
23
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK
NSP works by providing BERT with two sentences, A and B. BERT predicts whether
sentence B comes after B by outputing IsNextSentence or NotNextSentence. Hence, it
is able to learn long-term dependencies among the sentences.
To start with, tokenization is performed. The two sentences are unified into the same
tensor set, yet [SEP] token is placed in between the sentences. token type ids tensor
includes the segment ids, which are able to identify parts that the related tokens belong
to. Sentence A is identified as 0, while sentence B as 1.
Furthermore, classification label should be created. new labels tensor is created which
can identify whether a sentence follows the other. IsNextSentence is represented with 0,
while NotNextSentence — with 1.
Lastly, the loss is estimated. Inputs and labels are processed with the use of the model,
which outputs the loss tensor. That tensor is optimised during training. Model might
also be simply used for inference without the training, then no labels tensor would be
used (Aroca-Ouellette & Rudzicz, 2020).
BERT-Base is the baseline BERT model with 12 encoder layers. BERT has numerous
model configurations tailored toward various learning task needs. Robustly Optimized
BERT Pretraining Approach (RoBERTa) is a type of BERT created with the goal
of improving its training phase, by training the model longer and on more data. RoBERTa
24
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK
builds upon the language masking of BERT, by learning to predict purposefully hidden
segments in unlabelled text. As opposed to static masking in BERT-base, RoBERTa
uses dynamic masking which allows the model to capture specific masking patterns in
the given sequence, as well as simultaneously decrease the necessity of having more train-
ing instances. It adjusts the hyperparameters of BERT, for instance, by removing the
next-sentence pretraining objective which enhances downstream performance. Moreover,
RoBERTa uses larger mini-batches and learning rates (Facebook, 2019).
As discussed previously in the Section 2, BERT language models are also often trained
to learn the nuances of specific domain-related language to achieve higher performance.
PrivBERT is an example of a domain-specific BERT. It is a privacy policy domain lan-
guage model, which is based on roughly one million privacy policies. It was trained on
a pre-trained RoBERTa-base (12 layers, 768 hidden size, 12 attention heads, 110M pa-
rameters). It is trained to learn the nuances of privacy policy specific language to achieve
higher accuracy. It was trained using dynamic mask language modelling with 50.000 steps
using the batch size of 512. Most parameters were kept similar to RoBERTa, while the
peak learning rate was set to 8e-5. PrivBERT was evaluated on both classification and
Question Answering tasks achieving state-of-the-art results (Srinath et al., 2021).
Question Answering is a field of NLP which is focused on developing systems that can
automatically answer users’ questions. The system must be able to translate the user
inquiries into a machine-readable representation of the question to provide a relevant
answer. This is done by mapping the semantics of the sentences in natural language.
There are several ways of classifying QA systems. The main paradigms include IR-based
and knowledge-based QA systems. Moreover, QA can be distinguished between open and
closed domain QA. Lastly, QA can be categorised by the type of questions asled.
25
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK
To start with, QA systems can be distinguished between two key paradigms: IR-based
and knowledge-based QA systems.
1. IR-based QA system works by finding text segments in the document collection that
answer the user’s question. IR-based QA consists of question processing, passage
retrieval, and answer processing. IR-based QA system will be explained in more
detail in Section 3.4.2.
1. Open-domain QA systems are not limited to any particular field of knowledge. They
generally are dependent on the world wide web and universal ontology. They use
general vocabulary and thus do not require any domain-specific knowledge to answer
the question.
2. Closed domain QA systems work with a specific area of knowledge and thus consist
of a restricted repository of domain-related questions, the answers to which are
retrieved from a collection of domain-specific documents. Therefore, the quality of
answers is considered to be high. Closed domain QA uses specific ontology and
terminology. (Reddy & Madhavi, 2017)
Furthermore, QA systems can also be categorised by the type of questions asked, such
as factoid and non-factoid. Factoid questions can generally be described as the questions
which start with what, which, when, and who. The questions generally need a short answer
(a word or a single sentence) and are simple to answer. The answer type is normally a
named entity. Non-factoid questions, on the other hand, are open-ended questions, such
as opinions or explanations, and require complex answers and passage level texts. The
distinction can also be established between five types of questions, such as: (1) List, (2)
26
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK
Confirmation, (3) Causal, (4) Hypothetical, (5) Complex (Reddy & Madhavi, 2017).
1. List questions provide a list of named entities. Generally, they are able to perform
with rather high accuracy. Techniques used with factoid questions can be easily
adjusted to list types of questions.
3. Causal questions require more descriptive answers, as they refer to reasons, explana-
tions, and elaborations. They are generally much longer and can be a set of sentences
or even a paragraph.
4. Hypothetical questions are questions associated with unspecific answers. Their ac-
curacy is dependent on the context and the user and is often quite low.
5. Complex questions require inference and synthesis of context. They need deeper
neural architectures then, for instance, factoid and list type of questions.
Information Retrieval (IR) based QA systems answer a question posed by the user by
looking for the passages of relevant text. The traditional algorithms used in IR, such as
tf-idf, are facing the vocabulary mismatch problem when they can only perform given a
precise overlap of words in the query and the passage of text. To mitigate such shortcom-
ings, an approach that can handle synonyms and paraphrasing is needed. Hence, dense
embeddings and dense passage retrievers can be used as opposed to sparse word-count
vectors. An example of a generalised QA system architecture can be seen in Figure 4.
Bi-encoders, such as BERT, make use of one encoder to take in the query (BERTQ ), and
one to encode the text in a document (BERTD ). The score is then obtained as the dot
27
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK
product from the two vectors (Figure 5). Query and the document can be displayed as
[CLS] token seen in the equation below.
hQ = BERTQ (q)[CLS]
hD = BERTD (d)[CLS] (3)
score(d, q) = hq · hd
Dense vectors in the field of IR are still an ongoing research topic, particularly with regard
to fine-tuning. Documents are often too long for the encoder to process (such as extra-
lengthy post-GDPR privacy policies), thus a common approach is to split them up into
paragraphs.
28
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK
crafting a reading comprehension dataset to train the reader to predict a passage span
that contains the answer. Such datasets are comprised of triples: passage, question, and
answer. As the data already contains the passage, IR becomes irrelevant.
The most prominent example of such a dataset is Stanford Question Answering Dataset
(SQuAD) which was manually created from the question-answer pairs from the Wikipedia
articles (Rajpurkar, Zhang, Lopyrev, & Liang, 2016). In its first version, SQuAD 1.1, every
answer is answerable in the corresponding context passage. However, SQuAD 2.0 also in-
troduces no-answer annotations, resulting in more than 150.000 QA pairs (Rajpurkar, Jia,
& Liang, 2018). SQuAD is one of the most widely used open-domain QA dataset which
helps the models learn the general foundation of the QA task. NLQuAD is a dataset for
non-factoid question answering. Due to the nature of non-factoid questions, it requires
more descriptive answers and longer answer spans. It is comprised of 31,000 non-factoid
questions and long answers from 13,000 BBC news articles. The questions are interroga-
tive and inquire about the opinions and reasoning, thus resulting in complex open-ended
answers (Soleimani, Monz, & Worring, 2021).
3.4.4 Retriever-Reader
The Reader takes the passage as an input and gives an answer. Answer extraction is based
on span labelling. It works by finding a consistent string corresponding to the answer. In
reading comprehension tasks, neural models take in the questions q, n tokens q1 , ...qn , as
well as passage p and m tokens p1 , ..., pm . The neural architecture aims to calculate the
29
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK
probability that any given span constitutes to the answer: P (a|q, p) (Jurafsky & Martin,
2002).
Given that a possible answer span starts at as and ends at ae , the probability could be
simplified as P (a|q, p) = Pstart (as |q, p)Pend (ae |q, p). Therefore, two probabilities for the
start and end are estimated for every token pi .
The baseline for reading comprehension tasks is feeding the question and passage into
bi-encoder, such as BERT. The strings are then split with the separator token [SEP]. The
outcome is an encoded token embedding provided for each passage token pi (Figure 6).
The first sequence constitutes to the question, while the second sequence — to the passage.
A linear layer shall also be added to training during fine-tuning to estimate start and end
span positions. The start of the span embedding can be expressed as vector S, while the
end of the span can be represented as E. These vectors would be learnt by the model
during the fine-tuning stage. By iterating through all tokens pi , calculating the dot product
of S and p0i , and further applying a softmax function for normalisation, we may obtain
the following formulas for computing the probabilities for the start and end probabilities
of the span (Jurafsky & Martin, 2002).
30
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK
0
Pstarti = Pexp(S·pi ) 0
j exp(S·p j)
0 (4)
Pendi = Pexp(E·pi ) 0
j exp(E·pj )
Candidate span is ranging from i to j, and the score can be computed as S · p0i + E · p0j .
j i constitutes the maximum scoring span and is thus the score taken by the model as
its final prediction.
Evaluating the performance of QA can be rather challenging, as the correct answer could
be phrased in a variety of ways. To assess the performance of QA, evaluation metrics
able to mimic human judgement should be implemented. Generally, there are two ways
of evaluating QA — closed and opened domain modes (Su et al., 2016).
Separate metrics exist for evaluation of the Retriever and Reader parts of the QA system.
To evaluate the performance of a Retriever, it is crucial to figure out if the document
which has the correct answer span is among the retrieved candidates. The number of
retrieved candidates is set by the parameter topk . There are two well-known ways of
evaluating a Retriever - (1) Recall and (2) Mean Reciprocal Rank (Shao, Guo, Chen, &
Hao, 2019).
1. Recall estimates the number of times the right document is contained among the
selected candidates. In the case of a singular query, the output of the recall is binary
— whether the document is among the retrieved candidates or not. However, when
calculated over the whole dataset, the output ranges from zero to one — specifically,
the percentage a query returned the relevant document.
2. Mean reciprocal rank (MRR) is another metric used to measure the performance
31
3.5 Transfer Learning and Knowledge Distillation 3 THEORETICAL FRAMEWORK
of the Retriever. It takes into consideration the ranking of the answer, meaning its
position. MRR with a score of 0 means that there are no matching responses, while
the top score of 1 means the QA system was able to select the right document as
the top candidate for every query. Hence, MRR considers the fact the relevance of
the retrieved documents is di↵erent from the selected documents.
When measuring the performance of the Reader, the degree of correctness of the selected
answer span is evaluated.
1. Exact match (EM) is the metric that estimates the fraction of the documents
where the answer estimated by the Reader is completely identical to the right answer.
For instance, if for the annotated pair “When did GDPR come into e↵ect? – GDPR
came into e↵ect on May 25, 2018”, the Reader predicts an answer to be “On May
25, 2018, GDPR came into e↵ect”, EM would result in zero, as the match is not
completely identical.
TP
P recision = T P +F P
TP (5)
Recall = T P +F N
2⇤P recision⇤Recall 2⇤T P
F1 = P recision+Recall = 2⇤T P +F P +F N
”Transfer learning aims to extract the knowledge from one or more source tasks and apply
the knowledge to a target task”, Han, Kamber, and Pei (2012, pg. 435). In transfer learn-
ing, a pre-trained model is used as the initial stage, and then a related task is optimised
by re-purposing the existing task. As it is very challenging to train a model from scratch
with limited data, transfer learning o↵ers an opportunity to achieve high performance
using only a general knowledge dataset and transfer that knowledge to a smaller domain-
specific dataset that stores knowledge on solving a related problem. Hence, the target
task needs fewer training data and less training time. Traditional learning is based on
the assumption that the train and test datasets are obtained from the same feature space
32
3.5 Transfer Learning and Knowledge Distillation 3 THEORETICAL FRAMEWORK
and data distribution; in case of any shifts in feature space and distribution, models then
are expected to be redeveloped. However, transfer learning facilitates learning a di↵erent
task, thus the distribution and data domain can be di↵erent, as seen in Figure 7. The
most popular type of transfer learning is instance-based transfer learning, which changes
the weights of some data in the source task and adjusts to learning the target task. In
the field of NLP, domain adaptation has been a key area positively a↵ected by transfer
learning (including syntactic parsing, named entity recognition, and QA) (Min, Seo, &
Hajishirzi, 2017).
A technique for transferring the knowledge is distilling it from a bigger teacher network T
to a student S. Hence, the student is mimicking the teacher. Given that f T , f S describes
the behaviour functions of the networks, they transform network inputs into information
33
3.5 Transfer Learning and Knowledge Distillation 3 THEORETICAL FRAMEWORK
representation, the output of a given layer in the network. In the case of the distillation
of Transformers, the outputs of MHA, FFN or attention matrix A could be represented
as behaviour functions. Knowledge distillation can be described as the given objective
function:
X
LKD = L(f S (x), f T (x))
x2X
L represents the loss function which is estimating the gap between the teacher and the
student, x denotes the input, and X is the train set.
Pre-trained language models, such as BERT discussed above, are rather computationally
expensive, posing an inference challenge. Transformer distillation allows keeping the same
accuracy while reducing the size of the model. The knowledge encoded in the teacher
BERT can get transferred to a smaller student TinyBERT. TinyBERT does the Trans-
former Distillation at the pretraining and task-specific learning (see Figure 8). Hence,
TinyBERT is capable of grasping both the general source domain and the target task-
specific knowledge (Jiao et al., 2020).
34
4 METHODOLOGY
4 Methodology
”The purpose of methodology is to enable researchers to plan and examine critically the
logic, composition, and protocols of research methods; to evaluate the performance of indi-
vidual techniques; and to estimate the likelihood of particular research designs to contribute
to knowledge”, Krippendor↵ (2018, p. 21). To start with, the Research Onion framework
aids in conceptualising the key pillars of the research. The critical realism is employed as
the research philosophy, whereas the research approach is chosen to be abductive. The re-
search undertakes the experimental strategy, the methodological research choice is mixed-
method, and the time horizon of the research is longitudinal. Next, the section presents
the reader with the techniques used throughout the research, guided by the CRISP-DM
framework. To start with, the business understanding is formulated, revising the necessity
for the user to comprehend lengthy, legally sophisticated privacy policies in a time-efficient
manner. The two datasets are explored, (1) the existing broad PolicyQA dataset with QA
annotations from pre-GDPR time, as well as (2) manually labelled GDPRQA based on
recent post-GDPR policies varying in size, industry and location of the organisations.
Next, the reader is guided through the model development. The configuration files as-
sociated with each model, as well as parameters are discussed. The modeling iterations
are elaborated on, starting with (1) general knowledge acquisition from open domain QA
datasets — the factoid SQuAD and non-factoid NLQuAD, (2) knowledge transfer onto the
privacy policy domain, and (3) optimisation of the best performing models. Furthermore,
a production environment simulation was developed using the best optimised models and
ElasticSearch as the IR search engine. Evaluation techniques are discussed. Lastly, a
deployment architecture is proposed to conceptualise the GDPRQA Assistant.
35
4.1 Research Philosophy 4 METHODOLOGY
Critical realism is leveraged as the philosophical leitmotif of the thesis. Critical realism
distinguishes between real and observable aspects of research. It views scientific research
within its causal mechanisms. Hence, it considers that events or phenomena consist of
emergent mechanisms, driven by causation, agency, structure, and relations.
The types of research approaches include induction, deduction and abduction. The induc-
tive approach starts with the collection and analysis of data, followed by the development
of a theory based on the findings. Deduction first crafts a theory and hypothesis which
are further tested. Abduction combines the two in an iterative manner. The topic is thor-
oughly explored and the insights contribute to the conceptual framework which is further
improved with more data (Saunders et al., 2019).
36
4.1 Research Philosophy 4 METHODOLOGY
The abductive approach is taken, when the privacy policies data, as well as the annotated
question, answer, context triples, are explored. The theory is incorporated into the process
and is expanded based on the findings, which can also be further tested with more data.
Data-related techniques involve mono-methods when strictly one type of data is used
(quantitative or qualitative), multi-methods with beyond one data type which is dealt with
separately, and mixed-method with multiple data types and combined research applied to
those (Saunders et al., 2019). This thesis works with textual data, which is discerned as
qualitative, as the privacy policy texts are collected and analysed. However, the used tools
and techniques quantify the textual data making it quantitative. Hence, the mixed-method
research is undertaken.
The research question, goals, resources and existing knowledge impact the choice of the
research strategy. The possible variations of the strategy include experiment, survey,
archival research, case study, ethnography, action research, grounded theory, and narrative
inquiry. The experiment is a broadly common strategy taken in scientific as well as social
science research (Saunders et al., 2019). Evaluation of the performance of the Question
Answering system in the domain-specific context falls under the experimental strategy of
the research.
Research in terms of its time horizon can be separated into cross-sectional studies and
longitudinal studies. The former deals with a “snapshot” of time, while the latter explores
the evolution of data over time (Saunders et al., 2019). The thesis deals with the data
pre- and post-GDPR. The collected policies come from di↵erent time stamps yet no time-
series studies are performed, except for the comparison between any pre- and post-GDPR
policies. Hence, the study is a mix but with more focus on the longitudinal time horizon.
Techniques and procedures are elaborated on in detail under the CRISP-DM framework
presented in the next section and iteratively considered throughout the thesis. The
37
4.2 CRISP-DM Methodology 4 METHODOLOGY
methodology section will end with a discussion on the reliability and validity of work
(Section 4.9), critically assessing the techniques and procedures applied.
The methodology will follow the CRISP-DM framework, displayed in Figure 10, which is a
widely accepted framework in data science projects. CRISP-DM is a robust methodology
which stands for CRoss-Industry Standard Process for Data Mining, which is applicable
independent of the industry or tool applied. The rationale for choosing the following
framework is that it accommodates the needs for a model that is able to encompass both
technical and business functions, thus harnessing the NLP to derive business insights to
solve the problem of machine comprehension of GDPR privacy policies. CRISP-DM facil-
itates the structure and flexibility needed to achieve high performing data mining models.
It consists of several phases which guide us on the roadmap of the thesis project, namely:
business understanding, data understanding, data preparation, modelling, evaluation, and
deployment.
38
4.3 Business Understanding 4 METHODOLOGY
This stage is focused on building the essential foundation by analysing the objectives and
requirements of the business value behind privacy policy reading comprehension.
. To determine the business objectives, the thesis evaluates the users’ needs and success
criteria. The users need to be able to comprehend lengthy, verbose, often linguistically
complex privacy policies with ease. Moreover, they need to be fully aware of practices
applied to their personal data and possible accompanying risks in a time-saving manner.
Furthermore, they should be able to easily obtain the details that are most relevant to
them personally. Hence, a QA system which accurately extracts relevant GDPR data
privacy practices and reflects the current GDPR landscape would be able to successfully
satisfy these needs.
. To assess the situation, we shall evaluate the resources available. In terms of dataset
availability, PolicyQA, an existing QA dataset derived from the privacy policy standard
OPP-115 annotated dataset, is rather comprehensive and vast. However, due to the lack
of GDPR specific questions, it is deemed necessary to derive a GDPR relevant dataset
and further augment it to the PolicyQA. Yet, manually obtaining a dataset would result
in a limited number of training examples which would restrain the training of the neu-
ral language model. Transfer learning and data augmentation would be of immense help
and would overcome this challenge. Hence, first fine-tuning the QA model on SQuAD or
NLQuAD, the open domain QA datasets, would help the model to learn general question
answering patterns, and then performing transfer learning on PolicyQA and GDPRQA
would help adapt the knowledge to the privacy-specific domain. Computational power
resources could be limited given the large textual volumes to be processed, yet the maxi-
mum GPU capacity of the Google Colab pro+ subscription will be used, as elaborated in
the modelling setup in Section 4.6.
39
4.4 Data Understanding 4 METHODOLOGY
The data understanding section will explore the broad existing PolicyQA secondary dataset,
which covers a wide range of pre-GDPR privacy topics relevant to users. Furthermore, it
will explore the created GDPRQA specific dataset, which contains question-answer anno-
tations of policies collected after the introduction of GDPR. The data understanding will
include systematic collection and exploration of privacy policies to achieve a high-quality
GDPR-relevant QA dataset.
PolicyQA is a publicly available dataset (Ahmad et al., 2020) which consists of question,
answer, context triples based on the classified OPP-115 dataset. The policies used in Pol-
icyQA are those from OPP-115 corpus scraped from 2015-2016, hence before the adoption
of GDPR. The average length of a single policy in the dataset is 2319 words, which on
average would take roughly 14 minutes to read. OPP-115 presents 23.000 data practices
with 103.000 annotated text spans, each of which is linked to a segment of a policy with
character-level start and end positions.
PolicyQA annotations are created from these annotated spans and segments. The seg-
ments classified as “Other” and “Unspecified” are excluded. PolicyQA is a vast dataset
with 25.017 answer span annotations labelled by two domain field experts. It contains 714
individual questions created in a generic manner applicable to similar practice categories.
The distribution of n-gram prefixes of questions can be observed in Figure 11. From the
distribution, it is seen that the majority of questions are non-factoid, yet various types of
questions occur.
40
4.4 Data Understanding 4 METHODOLOGY
Due to the nested complex json structure of the SQuAD format of the dataset, prior to its
exploration, it was first converted to a dataframe by unwrapping the path to the deepest
level in the json file — [‘data’,’paragraphs’,’qas’,’answers’]. The di↵erent levels
in the json file were parsed and then combined into a single dataframe (Listing 1).
The answers in the dataset have a rather short span, 13.5 words, which allows the readers
to zoom in the necessary part of the text to grasp the information quickly. The average
length of a question is 11.2, while the average length of passage is 116 words.
1 {
2 ”data”: [
3 {
4 ”paragraphs”: [
5 {
6 ”qas”: [
41
4.4 Data Understanding 4 METHODOLOGY
7 {
8 ”question”: ”Can I edit or change the data that I have provided to you?”,
9 ”id”: 311216,
10 ”answers”: [
11 {
12 ”answer id”: 318379,
13 ”document id”: 525657,
14 ”question id”: 311216,
15 ”text”: ”You have the right to request rectification of inaccurate personal data
concerning yourself, and to complete incomplete data.”,
16 ”answer start”: 505,
17 ”answer end”: 630,
18 ”answer category”: null
19 }
20 ],
21 ”is impossible”: false
22 }
23 ],
24 ”context”: ”Right of access \nYou have the right to obtain confirmation ...”,
25 ”document id”: 525657
26 }
27 ]
28 }
29 ]
30 }
The distribution of the topics covered by the annotations reveals that 44.4% of the anno-
tations refer to the first party collection, and 34.1% to the third party sharing. 11% of
annotations reflect user choice, which is a rather generic category reflecting the individual
rights (which will be explained in detail in the manually created GDPRQA). Data secu-
rity annotations reflect 2.2%, data retention — 1.7%, user access, edit, and deletion —
3.1%, policy change — 1.9%, international audiences — 1.5%, do not track — 0.1%. The
most commonly asked questions include such questions as ”For what purpose do you use
my data?”, ”Do you collect or use my information? If yes, then what type?”, ”Does the
website mention the name of third parties, who gets my data?” as the top three. Further
42
4.4 Data Understanding 4 METHODOLOGY
First, the collected post-GDPR privacy policies will be explored, followed by the elabora-
tion on the process of manual annotation, and, finally, exploration of the SQuAD format
annotated dataset.
To start with, as the research has shown that previous datasets do not address the variety
of companies, 47 privacy policies of companies ranging in industry, size, and location were
collected. The companies and their respective sizes, industries and countries are presented
in Table 16. Companies of varying sizes were selected, from European startups with up
to 50 employees up to corporations with 10.000+ employees. Large companies often cover
several services provided, and the policies are made by their team’s legal experts and
updated frequently. Smaller companies portray a more narrow coverage of topics, the
language may vary in legal sophistication, and they are updated less often.
Both European (e.g. Swedish, Danish, Norwegian, French, German, Spanish, Lithuanian,
Hungarian, Greek, British, Dutch, Swiss, Portuguese, Italian, Austrian, etc.) and some
global companies with a European presence were chosen as they also need to comply with
GDPR when operating within the EU. Only policies in English language were considered.
The range of industries of the collected policies from the websites and apps is also rather
diverse, such as manufacturing, retail, technology, transport, energy and oil, space, hos-
pitality, automobile, healthcare, pharmaceuticals, banking, food, audio services, telecom-
munication, personal goods, lifestyle, marketplace, education, which can be seen in Table
16.
Depending on the industry, for some companies, it was more common to collect spe-
cific types of information. For instance, healthcare companies dealt more with sensitive
information such as health-related data, while banking companies often use automated
decision-making to determine financial validity. Privacy policy documents seem to be
43
4.4 Data Understanding 4 METHODOLOGY
rather structured as they follow specific guidelines and have a homogeneous nature among
the entities of similar industries.
Exploratory data analysis reveals some curious aspects. On average, policies contain 4569
words. With an average speed of 250 words per minute, it would take 18.2 minutes to
read such a policy on average. The average length of a sentence is 21 words, which is
considered rather verbose and lengthy. On average policies contain 515 words scored as
difficult. Difficult words are estimated as those with over 2 syllables and those not present
in the list of common words provided by the library Textstat.
Two readability scores are used to measure the lexical complexity of the privacy policies.
1. Flesch Readability Ease measures passages of text given the syllable count, word
and sentence lengths. Higher scores mark easier texts, while low numbers indicate
high lexical complexity. The Flesch Readability Ease is used to estimate the level of
education needed to comprehend the material. The average score of 62 falls in the
”high school students English” ease of understanding, bordering the 50-60 ”fairly
difficult” category.
2. Gunning Fog Score indicates the number of years of formal education needed for
a person to understand English text the first time reading it. It is assumed that
texts with a score below 12 are intended for a general audience. However, universal
comprehension would require a score of under 8. The GDPR privacy policies dataset
has on average scored 15.5 in Gunning Fog. Hence, the sampled privacy policies
may not be deemed readable by a general audience, despite the intention of privacy
policies to be fully understood by any individual. The score of 15.5 corresponds to
a college senior, thus implying rather strong lexical complexity.
Privacy policies are described as lengthy and complex, yet it does not necessarily entail
that they are dense in pertinent information. A set of vague words derived by Lebano↵
and Liu (2018) was used to count the frequencies, such words as ”may”, ”some”, ”nec-
essary”, ”certain”, ”sometimes”, ”reasonably”, ”appropriate”, ”typically” scored rather
high, with a mean of 156 vague words per policy.
44
4.4 Data Understanding 4 METHODOLOGY
Therefore, the collected GDPR privacy policies are indeed rather lengthy, verbose, com-
plex, and vague, as proposed by related research. This signifies that the general audience
would struggle to consent to the use of their data with an informed decision, hence mak-
ing the current practice of giving consent inefficient. Unawareness when giving consent to
data processing could result in personal data risks, such as misuse, breaches, and others,
questioning the legitimate power of privacy policies. Hence, providing an NLP solution
capable of simplifying the process of privacy policy comprehension could greatly assist in
solving the outlined problem. All the metrics are presented in Table 1.
Word Count Sent Count Avg Sent Len Difficult W Flesch Score Gunning Fog Vague Word Count
count 44 44 44 44 44 44 44
mean 4569 225.5 21 514.6 62 15.5 156
std 2786 157 3.8 186 9.1 1.8 88
min 960 44 13.2 159 50.2 10.6 28
25% 2935.7 145.2 18.8 409.0 62.7 14.3 90
50% 4055 190 21.4 480.5 67 15.5 145
75% 5352.7 253 23.7 642.5 74.8 16.5 188.8
max 14604 871 29.2 1106 91.2 19.9 476
Content analysis is a scientific tool that generates new insights and improves the re-
searchers’ comprehension of the subject of research. The process must be reliable and
systematic, producing valid results. It stems from how the subject of the research is con-
ceived. Exploratory in process and inferential in intent, it aids in making newly acquired
findings accessible for systematic analysis further performed by automated means, foster-
ing the development of inquiries in the realities of privacy constructs.
To annotate the policies in the SQuAD format with the (question, answer, context) triples,
Haystack was used. Haystack is an end-to-end open-source framework which enables the
development of scalable NLP pipelines. Built modularly, it allows to integrate with other
open-source projects, including HuggingFace and ElasticSearch. It provides the necessary
45
4.4 Data Understanding 4 METHODOLOGY
Each policy was read independently by two annotators, the authors of the thesis, with
GDPR specific domain knowledge. The annotating was done using the Natural Questions
technique when answers are marked for a set of predefined questions. Not having seen the
passage of text helps to avoid bias in making questions. Only one answer per question
was allowed per context.
Reflecting the di↵erences that GDPR brings into the structure and terminology of the
privacy policies, the relevant questions were derived which can be seen in Table 2.
46
4.4 Data Understanding 4 METHODOLOGY
• Sensitive Data. Uncovered in any existing datasets, sensitive data (3.7% of an-
notations in GDPRQA) is extremely important due to more stringent requirements
that must be met for a company to process it.
• Data Subject Rights. Elaboration on data subject rights and their exercising is
one of the key reasons why the introduction of GDPR was able to allow individuals
more control over their data. Previous datasets only include a generic description
of ”user control”, hence a comprehensive list of questions related to the rights of
individuals was composed. In total, the questions on data subject rights account for
46.7%, with the most frequent questions being about user consent, direct marketing,
data changes, data erasure, lodge of complaints and exercising of rights.
• Other. Some other changes in GDPR, such as stricter security measures (4.5%),
specific terminology and obligatory inclusion of the concept of data controller (3%),
and retention policies (6.2%) were introduced.
47
4.4 Data Understanding 4 METHODOLOGY
The statistical summary reveals that the average length of questions is 8.7 words, while
the average length of the answer is 21, and of the context — 178 words, as described in
Table 3. The distribution of questions present in the dataset is rather even, as can be
seen in the Table 2. It is observed that some policies are generally lacking information on
certain topics. For instance, countries which store the data are often not evidently stated
as this answer appears only in 2.8% of the question, answer, context triples. The low
frequency of the sensitive data category may be explained by the fact that few companies
deal with sensitive data. Moreover, such data is highly regulated. In terms of the data
subject rights distribution, the policies provide less information on the right of restriction
of processing and the right to data portability. Moreover, the policies seem to lack details
on the elaboration of the process behind the exercising of specific rights. Information on
”How to change data (or options)?” is less frequent.
48
4.4 Data Understanding 4 METHODOLOGY
The distribution of the n-gram prefixes of the questions in GDPRQA dataset can be seen
in Figure 13. The majority of the questions are non-factoid, with causal and complex
questions prevailing, such as ”How can I exercise my rights?” (elaborating causal). There
are a few factoid questions, such as ”Who is the DPO”?, and some list questions, for
instance, ”What rights do I have?”. Such questions as ”Can I object to processing?” can
be classified as both confirmatory and casual as they also require explanatory and elabo-
ratory aspects. Hence, the combination of di↵erent types of questions with the majority
being non-factoid makes the GDPRQA dataset rather challenging, as the expected answer
varies in length, span and structure.
49
4.5 Data Preparation and Preprocessing 4 METHODOLOGY
PolicyQA GDPRQA
Source Pre-GDPR Website Privacy Policies Post-GDPR Privacy Policies
Policies 115 47
Avg Words in One Policy 2319 4569
Questions 714 28
Annotations 25017 1402
Question Annotators Domain Experts Authors
Form of Q Reading Comprehension Reading Comprehension
Answer Type Sequence of words Sequence of Words
Avg Question Length 11.2 8.7
Avg Answer Length 13.5 21
Avg Passage Length 116 178
Preprocessing of the data is aimed at reducing the noise and, as a result, rendering a
consistent and transformed final dataset ready to be used in modelling.
As the BERT limitation can only take in 512 tokens, the policies were split into para-
graphs with a built paragraph divider function. The privacy policy documents were then
preprocessed to create sequences of a maximum length of 512 tokens to meet the BERT
processing limitation.
The privacy policies’ texts were normalised and preprocessed with the use of tokenization
and stop word removal. Tokenization segmented the text into tokens which were further
lowercased. All the preprocessing is integrated into relevant parts of modelling and ex-
plained in the Section 4.6.
4.6 Modelling
The modelling will take an iterative approach and assess the best solution that can an-
swer GDPR privacy policy inquiries accommodating for both PolicyQA and the GDPRQA
dataset. Throughout this paper, several models were trained built on Transformers with
the accompanying infrastructure.
The modelling is done through three iterations resulting in the best performing models
50
4.6 Modelling 4 METHODOLOGY
which are further tested in a simulated production environment. The modelling focuses
on three BERT models, namely, BERT-base, RoBERTa-base and PrivBERT.
The three iterations are illustrated in Figure 14. Finally, the hyperparameter optimized
model will be tested in a simulated production environment with a Reader, Retriever, and
IR pipeline, as illustrated in Figure 16.
8
HuggingFace is an NLP open-source community aimed at democratising NLP practices, which provides
functionality for numerous tasks, in particular, based around the Transformers library.
51
4.6 Modelling 4 METHODOLOGY
All preprocessing, data analysis, modelling, and testing tasks are run through Google
Colaboratory, which is a cloud-run Jupiter notebook environment developed by Google
Brain. The Pro+ subscription version was used which allowed priority access to GPUs for
increased computational power which is highly necessary when working with large textual
corpora and a variety of computation-intensive models. The Google Colaboratory Pro+
subscription both allows access to a Tesla V100-SXM2-16GB GPU and background ex-
ecution, which is essential when training time exceeds 10+ hours. Essentially, the Tesla
v100-sxm2-16GB GPU consists of numerous cores that are able to compute in parallel.
52
4.6 Modelling 4 METHODOLOGY
Deep learning in its very essence involves a significant amount of matrix and vector com-
puting which can be easily parallelized. Thus, GPU can aid tremendously in the training
and inference of neural network modelling. The nodes’ weights are adjusted iteratively
during the training. Thus, Transformer models speed up with the use of GPU which allows
for working with larger datasets. Due to the variety of parameters used, inference also
benefits from higher computational power.
Initially, all models are trained on one of the two datasets, the widely industry-adapted
factoid-based SQuAD or non-factoid NLQuAD, in order for the model to learn the foun-
dations of QA.
To lower training times and ensure a high replication state, two of the three SQuAD based
models were already fine-tuned on SQuAD and were collected from HuggingFace’s Model
Hub, specifically RoBERTa9 and BERT 10 . PrivBert, however, was fine-tuned on SQuAD
manually, as no pre-existing fine-tuned model exists. To ensure consistency with the other
two pre-trained and fine-tuned models, the same hyperparameters were used for PrivBert.
Hyperparameters are displayed in Table 4.
All models were fine-tuned on SQuAD 1.1 dataset to comply with the format of PolicyQA
dataset and ensure consistency across the training and evaluation.
To perform well at the specific task of QA in the target privacy domain, the knowledge
learnt from the bigger open domain QA (source domain) should be transferred onto a
smaller and narrower privacy policy QA. This allows the model to learn the nuances of
privacy policy semantics and lexicon.
9
bert-base-uncased-squad-v1 model
10
roberta-base-squad-v1 model
53
4.6 Modelling 4 METHODOLOGY
As described in Section 3.5 of the Theoretical Framework, the aim of transfer learning is
to improve the learning of the target predictive function fT (·) in the target domain DT
with the use of transferred knowledge from the source domain DS and source task TS .
The transfer learning is enabled through the HuggingFace script run squad.py which
allows to fine-tune any Transformer model on a SQuAD formatted dataset using MLM.
It has 45 arguments, allowing for much flexibility in adjustments. The code was first
modified by Soleimani et al. (2021) creating run nlquad.py and further adjusted to fit the
needs of this thesis. Essentially, the code downloads the model weights for the given model
(BERT-base, RoBERTa-base, and PrivBERT). Next, the training samples are converted to
features and saved to the cache file. Given the -do train argument, training is performed
for a specified number of epochs. Every model’s weights are saved according to the specified
-save steps as the checkpoints. The weights of the final output of the model are saved
to -output dir. Lastly, evaluation provides performance scores.
The performance of learning algorithms often depends on the correct instantiation of their
hyperparameters. While hyperparameter settings often make the di↵erence between stan-
dard and state-of-the-art performance (Hutter, Hoos, & Leyton-Brown, 2011), it is very
time-consuming to find an optimal setting due to the complexity of Transformers, and the
large amounts of textual data. The issue is particularly important in large-scale problems
where the size of the data can be so large that a quadratic running time is insurmountable
(Wang, Feng, Zhou, Xiang, & Mahadevan, 2015).
This thesis will make use of a manually created grid search, as implementing hyperparam-
eter optimization scripts through bayesian search or gridsearch into the developed
code would be out of the timeframe for this paper. The hyperparameter optimization is
54
4.6 Modelling 4 METHODOLOGY
performed on the best performing models of the second iteration found in Section 4.6.3 for
each of the two datasets: SQuAD and NLQuAD. These two models are hyperoptimized
on three parameters, namely epochs, learning rate and batch size.
• Epochs: [1, 2, 3, 4]
Firstly, the numbers of epochs are optimized with default values to obtain an understand-
ing of their implication on the model performance. The best performing number of epochs
is chosen to serve as a bound hyperparameter in the following manually performed grid
search of learning rate and batch size. Each of the 4 learning rates is trained against the 3
batch sizes on the GDPRQA and PolicyQA validation and test sets to ensure consistency
in findings. Other parameters, such as warm-up steps and weight-decay are not optimized
due to computational limitations.
The model scoring the highest F1 score across the two datasets is the model chosen for
the simulation of the production environment.
4.6.5 Parameters
The following section highlights the parameters used in the derived models. The overview
of the parameters is presented in Table 4.
55
4.6 Modelling 4 METHODOLOGY
Parameter Value
optimiser AdamW
activation function GELU
hidden size 768
num attention heads 12
num hidden layers 12
vocab size (BERT) 30522
vocab size (RoBERTa & PrivBERT) 50265
correct bias false
weight decay 0.01
learning rate 0.00003
max seq length 512
max answer length 512
top k 50
num training steps 400
num warmup steps (first iteration) 80
num warmup steps (second iteration) 1000
train batch size 8
doc stride 128
warmup proportion 0.2
num epochs 2
AdamW is used as the optimiser. Adaptive optimisers, such as Adam, are optimization
algorithms that update network weights iteratively based on the training dataset, instead
of the conventional stochastic gradient descent (SGD). Adam, which stands for adaptive
moment estimation, is an extension of SGD that has recently been widely adopted in deep
learning and neural language modelling (Géron, 2017).
However, models trained with Adam might not generalise so well in QA tasks, thus SGD
with momentum could be preferred. Loshchilov and Hutter (2019) have discovered that
L2 regularisation is much less e↵ective in adaptive optimisers compared to SDG. L2 reg-
ularisation (weight decay) states that networks with smaller weights normally generalise
more and are less prone to overfitting. Loshchilov and Hutter (2019) improve Adam by
creating AdamW where the weight decay is done after the step size of each parameter
is checked, hence only being proportional to the weight itself. AdamW results in bet-
ter training loss and generalisation also conveying general Adam’s benefits, such as the
decreased necessity for tuning the learning rate as it is an adaptive learning rate algorithm.
56
4.6 Modelling 4 METHODOLOGY
GELU, short for Gaussian error linear unit, is a high performing activation function, which
is used to decide whether the neurons of the developed networks should be activated. The
function is presented in Figure 15. Activations with linear units such as ReLU, ELU etc.
allow for faster convergence of the neural networks. Moreover, the dropout rate is able
to regularise the model by random multiplication of activations by zero. Both of these
approaches determine the output of the neutron. GELU is able to combine them together
thus o↵ering a more probabilistic neuron outcome and a new probabilistic understanding
of the nonlinearity, which is why it was used. GELU has been observed to result in better
performance in a variety of tasks within NLP, specifically in QA. It is advised to use it
with an optimiser with momentum, which fits well when using AdamW (Hendrycks &
Gimpel, 2016).
GELU can be formulated as follows, where (x) stands for the cumulative distribution
function of Gaussian distribution.
The warm-up proportion is set to 0.2, which represents the fraction of training steps taken
until the maximum learning rate. Until it is reached, the learning rate is increasing lin-
early, and once it is reached it starts to decrease at a linear rate. The number of warm-up
steps taken is 80. Warm-up steps are the updates with a small learning rate taken at the
57
4.6 Modelling 4 METHODOLOGY
beginning of the training. Once they are taken, the model starts using the regular learning
rate to train to convergence. Warm-up helps the model to adjust to the data and to let
the adaptive optimizer (AdamW) estimate the correct rates needed for the gradients. The
model takes in eight samples in one batch for training. W in AdamW stands for weight
decay and is set to 0.01. Given a weight decay of 0, AdamW becomes identical to Adam.
Document stride helps to split long documents into several features. The sliding window
of 128 allows controlling the stride between the chunks of the split text. The maximum
sequence length is 512 which is the maximum single input of text that can be handled by
BERT. The number of attention heads and hidden layers is 12. top k is set to 50, which
is the number of answer candidates that the Reader takes in. The higher the variety, the
higher the probability of getting the correct answer which comes with a trade-o↵ of the
processing time.
Every model has a set of files associated with it, including its configurations (Section
4.6.6.1), tokenizers (Section 4.6.6.2), and encodings (Section 4.6.6.3), which will be elab-
orated on further below.
4.6.6.1 Configuration
The base class PretrainedConfig.py is a class that implements common methods for
loading/saving a configuration either from a local folder, directory or from a pretrained
model configuration provided by HuggingFace Hub. Each derived configuration class im-
plements model-specific attributes. Common attributes present in all configuration classes
are: hidden size, num attention heads, num hidden layers, and vocab size. Each
subclass implements a model type which is an identifier for the model, serialized into the
JSON file, and used to recreate the correct object. keys to ignore at inference is a list
of tokens to ignore by default when looking at dictionary outputs of the model during in-
ference and, finally, attribute map that represents a dictionary that maps model-specific
attribute names to the standardized naming of attributes.
58
4.6 Modelling 4 METHODOLOGY
4.6.6.2 Tokenizers
To enable the model to understand input data, it needs to be processed into an acceptable
format for the model. As models do not understand raw text, the inputs need to be con-
verted into numbers and assembled into tensors. Tokenizers serve as the way to translate
text into model-comprehensible data. The used tokenizers start by splitting text into to-
kens, given one of the algorithms (BPE and WordPiece) described in Section 4.6.6.3. The
tokens are converted into numbers, which are used to build tensors as input to the models.
Any additional inputs required by a model are also added by the tokenizers through the
two configuration files merges.txt and special tokens mapping.json described below.
1. merges.txt
merges.txt is a file that only exists within HuggingFace’s RoBERTa based language mod-
els. As HuggingFace’s RoBERTa tokenizer is based on a GPT-2 tokenizer, the merges.txt
serves as a mapping between the GPT-2 and other BERT-based tokenizers. In essence,
it adds a top layer to the encoding performed by the tokenizer by first converting the
GPT-2 vocabulary to BERT vocabulary and then using the vocab.json to map the given
tokenized word to an id, as seen in the example below.
1 [ ' What ' , " ' s " , 'up ' , ' with ' , 'the ' , ' token ' , ' izer ' , '? ']
Then, according to the values in the vocab.json, these tokens are replaced by the tokens’
corresponding index:
1 [ ' What ' , "'s", 'up ' , ' with ' , 'the ' , ' token ' , ' izer ' ,
'? ']
Therefore, the tokens above result in the following encoding after using the vocab.json
mapping:
59
4.6 Modelling 4 METHODOLOGY
BERT; and "<s>", "</s>", "<unk>", "</s>", "<pad>", "<s>" and "<mask>" for
RoBERTa. The special tokens are described below:
• "[UNK] / "<unk>"" is used when a token in the training data is not covered in
the vocabulary due to the vocabulary size. However, since BERT makes use of
WordPiece and RoBERTa uses BPE, both described in Section 4.6.6.3, there should
exist little to no unknown token mappings when performing tokenization on the
training data.
• "[MASK]" / "<mask>" token is used to enable the deep bidirectional learning as-
pect of a transformer and leverage masked language modelling (MLM). It takes a
percentage of the input tokens and masks them using the ”[MASK]” / ”<mask>” at
random. The model then tries to predict the masked tokens. The predicted tokens
from the model are then fed into an output softmax which results in the final output
words. The models mask 15% of words while training, but not all use the ”[MASK]”
/ ”<mask>” token. About 80% of the masked tokens are labelled as ”[MASK]” or
”<mask>”. 10% of the time it takes a random token and places it instead of the
original token, and 10% of the time it replaces it with unchanged input tokens that
are being masked.
• "<s> and </s>": Similarly to BERT’s separation token, RoBERTa uses ”<s>
and </s>”. The first signifying the classification sequence where the question is
located and the second — the context( <s> Question </s> </s> Context </s>).
• "[SEP]": In addition to MLM, BERT also uses a next sentence prediction task
to pre-train the model for tasks that require an understanding of the relationship
between two sentences. When taking two sentences as input, BERT separates the
sentences with a special [SEP] token. During training, BERT is fed two sentences —
50% of the time the second sentence comes after the first one, and 50% of the time
60
4.6 Modelling 4 METHODOLOGY
it is a random sentence. BERT is then asked to predict whether the second sentence
is a random sentence or not.
• "[PAD]" / "<pad>" serves as a padding token for BERT and RoBERTa respec-
tively, as both models receive a fixed length of context as an input. Usually, the
maximum length of a context in the case of this paper 512, depends on the training
data. For contexts that are shorter than the maximum length, paddings are added
(empty tokens) to the context to make up the length.
4.6.6.3 Encodings
Two encodings will be presented. Byte-Pair Encoding (BPE) is used by RoBERTa, and
WordPiece is used by BERT.
To provide an example, let’s assume that after initial tokenization, the following set of
words including their frequency has been determined:
1 [(" mud " , 10) , (" cud " , 5) , (" pun " , 12) , (" sun " , 4) , (" puns " , 5) ]
Consequently, the vocabulary consists of [”m”, ”g”, ”h”, ”n”, ”p”, ”s”, ”u”]. Splitting all
words into symbols of the base vocabulary, we obtain:
1 [(" m " " u " " d " , 10) , (" c " " u " " d " , 5) , (" p " " u " " n " , 12) , (" s " " u " " n " , 4) ,
(" p " " u " " n " " s " , 5) ]
The tokenizer performs counts of the frequency of all possible symbol pairs and merges
the symbols that appear most frequently. In the example above, ”u” followed by ”n” is
61
4.6 Modelling 4 METHODOLOGY
the highest occurring symbol combination (12 + 4 + 5 = 21) and thus the first merge rule
of the tokenizer. The BPE algorithm keeps iterating over the most frequently appearing
symbol pairs in descending order until the vocabulary limit is reached.
1 (" m " " u " , " d " , 10) , (" c " " u " ," d " 5) , (" p " " un " , 12) , (" b " " un " , 4) , (" p " "
un " " s " , 5)
2. WordPiece
Similarly to BPE, WordPiece is the subword tokenization algorithm used for BERT. The
algorithm was described by Schuster and Nakajima(2012) and is very similar to BPE.
WordPiece creates a vocabulary that includes all characters present in the training data
and iteratively learns merge rules. In contrast to BPE, WordPiece does not choose the
most frequent symbol pair, but the one that maximizes the probability of the training data.
In other words, maximizing the probability of the training data is the same as finding
the symbol pair, in which probability divided by each other is the greatest among every
symbol pair in the vocabulary. Referring to the example, ”u”, followed by ”n” would have
only been combined if the probability of ”un” divided by ”u” and ”n” would have been
greater than the probability of any other symbol pair.
There might be a general inconsistency between how single document closed domain sys-
tems (explained in Section 3.4.5) are evaluated in a production setting and how the ma-
jority of research literature evaluates their results. When fine-tuning on a given dataset,
the context of a question has to be limited to the maximum token length that a given
transformer allows, which means that documents exceeding 512 tokens for BERT and
RoBERTa are cropped. To circumvent the cropping, contexts are normally split, as de-
scribed in Section 4.5.
That, in turn, results in contexts that are smaller and more specific than privacy policies
and have denoted questions and answers, which essentially leads to a more controlled eval-
uation environment. In this environment, models tend to perform proportionally better as
the possibility of achieving the right answer is higher. However, it cannot be inferred that
62
4.6 Modelling 4 METHODOLOGY
it will perform the same in a production setting where the data would be larger. When
an entire policy is fed into the model, the answer might come in varieties, not necessarily
corresponding to the labelled ground truth.
To achieve a more realistic evaluation of how the models would perform in a production
environment, a simulated prototype was created. The simulated production environment
includes an information retrieval system, specifically ElasticSearch, a Retriever and the two
hyperparameter optimized models. ElasticSearch is implemented as a local-host database
in the Google Colab environment and uses the Haystack framework from deepset.ai to
facilitate the indexing and retrieval of documents.
The entire production architecture is reflected in Figure 16. At index time, the privacy
policy is being split into paragraphs which are further sent to ElasticSearch making them
searchable. On the other side, a user asks a question about a specific policy. The Retriever
preprocesses the query and sends the preprocessed query to ElasticSearch. Next, Elastic-
Search returns the retrieved documents relevant to the search query. The Reader receives
the retrieved documents from the Retriever, then tokenizes and encodes the retrieved doc-
uments. Next, it feeds them into the QA model. The QA model finds the correct answer
span and sends the output to the decoder that decodes it. The decoded answer span is
finally sent back to the user.
63
4.6 Modelling 4 METHODOLOGY
After all splits have been performed, they are each converted to the Document.py class
using either a MarkdownConverter.py if the file is in HTML format or a TextConverter.py
if the policy is in a text format. Finally, each document is indexed to ElasticSearch with
the metadata of the corresponding title of the privacy policy. The parameters used for the
preprocessor are the following:
64
4.6 Modelling 4 METHODOLOGY
1 | {
2 | ” s i z e ” : 10 ,
3 | ” query ” : {
4 | ” bool ” : {
5 | ” s h o u l d ” : [ {” mu l t i mat ch ” : {
6 | ” query ” : ” I s my data l e a v i n g t h e EU?” ,
7 | ” type ” : ” m o s t f i e l d s ” ,
8 | ” f i e l d s ” : [ ” content ” , ” t i t l e ” ] }} ] ,
9 | ”filter”: [
10 | { ” terms ” : {” t i t l e ” : ”imdb . com”} }
11 | ]
12 | }
13 | },
14 | }
When retrieving a query, ElasticSearch analyses the query through its inverted index and
uses the BM25 algorithm to return the most relevant documents. This process is referred
65
4.6 Modelling 4 METHODOLOGY
to as keyword search, thus it does not consider phrasing or the context of the question. The
highest scoring documents, according to the BM25 algorithm, are returned and converted
back into an instance of the Document.py class.
The two hyperparameter optimized models are used to extract answers for questions,
given the contexts of the returned documents of the Retriever. The Readers are fed three
parameters, specifically a question, the number of answers to return and the documents
provided by the Retrievers. Each Reader is inserted into a QA pipeline which serves as
a high-level class which abstracts the code complexity of the tokenization, encoding, and
decoding. An example of the Reader can be seen in Listing 4.
3 % RETURNS
4 % [{ ' answer ': ' You have the right to access your data ' , //
5 % ' start_idx ': 34 , ' end_idx ' :64 , ' score ': 0.83}]
The architecture consists of the aforementioned IR, Reader, and Retriever. The architec-
ture is created to evaluate the performance of the existing models in a simulated production
environment. The performance on PolicyQA and GDPRQA datasets is evaluated sepa-
rately, but the IR system contains data from both datasets. Each of the two datasets is
evaluated by iterating through the dataset, where each question-answer pair is collected.
For each question, the Retriever is asked to provide up to 100 documents that correspond
to the title associated with the privacy policy in question. The Reader is then tasked with
finding one, three or five extracted candidate answers for a given question per received
document. A snippet of the retrieving and answer extraction can be found in Listing 5.
66
4.7 Evaluation 4 METHODOLOGY
4.7 Evaluation
The extracted answers are evaluated using a number of performance metrics. The number
of shared words between the prediction and the truth is the basis of the F1 score, which
is computed over the individual words in the prediction against those in the ground truth.
Recall is the ratio of the number of shared words to the total number of words in the
ground truth. Hence, recall represents how many tokens that are in common with the
ground truth. Precision, on the other hand, is the ratio of the number of shared words
to the total number of words in the prediction. Therefore, precision shows how big of a
percentage of the tokens in the ground truth is in our predicted answer. Thus, it is much
easier to have high precision when the answer span is short. Exact match (EM) checks
for each question-answer pair, whether the characters in the prediction identically match
those in the ground truth, hence raising concerns for semantic meaning similarities. F1
score would be considered the most important metric when evaluating our models.
Closed domain evaluation mode is performed when the evaluation is done only on the
privacy policy of the relevant organisation. Each model is evaluated on each of the two test
and validation datasets to assess GDPR questions’ answerability and general PolicyQA
answerability separately. Each of the models is evaluated equally and focuses on F1
metrics, however, EM is also included in Section 5. For each question and context, their
67
4.7 Evaluation 4 METHODOLOGY
ground truth answers are compared to the predicted answer. The 50 top candidates are
retrieved and the top-scoring answer, according to the model score output, is compared
with each of the ground truth answers. The highest F1 and EM of the predicted answer
on ground truth answers are chosen for the given answer.
To realistically evaluate the simulated production prototype, both the PolicyQA and
GDPRQA datasets were evaluated separately to preserve consistency with other find-
ings. However, both sets are contained in the information retrieval system, ElasticSearch.
This does not have any e↵ects on evaluation metrics of model performance as the model is
only concerned with the privacy policy of the relevant organisation. However, an increase
in retrieval times will happen due to more privacy policies being stored. The performance
of the information retrieval system is not evaluated, as ElasticSearch is highly distributed
and the thesis works with a single document closed domain tasks.
The evaluation of the model performance, which can be seen in Listing 7, is done on both
datasets. Each question in the dataset retrieves all segments of the relevant policy. The
model is tasked with finding the top candidate answer per retrieved segment. All answers
are merged and the corresponding 1, 3 and 5, top candidates, dependant on the output
score of the model, are evaluated as seen in Table 14. To ensure common formatting
between the extracted answer and its corresponding ground truth, both are preprocessed
before any evaluation metrics are calculated. First, both the answer and the ground truth
are lowercased, followed by the removal of punctuation and articles such as a, an and the.
Finally ’———’ created by the MarkDownConverter is removed. The normalization steps
can be seen in Listing 6.
1 def normalize_text ( s ) :
2 """ Removing articles and punctuation , and standardizing whitespace """
3 import string , re
4
5 def remove_three_lines ( s ) :
6 s . replace ( ' ||| ' , ' ')
7
68
4.8 Deployment 4 METHODOLOGY
11
When both the answer(s) and questions have been preprocessed, the corresponding EM,
F1, precision and recall are found for each of the either one, three or five most relevant
answers in the entire document. If three or five answers are found in the document, only the
highest EM, F1, precision and recall are recorded. In other words, the evaluation reports
to which extent any of the returned answers are able to answer the question according to
the ground truth.
4.8 Deployment
69
4.8 Deployment 4 METHODOLOGY
The Application Layer displays information about the privacy policy, thus providing the
users with a user-friendly front-end to pose their queries and receive the answers. In this
layer, the Query Module receives the user’s query consisting of the name of the organi-
sation and a question about the related policy (1). The query is transformed using the
preprocessing steps explained in Section 4.6.7.3. These inputs are forwarded to lower lay-
ers (2), which then extract the entire policy in the form of stridden segments. Each of
the stridden segments is sent to the Machine Learning layer (4) for answer extraction. To
resolve the response received from the answer generation (5), each of the stridden segments
is merged into one entire policy which is displayed to the user. The response also includes
the top 5 answers, and their corresponding start and end indices, which will be used to
highlight the answer span in the policy as also shown in Figure 12. It is suggested that
the answers will be shown in a ranked format, depending on the confidence score of the
models.
70
4.8 Deployment 4 METHODOLOGY
The Data Layer serves two purposes. Firstly, it performs the cleaning of the received poli-
cies from the PolicyCrawler and hereafter it preprocesses and indexes the cleaned privacy
policies. A potential solution could be to use both Coverters.py and Preprocessor.py
from HayStack described in Section 4.6.7.2 which segment the policy into semantically
coherent and adequately sized documents.
The second purpose of the Data Layer is a knowledge base, potentially leveraged with
ElasticSearch or any other search engine. ElasticSearch receives a formatted query (2)
and utilizes its inverted index and BM25 algorithm to retrieve the relevant answers. With
the scaling of the number of policies stored in the knowledge base, it may be necessary to
limit the doc-stride or the amount of returned stridden segments to preserve low retrieval
times. Especially here, the power of the BM25 algorithm’s keyword search or potentially
a dense passage retriever may be necessary.
4.8.3 PolicyCrawler
There are several potential approaches to creating a PolicyCrawler. The first approach
would be to rely on direct URL access provided by users at query time. The PolicyCrawler
would receive the URL, extract the information, clean, preprocess and index the policy to
the knowledge base. Thus, incrementally increase the knowledge base as the usage of the
platform increases.
Another possibility is to serve the models through an application. The application would
automatically retrieve the names of installed applications. The application names would
then be transferred to the crawler that would use the respective Google’s Play Store or
Apple’s App Store that has structured privacy policy links associated with them.
Lastly, the provided title could be used to automatically scan for privacy policies through
search engines. However, as the front-end often changes structure, there are numerous
associated dependencies which result in inconsistency. Hence, this approach seems like
the least reliable option.
71
4.8 Deployment 4 METHODOLOGY
The Machine Learning Layer is responsible for encoding the segment, extracting the an-
swer and decoding it to make it user comprehensible. The layer takes as an input the
privacy policy segment from the Query Module (4), and the user query (1) from the Ap-
plication Layer. The Machine Learning Layer encodes the question and context according
to what is described in Section 4.6.6.2, generates an answer as described in Section 3.4
and probabilistically assigns each answer an output score. All answers are then sent back
to the application layer in the following format:
1 [{ ' score ': 0.76 , ' start ': 215 , 'end ': 240 , ' answer ': ' The answer to the
question '}].
Further implementations to the architecture could expand the reliability of the answer
extractor. Firstly, a question classification model could potentially predict the question
type, enabling the IR system to filter on certain aspects of the question. Furthermore,
each segment could be supplied with extra meta-data provided by a classifier, such as the
XLNet classifier presented by (Mustapha et al., 2020) and implemented as in Figure 18.
72
4.9 Reliability and Validity 4 METHODOLOGY
Validity refers to the extent of accuracy of the measures used throughout the method-
ology. Content validity evaluates whether the method is adequately measuring all the
content relevant to the variable. The methods used cover the entire domain of privacy
policies, as it is built on top of the previous studies and is augmented with the aspects
relevant to the current GDPR regulatory landscape.
Construct validity explains whether any inferences can be made based on the test scores.
The test result cannot generally infer that questions to any policy can be answered with
the same performance scores, as it largely depends on the quality of the policy and its
compliance with GDPR. The results also depict convergent validity as the scores obtained
in the study are largely correlated with previously developed instruments in other privacy
policy studies.
73
5 RESULTS
5 Results
The results are based on the iteration process described in Section 4.7 and Figure 14. The
results are separated into the following sections:
(a) Models Fine-Tuned on SQuAD: The results all show pretrained models fine-
tuned on SQuAD
(b) Models Fine-Tuned on NLQuAD: The results all show pretrained models fine-
tuned on NLQuAD.
(c) Hyperparameter Optimisation: The results show the best performing model
from the second iteration optimized on various hyperparameters.
74
5.1 Test Environment 5 RESULTS
ii. Manual GridSearch: Results for cross-training learning rate with parame-
ters, 0.00001, 0.00002, 0.00003 and 0.00004 and batch size with parameters
4, 8 and 16.
(a) PolicyQA Dataset: All results are evaluated on PolicyQA test and validation
dataset for top candidates of 1, 3 and 5 over full privacy policy.
(b) GDPRQA Dataset: All results are evaluated on GDPRQA test and validation
dataset for top candidates of 1, 3 and 5 over full privacy policy.
To start with, the results of the performance of the models in the testing environment
are displayed. The testing is done on the testing sets of GDPRQA and PolicyQA. The
data is in SQuAD format (consisting of question, answer, context triples) — hence, given
the nature of the formatting, the model sees the paragraph (context) where the answer is
contained.
First, the models fine-tuned on SQuAD will be tested on the GDPRQA and PolicyQA test-
ing and validation datasets. The model combinations would include the baseline models
solely tuned on SQuAD, as well as models trained on GDPRQA training data, PolicyQA
training data and combined GDPRQA & PolicyQA data. Next, the same setup will be
applied to models fine-tuned on NLQuAD instead of SQuAD to see whether non-factoid
open domain knowledge helps the models to answer privacy-related questions in compar-
ison to factoid-based SQuAD.
75
5.1 Test Environment 5 RESULTS
Figure 19 presents an overview of charts with SQuAD and NLQuAD based results for both
GDPRQA, PolicyQA and the augmented PolicyQA & GDPRQA dataset. The clusters
represent di↵erent training configurations, while colours stand for the di↵erent language
models used.
GDPRQA-SQuAD
The best scores for GDPRQA-SQuAD are for a PrivBERT model which is trained on
GDPRQA dataset, resulting in 81.1% F1 score on validation and 78.4% score on test
GDPRQA dataset, as seen in Table 5. The scores of the EM are 52% for the test set
and 45.8% for the validation set. Meaning that 52% of the documents have the answer
estimated by the Reader completely identical in the testing set.
PrivBERT trained only on SQuAD results in 43.9% F1, while RoBERTa trained only in
SQuAD results in 20.7% — hence domain-specific language model indeed makes a dif-
ference in performance. BERT trained only on SQuAD performs better than RoBERTa
by 11%. Training on PolicyQA almost doubles the performance of the models. Training
on GDPRQA instead of PolicyQA, naturally, results in even stronger evaluation met-
76
5.1 Test Environment 5 RESULTS
rics. For BERT and PrivBERT, performance increases by 2 %, while for RoBERTa it
increases by 16%. The fourth variation of training data combinations is training on aug-
mented GDPRQA & PolicyQA. BERT gets a better result by 10%, RoBERTa by 5%,
while PrivBERT’s performance decreases by 4%. Hence, adding PolicyQA training data
to GDPRQA does not enhance performance when tested on GDPRQA, suggesting the
privacy policy annotations in GDPRQA to be quite di↵erent in content from PolicyQA
when modelled with domain-specific language model.
GDPRQA Dataset
Model Name SQuAD GDPRQA PolicyQA GDPRQA & PolicyQA Test Validation
EM F1 EM F1
BERT-base X ⇥ ⇥ ⇥ 0.080 0.319 0.083 0.287
BERT-base X ⇥ X ⇥ 0.320 0.623 0.083 0.422
BERT-base X X ⇥ ⇥ 0.320 0.646 0.291 0.707
BERT-base X X ⇥ X 0.480 0.740 0.333 0.717
RoBERTa-Base X ⇥ ⇥ ⇥ 0.040 0.207 0.000 0.280
RoBERTa-Base X ⇥ X ⇥ 0.120 0.566 0.083 0.458
RoBERTa-Base X X ⇥ ⇥ 0.440 0.727 0.333 0.749
RoBERTa-Base X X ⇥ X 0.480 0.777 0.416 0.754
PrivBERT X ⇥ ⇥ ⇥ 0.160 0.439 0.083 0.531
PrivBERT X ⇥ X ⇥ 0.480 0.766 0.208 0.527
PrivBERT X X ⇥ ⇥ 0.520 0.784 0.458 0.811
PrivBERT X X ⇥ X 0.440 0.747 0.333 0.764
PolicyQA-SQuAD
Policy QA based on SQuAD has the strongest performance when trained on augmented
GDPRQA & PolicyQA, resulting in 60% F1 for validation, and when trained on PolicyQA
in 64.2% for testing, presented in Table 6.
The baseline models trained only on SQuAD score 28.6% with BERT, 11% with RoBERTa,
35% with PrivBERT. Compared to PrivBERT’s 43.9% when tested on GDPRQA, it sug-
gests that without being trained on domain-specific dataset, it is easier for a privacy
language model to predict on a GDPR relevant dataset. It could also be due to the fact
that PrivBERT was created by feeding it post-GDPR privacy policies.
When adding PolicyQA data into the training, the performance of BERT improves by 2
times (from 28% to 60%), for RoBERTa by 6 times (from 11% to 61.5%), for PrivBERT by
2 times from 35% to 64.2%, making RoBERTa most sensitive to privacy-related training
77
5.1 Test Environment 5 RESULTS
data, and PrivBERT (Which is RoBERTa tuned on privacy policies) the best performing
model.
When training on GDPRQA dataset instead, the performance goes down for BERT to
34.5%, to 35.5% for RoBERTa, and 37.2% for PrivBERT. Hence, GDPR specific training
data only slightly helps broad pre-GDPR privacy policy annotations and is not enough
for the models to learn how to answer broader privacy questions, which makes sense due
to the dataset being limited only to GDPR specifics.
However, the augmented dataset (GDPR & PolicyQA) helps BERT to achieve the best
results. Hence, GDPR elements help to answer privacy questions better even for policies
which were made before the adoption of GDPR. Yet, in the case of PrivBERT it performs
better, and best among all models, when trained only on PolicyQA, with the best F1
score of 64.2%. Adding GDPRQA to the training data makes the model perform worse,
suggesting more di↵erences in pre- and post-GDPR question-answer pairs.
The creators of PolicyQA have previously achieved 56.6 % in testing with BERT-base and
SQuAD pre-training. Our results beat the current state-of-the-art performance by 7.6%.
PolicyQA Dataset
Model Name SQuAD GDPRQA PolicyQA GDPRQA & PolicyQA Test Validation
EM F1 EM F1
BERT-base X ⇥ ⇥ ⇥ 0.066 0.286 0.052 0.258
BERT-base X ⇥ X ⇥ 0.317 0.605 0.279 0.559
BERT-base X X ⇥ ⇥ 0.066 0.345 0.070 0.336
BERT-base X X ⇥ X 0.309 0.607 0.289 0.565
RoBERTa-base X ⇥ ⇥ ⇥ 0.026 0.110 0.022 0.090
RoBERTa-base X ⇥ X ⇥ 0.340 0.615 0.304 0.574
RoBERTa-Base X X ⇥ ⇥ 0.072 0.355 0.078 0.353
RoBERTa-Base X X ⇥ X 0.480 0.615 0.306 0.573
PrivBERT X ⇥ ⇥ ⇥ 0.082 0.350 0.084 0.328
PrivBERT X ⇥ X ⇥ 0.358 0.642 0.313 0.590
PrivBERT X X ⇥ ⇥ 0.082 0.372 0.091 0.369
PrivBERT X X ⇥ X 0.357 0.635 0.321 0.600
GDPRQA-NLQuAD
78
5.1 Test Environment 5 RESULTS
Tuned on NLQuAD, GDPRQA dataset results in the highest score of 82.4% F1 on the
validation set for PrivBERT while trained on GDPRQA and 78.9% F1 on the testing set
for RoBERTa trained on GDPRQA, illustrated in Table 7. The EM score for the best per-
forming model is 44% in the GDPRQA testing set which is the proportion of documents
that are a precise match of the estimated answer.
Baseline models (only tuned on NLQuAD) result in 21.2% for BERT, 14.7% for RoBERTa
and 30% for PrivBERT. Compared to SQuAD baseline, only RoBERTa performs better
by 4.6% with NLQuAD, while both BERT and PrivBERT perform better with SQuAD.
Interestingly enough, when training on GDPRQA instead of PolicyQA, BERT and RoBERTa
gain a rather large di↵erence (17.1% improvement for BERT (reaching 65%) and 21.2%
improvement by RoBERTa up to 78.9%). However, PrivBERT was already able to answer
questions with rather high accuracy of 73.3% only on PolicyQA, and switching the dataset
to GDPRQA gives it an advantage of only 3.2%. Hence, PrivBERT is able to answer more
GDPR relevant questions even when trained on PolicyQA.
Compared to the baseline models tuned on the open domain, adding domain-specific train-
ing very strongly improves performance — for instance, for RoBERTa, it jumps from 14.7
% to 78.9 % when GDPRQA is introduced into its training.
The results show that GDPRQA-NLQuAD performs slightly better compared to GDPRQA-
SQuAD suggesting that indeed the non-factoid NLQuAD with longer answer spans impacts
the performance of GDPRQA, given that it also has primarily non-factoid questions with
relatively long answer span.
79
5.1 Test Environment 5 RESULTS
GDPRQA Dataset
Model Name NLQuAD GDPRQA PolicyQA GDPRQA & PolicyQA Test Validation
EM F1 EM F1
BERT-base X ⇥ ⇥ ⇥ 0.040 0.212 0.008 0.376
BERT-base X ⇥ X ⇥ 0.080 0.479 0.083 0.480
BERT-base X X ⇥ ⇥ 0.320 0.650 0.208 0.682
BERT-base X ⇥ ⇥ X 0.320 0.663 0.208 0.691
RoBERTa-Base X ⇥ ⇥ ⇥ 0.000 0.147 0.041 0.395
RoBERTa-Base X ⇥ X ⇥ 0.280 0.577 0.083 0.428
RoBERTa-Base X X ⇥ ⇥ 0.400 0.789 0.333 0.753
Roberta-Base X ⇥ ⇥ X 0.320 0.756 0.416 0.784
PrivBERT X ⇥ ⇥ ⇥ 0.040 0.300 0.083 0.401
PrivBERT X ⇥ X ⇥ 0.400 0.733 0.125 0.572
PrivBERT X X ⇥ ⇥ 0.440 0.765 0.541 0.824
PrivBERT X ⇥ ⇥ X 0.400 0.761 0.375 0.751
Table 7: GDPRQA-NLQuAD
PolicyQA-NLQuAD
With regard to PolicyQA, the strongest performance is observed for PrivBERT model
trained on PolicyQA, resulting in 62.3% F1 on the test and 58.5% on validation datasets,
as seen in Table 8. The EM for the best performing model is 33.5% in the test and 30.8%
in validation.
The baseline models — those trained only on open domain NLQuAD, score 8.9% with
BERT, 10.5% with RoBERTa, and 8.9% with PrivBERT. The baseline performance on
PolicyQA-SQuAD was much stronger (e.g. PrivBERT performs worse by 26.1%). The
answer span in PolicyQA dataset is substantially smaller, hence it evidently benefits more
from being SQuAD-based.
Training on GDPRQA instead of PolicyQA halves the evaluation scores. PrivBERT re-
sults in 34.7%, BERT in 30.2%, RoBERTa in 34%. Lastly, training on the combined
dataset reaches results similar to those when trained solely on PolicyQA, however still
slightly worse — e.g. PrivBERT scores 62.2%, BERT — 59.6%, and RoBERTa achieves
60.6%.
Overall, PolicyQA achieved better results when tuned on SQuAD, which can be explained
80
5.1 Test Environment 5 RESULTS
PolicyQA Dataset
Model Name NLQuAD GDPRQA PolicyQA GDPRQA & PolicyQA Test Validation
EM F1 EM F1
BERT-base X ⇥ ⇥ ⇥ 0.014 0.089 0.007 0.065
BERT-base X ⇥ X ⇥ 0.318 0.603 0.278 0.560
BERT-base X X ⇥ ⇥ 0.055 0.302 0.053 0.293
BERT-base X ⇥ ⇥ X 0.301 0.596 0.274 0.558
RoBERTa-base X ⇥ ⇥ ⇥ 0.019 0.105 0.011 0.087
RoBERTa-base X ⇥ X ⇥ 0.326 0.609 0.295 0.568
RoBERTa-base X X ⇥ ⇥ 0.073 0.340 0.006 0.340
RoBERTa-base X X ⇥ X 0.324 0.606 0.296 0.567
PrivBERT X ⇥ ⇥ ⇥ 0.019 0.089 0.011 0.076
PrivBERT X ⇥ X ⇥ 0.335 0.623 0.308 0.585
PrivBERT X X ⇥ ⇥ 0.075 0.347 0.071 0.344
PrivBERT X ⇥ ⇥ X 0.335 0.622 0.305 0.582
Table 8: PolicyQA-NLQuAD
The keen reader might have noticed that PrivBERT performs better on all test and valida-
tion datasets. To enable better understanding, a summarization with averages across the
two datasets is displayed in Table 9. The table shows that a PrivBERT model trained on
the augmented dataset performs significantly better than any PrivBERT model trained on
the individual datasets across the two datasets. However, the PrivBERT models trained
on individual datasets perform better on their respective training and validation sets, but
worse on the other individual training and validation set. Despite a slight loss in perfor-
mance against the individual datasets on their respective training and validation sets, the
average is significantly higher across the two, with an average F1 score of 0.679 against
0.636 and 0.570 for NLQuAD and 0.686 against 0.631 and 0.584 for SQuAD.
Summarization - Best Performing Models
Model ODG GDPRQA Test GDPRQA Validation PolicyQA Test PolicyQA Validation
EM F1 EM F1 EM F1 EM F1 Avg. F1
PrivBERT GDPRQA SQuAD 0.520 0.784 0.458 0.811 0.082 0.372 0.092 0.369 0.584
PrivBERT PolicyQA SQuAD 0.480 0.766 0.458 0.527 0.358 0.642 0.313 0.590 0.631
PrivBERT GDPRQA & PolicyQA SQuAD 0.440 0.747 0.333 0.764 0.357 0.635 0.321 0.600 0.686
PrivBERT GDPRQA NLQuAD 0.440 0.765 0.541 0.824 0.075 0.347 0.071 0.344 0.570
PrivBERT PolicyQA NLQuAD 0.400 0.765 0.125 0.572 0.335 0.623 0.308 0.585 0.636
PrivBERT GDPRQA & PolicyQA NLQuAD 0.400 0.761 0.375 0.751 0.335 0.622 0.305 0.582 0.679
The two best performing models, PrivBERT GDPRQA & PolicyQA trained on SQuAD
and PrivBERT GDPRQA & PolicyQA trained on NLQuAD are chosen for hyperparameter
optimisation.
81
5.1 Test Environment 5 RESULTS
Hyperparameter optimisation was performed on the best performing models from SQuAD
and NLQuAD respectively (Section 5.1). As both models perform similarly, both models
are hyper-parameter optimized to show the potential di↵erences after hyperparameter
optimization.
• Epochs: [1, 2, 3, 4]
The results follow the approach stated in Section 4.6.3. First, di↵erent epochs are evalu-
ated. Next, a manual grid search with varying values of batch size and the learning rate is
evaluated. The best performing models after HPO are used in the production environment
prototype.
Epoch Evaluation
The Tables 10 and 11 depict similar trends. Iterating the performance through di↵erent
epochs, it is clear that the best performance for GDPRQA test and validation sets is on
the first epoch. However, the performance of PolicyQA is best on the last epoch observed,
epoch 4. Hence, while observing the evaluation through di↵erent epochs, one can notice
that the model gets more fitted to the PolicyQA data, the more epochs it goes through.
It can be explained by the fact that as there is a substantially larger amount of PolicyQA
data than GDPRQA data, the model will tend to be more trained towards PolicyQA.
Hence, PolicyQA starts to perform better with the increasing number of epochs, while the
performance of GDPRQA goes down. Yet, the di↵erence in performance through di↵erent
epochs is more noticeable for GDPRQA than for PolicyQA. An overview of the epoch
results can be seen in Figure 20.
82
5.1 Test Environment 5 RESULTS
The NLQuAD HPO, as presented in Table 12, shows that the combination of a batch size
of 4 and learning rate of 2e-5 overall result in the strongest performance in most metrics
when tested on GDPRQA testing and validation tests, resulting in F1 of 79.6% and 81.4%,
respectively. Yet, the best EM performance on GDPRQA validation set is observed for
the batch size of 4 and 3e-5 learning rate. Batch size of 4 and learning rate of 4e-5 score
equally well for the EM on GDPRQA testing set.
Regarding the HPO performance on PolicyQA, batch size of 4 with the learning rate of
4e-5 scored best in all metrics for both testing and validation sets — F1 for PolicyQA test
set is 64.6%, and for PolicyQA validation — 59.6%.
83
5.1 Test Environment 5 RESULTS
Overall, in any grid combination of batch size and learning rate, the batch size of 4 was
superior to the batch size of 8. The best average F1 for both datasets (GDPRQA and
PolicyQA) can be estimated to be 70.7% for the combination of a batch size of 4 and
learning rate of 2e-5.
Table 12: HPO Evaluation NLQuAD: Batch Size and Learning Rate
BS = Batch Size LR = Learning Rate
84
5.2 Production Setup Results 5 RESULTS
The production setup results are based on the two best performing models from the testing
environment — PrivBERT PolicyQA & GDPRQA fine-tuned on SQuAD and NLQuAD.
Figure 21 displays the best performing model for each dataset in both testing and pro-
duction environment. The results in the production environment are substantially lower
compared to those in the testing environment, since the model is not given a paragraph,
but the entire privacy policy, it needs to look for an answer in much larger text. Moreover,
the model has to deal with the possibility of answers being present in several parts of the
policy (e.g. the contact details being mentioned in various paragraphs.)
Figure 21: Best Performing Models after 3 Iteration — Testing vs Production Environ-
ment, F1 Test Score
The results for GDPRQA dataset in the simulated production environment show the
strongest performance when trained on NLQuAD with top k=5 showcased in Table 14.
Such configuration results in F1=38.5%, EM of 15.2%, precision of 56.5%, and recall of
32.8%. F1 score ranges from 24.5% for one answer candidate up to 38.5% for five. The
best performance when trained on SQuAD is also at top k=5, with F1 = 37.7%, precision
of 54.1%, recall of 32.5%. The di↵erence in scores when trained on SQuAD and NLQuAD
is rather minimal — only 1% improvement in NLQuAD, similar to that observed in the
test environment. Precision, which is the length of common tokens divided by the length
of predicted, is observed to be stronger than recall.
85
5.2 Production Setup Results 5 RESULTS
With regard to PolicyQA, the strongest performance is for the model trained on NLQuAD
with top k=5, as displayed in Table 15. That results in the strongest evaluation metrics
with F1=43.6%, precision of 53.2%, recall of 47.5% and EM of 0.183. The F1 score for
a SQuAD based model is lower by 2.4%, while in the test environment SQuAD-based
PolicyQA performed better.
We can observe that the higher the top k, which represents the number of answer can-
didates, the stronger the performance of the model in all metrics. The more answer
candidates the model is faced with, the higher the odds that the correct answer is among
the candidates.
As depicted in EDA, when comparing GDPRQA to PolicyQA, the answer span in GDPRQA
is substantially longer, hence, GDPRQA gets lower recall and higher precision compared
to PolicyQA, as the model manages to get the right text span but not the entire answer.
86
6 DISCUSSION
6 Discussion
CRISP-DM framework suggests iterative nature of evaluation to make sure the objec-
tives of the work are met. Hence, the holistic process shall be reviewed and potential
improvements considered in the following chapter. To start with, the discussion covers the
elaboration of findings in the test environment and its three iterations — (1) open domain
general knowledge learning, (2) transfer learning adapting to the post-GDPR privacy pol-
icy domain, and (3) HPO. Next, production environment results are discussed followed by
the discussion and recommendations for deployment. Furthermore, the research questions
are answered. Finally, the thesis contemplates on the contribution to research, limitations,
learning reflections and future work.
The findings in the test environment are quite peculiar in numerous aspects. To start with,
the first iteration of open domain general knowledge learning will be discussed, following
the second iteration of privacy policy specific transfer learning, and, lastly third iteration
of hyperparameter optimisation.
87
6.1 Test Environment Discussion 6 DISCUSSION
PrivBERT, as expected from the initial assumptions due to its ability to capture nuances
of privacy policy language, performed strongest among all Transformers models chosen, re-
sulting in the highest scores for basically all models. Its performance is consistent with the
literature as it enhanced the accuracy by 3% for PolicyQA dataset and 4% for GDPRQA
dataset. Its performance could be better on GDPRQA because PrivBERT was created by
feeding it post-GDPR privacy policies.
The fact that the GDPRQA dataset achieved much higher accuracy compared to PolicyQA
dataset may suggest that the introduction of GDPR also improved the structure of the
policies, as certain language and structure is replicated across the policies. The improved
structure and similar lexicon result in a higher machine understanding, thus proving the
usability and necessity for the QA model.
RoBERTa performs better than BERT. RoBERTa has the same architecture and token
input processing restriction as BERT, so the potential reason for its better performance
would account for RoBERTa’s optimized pre-training scheme. Such a scheme results in
better generalisation of RoBERTa to a task of QA. RoBERTa based models (RoBERTa-
base and PrivBERT) are more sensitive to the domain-specific training data as their per-
formance improves much more when the knowledge is transferred to the closed domain,
as compared to the general knowledge.
Moreover, the developed model for PolicyQA data was able to achieve results higher
than state-of-the-art PolicyQA, suggesting that both the hyperparameter optimization
and PrivBERT play a significant role in a performance increase for privacy policy related
QA.
Despite the fact that certain hyperparameter optimization was performed on the epochs,
batch size and learning rate, it might still be suboptimal as the computational power was
limiting the performance of the proper grid search. However, as described in Section 4,
the training times increase substantially, the more hyperparameters are optimized. Due
88
6.2 Production Environment Discussion 6 DISCUSSION
to the increase in training time, only two hyperparameters were chosen for a manually
created grid search, specifically batch size and learning rate due to their significance in
the findings of Rangasai (2022). This results in a cross-training of 4x2 parameters equaling
8 models with the top performing hyperparameter optimized model increasing the average
across both datasets by 3.5%.
The batch size was decreased to only two parameters, 8 and 4, as initial results showed
that increasing the batch size above 8 would result in poorer overall performance. At the
same time, a batch size of 4 increased the overall results, raising the question of whether
lowering the batch size even further to a batch size of 2 and 1 could be beneficial. With
regard to the epochs, as the data di↵erence is substantial between the two datasets, the
more epochs are run, the better the PolicyQA dataset performed, whereas the GDPRQA
performed worse. This could be explained by the fact that the model would turn more
biased towards the PolicyQA due to the imbalance.
Exploration of both weight-decay and warm-up steps was not explored, but as showcased
by (Kamsetty, 2020), the most important hyperparameters for a BERT-based classification
problem were learning rate, followed by weight-decay, batch size, warm-up steps and of
epochs. Given the time and implementation of a grid search of all of the above parameters,
the models might see an increase of 1-2%. However, the exponential increase in time
require substantially more computational power than allowed through a Google Colab
Pro+ subscription.
Results in the production setting do not reflect the results in the test environment which
is a curious finding in itself. As initially suggested, the results show that a single docu-
ment, i.e. a closed domain problem, cannot be evaluated solely based on a controlled test
environment. There are several factors that play a crucial role. Firstly, the training of
BERT and RoBERTa based models are bound to a maximum length of 512 tokens. This
results in either substantial cropping or the necessity to split documents, thus possibly
losing context. Secondly, given the format of SQuAD 1.1, the possibility of not providing
an answer is non-existent.
89
6.2 Production Environment Discussion 6 DISCUSSION
This forces the split segments to always include one or more answers. In other words,
there is always an answer to a question in the given context, which may not be true in a
real-world scenario. Thirdly, in a real-world scenario, there may be several ways to answer
a question and some may not necessarily reflect the annotated answer, thus a↵ecting the
evaluation metrics. There may even exist several answers to a given question where only
a proportion of the span contains the same tokens.
The aforementioned factors play a significant role when evaluating single document closed
domain QA models in a real-world scenario. In the case of privacy policies, these factors
are substantial. With the introduction of GDPR, policies of companies with European
presence grew substantially in size and word complexity due to organizations and states
trying to guard themselves against potential lawsuits and adhere to the law. This has
several implications for evaluating the PolicyQA and GDPRQA dataset in a traditional
test environment. Firstly, GDPR policies are typically 8 times as large as what BERT or
RoBERTa allows. This results in a potential loss of reading comprehension context that
according to Soleimani et al. (2021) findings plays a substantial role.
Secondly, the length of privacy policies matters. As all answers are evaluated as stated
in Section 4.7, only the top one candidate with the highest probability output score is
evaluated and used in the evaluation metrics. The well performing evaluation metrics are
assumed to come from the limited contexts, as it should be substantially easier for the
model to find the rightly annotated answer in a shorter context in contrast to evaluating
an entire policy.
Furthermore, the limitation of only annotating one answer per question and per context
for the GDPRQA dataset brings forth another difficulty. Questions may have multiple
answers in the same context. An example is a question ”How long would you retain or
store my data?”. This question often has multiple answers dependent on the lawful ba-
sis for the processing of the data. The model may correctly identify one of the several
answers, but not the ground truth. The associated metrics from the found answer are
thus not necessarily reflected. This could potentially be solved by using Semantic Answer
Similarity (SAS) as also mentioned in Section 6.8. It would allow for a more nuanced
approach to detecting the extent the provided answer corresponds to the ground truth.
90
6.3 Business Use Case — Deployment Discussion 6 DISCUSSION
However, despite a sharp drop in evaluation metrics in both datasets, the ability to ex-
tract correct answers is not necessarily reflected by the decrease. In other words, the
extracted answer might fully or partially answer the question, but not share significant
lexical overlap with the ground truth. To ensure a more correct and realistic evaluation of
the proposed system, human evaluation is a necessity. Often in the recommendation and
QA systems, the prediction accuracy does not always match user’s satisfaction (Harkous
et al., 2018, pg. 13).
In Harkous et al. (2018) PriBot application, a user study showcased that the models per-
formed substantially better than the predicted model accuracy. The respondents regarded
at least one of the top three answers as relevant for 89% of the questions, with the first
answer being relevant in 70% of the cases. In comparison, for the top one candidate, the
scores were 46% and 48% (Harkous et al., 2018, pg. 14).
In terms of legal aspects, GDPRQA Assistant should not replace the legally binding pri-
vacy policies. Yet, it may only ease and augment the user comprehension of the policy by
o↵ering a complementary interface to inquire about the relevant details related to privacy.
Following the wide recent adoption of NLP tools within the legal domain and motivated
by the rise of conversational agents and automated user support, it can be deployed as
a user-friendly solution for the privacy policy stakeholders. Yet, a disclaimer should be
added that the automatic question answering does not represent the service provider and
91
6.3 Business Use Case — Deployment Discussion 6 DISCUSSION
GDPRQA Assistant can also be deployed internally by the companies as an assistance tool
to handle privacy inquiries. Yet, given the legally binding nature of policies, a legal expert
would need to be involved to leverage the tradeo↵ of the utility of GDPRQA Assistant
and the possible legal implications.
The models could be deployed as an external, Open Source SaaS11 application or leverage
a subscription-based business model. The application would initially make use of the cre-
ated models and use future user feedback to consistently feed the models with additional
metadata as explained in Figure 17. However, as there would be no simple way to crawl
policies from all companies and organizations around the world, it would rely largely on
provided privacy policy URLs from users.
The models could also be leveraged as part of a privacy policy notification application.
The application would make use of the QA model to enable users to directly ask ques-
tions about an application, either installed on the user’s phone or tablet or an application
available in the application store. Furthermore, the application should further be able to
identify important features of a privacy policy such as whether they collect location data,
if they have access to files, folders and pictures and to which extent they sell or share data
11
Software as a Service
92
6.4 Answering Research Questions 6 DISCUSSION
All this information could be displayed to users, either prior to or after the installation
of an application. The potential benefit for users would be to continuously have an open,
informative idea of what is currently accepted on their device and to which extent they
collect the data. The advantage here is that the application would be able to manage the
personal data usage from other applications and thus enable to increase users’ awareness
of the state of their personal data. Furthermore, it allows for a direct privacy policy crawl
through Google Play Store and Apple App Store.
The longer privacy policies in the post-GDPR space pose a significant problem for the po-
tential deployment of the created models. As Transformers, such as BERT and RoBERTa
have a 512 token limit, the documents need to be split into a maximum of 512 token limits.
The longer a given privacy policy is, the more splits have to be performed. This essentially
results in a higher requirement for computational power, as the growth increases linearly
thus increasing the inference times. In 2009, a study by Forrester Research found that
users expected pages to load in less than two seconds — and at three seconds, a large share
abandon the site (Lohr, 2012). Currently, the average inference time of the 115 policies in
the OPP-115 corpus takes an average of 3.520 seconds when extracting top five candidate
answers for 8 typically asked questions. Similarly, substantially longer times are found for
the GDPRQA with an average inference time of 7.23 seconds. The inference is performed
with the available GPU in Google Colab Pro+ subscription. However, with the use of
knowledge distillation as explained in Section 3.5, inference times are able to be reduced
substantially due to the nature of TinyBERT. Note that this often comes at a small cost
to accuracy. The second option is to increase the computational power, however, in a
production setting, this will come at a cost.
RQ: How can recent advancements in NLP be leveraged to build a Question Answering
system that can answer privacy policy inquiries in the post-GDPR regulatory landscape?
93
6.4 Answering Research Questions 6 DISCUSSION
The thesis has revealed the feasibility of leveraging Transformers, Deep Neural Networks,
and transfer learning to develop a QA system which can automatically extract answers
to the user inquiries on GDPR privacy policies. It has been observed that the e↵ect of
GDPR has resulted in lengthier, more vague privacy policies in more sophisticated legal
language making them less comprehensible to a general audience. However, GDPR has
also yielded increased thoroughness and structure to privacy policies, manifesting poten-
tial for improved machine reading comprehension.
The thesis has contributed with the first GDPR-adapted QA system able to answer pri-
vacy policies both pre- and post-GDPR with an average F1 of ⇠71%, filling in the gap in
the existing research. Moreover, despite utilizing existing datasets, it has also produced
a GDPR adapted QA dataset, GDPRQA, which sets up the foundation for further devel-
opments in the domain of GDPR privacy policies.
Q1: How can Transformers, Deep Neural Networks, Transfer Learning and data aug-
mentation aid in adaptation to a post-GDPR privacy domain and improve the current
state-of-the-art?
The thesis has managed to not only generate a novel model capable of answering both pre-
and post-GDPR questions, yet it also has beaten the current state-of-the-art pre-GDPR
PolicyQA model by 7.6%. The findings showcased that PrivBERT, a privacy policy specific
configuration of RoBERTa, provides a boost to accuracy for privacy policy QA datasets.
It was found that PrivBERT improved the PolicyQA F1 score by 3%.
Transformers, specifically BERT, RoBERTa and PrivBERT assisted in accessing the dis-
tance aspect in the text of privacy policies. The Transformers’ usage of DNN and Position-
Wise FFN handle the constraints of the Markov assumption and manage to use greater
contexts in text, providing each next token in text a conditional probability. The architec-
tures are able to retain both the past information, as well as the areas where the attention
of the model should be paid to.
Transfer learning enabled the transfer of general knowledge of QA from source open domain
QA to the target post-GDPR privacy policy domain, thus pertaining general knowledge
of how to answer questions, whilst also specializing the QA system to the knowledge of
94
6.4 Answering Research Questions 6 DISCUSSION
the privacy policy domain. The e↵ect of transfer learning was noticed in the comparison
between NLQuAD and SQuAD. The source domain clearly had an impact on the perfor-
mance of the target domain, where NLQuAD performed better on GDPRQA due to the
non-factoid nature and longer answer spans, in contrast to PolicyQA on SQuAD.
Data augmentation aided in adapting general comprehension of the privacy policy domain
to the post-GDPR nuances of the privacy lexicon. The results show that PrivBERT mod-
els trained on the augmented dataset generally outperform the GDPRQA and PolicyQA
individually, with an increase of the average score across the two datasets by 5%-10%.
Q2: How can a developed privacy policy Question Answering system be implemented and
evaluated with respect to a potential deployment in the real world scenario?
The developed QA model can enable users to retain more control over their personal data
and extract the answers spans relevant to user needs in a post-GDPR regulatory landscape
at scale. The production environment prototype was simulated with the use of Elastic-
Search by building an architecture consisting of IR, Reader, and Retriever. The results
revealed that the metrics evaluated in the test environment do not reflect a real-world
production scenario. QA in the domain of privacy policy is identified as a single document
closed domain task, which performs substantially di↵erent in a production environment
due to the length of privacy policies. In production, the model is fed the entire policy
and needs to retrieve the relevant segments, as opposed to the test environment where
the model can see the relevant segment containing the answer. The production environ-
ment evaluation may not reflect the actual capabilities of the production environment.
As evaluation is done using F1 and EM it might not portray the true performance of the
system due to the semantic limitations. It is suggested that either user evaluation, SAS or
Intersection over Union is implemented, whereby a user evaluation will result in the most
realistic and accurate evaluation of such QA system.
95
6.5 Contribution and Implication for Research 6 DISCUSSION
Analysis of the privacy policies has indeed revealed that the e↵ect of the introduction
of GDPR has resulted in policies being more lengthy, verbose and advanced in legal lan-
guage, which can only be understood by college graduates. These factors make the policies
more difficult to comprehend for the general audience, violating the obligation by GDPR
for the policies to be easily understood by any individual, and questioning their legiti-
macy under the ”notice and choice” principle. The e↵ort and time taken to read a policy
indeed manifest the importance of a tool that can simplify the process. The increased
exhaustiveness of detail and legal structure of the policies can be argued to make them
less comprehensible for the general audience yet more comprehensible for the machines.
The thesis contributes to research with an annotated GDPRQA dataset that previously
did not exist. It further introduces the first GDPR-adapted QA system able to answer
privacy policies’ inquiries both pre- and post-GDPR. Lastly, the thesis contributes with
realistic real-world evaluation results on privacy policy QA systems.
6.6 Limitations
An array of possible constraints were faced which should be considered when measuring
the impact of the research.
96
6.6 Limitations 6 DISCUSSION
The annotation for GDPRQA was forced to follow the SQuAD 1.1 format to be consis-
tent with PolicyQA. However, annotating in SQuAD 2.0 format could have also allowed
for no-answer annotation, hence the QA system would be able to abstain when asked a
question with a non-existing answer.
The analysis relies on the quality of the existing PolicyQA dataset. The quality-ensuring
techniques behind the creation of PolicyQA dataset have not been studied. The dataset
was claimed to be created by domain experts yet no further elaboration was provided. It
was assumed to have an acceptable degree of quality, yet possible subjectivity bias which
is, in general, likely to occur in manual work could have a↵ected the performance of the
models developed in this thesis. A major limitation was the di↵erent answer lengths of
PolicyQA and GDPRQA. Specifically, the mixed question type results in complexity of
the QA, as the expected answer varies in length and structure of the span.
As manual labelling is very time and e↵ort expensive, only a limited number of annota-
tion samples was created in the constructed GDPRQA dataset. A small training dataset
restricts the training. More samples in the dataset could help the models learn more in-
depth nuances of the post-GDPR privacy domain lexicon.
The input quality of privacy policies, naturally, a↵ects the performance of QA. Given a
policy of low quality, e.g. non-compliant with GDPR, the model will fail to comprehend
its structure.
Only English language GDPR privacy policies were considered. Hence, extrapolating the
study to the policies in other languages might be challenging. Considerations of applying
the models to other European languages is not covered in the scope of this thesis, yet
machine translation could be further leveraged.
97
6.7 Learning Reflections 6 DISCUSSION
Subjectivity a↵ects the complexity of labelling and evaluation. To ensure labelling consis-
tency, each question was discussed by the labellers to lower subjectivity. The guidelines
were agreed on among the labellers and any edge cases were jointly discussed. Human
manual work is very prone to being inconsistent, despite employing critical realism and
the methods taken to avoid bias. Yet, similar pieces of text could still receive varying
annotations by the labellers.
The techniques used in this research have faced substantial computational power con-
straints. Large language models, such as RoBERTa-Large or BERT-large require a lot of
GPU memory which was limited to 16GB of memory, impacting the training duration.
Moreover, computational power constraints limited the possibilities of hyperparameter
optimisation. Hence, the best resulting models might still be suboptimal, as not all grid
search hyperparameter combinations have been explored. For instance, weight decay and
warm-up steps were skipped due to their exploration being too computationally heavy to
perform.
This thesis provided an opportunity to harness the transformative power of natural lan-
guage processing to derive actionable privacy policy comprehension insights. Attempting
to solve an urgent problem around the protection of personal privacy gave us stronger
sense of purpose in our work. Glimpsing through the prism of the multidisciplinary na-
ture of this thesis in its legal, technical and business angles, generated profound awareness
of the topic, sharpening our data science toolset. Throughout the M.SC. Data Science
degree, we have excelled in the three pillars of data — regulation, business insights, and
analytics. All three aspects have been exposed throughout this thesis.
We observed that a major limiting factor in our work was the lack of sufficient comput-
ing power. Despite Google Colab access to GPUs, the resources were not sufficient for
98
6.8 Future Work 6 DISCUSSION
utilising the full capabilities of the language models, resulting in potentially suboptimal
hyperparameter optimisation of the models presented. Training and operating large lan-
guage models requires sufficient computational power, which relates to hardware access,
financial costs and energy consumption, raising the question of the sustainability of large
language models. Increasing sizes of language models are leaving a larger carbon footprint
— pre-training of a BERT model is estimated to be similar to a roundtrip throughout
the US (Strubell, Ganesh, & McCallum, 2019). Moreover, the exclusiveness of access to
well-funded research poses a dilemma for the future development of NLP.
The work unravels the opportunities for a wide array of topics which could address the
limitations faced throughout the research.
GDPRQA dataset12 has set the building stone in development of a privacy policy dataset
with data relevant to the modern regulatory landscape. However, the dataset can be fur-
ther developed with more annotation pairs and even further variety of companies selected
and questions asked, hence improving the size, quality, and diversity of the dataset. More-
over, further standardisation of the labelling would result in a higher quality of consistent
annotations.
The large language models such as BERT-large and RoBERTa-large were not implemented
due to restrained computational power and time limit. Yet, it was observed that the
larger the model, the better a model can perform. However, it can be argued that the
inference time with such models would be much longer, questioning their applicability in
a production environment.
12
Access to the dataset and the project in the annotation tool can be provided by contacting the authors
of the thesis.
99
6.8 Future Work 6 DISCUSSION
6.8.3 Longformers
Longformers leverage varying configurations of the attention heads which allows to pro-
cess the entire sequence. Longformers use an attention technique that scales in a linear
way with the length of the sequence. As a result, Longformers can take in up to 4.096
tokens. Moreover, Longformers use global attention for the question tokens. Findings by
(Soleimani et al., 2021) suggest that Longformers perform significantly better than typical
models on non-factoid QA datasets due to the increased token size.
Tf-idf and BM25 are sparse retrievers which look for information in the document store.
Dense Passage Retrieval (DPR) was introduced in 2020 as an alternative to conventional
techniques due to a variety of their benefits. Semantically similar words would not be
considered a match by typical keyword algorithms such as TF-IDF and BM25, while
DPR would craft a dense vector with a shared semantic meaning where such words would
lie very closely. Moreover, traditional sparse retrievers cannot be trained, while DPR
leverages embeddings to be trained and fine-tuned. However, as DPR requires very large
annotated training data and significant computational power during indexing and retrieval
(Karpukhin et al., 2020) it was not yet possible to implement it.
Instead of answer extraction, a potential research topic could be to use a generator which
could generate an answer from the context. Given the majority of non-factoid questions
and the implicit complexity of the topic, text generation could hold large potential. It
may also be argued that large text answers would benefit more from text generation than
reading in terms of user comprehension.
F1 score is rather limited when evaluating longer answer spans for non-factoid questions,
yet another possible metric is Intersection over Union (IoU) which is able to measure the
position-sensitive intersection between the prediction and label (Soleimani et al., 2021).
With the proliferating complexity of QA models producing more free-form and abstract
answers, the metrics should be able to even closer resemble human perception beyond
100
6.8 Future Work 6 DISCUSSION
More ideas could be further implemented to assist the user in helping comprehension of
practices surrounding the management of their personal data. For instance, a certain
regulatory industry standard could be developed in order to measure the aggressiveness of
the policy based on whether the data collected and practices taken match those expected
of the given entity. Moreover, policies could be classified based on user-relevant metrics,
which could be presented as warning icons — such as ”this company collects your location
data” or ”your sensitive data is stored outside of EU”.
101
7 CONCLUSION
7 Conclusion
The thesis aided in the sensemaking of the issue of privacy policy machine reading compre-
hension in the post-GDPR regulatory landscape. It unravelled the feasibility of applying a
QA system to automatically extract answers to the user inquiries related to their personal
data. It was determined that the Transformer-based approach to extractive QA employ-
ing transfer learning in the closed domain can indeed improve machine comprehension of
privacy policies in the post-GDPR era, being able to automatically address users’ personal
data inquiries.
First, the existing research perspective was examined. Aligned with the recent workings
of the literature it was determined that a GDPR relevant QA dataset was needed, hence
making its creation necessary in our work. Furthermore, the project scope and research
methodology were outlined, following the fundamentals and iterative nature of the CRISP-
DM framework. Critical realism with an abductive research approach was employed under
a longitudinal time horizon which compared pre- and post-GDPR privacy policies.
The thesis utilises existing PolicyQA dataset along with the created GDPRQA dataset
and an augmentation of the two. The datasets are augmented to enable the introduc-
tion of GDPR related aspects to general privacy policy inquiries. The GDPRQA dataset
addresses the shifts in privacy policies structure introduced under the e↵ect of GDPR.
The GDPRQA is built on 47 GDPR privacy policies that were collected from companies
present in the EU varying in size, location and industry. The exploration of the collected
policies revealed them to be rather long, vague to an extent, and complex to read. The
fairly difficult readability scores deem the policies not easily readable by a general audi-
ence. Moreover, the vagueness score showed an average of 156 vague words per policy.
The average length of the policies was estimated to be 4,569 words, which is almost double
the pre-GDPR length of 2,319 words. As the legitimacy of privacy policies is dependent
on users’ comprehension upon the ”notice and choice” principle, such observations result
in high risks of personal data misuse.
A variety of Transformer-based BERT models were explored and developed to improve ma-
chine comprehension of pre- and post-GDPR privacy policies. First, models were trained
102
7 CONCLUSION
on open domain general knowledge QA datasets — the factoid SQuAD (used by the ma-
jority of QA tasks) and non-factoid NLQuAD. GDPRQA performed better when trained
on non-factoid NLQuAD, while PolicyQA performed better on SQuAD. The better per-
formance could be explained by GDPRQA having a longer span of answer annotations
and a higher number of non-factoid questions present.
Next, models utilized transfer learning to learn the nuances of privacy policy target domain
language. GDPRQA achieved higher scores than PolicyQA in the test environment which
suggests that GDPR is able to standardise policies, and similar language and structure are
replicated across policies. That signified the potential for stronger machine understanding
hence manifesting the application of NLP tools, while the complexity and length of policies
make them less readable by humans.
RoBERTa based models, especially PrivBERT which is a privacy policy domain configu-
ration of RoBERTa, were able to capture the most specifics of privacy policy language.
That could be explained by the fact that they are most optimised for the task of GDPR
QA due to their more favourable pre-training scheme and generalisation. The developed
models were able to beat state-of-the-art results for PolicyQA by 7.6% due to the usage
of privacy policy specific language model PrivBERT and HPO. Both foundational privacy
policy knowledge (obtained through the usage of pre-GDPR PolicyQA dataset) and GDPR
specific context (through the usage of GDPRQA) were captured in the best performing
models. SQuAD-based PrivBERT scored 70.2% average F1 in contrary to NLQuAD-based
PrivBERT averaging on 70.7% F1 across both datasets.
103
7 CONCLUSION
Deployment prototypes and business use cases were further elaborated upon, proposing
a GDPRQA Assistant which could be a user-friendly complementary solution for privacy
policy stakeholders. However, GDPRQA Assistant should be disclaimed to not replace
the binding policies themselves. GDPRQA could be deployed internally as an assistance
tool for the organisations to comply with legal regulations. Moreover, GDPRQA Assistant
could be deployed externally as an Open Source SaaS application or a subscription-based
business model. The framework could also include vagueness and aggressiveness scor-
ing, opt-out choices, and sentence classification tools, which could be deployed as a web
browser extension to provide feedback in real-time based on the visited policy. A mobile
application could also be developed which would provide continuous information on the
applications’ usage of data. Yet, given the length of policies and the number of needed
splits due to BERT limitations, this poses an issue of inference time.
The potential future work can involve a variety of activities addressing the limitations
faced. For instance, the proposed GDPRQA dataset can be further upscaled and im-
proved which could result in stronger performance of developed models. Moreover, larger
models and possibly Longformers could be used, yet the computational power trade-o↵
shall be considered. Furthermore, dense passage retrievers could benefit future work due
to their comprehension of semantically similar words. Given the non-factoid nature of
questions, text generation could benefit in creating more user comprehensible responses
of the model. More novel evaluation metrics, such as SAS and IoU, as well as human
evaluation would be able to evaluate the performance more realistically.
All things considered, this thesis built on the recent advancements in natural language
processing to allow users retain control over their privacy in more meaningful ways, ad-
dressing the unrealistic expectation of consuming numerous privacy policies on a nearly
daily basis. As a result, models capable of improving state-of-the-art QA in the domain
of privacy policies and adapting it to the post-GDPR regulatory landscape were developed,
hence enhancing the research in the field of machine comprehension of post-GDPR privacy
policies.
104
8 APPENDIX
8 Appendix
32 if top_k_value >1:
33 em_score = max (( compu t e _e x a ct _ m at c h ( guess [ ' answer '] , answer ) ) for
guesses in answer_lst for guess in guesses )
34 f1_score = max (( compute_f1 ( guess [ ' answer '] , answer ) )
35 for guesses in answer_lst for guess in guesses )
36 rec = max (( compute_recall ( guess [ ' answer '] , answer ) )
37 for guesses in answer_lst for guess in guesses )
38 prec = max (( compute_precisi on ( guess [ ' answer '] , answer ) )
106
8 APPENDIX
107
References References
References
Ahmad, W. U., Chi, J., Tian, Y., & Chang, K. W. (2020). PolicyQA: A reading compre-
hension dataset for privacy policies. In Findings of the association for computational
linguistics findings of acl: Emnlp 2020. doi: 10.18653/v1/2020.findings-emnlp.66
Alduaij, S., Chen, Z., & Gangopadhyay, A. (2016). Using crowd sourcing to analyze
consumers’ response to privacy policies of online social network and financial insti-
tutions at micro level. International Journal of Information Security and Privacy,
10 . doi: 10.4018/IJISP.2016040104
Amos, R., Acar, G., Lucherini, E., Kshirsagar, M., Narayanan, A., & Mayer, J. (2021).
Privacy policies over time: Curation and analysis of a million-document dataset..
doi: 10.1145/3442381.3450048
Aroca-Ouellette, S., & Rudzicz, F. (2020). On losses for modern language models.. doi:
10.18653/v1/2020.emnlp-main.403
Berman, D. (2019). 10 elasticsearch concepts you need to learn.
https://logz.io/blog/10-elasticsearch-concepts/?fbclid=
IwAR3GlwZFRZ0IR5Wj8tJeE0zQCSPPHXb8lNNNVCcav2W2wg3CeAgPCuwJa7Q.
Chen, A., Stanovsky, G., Singh, S., & Gardner, M. (2019). Evaluating question answering
evaluation.. doi: 10.18653/v1/d19-5817
Cranor, L. F. (2002). Web privacy with p3p. Leadership.
Das, R., Dhuliawala, S., Zaheer, M., & McCallum, A. (2019). Multi-step retriever-reader
interaction for scalable open-domain question answering..
Degeling, M., Utz, C., Lentzsch, C., Hosseini, H., Schaub, F., & Holz, T. (2019). We value
your privacy.. now take some cookies: Measuring the gdpr’s impact on web privacy.
Informatik-Spektrum, 42 . doi: 10.1007/s00287-019-01201-1
European Commission. (2016). Regulation (EU) 2016/679 of the European Parliament
and of the Council of 27 April 2016 on the protection of natural persons with regard
to the processing of personal data and on the free movement of such data, and re-
pealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA
relevance). European Commission. Retrieved from https://eur-lex.europa.eu/
eli/reg/2016/679/oj
Facebook. (2019). Roberta: An optimized method for pretraining self-supervised nlp
systems. Retrieved from https://ai.facebook.com/blog/roberta-an-optimized
108
References References
-method-for-pretraining-self-supervised-nlp-systems/
Gallé, M., Christofi, A., & Elsahar, H. (2019). The case for a gdpr-specific annotated
dataset of privacy policies. In (Vol. 2335). Retrieved from http://ceur-ws.org/
Vol-2335/1st PAL paper 5.pdf
Gotmlet, Z., Clinton & Tong. (2015). Elasticsearch: The definitive guide. O’Reilly
Media. Retrieved from https://books.google.dk/books?id=Ul9aBgAAQBAJ&dq=
how+does+lucene+index+new+documents+after+commit&hl=da
Géron, A. (2017). Hands-on machine learing with scikit-learn tensor flow.
Han, J., Kamber, M., & Pei, J. (2012). Data mining concepts and techniques,
third edition. Waltham, Mass.: Morgan Kaufmann Publishers. Retrieved
from http://www.amazon.de/Data-Mining-Concepts-Techniques-Management/
dp/0123814790/ref=tmm hrd title 0?ie=UTF8&qid=1366039033&sr=1-1
Haq, Q. A. U. (2021). Cyber crime and their restriction through laws and techniques for
protecting security issues and privacy threats (Vol. 341). doi: 10.1007/978-981-33
-4996-4 3
Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin, K. G., & Aberer, K. (2018). m. In
Proceedings of the 27th usenix security symposium.
Hendrycks, D., & Gimpel, K. (2016). Bridging nonlinearities and stochastic regularizers
with gaussian error linear units. arXiv .
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential model-based optimization
for general algorithm configuration. In (Vol. 6683 LNCS). doi: 10.1007/978-3-642
-25566-3 40
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., . . . Liu, Q. (2020). Tinybert:
Distilling bert for natural language understanding.. doi: 10.18653/v1/2020.findings
-emnlp.372
Jurafsky, D., & Martin, J. (2002). Speech and Language Processing. An Introduction to
Natural Language Processing, Computational Linguistics, and Speech Recognition.
Zeitschrift fur Sprachwissenschaft, 21 (1). doi: 10.1515/zfsw.2002.21.1.134
Kamsetty, A. (2020). Hyperparameter optimization for huggingface transformers: A
guide.. Retrieved from https://medium.com/distributed-computing-with-ray/
hyperparameter-optimization-for-transformers-a-guide-c4e32c6c989b
Karim, M. R. (2017). Searching and indexing with apache lucene. https://dzone.com/
articles/apache-lucene-a-high-performance-and-full-featured.
109
References References
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., . . . Yih, W. T. (2020).
Dense passage retrieval for open-domain question answering.. doi: 10.18653/v1/
2020.emnlp-main.550
Krippendor↵, K. (2018). Content analysis: An introduction to its methodol-
ogy. SAGE Publications. Retrieved from https://books.google.dk/books?id=
FixGDwAAQBAJ
Krumay, B., & Klar, J. (2020). Readability of privacy policies. In (Vol. 12122 LNCS).
doi: 10.1007/978-3-030-49669-2 22
Lapowsky, I. (2019). How cambridge analytica sparked the great privacy awakening.
Wired .
Lebano↵, L., & Liu, F. (2018). Automatic detection of vague words and sentences in
privacy policies.. doi: 10.18653/v1/d18-1387
Lin, Y. P., & Jung, T. P. (2017). Improving eeg-based emotion classification us-
ing conditional transfer learning. Frontiers in Human Neuroscience, 11 . doi:
10.3389/fnhum.2017.00334
Linden, T., Khandelwal, R., Harkous, H., & Fawaz, K. (2020, 1). The privacy policy
landscape after the gdpr. Proceedings on Privacy Enhancing Technologies, 2020 ,
47-64. Retrieved from https://www.sciendo.com/article/10.2478/popets-2020
-0004 doi: 10.2478/popets-2020-0004
Lohr, S. (2012). For impatient web users, an eye blink is just too long to wait.. Re-
trieved from https://www.immagic.com/eLibrary/ARCHIVES/GENERAL/GENPRESS/
N120229L.pdf
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization..
McDonald, A. M., & Cranor, L. F. (2009). The cost of reading privacy policies..
Min, S., Seo, M., & Hajishirzi, H. (2017). Question answering through transfer learning
from large fine-grained supervision data. In (Vol. 2, p. 510-517). Association for Com-
putational Linguistics. Retrieved from http://aclweb.org/anthology/P17-2081
doi: 10.18653/v1/P17-2081
Mishra, P., Rajnish, R., & Kumar, P. (2020). Sentiment analysis by novel hybrid
method be-cnn using convolutional neural network and bert. International Jour-
nal of Advanced Trends in Computer Science and Engineering, 9 (4). doi: 10.30534/
IJATCSE/2020/165942020
Murtezić, A. (2020). Convention 108: Present importance and implementation. Strani
110
References References
111
References References
112
References References
doi: 10.18653/v1/P19-1355
Su, Y., Sun, H., Sadler, B., Srivatsa, M., Gür, I., Yan, Z., & Yan, X. (2016). On generating
characteristic-rich question sets for qa evaluation.. doi: 10.18653/v1/d16-1054
Tesfay, W. B., Hofmann, P., Nakamura, T., Kiyomoto, S., & Serna, J. (2018, 3). Pri-
vacyguide. In (Vol. 2018-January, p. 15-21). ACM. Retrieved from https://
dl.acm.org/doi/10.1145/3180445.3180447 doi: 10.1145/3180445.3180447
Torre, D., Soltana, G., Sabetzadeh, M., Briand, L. C., Auffinger, Y., & Goes, P. (2019).
Using models to enable compliance checking against the gdpr: An experience report..
doi: 10.1109/MODELS.2019.00-20
Truong, N. B., Sun, K., Lee, G. M., & Guo, Y. (2020). Gdpr-compliant personal data man-
agement: A blockchain-based solution. IEEE Transactions on Information Forensics
and Security, 15 . doi: 10.1109/TIFS.2019.2948287
Trzaskowski, J., & Sørensen, M. G. (2019). Gdpr compliance: Understanding the general
data protection regulation. Ex Tuto Publishing.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polo-
sukhin, I. (2017). Attention is all you need. In Advances in neural information
processing systems (Vol. 2017-Decem).
Wang, L., Feng, M., Zhou, B., Xiang, B., & Mahadevan, S. (2015). Efficient hyper-
parameter optimization for nlp applications.. doi: 10.18653/v1/d15-1253
Wilson, S., Schaub, F., Dara, A. A., Liu, F., Cherivirala, S., Leon, P. G., . . . Sadeh,
N. (2016). The creation and analysis of a Website privacy policy corpus. In 54th
annual meeting of the association for computational linguistics, acl 2016 - long papers
(Vol. 3). doi: 10.18653/v1/p16-1126
Wilson, S., Schaub, F., Liu, F., Sathyendra, K. M., Smullen, D., Zimmeck, S., . . . Smith,
N. A. (2019, 2). Analyzing privacy policies at scale. ACM Transactions on the
Web, 13 , 1-29. Retrieved from https://dl.acm.org/doi/10.1145/3230665 doi:
10.1145/3230665
Wiratchawa, K., Khunthong, T., & Intharah, T. (2021). Legalbert-th: Development of
legal qa dataset and automatic question tagging.. doi: 10.1109/ECTI-CON51831
.2021.9454753
Zaeem, R. N., & Barber, K. S. (2021). A large publicly available corpus of website privacy
policies based on dmoz.. doi: 10.1145/3422337.3447827
Zimmeck, S., Story, P., Smullen, D., Ravichander, A., Wang, Z., Reidenberg, J., . . . Sadeh,
113
References References
114