0% found this document useful (0 votes)
54 views

Enhancing Machine Comprehension of GDPR Privacy Policies Using Recent Advancements in NLP

This master's thesis explores enhancing machine comprehension of GDPR privacy policies using recent advancements in natural language processing (NLP). The authors introduce a scalable question answering (QA) system that can answer users' inquiries about relevant privacy issues in a post-GDPR regulatory landscape. They also introduce the first GDPRQA dataset in a SQuAD format. The thesis improves the state-of-the-art on a pre-GDPR policy QA dataset and implements a production prototype with an ElasticSearch information retrieval architecture. Potential business use cases are suggested for a GDPRQA assistant to help with GDPR compliance or as a personal data management tool.

Uploaded by

Alice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Enhancing Machine Comprehension of GDPR Privacy Policies Using Recent Advancements in NLP

This master's thesis explores enhancing machine comprehension of GDPR privacy policies using recent advancements in natural language processing (NLP). The authors introduce a scalable question answering (QA) system that can answer users' inquiries about relevant privacy issues in a post-GDPR regulatory landscape. They also introduce the first GDPRQA dataset in a SQuAD format. The thesis improves the state-of-the-art on a pre-GDPR policy QA dataset and implements a production prototype with an ElasticSearch information retrieval architecture. Potential business use cases are suggested for a GDPRQA assistant to help with GDPR compliance or as a personal data management tool.

Uploaded by

Alice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

Copenhagen Business School

Department of Digitalisation
cand.merc.(it.)(Data Science)
CDSCO4001E

M.Sc. Business Administration and Data Science


Master Thesis

Enhancing Machine Comprehension of


GDPR Privacy Policies Using Recent
Advancements in NLP
Leveraging Question Answering, Transformers and Deep Learning
in Post-GDPR Regulatory Landscape

By:
Alisa Ilina, 141804
Simon Christensen, 111803

Supervisor:
Daniel Hardt

Contract No: 22779


Pages: 114, Characters: 205 440

This thesis was written as a part of the Master of Science in Business Administration
and Data Science at CBS. Please note that neither the institution nor the examiners are
responsible – through the approval of this thesis – for the theories and methods used, or
results and conclusions drawn in this work.

May 16, 2022


Copenhagen
Abstract
GDPR has shifted the privacy policy regulatory landscape providing EU citizens more
control over their data. Yet, it has also resulted in lengthier, more vague and legally
sophisticated policies, making them rather incomprehensible by a general audience. That
is contradicting the GDPR requirement to pertain transparent and easily understand-
able privacy policies, questioning their legitimacy under the “notice and choice” principle,
when users consent to provide their data without full awareness of privacy practices em-
ployed by the organisations. On the other hand, the increased details and structure of
privacy policies under GDPR can arguably make them more comprehensible by machines.
To address the unrealistic expectation of reading large volumes of privacy policies and
equip the stakeholders with a tool to comprehend policies at scale, the thesis is taking
a Transformer-based approach to instantiate machine comprehension of GDPR privacy
policies. A scalable GDPR QA system has been introduced which can enable the users
to selectively explore relevant privacy issues and answer their inquires in a modern regu-
latory post-GDPR landscape with an average F1 of ⇠71%. Moreover, the first GDPRQA
SQuAD-formatted dataset has been introduced. Transfer learning and data augmentation
has been utilized to learn the nuances of the GDPR privacy domain lexicon. The the-
sis also improved the current state-of-the-art of pre-GDPR PolicyQA model by 7.6%. A
production environment prototype with accompanying ElasticSearch information retrieval
architecture was implemented and evaluated. Furthermore, the thesis has suggested poten-
tial business deployment scenario use cases for a GDPRQA Assistant. Such a user-friendly
complimentary solution could be deployed internally to assist organisations with GDPR
compliance or externally as a personal data management tool.

Keywords: NLP, QA, Deep Neural Language Modelling, Transformers, RoBERTa,


Transfer Learning, IR, ElasticSearch, Privacy Policy, GDPR, Legal NLP

i
Acknowledgement
We would like to express our sincere gratitude to our supervisor Daniel Hardt for aspiring
guidance and constructive criticism.

Further acknowledgements go to Amir Soleimani, Christof Monz and Marcel Worring


from the University of Amsterdam for providing invaluable insights, an open factoid QA
dataset, and a code base to optimize SQuAD formatted QA. We would also like to thank
Wasi Ahmad, Jianfeng Chi, Yuan Tia, and Kai-Wei Chang for their contribution to the
PolicyQA dataset.

Copenhagen Business School


Copenhagen, 16 May 2022

Alisa Ilina Simon Christensen

ii
List of Figures

1 The Architecture of an IR System . . . . . . . . . . . . . . . . . . . . . . . 16


2 The Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Masked Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 General Architecture of Question Answering System . . . . . . . . . . . . . 27
5 BERT for Estimating the Relevance of a Document in a Query . . . . . . . 28
6 BERT for Span QA from Reading Comprehension Task . . . . . . . . . . . 30
7 Traditional Learning vs Transfer Learning . . . . . . . . . . . . . . . . . . . 33
8 TinyBERT Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9 The ”Research Onion” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
10 CRISP-DM Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
11 Distribution of N-gram Prefixes in the Questions of PolicyQA . . . . . . . . 41
12 Example of a Question Answered within the Context . . . . . . . . . . . . . 46
13 Distribution of N-gram Prefixes of Questions in GDPRQA . . . . . . . . . . 49
14 Modelling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
15 GELU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
16 Production Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
17 Potential Deployment Architecture . . . . . . . . . . . . . . . . . . . . . . . 70
18 Architecture with Improvements . . . . . . . . . . . . . . . . . . . . . . . . 73
19 Overview of Performance Based on F1 Test Score . . . . . . . . . . . . . . . 76
20 Epoch Evaluation SQuAD vs NLQuAD, F1 Score . . . . . . . . . . . . . . . 83
21 Best Performing Models after 3 Iteration — Testing vs Production Envi-
ronment, F1 Test Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

iii
List of Tables

1 Statistics on GDPR Privacy Policies . . . . . . . . . . . . . . . . . . . . . . 45


2 Overview and Frequencies of Questions in GDPRQA . . . . . . . . . . . . . 48
3 Comparison of the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Parameters Used in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 GDPRQA-SQuAD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 PolicyQA-SQuAD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7 GDPRQA-NLQuAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8 PolicyQA-NLQuAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9 Best Performing Models Results . . . . . . . . . . . . . . . . . . . . . . . . 81
10 Epoch Optimizaiton - SQuAD . . . . . . . . . . . . . . . . . . . . . . . . . . 83
11 Epoch Optimization - NLQuAD . . . . . . . . . . . . . . . . . . . . . . . . 83
12 HPO Evaluation NLQuAD: Batch Size and Learning Rate . . . . . . . . . . 84
13 HPO Evaliation of SQUAD . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
14 Production Environment - GDPRQA . . . . . . . . . . . . . . . . . . . . . . 86
15 Production Environment - PolicyQA . . . . . . . . . . . . . . . . . . . . . . 86
16 List of GDPR Privacy Policies Companies . . . . . . . . . . . . . . . . . . . 105

iv
Acronyms
API Application Programming Interface
BERT Bidirectional Encoder Representation from Transformers
BS Batch Size
CNN Convolutional Neural Network
DNN Deep Neural Network
ECHR European Convention of Human Rights
EM Exact Match
EU European Union
FFN Feed-Forward Network
FN False Negative
FP False Positive
GDPR General Data Protection Regulation
GELU Gaussian Error Linear Unit
HPO Hyperparameter Optimisation
IR Information Retrieval
IoU Intersection over Union
LR Learning Rate
MHA Multi-Head Attention
ML Machine Learning
MLM Masked Language Modelling
NE Named Entity
NLP Natural Language Processing
NLQuAD Non-factoid Long Question Answering Dataset
NSP Next Sentence Prediction
ODG Open Domain General Dataset
POS Part of Speech
QA Question Answering
RNN Recurrent Neural Network
RoBERTa Robustly Optimized BERT Pretraining Approach
SAS Semantic Answer Similarity
SaaS Software as a Service

v
SGD Stochastic Gradient Descent
SQuAD Stanford Question Answering Dataset
TN True Negative
TP True Positive

vi
Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Topic Delimitation and Superior Scope . . . . . . . . . . . . . . . . . . . . . 3
1.4 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 6
2.1 NLP in the Legal Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 NLP in Privacy Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Privacy Policies in Post-GDPR Regulatory Landscape . . . . . . . . . . . . 10

3 Theoretical Framework 13
3.1 GDPR and Privacy Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Term Weighting Schemes: TF-IDF and BM25 . . . . . . . . . . . . . 17
3.2.2 Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4 ElasticSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Deep Neural Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Vector semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Bidirectional Encoder Representation from Transformers (BERT) . . 22
3.3.4 Fine Tuning BERT with MLM and NSP . . . . . . . . . . . . . . . . 23
3.3.5 BERT Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Question Answering (QA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Types of QA Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 IR-based QA and BERT . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.3 Reading Comprehension and SQuAD . . . . . . . . . . . . . . . . . . 28
3.4.4 Retriever-Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.5 Evaluation Methods for QA . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Transfer Learning and Knowledge Distillation . . . . . . . . . . . . . . . . . 32

vii
4 Methodology 35
4.1 Research Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 Methodological Research Choice . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.4 Research Time Horizon . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.5 Techniques and Procedures . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 CRISP-DM Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Business Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Exploration of PolicyQA . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Acquisition and Exploration of GDPRQA . . . . . . . . . . . . . . . 43
4.5 Data Preparation and Preprocessing . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6.1 Modelling Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.2 Iteration 1: General Knowledge Acquisition . . . . . . . . . . . . . . 53
4.6.3 Iteration 2: Transferring Knowledge to Privacy Policy Domain . . . 53
4.6.4 Iteration 3: Optimising the Best Performing Model . . . . . . . . . . 54
4.6.5 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6.6 Model Associated Files . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6.7 Simulation of Production Environment Prototype . . . . . . . . . . . 62
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.1 Evaluation of Model Results . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.2 Evaluation of Simulated Production Prototype . . . . . . . . . . . . 68
4.8 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8.1 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8.2 Data Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8.3 PolicyCrawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8.4 Machine Learning Layer . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8.5 Feedback Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8.6 Further Improvements to the Architecture . . . . . . . . . . . . . . . 72
4.9 Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

viii
5 Results 74
5.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Models Pre-trained on SQuAD . . . . . . . . . . . . . . . . . . . . . 76
5.1.2 Models Fine-Tuned on NLQuAD . . . . . . . . . . . . . . . . . . . . 78
5.1.3 Top Model Performance Results . . . . . . . . . . . . . . . . . . . . 81
5.1.4 Hyperparameter Optimisation Findings . . . . . . . . . . . . . . . . 82
5.2 Production Setup Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Production Results GDPRQA . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 Production Results - PolicyQA . . . . . . . . . . . . . . . . . . . . . 86

6 Discussion 87
6.1 Test Environment Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.1 Iteration 1 - Open Domain General Knowledge Learning . . . . . . . 87
6.1.2 Iteration 2 - Transfer Learning on GDPR Privacy Policy Domain . . 88
6.1.3 Iteration 3 - Hyperparameter Optimisation Discussion . . . . . . . . 88
6.2 Production Environment Discussion . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Business Use Case — Deployment Discussion . . . . . . . . . . . . . . . . . 91
6.3.1 Internal Company Deployment . . . . . . . . . . . . . . . . . . . . . 92
6.3.2 External Web Deployment . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.3 Mobile Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.4 Recommendations for Deployment . . . . . . . . . . . . . . . . . . . 93
6.4 Answering Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Contribution and Implication for Research . . . . . . . . . . . . . . . . . . . 96
6.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.6.1 Data Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6.2 Labelling Subjectivity . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.6.3 Computational Power Limitations . . . . . . . . . . . . . . . . . . . 98
6.7 Learning Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.8.1 Standardisation of Labelling and Upscaling of the Dataset . . . . . . 99
6.8.2 Larger Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.8.3 Longformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.4 Dense Passage Retrievers . . . . . . . . . . . . . . . . . . . . . . . . 100

ix
6.8.5 Text Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8.7 Other Machine Reading Comprehension NLP Tasks . . . . . . . . . 101

7 Conclusion 102

8 Appendix 105

References 108

x
1 INTRODUCTION

1 Introduction

1.1 Background

Digital society is proliferating in its complexity and data needs, hence signifying the ur-
gency of data protection more than ever (Haq, 2021). Data privacy protection becomes a
crucial social dilemma, with privacy policies acting as a medium for personal data man-
agement. Privacy policies inform users on how companies collect, use, share and manage
personal data, accompanied by the rights the users have with regard to their data. Compa-
nies disclose their practices by providing privacy policies on their websites and requesting
consent from the users. The privacy landscape has vastly evolved since the introduction of
privacy laws, such as GDPR in the EU, which require organisations to always make their
privacy practices accessible to the users under specific expectations. GDPR has changed
the regulatory landscape, allowing EU citizens more control over their data.

Digital users increasingly care about their privacy. Our data is continuously tracked, with
apps monitoring our environment, bodies, and activity, as the data we share turns into
the new means of ”currency”. Incidents of malicious intentions, misuse of information and
data leakage raise the concerns of people who want to be in control of their personally
identifiable information, posing a salient issue of data awareness. Specifically, numerous
recent incidents have brought attention to the concerns of personal data misuse. For in-
stance, throughout the 2010s, millions of Facebook users got their personal data collected
without consent by Cambridge Analytica for political advertising which resulted in a ma-
jor privacy scandal disclosed in 2018 (Lapowsky, 2019).

Yet, despite the rising data protection concerns, the research shows that users rarely ever
read privacy policies prior to giving their consent (Alduaij, Chen, & Gangopadhyay, 2016).
An array of issues is limiting the users in their right to have control of their personal data.
To start with, verbose lengthy explanations often result in users neglecting to read the
privacy policies on the collection and use of their data. McDonald and Cranor (2008) has
estimated an average of 201 hours needed to read through all the privacy policies annually
per average person. Since the time of their research, the exponential increase in volume,
velocity, variety and veracity of data and increasingly sophisticated technologies have only

1
1.2 Problem Statement 1 INTRODUCTION

been resulting in lengthier and more complicated privacy policies.

Moreover, despite the fact that some companies have improved the coherency of their
policies after the introduction of GDPR, most still lack e↵ort in clear elaboration which
remains an obstacle to comprehension of policies. Such considerable energy and time
investment is hindering users from making informed decisions (Obar & Oeldorf-Hirsch,
2020). Furthermore, despite privacy policies targeting regular users, the legal language
used may be challenging to understand for a normal consumer (Oltramari et al., 2018).
Despite GDPR’s attempt to force companies to use an easily comprehensible language, it
is not further elaborated. Krumay and Klar (2020) find that the readability of privacy
policies is quite challenging for the users, and at least a college degree is expected to un-
derstand the policies.

Users shall be aware of the risks that come with their consent in order to be empowered to
make informed decisions. The legitimacy of the privacy policies is dependent on the user’s
comprehension of the document which, as research shows, is rare to happen (Reidenberg,
2000). As with any legal documents, privacy policies operate under the “notice and choice”
principle. That means that users shall read the policies prior to deciding to consent to
the service under the condition it matches what they expect their data to be used for.
Users might claim to have their data misused when they have consented to an ambiguous
privacy policy without full comprehension of the conditions. The compromises users make
with regard to sharing their data are highly nuanced, further complicating the matter.

1.2 Problem Statement

Users are not equipped to comprehend policies on such a large scale veiled under ambigu-
ous lengthy language. Thus, stakeholders lack tools and solutions to address the depth
and breadth of privacy policies. Certain attempts have been made to address this issue,
such as machine-readable formats of policies (Cranor, 2002), and icons representing the
privacy introduced by the EU1 (European Commission, 2016). Yet, the initiatives have
been restrained by the issue of manual labour and scalability, as the e↵ort required to
1
“Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data and on the free movement of
such data, and repealing Directive 95/46/EC”

2
1.3 Topic Delimitation and Superior Scope 1 INTRODUCTION

adjust the existing policies and further maintain them is unproportionally high — the
human e↵ort needed to retrofit the new notices to existing policies and maintain them
over time is substantial.

Such dynamic is manifesting the potential to leverage Natural Language Processing (NLP)
techniques to interpret privacy policies, hence empowering users to selectively explore
issues salient to their usage of the respective application or website.

1.3 Topic Delimitation and Superior Scope

This paper will build on the recent advancements in natural language processing to allow
users to retain control over their privacy in more meaningful ways, addressing the unre-
alistic expectation of users to read numerous policies on an almost daily basis. It will
aim to initiate the development of a model capable of improving machine comprehension
in this domain. The research also attests to the importance of meeting the transparency
requirement of GDPR by enhancing comprehension of privacy policies.

One way to enhance the reading comprehension of privacy policies is to enable auto-
matic answering of the privacy-related user inquiries. Successful accomplishment of such
scope will result in a Question Answering (QA) system that would empower the users to
comprehend lengthy, verbose, often linguistically complex privacy policies with ease in a
time-efficient manner. The system will be able to extract data reflecting the user needs in
a modern regulatory post-GDPR landscape at scale.

The thesis will take a transformer-based approach to improve machine comprehension of


GDPR privacy policies through a QA system. It will utilize existing datasets as well as
adapt to the GDPR domain through dataset augmentation and transfer learning. More-
over, the paper will simulate and evaluate a prototype of a production environment. Lastly,
it will also propose the deployment architecture for a GDPRQA Assistant.

The thesis possesses a high degree of complexity and requires a profound technical and
legal understanding of the matter. Given restricted time, scope and resources, the thesis
is designed to contribute to the delimited scope and build the foundation for a scalable
QA system for post-GDPR privacy policy reading comprehension.

3
1.4 Research Question 1 INTRODUCTION

1.4 Research Question

Therefore, based on this premise, the paper will aim to answer the following research
questions:

RQ: How can recent advancements in NLP be leveraged to build a Question Answering
system that can answer privacy policy inquiries in the post-GDPR regulatory landscape?
Q1: How can Transformers, Deep Neural Networks, Transfer Learning, and data aug-
mentation aid in adaptation to a post-GDPR privacy domain and improve the current
state-of-the-art?
Q2: How can a developed privacy policy Question Answering system be implemented and
evaluated with respect to potential deployment in the real world scenario?

1.5 Thesis Structure

This section is meant to guide the reader through the structure of the thesis. Following
Introduction, Related Work will provide a literature overview of the relevant research in
the area of legal NLP, specifically within automatisation of privacy policy comprehension.
Next, Theoretical Framework presents the concepts around GDPR and the di↵erences it
has brought to privacy policies. It will further present Information Retrieval systems and
ElasticSearch, Deep Neural Language Modelling, and Transformer architectures, including
variations of BERT, Question Answering and the application of BERT. Additionally, it will
explain the concept of knowledge distillation and transfer learning. Further, Methodology
will display the key aspects of the research design. Guided by the CRISP-DM frame-
work, the thesis storyline evolves from understanding user needs and business objectives,
to in-depth analysis and exploration of privacy policy data. Specifically, the secondary
PolicyQA dataset and primary self-annotated GDPRQA dataset, as well as the collected
privacy policies are explored. Moreover, Methodology elaborates on all iterations of the
modelling – (1) open-domain general knowledge learning, (2) transfer learning of domain-
specific privacy knowledge, and (3) hyperparameter optimisation. Moreover, it discusses
the simulation of the models in a production environment and possible deployment use
case implications. In Results the models are evaluated using the relevant performance
metrics introduced in Theoretical Framework and compared against one another. Dis-

4
1.5 Thesis Structure 1 INTRODUCTION

cussion presents contextualised results, answering the research questions. It elaborates


on the limitations, implications, future work, and learning reflections. Lastly, Conclusion
summarises the thesis.

5
2 RELATED WORK

2 Related Work
This chapter provides a glimpse into the existing research to understand the foundation
that the thesis is built upon. The chapter starts by reviewing the literature on the appli-
cation and popularity of NLP within the legal domain. It discusses the widespread usage
of BERT, with its domain-specific variations such as LegalBERT and PrivBERT. The
introduction of PrivBERT transitions the reader to the next section which dives into more
specifics behind the NLP research done within the area of privacy policies. The section
introduces the OPP-115 corpus, the building foundation which enabled extensive research
to be done and further datasets to be created. The researchers in this field have achieved
state-of-the-art results in a variety of NLP tasks, such as classification, text summarisa-
tion, and Question Answering. Yet, prevailing usage of pre-GDPR privacy policies corpora
means that the accuracy and scalability of working with GDPR relevant policies still re-
mained largely unexplored. Research has revealed the substantial shifts which GDPR has
brought into the regulatory landscape, suggesting the necessity of creating datasets with
GDPR relevant aspects.

2.1 NLP in the Legal Domain

Most laws are expressed in natural language, thus manifesting the potential for analysing
and predicting juridical language at scale with the use of NLP by transforming unstruc-
tured natural text into machine-readable representation. Over the last decade, the synergy
between law and NLP has seen prominent growth. NLP is able to address some inefficien-
cies of the current practices such as scalability, driven by the increasing number of digital
legal repositories as well as growing algorithmic and hardware power. The NLP potential
in the legal domain includes summarization of context, document relevance scoring, pre-
dicting judging outcomes, Question Answering, sentiment analysis, and others (Nay, 2018).

Transformer architectures such as BERT, which will be further discussed in Section 3.3,
have been in the spotlight of NLP for recent years. A common practice became to augment
the pre-training with domain-specific data which allows diving into domain implications.
BERT language models are often trained to learn the nuances of specific jargon of a given
domain in order to achieve higher performance, for instance, in financial, legal, medical,
and other industry-specific fields. Examples include SciBERT (biomedical and computer

6
2.2 NLP in Privacy Policies 2 RELATED WORK

science literature corpus), FinBERT (financial services corpus), BioBERT (biomedical


literature corpus), ClinicalBERT (clinical notes corpus), mBERT (corpora from multi-
ple languages), PatentBERT (patent corpus), and LegalBERT (legal literature BERT)
(Wiratchawa, Khunthong, & Intharah, 2021).

Legal-BERT is a family of BERT models intended to work within legal NLP applications
(Wiratchawa et al., 2021). 12 GB of English legal literature was obtained to pre-train it.
Its pre-training data is based on EU and UK legislation, EU Court of Justice and Court
of Human Rights rulings, as well as the US court cases and contracts. Wiratchawa et al.
(2021) have found performance gains of using legal domain-specific BERT on QA of legal
documents. It resulted in state-of-the-art outcomes in several tasks: multi-label classifica-
tion of ECHR court cases, NER of contract headers and details, and a legal QA system.
Domain knowledge plays a significant role when dealing with legal language due to the
specific terminology and context used. Researchers also present LegalBERT-SMALL, 3
times smaller than the original LegalBERT, however still maintaining strong performance
(Wiratchawa et al., 2021).

Moreover, Srinath, Wilson, and Giles (2021) has developed a privacy domain-oriented
language model PrivBERT which is trained to identify the specifics behind the privacy
policy semantics and lexicon. PrivBERT will be further introduced and discussed in the
theoretical framework in section 3.3.

2.2 NLP in Privacy Policies

The initial in-depth research of combining natural language processing and privacy policies
came with the release of the OPP-115 corpus. The OPP-115 corpus was the first large-
scale e↵ort that fine-scaled the annotation of a sentence-level fragment of policies (Wilson
et al., 2016). It cclassified each sentence into one of ten categories while having specific
attributes related to each of the categories to ensure specificity.

1. First Party Collection/Use: The sentence includes how or why a controller collects
information about the user.

2. Third Party Sharing/Collection: The sentence includes how user information may

7
2.2 NLP in Privacy Policies 2 RELATED WORK

be shared or collected by third parties

3. User Choice/Control: The sentence includes information on the users’ choices or


control options

4. User Access, Edit & Deletion: The sentence includes information about how and if
the users can access, edit or delete their data.

5. Data Retention: The sentence includes information about how long the data is stored

6. Data Security: The sentence includes information about how the user’s information
is secured

7. Policy Change: The sentence contains information about if and how the users will
be informed when a policy changes.

8. Do Not Track: The sentence includes information about how and if the company
tracks information related to advertising.

9. International & Specific Audiences: The sentence includes information that targets
a specific group, such as Californians or Europeans.

10. Other: The sentence includes other information not related to the above-written
categories.

Wilson et al. (2019) acquired 115 randomly selected policies amounting to 266,713 words.
In total, they collected 23,194 data practice annotations of which they used linear regres-
sion, support vector machines, and a hidden Markov model to classify segments of policies
to the aforementioned classes.

The introduction of the first fine-grained annotated privacy policy dataset, OPP-115, en-
abled a broad range of NLP techniques to be implemented to comprehend and analyze
privacy policies. The OPP-115 corpus has since been used to train a variety of models,
for instance, models to extract opt-out choices from privacy policies (Sathyendra, Schaub,
Wilson, & Sadeh, 2016), models to identify policies on websites and their compliance is-
sues (Story et al., 2019), and models to classify privacy practices and answer non-factoid
questions of privacy policies through the framework Polisis (Harkous et al., 2018).

8
2.2 NLP in Privacy Policies 2 RELATED WORK

Specifically, Polisis was the first to introduce an automated framework for privacy pol-
icy analysis. Polisis uses a hierarchy of neural network classifiers through privacy-specific
word embeddings and multi-label classification. The model at the top-level classifies a
segment based on the aforementioned categories and the lower-level CNN classifier utilises
the attributes with respect to each category (Harkous et al., 2018, pg. 5). Additionally,
Harkous et al. introduced PriBot in their Polisis Framework, which used carefully de-
signed annotation ontology with broad coverage of personal information that enabled it
to be used within advertising, analytics, legal requirements, marketing, etc.

Further corpora similar to OPP-115 Corpus have enabled extensive research on privacy
practices on a broad level. The PrivacyQA corpus spans 35 mobile application privacy
policies, consisting of 1,750 questions and expert annotated answers for the privacy policy
question answering task (Ravichander, Black, Wilson, Norton, & Sadeh, 2020). The 1,750
questions in the PrivacyQA corpus were created based on crowdsourcing through Me-
chanicalTurk and used seven experts with legal training to construct answers to Turker2
questions to identify legally sound answers. The PolicyQA corpus (Ahmad et al., 2020)
adopted the same approach, but with several key di↵erences. The PolicyQA is based on
OPP-115 policies, spanning 714 questions and 25,017 answers. Instead of sentence selec-
tion, they used reading comprehension and were annotated by domain experts in both
question generation and answer annotation. PolicyQA will be utilized and explored fur-
ther in Section 4. Lebano↵ and Liu (2018) created the first corpus to detect the vagueness
of a policy, by using human-annotated vague words and sentences in privacy policies.
Sathyendra, Wilson, Schaub, Zimmeck, and Sadeh (2017), showcased a dataset with a
corresponding model to analyze, identify and categorize opt-out choices o↵ered in privacy
policies.

Further datasets without labels were released such as the one by Zimmeck et al. (2019)
which collected over 400k URLs to Android app privacy policy pages by crawling the
Google Play store. Amos et al. (2020) focused on the evolution of privacy policies over
time and presented an analysis and dataset from approximately 130,000 websites from
over two decades. Moreover, Zaeem and Barber (2021) gathered around 100,000 privacy
policies using company domains from DMOZ, a website which maintains categories of web-
2
A user of MechanicalTurk

9
2.3 Privacy Policies in Post-GDPR Regulatory Landscape 2 RELATED WORK

sites on the internet. Finally, the PrivaSeer corpus of 1,005,380 English language website
privacy policies was introduced by Srinath et al. (2021), where the number of unique web-
sites represented in the corpus was about ten times bigger than the second-largest publicly
available privacy policy dataset created by Amos et al. (2021). It further surpassed the
aggregate of unique websites represented in all other publicly available web privacy policy
corpora combined (Srinath et al., 2021).

2.3 Privacy Policies in Post-GDPR Regulatory Landscape

“The introduction of the General Data Protection Regulation (GDPR) caused significant
changes to the practice of digital companies holding personal data, as well as the way they
communicate it to their clients” Linden, Khandelwal, Harkous, and Fawaz (2020, pg. 1).

Research of Linden et al. (2020) has revealed that the transparency requirement of GDPR
has proved privacy policies to be much more detailed, verbose and structured. Linden
et al. has suggested the necessity for a GDPR-augmented dataset in order to reflect and
analyse the changes in privacy policies under GDPR.

Moreover, more companies started complying with the law in order to guard themselves
against lawsuits. Degeling et al. (2019) has searched for privacy policies on 6357 websites
and estimated a 4.9% increase in the number of websites having a policy after the intro-
duction of GDPR.

Gallé, Christofi, and Elsahar (2019) have evaluated the need for a GDPR related dataset
in order to unleash the potential of ML techniques to assess privacy policies with respect to
GDPR compliance. The author claims that only the missing elements shall be annotated
to augment the existing datasets. GDPR has introduced novel ways of handling privacy
which comes with specific mandatory elements to be included in every policy which are
currently unaddressed by the existing datasets. Hence, this signifies the necessity for this
research paper to derive a GDPR dataset.

Gallé et al. (2019) called out “four potential problems that the use of previous datasets can
have when training models that would address the new requirement of that regulation” [pg.

10
2.3 Privacy Policies in Post-GDPR Regulatory Landscape 2 RELATED WORK

11]. These problems include the e↵ect of new elements, multi-linguality, type of companies
domain shift, and GDPR adaptation domain shift.

1. The e↵ect of new elements. To start with, GDPR requires companies to in-
clude contact details, as well as the assigned Data Protection Officer, DPO. Existing
datasets do not provide such information, yet it could be implemented with, for
instance, named entity recognition.

Moreover, GDPR facilitates specific provisions for “sensitive” data, which has not
been paid attention to in pre-GDPR times. Furthermore, as described above GDPR
provides a set of specific rights. The existing OPP dataset includes only one generic
label “User Choice”, lacking specifics on the user rights, hence identifying a crucial
gap in the existing datasets.

Next, GDPR strongly considers the risks of international transfers, especially those
outside of the EU/EEA. Existing annotations only partially refer to the data trans-
fer, missing details on where the third parties operate. Furthermore, automated
decision-making (Art. 22) is another GDPR aspect that is not covered in existing
datasets.

2. Multi-lingulity. Micro-organisations often only operate locally, thus using local


language for their privacy policies. The fact that European Union has 24 official
languages poses the dilemma of multilingual natural language processing. However,
this issue could be addressed through an automatic machine translation and thus is
not deemed an issue when deriving a GDPR dataset.

3. Domain shift on type of companies. Current datasets focus on popular websites


based on http://alexa.com ranking, thus neglecting variation in industry, location,
and size.

4. Domain shift due to GDPR adaptation. GDPR has resulted in new terminol-
ogy, structure and details in privacy policies, thus contributing to a whole shift in
the used language and redistribution of data. ML tools are rather sensitive to such
changes, thus introducing GDPR annotated dataset could benefit the research.

11
2.3 Privacy Policies in Post-GDPR Regulatory Landscape 2 RELATED WORK

Other research has been done with regard to GDPR. Specifically, in easing the under-
standing of privacy policies post-GDPR. Truong, Sun, Lee, and Guo (2020) envisioned
a personal data management platform designed around GDPR compliance through the
use of blockchain. Tesfay, Hofmann, Nakamura, Kiyomoto, and Serna (2018) created Pri-
vacyGuide which is a classification tool of European Privacy Policies. The classification
tool is built upon 45 annotated privacy policies and is significantly shallower than OPP-
115. Furthermore, it only covers 11 aspects of first-level categories. Torre et al. (2019)
presented a UML representation of the GDPR as a step toward automated compliance
checking, and Palmirani and Governatori (2018) introduced a framework for compliance
checking through modelling of legal documents.

In comparison to the prior Question Answering approaches, we encourage developing a


Question Answering system capable of extracting answers by using PolicyQA as general
privacy policy questions and answers combined with a newly annotated SQuAD formatted
GDPR specific dataset, GDPRQA.

12
3 THEORETICAL FRAMEWORK

3 Theoretical Framework
The following section will provide the theoretical concepts relevant to the exploration
of the topic of privacy policy natural language comprehension. The Section 3.1 will set
the legal foundation of the paper by discussing the key elements of GDPR and privacy
policies, including the challenges and opportunities induced by the shift of the regulatory
landscape with the e↵ect of GDPR. It will outline the e↵ect and novel structural changes
under GDPR on both individual and company level. Moreover, the section will refer to
the specific di↵erences in the nature of privacy policies post-GDPR, such as terminology,
verbosity, and new structural elements. Next, the thesis embarks on the technical theoret-
ical overview comprised of four topics: Information Retrieval 3.2, Deep Neural Language
Model architectures and Transformers 3.3, Question Answering 3.4, and transfer learning
3.5. Information Retrieval will cover the topics of term weighting schemes, inverted index,
as well as Lucene as the foundation of the ElasticSearch search engine. Next, Section 3.3
will revolve around the topics of Deep Neural Language Modelling and Transformer archi-
tectures. It will elaborate on the vector semantics, discuss in detail the algorithms behind
Transformers and, specifically, BERT with its variations, such as RoBERTa. Section 3.4
will discuss Information Retrieval based Question Answering systems, the use of encoders,
such as BERT, in Question Answering, and the theory behind Retrievers and Readers.
Lastly, Section 3.5 will discuss transfer learning and knowledge distillation, and the appli-
cation of BERT (DistilBERT and TinyBERT) for performing transformer distillation to
generate target task-specific knowledge.

3.1 GDPR and Privacy Policies

A privacy policy is a legal document that discloses the practices in which a party collects,
uses, manages, processes and discloses personal information. The first notice of privacy
policy was made by the Council of Europe during a study of risks in relation to technol-
ogy’s influence on human rights (Murtezić, 2020). This has resulted in Convention 108 on
28 January 1981, with the Council recommending the development of a policy to protect
personal data. Personal data is thereby considered to be information that identifies a real
person.

The General Data Protection Regulation (GDPR) came into e↵ect on the 25th of May,

13
3.1 GDPR and Privacy Policies 3 THEORETICAL FRAMEWORK

2018 (Trzaskowski & Sørensen, 2019). It was designed around the purpose of providing
EU citizens with more control over their personal data as well as simplifying the regulatory
environment in which businesses navigate across all EU member states. The introduction
of GDPR has shifted the way businesses process the data and the way they communicate
their practices, including new structure, obligatory details and terminology in privacy
policies.

GDPR has harmonised the privacy regulations across the EU states. GDPR obliges orga-
nizations to provide citizens with a privacy policy that is presented ”in a concise, trans-
parent, intelligible, and easily accessible form”. It has to be written in clear and plain
language and delivered in a timely manner. Moreover, it requires the companies to be
transparent in their disclosure of any collection, processing, storage, or transfer of per-
sonal data (Trzaskowski & Sørensen, 2019).

GDPR demands appropriate protection safeguards, leading to challenges and opportuni-


ties for organisations. Despite it only a↵ecting EU citizens, the e↵ect is global as it a↵ects
any organisation that deals with the EU market.

GDPR gives EU citizens control over their data by referring the individual to a number
of specific rights. GDPR also states that the exercising methods of the above-mentioned
rights should be disclosed clearly in the privacy policy. The rights include the following:

1. The right to withdraw consent (Art. 7)

2. The right to object, including direct marketing (Art. 21)

3. The right to access (Art. 15)

4. The right to rectification (Art. 16)

5. The right for erasure (the right to be forgotten) (Art. 17)

6. The right to restrict processing (Art. 18)

7. The right to data portability (Recital 68)

8. The right to lodge a complaint (Art. 77)

14
3.1 GDPR and Privacy Policies 3 THEORETICAL FRAMEWORK

Furthermore, under GDPR, data controllers and processors are also faced with stricter
requirements, such as data protection by design and by default (Art.25) for increased
security and to “implement appropriate technical and organizational measures”, as well
as record all processing activities (Art.30). Organisations are also held accountable for
non-compliance meaning stricter security measures and regulations to process and manage
data should be undertaken. According to Art. 82(2), “any controller involved in process-
ing shall be liable for the damage caused by processing”, as the failure to do so may result
in large fines for data breaches, amounting up to 20 mln € or 4% of worldwide turnover
for the preceding financial year.

Art. 5 states the foundational principles of processing personal data which companies
must adhere to. They include the following principles:

1. Lawfulness, fairness and transparency – should be implemented in the processing in


relation to the data subject

2. Purpose limitation – a collection of data only for ”specified, explicit and legitimate
purposes”

3. Data minimisation – ”adequate, relevant and limited to what is necessary in relation


to the purposes”

4. Accuracy - data kept up to date and ensured in accuracy

5. Storage limitation – data kept in a form which permits identification of data subjects
for no longer than is necessary for the purposes

6. Integrity and confidentiality – “processed in a manner that ensures appropriate se-


curity of the personal data”

7. Accountability – the controller shall be responsible for processing and demonstrating


compliance with GDPR.

GDPR clearly separates between sensitive and non-sensitive personal data. Sensitive data
refers to information such as racial or ethnic origin, political opinions, religious or philo-
sophical beliefs, health, trade union membership, genetic and biometric data, and sexual
orientation. GDPR obliges companies to undertake more stringent protection rules and

15
3.2 Information Retrieval 3 THEORETICAL FRAMEWORK

stricter requirements for processing.

Art. 22 of GDPR introduces automated decision-making and profiling, which refers to


automated means of processing personal data to conclude certain information about an
individual. It can be beneficial for companies as it can result in quick and consistent
decision-making. However, from a user perspective, it can lead to significant risks when
implemented irresponsibly and with bias. Data subjects have the right “not to be subject
to a decision based solely on automated processing, including profiling, which produces legal
e↵ects concerning him or her or similarly significantly a↵ects him or her.”

Moreover, GDPR identifies the countries outside of EEA that can be considered “ade-
quate” or “non-adequate” in terms of levels of personal data protection. Transfers to
“adequate” countries are permitted under GDPR, while “non-adequate” locations would
require appropriate safeguards implemented in place. Therefore, GDPR signifies the im-
portance of stating the location of storage of data in the privacy policies exposed to the
individuals. GDPR provides the necessary list of safeguards for the protection of personal
data to guarantee its safety.

3.2 Information Retrieval

Information Retrieval, also referred to as IR, encompasses the retrieval of any information
manner given specific needs. The developed IR system is often referred to as the search
engine. The user structures a query for the retrieval system, which in turn can output a
set of ranked documents from a specific collection as seen in Figure 1.

Figure 1: The Architecture of an IR System

16
3.2 Information Retrieval 3 THEORETICAL FRAMEWORK

3.2.1 Term Weighting Schemes: TF-IDF and BM25

Computation of term weight for each given document word allows scoring the match be-
tween a document and a query. The most commonly used term weighting schemes are the
TF-IDF weighting, and BM25.

TF-IDF computes a sparse model which captures word meanings by calculating the term
frequencies (tf ) and inverse document frequencies (idf ). tf estimates the frequency of
a word t in the document d. A popular approach is to decrease the term frequency to
mitigate skewness by taking log10 . That can be formulated as the following:
8
>
<1 + log10 , if count(t,d) > 0
tft,d =
>
:0 otherwise

N
The idf is calculated by idft = log10 ( df t
), where N is the total number of documents and
dft is the number of documents which contain the term t. It allows providing higher weight
to words present in only a few documents. For a term to get a high tf idf weight, the
term should hold a high term frequency (tf ) in a document (local parameter) and a low
document frequency across all documents (global parameter).

BM25 is a more powerful and complex variant of term weighting, also referred to as Okapi
BM25 (Robertson et al., 1995). BM25 introduces parameters k, which adapts the balance
between tf and idf , and b, which manages the significance of normalising the length of
a document. Therefore, given a query q, the score of document d can be formulated as
follows:

X ✓ ◆
N tft,d
log
t2q
dft k(1 b + b( |d|d|
avg |
)) + tft,d

where |davg | represents the length of an average document. If k = 0, BM25 does not use
term frequency, while a large value of k refers to raw term frequency. B spans from 0
(with no document length scaling) to 1 (scaling by length).

17
3.2 Information Retrieval 3 THEORETICAL FRAMEWORK

3.2.2 Inverted Index

An inverted index is the most commonly used full-text search method, which is performed
by matching a term to its containing document. The inverted index consists of a list of
all unique words that are present in a document, as well as a list of documents for each
given word in which it appears. An inverted index, given a particular query, is capable of
providing a list of documents with the given term, as shown in Figure 2. It is based on a
dictionary where the keys represent tokens, and the values of the keys are pointing to the
documents’ IDs which contain the given term.

Figure 2: The Inverted Index

3.2.3 Lucene

Lucene is a full-text search library built in Java. It inputs a search query and returns a
number of documents ranked by the relevance to the documents which share similarities
with the highest scored query. Lucene o↵ers an array of configurable forms of search,
such as boolean queries, term queries, and range queries. Lucene indexes the terms, and
each given term is comprised of a token and a field name, thus the searches are always
field-specific (Karim, 2017). Once the index is committed, it becomes immutable, making
Lucene append-only.

18
3.2 Information Retrieval 3 THEORETICAL FRAMEWORK

3.2.4 ElasticSearch

ElasticSearch, built on Apache Lucene, is a distributed, Open-Source search and analytics


engine. ElasticSearch provides scalable, rapid and real-time search (full-text and struc-
tured), as well as powerful exploratory capabilities. It is possible to communicate with
ElasticSearch using RESTful API (Gotmlet, 2015).

In ElasticSearch, a document is the unit of search. It stores entire documents and indexes
the context of each document. An index consists of one or more documents, and the latter
consists of one or more fields. Every field of its distributed real-time ’document store’ is
searchable when indexed (Gotmlet, 2015). ElasticSearch is able to provide a fast response
time due to the structure of Lucene and the use of inverted index. When a query is ex-
ecuted, the inverted index table, as showcased in Figure 2, is looked up to find possible
documents containing terms in the query. It utilizes indices and shards to separate data
which in turn enables distribution and allows for high horizontal scalability.

The core concepts of ElasticSearch include the following:

• Document: Document is a JSON object that is stored in an ElasticSearch index. In


a traditional relational database, it can be compared to a table row.

• Field: Field is the individual unit of data defined by a data type.

• Shards: Shard is single Lucene index.

• Replica Shards: Replica shard is ElasticSearch’s fail-safe mechanism which repre-


sents a copy of a shard. It is distributed at a node that does not contain the shard
that the replica is made of. Replica shard can also improve the speed by serving
read requests.

• Node: A node is an instance of ElasticSearch. Generally, one node can be run on a


single server. Any node in the cluster can be queried and forward requests to the
nodes which possess the queried data.

• Cluster: A cluster consists of one or more ElasticSearch nodes. As a cluster in-


creases in size, it will reorganize the data and spread it across the nodes.

19
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK

• Index (noun): Index is the location of document storage for it to be retrieved and
queried.

• Analyzers: Analyzer is leveraged during indexing to break down or parse terms


from a string.

• Mapping: Mapping contains information on the type of data that each field holds.
(Berman, 2019).

3.3 Deep Neural Language Modelling

3.3.1 Vector semantics

Vector semantics represents the word in its multidimensional vector space constructed by
the distributions of other words in proximity. Vector semantics reveal the distributional
hypothesis3 on the similarity of word meanings. It is done by learning the embeddings
from distributions in passages of privacy policies. Embeddings, a key element of represen-
tation learning in NLP can be static or contextualised, such as BERT, which will further
be explained in more detail in Section 3.3. Embeddings are short dense vectors which
generally perform better compared to, for instance, n-gram generated sparse vectors. The
reason behind it is the little parameter space needed, which in turn prevents overfitting
and generalisation of data.

”Language is a sequence that unfolds in time”, Jurafsky and Martin (2002, pg. 172). Thus,
the tools that do not consider the dimension of time and allow access to all inputs at the
same time, fail to capture the temporal essence of language. Question Answering, just
like many other NLP challenges, signify the need for access to distance in text, which is
possible using sequential and transformer architectures. These two architectures are over-
coming the constraints of the Markov assumption4 and manage to use greater contexts in
text, giving the next token in the text a conditional probability. Yet, despite the sequential
architectures being able to retain the past information, they cannot address the issue of
what the attention of the model should be paid to, and how to train the model efficiently
in that regard. This is when Transformers and BERT prove why they have revolutionised
3
”Distributional hypothesis is the link between similarity in how words are distributed and similarity in
what they mean”, Jurafsky and Martin (2002, p. 96).
4
Markov assumption states that the next state is assumed to be dependent only upon the current state
(Jurafsky & Martin, 2002).

20
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK

NLP carving the way to becoming the key building block for text analytics.

3.3.2 Transformers

To start with, Transformers are comprised of the regular network components. However,
on top of that, they consist of self-attention layers, which facilitate grasping information
from a vast and robust textual context. According to Vaswani et al. (2017), Transformers
are fully based on self-attention when computing representations of input and output,
avoiding convolutional or recurrent neural nets. Hence, the need for a transfer of informa-
tion through recurrent connections is mitigated. The layers of Transformers’ self-attention
can map the sequences of input vectors (x1 , . . . , xn ) to output vectors (y1 , . . . , yn ). The
estimation of each item is not dependent on the other elements. The Transformers, thus
are able to benchmark one element to the others by its relevance in the provided con-
text, which is calculated with a score (xi , xj ) = xi · xj . The greater the result is, the
more meaningful similarity there is between the vectors. Consequently, the scores get nor-
malised with a softmax 5 function to compute the weight vector ↵i j. The output value is
then computed from a summation of the inputs previously noticed by the model, weighted
with the relative ↵ values.

X
yi = ↵ij xj
j<=i

The attention is comprised of the following: (1) comparing the relevant elements, (2) nor-
malising to obtain probability distribution, and (3) generation of the weighted sum of the
distribution (Jurafsky & Martin, 2002).

A standard Transformer layer includes two sublayers: multi-head attention (MHA) and
fully connected feed-forward network (FFN).

Multi-Head Attention (MHA). The attention function is comprised of queries Q, keys


K and values V . Hence, it can be denoted as following:

QK t
A= p
dk
(1)
Attention(Q, K, V ) = sof tmax(A)V
5
Softmax is an activation function that generalizes the logistic function to multiple dimensions.

21
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK

dk denotes the scaling factor of the dimension of keys, and A is the attention matrix com-
puted from the dot-product operation between Q and K. The final attention function
represents a weighted summation of values V , while the weight is calculated by using soft-
max on each column of matrix A. Attention matrices generally grasp significant language
knowledge, hence being crucial in knowledge distillation. Concatenating attention head
can determine the multi-head attention:

M ultiHead(Q, K, V ) = Concat(head1 , ..., headh )W (2)

headi denotes the ith attention head, h denotes the number of attention heads, while W
is the linear transformation matrix.

Position-wise Feed-Forward Network (FFN) is a fully connected feed-forward net-


work with two linear transformations which use ReLU6 activation function. It can be
computed as the following:

F N N (x) = max(0, xW1 + b1 )W2 + b2

The position-wise FFN di↵ers from a typical feed-forward network, as it uses the dense
layers for each position item in the sequence, hence called position-wise.

3.3.3 Bidirectional Encoder Representation from Transformers (BERT)

BERT is a deeply bidirectional pre-trained model architecture. It is able to interpret


the context from both directions, in comparison with classical sequential architectures.
Transformers act as the underlying foundation of BERT. Transformers use encoders and
decoders to capture the contextual relationship, with encoders taking the token sequence
as the input, and decoders generating the predicted output (Mishra, Rajnish, & Kumar,
2020). BERT has managed to enhance state-of-the-art results for a wide range of NLP
tasks, including such categories as text classification, part-of-speech tagging and Question
Answering. It mitigates the necessity of developing an architecture-specific to the task, as
it only requires fine-tuning.
6
ReLU (Rectified Linear Activation Function) is a piecewise linear function which outputs the input if
it is positive. In other case, the function outputs zero.

22
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK

3.3.4 Fine Tuning BERT with MLM and NSP

BERT is well known for its two powerful training approaches — masked language mod-
elling (MLM), and next sentence prediction (NSP). Oftentimes it is possible to simply
take the BERT model and apply it directly to the needed task, however, it is common to
fine-tune it first. MLM fine-tuning and training can allow for BERT to grasp the specifics
of language in a given domain, which is highly important in such a unique linguistic con-
text as the field of privacy policies (Aroca-Ouellette & Rudzicz, 2020).

3.3.4.1 Masked Language Modelling

MLM works by giving BERT an input sentence and then further tuning the weights to
obtain the same sentence in the output. However, before BERT gets the input sentence,
some tokens are masked.

First, the text it tokenized which results in three tensors: input ids, token type ids,
attention mask. Input ids is the most crucial tensor which provides a tokenized repre-
sentation of the sentence which will be further adjusted. Next, a labels tensor is created
to calculate the loss and optimise the model during training. Labels tensor is a copy of
the input ids tensor.

Furthermore, a set of tokens inside of the input ids should be masked randomly. BERT
generally uses a 15% probability of masking every token while pre-training the model.
Labels and input ids tensors are processed through the model and the loss between the
two should be estimated. The loss is necessary to obtain the gradient changes to adjust
the weights of the model.

The di↵erence between the probability distributions for each token and the true labels
defines the loss. The 512 tokens (maximum input tokens that BERT can process) result
in the final embedding, known as logits. It also has a vector whose length is the same
as the size of the vocabulary of the model. The token ids tensor is predicted from the
logits by applying softmax and argmax 7 (Aroca-Ouellette & Rudzicz, 2020).

7
Argmax is a function that returns the maximum value from the target function.

23
3.3 Deep Neural Language Modelling 3 THEORETICAL FRAMEWORK

Figure 3: Masked Language Modeling

3.3.4.2 Next Sentence Prediction

NSP works by providing BERT with two sentences, A and B. BERT predicts whether
sentence B comes after B by outputing IsNextSentence or NotNextSentence. Hence, it
is able to learn long-term dependencies among the sentences.

To start with, tokenization is performed. The two sentences are unified into the same
tensor set, yet [SEP] token is placed in between the sentences. token type ids tensor
includes the segment ids, which are able to identify parts that the related tokens belong
to. Sentence A is identified as 0, while sentence B as 1.

Furthermore, classification label should be created. new labels tensor is created which
can identify whether a sentence follows the other. IsNextSentence is represented with 0,
while NotNextSentence — with 1.

Lastly, the loss is estimated. Inputs and labels are processed with the use of the model,
which outputs the loss tensor. That tensor is optimised during training. Model might
also be simply used for inference without the training, then no labels tensor would be
used (Aroca-Ouellette & Rudzicz, 2020).

3.3.5 BERT Variations

BERT-Base is the baseline BERT model with 12 encoder layers. BERT has numerous
model configurations tailored toward various learning task needs. Robustly Optimized
BERT Pretraining Approach (RoBERTa) is a type of BERT created with the goal
of improving its training phase, by training the model longer and on more data. RoBERTa

24
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK

builds upon the language masking of BERT, by learning to predict purposefully hidden
segments in unlabelled text. As opposed to static masking in BERT-base, RoBERTa
uses dynamic masking which allows the model to capture specific masking patterns in
the given sequence, as well as simultaneously decrease the necessity of having more train-
ing instances. It adjusts the hyperparameters of BERT, for instance, by removing the
next-sentence pretraining objective which enhances downstream performance. Moreover,
RoBERTa uses larger mini-batches and learning rates (Facebook, 2019).

As discussed previously in the Section 2, BERT language models are also often trained
to learn the nuances of specific domain-related language to achieve higher performance.
PrivBERT is an example of a domain-specific BERT. It is a privacy policy domain lan-
guage model, which is based on roughly one million privacy policies. It was trained on
a pre-trained RoBERTa-base (12 layers, 768 hidden size, 12 attention heads, 110M pa-
rameters). It is trained to learn the nuances of privacy policy specific language to achieve
higher accuracy. It was trained using dynamic mask language modelling with 50.000 steps
using the batch size of 512. Most parameters were kept similar to RoBERTa, while the
peak learning rate was set to 8e-5. PrivBERT was evaluated on both classification and
Question Answering tasks achieving state-of-the-art results (Srinath et al., 2021).

3.4 Question Answering (QA)

Question Answering is a field of NLP which is focused on developing systems that can
automatically answer users’ questions. The system must be able to translate the user
inquiries into a machine-readable representation of the question to provide a relevant
answer. This is done by mapping the semantics of the sentences in natural language.

3.4.1 Types of QA Systems

There are several ways of classifying QA systems. The main paradigms include IR-based
and knowledge-based QA systems. Moreover, QA can be distinguished between open and
closed domain QA. Lastly, QA can be categorised by the type of questions asled.

25
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK

3.4.1.1 Main QA Paradigms – IR-based and Knowledge-based QA

To start with, QA systems can be distinguished between two key paradigms: IR-based
and knowledge-based QA systems.

1. IR-based QA system works by finding text segments in the document collection that
answer the user’s question. IR-based QA consists of question processing, passage
retrieval, and answer processing. IR-based QA system will be explained in more
detail in Section 3.4.2.

2. Knowledge-based QA system maps the question to the query through a structured


database. Semantic parsers are used for mapping the text string.

3.4.1.2 Open and Closed Domain QA

QA systems can also be categorised into open and closed domains.

1. Open-domain QA systems are not limited to any particular field of knowledge. They
generally are dependent on the world wide web and universal ontology. They use
general vocabulary and thus do not require any domain-specific knowledge to answer
the question.

2. Closed domain QA systems work with a specific area of knowledge and thus consist
of a restricted repository of domain-related questions, the answers to which are
retrieved from a collection of domain-specific documents. Therefore, the quality of
answers is considered to be high. Closed domain QA uses specific ontology and
terminology. (Reddy & Madhavi, 2017)

3.4.1.3 Types of Questions in QA

Furthermore, QA systems can also be categorised by the type of questions asked, such
as factoid and non-factoid. Factoid questions can generally be described as the questions
which start with what, which, when, and who. The questions generally need a short answer
(a word or a single sentence) and are simple to answer. The answer type is normally a
named entity. Non-factoid questions, on the other hand, are open-ended questions, such
as opinions or explanations, and require complex answers and passage level texts. The
distinction can also be established between five types of questions, such as: (1) List, (2)

26
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK

Confirmation, (3) Causal, (4) Hypothetical, (5) Complex (Reddy & Madhavi, 2017).

1. List questions provide a list of named entities. Generally, they are able to perform
with rather high accuracy. Techniques used with factoid questions can be easily
adjusted to list types of questions.

2. Confirmation questions require yes or no answers. Sometimes they require subjective


information about the event in question which could pose a constraint in finding an
answer.

3. Causal questions require more descriptive answers, as they refer to reasons, explana-
tions, and elaborations. They are generally much longer and can be a set of sentences
or even a paragraph.

4. Hypothetical questions are questions associated with unspecific answers. Their ac-
curacy is dependent on the context and the user and is often quite low.

5. Complex questions require inference and synthesis of context. They need deeper
neural architectures then, for instance, factoid and list type of questions.

3.4.2 IR-based QA and BERT

Information Retrieval (IR) based QA systems answer a question posed by the user by
looking for the passages of relevant text. The traditional algorithms used in IR, such as
tf-idf, are facing the vocabulary mismatch problem when they can only perform given a
precise overlap of words in the query and the passage of text. To mitigate such shortcom-
ings, an approach that can handle synonyms and paraphrasing is needed. Hence, dense
embeddings and dense passage retrievers can be used as opposed to sparse word-count
vectors. An example of a generalised QA system architecture can be seen in Figure 4.

Figure 4: General Architecture of Question Answering System

Bi-encoders, such as BERT, make use of one encoder to take in the query (BERTQ ), and
one to encode the text in a document (BERTD ). The score is then obtained as the dot

27
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK

product from the two vectors (Figure 5). Query and the document can be displayed as
[CLS] token seen in the equation below.

hQ = BERTQ (q)[CLS]
hD = BERTD (d)[CLS] (3)
score(d, q) = hq · hd

To perform QA, a question-answering head is added on top of BERT, which identifies


the start and end tokens of the answer span. Inside the QA head, there are two sets of
weights for the start and end tokens. The dimensions of the sets correspond to the output
embeddings. Next, the output embeddings of the tokens are inputted into the head. The
dot product between the start token weight and output embeddings is computed, as well
as the dot product between the end token weight and the output embeddings. Lastly,
a softmax activation function is used to generate a probability distribution. The tokens
which have the highest probability are selected as the start and end. BERT is then fine-
tuned with cross-entropy loss (Ramnath, Nema, Sahni, & Khapra, 2020).

Dense vectors in the field of IR are still an ongoing research topic, particularly with regard
to fine-tuning. Documents are often too long for the encoder to process (such as extra-
lengthy post-GDPR privacy policies), thus a common approach is to split them up into
paragraphs.

Figure 5: BERT for Estimating the Relevance of a Document in a Query

3.4.3 Reading Comprehension and SQuAD

Reading comprehension is a challenging task which requires an understanding of language,


context, and knowledge in the domain. Data for IR-based QA is generally obtained by

28
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK

crafting a reading comprehension dataset to train the reader to predict a passage span
that contains the answer. Such datasets are comprised of triples: passage, question, and
answer. As the data already contains the passage, IR becomes irrelevant.

The most prominent example of such a dataset is Stanford Question Answering Dataset
(SQuAD) which was manually created from the question-answer pairs from the Wikipedia
articles (Rajpurkar, Zhang, Lopyrev, & Liang, 2016). In its first version, SQuAD 1.1, every
answer is answerable in the corresponding context passage. However, SQuAD 2.0 also in-
troduces no-answer annotations, resulting in more than 150.000 QA pairs (Rajpurkar, Jia,
& Liang, 2018). SQuAD is one of the most widely used open-domain QA dataset which
helps the models learn the general foundation of the QA task. NLQuAD is a dataset for
non-factoid question answering. Due to the nature of non-factoid questions, it requires
more descriptive answers and longer answer spans. It is comprised of 31,000 non-factoid
questions and long answers from 13,000 BBC news articles. The questions are interroga-
tive and inquire about the opinions and reasoning, thus resulting in complex open-ended
answers (Soleimani, Monz, & Worring, 2021).

3.4.4 Retriever-Reader

The foundation of the IR-based QA is the Retriever-Reader interaction. To start with,


relevant documents are extracted from the corpus with the use of a Retriever. Next, a
Reader, essentially a neural reading comprehension architecture, goes through each doc-
ument to find the span that contains the answer. However, sometimes the QA system
solely focuses on the reading comprehension (further discussed in Section 3.4.3), when the
relevant passage is already given to the Reader. The whole (question, passage, answer)
triple is used for training. However, for inference time, only the questions are presented
to the Reader, hence allowing the QA system to perform IR as it is not presented in the
passage (Das, Dhuliawala, Zaheer, & McCallum, 2019).

The Reader takes the passage as an input and gives an answer. Answer extraction is based
on span labelling. It works by finding a consistent string corresponding to the answer. In
reading comprehension tasks, neural models take in the questions q, n tokens q1 , ...qn , as
well as passage p and m tokens p1 , ..., pm . The neural architecture aims to calculate the

29
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK

probability that any given span constitutes to the answer: P (a|q, p) (Jurafsky & Martin,
2002).

Given that a possible answer span starts at as and ends at ae , the probability could be
simplified as P (a|q, p) = Pstart (as |q, p)Pend (ae |q, p). Therefore, two probabilities for the
start and end are estimated for every token pi .

Figure 6: BERT for Span QA from Reading Comprehension Task

The baseline for reading comprehension tasks is feeding the question and passage into
bi-encoder, such as BERT. The strings are then split with the separator token [SEP]. The
outcome is an encoded token embedding provided for each passage token pi (Figure 6).

The first sequence constitutes to the question, while the second sequence — to the passage.
A linear layer shall also be added to training during fine-tuning to estimate start and end
span positions. The start of the span embedding can be expressed as vector S, while the
end of the span can be represented as E. These vectors would be learnt by the model
during the fine-tuning stage. By iterating through all tokens pi , calculating the dot product
of S and p0i , and further applying a softmax function for normalisation, we may obtain
the following formulas for computing the probabilities for the start and end probabilities
of the span (Jurafsky & Martin, 2002).

30
3.4 Question Answering (QA) 3 THEORETICAL FRAMEWORK

0
Pstarti = Pexp(S·pi ) 0
j exp(S·p j)
0 (4)
Pendi = Pexp(E·pi ) 0
j exp(E·pj )

Candidate span is ranging from i to j, and the score can be computed as S · p0i + E · p0j .
j i constitutes the maximum scoring span and is thus the score taken by the model as
its final prediction.

3.4.5 Evaluation Methods for QA

Evaluating the performance of QA can be rather challenging, as the correct answer could
be phrased in a variety of ways. To assess the performance of QA, evaluation metrics
able to mimic human judgement should be implemented. Generally, there are two ways
of evaluating QA — closed and opened domain modes (Su et al., 2016).

Closed-domain evaluation mode refers to QA performed on a single document. Even if


the two strings are matched and yet occur in di↵erent documents, the answer is incorrect.
That is applicable to privacy policies as the question must be answered in the privacy
policy of the relevant organisation. Open-domain evaluation mode, on the other hand, is
only concerned with whether the strings match, regardless of their position in the corpus
(Jurafsky & Martin, 2002).

Separate metrics exist for evaluation of the Retriever and Reader parts of the QA system.
To evaluate the performance of a Retriever, it is crucial to figure out if the document
which has the correct answer span is among the retrieved candidates. The number of
retrieved candidates is set by the parameter topk . There are two well-known ways of
evaluating a Retriever - (1) Recall and (2) Mean Reciprocal Rank (Shao, Guo, Chen, &
Hao, 2019).

1. Recall estimates the number of times the right document is contained among the
selected candidates. In the case of a singular query, the output of the recall is binary
— whether the document is among the retrieved candidates or not. However, when
calculated over the whole dataset, the output ranges from zero to one — specifically,
the percentage a query returned the relevant document.

2. Mean reciprocal rank (MRR) is another metric used to measure the performance

31
3.5 Transfer Learning and Knowledge Distillation 3 THEORETICAL FRAMEWORK

of the Retriever. It takes into consideration the ranking of the answer, meaning its
position. MRR with a score of 0 means that there are no matching responses, while
the top score of 1 means the QA system was able to select the right document as
the top candidate for every query. Hence, MRR considers the fact the relevance of
the retrieved documents is di↵erent from the selected documents.

When measuring the performance of the Reader, the degree of correctness of the selected
answer span is evaluated.

1. Exact match (EM) is the metric that estimates the fraction of the documents
where the answer estimated by the Reader is completely identical to the right answer.
For instance, if for the annotated pair “When did GDPR come into e↵ect? – GDPR
came into e↵ect on May 25, 2018”, the Reader predicts an answer to be “On May
25, 2018, GDPR came into e↵ect”, EM would result in zero, as the match is not
completely identical.

2. F1 score is more similar to the way a human would judge an answer, as it is


concerned with the overlap of the prediction and the label. The previous example
would get an F1 of 1 (Jurafsky & Martin, 2002).

TP
P recision = T P +F P
TP (5)
Recall = T P +F N
2⇤P recision⇤Recall 2⇤T P
F1 = P recision+Recall = 2⇤T P +F P +F N

3.5 Transfer Learning and Knowledge Distillation

”Transfer learning aims to extract the knowledge from one or more source tasks and apply
the knowledge to a target task”, Han, Kamber, and Pei (2012, pg. 435). In transfer learn-
ing, a pre-trained model is used as the initial stage, and then a related task is optimised
by re-purposing the existing task. As it is very challenging to train a model from scratch
with limited data, transfer learning o↵ers an opportunity to achieve high performance
using only a general knowledge dataset and transfer that knowledge to a smaller domain-
specific dataset that stores knowledge on solving a related problem. Hence, the target
task needs fewer training data and less training time. Traditional learning is based on
the assumption that the train and test datasets are obtained from the same feature space

32
3.5 Transfer Learning and Knowledge Distillation 3 THEORETICAL FRAMEWORK

and data distribution; in case of any shifts in feature space and distribution, models then
are expected to be redeveloped. However, transfer learning facilitates learning a di↵erent
task, thus the distribution and data domain can be di↵erent, as seen in Figure 7. The
most popular type of transfer learning is instance-based transfer learning, which changes
the weights of some data in the source task and adjusts to learning the target task. In
the field of NLP, domain adaptation has been a key area positively a↵ected by transfer
learning (including syntactic parsing, named entity recognition, and QA) (Min, Seo, &
Hajishirzi, 2017).

Figure 7: Traditional Learning vs Transfer Learning

A domain D is composed of its feature space X and a marginal probability distribution


denoted as P(X), given that X = x1 , ..., xn 2 X. Hence, the domain can be described
as D = X, P (X). The task can be defined with two elements: a label space Y and an
objective predictive function denoted as f : X ! Y . The model predicts the label f(x) of
any new instance of x. The source task T = Y, f (x) is learnt from corresponding pairs of
data xi , yi , xi 2 X, yi 2 Y . Source domain and source task can be described as DS and TS
respectively, while target domain and the task are represented by DT and TT , given that
DS 6= DT _ TS 6= TT . Hence, the goal of transfer learning is to enhance the learning of
the target predictive function fT (·) in the target domain DT with the use of transferred
knowledge from the source domain DS and source task TS (Lin & Jung, 2017).

A technique for transferring the knowledge is distilling it from a bigger teacher network T
to a student S. Hence, the student is mimicking the teacher. Given that f T , f S describes
the behaviour functions of the networks, they transform network inputs into information

33
3.5 Transfer Learning and Knowledge Distillation 3 THEORETICAL FRAMEWORK

representation, the output of a given layer in the network. In the case of the distillation
of Transformers, the outputs of MHA, FFN or attention matrix A could be represented
as behaviour functions. Knowledge distillation can be described as the given objective
function:

X
LKD = L(f S (x), f T (x))
x2X

L represents the loss function which is estimating the gap between the teacher and the
student, x denotes the input, and X is the train set.

Figure 8: TinyBERT Learning

Pre-trained language models, such as BERT discussed above, are rather computationally
expensive, posing an inference challenge. Transformer distillation allows keeping the same
accuracy while reducing the size of the model. The knowledge encoded in the teacher
BERT can get transferred to a smaller student TinyBERT. TinyBERT does the Trans-
former Distillation at the pretraining and task-specific learning (see Figure 8). Hence,
TinyBERT is capable of grasping both the general source domain and the target task-
specific knowledge (Jiao et al., 2020).

34
4 METHODOLOGY

4 Methodology
”The purpose of methodology is to enable researchers to plan and examine critically the
logic, composition, and protocols of research methods; to evaluate the performance of indi-
vidual techniques; and to estimate the likelihood of particular research designs to contribute
to knowledge”, Krippendor↵ (2018, p. 21). To start with, the Research Onion framework
aids in conceptualising the key pillars of the research. The critical realism is employed as
the research philosophy, whereas the research approach is chosen to be abductive. The re-
search undertakes the experimental strategy, the methodological research choice is mixed-
method, and the time horizon of the research is longitudinal. Next, the section presents
the reader with the techniques used throughout the research, guided by the CRISP-DM
framework. To start with, the business understanding is formulated, revising the necessity
for the user to comprehend lengthy, legally sophisticated privacy policies in a time-efficient
manner. The two datasets are explored, (1) the existing broad PolicyQA dataset with QA
annotations from pre-GDPR time, as well as (2) manually labelled GDPRQA based on
recent post-GDPR policies varying in size, industry and location of the organisations.
Next, the reader is guided through the model development. The configuration files as-
sociated with each model, as well as parameters are discussed. The modeling iterations
are elaborated on, starting with (1) general knowledge acquisition from open domain QA
datasets — the factoid SQuAD and non-factoid NLQuAD, (2) knowledge transfer onto the
privacy policy domain, and (3) optimisation of the best performing models. Furthermore,
a production environment simulation was developed using the best optimised models and
ElasticSearch as the IR search engine. Evaluation techniques are discussed. Lastly, a
deployment architecture is proposed to conceptualise the GDPRQA Assistant.

4.1 Research Philosophy

To elaborate on the research considerations, the ”Research Onion” framework is used


showcased in Figure 9. There are three key pillars of conceptualising research — namely,
ontology, epistemology, and axiology. Ontology refers to the nature of reality as it ques-
tions the foundations of assumptions made. Epistemology deals with what is viewed as
the accepted knowledge. Lastly, axiology explains how value and ethics are judged and
evaluated. The three pillars and their interactions generate five research philosophies —
critical realism, positivism, pragmatism, postmodernism, and interpretivism (Saunders,

35
4.1 Research Philosophy 4 METHODOLOGY

Lewis, & Thornhill, 2019).

Critical realism is leveraged as the philosophical leitmotif of the thesis. Critical realism
distinguishes between real and observable aspects of research. It views scientific research
within its causal mechanisms. Hence, it considers that events or phenomena consist of
emergent mechanisms, driven by causation, agency, structure, and relations.

Figure 9: The ”Research Onion”

4.1.1 Research Approach

The types of research approaches include induction, deduction and abduction. The induc-
tive approach starts with the collection and analysis of data, followed by the development
of a theory based on the findings. Deduction first crafts a theory and hypothesis which
are further tested. Abduction combines the two in an iterative manner. The topic is thor-
oughly explored and the insights contribute to the conceptual framework which is further
improved with more data (Saunders et al., 2019).

36
4.1 Research Philosophy 4 METHODOLOGY

The abductive approach is taken, when the privacy policies data, as well as the annotated
question, answer, context triples, are explored. The theory is incorporated into the process
and is expanded based on the findings, which can also be further tested with more data.

4.1.2 Methodological Research Choice

Data-related techniques involve mono-methods when strictly one type of data is used
(quantitative or qualitative), multi-methods with beyond one data type which is dealt with
separately, and mixed-method with multiple data types and combined research applied to
those (Saunders et al., 2019). This thesis works with textual data, which is discerned as
qualitative, as the privacy policy texts are collected and analysed. However, the used tools
and techniques quantify the textual data making it quantitative. Hence, the mixed-method
research is undertaken.

4.1.3 Research Strategy

The research question, goals, resources and existing knowledge impact the choice of the
research strategy. The possible variations of the strategy include experiment, survey,
archival research, case study, ethnography, action research, grounded theory, and narrative
inquiry. The experiment is a broadly common strategy taken in scientific as well as social
science research (Saunders et al., 2019). Evaluation of the performance of the Question
Answering system in the domain-specific context falls under the experimental strategy of
the research.

4.1.4 Research Time Horizon

Research in terms of its time horizon can be separated into cross-sectional studies and
longitudinal studies. The former deals with a “snapshot” of time, while the latter explores
the evolution of data over time (Saunders et al., 2019). The thesis deals with the data
pre- and post-GDPR. The collected policies come from di↵erent time stamps yet no time-
series studies are performed, except for the comparison between any pre- and post-GDPR
policies. Hence, the study is a mix but with more focus on the longitudinal time horizon.

4.1.5 Techniques and Procedures

Techniques and procedures are elaborated on in detail under the CRISP-DM framework
presented in the next section and iteratively considered throughout the thesis. The

37
4.2 CRISP-DM Methodology 4 METHODOLOGY

methodology section will end with a discussion on the reliability and validity of work
(Section 4.9), critically assessing the techniques and procedures applied.

4.2 CRISP-DM Methodology

The methodology will follow the CRISP-DM framework, displayed in Figure 10, which is a
widely accepted framework in data science projects. CRISP-DM is a robust methodology
which stands for CRoss-Industry Standard Process for Data Mining, which is applicable
independent of the industry or tool applied. The rationale for choosing the following
framework is that it accommodates the needs for a model that is able to encompass both
technical and business functions, thus harnessing the NLP to derive business insights to
solve the problem of machine comprehension of GDPR privacy policies. CRISP-DM facil-
itates the structure and flexibility needed to achieve high performing data mining models.
It consists of several phases which guide us on the roadmap of the thesis project, namely:
business understanding, data understanding, data preparation, modelling, evaluation, and
deployment.

Figure 10: CRISP-DM Framework

38
4.3 Business Understanding 4 METHODOLOGY

4.3 Business Understanding

This stage is focused on building the essential foundation by analysing the objectives and
requirements of the business value behind privacy policy reading comprehension.

4.3.0.1 Business Objectives

. To determine the business objectives, the thesis evaluates the users’ needs and success
criteria. The users need to be able to comprehend lengthy, verbose, often linguistically
complex privacy policies with ease. Moreover, they need to be fully aware of practices
applied to their personal data and possible accompanying risks in a time-saving manner.
Furthermore, they should be able to easily obtain the details that are most relevant to
them personally. Hence, a QA system which accurately extracts relevant GDPR data
privacy practices and reflects the current GDPR landscape would be able to successfully
satisfy these needs.

4.3.0.2 Situation Assessment

. To assess the situation, we shall evaluate the resources available. In terms of dataset
availability, PolicyQA, an existing QA dataset derived from the privacy policy standard
OPP-115 annotated dataset, is rather comprehensive and vast. However, due to the lack
of GDPR specific questions, it is deemed necessary to derive a GDPR relevant dataset
and further augment it to the PolicyQA. Yet, manually obtaining a dataset would result
in a limited number of training examples which would restrain the training of the neu-
ral language model. Transfer learning and data augmentation would be of immense help
and would overcome this challenge. Hence, first fine-tuning the QA model on SQuAD or
NLQuAD, the open domain QA datasets, would help the model to learn general question
answering patterns, and then performing transfer learning on PolicyQA and GDPRQA
would help adapt the knowledge to the privacy-specific domain. Computational power
resources could be limited given the large textual volumes to be processed, yet the maxi-
mum GPU capacity of the Google Colab pro+ subscription will be used, as elaborated in
the modelling setup in Section 4.6.

39
4.4 Data Understanding 4 METHODOLOGY

4.4 Data Understanding

The data understanding section will explore the broad existing PolicyQA secondary dataset,
which covers a wide range of pre-GDPR privacy topics relevant to users. Furthermore, it
will explore the created GDPRQA specific dataset, which contains question-answer anno-
tations of policies collected after the introduction of GDPR. The data understanding will
include systematic collection and exploration of privacy policies to achieve a high-quality
GDPR-relevant QA dataset.

4.4.1 Exploration of PolicyQA

PolicyQA is a publicly available dataset (Ahmad et al., 2020) which consists of question,
answer, context triples based on the classified OPP-115 dataset. The policies used in Pol-
icyQA are those from OPP-115 corpus scraped from 2015-2016, hence before the adoption
of GDPR. The average length of a single policy in the dataset is 2319 words, which on
average would take roughly 14 minutes to read. OPP-115 presents 23.000 data practices
with 103.000 annotated text spans, each of which is linked to a segment of a policy with
character-level start and end positions.

PolicyQA annotations are created from these annotated spans and segments. The seg-
ments classified as “Other” and “Unspecified” are excluded. PolicyQA is a vast dataset
with 25.017 answer span annotations labelled by two domain field experts. It contains 714
individual questions created in a generic manner applicable to similar practice categories.
The distribution of n-gram prefixes of questions can be observed in Figure 11. From the
distribution, it is seen that the majority of questions are non-factoid, yet various types of
questions occur.

40
4.4 Data Understanding 4 METHODOLOGY

Figure 11: Distribution of N-gram Prefixes in the Questions of PolicyQA

Due to the nested complex json structure of the SQuAD format of the dataset, prior to its
exploration, it was first converted to a dataframe by unwrapping the path to the deepest
level in the json file — [‘data’,’paragraphs’,’qas’,’answers’]. The di↵erent levels
in the json file were parsed and then combined into a single dataframe (Listing 1).

The answers in the dataset have a rather short span, 13.5 words, which allows the readers
to zoom in the necessary part of the text to grasp the information quickly. The average
length of a question is 11.2, while the average length of passage is 116 words.

1 {
2 ”data”: [
3 {
4 ”paragraphs”: [
5 {
6 ”qas”: [

41
4.4 Data Understanding 4 METHODOLOGY

7 {
8 ”question”: ”Can I edit or change the data that I have provided to you?”,
9 ”id”: 311216,
10 ”answers”: [
11 {
12 ”answer id”: 318379,
13 ”document id”: 525657,
14 ”question id”: 311216,
15 ”text”: ”You have the right to request rectification of inaccurate personal data
concerning yourself, and to complete incomplete data.”,
16 ”answer start”: 505,
17 ”answer end”: 630,
18 ”answer category”: null
19 }
20 ],
21 ”is impossible”: false
22 }
23 ],
24 ”context”: ”Right of access \nYou have the right to obtain confirmation ...”,
25 ”document id”: 525657
26 }
27 ]
28 }
29 ]
30 }

Listing 1: Code Snippet of Data in SQuAD Format

The distribution of the topics covered by the annotations reveals that 44.4% of the anno-
tations refer to the first party collection, and 34.1% to the third party sharing. 11% of
annotations reflect user choice, which is a rather generic category reflecting the individual
rights (which will be explained in detail in the manually created GDPRQA). Data secu-
rity annotations reflect 2.2%, data retention — 1.7%, user access, edit, and deletion —
3.1%, policy change — 1.9%, international audiences — 1.5%, do not track — 0.1%. The
most commonly asked questions include such questions as ”For what purpose do you use
my data?”, ”Do you collect or use my information? If yes, then what type?”, ”Does the
website mention the name of third parties, who gets my data?” as the top three. Further

42
4.4 Data Understanding 4 METHODOLOGY

exploration can be found in the paper by Ahmad et al. (2020).

4.4.2 Acquisition and Exploration of GDPRQA

First, the collected post-GDPR privacy policies will be explored, followed by the elabora-
tion on the process of manual annotation, and, finally, exploration of the SQuAD format
annotated dataset.

4.4.2.1 GDPR Privacy Policy Selection

To start with, as the research has shown that previous datasets do not address the variety
of companies, 47 privacy policies of companies ranging in industry, size, and location were
collected. The companies and their respective sizes, industries and countries are presented
in Table 16. Companies of varying sizes were selected, from European startups with up
to 50 employees up to corporations with 10.000+ employees. Large companies often cover
several services provided, and the policies are made by their team’s legal experts and
updated frequently. Smaller companies portray a more narrow coverage of topics, the
language may vary in legal sophistication, and they are updated less often.

Both European (e.g. Swedish, Danish, Norwegian, French, German, Spanish, Lithuanian,
Hungarian, Greek, British, Dutch, Swiss, Portuguese, Italian, Austrian, etc.) and some
global companies with a European presence were chosen as they also need to comply with
GDPR when operating within the EU. Only policies in English language were considered.

The range of industries of the collected policies from the websites and apps is also rather
diverse, such as manufacturing, retail, technology, transport, energy and oil, space, hos-
pitality, automobile, healthcare, pharmaceuticals, banking, food, audio services, telecom-
munication, personal goods, lifestyle, marketplace, education, which can be seen in Table
16.

Depending on the industry, for some companies, it was more common to collect spe-
cific types of information. For instance, healthcare companies dealt more with sensitive
information such as health-related data, while banking companies often use automated
decision-making to determine financial validity. Privacy policy documents seem to be

43
4.4 Data Understanding 4 METHODOLOGY

rather structured as they follow specific guidelines and have a homogeneous nature among
the entities of similar industries.

4.4.2.2 Exploration of the GDPR policies

Exploratory data analysis reveals some curious aspects. On average, policies contain 4569
words. With an average speed of 250 words per minute, it would take 18.2 minutes to
read such a policy on average. The average length of a sentence is 21 words, which is
considered rather verbose and lengthy. On average policies contain 515 words scored as
difficult. Difficult words are estimated as those with over 2 syllables and those not present
in the list of common words provided by the library Textstat.

Two readability scores are used to measure the lexical complexity of the privacy policies.

1. Flesch Readability Ease measures passages of text given the syllable count, word
and sentence lengths. Higher scores mark easier texts, while low numbers indicate
high lexical complexity. The Flesch Readability Ease is used to estimate the level of
education needed to comprehend the material. The average score of 62 falls in the
”high school students English” ease of understanding, bordering the 50-60 ”fairly
difficult” category.

2. Gunning Fog Score indicates the number of years of formal education needed for
a person to understand English text the first time reading it. It is assumed that
texts with a score below 12 are intended for a general audience. However, universal
comprehension would require a score of under 8. The GDPR privacy policies dataset
has on average scored 15.5 in Gunning Fog. Hence, the sampled privacy policies
may not be deemed readable by a general audience, despite the intention of privacy
policies to be fully understood by any individual. The score of 15.5 corresponds to
a college senior, thus implying rather strong lexical complexity.

Privacy policies are described as lengthy and complex, yet it does not necessarily entail
that they are dense in pertinent information. A set of vague words derived by Lebano↵
and Liu (2018) was used to count the frequencies, such words as ”may”, ”some”, ”nec-
essary”, ”certain”, ”sometimes”, ”reasonably”, ”appropriate”, ”typically” scored rather
high, with a mean of 156 vague words per policy.

44
4.4 Data Understanding 4 METHODOLOGY

Therefore, the collected GDPR privacy policies are indeed rather lengthy, verbose, com-
plex, and vague, as proposed by related research. This signifies that the general audience
would struggle to consent to the use of their data with an informed decision, hence mak-
ing the current practice of giving consent inefficient. Unawareness when giving consent to
data processing could result in personal data risks, such as misuse, breaches, and others,
questioning the legitimate power of privacy policies. Hence, providing an NLP solution
capable of simplifying the process of privacy policy comprehension could greatly assist in
solving the outlined problem. All the metrics are presented in Table 1.

Word Count Sent Count Avg Sent Len Difficult W Flesch Score Gunning Fog Vague Word Count
count 44 44 44 44 44 44 44
mean 4569 225.5 21 514.6 62 15.5 156
std 2786 157 3.8 186 9.1 1.8 88
min 960 44 13.2 159 50.2 10.6 28
25% 2935.7 145.2 18.8 409.0 62.7 14.3 90
50% 4055 190 21.4 480.5 67 15.5 145
75% 5352.7 253 23.7 642.5 74.8 16.5 188.8
max 14604 871 29.2 1106 91.2 19.9 476

Table 1: Statistics on GDPR Privacy Policies

4.4.2.3 Content Analysis as a Framework for Manual Labelling

To generate a previously non-existent GDPRQA dataset, conventional manual content


analysis was carried out to annotate a sample of 47 post-GDPR privacy policies. Krippendor↵
(2018, p. 403) defines content analysis as “a research technique for making replicable and
valid inferences from data to their contexts.”

Content analysis is a scientific tool that generates new insights and improves the re-
searchers’ comprehension of the subject of research. The process must be reliable and
systematic, producing valid results. It stems from how the subject of the research is con-
ceived. Exploratory in process and inferential in intent, it aids in making newly acquired
findings accessible for systematic analysis further performed by automated means, foster-
ing the development of inquiries in the realities of privacy constructs.

To annotate the policies in the SQuAD format with the (question, answer, context) triples,
Haystack was used. Haystack is an end-to-end open-source framework which enables the
development of scalable NLP pipelines. Built modularly, it allows to integrate with other
open-source projects, including HuggingFace and ElasticSearch. It provides the necessary

45
4.4 Data Understanding 4 METHODOLOGY

functionality to apply Transformer architecture to a variety of use cases, including extrac-


tive QA.

Figure 12: Example of a Question Answered within the Context

Haystack’s Annotation Tool assists in the labelling process by providing an interface to


upload and mark documents with the question and answer spans. The tool outputs the
final dataset in the SQuAD format, which is further used for tuning a QA model. Hence,
the answer is a span within the context, related to the asked question (Figure 12).

Each policy was read independently by two annotators, the authors of the thesis, with
GDPR specific domain knowledge. The annotating was done using the Natural Questions
technique when answers are marked for a set of predefined questions. Not having seen the
passage of text helps to avoid bias in making questions. Only one answer per question
was allowed per context.

4.4.2.4 Exploration of Annotated GDPRQA

Reflecting the di↵erences that GDPR brings into the structure and terminology of the
privacy policies, the relevant questions were derived which can be seen in Table 2.

• Contact Details. The following category, which in total amounts to 11.9%, is


not present in existing datasets and is mandatory to include under GDPR as it is
required to allow users’ inquiries on the practices applied to their personal data. The
Data Protection Officer annotations account for 6.1% of all questions, while general
contact information — for 5.8%.

46
4.4 Data Understanding 4 METHODOLOGY

• Transfers Outside of EU. Annotations in this category provide more detail on


what happens to the data during international transfers, such as whether the data
leaves the EU/EEA (5.3%) and the concrete location of data storage (2.8%). The
total representation in the dataset is 13.8%.

• Automated Decision Making. A topic of the e↵ect of automated decision making


based on personal data is introduced upon GDPR e↵ect, hence it is also included in
the manually created GDPRQA, accounting for 10.2%.

• Sensitive Data. Uncovered in any existing datasets, sensitive data (3.7% of an-
notations in GDPRQA) is extremely important due to more stringent requirements
that must be met for a company to process it.

• Data Subject Rights. Elaboration on data subject rights and their exercising is
one of the key reasons why the introduction of GDPR was able to allow individuals
more control over their data. Previous datasets only include a generic description
of ”user control”, hence a comprehensive list of questions related to the rights of
individuals was composed. In total, the questions on data subject rights account for
46.7%, with the most frequent questions being about user consent, direct marketing,
data changes, data erasure, lodge of complaints and exercising of rights.

• Other. Some other changes in GDPR, such as stricter security measures (4.5%),
specific terminology and obligatory inclusion of the concept of data controller (3%),
and retention policies (6.2%) were introduced.

47
4.4 Data Understanding 4 METHODOLOGY

Category Question Frequency


Contact Details Who is the Data Protection Officer? 6.1%
Who should I contact if I have questions about my data? 5.8%
Transfers outside EU Does my data leave the EU/EEA? 5.3%
Which countries store my data? 2.8%
Are there any international transfers of my data? 5.7%
Automated Decision Making Is my data used in automated decision making? 4%
Is my data used in profiling? 6.2%
Sensitive Data Do you collect any sensitive personal data? 3.7%
Data Subject Rights Can I edit or change the data that I have provided to you? 4.5%
Can I object to processing my data? 3.3%
Can I object to receiving emails from you? 4.7%
Can I restrict what information you are processing about me? 2.8%
Can I withdraw my consent? 5.2%
Can I access my data? 4.1%
Can I download my data? 3.4%
Can I erase my data? 4.6%
What rights do I have? 1.3%
How can I exercise my rights? 4.3%
How do I complain? 4.6%
How can I change my options? 1%
How can I access my personal data? 0.7%
How can I change my data? 0.9%
How do I delete my data? 0.9%
How do I download my data? 0.4%
Other What security measures do you use to protect my data? 4.5%
Who is the data controller? 3%
How long do you store my data for? 6.2%

Table 2: Overview and Frequencies of Questions in GDPRQA

The statistical summary reveals that the average length of questions is 8.7 words, while
the average length of the answer is 21, and of the context — 178 words, as described in
Table 3. The distribution of questions present in the dataset is rather even, as can be
seen in the Table 2. It is observed that some policies are generally lacking information on
certain topics. For instance, countries which store the data are often not evidently stated
as this answer appears only in 2.8% of the question, answer, context triples. The low
frequency of the sensitive data category may be explained by the fact that few companies
deal with sensitive data. Moreover, such data is highly regulated. In terms of the data
subject rights distribution, the policies provide less information on the right of restriction
of processing and the right to data portability. Moreover, the policies seem to lack details
on the elaboration of the process behind the exercising of specific rights. Information on
”How to change data (or options)?” is less frequent.

48
4.4 Data Understanding 4 METHODOLOGY

Figure 13: Distribution of N-gram Prefixes of Questions in GDPRQA

The distribution of the n-gram prefixes of the questions in GDPRQA dataset can be seen
in Figure 13. The majority of the questions are non-factoid, with causal and complex
questions prevailing, such as ”How can I exercise my rights?” (elaborating causal). There
are a few factoid questions, such as ”Who is the DPO”?, and some list questions, for
instance, ”What rights do I have?”. Such questions as ”Can I object to processing?” can
be classified as both confirmatory and casual as they also require explanatory and elabo-
ratory aspects. Hence, the combination of di↵erent types of questions with the majority
being non-factoid makes the GDPRQA dataset rather challenging, as the expected answer
varies in length, span and structure.

A comparison of PolicyQA and GDPRQA can be found in Table 3.

49
4.5 Data Preparation and Preprocessing 4 METHODOLOGY

PolicyQA GDPRQA
Source Pre-GDPR Website Privacy Policies Post-GDPR Privacy Policies
Policies 115 47
Avg Words in One Policy 2319 4569
Questions 714 28
Annotations 25017 1402
Question Annotators Domain Experts Authors
Form of Q Reading Comprehension Reading Comprehension
Answer Type Sequence of words Sequence of Words
Avg Question Length 11.2 8.7
Avg Answer Length 13.5 21
Avg Passage Length 116 178

Table 3: Comparison of the Datasets

4.5 Data Preparation and Preprocessing

Preprocessing of the data is aimed at reducing the noise and, as a result, rendering a
consistent and transformed final dataset ready to be used in modelling.

As the BERT limitation can only take in 512 tokens, the policies were split into para-
graphs with a built paragraph divider function. The privacy policy documents were then
preprocessed to create sequences of a maximum length of 512 tokens to meet the BERT
processing limitation.

The privacy policies’ texts were normalised and preprocessed with the use of tokenization
and stop word removal. Tokenization segmented the text into tokens which were further
lowercased. All the preprocessing is integrated into relevant parts of modelling and ex-
plained in the Section 4.6.

4.6 Modelling

The modelling will take an iterative approach and assess the best solution that can an-
swer GDPR privacy policy inquiries accommodating for both PolicyQA and the GDPRQA
dataset. Throughout this paper, several models were trained built on Transformers with
the accompanying infrastructure.

The modelling is done through three iterations resulting in the best performing models

50
4.6 Modelling 4 METHODOLOGY

which are further tested in a simulated production environment. The modelling focuses
on three BERT models, namely, BERT-base, RoBERTa-base and PrivBERT.

1. Iteration 1. General QA Knowledge Acquisition: Fine-tuning on Open


Domain Datasets (SQuAD or NLQuAD).
The task of the first iteration is to teach the model a foundational understanding of
QA from the general open domain (also known as the source domain). Two general
open domain QA datasets were compared — the factoid SQuAD dataset, which
is considered the standard used by the majority of QA tasks; and NLQuAD, the
non-factoid dataset.

2. Iteration 2. Transfer Learning: Adapting General QA Knowledge to


GDPR Privacy Policies (GDPRQA & PolicyQA).
The second iteration aims to transfer the knowledge gained from the source domain to
the target post-GDPR privacy policy domain. Both iterations utilize HuggingFace8
Transformers library for the training.

3. Iteration 3. Hyperparameter Optimisation.


The third iteration aims to optimize the best results of the trained models through
hyperparameter optimization.

The three iterations are illustrated in Figure 14. Finally, the hyperparameter optimized
model will be tested in a simulated production environment with a Reader, Retriever, and
IR pipeline, as illustrated in Figure 16.

8
HuggingFace is an NLP open-source community aimed at democratising NLP practices, which provides
functionality for numerous tasks, in particular, based around the Transformers library.

51
4.6 Modelling 4 METHODOLOGY

Figure 14: Modelling Approach

4.6.1 Modelling Setup

All preprocessing, data analysis, modelling, and testing tasks are run through Google
Colaboratory, which is a cloud-run Jupiter notebook environment developed by Google
Brain. The Pro+ subscription version was used which allowed priority access to GPUs for
increased computational power which is highly necessary when working with large textual
corpora and a variety of computation-intensive models. The Google Colaboratory Pro+
subscription both allows access to a Tesla V100-SXM2-16GB GPU and background ex-
ecution, which is essential when training time exceeds 10+ hours. Essentially, the Tesla
v100-sxm2-16GB GPU consists of numerous cores that are able to compute in parallel.

52
4.6 Modelling 4 METHODOLOGY

Deep learning in its very essence involves a significant amount of matrix and vector com-
puting which can be easily parallelized. Thus, GPU can aid tremendously in the training
and inference of neural network modelling. The nodes’ weights are adjusted iteratively
during the training. Thus, Transformer models speed up with the use of GPU which allows
for working with larger datasets. Due to the variety of parameters used, inference also
benefits from higher computational power.

4.6.2 Iteration 1: General Knowledge Acquisition

Initially, all models are trained on one of the two datasets, the widely industry-adapted
factoid-based SQuAD or non-factoid NLQuAD, in order for the model to learn the foun-
dations of QA.

To lower training times and ensure a high replication state, two of the three SQuAD based
models were already fine-tuned on SQuAD and were collected from HuggingFace’s Model
Hub, specifically RoBERTa9 and BERT 10 . PrivBert, however, was fine-tuned on SQuAD
manually, as no pre-existing fine-tuned model exists. To ensure consistency with the other
two pre-trained and fine-tuned models, the same hyperparameters were used for PrivBert.
Hyperparameters are displayed in Table 4.

All models were fine-tuned on SQuAD 1.1 dataset to comply with the format of PolicyQA
dataset and ensure consistency across the training and evaluation.

4.6.3 Iteration 2: Transferring Knowledge to Privacy Policy Domain

To perform well at the specific task of QA in the target privacy domain, the knowledge
learnt from the bigger open domain QA (source domain) should be transferred onto a
smaller and narrower privacy policy QA. This allows the model to learn the nuances of
privacy policy semantics and lexicon.

9
bert-base-uncased-squad-v1 model
10
roberta-base-squad-v1 model

53
4.6 Modelling 4 METHODOLOGY

1. Source domain DS ! General Open Domain (SQuAD/NLQuAD)


Source task TS ! Answering general questions

2. Target domain DT ! GDPR Privacy Policies (PolicyQA/GDPRQA)


Target task TT ! Answering GDPR Privacy Policy specific questions

As described in Section 3.5 of the Theoretical Framework, the aim of transfer learning is
to improve the learning of the target predictive function fT (·) in the target domain DT
with the use of transferred knowledge from the source domain DS and source task TS .

The transfer learning is enabled through the HuggingFace script run squad.py which
allows to fine-tune any Transformer model on a SQuAD formatted dataset using MLM.
It has 45 arguments, allowing for much flexibility in adjustments. The code was first
modified by Soleimani et al. (2021) creating run nlquad.py and further adjusted to fit the
needs of this thesis. Essentially, the code downloads the model weights for the given model
(BERT-base, RoBERTa-base, and PrivBERT). Next, the training samples are converted to
features and saved to the cache file. Given the -do train argument, training is performed
for a specified number of epochs. Every model’s weights are saved according to the specified
-save steps as the checkpoints. The weights of the final output of the model are saved
to -output dir. Lastly, evaluation provides performance scores.

4.6.4 Iteration 3: Optimising the Best Performing Model

The performance of learning algorithms often depends on the correct instantiation of their
hyperparameters. While hyperparameter settings often make the di↵erence between stan-
dard and state-of-the-art performance (Hutter, Hoos, & Leyton-Brown, 2011), it is very
time-consuming to find an optimal setting due to the complexity of Transformers, and the
large amounts of textual data. The issue is particularly important in large-scale problems
where the size of the data can be so large that a quadratic running time is insurmountable
(Wang, Feng, Zhou, Xiang, & Mahadevan, 2015).

This thesis will make use of a manually created grid search, as implementing hyperparam-
eter optimization scripts through bayesian search or gridsearch into the developed
code would be out of the timeframe for this paper. The hyperparameter optimization is

54
4.6 Modelling 4 METHODOLOGY

performed on the best performing models of the second iteration found in Section 4.6.3 for
each of the two datasets: SQuAD and NLQuAD. These two models are hyperoptimized
on three parameters, namely epochs, learning rate and batch size.

• Epochs: [1, 2, 3, 4]

• Learning Rate: [1e-5, 2e-5, 3e-5, 4e-5]

• Batch Size: [4, 8, 16]

Firstly, the numbers of epochs are optimized with default values to obtain an understand-
ing of their implication on the model performance. The best performing number of epochs
is chosen to serve as a bound hyperparameter in the following manually performed grid
search of learning rate and batch size. Each of the 4 learning rates is trained against the 3
batch sizes on the GDPRQA and PolicyQA validation and test sets to ensure consistency
in findings. Other parameters, such as warm-up steps and weight-decay are not optimized
due to computational limitations.

The model scoring the highest F1 score across the two datasets is the model chosen for
the simulation of the production environment.

4.6.5 Parameters

The following section highlights the parameters used in the derived models. The overview
of the parameters is presented in Table 4.

55
4.6 Modelling 4 METHODOLOGY

Parameter Value
optimiser AdamW
activation function GELU
hidden size 768
num attention heads 12
num hidden layers 12
vocab size (BERT) 30522
vocab size (RoBERTa & PrivBERT) 50265
correct bias false
weight decay 0.01
learning rate 0.00003
max seq length 512
max answer length 512
top k 50
num training steps 400
num warmup steps (first iteration) 80
num warmup steps (second iteration) 1000
train batch size 8
doc stride 128
warmup proportion 0.2
num epochs 2

Table 4: Parameters Used in Modeling

AdamW is used as the optimiser. Adaptive optimisers, such as Adam, are optimization
algorithms that update network weights iteratively based on the training dataset, instead
of the conventional stochastic gradient descent (SGD). Adam, which stands for adaptive
moment estimation, is an extension of SGD that has recently been widely adopted in deep
learning and neural language modelling (Géron, 2017).

However, models trained with Adam might not generalise so well in QA tasks, thus SGD
with momentum could be preferred. Loshchilov and Hutter (2019) have discovered that
L2 regularisation is much less e↵ective in adaptive optimisers compared to SDG. L2 reg-
ularisation (weight decay) states that networks with smaller weights normally generalise
more and are less prone to overfitting. Loshchilov and Hutter (2019) improve Adam by
creating AdamW where the weight decay is done after the step size of each parameter
is checked, hence only being proportional to the weight itself. AdamW results in bet-
ter training loss and generalisation also conveying general Adam’s benefits, such as the
decreased necessity for tuning the learning rate as it is an adaptive learning rate algorithm.

56
4.6 Modelling 4 METHODOLOGY

Figure 15: GELU

GELU, short for Gaussian error linear unit, is a high performing activation function, which
is used to decide whether the neurons of the developed networks should be activated. The
function is presented in Figure 15. Activations with linear units such as ReLU, ELU etc.
allow for faster convergence of the neural networks. Moreover, the dropout rate is able
to regularise the model by random multiplication of activations by zero. Both of these
approaches determine the output of the neutron. GELU is able to combine them together
thus o↵ering a more probabilistic neuron outcome and a new probabilistic understanding
of the nonlinearity, which is why it was used. GELU has been observed to result in better
performance in a variety of tasks within NLP, specifically in QA. It is advised to use it
with an optimiser with momentum, which fits well when using AdamW (Hendrycks &
Gimpel, 2016).

GELU can be formulated as follows, where (x) stands for the cumulative distribution
function of Gaussian distribution.

GELU (x) = x (x)

The warm-up proportion is set to 0.2, which represents the fraction of training steps taken
until the maximum learning rate. Until it is reached, the learning rate is increasing lin-
early, and once it is reached it starts to decrease at a linear rate. The number of warm-up
steps taken is 80. Warm-up steps are the updates with a small learning rate taken at the

57
4.6 Modelling 4 METHODOLOGY

beginning of the training. Once they are taken, the model starts using the regular learning
rate to train to convergence. Warm-up helps the model to adjust to the data and to let
the adaptive optimizer (AdamW) estimate the correct rates needed for the gradients. The
model takes in eight samples in one batch for training. W in AdamW stands for weight
decay and is set to 0.01. Given a weight decay of 0, AdamW becomes identical to Adam.

Document stride helps to split long documents into several features. The sliding window
of 128 allows controlling the stride between the chunks of the split text. The maximum
sequence length is 512 which is the maximum single input of text that can be handled by
BERT. The number of attention heads and hidden layers is 12. top k is set to 50, which
is the number of answer candidates that the Reader takes in. The higher the variety, the
higher the probability of getting the correct answer which comes with a trade-o↵ of the
processing time.

4.6.6 Model Associated Files

Every model has a set of files associated with it, including its configurations (Section
4.6.6.1), tokenizers (Section 4.6.6.2), and encodings (Section 4.6.6.3), which will be elab-
orated on further below.

4.6.6.1 Configuration

The base class PretrainedConfig.py is a class that implements common methods for
loading/saving a configuration either from a local folder, directory or from a pretrained
model configuration provided by HuggingFace Hub. Each derived configuration class im-
plements model-specific attributes. Common attributes present in all configuration classes
are: hidden size, num attention heads, num hidden layers, and vocab size. Each
subclass implements a model type which is an identifier for the model, serialized into the
JSON file, and used to recreate the correct object. keys to ignore at inference is a list
of tokens to ignore by default when looking at dictionary outputs of the model during in-
ference and, finally, attribute map that represents a dictionary that maps model-specific
attribute names to the standardized naming of attributes.

58
4.6 Modelling 4 METHODOLOGY

4.6.6.2 Tokenizers

To enable the model to understand input data, it needs to be processed into an acceptable
format for the model. As models do not understand raw text, the inputs need to be con-
verted into numbers and assembled into tensors. Tokenizers serve as the way to translate
text into model-comprehensible data. The used tokenizers start by splitting text into to-
kens, given one of the algorithms (BPE and WordPiece) described in Section 4.6.6.3. The
tokens are converted into numbers, which are used to build tensors as input to the models.
Any additional inputs required by a model are also added by the tokenizers through the
two configuration files merges.txt and special tokens mapping.json described below.

1. merges.txt
merges.txt is a file that only exists within HuggingFace’s RoBERTa based language mod-
els. As HuggingFace’s RoBERTa tokenizer is based on a GPT-2 tokenizer, the merges.txt
serves as a mapping between the GPT-2 and other BERT-based tokenizers. In essence,
it adds a top layer to the encoding performed by the tokenizer by first converting the
GPT-2 vocabulary to BERT vocabulary and then using the vocab.json to map the given
tokenized word to an id, as seen in the example below.

1 What ' s up with the tokenizer ?

The tokenizer first tokenizes according to the merges.txt file:

1 [ ' What ' , " ' s " , 'up ' , ' with ' , 'the ' , ' token ' , ' izer ' , '? ']

Then, according to the values in the vocab.json, these tokens are replaced by the tokens’
corresponding index:

1 [ ' What ' , "'s", 'up ' , ' with ' , 'the ' , ' token ' , ' izer ' ,
'? ']

Therefore, the tokens above result in the following encoding after using the vocab.json
mapping:

1 [ 2061 , 338 , 510 , 351 , 262 , 11241 , 7509 ,


30]

2. special tokens mapping.json


Each model further contains a special tokens mapping.json file that stores mappings
of special tokens, specifically "[UNK]", "[SEP]", "[PAD]", "[CLS]" and "[MASK]" for

59
4.6 Modelling 4 METHODOLOGY

BERT; and "<s>", "</s>", "<unk>", "</s>", "<pad>", "<s>" and "<mask>" for
RoBERTa. The special tokens are described below:

• "[UNK] / "<unk>"" is used when a token in the training data is not covered in
the vocabulary due to the vocabulary size. However, since BERT makes use of
WordPiece and RoBERTa uses BPE, both described in Section 4.6.6.3, there should
exist little to no unknown token mappings when performing tokenization on the
training data.

• "[MASK]" / "<mask>" token is used to enable the deep bidirectional learning as-
pect of a transformer and leverage masked language modelling (MLM). It takes a
percentage of the input tokens and masks them using the ”[MASK]” / ”<mask>” at
random. The model then tries to predict the masked tokens. The predicted tokens
from the model are then fed into an output softmax which results in the final output
words. The models mask 15% of words while training, but not all use the ”[MASK]”
/ ”<mask>” token. About 80% of the masked tokens are labelled as ”[MASK]” or
”<mask>”. 10% of the time it takes a random token and places it instead of the
original token, and 10% of the time it replaces it with unchanged input tokens that
are being masked.

• "[CLS]": BERT encoders produce a sequence of hidden states. For classification


tasks, this sequence needs to be reduced to a single vector. This is done by tak-
ing a hidden state corresponding to the first token. To make this pooling scheme
work, BERT prepends a [CLS] token (short for “classification”) to the start of each
sentence. In QA tasks, it is denoted as the start of a question.

• "<s> and </s>": Similarly to BERT’s separation token, RoBERTa uses ”<s>
and </s>”. The first signifying the classification sequence where the question is
located and the second — the context( <s> Question </s> </s> Context </s>).

• "[SEP]": In addition to MLM, BERT also uses a next sentence prediction task
to pre-train the model for tasks that require an understanding of the relationship
between two sentences. When taking two sentences as input, BERT separates the
sentences with a special [SEP] token. During training, BERT is fed two sentences —
50% of the time the second sentence comes after the first one, and 50% of the time

60
4.6 Modelling 4 METHODOLOGY

it is a random sentence. BERT is then asked to predict whether the second sentence
is a random sentence or not.

• "[PAD]" / "<pad>" serves as a padding token for BERT and RoBERTa respec-
tively, as both models receive a fixed length of context as an input. Usually, the
maximum length of a context in the case of this paper 512, depends on the training
data. For contexts that are shorter than the maximum length, paddings are added
(empty tokens) to the context to make up the length.

4.6.6.3 Encodings

Two encodings will be presented. Byte-Pair Encoding (BPE) is used by RoBERTa, and
WordPiece is used by BERT.

1. Byte-Pair Encoding (BPE)


RoBERTa tokenizers make use of Byte-Pair Encoding (BPE) which was introduced by
Sennrich, Haddow, and Birch (2016). BPE relies on a pre-tokenizer that splits the training
data into words, whereas RoBERTa specifically makes use of simple white space tokeniza-
tion. The pretokenization creates a list of tuples which contain unique words and their
corresponding frequency, e.g. [(’personal’, 4530), (’data’, 8731), in the training
data. Subsequently, the tokenizer creates a vocabulary that contains every symbol that
exists in the set of unique words and learns merging rules to create a new symbol that
represents two symbols in the vocabulary. It consistently performs these merges, until the
vocabulary size set in the hyperparameters is reached.

To provide an example, let’s assume that after initial tokenization, the following set of
words including their frequency has been determined:

1 [(" mud " , 10) , (" cud " , 5) , (" pun " , 12) , (" sun " , 4) , (" puns " , 5) ]

Consequently, the vocabulary consists of [”m”, ”g”, ”h”, ”n”, ”p”, ”s”, ”u”]. Splitting all
words into symbols of the base vocabulary, we obtain:

1 [(" m " " u " " d " , 10) , (" c " " u " " d " , 5) , (" p " " u " " n " , 12) , (" s " " u " " n " , 4) ,
(" p " " u " " n " " s " , 5) ]

The tokenizer performs counts of the frequency of all possible symbol pairs and merges
the symbols that appear most frequently. In the example above, ”u” followed by ”n” is

61
4.6 Modelling 4 METHODOLOGY

the highest occurring symbol combination (12 + 4 + 5 = 21) and thus the first merge rule
of the tokenizer. The BPE algorithm keeps iterating over the most frequently appearing
symbol pairs in descending order until the vocabulary limit is reached.

1 (" m " " u " , " d " , 10) , (" c " " u " ," d " 5) , (" p " " un " , 12) , (" b " " un " , 4) , (" p " "
un " " s " , 5)

2. WordPiece
Similarly to BPE, WordPiece is the subword tokenization algorithm used for BERT. The
algorithm was described by Schuster and Nakajima(2012) and is very similar to BPE.
WordPiece creates a vocabulary that includes all characters present in the training data
and iteratively learns merge rules. In contrast to BPE, WordPiece does not choose the
most frequent symbol pair, but the one that maximizes the probability of the training data.

In other words, maximizing the probability of the training data is the same as finding
the symbol pair, in which probability divided by each other is the greatest among every
symbol pair in the vocabulary. Referring to the example, ”u”, followed by ”n” would have
only been combined if the probability of ”un” divided by ”u” and ”n” would have been
greater than the probability of any other symbol pair.

4.6.7 Simulation of Production Environment Prototype

There might be a general inconsistency between how single document closed domain sys-
tems (explained in Section 3.4.5) are evaluated in a production setting and how the ma-
jority of research literature evaluates their results. When fine-tuning on a given dataset,
the context of a question has to be limited to the maximum token length that a given
transformer allows, which means that documents exceeding 512 tokens for BERT and
RoBERTa are cropped. To circumvent the cropping, contexts are normally split, as de-
scribed in Section 4.5.

That, in turn, results in contexts that are smaller and more specific than privacy policies
and have denoted questions and answers, which essentially leads to a more controlled eval-
uation environment. In this environment, models tend to perform proportionally better as
the possibility of achieving the right answer is higher. However, it cannot be inferred that

62
4.6 Modelling 4 METHODOLOGY

it will perform the same in a production setting where the data would be larger. When
an entire policy is fed into the model, the answer might come in varieties, not necessarily
corresponding to the labelled ground truth.

To achieve a more realistic evaluation of how the models would perform in a production
environment, a simulated prototype was created. The simulated production environment
includes an information retrieval system, specifically ElasticSearch, a Retriever and the two
hyperparameter optimized models. ElasticSearch is implemented as a local-host database
in the Google Colab environment and uses the Haystack framework from deepset.ai to
facilitate the indexing and retrieval of documents.

4.6.7.1 Production Process Flow

The entire production architecture is reflected in Figure 16. At index time, the privacy
policy is being split into paragraphs which are further sent to ElasticSearch making them
searchable. On the other side, a user asks a question about a specific policy. The Retriever
preprocesses the query and sends the preprocessed query to ElasticSearch. Next, Elastic-
Search returns the retrieved documents relevant to the search query. The Reader receives
the retrieved documents from the Retriever, then tokenizes and encodes the retrieved doc-
uments. Next, it feeds them into the QA model. The QA model finds the correct answer
span and sends the output to the decoder that decodes it. The decoded answer span is
finally sent back to the user.

63
4.6 Modelling 4 METHODOLOGY

Figure 16: Production Process Flow

4.6.7.2 IR: Indexing of Documents

The indexing of documents in the ElasticSearch database is found in the Retriever.py


file. Firstly, it iterates through all policies in the OPP-115 and GDPRQA corpus. For each
policy, the policy is preprocessed. The Preprocessor.py made available by HayStack first
looks at the length of the context. If the length of the context exceeds 384 characters, it
splits the policy with a document stride of 128. To ensure that no sentences are broken, the
policy’s split will be placed at the end of the last sentence that does not exceed 384 tokens.

After all splits have been performed, they are each converted to the Document.py class
using either a MarkdownConverter.py if the file is in HTML format or a TextConverter.py
if the policy is in a text format. Finally, each document is indexed to ElasticSearch with
the metadata of the corresponding title of the privacy policy. The parameters used for the
preprocessor are the following:

• split by: ”word”

• split length: 384

• split respect sentence boundary: True

• split overlap: 128

64
4.6 Modelling 4 METHODOLOGY

4.6.7.3 Retriever: Retrieval of Documents

The retrieval of documents is performed through the Retriever.py class provided by


HayStack. The Retriever works as a lightweight high-level class to communicate with
the IR system. The Retriever is fed a number of di↵erent parameters to retrieve the
right documents, such as a search query, number of documents to retrieve and filters. The
search query always consists of a specific question, and the number of retrieved documents
is arbitrarily set to 100 to ensure the retrieval of the entire policy. The search query filters
on the specific title of the privacy policy. The Retriever will always only return the
number of documents available given the filters, meaning if a privacy policy contains less
than 384 ⇥ 100 tokens, less than 100 documents will be returned. Below is an example
of both the Retriever code (displayed in Listing 2) and its corresponding converted JSON
format that is further sent to ElasticSearch (Listing 3).

1 retriever . retrieve ( query = " Is my data leaving the EU ? " ,


2 top_k =10 ,
3 filters ={ " title " : [ " imdb . com " ]})

Listing 2: Retriever Code

1 | {
2 | ” s i z e ” : 10 ,
3 | ” query ” : {
4 | ” bool ” : {
5 | ” s h o u l d ” : [ {” mu l t i mat ch ” : {
6 | ” query ” : ” I s my data l e a v i n g t h e EU?” ,
7 | ” type ” : ” m o s t f i e l d s ” ,
8 | ” f i e l d s ” : [ ” content ” , ” t i t l e ” ] }} ] ,
9 | ”filter”: [
10 | { ” terms ” : {” t i t l e ” : ”imdb . com”} }
11 | ]
12 | }
13 | },
14 | }

Listing 3: Retriever Code Converted to JSON

When retrieving a query, ElasticSearch analyses the query through its inverted index and
uses the BM25 algorithm to return the most relevant documents. This process is referred

65
4.6 Modelling 4 METHODOLOGY

to as keyword search, thus it does not consider phrasing or the context of the question. The
highest scoring documents, according to the BM25 algorithm, are returned and converted
back into an instance of the Document.py class.

4.6.7.4 Readers: Answer Extraction From Documents

The two hyperparameter optimized models are used to extract answers for questions,
given the contexts of the returned documents of the Retriever. The Readers are fed three
parameters, specifically a question, the number of answers to return and the documents
provided by the Retrievers. Each Reader is inserted into a QA pipeline which serves as
a high-level class which abstracts the code complexity of the tokenization, encoding, and
decoding. An example of the Reader can be seen in Listing 4.

1 answers = sq u ad _q u es t io n_ a ns w er er ( question = ' What rights do I have ? ' ,


context = retriever . context [0] , top_k =1)
2

3 % RETURNS
4 % [{ ' answer ': ' You have the right to access your data ' , //
5 % ' start_idx ': 34 , ' end_idx ' :64 , ' score ': 0.83}]

Listing 4: Reader Code and Output

4.6.7.5 Production Architecture: IR, Reader, Retriever

The architecture consists of the aforementioned IR, Reader, and Retriever. The architec-
ture is created to evaluate the performance of the existing models in a simulated production
environment. The performance on PolicyQA and GDPRQA datasets is evaluated sepa-
rately, but the IR system contains data from both datasets. Each of the two datasets is
evaluated by iterating through the dataset, where each question-answer pair is collected.
For each question, the Retriever is asked to provide up to 100 documents that correspond
to the title associated with the privacy policy in question. The Reader is then tasked with
finding one, three or five extracted candidate answers for a given question per received
document. A snippet of the retrieving and answer extraction can be found in Listing 5.

1 for policy in data [ ' data ' ]:


2 # Stores the title of the test data
3 title = policy [ ' title ']
4 for paragraph in policy [ ' paragraphs ' ]:
5 # Stores the context of a given question

66
4.7 Evaluation 4 METHODOLOGY

6 context = paragraph [ ' context ']


7 for qas in paragraph [ ' qas ' ]:
8 question = qas [ ' question ']
9 ground_truth = qas [ ' answers ' ][0][ ' text ']
10 predictions = p_retrieval . run ( query = question ,
11 params ={ " Retriever " : { " top_k " : 10} ,
12 " filters " :{ ' Title ': [ title ]}})
13 answer_lst = []
14 for document in documents [ ' documents ' ]:
15 documents . content = documents . content . replace ( ' ||| ' , ' ')
16 # Finds the answer to a given question
17 answer_lst . append ( s qu a d_ qu e st i on _a n sw e re r ( question = question ,
18 context = prediction . content ) )

Listing 5: Retrieval and Answer Extraction Code

4.7 Evaluation

The extracted answers are evaluated using a number of performance metrics. The number
of shared words between the prediction and the truth is the basis of the F1 score, which
is computed over the individual words in the prediction against those in the ground truth.
Recall is the ratio of the number of shared words to the total number of words in the
ground truth. Hence, recall represents how many tokens that are in common with the
ground truth. Precision, on the other hand, is the ratio of the number of shared words
to the total number of words in the prediction. Therefore, precision shows how big of a
percentage of the tokens in the ground truth is in our predicted answer. Thus, it is much
easier to have high precision when the answer span is short. Exact match (EM) checks
for each question-answer pair, whether the characters in the prediction identically match
those in the ground truth, hence raising concerns for semantic meaning similarities. F1
score would be considered the most important metric when evaluating our models.

4.7.1 Evaluation of Model Results

Closed domain evaluation mode is performed when the evaluation is done only on the
privacy policy of the relevant organisation. Each model is evaluated on each of the two test
and validation datasets to assess GDPR questions’ answerability and general PolicyQA
answerability separately. Each of the models is evaluated equally and focuses on F1
metrics, however, EM is also included in Section 5. For each question and context, their

67
4.7 Evaluation 4 METHODOLOGY

ground truth answers are compared to the predicted answer. The 50 top candidates are
retrieved and the top-scoring answer, according to the model score output, is compared
with each of the ground truth answers. The highest F1 and EM of the predicted answer
on ground truth answers are chosen for the given answer.

4.7.2 Evaluation of Simulated Production Prototype

To realistically evaluate the simulated production prototype, both the PolicyQA and
GDPRQA datasets were evaluated separately to preserve consistency with other find-
ings. However, both sets are contained in the information retrieval system, ElasticSearch.
This does not have any e↵ects on evaluation metrics of model performance as the model is
only concerned with the privacy policy of the relevant organisation. However, an increase
in retrieval times will happen due to more privacy policies being stored. The performance
of the information retrieval system is not evaluated, as ElasticSearch is highly distributed
and the thesis works with a single document closed domain tasks.

The evaluation of the model performance, which can be seen in Listing 7, is done on both
datasets. Each question in the dataset retrieves all segments of the relevant policy. The
model is tasked with finding the top candidate answer per retrieved segment. All answers
are merged and the corresponding 1, 3 and 5, top candidates, dependant on the output
score of the model, are evaluated as seen in Table 14. To ensure common formatting
between the extracted answer and its corresponding ground truth, both are preprocessed
before any evaluation metrics are calculated. First, both the answer and the ground truth
are lowercased, followed by the removal of punctuation and articles such as a, an and the.
Finally ’———’ created by the MarkDownConverter is removed. The normalization steps
can be seen in Listing 6.

1 def normalize_text ( s ) :
2 """ Removing articles and punctuation , and standardizing whitespace """
3 import string , re
4

5 def remove_three_lines ( s ) :
6 s . replace ( ' ||| ' , ' ')
7

8 def remove_articles ( text ) :


9 regex = re . compile ( r " \ b ( a | an | the ) \ b " , re . UNICODE )
10 return re . sub ( regex , " " , text )

68
4.8 Deployment 4 METHODOLOGY

11

12 def white_space_fix ( text ) :


13 return " " . join ( text . split () )
14

15 def remove_punc ( text ) :


16 exclude = set ( string . punctuation )
17 return " " . join ( ch for ch in text if ch not in exclude )
18

19 def lower ( text ) :


20 return text . lower ()
21

22 return remove_tree_lines ( white_space_fix ( remove_articles


23 ( remove_punc ( lower ( s ) ) ) ) )

Listing 6: Evaluation Preprocessing

When both the answer(s) and questions have been preprocessed, the corresponding EM,
F1, precision and recall are found for each of the either one, three or five most relevant
answers in the entire document. If three or five answers are found in the document, only the
highest EM, F1, precision and recall are recorded. In other words, the evaluation reports
to which extent any of the returned answers are able to answer the question according to
the ground truth.

4.8 Deployment

To realize the potential of the QA system, an associated architecture and a front-end


to conceptualise the GDPRQA Assistant are needed. The proposed GDPRQA Assistant
architecture can enable the users to query privacy policy-related questions in a user-
friendly way. The architecture consists of three layers, a policy crawler and a feedback
module. The numbers seen in Figure 17 will be reflected in the description of each specific
layer.

69
4.8 Deployment 4 METHODOLOGY

Figure 17: Potential Deployment Architecture

4.8.1 Application Layer

The Application Layer displays information about the privacy policy, thus providing the
users with a user-friendly front-end to pose their queries and receive the answers. In this
layer, the Query Module receives the user’s query consisting of the name of the organi-
sation and a question about the related policy (1). The query is transformed using the
preprocessing steps explained in Section 4.6.7.3. These inputs are forwarded to lower lay-
ers (2), which then extract the entire policy in the form of stridden segments. Each of
the stridden segments is sent to the Machine Learning layer (4) for answer extraction. To
resolve the response received from the answer generation (5), each of the stridden segments
is merged into one entire policy which is displayed to the user. The response also includes
the top 5 answers, and their corresponding start and end indices, which will be used to
highlight the answer span in the policy as also shown in Figure 12. It is suggested that
the answers will be shown in a ranked format, depending on the confidence score of the
models.

70
4.8 Deployment 4 METHODOLOGY

4.8.2 Data Layer

The Data Layer serves two purposes. Firstly, it performs the cleaning of the received poli-
cies from the PolicyCrawler and hereafter it preprocesses and indexes the cleaned privacy
policies. A potential solution could be to use both Coverters.py and Preprocessor.py
from HayStack described in Section 4.6.7.2 which segment the policy into semantically
coherent and adequately sized documents.

The second purpose of the Data Layer is a knowledge base, potentially leveraged with
ElasticSearch or any other search engine. ElasticSearch receives a formatted query (2)
and utilizes its inverted index and BM25 algorithm to retrieve the relevant answers. With
the scaling of the number of policies stored in the knowledge base, it may be necessary to
limit the doc-stride or the amount of returned stridden segments to preserve low retrieval
times. Especially here, the power of the BM25 algorithm’s keyword search or potentially
a dense passage retriever may be necessary.

4.8.3 PolicyCrawler

There are several potential approaches to creating a PolicyCrawler. The first approach
would be to rely on direct URL access provided by users at query time. The PolicyCrawler
would receive the URL, extract the information, clean, preprocess and index the policy to
the knowledge base. Thus, incrementally increase the knowledge base as the usage of the
platform increases.

Another possibility is to serve the models through an application. The application would
automatically retrieve the names of installed applications. The application names would
then be transferred to the crawler that would use the respective Google’s Play Store or
Apple’s App Store that has structured privacy policy links associated with them.

Lastly, the provided title could be used to automatically scan for privacy policies through
search engines. However, as the front-end often changes structure, there are numerous
associated dependencies which result in inconsistency. Hence, this approach seems like
the least reliable option.

71
4.8 Deployment 4 METHODOLOGY

4.8.4 Machine Learning Layer

The Machine Learning Layer is responsible for encoding the segment, extracting the an-
swer and decoding it to make it user comprehensible. The layer takes as an input the
privacy policy segment from the Query Module (4), and the user query (1) from the Ap-
plication Layer. The Machine Learning Layer encodes the question and context according
to what is described in Section 4.6.6.2, generates an answer as described in Section 3.4
and probabilistically assigns each answer an output score. All answers are then sent back
to the application layer in the following format:

1 [{ ' score ': 0.76 , ' start ': 215 , 'end ': 240 , ' answer ': ' The answer to the
question '}].

4.8.5 Feedback Module

To enable continuous improvement of the QA system, a feedback module is proposed.


The feedback module is fed user feedback to the extracted answer from the Application
Layer (6). The feedback could come in the form of a rating, where the users can rate
the extracted answer with ’Great’, ’Neutral’ or ’Bad’. The answers are then stored in a
database and used to continuously improve the models, giving answer boosts to certain
feedback approved answers and negative boosts to bad answers.

4.8.6 Further Improvements to the Architecture

Further implementations to the architecture could expand the reliability of the answer
extractor. Firstly, a question classification model could potentially predict the question
type, enabling the IR system to filter on certain aspects of the question. Furthermore,
each segment could be supplied with extra meta-data provided by a classifier, such as the
XLNet classifier presented by (Mustapha et al., 2020) and implemented as in Figure 18.

72
4.9 Reliability and Validity 4 METHODOLOGY

Figure 18: Architecture with Improvements

4.9 Reliability and Validity

Reliability is described as the consistency of the measures used in the methodology.


Manual annotation is somewhat subjective and may yield unstable or biased results. Yet,
ground truth was estimated when two annotators worked together in an attempt to estab-
lish more objectivity. Multiple models were built with consistently reproducible results.
All results were trained on the same parameters and the models were evaluated in the
same way.

Validity refers to the extent of accuracy of the measures used throughout the method-
ology. Content validity evaluates whether the method is adequately measuring all the
content relevant to the variable. The methods used cover the entire domain of privacy
policies, as it is built on top of the previous studies and is augmented with the aspects
relevant to the current GDPR regulatory landscape.

Construct validity explains whether any inferences can be made based on the test scores.
The test result cannot generally infer that questions to any policy can be answered with
the same performance scores, as it largely depends on the quality of the policy and its
compliance with GDPR. The results also depict convergent validity as the scores obtained
in the study are largely correlated with previously developed instruments in other privacy
policy studies.

73
5 RESULTS

5 Results
The results are based on the iteration process described in Section 4.7 and Figure 14. The
results are separated into the following sections:

1. Test Environment Results

(a) Models Fine-Tuned on SQuAD: The results all show pretrained models fine-
tuned on SQuAD

i. GDPRQA-SQuAD: Results are all evaluated on GDPRQA validation and


test set. Each set of results contains evaluations of BERT, RoBERTa
and PrivBERT fine-tuned on SQuAD, subsequently transfer learned on
PolicyQA, GDPRQA and an augmented version of them, PolicyQA &
GDPRQA.

ii. PolicyQA-SQuAD Results are all evaluated on PolicyQA validation and


test set. Each set of results contains evaluations of BERT, RoBERTa
and PrivBERT fine-tuned on SQuAD, subsequently transfer learned on
PolicyQA, GDPRQA and an augmented version of them, PolicyQA &
GDPRQA.

(b) Models Fine-Tuned on NLQuAD: The results all show pretrained models fine-
tuned on NLQuAD.

i. GDPRQA-NLQuAD: Results are all evaluated on GDPRQA validation


and test set. Each set of results contains evaluations of BERT, RoBERTa
and PrivBERT fine-tuned on NLQuAD, subsequently transfer learned on
PolicyQA, GDPRQA and an augmented version of them, PolicyQA &
GDPRQA.

ii. PolicyQA-NLQuAD: Results are all evaluated on GDPRQA validation and


test set. Each set of results contains evaluations of BERT, RoBERTa
and PrivBERT fine-tuned on NLQuAD, subsequently transfer learned on
PolicyQA, GDPRQA and an augmented version of them, PolicyQA &
GDPRQA.

(c) Hyperparameter Optimisation: The results show the best performing model
from the second iteration optimized on various hyperparameters.

74
5.1 Test Environment 5 RESULTS

i. Epoch Optimization: Results for epoch optimization on SQuAD and NLQuAD

ii. Manual GridSearch: Results for cross-training learning rate with parame-
ters, 0.00001, 0.00002, 0.00003 and 0.00004 and batch size with parameters
4, 8 and 16.

2. Production Environment Results: Evaluation of the best performing hyperparameter


optimised models from the third iteration for NLQuAD and SQuAD, respectively.
Each of the two datasets is evaluated separately.

(a) PolicyQA Dataset: All results are evaluated on PolicyQA test and validation
dataset for top candidates of 1, 3 and 5 over full privacy policy.

(b) GDPRQA Dataset: All results are evaluated on GDPRQA test and validation
dataset for top candidates of 1, 3 and 5 over full privacy policy.

5.1 Test Environment

To start with, the results of the performance of the models in the testing environment
are displayed. The testing is done on the testing sets of GDPRQA and PolicyQA. The
data is in SQuAD format (consisting of question, answer, context triples) — hence, given
the nature of the formatting, the model sees the paragraph (context) where the answer is
contained.

First, the models fine-tuned on SQuAD will be tested on the GDPRQA and PolicyQA test-
ing and validation datasets. The model combinations would include the baseline models
solely tuned on SQuAD, as well as models trained on GDPRQA training data, PolicyQA
training data and combined GDPRQA & PolicyQA data. Next, the same setup will be
applied to models fine-tuned on NLQuAD instead of SQuAD to see whether non-factoid
open domain knowledge helps the models to answer privacy-related questions in compar-
ison to factoid-based SQuAD.

75
5.1 Test Environment 5 RESULTS

Figure 19: Overview of Performance Based on F1 Test Score

Figure 19 presents an overview of charts with SQuAD and NLQuAD based results for both
GDPRQA, PolicyQA and the augmented PolicyQA & GDPRQA dataset. The clusters
represent di↵erent training configurations, while colours stand for the di↵erent language
models used.

5.1.1 Models Pre-trained on SQuAD

GDPRQA-SQuAD
The best scores for GDPRQA-SQuAD are for a PrivBERT model which is trained on
GDPRQA dataset, resulting in 81.1% F1 score on validation and 78.4% score on test
GDPRQA dataset, as seen in Table 5. The scores of the EM are 52% for the test set
and 45.8% for the validation set. Meaning that 52% of the documents have the answer
estimated by the Reader completely identical in the testing set.

PrivBERT trained only on SQuAD results in 43.9% F1, while RoBERTa trained only in
SQuAD results in 20.7% — hence domain-specific language model indeed makes a dif-
ference in performance. BERT trained only on SQuAD performs better than RoBERTa
by 11%. Training on PolicyQA almost doubles the performance of the models. Training
on GDPRQA instead of PolicyQA, naturally, results in even stronger evaluation met-

76
5.1 Test Environment 5 RESULTS

rics. For BERT and PrivBERT, performance increases by 2 %, while for RoBERTa it
increases by 16%. The fourth variation of training data combinations is training on aug-
mented GDPRQA & PolicyQA. BERT gets a better result by 10%, RoBERTa by 5%,
while PrivBERT’s performance decreases by 4%. Hence, adding PolicyQA training data
to GDPRQA does not enhance performance when tested on GDPRQA, suggesting the
privacy policy annotations in GDPRQA to be quite di↵erent in content from PolicyQA
when modelled with domain-specific language model.

GDPRQA Dataset
Model Name SQuAD GDPRQA PolicyQA GDPRQA & PolicyQA Test Validation
EM F1 EM F1
BERT-base X ⇥ ⇥ ⇥ 0.080 0.319 0.083 0.287
BERT-base X ⇥ X ⇥ 0.320 0.623 0.083 0.422
BERT-base X X ⇥ ⇥ 0.320 0.646 0.291 0.707
BERT-base X X ⇥ X 0.480 0.740 0.333 0.717
RoBERTa-Base X ⇥ ⇥ ⇥ 0.040 0.207 0.000 0.280
RoBERTa-Base X ⇥ X ⇥ 0.120 0.566 0.083 0.458
RoBERTa-Base X X ⇥ ⇥ 0.440 0.727 0.333 0.749
RoBERTa-Base X X ⇥ X 0.480 0.777 0.416 0.754
PrivBERT X ⇥ ⇥ ⇥ 0.160 0.439 0.083 0.531
PrivBERT X ⇥ X ⇥ 0.480 0.766 0.208 0.527
PrivBERT X X ⇥ ⇥ 0.520 0.784 0.458 0.811
PrivBERT X X ⇥ X 0.440 0.747 0.333 0.764

Table 5: GDPRQA-SQuAD Dataset

PolicyQA-SQuAD

Policy QA based on SQuAD has the strongest performance when trained on augmented
GDPRQA & PolicyQA, resulting in 60% F1 for validation, and when trained on PolicyQA
in 64.2% for testing, presented in Table 6.

The baseline models trained only on SQuAD score 28.6% with BERT, 11% with RoBERTa,
35% with PrivBERT. Compared to PrivBERT’s 43.9% when tested on GDPRQA, it sug-
gests that without being trained on domain-specific dataset, it is easier for a privacy
language model to predict on a GDPR relevant dataset. It could also be due to the fact
that PrivBERT was created by feeding it post-GDPR privacy policies.

When adding PolicyQA data into the training, the performance of BERT improves by 2
times (from 28% to 60%), for RoBERTa by 6 times (from 11% to 61.5%), for PrivBERT by
2 times from 35% to 64.2%, making RoBERTa most sensitive to privacy-related training

77
5.1 Test Environment 5 RESULTS

data, and PrivBERT (Which is RoBERTa tuned on privacy policies) the best performing
model.

When training on GDPRQA dataset instead, the performance goes down for BERT to
34.5%, to 35.5% for RoBERTa, and 37.2% for PrivBERT. Hence, GDPR specific training
data only slightly helps broad pre-GDPR privacy policy annotations and is not enough
for the models to learn how to answer broader privacy questions, which makes sense due
to the dataset being limited only to GDPR specifics.

However, the augmented dataset (GDPR & PolicyQA) helps BERT to achieve the best
results. Hence, GDPR elements help to answer privacy questions better even for policies
which were made before the adoption of GDPR. Yet, in the case of PrivBERT it performs
better, and best among all models, when trained only on PolicyQA, with the best F1
score of 64.2%. Adding GDPRQA to the training data makes the model perform worse,
suggesting more di↵erences in pre- and post-GDPR question-answer pairs.

The creators of PolicyQA have previously achieved 56.6 % in testing with BERT-base and
SQuAD pre-training. Our results beat the current state-of-the-art performance by 7.6%.

PolicyQA Dataset
Model Name SQuAD GDPRQA PolicyQA GDPRQA & PolicyQA Test Validation
EM F1 EM F1
BERT-base X ⇥ ⇥ ⇥ 0.066 0.286 0.052 0.258
BERT-base X ⇥ X ⇥ 0.317 0.605 0.279 0.559
BERT-base X X ⇥ ⇥ 0.066 0.345 0.070 0.336
BERT-base X X ⇥ X 0.309 0.607 0.289 0.565
RoBERTa-base X ⇥ ⇥ ⇥ 0.026 0.110 0.022 0.090
RoBERTa-base X ⇥ X ⇥ 0.340 0.615 0.304 0.574
RoBERTa-Base X X ⇥ ⇥ 0.072 0.355 0.078 0.353
RoBERTa-Base X X ⇥ X 0.480 0.615 0.306 0.573
PrivBERT X ⇥ ⇥ ⇥ 0.082 0.350 0.084 0.328
PrivBERT X ⇥ X ⇥ 0.358 0.642 0.313 0.590
PrivBERT X X ⇥ ⇥ 0.082 0.372 0.091 0.369
PrivBERT X X ⇥ X 0.357 0.635 0.321 0.600

Table 6: PolicyQA-SQuAD Dataset

5.1.2 Models Fine-Tuned on NLQuAD

GDPRQA-NLQuAD

78
5.1 Test Environment 5 RESULTS

Tuned on NLQuAD, GDPRQA dataset results in the highest score of 82.4% F1 on the
validation set for PrivBERT while trained on GDPRQA and 78.9% F1 on the testing set
for RoBERTa trained on GDPRQA, illustrated in Table 7. The EM score for the best per-
forming model is 44% in the GDPRQA testing set which is the proportion of documents
that are a precise match of the estimated answer.

Baseline models (only tuned on NLQuAD) result in 21.2% for BERT, 14.7% for RoBERTa
and 30% for PrivBERT. Compared to SQuAD baseline, only RoBERTa performs better
by 4.6% with NLQuAD, while both BERT and PrivBERT perform better with SQuAD.

When trained on PolicyQA, BERT performance goes up to 47.9%, RoBERTa’s to 57.7%,


PrivBERT to 73.3%. The biggest di↵erence thus is captured by RoBERTa and PrivBERT,
which performance plummeted by roughly 43%.

Interestingly enough, when training on GDPRQA instead of PolicyQA, BERT and RoBERTa
gain a rather large di↵erence (17.1% improvement for BERT (reaching 65%) and 21.2%
improvement by RoBERTa up to 78.9%). However, PrivBERT was already able to answer
questions with rather high accuracy of 73.3% only on PolicyQA, and switching the dataset
to GDPRQA gives it an advantage of only 3.2%. Hence, PrivBERT is able to answer more
GDPR relevant questions even when trained on PolicyQA.

Compared to the baseline models tuned on the open domain, adding domain-specific train-
ing very strongly improves performance — for instance, for RoBERTa, it jumps from 14.7
% to 78.9 % when GDPRQA is introduced into its training.

The results show that GDPRQA-NLQuAD performs slightly better compared to GDPRQA-
SQuAD suggesting that indeed the non-factoid NLQuAD with longer answer spans impacts
the performance of GDPRQA, given that it also has primarily non-factoid questions with
relatively long answer span.

79
5.1 Test Environment 5 RESULTS

GDPRQA Dataset
Model Name NLQuAD GDPRQA PolicyQA GDPRQA & PolicyQA Test Validation
EM F1 EM F1
BERT-base X ⇥ ⇥ ⇥ 0.040 0.212 0.008 0.376
BERT-base X ⇥ X ⇥ 0.080 0.479 0.083 0.480
BERT-base X X ⇥ ⇥ 0.320 0.650 0.208 0.682
BERT-base X ⇥ ⇥ X 0.320 0.663 0.208 0.691
RoBERTa-Base X ⇥ ⇥ ⇥ 0.000 0.147 0.041 0.395
RoBERTa-Base X ⇥ X ⇥ 0.280 0.577 0.083 0.428
RoBERTa-Base X X ⇥ ⇥ 0.400 0.789 0.333 0.753
Roberta-Base X ⇥ ⇥ X 0.320 0.756 0.416 0.784
PrivBERT X ⇥ ⇥ ⇥ 0.040 0.300 0.083 0.401
PrivBERT X ⇥ X ⇥ 0.400 0.733 0.125 0.572
PrivBERT X X ⇥ ⇥ 0.440 0.765 0.541 0.824
PrivBERT X ⇥ ⇥ X 0.400 0.761 0.375 0.751

Table 7: GDPRQA-NLQuAD

PolicyQA-NLQuAD
With regard to PolicyQA, the strongest performance is observed for PrivBERT model
trained on PolicyQA, resulting in 62.3% F1 on the test and 58.5% on validation datasets,
as seen in Table 8. The EM for the best performing model is 33.5% in the test and 30.8%
in validation.

The baseline models — those trained only on open domain NLQuAD, score 8.9% with
BERT, 10.5% with RoBERTa, and 8.9% with PrivBERT. The baseline performance on
PolicyQA-SQuAD was much stronger (e.g. PrivBERT performs worse by 26.1%). The
answer span in PolicyQA dataset is substantially smaller, hence it evidently benefits more
from being SQuAD-based.

Training on PolicyQA drastically improves the performance — for PrivBERT it is in-


creased by a significant 53.4% and reaches 62.3%. It also improves substantially for BERT,
resulting in 60.3%, as well as for RoBERTa, amounting to 60.9%.

Training on GDPRQA instead of PolicyQA halves the evaluation scores. PrivBERT re-
sults in 34.7%, BERT in 30.2%, RoBERTa in 34%. Lastly, training on the combined
dataset reaches results similar to those when trained solely on PolicyQA, however still
slightly worse — e.g. PrivBERT scores 62.2%, BERT — 59.6%, and RoBERTa achieves
60.6%.

Overall, PolicyQA achieved better results when tuned on SQuAD, which can be explained

80
5.1 Test Environment 5 RESULTS

by its shorter answer spans, and larger proportion of factoid questions.

PolicyQA Dataset
Model Name NLQuAD GDPRQA PolicyQA GDPRQA & PolicyQA Test Validation
EM F1 EM F1
BERT-base X ⇥ ⇥ ⇥ 0.014 0.089 0.007 0.065
BERT-base X ⇥ X ⇥ 0.318 0.603 0.278 0.560
BERT-base X X ⇥ ⇥ 0.055 0.302 0.053 0.293
BERT-base X ⇥ ⇥ X 0.301 0.596 0.274 0.558
RoBERTa-base X ⇥ ⇥ ⇥ 0.019 0.105 0.011 0.087
RoBERTa-base X ⇥ X ⇥ 0.326 0.609 0.295 0.568
RoBERTa-base X X ⇥ ⇥ 0.073 0.340 0.006 0.340
RoBERTa-base X X ⇥ X 0.324 0.606 0.296 0.567
PrivBERT X ⇥ ⇥ ⇥ 0.019 0.089 0.011 0.076
PrivBERT X ⇥ X ⇥ 0.335 0.623 0.308 0.585
PrivBERT X X ⇥ ⇥ 0.075 0.347 0.071 0.344
PrivBERT X ⇥ ⇥ X 0.335 0.622 0.305 0.582

Table 8: PolicyQA-NLQuAD

5.1.3 Top Model Performance Results

The keen reader might have noticed that PrivBERT performs better on all test and valida-
tion datasets. To enable better understanding, a summarization with averages across the
two datasets is displayed in Table 9. The table shows that a PrivBERT model trained on
the augmented dataset performs significantly better than any PrivBERT model trained on
the individual datasets across the two datasets. However, the PrivBERT models trained
on individual datasets perform better on their respective training and validation sets, but
worse on the other individual training and validation set. Despite a slight loss in perfor-
mance against the individual datasets on their respective training and validation sets, the
average is significantly higher across the two, with an average F1 score of 0.679 against
0.636 and 0.570 for NLQuAD and 0.686 against 0.631 and 0.584 for SQuAD.
Summarization - Best Performing Models
Model ODG GDPRQA Test GDPRQA Validation PolicyQA Test PolicyQA Validation
EM F1 EM F1 EM F1 EM F1 Avg. F1
PrivBERT GDPRQA SQuAD 0.520 0.784 0.458 0.811 0.082 0.372 0.092 0.369 0.584
PrivBERT PolicyQA SQuAD 0.480 0.766 0.458 0.527 0.358 0.642 0.313 0.590 0.631
PrivBERT GDPRQA & PolicyQA SQuAD 0.440 0.747 0.333 0.764 0.357 0.635 0.321 0.600 0.686
PrivBERT GDPRQA NLQuAD 0.440 0.765 0.541 0.824 0.075 0.347 0.071 0.344 0.570
PrivBERT PolicyQA NLQuAD 0.400 0.765 0.125 0.572 0.335 0.623 0.308 0.585 0.636
PrivBERT GDPRQA & PolicyQA NLQuAD 0.400 0.761 0.375 0.751 0.335 0.622 0.305 0.582 0.679

Table 9: Best Performing Models Results


ODG = Open Domain General Dataset

The two best performing models, PrivBERT GDPRQA & PolicyQA trained on SQuAD
and PrivBERT GDPRQA & PolicyQA trained on NLQuAD are chosen for hyperparameter
optimisation.

81
5.1 Test Environment 5 RESULTS

5.1.4 Hyperparameter Optimisation Findings

Hyperparameter optimisation was performed on the best performing models from SQuAD
and NLQuAD respectively (Section 5.1). As both models perform similarly, both models
are hyper-parameter optimized to show the potential di↵erences after hyperparameter
optimization.

• Epochs: [1, 2, 3, 4]

• Learning Rate: [1e-5, 2e-5, 3e-5, 4e-5]

• Batch Size: [4, 8, 16, 32]

The results follow the approach stated in Section 4.6.3. First, di↵erent epochs are evalu-
ated. Next, a manual grid search with varying values of batch size and the learning rate is
evaluated. The best performing models after HPO are used in the production environment
prototype.

Epoch Evaluation
The Tables 10 and 11 depict similar trends. Iterating the performance through di↵erent
epochs, it is clear that the best performance for GDPRQA test and validation sets is on
the first epoch. However, the performance of PolicyQA is best on the last epoch observed,
epoch 4. Hence, while observing the evaluation through di↵erent epochs, one can notice
that the model gets more fitted to the PolicyQA data, the more epochs it goes through.
It can be explained by the fact that as there is a substantially larger amount of PolicyQA
data than GDPRQA data, the model will tend to be more trained towards PolicyQA.
Hence, PolicyQA starts to perform better with the increasing number of epochs, while the
performance of GDPRQA goes down. Yet, the di↵erence in performance through di↵erent
epochs is more noticeable for GDPRQA than for PolicyQA. An overview of the epoch
results can be seen in Figure 20.

82
5.1 Test Environment 5 RESULTS

Figure 20: Epoch Evaluation SQuAD vs NLQuAD, F1 Score

Epoch Evaluation - SQuAD


Model Epochs GDPRQA Test GDPRQA Validation PolicyQA Test PolicyQA Validation
EM F1 EM F1 EM F1 EM F1 Average F1
PrivBERT GDPRQA & PolicyQA 1 0.400 0.808 0.333 0.765 0.348 0.633 0.311 0.591 0.699
PrivBERT GDPRQA & PolicyQA 2 0.400 0.761 0.375 0.751 0.335 0.622 0.305 0.582 0,679
PrivBERT GDPRQA & PolicyQA 3 0.440 0.753 0.291 0.719 0.359 0.637 0.328 0.606 0,683
PrivBERT GDPRQA & PolicyQA 4 0.400 0.711 0.375 0.800 0.369 0.642 0.339 0.609 0,6905

Table 10: Epoch Optimizaiton - SQuAD

Epoch Evaluation - NLQuAD


Model Epochs GDPRQA Test GDPRQA Validation PolicyQA Test PolicyQA Validation
EM F1 EM F1 EM F1 EM F1 Average F1
PrivBERT GDPRQA & PolicyQA 1 0.400 0.808 0.375 0.768 0.305 0.594 0.277 0.562 0,683
PrivBERT GDPRQA & PolicyQA 2 0.400 0.761 0.375 0.751 0.335 0.622 0.305 0.582 0,679
PrivBERT GDPRQA & PolicyQA 3 0.360 0.732 0.458 0.756 0.347 0.630 0.310 0.588 0,676
PrivBERT GDPRQA & PolicyQA 4 0.400 0.733 0.333 0.729 0.352 0.632 0.320 0.595 0,672

Table 11: Epoch Optimization - NLQuAD

NLQuAD HPO Evaluation: Batch Size and Learning Rate

The NLQuAD HPO, as presented in Table 12, shows that the combination of a batch size
of 4 and learning rate of 2e-5 overall result in the strongest performance in most metrics
when tested on GDPRQA testing and validation tests, resulting in F1 of 79.6% and 81.4%,
respectively. Yet, the best EM performance on GDPRQA validation set is observed for
the batch size of 4 and 3e-5 learning rate. Batch size of 4 and learning rate of 4e-5 score
equally well for the EM on GDPRQA testing set.

Regarding the HPO performance on PolicyQA, batch size of 4 with the learning rate of
4e-5 scored best in all metrics for both testing and validation sets — F1 for PolicyQA test
set is 64.6%, and for PolicyQA validation — 59.6%.

83
5.1 Test Environment 5 RESULTS

Overall, in any grid combination of batch size and learning rate, the batch size of 4 was
superior to the batch size of 8. The best average F1 for both datasets (GDPRQA and
PolicyQA) can be estimated to be 70.7% for the combination of a batch size of 4 and
learning rate of 2e-5.

HPO Evaluation NLQuAD


Model BS LR GDPRQA Test GDPRQA Validation PolicyQA Test PolicyQA Validation
EM F1 EM F1 EM F1 EM F1 Avg. F1
PrivBERT GDPRQA & PolicyQA 4 1e-5 0.291 0.747 0.400 0.781 0.334 0.619 0.288 0.568 0.678
PrivBERT GDPRQA & PolicyQA 4 2e-5 0.440 0.796 0.375 0.814 0.350 0.630 0.313 0.588 0.707
PrivBERT GDPRQA & PolicyQA 4 3e-5 0.400 0.748 0.416 0.813 0.361 0.637 0.310 0.587 0.696
PrivBERT GDPRQA & PolicyQA 4 4e-5 0.440 0.768 0.375 0.752 0.373 0.646 0.316 0.596 0.690
PrivBERT GDPRQA & PolicyQA 8 1e-5 0.280 0.755 0.208 0.682 0.255 0.553 0.237 0.529 0.629
PrivBERT GDPRQA & PolicyQA 8 2e-5 0.375 0.745 0.250 0.716 0.290 0.590 0.274 0.560 0.652
PrivBERT GDPRQA & PolicyQA 8 3e-5 0.400 0.761 0.375 0.751 0.335 0.622 0.305 0.582 0.679
PrivBERT GDPRQA & PolicyQA 8 4e-5 0.40 0.748 0.208 0.695 0.315 0.602 0.280 0.566 0.652

Table 12: HPO Evaluation NLQuAD: Batch Size and Learning Rate
BS = Batch Size LR = Learning Rate

SQuAD HPO Evaluation


Similarly to the results of the HPO of NLQuAD, the PrivBERT GDPRQA & PolicyQA
with a batch size of 4 and a learning rate of 2e 5 as shown in Table 13 performs best.
It performs slightly worse than SQuAD, however, with an average F1 of 70.2% across
the two datasets against 70.7% of NLQuAD. Specifically, SQuAD performs better on the
PolicyQA dataset, where it achieves F1 of 64.7% in the PolicyQA test set, which is the
second highest recorded score among all results. However, it performs slightly worse in
the GDPRQA dataset than the NLQuAD version, where it is 4% worse in the GDPRQA
validation set and 1% worse in the test set. Noteworthy, however, is that despite the
lower overall average F1 in GDPRQA, it achieves the highest exact match of all recorded
results in the GDPRQA test dataset.

HPO Evaluation SQuAD


Model BS LR GDPRQA Test GDPRQA Validation PolicyQA Test PolicyQA Validation
EM F1 EM F1 EM F1 EM F1 Avg. F1
PrivBERT GDPRQA & PolicyQA 4 1e-5 0.520 0.778 0.375 0.730 0.350 0.628 0.305 0.586 0,680
PrivBERT GDPRQA & PolicyQA 4 2e-5 0.520 0.787 0.375 0.770 0.368 0.647 0.330 0.605 0.702
PrivBERT GDPRQA & PolicyQA 4 3e-5 0.480 0.732 0.333 0.810 0.373 0.649 0.325 0.600 0.697
PrivBERT GDPRQA & PolicyQA 4 4e-5 0.440 0.752 0.250 0.771 0.352 0.632 0.317 0.588 0.685
PrivBERT GDPRQA & PolicyQA 8 1e-5 0.520 0.758 0.375 0.728 0.332 0.614 0.308 0.585 0.671
PrivBERT GDPRQA & PolicyQA 8 2e-5 0.440 0.747 0.375 0.755 0.351 0.625 0.313 0.590 0.679
PrivBERT GDPRQA & PolicyQA 8 3e-5 0.440 0.747 0.333 0.764 0.357 0.635 0.321 0.600 0.686
PrivBERT GDPRQA & PolicyQA 8 4e-5 0.520 0.777 0.291 0.778 0.363 0.641 0.330 0.606 0.700

Table 13: HPO Evaliation of SQUAD


BS = Batch Size LR = Learning Rate

84
5.2 Production Setup Results 5 RESULTS

5.2 Production Setup Results

The production setup results are based on the two best performing models from the testing
environment — PrivBERT PolicyQA & GDPRQA fine-tuned on SQuAD and NLQuAD.
Figure 21 displays the best performing model for each dataset in both testing and pro-
duction environment. The results in the production environment are substantially lower
compared to those in the testing environment, since the model is not given a paragraph,
but the entire privacy policy, it needs to look for an answer in much larger text. Moreover,
the model has to deal with the possibility of answers being present in several parts of the
policy (e.g. the contact details being mentioned in various paragraphs.)

Figure 21: Best Performing Models after 3 Iteration — Testing vs Production Environ-
ment, F1 Test Score

5.2.1 Production Results GDPRQA

The results for GDPRQA dataset in the simulated production environment show the
strongest performance when trained on NLQuAD with top k=5 showcased in Table 14.
Such configuration results in F1=38.5%, EM of 15.2%, precision of 56.5%, and recall of
32.8%. F1 score ranges from 24.5% for one answer candidate up to 38.5% for five. The
best performance when trained on SQuAD is also at top k=5, with F1 = 37.7%, precision
of 54.1%, recall of 32.5%. The di↵erence in scores when trained on SQuAD and NLQuAD
is rather minimal — only 1% improvement in NLQuAD, similar to that observed in the
test environment. Precision, which is the length of common tokens divided by the length
of predicted, is observed to be stronger than recall.

85
5.2 Production Setup Results 5 RESULTS

GDPR Dataset Mock-Up Production


Model Name Transfer Learning Set Top K EM F1 Precision Recall
PrivBert PolicyQA & GDPRQA NLQuAD 1 0.101 0.245 0.334 0.214
PrivBert PolicyQA & GDPRQA NLQuAD 3 0.135 0.336 0.494 0.290
PrivBert PolicyQA & GDPRQA NLQuAD 5 0.152 0.385 0.565 0.328
PrivBert PolicyQA & GDPRQA SQuAD 1 0.084 0.185 0.283 0.157
PrivBert PolicyQA & GDPRQA SQuAD 3 0.118 0.303 0.440 0.258
PrivBert PolicyQA & GDPRQA SQuAD 5 0.152 0.377 0.541 0.325

Table 14: Production Environment - GDPRQA

5.2.2 Production Results - PolicyQA

With regard to PolicyQA, the strongest performance is for the model trained on NLQuAD
with top k=5, as displayed in Table 15. That results in the strongest evaluation metrics
with F1=43.6%, precision of 53.2%, recall of 47.5% and EM of 0.183. The F1 score for
a SQuAD based model is lower by 2.4%, while in the test environment SQuAD-based
PolicyQA performed better.

We can observe that the higher the top k, which represents the number of answer can-
didates, the stronger the performance of the model in all metrics. The more answer
candidates the model is faced with, the higher the odds that the correct answer is among
the candidates.

As depicted in EDA, when comparing GDPRQA to PolicyQA, the answer span in GDPRQA
is substantially longer, hence, GDPRQA gets lower recall and higher precision compared
to PolicyQA, as the model manages to get the right text span but not the entire answer.

PolicyQA Dataset Production Environment Simulation


Model Name Transfer Learning Set Top K EM F1 Precision Recall
PrivBert PolicyQA & GDPRQA NLQuAD 1 0.072 0.202 0.254 0.218
PrivBert PolicyQA & GDPRQA NLQuAD 3 0.128 0.334 0.416 0.361
PrivBert PolicyQA & GDPRQA NLQuAD 5 0.183 0.436 0.532 0.475
PrivBert PolicyQA & GDPRQA SQuAD 1 0.077 0.230 0.274 0.255
PrivBert PolicyQA & GDPRQA SQuAD 3 0.125 0.367 0.432 0.407
PrivBert PolicyQA & GDPRQA SQuAD 5 0.145 0.412 0.498 0.464

Table 15: Production Environment - PolicyQA

86
6 DISCUSSION

6 Discussion
CRISP-DM framework suggests iterative nature of evaluation to make sure the objec-
tives of the work are met. Hence, the holistic process shall be reviewed and potential
improvements considered in the following chapter. To start with, the discussion covers the
elaboration of findings in the test environment and its three iterations — (1) open domain
general knowledge learning, (2) transfer learning adapting to the post-GDPR privacy pol-
icy domain, and (3) HPO. Next, production environment results are discussed followed by
the discussion and recommendations for deployment. Furthermore, the research questions
are answered. Finally, the thesis contemplates on the contribution to research, limitations,
learning reflections and future work.

6.1 Test Environment Discussion

The findings in the test environment are quite peculiar in numerous aspects. To start with,
the first iteration of open domain general knowledge learning will be discussed, following
the second iteration of privacy policy specific transfer learning, and, lastly third iteration
of hyperparameter optimisation.

6.1.1 Iteration 1 - Open Domain General Knowledge Learning

A model without domain-specific understanding, when trained solely on an open domain


QA dataset, performs surprisingly decent when tested on GDPRQA (yet not on PolicyQA).
That could be due to the fact that GDPR might be more structured and the questions
might be rather general, which makes it easier for the model to get an understanding of
how to answer the questions even without domain-specific knowledge.

Fine-tuning on non-factoid NLQuAD instead of factoid SQuAD yielded some improve-


ments in evaluation scores of GDPRQA dataset, suggesting that GDPRQA indeed bene-
fits from learning the general knowledge from a more complex dataset with longer answer
spans. The average answer span in GDPRQA is much higher than PolicyQA — 21 words
compared to 13.9 on average.

87
6.1 Test Environment Discussion 6 DISCUSSION

6.1.2 Iteration 2 - Transfer Learning on GDPR Privacy Policy Domain

PrivBERT, as expected from the initial assumptions due to its ability to capture nuances
of privacy policy language, performed strongest among all Transformers models chosen, re-
sulting in the highest scores for basically all models. Its performance is consistent with the
literature as it enhanced the accuracy by 3% for PolicyQA dataset and 4% for GDPRQA
dataset. Its performance could be better on GDPRQA because PrivBERT was created by
feeding it post-GDPR privacy policies.

The fact that the GDPRQA dataset achieved much higher accuracy compared to PolicyQA
dataset may suggest that the introduction of GDPR also improved the structure of the
policies, as certain language and structure is replicated across the policies. The improved
structure and similar lexicon result in a higher machine understanding, thus proving the
usability and necessity for the QA model.

RoBERTa performs better than BERT. RoBERTa has the same architecture and token
input processing restriction as BERT, so the potential reason for its better performance
would account for RoBERTa’s optimized pre-training scheme. Such a scheme results in
better generalisation of RoBERTa to a task of QA. RoBERTa based models (RoBERTa-
base and PrivBERT) are more sensitive to the domain-specific training data as their per-
formance improves much more when the knowledge is transferred to the closed domain,
as compared to the general knowledge.

Moreover, the developed model for PolicyQA data was able to achieve results higher
than state-of-the-art PolicyQA, suggesting that both the hyperparameter optimization
and PrivBERT play a significant role in a performance increase for privacy policy related
QA.

6.1.3 Iteration 3 - Hyperparameter Optimisation Discussion

Despite the fact that certain hyperparameter optimization was performed on the epochs,
batch size and learning rate, it might still be suboptimal as the computational power was
limiting the performance of the proper grid search. However, as described in Section 4,
the training times increase substantially, the more hyperparameters are optimized. Due

88
6.2 Production Environment Discussion 6 DISCUSSION

to the increase in training time, only two hyperparameters were chosen for a manually
created grid search, specifically batch size and learning rate due to their significance in
the findings of Rangasai (2022). This results in a cross-training of 4x2 parameters equaling
8 models with the top performing hyperparameter optimized model increasing the average
across both datasets by 3.5%.

The batch size was decreased to only two parameters, 8 and 4, as initial results showed
that increasing the batch size above 8 would result in poorer overall performance. At the
same time, a batch size of 4 increased the overall results, raising the question of whether
lowering the batch size even further to a batch size of 2 and 1 could be beneficial. With
regard to the epochs, as the data di↵erence is substantial between the two datasets, the
more epochs are run, the better the PolicyQA dataset performed, whereas the GDPRQA
performed worse. This could be explained by the fact that the model would turn more
biased towards the PolicyQA due to the imbalance.

Exploration of both weight-decay and warm-up steps was not explored, but as showcased
by (Kamsetty, 2020), the most important hyperparameters for a BERT-based classification
problem were learning rate, followed by weight-decay, batch size, warm-up steps and of
epochs. Given the time and implementation of a grid search of all of the above parameters,
the models might see an increase of 1-2%. However, the exponential increase in time
require substantially more computational power than allowed through a Google Colab
Pro+ subscription.

6.2 Production Environment Discussion

Results in the production setting do not reflect the results in the test environment which
is a curious finding in itself. As initially suggested, the results show that a single docu-
ment, i.e. a closed domain problem, cannot be evaluated solely based on a controlled test
environment. There are several factors that play a crucial role. Firstly, the training of
BERT and RoBERTa based models are bound to a maximum length of 512 tokens. This
results in either substantial cropping or the necessity to split documents, thus possibly
losing context. Secondly, given the format of SQuAD 1.1, the possibility of not providing
an answer is non-existent.

89
6.2 Production Environment Discussion 6 DISCUSSION

This forces the split segments to always include one or more answers. In other words,
there is always an answer to a question in the given context, which may not be true in a
real-world scenario. Thirdly, in a real-world scenario, there may be several ways to answer
a question and some may not necessarily reflect the annotated answer, thus a↵ecting the
evaluation metrics. There may even exist several answers to a given question where only
a proportion of the span contains the same tokens.

The aforementioned factors play a significant role when evaluating single document closed
domain QA models in a real-world scenario. In the case of privacy policies, these factors
are substantial. With the introduction of GDPR, policies of companies with European
presence grew substantially in size and word complexity due to organizations and states
trying to guard themselves against potential lawsuits and adhere to the law. This has
several implications for evaluating the PolicyQA and GDPRQA dataset in a traditional
test environment. Firstly, GDPR policies are typically 8 times as large as what BERT or
RoBERTa allows. This results in a potential loss of reading comprehension context that
according to Soleimani et al. (2021) findings plays a substantial role.

Secondly, the length of privacy policies matters. As all answers are evaluated as stated
in Section 4.7, only the top one candidate with the highest probability output score is
evaluated and used in the evaluation metrics. The well performing evaluation metrics are
assumed to come from the limited contexts, as it should be substantially easier for the
model to find the rightly annotated answer in a shorter context in contrast to evaluating
an entire policy.

Furthermore, the limitation of only annotating one answer per question and per context
for the GDPRQA dataset brings forth another difficulty. Questions may have multiple
answers in the same context. An example is a question ”How long would you retain or
store my data?”. This question often has multiple answers dependent on the lawful ba-
sis for the processing of the data. The model may correctly identify one of the several
answers, but not the ground truth. The associated metrics from the found answer are
thus not necessarily reflected. This could potentially be solved by using Semantic Answer
Similarity (SAS) as also mentioned in Section 6.8. It would allow for a more nuanced
approach to detecting the extent the provided answer corresponds to the ground truth.

90
6.3 Business Use Case — Deployment Discussion 6 DISCUSSION

The significant drop in performance on the GDPRQA dataset compared to PolicyQA


dataset is a combination of the aforementioned factors, namely length of the policies, lack
of multiple answers and a lexical overlap evaluation method. To compare, the post-GDPR
policies collected in the GDPRQA dataset had an average length of 4569, while the poli-
cies in PolicyQA only contain an average of 2319 words. Hence, the length of the policies
after the introduction of GDPR almost doubled. Given such volumes of text, retrieving
the correct passages of text containing the answers becomes more challenging in the case
of the GDPR privacy policies in production.

However, despite a sharp drop in evaluation metrics in both datasets, the ability to ex-
tract correct answers is not necessarily reflected by the decrease. In other words, the
extracted answer might fully or partially answer the question, but not share significant
lexical overlap with the ground truth. To ensure a more correct and realistic evaluation of
the proposed system, human evaluation is a necessity. Often in the recommendation and
QA systems, the prediction accuracy does not always match user’s satisfaction (Harkous
et al., 2018, pg. 13).

In Harkous et al. (2018) PriBot application, a user study showcased that the models per-
formed substantially better than the predicted model accuracy. The respondents regarded
at least one of the top three answers as relevant for 89% of the questions, with the first
answer being relevant in 70% of the cases. In comparison, for the top one candidate, the
scores were 46% and 48% (Harkous et al., 2018, pg. 14).

6.3 Business Use Case — Deployment Discussion

In terms of legal aspects, GDPRQA Assistant should not replace the legally binding pri-
vacy policies. Yet, it may only ease and augment the user comprehension of the policy by
o↵ering a complementary interface to inquire about the relevant details related to privacy.
Following the wide recent adoption of NLP tools within the legal domain and motivated
by the rise of conversational agents and automated user support, it can be deployed as
a user-friendly solution for the privacy policy stakeholders. Yet, a disclaimer should be
added that the automatic question answering does not represent the service provider and

91
6.3 Business Use Case — Deployment Discussion 6 DISCUSSION

is only meant to be complementary.

6.3.1 Internal Company Deployment

GDPRQA Assistant can also be deployed internally by the companies as an assistance tool
to handle privacy inquiries. Yet, given the legally binding nature of policies, a legal expert
would need to be involved to leverage the tradeo↵ of the utility of GDPRQA Assistant
and the possible legal implications.

6.3.2 External Web Deployment

The models could be deployed as an external, Open Source SaaS11 application or leverage
a subscription-based business model. The application would initially make use of the cre-
ated models and use future user feedback to consistently feed the models with additional
metadata as explained in Figure 17. However, as there would be no simple way to crawl
policies from all companies and organizations around the world, it would rely largely on
provided privacy policy URLs from users.

Additionally, it is suggested to create a framework that is supplied with other privacy


policy analysis tools. This framework would not just include the GDPRQA Assistant,
but also models to analyse vagueness, opt-out choices, sentence classification and privacy
policy aggressiveness. To further enhance the ease of use for the users, Chrome and Firefox
extensions could be deployed that would automatically leverage the associated models and
directly in near real-time provide feedback on the privacy policy of the website they are
visiting.

6.3.3 Mobile Application

The models could also be leveraged as part of a privacy policy notification application.
The application would make use of the QA model to enable users to directly ask ques-
tions about an application, either installed on the user’s phone or tablet or an application
available in the application store. Furthermore, the application should further be able to
identify important features of a privacy policy such as whether they collect location data,
if they have access to files, folders and pictures and to which extent they sell or share data
11
Software as a Service

92
6.4 Answering Research Questions 6 DISCUSSION

with third parties.

All this information could be displayed to users, either prior to or after the installation
of an application. The potential benefit for users would be to continuously have an open,
informative idea of what is currently accepted on their device and to which extent they
collect the data. The advantage here is that the application would be able to manage the
personal data usage from other applications and thus enable to increase users’ awareness
of the state of their personal data. Furthermore, it allows for a direct privacy policy crawl
through Google Play Store and Apple App Store.

6.3.4 Recommendations for Deployment

The longer privacy policies in the post-GDPR space pose a significant problem for the po-
tential deployment of the created models. As Transformers, such as BERT and RoBERTa
have a 512 token limit, the documents need to be split into a maximum of 512 token limits.
The longer a given privacy policy is, the more splits have to be performed. This essentially
results in a higher requirement for computational power, as the growth increases linearly
thus increasing the inference times. In 2009, a study by Forrester Research found that
users expected pages to load in less than two seconds — and at three seconds, a large share
abandon the site (Lohr, 2012). Currently, the average inference time of the 115 policies in
the OPP-115 corpus takes an average of 3.520 seconds when extracting top five candidate
answers for 8 typically asked questions. Similarly, substantially longer times are found for
the GDPRQA with an average inference time of 7.23 seconds. The inference is performed
with the available GPU in Google Colab Pro+ subscription. However, with the use of
knowledge distillation as explained in Section 3.5, inference times are able to be reduced
substantially due to the nature of TinyBERT. Note that this often comes at a small cost
to accuracy. The second option is to increase the computational power, however, in a
production setting, this will come at a cost.

6.4 Answering Research Questions

RQ: How can recent advancements in NLP be leveraged to build a Question Answering
system that can answer privacy policy inquiries in the post-GDPR regulatory landscape?

93
6.4 Answering Research Questions 6 DISCUSSION

The thesis has revealed the feasibility of leveraging Transformers, Deep Neural Networks,
and transfer learning to develop a QA system which can automatically extract answers
to the user inquiries on GDPR privacy policies. It has been observed that the e↵ect of
GDPR has resulted in lengthier, more vague privacy policies in more sophisticated legal
language making them less comprehensible to a general audience. However, GDPR has
also yielded increased thoroughness and structure to privacy policies, manifesting poten-
tial for improved machine reading comprehension.

The thesis has contributed with the first GDPR-adapted QA system able to answer pri-
vacy policies both pre- and post-GDPR with an average F1 of ⇠71%, filling in the gap in
the existing research. Moreover, despite utilizing existing datasets, it has also produced
a GDPR adapted QA dataset, GDPRQA, which sets up the foundation for further devel-
opments in the domain of GDPR privacy policies.

Q1: How can Transformers, Deep Neural Networks, Transfer Learning and data aug-
mentation aid in adaptation to a post-GDPR privacy domain and improve the current
state-of-the-art?

The thesis has managed to not only generate a novel model capable of answering both pre-
and post-GDPR questions, yet it also has beaten the current state-of-the-art pre-GDPR
PolicyQA model by 7.6%. The findings showcased that PrivBERT, a privacy policy specific
configuration of RoBERTa, provides a boost to accuracy for privacy policy QA datasets.
It was found that PrivBERT improved the PolicyQA F1 score by 3%.

Transformers, specifically BERT, RoBERTa and PrivBERT assisted in accessing the dis-
tance aspect in the text of privacy policies. The Transformers’ usage of DNN and Position-
Wise FFN handle the constraints of the Markov assumption and manage to use greater
contexts in text, providing each next token in text a conditional probability. The architec-
tures are able to retain both the past information, as well as the areas where the attention
of the model should be paid to.

Transfer learning enabled the transfer of general knowledge of QA from source open domain
QA to the target post-GDPR privacy policy domain, thus pertaining general knowledge
of how to answer questions, whilst also specializing the QA system to the knowledge of

94
6.4 Answering Research Questions 6 DISCUSSION

the privacy policy domain. The e↵ect of transfer learning was noticed in the comparison
between NLQuAD and SQuAD. The source domain clearly had an impact on the perfor-
mance of the target domain, where NLQuAD performed better on GDPRQA due to the
non-factoid nature and longer answer spans, in contrast to PolicyQA on SQuAD.

Data augmentation aided in adapting general comprehension of the privacy policy domain
to the post-GDPR nuances of the privacy lexicon. The results show that PrivBERT mod-
els trained on the augmented dataset generally outperform the GDPRQA and PolicyQA
individually, with an increase of the average score across the two datasets by 5%-10%.

Q2: How can a developed privacy policy Question Answering system be implemented and
evaluated with respect to a potential deployment in the real world scenario?

The developed QA model can enable users to retain more control over their personal data
and extract the answers spans relevant to user needs in a post-GDPR regulatory landscape
at scale. The production environment prototype was simulated with the use of Elastic-
Search by building an architecture consisting of IR, Reader, and Retriever. The results
revealed that the metrics evaluated in the test environment do not reflect a real-world
production scenario. QA in the domain of privacy policy is identified as a single document
closed domain task, which performs substantially di↵erent in a production environment
due to the length of privacy policies. In production, the model is fed the entire policy
and needs to retrieve the relevant segments, as opposed to the test environment where
the model can see the relevant segment containing the answer. The production environ-
ment evaluation may not reflect the actual capabilities of the production environment.
As evaluation is done using F1 and EM it might not portray the true performance of the
system due to the semantic limitations. It is suggested that either user evaluation, SAS or
Intersection over Union is implemented, whereby a user evaluation will result in the most
realistic and accurate evaluation of such QA system.

A suggested implementation of the GDPRA Assistant can be implemented through the


three layers, namely an Application Layer, Data Layer and a Machine Learning Layer
with an additional Feedback and PolicyCrawler module. The Application Layer handles
transfers of data between the layers and display and interact with users. The Data Layer
serves two purposes. Firstly, it cleans, preprocesses and indexes the policies crawled by

95
6.5 Contribution and Implication for Research 6 DISCUSSION

the PolicyCrawler to ElasticSearch, and, secondly, it handles queries to retrieve policies


from ElasticSearch. The Machine Learning Layer is responsible for encoding the segment,
extracting the answer and decoding it to make it user comprehensible. The additional
modules, PolicyCrawler and Feedback, enable the collection of privacy policies through
URLs and feedback respectively, essentially allowing the system to function and improve
itself.

Despite the decrease in production performance, the deployment of a GDPRQA Assistant


is believed to substantially contribute to the machine and user comprehension of privacy
policies in the post-GDPR era.

6.5 Contribution and Implication for Research

Analysis of the privacy policies has indeed revealed that the e↵ect of the introduction
of GDPR has resulted in policies being more lengthy, verbose and advanced in legal lan-
guage, which can only be understood by college graduates. These factors make the policies
more difficult to comprehend for the general audience, violating the obligation by GDPR
for the policies to be easily understood by any individual, and questioning their legiti-
macy under the ”notice and choice” principle. The e↵ort and time taken to read a policy
indeed manifest the importance of a tool that can simplify the process. The increased
exhaustiveness of detail and legal structure of the policies can be argued to make them
less comprehensible for the general audience yet more comprehensible for the machines.
The thesis contributes to research with an annotated GDPRQA dataset that previously
did not exist. It further introduces the first GDPR-adapted QA system able to answer
privacy policies’ inquiries both pre- and post-GDPR. Lastly, the thesis contributes with
realistic real-world evaluation results on privacy policy QA systems.

6.6 Limitations

An array of possible constraints were faced which should be considered when measuring
the impact of the research.

96
6.6 Limitations 6 DISCUSSION

6.6.1 Data Limitations

The annotation for GDPRQA was forced to follow the SQuAD 1.1 format to be consis-
tent with PolicyQA. However, annotating in SQuAD 2.0 format could have also allowed
for no-answer annotation, hence the QA system would be able to abstain when asked a
question with a non-existing answer.

The analysis relies on the quality of the existing PolicyQA dataset. The quality-ensuring
techniques behind the creation of PolicyQA dataset have not been studied. The dataset
was claimed to be created by domain experts yet no further elaboration was provided. It
was assumed to have an acceptable degree of quality, yet possible subjectivity bias which
is, in general, likely to occur in manual work could have a↵ected the performance of the
models developed in this thesis. A major limitation was the di↵erent answer lengths of
PolicyQA and GDPRQA. Specifically, the mixed question type results in complexity of
the QA, as the expected answer varies in length and structure of the span.

As manual labelling is very time and e↵ort expensive, only a limited number of annota-
tion samples was created in the constructed GDPRQA dataset. A small training dataset
restricts the training. More samples in the dataset could help the models learn more in-
depth nuances of the post-GDPR privacy domain lexicon.

The input quality of privacy policies, naturally, a↵ects the performance of QA. Given a
policy of low quality, e.g. non-compliant with GDPR, the model will fail to comprehend
its structure.

Only English language GDPR privacy policies were considered. Hence, extrapolating the
study to the policies in other languages might be challenging. Considerations of applying
the models to other European languages is not covered in the scope of this thesis, yet
machine translation could be further leveraged.

97
6.7 Learning Reflections 6 DISCUSSION

6.6.2 Labelling Subjectivity

Subjectivity a↵ects the complexity of labelling and evaluation. To ensure labelling consis-
tency, each question was discussed by the labellers to lower subjectivity. The guidelines
were agreed on among the labellers and any edge cases were jointly discussed. Human
manual work is very prone to being inconsistent, despite employing critical realism and
the methods taken to avoid bias. Yet, similar pieces of text could still receive varying
annotations by the labellers.

6.6.3 Computational Power Limitations

The techniques used in this research have faced substantial computational power con-
straints. Large language models, such as RoBERTa-Large or BERT-large require a lot of
GPU memory which was limited to 16GB of memory, impacting the training duration.
Moreover, computational power constraints limited the possibilities of hyperparameter
optimisation. Hence, the best resulting models might still be suboptimal, as not all grid
search hyperparameter combinations have been explored. For instance, weight decay and
warm-up steps were skipped due to their exploration being too computationally heavy to
perform.

6.7 Learning Reflections

This thesis provided an opportunity to harness the transformative power of natural lan-
guage processing to derive actionable privacy policy comprehension insights. Attempting
to solve an urgent problem around the protection of personal privacy gave us stronger
sense of purpose in our work. Glimpsing through the prism of the multidisciplinary na-
ture of this thesis in its legal, technical and business angles, generated profound awareness
of the topic, sharpening our data science toolset. Throughout the M.SC. Data Science
degree, we have excelled in the three pillars of data — regulation, business insights, and
analytics. All three aspects have been exposed throughout this thesis.

We observed that a major limiting factor in our work was the lack of sufficient comput-
ing power. Despite Google Colab access to GPUs, the resources were not sufficient for

98
6.8 Future Work 6 DISCUSSION

utilising the full capabilities of the language models, resulting in potentially suboptimal
hyperparameter optimisation of the models presented. Training and operating large lan-
guage models requires sufficient computational power, which relates to hardware access,
financial costs and energy consumption, raising the question of the sustainability of large
language models. Increasing sizes of language models are leaving a larger carbon footprint
— pre-training of a BERT model is estimated to be similar to a roundtrip throughout
the US (Strubell, Ganesh, & McCallum, 2019). Moreover, the exclusiveness of access to
well-funded research poses a dilemma for the future development of NLP.

6.8 Future Work

The work unravels the opportunities for a wide array of topics which could address the
limitations faced throughout the research.

6.8.1 Standardisation of Labelling and Upscaling of the Dataset

GDPRQA dataset12 has set the building stone in development of a privacy policy dataset
with data relevant to the modern regulatory landscape. However, the dataset can be fur-
ther developed with more annotation pairs and even further variety of companies selected
and questions asked, hence improving the size, quality, and diversity of the dataset. More-
over, further standardisation of the labelling would result in a higher quality of consistent
annotations.

6.8.2 Larger Models

The large language models such as BERT-large and RoBERTa-large were not implemented
due to restrained computational power and time limit. Yet, it was observed that the
larger the model, the better a model can perform. However, it can be argued that the
inference time with such models would be much longer, questioning their applicability in
a production environment.
12
Access to the dataset and the project in the annotation tool can be provided by contacting the authors
of the thesis.

99
6.8 Future Work 6 DISCUSSION

6.8.3 Longformers

Longformers leverage varying configurations of the attention heads which allows to pro-
cess the entire sequence. Longformers use an attention technique that scales in a linear
way with the length of the sequence. As a result, Longformers can take in up to 4.096
tokens. Moreover, Longformers use global attention for the question tokens. Findings by
(Soleimani et al., 2021) suggest that Longformers perform significantly better than typical
models on non-factoid QA datasets due to the increased token size.

6.8.4 Dense Passage Retrievers

Tf-idf and BM25 are sparse retrievers which look for information in the document store.
Dense Passage Retrieval (DPR) was introduced in 2020 as an alternative to conventional
techniques due to a variety of their benefits. Semantically similar words would not be
considered a match by typical keyword algorithms such as TF-IDF and BM25, while
DPR would craft a dense vector with a shared semantic meaning where such words would
lie very closely. Moreover, traditional sparse retrievers cannot be trained, while DPR
leverages embeddings to be trained and fine-tuned. However, as DPR requires very large
annotated training data and significant computational power during indexing and retrieval
(Karpukhin et al., 2020) it was not yet possible to implement it.

6.8.5 Text Generation

Instead of answer extraction, a potential research topic could be to use a generator which
could generate an answer from the context. Given the majority of non-factoid questions
and the implicit complexity of the topic, text generation could hold large potential. It
may also be argued that large text answers would benefit more from text generation than
reading in terms of user comprehension.

6.8.6 Evaluation Metrics

F1 score is rather limited when evaluating longer answer spans for non-factoid questions,
yet another possible metric is Intersection over Union (IoU) which is able to measure the
position-sensitive intersection between the prediction and label (Soleimani et al., 2021).
With the proliferating complexity of QA models producing more free-form and abstract
answers, the metrics should be able to even closer resemble human perception beyond

100
6.8 Future Work 6 DISCUSSION

simple n-gram matching.

Furthermore, Semantic answer similarity (SAS) may be an interesting way to evaluate


the single document, closed domain QA Systems. SAS is a Transformer-based cross-
encoder that estimates semantic similarity and not just a lexical overlap. Therefore, it
considers the meaning behind the expression of the strings, and such strings as “May
twenty-fifth” and “05-25” would have the same semantic similarity (Chen, Stanovsky,
Singh, & Gardner, 2019). Moreover, the human evaluation would be able to provide a
more realistic assessment of the QA system.

6.8.7 Other Machine Reading Comprehension NLP Tasks

More ideas could be further implemented to assist the user in helping comprehension of
practices surrounding the management of their personal data. For instance, a certain
regulatory industry standard could be developed in order to measure the aggressiveness of
the policy based on whether the data collected and practices taken match those expected
of the given entity. Moreover, policies could be classified based on user-relevant metrics,
which could be presented as warning icons — such as ”this company collects your location
data” or ”your sensitive data is stored outside of EU”.

101
7 CONCLUSION

7 Conclusion
The thesis aided in the sensemaking of the issue of privacy policy machine reading compre-
hension in the post-GDPR regulatory landscape. It unravelled the feasibility of applying a
QA system to automatically extract answers to the user inquiries related to their personal
data. It was determined that the Transformer-based approach to extractive QA employ-
ing transfer learning in the closed domain can indeed improve machine comprehension of
privacy policies in the post-GDPR era, being able to automatically address users’ personal
data inquiries.

First, the existing research perspective was examined. Aligned with the recent workings
of the literature it was determined that a GDPR relevant QA dataset was needed, hence
making its creation necessary in our work. Furthermore, the project scope and research
methodology were outlined, following the fundamentals and iterative nature of the CRISP-
DM framework. Critical realism with an abductive research approach was employed under
a longitudinal time horizon which compared pre- and post-GDPR privacy policies.

The thesis utilises existing PolicyQA dataset along with the created GDPRQA dataset
and an augmentation of the two. The datasets are augmented to enable the introduc-
tion of GDPR related aspects to general privacy policy inquiries. The GDPRQA dataset
addresses the shifts in privacy policies structure introduced under the e↵ect of GDPR.
The GDPRQA is built on 47 GDPR privacy policies that were collected from companies
present in the EU varying in size, location and industry. The exploration of the collected
policies revealed them to be rather long, vague to an extent, and complex to read. The
fairly difficult readability scores deem the policies not easily readable by a general audi-
ence. Moreover, the vagueness score showed an average of 156 vague words per policy.
The average length of the policies was estimated to be 4,569 words, which is almost double
the pre-GDPR length of 2,319 words. As the legitimacy of privacy policies is dependent
on users’ comprehension upon the ”notice and choice” principle, such observations result
in high risks of personal data misuse.

A variety of Transformer-based BERT models were explored and developed to improve ma-
chine comprehension of pre- and post-GDPR privacy policies. First, models were trained

102
7 CONCLUSION

on open domain general knowledge QA datasets — the factoid SQuAD (used by the ma-
jority of QA tasks) and non-factoid NLQuAD. GDPRQA performed better when trained
on non-factoid NLQuAD, while PolicyQA performed better on SQuAD. The better per-
formance could be explained by GDPRQA having a longer span of answer annotations
and a higher number of non-factoid questions present.

Next, models utilized transfer learning to learn the nuances of privacy policy target domain
language. GDPRQA achieved higher scores than PolicyQA in the test environment which
suggests that GDPR is able to standardise policies, and similar language and structure are
replicated across policies. That signified the potential for stronger machine understanding
hence manifesting the application of NLP tools, while the complexity and length of policies
make them less readable by humans.

RoBERTa based models, especially PrivBERT which is a privacy policy domain configu-
ration of RoBERTa, were able to capture the most specifics of privacy policy language.
That could be explained by the fact that they are most optimised for the task of GDPR
QA due to their more favourable pre-training scheme and generalisation. The developed
models were able to beat state-of-the-art results for PolicyQA by 7.6% due to the usage
of privacy policy specific language model PrivBERT and HPO. Both foundational privacy
policy knowledge (obtained through the usage of pre-GDPR PolicyQA dataset) and GDPR
specific context (through the usage of GDPRQA) were captured in the best performing
models. SQuAD-based PrivBERT scored 70.2% average F1 in contrary to NLQuAD-based
PrivBERT averaging on 70.7% F1 across both datasets.

The production simulated environment with an implemented ElasticSearch IR architecture


portrays a di↵erent performance, proving that closed domain (single document) problems
cannot be evaluated solely in a controlled testing environment when the models are served
a limited context due to the nature of SQuAD format and the 512 BERT and RoBERTa
token limit. Moreover, answers might be found in several parts of the policies. The model
may correctly predict one of the possible answers, yet not the ground truth resulting in
decreased evaluation performance. The decreased evaluation performance could in future
work be solved by using SAS or user evaluations. Hence, as commonly done in such tasks
as QA and recommendation systems due to the complexity of performance measuring, hu-

103
7 CONCLUSION

man evaluation is necessary to provide a realistic understanding of the model performance.

Deployment prototypes and business use cases were further elaborated upon, proposing
a GDPRQA Assistant which could be a user-friendly complementary solution for privacy
policy stakeholders. However, GDPRQA Assistant should be disclaimed to not replace
the binding policies themselves. GDPRQA could be deployed internally as an assistance
tool for the organisations to comply with legal regulations. Moreover, GDPRQA Assistant
could be deployed externally as an Open Source SaaS application or a subscription-based
business model. The framework could also include vagueness and aggressiveness scor-
ing, opt-out choices, and sentence classification tools, which could be deployed as a web
browser extension to provide feedback in real-time based on the visited policy. A mobile
application could also be developed which would provide continuous information on the
applications’ usage of data. Yet, given the length of policies and the number of needed
splits due to BERT limitations, this poses an issue of inference time.

The potential future work can involve a variety of activities addressing the limitations
faced. For instance, the proposed GDPRQA dataset can be further upscaled and im-
proved which could result in stronger performance of developed models. Moreover, larger
models and possibly Longformers could be used, yet the computational power trade-o↵
shall be considered. Furthermore, dense passage retrievers could benefit future work due
to their comprehension of semantically similar words. Given the non-factoid nature of
questions, text generation could benefit in creating more user comprehensible responses
of the model. More novel evaluation metrics, such as SAS and IoU, as well as human
evaluation would be able to evaluate the performance more realistically.

All things considered, this thesis built on the recent advancements in natural language
processing to allow users retain control over their privacy in more meaningful ways, ad-
dressing the unrealistic expectation of consuming numerous privacy policies on a nearly
daily basis. As a result, models capable of improving state-of-the-art QA in the domain
of privacy policies and adapting it to the post-GDPR regulatory landscape were developed,
hence enhancing the research in the field of machine comprehension of post-GDPR privacy
policies.

104
8 APPENDIX

8 Appendix

Company Industry Size Location


1 Happn Dating App 100-250 France
2 Kahoot Education 100-250 Norway
3 ManoMano Marketplace 500-1000 France
4 Spotify Audio Streaming 5000-10000 Sweden
5 Vinted Marketplace 500-1000 Lithuania
6 Voi Transport 500-1000 Sweden
7 Cisco Telecommunication 10000+ US - Global
8 HM Retail 10000+ Sweden
9 Kry Healthcare 500-1000 Sweden
10 Mercedez Automobile 10000+ Germany
11 Pandora Jewelry 10000+ Denmark
12 Zara Retail 10000+ Spain
13 GomSpace Space 50-100 Denmark
14 Jabra Audio 1000-5000 US - Global
15 MinionMasters Gaming 11-50 Denmark
16 Steam Community Platform 11-50 US - Global
17 UbiSoft Retail 5000-10000 France
18 Veolia Environment 1000-5000 France
19 CBS Education 1000-5000 Denmark
20 Coop Food 5000-10000 UK
21 ESA Space 1000-5000 Netherlands
22 Filippa K Retail 250-500 Sweden
23 Snyk Technology 500-1000 UK
24 UBS Banking 10000+ Switzerland
25 Danubius Hotels Hotel 50-100 Hungary
26 GrundFos Manufacturing 1000-5000 Denmark
27 Heinz Food 10000+ UK
28 Imaginary Cloud Technology 11-50 Portugal
29 Schneider Manufacturing 250-500 Germany
30 Zf Transport 10000+ Germany
31 Bang of Olufsen Audio 1000-5000 Denmark
32 BBC News 10000+ UK
33 Bosch Technology 10000+ Germany
34 Equinor Oil 10000+ Norway
35 Ferrari Automobile 1000-5000 Italy
36 IKEA Furniture 10000+ Sweden
37 KPMG Consulting 10000+ Canada - Global
38 Porsche Automobile 10000+ Germany
39 Roche Pharmaceuticals 10000+ Switzerland
40 SallingGroup Retail 5000-10000 Denmark
41 SATS Fitness 1000-5000 Norway
42 SuperFastFerries Transport 500-1000 Greece
43 Telenor Telecommunication 10000+ Norway
44 Zendesk Technology 1000-5000 US - Global
45 L’Oreal Cosmetics Retailer 10000+ France
46 Pitch Software 100-250 Germany
47 Eccoverde 105
Cosmetics Retailer 50-100 Austria

Table 16: List of GDPR Privacy Policies Companies


8 APPENDIX

1 def c o mpu t e_ e x ac t _ ma t ch ( prediction , truth ) :


2 return int ( normalize_text ( prediction ) == normalize_text ( truth ) )
3

4 def compute_accuracy ( prediction , truth ) :


5 pred_tokens = normalize_text ( prediction ) . split ()
6 truth_tokens = normalize_text ( truth ) . split ()
7

8 # if either the prediction or the truth is no - answer then f1 = 1 if


they agree , 0 otherwise
9 if len ( pred_tokens ) == 0 or len ( truth_tokens ) == 0:
10 return int ( pred_tokens == truth_tokens )
11

12 common_tokens = set ( pred_tokens ) & set ( truth_tokens )


13

14 return len ( common_tokens ) / len ( truth_tokens )


15

16 def c ompute _preci sion ( prediction , truth ) :


17 pred_tokens = normalize_text ( prediction ) . split ()
18 truth_tokens = normalize_text ( truth ) . split ()
19

20 # if either the prediction or the truth is no - answer then f1 = 1 if


they agree , 0 otherwise
21 if len ( pred_tokens ) == 0 or len ( truth_tokens ) == 0:
22 return int ( pred_tokens == truth_tokens )
23

24 common_tokens = set ( pred_tokens ) & set ( truth_tokens )


25

26 return len ( common_tokens ) / len ( pred_tokens )


27

28 def compute_recall ( prediction , truth ) :


29 pred_tokens = normalize_text ( prediction ) . split ()
30 truth_tokens = normalize_text ( truth ) . split ()
31

32 if top_k_value >1:
33 em_score = max (( compu t e _e x a ct _ m at c h ( guess [ ' answer '] , answer ) ) for
guesses in answer_lst for guess in guesses )
34 f1_score = max (( compute_f1 ( guess [ ' answer '] , answer ) )
35 for guesses in answer_lst for guess in guesses )
36 rec = max (( compute_recall ( guess [ ' answer '] , answer ) )
37 for guesses in answer_lst for guess in guesses )
38 prec = max (( compute_precisi on ( guess [ ' answer '] , answer ) )

106
8 APPENDIX

39 for guesses in answer_lst for guess in guesses )


40 else :
41 em_score = max (( compu t e _e x a ct _ m at c h ( guess [ ' answer '] , answer ) )
42 for guess in answer_lst )
43 f1_score = max (( compute_f1 ( guess [ ' answer '] , answer ) )
44 for guess in answer_lst )
45 rec = max (( compute_recall ( guess [ ' answer '] , answer ) )
46 for guess in answer_lst )
47 prec = max (( compute_precisi on ( guess [ ' answer '] , answer ) )
48 for guess in answer_lst )

Listing 7: Evaluation Code

107
References References

References
Ahmad, W. U., Chi, J., Tian, Y., & Chang, K. W. (2020). PolicyQA: A reading compre-
hension dataset for privacy policies. In Findings of the association for computational
linguistics findings of acl: Emnlp 2020. doi: 10.18653/v1/2020.findings-emnlp.66
Alduaij, S., Chen, Z., & Gangopadhyay, A. (2016). Using crowd sourcing to analyze
consumers’ response to privacy policies of online social network and financial insti-
tutions at micro level. International Journal of Information Security and Privacy,
10 . doi: 10.4018/IJISP.2016040104
Amos, R., Acar, G., Lucherini, E., Kshirsagar, M., Narayanan, A., & Mayer, J. (2021).
Privacy policies over time: Curation and analysis of a million-document dataset..
doi: 10.1145/3442381.3450048
Aroca-Ouellette, S., & Rudzicz, F. (2020). On losses for modern language models.. doi:
10.18653/v1/2020.emnlp-main.403
Berman, D. (2019). 10 elasticsearch concepts you need to learn.
https://logz.io/blog/10-elasticsearch-concepts/?fbclid=
IwAR3GlwZFRZ0IR5Wj8tJeE0zQCSPPHXb8lNNNVCcav2W2wg3CeAgPCuwJa7Q.
Chen, A., Stanovsky, G., Singh, S., & Gardner, M. (2019). Evaluating question answering
evaluation.. doi: 10.18653/v1/d19-5817
Cranor, L. F. (2002). Web privacy with p3p. Leadership.
Das, R., Dhuliawala, S., Zaheer, M., & McCallum, A. (2019). Multi-step retriever-reader
interaction for scalable open-domain question answering..
Degeling, M., Utz, C., Lentzsch, C., Hosseini, H., Schaub, F., & Holz, T. (2019). We value
your privacy.. now take some cookies: Measuring the gdpr’s impact on web privacy.
Informatik-Spektrum, 42 . doi: 10.1007/s00287-019-01201-1
European Commission. (2016). Regulation (EU) 2016/679 of the European Parliament
and of the Council of 27 April 2016 on the protection of natural persons with regard
to the processing of personal data and on the free movement of such data, and re-
pealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA
relevance). European Commission. Retrieved from https://eur-lex.europa.eu/
eli/reg/2016/679/oj
Facebook. (2019). Roberta: An optimized method for pretraining self-supervised nlp
systems. Retrieved from https://ai.facebook.com/blog/roberta-an-optimized

108
References References

-method-for-pretraining-self-supervised-nlp-systems/
Gallé, M., Christofi, A., & Elsahar, H. (2019). The case for a gdpr-specific annotated
dataset of privacy policies. In (Vol. 2335). Retrieved from http://ceur-ws.org/
Vol-2335/1st PAL paper 5.pdf
Gotmlet, Z., Clinton & Tong. (2015). Elasticsearch: The definitive guide. O’Reilly
Media. Retrieved from https://books.google.dk/books?id=Ul9aBgAAQBAJ&dq=
how+does+lucene+index+new+documents+after+commit&hl=da
Géron, A. (2017). Hands-on machine learing with scikit-learn tensor flow.
Han, J., Kamber, M., & Pei, J. (2012). Data mining concepts and techniques,
third edition. Waltham, Mass.: Morgan Kaufmann Publishers. Retrieved
from http://www.amazon.de/Data-Mining-Concepts-Techniques-Management/
dp/0123814790/ref=tmm hrd title 0?ie=UTF8&qid=1366039033&sr=1-1
Haq, Q. A. U. (2021). Cyber crime and their restriction through laws and techniques for
protecting security issues and privacy threats (Vol. 341). doi: 10.1007/978-981-33
-4996-4 3
Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin, K. G., & Aberer, K. (2018). m. In
Proceedings of the 27th usenix security symposium.
Hendrycks, D., & Gimpel, K. (2016). Bridging nonlinearities and stochastic regularizers
with gaussian error linear units. arXiv .
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential model-based optimization
for general algorithm configuration. In (Vol. 6683 LNCS). doi: 10.1007/978-3-642
-25566-3 40
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., . . . Liu, Q. (2020). Tinybert:
Distilling bert for natural language understanding.. doi: 10.18653/v1/2020.findings
-emnlp.372
Jurafsky, D., & Martin, J. (2002). Speech and Language Processing. An Introduction to
Natural Language Processing, Computational Linguistics, and Speech Recognition.
Zeitschrift fur Sprachwissenschaft, 21 (1). doi: 10.1515/zfsw.2002.21.1.134
Kamsetty, A. (2020). Hyperparameter optimization for huggingface transformers: A
guide.. Retrieved from https://medium.com/distributed-computing-with-ray/
hyperparameter-optimization-for-transformers-a-guide-c4e32c6c989b
Karim, M. R. (2017). Searching and indexing with apache lucene. https://dzone.com/
articles/apache-lucene-a-high-performance-and-full-featured.

109
References References

Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., . . . Yih, W. T. (2020).
Dense passage retrieval for open-domain question answering.. doi: 10.18653/v1/
2020.emnlp-main.550
Krippendor↵, K. (2018). Content analysis: An introduction to its methodol-
ogy. SAGE Publications. Retrieved from https://books.google.dk/books?id=
FixGDwAAQBAJ
Krumay, B., & Klar, J. (2020). Readability of privacy policies. In (Vol. 12122 LNCS).
doi: 10.1007/978-3-030-49669-2 22
Lapowsky, I. (2019). How cambridge analytica sparked the great privacy awakening.
Wired .
Lebano↵, L., & Liu, F. (2018). Automatic detection of vague words and sentences in
privacy policies.. doi: 10.18653/v1/d18-1387
Lin, Y. P., & Jung, T. P. (2017). Improving eeg-based emotion classification us-
ing conditional transfer learning. Frontiers in Human Neuroscience, 11 . doi:
10.3389/fnhum.2017.00334
Linden, T., Khandelwal, R., Harkous, H., & Fawaz, K. (2020, 1). The privacy policy
landscape after the gdpr. Proceedings on Privacy Enhancing Technologies, 2020 ,
47-64. Retrieved from https://www.sciendo.com/article/10.2478/popets-2020
-0004 doi: 10.2478/popets-2020-0004
Lohr, S. (2012). For impatient web users, an eye blink is just too long to wait.. Re-
trieved from https://www.immagic.com/eLibrary/ARCHIVES/GENERAL/GENPRESS/
N120229L.pdf
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization..
McDonald, A. M., & Cranor, L. F. (2009). The cost of reading privacy policies..
Min, S., Seo, M., & Hajishirzi, H. (2017). Question answering through transfer learning
from large fine-grained supervision data. In (Vol. 2, p. 510-517). Association for Com-
putational Linguistics. Retrieved from http://aclweb.org/anthology/P17-2081
doi: 10.18653/v1/P17-2081
Mishra, P., Rajnish, R., & Kumar, P. (2020). Sentiment analysis by novel hybrid
method be-cnn using convolutional neural network and bert. International Jour-
nal of Advanced Trends in Computer Science and Engineering, 9 (4). doi: 10.30534/
IJATCSE/2020/165942020
Murtezić, A. (2020). Convention 108: Present importance and implementation. Strani

110
References References

pravni zivot. doi: 10.5937/spz64-26350


Mustapha, M., Krasnashchok, K., Bassit, A. A., & Skhiri, S. (2020). Privacy policy
classification with xlnet (short paper). In (Vol. 12484 LNCS). doi: 10.1007/978-3
-030-66172-4 16
Nay, J. (2018). Natural language processing and machine learning for law and pol-
icy texts. SSRN Electronic Journal . Retrieved from https://www.ssrn.com/
abstract=3438276 doi: 10.2139/ssrn.3438276
Obar, J. A., & Oeldorf-Hirsch, A. (2020). The biggest lie on the internet: ignoring the
privacy policies and terms of service policies of social networking services. Informa-
tion, Communication & Society, 23 (1), 128-147. Retrieved from https://doi.org/
10.1080/1369118X.2018.1486870 doi: 10.1080/1369118X.2018.1486870
Oltramari, A., Piraviperumal, D., Schaub, F., Wilson, S., Cherivirala, S., Norton, T. B.,
. . . Sadeh, N. (2018). Privonto: A semantic framework for the analysis of privacy
policies. Semantic Web, 9 . doi: 10.3233/SW-170283
Palmirani, M., & Governatori, G. (2018). Modelling legal knowledge for gdpr compliance
checking. In (Vol. 313). doi: 10.3233/978-1-61499-935-5-101
Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don’t know: Unanswerable
questions for squad. In (Vol. 2). doi: 10.18653/v1/p18-2124
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100,000+ questions
for machine comprehension of text.. doi: 10.18653/v1/d16-1264
Ramnath, S., Nema, P., Sahni, D., & Khapra, M. M. (2020). Towards interpreting bert
for reading comprehension based qa.. doi: 10.18653/v1/2020.emnlp-main.261
Rangasai, K. (2022). How do you pick the right set of hyperparameters for a machine
learning project?. Retrieved from https://devblog.pytorchlightning.ai/
how-do-you-pick-the-right-set-of-hyperparameters-for-a-machine
-learning-project-975951644152
Ravichander, A., Black, A., Wilson, S., Norton, T., & Sadeh, N. (2020). Question answer-
ing for privacy policies: Combining computational and legal perspectives. In Emnlp-
ijcnlp 2019 - 2019 conference on empirical methods in natural language processing
and 9th international joint conference on natural language processing, proceedings of
the conference. doi: 10.18653/v1/d19-1500
Reddy, A. C. O., & Madhavi, K. (2017). A survey on types of question answering system.
IOSR Journal of Computer Engineering (IOSR-JCE), 19 .

111
References References

Reidenberg, J. R. (2000). Resolving conflicting international data privacy rules in cy-


berspace (Vol. 52). doi: 10.2307/1229516
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995,
January). Okapi at trec-3. In Overview of the third text retrieval conference (trec-
3) (Overview of the Third Text REtrieval Conference (TREC–3) ed., p. 109-126).
Gaithersburg, MD: NIST. Retrieved from https://www.microsoft.com/en-us/
research/publication/okapi-at-trec-3/
Sathyendra, K. M., Schaub, F., Wilson, S., & Sadeh, N. (2016). Automatic extraction of
opt-out choices from privacy policies. In (Vol. FS-16-01 - FS-16-05).
Sathyendra, K. M., Wilson, S., Schaub, F., Zimmeck, S., & Sadeh, N. (2017). Identifying
the provision of choices in privacy policy text.. doi: 10.18653/v1/d17-1294
Saunders, M., Lewis, P., & Thornhill, A. (2019). Research methods for business students
by mark saunders, philip lewis and adrian thornhill 8th edition.
Schuster, M., & Nakajima, K. (2012). Japanese and korean voice search.. doi: 10.1109/
ICASSP.2012.6289079
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words
with subword units. In (Vol. 3). doi: 10.18653/v1/p16-1162
Shao, T., Guo, Y., Chen, H., & Hao, Z. (2019). Transformer-based neural network for
answer selection in question answering. IEEE Access, 7 . doi: 10.1109/ACCESS
.2019.2900753
Soleimani, A., Monz, C., & Worring, M. (2021). Nlquad: A non-factoid long question
answering data set.. doi: 10.18653/v1/2021.eacl-main.106
Srinath, M., Wilson, S., & Giles, C. L. (2021). Privacy at scale: Introducing the privaseer
corpus of web privacy policies. In (p. 6829-6839). Association for Computational
Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.532 doi:
10.18653/v1/2021.acl-long.532
Story, P., Zimmeck, S., Ravichander, A., Smullen, D., Wang, Z., Reidenberg, J., . . . Sadeh,
N. (2019). Natural language processing for mobile app privacy compliance. In Ceur
workshop proceedings (Vol. 2335).
Strubell, E., Ganesh, A., & McCallum, A. (2019, July). Energy and policy considerations
for deep learning in NLP. In Proceedings of the 57th annual meeting of the associa-
tion for computational linguistics (pp. 3645–3650). Florence, Italy: Association for
Computational Linguistics. Retrieved from https://aclanthology.org/P19-1355

112
References References

doi: 10.18653/v1/P19-1355
Su, Y., Sun, H., Sadler, B., Srivatsa, M., Gür, I., Yan, Z., & Yan, X. (2016). On generating
characteristic-rich question sets for qa evaluation.. doi: 10.18653/v1/d16-1054
Tesfay, W. B., Hofmann, P., Nakamura, T., Kiyomoto, S., & Serna, J. (2018, 3). Pri-
vacyguide. In (Vol. 2018-January, p. 15-21). ACM. Retrieved from https://
dl.acm.org/doi/10.1145/3180445.3180447 doi: 10.1145/3180445.3180447
Torre, D., Soltana, G., Sabetzadeh, M., Briand, L. C., Auffinger, Y., & Goes, P. (2019).
Using models to enable compliance checking against the gdpr: An experience report..
doi: 10.1109/MODELS.2019.00-20
Truong, N. B., Sun, K., Lee, G. M., & Guo, Y. (2020). Gdpr-compliant personal data man-
agement: A blockchain-based solution. IEEE Transactions on Information Forensics
and Security, 15 . doi: 10.1109/TIFS.2019.2948287
Trzaskowski, J., & Sørensen, M. G. (2019). Gdpr compliance: Understanding the general
data protection regulation. Ex Tuto Publishing.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polo-
sukhin, I. (2017). Attention is all you need. In Advances in neural information
processing systems (Vol. 2017-Decem).
Wang, L., Feng, M., Zhou, B., Xiang, B., & Mahadevan, S. (2015). Efficient hyper-
parameter optimization for nlp applications.. doi: 10.18653/v1/d15-1253
Wilson, S., Schaub, F., Dara, A. A., Liu, F., Cherivirala, S., Leon, P. G., . . . Sadeh,
N. (2016). The creation and analysis of a Website privacy policy corpus. In 54th
annual meeting of the association for computational linguistics, acl 2016 - long papers
(Vol. 3). doi: 10.18653/v1/p16-1126
Wilson, S., Schaub, F., Liu, F., Sathyendra, K. M., Smullen, D., Zimmeck, S., . . . Smith,
N. A. (2019, 2). Analyzing privacy policies at scale. ACM Transactions on the
Web, 13 , 1-29. Retrieved from https://dl.acm.org/doi/10.1145/3230665 doi:
10.1145/3230665
Wiratchawa, K., Khunthong, T., & Intharah, T. (2021). Legalbert-th: Development of
legal qa dataset and automatic question tagging.. doi: 10.1109/ECTI-CON51831
.2021.9454753
Zaeem, R. N., & Barber, K. S. (2021). A large publicly available corpus of website privacy
policies based on dmoz.. doi: 10.1145/3422337.3447827
Zimmeck, S., Story, P., Smullen, D., Ravichander, A., Wang, Z., Reidenberg, J., . . . Sadeh,

113
References References

N. (2019). Maps: Scaling privacy compliance analysis to a million apps. Proceedings


on Privacy Enhancing Technologies, 2019 . doi: 10.2478/popets-2019-0037

114

You might also like