Full Text 01
Full Text 01
Oskar Lindén
May 30, 2023
Oskar Lindén
Internal supervisor: Monowar Bhuyan
External supervisor: Erik Nordström
Partner: Nasdaq
Spring 2023
Master Thesis, 30 ECTS
Master of Science in computing science and engineering, 300 ECTS
Acknowledgements
First off, I would like to express my deepest gratitude for my internal advisor at Umeå Uni-
versity, Monowar Bhuyan, for his invaluable feedback and experience during our bi-weekly
meetings. I am also deeply grateful for my external advisor at Nasdaq, Erik Nordström, for
his answers to my many questions.
I am also very thankful for Martin Wuotila, Mikael Svensson and Magnus Larsson for their
invaluable insights during the course of the project.
And last, but not least, I would like to thank my partner Emma for her unwavering love
and support.
Abstract
In order to increase the resiliency and redundancy of a distributed system, it is
common to keep standby systems and backups of data in different locations than the
primary site, separated by a meaningful distance in order to tolerate local outages. Nas-
daq has accomplished this by maintaining primary-standby pairs or primary-standby-
disaster triplets with at least one system residing in a different site. The team at
Nasdaq is experimenting with a redundant deployment scheme in Kubernetes with
three availability zones, located within a single geographical region, in Amazon Web
Services. They want to move the disaster zone to another geographical region in order
to improve the redundancy and resiliency of the system. The aim of this thesis is to
investigate how this could be done and to compare the different approaches.
To compare the different approaches, a simple observable model of the chain repli-
cating strategy is implemented. The model is deployed in an Elastic Kubernetes Cluster
on Amazon Web Services, using Helm. The supporting infrastructure is defined and
created using Terraform. This model is subjected to evaluation through HTTP requests
with different configurations and scenarios, to measure latency and throughput. The
first scenario is a single user making HTTP requests to the system, and the second
scenario is multiple users making requests to the system.
The results show that the throughput is lower and the latency is higher with the
multi-region approach. The relative difference in median throughput is -54.41% and
the relative difference in median latency is 119.20%, in the single-producer case. In the
multi-producer case, both the relative difference in median throughput and latency is
reduced when increasing the amount of partitions in the system.
Contents
1 Introduction 1
1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Thesis organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background 3
2.1 Clearing product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Kubernetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Helm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Amazon Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4.1 Elastic Kubernetes Service . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Inter-region communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.1 VPC peering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.2 Transit gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 A Cloud Guru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 CAP and PACELC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Infrastructure as code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8.1 Terraform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9.1 OpenTelemetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.2 Jaeger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Related Work 12
4 Solution Design 14
4.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3.1 Single region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.2 Multi-region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Deployment scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Evaluation 19
5.1 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.1 Single producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.2 Multiple producers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Results 21
6.1 Single producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.1 Long-running tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.2 Fixed amount of requests . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Multiple producers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2.2 Throughput difference . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2.4 Latency difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 Discussion 33
7.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
i
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8 Conclusion 35
8.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.2 Personal remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References 36
Appendices 39
List of Figures
1 The components of a Kubernetes cluster . . . . . . . . . . . . . . . . . . . . . 4
2 Different type of Pods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Peering connection between VPCs . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Transit gateway with three VPC attachments . . . . . . . . . . . . . . . . . . 8
5 Trace waterfall diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6 Typical Jaeger architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7 Different Jaeger deployment options . . . . . . . . . . . . . . . . . . . . . . . 11
8 Average RTTs between region pairs . . . . . . . . . . . . . . . . . . . . . . . . 12
9 System model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
10 Dependencies between operations in the gateway process . . . . . . . . . . . . 15
11 Dependencies between operations in the non-primary backend processes . . . 15
12 Dependencies between operations in the primary backend process . . . . . . . 16
13 Single-region infrastructure design . . . . . . . . . . . . . . . . . . . . . . . . 17
14 Multi-region infrastructure design . . . . . . . . . . . . . . . . . . . . . . . . . 18
15 Throughput over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
16 Throughput distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
17 Latency over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
18 Latency distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
19 Latency with different batch sizes for both approaches . . . . . . . . . . . . . 23
20 Median throughput with different batch sizes for both approaches . . . . . . . 24
21 Median effective throughput with different batch sizes for both approaches . . 24
22 Internal gateway operations with batch size 1 . . . . . . . . . . . . . . . . . . 25
23 Internal gateway operations with batch size 1000 . . . . . . . . . . . . . . . . 25
24 Internal primary backend operations with batch size 1 . . . . . . . . . . . . . 26
25 Internal primary backend operations with batch size 1000 . . . . . . . . . . . 26
26 Internal standby backend operations with batch size 1 . . . . . . . . . . . . . 27
27 Internal standby backend operations with batch size 1000 . . . . . . . . . . . 27
28 Internal disaster backend operations with batch size 1 . . . . . . . . . . . . . 27
29 Internal disaster backend operations with batch size 1000 . . . . . . . . . . . 28
30 Effective throughput with different batch sizes and number of partitions . . . 29
31 Relative difference of median effective throughput between the approaches . . 30
32 Latency with different amount of partitions with a batch size of 1 . . . . . . . 30
33 Latency with different amount of partitions with a batch size of 10 . . . . . . 31
34 Latency with different amount of partitions with a batch size of 100 . . . . . 31
35 Latency with different amount of partitions with a batch size of 1000 . . . . . 31
36 Relative difference of median latency between the approaches . . . . . . . . . 32
37 Effective throughput with different batch sizes and amount of partitions . . . 39
38 Effective throughput with different batch sizes and amount of partitions . . . 40
39 Relative difference in median effective throughput between the approaches . . 40
40 Latency with different amount of partitions with a batch size of 1 . . . . . . . 41
41 Latency with different amount of partitions with a batch size of 10 . . . . . . 41
42 Latency with different amount of partitions with a batch size of 100 . . . . . 41
ii
43 Latency with different amount of partitions with a batch size of 1000 . . . . . 42
44 Latency with different amount of partitions with a batch size of 10000 . . . . 42
45 Relative difference in median effective latency between the approaches . . . . 42
List of Tables
1 Comparison of latencies among existing works . . . . . . . . . . . . . . . . . . 13
iii
1 Introduction
In December 2012, Netflix had an 18-hour outage which affected the majority of customers
as a result from a regional failure of the Amazon Web Services (AWS) Elastic Load Balancer
(ELB) service in the region us-east-1, where Netflix had all of its cloud operations at that
time. After the outage, AWS made several improvements in their networking control plane,
ELB service architecture and configuration management process. Netflix switched to an
active-active cross regional architecture, where if one region fails, traffic will be rerouted to
another healthy region [1].
Separating backup stacks from the primary stack by a meaningful distance is a way of toler-
ating local outages, such as earthquakes and power-grid failures. By isolating the different
stacks in different failure domains, we can ensure that the customers can still be serviced
by the system if any of the stacks experience local outages. AWS offers two different tiers
of geographical failure domains: availability zones and regions.
Market platform solutions, such as the clearing system, have high requirements on resilience
and data integrity, which is often achieved with highly redundant and replicated microser-
vices. There is a need at Nasdaq to understand more on which capabilities the cloud offers
around cross region redundancy to achieve even higher level of resiliency compared to single
region/multiple availability zone deployment.
Today, the team at Nasdaq is experimenting with a redundant deployment scheme consisting
of three availability zones within AWS: one primary, one secondary and one disaster zone.
The team is interested in moving the disaster zone to another AWS region in order to
improve the redundancy and resiliency of the system.
The goal of this thesis project is to find out how a multi-region approach would work with
a simple model of their current system, and what the limitations of such an approach are.
The questions that will be answered at the end of this project are as follows.
The contribution of this thesis is to quantify some of the limitations that comes with a multi-
region deployment of a chain replicating system compared to a single-region deployment of
the same system.
Section 2 describes relevant background information, tools and theory. Section 3 reports re-
lated work on the topic of measuring and comparing inter-region communication in the cloud
1
including multi-regional deployment. These Sections are the result of an initial literature
survey.
Section 4 explains the solution infrastructure for both approaches and system architecture
to be evaluated with the methods described in Section 5. The latter Section describes the
evaluation methods.
Section 6 describes the results of the evaluation which are later discussed in Section 7.
Section 8 describes the conclusion, future work and personal remarks.
2
2 Background
The subsequent Sections describe relevant information about the system that is modeled,
tools that are used to build the infrastructure and deploy the system, and other important
concepts.
The Nasdaq clearing product is a complex system consisting of multiple business and auxil-
iary domains, each consisting of different functional groups and different types of processes.
Within this system, there are two main types of processes: concurrent stateless processes,
and replicating stateful processes.
The current deployment model of the system follows a failover model with different levels
of resiliency. The system processes can be deployed with primary-standby pairs or primary-
standby-disaster triplets. While the primary and standby processes reside within the same
data center in an on-premise solution, the disaster process lives in a separate data center in
order to increase the redundancy and resiliency of the system. If a primary process becomes
unavailable, a standby process immediately takes over and becomes the new primary.
The primary process receives incoming messages, sorts, and assigns sequence numbers to
them. Then, the process initiates a transaction with the next message in the sequence. The
process synchronously replicates the input to a standby and a disaster process, and waits for
an acknowledgement from both processes. It also performs some sort of business logic with
the incoming message and stores the result in an embedded database, and synchronously
writes the message to disk. The process performs all operations in parallel.
This approach is similar to the chain replication strategy with a primary/backup approach
introduced by van Renesse and Schneider. Their chain replication protocol assumes that the
servers are fail-stop, i.e., each server halts instead of making an erroneous state transition
in response to failures, and that this halted state can be detected by the environment.
The primary sequences client requests, distributes the requests or their resulting updates
to other backups, awaits acknowledgements from all non-faulty backups, and sends a reply
to the client. With this approach, updating and query requests, i.e., read and write are
processed by the primary server to ensure strong consistency [2].
The clearing product can be horizontally scaled by moving processes to new server nodes.
The different processes themselves can also be horizontally scaled by partitioning the data,
i.e., each partition of a certain process only operates on a subset of the entire data.
Cloud computing is on-demand access to computing resources, servers, data storage, devel-
opment tools, networking, etc., made available via the Internet by a cloud service provider
(CSP). It also refers to the technology behind cloud computing, i.e., abstracted and virtu-
alised infrastructure that can be pooled and divided to provide the requested resources to
the customer [3].
3
lowest level of control to the customer. With PaaS, CSPs offers, runs and maintains both
infrastructure and system environment, often built around containers [3], that customers
can use to deploy their on software and applications in [4]. With SaaS, CSPs runs and
maintains application software, system environment and infrastructure that customers can
use over the Internet. Examples of SaaS services are Gmail and Google Docs [4].
2.3 Kubernetes
Kubernetes cluster
API server
api
Cloud controller
c-m
c-m c-c-m manager
c-c-m
c-m c-c-m (optional) c-c-m
Controller
manager c-m
etcd
api
Node Node (persistence store) etcd
api Node
api
kubelet
kubelet
sched
sched
sched
Scheduler
sched
Node
The worker node consists of a kubelet, which makes sure that containers are running in Pods,
a network proxy that maintains network rules on nodes and allows network communication
to the Pods from network sessions inside or outside the cluster, and a container runtime,
which is the software responsible for running the containers [5].
The control plane has several components, e.g., the API server, backing store, scheduler,
controller manager and cloud controller manager. The API server exposes the Kubernetes
API and acts as the front end for the control plane. The backing store stores the cluster
data. The scheduler watches for newly created Pods with no assigned node, and selects
a node for them to run on. The controller manager is a component that runs controller
processes. The cloud controller manager embeds cloud-specific control logic and lets you
link your cluster into your CSPs API, and only runs controllers specific to that CSP [5]. A
Pod is a group of one or more containers and a specification for how to run the containers.
The containers within a Pod share both storage and networking resources. The contents of
the Pod are relatively tightly coupled, are always co-located and co-scheduled, and run in
a shared context. Pods are designed to be ephemeral and disposable, and are not usually
created directly [6]. Figure 2 shows the contents of different types of Pods.
There are workloads that use Pods in a higher level of abstraction, e.g., ReplicaSets, Deploy-
ments, StatefulSets, etc. A ReplicaSet maintains a stable set of Pod replicas at any given
time, and guarantees the availability of a specified number of identical pods [8]. A Deploy-
4
r
o cke
D
b elt
Ku
ment provides declarative updates for Pods and ReplicaSets [9]. A StatefulSet manages the
deployment and scaling of a Set of Pods, and provides guarantees about the ordering and
uniqueness of these Pods [10].
2.3.1 Helm
Helm is a package manager for Kubernetes used to install applications on Kubernetes clus-
ters. Helm has the concept of a Chart, which is a Helm package. The Chart contains all
resource definitions necessary in order to run the application inside a Kubernetes cluster. It
can be used to deploy everything from small sample applications to complex systems. Helm
Chart templates are written in the Go template language, with some extensions, and uses
a template engine to generate the resource definitions to be applied in Kubernetes. Values
for the templates can be supplied using YAML files or be passed as command line argument
during Chart installation or upgrading [11].
A Helm Repository is a place where charts can be stored and shared. Repositories can be
added and removed from the local Helm client. The charts stored in repositories can be
directly installed to a Kubernetes cluster or be used as dependencies for local Charts. A
release is an instance of a chart running in a Kubernetes cluster. A Release can be installed
multiple times in the same cluster using different release names to keep track of the different
releases. The releases can be upgraded when a new update of the chart is available, or
you want to change the values of a release. Releases can also be rolled back to a previous
version, or revision, of the release or uninstalled from the cluster [12]. Helm uses a default
location to the Kubernetes configuration to communicate with the cluster, i.e. the one found
in ~/.kube/config. The Kubernetes configuration also contains contexts for each cluster
added to the configuration and Helm includes options to switch the context in order to
communicate with a different cluster than the on currently used by the configuration [13].
Amazon Web Services (AWS) is a CSP that offers a variety of services, e.g., virtual servers,
object storage and managed Kubernetes service. AWS manages, at the time of writing,
99 Availability Zones within 31 geographical regions around the world [14]. A region is an
isolated, physical location in a geographical area where groups of data centers are clustered
into different logical groups called Availability Zones (AZs) [14, 15]. In addition, the AWS
control planes and management console are distributed across regions [14]. Each region
consists of at least three isolated and physically separate AZs within this geographic area
[15].
An AZ consists of one or more physical data centers with independent cooling, physical
5
security and redundant power. The AZs within a region are interconnected with high-
bandwidth, low-latency networking. The AZs are also separated with a meaningful distance
but within 100 km of each other [15]. Customers can leverage the use of multiple AZs in order
to increase the availability and resiliency of their systems. There are several exceptions to
the isolation and dependency between regions in AWS. The Amazon Domain Name Service
(DNS) web service, Route 53, depends on the control plane in region us-east-1 to create,
update or delete DNS records, etc. CloudFront is a content delivery network (CDN) that
also relies on the control plane in us-east-1 to create edge-optimised API endpoints. In
addition to its dependency on the region us-east-1 for multiple control plane actions,
Amazon Simple Storage Service (S3) also depends on the us-west-2 region for its Multi-
Region Access Points [16].
Amazon Elastic Kubernetes Service (EKS) is a managed PaaS that customers can use to
run Kubernetes on AWS without having to manage a Kubernetes control plane or worker
nodes. EKS manages and scales the control plane across several AZs to ensure high avail-
ability, scales control plane instances based on load, detects and replaces unhealthy control
plane instances and provides automated version updates and patching for them. EKS also
integrates other AWS services into the platform, e.g., container image repositories, external
load balancers, authentication and virtual private clouds (VPCs) for isolation [17]. EKS
clusters are regional and cannot span multiple regions. In order to perform a multi-regional
deployment of an application in multiple regions on EKS, you would have to create one or
more clusters in each region that you want to deploy your application in.
There are different ways of communicating between AWS regions. The services, located in
different regions, could communicate with each other over the public internet. However,
there are some AWS-native ways of communicating which does not involve communicating
over the public Internet. AWS offers different services that enable inter-region communica-
tion, e.g., VPC peering and Transit Gateway.
AWS uses the concept of VPCs which are client-defined logically isolated virtual networks,
closely resembling a traditional data center network, used for launching AWS resources.
Within the VPC there are a number of features to configure, e.g., subnets, routing tables,
gateways and VPN connections [18]. VPC peering is a connection between two VPCs, i.e. a
one-to-one relationship. The connected VPCs can belong to different AWS accounts and/or
different regions. VPC peering is not transitive, which means that if VPC A is connected
to both VPC B and VPC C with a peering connection each, shown in Figure 3. VPC B
and VPC C cannot communicate directly unless a peering connection is established between
VPC B and VPC C. This means that the connection complexity increases with the number
of desired connected VPCs [19].
There are several other limitations to VPC peering, some of them are brought up here. There
is a quota on the number of active and pending peering connections per VPC, default up
to 50 and 25 respectively [20]. There cannot be more than one peering connection between
two VPCs at a time. There cannot be matching or overlapping IPv4 or IPv6 Classless
Inter-Domain Routing (CIDR) blocks in the VPCs [19].
6
Figure 3: Peering connection between VPCs, recreated from AWS [19].
A transit gateway acts as a regional virtual router that interconnects VPCs and on-premises
networks. The gateway scales elastically based on network traffic volume. A transit gateway
attachment is both a source and destination of packets. A few different resources can be
attached to the transit gateway, e.g., one or more VPC and one or more VPN connections.
Figure 4 shows a transit gateway with three VPC attachments. The route table for each
VPC includes the local route and routes that send traffic destined for the other two VPCs to
the transit gateway. The CIDR blocks for each VPC propagate to the transit gateway route
table. Therefore, each attachment can route packets to the other attachments. In order to
connect VPCs in different regions, each region needs its own transit gateway and a peering
connection has to be established between the transit gateways [21].
As with VPC peering, there are several limitations or region-specific quotas related to transit
gateways. There are, for example, a default maximum of 5 transit gateways per account, 20
route tables per transit gateway, 5,000 attachments (not adjustable) per transit gateway, 5
transit gateways per VPC, 50 peering attachments per transit gateway. The transit gateway
also has a maximum bandwidth of 50 Gbps (not adjustable) per VPC attachment or peered
transit gateway connection [22].
A Cloud Guru (ACG) is an online learning platform for cloud computing. The site offers
courses and labs for three major CSPs (AWS, Microsoft Azure and Google Cloud Platform).
The platform also offers playgrounds at the same CSPs. These playgrounds are essentially
limited cloud user accounts where learners can practice cloud operations for up to 8 hours
after the playground has been created.
The limitations posed on these user accounts are what type and capacity of resources that
can be provisioned, e.g., a limited number of virtual machines and specific types of virtual
machine instances. The limitations are constantly monitored, and if a user goes outside these
limitations, the playground is shut down. One significant limitation is that transit gateways
cannot be provisioned using an ACG playground. Another significant limitation is that the
only two AWS regions available are us-east-1 and us-west-2. Each account is limited to
creating nine virtual machine instances at a time. Creating ten or more instances leads to
the playground being suspended and shut down immediately. The billing information of the
resources the account creates in AWS cannot be viewed by playground users [23].
7
Figure 4: Transit gateway with three VPC attachments, recreated from AWS [21].
Eric Brewer theorised that there is a fundamental trade-off between consistency (C), avail-
ability (A) and partition tolerance (P) in a distributed system. He conjectured, in the CAP
theorem, that you can only ever guarantee two out of the three properties [24]. Brewer’s
conjecture was later proved by Gilbert and Lynch, where they also provided a formal model.
They define consistency as atomic or linearisable consistency, under which there must exist
a total order on all operations such that each operation looks as if it were completed at a
single instance. Availability is defined as every request received by a non-failing node must
yield a response. Lastly, for the distributed system to be partition tolerant, it must behave
as intended in the case of a network partition [25]. Generally, since no network is safe from
partitioning, the real trade-off is between consistency and availability. However, in the ab-
sence of partitions, there is no need to choose between availability and consistency. Instead,
the trade-off is between consistency and latency [26].
A more complete model, that takes into account the trade-offs between consistency and la-
tency in the baseline case is PACELC: in the presence of a partition (P), how does the system
trade off availability (A) and consistency (C), else (E), when the system is running normally
in the absence of partitions, how does the system, trade off latency (L) and consistency (C)
[26]?
8
2.8 Infrastructure as code
There are two different choices to make when choosing the IaC solution to use. The first
choice is between mutable and immutable infrastructure. Mutable infrastructure can be
modified or updated after it has been provisioned. This improves the flexibility of the
infrastructure to make customisations. However, this contradicts one of the benefits of
IaC - consistency. Immutable infrastructure, on the other hand, cannot be modified after
it has been provisioned. If the infrastructure needs to be changed, it has to be replaced
altogether. The second choice is between a declarative and an imperative approach. With
the declarative approach, you specify the desired state of the infrastructure you want and
then the IaC software handles the rest. With the imperative, or procedural, approach the
software prepares automation scripts that provision the infrastructure one step at a time
[27].
2.8.1 Terraform
Terraform is an open-source, declarative IaC tool, created by HashiCorp, used for provision-
ing immutable infrastructure across multiple cloud and on-premises data centers. Terraform
uses a high-level configuration language called HashiCorp Configuration Language (HCL)
to describe the desired infrastructure. The software creates a plan to reach the desired state
and then executes that plan [28].
Terraform provides implicit support for multi-region infrastructure creation through the
use of multiple providers, each configured with separate region tags. Provider plugins are
ways to interact with remote systems, e.g., cloud providers, SaaS platforms and other APIs.
Each provider defines a set of resource types and/or data sources that can be used within
Terraform. Providers are distributed separately from Terraform itself and has their own
releases [29]. Resource blocks describe one or more infrastructure objects, e.g., virtual
networks, compute instances or DNS records, and are the most important elements of the
Terraform language [30]. Data sources can be used by Terraform in order to use information
defined outside of Terraform, defined by separate Terraform configurations or modified by
functions [31].
2.9 Observability
Observability is a way of asking questions to a system from the outside without knowing
how the system operates under the hood. Observability is useful for troubleshooting, helps
with handling novel problems, and helps answer why a certain thing is happening. In order
to completely observe a system, it has to be properly instrumented, i.e., the application code
must emit signals such as traces, metrics and logs. It is said that an application is properly
instrumented when the developers do not need to add more instrumentation to troubleshoot
an issue, since they have all information they need [32].
Timestamped log messages appear almost everywhere in software and are emitted by services
and other components. However, log messages often lack important contextual information,
9
such as where they were called from, which makes them unfit for tracking code execution.
Log messages become way more useful when they are a part of a Span. A Span is a data
structure that represent a unit of work or an operation which tracks specific operations that
a request makes. The structure contains a name of the Span, time-related data, structured
log messages and other metadata, e.g., transport protocol, IP addresses, HTTP header data,
etc [32].
Traces record the paths taken by requests as they propagate through the system containing
one or multiple services. A Trace consists of one or more Spans, each representing a unit
of work performed as the request propagates through the system. Tracing a system makes
understanding and breaking down what happens when a request flows through the system
a less daunting task [32]. Figure 5 shows a typical Trace visualised as a waterfall diagram
and the request path through different services.
2.9.1 OpenTelemetry
2.9.2 Jaeger
The data, i.e., traces, metrics and logs, emitted from instrumented code must be sent to
an Observability backend. One such backend is Jaeger. Jaeger is used for monitoring and
troubleshooting distributed systems based on a microservice architecture [35]. Historically,
Jaeger has provided its own collection of tracing SDKs and agents that listen for spans
sent over UDP. However, the project has since deprecated its own SDKs in favor of the
OpenTelemetry SDKs. Jaeger supports a number of different deployments. Figure 6 shows
an instrumeted application which sends its trace data to an instance of the Jaeger Collector,
which runs the traces through a processing pipeline and stores them in a storage backend.
10
The Collector can be deployed as a part of the host or container, or as a remote service.
The data is then used by the Jaeger Query service, which provides a Web UI for searching
and analysing traces [36].
The Collector also supports receiving data directly from the OpenTelemetry SDKs and can
be paired up with an OpenTelemetry Collector deployed within the host or container, or as
a remote service [36], as shown in Figure 7.
11
3 Related Work
This Section presents related research that compares and measures intra- and inter-region
communication and multi-regional deployment in the cloud.
Iqbal et al. has studied the intra-AZ, inter-AZ and inter-region availability and latencies of
21 pairs of AZs, in three different regions, and 210 pairs of regions in AWS. They exchanged
Internet Control Message Protocol (ICMP) pings between pairs of AWS compute instances
for 100 days in all three hierarchies (intra-, inter-AZ and inter-region) in order to analyse
the availability violations in different granularities as well as study the latency by measuring
the round-trip time (RTT) of the messages. Their study shows that the inter-region RTT
goes from below 50 ms to above 300 ms, seemingly depending on the distance between the
regions, while the average RTT during intra-AZ experiments did not exceed 500 µs and
the average RTT during inter-AZ experiments were below 1.2 ms [37]. Figure 8 shows the
average RTT between all region pairs in milliseconds.
A similar work by Ghandi and Chan, measures the ping, upload and download times between
all pairs of 9 AWS regions and correlate these times with the geographical distance between
the pairs of regions [38]. Gorbenko et al. studied read and write throughput of a distributed
Cassandra database cluster with three replicas, different consistency settings and different
deployment scenarios. Their study shows that stronger consistency models and increasingly
distributed systems (inter-region) leads to a reduced database throughput and increased
latency of up to 600% during read/write operations [39]. The design and implementation
of a latency measurement service for IoT devices that can collect data across cloud regions
and CSPs are presented by Vu et al. They measure the one-way latency, i.e., half the RTT,
between regions in both AWS and Microsoft Azure as well as between the providers. They
find that the one-way latency is pending between 100 ms to 200 ms between the AWS regions
us-east-2 and us-west-2 [40].
Table 1 shows a comparison of results from studies regarding communication latency within
AZs, between AZs, and between regions. Note that there are no mentions of the specific
configurations, i.e., intra- or inter-region, used when the self ping was measured.
12
Table 1: Comparison of latencies among existing works
A different paper, by Berenberg and Calder, classifies six different deployment archetypes
with different sub-categories for cloud applications in order to discuss their respective trade-
offs between general availability, latency and geographical constraints. These are zonal,
regional, multi-regional, global, hybrid and multi-cloud. They describe single-zone with
zonal failover as a deployment model for non-critical services or services that can have a
downtime or maintenance window. Failover times between zones can be instantaneous if
application owners maintain an active-active configuration, with an increased cost. They
also describe single-region with region failover as a way for on-premise enterprise legacy
applications to increase their availability by having a disaster recovery option when they
move to the cloud. The trade-off between cold versus warm standby applies here as well
[41].
While the related work presented in this Section is very similar to this study, it is important
for Nasdaq to quantify the limitations of a multi-regional approach compared to a single
region approach in the context of their own system and deployment model, in order to help
them deal with business decisions.
13
4 Solution Design
The following Sections describe the system architecture, how the system components work,
infrastructure for each approach and deployment scheme that are used for evaluation.
The system to be evaluated, shown in Figure 9, is modeled after the clearing product with
a single domain consisting of a single business function. The model consists of two different
processes: a gateway and a backend. The gateway is a simple, concurrent HTTP server
that forwards incoming requests to the backend process via TCP. The gateway consists of a
single REST endpoint that accepts a list of JSON objects representing tasks that mutates
the data within the system.
The backend process is a TCP server that synchronously replicates its input to two other
backend processes, and performs a task. The task, in this case, is storing a key-value pair in
an embedded, memory mapped database. The backend process utilises a chain thread which
continously reads from a channel that buffers incoming requests from multiple threads. The
threads read TCP streams from incoming requests and parses them before sending them
to the channel buffer. When the chain thread receives a request, it stores the data locally
and, if the backend is the primary, sends the same request to the standby and the disaster
processes, in parallel. The primary backend process waits until it receives acknowledgements
from the two other backend processes before it sends its reply to the gateway process.
If an error occurs while waiting for a response, the primary sends its response to the gateway,
indicating that something went wrong. However, it will still perform the task. In PACELC
terms, in the presence of a network partition, the system trades consistency for availability,
otherwise it trades latency for consistency.
4.2 Tracing
The following Section explains the instrumentation approach and Jaeger deployment. The
Section also contains an explanation of the dependencies between traced operations. Jaeger
14
is deployed as a separate containerised process, providing API endpoints for collecting trace
data from the instrumented processes and a Web UI for searching and analysing traces. The
latter endpoint is also used for exporting the data during the evaluation.
Understanding the dependencies of traced operations is important since the duration of each
span is cumulative and depends on the duration of every child span. For example, Figure
10 shows the dependencies between operations in the gateway process. The duration of the
span with label Incoming request is the sum of its own operations as well as the duration of
the span with label Sending to primary, which in turn is the sum of its own operations and
the duration of its three children spans.
Figure 11 shows the dependencies between operations in the non-primary backend process,
i.e., standby and disaster.
Figure 12 shows the dependencies between operations in the primary backend process.
4.3 Infrastructure
The following Section presents the infrastructure design. The infrastructure has been imple-
mented and constructed using Terraform in AWS, with the playground accounts provided
by ACG.
15
Figure 12: Dependencies between operations in the primary backend process
Figure 13 shows the infrastructure design for the single-region approach. The infrastructure
consists of a single VPC in a single region, consisting of two different AZs. Each AZ has a
private and a public subnet. The EKS cluster is deployed in the private subnets of the two
AZs and an external-facing load balancer is deployed in the public subnets, balancing load
across all instances of the gateway process. The gateway, primary and standby processes are
deployed to the us-east-1a AZ while the disaster process is deployed to the us-east-1b
AZ.
The region also contains an Elastic Container Registry (ECR) for storing container images.
It contains two repositories, one for each system process, i.e., the gateway and the backend.
There is also an Elastic File System (EFS) used for persistent storage.
16
Figure 13: Single-region infrastructure design
4.3.2 Multi-region
Figure 14 shows the infrastructure design for the multi-region approach. The infrastructure
consists of two different regions within the ACG playground. Each region has a separate
VPC across two different AZs, each. Each AZ has a private and a public subnet. The
EKS clusters are deployed in the private subnets of the two regions and, like the single-
region approach, an external-facing load balancer is deployed in the public subnets in the
primary region, balancing incoming requests across the gateway instances. There is also a
VPC peering connection between the VPCs in the different regions, which the Kubernetes
DNS application can route traffic through via internal network load balancers. As with the
single-region infrastructure, the multi-region infrastructure also contains an ECR and an
EFS in each region. However, the ECR in the disaster region is a replica of the ECR in the
primary region. When container images are pushed to the repositories in the primary ECR,
they are replicated to the ECR in the disaster region.
The deployment, for each approach, is completed in two different stages. First, the in-
frastructure is created, using Terraform. The second stage consists of a script that builds
the container images, pushes the images to the ECR registries, deploys the containers and
auxiliary services to the Kubernetes clusters using Helm and configures the cluster DNS
services in order to route traffic between clusters, in the multi-region case. The processes
of the model system, i.e., the gateway and backend, are both containerised. Within the
Kubernetes clusters, the gateway container is deployed as a Deployment and the backend
containers are deployed as StatefulSets, and each partition of the respective backend con-
tainers has a separate persistent storage volume attached. In the event of a crash or a restart
of a Pod running a backend container, the data is intact.
17
Figure 14: Multi-region infrastructure design
4.5 Limitations
This Section describes the assumptions made for and limitations of the solution.
• The system does not discriminate incoming requests and does not attempt to route
the request to the correct backend partition.
• Neither the incoming requests nor the task that the backend process performs are
modeled after the clearing product.
• The system does not implement a failover mechanism. If the primary region or avail-
ability zone fails, the disaster partition of the system is unreachable.
• The requests received by the TCP server in the backend processes are not sequenced,
and are processed by the chain thread in the order they are buffered in the channel.
• The URL for accessing the gateway process is dynamic, and a new one is generated
each time the Kubernetes service for exposing the gateway process is created. This
URL is also open to the public.
• A failed process partition does not attempt to catch up, i.e., perform the tasks it has
missed during the downtime. Kubernetes just attempts to restart it and the process
resumes working like nothing has happened.
• The Jaeger trace data is stored in memory, and the Jaeger Pod is prone to crashes if
it receives too many traces.
18
5 Evaluation
The key metrics for this system are latency and throughput with throughput being the most
important of the two. These metrics are measured by an external load generator application,
Grafana K6, with different scenarios for both approaches in order to compare the latency
and throughput measurements.
The experimental evaluation of the system is performed on a total of nine instances of the
t3.medium instance type for both approaches. Due to the limitations of ACG playgrounds,
the number of instances cannot be increased. These instances have a capacity of 2 vCPUs
and 4 GiB memory, each. The CPUs have a sustained all core turbo clock speed of up to
3.1 GHz [42]. The primary AZ/region contains six instances and the secondary AZ/region
contains three instances. The evaluation is performed on multiple different deployments of
the different approaches over multiple weeks.
The evaluation of the different approaches is done through load testing the system using a
closed test model [43] with a certain number of virtual users (VUs). The VUs make HTTP
requests to the system, wait for a response, and do the same thing again without any waiting
time, during a certain time period. The payload of each request sent to the system consists
of a list of JSON objects, each object ranging between 52 and 56 bytes in size. The length of
the list is the batch size. At the end of the test, the evaluation software displays a summary
of the test and writes the results to a file.
5.1.1 Throughput
The throughput is calculated by dividing the number of successful requests made to the
system divided by the period of time the requests were made to the system. The effective
throughput is calculated by simply multiplying the batch size with the calculated through-
put. When comparing the throughput of the different approaches, the relative difference of
the median throughput between the different approaches is used. Equation 1 shows how the
difference is calculated.
TM p,b − TS p,b
(1)
TS p,b
TM is the median throughput of successful requests for the multi-region approach, TS is the
median throughput of successful requests for the single-region approach, p is the number of
partitions and b is the batch size.
5.1.2 Latency
The latency is calculated in the evaluation software and is defined as the time spent waiting
for response from the remote host, i.e., time to first byte (TTFB). When comparing the
latency of the different approaches, the relative difference of the median latency between the
different approaches is used. Equation 2 is used to calculate the difference.
LM p,b − LS p,b
(2)
LS p,b
19
LM is the median latency of successful requests for the multi-region approach, LS is the
median latency of successful requests for the single-region approach, p is the number of
partitions and b is the batch size.
The following Sections describe the test cases that will be evaluated.
The goal with this scenario is to measure the latency of individual requests and overall
throughput of the different approaches with a single user interacting with the system. This
scenario consists of two different tests. The first test utilises the evaluation software to
perform a long-running test for three hours with a single VU. This test will be performed
with the different approaches with a batch size of 1. The second test utilises the evaluation
software to perform 1000 requests on the system and manually extracting the tracing data
from the Jaeger endpoint. This is done because of the Jaeger limitations, described in Section
4.5. The Jaeger trace data is exported and the spans are presented in the dependency order,
presented in Section 4.2, to see which operations in the different processes that are the more
prone to latency increases when the system is deployed across multiple regions.
The goal with this scenario is to measure the latency of individual requests and the overall
throughput for different amount of partitions and batch sizes. The number of VUs and the
duration of the individual evaluations are 100 and 30 seconds, respectively, for this type of
test. The different number of partitions of the processes are included to see how many times
the multi-region approach, potentially, needs to scale the processes in order to approach the
same throughput as the single-region approach. The number of partitions is equal to the
amount of pods in each Deployment and Statefulset, i.e., two partitions equals two gateway
pods, two primary backend pods, two standby backend pods and two disaster backend pods.
The purpose of the different batch sizes is to see the effect of larger requests made to the
system and to see if the latency can be mitigated by utilising more bandwidth. These metrics
are later compared in order to see which has the most effect. The test utilises the evaluation
software to perform three consecutive samples of 30 second burst tests on the system at
different partitioning levels and batch sizes.
20
6 Results
The following Sections show the results from the experimental evaluation. The tests were
performed on multiple different deployments of both approaches.
The following Sections present the results of the long-running tests and the tests with a
fixed amount of requests, explained in Section 5.2.1.
As previously mentioned, this type of test utilises a single VU to make sequential requests
to the system over the course of three hours.
Figure 15 shows the throughput of both approaches each minute from the long-running test,
with fitted trend lines. The Figure shows that the throughput of both approaches stay
relatively flat during the entire test.
Figure 16 shows the throughput distribution of both approaches, each minute from the
long-running test. The density is with respect to the height and not the area. The median
throughput is 6.62 and 3.02 requests per second for the single-region approach and multi-
region approach, respectively. The relative difference in median throughput is −54.41%.
21
Figure 16: Throughput distribution
Figure 17 shows the latency of each individual request made to both systems during the
long-running test, with fitted trend lines. The Figure shows that the latency also stays
relatively flat over time for both approaches, despite a some outliers.
Figure 18 shows the latency distribution of each individual request made to both systems
during the long-running test. The median latency is 0.14 and 0.31 seconds for the single-
region approach and multi-region approach, respectively. The relative difference in median
latency is 119.20%.
22
Figure 18: Latency distribution
Figure 19 shows the latency with different batch sizes for both approaches, measured during
the second type of test. The boxes show the median, mean, the first and third quartile.
The whiskers are extending 1.5 times the interquartile range from the boxes. The fliers are
omitted. The Figure shows an increase in latency as the batch size increases, especially in
the multi-region approach.
Figure 19: Latency with different batch sizes for both approaches
Figure 20 shows the median throughput with different batch sizes for both approaches, mea-
sured during the second type of test. The throughput decreases as the batch size increases.
23
Figure 20: Median throughput with different batch sizes for both approaches
Figure 21 shows the median effective throughput with different batch sizes for both ap-
proaches, measured during the second type of test. The effective throughput increases as
the batch size increases.
Figure 21: Median effective throughput with different batch sizes for both approaches
Figures 22-29 show the latency of the respective internal operations for each process in both
systems at different batch sizes, measured during the second type of test. Figure 22 shows
the internal gateway operations with batch size 1. The Figure shows a latency difference in
operations Incoming request, Sending to primary and Reading response from primary. There
are no significant differences in the operations Connecting to primary and Writing request
to primary.
24
Figure 22: Internal gateway operations with batch size 1
Figure 23 shows the internal gateway operations with batch size 1000. The Figure shows
similar differences between the approaches as Figure 22. However, the operations, except
Connecting to primary, increases in latency as the batch size increases.
Figure 24 shows the internal primary backend operations with batch size 1. The differences
between the approaches can be traced to the operations Connecting to disaster and Reading
response from disaster.
25
Figure 24: Internal primary backend operations with batch size 1
Figure 25 shows the internal primary backend operations with batch size 1000. The Figure
shows similar differences between the approaches as Figure 24. However, the operation
Writing request to disaster now shows a significant difference between the approaches.
Figure 25: Internal primary backend operations with batch size 1000
26
Figure 26 shows the internal standby backend operations with batch size 1. The Figure
shows no significant difference between the different approaches.
Figure 27 shows the internal standby backend operations with batch size 1000. The Fig-
ure shows no significant difference between the different approaches. However, there is an
increase in latency compared to the results shown in Figure 26.
Figure 27: Internal standby backend operations with batch size 1000
Figure 28 shows the internal disaster backend operations with batch size 1. The Figure
shows no significant difference between the different approaches.
27
Figure 29 shows the internal disaster backend operations with batch size 1000. The Fig-
ure shows no significant difference between the different approaches. However, there is an
increase in latency compared to the results shown in Figure 28.
Figure 29: Internal disaster backend operations with batch size 1000
The following Sections show the results from testing with multiple producers, explained in
Section 5.2.2, and categorises them into throughput and latency results. These tests utilise
100 VUs to make concurrent requests to the system for 30 seconds, three consecutive times.
The results of multi-producer tests with a higher maximum batch size is included in Ap-
pendix A. These tests include a batch size of 10000 in the tests, which seems to affect
single-region throughput and latency with batch sizes 100, 1000 and 10000 negatively.
6.2.1 Throughput
Figure 30 show the median effective throughput of requests made to the system with a
variable number of partitions and batch size. The throughput is calculated by dividing
the total number of successful requests completed by the total duration of the evaluation,
provided by the evaluation software. The effective throughput is calculated by multiplying
the calculated throughput with the batch size. The Figure shows that the single-region
approach has a higher effective throughput than the multi-region approach. Both approaches
scale relatively well with the number of partitions of the processes within the system at
lower batch sizes. There is a smooth increase in throughput with an increasing amount of
partitions in the system. Furthermore, each incremental increase of the batch size increases
the overall throughput by almost an order of magnitude for both approaches.
The single-region system does not scale well with higher batch sizes and an increasing amount
of partitions, exhibiting diminishing returns and even worse results, unlike the multi-region
approach which exhibits a similar relationship for all batch sizes.
28
Figure 30: Effective throughput with different batch sizes and number of partitions
Figure 31 shows the relative difference of median effective throughput between the single-
region and multi-region approaches at different batch sizes and number of partitions. With
1 partition, the relative difference is below -90% for each batch size. The relative difference
is reduced to just above -40% to below -80%, depending on the batch size, with an increased
amount of partitions in the system.
29
Figure 31: Relative difference of median effective throughput between the approaches
6.2.3 Latency
Figure 32-35 show the latency of each successful request made to the system with a variable
amount of partitions and batch size for both single and multiple regions. The latencies of
the requests sent to the multi-region system are almost an order of magnitude higher than
the ones sent to the single-region system.
There is a lot of variation introduced to the latencies when increasing the batch size, espe-
cially in the single-region approach, as can be seen in Figures 34 and 35, compared to smaller
batch sizes, shown in Figures 32 and 33. The Figures also show no decrease in latency with
an increasing number of partitions for the single-region approach, suggesting diminishing
returns.
Figure 32: Latency with different amount of partitions with a batch size of 1
30
Figure 33: Latency with different amount of partitions with a batch size of 10
Figure 34: Latency with different amount of partitions with a batch size of 100
Figure 35: Latency with different amount of partitions with a batch size of 1000
31
6.2.4 Latency difference
Figure 36 shows the relative difference of median latency between the multi-region and
single-region approaches at different batch sizes and number of partitions. With 1 partition,
the relative difference is between 1100% and just above 1600%, depending on the batch
size. The relative difference is reduced to below 100% to around 300% when increasing the
amount of partitions in the system.
32
7 Discussion
The results of the single producer tests show that there is a significant difference in both
throughput and latency between the different approaches. The relative difference in median
throughput is −51.41% and the relative difference in median latency is 119.20%. The dif-
ferences in latency can be traced back to the operations pertaining to the disaster process
in the multi-region case. It is shown that the latency of the operation Sending to disaster
is an order of magnitude higher in the multi-region case than the single-region case. This is
expected since the distance between the primary and disaster regions is much greater than
the distance between the primary and standby availability zones.
As previously mentioned, there are no significant difference in the latency of the operations
in the standby and disaster processes, as shown in Figures 26 and 28. This is expected since
none of the operations in the disaster process depend on an acknowledgement or answer
from the primary process.
The results of the multi-producer tests are similar to those in the single-producer tests, but
the relative difference is larger. Both relative difference in latency and throughput show a
larger difference than the single-producer scenario, except for the results of batch size 100
in the multi-producer case. By looking at Figure 30, the reason seems to be a decrease in
throughput with batch size 100 in the single-region case. Increasing the number of partitions
beyond 10 decreases the throughput and the relative difference decreases as a result. The
difference being larger in the multi-producer case indicates that the single-region approach
handles requests from multiple users better than the multi-region approach.
The diminishing returns exhibited by the single-region approach in Figure 30 could be due
to a number of reasons, e.g., bandwidth throttling that activates at some point on the way,
temporary decrease of performance of the virtual machines or temporary decrease in network
quality.
Increasing the batch size by an order of magnitude increases the effective throughput by
almost an order of magnitude for both approaches, as shown in Figures 21 and 30. This
indicates that the effective throughput can be increased by utilising the bandwidth more
effectively. Figures 26-27 and 28-29 shows that that latency of the internal standby and
disaster backend operations has increased by a few milliseconds. However, one could expect
a drastic increase in latency of these operations if the tasks within the requests are further
processed within the chain thread.
The median was used primarily since the results showed a skewed distribution and a large
amount of outliers. This was especially prominent in the latency measurements of the long-
running tests, as can be seen in Figures 17 and 18.
The results are in line with the results presented in Section 3. Ghandi and Chan show that
the RTT between us-east-1 (NVR) and us-west-2 (ORG) is around 70 ms, while Iqbal
et al. show that the RTT between the same regions are between 50 and 100 ms. Figures
24-25 shows that the median connection times to the disaster backend, in the multi-region
case, is around 70 ms. The TCP connection stage consists of a three-way handshake and the
connect operation returns after sending the last message in the sequence. Assuming that
the one-way latency is half the RTT, the one-way latency of the TCP connection messages
coincides with the one-way latency of the pings from the related works, especially Iqbal et
al., and Ghandi and Chan.
33
7.2 Limitations
There are a number of limitations and assumptions made during the solution design process.
These limitations are listed in Section 4.5.
As previously mentioned, the ACG platform does not permit its cloud users to view the
billing information of the AWS sandbox accounts. Therefore, there are no comparisons,
discussions nor conclusions regarding the monetary cost of each approach.
Since there is no failover implementation and the recovery time cannot be measured, there
are no comparisons, discussions nor conclusions regarding the reliability of the different
approaches.
While a solution without these limitations would provide a more complete picture of the
limitations of a multi-region deployment of a strongly consistent system compared to a single-
region approach, the performance evaluation of the simple model quantifies limitations in
throughput and latency.
34
8 Conclusion
In order to deploy this system to a cloud environment, there has to be supporting infrastruc-
ture, e.g., virtual machines, networking, availability zones, regions, etc. This infrastructure
can be created and maintained with Terraform, in both the single and multi-region case.
Helm can be used to deploy the system to the Kubernetes clusters, in both cases. However,
in order to deploy and maintain the system in the multi-region cases you need to keep track
of different Kubernetes context identifiers in order to tell Helm which cluster to interact
with. There are multiple alternatives for multi-region communication in AWS, described in
Section 2.5. However, due to the limitations of the ACG platform, only the VPC peering
method has been used in this project. VPC peering allows two different VPCs, in different
regions, to communicate with each other as if they were on the same network. This method
has the benefit of not touching the public Internet at all.
As shown in existing works, there is a significant increase in latency between regions com-
pared to between AZs. An increase that is proportional to the distance between the regions.
As discussed in Section 7, the results of the single-user tests show that the relative difference
in median throughput is -54.41% while the relative difference in median latency is 119.20%.
The single-region approach handles multiple simultaneous users better than the multi-region
approach. The throughput measured during the multi-producer tests is almost an order of
magnitude higher in the single-region case compared to the multi-region case. Furthermore,
the latency measured during the same tests is almost an order of magnitude lower in the
single-region case compared to the multi-region case. However, as the amount of partitions
increases, the difference becomes smaller.
It would be very interesting to see a comparison between the performance and the reliability
between the different approaches. A chaos engineering approach where one could disable an
AZ or entire region, or kill all pods within to simulate a region failure, would be suitable
in order to gather reliability metrics. This approach would require an implemented failover
mechanism to signal the process in the disaster AZ or region to become the new primary.
Another interesting factor to investigate is cost. Which approach is more expensive, and by
how much? Due to the limitations with the AWS accounts provided with ACG, it was not
possible to include the cost of the different approaches in this project.
In order to reduce the complexity of having two different Kubernetes clusters to deal with,
a Federated Kubernetes multi-cluster approach could be an option. Historically, KubeFed
have been a popular answer to the issue of centralized management of multiple Kubernetes
clusters. However, the KubeFed project has reached its end-of-life due to incompatibility of
Kubernetes APIs and lack of extensibility. Karmada and Open Cluster Management (OCM)
are two projects that each provide alternatives for KubeFed [44].
35
References
[1] Izrailevsky, Y. and Bell, C. “Cloud reliability”. In: IEEE Cloud Computing 5.3 (2018),
pp. 39–44.
[2] Van Renesse, R. and Schneider, F. B. “Chain Replication for Supporting High Through-
put and Availability.” In: OSDI. Vol. 4. 91–104. 2004.
[3] IBM. What is cloud computing? url: https : / / www . ibm . com / topics / cloud -
computing. Accessed 2023-01-30.
[4] Rashid, A. and Chaturvedi, A. “Cloud computing characteristics and services: a brief
review”. In: International Journal of Computer Sciences and Engineering 7.2 (2019),
pp. 421–426.
[5] The Kubernetes Authors. Kubernetes Components. url: https://kubernetes.io/
docs/concepts/overview/components/. Accessed 2023-02-09.
[6] The Kubernetes Authors. Pods. url: https : / / kubernetes . io / docs / concepts /
workloads/pods/. Accessed 2023-03-28.
[7] The Kubernetes Authors. Viewing Pods and Nodes. url: https://kubernetes.io/
docs/tutorials/kubernetes- basics/explore/explore- intro/. Accessed 2023-
04-13.
[8] The Kubernetes Authors. ReplicaSets. url: https://kubernetes.io/docs/concepts/
workloads/controllers/replicaset/. Accessed 2023-03-28.
[9] The Kubernetes Authors. Deployments. url: https://kubernetes.io/docs/concepts/
workloads/controllers/deployment/. Accessed 2023-03-28.
[10] The Kubernetes Authors. StatefulSets. url: https://kubernetes.io/docs/concepts/
workloads/controllers/statefulset/. Accessed 2023-03-28.
[11] Helm Authors. Charts. url: https : / / helm . sh / docs / topics / charts/. Accessed
2023-03-27.
[12] Helm Authors. Using Helm. url: https : / / helm . sh / docs / intro / using _ helm/.
Accessed 2023-02-09.
[13] Helm Authors. Helm. url: https://helm.sh/docs/helm/helm/. Accessed 2023-05-
09.
[14] Amazon Web Services. AWS Global Infrastructure. url: https://aws.amazon.com/
about-aws/global-infrastructure/?hp=tile&tile=map. Accessed 2023-01-30.
[15] Amazon Web Services. Regions and Availability Zones. url: https://aws.amazon.
com/about- aws/global- infrastructure/regions_az/?p=ngi&loc=2. Accessed
2023-01-30.
[16] Amazon Web Services. Global Single-Region operations. url: https://docs.aws.
amazon . com / whitepapers / latest / aws - fault - isolation - boundaries / aws -
service-types.html#global-single-region-operations. Accessed 2023-01-30.
[17] Amazon Web Services. What is Amazon EKS? url: https://docs.aws.amazon.
com/eks/latest/userguide/what-is-eks.html. Accessed 2023-02-09.
[18] Amazon Web Services. What is Amazon VPC? url: https://docs.aws.amazon.
com/vpc/latest/userguide/what-is-amazon-vpc.html. Accessed 2023-02-01.
[19] Amazon Web Services. VPC peering basics. url: https://docs.aws.amazon.com/
vpc / latest / peering / vpc - peering - basics . html # vpc - peering - limitations.
Accessed 2023-02-01.
[20] Amazon Web Services. VPC peering connection quotas. url: https://docs.aws.
amazon.com/vpc/latest/peering/vpc- peering- connection- quotas.html. Ac-
cessed 2023-02-01.
36
[21] Amazon Web Services. How transit gateways work. url: https://docs.aws.amazon.
com/vpc/latest/tgw/how-transit-gateways-work.html. Accessed 2023-02-01.
[22] Amazon Web Services. Quotas for your transit gateways. url: https://docs.aws.
amazon.com/vpc/latest/tgw/transit- gateway- quotas.html. Accessed 2023-02-
01.
[23] Pluralsight. AWS cloud sandbox. url: https://help.pluralsight.com/help/aws-
sandbox. Accessed 2023-04-25.
[24] Brewer, E. A. “Towards robust distributed systems”. In: PODC. Vol. 7. 10.1145. Port-
land, OR. 2000, pp. 343477–343502.
[25] Gilbert, S. and Lynch, N. “Brewer’s conjecture and the feasibility of consistent, avail-
able, partition-tolerant web services”. In: Acm Sigact News 33.2 (2002), pp. 51–59.
[26] Abadi, D. “Consistency tradeoffs in modern distributed database system design: CAP
is only part of the story”. In: Computer 45.2 (2012), pp. 37–42.
[27] IBM. What is Infrastructure as Code (IaC)? url: https://www.ibm.com/topics/
infrastructure-as-code. Accessed 2023-02-07.
[28] IBM. What is Terraform? url: https://www.ibm.com/topics/terraform. Accessed
2023-02-07.
[29] HashiCorp. Providers. url: https : / / developer . hashicorp . com / terraform /
language/providers. Accessed 2023-02-08.
[30] HashiCorp. Resource Blocks. url: https://developer.hashicorp.com/terraform/
language/resources/syntax. Accessed 2023-02-08.
[31] HashiCorp. Data Sources. url: https://developer.hashicorp.com/terraform/
language/data-sources. Accessed 2023-02-08.
[32] The OpenTelemetry Authors. Observability Primer. url: https://opentelemetry.
io/docs/concepts/observability-primer/. Accessed 2023-03-27.
[33] Ian Duncan. OpenTelemetry.Trace. url: https://hackage.haskell.org/package/
hs - opentelemetry - sdk - 0 . 0 . 3 . 1 / docs / OpenTelemetry - Trace . html. Accessed
2023-05-17.
[34] The OpenTelemetry Authors. What is OpenTelemetry? url: https://opentelemetry.
io/docs/concepts/what-is-opentelemetry/. Accessed 2023-03-27.
[35] The Jaeger Authors. Features. url: https://www.jaegertracing.io/docs/1.43/
features/. Accessed 2023-03-27.
[36] The Jaeger Authors. Architecture. url: https://www.jaegertracing.io/docs/1.
43/architecture/. Accessed 2023-03-27.
[37] Iqbal, H., Singh, A., and Shahzad, M. “Characterizing the Availability and Latency
in AWS Network From the Perspective of Tenants”. In: IEEE/ACM Transactions on
Networking 30.4 (2022), pp. 1554–1568.
[38] Gandhi, A. and Chan, J. “Analyzing the network for aws distributed cloud computing”.
In: ACM SIGMETRICS Performance Evaluation Review 43.3 (2015), pp. 12–15.
[39] Gorbenko, A., Karpenko, A., and Tarasyuk, O. “Performance evaluation of various
deployment scenarios of the 3-replicated Cassandra NoSQL cluster on AWS”. In: Ra-
dioelectronic and Computer Systems 4 (2021), pp. 157–165.
[40] Vu, T., Mediran, C. J., and Peng, Y. “Measurement and Observation of Cross-Provider
Cross-Region Latency for Cloud-Based IoT Systems”. In: 2019 IEEE World Congress
on Services (SERVICES). Vol. 2642. IEEE. 2019, pp. 364–365.
[41] Berenberg, A. and Calder, B. “Deployment archetypes for cloud applications”. In: ACM
Computing Surveys (CSUR) 55.3 (2022), pp. 1–48.
[42] Amazon Web Services. Amazon EC2 T3 Instances. url: https://aws.amazon.com/
ec2/instance-types/t3/. Accessed 2023-04-18.
37
[43] Schroeder, B., Wierman, A., and Harchol-Balter, M. “Open versus closed: A cautionary
tale”. In: USENIX. 2006, pp. 239–252.
[44] Eads, D. and Wang, K. Karmada and Open Cluster Management: two new approaches
to the multicluster fleet management challenge. Sept. 26, 2022. url: https://www.
cncf.io/blog/2022/09/26/karmada-and-open- cluster-management-two-new-
approaches - to - the - multicluster - fleet - management - challenge/. Accessed
2023-04-25.
38
Appendix A Multi-producer tests with higher batch sizes
Figure 37 shows the median effective throughput of the different approaches with different
batch sizes and amount of partitions. The maximum batch size of these tests are 10000,
i.e., an order of magnitude higher than the ones shown in Section 6. The Figure shows a
stagnation in effective throughput of the single-region approach, especially with batch sizes
100, 1000 and 1000, despite increasing the number of partitions.
Figure 37: Effective throughput with different batch sizes and amount of partitions
Figure 38 shows the median effective throughput with different batch sizes and amount of
partitions with detailed numbers.
39
Figure 38: Effective throughput with different batch sizes and amount of partitions
Figure 39 shows the relative difference of the median effective throughput between the single-
region and the multi-region approach. The tests with batch size 100, 1000 and 10000 show
that both approaches are almost equally performant with more partitions, with the multi-
region approach surpassing the single-region approach, i.e., a relative difference above 0%.
This is also shown in Figure 37 and 38.
Figure 39: Relative difference in median effective throughput between the approaches
Figures 40-44 show the latency of individual requests made to both approaches. There is a
lot of variance introduced with the single-region approach from batch size 100 and above,
and from 2 partitions. The Figures also show that the latency is almost equal or, in some
40
cases, lower with the multi-region approach compared to the single-region approach, with
more partitions.
Figure 40: Latency with different amount of partitions with a batch size of 1
Figure 41: Latency with different amount of partitions with a batch size of 10
Figure 42: Latency with different amount of partitions with a batch size of 100
41
Figure 43: Latency with different amount of partitions with a batch size of 1000
Figure 44: Latency with different amount of partitions with a batch size of 10000
Figure 45 shows that the relative difference in latency between the multi-region approach and
the single-region approach reaches below 0% for batch sizes 100, 1000 and 10000, indicating
that the average latency is higher with the single-region approach than the multi-region
approach.
Figure 45: Relative difference in median effective latency between the approaches
42
There seems to be some throttling or limitations posed on the single-region approach between
the first couple of tests compared to the rest of the tests. The higher batch sizes, i.e., 100,
1000, 10000, show little to no increase in throughput and no improvement in latency with
an increased amount of partitions. The severity of the problem seems to disappear when
not testing batch size 10000 at all, as seen in Section 6.
Without any additional data it is hard to conclude why this happens. One reason might
be bandwidth throttling. By increasing the batch size we increase the bandwidth used by
individual requests. Using too much bandwidth from a single source might cause issues with
network policies somewhere along the path to the system, and could result in limitations.
However, from this data alone, it is hard to determine the culprit.
43